From c1f1775c91cfbdb169490067e50c1ec914f007ea Mon Sep 17 00:00:00 2001 From: Pim van Pelt Date: Mon, 5 Aug 2024 01:11:52 +0200 Subject: [PATCH] Rewrite all images to Hugo format --- .../articles/2016-10-07-fiber7-litexchange.md | 32 +- content/articles/2017-03-01-sixxs-sunset.md | 14 +- content/articles/2021-02-26-history.md | 19 + .../articles/2021-02-27-coloclue-loadtest.md | 403 ++++++ content/articles/2021-02-27-network.md | 156 +++ content/articles/2021-03-27-coloclue-vpp.md | 685 ++++++++++ content/articles/2021-05-17-frankfurt.md | 144 +++ content/articles/2021-05-26-amsterdam.md | 216 ++++ content/articles/2021-05-28-lille.md | 106 ++ content/articles/2021-06-01-paris.md | 185 +++ content/articles/2021-06-28-as112.md | 480 +++++++ content/articles/2021-07-03-geneva.md | 210 +++ content/articles/2021-07-19-pcengines-apu6.md | 201 +++ content/articles/2021-07-26-bucketlist.md | 315 +++++ content/articles/2021-08-07-fs-switch.md | 710 +++++++++++ content/articles/2021-08-12-vpp-1.md | 404 ++++++ content/articles/2021-08-13-vpp-2.md | 345 +++++ content/articles/2021-08-15-vpp-3.md | 414 ++++++ content/articles/2021-08-25-vpp-4.md | 520 ++++++++ content/articles/2021-08-26-fiber7-x.md | 214 ++++ content/articles/2021-09-02-vpp-5.md | 549 ++++++++ content/articles/2021-09-10-vpp-6.md | 409 ++++++ content/articles/2021-09-21-vpp-7.md | 567 +++++++++ content/articles/2021-10-24-as8298.md | 160 +++ content/articles/2021-11-14-routing-policy.md | 726 +++++++++++ content/articles/2021-11-26-netgate-6100.md | 474 +++++++ content/articles/2021-12-23-vpp-playground.md | 529 ++++++++ content/articles/2022-01-12-vpp-l2.md | 682 ++++++++++ content/articles/2022-02-14-vpp-vlan-gym.md | 375 ++++++ content/articles/2022-02-21-asr9006.md | 778 +++++++++++ content/articles/2022-02-24-colo.md | 105 ++ .../articles/2022-03-03-syslog-telegram.md | 281 ++++ content/articles/2022-03-27-vppcfg-1.md | 444 +++++++ content/articles/2022-04-02-vppcfg-2.md | 727 +++++++++++ content/articles/2022-10-14-lab-1.md | 645 ++++++++++ content/articles/2022-11-20-mastodon-1.md | 251 ++++ content/articles/2022-11-24-mastodon-2.md | 215 ++++ content/articles/2022-11-27-mastodon-3.md | 327 +++++ content/articles/2022-12-05-oem-switch-1.md | 638 ++++++++++ content/articles/2022-12-09-oem-switch-2.md | 552 ++++++++ content/articles/2023-02-12-fitlet2.md | 505 ++++++++ content/articles/2023-02-24-coloclue-vpp-2.md | 678 ++++++++++ content/articles/2023-03-11-mpls-core.md | 633 +++++++++ content/articles/2023-03-17-ipng-frontends.md | 481 +++++++ content/articles/2023-03-24-lego-dns01.md | 344 +++++ content/articles/2023-04-09-vpp-stats.md | 382 ++++++ content/articles/2023-05-07-vpp-mpls-1.md | 749 +++++++++++ content/articles/2023-05-17-vpp-mpls-2.md | 635 +++++++++ content/articles/2023-05-21-vpp-mpls-3.md | 463 +++++++ content/articles/2023-05-28-vpp-mpls-4.md | 387 ++++++ content/articles/2023-08-06-pixelfed-1.md | 553 ++++++++ content/articles/2023-08-27-ansible-nginx.md | 628 +++++++++ .../articles/2023-10-21-vpp-ixp-gateway-1.md | 308 +++++ .../articles/2023-11-11-mellanox-sn2700.md | 827 ++++++++++++ content/articles/2023-12-17-defra0-debian.md | 474 +++++++ content/articles/2024-01-27-vpp-papi.md | 717 +++++++++++ content/articles/2024-02-10-vpp-freebsd-1.md | 359 ++++++ content/articles/2024-02-17-vpp-freebsd-2.md | 501 ++++++++ content/articles/2024-03-06-vpp-babel-1.md | 644 ++++++++++ content/articles/2024-04-06-vpp-ospf.md | 438 +++++++ content/articles/2024-04-27-freeix-1.md | 230 ++++ content/articles/2024-05-17-smtp.md | 1134 +++++++++++++++++ content/articles/2024-05-25-nat64-1.md | 412 ++++++ content/articles/2024-06-22-vpp-ospf-2.md | 624 +++++++++ content/articles/2024-06-29-coloclue-ipng.md | 617 +++++++++ content/articles/2024-07-05-r86s.md | 566 ++++++++ content/articles/2024-08-03-gowin.md | 443 +++++++ 67 files changed, 29916 insertions(+), 23 deletions(-) create mode 100644 content/articles/2021-02-26-history.md create mode 100644 content/articles/2021-02-27-coloclue-loadtest.md create mode 100644 content/articles/2021-02-27-network.md create mode 100644 content/articles/2021-03-27-coloclue-vpp.md create mode 100644 content/articles/2021-05-17-frankfurt.md create mode 100644 content/articles/2021-05-26-amsterdam.md create mode 100644 content/articles/2021-05-28-lille.md create mode 100644 content/articles/2021-06-01-paris.md create mode 100644 content/articles/2021-06-28-as112.md create mode 100644 content/articles/2021-07-03-geneva.md create mode 100644 content/articles/2021-07-19-pcengines-apu6.md create mode 100644 content/articles/2021-07-26-bucketlist.md create mode 100644 content/articles/2021-08-07-fs-switch.md create mode 100644 content/articles/2021-08-12-vpp-1.md create mode 100644 content/articles/2021-08-13-vpp-2.md create mode 100644 content/articles/2021-08-15-vpp-3.md create mode 100644 content/articles/2021-08-25-vpp-4.md create mode 100644 content/articles/2021-08-26-fiber7-x.md create mode 100644 content/articles/2021-09-02-vpp-5.md create mode 100644 content/articles/2021-09-10-vpp-6.md create mode 100644 content/articles/2021-09-21-vpp-7.md create mode 100644 content/articles/2021-10-24-as8298.md create mode 100644 content/articles/2021-11-14-routing-policy.md create mode 100644 content/articles/2021-11-26-netgate-6100.md create mode 100644 content/articles/2021-12-23-vpp-playground.md create mode 100644 content/articles/2022-01-12-vpp-l2.md create mode 100644 content/articles/2022-02-14-vpp-vlan-gym.md create mode 100644 content/articles/2022-02-21-asr9006.md create mode 100644 content/articles/2022-02-24-colo.md create mode 100644 content/articles/2022-03-03-syslog-telegram.md create mode 100644 content/articles/2022-03-27-vppcfg-1.md create mode 100644 content/articles/2022-04-02-vppcfg-2.md create mode 100644 content/articles/2022-10-14-lab-1.md create mode 100644 content/articles/2022-11-20-mastodon-1.md create mode 100644 content/articles/2022-11-24-mastodon-2.md create mode 100644 content/articles/2022-11-27-mastodon-3.md create mode 100644 content/articles/2022-12-05-oem-switch-1.md create mode 100644 content/articles/2022-12-09-oem-switch-2.md create mode 100644 content/articles/2023-02-12-fitlet2.md create mode 100644 content/articles/2023-02-24-coloclue-vpp-2.md create mode 100644 content/articles/2023-03-11-mpls-core.md create mode 100644 content/articles/2023-03-17-ipng-frontends.md create mode 100644 content/articles/2023-03-24-lego-dns01.md create mode 100644 content/articles/2023-04-09-vpp-stats.md create mode 100644 content/articles/2023-05-07-vpp-mpls-1.md create mode 100644 content/articles/2023-05-17-vpp-mpls-2.md create mode 100644 content/articles/2023-05-21-vpp-mpls-3.md create mode 100644 content/articles/2023-05-28-vpp-mpls-4.md create mode 100644 content/articles/2023-08-06-pixelfed-1.md create mode 100644 content/articles/2023-08-27-ansible-nginx.md create mode 100644 content/articles/2023-10-21-vpp-ixp-gateway-1.md create mode 100644 content/articles/2023-11-11-mellanox-sn2700.md create mode 100644 content/articles/2023-12-17-defra0-debian.md create mode 100644 content/articles/2024-01-27-vpp-papi.md create mode 100644 content/articles/2024-02-10-vpp-freebsd-1.md create mode 100644 content/articles/2024-02-17-vpp-freebsd-2.md create mode 100644 content/articles/2024-03-06-vpp-babel-1.md create mode 100644 content/articles/2024-04-06-vpp-ospf.md create mode 100644 content/articles/2024-04-27-freeix-1.md create mode 100644 content/articles/2024-05-17-smtp.md create mode 100644 content/articles/2024-05-25-nat64-1.md create mode 100644 content/articles/2024-06-22-vpp-ospf-2.md create mode 100644 content/articles/2024-06-29-coloclue-ipng.md create mode 100644 content/articles/2024-07-05-r86s.md create mode 100644 content/articles/2024-08-03-gowin.md diff --git a/content/articles/2016-10-07-fiber7-litexchange.md b/content/articles/2016-10-07-fiber7-litexchange.md index 8d313a6..cce65a9 100644 --- a/content/articles/2016-10-07-fiber7-litexchange.md +++ b/content/articles/2016-10-07-fiber7-litexchange.md @@ -62,7 +62,7 @@ engineers at Init7 informed me that DHCPv6 was ready, and it worked spotlessly a to request an NA and a /48 PD, and bumping `accept_ra=2` on the egress interface (note: this allows forwarding while at the same time accepting router advertisements). -Additional details of the L3 connection: +Additional details of the L3 connection: 1. The routers operate an L2VPN to a third party provider (IP-Max, AS25091) which routes `194.1.163.32/27` via eBGP using GRE. The MSS on this tunnel is clamped to 1436 (from 1460) to allow @@ -108,7 +108,7 @@ Multiple Amino IPTV devices in multiple backend VLANs can be used at the same ti ``` $ ip mroute | grep 239.44.0 -(109.202.223.18, 239.44.0.77) Iif: eth0.9 Oifs: eth0 +(109.202.223.18, 239.44.0.77) Iif: eth0.9 Oifs: eth0 (109.202.223.18, 239.44.0.78) Iif: eth0.9 Oifs: eth0.2 ``` @@ -182,7 +182,7 @@ rtt min/avg/max/mdev = 1.154/1.451/2.206/0.276 ms IPv6 was initially not natively available on this connection. IPv6 was tunneled via chzrh02.sixxs.net (on-net at AS13030). The IPv6 server endpoint runs on a virtualized platform, with slightly less than bare-bones throughput. Shortly thereafter, native IPv6 was configured on the -Fiber7 product via the LiteXchange platform. +Fiber7 product via the LiteXchange platform. Each OTO delivered by the city of Wangen-Brüttisellen [[site](http://www.werkewb.ch/cms/?page_id=52)] holds four simplex single mode fibers. The first @@ -197,12 +197,12 @@ was used to provide the Fiber7 internet connection. ### Appendix 1 - Terminology **Term** | **Description** --------- | --------------- -ONT | **optical network terminal** - The ONT converts fiber-optic light signals to copper based electric signals, usually Ethernet. +-------- | --------------- +ONT | **optical network terminal** - The ONT converts fiber-optic light signals to copper based electric signals, usually Ethernet. OTO | **optical telecommunication outlet** - The OTO is a fiber optic outlet that allows easy termination of cables in an office and home environment. Installed OTOs are referred to by their OTO-ID. CARP | **common address redundancy protocol** - Its purpose is to allow multiple hosts on the same network segment to share an IP address. CARP is a secure, free alternative to the Virtual Router Redundancy Protocol (VRRP) and the Hot Standby Router Protocol (HSRP). SIT | **simple internet transition** - Its purpose is to interconnect isolated IPv6 networks, located in global IPv4 Internet via tunnels. -STB | **set top box** - a device that enables a television set to become a user interface to the Internet and also enables a television set to receive and decode digital television (DTV) broadcasts. +STB | **set top box** - a device that enables a television set to become a user interface to the Internet and also enables a television set to receive and decode digital television (DTV) broadcasts. GRE | **generic routing encapsulation** - a tunneling protocol developed by Cisco Systems that can encapsulate a wide variety of network layer protocols inside virtual point-to-point links over an Internet Protocol network. L2VPN | **layer2 virtual private network** - a service that emulates a switched Ethernet (V)LAN across a pseudo-wire (typically an IP tunnel) DHCP | **dynamic host configuration protocol** - an IPv4 network protocol that enables a server to automatically assign an IP address to a computer from a defined range of numbers. @@ -225,7 +225,7 @@ GRE via IP-Max: [speedtest](http://beta.speedtest.net/result/5668135633) #### Bandwidth with Iperf upstream ``` -(AS13030 IPv4) $ iperf -t 600 -P 4 -i 60 -l 1M -m -c chzrh02.sixxs.net +(AS13030 IPv4) $ iperf -t 600 -P 4 -i 60 -l 1M -m -c chzrh02.sixxs.net ------------------------------------------------------------ Client connecting to chzrh02.sixxs.net, TCP port 5001 TCP window size: 85.0 KByte (default) @@ -335,17 +335,17 @@ interface eth0.9 { # interface VLAN9 - Fiber7 id-assoc pd 1 { prefix ::/48 infinity; - prefix-interface lo { - sla-id 0; - ifid 1; - sla-len 16; + prefix-interface lo { + sla-id 0; + ifid 1; + sla-len 16; }; # Test interface - prefix-interface eth1 { - sla-id 4096; - ifid 1; - sla-len 16; + prefix-interface eth1 { + sla-id 4096; + ifid 1; + sla-len 16; }; }; @@ -360,7 +360,7 @@ Taking IGMPProxy from [github](https://github.com/pali/igmpproxy) and the follow file, IPTV worked reliably throughout the pilot: ``` -$ cat /etc/igmpproxy.conf +$ cat /etc/igmpproxy.conf ##------------------------------------------------------ ## Enable Quickleave mode (Sends Leave instantly) ##------------------------------------------------------ diff --git a/content/articles/2017-03-01-sixxs-sunset.md b/content/articles/2017-03-01-sixxs-sunset.md index 38f1ef2..cf34131 100644 --- a/content/articles/2017-03-01-sixxs-sunset.md +++ b/content/articles/2017-03-01-sixxs-sunset.md @@ -39,7 +39,7 @@ successor, SixXS, gaining more than 18 years of valuable IPv6 experience. As of March 2017, there are 38’393 7-day active users spanning 140 countries. These users configured a total of 44’673 tunnels spanning 118 countries, and 12’632 subnet delegations (28.28%). Our peak 7DA usage was over 50’000 users. Full statistics, including distributions by country, can be found -on the SixXS website [[link](https://www.sixxs.net/misc/usage/)]. +on the SixXS website [[link](https://www.sixxs.net/misc/usage/)]. #### Growth @@ -56,7 +56,7 @@ Another way to visualize this data is to measure the cumulative requested subnet size). The requests for subnets naturally follows the growth of users. In recent years (2014 onwards), requests for additional subnets were clearly tapering off - this is in line with our goal of SixXS. Therefore, in December of 2015, new user signups were suspended. Note: this explains the -flatline of requests in the years 2016 and 2017. +flatline of requests in the years 2016 and 2017. ![Cumulative Requests/Year](/assets/sixxs-sunset/image3.png) *Image: Cumulative requests over time*. @@ -92,7 +92,7 @@ between 2012 and 2017. Second graph is average traffic in bits/sec between 2017- Looking at the total footprint of SixXS - In March 2017, 46 PoPs spread over 29 countries were offered by 40 unique Internet service providers. Over the lifetime of SixXS, a total of 65 different -PoPs have been active. +PoPs have been active. ![image](/assets/sixxs-sunset/image6.png) @@ -118,12 +118,12 @@ tunnel brokering service when all the end users can get [Native IPv6](https://www.sixxs.net/faq/connectivity/?faq=native) directly from their own Internet provider*. -For a decade, the industry was divided into content providers and access providers engaged a chicken and egg game: +For a decade, the industry was divided into content providers and access providers engaged a chicken and egg game: 1. Content providers claimed that investing in IPv6 rollout would be useless because there were not sufficient numbers of large ISPs which offered it. 1. Access providers claimed that investing in IPv6 would be useless because there were not - sufficient numbers of large content providers which offered it. + sufficient numbers of large content providers which offered it. 1. Both content providers and access providers claimed their customers didn’t demand it and there was no business justification in doing so. @@ -179,7 +179,7 @@ When we started in 1999, we set ourselves some pretty ambitious goals. They are website as value propositions to [ISPs](https://www.sixxs.net/faq/sixxs/?faq=isp), to [endusers](https://www.sixxs.net/faq/sixxs/?faq=enduser) and we explain in detail [what targets](https://www.sixxs.net/faq/sixxs/?faq=why) we want our project to achieve in order to help -the technical community. +the technical community. To the latter point ([why do this](https://www.sixxs.net/faq/sixxs/?faq=why)?), we have by far exceeded our targets of creating 10 regional PoPs (we created 65, each using their own IPv6 address @@ -237,7 +237,7 @@ to take this opportunity to call out these formative folks: And lastly, we extend our gratitude to the men and women who professionally operate the network, those who arranged the physical or virtual hardware, and those who are in a position to commit to -running all 65 [SixXS PoPs](https://www.sixxs.net/pops/)! +running all 65 [SixXS PoPs](https://www.sixxs.net/pops/)! ## FAQ diff --git a/content/articles/2021-02-26-history.md b/content/articles/2021-02-26-history.md new file mode 100644 index 0000000..00ae9e5 --- /dev/null +++ b/content/articles/2021-02-26-history.md @@ -0,0 +1,19 @@ +--- +date: "2021-02-26T13:07:54Z" +title: IPng History +--- +Historical context - todo, but notes for now + +1. started with stack.nl (when it was still stack.urc.tue.nl), 6bone and watching NASA multicast video in 1997. +2. founded ipng.nl project, first IPv6 in NL that was usable outside of NREN. +3. attacted attention of the first few IPv6 partitipants in Amsterdam, organized the AIAD - AMS-IX IPv6 Awareness Day +4. launched IPv6 at AMS-IX, first IXP prefix allocated 2001:768:1::/48 +> My Brilliant Idea Of The Day -- encode AS number in leetspeak: `::AS01:2859:1`, because who would've thought we would ever run out of 16 bit AS numbers :) +5. IPng rearchitected to SixXS, and became a very large scale deployment of IPv6 tunnelbroker; our main central provisioning system moved around a few times between ISPs (Intouch, Concepts ICT, BIT, IP Man) +6. Needed eventually a NOC and servers to operate it that were provider independent, which is where our PI space came from (and is still used) +7. High Availability with paphosting of sixxs.net and other sites +8. Moved to IP-Max in 2014 (and still best of friends with that crew!) +9. In 2019, Fred said "hey why don't you get an AS number and announce your /24 PI yourself, that'll be fun!" +> I didn't want to at first, because "it is a lot of work to do it properly". +10. In 2020, I got to know Openfactory who are a local ISP (with an office in the town I live) and offer services on the local FTTH network; so I got a gigabit with them +> And that's when I made the plunge, got AS50869, started announcing my own PI space, built up a few routers, and the rest is ... history :) diff --git a/content/articles/2021-02-27-coloclue-loadtest.md b/content/articles/2021-02-27-coloclue-loadtest.md new file mode 100644 index 0000000..50b2f29 --- /dev/null +++ b/content/articles/2021-02-27-coloclue-loadtest.md @@ -0,0 +1,403 @@ +--- +date: "2021-02-27T21:31:00Z" +title: Loadtesting at Coloclue +--- +* Author: Pim van Pelt <[pim@ipng.nl](mailto:pim@ipng.nl)> +* Reviewers: Coloclue Network Committee <[routers@coloclue.net](mailto:routers@coloclue.net)> +* Status: Draft - Review - **Published** + +## Introduction + +Coloclue AS8283 operates several Linux routers running Bird. Over the years, the performance of their previous hardware platform (Dell R610) has deteriorated, and they were up for renewal. At the same time, network latency/jitter has been very high, and variability may be caused by the Linux router hardware, their used software, the intra-datacenter links, or any combination of these. One specific example of why this is important is that Coloclue runs BFD on their inter-datacenter links, which are VLANs provided to us by Atom86 and Fusix networks. On these links Colclue regularly sees ping times of 300-400ms, with huge outliers in the 1000ms range, which triggers BFD timeouts causing iBGP reconvergence events and overall horrible performance. Before we open up a discussion with these (excellent!) L2 providers, we should first establish if it’s not more likely that Coloclue's router hardware and/or software should be improved instead. + +By means of example, let’s take a look at a Smokeping graph that shows these latency spikes, jitter and loss quite well. It’s taken from a machine at True (in EUNetworks) to a machine at Coloclue (in NorthC); this is the first graph. The same machine at True to a machine at BIT (in Ede) does not exhibit this behavior; this is the second graph. + +| | | +| ---- | ---- | +| {{< image src="/assets/coloclue-loadtest/image0.png" alt="BIT" width="450px" >}} | {{< image src="/assets/coloclue-loadtest/image1.png" alt="Coloclue" width="450px" >}} | + +*Images: Smokeping graph from True to Coloclue (left), and True to BIT (right). There is quite a difference.* + +## Summary + +I performed three separate loadtests. First, I did a loopback loadtest on the T-Rex machine, proving that it can send 1.488Mpps in both directions simultaneously. Then, I did a loadtest of the Atom86 link by sending the traffic through the Arista in NorthC, over the Atom86 link, to the Arista in EUNetworks, looping two ethernet ports, and sending the traffic back to NorthC. Due to VLAN tagging, this yielded 1.42Mpps throughput, exactly as predicted. Finally, I performed a stateful loadtest that saturated the Atom86 link, while injecting SCTP packets at 1KHz, measuring the latency observed over the Atom86 link. + +**All three tests passed.** + +## Loadtest Setup + +After deploying the new NorthC routers (Supermicro Super Server/X11SCW-F with Intel Xeon E-2286G processors), I decided to rule out hardware issues, leaving link and software issues. To get a bit more insight on software or inter-datacenter links, I created the following two loadtest setups. + +### 1. Baseline + +Machine `dcg-2`, carrying an Intel 82576 quad Gigabit NIC, looped from the first two ports (port0 to port1). The point of this loopback test is to ensure that the machine itself is capable of sending and receiving the correct traffic patterns. Usually, one does an “imix” and a “64b” loadtest for this, and it is expected that the loadtester itself passes all traffic out on one port back into the other port, without any loss. The thing I am testing is called the DUT or *Device Under Test* and in this case, it is a UTP cable from NIC-NIC. + +The expected packet rate is: 672 bits for the ethernet frame is 10^9 / 672 == **1488095 packets per second** in each direction and traversing the link once. You will often see 1.488Mpps as “the theoretical maximum”, and this is why. + +### 2. Atom86 + +In this test, Tijn from Coloclue plugged `dcg-2` port0 into the core switch (an Arista) port e17, and he configured that switchport as an access port for VLAN A; which is put on the Atom86 trunk to EUNetworks. The second port1 is plugged into the core switch port e18, and assigned a different VLAN B, which is also put on the Atom86 link to EUNetworks. + +At EUNetworks then, he exposed that same VLAN A on port e17 and VLAN B on port e18. And Tijn used DAC cable to connect e17 <-> e18. Thus, the path the packets travel now becomes the *Device Under Test* (DUT): + +> port0 -> dcg-core-2:e17 -> Atom86 -> eunetworks-core-2:e17 +> +> eunetworks-core-2:e18 -> Atom86 -> dcg-core-2:e18 -> port1 + +I should note that because the loadtester emits traffic which is tagged by the `*-core-2` switches, that the Atom86 link will see each tagged packet twice, and as we'll see, that VLAN tagging actually matters! The maximum expected packet rate is: 672 bits for the ethernet frame + 32 bits for the VLAN tag == 704 bits per packet, sent in both directions, but traversing the link twice. We can deduce that we should see 10^9 / 704 / 2 == **710227 packets per second** in each direction. + +## Detailed Analysis + +This section goes into details, but it is roughly broken down into: + +1. Prepare machine (install T-Rex, needed kernel headers, and some packages) +1. Configure T-Rex (bind NIC from PCI bus into DPDK) +1. Run T-Rex interactively +1. Run T-Rex programmatically + +### Step 1 - Prepare machine + +Download T-Rex from Cisco website and unpack (I used version 2.88) in some directory that is readable by ‘nobody’. I used `/tmp/loadtest/` for this. Install some additional tools: + +``` +sudo apt install linux-headers-`uname -r` build-essential python3-distutils +``` + +### Step 2 - Bind NICs to DPDK + +First I had to find which NICs that can be used, these NICs have to be supported in DPDK, but luckily most Intel NICs are. I had a few ethernet NICs to choose from: + +``` +root@dcg-2:/tmp/loadtest/v2.88# lspci | grep -i Ether +01:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) +01:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) +01:00.2 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) +01:00.3 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) +05:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) +05:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) +07:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) +07:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) +0c:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03) +0d:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03) +``` + +But the ones that have no link are a good starting point: +``` +root@dcg-2:~/loadtest/v2.88# ip link | grep -v UP | grep enp7 +6: enp7s0f0: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 +7: enp7s0f1: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 +``` + +This is PCI bus 7, slot 0, function 0 and 1, so the configuration file for T-Rex becomes: +``` +root@dcg-2:/tmp/loadtest/v2.88# cat /etc/trex_cfg.yaml +- version : 2 + interfaces : ["07:00.0","07:00.1"] + port_limit : 2 + port_info : + - dest_mac : [0x0,0x0,0x0,0x1,0x0,0x00] # port 0 + src_mac : [0x0,0x0,0x0,0x2,0x0,0x00] + - dest_mac : [0x0,0x0,0x0,0x2,0x0,0x00] # port 1 + src_mac : [0x0,0x0,0x0,0x1,0x0,0x00] +``` + +### Step 3 - Run T-Rex Interactively + +Start the loadtester, this is easiest if you use two terminals, one to run t-rex itself and one to run the console: +``` +root@dcg-2:/tmp/loadtest/v2.88# ./t-rex-64 -i +root@dcg-2:/tmp/loadtest/v2.88# ./trex-console +``` + +The loadtester starts with -i (interactive) and optionally -c (number of cores to use, in this case only 1 CPU core is used). I will be doing a loadtest with gigabit speeds only, so no significant CPU is needed. I will demonstrate below that one CPU core of this machine can generate (sink and source) approximately 72Gbit/s of traffic. The loadtest starts a controlport on :4501 which the client connects to. You can now program the loadtester (programmatically via an API, or via the commandline / CLI tool provided. I’ll demonstrate both). + +In trex-console, I first enter ‘TUI’ mode -- this stands for the Traffic UI. Here, I can load a profile into the loadtester, and while you can write your own profiles, there are many standard ones to choose from. There’s further two types of loadtest, stateful and stateless. I started with a simpler ‘stateless’ one first, take a look at `stl/imix.py` which is self explanatory, but in particular, the mix consists of: +``` + self.ip_range = {'src': {'start': "16.0.0.1", 'end': "16.0.0.254"}, + 'dst': {'start': "48.0.0.1", 'end': "48.0.0.254"}} + + # default IMIX properties + self.imix_table = [ {'size': 60, 'pps': 28, 'isg':0 }, + {'size': 590, 'pps': 16, 'isg':0.1 }, + {'size': 1514, 'pps': 4, 'isg':0.2 } ] +``` + +Above one can see that there will be traffic flowing from 16.0.0.1-254 to 48.0.0.1-254, and there will be three streams generated at a certain ratio, 28 small 60 byte packets, 16 medium sized 590b packets, and 4 large 1514b packets. This is typically what a residential user would see (a SIP telephone call, perhaps a Jitsi video stream; some download of data with large MTU-filling packets; and some DNS requests and other smaller stuff). Executing this profile can be done with: + +``` +tui> start -f stl/imix.py -m 1kpps +``` + +.. which will start a 1kpps load of that packet stream. The traffic load can be changed by either specifying an absolute packet rate, or a percentage of line rate, and you can pause and resume as well: +``` +tui> update -m 10kpps +tui> update -m 10% +tui> update -m 50% +tui> pause +# do something, there will be no traffic +tui> resume +tui> update -m 100% +``` + +After this last command, T-Rex will be emitting line rate packets out of port0 and out of port1, and it will be expecting to see the packets that it sent back on port1 and port0 respectively. If the machine is powerful enough, it can saturate traffic up to the line rate in both directions. One can see if things are successfully passing through the device under test (in this case, for now simply a UTP cable from port0-port1). The ‘ibytes’ should match the ‘obytes’, and of course ‘ipackets’ should match the ‘opackets’ in both directions. Typically, a loss rate of 0.01% is considered acceptable. And, typically, a loss rate of a few packets in the beginning of the loadtest is also acceptable (more on that later). + + +Screenshot of port0-port1 loopback test with L2: +``` +Global Statistitcs + +connection : localhost, Port 4501 total_tx_L2 : 1.51 Gbps +version : STL @ v2.88 total_tx_L1 : 1.98 Gbps +cpu_util. : 5.63% @ 1 cores (1 per dual port) total_rx : 1.51 Gbps +rx_cpu_util. : 0.0% / 0 pps total_pps : 2.95 Mpps +async_util. : 0% / 104.03 bps drop_rate : 0 bps +total_cps. : 0 cps queue_full : 0 pkts + +Port Statistics + + port | 0 | 1 | total +-----------+-------------------+-------------------+------------------ +owner | root | root | +link | UP | UP | +state | TRANSMITTING | TRANSMITTING | +speed | 1 Gb/s | 1 Gb/s | +CPU util. | 5.63% | 5.63% | +-- | | | +Tx bps L2 | 755.21 Mbps | 755.21 Mbps | 1.51 Gbps +Tx bps L1 | 991.21 Mbps | 991.21 Mbps | 1.98 Gbps +Tx pps | 1.48 Mpps | 1.48 Mpps | 2.95 Mpps +Line Util. | 99.12 % | 99.12 % | +--- | | | +Rx bps | 755.21 Mbps | 755.21 Mbps | 1.51 Gbps +Rx pps | 1.48 Mpps | 1.48 Mpps | 2.95 Mpps +---- | | | +opackets | 355108111 | 355111209 | 710219320 +ipackets | 355111078 | 355108226 | 710219304 +obytes | 22761267356 | 22761466414 | 45522733770 +ibytes | 22761457966 | 22761274908 | 45522732874 +tx-pkts | 355.11 Mpkts | 355.11 Mpkts | 710.22 Mpkts +rx-pkts | 355.11 Mpkts | 355.11 Mpkts | 710.22 Mpkts +tx-bytes | 22.76 GB | 22.76 GB | 45.52 GB +rx-bytes | 22.76 GB | 22.76 GB | 45.52 GB +----- | | | +oerrors | 0 | 0 | 0 +ierrors | 0 | 0 | 0 +``` + +Instead of `stl/imix.py` as a profile, one can also consider `stl/udp_1pkt_simple_bdir.py` as a profile. This will send UDP packets of 0 bytes payload from a single host 16.0.0.1 to a single host 48.0.0.1 and back. Running the 1pkt UDP profile in both directions at gigabit link speeds will allow for 1.488Mpps in both directions (a minimum ethernet frame carrying IPv4 packet will be 672 bits in length -- see [wikipedia](https://en.wikipedia.org/wiki/Ethernet_Frame) for details). + +Above, one can see the system is in a healthy state - it has saturated the network bandwidth in both directions (991Mps L1 rate, so this is the full 672 bits per ethernet frame, including the header, interpacket gap, etc), at 1.48Mpps. All packets sent by port0 (the opackets, obytes) should have been received by port1 (the ipackets, ibytes), and they are. + +One can also learn that T-Rex is utilizing approximately 5.6% of one CPU core sourcing and sinking this load on the two gigabit ports (that’s 2 gigabit out, 2 gigabit in), so for a DPDK application, **one CPU core is capable of 71Gbps and 53Mpps**, an interesting observation. + +### Step 4 - Run T-Rex programmatically + +I wrote a tool previously that allows to run a specific ramp-up profile from 1kpps warmup through to line rate, in order to find the maximum allowable throughput before a DUT exhibits too much loss, usage: + +``` +usage: trex-loadtest.py [-h] [-s SERVER] [-p PROFILE_FILE] [-o OUTPUT_FILE] + [-wm WARMUP_MULT] [-wd WARMUP_DURATION] + [-rt RAMPUP_TARGET] [-rd RAMPUP_DURATION] + [-hd HOLD_DURATION] + +T-Rex Stateless Loadtester -- pim@ipng.nl + +optional arguments: + -h, --help show this help message and exit + -s SERVER, --server SERVER + Remote trex address (default: 127.0.0.1) + -p PROFILE_FILE, --profile PROFILE_FILE + STL profile file to replay (default: imix.py) + -o OUTPUT_FILE, --output OUTPUT_FILE + File to write results into, use "-" for stdout + (default: -) + -wm WARMUP_MULT, --warmup_mult WARMUP_MULT + During warmup, send this "mult" (default: 1kpps) + -wd WARMUP_DURATION, --warmup_duration WARMUP_DURATION + Duration of warmup, in seconds (default: 30) + -rt RAMPUP_TARGET, --rampup_target RAMPUP_TARGET + Target percentage of line rate to ramp up to (default: + 100) + -rd RAMPUP_DURATION, --rampup_duration RAMPUP_DURATION + Time to take to ramp up to target percentage of line + rate, in seconds (default: 600) + -hd HOLD_DURATION, --hold_duration HOLD_DURATION + Time to hold the loadtest at target percentage, in + seconds (default: 30) +``` + +Here, the loadtester will load a profile (imix.py for example), warmup for 30s at 1kpps, then ramp up linearly to 100% of line rate in 600s, and hold at line rate for 30s. The loadtest passes if during this entire time, the DUT had less than 0.01% packet loss. I must note that in this loadtest, I cannot ramp up to line rate (because the Atom86 link is used twice!), and I’ll also note I cannot ramp up to 50% of line rate (because the loadtester is sending untagged traffic, but the Arista is adding tags onto the Atom86 link!), so I expect to see **711Kpps** which is just about **47% of line rate**. + +The loadtester will emit a JSON file with all of its runtime stats, which can be later analyzed and used to plot graphs. First, let’s look at an imix loadtest: +``` +root@dcg-2:/tmp/loadtest# trex-loadtest.py -o ~/imix.json -p imix.py -rt 50 +Running against 127.0.0.1 profile imix.py, warmup 1kpps for 30s, rampup target 50% +of linerate in 600s, hold for 30s output goes to /root/imix.json +Mapped ports to sides [0] <--> [1] +Warming up [0] <--> [1] at rate of 1kpps for 30 seconds +Setting load [0] <--> [1] to 1% of linerate +stats: 4.20 Kpps 2.82 Mbps (0.28% of linerate) +… +stats: 321.30 Kpps 988.14 Mbps (98.81% of linerate) +Loadtest finished, stopping + +Test has passed :-) + +Writing output to /root/imix.json +``` + +And then let’s step up our game with a 64b loadtest: +``` +root@dcg-2:/tmp/loadtest# trex-loadtest.py -o ~/64b.json -p udp_1pkt_simple_bdir.py -rt 50 +Running against 127.0.0.1 profile udp_1pkt_simple_bdir.py, warmup 1kpps for 30s, rampup target 50% +of linerate in 600s, hold for 30s output goes to /root/64b.json +Mapped ports to sides [0] <--> [1] +Warming up [0] <--> [1] at rate of 1kpps for 30 seconds +Setting load [0] <--> [1] to 1% of linerate +stats: 4.20 Kpps 2.82 Mbps (0.28% of linerate) +… +stats: 1.42 Mpps 956.41 Mbps (95.64% of linerate) +stats: 1.42 Mpps 952.44 Mbps (95.24% of linerate) +WARNING: DUT packetloss too high +stats: 1.42 Mpps 955.19 Mbps (95.52% of linerate) +``` + +As an interesting note, this value 1.42Mpps is exactly what I calculated and expected (see above for a full explanation). The math works out at 10^9 / 704 bits/packet == **1.42Mpps**, just short of 1.488M line rate that I would have found had Coloclue not used VLAN tags. + +### Step 5 - Run T-Rex ASTF, measure latency/jitter + +In this mode, T-Rex simulates many stateful flows using a profile, which replays actual PCAP data (as can be obtained with tcpdump), by spacing out the requests and rewriting the source/destination addresses, thereby simulating hundreds or even millions of active sessions - I used `astf/http_simple.py` as a canonical example. In parallel to the test, I let T-Rex run a latency check, by sending SCTP packets at a rate of 1KHz from each interface. By doing this, latency profile and jitter can be accurately measured under partial or full line load. + +#### Bandwidth/Packet rate + +Let’s first take a look at the bandwidth and packet rates: +``` +Global Statistitcs + +connection : localhost, Port 4501 total_tx_L2 : 955.96 Mbps +version : ASTF @ v2.88 total_tx_L1 : 972.91 Mbps +cpu_util. : 5.64% @ 1 cores (1 per dual port) total_rx : 955.93 Mbps +rx_cpu_util. : 0.06% / 2 Kpps total_pps : 105.92 Kpps +async_util. : 0% / 63.14 bps drop_rate : 0 bps +total_cps. : 3.46 Kcps queue_full : 143,837 pkts + +Port Statistics + + port | 0 | 1 | total +-----------+-------------------+-------------------+------------------ +owner | root | root | +link | UP | UP | +state | TRANSMITTING | TRANSMITTING | +speed | 1 Gb/s | 1 Gb/s | +CPU util. | 5.64% | 5.64% | +-- | | | +Tx bps L2 | 17.35 Mbps | 938.61 Mbps | 955.96 Mbps +Tx bps L1 | 20.28 Mbps | 952.62 Mbps | 972.91 Mbps +Tx pps | 18.32 Kpps | 87.6 Kpps | 105.92 Kpps +Line Util. | 2.03 % | 95.26 % | +--- | | | +Rx bps | 938.58 Mbps | 17.35 Mbps | 955.93 Mbps +Rx pps | 87.59 Kpps | 18.32 Kpps | 105.91 Kpps +---- | | | +opackets | 8276689 | 39485094 | 47761783 +ipackets | 39484516 | 8275603 | 47760119 +obytes | 978676133 | 52863444478 | 53842120611 +ibytes | 52862853894 | 978555807 | 53841409701 +tx-pkts | 8.28 Mpkts | 39.49 Mpkts | 47.76 Mpkts +rx-pkts | 39.48 Mpkts | 8.28 Mpkts | 47.76 Mpkts +tx-bytes | 978.68 MB | 52.86 GB | 53.84 GB +rx-bytes | 52.86 GB | 978.56 MB | 53.84 GB +----- | | | +oerrors | 0 | 0 | 0 +ierrors | 0 | 0 | 0 +``` + +In the above screen capture, one can see the traffic out of port0 is 20Mbps at 18.3Kpps, while the traffic out of port1 is 952Mbps at 87.6Kpps - this is because the clients are sourcing from port0, while the servers are simulated behind port1. Note the asymmetric traffic flow, T-Rex is using 972Mbps of total bandwidth over this 1Gbps VLAN, and a tiny bit more than that on the Atom86 link, because the Aristas are inserting VLAN tags in transit, to be exact, 18.32+87.6 = 105.92Kpps worth of 4 byte tags, thus 3.389Mbit extra traffic. + +#### Latency Injection + +Now, let’s look at the latency in both directions, depicted in microseconds, at a throughput of 106Kpps (975Mbps): +``` +Global Statistitcs + +connection : localhost, Port 4501 total_tx_L2 : 958.07 Mbps +version : ASTF @ v2.88 total_tx_L1 : 975.05 Mbps +cpu_util. : 4.86% @ 1 cores (1 per dual port) total_rx : 958.06 Mbps +rx_cpu_util. : 0.05% / 2 Kpps total_pps : 106.15 Kpps +async_util. : 0% / 63.14 bps drop_rate : 0 bps +total_cps. : 3.47 Kcps queue_full : 143,837 pkts + +Latency Statistics + + Port ID: | 0 | 1 +-------------+-----------------+---------------- +TX pkts | 244068 | 242961 +RX pkts | 242954 | 244063 +Max latency | 23983 | 23872 +Avg latency | 815 | 702 +-- Window -- | | +Last max | 966 | 867 +Last-1 | 948 | 923 +Last-2 | 945 | 856 +Last-3 | 974 | 880 +Last-4 | 963 | 851 +Last-5 | 985 | 862 +Last-6 | 986 | 870 +Last-7 | 946 | 869 +Last-8 | 976 | 879 +Last-9 | 964 | 867 +Last-10 | 964 | 837 +Last-11 | 970 | 867 +Last-12 | 1019 | 897 +Last-13 | 1009 | 908 +Last-14 | 1006 | 897 +Last-15 | 1022 | 903 +Last-16 | 1015 | 890 +--- | | +Jitter | 42 | 45 +---- | | +Errors | 3 | 2 +``` + +In the capture above, one can see the total latency measurement packets sent, and the latency measurements in microseconds. One can see that from port0->port1 the measured latency was 0.815ms, while the latency in the other direction was 0.702ms. The discrepancy can be explained by the HTTP traffic being asymmetric (clients on port0 have to send their SCTP packets into a much busier port1), which creates queuing latency on the wire and NIC. The `Last-*` lines under it are the values of the last 16 seconds of measurements. The maximum observed latency was 23.9ms in one direction and 23.8ms in the other direction. I have to conclude therefore that the Atom86 line, even under stringent load, does not suffer from outliers in the entire 300s duration of my loadtest. + +Jitter is defined as a variation in the delay of received packets. At the sending side, packets are sent in a continuous stream with the packets spaced evenly apart. Due to network congestion, improper queuing, or configuration errors, this steady stream can become lumpy, or the delay between each packet can vary instead of remaining constant. There was **virtually no jitter**: 42 microseconds in one direction, 45us in the other. + +#### Latency Distribution + +While performing this test at 106Kpps (975Mbps), it's also useful to look at the latency distribution as a histogram: +``` +Global Statistitcs + +connection : localhost, Port 4501 total_tx_L2 : 958.07 Mbps +version : ASTF @ v2.88 total_tx_L1 : 975.05 Mbps +cpu_util. : 4.86% @ 1 cores (1 per dual port) total_rx : 958.06 Mbps +rx_cpu_util. : 0.05% / 2 Kpps total_pps : 106.15 Kpps +async_util. : 0% / 63.14 bps drop_rate : 0 bps +total_cps. : 3.47 Kcps queue_full : 143,837 pkts + +Latency Histogram + + Port ID: | 0 | 1 +-------------+-----------------+---------------- +20000 | 2545 | 2495 +10000 | 5889 | 6100 +9000 | 456 | 421 +8000 | 874 | 854 +7000 | 692 | 757 +6000 | 619 | 637 +5000 | 985 | 994 +4000 | 579 | 620 +3000 | 547 | 546 +2000 | 381 | 405 +1000 | 798 | 697 +900 | 27451 | 346 +800 | 163717 | 22924 +700 | 102194 | 154021 +600 | 24623 | 171087 +500 | 24882 | 40586 +400 | 18329 | +300 | 26820 | +``` + +In the capture above, one can see the number of packets observed between certain ranges; from port0 to port1, 102K SCTP latency probe packets were in transit some time between 700-799us, 163K probes were between 800-899us. In the other direction, 171K probes were between 600-699us and 154K probes were between 700-799us. This is corroborated by the mean latency I saw above (815us from port0->port1 and 702us from port1->port0). diff --git a/content/articles/2021-02-27-network.md b/content/articles/2021-02-27-network.md new file mode 100644 index 0000000..b97b512 --- /dev/null +++ b/content/articles/2021-02-27-network.md @@ -0,0 +1,156 @@ +--- +date: "2021-02-27T23:46:12Z" +title: IPng Network +--- + +# Introduction to IPng Networks + +At IPng Networks, we run a modest network with European reach. With our home base +in Zurich, Switzerland, we are pretty well connected into the Swiss internet scene. +We operate four sites in Zurich, and an additional set of sites in European cities, +each of which are described on this post. If you're curious as to how the network +runs, you can find two main pieces here: Firstly, the physical parts, where exactly +are IPng's routers and switches, what types of kit does the ISP use, and so on. +Secondly, the logical parts, what operating systems and configurations are in use. + +## Physical + +### Zurich Metropolitan Area + +The Canton of Zurich, Switzerland is our home-base, and it's where IPng +Networks GmbH is registered. The local commercial datacenter scene is dominated +by Interxion, NTT and Equinix. The small town of Brüttisellen (zipcode +CH-8306), is where our founder lives and, due to the ongoing Corona pandemic, +where he works from home. + +{{< image width="400px" float="left" src="/assets/network/zurich-ring.png" alt="Zurich Metro" >}} + +In Brüttisellen, marked with **C**, we have our first two routers, +`chbtl0.ipng.ch` and `chbtl1.ipng.ch`, racked in our office. There are only two +fiber operators in this town - UPC and Swisscom. The orange trace (**C** to **D**) +is a leased line from UPC, which we rent from [Openfactory](https://openfactory.net/) +and it gets terminated at Interxion Glattbrugg, where our first router +called `chgtg0.ipng.ch` is located. From there, Openfactory rents darkfiber +to multiple locations - but notably the dark purple trace (**D** to **E**) +that connects from Interxion Glattbrugg to NTT Rümlang, where our second +router called `chrma0.ipng.ch` is located. + +We rent a 10G CWDM wave between these two datacenters, directly connecting these +two routers. Now, Equinix also has a sizable footprint in Zürich, and +operating ZH04 (**B** where we only have passive optical presence) in the +Industriekwartier (our local internet exchange [SwissIX](https://swissix.net/) + was born in the now defunct Equinix ZH01 office building). From the neighboring +building Equinix ZH04, our partner [IP-Max](https://ip-max.net/) rents dark fiber +to Equinix ZH05 in the Zurich Allmend area (the light purple trace **B** to **F**), +and from there, IP-Max rents dark fiber to NTT Rümlang again (**F** to **E**), +completing the ring. We rent a 10G circuit on that path, to redundantly connect our +routers `chgtg0` and `chrma0`. If at any time we'd need to connect partners +or customers, we can do so at a moment's notice, as rackspace is available in +all Equinix sites for IPng Networks. + +The green link (**D** to **B**) is a 10G carrier ethernet circuit between Interxion, +over the light purple path (**B** to **A**) on its last mile to Albisrieden, where +we built a very small colocation site, which you can read about in more detail in our +[informational post]({% post_url 2022-02-24-colo %}) - the colo is open for private +individuals and small businesses ([contact](/s/contact/) us for details!). + +### European Ring + +At IPng, we are strong believers in a free and open Internet. Having seen +the shakeout of internet backbone providers over the last two decades, it +seems to be a race to the bottom, with mergers, acquisitions and takeovers +of datacenters and network carriers. Prices are going lower, and small fish +traffic (let's be honest, IPng Networks is definitely a small provider), to +the point that purchasing IP transit is cheaper than connecting to local +Internet exchange points. We've decided specifically to go the extra mile, +quite literally, and plot a path to several continental european internet +hubs. + +{{< image width="400px" float="left" src="/assets/network/european-ring.png" alt="European Ring" >}} + +***Frankfurt*** - Connected from NTT's datacenter at Rümlang (Zurich) with +a first 10G circuit, and from Interxion's datacenter at Glattbrugg (Zurich) +with a second 10G circuit, this is our first hop into the world. Here, we +connect to [DE-CIX](https://de-cix.net/) from Equinix FR5 at the Kleyerstrasse. +More details in our post [IPng Arrives in Frankfurt]({% post_url 2021-05-17-frankfurt %}). + +***Amsterdam*** - The Amsterdam Science Park is where European Internet was born. +[NIKHEF](https://www.nikhef.nl/) is where we rent rackspace that connects with a 10G +circuit to Frankfurt, and a 10G circuit onwards towards Lille. We connect to +[Speed-IX](https://speed-ix.net/), [LSIX](https://lsix.net/), [NL-IX](https://nl-ix.net), +and an exchange point we help run called [FrysIX](https://www.frys-ix.net/). +More details in our post [IPng Arrives in Amsterdam]({% post_url 2021-05-26-amsterdam %}). + +***Lille*** - [IP-Max](https://ip-max.net/) does lots of business in this +region, with presence in both local datacenters here, one in Lille and one in +Anzin. IPng has a point of presence here too, at the [CIV1](https://www.civ.fr/) +facility, with a northbound 10G circuit to Amsterdam, and a southbound 10G +circuit to Paris. Here, we connect to [LillIX](https://lillix.fr/). +More details in our post [IPng Arrives in Lille]({% post_url 2021-05-28-lille %}). + +***Paris*** - Where two large facilities are placed back-to-back in the middle +of the city, originally Telehouse TH2, with a new facility at Léon Frot, +where we pick up a 10G circuit from Lille and further on the ring with a 10G +circuit to Geneva. Here, we connect to [FranceIX](https://franceix.net). +More details in our post [IPng Arrives in Paris]({% post_url 2021-06-01-paris %}). + +***Geneva*** - The home-base of [IP-Max](https://ip-max.net) is where we close +our ring. From Paris, IP-Max has two redundant paths back to Switzerland, the first +being a DWDM link from to Zurich, and the second being a DWDM link to Lyon and +then into Geneva. Here, at [SafeHost](https://safehost.com/) in Plan les Ouates, +is where we have our fourth Swiss point of presence, with a connection to our very +own [Free-IX](https://free-ix.net/) and a 10G circuit to Interxion at Glattbrugg +(Zurich), and of course to Paris. +More details in our post [IPng Arrives in Geneva]({% post_url 2021-07-03-geneva %}). + +## Logical + +As a small operator, we'd love to be able to boast the newest Juniper [PTX10016](https://www.juniper.net/us/en/products/routers/ptx-series.html) +routers but we neither have the rack space, the power budget, and to be +perfectly honest, the monetary budget to run these at IPng Networks. But it +turns out, we know a fair bit about hardware silicon, architecture and the +controlplane software running on commercial routers. + +We've decided to go a different route. In our opinion, at speeds under 100Gbit, +it's perfectly viable to use software routers on off-the-shelf hardware, notably +Intel network cards and CPUs, notably those that have support for the +[Dataplane Development Kit](https://dpdk.org/) (aka DPDK), which offers libraries +to accelerate packet processing workloads, which turn ordinary servers into very +performant routers. Two notable applications are [VPP](https://fd.io/) and +[Danos](https://danosproject.org). + +### VPP + +VPP originally comes from the house of Cisco [[ref](https://www.cisco.com/c/dam/m/en_us/service-provider/ciscoknowledgenetwork/files/592_05_25-16-fdio_is_the_future_of_software_dataplanes-v2.pdf)] and looks quite a bit like +the commercial ASR9k platform. In development since 2002, VPP is production +code currently running in shipping products. It runs in user space on multiple +architectures including x86, ARM, and Power architectures on both x86 servers +and embedded devices. The design of VPP is hardware, kernel, and deployment +(bare metal, VM, container) agnostic. It runs completely in userspace. + +We've contributed a little bit to the Control Plane abstraction [[ref](https://docs.fd.io/vpp/21.06/dc/d2e/clicmd_src_plugins_linux-cp.html)], +which allows users to combine the throughput of a dataplane with usual routing +software like [Bird](https://bird.network.cz/) or [FRR](https://frrouting.org/). +We've been running it in production since December 2020 on `chbtl1.ipng.ch`. +It's our ultimate goal to run VPP and Linux Control Plane on the entire network, +as the design and architecture really resonates with us as software and systems +engineers. + +### DANOS + +The Disaggregated Network Operating System (DANOS) project originally comes +from AT&T’s “dNOS” software framework and provides an open, cost-effective and +flexible alternative to traditional networking equipment. As part of The Linux +Foundation, it now incorporates contributions from complementary open source +communities in building a standardized distributed Network Operating System (NOS) +to speed the adoption and use of white boxes in a service provider’s +infrastructure. + +We've been using DANOS since its first release in August 2019, and it's +currently our routing platform of choice -- it combines the sheer speed of +DPDK with a [Vyatta](https://en.wikipedia.org/wiki/Vyatta) command line +interface. As an appliance, care was taken to complete the _whole package_, +with SNMP, YANG interface, image and upgrade management, interface monitoring +with wireshark semantics, et cetera. Performing easily at wire speed 10G +workloads (including 64byte ethernet frames), and being completely open source, +it fits very well with our philosophy of an open and free internet. diff --git a/content/articles/2021-03-27-coloclue-vpp.md b/content/articles/2021-03-27-coloclue-vpp.md new file mode 100644 index 0000000..2db8299 --- /dev/null +++ b/content/articles/2021-03-27-coloclue-vpp.md @@ -0,0 +1,685 @@ +--- +date: "2021-03-27T11:33:00Z" +title: 'Case Study: VPP at Coloclue, part 1' +--- +* Author: Pim van Pelt <[pim@ipng.nl](mailto:pim@ipng.nl)> +* Reviewers: Coloclue Network Committee <[routers@coloclue.net](mailto:routers@coloclue.net)> +* Status: Draft - Review - **Published** + +## Introduction + +Coloclue AS8283 operates several Linux routers running Bird. Over the years, the performance of their previous hardware platform (Dell R610) has deteriorated, and they were up for renewal. At the same time, network latency/jitter has been very high, and variability may be caused by the Linux router hardware, their used software, the inter-datacenter links, or any combination of these. The routers were replaced with relatively modern hardware. In a [previous post]({% post_url 2021-02-27-coloclue-loadtest %}), I looked into the links between the datacenters, and demonstrated that they are performing as expected (1.41Mpps of 802.1q ethernet frames in both directions). That leaves the software. This post explores a replacement of the Linux kernel routers by a userspace process running VPP, which is an application built on DPDK. + +### Executive Summary + +I was unable to run VPP due to an issue detecting and making use of the Intel x710 network cards in this chassis. While the Intel i210-AT cards worked well, both with the standard `vfio-pci` driver and with an alternative `igb_uio` driver, I did not manage to get the Intel x710 cards to fully work (noting that I have the same Intel x710 NIC working flawlessly in VPP on another Supermicro chassis). See below for a detailed writeup of what I tried and which results were obtained. In the end, I reverted the machine back to its (mostly) original state, with three pertinent changes: + +1. I left the Debian Backports kernel 5.10 running +1. I turned on IOMMU (Intel VT-d was already on), booting with `iommu=pt intel_iommu=on` +1. I left Hyperthreading off in the BIOS (it was on when I started) + +After I restored the machine to its original Linux+Bird configuration, I noticed a marked improvement in latency, jitter and throughput. A combination of these changes is likely beneficial, so I do recommend making this change on all Coloclue routers, while we continue our quest for faster, more stable network performance. + +So the bad news is: I did not get to prove that VPP and DPDK are awesome in AS8283. Yet. + +But the good news is: **network performance improved** drastically. I'll take it :) + +### Timeline + +| | | +| ---- | ---- | +| {{< image src="/assets/coloclue-vpp/image0.png" alt="AS15703" width="450px" >}} | {{< image src="/assets/coloclue-vpp/image1.png" alt="AS12859" width="450px" >}} | + +The graph on the left shows latency from AS15703 (True) in EUNetworks to a Coloclue machine hosted in NorthC. As far as Smokeping is concerned, latency has been quite poor for as long as it can remember (at least a year). The graph on the right shows the latency from AS12859 (BIT) to the beacon on `185.52.225.1/24` which is announced only on dcg-1, on the day this project was carried out. + +Looking more closely at the second graph: + +**Sunday 07:30**: The machine was put into maintenance, which made the latency jump. This is because the beacon was no longer reachable directly behind dcg-1 from AS12859 over NL-IX, but via an alternative path which traversed several more Coloclue routers, hence higher latency and jitter/loss. + +**Sunday 11:00**: I rolled back the VPP environment on the machine, restoring it to its original configuration, except running kernel 5.10 and with Intel VT-d and Hyperthreading both turned off in the BIOS. A combination of those changes has definitely worked wonders. See also the `mtr` results down below. + +**Sunday 14:50**: Because I didn't want to give up, and because I expected a little more collegiality from my friend dcg-1, I gave it another go by enabling IOMMU and PT, booting the 5.10 kernel with `iommu=pt` and `intel_iommu=on`. Now, with the `igb_uio` driver loaded, VPP detected both the i210 and x710 NICs, however it did not want to initialize the 4th port on the NIC (this was `enp1s0f3`, the port to Fusix Networks), and the port `eno1` only partially worked (IPv6 was fine, IPv4 was not). During this second attempt though, the rest of VPP and Bird came up, including NL-IX, the LACP, all internal interfaces, IPv4 and IPv6 OSPF and all BGP peering sessions with members. + +**Sunday 16:20**: I could not in good faith turn on eBGP peers though, because of the interaction with `eno1` and `enp1s0f3` described in more detail below. I then ran out of time, and restored service with Linux 5.10 kernel and the original Bird configuration, now with Intel VT-d turned on and IOMMU/PT enabled in the kernel. + +### Quick Overview + +This paper, at a high level, discusses the following: + +1. Gives a brief introduction of VPP and its new Linux CP work +1. Discusses a means to isolate a /24 on exactly one Coloclue router +1. Demonstrates changes made to run VPP, even though they were not applied +1. Compares latency/throughput before-and-after in a surprising improvement, unrelated to VPP + +## 1. Introduction to VPP + +VPP stands for _Vector Packet Processing_. In development since 2002, VPP is production code currently running in shipping products. It runs in user space on multiple architectures including x86, ARM, and Power architectures on both x86 servers and embedded devices. The design of VPP is hardware, kernel, and deployment (bare metal, VM, container) agnostic. It runs ***completely in userspace***. VPP helps push extreme limits of performance and scale. Independent testing shows that, at scale, VPP-powered routers are two orders of magnitude faster than currently available technologies. + +The Linux (and BSD) kernel is not optimized for network I/O. Each packet (or in some implementations, a small batch of packets) generates an interrupt which causes the kernel to stop what it's doing, schedule the interrupt handler, do the necessary steps in the networking stack for each individual packet in turn: layer2 input, filtering, NAT session matching and packet rewriting, IP next-hop lookup, interface and L2 next-hop lookup, and marshalling the packet back onto the network, or handing it over to an application running on the local machine. And it does this for each packet one after another. + +VPP takes away a few inefficiencies in this process in a few ways: +* VPP does not use interrupts, does not use the kernel network driver, and does not use the kernel networking stack at all. Instead, it attaches directly to the PCI device and polls the network card directly for incoming packets. +* Once network traffic gets busier, VPP constructs a _collection of packets_ called a _vector_, to pass through a directed graph of smaller functions. There's a clear performance benefit of such an architecture: the first packet from the vector will hit possibly a cold instruction/data cache in the CPU, but the second through Nth packet from the vector will execute on a hot cache and not need most/any memory access, executing at an order of magnitude faster or even better. +* VPP is multithreaded and can have multiple cores polling and executing receive and transmit queues for network interfaces at the same time. Routing information (like next hops, forwarding tables, etc) should be carefully maintained, but in principle, VPP linearly scales with the amount of cores. + +It is straight forward to obtain 10Mpps of forwarding throughput per CPU core, so a 32 core machine (handling 320Mpps) can realistically saturate 21x10Gbit interfaces (at 14.88Mpps). A similar 32-core machine, if it has sufficient amounts of PCI slots and network cards can route an internet mixture of traffic at throughputs of roughly 492Gbit (320Mpps at 650Kpps per 10G of imix). + +VPP, upon startup, will disassociate the NICs with the kernel and bind them into the `vpp` process, which will promptly run at 100% CPU, due to its DPDK polling. There's a tool `vppcli` which allows the operator to configure the VPP process: create interfaces, set attributes like link state, MTU, MPLS, Bonding, IPv4/IPv6 addresses and add/remove routes in the _forwarding information base_ (or FIB). VPP further works with plugins, that add specific functionality, examples of this is LLDP, DHCP, IKEv2, NAT, DSLITE, Load Balancing, Firewall ACLs, GENEVE, VXLAN, VRRP, and Wireguard, to name but a few popular ones. + +### Introduction to Linux CP Plugin + +However, notably (or perhaps notoriously), VPP is only a dataplane application, it does not have any routing protols like `OSPF` or `BGP`. A relatively new plugin is called the Linux Control Plane (or LCP), and it consists of two parts, one is public and one is under development at the time of this article. The first plugin allows the operator to create a Linux `tap` interface and pass though or _punt_ traffic from the dataplane into it. This way, the userspace VPP application creates a link back into the kernel, and an interface (eg. `vpp0`) appears. Input packets in VPP have all input features (firewall, NAT, session matching, etc), and if the packet is sent to an IP address with an LCP pair associated with it, it is punted to the `tap` device. So if on the Linux side, the same IP address is put on the resulting `vpp0` device, Linux will see it. Responses from the kernel into the `tap` device are picked up by the Linux CP plugin and re-injected into the dataplane, and all output features of VPP are applied. This makes bidirectional traffic possible. You can read up on the Linux CP plugin in the [VPP documentation](https://docs.fd.io/vpp/21.06/d6/ddb/lcp_8api.html). + +Here's a barebones example of plumbing the VPP interface `GigabitEthernet7/0/0` through a network device `vpp0` in the `dataplane` network namespace. +``` +pim@vpp-west:~$ sudo systemctl restart vpp +pim@vpp-west:~$ vppctl lcp create GigabitEthernet7/0/0 host-if vpp0 namespace dataplane +pim@vpp-west:~$ sudo ip netns exec dataplane ip link +1: lo: mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 + link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 +12: vpp0: mtu 9000 qdisc mq state DOWN mode DEFAULT group default qlen 1000 + link/ether 52:54:00:8a:0e:97 brd ff:ff:ff:ff:ff:ff + +pim@vpp-west:~$ vppctl show interface + Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count +GigabitEthernet7/0/0 1 down 9000/0/0/0 +local0 0 down 0/0/0/0 +tap1 2 up 9000/0/0/0 +``` + +### Introduction to Linux NL Plugin + +You may be wondering, what happens with interface addresses or static routes? Usually, a userspace application like `ip link add` or `ip address add` or a higher level process like `bird` or `FRR` will want to set routes towards next hops upon interfaces using routing protocols like `OSPF` or `BGP`. The Linux kernel picks these events up and can share them as so called `netlink` messages with interested parties. Enter the second plugin (the one that is under development at the moment), which is a netlink listener. Its job is to pick up netlink messages from the kernel and apply them to the VPP dataplane. With the Linux NL plugin enabled, events like adding or removing links, addresses, routes, set linkstate or MTU, will all mirrored into the dataplane. I'm hoping the netlink code will be released in the upcoming VPP release, but contact me any time if you'd like to discuss details of the code, which can be found currently under community review in the [VPP Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122) + +Building on the example above, with this Linux NL plugin enabled, we can now manipulate VPP state from Linux, for example creating an interface and adding an IPv4 address to it (of course, IPv6 works just as well!): +``` +pim@vpp-west:~$ sudo ip netns exec dataplane ip link set vpp0 up mtu 1500 +pim@vpp-west:~$ sudo ip netns exec dataplane ip addr add 2001:db8::1/64 dev vpp0 +pim@vpp-west:~$ sudo ip netns exec dataplane ip addr add 10.0.13.2/30 dev vpp0 +pim@vpp-west:~$ sudo ip netns exec dataplane ping -c1 10.0.13.1 +PING 10.0.13.1 (10.0.13.1) 56(84) bytes of data. +64 bytes from 10.0.13.1: icmp_seq=1 ttl=64 time=0.591 ms + +--- 10.0.13.1 ping statistics --- +1 packets transmitted, 1 received, 0% packet loss, time 0ms +rtt min/avg/max/mdev = 0.591/0.591/0.591/0.000 ms + +pim@vpp-west:~$ vppctl show interface + Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count +GigabitEthernet7/0/0 1 up 1500/0/0/0 rx packets 4 + rx bytes 268 + tx packets 14 + tx bytes 1140 + drops 2 + ip4 2 +local0 0 down 0/0/0/0 +tap1 2 up 9000/0/0/0 rx packets 10 + rx bytes 796 + tx packets 2 + tx bytes 140 + ip4 1 + ip6 8 + +pim@vpp-west:~$ vppctl show interface address +GigabitEthernet7/0/0 (up): + L3 10.0.13.2/30 + L3 2001:db8::1/64 +local0 (dn): +tap1 (up): +``` + +As can be seen above, setting the link state up, setting the MTU, adding an address were all captured by the Linux NL plugin and applied in the dataplane. Further to this, the Linux NL plugin also synchronizes route updates into the _forwarding information base_ (or FIB) of the dataplane: +``` +pim@vpp-west:~$ sudo ip netns exec dataplane ip route add 100.65.0.0/24 via 10.0.13.1 + +pim@vpp-west:~$ vppctl show ip fib 100.65.0.0 +ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, default-route:1, lcp-rt:1, ] +100.65.0.0/24 fib:0 index:15 locks:2 + lcp-rt refs:1 src-flags:added,contributing,active, + path-list:[27] locks:2 flags:shared, uPRF-list:19 len:1 itfs:[1, ] + path:[34] pl-index:27 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved, + 10.0.13.1 GigabitEthernet7/0/0 + [@0]: ipv4 via 10.0.13.1 GigabitEthernet7/0/0: mtu:1500 next:5 flags:[] 52540015f82a5254008a0e970800 +``` + +***Note***: I built the code for VPP v21.06 including the Linux CP and Linux NL plugins at tag `21.06-rc0~476-g41cf6e23d` on Debian Buster for the rest of this project, to match the operating system in use on Coloclue routers. I did this without additional modifications (even though I must admit, I do know of a few code paths in the netlink handler that still trigger a crash, and I have a few fixes in my client at home, so I'll be careful to avoid the pitfalls for now :-). + + +## 2. Isolating a Device Under Test + +Coloclue has several routers, so to ensure that the traffic traverses only the one router under test, I decided to use an allocated but currently unused IPv4 prefix and announce that only from one of the four routers, so that all traffic to and from that /24 goes over that router. Coloclue uses a piece of software called [Kees](https://github.com/coloclue/kees.git), a set of Python and Jinja2 scripts to generate a Bird1.6 configuration for each router. This is great because that allows me to add a small feature to get what I need: **beacons**. + +A beacon is a prefix that is sent to (some, or all) peers on the internet to attract traffic in a particular way. I added a function called `is_coloclue_beacon()` which reads the input YAML file and uses a construction similar to the existing feature for "supernets". It determines if a given prefix must be announced to peers and upstreams. Any IPv4 and IPv6 prefixes from the `beacons` list will be then matched in `is_coloclue_beacon()` and announced. + +Based on a per-router config (eg. `vars/dcg-1.router.nl.coloclue.net.yml`) I can now add the following YAML stanza: +``` +coloclue: + beacons: + - prefix: "185.52.225.0" + length: 24 + comment: "VPP test prefix (pim)" +``` + +Because tinkering with routers in the _Default Free Zone_ is a great way to cause an outage, I need to ensure that the code I wrote was well tested. I first ran `./update-routers.sh check` with no beacon config. This succeeded: +``` +[...] +checking: /opt/router-staging/dcg-1.router.nl.coloclue.net/bird.conf +checking: /opt/router-staging/dcg-1.router.nl.coloclue.net/bird6.conf +checking: /opt/router-staging/dcg-2.router.nl.coloclue.net/bird.conf +checking: /opt/router-staging/dcg-2.router.nl.coloclue.net/bird6.conf +checking: /opt/router-staging/eunetworks-2.router.nl.coloclue.net/bird.conf +checking: /opt/router-staging/eunetworks-2.router.nl.coloclue.net/bird6.conf +checking: /opt/router-staging/eunetworks-3.router.nl.coloclue.net/bird.conf +checking: /opt/router-staging/eunetworks-3.router.nl.coloclue.net/bird6.conf +``` + +And I made sure that the generated function is indeed empty: +``` +function is_coloclue_beacon() +{ + # Prefix must fall within one of our supernets, otherwise it cannot be a beacon. + if (!is_coloclue_more_specific()) then return false; + return false; +} +``` + +Then, I ran the configuration again with one IPv4 beacon set on dcg-1, and still all the bird configs on both IPv4 and IPv6 for all routers parsed correctly, and the generated function on the dcg-1 IPv4 filters file was popupated: +``` +function is_coloclue_beacon() +{ + # Prefix must fall within one of our supernets, otherwise it cannot be a beacon. + if (!is_coloclue_more_specific()) then return false; + if (net = 185.52.225.0/24) then return true; /* VPP test prefix (pim) */ + return false; +} +``` + +I then wired up the function into `function ebgp_peering_export()` and submitted the beacon configuration above, as well as a static route for that beacon prefix to a server running in the NorthC (previously called DCG) datacenter. You can read the details in this [Kees commit](https://github.com/coloclue/kees/commit/3710f1447ade10384c86f35b2652565b440c6aa6). The dcg-1 router is connected to [NL-IX](https://nl-ix.net/), so it's expected that after this configuration went live, peers can now see that prefix only via NL-IX, and it's a more specific to the overlapping supernet (which is `185.52.224.0/22`). + +And indeed, a traceroute now only traverses dcg-1 as seen from peer BIT (AS12859 coming from NL-IX): +``` + 1. lo0.leaf-sw4.bit-2b.network.bit.nl + 2. lo0.leaf-sw6.bit-2a.network.bit.nl + 3. xe-1-3-1.jun1.bit-2a.network.bit.nl + 4. coloclue.the-datacenter-group.nl-ix.net + 5. vpp-test.ams.ipng.ch +``` + +As well as return traffic from Coloclue to that peer: +``` + 1. bond0-100.dcg-1.router.nl.coloclue.net + 2. bit.bit2.nl-ix.net + 3. lo0.leaf-sw6.bit-2a.network.bit.nl + 4. lo0.leaf-sw4.bit-2b.network.bit.nl + 5. sandy.ipng.nl +``` + +## 3. Installing VPP + +First, I need to ensure that the machine is reliably reachable via its IPMI interface (normally using serial-over-lan, but to make sure as well Remote KVM). This is required because all network interfaces above will be bound by VPP, and if the `vpp` process ever were to crash, it will be restarted without configuration. On a production router, one would expect there to be a configuration daemon that can persist a configuration and recreate it in case of a server restart or dataplane crash. + +Before we start, let's build VPP with our two beautiful plugins, copy them to dcg-1, and install all the supporting packages we'll need: +``` +pim@vpp-builder:~/src/vpp$ make install-dep +pim@vpp-builder:~/src/vpp$ make build +pim@vpp-builder:~/src/vpp$ make build-release +pim@vpp-builder:~/src/vpp$ make pkg-deb +pim@vpp-builder:~/src/vpp$ dpkg -c build-root/vpp-plugin-core*.deb | egrep 'linux_(cp|nl)_plugin' +-rw-r--r-- root/root 92016 2021-03-27 12:06 ./usr/lib/x86_64-linux-gnu/vpp_plugins/linux_cp_plugin.so +-rw-r--r-- root/root 57208 2021-03-27 12:06 ./usr/lib/x86_64-linux-gnu/vpp_plugins/linux_nl_plugin.so +pim@vpp-builder:~/src/vpp$ scp build-root/*.deb root@dcg-1.nl.router.coloclue.net:/root/vpp/ + +pim@dcg-1:~$ sudo apt install libmbedcrypto3 libmbedtls12 libmbedx509-0 libnl-3-200 \ + libnl-route-3-200 libnuma1 python3-cffi python3-cffi-backend python3-ply python3-pycparser +pim@dcg-1:~$ sudo dpkg -i /root/vpp/*.deb +pim@dcg-1:~$ sudo usermod -a -G vpp pim +``` + +On a BGP speaking router, `netlink` messages can come in rather quickly as peers come and go. Due to an unfortunate design choice in the Linux kernel, messages are not buffered for clients, which means that a buffer overrun can occur. To avoid this, I'll raise the netlink socket size to 64MB, leverging a feature that will create a producer queue in the Linux NL plugin, so that VPP can try to drain the messages from the kernel into its memory as quickly as possible. To be able to raise the netlink socket buffer size, we need to set some variables with `sysctl` (take note as well on the usual variables VPP wants to set with regards to [hugepages](https://wiki.debian.org/Hugepages) in `/etc/sysctl.d/80-vpp.conf`, which the Debian package installs for you): +``` +pim@dcg-1:~$ cat << EOF | sudo tee /etc/sysctl.d/81-vpp-netlink.conf +# Increase netlink to 64M +net.core.rmem_default=67108864 +net.core.wmem_default=67108864 +net.core.rmem_max=67108864 +net.core.wmem_max=67108864 +EOF +pim@dcg-1:~$ sudo sysctl -p /etc/sysctl.d/81-vpp-netlink.conf /etc/sysctl.d/80-vpp.conf +``` + +### VPP Configuration + +Now that I'm sure traffic to and from `185.52.225.0/24` will go over dcg-1, let's take a look at the machine itself. It has a six network cards, two onboard Intel i210 gigabit and one Intel x710-DA4 quad-tengig network cards. To run VPP, the network cards in the machine need to be supported in [Intel's DPDK](https://www.dpdk.org/) libraries. The ones in this machine are all OK (but as we'll see later, problematic for unexplained reasons): + +``` +root@dcg-1:~# lspci | grep Ether +01:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) +01:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) +01:00.2 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) +01:00.3 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) +06:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03) +07:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03) +``` + +To handle the inbound traffic, netlink messages and other internal memory structure, I'll allocate 2GB of _hugepages_ to the VPP process. I'll then of course enable the two Linux CP plugins. Because VPP has a lot of statistics counters (for example, a few stats for each used prefix in its _forwarding information base_ or FIB), I will need to give it more than the default of 32MB of stats memory. I'd like to execute a few startup commands to further configure the VPP runtime upon startup, so I'll add a startup-config stanza. Finally, although on a production router I would, here I will not specify the DPDK interfaces, because I know that VPP will take over any supported network card that is in link down state upon startup. As long as I boot the machine with unconfigured NICs, I will be good. + +So, here's the configuration I end up adding to `/etc/vpp/startup.conf`: +``` +unix { + startup-config /etc/vpp/vpp-exec.conf +} + +memory { + main-heap-size 2G + main-heap-page-size default-hugepage +} + +plugins { + path /usr/lib/x86_64-linux-gnu/vpp_plugins + plugin linux_cp_plugin.so { enable } + plugin linux_nl_plugin.so { enable } +} + +statseg { + size 128M +} + +# linux-cp { +# default netns dataplane +# } +``` + +***Note***: It is important to isolate the `tap` devices into their own Linux network namespace. If this is not done, packets arriving via the dataplane will not have a route up and into the kernel for interfaces VPP is not aware of, making those kernel-enabled interfaces unreachable. Due to the use of a network namespace, all applications in Linux will have to be run in that namespace (think: `bird`, `sshd`, `snmpd`, etc) and the firewall rules with `iptables` will also have to be carefully applied into that namespace. Considering for this test we are using all interfaces in the dataplane, this point is moot, and we'll take a small shortcut and introduce the `tap` devices in the default namespace. + +In the configuration file, I added a `startup-config` (also known as `exec`) stanza. This is a set of VPP CLI commands that will be executed every time the process starts. It's a great way to get the VPP plumbing done ahead of time. I figured, if I let VPP take the network cards, but then re-present `tap` interfaces with names which have the same name that the Linux kernel driver would've given them, the rest of the machine will mostly just work. + +So the final trick is to disable every interface in `/etc/nework/interfaces` on dcg-1 and then configure it with a combination of a `/etc/vpp/vpp-exec.conf` and a small shell script that puts the IP addresses and things back just the way Debian would've put them using the `/etc/network/interfaces` file. Here we go! + +``` +# Loopback interface +create loopback interface instance 0 +lcp create loop0 host-if lo0 + +# Core: dcg-2 +lcp create GigabitEthernet6/0/0 host-if eno1 + +# Infra: Not used. +lcp create GigabitEthernet7/0/0 host-if eno2 + +# LACP to Arista core switch +create bond mode lacp id 0 +set interface state TenGigabitEthernet1/0/0 up +set interface mtu packet 1500 TenGigabitEthernet1/0/0 +set interface state TenGigabitEthernet1/0/1 up +set interface mtu packet 1500 TenGigabitEthernet1/0/1 +bond add BondEthernet0 TenGigabitEthernet1/0/0 +bond add BondEthernet0 TenGigabitEthernet1/0/1 +set interface mtu packet 1500 BondEthernet0 +lcp create BondEthernet0 host-if bond0 + +# VLANs on bond0 +create sub-interfaces BondEthernet0 100 +lcp create BondEthernet0.100 host-if bond0.100 +create sub-interfaces BondEthernet0 101 +lcp create BondEthernet0.101 host-if bond0.101 +create sub-interfaces BondEthernet0 102 +lcp create BondEthernet0.102 host-if bond0.102 +create sub-interfaces BondEthernet0 120 +lcp create BondEthernet0.120 host-if bond0.120 +create sub-interfaces BondEthernet0 201 +lcp create BondEthernet0.201 host-if bond0.201 +create sub-interfaces BondEthernet0 202 +lcp create BondEthernet0.202 host-if bond0.202 +create sub-interfaces BondEthernet0 205 +lcp create BondEthernet0.205 host-if bond0.205 +create sub-interfaces BondEthernet0 206 +lcp create BondEthernet0.206 host-if bond0.206 +create sub-interfaces BondEthernet0 2481 +lcp create BondEthernet0.2481 host-if bond0.2481 + +# NLIX +lcp create TenGigabitEthernet1/0/2 host-if enp1s0f2 +create sub-interfaces TenGigabitEthernet1/0/2 7 +lcp create TenGigabitEthernet1/0/2.7 host-if enp1s0f2.7 +create sub-interfaces TenGigabitEthernet1/0/2 26 +lcp create TenGigabitEthernet1/0/2.26 host-if enp1s0f2.26 + +# Fusix Networks +lcp create TenGigabitEthernet1/0/3 host-if enp1s0f3 +create sub-interfaces TenGigabitEthernet1/0/3 108 +lcp create TenGigabitEthernet1/0/3.108 host-if enp1s0f3.108 +create sub-interfaces TenGigabitEthernet1/0/3 110 +lcp create TenGigabitEthernet1/0/3.110 host-if enp1s0f3.110 +create sub-interfaces TenGigabitEthernet1/0/3 300 +lcp create TenGigabitEthernet1/0/3.300 host-if enp1s0f3.300 +``` + +And then to set up the IP address information, a small shell script: +``` +ip link set lo0 up mtu 16384 +ip addr add 94.142.247.1/32 dev lo0 +ip addr add 2a02:898:0:300::1/128 dev lo0 + +ip link set eno1 up mtu 1500 +ip addr add 94.142.247.224/31 dev eno1 +ip addr add 2a02:898:0:301::12/127 dev eno1 + +ip link set eno2 down + +ip link set bond0 up mtu 1500 +ip link set bond0.100 up mtu 1500 +ip addr add 94.142.244.252/24 dev bond0.100 +ip addr add 2a02:898::d1/64 dev bond0.100 +ip link set bond0.101 up mtu 1500 +ip addr add 172.28.0.252/24 dev bond0.101 +ip link set bond0.102 up mtu 1500 +ip addr add 94.142.247.44/29 dev bond0.102 +ip addr add 2a02:898:0:e::d1/64 dev bond0.102 +ip link set bond0.120 up mtu 1500 +ip addr add 94.142.247.236/31 dev bond0.120 +ip addr add 2a02:898:0:301::6/127 dev bond0.120 +ip link set bond0.201 up mtu 1500 +ip addr add 94.142.246.252/24 dev bond0.201 +ip addr add 2a02:898:62:f6::fffd/64 dev bond0.201 +ip link set bond0.202 up mtu 1500 +ip addr add 94.142.242.140/28 dev bond0.202 +ip addr add 2a02:898:100::d1/64 dev bond0.202 +ip link set bond0.205 up mtu 1500 +ip addr add 94.142.242.98/27 dev bond0.205 +ip addr add 2a02:898:17::fffe/64 dev bond0.205 +ip link set bond0.206 up mtu 1500 +ip addr add 185.52.224.92/28 dev bond0.206 +ip addr add 2a02:898:90:1::2/125 dev bond0.206 +ip link set bond0.2481 up mtu 1500 +ip addr add 94.142.247.82/29 dev bond0.2481 +ip addr add 2a02:898:0:f::2/64 dev bond0.2481 + +ip link set enp1s0f2 up mtu 1500 +ip link set enp1s0f2.7 up mtu 1500 +ip addr add 193.239.117.111/22 dev enp1s0f2.7 +ip addr add 2001:7f8:13::a500:8283:1/64 dev enp1s0f2.7 +ip link set enp1s0f2.26 up mtu 1500 +ip addr add 213.207.10.53/26 dev enp1s0f2.26 +ip addr add 2a02:10:3::a500:8283:1/64 dev enp1s0f2.26 + +ip link set enp1s0f3 up mtu 1500 +ip link set enp1s0f3.108 up mtu 1500 +ip addr add 94.142.247.243/31 dev enp1s0f3.108 +ip addr add 2a02:898:0:301::15/127 dev enp1s0f3.108 +ip link set enp1s0f3.110 up mtu 1500 +ip addr add 37.139.140.23/31 dev enp1s0f3.110 +ip addr add 2a00:a7c0:e20b:110::2/126 dev enp1s0f3.110 +ip link set enp1s0f3.300 up mtu 1500 +ip addr add 185.1.94.15/24 dev enp1s0f3.300 +ip addr add 2001:7f8:b6::205b:1/64 dev enp1s0f3.300 +``` + +## 4. Results + +And this is where it went horribly wrong. After installing the VPP packages on the dcg-1 machine, running Debian Buster on a Supermicro Super Server/X11SCW-F with BIOS 1.5 dated 10/12/2020, the `vpp` process was unable to bind the PCI devices for the Intel x710 NICs. I tried the following combinations: + +* Stock Buster kernel `4.19.0-14-amd64` and Backports kernel `5.10.0-0.bpo.3-amd64`. +* The kernel driver `vfio-pci` and the DKMS for `igb_uio` from Debian package `dpdk-igb-uio-dkms`. +* Intel IOMMU off, on and strict (kernel boot parameter `intel_iommu=on` and `intel_iommu=strict`) +* BIOS setting for Intel VT-d on and off. + +Each time, I would start VPP with an explicit `dpdk {}` stanza, and observed the following. With the default `vfio-pci` driver, the VPP process would not start, and instead it would be spinning loglines: + +``` +[ 74.378330] vfio-pci 0000:01:00.0: Masking broken INTx support +[ 74.384328] vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x19@0x1d0 +## Repeated for all of the NICs 0000:01:00.[0123] +``` +Commenting out the `dpdk { dev 0000:01:00.* }` devices would allow it to start, detect the two i210 NICs, which both worked fine. + +With the `igb_uio` driver, VPP would start, but not detect the x710 devices at all, it would detect the two i210 NICs, but they would not pass traffic or even link up: +``` +[ 139.495061] igb_uio 0000:01:00.0: uio device registered with irq 128 +[ 139.522507] DMAR: DRHD: handling fault status reg 2 +[ 139.528383] DMAR: [DMA Read] Request device [01:00.0] PASID ffffffff fault addr 138dac000 [fault reason 06] PTE Read access is not set +## Repeated for all 6 NICs +``` + +I repeated this test of both drivers for all combinations of kernel, IOMMU and BIOS settings for VT-d, with exactly identical results. + +### Baseline + +In a traceroute from BIT to Coloclue (using Junipers on hops 1-3, Linux kernel routing on hop 4), it's clear that (a) only NL-IX is used on hop 4, which means that only dcg-1 is in the path and no other routers at Coloclue. From hop 4 onwards, one can clearly see high variance, with a 49.7ms standard deviation on a **~247.1ms** worst case, even though the end to end latency is only 1.6ms and the NL-IX port is not congested. +``` +sandy (193.109.122.4) 2021-03-27T22:36:11+0100 +Keys: Help Display mode Restart statistics Order of fields quit + Packets Pings + Host Loss% Snt Last Avg Best Wrst StDev + 1. lo0.leaf-sw4.bit-2b.network.bit.nl 0.0% 4877 0.3 0.2 0.1 7.8 0.2 + 2. lo0.leaf-sw6.bit-2a.network.bit.nl 0.0% 4877 0.3 0.2 0.2 1.1 0.1 + 3. xe-1-3-1.jun1.bit-2a.network.bit.nl 0.0% 4877 0.5 0.3 0.2 9.3 0.7 + 4. coloclue.the-datacenter-group.nl-ix.net 0.2% 4877 1.8 18.3 1.7 253.5 45.0 + 5. vpp-test.ams.ipng.ch 0.1% 4877 1.9 23.6 1.6 247.1 49.7 +``` + +On the return path, seen by a traceroute from Coloclue to BIT (using Linux kernel routing on hop 2, Junipers on hops 2-4), it becomes clear that the very first hop (the Linux machine dcg-1) is contributing to high variance, with a 49.4ms standard deviation on a **257.9ms** worst case, again on an NL-IX port that was not congested and easy sailing in BIT's 10Gbit network from there on. +``` +vpp-test (185.52.225.1) 2021-03-27T21:36:43+0000 +Keys: Help Display mode Restart statistics Order of fields quit + Packets Pings + Host Loss% Snt Last Avg Best Wrst StDev + 1. bond0-100.dcg-1.router.nl.coloclue.net 0.1% 4839 0.2 12.9 0.1 251.2 38.2 + 2. bit.bit2.nl-ix.net 0.0% 4839 10.7 22.6 1.4 261.8 48.3 + 3. lo0.leaf-sw5.bit-2a.network.bit.nl 0.0% 4839 1.8 20.9 1.6 263.0 46.9 + 4. lo0.leaf-sw3.bit-2b.network.bit.nl 0.0% 4839 155.7 22.7 1.4 282.6 50.9 + 5. sandy.ede.ipng.nl 0.0% 4839 1.8 22.9 1.6 257.9 49.4 +``` + +### New Configuration + +As I mentioned, I had expected this article to have a different outcome, in that I would've wanted to show off the superior routing performance under VPP of the beacon `185.52.225.1/24` which is found from AS12859 (BIT) via NL-IX directly through dcg-1. Alas, I did not manage to get the Intel x710 NIC to work with VPP, I ultimately rolled back but kept a few settings (Intel VT-d enabled and IOMMU on, hyperthreading disabled, Linux kernel 5.10 which uses a much newer version of the `i40e` for the NIC). + +That combination definitely helped, the latency is now very smooth between BIT and Coloclue, a mean latency of 1.7ms, worst case 4.3ms and a standard deviation of **0.2ms** only. That is as good as you could expect: +``` +sandy (193.109.122.4) 2021-03-28T16:20:05+0200 +Keys: Help Display mode Restart statistics Order of fields quit + Packets Pings + Host Loss% Snt Last Avg Best Wrst StDev + 1. lo0.leaf-sw4.bit-2b.network.bit.nl 0.0% 4342 0.3 0.2 0.2 0.4 0.1 + 2. lo0.leaf-sw6.bit-2a.network.bit.nl 0.0% 4342 0.3 0.2 0.2 0.9 0.1 + 3. xe-1-3-1.jun1.bit-2a.network.bit.nl 0.0% 4341 0.4 1.0 0.3 28.3 2.3 + 4. coloclue.the-datacenter-group.nl-ix.net 0.0% 4341 1.8 1.8 1.7 3.4 0.1 + 5. vpp-test.ams.ipng.ch 0.0% 4341 1.8 1.7 1.7 4.3 0.2 +``` + +On the return path, seen by a traceroute again from Coloclue to BIT, it becomes clear that dcg-1 is no longer causing jitter or loss, at least not to NL-IX and AS12859. The latency there is as well an expected 1.8ms with a worst cast of 3.5ms and a standard deviation of **0.1ms**, in other words comparable to the BIT --> Coloclue path: +``` +vpp-test (185.52.225.1) 2021-03-28T14:20:50+0000 +Keys: Help Display mode Restart statistics Order of fields quit + Packets Pings + Host Loss% Snt Last Avg Best Wrst StDev + 1. bond0-100.dcg-1.router.nl.coloclue.net 0.0% 4303 0.2 0.2 0.1 0.9 0.1 + 2. bit.bit2.nl-ix.net 0.0% 4303 1.6 2.2 1.4 17.1 2.2 + 3. lo0.leaf-sw5.bit-2a.network.bit.nl 0.0% 4303 1.8 1.7 1.6 6.6 0.4 + 4. lo0.leaf-sw3.bit-2b.network.bit.nl 0.0% 4303 1.6 1.5 1.4 4.2 0.2 + 5. sandy.ede.ipng.nl 0.0% 4303 1.9 1.8 1.7 3.5 0.1 + +``` + +## Appendix + +Assorted set of notes -- because I did give it "one last try" and managed to get VPP to almost work on this Coloclue router :) +* Boot kernel 5.10 with `intel_iommu=on iommu=pt` +* Load kernel module `igb_uio` and unload `vfio-pci` before starting VPP + +What follows is a bunch of debugging information -- useful perhaps for a future attempt at running VPP at Coloclue. +``` +root@dcg-1:/etc/vpp# tail -10 startup.conf +dpdk { + uio-driver igb_uio + dev 0000:06:00.0 + dev 0000:07:00.0 + + dev 0000:01:00.0 + dev 0000:01:00.1 + dev 0000:01:00.2 + dev 0000:01:00.3 +} + +root@dcg-1:/etc/vpp# lsmod | grep uio +uio_pci_generic 16384 0 +igb_uio 20480 5 +uio 20480 12 igb_uio,uio_pci_generic + +[ 39.211999] igb_uio: loading out-of-tree module taints kernel. +[ 39.218094] igb_uio: module verification failed: signature and/or required key missing - tainting kernel +[ 39.228147] igb_uio: Use MSIX interrupt by default +[ 91.595243] igb 0000:06:00.0: removed PHC on eno1 +[ 91.716041] igb_uio 0000:06:00.0: mapping 1K dma=0x101c40000 host=0000000095299b4e +[ 91.723683] igb_uio 0000:06:00.0: unmapping 1K dma=0x101c40000 host=0000000095299b4e +[ 91.733221] igb 0000:07:00.0: removed PHC on eno2 +[ 91.856255] igb_uio 0000:07:00.0: mapping 1K dma=0x101c40000 host=0000000095299b4e +[ 91.863918] igb_uio 0000:07:00.0: unmapping 1K dma=0x101c40000 host=0000000095299b4e +[ 91.988718] igb_uio 0000:06:00.0: uio device registered with irq 127 +[ 92.039935] igb_uio 0000:07:00.0: uio device registered with irq 128 +[ 105.040391] i40e 0000:01:00.0: i40e_ptp_stop: removed PHC on enp1s0f0 +[ 105.232452] igb_uio 0000:01:00.0: mapping 1K dma=0x103a64000 host=00000000bc39c074 +[ 105.240108] igb_uio 0000:01:00.0: unmapping 1K dma=0x103a64000 host=00000000bc39c074 +[ 105.249142] i40e 0000:01:00.1: i40e_ptp_stop: removed PHC on enp1s0f1 +[ 105.472489] igb_uio 0000:01:00.1: mapping 1K dma=0x180187000 host=000000003182585c +[ 105.480148] igb_uio 0000:01:00.1: unmapping 1K dma=0x180187000 host=000000003182585c +[ 105.489178] i40e 0000:01:00.2: i40e_ptp_stop: removed PHC on enp1s0f2 +[ 105.700497] igb_uio 0000:01:00.2: mapping 1K dma=0x12108a000 host=000000006ccf7ec6 +[ 105.708160] igb_uio 0000:01:00.2: unmapping 1K dma=0x12108a000 host=000000006ccf7ec6 +[ 105.717272] i40e 0000:01:00.3: i40e_ptp_stop: removed PHC on enp1s0f3 +[ 105.916553] igb_uio 0000:01:00.3: mapping 1K dma=0x121132000 host=00000000a0cf9ceb +[ 105.924214] igb_uio 0000:01:00.3: unmapping 1K dma=0x121132000 host=00000000a0cf9ceb +[ 106.051801] igb_uio 0000:01:00.0: uio device registered with irq 127 +[ 106.131501] igb_uio 0000:01:00.1: uio device registered with irq 128 +[ 106.211155] igb_uio 0000:01:00.2: uio device registered with irq 129 +[ 106.288722] igb_uio 0000:01:00.3: uio device registered with irq 130 +[ 106.367089] igb_uio 0000:06:00.0: uio device registered with irq 130 +[ 106.418175] igb_uio 0000:07:00.0: uio device registered with irq 131 + +### Note above: Gi6/0/0 and Te1/0/3 both use irq 130. + +root@dcg-1:/etc/vpp# vppctl show log | grep dpdk +2021/03/28 15:57:09:184 notice dpdk EAL: Detected 6 lcore(s) +2021/03/28 15:57:09:184 notice dpdk EAL: Detected 1 NUMA nodes +2021/03/28 15:57:09:184 notice dpdk EAL: Selected IOVA mode 'PA' +2021/03/28 15:57:09:184 notice dpdk EAL: No available hugepages reported in hugepages-1048576kB +2021/03/28 15:57:09:184 notice dpdk EAL: No free hugepages reported in hugepages-1048576kB +2021/03/28 15:57:09:184 notice dpdk EAL: No available hugepages reported in hugepages-1048576kB +2021/03/28 15:57:09:184 notice dpdk EAL: Probing VFIO support... +2021/03/28 15:57:09:184 notice dpdk EAL: WARNING! Base virtual address hint (0xa80001000 != 0x7eff80000000) not respected! +2021/03/28 15:57:09:184 notice dpdk EAL: This may cause issues with mapping memory into secondary processes +2021/03/28 15:57:09:184 notice dpdk EAL: WARNING! Base virtual address hint (0xec0c61000 != 0x7efb7fe00000) not respected! +2021/03/28 15:57:09:184 notice dpdk EAL: This may cause issues with mapping memory into secondary processes +2021/03/28 15:57:09:184 notice dpdk EAL: WARNING! Base virtual address hint (0xec18c2000 != 0x7ef77fc00000) not respected! +2021/03/28 15:57:09:184 notice dpdk EAL: This may cause issues with mapping memory into secondary processes +2021/03/28 15:57:09:184 notice dpdk EAL: WARNING! Base virtual address hint (0xec2523000 != 0x7ef37fa00000) not respected! +2021/03/28 15:57:09:184 notice dpdk EAL: This may cause issues with mapping memory into secondary processes +2021/03/28 15:57:09:184 notice dpdk EAL: Invalid NUMA socket, default to 0 +2021/03/28 15:57:09:184 notice dpdk EAL: Probe PCI driver: net_i40e (8086:1572) device: 0000:01:00.0 (socket 0) +2021/03/28 15:57:09:184 notice dpdk EAL: Invalid NUMA socket, default to 0 +2021/03/28 15:57:09:184 notice dpdk EAL: Probe PCI driver: net_i40e (8086:1572) device: 0000:01:00.1 (socket 0) +2021/03/28 15:57:09:184 notice dpdk EAL: Invalid NUMA socket, default to 0 +2021/03/28 15:57:09:184 notice dpdk EAL: Probe PCI driver: net_i40e (8086:1572) device: 0000:01:00.2 (socket 0) +2021/03/28 15:57:09:184 notice dpdk EAL: Invalid NUMA socket, default to 0 +2021/03/28 15:57:09:184 notice dpdk EAL: Probe PCI driver: net_i40e (8086:1572) device: 0000:01:00.3 (socket 0) +2021/03/28 15:57:09:184 notice dpdk i40e_init_fdir_filter_list(): Failed to allocate memory for fdir filter array! +2021/03/28 15:57:09:184 notice dpdk ethdev initialisation failed +2021/03/28 15:57:09:184 notice dpdk EAL: Requested device 0000:01:00.3 cannot be used +2021/03/28 15:57:09:184 notice dpdk EAL: VFIO support not initialized +2021/03/28 15:57:09:184 notice dpdk EAL: Couldn't map new region for DMA + + +root@dcg-1:/etc/vpp# vppctl show pci +Address Sock VID:PID Link Speed Driver Product Name Vital Product Data +0000:01:00.0 0 8086:1572 8.0 GT/s x8 igb_uio +0000:01:00.1 0 8086:1572 8.0 GT/s x8 igb_uio +0000:01:00.2 0 8086:1572 8.0 GT/s x8 igb_uio +0000:01:00.3 0 8086:1572 8.0 GT/s x8 igb_uio +0000:06:00.0 0 8086:1533 2.5 GT/s x1 igb_uio +0000:07:00.0 0 8086:1533 2.5 GT/s x1 igb_uio + +root@dcg-1:/etc/vpp# ip ro +94.142.242.96/27 dev bond0.205 proto kernel scope link src 94.142.242.98 +94.142.242.128/28 dev bond0.202 proto kernel scope link src 94.142.242.140 +94.142.244.0/24 dev bond0.100 proto kernel scope link src 94.142.244.252 +94.142.246.0/24 dev bond0.201 proto kernel scope link src 94.142.246.252 +94.142.247.40/29 dev bond0.102 proto kernel scope link src 94.142.247.44 +94.142.247.80/29 dev bond0.2481 proto kernel scope link src 94.142.247.82 +94.142.247.224/31 dev eno1 proto kernel scope link src 94.142.247.224 +94.142.247.236/31 dev bond0.120 proto kernel scope link src 94.142.247.236 +172.28.0.0/24 dev bond0.101 proto kernel scope link src 172.28.0.252 +185.52.224.80/28 dev bond0.206 proto kernel scope link src 185.52.224.92 +193.239.116.0/22 dev enp1s0f2.7 proto kernel scope link src 193.239.117.111 +213.207.10.0/26 dev enp1s0f2.26 proto kernel scope link src 213.207.10.53 + +root@dcg-1:/etc/vpp# birdc6 show ospf neighbors +BIRD 1.6.6 ready. +ospf1: +Router ID Pri State DTime Interface Router IP +94.142.247.2 1 Full/PtP 00:35 eno1 fe80::ae1f:6bff:feeb:858c +94.142.247.7 128 Full/PtP 00:35 bond0.120 fe80::9ecc:8300:78b2:8b62 + +root@dcg-1:/etc/vpp# birdc show ospf neighbors +BIRD 1.6.6 ready. +ospf1: +Router ID Pri State DTime Interface Router IP +94.142.247.2 1 Exchange/PtP 00:37 eno1 94.142.247.225 +94.142.247.7 128 Exchange/PtP 00:39 bond0.120 94.142.247.237 + + +root@dcg-1:/etc/vpp# vppctl show bond details +BondEthernet0 + mode: lacp + load balance: l2 + number of active members: 2 + TenGigabitEthernet1/0/0 + TenGigabitEthernet1/0/1 + number of members: 2 + TenGigabitEthernet1/0/0 + TenGigabitEthernet1/0/1 + device instance: 0 + interface id: 0 + sw_if_index: 6 + hw_if_index: 6 + +root@dcg-1:/etc/vpp# ping 193.239.116.1 +PING 193.239.116.1 (193.239.116.1) 56(84) bytes of data. +64 bytes from 193.239.116.1: icmp_seq=1 ttl=64 time=2.24 ms +64 bytes from 193.239.116.1: icmp_seq=2 ttl=64 time=0.571 ms +64 bytes from 193.239.116.1: icmp_seq=3 ttl=64 time=0.625 ms +^C +--- 193.239.116.1 ping statistics --- +3 packets transmitted, 3 received, 0% packet loss, time 5ms +rtt min/avg/max/mdev = 0.571/1.146/2.244/0.777 ms + +root@dcg-1:/etc/vpp# ping 94.142.244.85 +PING 94.142.244.85 (94.142.244.85) 56(84) bytes of data. +64 bytes from 94.142.244.85: icmp_seq=1 ttl=64 time=0.226 ms +64 bytes from 94.142.244.85: icmp_seq=2 ttl=64 time=0.207 ms +64 bytes from 94.142.244.85: icmp_seq=3 ttl=64 time=0.200 ms +64 bytes from 94.142.244.85: icmp_seq=4 ttl=64 time=0.204 ms +^C +--- 94.142.244.85 ping statistics --- +4 packets transmitted, 4 received, 0% packet loss, time 66ms +rtt min/avg/max/mdev = 0.200/0.209/0.226/0.014 ms +``` + +### Cleaning up + +``` +apt purge dpdk* vpp* +apt autoremove +rm -rf /etc/vpp +rm /etc/sysctl.d/*vpp*.conf + +cp /etc/network/interfaces.2021-03-28 /etc/network/interfaces +cp /root/.ssh/authorized_keys.2021-03-28 /root/.ssh/authorized_keys +systemctl enable bird +systemctl enable bird6 +systemctl enable keepalived +reboot +``` + +### Next steps + +Taking another look at IOMMU and PT [redhat thread](https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/installation_guide/appe-configuring_a_hypervisor_host_for_pci_passthrough) and in particular the part about `allow_unsafe_interrupts` in the kernel module. Find some ways to get the NICs (1x Intel x710 and 2x Intel i210) to detect in VPP. By then, probably the Linux CP (Interface mirroring and Netlink listener) will be submitted. diff --git a/content/articles/2021-05-17-frankfurt.md b/content/articles/2021-05-17-frankfurt.md new file mode 100644 index 0000000..3309c0a --- /dev/null +++ b/content/articles/2021-05-17-frankfurt.md @@ -0,0 +1,144 @@ +--- +date: "2021-05-17T22:27:34Z" +title: IPng arrives in Frankfurt +--- + +I've been planning a network expansion for a while now. For the next few weeks, +I will be in total geek-mode as I travel to several European cities to deploy +AS50869 on a european ring. At the same time, my buddy Fred from +[IP-Max](https://ip-max.net/) has been wanting to go to Amsterdam. IP-Max's +[network](https://as25091.peeringdb.com/) is considerably larger than mine, but +it just never clicked with the right set of circumstances for them to deploy +in the Netherlands, until the stars aligned ... + +## Leadup to the Roadtrip + +Usually, IP-Max deploys their routers by having them shipped into the destination +location, but this time was special. We decided to make a roadtrip out of it, +so Fred made his way from Geneva to Brüttisellen, stayed the night, and early +on Monday May 17th, we packed up the car and started our trek. + +It turns out we had estimated our risk profile completely wrong - we thought it +would be hard to cross the border into Germany due to the ongoing pandemic, but +actually that part was fine. The Germans had opened their borders for transit +traffic and stays of up to 24hrs just a few days ago, and we both got a +(negative) PCR test so we felt we had our bases covered. + +## The Border + +Then when we arrived at the border, perhaps because we had Geneva license +plates, we were asked about our trip, business or pleasure, and we shared that +we had some equipment with us. Thus begun the four-and-a-half hour customs +exercise that was necessary for us to safely send our equipment off to +the European Union. One would think it should be easy, but it actually wasn't +quite that easy, considering we arrived at the border at 9am on a Monday, and +the traffic into Switzerland was queueing up all expeditor and logistics +companies, so nobody really was willing to help us out. But we made it and left +again shortly after 1:30pm. + +## Frankfurt + +{{< image width="300px" float="right" src="/assets/network/defra0-rack.png" alt="IP-Max at Frankfurt" >}} + +We arrived at Frankfurt Equinix FR5 at the Kleyerstrasse at around 5pm. The +IP-Max rack was quickly found, and while Fred was installing their corporate +Xen host to run remote VMs for the Frankfurt area, I deployed the first router +of the trip: **defra0.ipng.ch**. + +IP-Max at this location has a respectable 30G of DWDM capacity from three +different vendors into Zurich, 30G of LAG capacity towards DE-CIX, and a +10G DWDM wave into Anzin (France), which will be broken up for us in Amsterdam +for a future blogpost - stay tuned :) + +Making use of line card and route processor redundancy, we decided to use +three line cards, reserving one TenGig ethernet port on each: + +* Te0/0/0/4 -- EoMPLS to NTT/eShelter Rumlang (**chrma0.ipng.ch**) +* Te0/1/0/4 -- EoMPLS to Interxion ZUR1 (**chgtg0.ipng.ch**) +* Te0/2/0/4 -- EoMPLS to Amsterdam NIKHEF (**nlams0.ipng.ch**) + +At each site, specifically those that are a bit further away, I deploy a +standard issue [PCEngines APU](https://pcengines.ch/) with 802.11ac WiFi, +serial, and IPMI access to any machine that may be there. If you ever visit +a datacenter floor where I'm present, look for SSID _AS50869 FRA_ in the +case of Kleyerstrasse. The password is _IPngGuest_, you're welcome to some +bits of bandwidth in a pinch :) + +You can see my router dangling off what looks like a fiber optic umbellical +cord under **er01.fra05.ip-max.net**, right at the heart of the Frankfurt +internet. + +### Logical Configuration + +{{< image width="300px" float="left" src="/assets/network/console-fra.png" alt="console-fra.ipng.nl" >}} + +**console.fra.ipng.nl** At the top of the rack you can also see the blue APU3 +with its WiFi antennas. It takes an IPv4 /29 and IPv6 /64 from IP-Max AS25091 +which gives me access to my equipment even if bad things happen (and they will, +it's just a matter of time!). It also exposes a WireGuard so that I can access +it even without the need for SSH which can come in useful if a KVM console is +required. Note the logo :-) + +On the inside of the APU, it configures one RFC1918 wifi segment and another +RFC1918 wired segment. In this case, the wired segment is connected to the +IPMI port of the Supermicro router. I have really gotten used to this style +of deployment -- I **start** with the OOB. Once the APU has power (and it does +not need to have an uplink yet), I can already SSH to it from the wireless +segment, and further configure it. Once it's done, I make a habit of rebooting +it to ensure it comes up. Then, I can easily configure (and even entirely +install!!) the server behind it using IPMI serial-over-lan and HTML5 KVM +if need be. It's delicious. And, it has saved my ass several times over the +years! + +{{< image width="300px" float="left" src="/assets/network/defra0.png" alt="defra0.ipng.ch" >}} + +**defra0.ipng.ch** Making use of the line card redundancy, there is now 3x +10Gig connected to my router, which immediately makes it one of the better +connected hosts in this facility. Logging in via IPMI, the [DANOS](https://danosproject.org) +image is quickly configured. There's one link to Interxion ZUR1 in Glattbrugg, +one link to eShelter in Rümlang, and one link up to Amsterdam. The +interface towards Interxion ZUR1 doubles up as an egresspoint for now. There +will be an IPv4/IPv6 transit session with AS25091, a [DE-CIX](https://de-cix.net) +connection and possibly but probably not a [Kleyrex](https://kleyrex.net) +connection, were it not for the murderous cross connect costs at this facility. + +## The results + +{{< image width="100px" float="right" src="/assets/network/iperf-chgtg0-defra0.png" alt="iperf" >}} + +After the OSPF and OSPFv3 adjacencies came up, iBGP was next. For now, the +machine is single-homed off of **chrma0.ipng.ch** but soon there will be as +well a leg towards Amsterdam. So for now, all that we can do is test basic +connectivity. So after finishing our trip to Amsterdam, and checking into +our AirBnB ready to go through our quarantine song-and-dance, we spent a +little time celebrating - we arrived at 1:30am, and turned in for the night +at 3am. The next day, our groceries arrived, somehow unfortunately I had to +be "well prepared" and ordered them to be delivered between 7-8am on Tuesday. + +After a full day of _regular work_, we spent the evening taking a look at +how my kit performs, and we are happy to report it's absolutely great: +``` +pim@defra0:~$ iperf3 -c chgtg0.ipng.ch -P 10 +... +[SUM] 0.00-10.00 sec 11.2 GBytes 9.63 Gbits/sec 281 sender +[SUM] 0.00-10.02 sec 11.2 GBytes 9.56 Gbits/sec receiver + +pim@defra0:~$ iperf3 -c chgtg0.ipng.ch -P 10 -R +... +[SUM] 0.00-10.01 sec 10.2 GBytes 8.73 Gbits/sec 550 sender +[SUM] 0.00-10.00 sec 10.1 GBytes 8.70 Gbits/sec receiver + +pim@defra0:~$ ping4 chrma0.ipng.ch +PING chrma0.ipng.ch (194.1.163.0) 56(84) bytes of data. +... +--- chrma0.ipng.ch ping statistics --- +9 packets transmitted, 9 received, 0% packet loss, time 20ms +rtt min/avg/max/mdev = 5.864/6.022/6.173/0.072 ms +``` + +The roundtrip latency to Zurich is about 6.0ms, and the performance is north of +9Gbit in both directions for my router. Soon, we will go to Amsterdam, and +deploy router number two (of four!) on this epic roadtrip: **nlams0.ipng.ch** +which is a bucket list item of mine -- to peer at Amsterdam Science Park. + +More on that later! diff --git a/content/articles/2021-05-26-amsterdam.md b/content/articles/2021-05-26-amsterdam.md new file mode 100644 index 0000000..429d65e --- /dev/null +++ b/content/articles/2021-05-26-amsterdam.md @@ -0,0 +1,216 @@ +--- +date: "2021-05-26T21:19:34Z" +title: IPng arrives in Amsterdam +--- + +I've been planning a network expansion for a while now. For the next few weeks, +I will be in total geek-mode as I travel to several European cities to deploy +AS50869 on a european ring. At the same time, my buddy Fred from +[IP-Max](https://ip-max.net/) has been wanting to go to Amsterdam. IP-Max's +[network](https://as25091.peeringdb.com/) is considerably larger than mine, but +it just never clicked with the right set of circumstances for them to deploy +in the Netherlands, until the stars aligned ... + +## Leadup for IP-Max + +Usually, if I were to go deploy somewhere with IP-Max, I settle down on top of +(or underneath, or in some way physically close to) their router at a point of +presence of theirs. In Amsterdam though, it was different ... because IP-Max +had not yet _built_ a PoP here. + +But I ask: why would that stop us? Fred told me last year that he had always +wanted to build out a PoP in Amsterdam, but somehow he never really found the +time. I offered to do the work to organize the local supplier chain, get a +good spot in a well connected place, long haul to France and Germany, and +otherwise exercise my (social) network to get it done. + +In March 2021, I stumbled across rackspace at [NIKHEF](https://barm.nikhef.nl/housing/) +by working with the folks from [ERITAP](https://eritap.com/) who got their hands +on something that is less of a commodity: a full rack + power (the facility is +always chronically oversubscribed). + +A few chores on the tasklist: +1. Sign for rackspace. Check. +1. Order the IP-Max standard-issue _small pop_ kit, which consists of: + * One Cisco [ASR9001](https://www.cisco.com/c/en/us/products/collateral/routers/asr-9001-router/data_sheet_c78-685687.html) + * One Nexus [3064PQ](https://www.cisco.com/c/en/us/support/switches/nexus-3064-switch/model.html) + * One PCEngines [APU4](https://www.pcengines.ch/apu4c4.htm) for out-of-band + * And all the power/copper/fiber cables, optics, serial dongles we might need +1. Procure an out-of-band provider for our APU4, easily found at NIKHEF (thanks, [Arend](https://eritap.com/))! +1. Get connectivity in and out of Amsterdam! + +### Connectivity + +The most important piece of planning is around the long haul connectivity. +Considering IP-Max already operates a circuit from Frankfurt (Germany) to +Anzin (France), I arranged for that link to be rerouted through Amsterdam +and broken into two segments: Frankfurt-Amsterdam and Amsterdam-Anzin. I +was like a kid in a candy store being able to meticulously choose the route +that the fiber takes -- over Düsseldorf, entering the Netherlands at +Emmerik, over Arnhem and Ede, and to Amsterdam. A very direct route, using +a 10Gig DWDM wave. + +The other span goes from Amsterdam through Antwerp (Belgium) and Brussels +and finally landing in Anzin (near Lille, France), which was the previous +10Gig DWDM wave, so there is no increased latency even though the link is +broken up in Amsterdam. Yaay! + +Delivery of the DWDM waves was ordered on March 30th, and although it should +normally take 25 working days to deliver, for some awkward reason with the +supplier it was going to take way longer than what we could afford, so a +spot of VP style escalation took place, and oh look! Now it would take four +weeks to turn up, was completed last Friday, which was just in time +for our trip. Double yaay! + +## Staging Amsterdam + +{{< image width="300px" float="right" src="/assets/network/nlams0-staging.png" alt="Staging Amsterdam" >}} + +Because this is a completely new site for [IP-Max](https://ip-max.net) as well +as [IPng](https://ipng.ch/), we'll have to do a bit more work. And this suites +us just fine, because after driving through Frankfurt (see my [previous post]({% post_url 2021-05-17-frankfurt %})), +to the Netherlands, we have to stay in quarantine for five days (or, ten if we +happen to fail our PCR test after five days!), which gives us plenty of time +to stage and configure what will be our Cisco **er01.ams01.ip-max.net** and +our Nexus **as01.ams01.ip-max.net**. + +Of course, figuring out how all of this fits together is a nice exercise, and +we planned to just _plug and play_ the ASR9k, which worked out rather +successfully by the way, so it had to be completely configured ahead of time. +We created the interfaces, DNS, routing protocols like OSPF, OSPFv3, MPLS/LDP, +BGP and all of the good stuff like ACLs, accounts and et cetera. + +We staged the stuff in the laundry room of our AirBnB, being actually quite +grateful once the staging was complete and we could turn the machines off. + +For IPng, staging **nlams0.ipng.ch** was already done ahead of time. So all +I really needed for it, was to ensure that the EoMPLS circuits were created +ahead of time. I was really looking forward to seeing if we could beat 14ms +to Amsterdam on the [IP-Max](https://ip-max.net/) network. + +## Extracurriculars + +{{< image width="400px" float="right" src="/assets/network/airbnb-staging.png" alt="Dell AirBNB" >}} + +Besides the staging, we also ate some pretty delicious food: +* Mushroom risotto +* HotPot with Arend and Esther +* Chicken vegetable soup +* Tacos w/ Tapas +* Steak w/ broccoli and potatoes +* Red tuna w/ beans and herbs + +But we also took the time to explore a little bit, for example on Kaz's new +boat through the canals and over the river Amstel. But mostly: we sat home and +enjoyed our quarantine the best we could :-) + +## Deployment (day 1) +{{< image width="400px" float="right" src="/assets/network/nlams0-staging-day1.png" alt="IP-Max Staging" >}} + +First before the day started, I drained the Frankfurt-Anzin link by raising +OSPF cost on **er01.fra01.ip-max.net** and **er01.lil02.ip-max.net** while Fred +notified customers and the IP-Max team of the impending update to the network. + +We met up with [ERITAP](https://eritap.com/) on Monday 24th, or target deploy +date. We had labeled and packed up all of our gear, grabbed the car, and made +our way to the Watergraafsmeer to the place where the Internet landed in Europe +in 1982. Almost 40 years later, here we are: IP-Max is moving in! + +The physical work was not very exciting. The Nexus, ASR, two APUs and my own +Supermicro were racked in only a few minutes. But then the interesting bits +begin -- how do we connect all of this without making a _Kabelsalat_ that you +so often see in people's racks. + +{{< image width="400px" float="right" src="/assets/network/nlams0.png" alt="nlams0" >}} +But yet at the same time, both Fred and I were enthusiastic and couldn't wait +to see the ping time to Anzin and Frankfurt from here. I left Fred the honors +to connect his own brand new **er01.ams01.ip-max.net** by opening the patched +through loop from our supplier, and he was beaming once he saw OSPF and OSPFv3 +adjacencies and a latency of just short of 6ms. But he was very kind to let +me do the second honors to connect the router to Anzin, at just over 5ms. That +is a really fantastic performance and very short path indeed. This will be fun +for my next adventure, I'm sure. We'll see the Dell pictured above appear as +**frlil0.ipng.ch** but I get ahead of myself .. + +After we connected the whole thing up and did extensive ping tests, we +undrained the spans and saw a respectable 600Mbit of traffic traverse the +new router. Because there were a few other folks tinkering in the rack (for +example our friends from [Coloclue](https://coloclue.net/) we decided to +adjourn for the day and visit Paul and Henrieke up in Almere for a fabulous +homecooked meal (thanks again for the Picaña!) and we enjoyed being +followed by the cops when driving back out of Almere -- but we were not +bothered/hassled by them. + +## Deployment (day 2) + +{{< image width="400px" float="right" src="/assets/network/nlams0-pim-sad.png" alt="Pim Cries" >}} +But then (and this is technically day 2 because it was, let's just say, well +after midnight), as the IP-Max network calmed down for the night I did my +stress test and came to a horrible surprise, interface errors! They were +Frame Checksum Errors and while the performance from **defra0.ipng.ch** to +**nlams0.ipng.ch** was impeccable (9.2Gbit, yaay), the transfer speeds on +the reversed direction did stall out at about 35Mbit. That is **NOT** what +the Doctor ordered! + +So luckily we had already decided to go back for a day2 to complete the +rack install, mostly for things like the fiber patch panel for IP-Max +customers in the ERITAP rack, and to ensure that our power, serial and +network cables would not come loose, because packets don't like loose +cables. Certainly we should avoid the electrons or photons falling onto +the floor... + +But the weird thing about my link errors (as seen by the ASR9k) was that +usually the problem is either a duplex error (which was OK), or a dirty +fiber or transciever (which was unlikely considering this link was a +SFP+ DAC!). So that leaves either a faulty Cisco or a faulty Supermicro, +neither of which are appealing. + +On day two, after breakfast, we had to do a few chores first (like the claim +for the VAT for imports, see our [previous post]({% post_url 2021-05-17-frankfurt %}), +and as well get a corona PCR test for the way to France (which was absolutely +horrible, by the way, I *still* feel my nose which was violated). So we hit +NIKHEF at around 4pm to finish the job and take care of a few small favors for +Coloclue, ERITAP and Byteworks, who are also in the same rack as IPng and +IP-Max. + +## The results + +After I replaced the DAC (ironically with an SFP+ optic), once OSPF and iBGP +came back to life, this is what it looked like: + +``` +pim@chumbucket:~$ traceroute nlams0.ipng.ch +traceroute to nlams0.ipng.ch (194.1.163.32), 30 hops max, 60 byte packets + 1 chbtl1.ipng.ch (194.1.163.67) 0.292 ms 0.216 ms 0.179 ms + 2 chgtg0.ipng.ch (194.1.163.19) 0.599 ms 0.565 ms 0.531 ms + 3 chrma0.ipng.ch (194.1.163.8) 0.873 ms 0.840 ms 0.806 ms + 4 defra0.ipng.ch (194.1.163.25) 6.783 ms 6.751 ms 6.718 ms + 5 nlams0.ipng.ch (194.1.163.32) 12.864 ms 12.831 ms 12.798 ms + +pim@nlams0:~$ iperf3 -P 10 -c chgtg0.ipng.ch +... +[SUM] 0.00-10.00 sec 11.0 GBytes 9.49 Gbits/sec 95 sender +[SUM] 0.00-10.01 sec 11.0 GBytes 9.41 Gbits/sec receiver + +pim@nlams0:~$ iperf3 -P 10 -c chgtg0.ipng.ch -R +... +[SUM] 0.00-10.01 sec 10.0 GBytes 8.62 Gbits/sec 339 sender +[SUM] 0.00-10.00 sec 9.98 GBytes 8.57 Gbits/sec receiver +``` + +{{< image width="400px" float="right" src="/assets/network/nlams0-pim-happy.png" alt="Pim Laughs" >}} +That will do, thanks. I cannot believe that the latency from my basement +workstation in Brüttisellen, Switzerland, to the local internet +exchange is 0.8ms, then through to Frankfurt at 6.2ms and then all the way +to Amsterdam the end to end round trip latency is 12.2ms. I can stare at +the smokeping for hours!! + +So I spent the reminder of the night hanging out with Fred while pumping +9Gbit in both directions for 2 hours while traffic was low. It's one thing +to do an `iperf` in your basement rack, but it's an entirely different feeling +to do an `iperf` spanning three countries in Europe (CH, DE and NL). I will +note that the spans from Zurich to Frankfurt didn't even get warm, although +the one from Frankfurt to Amsterdam kind of broke a sweat for a little while +there ... + +And the coolest thing yet? We're not done with this trip. diff --git a/content/articles/2021-05-28-lille.md b/content/articles/2021-05-28-lille.md new file mode 100644 index 0000000..bd293ba --- /dev/null +++ b/content/articles/2021-05-28-lille.md @@ -0,0 +1,106 @@ +--- +date: "2021-05-28T22:16:44Z" +title: IPng arrives in Lille +--- + +I've been planning a network expansion for a while now. For the next few weeks, +I will be in total geek-mode as I travel to several European cities to deploy +AS50869 on a european ring. At the same time, my buddy Fred from +[IP-Max](https://ip-max.net/) has been wanting to go to Amsterdam. IP-Max's +[network](https://as25091.peeringdb.com/) is considerably larger than mine, but +it just never clicked with the right set of circumstances for them to deploy +in the Netherlands, until the stars aligned ... + +## Deployment + +{{< image width="300px" float="right" src="/assets/network/lille-civ1.png" alt="Lille CIV1" >}} + +After our adventure in [Amsterdam]({% post_url 2021-05-26-amsterdam %}), and +after Fred and I both got negative PCR test results, we made our way down to +Lille, France. There are two datacenters there where IP-Max has a presence, +and they are very innovative ones. There's a specific trick with a block of +frozen ice that allows for the facility cooling to run autonomously in case +of a chiller or power failure. I got to see the storage of the icecube :) + +Now, because Fred is possibly even more enthusiastic about our eurotrip than +I am, he insisted that I also put a machine here in Lille, even though +originally it was meant to be only Frankfurt, Amsterdam, Paris and Zurich. +After some tough negotiations, I reluctantly agreed. While I did not have a +Supermicro, I did have some freshly procured Dell R610s, the old machines +from Coloclue AS8283, which they have recently upgraded to newer routers. +I installed the pair at EUNetworks for Coloclue, and another team mate +installed the pair at DCG/NorthC, after which four of these units were left +written off and available. And I would not be me, if I were not to accept +some perfectly valuable second hand vintage Dells :-) + +So the plan is: we drop an R610 here now, and I ship a replacement _standard +issue_ Supermicro + APU3d3/WiFi later. + +### Connectivity + +{{< image width="300px" float="right" src="/assets/network/frggh0-rack.png" alt="Lille Rack" >}} + +By now I'm getting pretty good ad creating L2VPN EoMPLS circuits, so I +created one for myself from the Amsterdam router **er01.ams01.ip-max.net** +to the one here at **er01.lil01.ip-max.net**. The one here is a Cisco +ASR9006, a respectable machine. In the second point of presence, in +the neighboring town of Anzin, there is an ASR9010 called **er01.lil02.ip-max.net** +but that one is tucked away in a telco room reserved for deities, not +in the normal serverroom which is available for _plebs_ like me. + +The inbound span goes from Amsterdam through Antwerp (Belgium) and Brussels +and finally landing in Anzin, and from there it's dark fiber to this rack. +I realised that because I use [UN/LOCODE](https://unece.org/trade/cefact/unlocode-code-list-country-and-territory) +that I should be precise in my naming. The town here is a suburb of Lille, +the country's fourth biggest city after Paris, Marseille and Lyon. The +town itself is called Sainghin-en-Mélantois, which resolves to +**FR GGH** and thus my temporary Dell R610 will be called **frggh0.ipng.ch**. + +Fred happened to have a spare Intel X552 in his bag, so I commandeered it and +gave the machine two legs of Tengig, one going to the ASR9006 and the other +going the Nexus3064PQ under it. Soon, we will connect here to the local +internet exchange [Lillix](https://www.lillix.fr/membres/). There are very +few non-locals, let alone international members at LilleIX, but considering +larger clubs like Zayo require two or more french peering points, this will +be my ticket to some pretty good peers. Nice! + +The idea is that one VLL lands me on Amsterdam, and the other will eventually +land me on Paris at Telehouse TH2. But that will be after the weekend, as +first we need to spend some quality time exploring Lille and recreating the +_grignotage_ (English: nibbling) that must be done in the north of France. + +## The results + +As always on the IP-Max network, they speak for themselves. During daytime, +the connectivity from my basement to Frankfurt is at 6.7ms, to Amsterdam +it's 13ms and all the way to Lille it's 20.3ms, with throughputs that +are, let's just say, _line rate_ **booyah**! + +``` +pim@chumbucket:~$ traceroute frggh0.ipng.ch +traceroute to frggh0.ipng.ch (194.1.163.34), 30 hops max, 60 byte packets + 1 chbtl1.ipng.ch (194.1.163.67) 0.246 ms 0.214 ms 0.135 ms + 2 chgtg0.ipng.ch (194.1.163.19) 0.513 ms 0.478 ms 0.445 ms + 3 chrma0.ipng.ch (194.1.163.8) 0.633 ms 0.599 ms 0.658 ms + 4 defra0.ipng.ch (194.1.163.25) 6.808 ms 6.773 ms 6.740 ms + 5 nlams0.ipng.ch (194.1.163.27) 13.090 ms 13.057 ms 13.024 ms + 6 frggh0.ipng.ch (194.1.163.34) 20.370 ms 20.550 ms 20.473 ms + +pim@frggh0:~$ iperf3 -c chrma0.ipng.ch -P 10 -R +... +[SUM] 0.00-10.02 sec 8.84 GBytes 7.58 Gbits/sec 271 sender +[SUM] 0.00-10.00 sec 8.75 GBytes 7.52 Gbits/sec receiver + +pim@defra0:~$ iperf3 -P 10 -c frggh0.ipng.ch -R +... +[SUM] 0.00-10.02 sec 11.1 GBytes 9.54 Gbits/sec 292 sender +[SUM] 0.00-10.00 sec 11.1 GBytes 9.51 Gbits/sec receiver +``` + +After the weekend, we'll be driving on to Paris to complete the ring, after +which I will have two different ways to traverse -- clockwise from +Zurich to Geneva, Paris, Lille, Amsterdam and Frankfurt, or counterclockwise +on the same ring. There will be 10Gbit between each of my routers in each +direction. We do not compromise on quality, throughput or latency over here. + +I could not be happier with the service provided so far. Paris, here we come!! diff --git a/content/articles/2021-06-01-paris.md b/content/articles/2021-06-01-paris.md new file mode 100644 index 0000000..021cb08 --- /dev/null +++ b/content/articles/2021-06-01-paris.md @@ -0,0 +1,185 @@ +--- +date: "2021-06-01T18:16:42Z" +title: IPng arrives in Paris +--- + +I've been planning a network expansion for a while now. For the next few weeks, +I will be in total geek-mode as I travel to several European cities to deploy +AS50869 on a european ring. At the same time, my buddy Fred from +[IP-Max](https://ip-max.net/) has been wanting to go to Amsterdam. IP-Max's +[network](https://as25091.peeringdb.com/) is considerably larger than mine, but +it just never clicked with the right set of circumstances for them to deploy +in the Netherlands, until the stars aligned ... + +## First a word + +I just wanted to start with a note on how special it is to partner with [IP-Max](https://www.ip-max.net/) +and having known the founder Fred for so long affords me this specific trip. +It must be known that building a 10G pan-european ring is not an easy thing to +do, and I do appreciate very much the kindness Fred has shown IPng even though +we are under contract -- the ability to travel to every point of presence and +get the _founder's tour_ in each place, get to know the local DC ops, sales +directors, field technicians and local customers, is simply golden. Thank you. + +## Deployment + +{{< image width="300px" float="right" src="/assets/network/frpar0-rack.png" alt="Paris KDDI" >}} + +The city of love, corny as it sounds, also happens to have quite a bit of _fibre +noire_, happily lit by hundreds of local and international carriers. There are +two places in the dead-center of the city, a _cabaret voltaire_ called Telehouse +TH2, and a new facility from [KDDI](https://fra.kddi.com/) which is in the same +physical block, aptly addressed as 65 Rue Léon Frot, 75011 Paris and that +is where my beloved router **frpar0.ipng.ch** will be. Note that in Lille, +actually I had to make do with **frggh0** which was in the town of +Sainghin-en-Mélantois. This one lives in Paris, none of that suburban +bullshit. This is the real deal. There's probably more connectivity in this +one block than in all of the Paris metro combined, maybe even more than +all of France combined -- peut être :) + +After having visited the older location, we took the router, APU, cables and +optics to Léon Frot. The rack was quickly found, and it is obvious that +this is the location: A _fridge_ was awaiting us, and I reserved two Tengig +ports on the ASR9010, one towards Lille and one towards Zürich (which will be +Geneva later on). + +I'm getting the hang of this VLL stuff after our adventures, previously in +[Lille]({% post_url 2021-05-28-lille %}) at CIV, which is a lazy 4.8ms away +from this place (Fred already speaks of making that more like 3.7ms with a +_small call_ to his buddy Laurent). So I went about my business, racking first +the WiFi enabled **console.par.ipng.nl**, connecting to WiFi with it, and +finding my router on **frpar0.ipng.ch** over IPMI serial-over-lan. Configuring +one Tengig port on the Intel X710 towards **er01.lil01.ip-max.net** and another +Tengig port towards **er01.zrh56.ip-max.net**. + +I have two Supermicros on backorder, one of which will go to Lille to replace +the Dell R610 that I placed temporarily in that site, and the other will go to +Geneva. At that point, I will break up this VLL to become one from here to +Geneva and another from Geneva back home to Zürich. Seriously though I will +have to stop Fred's enthusiasm because he also mentioned somthing about +[LyonIX](https://www.rezopole.net/fr/ixp/lyonix) and another small stopover of +1HE and 35W over there... this is addictive, save me from myself!!1 + +## Connectivity on the FLAP + +All week, Fred has been talking about _the FLAP_, which I know to be a term +but I really never bothered to ask about it -- it turns out, he explained, +that it stands for **F**rankfurt, **L**ondon, **A**msterdam, and **P**aris. +I'm not in London (yet, please don't dare me ...), however I can legitimately +claim I am on _the FLAP_ because I have a router in **L**ille. So there's that :) + +{{< image float="left" src="/assets/network/frpar0.png" alt="frpar0" >}} + +Fred has ordered my FranceIX connection this afternoon, delivered from a +20Gig LAG on **er02.par02.ip-max.net** and directly into my router there. In +the mean time, I will be busy configuring my DE-CIX port from [a previous post]({% post_url 2021-05-17-frankfurt %}). + +The console server here (a standard issue APU3 with 802.11ac WiFi broadcasting +_AS50869 PAR_ with password _IPngGuest_, you're welcome), connects to the +router with IPMI, while the router itself connects via USB serial back to the +APU for maximum resilience. + +## A hard knock life + +{{< image width="300px" float="right" src="/assets/network/happy-fred.png" alt="happy-fred" >}} + +It was not without troubles today. When configuring my VLL to Zurich, I had +misconfigured the **er01.zrh56.ip-max.net** side (a Cisco 7600, which is on its +way out), and the VLL would not come up. I could see traffic going in one +direction but not in the other ... which typically does not make OSPF adjacencies +happen. After about an hour of messing around, I puppydog-eyed to Fred who +proceeded to find my bug within 30 seconds: I needed to do some VLAN gymnastics +by adding `rewrite ingress tag pop 1 symmetric` and as well adding 4 bytes +to the MTU (so `mtu 9018` total, cuz packets gotta be sourced directly from the +[jumbo-jumbo club](https://www.rheinfelderbierhalle.com/)!). + +But, Fred was miserable as well because he had updated a Xen hypervisor which +ended up not being able to boot because of a broken LVM configuration. So we +literally swapped laptops and while he fixed my VLL, I fixed his volume group +by running `update-initramfs -u -k all` from a recovery Debian USB stick. For +an extra bonus, here's a picture of Happy Fred at the local Cafe Leopard, where +pretty much every day you can find a host of locals and international nerds who +have emerged from the server floor. + +And Mael also helped with the serial port - I had put it into `115200` baud +but not pinned it at `8N1` for the APU serial console. It shot into life +as soon as he gave that tip and I committed the config, bravo! + +I tend to believe it will not be necessary for me to physically visit the +facility that often -- simple hardware, no spinning disks, an APU connecting to +IPMI for full HTML5 based KVM control and serial-over-lan, and the router +exposing a console back to the APU, which has aan OOB network connection from +[AS25091](https://as25091.peeringdb.com/). Yeah, I think I'll be good. + +## The results + +But this was a special one indeed, because up until now, my traceroutes kept +on getting longer and longer as I deployed in Frankfurt, Amsterdam, and Lille. +Deploying in Paris therefore looked like this initially, as the packets took, +let us say, the scenic route to my basement: + +``` +pim@frpar0:~$ traceroute chumbucket.ipng.nl +traceroute to chumbucket.ipng.nl (194.1.163.93), 30 hops max, 60 byte packets + 1 frggh0.ipng.ch (194.1.163.30) 4.915 ms 4.885 ms 4.866 ms + 2 nlams0.ipng.ch (194.1.163.28) 12.396 ms 12.398 ms 12.382 ms + 3 defra0.ipng.ch (194.1.163.26) 18.536 ms 18.520 ms 18.541 ms + 4 chrma0.ipng.ch (194.1.163.24) 24.572 ms 24.557 ms 24.542 ms + 5 chgtg0.ipng.ch (194.1.163.9) 24.549 ms 24.510 ms 24.517 ms + 6 chbtl1.ipng.ch (194.1.163.18) 24.707 ms 25.114 ms 25.038 ms + 7 chumbucket.ipng.nl (194.1.163.93) 25.320 ms 25.564 ms 25.452 ms +``` + +That's quite the scenic route indeed. But! On this glorious day, at exactly +16:34 UTC, the Tengig european IPv4 and IPv6 ring was closed, with one final set +of OSPF adjacencies: + +``` +pim@frpar0:~$ show protocols ospfv3 neighbor +Neighbor ID Pri DeadTime State/IfState Duration I/F[State] +194.1.163.34 1 00:00:37 Full/PointToPoint 01:40:51 dp0p6s0f0.100[PointToPoint] +194.1.163.1 1 00:00:33 Full/PointToPoint 00:01:28 dp0p6s0f1.100[PointToPoint] +``` + +Which allowed the ring to hone in on shortest path at its best - East bound to +Frankfurt and Amsterdam, and West bound to Paris and Lille. Link and equipment +failures will not bother me that much, OSPF and OSPFv3 will take care of +rerouting me around network problems, which considering the ASR9k at IP-Max, +I do think will be the exception: +``` +pim@chumbucket:~$ traceroute frggh0.ipng.ch +traceroute to frggh0.ipng.ch (194.1.163.34), 30 hops max, 60 byte packets + 1 chbtl1.ipng.ch (194.1.163.67) 0.317 ms 0.238 ms 0.190 ms + 2 chgtg0.ipng.ch (194.1.163.19) 0.619 ms 0.574 ms 0.531 ms + 3 frpar0.ipng.ch (194.1.163.40) 15.271 ms 15.226 ms 15.174 ms + 4 frggh0.ipng.ch (194.1.163.34) 20.059 ms 20.020 ms 19.977 ms + +pim@chumbucket:~$ traceroute nlams0.ipng.ch +traceroute to nlams0.ipng.ch (194.1.163.32), 30 hops max, 60 byte packets + 1 chbtl1.ipng.ch (194.1.163.67) 0.345 ms 0.198 ms 0.284 ms + 2 chgtg0.ipng.ch (194.1.163.19) 0.610 ms 0.518 ms 0.538 ms + 3 chrma0.ipng.ch (194.1.163.8) 0.732 ms 0.750 ms 0.716 ms + 4 defra0.ipng.ch (194.1.163.25) 6.835 ms 6.802 ms 6.767 ms + 5 nlams0.ipng.ch (194.1.163.32) 12.799 ms 12.765 ms 12.731 ms +``` + +Of course with impeccable throughput, bien sûr: + + +``` +pim@frpar0:~$ iperf3 -c frggh0.ipng.ch -R +... +[ 5] 0.00-10.00 sec 10.7 GBytes 9.22 Gbits/sec 1 sender +[ 5] 0.00-10.00 sec 10.7 GBytes 9.22 Gbits/sec receiver + +pim@frpar0:~$ iperf3 -c chgtg0.ipng.ch -R +... +[ 5] 0.00-10.01 sec 11.2 GBytes 9.42 Gbits/sec 1 sender +[ 5] 0.00-10.00 sec 11.2 GBytes 9.42 Gbits/sec receiver +``` + +I'm tired, but ultimately satisfied on taking my private AS50869 across +_the FLAP_, with a physical presence in each, and an IXP connection and +TenGig bidirectional, and TenGig IP Transit at each location. I think this +network is good to go for the next few years at least. diff --git a/content/articles/2021-06-28-as112.md b/content/articles/2021-06-28-as112.md new file mode 100644 index 0000000..26375ce --- /dev/null +++ b/content/articles/2021-06-28-as112.md @@ -0,0 +1,480 @@ +--- +date: "2021-06-28T10:59:42Z" +title: Launch of AS112 +--- + +I'm one of those people who is a fan of low-latency and high performance +distributed service architectures. After building out the IPng Network across +europe, I did notice a rather stark difference in presence of one particular +service: **AS112** anycast nameservers. In particular, I only have one Internet +Exchange in common with a direct presence of AS112, FCIX in California. +Big-up to the kind folks in Fremont who operate [www.as112.net](https://www.as112.net). + +## The Problem + +Looking around Switzerland, no internet exchanges actually have AS112 as a direct +member and as such you'll find the service tucked away behind several ISPs, with +AS paths such as `13030 29670 112`, `6939 112` and `34019 112`. A traceroute +from a popular swiss ISP, [Init7](https://init7.net/) will go to Germany, at a +roundtrip latency of 18.9ms. My own latency is 146ms as my queries are served +from FCIX: + +``` +pim@spongebob:~$ traceroute prisoner.iana.org +traceroute to prisoner.iana.org (192.175.48.1), 64 hops max, 40 byte packets + 1 fiber7.xe8.chbtl0.ipng.ch (194.126.235.33) 2.658 ms 0.754 ms 0.523 ms + 2 1790bre1.fiber7.init7.net (81.6.42.1) 1.132 ms 1.077 ms 3.621 ms + 3 780eff1.fiber7.init7.net (109.202.193.44) 1.238 ms 1.162 ms 1.188 ms + 4 r1win12.core.init7.net (77.109.181.155) 2.096 ms 2.1 ms 2.1 ms + 5 r1zrh6.core.init7.net (82.197.168.222) 2.086 ms 3.904 ms 2.183 ms + 6 r1glb1.core.init7.net (5.180.135.134) 2.043 ms 3.621 ms 2.088 ms + 7 r2zrh2.core.init7.net (82.197.163.213) 2.353 ms 2.522 ms 2.289 ms + 8 r2zrh2.core.init7.net (5.180.135.156) 2.08 ms 2.299 ms 2.202 ms + 9 r1fra3.core.init7.net (5.180.135.173) 7.65 ms 7.582 ms 7.546 ms +10 r1fra2.core.init7.net (5.180.135.126) 7.928 ms 7.831 ms 7.997 ms +11 r1ber1.core.init7.net (77.109.129.8) 19.395 ms 19.287 ms 19.558 ms +12 octalus.in-berlin.a36.community-ix.de (185.1.74.3) 18.839 ms 18.717 ms 29.615 ms +13 prisoner.iana.org (192.175.48.1) 18.536 ms 18.613 ms 18.766 ms + +pim@chumbucket:~$ traceroute blackhole-1.iana.org +traceroute to blackhole-1.iana.org (192.175.48.6), 30 hops max, 60 byte packets + 1 chbtl1.ipng.ch (194.1.163.67) 0.247 ms 0.158 ms 0.107 ms + 2 chgtg0.ipng.ch (194.1.163.19) 0.514 ms 0.474 ms 0.419 ms + 3 usfmt0.ipng.ch (194.1.163.23) 146.451 ms 146.406 ms 146.364 ms + 4 blackhole-1.iana.org (192.175.48.6) 146.323 ms 146.281 ms 146.239 ms +``` + +This path goes to FCIX because it's the only place where AS50869 picks up +AS112 directly, at an internet exchange, and therefore the localpref will +make this route preferred. But that's a long way to go for my DNS queries! + +I think we can do better. + +## Introduction + +Taken from [RFC7534](https://datatracker.ietf.org/doc/html/rfc7534): + +> Many sites connected to the Internet make use of IPv4 addresses that +> are not globally unique. Examples are the addresses designated in +> RFC 1918 for private use within individual sites. +> +> Devices in such environments may occasionally originate Domain Name +> System (DNS) queries (so-called "reverse lookups") corresponding to +> those private-use addresses. Since the addresses concerned have only +> local significance, it is good practice for site administrators to +> ensure that such queries are answered locally. However, it is not +> uncommon for such queries to follow the normal delegation path in the +> public DNS instead of being answered within the site. +> +> It is not possible for public DNS servers to give useful answers to +> such queries. In addition, due to the wide deployment of private-use +> addresses and the continuing growth of the Internet, the volume of +> such queries is large and growing. The AS112 project aims to provide +> a distributed sink for such queries in order to reduce the load on +> the corresponding authoritative servers. The AS112 project is named +> after the Autonomous System Number (ASN) that was assigned to it. + +## Deployment + +It's actually quite straight forward, the deployment consists of roughly +three steps: + +1. Procure hardware to run the instances of the nameserver on. +1. Configure the nameserver to serve the zonefiles. +1. Announce the anycast service locally/regionally. + +Let's discuss each in turn. + +### Hardware + +For the hardware, I've decided to use existing server platform at IP-Max +and IPng Networks. There are two types of hardware, both tried and tested, +one set is an HP ProLiant DL380 Gen9, and one is an older Dell PowerEdge R610. + +Considering each vendor ships specific parts and each are different, many +appliance vendors choose to virtualize their environment such that the guest +operating system finds a very homogenous configuration. For my purposes, +the virtualization platform is Xen and the guest is a (para)virtualized +Debian. + +I will be starting with three nodes, one in Geneva and one in Zurich, hosted +on hypervisors of [IP-Max](https://www.ip-max.net/), and one in Amsterdam, +hosted on a hypervisor of [IPng](https://ipng.ch/). I have a feeling a few +more places will follow. + +#### Install the OS + +Xen makes this repeatable and straight forward. Other systems, such as KVM, +have very similar installers, for example VMBuilder is popular. Both work +roughly the same way, and install a guest in a matter of minutes. + +I'll install to an LVS volume group on all machines, backed by pairs of SSD +for throughput and redundancy. We'll give the guest 4GB of memory and 4 +CPUs. I love how the machine boots using [PyGrub](https://wiki.debian.org/PyGrub), +fully on serial, and is fully booted and running in 20 seconds. + +``` +sudo xen-create-image --hostname as112-1.free-ix.net --ip 46.20.249.197 \ + --vcpus 4 --pygrub --dist buster --lvm=vg1_hvn04_gva20 +sudo xl create -c as112-1.free-ix.net.cfg +``` + +After logging in, the following additional software was installed. We'll be +using [Bird2](https://bird.network.cz/), which comes on Debian Buster's +backports. Otherwise, we're pretty vanilla: + +``` +$ cat << EOF | sudo tee -a /etc/apt/sources.list +# +# Backports +# +deb http://deb.debian.org/debian buster-backports main +EOF + +$ sudo apt update +$ sudo apt install tcpdump sudo net-tools bridge-utils nsd bird2 \ + netplan.io traceroute ufw curl bind9-dnsutils +$ sudo apt purge ifupdown +``` + +I removed the `/etc/network/interfaces` approach and configured Netplan, +a personal choice, which aligns the machines more closely with other servers +in the IPng fleet. The only trick is to ensure that the anycast IP addresses +are available for the nameserver to listen on, so at the top of Netplan's +configuration file, we add them like so: + +``` +network: + version: 2 + renderer: networkd + ethernets: + lo: + addresses: + - 127.0.0.1/8 + - ::1/128 + - 192.175.48.1/32 # prisoner.iana.org (anycast) + - 2620:4f:8000::1/128 # prisoner.iana.org (anycast) + - 192.175.48.6/32 # blackhole-1.iana.org (anycast) + - 2620:4f:8000::6/128 # blackhole-1.iana.org (anycast) + - 192.175.48.42/32 # blackhole-2.iana.org (anycast) + - 2620:4f:8000::42/128 # blackhole-2.iana.org (anycast) + - 192.31.196.1/32 # blackhole.as112.arpa (anycast) + - 2001:4:112::1/128 # blackhole.as112.arpa (anycast) +``` + +### Nameserver + +My nameserver of choice is [NSD](https://www.nlnetlabs.nl/projects/nsd/about/), +and its configuration is similar to BIND, which is described in RFC7534. In +fact, the zone files are identical, so all we should do is create a few listen +statements and load up the zones: + +``` +$ cat << EOF | sudo tee /etc/nsd/nsd.conf.d/listen.conf +server: + ip-address: 127.0.0.1 + ip-address: ::1 + ip-address: 46.20.249.197 + ip-address: 2a02:2528:a04:202::197 + + ip-address: 192.175.48.1 # prisoner.iana.org (anycast) + ip-address: 2620:4f:8000::1 # prisoner.iana.org (anycast) + + ip-address: 192.175.48.6 # blackhole-1.iana.org (anycast) + ip-address: 2620:4f:8000::6 # blackhole-1.iana.org (anycast) + + ip-address: 192.175.48.42 # blackhole-2.iana.org (anycast) + ip-address: 2620:4f:8000::42 # blackhole-2.iana.org (anycast) + + ip-address: 192.31.196.1 # blackhole.as112.arpa (anycast) + ip-address: 2001:4:112::1 # blackhole.as112.arpa (anycast) + + server-count: 4 +EOF + +$ cat << EOF | sudo tee /etc/nsd/nsd.conf.d/as112.conf +zone: + name: "hostname.as112.net" + zonefile: "/etc/nsd/master/db.hostname.as112.net" + +zone: + name: "hostname.as112.arpa" + zonefile: "/etc/nsd/master/db.hostname.as112.arpa" + +zone: + name: "10.in-addr.arpa" + zonefile: "/etc/nsd/master/db.dd-empty" + +# etcetera +EOF +``` + +While all of the zones are captured by `db.dd-empty` or `db.dr-empty`, which +can be found in the RFC text, I'll note the top two are special, as they are +specific to the instance. For example on our Geneva instance: + +``` +$ cat << EOF | sudo tee /etc/nsd/master/db.hostname.as112.arpa +$TTL 1W +@ SOA chplo01.paphosting.net. noc.ipng.ch. ( + 1 ; serial number + 1W ; refresh + 1M ; retry + 1W ; expire + 1W ) ; negative caching TTL + NS blackhole.as112.arpa. + TXT "AS112 hosted by IPng Networks" "Geneva, Switzerland" + TXT "See https://www.as112.net/ for more information." + TXT "See https://free-ix.net/ for local information." + TXT "Unique IP: 194.1.163.147" + TXT "Unique IP: [2001:678:d78:7::147]" + LOC 46 9 55.501 N 6 6 25.870 E 407.00m 10m 100m 10m +``` + +This is super helpful to users, who want to know which server, exactly, +is serving their request. Not all operators added the `Unique IP` details, +but I found it useful when launching the service, as several anycast nodes +quickly become confusing otherwise :-) + +After this is all done, the nameserver can be started. I rebooted the guest +for good measure, and about 19 seconds later (a fact that continues to +amaze me), the server was up and serving queries, albeit only from localhost +because there is no way to reach the server on the network, yet. + +To validate things work, we can do a few SOA or TXT queries, like this one: +``` +pim@nlams01:~$ ping -c5 -q prisoner.iana.org +PING prisoner.iana.org(prisoner.iana.org (2620:4f:8000::1)) 56 data bytes + +--- prisoner.iana.org ping statistics --- +5 packets transmitted, 5 received, 0% packet loss, time 34ms +rtt min/avg/max/mdev = 0.041/0.045/0.053/0.004 ms + +pim@nlams01:~$ dig @prisoner.iana.org hostname.as112.net TXT +short +norec +"AS112 hosted by IPng Networks" "Amsterdam, The Netherlands" +"See http://www.as112.net/ for more information." +"Unique IP: 94.142.241.187" +"Unique IP: [2a02:898:146::2]" +``` + +### Network + +Now comes the fun part! We're running these instances of the nameservers +in a few locations, and to ensure we don't route traffic to the incorrect +location, we'll announce them using BGP as per recommendation of RFC7534. + +My choice of routing suite is [Bird2](https://bird.network.cz/), which comes +with a lot of extensiblility and a programmatic validation of routing policies. + +We'll only be using `static` and `BGP` routing protocols for Bird, so the +configuration is relatively straight forward, first we create a routing table +export for IPv4 and IPv6, then we define some static _Nullroutes_, which ensure +that our prefixes are always present in the RIB (otherwise BGP will not export +them), then we create some filter functions (one for routeserver sessions, +one for peering sessions, and one for transit sessions), and finally we include +a few specific configuration files, one-per-environment where we'll be active. + +``` +$ cat << EOF | sudo tee /etc/bird/bird.conf +router id 46.20.249.197; + +protocol kernel fib4 { + ipv4 { export all; }; + scan time 60; +} + +protocol kernel fib6 { + ipv6 { export all; }; + scan time 60; +} + +protocol static static_as112_ipv4 { + ipv4; + route 192.175.48.0/24 blackhole; + route 192.31.196.0/24 blackhole; +} + +protocol static static_as112_ipv6 { + ipv6; + route 2620:4f:8000::/48 blackhole; + route 2001:4:112::/48 blackhole; +} + +include "bgp-freeix.conf"; +include "bgp-ipng.conf"; +include "bgp-ipmax.conf"; +EOF +``` + +The configuration file per environment, say `bgp-freeix.conf`, can (and will) +be autogenerated, but the pattern is of the following form: + +``` +$ cat << EOF | tee /etc/bird/bgp-freeix.conf +# +# Bird AS112 configuration for FreeIX +# +define my_ipv4 = 185.1.205.252; +define my_ipv6 = 2001:7f8:111:42::70:1; + +protocol bgp freeix_as51530_1_ipv4 { + description "FreeIX - AS51530 - Routeserver #1"; + local as 112; + source address my_ipv4; + neighbor 185.1.205.254 as 51530; + ipv4 { + import where fn_import_routeserver( 51530 ); + export where proto = "static_as112_ipv4"; + import limit 120000 action restart; + }; +} + +protocol bgp freeix_as51530_1_ipv6 { + description "FreeIX - AS51530 - Routeserver #1"; + local as 112; + source address my_ipv6; + neighbor 2001:7f8:111:42::c94a:1 as 51530; + ipv6 { + import where fn_import_routeserver( 51530 ); + export where proto = "static_as112_ipv6"; + import limit 120000 action restart; + }; +} + +# etcetera +EOF +``` + +If you've seen IXPManager's approach to routeserver configuration generators, +you'll notice I borrowed the `fn_import()` function and its dependents from +there. This allows imports to be specific towards prefix-lists, as-paths and +ensure some _Belts and Braces_ checks are in place (no invalid or tier1 ASN +in the path, a valid nexthop, no tricks with AS path truncation, and so on). + +After bringing up the service, the prefixes make their way into the +routeserver and get distributed to the FreeIX participants: + +``` +$ sudo systemctl start bird +$ sudo birdc show protocol +BIRD 2.0.7 ready. +Name Proto Table State Since Info +fib4 Kernel master4 up 2021-06-28 11:01:35 +fib6 Kernel master6 up 2021-06-28 11:01:35 +device1 Device --- up 2021-06-28 11:01:35 +static_as112_ipv4 Static master4 up 2021-06-28 11:01:35 +static_as112_ipv6 Static master6 up 2021-06-28 11:01:35 +freeix_as51530_1_ipv4 BGP --- up 2021-06-28 11:01:17 Established +freeix_as51530_1_ipv6 BGP --- up 2021-06-28 11:01:19 Established +freeix_as51530_2_ipv4 BGP --- up 2021-06-28 11:01:32 Established +freeix_as51530_2_ipv6 BGP --- up 2021-06-28 11:01:37 Established +``` + +#### Internet Exchanges + +Having one configuration file per group helps a lot with integration of +[IXPManager](https://www.ixpmanager.org/) where we might autogenerate the IXP +versions of these files and install them periodically. That way, when members +enable the `AS112` peering checkmark, the servers will automatically download +and set up those sessions without human involvement -- typically this is the +best way to avoid outages: never tinker with production config files by hand. +We'll test this out with [FreeIX](https://free-ix.net/), but hope as well to +offer our service to other internet exchanges, notably SwissIX and CIXP. + +One of the huge benefits of operating within [IP-Max](https://ip-max.net/) +network is their ability to do L2VPN transport from any place on-net to any +other router. As such, connecting these virtual machines to other places, +like SwissIX, CIXP, CHIX-CH, Community-IX or other further away places, +is a piece of cake. All we must do is create an L2VPN and offer it to the +hypervisor (which usually is connected via a LACP _BundleEthernet_) on some +VLAN, after which we can bridge that into the guest OS by creating a new +virtio NIC. This is how, in the example above, our AS112 machines were +introduced to FreeIX. This scales very well, requiring only one guest reboot +per internet exchange, and greatly simplifies operations. + +### Monitoring + +Of course, one would not want to run a production service, certainly not +on the public internet, without a bit of introspection and monitoring. + +There are four things we might want to ensure: +1. Is the machine up and healthy? For this we use NAGIOS. +1. Is NSD serving? For this we use NSD Exporter and Prometheus/Grafana. +1. Is NSD reachable? For this we use CloudProber. +1. If there is an issue, can we alert an operator? For this we use Telegram. + +In a followup post, I'll demonstrate how these things come together into +a comprehensive anycast monitoring and alerting solution. As a fringe benefit +we can show contemporary graphs and dashboards. But seeing as the service +hasn't yet gotten a lot of mileage, it deserves its own followup post, some +time in August. + +## The results + +First things first - latency went waaaay down: +``` +pim@chumbucket:~$ traceroute blackhole-1.iana.org +traceroute to blackhole-1.iana.org (192.175.48.6), 30 hops max, 60 byte packets + 1 chbtl1.ipng.ch (194.1.163.67) 0.257 ms 0.199 ms 0.159 ms + 2 chgtg0.ipng.ch (194.1.163.19) 0.468 ms 0.430 ms 0.430 ms + 3 chrma0.ipng.ch (194.1.163.8) 0.648 ms 0.611 ms 0.597 ms + 4 blackhole-1.iana.org (192.175.48.6) 1.272 ms 1.236 ms 1.201 ms + +pim@chumbucket:~$ dig -6 @prisoner.iana.org hostname.as112.net txt +short +norec +tcp +"Free-IX hosted by IP-Max SA" "Zurich, Switzerland" +"See https://www.as112.net/ for more information." +"See https://free-ix.net/ for local information." +"Unique IP: 46.20.246.67" +"Unique IP: [2a02:2528:1703::67]" +``` + +and this demonstrates why it's super useful to have the `hostname.as112.net` +entry populated well. If I'm in Amsterdam, I'll be served by the local node there: + +``` +pim@gripe:~$ traceroute6 blackhole-2.iana.org +traceroute6 to blackhole-2.iana.org (2620:4f:8000::42), 64 hops max, 60 byte packets + 1 nlams0.ipng.ch (2a02:898:146::1) 0.744 ms 0.879 ms 0.818 ms + 2 blackhole-2.iana.org (2620:4f:8000::42) 1.104 ms 1.064 ms 1.035 ms + +pim@gripe:~$ dig -4 @prisoner.iana.org hostname.as112.net txt +short +norec +tcp +"Hosted by IPng Networks" "Amsterdam, The Netherlands" +"See http://www.as112.net/ for more information." +"Unique IP: 94.142.241.187" +"Unique IP: [2a02:898:146::2]" +``` + +Of course, due to anycast, and me being in Zurich, I will be served primarily +by the Zurich node. If it were to go down for maintenance, or hardware failure, +BGP will immediately converge on alternate paths, there are currently three +to choose from: + +``` +pim@chrma0:~$ show protocols bgp ipv4 unicast 192.31.196.0/24 +BGP routing table entry for 192.31.196.0/24 +Paths: (10 available, best #2, table default) + Advertised to non peer-group peers: + 185.1.205.251 194.1.163.1 [...] + 112 + 194.1.163.32 (metric 137) from 194.1.163.32 (194.1.163.32) + Origin IGP, localpref 400, valid, internal + Community: 50869:3500 50869:4099 50869:5055 + Last update: Mon Jun 28 11:13:14 2021 + 112 + 185.1.205.251 from 185.1.205.251 (46.20.246.67) + Origin IGP, localpref 400, valid, external, bestpath-from-AS 112, best (Local Pref) + Community: 50869:3500 50869:4099 50869:5000 50869:5020 50869:5060 + Last update: Mon Jun 28 11:00:45 2021 + 112 + 185.1.205.251 from 185.1.205.253 (185.1.205.253) + Origin IGP, localpref 200, valid, external + Community: 50869:1061 + Last update: Mon Jun 28 11:00:20 2021 + +(and more) +``` + +I am expecting a few more direct paths to come, as I harden this service, and +offer it to other swiss internet exchange points in the future. But mostly, my +mission of reducing the round trip time from 146ms to 1ms from my desktop at +home was successfully accomplished. diff --git a/content/articles/2021-07-03-geneva.md b/content/articles/2021-07-03-geneva.md new file mode 100644 index 0000000..f7f3f96 --- /dev/null +++ b/content/articles/2021-07-03-geneva.md @@ -0,0 +1,210 @@ +--- +date: "2021-07-03T22:16:44Z" +title: IPng arrives in Geneva +--- + +I've been planning a network expansion for a while now. For the next few weeks, +I will be in total geek-mode as I travel to several European cities to deploy +AS50869 on a european ring. At the same time, my buddy Fred from +[IP-Max](https://ip-max.net/) has been wanting to go to Amsterdam. IP-Max's +[network](https://as25091.peeringdb.com/) is considerably larger than mine, but +it just never clicked with the right set of circumstances for them to deploy +in the Netherlands, until the stars aligned ... + +## Deployment + +{{< image width="400px" float="left" src="/assets/network/qdr.png" alt="Quai du Rhône" >}} + +After our adventure in [Frankfurt]({% post_url 2021-05-17-frankfurt %}), +[Amsterdam]({% post_url 2021-05-26-amsterdam %}), [Lille]({% post_url 2021-05-28-lille %}), +and [Paris]({% post_url 2021-06-01-paris %}) came to an end, I still had a few +loose ends to tie up. In particular, in Lille I had dropped an old Dell R610 +while waiting for new Supermicros to be delivered. There is benefit to having +one standard footprint setup, in my case an PCEngines `APU2`, Supermicro +`5018D-FN8T` and Intel `X710-DA4` expansion NIC. They run fantastic with +[DANOS](https://danosproject.org/) and [VPP](https://fd.io/) applications. + +Of course, we mustn't forget _home base_, Geneva, where IP-Max has its +headquarters in a beautiful mansion pictured here. At the same time, my +family likes to take one trip per month to a city we don't usually go, sort +of to keep up with real life as we are now more and more able to travel. +Marina has a niece in Geneva, who has lived and worked there for 20+ years, +so we figured we'd combine these things and stay the weekend at her place. + +After making our way from Zurich to Geneva, a trip that took us just short +of six hours (!) by car, we arrived at the second half of the Belgium:Italy +eurocup soccer match. It was perhaps due to our tardiness and lack of physical +supportering, that the belgians lost the match that day. Sorry! + +### Connectivity + +{{< image width="400px" float="right" src="/assets/network/chplo0-rack.png" alt="Geneva Rack" >}} + +My current circuit runs from Paris (Leon Frot), `frpar0.ipng.ch` over a direct +DWDM wave to Zurich where I pick it up on `chgtg0.ipng.ch` at Interxion +Glattbrugg. So what we'll do is break open this VLL at the IP-Max side, +insert the new router `chplo0.ipng.ch`, and reconfigure the Paris side +to go to the new router, and the new router to create another VLL back +to Zurich, which due to the toplogy of IP-Max's underlying DWDM network +will traverse Paris - Lyon - Geneva instead (shaving off ~1.5ms of latency +at the same time). + +I hung up the `APU2` OOB server and the `5018D-FN8T` router, and another Dell +R610 to run virtual machines at Safehost SH1 in Plan-les-Ouates, a southern +suburb of Geneva. I connected one 10G port to `er01.gva20.ip-max.net` and +another 10G port to `er02.gva20.ip-max.net` to obtain maximum availability +benefits. As an example of what the configuration on the ASR9k platform looks +like for this type of operation, here's what I committed on `er01.gva20`. + +Of course, first things first: let's ensure that the OOB machine has +connectivity, by allocating a /64 IPv6 and /29 IPv4. I usually configure +myself a BGP transit session in the same subnet, which means we'll want to +bridge the 1G UTP connection of the APU with the 10G fiber connection of +the Supermicro router, like so: + +``` +interface BVI911 + description Cust: IPng OOB and Transit + ipv4 address 46.20.250.105 255.255.255.248 + ipv4 unreachables disable + ipv6 nd suppress-ra + ipv6 address 2a02:2528:ff05::1/64 + ipv6 enable + load-interval 30 +! +interface GigabitEthernet0/7/0/38 + description Cust: IPng APU (OOB) + mtu 9064 + load-interval 30 + l2transport + ! +! +interface TenGigE0/1/0/3 + description Cust: IPng (VLL and Transit) + mtu 9014 +! +interface TenGigE0/1/0/3.911 l2transport + encapsulation dot1q 911 exact + rewrite ingress tag pop 1 symmetric + mtu 9018 +! + +l2vpn + bridge group BG_IPng + bridge-domain BD_IPng911 + interface Te0/1/0/3.911 + ! + interface GigabitEthernet0/7/0/38 + ! + routed interface BVI911 + ! + ! +! +``` + +After this, we pulled UTP cable and configured the `APU2`, which then has an +internal network towards the IPMI port of the Supermicro, and from there on, +the configuration becomes much easier. Of course, all config can be done +wirelessly, because the APU `console.plo.ipng.nl` acts as a WiFi access +point, so I connect to it and commit the network configs. + +Once that's online and happy, the router `chplo0.ipng.ch` is next. For this, +on `er02.par02.ip-max.net`, I reconfigure the current VLL to point to the +loopback of this router `er01.gva20.ip-max.net` using the same `pw-id`. Then, +I can configure this router as follows: + +``` +interface TenGigE0/1/0/3.100 l2transport + description Cust: IPng VLL to par02 + encapsulation dot1q 100 + rewrite ingress tag pop 1 symmetric + mtu 9018 +! + +l2vpn + pw-class EOMPLS-PW-CLASS + encapsulation mpls + transport-mode ethernet + ! + ! + xconnect group IPng + p2p IPng_to_par02 + interface TenGigE0/1/0/3.100 + neighbor ipv4 46.20.255.33 pw-id 210535705 + pw-class EOMPLS-PW-CLASS + ! + ! + ! +``` + +## The results + +And with that, the pseudowire is constructed, and the original interface on +`frpar0.ipng.ch` directly sees the interface here on `chplo0.ipng.ch` using +jumboframes of 9000 bytes (+14 bytes of ethernet overhead and +4 bytes of VLAN +tag on the ingress interface). It is as if the routers are directly connected +by a very long ethernet cable, a _pseudo-wire_ if you wish. Super low pingtimes +are observed between this new router in Geneva and the existing two in Paris +and Zurich: + +``` +pim@chplo0:~$ /bin/ping -4 -c5 frpar0 +PING frpar0.ipng.ch (194.1.163.33) 56(84) bytes of data. +64 bytes from frpar0.ipng.ch (194.1.163.33): icmp_seq=1 ttl=64 time=8.78 ms +64 bytes from frpar0.ipng.ch (194.1.163.33): icmp_seq=2 ttl=64 time=8.80 ms +64 bytes from frpar0.ipng.ch (194.1.163.33): icmp_seq=3 ttl=64 time=8.81 ms +64 bytes from frpar0.ipng.ch (194.1.163.33): icmp_seq=4 ttl=64 time=8.82 ms +64 bytes from frpar0.ipng.ch (194.1.163.33): icmp_seq=5 ttl=64 time=8.85 ms + +--- frpar0.ipng.ch ping statistics --- +5 packets transmitted, 5 received, 0% packet loss, time 10ms +rtt min/avg/max/mdev = 8.783/8.810/8.846/0.104 ms +pim@chplo0:~$ /bin/ping -6 -c5 chgtg0 +PING chgtg0(chgtg0.ipng.ch (2001:678:d78::1)) 56 data bytes +64 bytes from chgtg0.ipng.ch (2001:678:d78::1): icmp_seq=1 ttl=64 time=4.51 ms +64 bytes from chgtg0.ipng.ch (2001:678:d78::1): icmp_seq=2 ttl=64 time=4.44 ms +64 bytes from chgtg0.ipng.ch (2001:678:d78::1): icmp_seq=3 ttl=64 time=4.36 ms +64 bytes from chgtg0.ipng.ch (2001:678:d78::1): icmp_seq=4 ttl=64 time=4.47 ms +64 bytes from chgtg0.ipng.ch (2001:678:d78::1): icmp_seq=5 ttl=64 time=4.41 ms + +--- chgtg0 ping statistics --- +5 packets transmitted, 5 received, 0% packet loss, time 10ms +rtt min/avg/max/mdev = 4.362/4.436/4.506/0.077 ms +``` + +For good measure I've also connected to FreeIX, a new internet exchange project +I'm working on, that will span the Geneva, Zurich and Lugano areas. More on that +in a future post! + +``` +pim@chplo0:~$ iperf3 -4 -c 185.1.205.1 ## chgtg0.ipng.ch +Connecting to host 185.1.205.1, port 5201 +[ 5] local 185.1.205.2 port 46872 connected to 185.1.205.1 port 5201 +[ ID] Interval Transfer Bitrate Retr Cwnd +[ 5] 0.00-1.00 sec 809 MBytes 6.78 Gbits/sec 4 11.4 MBytes +[ 5] 1.00-2.00 sec 869 MBytes 7.29 Gbits/sec 0 11.4 MBytes +[ 5] 2.00-3.00 sec 865 MBytes 7.25 Gbits/sec 0 11.4 MBytes +[ 5] 3.00-4.00 sec 868 MBytes 7.28 Gbits/sec 0 11.4 MBytes +[ 5] 4.00-5.00 sec 836 MBytes 7.01 Gbits/sec 0 11.4 MBytes +[ 5] 5.00-6.00 sec 852 MBytes 7.15 Gbits/sec 0 11.4 MBytes +[ 5] 6.00-7.00 sec 865 MBytes 7.26 Gbits/sec 0 11.4 MBytes +[ 5] 7.00-8.00 sec 865 MBytes 7.26 Gbits/sec 0 11.4 MBytes +[ 5] 8.00-9.00 sec 861 MBytes 7.22 Gbits/sec 0 11.4 MBytes +[ 5] 9.00-10.00 sec 860 MBytes 7.22 Gbits/sec 0 11.4 MBytes +- - - - - - - - - - - - - - - - - - - - - - - - - +[ ID] Interval Transfer Bitrate Retr +[ 5] 0.00-10.00 sec 8.35 GBytes 7.17 Gbits/sec 4 sender +[ 5] 0.00-10.01 sec 8.35 GBytes 7.16 Gbits/sec receiver + +iperf Done. +``` + +You kind of get used to performance stats like this, but that said, it's nice +to see that performance over FreeIX is slightly *lower* than performance over +the IPng backbone, and this is because on my VLLs, I can make use of jumbo +frames, which gives me 20% or so better performance (currently 9.62 Gbits/sec). + +Currently I'm busy at work in the background completing the configuration, the +management environment and physical infrastructure for the internet exchange. +I'm planning to make a more complete post about the FreeIX project in a few +weeks once it's ready for launch. Stay tuned! diff --git a/content/articles/2021-07-19-pcengines-apu6.md b/content/articles/2021-07-19-pcengines-apu6.md new file mode 100644 index 0000000..007d7f3 --- /dev/null +++ b/content/articles/2021-07-19-pcengines-apu6.md @@ -0,0 +1,201 @@ +--- +date: "2021-07-19T16:12:54Z" +title: 'Review: PCEngines APU6 (with SFP)' +--- + +* Author: Pim van Pelt <[pim@ipng.nl](mailto:pim@ipng.nl)> +* Reviewed: Pascal Dornier <[pdornier@pcengines.ch](mailto:pdornier@pcengines.ch)> +* Status: Draft - Review - **Approved** + +I did this test back in February, but can now finally publish the results! This little SBC is +definitely going to be a hit in the ISP industry. See more information about it [here](https://www.pcengines.ch/apu6b4.htm). + +[PC Engines](https://pcengines.ch/) develops and sells small single board computers for networking +to a worldwide customer base. This article discusses a new/unreleased product which PC Engines has +developed, which has specific significance in the network operator community: an SBC which comes +with three RJ45/UTP based network ports, and one SFP optical port. + +# Executive Summary + +Due to the use of Intel i210-IS on the SFP port and i211-AT on the three copper ports, and due to +it having no moving parts (fans, hard disks, etc), this SBC is an excellent choice for network +appliances such as out-of-band or serial consoles in a datacenter, or routers in a small business +or home office. + +## Detailed findings + +{{< image width="300px" float="right" src="/assets/pcengines-apu6/apu6.png" alt="APU6" >}} + + +The [APU](https://www.pcengines.ch/apu2.htm) series boards typically ship with 2GB or 4GB of DRAM, +2, 3 or 4 Intel i211-AT network interfaces, and a four core AMD GX-412TC (running at 1GHz). This +review is about the following **APU6** unit, which comes with 4GB of DRAM (this preproduction unit +came with 2GB, but that will be fixed in the production version), 3x i211-AT for the RJ45 +network interfaces, and one i210-IS with an SFP cage. + +One other significant difference is visible -- the trusty rusty DB9 connector that exposes the first +serial RS232 port is replaced with a modern CP2104 (USB vendor `10c4:ea60`) from Silicon Labs which +exposes the serial port as TTL/serial on a micro USB connector rather than RS232, neat! + +## Transceiver Compatibility + +{{< image float="right" src="/assets/pcengines-apu6/optics.png" alt="Optics" >}} + +The small form-factor pluggable (SFP) is a compact, hot-pluggable network interface module used for +both telecommunication and data communications applications. An SFP interface on networking hardware +is a modular slot for a media-specific transceiver in order to connect a fiber-optic cable or +sometimes a copper cable. Such a slot is typically called a _cage_. + +The SFP port accepts most/any optics brand and configuration (Copper, regular 850nm/1310nm/1550nm +based, BiDi as commonly used in FTTH deployments, CWDM for use behind an OADM). I tried 6 different +vendors and types, see below for results. All modules worked, regardless of vendor or brand. + +I tried 6 different SFP modules, all successfully. See the links in the list for an output of an +optical diagnostics tool (using the SFF-8472 standard for SFP/SFP+ management). + +Each module provided link and passed traffic. The loadtest below was done with the BiDi optics +in one interface and a boring RJ45 copper cable in another. It's going to be fantastic to be able +to use these APU6's in a datacenter setting as remote / out-of-band serial devices, specifically +nowadays where UTP is becoming a scarcity and everybody has fiber infrastructure in their racks. + +Vendor | Type | Description | Details +------- | ----------------- | ------------------------------- | ------------------ +Finisar | FTLF8519P2BNL-RB | 850nm duplex | [sfp0.txt](/assets/pcengines-apu6/sfp0.txt) +Generic | Unknown(no DOM) | 850nm duplex | [sfp1.txt](/assets/pcengines-apu6/sfp1.txt) +Cisco | GLC-LH-SMD | 1310nm duplex | [sfp2.txt](/assets/pcengines-apu6/sfp2.txt) +Cisco | SFP-GE-BX-D | 1490nm Bidirectional (FTTH CPE) | [sfp3.txt](/assets/pcengines-apu6/sfp3.txt) +Cisco | SFP-GE-BX-U | 1310nm Bidirectional (FTTH COR) | [sfp3.txt](/assets/pcengines-apu6/sfp3.txt) +Cisco | BT-OC24-20A | 1550nm OC24 SDH | [sfp4.txt](/assets/pcengines-apu6/sfp4.txt) +Finisar | FTRJ1319P1BTL-C7 | 1310nm 20km (w/ 6dB attenuator) | [sfp5.txt](/assets/pcengines-apu6/sfp5.txt) + + +## Network Loadtest + +The choice of Intel i210/i211 network controller on this board allows operators to use Intel's +DPDK with relatively high performance, compared to regular (kernel) based routing. I loadtested +Linux (Ubuntu 20.04), OpenBSD (6.8), and two lesser known but way cooler DPDK open source +appliances called Danos ([ref](https://www.danosproject.org/)) and VPP ([ref](https://fd.io/)) +respectively. + +Specifically worth calling out that while Linux and OpenBSD struggled, both DPDK appliances had +absolutely no problems filling a bidirectional gigabit stream of "regular internet traffic" +(referred to as `imix`), and came close to _line rate_ with "64b UDP packets". The line rate of +a gigabit ethernet is 1.48Mpps in one direction, and my loadtests stressed both directions +simultaneously. + +### Methodology + +For the loadtests, I used Cisco's T-Rex ([ref](https://trex-tgn.cisco.com/)) in stateless mode, +with a custom Python controller that ramps up and down traffic from the loadtester to the device +under test (DUT) by sending traffic out `port0` to the DUT, and expecting that traffic to be +presented back out from the DUT to its `port1`, and vice versa (out from `port1` -> DUT -> back +in on `port0`). The loadtester first sends a few seconds of warmup, this is to ensure the DUT is +passing traffic and offers the ability to inspect the traffic before the actual rampup. Then +the loadteser ramps up linearly from zero to 100% of line rate (in our case, line rate is +one gigabit in both directions), finally it holds the traffic at full line rate for a certain +duration. If at any time the loadtester fails to see the traffic it's emitting return on its +second port, it flags the DUT as saturated; and this is noted as the maximum bits/second and/or +packets/second. + +``` +usage: trex-loadtest.bin [-h] [-s SERVER] [-p PROFILE_FILE] [-o OUTPUT_FILE] [-wm WARMUP_MULT] + [-wd WARMUP_DURATION] [-rt RAMPUP_TARGET] + [-rd RAMPUP_DURATION] [-hd HOLD_DURATION] + +T-Rex Stateless Loadtester -- pim@ipng.nl + +optional arguments: + -h, --help show this help message and exit + -s SERVER, --server SERVER + Remote trex address (default: 127.0.0.1) + -p PROFILE_FILE, --profile PROFILE_FILE + STL profile file to replay (default: imix.py) + -o OUTPUT_FILE, --output OUTPUT_FILE + File to write results into, use "-" for stdout (default: -) + -wm WARMUP_MULT, --warmup_mult WARMUP_MULT + During warmup, send this "mult" (default: 1kpps) + -wd WARMUP_DURATION, --warmup_duration WARMUP_DURATION + Duration of warmup, in seconds (default: 30) + -rt RAMPUP_TARGET, --rampup_target RAMPUP_TARGET + Target percentage of line rate to ramp up to (default: 100) + -rd RAMPUP_DURATION, --rampup_duration RAMPUP_DURATION + Time to take to ramp up to target percentage of line rate, in seconds (default: 600) + -hd HOLD_DURATION, --hold_duration HOLD_DURATION + Time to hold the loadtest at target percentage, in seconds (default: 30) +``` + +It's worth pointing out that almost all systems are _pps-bound_ not _bps-bound_. A typical rant +I have is that network vendors are imprecise when they specify their throughput "up to 40Gbit" +they more often than not mean "under carefully crafted conditions" such as utilizing jumboframes +(9216 bytes rather than "usual" 1500 byte MTU found on ethernet, which is easier on the router +than a typical internet mixture (closer to 1100 bytes), and much easier yet than if the router +is asked to forward 64 byte packets, for instance in a DDoS attack); and only in one direction; +and only using exactly one source/destination IP address/port, which is a little bit easier to +do than to look up a destination in a forwarding table containing 1M destinations -- for context +a current internet backbone router carries ~845K IPv4 destinations and ~105K IPv6 destinations. + +### Results + +Product | Loadtest | Throughput (pps) | Throughput (bps) | % of linerate | Details +------- | -------- | ---------------- | ------------------| ------------- | ---------- +Linux | imix | 150.21 Kpps | 452.81 Mbps | 45.28% | [apu6-linux-imix.json](/assets/pcengines-apu6/apu6-linux-imix.json) +OpenBSD | imix | 145.52 Kpps | 444.51 Mbps | 44.45% | [apu6-openbsd-imix.json](/assets/pcengines-apu6/apu6-openbsd-imix.json) +**VPP** | **imix** | **654.40 Kpps** | **2.00 Gbps** | **199.90%** | [apu6-vpp-imix.json](/assets/pcengines-apu6/apu6-vpp-imix.json) +**Danos** | **imix** | **655.53 Kpps** | **2.00 Gbps** | **200.24%** | [apu6-danos-imix.json](/assets/pcengines-apu6/apu6-danos-imix.json) +Linux | 64b | 96.93 Kpps | 65.14 Mbps | 6.51% | [apu6-linux-64b.json](/assets/pcengines-apu6/apu6-linux-64b.json) +OpenBSD | 64b | 152.09 Kpps | 102.20 Mbps | 10.22% | [apu6-openbsd-64b.json](/assets/pcengines-apu6/apu6-openbsd-64b.json) +VPP | 64b | 1.78 Mpps | 1.19 Gbps | 119.49% | [apu6-vpp-64b.json](/assets/pcengines-apu6/apu6-vpp-64b.json) +**Danos** | **64b** | **2.30 Mpps** | **1.55 Gbps** | **154.62%** | [apu6-danos-64b.json](/assets/pcengines-apu6/apu6-danos-64b.json) + +{{< image width="800px" src="/assets/pcengines-apu6/results.png" alt="Results" >}} + +For more information on the methodology and the scripts that drew these graphs, take a look +at my buddy Michal's [GitHub Page](https://github.com/wejn/trex-loadtest-viz), which, given +time, will probably turn into its own subsection of this website (I can only imagine the value +of a corpus of loadtests of popular equipment in the consumer arena). + +## Caveats + +The unit was shipped to me free of charge by PC Engines for the purposes of load- and systems +integration testing. Other than that, this is not a paid endorsement and views of this review +are my own. + +## Open Questions + +### SFP I2C + +Considering the target audience, I wonder if there is a possibility to break out the I2C pins from +the SFP cage into a header on the board, so that users can connect them through to the CPU's I2C +controller (or bitbang directly on GPIO pins), and use the APU6 as an SFP flasher. I think that +would come in incredibly handy in a datacenter setting. + +### CPU bound + +The DPDK based router implementations are CPU bound, and could benefit from a little bit more power. +I am duly impressed by the throughput seen in terms of packets/sec/watt, but considering a typical +router has a (forwarding) dataplane and needs as well a (configuration) controlplane, we are short +about 30% CPU cycles. If a controlplane (like Bird or FRR ([ref](https://frrouting.org)) is dedicated +one core, that leaves us three cores for forwarding, with which we obtain roughly 154% of linerate, +we'll need that `200/154 == 1.298` to obtain line rate in both directions. That said, the APU6 has +absolutely no problems saturating a gigabit **in both directions** under normal (==imix) +circumstances. + +# Appendix + +## Appendix 1 - Terminology + +**Term** | **Description** +-------- | --------------- +OADM | **optical add drop multiplexer** -- a device used in wavelength-division multiplexing systems for multiplexing and routing different channels of light into or out of a single mode fiber (SMF) +ONT | **optical network terminal** - The ONT converts fiber-optic light signals to copper based electric signals, usually Ethernet. +OTO | **optical telecommunication outlet** - The OTO is a fiber optic outlet that allows easy termination of cables in an office and home environment. Installed OTOs are referred to by their OTO-ID. +CARP | **common address redundancy protocol** - Its purpose is to allow multiple hosts on the same network segment to share an IP address. CARP is a secure, free alternative to the Virtual Router Redundancy Protocol (VRRP) and the Hot Standby Router Protocol (HSRP). +SIT | **simple internet transition** - Its purpose is to interconnect isolated IPv6 networks, located in global IPv4 Internet via tunnels. +STB | **set top box** - a device that enables a television set to become a user interface to the Internet and also enables a television set to receive and decode digital television (DTV) broadcasts. +GRE | **generic routing encapsulation** - a tunneling protocol developed by Cisco Systems that can encapsulate a wide variety of network layer protocols inside virtual point-to-point links over an Internet Protocol network. +L2VPN | **layer2 virtual private network** - a service that emulates a switched Ethernet (V)LAN across a pseudo-wire (typically an IP tunnel) +DHCP | **dynamic host configuration protocol** - an IPv4 network protocol that enables a server to automatically assign an IP address to a computer from a defined range of numbers. +DHCP6-PD | **Dynamic host configuration protocol: prefix delegation** - an IPv6 network protocol that enables a server to automatically assign network prefixes to a customer from a defined range of numbers. +NDP NS/NA | **neighbor discovery protocol: neighbor solicitation / advertisement** - an ipv6 specific protocol to discover and judge reachability of other nodes on a shared link. +NDP RS/RA | **neighbor discovery protocol: router solicitation / advertisement** - an ipv6 specific protocol to discover and install local address and gateway information. +SBC | **single board computer** - a compute computer with all peripherals and components directly attached to the board. diff --git a/content/articles/2021-07-26-bucketlist.md b/content/articles/2021-07-26-bucketlist.md new file mode 100644 index 0000000..6099ec7 --- /dev/null +++ b/content/articles/2021-07-26-bucketlist.md @@ -0,0 +1,315 @@ +--- +date: "2021-07-26T11:16:44Z" +title: A story of a Bucketlist +--- + +## Introduction + +Many people maintain what is called a Bucketlist, a list of things they +wish to do before they _kick the bucket_. I have one also, and although +most of the items on that list are earthly and more on the emotional +realm, and private, there is one specific thing that I have wanted to +do ever since I first started working in IT in 1998: Peer at the +Amsterdam Internet Exchange. + +This post details striking this particular item off my bucketlist. It's +both indulgent, humblebraggy and incredibly nerdy and it talks a bit about +mental health. If those are trigger words for you, skip ahead to another +post, like my series on VPP ;-) + + +## 1998 - Netherlands + +{{< image width="300px" float="right" src="/assets/bucketlist/bucketlist-ede.png" alt="The Kelvinstraat" >}} + +I started working when I was still at the TU/Eindhoven, and after a great +sysadmin job at Radar Internet, which became Track and was sold to Wegener +Arcade, I turned towards networking. After building Freeler (the first +_free_ ISP in the Netherlands) with Adrianus and co, and a small stint at +their primary uplink Intouch with Rager (rest in peace, Brother), I joined +BIT (AS12859) from 2000 to 2006, and it was here where I developed a true +passion for that which makes the internet 'tick': routing protocols. + +I was secretly jealous that BIT could afford Junipers, F5 loadbalancers and +large Cisco switches, and I loved working with and on those machines. BIT +had a reseller relationship with BBNed, and were able to directly connect +ADSL modems into their own infrastructure, and as such I could afford to get +myself a subnet from 213.154.224.0/19 routed to my house in Wageningen. It +was where I had a half-19" rack in a clothing closet in our guest bedroom, +and it was there that I decided: I want to eventually participate in the +BGP world and peer at AMS-IX (the only exchange at the time, NLIX was just +starting up, thanks again, Jan!). + +Pictured to the right was my first contribution to AS12859 - deploying a +CWDM ring from Ede to Amsterdam and upgrading our backbone from an ATM E3 +(34Mbit) and POS STM1 (155Mbit) leased line to Gigabit Ethernet on Juniper +M5 routers, this was in 2001, 20 years ago almost to the month. + +## 2008 - Switzerland + +{{< image width="300px" float="right" src="/assets/bucketlist/bucketlist-dk2.png" alt="The Cavern" >}} + +Fast forward to 2006, I moved to Switzerland and while I remained friendly +with NLNOG and SWINOG (and a few other network operator groups), I did not +pursue the whole internet exchange thing. I had operated networks for the +greater part of a decade, and with my full time job, I spent a lot of time +learning how to be a good _Site Reliability Engineer_. I still had three /24 +PI space blocks, used for different purposes in the past, but I was much +more comfortable letting the "real" ISPs announce them - in my case AS25091 +[IP-Max](https://ip-max.net/) (thanks, Fred!) and AS13030 [Init7](https://init7.net/) +(thanks, Fredy!) and AS12859 [BIT](https://bit.nl/) (thanks, Michel!). I +cannot remember any meaningful downtime in any of those operators, of course +there is always some, but due to the N+2 nature of my network deployment, I +don't think any global downtime for my internet presence has ever occured. + +It's not a coincidence that even Google for the longest time used my website +at [SixXS](https://sixxs.net/) for their own monitoring, now _that_ is +cool. Although Jeroen and I did decide to retire the SixXS project (see my +[Sunset]({% post_url 2017-03-01-sixxs-sunset %}) article on why), the website +is still up and served off of three distinct networks, because I have to stay +true to the SRE life. + +Pictured to the right was one of the two racks at Deltalis DK2, a datacenter +built into a mountain in the heart of the swiss Alps. Classic edge/core/border +approach with (at the time) state of the art Cisco 7600 routers. One of these +is destined to become my nightstand at some point, this was in 2013, which +is now (almost) 10 years ago. + +### Corona Motivation + +My buddy Fred from IP-Max would regularly ask me "why don't you just announce +your /24 yourself?" It'd be fun, he said. In 2007, we registered a /24 PI for +SixXS, and I was always quite content to let _him_ handle the routing. But it +started to itch and a neighbor of mine inadvertently reminded me of this itch +(thanks, Max) by asking me if I was interested to share an L2 ethernet link +with him from our place in Brüttisellen to one of the datacenters in +Zürich, a distance of about 7km as the photons fly. + +{{< image width="110px" float="left" src="/assets/bucketlist/bucketlist-corona.png" alt="The Virus" >}} + +I could not resist any longer. I was working long(er) than average hours due +to the work-from-home situation: you easily chop off 45-60min of commute each +day, but I noticed myself spending it in more meetings instead of in the train. +I was slowly getting into a bad state, and my motivation was very low. I wanted +to do something other than sleep-eat-work-sleep and even my jogging went to an +all time minimum. I had very low emotional energy. + +To put my mind off of things, I decided to reattach to my networking roots in +a few ways: one was to build an AS and operate it for a while (maybe a few years +until I get bored of it, and then re-parent my IP space to some friendly ISP, +or who knows, cash in rich and sell my IP space to the highest bidder!), and +the other was to continue my desire to have a competent replacement for silicon +now that CPUs-of-now are just as fast as ASICs-of-then, and contribute to DANOS +and VPP. + +#### Step 1. Build a basement ISP + +So getting a PC with Bird, or in my case, an appliance called [DANOS](https://danosproject.org/) +which uses [DPDK](https://dpdk.org/) to implement wirespeed routing on commodity +x86/64 hardware. So I happily announced my /24 and /48 from NTT's datacenter, +connected to the local internet exchange [Swissix](https://swissix.ch/) and +rented an L2 circuit to my house via [Openfactory](https://openfactory.net/). Also, +I showed that a simple Supermicro (for example [SYS-5018D-FN8T](https://www.supermicro.com/products/system/1u/5018/SYS-5018D-FN8T.cfm)) +could easily handle line rate 64 byte frames in both directions on its TenGigabit +interfaces, that's 29Mpps, and still have a responsive IPMI serial console. It +reminded me of the early days of Juniper martini class routers, where Jean would +say ".. and the chassis doesn't even get warm". That's certainly correct today, +cuz that Supermicro draws 35W, which is one microwatt per packet routed! + +#### Step 2. Build a European Ring + +{{< image width="350px" float="right" src="/assets/bucketlist/bucketlist-staging-ams.png" alt="Staging Amsterdam" >}} + +Of course, I cannot end there, as I have a bucketlist item to work towards. I always +wanted to peer in Amsterdam, ever since 2001 when I joined BIT. So I worked out a +plan with Fred, who has also been wanting to go to Amsterdam with his Swiss ISP +[IP-Max](https://ip-max.net/). + +So, in a really epic roadtrip full of nerd, Fred and I went into total geek-mode +as we traveled to several European cities to deploy AS50869 on a european ring. I +wrote about my experience extensively in these blog posts: + +* [Frankfurt]({% post_url 2021-05-17-frankfurt %}): May 17th 2021. +* [Amsterdam]({% post_url 2021-05-26-amsterdam %}): May 26th 2021. +* [Lille]({% post_url 2021-05-28-lille %}): May 28th 2021. +* [Paris]({% post_url 2021-06-01-paris %}): June 1st 2021. +* [Geneva]({% post_url 2021-07-03-geneva %}): July 3rd 2021. + +I think we can now say that I'm _peering on the FLAP_. It's not that this AS50869 +carries that much traffic, but it's a very welcome relief of daily worklife to be +able to do something _fun_ and _immediately rewarding_ like turn up a BGP session +and see the traffic go from Zurich to any one of these cities at 10Gbit in any +direction. No congestion, no _packetlo_, just pure horsepower performance. + +#### Step 3. Build Linux CP in VPP + +Next month, I plan to take [VPP](https://fd.io/) out for an elaborate spin. I've been +running DANOS on my routers for a while now, and I'm pretty happy with it, but there +are a few quirks that are annoying me more and more. Notably, the conversion of Vyatta +style commands in the configuration into an FRR config, are often lossy. There's a few +key features (such as RPKI or LDP signalling for MPLS paths) that I'm missing, and +the dataplane, although pretty stable, has crashed maybe three or four times over the +last year. Note: One of IP-Max's many Cisco ASR9k also had a few line card reboots in +the last year so maybe these crashes are par for the course. + +Ever since seeing Netgate and Cisco started work on the Linux Control Plane plugin, which +takes interfaces in the VPP dataplane and exposes those as TAP interfaces in Linux, I've +wanted to contribute to that. I've been determined to make use of VPP+LinuxCP in my own +network. However, development has completely stalled on the plugin; the one that ships with +VPP 21.06 is rudimentary at best: doesn't do QinQ/QinAD; doesn't apply any changes from the +dataplane into the Linux network interface; and the plugin that mirrors netlink message has +been stuck in limbo for a few months. So I reached out to the authors in May and offered to +complete / rewrite the plugins. I find that writing code, compiling and testing it, and +being able to immediately see the improvements in a live network incredibly motivating +and energizing. + +Expect to see a few posts in August/September about this work! + +## 2021 - Switzerland + +{{< image width="400px" float="right" src="/assets/bucketlist/bucketlist-mentalhealth.png" alt="Alpine Health" >}} + +I can say that after making a few small tweaks and adjustments, and breaking the WFH +regime into "work" from home and "play" from home, helps a lot. I now have a HDMI +switch that flips my desk from my work Mac into my personal OpenBSD machine, and a +19" rack in my basement with equipment to loadtest and develop VPP, and I often do +some small chores like establish a peering session and happily traceroute from my +basement to Amsterdam. + +I've spent some time in the mountains, in a family commitment to go to a new swiss +canton every month. The picture on the right was taken from First in Grindelwald, +looking south towards Eiger and Mönch. I live in an absolutely beautiful country. +Thanks, Switzerland ;-) + +On the Bucketlist front, I have the following to report. I waited a few months before +writing the post, but I can confidently say that accomplishing this L2/L3 path from +my workstation in Brüttisellen where I'm typing this blogpost, all the way over +Frankfurt to Amsterdam and being able to reach my original colocation machine at AS8283 +[Coloclue](https://coloclue.net/) using only switches, routers and IP addresses I own +is a continual joy. Seeing that my work now affords me a straight gigabit bandwidth +in each direction, makes me just fill with engineering pride and happiness. + +``` +pim@chumbucket:~$ traceroute ghoul.ipng.nl +traceroute to gripe.ipng.nl (94.142.241.186), 30 hops max, 60 byte packets + 1 chbtl0.ipng.ch (194.1.163.66) 0.236 ms 0.178 ms 0.143 ms + 2 chrma0.ipng.ch (194.1.163.17) 1.394 ms 1.363 ms 1.332 ms + 3 defra0.ipng.ch (194.1.163.25) 7.275 ms 7.362 ms 7.213 ms + 4 nlams0.ipng.ch (194.1.163.27) 12.905 ms 12.843 ms 12.844 ms + 5 ghoul.ipng.nl (94.142.244.54) 13.120 ms 13.181 ms 13.044 ms +``` + +And as far as the _actual_ bucketlist item goes, although I made a bit harder on myself +because I moved to Switzerland, IP-Max also made it easier by giving me a great price +on the backhaul connectivity to Amsterdam, so I can report that the bucket list item +is indeed checked off the list: + +``` +pim@nlams0:~$ show protocols bgp address-family ipv6 unicast summary + +IPv6 Unicast Summary: +BGP table version 689670802 +RIB entries 251402, using 46 MiB of memory +Peers 67, using 1427 KiB of memory +Peer groups 32, using 2048 bytes of memory + +Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt +2a02:1668:a2b:5:869::1 4 51088 1561576 216485 0 0 0 08w4d03h 126136 5 +2a02:1668:a2b:5:869::2 4 51088 1546990 216485 0 0 0 08w4d03h 126127 5 +2a02:898::d1 4 8283 812846953 127814 0 0 0 08w6d20h 130590 6 +2a02:898::d2 4 8283 828908332 127814 0 0 0 08w0d16h 130590 6 +2a02:898:146::2 4 112 101560 228562328 0 0 0 06w2d15h 2 132437 +2a07:cd40:1::4 4 212855 105513 238069267 0 0 0 2d14h12m 1 132437 +2602:fed2:fff:ffff::1 4 137933 4180058 124978 0 0 0 04w4d10h 551 7 +2602:fed2:fff:ffff::253 4 209762 2034724 125048 0 0 0 1d00h14m 618 7 +2001:7f8:10f::205b:140 4 8283 137242 121460 0 0 0 08w5d17h 34 7 +2001:7f8:10f::207b:145 4 8315 278651 274793 0 0 0 06w0d12h 34 7 +2001:7f8:10f::500f:139 4 20495 117590 107877 0 0 0 04w3d00h 208 7 +2001:7f8:10f::ac47:131 4 44103 152949 55010 0 0 0 05w1d13h 24 7 +2001:7f8:10f::af36:129 4 44854 134969 146240 0 0 0 09w2d16h 1 7 +2001:7f8:10f::afd1:133 4 45009 35438 35477 0 0 0 01w0d02h 3 7 +2001:7f8:10f::e20a:148 4 57866 302505 280603 0 0 0 05w5d18h 161 7 +2001:7f8:10f::e3bb:137 4 58299 1419455 104321 0 0 0 04w0d13h 531 7 +2001:7f8:10f::ec8d:132 4 60557 120509 108071 0 0 0 01w4d20h 7 7 +2001:7f8:10f::3:259e:143 4 206238 278960 272776 0 0 0 04w4d18h 2 7 +2001:7f8:10f::3:3e9b:134 4 212635 823944 140075 0 0 0 08w5d17h 1 7 +2001:7f8:10f::dc49:253 4 56393 5693179 157171 0 0 0 02w6d22h 26680 7 +2001:7f8:10f::dc49:254 4 56393 5698910 162197 0 0 0 08w5d17h 26680 7 +2a02:2528:1902::1 4 25091 9964126 137696 0 0 0 09w1d22h 113020 5 +2001:7f8:8f::a500:6939:1 4 6939 8496149 138188 0 0 0 01w2d20h 48079 7 +2001:7f8:8f::a500:8283:1 4 8283 23251 52823 0 0 0 03w3d02h Active 0 +2001:7f8:8f::a501:3335:1 4 13335 3279 3199 0 0 0 1d02h35m 102 7 +2001:7f8:8f::a502:495:1 4 20495 117248 107466 0 0 0 04w3d00h 208 7 +2001:7f8:8f::a503:2934:1 4 32934 194428 193990 0 0 0 01w3d08h 30 7 +2001:7f8:8f::a503:2934:2 4 32934 194035 194002 0 0 0 03w3d11h 30 7 +2001:7f8:8f::a504:4854:1 4 44854 0 9052 0 0 0 never Idle (Admin) 0 +2001:7f8:8f::a504:5009:1 4 45009 35433 35467 0 0 0 01w0d02h 3 7 +2001:7f8:8f::a505:7866:1 4 57866 302602 276459 0 0 0 04w4d01h 161 7 +2001:7f8:8f::a505:8299:1 4 58299 912125 141718 0 0 0 04w0d13h 531 7 +2001:7f8:8f::a506:557:1 4 60557 120482 108067 0 0 0 01w4d20h 7 7 +2001:7f8:8f::a521:2635:1 4 212635 622475 85332 0 0 0 02w5d10h 1 7 +2001:7f8:8f::a504:9917:1 4 49917 8370930 158851 0 0 0 03w4d13h 25257 7 +2001:7f8:8f::a504:9917:2 4 49917 8397150 160118 0 0 0 04w4d01h 25011 7 +2001:7f8:13::a500:714:1 4 714 67722 66645 0 0 0 03w2d03h 146 7 +2001:7f8:13::a500:714:2 4 714 68208 66645 0 0 0 03w2d03h 146 7 +2001:7f8:13::a500:6939:1 4 6939 10980475 98099 0 0 0 07w0d10h 48079 7 +2001:7f8:13::a502:495:1 4 20495 117773 107873 0 0 0 04w0d14h 208 7 +2001:7f8:13::a503:4307:1 4 34307 10709086 100814 0 0 0 09w4d23h 23339 7 +2001:7f8:13::a503:4307:2 4 34307 10694266 100814 0 0 0 09w4d23h 22137 7 +2001:7f8:8f::a504:4103:1 4 44103 152932 55010 0 0 0 05w1d13h 24 7 +2001:7f8:b7::a500:8283:1 4 8283 126035 98846 0 0 0 06w4d22h 34 7 +2001:7f8:b7::a501:3335:1 4 13335 4277 4157 0 0 0 1d10h34m 102 7 +2001:7f8:b7::a502:495:1 4 20495 117588 107871 0 0 0 04w3d00h 208 7 +2001:7f8:b7::a504:5009:1 4 45009 35441 35504 0 0 0 01w0d02h 3 7 +2001:7f8:b7::a506:557:1 4 60557 120546 108067 0 0 0 01w4d20h 7 7 +2001:7f8:b7::a521:2635:1 4 212635 716031 94458 0 0 0 08w5d17h 1 7 +2001:7f8:b7::a504:1441:1 4 41441 12911969 107363 0 0 0 08w2d12h 50606 7 +2001:7f8:b7::a504:1441:2 4 41441 12733337 107304 0 0 0 08w2d12h 50606 7 + +Total number of neighbors 67 + +pim@nlams0:~$ show protocols ospfv3 neighbor +Neighbor ID Pri DeadTime State/IfState Duration I/F[State] +194.1.163.7 1 00:00:32 Full/PointToPoint 62d21:41:24 dp0p6s0f3.100[PointToPoint] +194.1.163.34 1 00:00:39 Full/PointToPoint 27d22:28:30 dp0p6s0f3.200[PointToPoint] +``` + +There are three full IPv4 and IPv6 transit providers: AS51088 ([A2B Internet](https://a2b-internet.com/), +thanks Erik!), AS8283 ([Coloclue](https://coloclue.net/)) and AS25091 ([IP-Max](https://ip-max.net/), +thanks Fred!). Also, the router is connected directly to Speed-IX, LSIX, FrysIX and NL-IX. Along with +the many other internet exchanges I've connected to, it puts my humble AS50869 as #5 +[best connected](https://bgp.he.net/country/CH) ISP in Switzerland! + +I mean, just look at that stability, BGP sessions often times up as long as the machine +has been there (remember, I deployed `nlams0.ipng.ch` only in May, so 9 weeks is all we can ask for!). +OSPF uptime (helpfully shown with duration with OSPFv3 on FRR) is impeccable as well. The link with 27d +of uptime is because I took out that router for maintenance 27 days ago to upgrade it to a preliminary +version of DANOS + Bird2, as I prepare the move to VPP + Bird2 later this year. + +#### A note on mental health + +Mental health includes our emotional, psychological, and social well-being. It +affects how we think, feel, and act. It also helps determine how we handle stress, +relate to others, and make choices. Mental health is important at every stage of +life, from childhood and adolescence through adulthood. + +If you've read so far, thanks! I can imagine that some find this story a mixture of +nerd and brag, and that's OK. I am writing these stories because ***I find happiness in writing*** +about the small and large technical things that I perceive as important to my +feelings of accomplishment and therefor my wellbeing. + +I do many non-nerd and non-technical things, but I try to make it a habit of keeping my personal +life off the internet (I'm not on social media and not often on digital messaging boards or chat +apps). I could tell you equally enthusiastically about those hikes I took in Grindelwald, or +those Bürli I baked, but that would have to be in person. + +Well-being is a positive outcome that is meaningful for people and for many sectors +of society, because it tells us that people perceive that their lives are going +well. However, many indicators that measure living conditions fail to measure what +people think and feel about their lives, such as the quality of their relationships, +their positive emotions and resilience, the realization of their potential, or their +overall satisfaction with life. + +I find satisfaction in my modest dabbles with IPng Networks, both the software and +the hardware and physical aspects of it. I encourage everybody to have a safe/fun place +where they spend some meaningful time doing things that _spark joy_. To your health! diff --git a/content/articles/2021-08-07-fs-switch.md b/content/articles/2021-08-07-fs-switch.md new file mode 100644 index 0000000..21387ce --- /dev/null +++ b/content/articles/2021-08-07-fs-switch.md @@ -0,0 +1,710 @@ +--- +date: "2021-08-07T06:17:54Z" +title: 'Review: FS S5860-20SQ Switch' +--- + +[FiberStore](https://fs.com/) is a staple provider of optics and network gear in Europe. Although +I've been buying opfics like SFP+ and QSFP+ from them for years, I rarely looked at the switch +hardware they have on sale, until my buddy Arend suggested one of their switches as a good +alternative for an Internet Exchange Point, one with [Frysian roots](https://frys-ix.net/) no less! + +# Executive Summary + +{{< image width="400px" float="left" src="/assets/fs-switch/fs-switches.png" alt="Switches" >}} + +The FS.com S5860 switch is pretty great: 20x 10G SFP+ ports, 4x 25G SPF28 ports +and 2x 40G QSFP ports, which can also be reconfigued to be 4x10G each. The switch has a +Cisco-like CLI, and great performance. I loadtested a pair of them in L2, QinQ, and in L3 mode, +and they handled all the packets I sent to and through them, with all of 10G, 25G and 40G ports +in use. Considering the redundant power supply with relatively low power usage, silicon based +switching of L2 and L3, I definitely appreciate the price/performance. The switch would be a +better match if it allowed for MPLS based L2VPN services, but it doesn't support that. + +## Detailed findings + +### Hardware + +{{< image width="400px" float="right" src="/assets/fs-switch/fs-switch-inside.png" alt="Inside" >}} + +The switch is based on Broadcom's [BCM56170](https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs/bcm56170), codename *Hurricane* with 28x10GbE + 4x25GbE ports internally, for a total switching bandwidth of 380Gbps. I +noticed that the FS website shows 760Gbps of nonblocking capacity, which I can explain: Broadcom +has taken the per port ingress capacity, while FS.com is taking the ingress/egress port +capacity and summing them up. Further, the sales pitch claims 565Mpps which I found curious: if +we divide the available bandwidth of 380Gbps (the number from the Broadcom dataspec) by smallest +possible frame of 84 bytes (672 bits), we'll see 565Mpps. Why FS.com decided to seemingly +arbitrarily double the switching capacity while reporting the nominal fowarding rate, is +beyond me. + +You can see more (hires) pictures in this [Photo Album](https://photos.app.goo.gl/6M69UgowGf6yWJmU7). + +This Broadcom chip is an SOC (System-on-a-Chip) which comes with an Arm A9 and modest +amount of TCAM on board and packs into a 31x31mm *ball grid array* formfactor. The switch +chip is able to store 16k routes and ACLs - it did not become immediately obvious to me what +the partitioning is (between IPv4 entries, IPv6 entries, L2/L3/L4 ACL entries). One can only +assume that the total sum of TCAM based objects must not exceed 4K entries. This means that +as a campus switch, the L3 functionalty will be great, including with routing protocols such +as OSPF and ISIS. However, BGP with any amount of routing table activity will not be a good +fit for this chip, so my dreams of porting DANOS to it are shot out of the box :-) + +This Broadcom chip alone retails for €798,- apiece at [Digikey](https://www.digikey.ie/products/en?keywords=BCM56170B0IFSBG), +with a manufacturer lead time of 50 weeks as of Aug'21, which may be related to the ongoing +foundry and supply chain crisis, I don't know. But at that price point, the retail price of +€1150,- per switch is really attractive. + +{{< image width="300px" float="right" src="/assets/fs-switch/noise.png" alt="Noise" >}} + +The switch comes with two modular and field-replaceble power supplies (rated at 150W each, +delivering 12V at 12.5A, one fan installed), and with two modular and equally field replaceble +fan trays installed with one fan each. Idle, without any optics installed and with all interfaces +down, the switch draws about 18W of power, which is nice. The fans spin up only when needed, +and by default the switch is quiet, but certainly not silent. I measured it after a tip from +Michael, certainly nothing scientific, but in a silent room that measures a floor of ~30 dBA, +the switch booted up and briefly burst the fans at 60dBA after which it **stabilized at 54dBA** +or thereabouts. This is with both power supplies on, and with my cell phone microphone pointed +directly towards the rear of the device, at 1 meter distance. Or something, IDK, I'm a network +engineer, Jim, not an audio specialist! + +Besides the 20x 1G/10G SFP+ ports, 4x 25G ports and 2x 40G ports (which, incidentally, can be +broken out into 4x 10G as well, bringing the Tengig port count to the datasheet specified 28), +the switch also comes with a USB port (which mounts a filesystem on a USB stick, handy to do +firmware upgrades and to copy files such as SSH keys back and forth), an RJ45 1G management +port, which does not participate in the switch at all, and an RJ45 serial port that uses a +standard Cisco cable for access and presents itself as `9600,8n1` to a console server, although +flow control must be disabled on the serial port. + +#### Transceiver Compatibility + +FS did not attempt any vendor locking or crippleware with the ports and optics, yaay for that. +I successfully inserted Cisco optics, Arista optics, FS.com 'Generic' optics, and several DACs +for 10G, 25G and 40G that I had lying around. The switch is happy to take all of them. The switch, +as one would expect, supports diagnostics, which looks like this: + +``` +fsw0#show interfaces TFGigabitEthernet0/24 transceiver +Transceiver Type : 25GBASE-LR-SFP28 +Connector Type : LC +Wavelength(nm) : 1310 +Transfer Distance : + SMF fiber + -- 10km +Digital Diagnostic Monitoring : YES +Vendor Serial Number : G2006362849 + +Current diagnostic parameters[AP:Average Power]: +Temp(Celsius) Voltage(V) Bias(mA) RX power(dBm) TX power(dBm) +33(OK) 3.29(OK) 38.31(OK) -0.10(OK)[AP] -0.07(OK) + +Transceiver current alarm information: +None +``` + +.. with a helpful shorthand `show interfaces ... trans diag` that only shows the optical budget. + +### Software + +I bought a pair of switches, and they came delivered with a current firmware version. The devices +idenfity themselves as `FS Campus Switch (S5860-20SQ) By FS.COM Inc` with a hardware version of `1.00` +and a software version of `S5860_FSOS 12.4(1)B0101P1`. Firmware updates can be downloaded from the +FS.com website directly. I'm not certain if there's a viable ONIE firmware for this chip, although the +N8050 certainly can run ONIE, Cumulus and its own ICOS which is backed by Broadcom. Maybe +in the future I could take a better look at the open networking firmware aspects of this type of +hardware, but considering the CAM is tiny and the switch will do L2 in hardware, but L3 only up to +a certain amount of routes (I think 4K or 16K in the FIB, and only 1GB of ram on the SOC), this is +not the right platform to pour energy into trying to get DANOS to run on. + +Taking a look at the CLI, it's very Cisco IOS-esque; there's a few small differences, but the look +and feel is definitely familiar. Base configuration kind of looks like this: + +``` +fsw0#show running-config +hostname fsw0 +! +sntp server oob 216.239.35.12 +sntp enable +! +username pim privilege 15 secret 5 $1$ +! +ip name-server oob 8.8.8.8 +! +service password-encryption +! +enable service ssh-server +no enable service telnet-server +! +interface Mgmt 0 + ip address 192.168.1.10 255.255.255.0 + gateway 192.168.1.1 +! +snmp-server location Zurich, Switzerland +snmp-server contact noc@ipng.ch +snmp-server community 7 ro +! +``` + +Configuration as well follows the familiar `conf t` (configure terminal) that many of us grew up +with, and `show` command allow for `include` and `exclude` modifiers, of course with all the +shortest-next abbriviations such as `sh int | i Forty` and the likes. VLANs are to be declared +up front, with one notable cool feature of `supervlans`, which are the equivalent of aggregating +VLANs together in the switch - a useful example might be an internet exchange platform which has +trunk ports towards resellers, who might resell VLAN 101, 102, 103 each to an individual customer, +but then all end up in the same peering lan VLAN 100. + +A few of the services (SSH, SNMP, DNS, SNTP) can be bound to the management network, but for this +to work, the `oob` keyword has to be used. This likely because the mgmt port is a network interface +that is attached to the SOC, not to the switch fabric itself, and thus its route is not added to +the routing table. I like this, because it avoids the mgmt network to be picked up in OSPF, and +accidentally routed to/from. But it does show a bit more of an awkward config: + +``` +fsw1#show running-config | inc oob +sntp server oob 216.239.35.12 +ip name-server oob 8.8.8.8 +ip name-server oob 1.1.1.1 +ip name-server oob 9.9.9.9 + +fsw1#copy ? + WORD Copy origin file from native + flash: Copy origin file from flash: file system + ftp: Copy origin file from ftp: file system + http: Copy origin file from http: file system + oob_ftp: Copy origin file from oob_ftp: file system + oob_http: Copy origin file from oob_http: file system + oob_tftp: Copy origin file from oob_tftp: file system + running-config Copy origin file from running config + startup-config Copy origin file from startup config + tftp: Copy origin file from tftp: file system + tmp: Copy origin file from tmp: file system + usb0: Copy origin file from usb0: file system +``` + +Note here the hack `oob_ftp:` and such; this would allow the switch to copy things from the +OOB (management) network by overriding the scheme. But that's OK, I guess, not beautiful, +but it gets the job done and these types of commands will rarely be used. + +A few configuration examples, notably QinQ, in which I configure a port to take usual dot1q +traffic, say from a customer, and add it into our local VLAN 200. Therefore, untagged traffic +on that port will turn into our VLAN 200, and tagged traffic will turn into our dot1ad stack +of outer VLAN 200 and inner VLAN whatever the customer provided -- in our case allowing only +VLANs 1000-2000 and untagged traffic into VLAN 200: + +``` +fsw0#confifgure +fsw0(config)#vlan 200 +fsw0(config-vlan)#name v-qinq-outer +fsw0(config-vlan)#exit +fsw0(config)#interface TenGigabitEthernet 0/3 +fsw0(config-if-TenGigabitEthernet 0/3)#switchport mode dot1q-tunnel +fsw0(config-if-TenGigabitEthernet 0/3)#switchport dot1q-tunnel native vlan 200 +fsw0(config-if-TenGigabitEthernet 0/3)#switchport dot1q-tunnel allowed vlan add untagged 200 +fsw0(config-if-TenGigabitEthernet 0/3)#switchport dot1q-tunnel allowed vlan add tagged 1000-2000 +``` + +The industry remains conflicted about the outer ethernet frame's type -- originally a tag +protocol identifier (TPID) of 0x9100 was suggested, and that's what this switch uses. But +the first specification of Q-in-Q called 802.1ad specified that the TPID should be 0x88a8 +instead of the VLAN tag that was 0x8100. This ugly reality can be reflected directly in the +switchport configuration by adding a `frame-tag tpid 0xXXXX` value to let the switch know +which TPID needs to be used for the outer tag. + +If this type of historical thing interests you, I definitely recommend reading up on Wikipedia on +[802.1q](https://en.wikipedia.org/wiki/IEEE_802.1Q) and [802.1ad](https://en.wikipedia.org/wiki/IEEE_802.1ad) +as well. + +## Loadtests + +For my loadtests, I used Cisco's T-Rex ([ref](https://trex-tgn.cisco.com/)) in stateless mode, +with a custom Python controller that ramps up and down traffic from the loadtester to the device +under test (DUT) by sending traffic out `port0` to the DUT, and expecting that traffic to be +presented back out from the DUT to its `port1`, and vice versa (out from `port1` -> DUT -> back +in on `port0`). You can read a bit more about my setup in my [Loadtesting at Coloclue]({% post_url 2021-02-27-coloclue-loadtest %}) +post. + +To stress test the switch, several pairs at 10G and 25G were used, and since the specs boast +line rate forwarding, I immediately ran T-Rex at maximum load with small frames. I found out, +once again, that Intel's X710 network cards aren't line rate, something I'll dive into in a bit +more detail another day, for now, take a look at the [T-Rex docs](https://github.com/cisco-system-traffic-generator/trex-core/blob/master/doc/trex_stateless_bench.asciidoc). + +### L2 + +First let's test a straight forward configuration. I connect a DAC between a 40G port on each +switch, and connect a loadtester to port `TenGigabitEthernet 0/1` and `TenGigabitEthernet 0/2` +on either switch, and leave everything simply in the default VLAN. This means packets from +Te0/1 and Te0/2 go out on Fo0/26, then through the DAC into Fo0/26 on the second switch, and +out on Te0/1 and Te0/2 there, to return to the loadtester. Configuration wise, rather boring: + +``` +fsw0#configure +fsw0(config)#vlan 1 +fsw0(config-vlan)#name v-default + +fsw0#show run int te0/1 +interface TenGigabitEthernet 0/1 + +fsw0#show run int te0/2 +interface TenGigabitEthernet 0/2 + +fsw0#show run int fo0/26 +interface FortyGigabitEthernet 0/26 + switchport mode trunk + switchport trunk allowed vlan only 1 + +fsw0#show vlan id 1 +VLAN Name Status Ports +---------- -------------------------------- --------- ----------------------------------- + 1 v-default STATIC Te0/1, Te0/2, Te0/3, Te0/4 + Te0/5, Te0/6, Te0/7, Te0/8 + Te0/9, Te0/10, Te0/11, Te0/12 + Te0/13, Te0/14, Te0/15, Te0/16 + Te0/17, Te0/18, Te0/19, Te0/20 + TF0/21, TF0/22, TF0/23, Fo0/25 + Fo0/26 +``` + +I set up T-Rex with unique MAC addresses for each of its ports, I find it useful +to codify a few bits of information into the MAC, such as loadtester machine, +PCI bus, port, so that when I try to find them on the switches in the forwarding +table, and I have many loadtesters running at the same time, it's easier to +find what I'm looking for. My trex configuration for this loadtest: +``` +pim@hippo:~$ cat /etc/trex_cfg.yaml +- version : 2 + interfaces : ["42:00.0","42:00.1", "42:00.2", "42:00.3"] + port_limit : 4 + port_info : + - dest_mac : [0x0,0x2,0x1,0x1,0x0,0x00] # port 0 + src_mac : [0x0,0x2,0x1,0x2,0x0,0x00] + - dest_mac : [0x0,0x2,0x1,0x2,0x0,0x00] # port 1 + src_mac : [0x0,0x2,0x1,0x1,0x0,0x00] + - dest_mac : [0x0,0x2,0x1,0x3,0x0,0x00] # port 2 + src_mac : [0x0,0x2,0x1,0x4,0x0,0x00] + - dest_mac : [0x0,0x2,0x1,0x4,0x0,0x00] # port 3 + src_mac : [0x0,0x2,0x1,0x3,0x0,0x00] +``` + +Here's where I notice something I've noticed before: the Intel X710 network cards cannot +actually fill 4x10G at line rate. They're fine at larger frames, but they max out at about +32Mpps throughput -- and we know that each 10G connection filled with small ethernet frames +in one direction will consume 14.88Mpps. The same is true for the XXV710 cards, the chip +used will really only source 30Mpps across all ports, which is sad but true. + +So I have a choice to make: either I run small packets at a rate that's acceptable for the +NIC (~7.5Mpps per port thus 30Mpps across the X710-DA4), or I run `imix` at line rate +but with slightly less packets/sec. I chose the latter for these tests, and will be reporting +the usage based on `imix` profile, which saturates 10G at 3.28Mpps in one direction, or +13.12Mpps per network card. + +Of course, I can run two of these at the same time, *pourquois pas*, which looks like this: + +``` +fsw0#show mac +Vlan MAC Address Type Interface Live Time +---------- -------------------- -------- ------------------------------ ------------- + 1 0001.0101.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:16:11 + 1 0001.0102.0000 DYNAMIC TenGigabitEthernet 0/1 0d 00:16:11 + 1 0001.0103.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:16:11 + 1 0001.0104.0000 DYNAMIC TenGigabitEthernet 0/2 0d 00:16:10 + 1 0002.0101.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:15:51 + 1 0002.0102.0000 DYNAMIC TenGigabitEthernet 0/3 0d 00:15:51 + 1 0002.0103.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:15:51 + 1 0002.0104.0000 DYNAMIC TenGigabitEthernet 0/4 0d 00:15:50 + +fsw0#show int usage | exclude 0.00 +Interface Bandwidth Average Usage Output Usage Input Usage +------------------------------------ ----------- ---------------- ---------------- ---------------- +TenGigabitEthernet 0/1 10000 Mbit 94.66% 94.66% 94.66% +TenGigabitEthernet 0/2 10000 Mbit 94.66% 94.66% 94.66% +TenGigabitEthernet 0/3 10000 Mbit 94.65% 94.66% 94.66% +TenGigabitEthernet 0/4 10000 Mbit 94.66% 94.66% 94.66% +FortyGigabitEthernet 0/26 40000 Mbit 94.66% 94.66% 94.66% + +fsw0#show cpu core +[Slot 0 : S5860-20SQ] +Core 5Sec 1Min 5Min + 0 16.40% 12.00% 12.80% +``` + +This is the first time that I noticed that the switch usage (94.66%) somewhat confusingly +lines up with the observed T-Rex statistics: what the switch reports, T-Rex considers L2 +(ethernet) use, not L1 use. For an in-depth explanation of this, see below in the L3 +section. But for now, let's just say that when T-Rex says it's sending 37.9Gbps of ethernet +traffic (which is 40.00Gbps of bits on the line), that corresponds to the 94.75% we see +the switch reporting. + +So suffice to say, at 80Gbit actual throughput (40G from Te0/1-3 ingress and 40G to +Te0/1-3 egress), the switch performs at line rate, with no noticable lag or jitter. +The CLI is responsive and the fans aren't spinning harder than at idle, even after 60min +of packets. Good! + +### QinQ + +Then, I reconfigured the switch to let each pair of ports (Te0/1-2 and Te0/3-4) each drop +into a Q-in-Q VLAN, with tag 20 and tag 21 respectively. The configuration: + + +``` +interface TenGigabitEthernet 0/1 + switchport mode dot1q-tunnel + switchport dot1q-tunnel allowed vlan add untagged 20 + switchport dot1q-tunnel native vlan 20 +! +interface TenGigabitEthernet 0/3 + switchport mode dot1q-tunnel + switchport dot1q-tunnel allowed vlan add untagged 21 + switchport dot1q-tunnel native vlan 21 + spanning-tree bpdufilter enable +! +interface FortyGigabitEthernet 0/26 + switchport mode trunk + switchport trunk allowed vlan only 1,20-21 + +fsw0#show mac +Vlan MAC Address Type Interface Live Time +---------- -------------------- -------- ------------------------------ ------------- + 20 0001.0101.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 01:15:02 + 20 0001.0102.0000 DYNAMIC TenGigabitEthernet 0/1 0d 01:15:01 + 20 0001.0103.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 01:15:02 + 20 0001.0104.0000 DYNAMIC TenGigabitEthernet 0/2 0d 01:15:03 + 21 0002.0101.0000 DYNAMIC TenGigabitEthernet 0/4 0d 00:01:50 + 21 0002.0102.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:01:03 + 21 0002.0103.0000 DYNAMIC TenGigabitEthernet 0/3 0d 00:01:59 + 21 0002.0104.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:01:02 +``` + +Two things happen that require a bit of explanation. First of all, despite both loadtesters +use the exact same configuration (in fact, I didn't even stop them from emitting packets +while reconfiguring the switch), I now have packetloss, the throughput per 10G port has +reduced from 94.67% to 93.63% and at the same time, I observe that the 40G ports raised +their usage from 94.66% to 94.81%. + +``` +fsw1#show int usage | e 0.00 +Interface Bandwidth Average Usage Output Usage Input Usage +------------------------------------ ----------- ---------------- ---------------- ---------------- +TenGigabitEthernet 0/1 10000 Mbit 94.20% 93.63% 94.67% +TenGigabitEthernet 0/2 10000 Mbit 94.21% 93.65% 94.67% +TenGigabitEthernet 0/3 10000 Mbit 91.05% 94.66% 94.66% +TenGigabitEthernet 0/4 10000 Mbit 90.80% 94.66% 94.66% +FortyGigabitEthernet 0/26 40000 Mbit 94.81% 94.81% 94.81% +``` + +The switches, however, are perfectly fine. The reason for this loss is that when I created +the `dot1q-tunnel`, the switch sticks another VLAN tag (4 bytes, or 32 bits) on each packet +before sending it out the 40G port between the switches, and at these packet rates, it adds +up. Each 10G switchport is receiving 3.28Mpps (for a total of 13.12Mpps) which, when the +switch needs to send it to its peer on the 40G trunk, adds 13.12Mpps * 32 bits = 419.8Mbps +on top of the 40G line rate, implying we're going to be losing roughly 1.045% of our packets. +And indeed, the difference between 94.67 (inbound) and 93.63 (outbound) is 1.04% which lines +up. + + +``` +Global Statistics + +connection : localhost, Port 4501 total_tx_L2 : 37.92 Gbps +version : STL @ v2.91 total_tx_L1 : 40.02 Gbps +cpu_util. : 43.52% @ 8 cores (4 per dual port) total_rx : 37.92 Gbps +rx_cpu_util. : 0.0% / 0 pps total_pps : 13.12 Mpps +async_util. : 0% / 198.64 bps drop_rate : 0 bps +total_cps. : 0 cps queue_full : 0 pkts + +Port Statistics + + port | 0 | 1 | 2 | 3 +-----------+-------------------+-------------------+-------------------+------------------ +owner | pim | pim | pim | pim +link | UP | UP | UP | UP +state | TRANSMITTING | TRANSMITTING | TRANSMITTING | TRANSMITTING +speed | 10 Gb/s | 10 Gb/s | 10 Gb/s | 10 Gb/s +CPU util. | 46.29% | 46.29% | 40.76% | 40.76% +-- | | | | +Tx bps L2 | 9.48 Gbps | 9.48 Gbps | 9.48 Gbps | 9.48 Gbps +Tx bps L1 | 10 Gbps | 10 Gbps | 10 Gbps | 10 Gbps +Tx pps | 3.28 Mpps | 3.27 Mpps | 3.27 Mpps | 3.28 Mpps +Line Util. | 100.04 % | 100.04 % | 100.04 % | 100.04 % +--- | | | | +Rx bps | 9.48 Gbps | 9.48 Gbps | 9.48 Gbps | 9.48 Gbps +Rx pps | 3.24 Mpps | 3.24 Mpps | 3.23 Mpps | 3.24 Mpps +---- | | | | +opackets | 1891576526 | 1891577716 | 1891547042 | 1891548090 +ipackets | 1891576643 | 1891577837 | 1891547158 | 1891548214 +obytes | 684435443496 | 684435873418 | 684424773684 | 684425153614 +ibytes | 684435484082 | 684435916902 | 684424817178 | 684425197948 +tx-pkts | 1.89 Gpkts | 1.89 Gpkts | 1.89 Gpkts | 1.89 Gpkts +rx-pkts | 1.89 Gpkts | 1.89 Gpkts | 1.89 Gpkts | 1.89 Gpkts +tx-bytes | 684.44 GB | 684.44 GB | 684.42 GB | 684.43 GB +rx-bytes | 684.44 GB | 684.44 GB | 684.42 GB | 684.43 GB +----- | | | | +oerrors | 0 | 0 | 0 | 0 +ierrors | 0 | 0 | 0 | 0 +``` + +### L3 + +For this test, I reconfigured the 25G ports to become routed rather than switched, and I put them +under 80% load with T-Rex (where 80% here is of L1), thus the ports are emitting 20Gbps of traffic +at a rate of 13.12Mpps. I left two of the 10G ports just continuing their ethernet loadtest at +100%, which is also 20Gbps of traffic and 13.12Mpps. In total, I observed 79.95Gbps of traffic +between the two switches: an entirely saturated 40G port in both directions. + +I then created a simple topology with OSPF, both switches configured a `Loopback0` interface with +a /32 IPv4 and /128 IPv6 address, and a transit network between them in a `VLAN100` interface. +OSPF and OSPFv3 both distribute connected and static routes, to keep things simple. + +Finally, I added an IP address on the `Tf0/24` interface, set a static IPv4 route for 16.0.0.0/8 +and 48.0.0.0/8 towards that interface on each switch, and added VLAN 100 to the `Fo0/26` trunk. +It looks like this for switch `fsw0`: + +``` +interface Loopback 0 + ip address 100.64.0.0 255.255.255.255 + ipv6 address 2001:DB8::/128 + ipv6 enable + +interface VLAN 100 + ip address 100.65.2.1 255.255.255.252 + ipv6 enable + ip ospf network point-to-point + ipv6 ospf network point-to-point + ipv6 ospf 1 area 0 + +interface TFGigabitEthernet 0/24 + no switchport + ip address 100.65.1.1 255.255.255.0 + ipv6 address 2001:DB8:1:1::1/64 + +interface FortyGigabitEthernet 0/26 + switchport mode trunk + switchport trunk allowed vlan only 1,20,21,100 + +router ospf 1 + graceful-restart + redistribute connected subnets + redistribute static subnets + area 0 + network 100.65.2.0 0.0.0.3 area 0 +! + +ipv6 router ospf 1 + graceful-restart + redistribute connected + redistribute static + area 0 +! + +ip route 16.0.0.0 255.0.0.0 100.65.1.2 +ipv6 route 2001:db8:100::/40 2001:db8:1:1::2 +``` + +With this topology, an L3 routing domain emerges between `Tf0/24` on switch `fsw0` and `Tf0/24` +on switch `fsw1`, and we can inspect this, taking a look at `fsw1`, I can see that both IPv4 and +IPv6 adjacencies have formed, and that the switches, néé routers, have learned +routes from one another: + +``` +fsw1#show ip ospf neighbor +OSPF process 1, 1 Neighbors, 1 is Full: +Neighbor ID Pri State BFD State Dead Time Address Interface +100.65.2.1 1 Full/ - - 00:00:31 100.65.2.1 VLAN 100 + +fsw1#show ipv6 ospf neighbor +OSPFv3 Process (1), 1 Neighbors, 1 is Full: +Neighbor ID Pri State BFD State Dead Time Instance ID Interface +100.65.2.1 1 Full/ - - 00:00:31 0 VLAN 100 + +fsw1#show ip route +Codes: C - Connected, L - Local, S - Static + R - RIP, O - OSPF, B - BGP, I - IS-IS, V - Overflow route + N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2 + E1 - OSPF external type 1, E2 - OSPF external type 2 + SU - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2 + IA - Inter area, EV - BGP EVPN, A - Arp to host + * - candidate default + +Gateway of last resort is no set +O E2 16.0.0.0/8 [110/20] via 100.65.2.1, 12:42:13, VLAN 100 +S 48.0.0.0/8 [1/0] via 100.65.0.2 +O E2 100.64.0.0/32 [110/20] via 100.65.2.1, 00:05:23, VLAN 100 +C 100.64.0.1/32 is local host. +C 100.65.0.0/24 is directly connected, TFGigabitEthernet 0/24 +C 100.65.0.1/32 is local host. +O E2 100.65.1.0/24 [110/20] via 100.65.2.1, 12:44:57, VLAN 100 +C 100.65.2.0/30 is directly connected, VLAN 100 +C 100.65.2.2/32 is local host. + +fsw1#show ipv6 route +IPv6 routing table name - Default - 12 entries +Codes: C - Connected, L - Local, S - Static + R - RIP, O - OSPF, B - BGP, I - IS-IS, V - Overflow route + N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2 + E1 - OSPF external type 1, E2 - OSPF external type 2 + SU - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2 + IA - Inter area, EV - BGP EVPN, N - Nd to host + +O E2 2001:DB8::/128 [110/20] via FE80::669D:99FF:FED0:A054, VLAN 100 +LC 2001:DB8::1/128 via Loopback 0, local host +C 2001:DB8:1::/64 via TFGigabitEthernet 0/24, directly connected +L 2001:DB8:1::1/128 via TFGigabitEthernet 0/24, local host +O E2 2001:DB8:1:1::/64 [110/20] via FE80::669D:99FF:FED0:A054, VLAN 100 +O E2 2001:DB8:100::/40 [110/20] via FE80::669D:99FF:FED0:A054, VLAN 100 +C FE80::/10 via ::1, Null0 +C FE80::/64 via Loopback 0, directly connected +L FE80::669D:99FF:FED0:A076/128 via Loopback 0, local host +C FE80::/64 via TFGigabitEthernet 0/24, directly connected +L FE80::669D:99FF:FED0:A076/128 via TFGigabitEthernet 0/24, local host +C FE80::/64 via VLAN 100, directly connected +L FE80::669D:99FF:FED0:A076/128 via VLAN 100, local host +``` + +Great success! I can see from the `fsw1` output above, its OSPF process has learned +routes for the IPv4 and IPv6 loopbacks (100.64.0.0/32 and 2001:DB8::1/128 respectively), +the connected routes (100.65.1.0/24 and 2001:DB8:1:1::/64 respectively), and the +static routes (16.0.0.0/8 and 2001:db8:100::/40). + +So let's make use of this topology and change one of the two loadtesters to switch +to L3 mode instead: + +``` +pim@hippo:~$ cat /etc/trex_cfg.yaml +- version : 2 + interfaces : ["0e:00.0", "0e:00.1" ] + port_bandwidth_gb: 25 + port_limit : 2 + port_info : + - ip : 100.65.0.2 + default_gw : 100.65.0.1 + - ip : 100.65.1.2 + default_gw : 100.65.1.1 +``` + +I left the loadtest running for 12hrs or so, and observed the results to be squeaky clean. +The loadtester machine was generating ~96Gb/core at 20% utilization, so lazily generating +40.00Gbit of traffic at 25.98Mpps (remember, this was setting the load to 80% on the 25G +port, and 99% on the 10G ports). Looking at the switch and again being surprised about the +discrepancy, I decided to fully explore the curiosity in this switch's utilization reporting. + +``` +fsw1#show interfaces usage | exclude 0.00 +Interface Bandwidth Average Usage Output Usage Input Usage +------------------------------------ ----------- -------------- -------------- ----------- +TenGigabitEthernet 0/1 10000 Mbit 93.80% 93.79% 93.81% +TenGigabitEthernet 0/2 10000 Mbit 93.80% 93.79% 93.81% +TFGigabitEthernet 0/24 25000 Mbit 75.80% 75.79% 75.81% +FortyGigabitEthernet 0/26 40000 Mbit 94.79% 94.79% 94.79% + +fsw1#show int te0/1 | inc packets/sec + 10 seconds input rate 9381044793 bits/sec, 3240802 packets/sec + 10 seconds output rate 9378930906 bits/sec, 3240123 packets/sec + +fsw1#show int tf0/24 | inc packets/sec + 10 seconds input rate 18952369793 bits/sec, 6547299 packets/sec + 10 seconds output rate 18948317049 bits/sec, 6545915 packets/sec + +fsw1#show int fo0/26 | inc packets/sec + 10 seconds input rate 37915517884 bits/sec, 13032078 packets/sec + 10 seconds output rate 37915335102 bits/sec, 13026051 packets/sec + +``` + +Looking at that number, 75.80% was not the 80% that I had asked for, and actually the usage of +the 10G ports (which I put at 99% load) and 40G port are also lower than I had anticipated. +What's going on there? It's quite simple after doing some math: ***the switch is reporting +L2 bits/sec, not L1 bits/sec!*** + +On the L3 loadtest and using the `imix` profile, T-Rex is sending 13.02Mpps of load, which, +according to its own observation is 37.8Gbit of L2 and 40.00Gbps of L1 bandwidth. On the L2 +loadtest, again using `imix` profile, T-Rex is sending 4x 3.24Mpps as well, which it claims +is 37.6Gbps of L2 and 39.66Gbps of L1 bandwidth (note: I put the loadtester here at 99% of +line rate, this is to ensure I would not end up losing packets due to congestion on the 40G +port). + +So according to T-Rex, I am sending 75.4Gbps of traffic (37.8Gbps in the L2 test and 37.6Gbps +in the simultenous L3 loadtest), yet I'm seeing 37.9Gbps on the switchport. Oh my! + +Here's how all of these numbers relate to one another: + +* First off, we are sending 99% linerate at 3.24Mpps into Te0/1 and Te0/2 on each switch. +* Then, we are sending 80% linerate at 6.55Mpps into Tf0/24 on each switch. +* The Te0/1 and Te0/2 are both in the default VLAN on either side. +* But, the Tf0/24 is sending its IP traffic through VLAN 100 interconnect, which means all + of that traffic gets a dot1q VLAN tag added. That's 4 bytes for each packet. +* Sending 6.55Mpps * 32bits extra, equals 209600000 bits/sec (0.21Gbps) +* Loadtester claims 37.70Gbps, but the switch sees 37.91Gbps which is exactly the difference + we calculated above (0.21Gbps), and equals the overhead created by adding the VLAN tag on + the 25G stream that is in VLAN tag 100. + +Now we are ready to explain the difference between the switch reported port usage and the +loadtester reported port usage: + +* The loadtester is sending an `imix` traffic mix, which consts of a ratio of 28:16:4 of + packets that are 64:590:1514 bytes. +* We already know that to create a packet on the wire, we have to add a 7 byte preamble + a 1 byte start frame delimiter, and end with a 12 byte interpacket gap, so each ethernet + frame is 20 bytes longer, making 84 bytes the on-the-wire smallest possible frame. +* We know we're sending 3.24Mpps on a 10G port at 99% T-Rex (L1) usage: + * Each packet needs 20 bytes or 160 bits of overhead, which is 518400000 bits/sec + * We are seeing 9381044793 bits/sec on a 10G port (**corresponding switch 93.80% usage**) + * Adding these two numbers up gives us 9899444793 bits/sec (**corresponding T-Rex 98.99% usage**) +* Conversely, the whole system is sending 37.9Gbps on the 40G port (**corresponding switch 37.9/40 == 94.79% usage**) + * We know this is 2x 10G streams at 99% utilization and 1x25G stream at 80% utilization + * This is 13.03Mpps, which generate 2084800000 bits/sec of overhead + * Adding these two numbers up gives us 40.00 Gbps of usage (which is the expected L1 line rate) + +I find it very fulfilling to see these numbers meaningfully add up! Oh, and by the way, +the switches that are now switching and routing all of this with with 0.00% packet loss, +and the chassis doesn't even get warm :-) + + +``` +Global Statistics + +connection : localhost, Port 4501 total_tx_L2 : 38.02 Gbps +version : STL @ v2.91 total_tx_L1 : 40.02 Gbps +cpu_util. : 21.88% @ 4 cores (4 per dual port) total_rx : 38.02 Gbps +rx_cpu_util. : 0.0% / 0 pps total_pps : 13.13 Mpps +async_util. : 0% / 39.16 bps drop_rate : 0 bps +total_cps. : 0 cps queue_full : 0 pkts + +Port Statistics + + port | 0 | 1 | total +-----------+-------------------+-------------------+------------------ +owner | pim | pim | +link | UP | UP | +state | TRANSMITTING | TRANSMITTING | +speed | 25 Gb/s | 25 Gb/s | +CPU util. | 21.88% | 21.88% | +-- | | | +Tx bps L2 | 19.01 Gbps | 19.01 Gbps | 38.02 Gbps +Tx bps L1 | 20.06 Gbps | 20.06 Gbps | 40.12 Gbps +Tx pps | 6.57 Mpps | 6.57 Mpps | 13.13 Mpps +Line Util. | 80.23 % | 80.23 % | +--- | | | +Rx bps | 19 Gbps | 19 Gbps | 38.01 Gbps +Rx pps | 6.56 Mpps | 6.56 Mpps | 13.13 Mpps +---- | | | +opackets | 292215661081 | 292215652102 | 584431313183 +ipackets | 292152912155 | 292153677482 | 584306589637 +obytes | 105733412810506 | 105733412001676 | 211466824812182 +ibytes | 105710857873526 | 105711223651650 | 211422081525176 +tx-pkts | 292.22 Gpkts | 292.22 Gpkts | 584.43 Gpkts +rx-pkts | 292.15 Gpkts | 292.15 Gpkts | 584.31 Gpkts +tx-bytes | 105.73 TB | 105.73 TB | 211.47 TB +rx-bytes | 105.71 TB | 105.71 TB | 211.42 TB +----- | | | +oerrors | 0 | 0 | 0 +ierrors | 0 | 0 | 0 +``` + +### Conclusions + +It's just super cool to see a switch like this work as expected. I did not manage to +overload it at all, neither with IPv4 loadtest at 20Mpps and 50Gbit of traffic, nor with +L2 loadtest at 26Mpps and 80Gbit of traffic, with QinQ demonstrably done in hardware as +well as IPv4 route lookups. I will be putting these switches into production soon on the +IPng Networks links between Glattbrugg and Rümlang in Zurich, thereby upgrading our +backbone from 10G to 25G CWDM. It seems to me, that using these switches as L3 devices +given a smaller OSPF routing domain (currently, we have ~300 prefixes in our OSPF at +AS50869), would definitely work well, as would pushing and popping QinQ trunks for our +customers (for example on Solnet or Init7 or Openfactory). + +Approved. A+, will buy again. diff --git a/content/articles/2021-08-12-vpp-1.md b/content/articles/2021-08-12-vpp-1.md new file mode 100644 index 0000000..f572c95 --- /dev/null +++ b/content/articles/2021-08-12-vpp-1.md @@ -0,0 +1,404 @@ +--- +date: "2021-08-12T11:17:54Z" +title: VPP Linux CP - Part1 +--- + + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. One thing notably missing, is the higher level control plane, that is +to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a +VPP _plugin_ which is called the **Linux Control Plane**, or LCP for short, which creates Linux network +devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols +like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet +forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use +VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the +interface state (links, addresses and routes) itself. When the plugin is completed, running software +like [FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving +>100Mpps and >100Gbps forwarding rates will be well in reach! + +In this first post, let's take a look at tablestakes: making a copy of VPP's interfaces appear in +the Linux kernel. + +## My test setup + +I took two AMD64 machines, each with 32GB of memory and one Intel X710-DA4 network card (which offers +four SFP+ cages), and installed Ubuntu 20.04 on them. I connected each of the network ports back +to back with DAC cables. This gives me plenty of interfaces to play with. On the vanilla Ubuntu machine, +I created a bunch of different types of interfaces and configured IPv4 and IPv6 addresses on them. + +The goal of this post is to show what code needed to be written and which changes needed to +be made to the plugin, in order to mirror each type of interface from VPP into a valid Linux interface. +As we'll see, marrying the Linux network interface approach with the VPP interface approach can be +tricky! Throughout this post, the vanilla Ubuntu machine will keep the following configuration, the +config file of which you can see in the Appendix: + +| Name | type | Addresses +|-----------------|------|---------- +| enp66s0f0 | untagged | 10.0.1.2/30 2001:db8:0:1::2/64 +| enp66s0f0.q | dot1q 1234 | 10.0.2.2/30 2001:db8:0:2::2/64 +| enp66s0f0.qinq | outer dot1q 1234, inner dot1q 1000 | 10.0.3.2/30 2001:db8:0:3::2/64 +| enp66s0f0.ad | dot1ad 2345 | 10.0.4.2/30 2001:db8:0:4::2/64 +| enp66s0f0.qinad | outer dot1ad 2345, inner dot1q 1000 | 10.0.5.2/30 2001:db8:0:5::2/64 + +This configuration will allow me to ensure that all common types of sub-interface are supported +by the plugin. + +### Startingpoint + +The `linux-cp` plugin that ships with VPP 21.06, when initialized with the desired startup config +(see Appendix), will yield this (Hippo is the machine that runs my development branch of VPP, it's +called like that because it's always hungry for packets): + +``` + +pim@hippo:~/src/lcpng$ ip ro +default via 194.1.163.65 dev enp6s0 proto static +10.0.1.0/30 dev e0 proto kernel scope link src 10.0.1.1 +10.0.2.0/30 dev e0.1234 proto kernel scope link src 10.0.2.1 +10.0.4.0/30 dev e0.1236 proto kernel scope link src 10.0.4.1 +194.1.163.64/27 dev enp6s0 proto kernel scope link src 194.1.163.88 + +pim@hippo:~/src/lcpng$ fping 10.0.1.2 10.0.2.2 10.0.3.2 10.0.4.2 10.0.5.2 +10.0.1.2 is alive +10.0.2.2 is alive +10.0.3.2 is unreachable +10.0.4.2 is unreachable +10.0.5.2 is unreachable + +pim@hippo:~/src/lcpng$ fping6 2001:db8:0:1::2 2001:db8:0:2::2 \ + 2001:db8:0:3::2 2001:db8:0:4::2 2001:db8:0:5::2 +2001:db8:0:1::2 is alive +2001:db8:0:2::2 is alive +2001:db8:0:3::2 is unreachable +2001:db8:0:4::2 is unreachable +2001:db8:0:5::2 is unreachable +``` + +Yikes! So the plugin really only knows how to handle untagged interfaces, and sub-interfaces with +one dot1q VLAN tag. The other three scenarios (dot1ad VLAN tag; dot1q in dot1q; and dot1q in dot1ad) +are not ok. And, curiously, the `dot1ad 2345 exact-match` interface _was_ created (as linux interface +`e0.1236`, but it doesn't ping, and I'll show you why :-) But principally: let's fix this plugin! + + +### Anatomy of Linux Interface Pairs + +In VPP, the plumbing to the Linux kernel is done via a TUN/TAP interface. For L3 interfaces, TAP is +used. This TAP appears in the Linux network namesapce as a device with which you can interact. From +the Linux point of view, on egress, all packets coming from the host into the TAP are cross-connected +directly to the logical VPP network interface. In VPP, on ingress, packets destined for an L3 address +on any VPP interface, as well as packets that are multicast, are punted into the TAP, which makes +them appear in the kernel. + +In VPP, a linux interface pair (`LIP` for short) is therefore a tuple `{ vpp_phy_idx, vpp_tap_idx, netlink_idx }`. +Creating one of these, is the art of first creating a tap, and associating it with the `vpp_phy`, +copying traffic from it into the dataplane, and punting traffic from the dataplane into the TAP +so that Linux can see it. The plugin exposes an API endpoint that creates, deletes and lists +these linux interface pairs: +``` +lcp create | host-if netns [tun] +lcp delete | +show lcp [phy ] +``` + +If you're still with me, congratulations, because this is where it starts to get fun! + +### Create interface: physical + +The easiest interface type is a physical one. Here, the plugin will create a TAP, copy the MAC +address from the PHY, and set a bunch of attributes on the TAP, such as MTU and link state. +Here, I made my first set of changes (in [[patchset 3](https://gerrit.fd.io/r/c/vpp/+/33481/2..3)]) +to the plugin: + +* Initialize the link state of the VPP interface, not unconditionally set it to 'down'. +* Initialize the MTU of the VPP interface into the TAP, do not assume it is the VPP default + of 9000; if the MTU is not known, assume the TAP has 9216, the largest possible on ethernet. + +Taking a look: +``` +DBGvpp# show int TenGigabitEthernet3/0/0 + Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count +TenGigabitEthernet3/0/0 1 down 9000/0/0/0 +DBGvpp# lcp create TenGigabitEthernet3/0/0 host-if e0 +DBGvpp# show tap tap1 +Interface: tap1 (ifindex 7) + name "e0" + host-ns "(nil)" + host-mtu-size "9000" + host-mac-addr: 68:05:ca:32:46:14 +... + +DBGvpp# set interface state TenGigabitEthernet3/0/1 up +DBGvpp# set interface mtu packet 1500 TenGigabitEthernet3/0/1 +DBGvpp# lcp create TenGigabitEthernet3/0/1 host-if e1 +``` + +And in Linux, unceremoniously, both interfaces appear: +``` +pim@hippo:~/src/lcpng$ ip link show e0 +291: e0: mtu 9000 qdisc mq state DOWN mode DEFAULT group default qlen 1000 + link/ether 68:05:ca:32:46:14 brd ff:ff:ff:ff:ff:ff +pim@hippo:~/src/lcpng$ ip link show e1 +307: e1: mtu 1500 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000 + link/ether 68:05:ca:32:46:15 brd ff:ff:ff:ff:ff:ff +``` + +The MAC address from the physical interface `show hardware-interface TenGigabitEthernet3/0/0` +corresponds to the one seen in the TAP, and the one seen in the Linux interface we just created. +The Linux interfaces respect the MTU and link state of their counterpart VPP interfaces (`e0` is +down at 9000b, `e1` is up at 1500b). + +### Create interface: dot1q + +Note that creating an ethernet sub-interface in VPP takes the following form: +``` +create sub-interfaces { [default|untagged]} | {-} + | { dot1q|dot1ad |any [inner-dot1q |any] [exact-match]} +``` + +Here, I'll start with the simplest form, canonically called a .1q VLAN or a _tagged_ interface. +The plugin handles it just fine, with a codepath that first creates a sub-interface on the parent's +TAP, forwards traffic to/from the VPP subinterface into the parent TAP, asks the Linux kernel to +create a new interface of type vlan with the `id` set to the dot1q tag, as a child of the `e0` +interface. Note however the `exact-match` keyword, which is very important. In VPP, without setting +exact-match, any ethernet frame that matches the sub-interface expression, will be handled by it. +This means the VLAN with tag 1234, but also a stacked (Q-in-Q or Q-in-AD) VLAN with the outer tag +set to 1234 will match. This is non-sensical for an IP interface, and as such the first two examples +will successfully create, but the third example will crash the plugin: + +``` +## Good, shorthand sets exact-match +DBGvpp# create sub TenGigabitEthernet3/0/0 1234 +DBGvpp# lcp create TenGigabitEthernet3/0/0.1234 host-if e0.1234 + +## Good, explicitly set exact-match +DBGvpp# create sub TenGigabitEthernet3/0/0 1234 dot1q 1234 exact-match +DBGvpp# lcp create TenGigabitEthernet3/0/0.1234 host-if e0.1234 + +## Bad, will crash +DBGvpp# create sub TenGigabitEthernet3/0/0 1234 dot1q 1234 +DBGvpp# lcp create TenGigabitEthernet3/0/0.1234 host-if e0.1234 +``` + +The reason is that the first call is a shorthand: it creates sub-int 1234 as `dot1q 1234 exact-match`, +which is literally what the second example does, while the third example creates a non-exact-match sub-int +1234 with `dot1q 1234`. So I changed the behavior to explicitly reject sub-interfaces that are not exact-match +in [[patchset 4](https://gerrit.fd.io/r/c/vpp/+/33481/3..4)]. Actually, it turns out that VPP upstream also +crashes on setting an ip address on a sub-int that is not configured with `exact-match`, so I fixed that +upstream in this [[gerrit](https://gerrit.fd.io/r/c/vpp/+/33444)] too. + +### Create interface: dot1ad + +While by far 802.1q _VLAN_ interfaces are the most used, there's a lesser known sibling called +802.1ad -- the only difference is that VLAN ethernet frames with .1q use the well known 0x8100 +ethernet type (called a a tag protocol identifier, or [TPID](https://en.wikipedia.org/wiki/IEEE_802.1Q)), while .1ad uses a +lesser known 0x88a8 type. In the first beginnings, Q-in-Q was suggested to use the 0x88a8 tag +for the outer type, and 0x8100 for the inner type, differentiating the two. But the industry +was conflicted, and many vendors chose to use 0x8100 for both inner- and outer-types, VPP supports +it and so does Linux, so let's implement it in [[patchset 5](https://gerrit.fd.io/r/c/vpp/+/33481/4..5)]. +Without this change, the plugin would create the interface, but it would invariably create it as +.1q on the linux side, which explains why the `e0.1236` interface exists but doesn't ping in +my _startingpoint_ above. Now we have the expected behavior: + +``` +DBGvpp# create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match +DBGvpp# lcp create TenGigabitEthernet3/0/0.1236 host-if e0.1236 + +pim@hippo:~/src/lcpng$ ping 10.0.4.2 +PING 10.0.4.2 (10.0.4.2) 56(84) bytes of data. +64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=0.58 ms +64 bytes from 10.0.4.2: icmp_seq=2 ttl=64 time=0.57 ms +64 bytes from 10.0.4.2: icmp_seq=3 ttl=64 time=0.62 ms +64 bytes from 10.0.4.2: icmp_seq=4 ttl=64 time=0.67 ms +^C +--- 10.0.4.2 ping statistics --- +4 packets transmitted, 4 received, 0% packet loss, time 3005ms +rtt min/avg/max/mdev = 0.566/0.608/0.672/0.041 ms +``` + +### Create interface: dot1q in dot1ad + +This is the original Q-in-Q as it was intended. Frames here carry an outer ethernet TPID of 0x88a8 +(dot1ad) which is followed by an inner ethernet TPID of 0x8100 (dot1q). Of course, untagged inner +frames are also possible - they show up as simply one ethernet TPID of dot1ad followed directly by +the L3 payload. Here, things get a bit more tricky. On the VPP side, we can simply create the +sub-interface directly; but on the Linux side, we cannot do that. This is because in VPP, +all sub-interfaces are directly parented by their physical interface, while in Linux, the +interfaces are stacked on one another. Compare: + +``` +### VPP idiomatic q-in-ad (1 interface) +DBGvpp# create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match + +### Linux idiomatic q-in-ad stack (2 interfaces) +ip link add link e0 name e0.2345 type vlan id 2345 proto 802.1ad +ip link add link e0.2345 name e0.2345.1000 type vlan id 1000 proto 802.1q +``` + +So in order to create Q-in-AD sub-interfaces, for Linux their intermediary parent must exist, while +in VPP this is not necessary. I have to make a compromise, so I'll be a bit more explicit and allow +this type of _LIP_ to be created only under these conditions: +* A sub-int exists with the intermediary (in this case, `dot1ad 2345 exact-match`) +* That sub-int itself has a _LIP_, with a Linux interface device that we can spawn the inner interface off of + +If these conditions don't hold, I reject the request. If they do, I create an interface pair: +``` +DBGvpp# create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match +DBGvpp# create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match +DBGvpp# lcp create TenGigabitEthernet3/0/0.1236 host-if e0.1236 +DBGvpp# lcp create TenGigabitEthernet3/0/0.1237 host-if e0.1237 + +pim@hippo:~/src/lcpng$ ip link show e0.1236 +375: e0.1236@e0: mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 + link/ether 68:05:ca:32:46:14 brd ff:ff:ff:ff:ff:ff +pim@hippo:~/src/lcpng$ ip link show e0.1237 +376: e0.1237@e0.1236: mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 + link/ether 68:05:ca:32:46:14 brd ff:ff:ff:ff:ff:ff +``` + +Here, `e0.1237` was indeed created as a child of `e0.1236`, which in turn was created as a child of +`e0`. + +The code for this is in [[patchset 6](https://gerrit.fd.io/r/c/vpp/+/33481/5..6)]. + +### Create interface: dot1q in dot1q + +Given the change above, this is an entirely obvious capability that the plugin now handles, but I did +find a failure mode, when I tried to create a _LIP_ for a sub-interface when there are no _LIPs_ created. +It causes a NULL deref when trying to look up the _LIP_ of the parent (which doesn't yet have a _LIP_ +defined). I fixed that in this [[patchset 7](https://gerrit.fd.io/r/c/vpp/+/33481/6..7)]. + +## Results + +After applying the configuration to VPP (in Appendix), here's the results: + +``` +pim@hippo:~/src/lcpng$ ip ro +default via 194.1.163.65 dev enp6s0 proto static +10.0.1.0/30 dev e0 proto kernel scope link src 10.0.1.1 +10.0.2.0/30 dev e0.1234 proto kernel scope link src 10.0.2.1 +10.0.3.0/30 dev e0.1235 proto kernel scope link src 10.0.3.1 +10.0.4.0/30 dev e0.1236 proto kernel scope link src 10.0.4.1 +10.0.5.0/30 dev e0.1237 proto kernel scope link src 10.0.5.1 +194.1.163.64/27 dev enp6s0 proto kernel scope link src 194.1.163.88 + +pim@hippo:~/src/lcpng$ fping 10.0.1.2 10.0.2.2 10.0.3.2 10.0.4.2 10.0.5.2 +10.0.1.2 is alive +10.0.2.2 is alive +10.0.3.2 is alive +10.0.4.2 is alive +10.0.5.2 is alive + +pim@hippo:~/src/lcpng$ fping6 2001:db8:0:1::2 2001:db8:0:2::2 \ + 2001:db8:0:3::2 2001:db8:0:4::2 2001:db8:0:5::2 +2001:db8:0:1::2 is alive +2001:db8:0:2::2 is alive +2001:db8:0:3::2 is alive +2001:db8:0:4::2 is alive +2001:db8:0:5::2 is alive + +``` + +As can be seen, all interface types ping. Mirroring interfaces from VPP to Linux is now done! + +We still have to manually copy the configuration (like link states, MTU changes, IP addresses and +routes) from VPP into Linux, and of course it would be great if we could mirror those states also +into Linux, and this is exactly the topic of my next post. + + +## Credits + +I'd like to make clear that the Linux CP plugin is a great collaboration between several great folks +and that my work stands on their shoulders. I've had a little bit of help along the way from Neale +Ranns, Matthew Smith and Jon Loeliger, and I'd like to thank them for their work! + +## Appendix + +#### Ubuntu config +``` +# Untagged interface +ip addr add 10.0.1.2/30 dev enp66s0f0 +ip addr add 2001:db8:0:1::2/64 dev enp66s0f0 +ip link set enp66s0f0 up mtu 9000 + +# Single 802.1q tag 1234 +ip link add link enp66s0f0 name enp66s0f0.q type vlan id 1234 +ip link set enp66s0f0.q up mtu 9000 +ip addr add 10.0.2.2/30 dev enp66s0f0.q +ip addr add 2001:db8:0:2::2/64 dev enp66s0f0.q + +# Double 802.1q tag 1234 inner-tag 1000 +ip link add link enp66s0f0.q name enp66s0f0.qinq type vlan id 1000 +ip link set enp66s0f0.qinq up mtu 9000 +ip addr add 10.0.3.3/30 dev enp66s0f0.qinq +ip addr add 2001:db8:0:3::2/64 dev enp66s0f0.qinq + +# Single 802.1ad tag 2345 +ip link add link enp66s0f0 name enp66s0f0.ad type vlan id 2345 proto 802.1ad +ip link set enp66s0f0.ad up mtu 9000 +ip addr add 10.0.4.2/30 dev enp66s0f0.ad +ip addr add 2001:db8:0:4::2/64 dev enp66s0f0.ad + +# Double 802.1ad tag 2345 inner-tag 1000 +ip link add link enp66s0f0.ad name enp66s0f0.qinad type vlan id 1000 proto 802.1q +ip link set enp66s0f0.qinad up mtu 9000 +ip addr add 10.0.5.2/30 dev enp66s0f0.qinad +ip addr add 2001:db8:0:5::2/64 dev enp66s0f0.qinad +``` + +#### VPP config +``` +vppctl set interface state TenGigabitEthernet3/0/0 up +vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0 +vppctl set interface ip address TenGigabitEthernet3/0/0 10.0.1.1/30 +vppctl set interface ip address TenGigabitEthernet3/0/0 2001:db8:0:1::1/64 +vppctl lcp create TenGigabitEthernet3/0/0 host-if e0 +ip link set e0 up mtu 9000 +ip addr add 10.0.1.1/30 dev e0 +ip addr add 2001:db8:0:1::1/64 dev e0 + +vppctl create sub TenGigabitEthernet3/0/0 1234 +vppctl set interface state TenGigabitEthernet3/0/0.1234 up +vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1234 +vppctl set interface ip address TenGigabitEthernet3/0/0.1234 10.0.2.1/30 +vppctl set interface ip address TenGigabitEthernet3/0/0.1234 2001:db8:0:2::1/64 +vppctl lcp create TenGigabitEthernet3/0/0.1234 host-if e0.1234 +ip link set e0.1234 up mtu 9000 +ip addr add 10.0.2.1/30 dev e0.1234 +ip addr add 2001:db8:0:2::1/64 dev e0.1234 + +vppctl create sub TenGigabitEthernet3/0/0 1235 dot1q 1234 inner-dot1q 1000 exact-match +vppctl set interface state TenGigabitEthernet3/0/0.1235 up +vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1235 +vppctl set interface ip address TenGigabitEthernet3/0/0.1235 10.0.3.1/30 +vppctl set interface ip address TenGigabitEthernet3/0/0.1235 2001:db8:0:3::1/64 +vppctl lcp create TenGigabitEthernet3/0/0.1235 host-if e0.1235 +ip link set e0.1235 up mtu 9000 +ip addr add 10.0.3.1/30 dev e0.1235 +ip addr add 2001:db8:0:3::1/64 dev e0.1235 + +vppctl create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match +vppctl set interface state TenGigabitEthernet3/0/0.1236 up +vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1236 +vppctl set interface ip address TenGigabitEthernet3/0/0.1236 10.0.4.1/30 +vppctl set interface ip address TenGigabitEthernet3/0/0.1236 2001:db8:0:4::1/64 +vppctl lcp create TenGigabitEthernet3/0/0.1236 host-if e0.1236 +ip link set e0.1236 up mtu 9000 +ip addr add 10.0.4.1/30 dev e0.1236 +ip addr add 2001:db8:0:4::1/64 dev e0.1236 + +vppctl create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match +vppctl set interface state TenGigabitEthernet3/0/0.1237 up +vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1237 +vppctl set interface ip address TenGigabitEthernet3/0/0.1237 10.0.5.1/30 +vppctl set interface ip address TenGigabitEthernet3/0/0.1237 2001:db8:0:5::1/64 +vppctl lcp create TenGigabitEthernet3/0/0.1237 host-if e0.1237 +ip link set e0.1237 up mtu 9000 +ip addr add 10.0.5.1/30 dev e0.1237 +ip addr add 2001:db8:0:5::1/64 dev e0.1237 +``` diff --git a/content/articles/2021-08-13-vpp-2.md b/content/articles/2021-08-13-vpp-2.md new file mode 100644 index 0000000..565eb3d --- /dev/null +++ b/content/articles/2021-08-13-vpp-2.md @@ -0,0 +1,345 @@ +--- +custom_css: +- /assets/vpp/asciinema-player.css +date: "2021-08-13T15:33:14Z" +title: VPP Linux CP - Part2 +--- + + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. One thing notably missing, is the higher level control plane, that is +to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a +VPP _plugin_ which is called the **Linux Control Plane**, or LCP for short, which creates Linux network +devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols +like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet +forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use +VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the +interface state (links, addresses and routes) itself. When the plugin is completed, running software +like [FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving +>100Mpps and >100Gbps forwarding rates will be well in reach! + +In this second post, let's make the plugin a bit more useful by making it copy forward state changes +to interfaces in VPP, into their Linux CP counterparts. + +## My test setup + +I'm using the same setup from the [previous post]({% post_url 2021-08-12-vpp-1 %}). The goal of this +post is to show what code needed to be written and which changes needed to be made to the plugin, in +order to propagate changes to VPP interfaces to the Linux TAP devices. + +### Startingpoint + +The `linux-cp` plugin that ships with VPP 21.06, even with my [changes](https://gerrit.fd.io/r/c/vpp/+/33481) +is still _only_ able to create _LIP_ devices. It's not very user friendly to have to +apply state changes meticulously on both sides, but it can be done: + +``` +vppctl lcp create TenGigabitEthernet3/0/0 host-if e0 +vppctl set interface state TenGigabitEthernet3/0/0 up +vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0 +vppctl set interface ip address TenGigabitEthernet3/0/0 10.0.1.1/30 +vppctl set interface ip address TenGigabitEthernet3/0/0 2001:db8:0:1::1/64 +ip link set e0 up +ip link set e0 mtu 9000 +ip addr add 10.0.1.1/30 dev e0 +ip addr add 2001:db8:0:1::1/64 dev e0 +``` + +In this snippet, we can see that after creating the _LIP_, thus conjuring up the unconfigured +`e0` interface in Linux, I changed the VPP interface in three ways: +1. I set the state of the VPP interface to 'up' +1. I set the MTU of the VPP interface to 9000 +1. I add an IPv4 and IPv6 address to the interface + +Because state does not (yet) propagate, I have to make those changes as well on the Linux side +with the subsequent `ip` commands. + +### Configuration + +I can imagine that operators want to have more control and facilitate the Linux and VPP changes +themselves. This is why I'll start off by adding a variable called `lcp_sync`, along with a +startup configuration keyword and a CLI setter. This allows me to turn the whole sync behavior on +and off, for example in `startup.conf`: + +``` +linux-cp { + default netns dataplane + lcp-sync +} +``` + +And in the CLI: +``` +DBGvpp# show lcp +lcp default netns dataplane +lcp lcp-sync on + +DBGvpp# lcp lcp-sync off +DBGvpp# show lcp +lcp default netns dataplane +lcp lcp-sync off +``` + +The prep work for the rest of the interface syncer starts with this +[[commit](https://github.com/pimvanpelt/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and +for the rest of this blog post, the behavior will be in the 'on' position. + +### Change interface: state + +Immediately, I find a dissonance between VPP and Linux: When Linux sets a parent interface down, +all children go to state `M-DOWN`. When Linux sets a parent interface up, all of its children +automatically go to state `UP` and `LOWER_UP`. To illustrate: + +``` +ip link set enp66s0f1 down +ip link add link enp66s0f1 name foo type vlan id 1234 +ip link set foo down +## Both interfaces are down, which makes sense because I set them both down +ip link | grep enp66s0f1 +9: enp66s0f1: mtu 9000 qdisc mq state DOWN mode DEFAULT group default qlen 1000 +61: foo@enp66s0f1: mtu 9000 qdisc noop state DOWN mode DEFAULT group default qlen 1000 + +ip link set enp66s0f1 up +ip link | grep enp66s0f1 +## Both interfaces are up, which doesn't make sense because I only changed one of them! +9: enp66s0f1: mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000 +61: foo@enp66s0f1: mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 +``` + +VPP does not work this way. In VPP, the admin state of each interface is individually +controllable, so it's possible to bring up the parent while leaving the sub-interface in +the state it was. I did notice that you can't bring up a sub-interface if its parent +is down, which I found counterintuitive, but that's neither here nor there. + +All of this is to say that we have to be careful when copying state forward, because as +this [[commit](https://github.com/pimvanpelt/lcpng/commit/7c15c84f6c4739860a85c599779c199cb9efef03)] +shows, issuing `set int state ... up` on an interface, won't touch its sub-interfaces in VPP, but +the subsequent netlink message to bring the _LIP_ for that interface up, **will** update the +children, thus desynchronising Linux and VPP: Linux will have interface **and all its +sub-interfaces** up unconditionally; VPP will have the interface up and its sub-interfaces in +whatever state they were before. + +To address this, a second +[[commit](https://github.com/pimvanpelt/lcpng/commit/a3dc56c01461bdffcac8193ead654ae79225220f)] was +needed. I'm not too sure I want to keep this behavior, but for now, it results in an intuitive +end-state, which is that all interfaces states are exactly the same between Linux and VPP. + +``` +DBGvpp# create sub TenGigabitEthernet3/0/0 10 +DBGvpp# lcp create TenGigabitEthernet3/0/0 host-if e0 +DBGvpp# lcp create TenGigabitEthernet3/0/0.10 host-if e0.10 +DBGvpp# set int state TenGigabitEthernet3/0/0 up +## Correct: parent is up, sub-int is not +694: e0: mtu 9000 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000 +695: e0.10@e0: mtu 9000 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 + +DBGvpp# set int state TenGigabitEthernet3/0/0.10 up +## Correct: both interfaces up +694: e0: mtu 9000 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000 +695: e0.10@e0: mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 + +DBGvpp# set int state TenGigabitEthernet3/0/0 down +DBGvpp# set int state TenGigabitEthernet3/0/0.10 down +DBGvpp# set int state TenGigabitEthernet3/0/0 up +## Correct: only the parent is up +694: e0: mtu 9000 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000 +695: e0.10@e0: mtu 9000 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 +``` + +### Change interface: MTU + +Finally, a straight forward +[[commit](https://github.com/pimvanpelt/lcpng/commit/39bfa1615fd1cafe5df6d8fc9d34528e8d3906e2)], or +so I thought. When the MTU changes in VPP (with `set interface mtu packet N `), there is +callback that can be registered which copies this into the _LIP_. I did notice a specific corner +case: In VPP, a sub-interface can have a larger MTU than its parent. In Linux, this cannot happen, +so the following remains problematic: + +``` +DBGvpp# create sub TenGigabitEthernet3/0/0 10 +DBGvpp# set int mtu packet 1500 TenGigabitEthernet3/0/0 +DBGvpp# set int mtu packet 9000 TenGigabitEthernet3/0/0.10 +## Incorrect: sub-int has larger MTU than parent, valid in VPP, not in Linux +694: e0: mtu 1500 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000 +695: e0.10@e0: mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 +``` + +I think the best way to ensure this works is to _clamp_ the sub-int to a maximum MTU of +that of its parent, and revert the user's request to change the VPP sub-int to anything +higher than that, perhaps logging an error explaining why. This means two things: +1. Any change in VPP of a child MTU to larger than its parent, must be reverted. +1. Any change in VPP of a parent MTU should ensure all children are clamped to at most that. + +I addressed the issue in this +[[commit](https://github.com/pimvanpelt/lcpng/commit/79a395b3c9f0dae9a23e6fbf10c5f284b1facb85)]. + +### Change interface: IP Addresses + +There are three scenarios in which IP addresses will need to be copied from +VPP into the companion Linux devices: + +1. `set interface ip address` adds an IPv4 or IPv6 address. This is handled by + `lcp_itf_ip[46]_add_del_interface_addr()` which is a callback installed in + `lcp_itf_pair_init()` at plugin initialization time. +1. `set interface ip address del` removes addresses. This is also handled by + `lcp_itf_ip[46]_add_del_interface_addr()` but curiously there is no + upstream `vnet_netlink_del_ip[46]_addr()` so I had to write them inline here. + I will try to get them upstreamed, as they appear to be obvious companions + in `vnet/device/netlink.h`. +1. This one is easy to overlook, but upon _LIP_ creation, it could be that there + are already L3 addresses present on the VPP interface. If so, set them in the + _LIP_ with `lcp_itf_set_interface_addr()`. + +This means with this +[[commit](https://github.com/pimvanpelt/lcpng/commit/f7e1bb951d648a63dfa27d04ded0b6261b9e39fe)], at +any time a new _LIP_ is created, the IPv4 and IPv6 address on the VPP interface are fully copied +over by the third change, while at runtime, new addresses can be set/removed as well by the first +and second change. + +### Further work + +I noticed that [Bird](https://bird.network.cz/) periodically scans the Linux +interface list and (re)learns information from them. I have a suspicion that +such a feature might be useful in the VPP plugin as well: I can imagine a +periodical process that walks over the _LIP_ interface list, and compares +what it finds in Linux with what is configured in VPP. What's not entirely +clear to me is which direction should 'trump', that is, should the Linux +state be forced into VPP, or should the VPP state be forced into Linux? I +don't yet have a good feeling of the answer, so I'll punt on that for now. + +## Results + +After applying the configuration to VPP (in Appendix), here's the results: + +``` +pim@hippo:~/src/lcpng$ ip ro +default via 194.1.163.65 dev enp6s0 proto static +10.0.1.0/30 dev e0 proto kernel scope link src 10.0.1.1 +10.0.2.0/30 dev e0.1234 proto kernel scope link src 10.0.2.1 +10.0.3.0/30 dev e0.1235 proto kernel scope link src 10.0.3.1 +10.0.4.0/30 dev e0.1236 proto kernel scope link src 10.0.4.1 +10.0.5.0/30 dev e0.1237 proto kernel scope link src 10.0.5.1 +194.1.163.64/27 dev enp6s0 proto kernel scope link src 194.1.163.88 + +pim@hippo:~/src/lcpng$ fping 10.0.1.2 10.0.2.2 10.0.3.2 10.0.4.2 10.0.5.2 +10.0.1.2 is alive +10.0.2.2 is alive +10.0.3.2 is alive +10.0.4.2 is alive +10.0.5.2 is alive + +pim@hippo:~/src/lcpng$ fping6 2001:db8:0:1::2 2001:db8:0:2::2 \ + 2001:db8:0:3::2 2001:db8:0:4::2 2001:db8:0:5::2 +2001:db8:0:1::2 is alive +2001:db8:0:2::2 is alive +2001:db8:0:3::2 is alive +2001:db8:0:4::2 is alive +2001:db8:0:5::2 is alive + +``` + +In case you were wondering: my previous post ended in the same huzzah moment. It did. + +The difference is that now the VPP configuration is _much shorter_! Comparing +the Appendix from this post with my [first post]({% post_url 2021-08-12-vpp-1 %}), after +all of this work I no longer have to manually copy the configuration (like link states, +MTU changes, IP addresses) from VPP into Linux, instead the plugin does all of this work +for me, and I can configure both sides entirely with `vppctl` commands! + +### Bonus screencast! + +Humor me as I [take the code out](https://asciinema.org/a/430411) for a 5 minute spin :-) + + + + +## Credits + +I'd like to make clear that the Linux CP plugin is a great collaboration between several great folks +and that my work stands on their shoulders. I've had a little bit of help along the way from Neale +Ranns, Matthew Smith and Jon Loeliger, and I'd like to thank them for their work! + +## Appendix + +#### Ubuntu config +``` +# Untagged interface +ip addr add 10.0.1.2/30 dev enp66s0f0 +ip addr add 2001:db8:0:1::2/64 dev enp66s0f0 +ip link set enp66s0f0 up mtu 9000 + +# Single 802.1q tag 1234 +ip link add link enp66s0f0 name enp66s0f0.q type vlan id 1234 +ip link set enp66s0f0.q up mtu 9000 +ip addr add 10.0.2.2/30 dev enp66s0f0.q +ip addr add 2001:db8:0:2::2/64 dev enp66s0f0.q + +# Double 802.1q tag 1234 inner-tag 1000 +ip link add link enp66s0f0.q name enp66s0f0.qinq type vlan id 1000 +ip link set enp66s0f0.qinq up mtu 9000 +ip addr add 10.0.3.3/30 dev enp66s0f0.qinq +ip addr add 2001:db8:0:3::2/64 dev enp66s0f0.qinq + +# Single 802.1ad tag 2345 +ip link add link enp66s0f0 name enp66s0f0.ad type vlan id 2345 proto 802.1ad +ip link set enp66s0f0.ad up mtu 9000 +ip addr add 10.0.4.2/30 dev enp66s0f0.ad +ip addr add 2001:db8:0:4::2/64 dev enp66s0f0.ad + +# Double 802.1ad tag 2345 inner-tag 1000 +ip link add link enp66s0f0.ad name enp66s0f0.qinad type vlan id 1000 proto 802.1q +ip link set enp66s0f0.qinad up mtu 9000 +ip addr add 10.0.5.2/30 dev enp66s0f0.qinad +ip addr add 2001:db8:0:5::2/64 dev enp66s0f0.qinad +``` + +#### VPP config +``` +## Look mom, no `ip` commands!! :-) +vppctl set interface state TenGigabitEthernet3/0/0 up +vppctl lcp create TenGigabitEthernet3/0/0 host-if e0 +vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0 +vppctl set interface ip address TenGigabitEthernet3/0/0 10.0.1.1/30 +vppctl set interface ip address TenGigabitEthernet3/0/0 2001:db8:0:1::1/64 + +vppctl create sub TenGigabitEthernet3/0/0 1234 +vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1234 +vppctl lcp create TenGigabitEthernet3/0/0.1234 host-if e0.1234 +vppctl set interface state TenGigabitEthernet3/0/0.1234 up +vppctl set interface ip address TenGigabitEthernet3/0/0.1234 10.0.2.1/30 +vppctl set interface ip address TenGigabitEthernet3/0/0.1234 2001:db8:0:2::1/64 + +vppctl create sub TenGigabitEthernet3/0/0 1235 dot1q 1234 inner-dot1q 1000 exact-match +vppctl set interface state TenGigabitEthernet3/0/0.1235 up +vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1235 +vppctl lcp create TenGigabitEthernet3/0/0.1235 host-if e0.1235 +vppctl set interface ip address TenGigabitEthernet3/0/0.1235 10.0.3.1/30 +vppctl set interface ip address TenGigabitEthernet3/0/0.1235 2001:db8:0:3::1/64 + +vppctl create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match +vppctl set interface state TenGigabitEthernet3/0/0.1236 up +vppctl lcp create TenGigabitEthernet3/0/0.1236 host-if e0.1236 +vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1236 +vppctl set interface ip address TenGigabitEthernet3/0/0.1236 10.0.4.1/30 +vppctl set interface ip address TenGigabitEthernet3/0/0.1236 2001:db8:0:4::1/64 + +vppctl create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match +vppctl set interface state TenGigabitEthernet3/0/0.1237 up +vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1237 +vppctl set interface ip address TenGigabitEthernet3/0/0.1237 10.0.5.1/30 +vppctl set interface ip address TenGigabitEthernet3/0/0.1237 2001:db8:0:5::1/64 +vppctl lcp create TenGigabitEthernet3/0/0.1237 host-if e0.1237 +``` + +#### Final note + +You may have noticed that the [commit] links are all git commits in my private working copy. I want +to wait until my [previous work](https://gerrit.fd.io/r/c/vpp/+/33481) is reviewed and submitted +before piling on more changes. Feel free to contact vpp-dev@ for more information in the mean time +:-) diff --git a/content/articles/2021-08-15-vpp-3.md b/content/articles/2021-08-15-vpp-3.md new file mode 100644 index 0000000..76cfd3b --- /dev/null +++ b/content/articles/2021-08-15-vpp-3.md @@ -0,0 +1,414 @@ +--- +date: "2021-08-15T11:13:14Z" +title: VPP Linux CP - Part3 +--- + + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. One thing notably missing, is the higher level control plane, that is +to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a +VPP _plugin_ which is called the **Linux Control Plane**, or LCP for short, which creates Linux network +devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols +like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet +forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use +VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the +interface state (links, addresses and routes) itself. When the plugin is completed, running software +like [FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving +>100Mpps and >100Gbps forwarding rates will be well in reach! + +In this third post, I'll be adding a convenience feature that I think will be popular: the plugin +will now automatically create or delete _LIPs_ for sub-interfaces where-ever the parent has a _LIP_ +configured. + +## My test setup + +I've extended the setup from the [first post]({% post_url 2021-08-12-vpp-1 %}). The base +configuration for the `enp66s0f0` interface remains exactly the same, but I've also added +an LACP `bond0` interface, which also has the whole kitten kaboodle of sub-interfaces defined, see +below in the Appendix for details, but here's the table again for reference: + +| Name | type | Addresses +|-----------------|------|---------- +| enp66s0f0 | untagged | 10.0.1.2/30 2001:db8:0:1::2/64 +| enp66s0f0.q | dot1q 1234 | 10.0.2.2/30 2001:db8:0:2::2/64 +| enp66s0f0.qinq | outer dot1q 1234, inner dot1q 1000 | 10.0.3.2/30 2001:db8:0:3::2/64 +| enp66s0f0.ad | dot1ad 2345 | 10.0.4.2/30 2001:db8:0:4::2/64 +| enp66s0f0.qinad | outer dot1ad 2345, inner dot1q 1000 | 10.0.5.2/30 2001:db8:0:5::2/64 +| bond0 | untagged | 10.1.1.2/30 2001:db8:1:1::2/64 +| bond0.q | dot1q 1234 | 10.1.2.2/30 2001:db8:1:2::2/64 +| bond0.qinq | outer dot1q 1234, inner dot1q 1000 | 10.1.3.2/30 2001:db8:1:3::2/64 +| bond0.ad | dot1ad 2345 | 10.1.4.2/30 2001:db8:1:4::2/64 +| bond0.qinad | outer dot1ad 2345, inner dot1q 1000 | 10.1.5.2/30 2001:db8:1:5::2/64 + +The goal of this post is to show what code needed to be written and which changes needed to be +made to the plugin, in order to automatically create and delete sub-interfaces. + +### Startingpoint + +Based on the state of the plugin after the [second post]({% post_url 2021-08-13-vpp-2 %}), +operators must create _LIP_ instances for interfaces as well as each sub-interface +explicitly: + +``` +DBGvpp# lcp create TenGigabitEthernet3/0/0 host-if e0 +DBGvpp# create sub TenGigabitEthernet3/0/0 1234 +DBGvpp# lcp create TenGigabitEthernet3/0/0.1234 host-if e0.1234 +DBGvpp# create sub TenGigabitEthernet3/0/0 1235 dot1q 1234 inner-dot1q 1000 exact-match +DBGvpp# lcp create TenGigabitEthernet3/0/0.1235 host-if e0.1235 +DBGvpp# create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match +DBGvpp# lcp create TenGigabitEthernet3/0/0.1236 host-if e0.1236 +DBGvpp# create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match +DBGvpp# lcp create TenGigabitEthernet3/0/0.1237 host-if e0.1237 +``` + +But one might ask -- is it really useful to have L3 interfaces in VPP without a companion interface +in an appropriate Linux namespace? I think the answer might be 'yes' for individual interfaces +(for example, in a mgmt VRF that has no need to run routing protocols), but I also think the answer +is probably 'no' for sub-interfaces, once their parent has a _LIP_ defined. + +### Configuration + +The original plugin (the one that ships with VPP 21.06) has a configuration flag that seems +promising by defining a flag `interface-auto-create`, but its implementation was never finished. +I've removed that flag and replaced it with a new one. The main reason for this decision is +that there are actually two kinds of auto configuration: the first one is detailed in this post, +but in the future, I will also make it possible to create VPP interfaces by creating their Linux +counterpart (eg. `ip link add link e0 name e0.1234 type vlan id 1234` with a configuration statement +that might be called `netlink-auto-subint`), and I'd like for the plugin to individually +enable/disable both types. Also, I find the name unfortunate, as the feature should create +_and delete LIPs_ on sub-interfaces, not just create them. So out with the old, in with the new :) + +I have to acknowledge that not everybody will want automagically created interfaces, similar to +the original configuration, so I define a new configuration flag called `lcp-auto-subint` which goes +into the `linux-cp` module configuration stanza in VPP's `startup.conf`, which might look a little +like this: + +``` +linux-cp { + default netns dataplane + lcp-auto-subint +} +``` + +Based on this config, I set the startup default in `lcp_set_lcp_auto_subint()`, but I realize that +an administrator may want to turn it on/off at runtime, too, so I add a CLI getter/setter that +interacts with the flag in this [[commit](https://github.com/pimvanpelt/lcpng/commit/d23aab2d95aabcf24efb9f7aecaf15b513633ab7)]: + +``` +DBGvpp# show lcp +lcp default netns dataplane +lcp lcp-auto-subint on +lcp lcp-sync off + +DBGvpp# lcp lcp-auto-subint off +DBGvpp# show lcp +lcp default netns dataplane +lcp lcp-auto-subint off +lcp lcp-sync off +``` + +The prep work for the rest of the interface syncer starts with this +[[commit](https://github.com/pimvanpelt/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and +for the rest of this blog post, the behavior will be in the 'on' position. + +The code for the configuration toggle is in this +[[commit](https://github.com/pimvanpelt/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)]. + +### Auto create/delete sub-interfaces + +The original plugin code (that ships with VPP 21.06) made a start by defining a function called +`lcp_itf_phy_add()` and registering an intent with `VNET_SW_INTERFACE_ADD_DEL_FUNCTION()`. I've +moved the function to the source file I created in [Part 2]({% post_url 2021-08-13-vpp-2 %}) +(called `lcp_if_sync.c`), specifically to handle interface syncing, and gave it a name that +matches the VPP callback, so `lcp_itf_interface_add_del()`. + +The logic of that function is pretty straight forward. I want to only continue if `lcp-auto-subint` +is set, and I only want to create or delete sub-interfaces, not parents. This way, the operator +can decide on a per-interface basis if they want it to participate in Linux (eg, issuing +`lcp create BondEthernet0 host-if be0`). After I've established that (a) the caller wants +auto-creation/auto-deletion, and (b) we're fielding a callback for a sub-int, all I must do is: +* On creation: does the parent interface `sw->sup_sw_if_index` have a `LIP`? If yes, let's + create a `LIP` for this sub-interface, too. We determine that Linux interface name by taking + the parent name (say, `be0`), and sticking the sub-int number after it, like `be0.1234`. +* On deletion: does this sub-interface we're fielding the callback for have a `LIP`? If yes, + then delete it. + +I noticed that interface deletion had a bug (one that I fell victim to as well: it does not +remove the netlink device in the correct network namespace), which I fixed. + +The code for the auto create/delete and the bugfix is in this +[[commit](https://github.com/pimvanpelt/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)]. + +### Further Work + +One other thing I noticed (and this is actually a bug!) is that on `BondEthernet` +interfaces, upon creation a temporary MAC is assigned, which is subsequently +overwritten by the first physical interface that is added to the bond, which means +that when a _LIP_ is created _before_ the first interface is added, its MAC +will be the temporary MAC. Compare: + +``` +vppctl create bond mode lacp load-balance l2 +vppctl lcp create BondEthernet0 host-if be0 +## MAC of be0 is now a temp/ephemeral MAC + +vppctl bond add BondEthernet0 TenGigabitEthernet3/0/2 +vppctl bond add BondEthernet0 TenGigabitEthernet3/0/3 +## MAC of the BondEthernet0 device is now that of TenGigabitEthernet3/0/2 +## MAC of TenGigabitEthernet3/0/3 is that of BondEthernet0 (ie. Te3/0/2) +``` + +In such a situation, `be0` will not be reachable unless it's manually set to the correct MAC. +I looked around but found no callback of event handler for MAC address changes in VPP -- so I +should add one probably, but in the mean time, I'll just add interfaces to the bond before +creating the _LIP_, like so: + +``` +vppctl create bond mode lacp load-balance l2 +vppctl bond add BondEthernet0 TenGigabitEthernet3/0/2 +## MAC of the BondEthernet0 device is now that of TenGigabitEthernet3/0/2 + +vppctl lcp create BondEthernet0 host-if be0 +## MAC of be0 is now that of BondEthernet0 + +vppctl bond add BondEthernet0 TenGigabitEthernet3/0/3 +## MAC of TenGigabitEthernet3/0/3 is that of BondEthernet0 (ie. Te3/0/2) +``` + +.. which is an adequate workaround for now. + +## Results + +After this code is in, the operator will only have to create a LIP for the main interfaces, and +the plugin will take care of the rest! + +``` +pim@hippo:~/src/lcpng$ grep 'create' config3.sh +vppctl lcp lcp-auto-subint on +vppctl lcp create TenGigabitEthernet3/0/0 host-if e0 +vppctl create sub TenGigabitEthernet3/0/0 1234 +vppctl create sub TenGigabitEthernet3/0/0 1235 dot1q 1234 inner-dot1q 1000 exact-match +vppctl create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match +vppctl create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match + +vppctl create bond mode lacp load-balance l2 +vppctl lcp create BondEthernet0 host-if be0 +vppctl create sub BondEthernet0 1234 +vppctl create sub BondEthernet0 1235 dot1q 1234 inner-dot1q 1000 exact-match +vppctl create sub BondEthernet0 1236 dot1ad 2345 exact-match +vppctl create sub BondEthernet0 1237 dot1ad 2345 inner-dot1q 1000 exact-match +``` + +And as an end-to-end functional validation, now extended as well to ping the Ubuntu machine over +the LACP interface and all of its subinterfaces, works like a charm: + +``` +pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip link | grep e0 +1063: e0: mtu 9000 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000 +1064: be0: mtu 9000 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000 +209: e0.1234@e0: mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 +210: e0.1235@e0.1234: mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 +211: e0.1236@e0: mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 +212: e0.1237@e0.1236: mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 +213: be0.1234@be0: mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 +214: be0.1235@be0.1234: mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 +215: be0.1236@be0: mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 +216: be0.1237@be0.1236: mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 + +# The TenGigabitEthernet3/0/0 (e0) interfaces +pim@hippo:~/src/lcpng$ fping 10.0.1.2 10.0.2.2 10.0.3.2 10.0.4.2 10.0.5.2 +10.0.1.2 is alive +10.0.2.2 is alive +10.0.3.2 is alive +10.0.4.2 is alive +10.0.5.2 is alive + +pim@hippo:~/src/lcpng$ fping6 2001:db8:0:1::2 2001:db8:0:2::2 \ + 2001:db8:0:3::2 2001:db8:0:4::2 2001:db8:0:5::2 +2001:db8:0:1::2 is alive +2001:db8:0:2::2 is alive +2001:db8:0:3::2 is alive +2001:db8:0:4::2 is alive +2001:db8:0:5::2 is alive + +## The BondEthernet0 (be0) interfaces +pim@hippo:~/src/lcpng$ fping 10.1.1.2 10.1.2.2 10.1.3.2 10.1.4.2 10.1.5.2 +10.1.1.2 is alive +10.1.2.2 is alive +10.1.3.2 is alive +10.1.4.2 is alive +10.1.5.2 is alive + +pim@hippo:~/src/lcpng$ fping6 2001:db8:1:1::2 2001:db8:1:2::2 \ + 2001:db8:1:3::2 2001:db8:1:4::2 2001:db8:1:5::2 +2001:db8:1:1::2 is alive +2001:db8:1:2::2 is alive +2001:db8:1:3::2 is alive +2001:db8:1:4::2 is alive +2001:db8:1:5::2 is alive +``` + + +## Credits + +I'd like to make clear that the Linux CP plugin is a great collaboration between several great folks +and that my work stands on their shoulders. I've had a little bit of help along the way from Neale +Ranns, Matthew Smith and Jon Loeliger, and I'd like to thank them for their work! + +## Appendix + +#### Ubuntu config +``` +# Untagged interface +ip addr add 10.0.1.2/30 dev enp66s0f0 +ip addr add 2001:db8:0:1::2/64 dev enp66s0f0 +ip link set enp66s0f0 up mtu 9000 + +# Single 802.1q tag 1234 +ip link add link enp66s0f0 name enp66s0f0.q type vlan id 1234 +ip link set enp66s0f0.q up mtu 9000 +ip addr add 10.0.2.2/30 dev enp66s0f0.q +ip addr add 2001:db8:0:2::2/64 dev enp66s0f0.q + +# Double 802.1q tag 1234 inner-tag 1000 +ip link add link enp66s0f0.q name enp66s0f0.qinq type vlan id 1000 +ip link set enp66s0f0.qinq up mtu 9000 +ip addr add 10.0.3.2/30 dev enp66s0f0.qinq +ip addr add 2001:db8:0:3::2/64 dev enp66s0f0.qinq + +# Single 802.1ad tag 2345 +ip link add link enp66s0f0 name enp66s0f0.ad type vlan id 2345 proto 802.1ad +ip link set enp66s0f0.ad up mtu 9000 +ip addr add 10.0.4.2/30 dev enp66s0f0.ad +ip addr add 2001:db8:0:4::2/64 dev enp66s0f0.ad + +# Double 802.1ad tag 2345 inner-tag 1000 +ip link add link enp66s0f0.ad name enp66s0f0.qinad type vlan id 1000 proto 802.1q +ip link set enp66s0f0.qinad up mtu 9000 +ip addr add 10.0.5.2/30 dev enp66s0f0.qinad +ip addr add 2001:db8:0:5::2/64 dev enp66s0f0.qinad + +## Bond interface +ip link add bond0 type bond mode 802.3ad +ip link set enp66s0f2 down +ip link set enp66s0f3 down +ip link set enp66s0f2 master bond0 +ip link set enp66s0f3 master bond0 +ip link set enp66s0f2 up +ip link set enp66s0f3 up +ip link set bond0 up + +ip addr add 10.1.1.2/30 dev bond0 +ip addr add 2001:db8:1:1::2/64 dev bond0 +ip link set bond0 up mtu 9000 + +# Single 802.1q tag 1234 +ip link add link bond0 name bond0.q type vlan id 1234 +ip link set bond0.q up mtu 9000 +ip addr add 10.1.2.2/30 dev bond0.q +ip addr add 2001:db8:1:2::2/64 dev bond0.q + +# Double 802.1q tag 1234 inner-tag 1000 +ip link add link bond0.q name bond0.qinq type vlan id 1000 +ip link set bond0.qinq up mtu 9000 +ip addr add 10.1.3.2/30 dev bond0.qinq +ip addr add 2001:db8:1:3::2/64 dev bond0.qinq + +# Single 802.1ad tag 2345 +ip link add link bond0 name bond0.ad type vlan id 2345 proto 802.1ad +ip link set bond0.ad up mtu 9000 +ip addr add 10.1.4.2/30 dev bond0.ad +ip addr add 2001:db8:1:4::2/64 dev bond0.ad + +# Double 802.1ad tag 2345 inner-tag 1000 +ip link add link bond0.ad name bond0.qinad type vlan id 1000 proto 802.1q +ip link set bond0.qinad up mtu 9000 +ip addr add 10.1.5.2/30 dev bond0.qinad +ip addr add 2001:db8:1:5::2/64 dev bond0.qinad +``` + +#### VPP config +``` +## No more `lcp create` commands for sub-interfaces. +vppctl lcp default netns dataplane +vppctl lcp lcp-auto-subint on + +vppctl lcp create TenGigabitEthernet3/0/0 host-if e0 +vppctl set interface state TenGigabitEthernet3/0/0 up +vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0 +vppctl set interface ip address TenGigabitEthernet3/0/0 10.0.1.1/30 +vppctl set interface ip address TenGigabitEthernet3/0/0 2001:db8:0:1::1/64 + +vppctl create sub TenGigabitEthernet3/0/0 1234 +vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1234 +vppctl set interface state TenGigabitEthernet3/0/0.1234 up +vppctl set interface ip address TenGigabitEthernet3/0/0.1234 10.0.2.1/30 +vppctl set interface ip address TenGigabitEthernet3/0/0.1234 2001:db8:0:2::1/64 + +vppctl create sub TenGigabitEthernet3/0/0 1235 dot1q 1234 inner-dot1q 1000 exact-match +vppctl set interface state TenGigabitEthernet3/0/0.1235 up +vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1235 +vppctl set interface ip address TenGigabitEthernet3/0/0.1235 10.0.3.1/30 +vppctl set interface ip address TenGigabitEthernet3/0/0.1235 2001:db8:0:3::1/64 + +vppctl create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match +vppctl set interface state TenGigabitEthernet3/0/0.1236 up +vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1236 +vppctl set interface ip address TenGigabitEthernet3/0/0.1236 10.0.4.1/30 +vppctl set interface ip address TenGigabitEthernet3/0/0.1236 2001:db8:0:4::1/64 + +vppctl create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match +vppctl set interface state TenGigabitEthernet3/0/0.1237 up +vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1237 +vppctl set interface ip address TenGigabitEthernet3/0/0.1237 10.0.5.1/30 +vppctl set interface ip address TenGigabitEthernet3/0/0.1237 2001:db8:0:5::1/64 + +## The LACP bond +vppctl create bond mode lacp load-balance l2 +vppctl bond add BondEthernet0 TenGigabitEthernet3/0/2 +vppctl bond add BondEthernet0 TenGigabitEthernet3/0/3 +vppctl lcp create BondEthernet0 host-if be0 +vppctl set interface state TenGigabitEthernet3/0/2 up +vppctl set interface state TenGigabitEthernet3/0/3 up +vppctl set interface state BondEthernet0 up +vppctl set interface mtu packet 9000 BondEthernet0 +vppctl set interface ip address BondEthernet0 10.1.1.1/30 +vppctl set interface ip address BondEthernet0 2001:db8:1:1::1/64 + +vppctl create sub BondEthernet0 1234 +vppctl set interface mtu packet 9000 BondEthernet0.1234 +vppctl set interface state BondEthernet0.1234 up +vppctl set interface ip address BondEthernet0.1234 10.1.2.1/30 +vppctl set interface ip address BondEthernet0.1234 2001:db8:1:2::1/64 + +vppctl create sub BondEthernet0 1235 dot1q 1234 inner-dot1q 1000 exact-match +vppctl set interface state BondEthernet0.1235 up +vppctl set interface mtu packet 9000 BondEthernet0.1235 +vppctl set interface ip address BondEthernet0.1235 10.1.3.1/30 +vppctl set interface ip address BondEthernet0.1235 2001:db8:1:3::1/64 + +vppctl create sub BondEthernet0 1236 dot1ad 2345 exact-match +vppctl set interface state BondEthernet0.1236 up +vppctl set interface mtu packet 9000 BondEthernet0.1236 +vppctl set interface ip address BondEthernet0.1236 10.1.4.1/30 +vppctl set interface ip address BondEthernet0.1236 2001:db8:1:4::1/64 + +vppctl create sub BondEthernet0 1237 dot1ad 2345 inner-dot1q 1000 exact-match +vppctl set interface state BondEthernet0.1237 up +vppctl set interface mtu packet 9000 BondEthernet0.1237 +vppctl set interface ip address BondEthernet0.1237 10.1.5.1/30 +vppctl set interface ip address BondEthernet0.1237 2001:db8:1:5::1/64 +``` + +#### Final note + +You may have noticed that the [commit] links are all git commits in my private working copy. +I want to wait until my [previous work](https://gerrit.fd.io/r/c/vpp/+/33481) is reviewed and +submitted before piling on more changes. Feel free to contact vpp-dev@ for more information in the +mean time :-) diff --git a/content/articles/2021-08-25-vpp-4.md b/content/articles/2021-08-25-vpp-4.md new file mode 100644 index 0000000..86d664d --- /dev/null +++ b/content/articles/2021-08-25-vpp-4.md @@ -0,0 +1,520 @@ +--- +date: "2021-08-25T08:55:14Z" +title: VPP Linux CP - Part4 +--- + + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. One thing notably missing, is the higher level control plane, that is +to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a +VPP _plugin_ which is called the **Linux Control Plane**, or LCP for short, which creates Linux network +devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols +like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet +forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use +VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the +interface state (links, addresses and routes) itself. When the plugin is completed, running software +like [FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving +>100Mpps and >100Gbps forwarding rates will be well in reach! + +In the first three posts, I added the ability for VPP to synchronize its state (like link state, +MTU, and interface addresses) into Linux. In this post, I'll make a start on the other direction: +allowing changes to interfaces made in Linux to make their way back into VPP! + +## My test setup + +I'm keeping the setup from the [third post]({% post_url 2021-08-15-vpp-3 %}). A Linux machine has an +interface `enp66s0f0` which has 4 sub-interfaces (one dot1q tagged, one q-in-q, one dot1ad tagged, +and one q-in-ad), giving me five flavors in total. Then, I created an LACP `bond0` interface, which +also has the whole kit and caboodle of sub-interfaces defined, see below in the Appendix for details, +but here's the table again for reference: + +| Name | type | Addresses +|-----------------|------|---------- +| enp66s0f0 | untagged | 10.0.1.2/30 2001:db8:0:1::2/64 +| enp66s0f0.q | dot1q 1234 | 10.0.2.2/30 2001:db8:0:2::2/64 +| enp66s0f0.qinq | outer dot1q 1234, inner dot1q 1000 | 10.0.3.2/30 2001:db8:0:3::2/64 +| enp66s0f0.ad | dot1ad 2345 | 10.0.4.2/30 2001:db8:0:4::2/64 +| enp66s0f0.qinad | outer dot1ad 2345, inner dot1q 1000 | 10.0.5.2/30 2001:db8:0:5::2/64 +| bond0 | untagged | 10.1.1.2/30 2001:db8:1:1::2/64 +| bond0.q | dot1q 1234 | 10.1.2.2/30 2001:db8:1:2::2/64 +| bond0.qinq | outer dot1q 1234, inner dot1q 1000 | 10.1.3.2/30 2001:db8:1:3::2/64 +| bond0.ad | dot1ad 2345 | 10.1.4.2/30 2001:db8:1:4::2/64 +| bond0.qinad | outer dot1ad 2345, inner dot1q 1000 | 10.1.5.2/30 2001:db8:1:5::2/64 + +The goal of this post is to show what code needed to be written and introduces an entirely _new +plugin_, so that we can separate concerns (and have a higher chance of community acceptance +of the plugins). In the first plugin, now called the **Interface Mirror**, I have previously +implemented the VPP-to-Linux synchronization. In this new plugin (called the **Netlink Listener**) +I implement the Linux-to-VPP synchronization using, _quelle surprise_, Netlink message handlers. + +### Startingpoint + +Based on the state of the plugin after the [third post]({% post_url 2021-08-15-vpp-3 %}), +operators can enable `lcp-sync` (which copies changes made in VPP into their Linux counterpart) +and `lcp-auto-subint` (which extends sub-interface creation in VPP to automatically create a +Linux Interface Pair, or _LIP_, and its companion Linux network interface): + +``` +DBGvpp# lcp lcp-sync on +DBGvpp# lcp lcp-auto-subint on +DBGvpp# lcp create TenGigabitEthernet3/0/0 host-if e0 +DBGvpp# create sub TenGigabitEthernet3/0/0 1234 +DBGvpp# create sub TenGigabitEthernet3/0/0 1235 dot1q 1234 inner-dot1q 1000 exact-match +DBGvpp# create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match +DBGvpp# create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match + +pim@hippo:~/src/lcpng$ ip link | grep e0 +1286: e0.1234@e0: mtu 9000 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 +1287: e0.1235@e0.1234: mtu 9000 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 +1288: e0.1236@e0: mtu 9000 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 +1289: e0.1237@e0.1236: mtu 9000 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 +1701: e0: mtu 9050 qdisc mq state DOWN mode DEFAULT group default qlen 1000 +``` + +The vision for this plugin has been that Linux can drive most control-plane operations, such as +creating sub-interfaces, adding/removing addresses, changing MTU on links, etc. We can do that by +listening to [Netlink](https://en.wikipedia.org/wiki/Netlink) messages, which were designed for +transferring miscellaneous networking information between the kernel space and userspace processes +(like `VPP`). Networking utilities, such as the _iproute2_ family and its command line utilities +(like `ip`) use Netlink to communicate with the Linux kernel from userspace. + +## Netlink Listener + +The first task at hand is to install a Netlink listener. In this new plugin, I first register +`lcp_nl_init()` which adds Linux interface pair (_LIP_) add/del callbacks from the first plugin. +I'm now made aware of new _LIPs_ as they are created. + +In `lcb_nl_pair_add_cb()`, I will initiate Netlink listener for first interface that gets created, +noting its netns. If subsequent adds are in other netns, I'll just issue a warning. And, I will keep +a refcount so I know how many _LIPs_ are bound to this listener. + +In `lcb_nl_pair_del_cb()`, I can remove the listener when the last interface pair is removed. + +Then for listening itself, a Netlink socket is opened, and because Linux can be quite chatty on +Netlink sockets, I'll raise its read/write buffers to something quite large (typically 64M read +and 16K write size). One note on this size, it'll need some sysctl to be set before VPP starts, +typically done as follows: + +``` +pim@hippo:~/src/vpp$ cat << EOF | sudo tee /etc/sysctl.d/81-vpp-Netlink.conf +# Increase Netlink to 64M +net.core.rmem_default=67108864 +net.core.wmem_default=67108864 +net.core.rmem_max=67108864 +net.core.wmem_max=67108864 +EOF +pim@hippo:~/src/vpp$ sudo sysctl -p +``` + +After creating the Netlink socket, I add its file descriptor to VPP's built in file handler, which +will see to polling it. On the file handler, I install `lcp_nl_read_cb()` and `lcp_nl_error_cb()` +callbacks which will be invoked when anything interesting happens on the socket: + +A bit of explanation on why I'd use a queue rather than just consuming the Netlink messages directly +as they are offered. I _have to_ use a queue for the common case in which VPP is running single threaded. +Instead of consuming a block of potentially a million route del/add's (say, if BGP is reconverging), +and thereby blocking VPP from reading new packets from DPDK, but more importantly, new Netlink +messages from the kernel, which will fill the 64M socket buffer and overflow it, losing Netlink messages, +which is bad because it requires an end to end resync of the Linux namespace into the VPP dataplane, +something called an `NLM_F_DUMP` but that's a story for another day. + +So I process only a batch of messages and only for a maximum amount of time per batch. If there are still +some messages left in the queue, I'll just reschedule consumption after M milliseconds. This allows new +Netlink messages to continuously be read from the kernel by VPP's file handler, even if there's a lot of +work to do. + +* `lcp_nl_read_cb()` calls `lcp_nl_callback()` which pushes Netlink messages onto a queue and + issues a `NL_EVENT_READ` event, any socket read error issues `NL_EVENT_READ_ERR` event. +* `lcp_nl_error_cb()` simply issues `NL_EVENT_READ_ERR` event and moves on with life. + +To capture these events, I initialize a process node called `lcp_nl_process()`, which handles: +* `NL_EVENT_READ` by calling `lcp_nl_process_msgs()` and processing a batch of messages (either + a maximum count, or a maximum duration, whichever is reached first). +* `NL_EVENT_READ_ERR` is the other event that can happen, in case VPP's file handler or my own + `lcp_nl_read_cb()` encounter a read error. All it does is close and reopen the Netlink socket + in the same network namespace we were before, in an attempt to minimize the damage, _dazed and + confused, but trying to continue_. + +Allright, so at this point, I have a producer queue that gets added to by the Netlink reader +machinery, so all I have to do is consume them. `lcp_nl_process_msgs()` processes up to N messages +and/or for up to M msecs, whichever comes first, and for each individual Netlink message, it +will call `lcp_nl_dispatch()` to handle messages of a given type. + +For now, `lcp_nl_dispatch()` just throws the message away after logging it with `format_nl_object()`, +a function that will come in very useful as I start to explore all the different Netlink message types. + +The code that forms the basis of our Netlink Listener lives in [[this +commit](https://github.com/pimvanpelt/lcpng/commit/c4e3043ea143d703915239b2390c55f7b6a9b0b1)] and +specifically, here I want to call out I was not the primary author, I worked off of Matt and Neale's +awesome work in this pending [Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122). + +### Netlink: Neighbor + +ARP and IPv6 Neighbor Discovery will trigger a set of Netlink messages, which are of type +`RTM_NEWNEIGH` and `RTM_DELNEIGH` + +First, I'll add a new source file `lcpng_nl_sync.c` that will house these handler functions. +Their purpose is to take state learned from Netlink messages, and apply that state to VPP. + +Then, I add `lcp_nl_neigh_add()` and `lcp_nl_neigh_del()` which implement the following +pattern: Most Netlink messages are somehow about a `link`, which is identified by an +interface index (`ifindex` or just idx for short). That's the same interface index I stored +when I created the _LIP_, calling it `vif_index` because in VPP, it describes a `virtio` +device which implements the IO for the TAP. + +If I'm handling a message for link with a given ifindex, I can correlate it with a _LIP_. Not all +messages will be related to something VPP knows or cares about, I'll discuss that more later when +I discuss `RTM_NEWLINK` messages. + +If there is no _LIP_ associated with the `ifindex`, then clearly this message is about a +Linux interface VPP is not aware of. But, if I can find the _LIP_, I can convert the lladdr +(MAC address) and IP address from the Netlink message into their VPP variants, and then simply +add or remove the ip4/ip6 neighbor adjacency. + +The code for this first Netlink message handler lives in this +[[commit](https://github.com/pimvanpelt/lcpng/commit/30bab1d3f9ab06670fbef2c7c6a658e7b77f7738)]. An +ironic insight is that after writing the code, I don't think any of it will be necessary, because +the interface plugin will already copy ARP and IPv6 ND packets back and forth and itself update its +neighbor adjacency tables; but I'm leaving the code in for now. + +### Netlink: Address + +A decidedly more interesting message is `RTM_NEWADDR` and its deletion companion `RTM_DELADDR`. + +It's pretty straight forward to add and remove IPv4 and IPv6 addresses on interfaces. I have +to convert the Netlink representation of an IP address to its VPP counterpart with a helper, add +it or remove it, and if there are no link-local addresses left, disable IPv6 on the interface. +There's also a few multicast routes to add (notably 224.0.0.0/24 and ff00::/8, all-local-subnet). + +The code for IP address handling is in this +[[commit]](https://github.com/pimvanpelt/lcpng/commit/87742b4f541d389e745f0297d134e34f17b5b485), but +when I took it out for a spin, I noticed something curious, looking at the log lines that are +generated for the following sequence: + +``` +ip addr add 10.0.1.1/30 dev e0 + debug linux-cp/nl addr_add: Netlink route/addr: add idx 1488 family inet local 10.0.1.1/30 flags 0x0080 (permanent) + warn linux-cp/nl dispatch: ignored route/route: add family inet type 2 proto 2 table 255 dst 10.0.1.1 nexthops { idx 1488 } + warn linux-cp/nl dispatch: ignored route/route: add family inet type 1 proto 2 table 254 dst 10.0.1.0/30 nexthops { idx 1488 } + warn linux-cp/nl dispatch: ignored route/route: add family inet type 3 proto 2 table 255 dst 10.0.1.0 nexthops { idx 1488 } + warn linux-cp/nl dispatch: ignored route/route: add family inet type 3 proto 2 table 255 dst 10.0.1.3 nexthops { idx 1488 } + +ping 10.0.1.2 + debug linux-cp/nl neigh_add: Netlink route/neigh: add idx 1488 family inet lladdr 68:05:ca:32:45:94 dst 10.0.1.2 state 0x0002 (reachable) flags 0x0000 + notice linux-cp/nl neigh_add: Added 10.0.1.2 lladdr 68:05:ca:32:45:94 iface TenGigabitEthernet3/0/0 + +ip addr del 10.0.1.1/30 dev e0 + debug linux-cp/nl addr_del: Netlink route/addr: del idx 1488 family inet local 10.0.1.1/30 flags 0x0080 (permanent) + notice linux-cp/nl addr_del: Deleted 10.0.1.1/30 iface TenGigabitEthernet3/0/0 + warn linux-cp/nl dispatch: ignored route/route: del family inet type 1 proto 2 table 254 dst 10.0.1.0/30 nexthops { idx 1488 } + warn linux-cp/nl dispatch: ignored route/route: del family inet type 3 proto 2 table 255 dst 10.0.1.3 nexthops { idx 1488 } + warn linux-cp/nl dispatch: ignored route/route: del family inet type 3 proto 2 table 255 dst 10.0.1.0 nexthops { idx 1488 } + warn linux-cp/nl dispatch: ignored route/route: del family inet type 2 proto 2 table 255 dst 10.0.1.1 nexthops { idx 1488 } + debug linux-cp/nl neigh_del: Netlink route/neigh: del idx 1488 family inet lladdr 68:05:ca:32:45:94 dst 10.0.1.2 state 0x0002 (reachable) flags 0x0000 + error linux-cp/nl neigh_del: Failed 10.0.1.2 iface TenGigabitEthernet3/0/0 +``` + +It is this very last message that's a bit of a surprise -- the ping brought the peer's +lladdr into the neighbor cache; and the subsequent address deletion first removed the address, +then all the typical local routes (the connected, the broadcast, the network, and the self/local); +but then as well explicitly deleted the neighbor, which I suppose is correct behavior for Linux, +were it not that VPP already invalidates the neighbor cache and adds/removes the connected routes +for example in `ip/ip4_forward.c` L826-L830 and L583. + +I can see more of these false positive non-errors like the one on `lcp_nl_neigh_del()` because +interface and directly connected route addition/deletion is slightly different in VPP than in Linux. +So, I decide to take a little shortcut -- if an addition returns "already there", or a deletion returns +"no such entry", I'll just consider it a successful addition and deletion respectively, saving my eyes +from being screamed at by this red error message. I changed that in this +[[commit](https://github.com/pimvanpelt/lcpng/commit/d63fbd8a9a612d038aa385e79a57198785d409ca)], +turning this situation in a friendly green notice instead. + +### Netlink: Link (existing) + +There's a bunch of use cases for these messages `RTM_NEWLINK` and `RTM_DELLINK`. They carry information +about carrier (link, no-link), admin state (up/down), MTU, and so on. The function `lcp_nl_link_del()` +is the easier of the two. If I see a message like this for an ifindex that VPP has a _LIP_ for, I'll +just remove it. This means first calling the `lcp_itf_pair_delete()` function and then, if the message +was for a VLAN interface, remove the accompanying sub-interface (both the physical one (eg. `TenGigabitEthernet3/0/0.1234`) +as well as the TAP that we used to communicate to the host with (eg. `tap8.1234`). + +The other message (the `RTM_NEWLINK` one), is much more complicated, because it's actually many types +of operation all in one message type: We can set the link up/down, change its MTU, and change its MAC +address, in any combination, perhaps like so: +``` +ip link set e0 mtu 9216 address 00:11:22:33:44:55 down +``` + +So in turn, `lcp_nl_link_add()` will first look at admin state and apply it to the phy and tap, +apply the MTU if it's different to what VPP has, and apply the MAC address if it's different to +what VPP has, notably applying MAC addresses only in 'hardware' interfaces, which I now know are +not just physical ones like `TenGigabitEthernet3/0/0` but also virtual ones like `BondEthernet0`. + +One thing I noticed, is that link state and MTU changes tend to go around in circles (from Netlink +into VPP, with this code, but when `lcp-sync` is on in the interface mirror plugin, changes to link +and mtu will trigger a callback there, which will in turn generate a Netlink message, and so on). +To avoid this loop, I temporarily turn off `lcp-sync` just before handling a batch of messages, and +turn it back to its original state when I'm done with that. + +The code for all/del of existing links is in this +[[commit](https://github.com/pimvanpelt/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)]. + +### Netlink: Link (new) + +Here's where it gets interesting! What if the `RTM_NEWLINK` message was for an interface that VPP +doesn't have a _LIP_ for, but specifically describes a VLAN interface? Well, then clearly the operator +is trying to create a new sub-interface. And supporting that operation would be super cool, so let's go! + +Using the earlier placeholder hint in `lcp_nl_link_add()` (see the previous +[[commit](https://github.com/pimvanpelt/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)]), +I know that I've gotten a NEWLINK request but the Linux ifindex doesn't have a _LIP_. This could be +because the interface is entirely foreign to VPP, for example somebody created a dummy interface or +a VLAN sub-interface on one: +``` +ip link add dum0 type dummy +ip link add link dum0 name dum0.10 type vlan id 10 +``` + +Or perhaps more interestingly, the operator is actually trying to create a VLAN sub-interface on an +interface we created in VPP earlier, like these: +``` +ip link add link e0 name e0.1234 type vlan id 1234 +ip link add link e0.1234 name e0.1235 type vlan id 1000 +ip link add link e0 name e0.1236 type vlan id 2345 proto 802.1ad +ip link add link e0.1236 name e0.1237 type vlan id 1000 +``` + +None of these `RTM_NEWLINK` messages, represented by vif (Linux ifindex) will have a corresponding _LIP_. +So, I try to _create one_ by calling `lcp_nl_link_add_vlan()`. + +First, I'll lookup the parent ifindex (`dum0` or `e0` in the examples above). The first example parent, +`dum0`, doesn't have a _LIP_, so I bail after logging a warning. The second example however, `e0`, +definitely does have a _LIP_, so it's known to VPP. + +Now, I have two further choices: + +1. the _LIP_ is a phy (ie `TenGigabitEthernet3/0/0` or `BondEthernet0`) and this is a regular tagged + interface with a given proto (dot1q or dot1ad); or +1. the _LIP_ is itself a subint (ie `TenGigabitEthernet3/0/0.1234`) and what I'm being asked for is + actually a QinQ or QinAD sub-interface. Remember, there's an important difference: + - In Linux these sub-interfaces are chained (`e0` creates child `e0.1234@e0` for a normal VLAN, + and `e0.1234` creates child `e0.1235@e0.1234` for the QinQ). + - In VPP these are actually all flat sub-interfaces, with the 'regular' VLAN interface carrying + the `one_tag` flag with only an `outer_vlan_id` set, and the latter QinQ carrying the `two_tags` + flag with both an `outer_vlan_id` (1234) and an `inner_vlan_id` (1000). + +So I look up both the parent _LIP_ as well the phy _LIP_. I now have all the ingredients I need to create +the VPP sub-interfaces with the correct inner-dot1q and outer dot1q or dot1ad. + +Of course, I don't really know what subinterface ID to use. It's appealing to "just" use the vlan id, +but that's not helpful if the outer tag and the inner tag are the same. So I write a helper function +`vnet_sw_interface_get_available_subid()` whose job it is to return an unused subid for the phy, +starting from 1. + +Here as well, the interface plugin can be configured to automatically create _LIPs_ for sub-interfaces, +which I have to turn off temporarily to let my new form of creation do its thing. I carefully ensure that +the thread barrier is taken/released and the original setting of `lcp-auto-subint` is restored at all +exit points. One cool thing is that the new link's name is given in the Netlink message, so I can just +use that one. I like the aesthetic a bit more, because here the operator can give the Linux interface +any name they like, where-as in the other direction, VPP's `lcp-auto-subint` feature has to make up +a boring `.` name. + +Alright, without further ado, the code for the main innovation here, the implementation of +`lcp_nl_link_add_vlan()`, is in this +[[commit](https://github.com/pimvanpelt/lcpng/commit/45f408865688eb7ea0cdbf23aa6f8a973be49d1a)]. + +## Results + +The functional regression test I made on day one, the one that ensures end-to-end connectivity to and +from the Linux host interfaces works for all 5 interface types (untagged, .1q tagged, QinQ, .1ad tagged +and QinAD) and for both physical and virtual interfaces (like `TenGigabitEthernet3/0/0` and `BondEthernet0`), +still works. + +After this code is in, the operator will only have to create a _LIP_ for any phy interfaces, and +can rely on the new Netlink Listener plugin and the use of `ip` in Linux for all the rest. This +implementation starts approaching 'vanilla' Linux user experience! + +Here's [a screencast](https://asciinema.org/a/432243) showing me playing around a bit, demonstrating +that synchronization works pretty well in both directions, a huge improvement from the +[previous screencast](https://asciinema.org/a/430411) in my [second post]({% post_url 2021-08-13-vpp-2 %}), +which was only two weeks ago: + + + +### Further Work + +You will note that there's one important Netlink message type that's missing: routes! They are so +important in fact, that they're a topic of their very own post. Also, I haven't written the code +for them yet :-) + +A few things worth noting, as future work. + +**Multiple NetNS** - The original Netlink Listener ([ref](https://gerrit.fd.io/r/c/vpp/+/31122)) would +only listen to the default netns specified in the configuration file. This is problematic because the +interface plugin does allow interfaces to be made in other namespaces (by issuing +`lcp create ... host-if X netns foo`), the Netlink world of which will be unknown to VPP. I +created `struct lcp_nl_netlink_namespace` to hold the stuff needed for the Netlink listener, +which is a good starting point to create not one but multiple listeners, one for each unique +namespace that has one or more _LIPs_ defined. This is version-two work :) + +**Multithreading** - In testing, I noticed that while my plugin itself are (or seem to be..) thread +safe, `virtio` may not be totally clean, and I noticed that in a multithreaded VPP instance with many +workers, there's a crash in `lcp_arp_phy_node()` where `vlib_buffer_copy()` returns NULL, which should +not happen. When VPP is in such a state, other plugins (notably DHCP and IPv6 ND) also start complaining, +and `show errors` shows millions of `virtio-input` errors about unavailable buffers. +I do confirm though, that running VPP single threaded does not have these issues. + +## Credits + +I'd like to make clear that the Linux CP plugin is a collaboration between several great minds, +and that my work stands on other software engineer's shoulders. In particular most of the Netlink +socket handling and Netlink message queueing was written by Matthew Smith, and I've had a little bit +of help along the way from Neale Ranns and Jon Loeliger. I'd like to thank them for their work! + +## Appendix + +#### Ubuntu config + +This configuration has been the exact same ever since [my first post]({% post_url 2021-08-12-vpp-1 %}): +``` +# Untagged interface +ip addr add 10.0.1.2/30 dev enp66s0f0 +ip addr add 2001:db8:0:1::2/64 dev enp66s0f0 +ip link set enp66s0f0 up mtu 9000 + +# Single 802.1q tag 1234 +ip link add link enp66s0f0 name enp66s0f0.q type vlan id 1234 +ip link set enp66s0f0.q up mtu 9000 +ip addr add 10.0.2.2/30 dev enp66s0f0.q +ip addr add 2001:db8:0:2::2/64 dev enp66s0f0.q + +# Double 802.1q tag 1234 inner-tag 1000 +ip link add link enp66s0f0.q name enp66s0f0.qinq type vlan id 1000 +ip link set enp66s0f0.qinq up mtu 9000 +ip addr add 10.0.3.2/30 dev enp66s0f0.qinq +ip addr add 2001:db8:0:3::2/64 dev enp66s0f0.qinq + +# Single 802.1ad tag 2345 +ip link add link enp66s0f0 name enp66s0f0.ad type vlan id 2345 proto 802.1ad +ip link set enp66s0f0.ad up mtu 9000 +ip addr add 10.0.4.2/30 dev enp66s0f0.ad +ip addr add 2001:db8:0:4::2/64 dev enp66s0f0.ad + +# Double 802.1ad tag 2345 inner-tag 1000 +ip link add link enp66s0f0.ad name enp66s0f0.qinad type vlan id 1000 proto 802.1q +ip link set enp66s0f0.qinad up mtu 9000 +ip addr add 10.0.5.2/30 dev enp66s0f0.qinad +ip addr add 2001:db8:0:5::2/64 dev enp66s0f0.qinad + +## Bond interface +ip link add bond0 type bond mode 802.3ad +ip link set enp66s0f2 down +ip link set enp66s0f3 down +ip link set enp66s0f2 master bond0 +ip link set enp66s0f3 master bond0 +ip link set enp66s0f2 up +ip link set enp66s0f3 up +ip link set bond0 up + +ip addr add 10.1.1.2/30 dev bond0 +ip addr add 2001:db8:1:1::2/64 dev bond0 +ip link set bond0 up mtu 9000 + +# Single 802.1q tag 1234 +ip link add link bond0 name bond0.q type vlan id 1234 +ip link set bond0.q up mtu 9000 +ip addr add 10.1.2.2/30 dev bond0.q +ip addr add 2001:db8:1:2::2/64 dev bond0.q + +# Double 802.1q tag 1234 inner-tag 1000 +ip link add link bond0.q name bond0.qinq type vlan id 1000 +ip link set bond0.qinq up mtu 9000 +ip addr add 10.1.3.2/30 dev bond0.qinq +ip addr add 2001:db8:1:3::2/64 dev bond0.qinq + +# Single 802.1ad tag 2345 +ip link add link bond0 name bond0.ad type vlan id 2345 proto 802.1ad +ip link set bond0.ad up mtu 9000 +ip addr add 10.1.4.2/30 dev bond0.ad +ip addr add 2001:db8:1:4::2/64 dev bond0.ad + +# Double 802.1ad tag 2345 inner-tag 1000 +ip link add link bond0.ad name bond0.qinad type vlan id 1000 proto 802.1q +ip link set bond0.qinad up mtu 9000 +ip addr add 10.1.5.2/30 dev bond0.qinad +ip addr add 2001:db8:1:5::2/64 dev bond0.qinad +``` + +#### VPP config + +We can whittle down the VPP configuration to the bare minimum: +``` +vppctl lcp default netns dataplane +vppctl lcp lcp-sync on +vppctl lcp lcp-auto-subint on + +## Create `e0` +vppctl lcp create TenGigabitEthernet3/0/0 host-if e0 + +## Create `be0` +vppctl create bond mode lacp load-balance l34 +vppctl bond add BondEthernet0 TenGigabitEthernet3/0/2 +vppctl bond add BondEthernet0 TenGigabitEthernet3/0/3 +vppctl set interface state TenGigabitEthernet3/0/2 up +vppctl set interface state TenGigabitEthernet3/0/3 up +vppctl lcp create BondEthernet0 host-if be0 +``` + + +And the rest of the confifuration work is done entirely from the Linux side! +``` +IP="sudo ip netns exec dataplane ip" +## `e0` aka TenGigabitEthernet3/0/0 +$IP link add link e0 name e0.1234 type vlan id 1234 +$IP link add link e0.1234 name e0.1235 type vlan id 1000 +$IP link add link e0 name e0.1236 type vlan id 2345 proto 802.1ad +$IP link add link e0.1236 name e0.1237 type vlan id 1000 +$IP link set e0 up mtu 9000 + +$IP addr add 10.0.1.1/30 dev e0 +$IP addr add 2001:db8:0:1::1/64 dev e0 +$IP addr add 10.0.2.1/30 dev e0.1234 +$IP addr add 2001:db8:0:2::1/64 dev e0.1234 +$IP addr add 10.0.3.1/30 dev e0.1235 +$IP addr add 2001:db8:0:3::1/64 dev e0.1235 +$IP addr add 10.0.4.1/30 dev e0.1236 +$IP addr add 2001:db8:0:4::1/64 dev e0.1236 +$IP addr add 10.0.5.1/30 dev e0.1237 +$IP addr add 2001:db8:0:5::1/64 dev e0.1237 + +## `be0` aka BondEthernet0 +$IP link add link be0 name be0.1234 type vlan id 1234 +$IP link add link be0.1234 name be0.1235 type vlan id 1000 +$IP link add link be0 name be0.1236 type vlan id 2345 proto 802.1ad +$IP link add link be0.1236 name be0.1237 type vlan id 1000 +$IP link set be0 up mtu 9000 + +$IP addr add 10.1.1.1/30 dev be0 +$IP addr add 2001:db8:1:1::1/64 dev be0 +$IP addr add 10.1.2.1/30 dev be0.1234 +$IP addr add 2001:db8:1:2::1/64 dev be0.1234 +$IP addr add 10.1.3.1/30 dev be0.1235 +$IP addr add 2001:db8:1:3::1/64 dev be0.1235 +$IP addr add 10.1.4.1/30 dev be0.1236 +$IP addr add 2001:db8:1:4::1/64 dev be0.1236 +$IP addr add 10.1.5.1/30 dev be0.1237 +$IP addr add 2001:db8:1:5::1/64 dev be0.1237 +``` + +#### Final note + +You may have noticed that the [commit] links are all to git commits in my private working copy. I +want to wait until my [previous work](https://gerrit.fd.io/r/c/vpp/+/33481) is reviewed and +submitted before piling on more changes. Feel free to contact vpp-dev@ for more information in the +mean time :-) diff --git a/content/articles/2021-08-26-fiber7-x.md b/content/articles/2021-08-26-fiber7-x.md new file mode 100644 index 0000000..6c60702 --- /dev/null +++ b/content/articles/2021-08-26-fiber7-x.md @@ -0,0 +1,214 @@ +--- +date: "2021-08-26T12:55:44Z" +title: Fiber7-X in 1790BRE +--- + +## Introduction + +I've been a very happy Init7 customer since 2016, when the fiber to the home +ISP I was a subscriber at back then, a small company called Easyzone, got +acquired by Init7. The technical situation in Wangen-Brüttisellen was +a bit different back in 2016. There was a switch provided by Litecom in which +ports were resold OEM to upstream ISPs, and Litecom would provide the L2 +backhaul to a central place to hand off the customers to the ISPs, in my case +Easyzone. In Oct'16, Fredy asked me if I could do a test of +Fiber7-on-Litecom, which I did and reported on in a [blog post]({% post_url 2016-10-07-fiber7-litexchange %}). + +Some time early 2017, Init7 deployed a POP in Dietlikon (790BRE) and then +magically another one in Brüttisellen (1790BRE). It's a funny story +why the Dietlikon point of presence is called 790BRE, but I'll leave that +for the bar, not this post :-) + +## Fiber7's Next Gen + +Some of us read a rather curious tweet in back in May: + +{{< image width="400px" float="center" src="/assets/fiber7-x/tweet.png" alt="Tweet-X2" >}} + +Translated -- _'7 years ago our Gigabit-Internet was born. To celebrate this day, +here's a riddle for #Nerds: Gordon Moore's law says dictates doubling every 18 months. +What does that mean for our 7 year old Fiber7?'_ Well, 7 years is 84 months, and +doubling every 18 months means 84/18 = 4.6667 doublings and 1Gbpbs * 2^4.6667 = +25.4Gbps. Holy shitballs, Init7 just announced that their new platform will offer 25G +symmetric ethernet?! + +"I wonder what that will cost?", I remember myself thinking. "**The same price**", +was the answer. I can see why -- monitoring my own family's use, we're doing a good +60Mbit or so when we stream Netflix and/or Spotify (which we all do daily). And some +IPTV maybe at 4k will go for a few hundred megs, but the only time we actually use +the gigabit, is when we do a speedtest of an iperf :-) Moreover, offering 25G fits +the company's marketing strategy well, because our larger Swiss national telco and +cable providers are all muddying the waters with their DOCSIS and GPON offering, +both of which _can_ do 10Gbit, but it's a TDM (time division multiplexing) offering +which makes any number of subscribers share that bandwidth to a central office. And +when I say any number, it's easy to imagine 128 and 256 subscribers on one XGSPON, +and many of those transponders in a telco line terminator, each with redundant uplinks +of 2x10G or sometimes 2x40G. But that's an oversubscription of easily 2000x, taking +128 (subscribers per PON) x16 (PONs per linecard) x8 (linecards), is 16K subscribers +of 10G using 80G (or only 20G) of uplink bandwidth. That's massively inferior from +a technical perspective. And, as we'll see below, it doesn't really allow for +advanced services, like L2 backhaul from the subscriber to a central office. + +Now to be fair, the 1790BRE pop that I am personally connected to has 2x 10G uplinks +and ~200 or so 1G downlinks, which is also a local overbooking of 10:1, or 20:1 if only +one of the uplinks is used at any given time. Worth noting, sometimes several cities +are daisy chained, which makes for larger overbooking if you're deep in the Fiber7 +access network. I am pretty close (790BRE-790SCW-790OER-Core; and an alternate path of +780EFF-Core; only one of which is used because the Fiber7 edge switches use OSPF and +a limited TCAM space means only few if any public routes are there; I assume a default +is injected into OSPF at every core site and limited traffic engineering is done). +The longer the traceroute, the cooler it looks, but the more customers are ahead of +you, causing more overbooking. YMMV ;-) + +## Upgrading 1790BRE + +{{< image width="300px" float="right" src="/assets/fiber7-x/before.png" alt="Before" >}} + +Wouldn't it be cool if Init7 upgraded to 100G intra-pop? Well, this is the story +of their Access'21 project! My buddy Pascal (who is now the CTO at Init7, good +choice!), explained it to me in a business call back in June, but also shared it +in a [presentation](/assets/fiber7-x/UKNOF_20210803.pdf) which I definitely encourage +you to browse through. If you thought I was jaded on GPON, check out their assessment, +it's totally next level! + +Anyway, the new POPs are based on Cisco's C9500 switches, which come in two variants: +Access switches are C9500-48Y4C which take 48x SPF28 (1/10/25Gbit) and 4x QSFP+ (40/100Gbit) +and aggregation switches are C9500-32C which take 32x QSFP+ (40/100Gbit). + +As a subscriber, we all got a courtesy headsup on the date of 1790BRE's upgrade. +It was [scheduled](https://as13030.net/status/?ticket=4238550) for Thursday Aug 26th +starting at midnight. As I've written about before (for example at the bottom of my +[Bucketlist post]({% post_url 2021-07-26-bucketlist %})), I really enjoy the immediate +gratification of physical labor in a datacenter. Most of my projects at work are on the +quarters-to-years timeframe, and being able to do a thing and see the result of that +thing ~immmediately, is a huge boost for me. + +So I offered to let one of the two Init7 people take the night off and help perform the +upgrade myself. The picture on the right is how the switch looked like until now, with +four linecards of 48x1G trunked into 2x10G uplinks, one towards Effretikon and one +towards Dietlikon. It's an aging Cisco 4510 switch (they were released around 2010), +but it has served us well here in Brüttisellen for many years, thank you, little +chassis! + +## The Upgrade + +{{< image width="300px" float="right" src="/assets/fiber7-x/during.png" alt="During" >}} + +I met the Init7 engineer in front of the Werke Wangen-Brüttisellen, which is about +170m from my house, as the photons fly, at around 23:30. We chatted for a little while, +I had already gotten to know him due to mutual hosting at NTT in Rümlang, so of +course our basement ISPs peer over [CommunityIX](https://communityix.ch/) and so on, +but it's cool to put a face to the name. + +The new switches were already racked by Pascal previously, and DWDM multiplexers have +appeared, and that what used to be a simplex fiber, is now two pairs of duplex fibers. +Maybe DWDM services are in reach for me at some point? I should look in to that ... but +for now let's focus on the task at hand. + +In the picture on the right, you can see from top to bottom: DWDM mux to ZH11/790ZHB +which immediately struck my eye as clever - it's a 8 channel DWDM mux with channels C31-C38 +and two wideband passthroughs, one is 1310W which means "a wideband 1310nm" which is where +the 100G optics are sending; and the other is UPG which is an upgrade port, allowing to add +more DWDM channels in a separate mux into the fiber at a later date, at the expense of +2dB or so of insertion loss. Nice. The second is an identical unit, a DWDM mux to 780EFF +which has again one 100G 1310nm wideband channel towards Effretikon and then on to +Winterthur, and CH31 in use with what is the original C4510 switch (that link used to +be a dark fiber with vanilla 10G optics connecting 1790BRE with 780EFF). + +Then there are two redundant aggregation switches (the 32x100G kind), which have each +four access switches connected to them, with the pink cables. Those are interesting: +to make 100G very cheap, optics can make use of 4x25G lasers that each take one fiber, +so 8 fibers in total, and those pink cables are 12-fiber multimode trunks with an +[MPO](https://vitextech.com/mpo-mtp-connectors-difference/) connector. The optics for this +type of connection are super cheap, for example this [Flexoptix](https://www.flexoptix.net/en/qsfp28-sr4-transceiver-100-gigabit-mm-850nm-100m-1db-ddm-dom.html?co8502=76636) one. I have the 40G variant at home, also running multimode +4x10G MPO cables, at a fraction of the price of singlemode single-laser variants. So +when people say "multimode is useless, always use singlemode", point them at this post +please! + +{{< image width="300px" float="left" src="/assets/fiber7-x/after.png" alt="After" >}} + +There were 11 subscribers who upgraded their service, ten of them to 10Gbps (myself +included) and one of them to 25Gbps, lucky bastard. So in a first pass we shut down all +the ports on the C4510 and moved over optics and fibers one by one into the new C9500 +switches, of which there were four. + +Werke Wangen-Brüttisellen (the local telcoroom owners in my town) historically did do +a great job at labeling every fiber with little numbered clips, so it's easy to ensure +that what used to be fiber #33, is now still in port #33. I worked from the right, +taking two optics from the old switch, moving them into the new switch, and reinserting +the fibers. The Init7 engineer worked from the left, doing the same. We managed to +complete this swap-over in record time, according to Pascal who was monitoring from +remote, and reconfiguring the switches to put the subscribers back into service. We +started at 00:05 and completed the physical reconfiguration at 01:21am. Go, us! + +After the physical work, we conducted an Init7 post-maintenance ritual which was eating +a cheeseburger to replenish our body's salt and fat contents. We did that at my place +and luckily I have access to a microwave oven and also some Blairs Mega Death hotsauce +(with liquid rage) which my buddy enthusiastically drizzled onto the burger, but it did +make him burp just a little bit as sweat poured out of his face. That was fun! I took +some more pictures, published with permission, in [this album](https://photos.app.goo.gl/VozxYvnuXSQPBePG7). + +
+### 10G VLL + +One more thing! I had waited to order this until the time was right, and the upgrade of +1790BRE was it -- since I operate AS50869, a little basement ISP, I had always hoped to +change my 1500 byte MTU L3 service into a Jumboframe capable L2 service. After some +negotiation on the contractuals, I signed an order ahead of this maintenance to upgrade +to a 10G virtual leased line (VLL) from this place to the NTT datacenter in Rümlang. + +In the afternoon, I had already patched my side of the link in the datacenter, and I +noticed that the Init7 side of the patch was dangling in their rack without an optic. So +we went to the datacenter (at 2am, the drive from my house to NTT is 9 minutes, without +speeding!), and plugged in an optic to let my lonely photons hit a friendly receiver. + +I then got to configure the VLL together with my buddy, which was a hilight of the night +for me. I now have access to a spiffy new 10 gigabit VLL operating at 9190 MTU, from +1790BRE directly to my router `chrma0.ipng.ch` at NTT Rümlang, while previously I +had secured a 1G carrier ethernet operating at 9000 MTU directly to my router +`chgtg0.ipng.ch` at Interxion Glattbrugg. Between the two sites, I have a CWDM wave +which currently runs 10G optics but I have the 25G CWDM optics and switches ready for +deployment. It's somewhat (ok, utterly) over the top, but I like (ok, love) it. + +``` +pim@chbtl0:~$ show protocols ospfv3 neighbor +Neighbor ID Pri DeadTime State/IfState Duration I/F[State] +194.1.163.4 1 00:00:38 Full/PointToPoint 87d05:37:45 dp0p6s0f3[PointToPoint] +194.1.163.86 1 00:00:31 Full/DROther 16:18:39 dp0p6s0f2.101[BDR] +194.1.163.87 1 00:00:30 Full/DR 7d15:48:41 dp0p6s0f2.101[BDR] +194.1.163.0 1 00:00:38 Full/PointToPoint 2d12:02:19 dp0p6s0f0[PointToPoint] +``` + +The latency from my workstation on which I'm writing this blogpost to, say, my Bucketlist +location of NIKHEF in the Amsterdam Watergraafsmeer, is pretty much as fast as light +goes (I've seen 12.2ms, but considering it's ~820km, this is not bad at all): + +``` +pim@chumbucket:~$ traceroute gripe +traceroute to gripe (94.142.241.186), 30 hops max, 60 byte packets + 1 chbtl0.ipng.ch (194.1.163.66) 0.211 ms 0.186 ms 0.189 ms + 2 chrma0.ipng.ch (194.1.163.17) 1.463 ms 1.416 ms 1.432 ms + 3 defra0.ipng.ch (194.1.163.25) 7.376 ms 7.344 ms 7.330 ms + 4 nlams0.ipng.ch (194.1.163.27) 12.952 ms 13.115 ms 12.925 ms + 5 gripe.ipng.nl (94.142.241.186) 13.250 ms 13.337 ms 13.223 ms +``` + +And, due to the work we did above, now the bandwidth is up to par as well, with +comparable down- and upload speeds of 9.2Gbit from NL>CH and 8.9Gbit from +CH>NL, and, while I'm not going to prove it here, this would work equally +well with 9000 byte, 1500 byte or 64 byte frames due to my use of DPDK based +routers who just don't G.A.F. : +``` +pim@chumbucket:~$ iperf3 -c nlams0.ipng.ch -R -P 10 ## Richtung Schweiz! +Connecting to host nlams0, port 5201 +Reverse mode, remote host nlams0 is sending +... +[SUM] 0.00-10.01 sec 10.8 GBytes 9.26 Gbits/sec 53 sender +[SUM] 0.00-10.00 sec 10.7 GBytes 9.19 Gbits/sec receiver + +pim@chumbucket:~$ iperf3 -c nlams0.ipng.ch -P 10 ## Naar Nederland toe! +Connecting to host nlams0, port 5201 +... +[SUM] 0.00-10.00 sec 9.93 GBytes 8.87 Gbits/sec 405 sender +[SUM] 0.00-10.02 sec 9.91 GBytes 8.84 Gbits/sec receiver +``` diff --git a/content/articles/2021-09-02-vpp-5.md b/content/articles/2021-09-02-vpp-5.md new file mode 100644 index 0000000..71b5ff0 --- /dev/null +++ b/content/articles/2021-09-02-vpp-5.md @@ -0,0 +1,549 @@ +--- +date: "2021-09-02T12:19:14Z" +title: VPP Linux CP - Part5 +--- + + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. One thing notably missing, is the higher level control plane, that is +to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a +VPP _plugin_ which is called the **Linux Control Plane**, or LCP for short, which creates Linux network +devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols +like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet +forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use +VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the +interface state (links, addresses and routes) itself. When the plugin is completed, running software +like [FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving +>100Mpps and >100Gbps forwarding rates will be well in reach! + +In the previous post, I added support for VPP to consume Netlink messages that describe interfaces, +IP addresses and ARP/ND neighbor changes. This post completes the stablestakes Netlink handler by +adding IPv4 and IPv6 route messages, and ends up with a router in the DFZ consuming 133K IPv6 +prefixes and 870K IPv4 prefixes. + +## My test setup + +The goal of this post is to show what code needed to be written to extend the **Netlink Listener** +plugin I wrote in the [fourth post]({% post_url 2021-08-25-vpp-4 %}), so that it can consume +route additions/deletions, a thing that is common in dynamic routing protocols such as OSPF and +BGP. + +The setup from my [third post]({% post_url 2021-08-15-vpp-3 %}) is still there, but it's no longer +a focal point for me. I use it (the regular interface + subints and the BondEthernet + subints) +just to ensure my new code doesn't have a regression. + +Instead, I'm creating two VLAN interfaces now: +- The first is in my home network's _servers_ VLAN. There are three OSPF speakers there: + - `chbtl0.ipng.ch` and `chbtl1.ipng.ch` are my main routers, they run DANOS and are in + the Default Free Zone (or DFZ for short). + - `rr0.chbtl0.ipng.ch` is one of AS50869's three route-reflectors. Every one of the 13 + routers in AS50869 exchanges BGP information with these, and it cuts down on the total + amount of iBGP sessions I have to maintain -- see [here](https://networklessons.com/bgp/bgp-route-reflector) + for details on Route Reflectors. +- The second is an L2 connection to a local BGP exchange, with only three members (IPng Networks, + AS50869, Openfactory AS58299, and Stucchinet AS58280). In this VLAN, Openfactory was so kind + as to configure a full transit session for me, and I'll use it in my test bench. + +The test setup offers me the ability to consume OSPF, OSPFv3 and BGP. + +### Startingpoint + +Based on the state of the plugin after the [fourth post]({% post_url 2021-08-25-vpp-4 %}), +operators can create VLANs (including .1q, .1ad, QinQ and QinAD subinterfaces) directly in +Linux. They can change link attributes (like set admin state 'up' or 'down', or change +the MTU on a link), they can add/remove IP addresses, and the system will add/remove IPv4 +and IPv6 neighbors. But notably, the following Netlink messages are not yet consumed, as shown +by the following example: + +``` +pim@hippo:~/src/lcpng$ sudo ip link add link e1 name servers type vlan id 101 +pim@hippo:~/src/lcpng$ sudo ip link up mtu 1500 servers +pim@hippo:~/src/lcpng$ sudo ip addr add 194.1.163.86/27 dev servers +pim@hippo:~/src/lcpng$ sudo ip ro add default via 194.1.163.65 +``` + +which does the first three commands just fine, but the fourth: +``` +linux-cp/nl [debug ]: dispatch: ignored route/route: add family inet type 1 proto 3 + table 254 dst 0.0.0.0/0 nexthops { gateway 194.1.163.65 idx 197 } +``` + +In this post, I'll implement that last missing piece in two functions called `lcp_nl_route_add()` +and `lcp_nl_route_del()`. Here we go! + +## Netlink Routes + +Reusing the approach from the work-in-progress [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122)], I introduce two FIB sources: one +for manual routes (ie. the ones that an operator might set with `ip route add`), and another one +for dynamic routes (ie. what a routing protocol like Bird or FRR might set), this is in +`lcp_nl_proto_fib_source()`. Next, I need a bunch of helper functions that can translate the +Netlink message information into VPP primitives: + +- `lcp_nl_mk_addr46()` converts a Netlink `nl_addr` to a VPP `ip46_address_t`. +- `lcp_nl_mk_route_prefix()` converts a Netlink `rtnl_route` to a VPP `fib_prefix_t`. +- `lcp_nl_mk_route_mprefix()` converts a Netlink `rtnl_route` to a VPP `mfib_prefix_t` (for + multicast routes). +- `lcp_nl_mk_route_entry_flags()` generates `fib_entry_flag_t` from the Netlink route type, + table and proto metadata. +- `lcp_nl_proto_fib_source()` selects the most appropciate FIB source by looking at the + `rt_proto` field from the Netlink message (see `/etc/iproute2/rt_protos` for a list of + these). Anything **RTPROT_STATIC** or better is `fib_src`, while anything above that + becomes `fib_src_dynamic`. +- `lcp_nl_route_path_parse()` converts a Netlink `rtnl_nexthop` to a VPP `fib_route_path_t` + and adds that to a growing list of paths. Similar to Netlink's nethops being a list, so + are the individual paths in VPP, so that lines up perfectly. +- `lcp_nl_route_path_add_special()` adds a blackhole/unreach/prohibit route to the list + of paths, in the special-case there is not yet a path for the destination. + +With these helpers, I will have enough to manipulate VPP's forwarding information base or _FIB_ +for short. But in VPP, the _FIB_ consists of any number of _tables_ (think of them as _VRFs_ +or Virtual Routing/Forwarding domains). So first, I need to add these: + +- `lcp_nl_table_find()` selects the matching `{table-id,protocol}` (v4/v6) tuple from + an internally kept hash of tables. +- `lcp_nl_table_add_or_lock()` if a table with key `{table-id,protocol}` (v4/v6) hasn't + been used yet, create one in VPP, and store it for future reference. Otherwise increment + a table reference counter so I know how many FIB entries VPP will have in this table. +- `lcp_nl_table_unlock()` given a table, decrease the refcount on it, and if no more + prefixes are in the table, remove it from VPP. + +All of this code was heavily inspired by the pending [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122)] +but a few finishing touches were added, and wrapped up in this +[[commit](https://github.com/pimvanpelt/lcpng/commit/7a76498277edc43beaa680e91e3a0c1787319106)]. + +### Deletion + +Our main function `lcp_nl_route_del()` will remove a route from the given table-id/protocol. +I do this by applying `rtnl_route_foreach_nexthop()` callbacks to the list of Netlink message +nexthops, converting each of them into VPP paths in a `lcp_nl_route_path_parse_t` structure. +If the route is for unreachable/blackhole/prohibit in Linux, add that path too. + +Then, remove the VPP paths from the FIB and reduce refcnt or remove the table if it's empty. +This is reasonably straight forward. + +### Addition + +Adding routes to the FIB is done with `lcp_nl_route_add()`. It immediately becomes obvious +that not all routes are relevant for VPP. A prime example are those in table 255, they are +'local' routes, which have already been set up by IPv4 and IPv6 address addition functions +in VPP. There are some other route types that are invalid, so I'll just skip those. + +Link-local IPv6 and IPv6 multicast is also skipped, because they're also added when interfaces +get their IP addresses configured. But for the other routes, similar to deletion, I'll extract +the paths from the Netlink message's netxhops list, by constructing an `lcp_nl_route_path_parse_t` +by walking those Netlink nexthops, and optionally add a _special_ route (in case the route was +for unreachable/blackhole/prohibit in Linux -- those won't have a nexthop). + +Then, insert the VPP paths found in the Netlink message into the FIB or the multicast FIB, +respectively. + +## Control Plane: Bird + +So with this newly added code, the example above of setting a default route shoots to life. +But I can do better! At IPng Networks, my routing suite of choice is Bird2, and I have some +code to generate configurations for it and push those configs safely to routers. So, let's +take a closer look at a configuration on the test machine running VPP + Linux CP with this +new Netlink route handler. + +``` +router id 194.1.163.86; +protocol device { scan time 10; } +protocol direct { ipv4; ipv6; check link yes; } +``` + +These first two protocols are internal implementation details. The first, called _device_ +periodically scans the network interface list in Linux, to pick up new interfaces. You can +compare it to issuing `ip link` and acting on additions/removals as they occur. The second, +called _direct_, generates directly connected routes for interfaces that have IPv4 or IPv6 +addresses configured. It turns out that if I add `194.1.163.86/27` as an IPv4 address on +an interface, it'll generate several Netlink messages: one for the `RTM_NEWADDR` which +I discussed in my [fourth post]({% post_url 2021-08-25-vpp-4 %}), and also a `RTM_NEWROUTE` +for the connected `194.1.163.64/27` in this case. It helps the kernel understand that if +we want to send a packet to a host in that prefix, we should not send it to the default +gateway, but rather to a nexthop of the device. Those are intermittently called `direct` +or `connected` routes. Ironically, these are called `RTS_DEVICE` routes in Bird2 +[ref](https://github.com/BIRD/bird/blob/master/nest/route.h#L373) even though they are +generated by the `direct` routing protocol. + +That brings me to the third protocol, one for each address type: +``` +protocol kernel kernel4 { + ipv4 { + import all; + export where source != RTS_DEVICE; + }; +} + +protocol kernel kernel6 { + ipv6 { + import all; + export where source != RTS_DEVICE; + }; +} +``` +We're asking Bird to import any route it learns from the kernel, and we're asking it to +export any route that's not an `RTS_DEVICE` route. The reason for this is that when we +create IPv4/IPv6 addresses, the `ip` command already adds the connected route, and this +avoids Bird from inserting a second, identical route for those connected routes. And with +that, I have a very simple view, given for example these two interfaces: + +``` +pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip route +45.129.224.232/29 dev ixp proto kernel scope link src 45.129.224.235 +194.1.163.64/27 dev servers proto kernel scope link src 194.1.163.86 + +pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip -6 route +2a0e:5040:0:2::/64 dev ixp proto kernel metric 256 pref medium +2001:678:d78:3::/64 dev servers proto kernel metric 256 pref medium + +pim@hippo:/etc/bird$ birdc show route +BIRD 2.0.7 ready. +Table master4: +45.129.224.232/29 unicast [direct1 20:48:55.547] * (240) + dev ixp +194.1.163.64/27 unicast [direct1 20:48:55.547] * (240) + dev servers + +Table master6: +2a0e:5040:1001::/64 unicast [direct1 20:48:55.547] * (240) + dev stucchi +2001:678:d78:3::/64 unicast [direct1 20:48:55.547] * (240) + dev servers +``` + +## Control Plane: OSPF + +Considering the `servers` network above has a few OSPF speakers in it, I will introduce this +router there as well. The configuration is very straight forward in Bird, let's just add +the OSPF and OSPFv3 protocols as follows: + +``` +protocol ospf v2 ospf4 { + ipv4 { export where source = RTS_DEVICE; import all; }; + area 0 { + interface "lo" { stub yes; }; + interface "servers" { type broadcast; cost 5; }; + }; +} + +protocol ospf v3 ospf6 { + ipv6 { export where source = RTS_DEVICE; import all; }; + area 0 { + interface "lo" { stub yes; }; + interface "servers" { type broadcast; cost 5; }; + }; +} +``` + +Here, I tell OSPF to export all `connected` routes, and accept any route given to it. The only +difference between IPv4 and IPv6 is that the former uses OSPF version 2 of the protocol, and IPv6 +uses version 3 of the protocol. And, as with the `kernel` routing protocol above, each instance +has to has its own unique name, so I make the obvious choice. + +Within a few seconds, the OSPF Hello packets can be seen going out of the `servers` interface, +and adjacencies form shortly thereafter: + +``` + +pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip ro | wc -l +83 +pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip -6 ro | wc -l +74 + +pim@hippo:~/src/lcpng$ birdc show ospf nei ospf4 +BIRD 2.0.7 ready. +ospf4: +Router ID Pri State DTime Interface Router IP +194.1.163.3 1 Full/Other 39.588 servers 194.1.163.66 +194.1.163.87 1 Full/DR 39.588 servers 194.1.163.87 +194.1.163.4 1 Full/Other 39.588 servers 194.1.163.67 + +pim@hippo:~/src/lcpng$ birdc show ospf nei ospf6 +BIRD 2.0.7 ready. +ospf6: +Router ID Pri State DTime Interface Router IP +194.1.163.87 1 Full/DR 32.221 servers fe80::5054:ff:feaa:2b24 +194.1.163.3 1 Full/BDR 39.504 servers fe80::9e69:b4ff:fe61:7679 +194.1.163.4 1 2-Way/Other 38.357 servers fe80::9e69:b4ff:fe61:a1dd +``` + +And all of these were inserted into the VPP forwarding information base, taking for example +the IPng router in Amsterdam, loopback address `194.1.163.32` and `2001:678:d78::8`: + +``` +DBGvpp# show ip fib 194.1.163.32 +ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:1, default-route:1, lcp-rt:1, nat-hi:2, ] +194.1.163.32/32 fib:0 index:70 locks:2 + lcp-rt-dynamic refs:1 src-flags:added,contributing,active, + path-list:[49] locks:142 flags:shared,popular, uPRF-list:49 len:1 itfs:[16, ] + path:[69] pl-index:49 ip4 weight=1 pref=32 attached-nexthop: oper-flags:resolved, + 194.1.163.67 TenGigabitEthernet3/0/1.3 + [@0]: ipv4 via 194.1.163.67 TenGigabitEthernet3/0/1.3: mtu:1500 next:5 flags:[] 9c69b461a1dd6805ca324615810000650800 + + forwarding: unicast-ip4-chain + [@0]: dpo-load-balance: [proto:ip4 index:72 buckets:1 uRPF:49 to:[0:0]] + [0] [@5]: ipv4 via 194.1.163.67 TenGigabitEthernet3/0/1.3: mtu:1500 next:5 flags:[] 9c69b461a1dd6805ca324615810000650800 + +DBGvpp# show ip6 fib 2001:678:d78::8 +ipv6-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, default-route:1, ] +2001:678:d78::8/128 fib:0 index:130058 locks:2 + lcp-rt-dynamic refs:1 src-flags:added,contributing,active, + path-list:[116] locks:220 flags:shared,popular, uPRF-list:106 len:1 itfs:[16, ] + path:[141] pl-index:116 ip6 weight=1 pref=32 attached-nexthop: oper-flags:resolved, + fe80::9e69:b4ff:fe61:a1dd TenGigabitEthernet3/0/1.3 + [@0]: ipv6 via fe80::9e69:b4ff:fe61:a1dd TenGigabitEthernet3/0/1.3: mtu:1500 next:5 flags:[] 9c69b461a1dd6805ca3246158100006586dd + + forwarding: unicast-ip6-chain + [@0]: dpo-load-balance: [proto:ip6 index:130060 buckets:1 uRPF:106 to:[0:0]] + [0] [@5]: ipv6 via fe80::9e69:b4ff:fe61:a1dd TenGigabitEthernet3/0/1.3: mtu:1500 next:5 flags:[] 9c69b461a1dd6805ca3246158100006586dd +``` + +In the snippet above we can see elements of the Linux CP Netlink Listener plugin doing its work. +It found the right nexthop, the right interface, enabled the FIB entry, and marked it with the +correct _FIB_ source `lcp-rt-dynamic`. And, with OSPF and OSPFv3 now enabled, VPP has gained visibility +to all of my internal network: +``` +pim@hippo:~/src/lcpng$ traceroute nlams0.ipng.ch +traceroute to nlams0.ipng.ch (2001:678:d78::8) from 2001:678:d78:3::86, 30 hops max, 24 byte packets + 1 chbtl1.ipng.ch (2001:678:d78:3::1) 0.3182 ms 0.2840 ms 0.1841 ms + 2 chgtg0.ipng.ch (2001:678:d78::2:4:2) 0.5473 ms 0.6996 ms 0.6836 ms + 3 chrma0.ipng.ch (2001:678:d78::2:0:1) 0.7700 ms 0.7693 ms 0.7692 ms + 4 defra0.ipng.ch (2001:678:d78::7) 6.6586 ms 6.6443 ms 6.9292 ms + 5 nlams0.ipng.ch (2001:678:d78::8) 12.8321 ms 12.9398 ms 12.6225 ms +``` + +## Control Plane: BGP + +But the holy grail, and what got me started on this whole adventure, is to be able to participate in the +_Default Free Zone_ using BGP, So let's put these plugins to the test and load up a so-called _full table_ +which means: all the routing information needed to reach any part of the internet. As of August'21, +there are about 870'000 such prefixes for IPv4, and aboug 133'000 prefixes for IPv6. We passed the magic +1M number, which I'm sure makes some silicon vendors anxious, because lots of older kit in the field won't +scale beyond a certain size. VPP is totally immune to this problem, so here we go! + +``` +template bgp T_IBGP4 { + local as 50869; + neighbor as 50869; + source address 194.1.163.86; + ipv4 { import all; export none; next hop self on; }; +}; +protocol bgp rr4_frggh0 from T_IBGP4 { neighbor 194.1.163.140; } +protocol bgp rr4_chplo0 from T_IBGP4 { neighbor 194.1.163.148; } +protocol bgp rr4_chbtl0 from T_IBGP4 { neighbor 194.1.163.87; } + +template bgp T_IBGP6 { + local as 50869; + neighbor as 50869; + source address 2001:678:d78:3::86; + ipv6 { import all; export none; next hop self ibgp; }; +}; +protocol bgp rr6_frggh0 from T_IBGP6 { neighbor 2001:678:d78:6::140; } +protocol bgp rr6_chplo0 from T_IBGP6 { neighbor 2001:678:d78:7::148; } +protocol bgp rr6_chbtl0 from T_IBGP6 { neighbor 2001:678:d78:3::87; } +``` + +And with these two blocks, I've added six new protocols -- three of them are IPv4 route-reflector +clients, and three of them are IPv6 ones. Once this commits, Bird will be able to find these IP +addresses due to the OSPF routes being loaded into the _FIB_, and once it does that, each of the +route-reflector servers will download a full routing table into Bird's memory, and in turn Bird +will use the `kernel4` and `kernel6` protocol to export them into Linux (essentially performing +an `ip ro add ... via ...` on each), and the kernel will then generate a Netlink message, which +the Linux CP **Netlink Listener** plugin will pick up and the rest, as they say, is history. + +I gotta tell you - the first time I saw this working end to end, I was elated. Just seeing blocks +of 6800-7000 of these being pumped into VPP's _FIB_ each 40ms block was just .. magical. And the +performance is pretty good, too, because 7000/40ms is 175K/sec alluding to VPP operators being +able to not only consume but also program into the _FIB_, a full IPv4 and IPv6 table in about 6 +seconds, whoa! + +``` +DBGvpp# +linux-cp/nl [warn ]: process_msgs: Processed 6550 messages in 40001 usecs, 2607 left in queue +linux-cp/nl [warn ]: process_msgs: Processed 6368 messages in 40000 usecs, 7012 left in queue +linux-cp/nl [warn ]: process_msgs: Processed 6460 messages in 40001 usecs, 13163 left in queue +... +linux-cp/nl [warn ]: process_msgs: Processed 6418 messages in 40004 usecs, 93606 left in queue +linux-cp/nl [warn ]: process_msgs: Processed 6438 messages in 40002 usecs, 96944 left in queue +linux-cp/nl [warn ]: process_msgs: Processed 6575 messages in 40002 usecs, 99986 left in queue +linux-cp/nl [warn ]: process_msgs: Processed 6552 messages in 40004 usecs, 94767 left in queue +linux-cp/nl [warn ]: process_msgs: Processed 5890 messages in 40001 usecs, 88877 left in queue +linux-cp/nl [warn ]: process_msgs: Processed 6829 messages in 40003 usecs, 82048 left in queue +... +linux-cp/nl [warn ]: process_msgs: Processed 6685 messages in 40004 usecs, 13576 left in queue +linux-cp/nl [warn ]: process_msgs: Processed 6701 messages in 40003 usecs, 6893 left in queue +linux-cp/nl [warn ]: process_msgs: Processed 6579 messages in 40003 usecs, 314 left in queue +DBGvpp# +``` + +Due to a good cooperative multitasking approach in the Netlink message queue producer, I will continuously +read Netlink messages from the kernel and put them in a queue, but only consume 40ms or 8000 messages +whichever comes first, after which I yield control back to VPP. So you can see here that when the +kernel is flooding the Netlink messages of the learned BGP routing table, the plugin correctly consumes +what it can, the queue grows (in this case to just about 100K messages) and then quickly shrinks again. + +And indeed, Bird, IP and VPP all seem to agree, we did a good job: +``` +pim@hippo:~/src/lcpng$ birdc show route count +BIRD 2.0.7 ready. +1741035 of 1741035 routes for 870479 networks in table master4 +396518 of 396518 routes for 132479 networks in table master6 +Total: 2137553 of 2137553 routes for 1002958 networks in 2 tables + +pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip -6 ro | wc -l +132430 +pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip ro | wc -l +870494 + +pim@hippo:~/src/lcpng$ vppctl sh ip6 fib sum | awk '$1~/[0-9]+/ { total += $2 } END { print total }' +132479 +pim@hippo:~/src/lcpng$ vppctl sh ip fib sum | awk '$1~/[0-9]+/ { total += $2 } END { print total }' +870529 +``` + + +## Results + +The functional regression test I made on day one, the one that ensures end-to-end connectivity to and +from the Linux host interfaces works for all 5 interface types (untagged, .1q tagged, QinQ, .1ad tagged +and QinAD) and for both physical and virtual interfaces (like `TenGigabitEthernet3/0/0` and `BondEthernet0`), +still works. Great. + +Here's [a screencast](https://asciinema.org/a/432943) showing me playing around a bit with that configuration +shown above, demonstrating that RIB and FIB synchronisation works pretty well in both directions, making the +combination of these two plugins sufficient to run a VPP router in the _Default Free Zone_, Whoohoo! + + + +### Future work + +**Atomic Updates** - When running VPP + Linux CP in a default free zone BGP environment, +IPv4 and IPv6 prefixes will be constantly updated as the internet topology morphs and changes. +One thing I noticed is that those are often deletes followed by adds with the exact same +nexthop (ie. something in Germany flapped, and this is not deduplicated), which shows up +as many of these pairs of messages like so: + +``` +linux-cp/nl [debug ]: route_del: netlink route/route: del family inet6 type 1 proto 12 table 254 dst 2a10:cc40:b03::/48 nexthops { gateway fe80::9e69:b4ff:fe61:a1dd idx 197 } +linux-cp/nl [debug ]: route_path_parse: path ip6 fe80::9e69:b4ff:fe61:a1dd, TenGigabitEthernet3/0/1.3, [] +linux-cp/nl [info ]: route_del: table 254 prefix 2a10:cc40:b03::/48 flags +linux-cp/nl [debug ]: route_add: netlink route/route: add family inet6 type 1 proto 12 table 254 dst 2a10:cc40:b03::/48 nexthops { gateway fe80::9e69:b4ff:fe61:a1dd idx 197 } +linux-cp/nl [debug ]: route_path_parse: path ip6 fe80::9e69:b4ff:fe61:a1dd, TenGigabitEthernet3/0/1.3, [] +linux-cp/nl [info ]: route_add: table 254 prefix 2a10:cc40:b03::/48 flags +linux-cp/nl [info ]: process_msgs: Processed 2 messages in 225 usecs +``` + +See how `2a10:cc40:b03::/48` is first removed, and then immediately reinstalled to the exact same +nexthop `fe80::9e69:b4ff:fe61:a1dd` on interface `TenGigabitEthernet3/0/1.3` ? Although it only takes +225µs, it's still a bit sad to parse, create paths, just to remove from the FIB and re-insert the +exact same thing into the FIB. But more importantly, if a packet destined for this prefix arrives in that +225µs window, it will be lost. So I think I'll build a peek-ahead mechanism to capture specifically +this occurence, and let the two del+add messages cancel each other out. + +**Prefix updates towards lo** - When writing the code, I borrowed a bunch from the +pending [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122)] but that one has a nasty crash which was hard to +debug and I haven't yet fully understood it. When a add/del occurs for a route towards IPv6 localhost (these +are typically seen when Bird shuts down eBGP sessions and I no longer have a path to a prefix, it'll mark +it as 'unreachable' rather than deleting it. These are *additions* which have a nexthop without a gateway +but with an interface index of 1 (which, in Netlink, is 'lo'). This makes VPP intermittently crash, so I +currently commented this out, while I gain better understanding. Result: blackhole/unreachable/prohibit +specials can not be set using the plugin. Beware! +(disabled in this [[commit](https://github.com/pimvanpelt/lcpng/commit/7c864ed099821f62c5be8cbe9ed3f4dd34000a42)]). + +## Credits + +I'd like to make clear that the Linux CP plugin is a collaboration between several great minds, +and that my work stands on other software engineer's shoulders. In particular most of the Netlink +socket handling and Netlink message queueing was written by Matthew Smith, and I've had a little bit +of help along the way from Neale Ranns and Jon Loeliger. I'd like to thank them for their work! + +## Appendix + +#### VPP config + +We only use one TenGigabitEthernet device on the router, and create two VLANs on it: + +``` +IP="sudo ip netns exec dataplane ip" + +vppctl set logging class linux-cp rate-limit 1000 level warn syslog-level notice +vppctl lcp create TenGigabitEthernet3/0/1 host-if e1 netns dataplane +$IP link set e1 mtu 1500 up + +$IP link add link e1 name ixp type vlan id 179 +$IP link set ixp mtu 1500 up +$IP addr add 45.129.224.235/29 dev ixp +$IP addr add 2a0e:5040:0:2::235/64 dev ixp + +$IP link add link e1 name servers type vlan id 101 +$IP link set servers mtu 1500 up +$IP addr add 194.1.163.86/27 dev servers +$IP addr add 2001:678:d78:3::86/64 dev servers +``` + +#### Bird config + +I'm using a purposefully minimalist configuration for demonstration purposes, posted here +in full for posterity: + +``` +log syslog all; +log "/var/log/bird/bird.log" { debug, trace, info, remote, warning, error, auth, fatal, bug }; + +router id 194.1.163.86; + +protocol device { scan time 10; } +protocol direct { ipv4; ipv6; check link yes; } +protocol kernel kernel4 { ipv4 { import all; export where source != RTS_DEVICE; }; } +protocol kernel kernel6 { ipv6 { import all; export where source != RTS_DEVICE; }; } + +protocol ospf v2 ospf4 { + ipv4 { export where source = RTS_DEVICE; import all; }; + area 0 { + interface "lo" { stub yes; }; + interface "servers" { type broadcast; cost 5; }; + }; +} + +protocol ospf v3 ospf6 { + ipv6 { export where source = RTS_DEVICE; import all; }; + area 0 { + interface "lo" { stub yes; }; + interface "servers" { type broadcast; cost 5; }; + }; +} + +template bgp T_IBGP4 { + local as 50869; + neighbor as 50869; + source address 194.1.163.86; + ipv4 { import all; export none; next hop self on; }; +}; +protocol bgp rr4_frggh0 from T_IBGP4 { neighbor 194.1.163.140; } +protocol bgp rr4_chplo0 from T_IBGP4 { neighbor 194.1.163.148; } +protocol bgp rr4_chbtl0 from T_IBGP4 { neighbor 194.1.163.87; } + +template bgp T_IBGP6 { + local as 50869; + neighbor as 50869; + source address 2001:678:d78:3::86; + ipv6 { import all; export none; next hop self ibgp; }; +}; +protocol bgp rr6_frggh0 from T_IBGP6 { neighbor 2001:678:d78:6::140; } +protocol bgp rr6_chplo0 from T_IBGP6 { neighbor 2001:678:d78:7::148; } +protocol bgp rr6_chbtl0 from T_IBGP6 { neighbor 2001:678:d78:3::87; } +``` + +#### Final note + +You may have noticed that the [commit] links are all to git commits in my private working copy. I +want to wait until my [previous work](https://gerrit.fd.io/r/c/vpp/+/33481) is reviewed and +submitted before piling on more changes. Feel free to contact vpp-dev@ for more information in the +mean time :-) + diff --git a/content/articles/2021-09-10-vpp-6.md b/content/articles/2021-09-10-vpp-6.md new file mode 100644 index 0000000..8fc451d --- /dev/null +++ b/content/articles/2021-09-10-vpp-6.md @@ -0,0 +1,409 @@ +--- +date: "2021-09-10T13:21:14Z" +title: VPP Linux CP - Part6 +--- + + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. One thing notably missing, is the higher level control plane, that is +to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a +VPP _plugin_ which is called the **Linux Control Plane**, or LCP for short, which creates Linux network +devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols +like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet +forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use +VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the +interface state (links, addresses and routes) itself. When the plugin is completed, running software +like [FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving +>100Mpps and >100Gbps forwarding rates will be well in reach! + +## SNMP in VPP + +Now that the **Interface Mirror** and **Netlink Listener** plugins are in good shape, this post +shows a few finishing touches. First off, although the native habitat of VPP is [Prometheus](https://prometheus.io/), +many folks still run classic network monitoring systems like the popular [Obvervium](https://observium.org/) +or its sibling [LibreNMS](https://librenms.org/). Although the metrics-based approach is modern, +we really ought to have an old-skool [SNMP](https://datatracker.ietf.org/doc/html/rfc1157) interface +so that we can swear it _by the Old Gods and the New_. + +### VPP's Stats Segment + +VPP maintains lots of interesting statistics at runtime - for example for nodes and their activity, +but also, importantly, for each interface known to the system. So I take a look at the stats segment, +configured in `startup.conf`, and I notice that VPP will create socket in `/run/vpp/stats.sock` which +can be connected to. There's also a few introspection tools, notably `vpp_get_stats`, which can either +list, dump once, or continuously dump the data: + +``` +pim@hippo:~$ vpp_get_stats socket-name /run/vpp/stats.sock ls | wc -l +3800 +pim@hippo:~$ vpp_get_stats socket-name /run/vpp/stats.sock dump /if/names +[0]: local0 /if/names +[1]: TenGigabitEthernet3/0/0 /if/names +[2]: TenGigabitEthernet3/0/1 /if/names +[3]: TenGigabitEthernet3/0/2 /if/names +[4]: TenGigabitEthernet3/0/3 /if/names +[5]: GigabitEthernet5/0/0 /if/names +[6]: GigabitEthernet5/0/1 /if/names +[7]: GigabitEthernet5/0/2 /if/names +[8]: GigabitEthernet5/0/3 /if/names +[9]: TwentyFiveGigabitEthernet11/0/0 /if/names +[10]: TwentyFiveGigabitEthernet11/0/1 /if/names +[11]: tap2 /if/names +[12]: TenGigabitEthernet3/0/1.1 /if/names +[13]: tap2.1 /if/names +[14]: TenGigabitEthernet3/0/1.2 /if/names +[15]: tap2.2 /if/names +[16]: TenGigabitEthernet3/0/1.3 /if/names +[17]: tap2.3 /if/names +[18]: tap3 /if/names +[19]: tap4 /if/names +``` + +Alright! Clearly, the `/if/` prefix is the one I'm looking for. I find a Python library that allows +for this data to be MMAPd and directly read as a dictionary, including some neat aggregation +functions (see `src/vpp-api/python/vpp_papi/vpp_stats.py`): + +``` +Counters can be accessed in either dimension. +stat['/if/rx'] - returns 2D lists +stat['/if/rx'][0] - returns counters for all interfaces for thread 0 +stat['/if/rx'][0][1] - returns counter for interface 1 on thread 0 +stat['/if/rx'][0][1]['packets'] - returns the packet counter + for interface 1 on thread 0 +stat['/if/rx'][:, 1] - returns the counters for interface 1 on all threads +stat['/if/rx'][:, 1].packets() - returns the packet counters for + interface 1 on all threads +stat['/if/rx'][:, 1].sum_packets() - returns the sum of packet counters for + interface 1 on all threads +stat['/if/rx-miss'][:, 1].sum() - returns the sum of packet counters for + interface 1 on all threads for simple counters +``` + +Alright, so let's grab that file and refactor it into a small library for me to use, I do +this in [[this commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)]. + +### VPP's API + +In a previous project, I already got a little bit of exposure to the Python API (`vpp_papi`), +and it's pretty straight forward to use. Each API is published in a JSON file in +`/usr/share/vpp/api/{core,plugins}/` and those can be read by the Python library and +exposed to callers. This gives me full programmatic read/write access to the VPP runtime +configuration, which is super cool. + +There are dozens of APIs to call (the Linux CP plugin even added one!), and in the case +of enumerating interfaces, we can see the definition in `core/interface.api.json` where +there is an element called `services.sw_interface_dump` which shows its reply is +`sw_interface_details`, and in that message we can see all the fields that will be +set in the request and all that will be present in the response. Nice! Here's a quick +demonstration: + +```python +from vpp_papi import VPPApiClient +import os +import fnmatch +import sys + +vpp_json_dir = '/usr/share/vpp/api/' + +# construct a list of all the json api files +jsonfiles = [] +for root, dirnames, filenames in os.walk(vpp_json_dir): + for filename in fnmatch.filter(filenames, '*.api.json'): + jsonfiles.append(os.path.join(root, filename)) + +vpp = VPPApiClient(apifiles=jsonfiles, server_address='/run/vpp/api.sock') +vpp.connect("test-client") + +v = vpp.api.show_version() +print('VPP version is %s' % v.version) + +iface_list = vpp.api.sw_interface_dump() +for iface in iface_list: + print("idx=%d name=%s mac=%s mtu=%d flags=%d" % (iface.sw_if_index, + iface.interface_name, iface.l2_address, iface.mtu[0], iface.flags)) +``` + +The output: +``` +$ python3 vppapi-test.py +VPP version is 21.10-rc0~325-g4976c3b72 +idx=0 name=local0 mac=00:00:00:00:00:00 mtu=0 flags=0 +idx=1 name=TenGigabitEthernet3/0/0 mac=68:05:ca:32:46:14 mtu=9000 flags=0 +idx=2 name=TenGigabitEthernet3/0/1 mac=68:05:ca:32:46:15 mtu=1500 flags=3 +idx=3 name=TenGigabitEthernet3/0/2 mac=68:05:ca:32:46:16 mtu=9000 flags=1 +idx=4 name=TenGigabitEthernet3/0/3 mac=68:05:ca:32:46:17 mtu=9000 flags=1 +idx=5 name=GigabitEthernet5/0/0 mac=a0:36:9f:c8:a0:54 mtu=9000 flags=0 +idx=6 name=GigabitEthernet5/0/1 mac=a0:36:9f:c8:a0:55 mtu=9000 flags=0 +idx=7 name=GigabitEthernet5/0/2 mac=a0:36:9f:c8:a0:56 mtu=9000 flags=0 +idx=8 name=GigabitEthernet5/0/3 mac=a0:36:9f:c8:a0:57 mtu=9000 flags=0 +idx=9 name=TwentyFiveGigabitEthernet11/0/0 mac=6c:b3:11:20:e0:c4 mtu=9000 flags=0 +idx=10 name=TwentyFiveGigabitEthernet11/0/1 mac=6c:b3:11:20:e0:c6 mtu=9000 flags=0 +idx=11 name=tap2 mac=02:fe:07:ae:31:c3 mtu=1500 flags=3 +idx=12 name=TenGigabitEthernet3/0/1.1 mac=00:00:00:00:00:00 mtu=1500 flags=3 +idx=13 name=tap2.1 mac=00:00:00:00:00:00 mtu=1500 flags=3 +idx=14 name=TenGigabitEthernet3/0/1.2 mac=00:00:00:00:00:00 mtu=1500 flags=3 +idx=15 name=tap2.2 mac=00:00:00:00:00:00 mtu=1500 flags=3 +idx=16 name=TenGigabitEthernet3/0/1.3 mac=00:00:00:00:00:00 mtu=1500 flags=3 +idx=17 name=tap2.3 mac=00:00:00:00:00:00 mtu=1500 flags=3 +idx=18 name=tap3 mac=02:fe:95:db:3f:c4 mtu=9000 flags=3 +idx=19 name=tap4 mac=02:fe:17:06:fc:af mtu=9000 flags=3 +``` + +So I added a little abstration with some error handling and one main function +to return interfaces as a Python dictionary of those `sw_interface_details` +tuples in [[this commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)]. + +### AgentX + +Now that we are able to enumerate the interfaces and their metadata (like admin/oper +status, link speed, name, index, MAC address, and what have you), and as well +the highly sought after interface statistics as 64bit counters (with a wealth of +extra information like broadcast/multicast/unicast packets, octets received and +transmitted, errors and drops). I am ready to tie things together. + +It took a bit of sleuthing, but I eventually found a library on sourceforge (!) +that has a rudimentary implementation of [RFC 2741](https://datatracker.ietf.org/doc/html/rfc2741) +which is the SNMP Agent Extensibility (AgentX) Protocol. In a nutshell, this allows +an external program to connect to the main SNMP daemon, register an interest in +certain OIDs, and get called whenever the SNMPd is being queried for them. + +The flow is pretty simple (see section 6.2 of the RFC), the Agent (client): +1. opens a TCP or Unix domain socket to the SNMPd +1. sends an Open PDU, which the server will respond or reject. +1. (optionally) can send a Ping PDU, the server will respond. +1. registers an interest with Register PDU + +It then waits and gets called by the SNMPd with Get PDUs (to retrieve one +single value), GetNext PDU (to enable snmpwalk), GetBulk PDU (to retrieve a whole +subsection of the MIB), all of which are answered by a Response PDU. + +If the Agent is to support writing, it will also have to implement TestSet, CommitSet, +CommitUndoSet and CommitCleanupSet PDUs. For this agent, we don't need to implement +those, so I'll just ignore those requests and implement the read-only stuff. Sounds easy :) + +The first order of business is to create the values for two main MIBs of interest: + +1. `.iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable.` - This table is an older variant + and it contains a bunch of relevant fields, one per interface, notably `ifIndex`, + `ifName`, `ifType`, `ifMtu`, `ifSpeed`, `ifPhysAddress`, `ifOperStatus`, `ifAdminStatus` + and a bunch of 32bit counters for octets/packets in and out of the interfaces. +1. `.iso.org.dod.internet.mgmt.mib-2.ifMIB.ifMIBObjects.ifXTable.` - This table is a makeover + of the other one (the **X** here stands for eXtra), and adds a few 64 bit counters + for the interface stats, and as well an `ifHighSpeed` which is in megabits instead of + kilobits in the previous MIB. + +Populating these MIBs can be done periodically by retrieving the interfaces from VPP and +then simply walking the dictionary with Stats Segment data. I then register these two +main MIB entrypoints with SNMPd as I connect to it, and spit out the correct values +once asked with `GetPDU` or `GetNextPDU` requests, by issuing a corresponding `ResponsePDU` +to the SNMP server -- it takes care of all the rest! + +The resulting code is in [[this +commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/8c9c1e2b4aa1d40a981f17581f92bba133dd2c29)] +but you can also check out the whole thing on +[[Github](https://github.com/pimvanpelt/vpp-snmp-agent)]. + +### Building + +Shipping a bunch of Python files around is not ideal, so I decide to build this stuff +together in a binary that I can easily distribute to my machines: I just simply install +`pyinstaller` with _PIP_ and run it: + +``` +sudo pip install pyinstaller +pyinstaller vpp-snmp-agent.py --onefile + +## Run it on console +dist/vpp-snmp-agent -h +usage: vpp-snmp-agent [-h] [-a ADDRESS] [-p PERIOD] [-d] + +optional arguments: + -h, --help show this help message and exit + -a ADDRESS Location of the SNMPd agent (unix-path or host:port), default localhost:705 + -p PERIOD Period to poll VPP, default 30 (seconds) + -d Enable debug, default False + + +## Install +sudo cp dist/vpp-snmp-agent /usr/sbin/ +``` + +### Running + +After installing `Net-SNMP`, the default in Ubuntu, I do have to ensure that it runs in +the correct namespace. So what I do is disable the systemd unit that ships with the Ubuntu +package, and instead create these: + +``` +pim@hippo:~/src/vpp-snmp-agentx$ cat < EOF | sudo tee /usr/lib/systemd/system/netns-dataplane.service +[Unit] +Description=Dataplane network namespace +After=systemd-sysctl.service network-pre.target +Before=network.target network-online.target + +[Service] +Type=oneshot +RemainAfterExit=yes + +# PrivateNetwork will create network namespace which can be +# used in JoinsNamespaceOf=. +PrivateNetwork=yes + +# To set `ip netns` name for this namespace, we create a second namespace +# with required name, unmount it, and then bind our PrivateNetwork +# namespace to it. After this we can use our PrivateNetwork as a named +# namespace in `ip netns` commands. +ExecStartPre=-/usr/bin/echo "Creating dataplane network namespace" +ExecStart=-/usr/sbin/ip netns delete dataplane +ExecStart=-/usr/bin/mkdir -p /etc/netns/dataplane +ExecStart=-/usr/bin/touch /etc/netns/dataplane/resolv.conf +ExecStart=-/usr/sbin/ip netns add dataplane +ExecStart=-/usr/bin/umount /var/run/netns/dataplane +ExecStart=-/usr/bin/mount --bind /proc/self/ns/net /var/run/netns/dataplane +# Apply default sysctl for dataplane namespace +ExecStart=-/usr/sbin/ip netns exec dataplane /usr/lib/systemd/systemd-sysctl +ExecStop=-/usr/sbin/ip netns delete dataplane + +[Install] +WantedBy=multi-user.target +WantedBy=network-online.target +EOF + +pim@hippo:~/src/vpp-snmp-agentx$ cat < EOF | sudo tee /usr/lib/systemd/system/snmpd-dataplane.service +[Unit] +Description=Simple Network Management Protocol (SNMP) Daemon. +After=network.target +ConditionPathExists=/etc/snmp/snmpd.conf + +[Service] +Type=simple +ExecStartPre=/bin/mkdir -p /var/run/agentx-dataplane/ +NetworkNamespacePath=/var/run/netns/dataplane +ExecStart=/usr/sbin/snmpd -LOw -u Debian-snmp -g vpp -I -smux,mteTrigger,mteTriggerConf -f -p /run/snmpd-dataplane.pid +ExecReload=/bin/kill -HUP \$MAINPID + +[Install] +WantedBy=multi-user.target +EOF + +pim@hippo:~/src/vpp-snmp-agentx$ cat < EOF | sudo tee /usr/lib/systemd/system/vpp-snmp-agent.service +[Unit] +Description=SNMP AgentX Daemon for VPP dataplane statistics +After=network.target +ConditionPathExists=/etc/snmp/snmpd.conf + +[Service] +Type=simple +NetworkNamespacePath=/var/run/netns/dataplane +ExecStart=/usr/sbin/vpp-snmp-agent +Group=vpp +ExecReload=/bin/kill -HUP \$MAINPID +Restart=on-failure +RestartSec=5s + +[Install] +WantedBy=multi-user.target +EOF +``` + +Note the use of `NetworkNamespacePath` here -- this ensures that the snmpd and its agent both +run in the `dataplane` namespace which was created by `netns-dataplane.service`. + +## Results + +I now install the binary and, using the `snmpd.conf` configuration file (see Appendix): + +``` +pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl stop snmpd +pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl disable snmpd +pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl daemon-reload +pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl enable netns-dataplane +pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl start netns-dataplane +pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl enable snmpd-dataplane +pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl start snmpd-dataplane +pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl enable vpp-snmp-agent +pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl start vpp-snmp-agent + +pim@hippo:~/src/vpp-snmp-agentx$ sudo journalctl -u vpp-snmp-agent +[INFO ] agentx.agent - run : Calling setup +[INFO ] agentx.agent - setup : Connecting to VPP Stats... +[INFO ] agentx.vppapi - connect : Connecting to VPP +[INFO ] agentx.vppapi - connect : VPP version is 21.10-rc0~325-g4976c3b72 +[INFO ] agentx.agent - run : Initial update +[INFO ] agentx.network - update : Setting initial serving dataset (740 OIDs) +[INFO ] agentx.agent - run : Opening AgentX connection +[INFO ] agentx.network - connect : Connecting to localhost:705 +[INFO ] agentx.network - start : Registering: 1.3.6.1.2.1.2.2.1 +[INFO ] agentx.network - start : Registering: 1.3.6.1.2.1.31.1.1.1 +[INFO ] agentx.network - update : Replacing serving dataset (740 OIDs) +[INFO ] agentx.network - update : Replacing serving dataset (740 OIDs) +[INFO ] agentx.network - update : Replacing serving dataset (740 OIDs) +[INFO ] agentx.network - update : Replacing serving dataset (740 OIDs) +``` + +{{< image width="800px" src="/assets/vpp/librenms.png" alt="LibreNMS" >}} + +## Appendix + +#### SNMPd Config + +``` +$ cat << EOF | sudo tee /etc/snmp/snmpd.conf +com2sec readonly default <> + +group MyROGroup v2c readonly +view all included .1 80 +access MyROGroup "" any noauth exact all none none + +sysLocation Ruemlang, Zurich, Switzerland +sysContact noc@ipng.ch + +master agentx +agentXSocket tcp:localhost:705,unix:/var/agentx/master,unix:/run/vpp/agentx.sock + +agentaddress udp:161,udp6:161 + +#OS Distribution Detection +extend distro /usr/bin/distro + +#Hardware Detection +extend manufacturer '/bin/cat /sys/devices/virtual/dmi/id/sys_vendor' +extend hardware '/bin/cat /sys/devices/virtual/dmi/id/product_name' +extend serial '/bin/cat /var/run/snmpd.serial' +EOF +``` + +Note the use of a few helpers here - `/usr/bin/distro` comes from LibreNMS [ref](https://docs.librenms.org/Support/SNMP-Configuration-Examples/) +and tries to figure out what distribution is used. The very last line of that file +echo's the found distribtion, to which I prepend the string, like `echo "VPP ${OSSTR}"`. +The other file of interest `/var/run/snmpd.serial` is computed at boot-time, by running +the following in `/etc/rc.local`: + +``` +# Assemble serial number for snmpd +BS=$(cat /sys/devices/virtual/dmi/id/board_serial) +PS=$(cat /sys/devices/virtual/dmi/id/product_serial) +echo $BS.$PS > /var/run/snmpd.serial +``` + +I have to do this, because SNMPd runs as non-privileged user, yet those DMI elements are +root-readable only (for reasons that are beyond me). Seeing as they will not change at +runtime anyway, I just create that file and cat it into the `serial` field. It then shows +up nicely in LibreNMS alongside the others. + +{{< image width="200px" float="left" src="/assets/vpp/vpp.png" alt="VPP Hound" >}} + +Oh, and one last thing. The VPP Hound logo! + +In LibreNMS, the icons in the _devices_ view use a function that leveages this `distro` +field, by looking at the first word (in our case "VPP") with an extension of either .svg +or .png in an icons directory, usually `html/images/os/`. I dropped the hound of the +[fd.io](https://fd.io/) homepage in there, and will add the icon upstream for future use, +in this [[librenms PR](https://github.com/librenms/librenms/pull/13230)] and its companion +change to [[librenms-agent PR](https://github.com/librenms/librenms-agent/pull/374). diff --git a/content/articles/2021-09-21-vpp-7.md b/content/articles/2021-09-21-vpp-7.md new file mode 100644 index 0000000..23d5aae --- /dev/null +++ b/content/articles/2021-09-21-vpp-7.md @@ -0,0 +1,567 @@ +--- +date: "2021-09-21T00:41:14Z" +title: VPP Linux CP - Part7 +--- + + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. One thing notably missing, is the higher level control plane, that is +to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a +VPP _plugin_ which is called the **Linux Control Plane**, or LCP for short, which creates Linux network +devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols +like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet +forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use +VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the +interface state (links, addresses and routes) itself. When the plugin is completed, running software +like [FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving +>100Mpps and >100Gbps forwarding rates will be well in reach! + +## Running in Production + +In the first articles from this series, I showed the code that needed to be written to implement the +**Control Plane** and **Netlink Listener** plugins. In the [penultimate post]({% post_url 2021-09-10-vpp-6 %}), +I wrote an SNMP Agentx that exposes the VPP interface data to, say, LibreNMS. + +But what are the things one might do to deploy a router end-to-end? That is the topic of this post. + +### A note on hardware + +Before I get into the details, here's some specifications on the router hardware that I use at +IPng Networks (AS50869). See more about our network [here]({% post_url 2021-02-27-network %}). + +The chassis is a Supermicro SYS-5018D-FN8T, which includes: +* Full IPMI support (power, serial-over-lan and kvm-over-ip with HTML5), on a dedicated network port. +* A 4-core, 8-thread Xeon D1518 CPU which runs at 35W +* Two independent Intel i210 NICs (Gigabit) +* A Quad Intel i350 NIC (Gigabit) +* Two Intel X552 (TenGig) +* (optional) One Intel X710 Quad-TenGig NIC in the expansion bus +* m.SATA 120G boot SSD +* 2x16GB of ECC RAM + +The only downside for this machine is that it has only one power supply, so datacenters which do +periodical feed-maintenance (such as Interxion is known to do), are likely to reboot the machine +from time to time. However, the machine is very well spec'd for VPP in "low" performance scenarios. +A machine like this is very affordable (I bought the chassis for about USD 800,- a piece) but its +CPU/Memory/PCIe construction is enough to provide forwarding at approximately 35Mpps. + +Doing a lazy 1Mpps on this machine's Xeon D1518, VPP comes in at ~660 clocks per packet with a vector +length of ~3.49. This means that if I dedicate 3 cores running at 2200MHz to VPP (leaving 1C2T for +the controlplane), this machine has a forwarding capacity of ~34.7Mpps, which fits really well with +the Intel X710 NICs (which are limited to 40Mpps [[ref](https://trex-tgn.cisco.com/trex/doc/trex_faq.html)]). + +A reasonable step-up from here would be Supermicro's SIS810 with a Xeon E-2288G (8 cores / 16 threads) +which carries a dual-PSU, up to 8x Intel i210 NICs and 2x Intel X710 Quad-Tengigs, but it's quite +a bit more expensive. I commit to do that the day AS50869 is forwarding 10Mpps in practice :-) + +### Install HOWTO + +First, I install the "canonical" (pun intended) operating system that VPP is most comfortable running +on: Ubuntu 20.04.3. Nothing special selected when installing and after the install is done, I make sure +that GRUB uses the serial IPMI port by adding to `/etc/default/grub`: + +``` +GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8 isolcpus=1,2,3,5,6,7" +GRUB_TERMINAL=serial +GRUB_SERIAL_COMMAND="serial --speed=115200 --unit=0 --word=8 --parity=no --stop=1" + +# Followed by a gratuitous install and update +grub-install /dev/sda +update-grub +``` + +Note that the `isolcpus` is a neat trick that tells the Linux task scheduler to avoid scheduling any +workloads on those CPUs. Because the Xeon-D1518 has 4 cores (0,1,2,3) and 4 additional hyperthreads +(4,5,6,7), this stanza effectively makes core 1,2,3 unavailable to Linux, leaving only core 0 and its +hyperthread 4 are available. This means that our controlplane will have 2 CPUs available to run things +like Bird, SNMP, SSH etc, while hyperthreading is essentially turned off on CPU 1,2,3 giving +those cores entirely to VPP. + +In case you were wondering why I would turn off hyperthreading in this way: hyperthreads share +CPU instruction and data cache. The premise of VPP is that a `vector` (a list) of packets will +go through the same routines (like `ethernet-input` or `ip4-lookup`) all at once. In such a +computational model, VPP leverages the i-cache and d-cache to have subsequent packets make use +of the warmed up cache from their predecessor, without having to use the (much slower, relatively +speaking) main memory. + +The last thing you'd want, is for the hyperthread to come along and replace the cache contents with +what-ever it's doing (be it Linux tasks, or another VPP thread). + +So: disaallowing scheduling on 1,2,3 and their counterpart hyperthreads 5,6,7 AND constraining +VPP to run only on lcore 1,2,3 will essentially maximize the CPU cache hitrate for VPP, greatly +improving performance. + +### Network Namespace + +Originally proposed by [TNSR](https://netgate.com/tnsr), a Netgate commercial productionization of +VPP, it's a good idea to run VPP and its controlplane in a separate Linux network +[namespace](https://man7.org/linux/man-pages/man8/ip-netns.8.html). A network namespace is +logically another copy of the network stack, with its own routes, firewall rules, and network +devices. + +Creating a namespace looks like follows, on a machine running `systemd`, like Ubuntu or Debian: + +``` +cat << EOF | sudo tee /usr/lib/systemd/system/netns-dataplane.service +[Unit] +Description=Dataplane network namespace +After=systemd-sysctl.service network-pre.target +Before=network.target network-online.target + +[Service] +Type=oneshot +RemainAfterExit=yes + +# PrivateNetwork will create network namespace which can be +# used in JoinsNamespaceOf=. +PrivateNetwork=yes + +# To set `ip netns` name for this namespace, we create a second namespace +# with required name, unmount it, and then bind our PrivateNetwork +# namespace to it. After this we can use our PrivateNetwork as a named +# namespace in `ip netns` commands. +ExecStartPre=-/usr/bin/echo "Creating dataplane network namespace" +ExecStart=-/usr/sbin/ip netns delete dataplane +ExecStart=-/usr/bin/mkdir -p /etc/netns/dataplane +ExecStart=-/usr/bin/touch /etc/netns/dataplane/resolv.conf +ExecStart=-/usr/sbin/ip netns add dataplane +ExecStart=-/usr/bin/umount /var/run/netns/dataplane +ExecStart=-/usr/bin/mount --bind /proc/self/ns/net /var/run/netns/dataplane +# Apply default sysctl for dataplane namespace +ExecStart=-/usr/sbin/ip netns exec dataplane /usr/lib/systemd/systemd-sysctl +ExecStop=-/usr/sbin/ip netns delete dataplane + +[Install] +WantedBy=multi-user.target +WantedBy=network-online.target +EOF +sudo systemctl daemon-reload +sudo systemctl enable netns-dataplane +sudo systemctl start netns-dataplane +``` + +Now, every time we reboot the system, a new network namespace will exist with the +name `dataplane`. That's where you've seen me create interfaces in my previous posts, +and that's where our life-as-a-VPP-router will be born. + +### Preparing the machine + +After creating the namespace, I'll install a bunch of useful packages and further prepare +the machine, but also I'm going to remove a few out-of-the-box installed packages: + +``` +## Remove what we don't need +sudo apt purge cloud-init snapd + +## Usual tools for Linux +sudo apt install rsync net-tools traceroute snmpd snmp iptables ipmitool bird2 lm-sensors + +## And for VPP +sudo apt install libmbedcrypto3 libmbedtls12 libmbedx509-0 libnl-3-200 libnl-route-3-200 \ + libnuma1 python3-cffi python3-cffi-backend python3-ply python3-pycparser libsubunit0 + +## Disable Bird and SNMPd because it will be running in another namespace +for i in bird snmpd; do + sudo systemctl stop $i + sudo systemctl disable $i + sudo systemctl mask $i +done + +# Ensure all temp/fan sensors are detected +sensors-detect --auto +``` + +### Installing VPP + +After [building](https://fdio-vpp.readthedocs.io/en/latest/gettingstarted/developers/building.html) +the code, specifically after issuing a successful `make pkg-deb`, a set of Debian packages +will be in the `build-root` sub-directory. Take these and install them like so: + +``` +## Install VPP +sudo mkdir -p /var/log/vpp/ +sudo dpkg -i *.deb + +## Reserve 6GB (3072 x 2MB) of memory for hugepages +cat << EOF | sudo tee /etc/sysctl.d/80-vpp.conf +vm.nr_hugepages=3072 +vm.max_map_count=7168 +vm.hugetlb_shm_group=0 +kernel.shmmax=6442450944 +EOF + +## Set 64MB netlink buffer size +cat << EOF | sudo tee /etc/sysctl.d/81-vpp-netlink.conf +net.core.rmem_default=67108864 +net.core.wmem_default=67108864 +net.core.rmem_max=67108864 +net.core.wmem_max=67108864 +EOF + +## Apply these sysctl settings +sudo sysctl -p -f /etc/sysctl.d/80-vpp.conf +sudo sysctl -p -f /etc/sysctl.d/81-vpp-netlink.conf + +## Add user to relevant groups +sudo adduser $USER bird +sudo adduser $USER vpp +``` + +Next up, I make a backup of the original, and then create a reasonable startup configuration +for VPP: +``` +## Create suitable startup configuration for VPP +cd /etc/vpp +sudo cp startup.conf startup.conf.orig + +cat << EOF | sudo tee startup.conf +unix { + nodaemon + log /var/log/vpp/vpp.log + full-coredump + cli-listen /run/vpp/cli.sock + gid vpp + exec /etc/vpp/bootstrap.vpp +} + +api-trace { on } +api-segment { gid vpp } +socksvr { default } + +memory { + main-heap-size 1536M + main-heap-page-size default-hugepage +} + +cpu { + main-core 0 + workers 3 +} + +buffers { + buffers-per-numa 128000 + default data-size 2048 + page-size default-hugepage +} + +statseg { + size 1G + page-size default-hugepage + per-node-counters off +} + +plugins { + plugin lcpng_nl_plugin.so { enable } + plugin lcpng_if_plugin.so { enable } +} + +logging { + default-log-level info + default-syslog-log-level notice +} +EOF +``` + +A few notes specific to my hardware configuration: +* the `cpu` stanza says to run the main thread on CPU 0, and then run three workers (on + CPU 1,2,3; the ones for which I disabled the Linux scheduler by means of `isolcpus`). So + CPU 0 and its hyperthread CPU 4 are available for Linux to schedule on, while there are + three full cores dedicated to forwarding. This will ensure very low latency/jitter and + predictably high throughput! +* HugePages are a memory optimization mechanism in Linux. In virtual memory management, + the kernel maintains a table in which it has a mapping of the virtual memory address + to a physical address. For every page transaction, the kernel needs to load related + mapping. If you have small size pages then you need to load more numbers of pages + resulting kernel to load more mapping tables. This decreases performance. I set these + to a larger size of 2MB (the default is 4KB), reducing mapping load and thereby + considerably improving performance. +* I need to ensure there's enough _Stats Segment_ memory available - each worker thread + keeps counters of each prefix, and with a full BGP table (weighing in at 1M prefixes + in Q3'21), the amount of memory needed is substantial. Similarly, I need to ensure there + are sufficient _Buffers_ available. + +Finally, observe the stanza `unix { exec /etc/vpp/bootstrap.vpp }` and this is a way for me to +tell VPP to run a bunch of CLI commands as soon as it starts. This ensures that if VPP were to +crash, or the machine were to reboot (more likely :-), that VPP will start up with a working +interface and IP address configuration, and any other things I might want VPP to do (like +bridge-domains). + +A note on VPP's binding of interfaces: by default, VPP's `dpdk` driver will acquire any interface +from Linux that is not in use (which means: any interface that is admin-down/unconfigured). +To make sure that VPP gets all interfaces, I will remove `/etc/netplan/*` (or in Debian's case, +`/etc/network/interfaces`). This is why Supermicro's KVM and serial-over-lan are so valuable, as +they allow me to log in and deconfigure the entire machine, in order to yield all interfaces +to VPP. They also allow me to reinstall or switch from DANOS to Ubuntu+VPP on a server that's +700km away. + +Anyway, I can start VPP simply like so: + +``` +sudo rm -f /etc/netplan/* +sudo rm -f /etc/network/interfaces +## Set any link to down, or reboot the machine and access over KVM or Serial + +sudo systemctl restart vpp +vppctl show interface +``` + +See all interfaces? Great. Moving on :) + +### Configuring VPP + +I set a VPP interface configuration (which it'll read and apply any time it starts or restarts, +thereby making the configuration persistent across crashes and reboots). Using the `exec` +stanza described above, the contents now become, taking as an example, our first router in +Lille, France [[details]({% post_url 2021-05-28-lille %})], configured as so: + +``` +cat << EOF | sudo tee /etc/vpp/bootstrap.vpp +set logging class linux-cp rate-limit 1000 level warn syslog-level notice + +lcp default netns dataplane +lcp lcp-sync on +lcp lcp-auto-subint on + +create loopback interface instance 0 +lcp create loop0 host-if loop0 +set interface state loop0 up +set interface ip address loop0 194.1.163.34/32 +set interface ip address loop0 2001:678:d78::a/128 + +lcp create TenGigabitEthernet4/0/0 host-if xe0-0 +lcp create TenGigabitEthernet4/0/1 host-if xe0-1 +lcp create TenGigabitEthernet6/0/0 host-if xe1-0 +lcp create TenGigabitEthernet6/0/1 host-if xe1-1 +lcp create TenGigabitEthernet6/0/2 host-if xe1-2 +lcp create TenGigabitEthernet6/0/3 host-if xe1-3 + +lcp create GigabitEthernetb/0/0 host-if e1-0 +lcp create GigabitEthernetb/0/1 host-if e1-1 +lcp create GigabitEthernetb/0/2 host-if e1-2 +lcp create GigabitEthernetb/0/3 host-if e1-3 +EOF +``` + +This base-line configuration will: +* Ensure all host interfaces are created in namespace `dataplane` which we created earlier +* Turn on `lcp-sync`, which copies forward any configuration from VPP into Linux (see + [VPP Part 2]({% post_url 2021-08-13-vpp-2 %})) +* Turn on `lcp-auto-subint`, which automatically creates _LIPs_ (Linux interface pairs) + for all sub-interfaces (see [VPP Part 3]({% post_url 2021-08-15-vpp-3 %})) +* Create a loopback interface, give it IPv4/IPv6 addresses, and expose it to Linux +* Create one _LIP_ interface for four of the Gigabit and all 6x TenGigabit interfaces +* Leave 2 interfaces (`GigabitEthernet7/0/0` and `GigabitEthernet8/0/0`) for later + +Further, sub-interfaces and bridge-groups might be configured as such: +``` +comment { Infra: er01.lil01.ip-max.net Te0/0/0/6 } +set interface mtu packet 9216 TenGigabitEthernet6/0/2 +set interface state TenGigabitEthernet6/0/2 up + +create sub TenGigabitEthernet6/0/2 100 +set interface mtu packet 9000 TenGigabitEthernet6/0/2.100 +set interface state TenGigabitEthernet6/0/2.100 up +set interface ip address TenGigabitEthernet6/0/2.100 194.1.163.30/31 +set interface unnumbered TenGigabitEthernet6/0/2.100 use loop0 + +comment { Infra: Bridge Domain for mgmt } +create bridge-domain 1 +create loopback interface instance 1 +lcp create loop1 host-if bvi1 +set interface ip address loop1 192.168.0.81/29 +set interface ip address loop1 2001:678:d78::1:a:1/112 +set interface l2 bridge loop1 1 bvi +set interface l2 bridge GigabitEthernet7/0/0 1 +set interface l2 bridge GigabitEthernet8/0/0 1 +set interface state GigabitEthernet7/0/0 up +set interface state GigabitEthernet8/0/0 up +set interface state loop1 up +``` + +Particularly the last stanza, creating a bridge-domain, will remind Cisco operators +of the same semantics on the ASR9k and IOS/XR operating system. What it does is create +a bridge with two physical interfaces, and one so-called _bridge virtual interface_ +which I expose to Linux as `bvi1`, with an IPv4 and IPv6 address. Beautiful! + +### Configuring Bird + +Now that VPP's interfaces are up, which I can validate with both `vppctl show int addr` +and as well `sudo ip netns exec dataplane ip addr`, I am ready to configure Bird and +put the router in the _default free zone_ (ie. run BGP on it): + +``` +cat << EOF > /etc/bird/bird.conf +router id 194.1.163.34; + +protocol device { scan time 30; } +protocol direct { ipv4; ipv6; check link yes; } +protocol kernel kernel4 { + ipv4 { import none; export where source != RTS_DEVICE; }; + learn off; + scan time 300; +} +protocol kernel kernel6 { + ipv6 { import none; export where source != RTS_DEVICE; }; + learn off; + scan time 300; +} + +include "static.conf"; +include "core/ospf.conf"; +include "core/ibgp.conf"; +EOF +``` + +The most important thing to note in the configuration is that Bird tends to add a route +for all of the connected interfaces, while Linux has already added those. Therefore, I +avoid the source `RTS_DEVICE`, which means "connected routes", but otherwise offer all +routes to the kernel, which in turn propagates these as Netlink messages which are +consumed by VPP. A detailed discussion of Bird's configuration semantics is in my +[VPP Part 5]({% post_url 2021-09-02-vpp-5 %}) post. + +### Configuring SSH + +While Ubuntu (or Debian) will start an SSH daemon upon startup, they will do this in the +default namespace. However, our interfaces (like `loop0` or `xe1-2.100` above) are configured +to be present in the `dataplane` namespace. Therefor, I'll add a second SSH +daemon that runs specifically in the alternate namespace, like so: + +``` +cat << EOF | sudo tee /usr/lib/systemd/system/ssh-dataplane.service +[Unit] +Description=OpenBSD Secure Shell server (Dataplane Namespace) +Documentation=man:sshd(8) man:sshd_config(5) +After=network.target auditd.service +ConditionPathExists=!/etc/ssh/sshd_not_to_be_run +Requires=netns-dataplane.service +After=netns-dataplane.service + +[Service] +EnvironmentFile=-/etc/default/ssh +ExecStartPre=/usr/sbin/ip netns exec dataplane /usr/sbin/sshd -t +ExecStart=/usr/sbin/ip netns exec dataplane /usr/sbin/sshd -oPidFile=/run/sshd-dataplane.pid -D $SSHD_OPTS +ExecReload=/usr/sbin/ip netns exec dataplane /usr/sbin/sshd -t +ExecReload=/usr/sbin/ip netns exec dataplane /bin/kill -HUP $MAINPID +KillMode=process +Restart=on-failure +RestartPreventExitStatus=255 +Type=notify +RuntimeDirectory=sshd +RuntimeDirectoryMode=0755 + +[Install] +WantedBy=multi-user.target +Alias=sshd-dataplane.service +EOF + +sudo systemctl enable ssh-dataplane +sudo systemctl start ssh-dataplane +``` + +And with that, our loopback address, and indeed any other interface created in the +`dataplane` namespace, will accept SSH connections. Yaay! + +### Configuring SNMPd + +At IPng Networks, we use [LibreNMS](https://librenms.org/) to monitor our machines and +routers in production. Similar to SSH, I want the snmpd (which we disabled all the way +at the top of this article), to be exposed in the `dataplane` namespace. However, that +namespace will have interfaces like `xe0-0` or `loop0` or `bvi1` configured, and it's +important to note that Linux will only see those packets that were _punted_ by VPP, that +is to say, those packets which were destined to any IP address configured on the control +plane. Any traffic going _through_ VPP will never be seen by Linux! So, I'll have to be +clever and count this traffic by polling VPP instead. This was the topic of my previous +[VPP Part 6]({% post_url 2021-09-10-vpp-6 %}) about the SNMP Agent. All of that code +was released to [Github](https://github.com/pimvanpelt/vpp-snmp-agent), notably there's +a hint there for an `snmpd-dataplane.service` and a `vpp-snmp-agent.service`, including +the compiled binary that reads from VPP and feeds this to SNMP. + +Then, the SNMP daemon configuration file, assuming `net-snmp` (the default for Ubuntu and +Debian) which was installed in the very first step above, I'll yield the following simple +configuration file: + +``` +cat << EOF | tee /etc/snmp/snmpd.conf +com2sec readonly default public +com2sec6 readonly default public + +group MyROGroup v2c readonly +view all included .1 80 +# Don't serve ipRouteTable and ipCidrRouteEntry (they're huge) +view all excluded .1.3.6.1.2.1.4.21 +view all excluded .1.3.6.1.2.1.4.24 +access MyROGroup "" any noauth exact all none none + +sysLocation Rue des Saules, 59262 Sainghin en Melantois, France +sysContact noc@ipng.ch + +master agentx +agentXSocket tcp:localhost:705,unix:/var/agentx/master + +agentaddress udp:161,udp6:161 + +# OS Distribution Detection +extend distro /usr/bin/distro + +# Hardware Detection +extend manufacturer '/bin/cat /sys/devices/virtual/dmi/id/sys_vendor' +extend hardware '/bin/cat /sys/devices/virtual/dmi/id/product_name' +extend serial '/bin/cat /var/run/snmpd.serial' +EOF +``` + +This config assumes that `/var/run/snmpd.serial` exists as a regular file rather than a `/sys` +entry. That's because while the `sys_vendor` and `product_name` fields are easily retrievable +as user from the `/sys` filesystem, for some reason `board_serial` and `product_serial` are +only readable by root, and our SNMPd runs as user `Debian-snmp`. So, I'll just generate this at +boot-time in `/etc/rc.local`, like so: + +``` +cat << EOF | sudo tee /etc/rc.local +#!/bin/sh + +# Assemble serial number for snmpd +BS=\$(cat /sys/devices/virtual/dmi/id/board_serial) +PS=\$(cat /sys/devices/virtual/dmi/id/product_serial) +echo \$BS.\$PS > /var/run/snmpd.serial + +[ -x /etc/rc.firewall ] && /etc/rc.firewall +EOF +sudo chmod 755 /etc/rc.local +sudo /etc/rc.local +sudo systemctl restart snmpd-dataplane +``` + +## Results + +With all of this, I'm ready to pick up the machine in LibreNMS, which looks a bit like +this: + +{{< image width="800px" src="/assets/vpp/librenms.png" alt="LibreNMS" >}} + +Or a specific traffic pattern looking at interfaces: +{{< image width="800px" src="/assets/vpp/librenms-frggh0.png" alt="LibreNMS" >}} + +Clearly, looking at the 17d of ~18Gbit of traffic going through this particular router, +with zero crashes and zero SNMPd / Agent restarts, this thing is a winner: + +``` +pim@frggh0:/etc/bird$ date +Tue 21 Sep 2021 01:26:49 AM UTC + +pim@frggh0:/etc/bird$ ps auxw | grep vpp +root 1294 307 0.1 154273928 44972 ? Rsl Sep04 73578:50 /usr/bin/vpp -c /etc/vpp/startup.conf +Debian-+ 331639 0.2 0.0 21216 11812 ? Ss Sep04 22:23 /usr/sbin/snmpd -LOw -u Debian-snmp -g vpp -I -smux mteTrigger mteTriggerConf -f -p /run/snmpd-dataplane.pid +Debian-+ 507638 0.0 0.0 2900 592 ? Ss Sep04 0:00 /usr/sbin/vpp-snmp-agent -a localhost:705 -p 30 +Debian-+ 507659 1.6 0.1 1317772 43508 ? Sl Sep04 2:16 /usr/sbin/vpp-snmp-agent -a localhost:705 -p 30 +pim 510503 0.0 0.0 6432 736 pts/0 S+ 01:25 0:00 grep --color=auto vpp +``` + +Thanks for reading this far :-) + diff --git a/content/articles/2021-10-24-as8298.md b/content/articles/2021-10-24-as8298.md new file mode 100644 index 0000000..c1a7c88 --- /dev/null +++ b/content/articles/2021-10-24-as8298.md @@ -0,0 +1,160 @@ +--- +date: "2021-10-24T08:49:09Z" +title: IPng acquires AS8298 +--- + +# A Brief History + +In January of 2003, my buddy Jeroen announced a project called the [Ghost Route Hunters](/assets/as8298/RIPE44-IPv6-GRH.pdf), after the industry had been plagued for a few years with anomalies in the DFZ - routes would show up with phantom BGP paths, unable to be traced down to a source or faulty implementation. Jeroen presented his [findings](/assets/as8298/RIPE46-IPv6-Routing-Table-Anomalies.pdf) at RIPE-46 and for years after this, the industry used the [SixXS GRH](https://www.sixxs.net/tools/grh/how/) as a distributed looking glass. At the time, one of SixXS's point of presence providers kindly lent the project AS8298 to build this looking glass and underlying infrastructure. + +After running SixXS for 16 years, Jeroen and I decided to [Sunset]({%post_url 2017-03-01-sixxs-sunset %}) it, which meant that in June of 2017, the Ghost Route Hunter project came to an end as well, and as we tore down the infrastructure, AS8298 became dormant. + +Then in August of 2021, I was doing a little bit of cleaning on the IPng Networks serving infrastructure, and came across some old mail from RIPE NCC about that AS number. And while IPng Networks is running [just fine]({%post_url 2021-02-27-network %}) on AS50869 today, it would be just that little bit cooler if it were to run on AS8298. So, I embarked on a journey to move a running ISP into a new AS number, which sounds like fun! This post describes the situation going in to this renumbering project, and there will be another post, likely in January 2022, that describes the retrospective (this future post may be either celebratory, or a huge postmortem, to be determined). + +## The Plan + +#### Step 0. Acquire the AS + +First off, I had to actually acquire the AS number. Back in the days (I'm speaking of 2002), RIPE NCC was a little bit less formal than it is today. As such, our loaned AS number was simply _registered_ to SixXS, which is not a legal entity. Later, to do things properly, it was placed in Jeroen's custody (by creating ORG-SIXX1-RIPE). But precisely because Jeroen nor SixXS was an LIR at that time, the previous holder (Easynet, with LIR `de.easynet`, later acquired by Sky, also called British Sky Broadcasting, trading under LIR `uk.bskyb`) became the sponsoring LIR. So I had to arrange two things: +1. A Transfer Agreement between Jeroen and myself; that signaled his willingness to transfer the AS number to me. This is [boilerplate](https://www.ripe.net/manage-ips-and-asns/resource-transfers-and-mergers/transfer-of-ip-addresses-and-as-numbers/transfer-agreement-template) stuff, and example contracts can be downloaded on the RIPE website. +1. An agreement between Sky and IPng Networks; that signaled the transfer from sponsoring LIR to our LIR `ch.ipng`. This is rather non-bureaucratic and a well traveled path; sponsoring LIRs and movements of resource between holders happen all the time. + +With the permission of the previous holder, and with the help of the previous sponsoring LIR, the transfer itself was a matter of filing the correct paperwork at RIPE NCC, quoting the transfer agreement, and providing identification for the offering party (Jeroen) and the receiving party (Pim). And within a matter of a few days, the AS number was transfered to `ORG-PVP9-RIPE`, the `ch.ipng` LIR. + +**NOTE** - In case you're wondering, I registered the `ch.ipng` LIR a few months *before* I incorporated IPng Networks as a swiss limited liability company (called a `GmbH`, or *Gesellschaft mit beschränkter Haftung* in German), so for now I'm still trading as my natural person. RIPE has a cooldown period of two years before new LIRs can acquire/merge/rename. I do expect that some time in 2023 the PeeringDB page and bgp.he.net and friends to drop my personal name and take my company name :) For now, trading as Pim will have to do, slightly more business risk, but just as much fun! + +#### Step 1. Split AS50869 into two networks + + +{{< image width="300px" float="right" src="/assets/network/zurich-ring.png" alt="Zurich Metro" >}} + +The autonomous system of IPng Networks spans two main parts. Firstly, in Zurich IPng Networks operates four sites and six routers: +* Two in a private colocation site at Daedalean (**A**) in Albisrieden called `ddln0` and `ddln1`, they are running [DANOS](https://danosproject.org/) +* Two at our offices in Brüttisellen (**C**), called `chbtl0` and `chbtl1`, they are running [Debian](https://debian.org/) +* One at Interxion ZUR1 datacenter in Glattbrugg (**D**), called `chgtg0`, running [VPP]({%post_url 2021-09-21-vpp-7 %}), connecting to a public internet exchange CHIX-CH and taking transit from IP-Max and Openfactory. +* One at NTT's datacenter in Rümlang (**E**), called `chrma0`, also running [VPP]({%post_url 2021-09-21-vpp-7 %}), connecting to a public internet exchange SwissIX and taking transit from IP-Max and Meerfarbig. + +NOTE: You can read a lot about my work on VPP in a series of [VPP articles](/s/articles/), please take a look! + +There's a few downstream IP Transit networks and lots of local connected networks, such as in the DDLN colo. Then, from `chrma0`, we connect to our european ring northwards (towards Frankfurt), and from `chgtg0` we connect to our european ring south-westwards (towards Geneva). + +{{< image width="300px" float="right" src="/assets/network/european-ring.png" alt="European Ring" >}} + +That ring, then, consists of five additional sites and five routers, all running [VPP]({%post_url 2021-09-21-vpp-7 %}): +* Frankfurt: `defra0`, connecting to four DE-CIX exchangepoints in Frankfurt itself directly, and remotely to Munich, Düsseldorf and Hamburg +* Amsterdam: `nlams0`, connecting to NL-IX, SpeedIX, FrysIX (our favorite!), and LSIX; we also pick up two transit providers (A2B and Coloclue). +* Lille: `frggh0`, connecting to the northern france exchange called LillIX +* Paris: `frpar0`, connecting to two FranceIX exchange points, directly in Paris, and remotely to Marseille +* Geneva: `chplo0`, connecting to our very own Free IX + +Every one of these sites has an upstream session with AS25091 (IP-Max). Considering these folks are organizationally very close to me, it is easy for me to rejigger any one of those sessions between AS50869 (current) and AS8298 (new). And considering AS8298 has been a member of our as-set `AS-IPNG` for a while, it'll also be a natural propagation to rely on IP-Max, even if some peering sessions might be down. + +So before I start, IPng Networks' view from AS25091 in Amsterdam looks like this: +``` + Network Next Hop Metric LocPrf Weight Path +* 92.119.38.0/24 46.20.242.210 0 50869 i +* 176.119.215.0/24 46.20.242.210 0 50869 60557 i +* 185.36.229.0/24 46.20.242.210 0 50869 212855 i +* 185.173.128.0/24 46.20.242.210 0 50869 57777 i +* 185.209.12.0/24 46.20.242.210 0 50869 212323 i +* 192.31.196.0/24 46.20.242.210 0 50869 112 i +* 192.175.48.0/24 46.20.242.210 0 50869 112 i +* 194.1.163.0/24 46.20.242.210 0 50869 i +* 194.126.235.0/24 46.20.242.210 0 50869 i + + Network Next Hop Metric LocPrf Weight Path +* 2001:4:112::/48 2a02:2528:1902::210 0 50869 112 i +* 2001:678:3d4::/48 2a02:2528:1902::210 0 50869 201723 i +* 2001:678:ce4::/48 2a02:2528:1902::210 0 50869 60557 i +* 2001:678:ce8::/48 2a02:2528:1902::210 0 50869 60557 i +* 2001:678:cec::/48 2a02:2528:1902::210 0 50869 60557 i +* 2001:678:cf0::/48 2a02:2528:1902::210 0 50869 60557 i +* 2001:678:d78::/48 2a02:2528:1902::210 0 50869 i +* 2001:67c:6bc::/48 2a02:2528:1902::210 0 50869 201723 i +* 2620:4f:8000::/48 2a02:2528:1902::210 0 50869 112 i +* 2a07:cd40::/29 2a02:2528:1902::210 0 50869 212855 i +* 2a0b:dd80::/29 2a02:2528:1902::210 0 50869 i +* 2a0b:dd80::/32 2a02:2528:1902::210 0 50869 i +* 2a0d:8d06::/32 2a02:2528:1902::210 0 50869 60557 i +* 2a0e:fd40:200::/48 2a02:2528:1902::210 0 50869 60557 i +* 2a0e:fd45:da0::/48 2a02:2528:1902::210 0 50869 60557 i +* 2a10:d200::/29 2a02:2528:1902::210 0 50869 212323 i +* 2a10:fc40::/29 2a02:2528:1902::210 0 50869 57777 i +``` + +#### Step 2. Restrict the routers that originate our prefixes + +As a preparation to actually starting to use AS8298, I'll create RPKI records of authorization for all of our prefixes in **both** AS50869 and AS8298, and I'll add `route:` objects for all of them in both as well. + +Now, I'm ready to make my first networking topology change: instead of originating our prefixes in _all_ routers, I will originate our prefixes in AS50869 only from _two_ routers: `chbtl0` and `chbtl1`. Nobody on the internet will notice this change, the as-path will remain `^50869$` for all prefixes. + +I'll also prepare the two routers to speak to eachother with an iBGP session (rather than to IPng Networks' route-reflectors, which are still in AS50869). + +#### Step 3. Convert Brüttisellen to AS8298 + +Now, I'll switch these routers `chbtl0` and `chbtl1` out of AS50869 and into AS8298. The only damage at this point might be that my personal Spotify and Netflix stop working, and my family yells at me (but they do that all the time anyway, so it's a wash...). If things go poorly, the backout plan is to switch back to AS50869 and return things to normal. But if things go well, from this point onwards everybody will see our own IPng Networks prefixes via as-path `^50869_8298$` and effectively, AS50869 will become a transit provider for AS8298, which will be singlehomed for the moment. + +My buddy Max runs a small /29 and /64 exchangepoint in Brüttisellen with only three members on it - IPng Networks, Stucchinet and Openfactory. I will ask both of them to be my canary, and change their peering session from AS50869 to AS8298. If things go bad, that's no worries, I can drop/disable these peering sessions as I have sessions with both as well in other places. But it'd be a good place to see and test if things work as expected. + +#### Step 4. Convert DDLN to AS8298 + +Now that our IPv4 and IPv6 prefixes have moved and AS50869 does not originate prefixes anymore, you may wonder 'what about the colo?'. Indeed, the colo runs at Daedalean behind `ddln0` and `ddln1`, both of which are still in AS50869. However, the only way to ever be able to reach those prefixes, would be to find an entrypoint in AS50869 (as it is the only uplink of AS8298). All routers in AS50869 and AS8298 share an underlying OSPF and OSPFv3 interior gateway protocol, which means that if anything is destined to `194.1.163.0/24` or `2001:678:d78::/48`, those packets will find their way to the correct location using the IGP. That's neat, because it means that even though `ddln0` is speaking BGP in AS50869, it will happily forward traffic to its more specific prefixes from AS8298. + +Considering there's only one IP transit network in DDLN, those two routers will be the first for me to convert. After converting them to AS8298, they will receive transit from AS50869 just like the ones in Brüttisellen. I'll rejigger Jeroen's AS57777 to receive transit directly from AS8298, and it will then be the first to transition. Jeroen's prefixes will briefly become `^50869_8298_57777$`, which will be the only change, but this will validate that, indeed, AS8298 can provide transit. Apart from the longer as-path, the physical path that IP packets take will remain the same because Jeroen’s network is currently *cough* singlehomed at DDLN. + +Now that I have four routers in AS8298, I'll add a iBGP sessions directly only amongst the pairs, rather than in a full mesh on these routers. I've now created _two islands_ of AS8298, interconnected by AS50869. You'd think this is bad, but it's actually a fine situation to be in and there will be no loss of service, because: +* we've already established that AS50869 has reachability to all more-specifics in AS8298 (Step 3) +* AS50869 has reachability to AS57777 via its downstream AS8298 +* AS50869 is the only way in and out of the DDLN or Brüttisellen routers + +Nevertheless, I'd rather swiftly continue with the next step, and reconnect these two islands. It's is a good time for me to give a headsup to the larger internet exchanges (notably SwissIX) so folks can prepare for what's coming. I think many NOC teams know how to establish/remove a peering session, but I do expect the ITIL-automated teams to not have a playbook for "peer on IP 192.0.2.1 has changed from AS666 to AS42". I'll observe their performance on this task and take notes, as there's quite a few public IXPs to go... + +#### Step 5. Convert Rümlang and Glattbrugg to AS8298 + +After four of our routers (two redundant pairs) have been operating in AS8298 for a few days, I'm ready to renumber our first machine connected to a public internet exchange: `chgtg0` is connected to [CHIX-CH](https://chix.ch/). I'll contact the members with whom IPng Networks has a direct peering, and of course the internet exchange folks, to ask them to renumber our AS50869 to AS8298. After restarting our router into the new AS, one by one I'll establish sessions with our peers there - this is an important exercise because I'll be doing this later in Step 5 on every peering/border router in the european ring, and there will be in total ~1'850 BGP adjacencies that have to be renumbered. + +At CHIX-CH, IPng has in total four BGP sessions with the two routeservers in AS212100, and in total 38 direct BGP sessions; two of which are somewhat important: AS13030 (Init7) and AS15169 (Google), to which a large fraction of our traffic flows. While upgrading this router, I'll also switch my one downstream network (Daedalean itself, operating AS212323) to receive transit from AS8298. Because I've canaried this with Jeroen's AS57777 in the DDLN colo previously, I'll be reasonably certain at this point that it'll work well. If not, they have alternative uplinks (notably AS174), so they should be fine without me. + +At SwissIX, IPng has as well four BGP sessions with the two routeservers in AS42476, and in total 132 direct BGP sessions. I think that, once these two peering routers are complete, I'll checkpoint and let things run like this for a while. Let's take a few weeks off, giving me a while to hunt down peers at SwissIX and CH-IX to catch up and re-establish their sessions with the new AS8298 :) + +After this step, phase one of the transition is complete, and AS8298 (and its networks AS57777 and AS212323) will be directly visible in Switzerland, and still tucked away behind `^50869_8298_212323$` for the international traffic. I will however have 2 transit sessions from IP-Max (AS25091), 2 transit sessions from Openfactory (AS58299) and 1 transit session from Meerfarbig (AS34549); so it is expected to be a quite stable network at this point, which is good. + +#### Step 6. Convert European Ring to AS8298 + +For this task, I'll swap my iBGP sessions to all use the new AS8298. I do this by first dismantling `rr0.chbtl0.ipng.ch` and bring it back up as an iBGP speaker in the new AS; then one by one push all routers to speak to that route reflector (in addition to the existing route reflectors in AS50869). After that stabilizes, rinse and repeat with `rr0.chplo0.ipng.ch`; and finally finish the job with `rr0.frggh0.ipng.ch`. The two pairs of routers who were by themselves in AS8298 (`chbtl0`/`chbtl1` in Brüttisellen; and `ddln0`/`ddln1` at Daedalean) can now be reattached to the iBGP route-reflectors as well. + +It will be a bit of a schlepp, but now comes the notification of all international peers (there are an additional ~250 direct peerings) and downstreams (there are three left: Raymon, Jelle and Krik), and upstreams (there are two additional ones: Coloclue, and A2B). While normally this is a matter of merely swapping the AS number, it has to be done on both sides - on my side, I can do it with a one-line change to the git repository, and it'll be pushed by Kees (the network build and push automation that was inspired by Coloclue's [Kees](https://github.com/coloclue/kees/)), on the remote side it will be a matter of patience. One by one, folks will (or.. won't) update their peering session. The only folks I'll actively chase is the DE-CIX, FranceIX, NL-IX, LSIX and FrysIX routeserver operators, as the vast majority of adjacencies are learned via those. By means of a headsup one week in advance, and a few reminders on the day of, and the day after maintenance, I should minimize downtime. But, in this case, because I already have two transit providers in Switzerland (AS25091 IP-Max, and AS58299 Openfactory) who provide me full tables in AS8298, it should be operationally smooth sailing. +At the end of this exercise, the as-path will be `^8298$` for my own prefixes and `^8298_.*$` for my downstream networks, and AS50869 will no longer be in any as-path in the DFZ. Rolling back is tricky, although Bird can do individual peering sessions with differing AS numbers, I don't think this is a good idea; as it'll mean many (many) changes into the repository; so in the interest of simplicity and don't break things that work, I will do them router-by-router rather than session-by-session; and send a few reminders to folks to update their records to match my peeringdb entries. + +#### Step 7. Repurpose/retire AS50869 + +There's really not that much to do -- delete the `route:` and `route6:` objects and remove RPKI ROAs for the old announcments; but mostly it'll be a matter of hunting down peering partners who have not (yet) updated their records and sessions. I imagine lots of folks will hesitate and be unfamiliar with this type of operation (even though it literally is an `s/50869/8298/g` for them). I'll take most of December to remind folks a few times, and ultimately just clean up broken peering sessions in January 2022. + +And of course, then lick my wounds and count pokemon - on October 24th, the day this post was published, Hurricane Electric showed 1'845 adjacencies in total for AS50869, of which 1'653 IPv4 and 1'430 IPv6. I will consider it a success if I lose less than 200 adjacencies. I'll keep AS50869 around, as a test AS number to do a few experiments. + +## The Timeline + +Most all intrusive maintenances will be done in maintenance windows between 22:00 - 03:00 UTC from Thursday to Friday. The tentative planning for the project, which starts on October 22nd and lasts through end of year (10 weeks): + +1. 2021-10-22 - RIPE updates for `route:`, `route6:` and RPKI ROAs +1. 2021-10-24 - Originate prefixes from chbtl0/chbtl1 (no-op for the DFZ) +1. 2021-10-28 - Move chbtl0/chbtl1 at Brüttisellen to AS8298 +1. 2021-10-29 - Update Brüttisellen IXP (AS58299 and AS58280), canary upstream AS58299 +1. 2021-10-29 - Headsup to CHIX-CH and SwissIX peers and Transit partners, announcing move +1. 2021-11-04 - Move ddln0/ddln1 at Daedalean to AS8298, canary downstream AS57777 +1. 2021-11-11 - Move chgtg0 Glattbrugg to AS8298, add downstream AS212323 +1. 2021-11-12 - Move CHIX-CH peers to AS8298 +1. 2021-11-15 - Move chrma0 Rümlang to AS8298 +1. 2021-11-16 - Move SwissIX peers to AS8298 +1. Two week cooldown period, start to move IXPs to AS8298 +1. 2021-11-29 - Headsup to all european IXPs and IP Transit partners, announcing move +1. 2021-12-02 - Move defra0, nlams0, frggh0, frpar0, chplo0 to AS8298 +1. 2021-12 - Move european IXPs to AS8298 +1. 2021-12-06 - First reminder for peerings that did not re-establish +1. 2021-12-13 - Second reminder for peerings that did not re-establish +1. 2021-12-20 - Third (final) reminder for peerings that did not re-establish +1. 2022-01-10 - Remove peerings that did not re-establish + +## Appendix + +This blogpost turned into a talk at [RIPE #83](/assets/as8298/RIPE83-IPngNetworks.pdf), in case you wanted to take a look at [the recording](https://ripe83.ripe.net/archives/video/649/) and [a few questions](https://ripe83.ripe.net/archives/video/650/). diff --git a/content/articles/2021-11-14-routing-policy.md b/content/articles/2021-11-14-routing-policy.md new file mode 100644 index 0000000..28a9be5 --- /dev/null +++ b/content/articles/2021-11-14-routing-policy.md @@ -0,0 +1,726 @@ +--- +date: "2021-11-14T22:49:09Z" +title: Case Study - BGP Routing Policy +--- + +# Introduction + +BGP Routing policy is a very interesting topic. I get asked about it formally +and informally all the time. I have to admit, there are lots of ways to organize +an automous system. Vendors have unique features and templating / procedural +functions, but in the end, BGP routing policy all boils down to two+two things: + +1. Not accepting the prefixes you don't want (inbound) + * For those prefixes accepted, ensure they have correct attributes. +1. Not announcing prefixes to folks who shouldn't see them (outbound) + * For those prefixes announced, ensure they have correct attributes. + +At IPng Networks, I've cycled through a few iterations and landed on a specific +setup that works well for me. It provides sufficient information to enable our +downstream (customers) to make good decisions on what they should accept from +us, as well as enough expressivity for them to determine which prefixes we +should propagate for them, where, and how. + +This article describes one approach to a relatively feature rich routing policy +which is in use at IPng Networks (AS8298). It uses the [Bird2](https://bird.network.cz/) configuration +language, although the concepts would be implementable in ~any modern routing +suite (ie. FRR, Cisco, Juniper, Arista, Extreme, et cetera). + +Interested in one operator's opinion? Read on! + +## 1. Concepts + +There are three basic pieces of routing filtering, which I'll describe briefly. + +### Prefix Lists + +A prefix list (also sometimes referred to as an access-list in older software) +is a list of IPv4 of IPv6 prefixes, often with a prefixlen boundary, that +determines if a given prefix is "in" or "out". + +An example could be: `2001:db8::/32{32,48}` which describes any prefix in the +supernet `2001:db8::/32` that has a prefix length of anywhere between /32 and +/48, inclusive. + +### AS Paths + +In BGP, each prefix learned comes with an AS path on how to reach it. If my router +learns a prefix from a peer with AS number `65520`, it'll see every prefix that peer +sends as a list of AS numbers starting with 65520. With AS Paths, the very first +one in the list is the one the router directly learned the prefix from, and the very +last one is the origin of the prefix. Often times the prefix is shown as a regular +expression, starting with `^` and ending with `$` and to help readability, +spaces are often written as `_`. + +Examples: `^25091_1299_3301$` and `^58299_174_1299_3301$` + +### BGP Communities + +When learning (or originating) a prefix in BGP, zero or more so called `communities` +can be added to it along the way. The _Routing Information Base_ or _RIB_ carries +these communities and can share them between peering sessions. Communities can be +added, removed and modified. Some communities have special meaning (which is +agreed upon by everyone), and some have local meaning (agreed upon by only +one or a small set of operators). + +There's three types of communities: _normal_ communities are a pair of 16-bit +integers; _extended_ communities are 8 bytes, split into one 16-bit integer +and an additional 48-bit value; and finally _large_ communities consist of +a triplet of 32-bit values. + +Examples: `(8298, 1234)` (normal), or `(8298, 3, 212323)` (large) + +# Routing Policy + +Now that I've explained a little bit about the ingredients we have to work with, +let me share an observation that took me a few decades to make: BGP sessions are +really all the same. As such, every single one of the BGP sessions at IPng Networks +are generated with one template. What makes the difference between 'Transit', 'Customer' +and 'Peer' and 'Private Interconnect', really all boils down to what types of filtering +are applied on in- and outbound updates. I will demonstrate this by means of two main +functions in Bird: `ebgp_import()` discussed first in the section ***Inbound: Learning +Routes*** section, and `ebgp_export()` in the section ***Outbound: Announcing Routes***. + +## 2. Inbound: Learning Routes + +Let's consider this function: + +``` +function ebgp_import(int remote_as) { + if aspath_bogon() then return false; + if (net.type = NET_IP4 && ipv4_bogon()) then return false; + if (net.type = NET_IP6 && ipv6_bogon()) then return false; + + if (net.type = NET_IP4 && ipv4_rpki_invalid()) then return false; + if (net.type = NET_IP6 && ipv6_rpki_invalid()) then return false; + + # Demote certain AS nexthops to lower pref + if (bgp_path.first ~ AS_LOCALPREF50 && bgp_path.len > 1) then bgp_local_pref = 50; + if (bgp_path.first ~ AS_LOCALPREF30 && bgp_path.len > 1) then bgp_local_pref = 30; + if (bgp_path.first ~ AS_LOCALPREF10 && bgp_path.len > 1) then bgp_local_pref = 10; + + # Graceful Shutdown (RFC8326) + if (65535, 0) ~ bgp_community then bgp_local_pref = 0; + + # Scrub BLACKHOLE community + bgp_community.delete((65535, 666)); + + return true; +} +``` + +The function works by order of elimination -- for each prefix that is offered on the +session, it will either be rejected (by means of returning `false`), or modified (by means +of setting attributes like `bgp_local_pref`) and then accepted (by means of returning +`true`). + +***AS-Path Bogon*** filtering is a way to remove prefixes that have an invalid AS +number in their path. The main example of this are private AS numbers (64496-131071) +and their 32 bit equivalents (4200000000-4294967295). In case you haven't come across +this yet, AS number 23456 is also magic, see [RFC4893](https://datatracker.ietf.org/doc/html/rfc4893) +for details: +``` +function aspath_bogon() { + return bgp_path ~ [0, 23456, 64496..131071, 4200000000..4294967295]; +} +``` + +***Prefix Bogon*** comes next, as certain prefixes that are not publicly routable (you +know, such as [RFC1918](https://datatracker.ietf.org/doc/html/rfc1918), but there are +many others). They look differently for IPv4 and IPv6: +``` +function ipv4_bogon() { + return net ~ [ + 0.0.0.0/0, # Default + 0.0.0.0/32-, # RFC 5735 Special Use IPv4 Addresses + 0.0.0.0/0{0,7}, # RFC 1122 Requirements for Internet Hosts -- Communication Layers 3.2.1.3 + 10.0.0.0/8+, # RFC 1918 Address Allocation for Private Internets + 100.64.0.0/10+, # RFC 6598 IANA-Reserved IPv4 Prefix for Shared Address Space + 127.0.0.0/8+, # RFC 1122 Requirements for Internet Hosts -- Communication Layers 3.2.1.3 + 169.254.0.0/16+, # RFC 3927 Dynamic Configuration of IPv4 Link-Local Addresses + 172.16.0.0/12+, # RFC 1918 Address Allocation for Private Internets + 192.0.0.0/24+, # RFC 6890 Special-Purpose Address Registries + 192.0.2.0/24+, # RFC 5737 IPv4 Address Blocks Reserved for Documentation + 192.168.0.0/16+, # RFC 1918 Address Allocation for Private Internets + 198.18.0.0/15+, # RFC 2544 Benchmarking Methodology for Network Interconnect Devices + 198.51.100.0/24+, # RFC 5737 IPv4 Address Blocks Reserved for Documentation + 203.0.113.0/24+, # RFC 5737 IPv4 Address Blocks Reserved for Documentation + 224.0.0.0/4+, # RFC 1112 Host Extensions for IP Multicasting + 240.0.0.0/4+ # RFC 6890 Special-Purpose Address Registries + ]; +} + +function ipv6_bogon() { + return net ~ [ + ::/0, # Default + ::/96, # IPv4-compatible IPv6 address - deprecated by RFC4291 + ::/128, # Unspecified address + ::1/128, # Local host loopback address + ::ffff:0.0.0.0/96+, # IPv4-mapped addresses + ::224.0.0.0/100+, # Compatible address (IPv4 format) + ::127.0.0.0/104+, # Compatible address (IPv4 format) + ::0.0.0.0/104+, # Compatible address (IPv4 format) + ::255.0.0.0/104+, # Compatible address (IPv4 format) + 0000::/8+, # Pool used for unspecified, loopback and embedded IPv4 addresses + 0100::/8+, # RFC 6666 - reserved for Discard-Only Address Block + 0200::/7+, # OSI NSAP-mapped prefix set (RFC4548) - deprecated by RFC4048 + 0400::/6+, # RFC 4291 - Reserved by IETF + 0800::/5+, # RFC 4291 - Reserved by IETF + 1000::/4+, # RFC 4291 - Reserved by IETF + 2001:10::/28+, # RFC 4843 - Deprecated (previously ORCHID) + 2001:20::/28+, # RFC 7343 - ORCHIDv2 + 2001:db8::/32+, # Reserved by IANA for special purposes and documentation + 2002:e000::/20+, # Invalid 6to4 packets (IPv4 multicast) + 2002:7f00::/24+, # Invalid 6to4 packets (IPv4 loopback) + 2002:0000::/24+, # Invalid 6to4 packets (IPv4 default) + 2002:ff00::/24+, # Invalid 6to4 packets + 2002:0a00::/24+, # Invalid 6to4 packets (IPv4 private 10.0.0.0/8 network) + 2002:ac10::/28+, # Invalid 6to4 packets (IPv4 private 172.16.0.0/12 network) + 2002:c0a8::/32+, # Invalid 6to4 packets (IPv4 private 192.168.0.0/16 network) + 3ffe::/16+, # Former 6bone, now decommissioned + 4000::/3+, # RFC 4291 - Reserved by IETF + 5f00::/8+, # RFC 5156 - used for the 6bone but was returned + 6000::/3+, # RFC 4291 - Reserved by IETF + 8000::/3+, # RFC 4291 - Reserved by IETF + a000::/3+, # RFC 4291 - Reserved by IETF + c000::/3+, # RFC 4291 - Reserved by IETF + e000::/4+, # RFC 4291 - Reserved by IETF + f000::/5+, # RFC 4291 - Reserved by IETF + f800::/6+, # RFC 4291 - Reserved by IETF + fc00::/7+, # Unicast Unique Local Addresses (ULA) - RFC 4193 + fe80::/10+, # Link-local Unicast + fec0::/10+, # Site-local Unicast - deprecated by RFC 3879 (replaced by ULA) + ff00::/8+ # Multicast + ]; +} +``` + +That's a long list!! But operators on the _DFZ_ should really never be accepting any +of these, and we should all collectively yell at those who propagate them. + +***RPKI Filtering*** is a fantastic routing security feature, described in [RFC6810](https://datatracker.ietf.org/doc/html/rfc6810) +and relatively straight forward to implement. For each _originating_ AS number, we can +check in a table of known `` mapping, if it is the correct ISP to +originate the prefix. The lookup can either match (which makes the prefix RPKI valid), +the lookup can fail because the prefix is missing (which makes the prefix RPKI unknown), +and it can specifically mismatch (which makes the prefix RPKI invalid). Operators are +encouraged to flag and drop _invalid_ prefixes: + +``` +function ipv4_rpki_invalid() { + return roa_check(t_roa4, net, bgp_path.last) = ROA_INVALID; +} + +function ipv6_rpki_invalid() { + return roa_check(t_roa6, net, bgp_path.last) = ROA_INVALID; +} +``` + +***NOTE***: In NLNOG my post sparked a bit of debate on the use of `bgp_path.last_nonaggregated` +versus simply `bgp_path.last`. Job Snijders did some spelunking and offered [this post](https://bird.network.cz/pipermail/bird-users/2019-September/013805.html) and a reference to [RFC6907](https://datatracker.ietf.org/doc/html/rfc6907) for details, and +Tijn confirmed that Coloclue (on which many of my approaches have been modeled) indeed uses +`bgp_path.last`. I've updated my configs, with many thanks for the discussion. + +Alright, now that I've determined the as-path and prefix are kosher, and that it +is not known to be hijacked (ie. is either `ROA_VALID` or `ROA_UNKNOWN`), I'm ready +to set a few attributes, notably: + +* ***AS_LOCALPREF*** If the peer I learned this prefix from is in the given list, set + the BGP local preference to either 50, 30 or 10 respectively (a lower localpref means + the prefix is less likely to be selected). Some internet providers send lots of + prefixes, but have poor network connectivity to the place I learned the routes from + (a few examples to this, 6939 is often oversubscribed in Amsterdam, and 39533 was + for a while connected via a tunnel (!) to Zurich, and several hobby/amateur IXPs are + on a VXLAN bridged domain rather than a physical switch). + +* ***Graceful Shutdown*** described in [RFC8326](https://datatracker.ietf.org/doc/html/rfc8326), + shows a way to allow operators to pre-announce their downtime by setting a special + BGP community that informs their peers to deselect that path by setting the local + preference to the lowest possible value. This oneliner matching on `(65535,0)` + implements that behavior. + +* ***Blackhole Community*** described in [RFC7999](https://datatracker.ietf.org/doc/html/rfc7999), + is another special BGP community of `(65535,666)` which signals the need to stop sending + traffic to the prefix at hand. I haven't yet implemented the blackhole routing (this has + to do with an intricacy of the VPP Linux-CP code that I wrote), so for now I'll just remove + the community. + +Alright, based on this one template, I'm now ready to implement all three types of +BGP session: ***Peer***, ***Upstream***, and ***Downstream***. + +### Peers + +``` +function ebgp_import_peer(int remote_as) { + # Scrub BGP Communities (RFC 7454 Section 11) + bgp_community.delete([(8298, *)]); + bgp_large_community.delete([(8298, *, *)]); + + return ebgp_import(remote_as); +} +``` + +It's dangerous to accept communities for my own AS8298 from peers. This is because +several of them can actively change the behavior of route propagation (these types +of communities are commonly called _action_ communities). So with peering +relationships, I'll just toss them all. + +Now, working my way up to the actual BGP peering session, taking for example a +peer that I'm connecting to at LSIX (the routeserver, in fact) in Amsterdam: + +``` +filter ebgp_lsix_49917_import { + if ! ebgp_import_peer(49917) then reject; + + # Add IXP Communities + bgp_community.add((8298,1036)); + bgp_large_community.add((8298,1,1036)); + + accept; +} + +protocol bgp lsix_49917_ipv4_1 { + description "LSIX IX Route Servers (LSIX)"; + local as 8298; + source address 185.1.32.74; + neighbor 185.1.32.254 as 49917; + default bgp_med 0; + default bgp_local_pref 200; + ipv4 { + import keep filtered; + import filter ebgp_lsix_49917_import; + export filter ebgp_lsix_49917_export; + receive limit 100000 action restart; + next hop self on; + }; +}; +``` + +Parsing this through: the ipv4 import filter is called `ebgp_lsix_49917_import` and its +job is to run the whole kittenkaboodle of filtering I described above, and then if the +`ebgp_import_peer()` function returns false, to simply drop the prefix. But if it is +accepted, I'll tag it with a few communities. As I'll show later, any other peer will receive +these communities if I decide to propagate the prefix to them. This is specifically +useful for downstream (customers), who can decide to accept/deny the prefix based on a +wellknown set of communities we tag. + +***IXP Community***: If the prefix is learned at an IXP, I'll add a large community +`(8298,1,*)` and backwards compat normal community `(8298,10XX)`. + +One last thing I'll note, and this is a matter of taste, is for most peering prefixes +picked up at internet exchanges (like LSIX), are typically much cheaper per megabit than +the transit routes, so I will set a default `bgp_local_pref` of 200 (higher localpref +is more likely to be selected as the active route). + +### Upstream + +An interesting observation: from Peers and from Upstreams I typically am happy to take +all the prefixes I can get (but see the epilog below for an important note on this). For a +Peer, this is mostly "their own prefixes" and for a Transit, this is mostly "all prefixes", +but there's things in the middle, say partial transit of "all prefixes learned at IXP A B +and C". Really, all inbound sessions are very similar: + +``` +function ebgp_import_upstream(int remote_as) { + # Scrub BGP Communities (RFC 7454 Section 11) + bgp_community.delete([(8298, *)]); + bgp_large_community.delete([(8298, *, *)]); + + return ebgp_import(remote_as); +} +``` + +... is in fact identical to the `ebgp_import_peer()` function above, so I'll not discuss +it further. But for the sessions to upstream (==transit) providers, it can make sense +to use slightly different BGP community tags and a lower localpref: + +``` +filter ebgp_ipmax_25091_import { + if ! ebgp_import_upstream(25091) then reject; + + # Add BGP Large Communities + bgp_large_community.add((8298,2,25091)); + + # Add BGP Communities + bgp_community.add((8298,2000)); + + accept; +} + +protocol bgp ipmax_25091_ipv4_1 { + description "IP-Max Transit"; + local as 8298; + source address 46.20.242.210; + neighbor 46.20.242.209 as 25091; + default bgp_med 0; + default bgp_local_pref 50; + ipv4 { + import keep filtered; + import filter ebgp_ipmax_25091_import; + export filter ebgp_ipmax_25091_export; + next hop self on; + }; +}; +``` + +Again, a very similar pattern; the only material difference is that the inbound prefixes +are tagged with an ***Upstream Community*** which is of the form `(8298,2,*)` and backwards +compatible `(8298,20XX)`. Downstream customers can use this, if they wish, to select or +reject routes (maybe they don't like routes coming from AS25091, although they should know +better because IP-Max rocks!). + +The other slight change here is the `bgp_local_pref` is set to 50, which implies that it will +be used only if there are no alternatives in the _RIB_ with a higher localpref, or with a +similar localpref but shorter as-path, or many other scenarios which I won't get into here, +because BGP selection criteria 101 is a whole blogpost of its own. + +## Downstream + +That brings us to the third type of BGP sessions -- commonly referred to as customers except +that not everybody pays :) so I just call them _downstreams_: + +``` +function ebgp_import_downstream(int remote_as) { + # We do not scrub BGP Communities (RFC 7454 Section 11) for customers + return ebgp_import(remote_as); +} +``` + +Here, I have a special relationship with the `remote_as`, and I do not scrub the communities, +letting the downstream operator set whichever they like. As I'll demonstrate in the next +chapter, they can use these communities to drive certain types of behavior. + +Here's how I use this `ebgp_import_downstream()` function in the full filter for a downstream: + +``` +# bgpq4 -Ab4 -R 24 -m 24 -l 'define AS201723_IPV4' AS201723 +define AS201723_IPV4 = [ + 185.54.95.0/24 +]; + +# bgpq4 -Ab6 -R 48 -m 48 -l 'define AS201723_IPV6' AS201723 +define AS201723_IPV6 = [ + 2001:678:3d4::/48, + 2001:67c:6bc::/48 +]; + +filter ebgp_raymon_201723_import { + if (net.type = NET_IP4 && ! (net ~ AS201723_IPV4)) then reject; + if (net.type = NET_IP6 && ! (net ~ AS201723_IPV6)) then reject; + if ! ebgp_import_downstream(201723) then reject; + + # Add BGP Large Communities + bgp_large_community.add((8298,3,201723)); + + # Add BGP Communities + bgp_community.add((8298,3500)); + + accept; +} + +protocol bgp raymon_201723_ipv4_1 { + local as 8298; + source address 185.54.95.250; + neighbor 185.54.95.251 as 201723; + default bgp_med 0; + default bgp_local_pref 400; + ipv4 { + import keep filtered; + import filter ebgp_raymon_201723_import; + export filter ebgp_raymon_201723_export; + receive limit 94 action restart; + next hop self on; + }; +}; +``` + +OK, so this is a mouthful, but the one thing that I really need to do with customers is +ensure that I only accept prefixes from them that they're supposed to send me. I do this +with a `prefix-list` for IPv4 and IPv6, and in the importer, I simply reject any prefixes +that are not in the list. From then on, it looks very much like a peer, with identical +filtering and tagging, except now I'm using yet another ***Customer Community*** which +starts with `(8298,3,*)` and a vanilla `(8298,3500)` community. Anybody who wishes to, +can act on the presence of these communities to know that it's a downstream of IPng Networks +AS8298. + +***A note on Peers and Downstreams***: + +Some ISPs will not peer with their customers (as in: once you become a transit customer +they will terminate all BGP sessions at public internet exchanges), and I find that silly. +However, for me the situation becomes a little bit more complex if I were to have AS201723 +both as a Downstream (as shown here) as well as a Peer (which in fact, I do, at multiple Amsterdam +based internet exchanges). Note how the `bgp_local_pref` is 400 on this session, and it +will always be lower on other types of sessions. The implication is that this prefix from the _RIB_ +which carries `(8298,3,201723)` will be selected, and the ones I learn from LSIX will +carry `(8298,1,*)` and the ones I learn from A2B (a transit provider) will carry `(8298,2,51088)` +and both will not be selected due to those having a lower localpref. As I'll demonstrate below, +I can make smart use of these communities when announcing prefixes to my own peers and upstreams, +... read on :) + +## 3. Outbound: Announcing Routes + +Alright, the _RIB_ is now filled with lots of prefixes that have the right localpref and +communities, for example from having been learned at an IXP, from an Upstream, or from a +Downstream. Now let's consider the following generic exporter: + +``` +function ebgp_export(int remote_as) { + # Remove private ASNs + bgp_path.delete([64512..65535, 4200000000..4294967295]); + + # Well known BGP Large Communities + if (8298, 0, remote_as) ~ bgp_large_community then return false; + if (8298, 0, 0) ~ bgp_large_community then return false; + + # Well known BGP Communities + if (0, 8298) ~ bgp_community then return false; + if (remote_as < 65536 && (0, remote_as) ~ bgp_community) then return false; + + # AS path prepending + if ((8298, 103, remote_as) ~ bgp_large_community || + (8298, 103, 0) ~ bgp_large_community) then { + bgp_path.prepend( bgp_path.first ); + bgp_path.prepend( bgp_path.first ); + bgp_path.prepend( bgp_path.first ); + } else if ((8298, 102, remote_as) ~ bgp_large_community || + (8298, 102, 0) ~ bgp_large_community) then { + bgp_path.prepend( bgp_path.first ); + bgp_path.prepend( bgp_path.first ); + } else if ((8298, 101, remote_as) ~ bgp_large_community || + (8298, 101, 0) ~ bgp_large_community) then { + bgp_path.prepend( bgp_path.first ); + } + + return true; +} +``` + +Oh, wow! There's some really cool stuff to unpack here. As a belt-and-braces type safety, +I will remove any private AS numbers from the as-path - this avoids my own announcements +from tripping any as-path bogon filtering. But then, there's a few well-known communities +that help determine if the announcement is made or not, and there are three-and-a-half +ways of doing this: +1. `(8298,0,remote_as)` +1. `(8298,0,0)` +1. `(0,8298)` +1. `(0,remote_as)` but only if the remote_as is 16 bits. + +All four of these methods will tell the router to refuse announcing the prefix on this +session. Note that downstreams are allowed to set `(8298,*,*)` and `(8298,*)` communities +(and they're the only ones who are allowed to do so). So here is where some of the cool +magic starts to happen. + +Then, to drive prepending of the prefix on this session, I'll again match certain +communities `(8298, 103, *)` will prepend the customer's AS number three times, using +`102` will prepend twice, and `101` will prepend once. If the third digit is `0`, then +any session with this filter will prepend. If the third digit is the AS number, then +only sessions to this AS number will be prepended. + +Using these types of communities allow downstream (customers) incredibly fine grained +propagation actions, at the per-IPng-session level. Not many ISPs offer this functionality! + +### Peers + +Exporting to peers, I really need to make sure that I don't send too many prefixes. Most +of us have at some point gone through the embarassing motions of being told by a fellow +operator "hey you're sending a full table". It is paramount to good peering hygiene +that I do not leak. So I'll define a healthy set of _defense in depth_ principles here: + +``` +# bgpq4 -A4b -R 24 -m 24 -l 'define AS8298_IPV4' AS8298 +define AS8298_IPV4 = [ 92.119.38.0/24, 194.1.163.0/24, 194.126.235.0/24 ]; + +# bgpq4 -A6bR 48 -m 48 -l 'define AS8298_IPV6' AS8298 +define AS8298_IPV6 = [ 2001:678:d78::/48, 2a0b:dd80::/29{29,48} ]; + +# bgpq4 -A4b -R 24 -m 24 -l 'define AS_IPNG_IPV4' AS-IPNG +define AS_IPNG_IPV4 = [ ... ## Removed for brevity ]; + +# bgpq4 -A6bR 48 -m 48 -l 'define AS_IPNG_IPV6' AS-IPNG +define AS_IPNG_IPV6 = [ .. ## Removed for brevity ]; + +# bgpq4 -t4b -l 'define AS_IPNG' AS-IPNG +define AS_IPNG = [112, 8298, 50869, 57777, 60557, 201723, 212323, 212855]; + +function aspath_first_valid() { + return (bgp_path.len = 0 || bgp_path.first ~ AS_IPNG); +} + +# A list of well-known tier1 transit providers +function aspath_contains_tier1() { + return bgp_path ~ [ + 174, # Cogent + 209, # Qwest (HE carries this on IXPs IPv6 (Jul 12 2018)) + 701, # UUNET + 702, # UUNET + 1239, # Sprint + 1299, # Telia + 2914, # NTT Communications + 3257, # GTT Backbone + 3320, # Deutsche Telekom AG (DTAG) + 3356, # Level3 + 3549, # Level3 + 3561, # Savvis / CenturyLink + 4134, # Chinanet + 5511, # Orange opentransit + 6453, # Tata Communications + 6762, # Seabone / Telecom Italia + 7018 ]; # AT&T +} + +# The list of our own uplink (transit) providers +# Note: This list is autogenerated by our automation. +function aspath_contains_upstream() { + return bgp_path ~ [ 8283,25091,34549,51088,58299 ]; +} + +function ipv4_prefix_valid() { + # Our (locally sourced) prefixes + if (net ~ AS8298_IPV4) then return true; + + # Customer prefixes in AS-IPNG must be tagged with customer community + if (net ~ AS_IPNG_IPV4 && + (bgp_large_community ~ [(8298, 3, *)] || bgp_community ~ [(8298, 3500)]) + ) then return true; + + return false; +} +function ipv6_prefix_valid() { + # Our (locally sourced) prefixes + if (net ~ AS8298_IPV6) then return true; + + # Customer prefixes in AS-IPNG must be tagged with customer community + if (net ~ AS_IPNG_IPV6 && + (bgp_large_community ~ [(8298, 3, *)] || bgp_community ~ [(8298, 3500)]) + ) then return true; + + return false; +} +function prefix_valid() { + # as-path based filtering + if !aspath_first_valid() then return false; + if aspath_contains_tier1() then return false; + if aspath_contains_upstream() then return false; + + # prefix (and BGP community) based filtering + if (net.type = NET_IP4 && !ipv4_prefix_valid()) then return false; + if (net.type = NET_IP6 && !ipv6_prefix_valid()) then return false; + return true; +} + +function ebgp_export_peer(int remote_as) { + if !prefix_valid() then return false; + return ebgp_export(remote_as); +} +``` + +Wow, alrighty then!! All I'm doing here is checking if the call to `prefix_valid()` +returns true. That function isn't very complex. It takes a look at three as-path based +filters and then a prefix-list based filter. Let's go over them in turn: + +***aspath_first_valid()*** takes a look at the first hop in the as-path. I need to +make sure that I've received this prefix from an actual downstream, and those are +collected in a RIPE `as-set` called `AS-IPNG`. So if the first BGP hop in the path is +not one of these, I'll refuse to announce the prefix. + +***aspath_contains_tier1()*** is a belt-and-braces style check. How on earth would +I provide transit for any prefix for which there's already a global _Tier1_ provider +in the path? I mean, in no universe would AS174 or AS1299 need me to reach any of +their customers, or indeed, any place in the world. So this filter helps me never +announce the prefix, if it has one of these ISPs in the path. + +***aspath_contains_upstream()*** similarly, if I am receiving a full table from an +upstream provider, I should not be passing this prefix along - I would for similar +reasons never be a transit provider for A2B or IP-Max or Meerfarbig. Due to a bug +in my configuration, my buddy Erik kindly pointed out this issue to me, so hat-tip +to him for the intelligence. + +***ipv[46]_prefix_valid()*** is the main thrust of prefix-based filtering. At this +point we've already established that the as-path is clean, but it could be that +the downstream is sending prefixes they should not (possibly leaking a full table) +so let's take a look at a good way to avoid this. +* First, we look at locally sourced routes from `AS8298`, that is the ones that I + myself originate at IPng Networks. These are always OK. The list is carefully + curated. +* Alternatively, the prefix needs to be from the as-set `AS-IPNG` (which contains + both my prefixes and all `route` and `route6` objects belonging to any AS number + that I consider a downstream), +* Finally, if the prefix is from `AS-IPNG`, I'll still add one additional check to + ensure that there is a so-called _customer community_ attached. Remember that I + discused this specifically up in the ***Inbound - Downstream*** section. + +So before I were to announce anything on such a session, all _four_ of as-path, +inbound prefix-list, outbound prefix-list and bgp-community are checked. This +makes it incredibly unlikely that AS8298 ever leaks prefixes -- knock on wood! + +### Upstream + +Interestingly and if you think about it, unsurprisingly, an upstream configuration +is exactly identical to a peer: + +``` +function ebgp_export_upstream(int remote_as) { + if !prefix_valid() then return false; + return ebgp_export(remote_as); +} +``` + +Alright, nothing to see here, moving on ... + +### Downstream + +Now the difference between a Peer and an Upstream on the one hand, and a Downstream +on the other, is that the former two will only see a very limited set of prefixes, +heavily guarded by all of that filtering I described. But a downstream typically +has the luxury of getting to learn every prefix I've learned: + +``` +function ipv4_acceptable_size() { + if net.len < 8 then return false; + if net.len > 24 then return false; + return true; +} +function ipv6_acceptable_size() { + if net.len < 12 then return false; + if net.len > 48 then return false; + return true; +} +function ebgp_export_downstream(int remote_as) { + if (source != RTS_BGP && source != RTS_STATIC) then return false; + if (net.type = NET_IP4 && ! ipv4_acceptable_size()) then return false; + if (net.type = NET_IP6 && ! ipv6_acceptable_size()) then return false; + + return ebgp_export(remote_as); +} +``` + +So here I'll assert that the prefix has to be either from the `RTS_BGP` source, or +from the `RTS_STATIC` source. This latter source is what Bird uses for locally +generated routes (ie. the ones in AS8298 itself). Locally generated routes are not +known from BGP, but known instead because they are blackholed / null-routed on the +router itself. And from these routes, I further deselect those prefixes that are +too short or too long, which are slightly different based on address family (IPv4 +is anywhere between /8-/24 and for IPv6 is anywhere between /12-/48). + +Now, I will note that I've seen many operators who inject OSPF or connected or +static routes into BGP, and all of those folks will have to maintain elaborate egress +"bogon" route filters, for example for those IXP prefixes that they picked up due to +them being directly connected. If those operators would simply not propagate directly +connected routes, their life would be so much simpler .. but I digress and it's time +for me to wrap up. + +## Epilog + +I hope this little dissertation proves useful for other Bird enthusiasts out there. +I myself had to fiddle a bit over the years with the idiosyncracies (and bugs) of +Bird and Bird2. I wanted to make a few comments: + +1. Thanks to the crew at [Coloclue](https://coloclue.net/) for having a really phenomenal + routing setup, with a lot of thoughtful documentation, action communities, and strict + ingress and egress filtering. It's also fully automated and I've derived, although + completely rewritten, my own automation based off of [Kees](https://github.com/coloclue/kees). +1. I understand that the main destinction on inbound Peer and Upstream, is that for Peers + many folks will want to do strict filtering. I've considered this for a long time and + ultimately decided against it, because a combination of max prefix, tier1 as-path filtering + and RPKI filtering would take care of the most egregious mistakes and otherwise, I'm actually + happy to get more prefixes via IXPs rather than less. diff --git a/content/articles/2021-11-26-netgate-6100.md b/content/articles/2021-11-26-netgate-6100.md new file mode 100644 index 0000000..021a0a2 --- /dev/null +++ b/content/articles/2021-11-26-netgate-6100.md @@ -0,0 +1,474 @@ +--- +date: "2021-11-26T08:51:23Z" +title: 'Review: Netgate 6100' +--- + +* Author: Pim van Pelt <[pim@ipng.nl](mailto:pim@ipng.nl)> +* Reviewed: Jim Thompson <[jim@netgate.com](mailto:jim@netgate.com)> +* Status: Draft - Review - **Approved** + +A few weeks ago, Jim Thompson from Netgate stumbled across my [APU6 Post]({% post_url 2021-07-19-pcengines-apu6 %}) +and introduced me to their new desktop router/firewall the Netgate 6100. It currently ships with +[pfSense Plus](https://www.netgate.com/pfsense-plus-software), but he mentioned that it's designed +as well to run their [TNSR](https://www.netgate.com/tnsr) software, considering the device ships +with 2x 1GbE SFP/RJ45 combo, 2x 10GbE SFP+, and 4x 2.5GbE RJ45 ports, and all network interfaces +are Intel / DPDK capable chips. He asked me if I was willing to take it around the block with +VPP, which of course I'd be happy to do, and here are my findings. The TNSR image isn't yet +public for this device, but that's not a problem because [AS8298 runs VPP]({% post_url 2021-09-10-vpp-6 %}), +so I'll just go ahead and install it myself ... + +# Executive Summary + +The Netgate 6100 router running pfSense has a single core performance of 623kpps and a total chassis +throughput of 2.3Mpps, which is sufficient for line rate _in both directions_ at 1514b packets (1.58Mpps), +about 6.2Gbit of _imix_ traffic, and about 419Mbit of 64b packets. Running Linux on the router yields +very similar results. + +With VPP though, the router's single core performance leaps to 5.01Mpps at 438 CPU cycles/packet. This +means that all three of 1514b, _imix_ and 64b packets can be forwarded at line rate in one direction +on 10Gbit. Due to its Atom C3558 processor (which has 4 cores, 3 of which are dedicated to VPP's worker +threads, and 1 to its main thread and controlplane running in LInux), achieving 10Gbit line rate in +both directions when using 64 byte packets, is not possible. + +Running at 19W and a total forwarding **capacity of 15.1Mpps**, it consumes only **_1.26µJ_ of +energy per forwarded packet**, while at the same time easily handling a full BGP table with room to +spare. I find this Netgate 6100 appliance pretty impressive and when TNSR becomes available, +performance will be similar to what I've tested here, at a pricetag of USD 699,- + +## Detailed findings + +{{< image width="400px" float="left" src="/assets/netgate-6100/netgate-6100-back.png" alt="Netgate 6100" >}} + +The [Netgate 6100](https://www.netgate.com/blog/introducing-the-new-netgate-6100) ships with an +Intel Atom C-3558 CPU (4 cores including AES-NI and QuickAssist), 8GB of memory and either 16GB of +eMMC, or 128GB of NVME storage. The network cards are its main forté: it comes with 2x i354 +gigabit combo (SFP and RJ45), 4x i225 ports (these are 2.5GbE), and 2x X553 10GbE ports with an SFP+ +cage each, for a total of 8 ports and lots of connectivity. + +The machine is fanless and this is made possible by its power efficient CPU: the Atom here runs at 16W +TDP only, and the whole machine clocks in at a very efficient 19W. It comes with an external power brick, +but only one power supply, so no redundancy, unfortunately. To make up for that small omission, here are +a few nice touches that I noticed: +* The power jack has a screw-on barrel - no more accidentally rebooting the machine when fumbling around under the desk. +* There's both a Cisco RJ45 console port (115200,8n1), as well as a CP2102 onboard USB/serial connector, + which means you can connect to its serial port as well with a standard issue micro-USB cable. Cool! + +### Battle of Operating Systems + +Netgate ships the device with pfSense - it's a pretty great appliance and massively popular - delivering +firewall, router and VPN functionality to homes and small business across the globe. I myself am partial +to BSD (albeit a bit more of the Puffy persuasion), but DPDK and VPP are more of a Linux cup of tea. So +I'll have to deface this little guy, and reinstall it with Linux. My game plan is: + +1. Based on the shipped pfSense 21.05 (FreeBSD 12.2), do all the loadtests +1. Reinstall the machine with Linux (Ubuntu 20.04.3), do all the loadtests +1. Install VPP using my own [HOWTO]({% post_url 2021-09-21-vpp-7 %}), and do all the loadtests + +This allows for, I think, a pretty sweet comparison between FreeBSD, Linux, and DPDK/VPP. Now, on to a +description on the defacing, err, reinstall process on this Netgate 6100 machine, as it was not as easy +as I had anticipated (but is it ever easy, really?) + +{{< image width="400px" float="right" src="/assets/netgate-6100/blinkboot.png" alt="Blinkboot" >}} + +Turning on the device, it presents me with some BIOS firmware from Insyde Software which +is loading some software called _BlinkBoot_ [[ref](https://www.insyde.com/products/blinkboot)], which +in turn is loading modules called _Lenses_, pictured right. Anyway, this ultimately presents +me with a ***Press F2 for Boot Options***. Aha! That's exactly what I'm looking for. I'm really +grateful that Netgate decides to ship a device with a BIOS that will allow me to boot off of other +media, notably the USB stick in order to [reinstall pfSense](https://docs.netgate.com/pfsense/en/latest/solutions/netgate-6100/reinstall-pfsense.html) +but in my case, also to install another operating system entirely. + +My first approach was to get a default image to boot off of USB (the device has two USB3 ports on the +side). But none of the USB ports want to load my UEFI `bootx64.efi` prepared USB key. So my second +attempt was to prepare a PXE boot image, taking a few hints from Ubuntu's documentation [[ref](https://wiki.ubuntu.com/UEFI/PXE-netboot-install)]: + +``` +wget http://archive.ubuntu.com/ubuntu/dists/focal/main/installer-amd64/current/legacy-images/netboot/mini.iso +mv mini.iso /tmp/mini-focal.iso +grub-mkimage --format=x86_64-efi \ + --output=/var/tftpboot/grubnetx64.efi.signed \ + --memdisk=/tmp/mini-focal.iso \ + `ls /usr/lib/grub/x86_64-efi | sed -n 's/\.mod//gp'` +``` + +After preparing DHCPd and a TFTP server, and getting a slight feeling of being transported back in time +to the stone age, I see the PXE both request an IPv4 address, and the image I prepared. And, it boots, yippie! + +``` +Nov 25 14:52:10 spongebob dhcpd[43424]: DHCPDISCOVER from 90:ec:77:1b:63:55 via bond0 +Nov 25 14:52:11 spongebob dhcpd[43424]: DHCPOFFER on 192.168.1.206 to 90:ec:77:1b:63:55 via bond0 +Nov 25 14:52:13 spongebob dhcpd[43424]: DHCPREQUEST for 192.168.1.206 (192.168.1.254) from 90:ec:77:1b:63:55 via bond0 +Nov 25 14:52:13 spongebob dhcpd[43424]: DHCPACK on 192.168.1.206 to 90:ec:77:1b:63:55 via bond0 +Nov 25 15:04:56 spongebob tftpd[2076]: 192.168.1.206: read request for 'grubnetx64.efi.signed' +``` + +I took a peek while the `grubnetx64` was booting, and saw that the available output terminals +on this machine are `spkmodem morse gfxterm serial_efi0 serial_com0 serial cbmemc audio`, and that +the default/active one is `console`, so I make a note that Grub wants to run on 'console' (and +specifically NOT on 'serial', as is usual, see below for a few more details on this) while the Linux +kernel will of course be running on serial, so I have to add `console=ttyS0,115200n8` to the kernel boot +string before booting. + +Piece of cake, by which I mean I spent about four hours staring at the boot loader and failing to get +it quite right -- pro-tip: install OpenSSH and fix the GRUB and Kernel configs before finishing the +`mini.iso` install: + +``` +mount --bind /proc /target/proc +mount --bind /dev /target/dev +mount --bind /sys /target/sys +chroot /target /bin/bash + +# Install OpenSSH, otherwise the machine boots w/o access :) +apt update +apt install openssh-server + +# Fix serial for GRUB and Kernel +vi /etc/default/grub +## set GRUB_CMDLINE_LINUX_DEFAULT="console=ttyS0,115200n8" +## set GRUB_TERMINAL=console (and comment out the serial stuff) +grub-install /dev/sda +update-grub +``` + +Rebooting now brings me to Ubuntu: Pat on the back, Caveman Pim, you've still got it! + +## Network Loadtest + +After that small but exciting detour, let me get back to the loadtesting. The choice of Intel's +network controller on this board allows me to use Intel's DPDK with relatively high +performance, compared to regular (kernel) based routing. I loadtested the stock firmware pfSense +(21.05, based on FreeBSD 12.2), Linux (Ubuntu 20.04.3), and VPP (22.02, [[ref](https://fd.io/)]). + +Specifically worth calling out is that while Linux and FreeBSD struggled in the packets-per-second +department, the use of DPDK in VPP meant absolutely no problems filling a unidirectional 10G stream +of "regular internet traffic" (referred to as `imix`), it was also able to fill _line rate_ with +"64b UDP packets", with just a little headroom there, but it ultimately struggled with _bidirectional_ +64b UDP packets. + +### Methodology + +For the loadtests, I used Cisco's T-Rex [[ref](https://trex-tgn.cisco.com/)] in stateless mode, +with a custom Python controller that ramps up and down traffic from the loadtester to the _device +under test_ (DUT) by sending traffic out `port0` to the DUT, and expecting that traffic to be +presented back out from the DUT to its `port1`, and vice versa (out from `port1` -> DUT -> back +in on `port0`). The loadtester first sends a few seconds of warmup, this is to ensure the DUT is +passing traffic and offers the ability to inspect the traffic before the actual rampup. Then +the loadtester ramps up linearly from zero to 100% of line rate (in this case, line rate is +10Gbps in both directions), finally it holds the traffic at full line rate for a certain +duration. If at any time the loadtester fails to see the traffic it's emitting return on its +second port, it flags the DUT as saturated; and this is noted as the maximum bits/second and/or +packets/second. + +Since my last loadtesting [post]({% post_url 2021-07-19-pcengines-apu6 %}), I've learned a lot +more about packet forwarding and how to make it easier or harder on the router. Let me go into a +few more details about the various loadtests that I've done here. + +#### Method 1: Single CPU Thread Saturation + +Most kernels (certainly OpenBSD, FreeBSD and Linux) will make use of multiple receive queues +if the network card supports it. The Intel NICs in this machine are all capable of _Receive Side +Scaling_ (RSS), which means the NIC can offload its packets into multiple queues. The kernel +will typically enable one queue for each CPU core -- the Atom has 4 cores, so 4 queues are +initialized, and inbound traffic is sent, typically using some hashing function, to individual +CPUs, allowing for a higher aggregate throughput. + +Mostly, this hashing function is based on some L3 or L4 payload, for example a hash over +the source IP/port and destination IP/port. So one interesting test is to send **the same packet** +over and over again -- the hash function will then return the same value for each packet, which +means all traffic goes into exactly one of the N available queues, and therefore handled by +only one core. + +One such TRex stateless traffic profile is `udp_1pkt_simple.py` which, as the name implies, +simply sends the same UDP packet from source IP/port and destination IP/port, padded with +a bunch of 'x' characters, over and over again: + +``` + packet = STLPktBuilder(pkt = + Ether()/ + IP(src="16.0.0.1",dst="48.0.0.1")/ + UDP(dport=12,sport=1025)/(10*'x') + ) +``` + +#### Method 2: Rampup using trex-loadtest.py + +TRex ships with a very handy `bench.py` stateless traffic profile which, without any additional +arguments, does the same thing as the above method. However, this profile optionally takes a few +arguments, which are called _tunables_, notably: +* ***size*** - set the size of the packets to either a number (ie. 64, the default, or any number + up to a maximum of 9216 byes), or the string `imix` which will send a traffic mix consisting of + 60b, 590b and 1514b packets. +* ***vm*** - set the packet source/dest generation. By default (when the flag is `None`), the + same src (16.0.0.1) and dst (48.0.0.1) is set for each packet. When setting the value to + `var1`, the source IP is incremented from `16.0.0.[4-254]`. If the value is set to + `var2`, the source _and_ destination IP are incremented, the destination from `48.0.0.[4-254]`. + +So tinkering with the `vm` parameter is an excellent way of driving one or many receive queues. Armed +with this, I will perform a loadtest with four modes of interest, from easier to more challenging: +1. ***bench-var2-1514b***: multiple flows, ~815Kpps at 10Gbps; this is the easiest test to perform, + as the traffic consists of large (1514 byte) packets, and both source and destination are + different each time, which means lots of multiplexing across receive queues, and relatively few + packets/sec. +1. ***bench-var2-imix***: multiple flows, with a mix of 60, 590 and 1514b frames in a certain ratio. This + yields what can be reasonably expected from _normal internet use_, just about 3.2Mpps at + 10Gbps. This is the most representative test for normal use, but still the packet rate is quite + low due to (relatively) large packets. Any respectable router should be able to perform well at + an imix profile. +1. ***bench-var2-64b***: Still multiple flows, but very small packets, 14.88Mpps at 10Gbps, + often refered to as the theoretical maximum throughput on Tengig. Now it's getting harder, as + the loadtester will fill the line with small packets (of 64 bytes, the smallest that an ethernet + packet is allowed to be). This is a good way to see if the router vendor is actually capable of + what is referred to as _line rate_ forwarding. +1. ***bench***: Now restricted to a constant src/dst IP:port tuple, and the same rate of + 14.88Mpps at 10Gbps, means only one Rx queue (and thus, one CPU core) can be used. This is where + single-core performance becomes relevant. Notably, vendors who boast many CPUs, will often struggle + with a test like this, in case any given CPU core cannot individually handle a full line rate. + I'm looking at you, Tilera! + +Further to this list, I can send traffic in one direction only (TRex will emit this from its +port0 and expect the traffic to be seen back at port1); or I can send it ***in both directions***. +The latter will double the packet rate and bandwidth, to approx 29.7Mpps. + +***NOTE***: At these rates, TRex can be a bit fickle trying to fit all these packets into its own +transmit queues, so I decide to drive it a bit less close to the cliff and stop at 97% of line rate +(this is 28.3Mpps). It explains why lots of these loadtests top out at that number. + +### Results + +#### Method 1: Single CPU Thread Saturation + +Given the approaches above, for the first method I can "just" saturate the line and see how many packets +emerge through the DUT on the other port, so that's only 3 tests: + +Netgate 6100 | Loadtest | Throughput (pps) | L1 Throughput (bps) | % of linerate +------------- | --------------- | ---------------- | -------------------- | ------------- +pfSense | 64b 1-flow | 622.98 Kpps | 418.65 Mbps | 4.19% +Linux | 64b 1-flow | 642.71 Kpps | 431.90 Mbps | 4.32% +***VPP*** | ***64b 1-flow***| ***5.01 Mpps*** | ***3.37 Gbps*** | ***33.67%*** + +***NOTE***: The bandwidth figures here are so called _L1 throughput_ which means bits on the wire, as opposed +to _L2 throughput_ which means bits in the ethernet frame. This is relevant particularly at 64b loadtests as +the overhead for each ethernet frame is 20 bytes (7 bytes preamble, 1 byte start-frame, and 12 bytes inter-frame gap +[[ref](https://en.wikipedia.org/wiki/Ethernet_frame)]). At 64 byte frames, this is 31.25% overhead! It also +means that when L1 bandwidth is fully consumed at 10Gbps, that the observed L2 bandwidth will be only 7.62Gbps. + +#### Interlude - VPP efficiency + +In VPP it can be pretty cool to take a look at efficiency -- one of the main reasons why it's so quick is because +VPP will consume the entire core, and grab ***a set of packets*** from the NIC rather than do work for each +individual packet. VPP then advances the set of packets, called a _vector_, through a directed graph. The first +of these packets will result in the code for the current graph node to be fetched into the CPU's instruction cache, +and the second and further packets will make use of the warmed up cache, greatly improving per-packet efficiency. + +I can demonstrate this by running a 1kpps, 1Mpps and 10Mpps test against the VPP install on this router, and +observing how many CPU cycles each packet needs to get forwarded from the input interface to the output interface. +I expect this number _to go down_ when the machine has more work to do, due to the higher CPU i/d-cache hit rate. +Seeing the time spent in each of VPP's graph nodes, and for each individual worker thread (which correspond 1:1 +with CPU cores), can be done with `vppctl show runtime` command and some `awk` magic: + +``` +$ vppctl clear run && sleep 30 && vppctl show run | \ + awk '$2 ~ /active|polling/ && $4 > 25000 { + print $0; + if ($1=="ethernet-input") { packets = $4}; + if ($1=="dpdk-input") { dpdk_time = $6}; + total_time += $6 + } END { + print packets/30, "packets/sec, at",total_time,"cycles/packet,", + total_time-dpdk_time,"cycles/packet not counting DPDK" + }' +``` + +This gives me the following, somewhat verbose but super interesting output, which I've edited down to fit on screen, +and omit the columns that are not super relevant. Ready? Here we go! + +``` +tui>start -f stl/udp_1pkt_simple.py -p 0 -m 1kpps +Graph Node Name Clocks Vectors/Call +---------------------------------------------------------------- +TenGigabitEthernet3/0/1-output 6.07e2 1.00 +TenGigabitEthernet3/0/1-tx 8.61e2 1.00 +dpdk-input 1.51e6 0.00 +ethernet-input 1.22e3 1.00 +ip4-input-no-checksum 6.59e2 1.00 +ip4-load-balance 4.50e2 1.00 +ip4-lookup 5.63e2 1.00 +ip4-rewrite 5.83e2 1.00 +1000.17 packets/sec, at 1514943 cycles/packet, 4943 cycles/pkt not counting DPDK +``` + +I'll observe that a lot of time is spent in `dpdk-input`, because that is a node that is constantly polling +the network card, as fast as it can, to see if there's any work for it to do. Apparently not, because the average +vectors per call is pretty much zero, and considering that, most of the CPU time is going to sit in a nice "do +nothing". Because reporting CPU cycles spent doing nothing isn't particularly interesting, I shall report on +both the total cycles spent, that is to say including DPDK, and as well the cycles spent per packet in the +_other active_ nodes. In this case, at 1kpps, VPP is spending 4953 cycles on each packet. + +Now, take a look what happens when I raise the traffic to 1Mpps: + +``` +tui>start -f stl/udp_1pkt_simple.py -p 0 -m 1mpps +Graph Node Name Clocks Vectors/Call +---------------------------------------------------------------- +TenGigabitEthernet3/0/1-output 3.80e1 18.57 +TenGigabitEthernet3/0/1-tx 1.44e2 18.57 +dpdk-input 1.15e3 .39 +ethernet-input 1.39e2 18.57 +ip4-input-no-checksum 8.26e1 18.57 +ip4-load-balance 5.85e1 18.57 +ip4-lookup 7.94e1 18.57 +ip4-rewrite 7.86e1 18.57 +981830 packets/sec, at 1770.1 cycles/packet, 620 cycles/pkt not counting DPDK +``` + +Whoa! The system is now running the VPP loop with ~18.6 packets per vector, and you can clearly see that +the CPU efficiency went up greatly, from 4953 cycles/packet at 1kpps, to 620 cycles/packet at 1Mpps. +That's an order of magnitude improvement! + +Finally, let's give this Netgate 6100 router a run for its money, and slam it with 10Mpps: + +``` +tui>start -f stl/udp_1pkt_simple.py -p 0 -m 10mpps +Graph Node Name Clocks Vectors/Call +---------------------------------------------------------------- +TenGigabitEthernet3/0/1-output 1.41e1 256.00 +TenGigabitEthernet3/0/1-tx 1.23e2 256.00 +dpdk-input 7.95e1 256.00 +ethernet-input 6.74e1 256.00 +ip4-input-no-checksum 3.95e1 256.00 +ip4-load-balance 2.54e1 256.00 +ip4-lookup 4.12e1 256.00 +ip4-rewrite 4.78e1 256.00 +5.01426e+06 packets/sec, at 437.9 cycles/packet, 358 cycles/pkt not counting DPDK +``` + +And here is where I learn the maximum packets/sec that this one CPU thread can handle: ***5.01Mpps***, at which +point every packet is super efficiently handled at 358 CPU cycles each, or 13.8 times (4953/438) +as efficient under high load than when the CPU is unloaded. Sweet!! + +Another really cool thing to do here is derive the effective clock speed of the Atom CPU. We know it runs at +2200Mhz, and we're doing 5.01Mpps at 438 cycles/packet including the time spent in DPDK, which adds up to 2194MHz, +remarkable precision. Color me impressed :-) + +#### Method 2: Rampup using trex-loadtest.py + +For the second methodology, I have to perform a _lot_ of loadtests. In total, I'm testing 4 modes (1514b, imix, +64b-multi and 64b 1-flow), then take a look at unidirectional traffic and bidirectional traffic, and perform each +of these loadtests on pfSense, Ubuntu, and VPP with one, two or three Rx/Tx queues. That's a total of 40 +loadtests! + +Loadtest | pfSense | Ubuntu | VPP 1Q | VPP 2Q | VPP 3Q | Details +--------- | -------- | -------- | ------- | ------- | ------- | ---------- +***Unidirectional*** | +1514b | 97% | 97% | 97% | 97% | ***97%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-1514b-unidirectional.html)] +imix | 61% | 75% | 96% | 95% | ***95%*** | [[graphs](/assets/netgate-6100/netgate-6100.imix-unidirectional.html)] +64b | 15% | 17% | 33% | 66% | ***96%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-64b-unidirectional.html)] +64b 1-flow| 4.4% | 4.7% | 33% | 33% | ***33%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-unidirectional.html)] +***Bidirectional*** | +1514b | 192% | 193% | 193% | 193% | ***194%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-1514b-bidirectional.html)] +imix | 63% | 71% | 190% | 190% | ***191%*** | [[graphs](/assets/netgate-6100/netgate-6100.imix-bidirectional.html)] +64b | 15% | 16% | 61% | 63% | ***81%*** | [[graphs](/assets/netgate-6100/netgate-6100.bench-var2-64b-bidirectional.html)] +64b 1-flow| 8.6% | 9.0% | 61% | ***61%*** | 33% (+) | [[graphs](/assets/netgate-6100/netgate-6100.bench-bidirectional.html)] + +A picture says a thousand words - so I invite you to take a look at the interactive graphs from the table +above. I'll cherrypick what I find the most interesting one here: + +{{< image width="1000px" src="/assets/netgate-6100/bench-var2-64b-unidirectional.png" alt="Netgate 6100" >}} + +The graph above is of the unidirectional _64b_ loadtest. Some observations: +* pfSense 21.05 (running FreeBSD 12.2, the bottom blue trace), and Ubuntu 20.04.3 (running Linux 5.13, the + orange trace just above it) are are equal performers. They handle fullsized (1514 byte) packets just fine, + struggle a little bit with imix, and completely suck at 64b packets (shown here), in particular if only 1 + CPU core can be used. +* Even at 64b packets, VPP scales linearly from 33% of line rate with 1Q (the green trace), 66% with 2Q (the + red trace) and 96% with 3Q (the purple trace, that makes it through to the end). +* With VPP taking 3Q, one CPU is left over for the main thread and controlplane software like FRR or Bird2. + +## Caveats + +The unit was shipped courtesy of Netgate (thanks again! Jim, this was fun!) for the purposes of +load- and systems integration testing and comparing their internal benchmarking with my findings. +Other than that, this is not a paid endorsement and views of this review are my own. + +One quirk I noticed is that while running VPP with 3Q and bidirectional traffic, performance is much worse +than with 2Q or 1Q. This is not a fluke with the loadtest, as I have observed the same strange performance +with other machines (Supermicro 5018D-FN8T for example). I confirmed that each VPP worker thread is used +for each queue, so I would've expected ~15Mpps shared by both interfaces (so a per-direction linerate of ~50%), +but I get 16.8% instead [[graphs](/assets/netgate-6100/netgate-6100.bench-bidirectional.html)]. I'll have to +understand that better, but for now I'm releasing the data as-is. + +## Appendix + +### Generating the data + +You can find all of my loadtest runs in [this archive](/assets/netgate-6100/trex-loadtest-json.tar.gz). +The archive contains the `trex-loadtest.py` script as well, for curious readers! +These JSON files can be fed directly into Michal's [visualizer](https://github.com/wejn/trex-loadtest-viz) +to plot interactive graphs (which I've done for the table above): + +``` +DEVICE=netgate-6100 +ruby graph.rb -t 'Netgate 6100 All Loadtests' -o ${DEVICE}.html netgate-*.json +for i in bench-var2-1514b bench-var2-64b bench imix; do + ruby graph.rb -t 'Netgate 6100 Unidirectional Loadtests' --only-channels 0 \ + netgate-*-${i}-unidi*.json -o ${DEVICE}.$i-unidirectional.html +done +for i in bench-var2-1514b bench-var2-64b bench imix; do + ruby graph.rb -t 'Netgate 6100 Bidirectional Loadtests' \ + netgate-*-${i}.json -o ${DEVICE}.$i-bidirectional.html +done +``` + +### Notes on pfSense + +I'm not a pfSense user, but I know my way around FreeBSD just fine. After installing the firmware, I +simply choose the 'Give me a Shell' option, and take it from there. The router will run `pf` out of +the box, and it is pretty complex, so I'll just configure some addresses, routes and disable the +firewall alltogether. That sounds just fair, as the same tests with Linux and VPP also do not use +a firewall (even though obviously, both VPP and Linux support firewalls just fine). + +``` +ifconfig ix0 inet 100.65.1.1/24 +ifconfig ix1 inet 100.65.2.1/24 +route add -net 16.0.0.0/8 100.65.1.2 +route add -net 48.0.0.0/8 100.65.2.2 +pfctl -d +``` + +### Notes on Linux + +When doing loadtests on Ubuntu, I have to ensure irqbalance is turned off, otherwise the kernel will +thrash around re-routing softirq's between CPU threads, and at the end of the day, I'm trying to saturate +all CPUs anyway, so balancing/moving them around doesn't make any sense. Further, Linux wants to configure +a static ARP entry for the interfaces from TRex: + +``` +sudo systemctl disable irqbalance +sudo systemctl stop irqbalance +sudo systemctl mask irqbalance + +sudo ip addr add 100.65.1.1/24 dev enp3s0f0 +sudo ip addr add 100.65.2.1/24 dev enp3s0f1 +sudo ip nei replace 100.65.1.2 lladdr 68:05:ca:32:45:94 dev enp3s0f0 ## TRex port0 +sudo ip nei replace 100.65.2.2 lladdr 68:05:ca:32:45:95 dev enp3s0f1 ## TRex port1 +sudo ip ro add 16.0.0.0/8 via 100.65.1.2 +sudo ip ro add 48.0.0.0/8 via 100.65.2.2 +``` + +On Linux, I now see a reasonable spread of IRQs by CPU while doing a unidirectional loadtest: +``` +root@netgate:/home/pim# cat /proc/softirqs + CPU0 CPU1 CPU2 CPU3 + HI: 3 0 0 1 + TIMER: 203788 247280 259544 401401 + NET_TX: 8956 8373 7836 6154 + NET_RX: 22003822 19316480 22526729 19430299 + BLOCK: 2545 3153 2430 1463 + IRQ_POLL: 0 0 0 0 + TASKLET: 5084 60 1830 23 + SCHED: 137647 117482 56371 49112 + HRTIMER: 0 0 0 0 + RCU: 11550 9023 8975 8075 +``` + diff --git a/content/articles/2021-12-23-vpp-playground.md b/content/articles/2021-12-23-vpp-playground.md new file mode 100644 index 0000000..037595c --- /dev/null +++ b/content/articles/2021-12-23-vpp-playground.md @@ -0,0 +1,529 @@ +--- +date: "2021-12-23T17:58:14Z" +title: VPP Linux CP - Virtual Machine Playground +--- + + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. One thing notably missing, is the higher level control plane, that is +to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a +VPP _plugin_ which is called the **Linux Control Plane**, or LCP for short, which creates Linux network +devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols +like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet +forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use +VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the +interface state (links, addresses and routes) itself. When the plugin is completed, running software +like [FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving +>100Mpps and >100Gbps forwarding rates will be well in reach! + +Before we head off into the end of year holidays, I thought I'd make good on a promise I made a while +ago, and that's to explain how to create a Debian (Buster or Bullseye), or Ubuntu (Focal Fossai LTS) +virtual machine running in Qemu/KVM into a working setup with both [Free Range Routing](https://frrouting.org/) +and [Bird](https://bird.network.cz/) installed side by side. + +**NOTE**: If you're just interested in the resulting image, here's the most pertinent information: +> * ***vpp-proto.qcow2.lrz [[Download](https://ipng.ch/media/vpp-proto/vpp-proto-bookworm-20231015.qcow2.lrz)]*** +> * ***SHA256*** `bff03a80ccd1c0094d867d1eb1b669720a1838330c0a5a526439ecb1a2457309` +> * ***Debian Bookworm (12.4)*** and ***VPP 24.02-rc0~46-ga16463610e*** +> * ***CPU*** Make sure the (virtualized) CPU supports AVX +> * ***RAM*** The image needs at least 4GB of RAM, and the hypervisor should support hugepages and AVX +> * ***Username***: `ipng` with ***password***: `ipng loves vpp` and is sudo-enabled +> * ***Root Password***: `IPng loves VPP` + +Of course, I do recommend that you change the passwords for the `ipng` and `root` user as soon as you +boot the VM. I am offering the KVM images as-is and without any support. [Contact](/s/contact/) us if +you'd like to discuss support on commission. + +## Reminder - Linux CP + +[Vector Packet Processing](https://fd.io/) by itself offers _only_ a dataplane implementation, that is +to say it cannot run controlplane software like OSPF, BGP, LDP etc out of the box. However, VPP allows +_plugins_ to offer additional functionalty. Rather than adding the routing protocols as VPP plugins, +I much rather leverage high quality and well supported community efforts like [FRR](https://frrouting.org/) +or [Bird](https://bird.network.cz/). + +I wrote a series of in-depth [articles](/s/articles/) explaining in detail the design and implementation, +but for the purposes of this article, I will keep it brief. The Linux Control Plane (LCP) is a set of two +plugins: + +1. The ***Interface plugin*** is responsible for taking VPP interfaces (like ethernet, tunnel, bond) + and exposing them in Linux as a TAP device. When configuration such as link MTU, state, MAC + address or IP address are applied in VPP, the plugin will copy this forward into the host + interface representation. +1. The ***Netlink plugin*** is responsible for taking events in Linux (like a user setting an IP address + or route, or the system receiving ARP or IPv6 neighbor request/reply from neighbors), and applying + these events to the VPP dataplane. + +I've published the code on [Github](https://github.com/pimvanpelt/lcpng/) and I am targeting a release +in upstream VPP, hoping to make the upcoming 22.02 release in February 2022. I have a lot of ground to +cover, but I will note that the plugin has been running in production in [AS8298]({% post_url 2021-02-27-network %}) +since Sep'21 and no crashes related to LinuxCP have been observed. + +To help tinkerers, this article describes a KVM disk image in _qcow2_ format, which will boot a vanilla +Debian install and further comes pre-loaded with a fully functioning VPP, LinuxCP and both FRR and Bird +controlplane environment. I'll go into detail on precisely how you can build your own. Of course, you're +welcome to just take the results of this work and download the `qcow2` image above. + +### Building the Debian KVM image + +In this section I'll try to be precise in the steps I took to create the KVM _qcow2_ image, in case you're +interested in reproducing for yourself. Overall, I find that reading about how folks build images teaches +me a lot about the underlying configurations, and I'm as well keen on remembering how to do it myself, +so this article serves as well as reference documentation for IPng Networks in case we want to build +images in the future. + +#### Step 1. Install Debian + +For this, I'll use `virt-install` completely on the prompt of my workstation, a Linux machine which +is running Ubuntu Hirsute (21.04). Assuming KVM is installed [ref](https://help.ubuntu.com/community/KVM/Installation) +and already running, let's build a simple Debian Bullseye _qcow2_ bootdisk: + +``` +pim@chumbucket:~$ sudo apt-get install qemu-kvm libvirt-daemon-system libvirt-clients bridge-utils +pim@chumbucket:~$ sudo apt-get install virtinst +pim@chumbucket:~$ sudo adduser `id -un` libvirt +pim@chumbucket:~$ sudo adduser `id -un` kvm + +pim@chumbucket:~$ qemu-img create -f qcow2 vpp-proto.qcow2 8G +pim@chumbucket:~$ virt-install --virt-type kvm --name vpp-proto \ + --location http://deb.debian.org/debian/dists/bullseye/main/installer-amd64/ \ + --os-variant debian10 \ + --disk /home/pim/vpp-proto.qcow2,bus=virtio \ + --memory 4096 \ + --graphics none \ + --network=bridge:mgmt \ + --console pty,target_type=serial \ + --extra-args "console=ttyS0" \ + --check all=off +``` + +_Note_: You may want to use a different network bridge, commonly `bridge:virbr0`. In my case, the +network which runs DHCP is on a bridge called `mgmt`. And, just for pedantry, it's good to make +yourself a member of groups `kvm` and `libvirt` so that you can run most `virsh` commands as an +unprivileged user. + +During the Debian Bullseye install, I try to leave everything as vanilla as possible, but I do +enter the following specifics: +* ***Root Password*** the string `IPng loves VPP` +* ***User*** login `ipng` with password `ipng loves vpp` +* ***Disk*** is entirely in one partition / (all 8GB of it), no swap +* ***Software selection*** remove everything but `SSH server` and `standard system utilities` + +When the machine is done installing, it'll reboot and I'll log in as root to install a few packages, +most notably `sudo` which will allow the user `ipng` to act as root. The other seemingly weird +packages are to help the VPP install along later. + +``` +root@vpp-proto:~# apt install rsync net-tools traceroute snmpd snmp iptables sudo gnupg2 \ + curl libmbedcrypto3 libmbedtls12 libmbedx509-0 libnl-3-200 libnl-route-3-200 \ + libnuma1 python3-cffi python3-cffi-backend python3-ply python3-pycparser libsubunit0 +root@vpp-proto:~# adduser ipng sudo +root@vpp-proto:~# poweroff +``` + +Finally, after I stop the VM, I'll edit its XML config to give it a few VirtIO NICs to play with, +nicely grouped on the same virtual PCI bus/slot. I look for the existing `` block that +`virt-install` added for me, and add four new ones under that, all added to a newly created bridge +called `empty`, for now: + +``` +pim@chumbucket:~$ sudo brctl addbr empty +pim@chumbucket:~$ virsh edit vpp-proto + + + + + + +
+ + + + + + + +
+ + + + + + + +
+ + + + + + + +
+ +pim@chumbucket:~$ virsh start --console vpp-proto +``` + +And with that, I have a lovely virtual machine to play with, serial and all, beautiful! +![VPP Proto](/assets/vpp-proto/vpp-proto.png) + +#### Step 2. Compile VPP + Linux CP + +Compiling DPDK and VPP can both take a while, and to avoid cluttering the virtual machine, I'll do +this step on my buildfarm and copy the resulting Debian packages back onto the VM. + +This step simply follows [VPP's doc](https://fdio-vpp.readthedocs.io/en/latest/gettingstarted/developers/building.html) +but to recap the individual steps here, I will: +* use Git to check out both VPP and my plugin +* ensure all Debian dependencies are installed +* build DPDK libraries as a Debian package +* build VPP and its plugins (including LinuxCP) +* finally, build a set of Debian packages out of the VPP, Plugins, DPDK, etc. + +The resulting Packages will work both on Debian (Buster and Bullseye) as well as Ubuntu (Focal, 20.04). +So grab a cup of tea, while we let Rhino stretch its legs, ehh, CPUs ... + +``` +pim@rhino:~$ mkdir -p ~/src +pim@rhino:~$ cd ~/src +pim@rhino:~/src$ sudo apt install libmnl-dev +pim@rhino:~/src$ git clone https://github.com/pimvanpelt/lcpng.git +pim@rhino:~/src$ git clone https://gerrit.fd.io/r/vpp +pim@rhino:~/src$ ln -s ~/src/lcpng ~/src/vpp/src/plugins/lcpng +pim@rhino:~/src$ cd ~/src/vpp +pim@rhino:~/src/vpp$ make install-deps +pim@rhino:~/src/vpp$ make install-ext-deps +pim@rhino:~/src/vpp$ make build-release +pim@rhino:~/src/vpp$ make pkg-deb +``` + +Which will yield the following Debian packages, would you believe that, at exactly leet-o'clock :-) +``` +pim@rhino:~/src/vpp$ ls -hSl build-root/*.deb +-rw-r--r-- 1 pim pim 71M Dec 23 13:37 build-root/vpp-dbg_22.02-rc0~421-ge6387b2b9_amd64.deb +-rw-r--r-- 1 pim pim 4.7M Dec 23 13:37 build-root/vpp_22.02-rc0~421-ge6387b2b9_amd64.deb +-rw-r--r-- 1 pim pim 4.2M Dec 23 13:37 build-root/vpp-plugin-core_22.02-rc0~421-ge6387b2b9_amd64.deb +-rw-r--r-- 1 pim pim 3.7M Dec 23 13:37 build-root/vpp-plugin-dpdk_22.02-rc0~421-ge6387b2b9_amd64.deb +-rw-r--r-- 1 pim pim 1.3M Dec 23 13:37 build-root/vpp-dev_22.02-rc0~421-ge6387b2b9_amd64.deb +-rw-r--r-- 1 pim pim 308K Dec 23 13:37 build-root/vpp-plugin-devtools_22.02-rc0~421-ge6387b2b9_amd64.deb +-rw-r--r-- 1 pim pim 173K Dec 23 13:37 build-root/libvppinfra_22.02-rc0~421-ge6387b2b9_amd64.deb +-rw-r--r-- 1 pim pim 138K Dec 23 13:37 build-root/libvppinfra-dev_22.02-rc0~421-ge6387b2b9_amd64.deb +-rw-r--r-- 1 pim pim 27K Dec 23 13:37 build-root/python3-vpp-api_22.02-rc0~421-ge6387b2b9_amd64.deb +``` + +I've copied these packages to our `vpp-proto` image in `~ipng/packages/`, where I'll simply install +them using `dpkg`: + +``` +ipng@vpp-proto:~$ sudo mkdir -p /var/log/vpp +ipng@vpp-proto:~$ sudo dpkg -i ~/packages/*.deb +ipng@vpp-proto:~$ sudo adduser `id -un` vpp +``` + +I'll configure 2GB of hugepages and 64MB of netlink buffer size - see my [VPP #7]({% post_url 2021-09-21-vpp-7 %}) +post for more details and lots of background information: + +``` +ipng@vpp-proto:~$ cat << EOF | sudo tee /etc/sysctl.d/80-vpp.conf +vm.nr_hugepages=1024 +vm.max_map_count=3096 +vm.hugetlb_shm_group=0 +kernel.shmmax=2147483648 +EOF + +ipng@vpp-proto:~$ cat << EOF | sudo tee /etc/sysctl.d/81-vpp-netlink.conf +net.core.rmem_default=67108864 +net.core.wmem_default=67108864 +net.core.rmem_max=67108864 +net.core.wmem_max=67108864 +EOF + +ipng@vpp-proto:~$ sudo sysctl -p -f /etc/sysctl.d/80-vpp.conf +ipng@vpp-proto:~$ sudo sysctl -p -f /etc/sysctl.d/81-vpp-netlink.conf +``` + +Next, I'll create a network namespace for VPP and associated controlplane software to run in, this is because +VPP will want to create its TUN/TAP devices separate from the _default_ namespace: + +``` +ipng@vpp-proto:~$ cat << EOF | sudo tee /usr/lib/systemd/system/netns-dataplane.service +[Unit] +Description=Dataplane network namespace +After=systemd-sysctl.service network-pre.target +Before=network.target network-online.target + +[Service] +Type=oneshot +RemainAfterExit=yes + +# PrivateNetwork will create network namespace which can be +# used in JoinsNamespaceOf=. +PrivateNetwork=yes + +# To set `ip netns` name for this namespace, we create a second namespace +# with required name, unmount it, and then bind our PrivateNetwork +# namespace to it. After this we can use our PrivateNetwork as a named +# namespace in `ip netns` commands. +ExecStartPre=-/usr/bin/echo "Creating dataplane network namespace" +ExecStart=-/usr/sbin/ip netns delete dataplane +ExecStart=-/usr/bin/mkdir -p /etc/netns/dataplane +ExecStart=-/usr/bin/touch /etc/netns/dataplane/resolv.conf +ExecStart=-/usr/sbin/ip netns add dataplane +ExecStart=-/usr/bin/umount /var/run/netns/dataplane +ExecStart=-/usr/bin/mount --bind /proc/self/ns/net /var/run/netns/dataplane +# Apply default sysctl for dataplane namespace +ExecStart=-/usr/sbin/ip netns exec dataplane /usr/lib/systemd/systemd-sysctl +ExecStop=-/usr/sbin/ip netns delete dataplane + +[Install] +WantedBy=multi-user.target +WantedBy=network-online.target +EOF +ipng@vpp-proto:~$ sudo systemctl enable netns-dataplane +ipng@vpp-proto:~$ sudo systemctl start netns-dataplane +``` + +Finally, I'll add a useful startup configuration for VPP (note the comment on `poll-sleep-usec` +which slows down the DPDK poller, making it a little bit milder on the CPU: +``` +ipng@vpp-proto:~$ cd /etc/vpp +ipng@vpp-proto:/etc/vpp$ sudo cp startup.conf startup.conf.orig +ipng@vpp-proto:/etc/vpp$ cat << EOF | sudo tee startup.conf +unix { + nodaemon + log /var/log/vpp/vpp.log + cli-listen /run/vpp/cli.sock + gid vpp + + ## This makes VPP sleep 1ms between each DPDK poll, greatly + ## reducing CPU usage, at the expense of latency/throughput. + poll-sleep-usec 1000 + + ## Execute all CLI commands from this file upon startup + exec /etc/vpp/bootstrap.vpp +} + +api-trace { on } +api-segment { gid vpp } +socksvr { default } + +memory { + main-heap-size 512M + main-heap-page-size default-hugepage +} + +buffers { + buffers-per-numa 128000 + default data-size 2048 + page-size default-hugepage +} + +statseg { + size 1G + page-size default-hugepage + per-node-counters off +} + +plugins { + plugin lcpng_nl_plugin.so { enable } + plugin lcpng_if_plugin.so { enable } +} + +logging { + default-log-level info + default-syslog-log-level notice + class linux-cp/if { rate-limit 10000 level debug syslog-level debug } + class linux-cp/nl { rate-limit 10000 level debug syslog-level debug } +} + +lcpng { + default netns dataplane + lcp-sync + lcp-auto-subint +} +EOF + +ipng@vpp-proto:/etc/vpp$ cat << EOF | sudo tee bootstrap.vpp +comment { Create a loopback interface } +create loopback interface instance 0 +lcp create loop0 host-if loop0 +set interface state loop0 up +set interface ip address loop0 2001:db8::1/64 +set interface ip address loop0 192.0.2.1/24 + +comment { Create Linux Control Plane interfaces } +lcp create GigabitEthernet10/0/0 host-if e0 +lcp create GigabitEthernet10/0/1 host-if e1 +lcp create GigabitEthernet10/0/2 host-if e2 +lcp create GigabitEthernet10/0/3 host-if e3 +EOF + +ipng@vpp-proto:/etc/vpp$ sudo systemctl restart vpp +``` + +After all of this, the following screenshot is a reasonable confirmation of success. +![VPP Interfaces + LCP](/assets/vpp-proto/vppctl-ip-link.png) + +#### Step 3. Install / Configure FRR + +Debian Bullseye ships with FRR 7.5.1, which will be fine. But for completeness, I'll point out that +FRR maintains their own Debian package [repo](https://deb.frrouting.org/) as well, and they're currently +releasing FRR 8.1 as stable, so I opt to install that one instead: +``` +ipng@vpp-proto:~$ curl -s https://deb.frrouting.org/frr/keys.asc | sudo apt-key add - +ipng@vpp-proto:~$ FRRVER="frr-stable" +ipng@vpp-proto:~$ echo deb https://deb.frrouting.org/frr $(lsb_release -s -c) $FRRVER | \ + sudo tee -a /etc/apt/sources.list.d/frr.list +ipng@vpp-proto:~$ sudo apt update && sudo apt install frr frr-pythontools +ipng@vpp-proto:~$ sudo adduser `id -un` frr +ipng@vpp-proto:~$ sudo adduser `id -un` frrvty +``` + +After installing, FRR will start up in the _default_ network namespace, but I'm going to be using +VPP in a custom namespace called `dataplane`. FRR after version 7.5 can work with multiple namespaces +[ref](http://docs.frrouting.org/en/stable-8.1/setup.html?highlight=pathspace%20netns#network-namespaces) +which boils down to adding the following `daemons` file: + +``` +ipng@vpp-proto:~$ cat << EOF | sudo tee /etc/frr/daemons +bgpd=yes +ospfd=yes +ospf6d=yes +bfdd=yes + +vtysh_enable=yes +watchfrr_options="--netns=dataplane" +zebra_options=" -A 127.0.0.1 -s 67108864" +bgpd_options=" -A 127.0.0.1" +ospfd_options=" -A 127.0.0.1" +ospf6d_options=" -A ::1" +staticd_options="-A 127.0.0.1" +bfdd_options=" -A 127.0.0.1" +EOF +ipng@vpp-proto:~$ sudo systemctl restart frr +``` + +After restarting FRR with this _namespace_ aware configuration, I can check to ensure it found +the `loop0` and `e0-3` interfaces VPP defined above. Let's take a look, while I set link `e0` +up and give it an IPv4 address. I'll do this in the `dataplane` namespace, and expect that FRR +picks this up as it's monitoring the netlink messages in that namespace as well: + +![VPP VtySH](/assets/vpp-proto/vpp-frr.png) + +#### Step 4. Install / Configure Bird2 + +Installing Bird2 is straight forward, although as with FRR above, after installing it'll want to +run in the _default_ namespace, which we ought to change. And as well, let's give it a bit of a +default configuration to get started: + +``` +ipng@vpp-proto:~$ sudo apt-get install bird2 +ipng@vpp-proto:~$ sudo systemctl stop bird +ipng@vpp-proto:~$ sudo systemctl disable bird +ipng@vpp-proto:~$ sudo systemctl mask bird +ipng@vpp-proto:~$ sudo adduser `id -un` bird +``` + +Then, I create a systemd unit for Bird running in the dataplane: +``` +ipng@vpp-proto:~$ sed -e 's,ExecStart=,ExecStart=/usr/sbin/ip netns exec dataplane ,' < \ + /usr/lib/systemd/system/bird.service | sudo tee /usr/lib/systemd/system/bird-dataplane.service +ipng@vpp-proto:~$ sudo systemctl enable bird-dataplane +``` + +And, finally, I create some reasonable default config and start bird in the dataplane namespace: +``` +ipng@vpp-proto:~$ cd /etc/bird +ipng@vpp-proto:/etc/bird$ sudo cp bird.conf bird.conf.orig +ipng@vpp-proto:/etc/bird$ cat << EOF | sudo tee bird.conf +router id 192.0.2.1; + +protocol device { scan time 30; } +protocol direct { ipv4; ipv6; check link yes; } +protocol kernel kernel4 { + ipv4 { import none; export where source != RTS_DEVICE; }; + learn off; + scan time 300; +} +protocol kernel kernel6 { + ipv6 { import none; export where source != RTS_DEVICE; }; + learn off; + scan time 300; +} +EOF + +ipng@vpp-proto:/usr/lib/systemd/system$ sudo systemctl start bird-dataplane +``` + +And the results work quite similar to FRR, due to the VPP plugins working via Netlink, +basically any program that operates in the _dataplane_ namespace can interact with the +kernel TAP interfaces, create/remove links, set state and MTU, add/remove IP addresses +and routes: + +![VPP Bird2](/assets/vpp-proto/vpp-bird.png) + +### Choosing FRR or Bird + +At IPng Networks, we have historically, and continue to use Bird as our routing system +of choice. But I totally realize the potential of FRR, in fact its implementation of LDP +is what may drive me onto the platform after all, as I'd love to add MPLS support to the +LinuxCP plugin at some point :-) + +By default the KVM image comes with **both FRR and Bird enabled**. This is OK because there +is no configuration on them yet, and they won't be in each others' way. It makes sense for +users of the image to make a conscious choice which of the two they'd like to use, and simply +disable and mask the other one: + +#### If FRR is your preference: +``` +ipng@vpp-proto:~$ sudo systemctl stop bird-dataplane +ipng@vpp-proto:~$ sudo systemctl disable bird-dataplane +ipng@vpp-proto:~$ sudo systemctl mask bird-dataplane +ipng@vpp-proto:~$ sudo systemctl unmask frr +ipng@vpp-proto:~$ sudo systemctl enable frr +ipng@vpp-proto:~$ sudo systemctl start frr +``` + +#### If Bird is your preference: +``` +ipng@vpp-proto:~$ sudo systemctl stop frr +ipng@vpp-proto:~$ sudo systemctl disable frr +ipng@vpp-proto:~$ sudo systemctl mask frr +ipng@vpp-proto:~$ sudo systemctl unmask bird-dataplane +ipng@vpp-proto:~$ sudo systemctl enable bird-dataplane +ipng@vpp-proto:~$ sudo systemctl start bird-dataplane +``` + +And with that, I hope to have given you a good overview of what comes into play when +installing a Debian machine with VPP, my LinuxCP plugin, and FRR or Bird: Happy hacking! + +### One last thing .. + +After I created the KVM image, I made a qcow2 snapshot of it in pristine state. This means +you can mess around with the VM, and easily revert to that pristine state without having +to download the image again. You can also add some customization (as I've done for our own +VPP Lab at IPng Networks) and set another snapshot and roll forwards and backwards between +them. The syntax is: + +``` +## Create a named snapshot +pim@chumbucket:~$ qemu-img snapshot -c pristine vpp-proto.qcow2 + +## List snapshots in the image +pim@chumbucket:~$ qemu-img snapshot -l vpp-proto.qcow2 +Snapshot list: +ID TAG VM SIZE DATE VM CLOCK ICOUNT +1 pristine 0 B 2021-12-23 17:52:36 00:00:00.000 0 + +## Revert to the named snapshot +pim@chumbucket:~$ qemu-img snapshot -a pristine vpp-proto.qcow2 + +## Delete the named snapshot +pim@chumbucket:~$ qemu-img snapshot -d pristine vpp-proto.qcow2 +``` diff --git a/content/articles/2022-01-12-vpp-l2.md b/content/articles/2022-01-12-vpp-l2.md new file mode 100644 index 0000000..2d0629c --- /dev/null +++ b/content/articles/2022-01-12-vpp-l2.md @@ -0,0 +1,682 @@ +--- +date: "2022-01-12T18:35:14Z" +title: Case Study - Virtual Leased Line (VLL) in VPP +--- + + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. + +After completing the Linux CP plugin, interfaces and their attributes such as addresses and routes +can be shared between VPP and the Linux kernel in a clever way, so running software like +[FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving +>100Mpps and >100Gbps forwarding rates are easily in reach! + +If you've read my previous articles (thank you!), you will have noticed that I have done a lot of +work on making VPP work well in an ISP (BGP/OSPF) environment with Linux CP. However, there's many other cool +things about VPP that make it a very competent advanced services router. One that that has always been +super interesting to me, is being able to offer L2 connectivity over wide-area-network. For example, a +virtual leased line from our Colo in Zurich to Amsterdam NIKHEF. This article explores this space. + +***NOTE***: If you're only interested in results, scroll all the way down to the markdown table and +graph for performance stats. + +## Introduction + +ISPs can offer ethernet services, often called _Virtual Leased Lines_ (VLLs), _Layer2 VPN_ (L2VPN) +or _Ethernet Backhaul_. They mean the same thing: imagine a switchport in location A that appears +to be transparently and directly connected to a switchport in location B, with the ISP (layer3, so +IPv4 and IPv6) network in between. The "simple" old-school setup would be to have switches which +define VLANs and are all interconnected. But we collectively learned that it's a bad idea for several +reasons: + +* Large broadcast domains tend to encouter L2 forwarding loops sooner rather than later +* Spanning-Tree and its kin are a stopgap, but they often disable an entire port from forwarding, + which can be expensive if that port is connected to a dark fiber into another datacenter far + away. +* Large VLAN setups that are intended to interconnect with other operators run into overlapping + VLAN tags, which means switches have to do tag rewriting and filtering and such. +* Traffic engineering is all but non-existent in L2-only networking domains, while L3 has all sorts + of smart TE extensions, ECMP, and so on. + +The canonical solution is for ISPs to encapsulate the ethernet traffic of their customers in some +tunneling mechanism, for example in MPLS or in some IP tunneling protocol. Fundamentally, these are +the same, except for the chosen protocol and overhead/cost of forwarding. MPLS is a very thin layer +under the packet, but other IP based tunneling mechanisms exist, commonly used are GRE, VXLAN and GENEVE +but many others exist. + +They all work roughly the same: +* An IP packet has a _maximum transmission unit_ (MTU) of 1500 bytes, while the ethernet header is + typically an additional 14 bytes: a 6 byte source MAC, 6 byte destination MAC, and 2 byte ethernet + type, which is 0x0800 for an IPv4 datagram, 0x0806 for ARP, and 0x86dd for IPv6, and many others + [[ref](https://en.wikipedia.org/wiki/EtherType)]. + * If VLANs are used, an additional 4 bytes are needed [[ref](https://en.wikipedia.org/wiki/IEEE_802.1Q)] + making the ethernet frame at most 1518 bytes long, with an ethertype of 0x8100. + * If QinQ or QinAD are used, yet again 4 bytes are needed [[ref](https://en.wikipedia.org/wiki/IEEE_802.1ad)], + making the ethernet frame at most 1522 bytes long, with an ethertype of either 0x8100 or 0x9100, + depending on the implementation. +* We can take such an ethernet frame, and make it the _payload_ of another IP packet, encapsulating the original + ethernet frame in a new IPv4 or IPv6 packet. We can then route it over an IP network to a remote + site. +* Upon receipt of such a packet, by looking at the headers the remote router can determine that this + packet represents an encapsulated ethernet frame, unpack it all, and forward the original frame onto a + given interface. + +### IP Tunneling Protocols + +First let's get some theory out of the way -- I'll discuss three common IP tunneling protocols here, and then +move on to demonstrate how they are configured in VPP and perhaps more importantly, how they perform in VPP. +Each tunneling protocol has its own advantages and disadvantages, but I'll stick to the basics first: + +#### GRE: Generic Routing Encapsulation + +_Generic Routing Encapsulation_ (GRE, described in [RFC2784](https://datatracker.ietf.org/doc/html/rfc2784)) is a +very old and well known tunneling protocol. The packet is an IP datagram with protocol number 47, consisting +of a header with 4 bits of flags, 8 reserved bits, 3 bits for the version (normally set to all-zeros), and +16 bits for the inner protocol (ether)type, so 0x0800 for IPv4, 0x8100 for 802.1q and so on. It's a very +small header of only 4 bytes and an optional key (4 bytes) and sequence number (also 4 bytes) whieah means +that to be able to transport any ethernet frame (including the fancy QinQ and QinAD ones), the _underlay_ +must have an end to end MTU of at least 1522 + 20(IPv4)+12(GRE) = ***1554 bytes for IPv4*** and ***1574 +bytes for IPv6***. + +#### VXLAN: Virtual Extensible LAN + +_Virtual Extensible LAN_ (VXLAN, described in [RFC7348](https://datatracker.ietf.org/doc/html/rfc7348)) +is a UDP datagram which has a header consisting of 8 bits worth +of flags, 24 bits reserved for future expansion, 24 bits of _Virtual Network Identifier_ (VNI) and +an additional 8 bits or reserved space at the end. It uses UDP port 4789 as assigned by IANA. VXLAN +encapsulation adds 20(IPv4)+8(UDP)+8(VXLAN) = 36 bytes, and considering IPv6 is 40 bytes, there it +adds 56 bytes. This means that to be able to transport any ethernet frame, the _underlay_ +network must have an end to end MTU of at least 1522+36 = ***1558 bytes for IPv4*** and ***1578 bytes +for IPv6***. + +#### GENEVE: Generic Network Virtualization Encapsulation + +_GEneric NEtwork Virtualization Encapsulation_ (GENEVE, described in [RFC8926](https://datatracker.ietf.org/doc/html/rfc8926)) +is somewhat similar to VXLAN although it was an attempt to stop the wild growth of tunneling protocols, +I'm sure there is an [XKCD](https://xkcd.com/927/) out there specifically for this approach. The packet is +also a UDP datagram with destination port 6081, followed by an 8 byte GENEVE specific header, containing +2 bits of version, 8 bits for flags, a 16 bit inner ethertype, a 24 bit _Virtual Network Identifier_ (VNI), +and 8 bits of reserved space. With GENEVE, several options are available and will be tacked onto the GENEVE +header, but they are typically not used. If they are though, the options can add an additional 16 bytes +which means that to be able to transport any ethernet frame, the _underlay_ network must have an end to +end MTU of at least 1522+52 = ***1574 bytes for IPv4*** and ***1594 bytes for IPv6***. + +### Hardware setup + +First let's take a look at the physical setup. I'm using three servers and a switch in the IPng Networks lab: + +{{< image width="400px" float="right" src="/assets/vpp/l2-xconnect-lab.png" alt="Loadtest Setup" >}} + +* `hvn0`: Dell R720xd, load generator + * Dual E5-2620, 24 CPUs, 2 threads per core, 2 numa nodes + * 64GB DDR3 at 1333MT/s + * Intel X710 4-port 10G, Speed 8.0GT/s Width x8 (64 Gbit/s) +* `Hippo` and `Rhino`: VPP routers + * ASRock B550 Taichi + * Ryzen 5950X 32 CPUs, 2 threads per core, 1 numa node + * 64GB DDR4 at 2133 MT/s + * Intel 810C 2-port 100G, Speed 16.0 GT/s Width x16 (256 Gbit/s) +* `fsw0`: FS.com switch S5860-48SC, 8x 100G, 48x 10G + * VLAN 4 (blue) connects Rhino's `Hu12/0/1` to Hippo's `Hu12/0/1` + * VLAN 5 (red) connects hvn0's `enp5s0f0` to Rhino's `Hu12/0/0` + * VLAN 6 (green) connects hvn0's `enp5s0f1` to Hippo's `Hu12/0/0` + * All switchports have jumbo frames enabled and are set to 9216 bytes. + +Further, Hippo and Rhino are running VPP at head `vpp v22.02-rc0~490-gde3648db0`, and hvn0 is running +T-Rex v2.93 in L2 mode, with MAC address `00:00:00:01:01:00` on the first port, and MAC address +`00:00:00:02:01:00` on the second port. This machine can saturate 10G in both directions with small +packets even when using only one flow, as can be seen, if the ports are just looped back onto one +another, for example by physically crossconnecting them with an SFP+ or DAC; or in my case by putting +`fsw0` port `Te0/1` and `Te0/2` in the same VLAN together: + +{{< image width="800px" src="/assets/vpp/l2-xconnect-trex.png" alt="TRex on hvn0" >}} + +Now that I have shared all the context and hardware, I'm ready to actually dive in to what I wanted to +talk about: how does all this _virtual leased line_ business look like, for VPP. Ready? Here we go! + +### Direct L2 CrossConnect + +The simplest thing I can show in VPP, is to configure a layer2 cross-connect (_l2 xconnect_) between +two ports. In this case, VPP doesn't even need to have an IP address, all I do is bring up the ports, +set their MTU to be able to carry the 1522 bytes frames (ethernet at 1514, dot1q at 1518, and QinQ +at 1522 bytes). The configuration is identical on both Rhino and Hippo: +``` +set interface state HundredGigabitEthernet12/0/0 up +set interface state HundredGigabitEthernet12/0/1 up +set interface mtu packet 1522 HundredGigabitEthernet12/0/0 +set interface mtu packet 1522 HundredGigabitEthernet12/0/1 +set interface l2 xconnect HundredGigabitEthernet12/0/0 HundredGigabitEthernet12/0/1 +set interface l2 xconnect HundredGigabitEthernet12/0/1 HundredGigabitEthernet12/0/0 +``` + +I'd say the only thing to keep in mind here, is that the cross-connect commands only +link in one direction (receive in A, forward to B), and that's why I have to type them twice (receive in B, +forward to A). Of course, this must be really cheap on VPP -- because all it has to do now is receive +from DPDK and immediately schedule for transmit on the other port. Looking at `show runtime` I can +see how much CPU time is spent in each of VPP's nodes: + +``` +Time 1241.5, 10 sec internal node vector rate 28.70 loops/sec 475009.85 + vector rates in 1.4879e7, out 1.4879e7, drop 0.0000e0, punt 0.0000e0 + Name Calls Vectors Clocks Vectors/Call +HundredGigabitEthernet12/0/1-o 650727833 18472218801 7.49e0 28.39 +HundredGigabitEthernet12/0/1-t 650727833 18472218801 4.12e1 28.39 +ethernet-input 650727833 18472218801 5.55e1 28.39 +l2-input 650727833 18472218801 1.52e1 28.39 +l2-output 650727833 18472218801 1.32e1 28.39 +``` + +In this simple cross connect mode, the only thing VPP has to do is receive the ethernet, funnel it +into `l2-input`, and immediately send it straight through `l2-output` back out, which does not cost +much in terms of CPU cycles at all. In total, this CPU thread is forwarding 14.88Mpps (line rate 10G +at 64 bytes), at an average of 133 cycles per packet (not counting the time spent in DPDK). The CPU +has room to spare in this mode, in other words even _one CPU thread_ can handle this workload at +line rate, impressive! + +Although cool, doing an L2 crossconnect like this isn't super useful. Usually, the customer leased line +has to be transported to another location, and for that we'll need some form of encapsulation ... + +### Crossconnect over IPv6 VXLAN + +Let's start with VXLAN. The concept is pretty straight forward in VPP. Based on the configuration +I put in Rhino and Hippo above, I first will have to bring `Hu12/0/1` out of L2 mode, give both interfaces an +IPv6 address, create a tunnel with a given _VNI_, and then crossconnect the customer side `Hu12/0/0` +into the `vxlan_tunnel0` and vice-versa. Piece of cake: + +``` +## On Rhino +set interface mtu packet 1600 HundredGigabitEthernet12/0/1 +set interface l3 HundredGigabitEthernet12/0/1 +set interface ip address HundredGigabitEthernet12/0/1 2001:db8::1/64 +create vxlan tunnel instance 0 src 2001:db8::1 dst 2001:db8::2 vni 8298 +set interface state vxlan_tunnel0 up +set interface mtu packet 1522 vxlan_tunnel0 +set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0 +set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0 + +## On Hippo +set interface mtu packet 1600 HundredGigabitEthernet12/0/1 +set interface l3 HundredGigabitEthernet12/0/1 +set interface ip address HundredGigabitEthernet12/0/1 2001:db8::2/64 +create vxlan tunnel instance 0 src 2001:db8::2 dst 2001:db8::1 vni 8298 +set interface state vxlan_tunnel0 up +set interface mtu packet 1522 vxlan_tunnel0 +set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0 +set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0 +``` + +Of course, now we're actually beginning to make VPP do some work, and the exciting thing is, if there +would be an (opaque) ISP network between Rhino and Hippo, this would work just fine considering the +encapsulation is 'just' IPv6 UDP. Under the covers, for each received frame, VPP has to encapsulate it +into VXLAN, and route the resulting L3 packet by doing an IPv6 routing table lookup: + +``` +Time 10.0, 10 sec internal node vector rate 256.00 loops/sec 32132.74 + vector rates in 8.5423e6, out 8.5423e6, drop 0.0000e0, punt 0.0000e0 + Name Calls Vectors Clocks Vectors/Call +HundredGigabitEthernet12/0/0-o 333777 85445944 2.74e0 255.99 +HundredGigabitEthernet12/0/0-t 333777 85445944 5.28e1 255.99 +ethernet-input 333777 85445944 4.25e1 255.99 +ip6-input 333777 85445944 1.25e1 255.99 +ip6-lookup 333777 85445944 2.41e1 255.99 +ip6-receive 333777 85445944 1.71e2 255.99 +ip6-udp-lookup 333777 85445944 1.55e1 255.99 +l2-input 333777 85445944 8.94e0 255.99 +l2-output 333777 85445944 4.44e0 255.99 +vxlan6-input 333777 85445944 2.12e1 255.99 +``` + +I can definitely see a lot more action here. In this mode, VPP is handlnig 8.54Mpps on this CPU thread +before saturating. At full load, VPP is spending 356 CPU cycles per packet, of which almost half is in +node `ip6-receive`. + +### Crossconnect over IPv4 VXLAN + +Seeing `ip6-receive` being such a big part of the cost (almost half!), I wonder what it might look like if +I change the tunnel to use IPv4. So I'll give Rhino and Hippo an IPv4 address as well, delete the vxlan tunnel I made +before (the IPv6 one), and create a new one with IPv4: + +``` +set interface ip address HundredGigabitEthernet12/0/1 10.0.0.0/31 +create vxlan tunnel instance 0 src 2001:db8::1 dst 2001:db8::2 vni 8298 del +create vxlan tunnel instance 0 src 10.0.0.0 dst 10.0.0.1 vni 8298 +set interface state vxlan_tunnel0 up +set interface mtu packet 1522 vxlan_tunnel0 +set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0 +set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0 + +set interface ip address HundredGigabitEthernet12/0/1 10.0.0.1/31 +create vxlan tunnel instance 0 src 2001:db8::2 dst 2001:db8::1 vni 8298 del +create vxlan tunnel instance 0 src 10.0.0.1 dst 10.0.0.0 vni 8298 +set interface state vxlan_tunnel0 up +set interface mtu packet 1522 vxlan_tunnel0 +set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0 +set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0 +``` + +And after letting this run for a few seconds, I can take a look and see how the `ip4-*` version of +the VPP code performs: + +``` +Time 10.0, 10 sec internal node vector rate 256.00 loops/sec 53309.71 + vector rates in 1.4151e7, out 1.4151e7, drop 0.0000e0, punt 0.0000e0 + Name Calls Vectors Clocks Vectors/Call +HundredGigabitEthernet12/0/0-o 552890 141539600 2.76e0 255.99 +HundredGigabitEthernet12/0/0-t 552890 141539600 5.30e1 255.99 +ethernet-input 552890 141539600 4.13e1 255.99 +ip4-input-no-checksum 552890 141539600 1.18e1 255.99 +ip4-lookup 552890 141539600 1.68e1 255.99 +ip4-receive 552890 141539600 2.74e1 255.99 +ip4-udp-lookup 552890 141539600 1.79e1 255.99 +l2-input 552890 141539600 8.68e0 255.99 +l2-output 552890 141539600 4.41e0 255.99 +vxlan4-input 552890 141539600 1.76e1 255.99 +``` + +Throughput is now quite a bit higher, clocking a cool 14.2Mpps (just short of line rate!) at 202 CPU +cycles per packet, considerably less time spent than in IPv6, but keep in mind that VPP has an ~empty +routing table in all of these tests. + +### Crossconnect over IPv6 GENEVE + +Another popular cross connect type, also based on IPv4 and IPv6 UDP packets, is GENEVE. The configuration +is almost identical, so I delete the IPv4 VXLAN and create an IPv6 GENEVE tunnel instead: + +``` +create vxlan tunnel instance 0 src 10.0.0.0 dst 10.0.0.1 vni 8298 del +create geneve tunnel local 2001:db8::1 remote 2001:db8::2 vni 8298 +set interface state geneve_tunnel0 up +set interface mtu packet 1522 geneve_tunnel0 +set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0 +set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0 + +create vxlan tunnel instance 0 src 10.0.0.1 dst 10.0.0.0 vni 8298 del +create geneve tunnel local 2001:db8::2 remote 2001:db8::1 vni 8298 +set interface state geneve_tunnel0 up +set interface mtu packet 1522 geneve_tunnel0 +set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0 +set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0 +``` + +All the while, the TRex on the customer machine `hvn0`, is sending 14.88Mpps in both directions, and +after just a short (second or so) interruption, the GENEVE tunnel comes up, cross-connects into the +customer `Hu12/0/0` interfaces, and starts to carry traffic: + +``` +Thread 8 vpp_wk_7 (lcore 8) +Time 10.0, 10 sec internal node vector rate 256.00 loops/sec 29688.03 + vector rates in 8.3179e6, out 8.3179e6, drop 0.0000e0, punt 0.0000e0 + Name Calls Vectors Clocks Vectors/Call +HundredGigabitEthernet12/0/0-o 324981 83194664 2.74e0 255.99 +HundredGigabitEthernet12/0/0-t 324981 83194664 5.18e1 255.99 +ethernet-input 324981 83194664 4.26e1 255.99 +geneve6-input 324981 83194664 3.87e1 255.99 +ip6-input 324981 83194664 1.22e1 255.99 +ip6-lookup 324981 83194664 2.39e1 255.99 +ip6-receive 324981 83194664 1.67e2 255.99 +ip6-udp-lookup 324981 83194664 1.54e1 255.99 +l2-input 324981 83194664 9.28e0 255.99 +l2-output 324981 83194664 4.47e0 255.99 +``` + +Similar to VXLAN when using IPv6 the total for GENEVE-v6 is also comparatively slow (I say comparatively +because you should not expect anything like this performance when using Linux or BSD kernel routing!). +The lower throughput is again due to the `ip6-receive` node being costly. It is just slightly worse of a +performer at 8.32Mpps per core and 368 CPU cycles per packet. + +### Crossconnect over IPv4 GENEVE + +I am now suspecting that GENEVE over IPv4 would have similar gains to when I switched from VXLAN +IPv6 to IPv4 above. So I remove the IPv6 tunnel, create a new IPv4 tunnel instead, and hook it +back up to the customer port on both Rhino and Hippo, like so: + +``` +create geneve tunnel local 2001:db8::1 remote 2001:db8::2 vni 8298 del +create geneve tunnel local 10.0.0.0 remote 10.0.0.1 vni 8298 +set interface state geneve_tunnel0 up +set interface mtu packet 1522 geneve_tunnel0 +set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0 +set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0 + +create geneve tunnel local 2001:db8::2 remote 2001:db8::1 vni 8298 del +create geneve tunnel local 10.0.0.1 remote 10.0.0.0 vni 8298 +set interface state geneve_tunnel0 up +set interface mtu packet 1522 geneve_tunnel0 +set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0 +set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0 +``` + +And the results, indeed a significant improvement: + +``` +Time 10.0, 10 sec internal node vector rate 256.00 loops/sec 48639.97 + vector rates in 1.3737e7, out 1.3737e7, drop 0.0000e0, punt 0.0000e0 + Name Calls Vectors Clocks Vectors/Call +HundredGigabitEthernet12/0/0-o 536763 137409904 2.76e0 255.99 +HundredGigabitEthernet12/0/0-t 536763 137409904 5.19e1 255.99 +ethernet-input 536763 137409904 4.19e1 255.99 +geneve4-input 536763 137409904 2.39e1 255.99 +ip4-input-no-checksum 536763 137409904 1.18e1 255.99 +ip4-lookup 536763 137409904 1.69e1 255.99 +ip4-receive 536763 137409904 2.71e1 255.99 +ip4-udp-lookup 536763 137409904 1.79e1 255.99 +l2-input 536763 137409904 8.81e0 255.99 +l2-output 536763 137409904 4.47e0 255.99 +``` + +So, close to line rate again! Performance of GENEVE-v4 clocks in at 13.7Mpps per core or 207 +CPU cycles per packet. + +## Crossconnect over IPv6 GRE + +Now I can't help but wonder, that if those `ip4|6-udp-lookup` nodes burn valuable CPU cycles, +GRE will possibly do better, because it's an L3 protocol (proto number 47) and will never have to +inspect beyond the IP header, so I delete the GENEVE tunnel and give GRE a go too: + +``` +create geneve tunnel local 10.0.0.0 remote 10.0.0.1 vni 8298 del +create gre tunnel src 2001:db8::1 dst 2001:db8::2 teb +set interface state gre0 up +set interface mtu packet 1522 gre0 +set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0 +set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0 + +create geneve tunnel local 10.0.0.1 remote 10.0.0.0 vni 8298 del +create gre tunnel src 2001:db8::2 dst 2001:db8::1 teb +set interface state gre0 up +set interface mtu packet 1522 gre0 +set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0 +set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0 +``` + +Results: +``` +Time 10.0, 10 sec internal node vector rate 255.99 loops/sec 37129.87 + vector rates in 9.9254e6, out 9.9254e6, drop 0.0000e0, punt 0.0000e0 + Name Calls Vectors Clocks Vectors/Call +HundredGigabitEthernet12/0/0-o 387881 99297464 2.80e0 255.99 +HundredGigabitEthernet12/0/0-t 387881 99297464 5.21e1 255.99 +ethernet-input 775762 198594928 5.97e1 255.99 +gre6-input 387881 99297464 2.81e1 255.99 +ip6-input 387881 99297464 1.21e1 255.99 +ip6-lookup 387881 99297464 2.39e1 255.99 +ip6-receive 387881 99297464 5.09e1 255.99 +l2-input 387881 99297464 9.35e0 255.99 +l2-output 387881 99297464 4.40e0 255.99 +``` + +The performance of GRE-v6 (in transparent ethernet bridge aka _TEB_ mode) is 9.9Mpps per core or +243 CPU cycles per packet, and I'll also note that while the `ip6-receive` node in all the +UDP based tunneling were in the 170 clocks/packet arena, now we're down to only 51 or so, so +indeed a huge improvement. + +### Crossconnect over IPv4 GRE + +To round off the set, I'll remove the IPv6 GRE tunnel and put an IPv4 GRE tunnel in place: + +``` +create gre tunnel src 2001:db8::1 dst 2001:db8::2 teb del +create gre tunnel src 10.0.0.0 dst 10.0.0.1 teb +set interface state gre0 up +set interface mtu packet 1522 gre0 +set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0 +set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0 + +create gre tunnel src 2001:db8::2 dst 2001:db8::1 teb del +create gre tunnel src 10.0.0.1 dst 10.0.0.0 teb +set interface state gre0 up +set interface mtu packet 1522 gre0 +set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0 +set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0 +``` + +And without further ado: +``` +Time 10.0, 10 sec internal node vector rate 255.87 loops/sec 52898.61 + vector rates in 1.4039e7, out 1.4039e7, drop 0.0000e0, punt 0.0000e0 + Name Calls Vectors Clocks Vectors/Call +HundredGigabitEthernet12/0/0-o 548684 140435080 2.80e0 255.95 +HundredGigabitEthernet12/0/0-t 548684 140435080 5.22e1 255.95 +ethernet-input 1097368 280870160 2.92e1 255.95 +gre4-input 548684 140435080 2.51e1 255.95 +ip4-input-no-checksum 548684 140435080 1.19e1 255.95 +ip4-lookup 548684 140435080 1.68e1 255.95 +ip4-receive 548684 140435080 2.03e1 255.95 +l2-input 548684 140435080 8.72e0 255.95 +l2-output 548684 140435080 4.43e0 255.95 +``` + +The performance of GRE-v4 (in transparent ethernet bridge aka _TEB_ mode) is 14.0Mpps per +core or 171 CPU cycles per packet. This is really very low, the best of all the tunneling +protocols, but (for obvious reasons) will not outperform a direct L2 crossconnect, as that +cuts out the L3 (and L4) middleperson entirely. Whohoo! + +## Conclusions + +First, let me give a recap of the tests I did, from left to right the better to worse +performer. + +Test | L2XC | GRE-v4 | VXLAN-v4 | GENEVE-v4 | GRE-v6 | VXLAN-v6 | GENEVE-v6 +------------- | ----------- | --------- | --------- | ---------- | ---------- | --------- | --------- +pps/core | >14.88M | 14.34M | 14.15M | 13.74M | 9.93M | 8.54M | 8.32M +cycles/packet | 132.59 | 171.45 | 201.65 | 207.44 | 243.35 | 355.72 | 368.09 + +***(!)*** Achtung! Because in the L2XC mode the CPU was not fully consumed (VPP was consuming only +~28 frames per vector), it did not yet achieve its optimum CPU performance. Under full load, the +cycles/packet will be somewhat lower than what is shown here. + +Taking a closer look at the VPP nodes in use, below I draw a graph of CPU cycles spent in each VPP +node, for each type of cross connect, where the lower the stack is, the faster cross connect will +be: + +{{< image width="1000px" src="/assets/vpp/l2-xconnect-cycles.png" alt="Cycles by node" >}} + +Although clearly GREv4 is the winner, I still would not use it for the following reason: +VPP does not support GRE keys, and considering it is an L3 protocol, I will have to use unique +IPv4 or IPv6 addresses for each tunnel src/dst pair, otherwise VPP will not know upon receipt of +a GRE packet, which tunnel it belongs to. For IPv6 this is not a huge deal (I can bind a whole +/64 to a loopback and just be done with it), but GREv6 does not perform as well as VXLAN-v4 or +GENEVE-v4. + +VXLAN and GENEVE are equal performers, both in IPv4 and in IPv6. In both cases, IPv4 is +significantly faster than IPv6. But due to the use of _VNI_ fields in the header, contrary +to GRE, both VXLAN and GENEVE can have the same src/dst IP for any number of tunnels, which +is a huge benefit. + +#### Multithreading + +Usually, the customer facing side is an ethernet port (or sub-interface with tag popping) that will be +receiving IPv4 or IPv6 traffic (either tagged or untagged) and this allows the NIC to use _RSS_ to assign +this inbound traffic to multiple queues, and thus multiple CPU threads. That's great, it means linear +encapsulation performance. + +Once the traffic is encapsulated, it risks becoming single flow with respect to the remote host, if +Rhino would be sending from 10.0.0.0:4789 to Hippo's 10.0.0.1:4789. However, the VPP VXLAN and GENEVE +implementation both inspect the _inner_ payload, and uses it to scramble the source port (thanks to +Neale for pointing this out, it's in `vxlan/encap.c:246`). Deterministically changing the source port +based on the inner-flow will allow Hippo to use _RSS_ on the receiving end, which allows these tunneling +protocols to scale linearly. I proved this for myself by attaching a port-mirror to the switch and +copying all traffic between Hippo and Rhino to a spare machine in the rack: + +``` +pim@hvn1:~$ sudo tcpdump -ni enp5s0f3 port 4789 +11:19:54.887763 IP 10.0.0.1.4452 > 10.0.0.0.4789: VXLAN, flags [I] (0x08), vni 8298 +11:19:54.888283 IP 10.0.0.1.42537 > 10.0.0.0.4789: VXLAN, flags [I] (0x08), vni 8298 +11:19:54.888285 IP 10.0.0.0.17895 > 10.0.0.1.4789: VXLAN, flags [I] (0x08), vni 8298 +11:19:54.899353 IP 10.0.0.1.40751 > 10.0.0.0.4789: VXLAN, flags [I] (0x08), vni 8298 +11:19:54.899355 IP 10.0.0.0.35475 > 10.0.0.1.4789: VXLAN, flags [I] (0x08), vni 8298 +11:19:54.904642 IP 10.0.0.0.60633 > 10.0.0.1.4789: VXLAN, flags [I] (0x08), vni 8298 + +pim@hvn1:~$ sudo tcpdump -ni enp5s0f3 port 6081 +11:22:55.802406 IP 10.0.0.0.32299 > 10.0.0.1.6081: Geneve, Flags [none], vni 0x206a: +11:22:55.802409 IP 10.0.0.1.44011 > 10.0.0.0.6081: Geneve, Flags [none], vni 0x206a: +11:22:55.807711 IP 10.0.0.1.45503 > 10.0.0.0.6081: Geneve, Flags [none], vni 0x206a: +11:22:55.807712 IP 10.0.0.0.45532 > 10.0.0.1.6081: Geneve, Flags [none], vni 0x206a: +11:22:55.841495 IP 10.0.0.0.61694 > 10.0.0.1.6081: Geneve, Flags [none], vni 0x206a: +11:22:55.851719 IP 10.0.0.1.47581 > 10.0.0.0.6081: Geneve, Flags [none], vni 0x206a: +``` + +Considering I was sending the T-Rex profile `bench.py` with tunables `vm=var2,size=64`, the latter +of which chooses randomized source and destination (inner) IP addresses in the loadtester, I can +conclude that the outer source port is chosen based on a hash of the inner packet. Slick!! + +#### Final conclusion + +The most important practical conclusion to draw is that I can feel safe to offer L2VPN services at +IPng Networks using VPP and a VXLAN or GENEVE IPv4 underlay -- our backbone is 9000 bytes everywhere, +so it will be possible to provide up to 8942 bytes of customer payload taking into account the +VXLAN-v4 overhead. At least gigabit symmetric _VLLs_ filled with 64b packets will not be a +problem for the routers we have, as they forward approximately 10.2Mpps per core and 35Mpps +per chassis when fully loaded. Even considering the overhead and CPU consumption that VXLAN +encap/decap brings with it, due to the use of multiple transmit and receive threads, +the router would have plenty of room to spare. + +## Appendix + +The backing data for the graph in this article are captured in this [Google Sheet](https://docs.google.com/spreadsheets/d/1WZ4xvO1pAjCswpCDC9GfOGIkogS81ZES74_scHQryb0/edit?usp=sharing). + +### VPP Configuration + +For completeness, the `startup.conf` used on both Rhino and Hippo: + +``` +unix { + nodaemon + log /var/log/vpp/vpp.log + full-coredump + cli-listen /run/vpp/cli.sock + cli-prompt rhino# + gid vpp +} + +api-trace { on } +api-segment { gid vpp } +socksvr { default } + +memory { + main-heap-size 1536M + main-heap-page-size default-hugepage +} + +cpu { + main-core 0 + corelist-workers 1-15 +} + +buffers { + buffers-per-numa 300000 + default data-size 2048 + page-size default-hugepage +} + +statseg { + size 1G + page-size default-hugepage + per-node-counters off +} + +dpdk { + dev default { + num-rx-queues 7 + } + decimal-interface-names + dev 0000:0c:00.0 + dev 0000:0c:00.1 +} + +plugins { + plugin lcpng_nl_plugin.so { enable } + plugin lcpng_if_plugin.so { enable } +} + +logging { + default-log-level info + default-syslog-log-level crit + class linux-cp/if { rate-limit 10000 level debug syslog-level debug } + class linux-cp/nl { rate-limit 10000 level debug syslog-level debug } +} + +lcpng { + default netns dataplane + lcp-sync + lcp-auto-subint +} +``` + +### Other Details + +For posterity, some other stats on the VPP deployment. First of all, a confirmation that PCIe 4.0 x16 +slots were used, and that the _Comms_ DDP was loaded: + +``` +[ 0.433903] pci 0000:0c:00.0: [8086:1592] type 00 class 0x020000 +[ 0.433924] pci 0000:0c:00.0: reg 0x10: [mem 0xea000000-0xebffffff 64bit pref] +[ 0.433946] pci 0000:0c:00.0: reg 0x1c: [mem 0xee010000-0xee01ffff 64bit pref] +[ 0.433964] pci 0000:0c:00.0: reg 0x30: [mem 0xfcf00000-0xfcffffff pref] +[ 0.434104] pci 0000:0c:00.0: reg 0x184: [mem 0xed000000-0xed01ffff 64bit pref] +[ 0.434106] pci 0000:0c:00.0: VF(n) BAR0 space: [mem 0xed000000-0xedffffff 64bit pref] (contains BAR0 for 128 VFs) +[ 0.434128] pci 0000:0c:00.0: reg 0x190: [mem 0xee220000-0xee223fff 64bit pref] +[ 0.434129] pci 0000:0c:00.0: VF(n) BAR3 space: [mem 0xee220000-0xee41ffff 64bit pref] (contains BAR3 for 128 VFs) +[ 11.216343] ice 0000:0c:00.0: The DDP package was successfully loaded: ICE COMMS Package version 1.3.30.0 +[ 11.280567] ice 0000:0c:00.0: PTP init successful +[ 11.317826] ice 0000:0c:00.0: DCB is enabled in the hardware, max number of TCs supported on this port are 8 +[ 11.317828] ice 0000:0c:00.0: FW LLDP is disabled, DCBx/LLDP in SW mode. +[ 11.317829] ice 0000:0c:00.0: Commit DCB Configuration to the hardware +[ 11.320608] ice 0000:0c:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link) +``` + +And how the NIC shows up in VPP, in particular the rx/tx burst modes and functions are interesting: +``` +hippo# show hardware-interfaces + Name Idx Link Hardware +HundredGigabitEthernet12/0/0 1 up HundredGigabitEthernet12/0/0 + Link speed: 100 Gbps + RX Queues: + queue thread mode + 0 vpp_wk_0 (1) polling + 1 vpp_wk_1 (2) polling + 2 vpp_wk_2 (3) polling + 3 vpp_wk_3 (4) polling + 4 vpp_wk_4 (5) polling + 5 vpp_wk_5 (6) polling + 6 vpp_wk_6 (7) polling + Ethernet address b4:96:91:b3:b1:10 + Intel E810 Family + carrier up full duplex mtu 9190 promisc + flags: admin-up promisc maybe-multiseg tx-offload intel-phdr-cksum rx-ip4-cksum int-supported + Devargs: + rx: queues 7 (max 64), desc 1024 (min 64 max 4096 align 32) + tx: queues 16 (max 64), desc 1024 (min 64 max 4096 align 32) + pci: device 8086:1592 subsystem 8086:0002 address 0000:0c:00.00 numa 0 + max rx packet len: 9728 + promiscuous: unicast on all-multicast on + vlan offload: strip off filter off qinq off + rx offload avail: vlan-strip ipv4-cksum udp-cksum tcp-cksum qinq-strip + outer-ipv4-cksum vlan-filter vlan-extend jumbo-frame + scatter keep-crc rss-hash + rx offload active: ipv4-cksum jumbo-frame scatter + tx offload avail: vlan-insert ipv4-cksum udp-cksum tcp-cksum sctp-cksum + tcp-tso outer-ipv4-cksum qinq-insert multi-segs mbuf-fast-free + outer-udp-cksum + tx offload active: ipv4-cksum udp-cksum tcp-cksum multi-segs + rss avail: ipv4-frag ipv4-tcp ipv4-udp ipv4-sctp ipv4-other ipv4 + ipv6-frag ipv6-tcp ipv6-udp ipv6-sctp ipv6-other ipv6 + l2-payload + rss active: ipv4-frag ipv4-tcp ipv4-udp ipv4 ipv6-frag ipv6-tcp + ipv6-udp ipv6 + tx burst mode: Scalar + tx burst function: ice_recv_scattered_pkts_vec_avx2_offload + rx burst mode: Offload Vector AVX2 Scattered + rx burst function: ice_xmit_pkts +``` + +Finally, in case it's interesting, an output of [lscpu](/assets/vpp/l2-xconnect-lscpu.txt), +[lspci](/assets/vpp/l2-xconnect-lspci.txt) and [dmidecode](/assets/vpp/l2-xconnect-dmidecode.txt) +as run on Hippo (Rhino is an identical machine). diff --git a/content/articles/2022-02-14-vpp-vlan-gym.md b/content/articles/2022-02-14-vpp-vlan-gym.md new file mode 100644 index 0000000..1cdeb7d --- /dev/null +++ b/content/articles/2022-02-14-vpp-vlan-gym.md @@ -0,0 +1,375 @@ +--- +date: "2022-02-14T11:35:14Z" +title: Case Study - VLAN Gymnastics with VPP +--- + + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. + +After completing the Linux CP plugin, interfaces and their attributes such as addresses and routes +can be shared between VPP and the Linux kernel in a clever way, so running software like +[FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving +>100Mpps and >100Gbps forwarding rates are easily in reach! But after the controlplane is +up and running, VPP has so much more to offer - many interesting L2 and L3 services that you'd expect +in commercial (and very pricy) routers like Cisco ASR are well within reach. + +When Fred and I were in Paris [[report]({%post_url 2021-06-01-paris %})], I got stuck trying to +configure an Ethernet over MPLS circuit for IPng from Paris to Zurich. Fred took a look for me +and quickly determined "Ah, you forgot to do the VLAN gymnastics". I found it a fun way to describe +the solution to my problem back then, and come to think of it: the router really can be configured +to hook up anything to pretty much anything -- this post takes a look at similar flexibility in VPP. + +## Introduction + +When I first started learning how to work on Cisco's Advanced Services Router platform (Cisco IOS/XR), +I was surprised that there is no concept of a _switch_. As many network engineers, I was used to be able +to put a number of ports in the same switch VLAN; and take a different set of ports and put them into L3 +mode with an IPv4/IPv6 address, or activate MPLS. And I was used to combining these two concepts by +creating VLAN (L3) interfaces. + +Turning to VPP, much like its commercial sibling Cisco IOS/XR, the mental model and approach they take +is different. Each physical interface can have a number of sub-interfaces which carry an encapsulation, +for example a _dot1q_, or a _dot1ad_ or even a double-tagged (QinQ or QinAD). When ethernet frames +arrive on the physical interface, VPP will match them to the sub-interface which is configured to +receive frames of that specific encapsulation, and drop frames that do not match any sub-interface. + +### Sub Interfaces in VPP + +There are several forms of sub-interface, let's take a look at them: + +``` +1. create sub dot1q|dot1ad +2. create sub dot1q|dot1ad exact-match +3. create sub dot1q|dot1ad inner-dot1q |any +4. create sub dot1q|dot1ad inner-dot1q exact-match +5. create sub +6. create sub - +7. create sub untagged +8. create sub default +``` + +Alright, that's a lot of choice! Let me go over these one by one. +1. The first variant creates a sub-interface which will match frames with the first VLAN tag being + either _dot1q_ or _dot1ad_ with the given **vlanId**. An important note to this: there might be + more VLAN tags following in the ethernet frame, ie the frame may be _QinQ_ or _QinAD_, and all + of these will be matched. +1. The second variant looks to do the same thing, but there, the frame will only match if there is + **exactly one** VLAN tag, not more, not less. So this sub-interface will not match frames which are + _QinQ_ or _QinAD_. +1. The third variant creates a sub-interface which matches an outer _dot1q_ or _dot1ad_ VLAN and in + addition an inner _dot1q_ tag. The special keyword **any** can be specified, which will make the + sub-interface match _QinQ_ or _QinAD_ frames without caring which inner tag is used. +1. The fourth variant looks a bit like the second one, in that it will match for frames which have + **exactly two** VLAN tags (either _dot1q_._dot1q_ or _dot1ad_._dot1q_). In this _exact-match_ mode of + operation, precisely those two tags must be present, and no other tags may follow. +1. The fifth variant is simply a shorthand for the second one, it creates an exact-match dot1q with + a **vlanId** equal to the given **subId**. This is the most obvious form, and people will recognize + this as "just" a VLAN :) +1. The sixth variant further expands on this pattern, and creates a list of these dot1q exact-match + (eg. 100-200 will create 101 sub-interfaces). +1. The seventh variant creates a sub-interface that matches any frames that have **exactly zero** + tags (ie. _untagged_), and finally +1. The eighth variant might match anything that is not matched in any other sub-interface (ie. the + fallthrough _default_). + +When I first saw this, it seemed overly complicated to me, but now that I've gotten to know this +way of thinking, what's being presented here is a way for any physical interface to branch off +inbound traffic based on either exactly zero (_untagged_), exactly one (_dot1q_ or _dot1ad_ with +_exact-match_), exactly two (outer _dot1q_ or _dot1ad_ followed by _inner-dot1q_ with _exact-match_), +and one outer tag followed by _any_ inner tag(s). In other words, any combination of zero, one or +two present tags on the frame can be matched and acted on by this logic. + +A few other considerations: +* If a sub-interface is created with a given _dot1q_ or _dot1ad_ tag, you can't have another sub-interface + with a diffent matching logic on that same tag, for example creating `dot1q 100` means you can't + then also create `dot1q 100 exact-match`. If that behavior is desired, then you'll want to create + `dot1q 100 inner-dot1q any` followed by `dot1q 100 exact-match` +* For L3 interfaces, it only makes sense to have _exact-match_ interfaces. I found a bug + in VPP that leads to a crash, which I've fixed in [[this gerrit](https://gerrit.fd.io/r/c/vpp/+/33444)], + so now the API and CLI throw an error instead of taking down the router. + +### Bridge Domains + +So how do we make the functional equivalent of a VLAN, where several interfaces are bound together into +an L2 broadcast domain, like a regular switch might do? The VPP answer to this is a _bridge-domain_ +which I can create and give a number, and then add any interface to it, like so: + +``` +vpp# create bridge-domain 10 +vpp# set interface l2 bridge GigabitEthernet10/0/0 10 +vpp# set interface l2 bridge BondEthernet0 10 +``` + +And if I want to add an IP address (creating the equivalent of a routable _VLAN Interface_), I create what +is called a Bridge Virtual Interface or _BVI_, add that interface to the bridge domain, and optionally +expose it in Linux with the [LinuxCP]({%post_url 2021-09-21-vpp-7%}) plugin: + +``` +vpp# bvi create instance 10 mac 02:fe:4b:4c:22:8f +vpp# set interface l2 bridge bvi10 10 bvi +vpp# set interface ip address bvi10 192.0.2.1/24 +vpp# set interface ip address bvi10 2001:db8::1/64 +vpp# lcp create bvi10 host-if bvi10 +``` + +A bridge-domain is fully configurable - by default it'll participate in L2 learning, maintain a FIB (which +MAC addresses are seen behind which interface), and pass along ARP requests and Neighbor Discovery. But I can +configure it to turn on/off forwarding, ARP, handling of unknown unicast frames, and so on, the complete +list of functionality that can be changed at runtime: +``` +set bridge-domain arp entry [ [del] | del-all] +set bridge-domain arp term [disable] +set bridge-domain arp-ufwd [disable] +set bridge-domain default-learn-limit +set bridge-domain flood [disable] +set bridge-domain forward [disable] +set bridge-domain learn [disable] +set bridge-domain learn-limit +set bridge-domain mac-age +set bridge-domain rewrite [disable] +set bridge-domain uu-flood [disable] +``` + +This makes bridge domains a very powerful concept, and actually much more powerful (a strict superset) +of what I might be able to configure on an L2 switch. + +### L2 CrossConnect + +I thought it'd be useful to point out another powerful concept, which made an appearance in my previous +post about [Virtual Leased Lines]({%post_url 2022-01-12-vpp-l2%}). If all I want to do is connect two +interfaces together, there won't be a need for learning, L2 FIB, and so on. It is computationally much +simpler to just take any frame received on interface A and transmit it out on interface B, unmodified. +This is known in VPP as a layer2 crossconnect, and can be configured like so: + +``` +vpp# set interface l2 xconnect GigabitEthernet10/0/0 GigabitEthernet10/0/3 +vpp# set interface l2 xconnect GigabitEthernet10/0/3 GigabitEthernet10/0/0 +``` + +I should point out that this has to be done in both directions. The first invocation will transmit any +frame received on Gi10/0/0 directly out on Gi10/0/3, and the second one will transmit any frame from Gi10/0/3 +directly out on Gi10/0/0, turning this into a very efficient way to connect two interfaces together. +Obviously, this only works in pairs, if more interfaces have to be connected, the bridge-domain is +the way to go. That said, L2 cross connects are super common. + +### Tag Rewriting + +If I want to connect two tagged _sub_-interfaces together, for example Gi10/0/0.123 to Gi10/0/3.321, +things get a bit more complicated. When VPP receives the frame from the first interface, it'll arrive tagged +with VLAN 123, so what happens if that is l2 crossconnected to Gi10/0/3.321? The answer will surprise you, +so let's take a look: + +``` +vpp# set interface state GigabitEthernet10/0/0 up +vpp# set interface state GigabitEthernet10/0/3 up +vpp# create sub GigabitEthernet10/0/0 123 +vpp# set interface state GigabitEthernet10/0/0.123 up +vpp# create sub GigabitEthernet10/0/3 321 +vpp# set interface state GigabitEthernet10/0/3.321 up +vpp# set interface l2 xconnect GigabitEthernet10/0/0.123 GigabitEthernet10/0/3.321 +vpp# set interface l2 xconnect GigabitEthernet10/0/3.321 GigabitEthernet10/0/0.123 +``` + +If I send a packet into Gi10/0/0.123, the L2 crossconnect will copy the entire frame, unmodified +into Gi10/0/3.321, but how can that be? That interface Gi10/0/3.321 is tagged with VLAN 321! VPP will +end up sending the frame out on interface Gi10/0/3 tagged as **VLAN 123**. In the other direction, frames +received on Gi10/0/3.321 will be sent out tagged as **VLAN 321** on Gi10/0/0. This is certainly not +what I expected. + +To address this, VPP can add or remove VLAN tags when it receives a frame, when it transmits a frame, +or both, let me show you this concept up close, as it's really powerful! + +VLAN tag rewrite provides the ability to change the VLAN tags on a packet. Existing tags can be popped, +new tags can be pushed, and existing tags can be swapped with new tags. The rewrite feature is attached +to a sub-interface as input and output operations. The input operation is explicitly configured by CLI or +API calls, and the output operation is the symmetric opposite and is automatically derived from the input +operation. + +* **POP**: For pop operations, the sub-interface encapsulation (the vlan tags specified when it was created) + must have at least the number of popped tags. e.g. the "pop 2" operation would be rejected on a + single-vlan interface. The output tag-rewrite operation will push the specified number of vlan + tags onto the packet before transmitting. The pushed tag values are taken from the sub-interface encapsulation + configuration. +* **PUSH**: For push operations, the ethertype (_dot1q_ or _dot1ad_) is also specified. The output tag-rewrite + operation for pushes is to pop the same number of tags off the packet. If the packet doesn't have enough tags + it is dropped. +* **TRANSLATE**: This is a combination of a pop and a push operation. + +This may be confusing at first, so let me demonstrate how this works, by extending the example above. On +the machine connected to Gi10/0/0.123, I'll configure an IP address and try to ping its neighbor: + +``` +pim@hippo:~$ sudo ip link add link enp4s0f0 name vlan123 type vlan id 123 +pim@hippo:~$ sudo ip link set vlan123 up +pim@hippo:~$ sudo ip addr add 192.0.2.1/30 dev vlan123 +pim@hippo:~$ ping 192.0.2.2 +PING 192.0.2.2 (192.0.2.2) 56(84) bytes of data. +... +``` + +On the other side, I'll tcpdump what comes out the Gi10/0/3 port (which, as I observed above, is not carrying +the tag, 321, but instead carrying the original ingress tag, 123): +``` +16:33:59.489246 fe:54:00:00:10:00 > ff:ff:ff:ff:ff:ff, length 46: + ethertype 802.1Q (0x8100), vlan 123, p 0, + ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4), + Request who-has 192.0.2.2 tell 192.0.2.1, length 28 +``` + +Now, to demonstrate tag rewriting, I will remove (pop) the ingress VLAN tag from Gi10/0/0.123 when +a packet is received: + +``` +vpp# set interface l2 tag-rewrite GigabitEthernet10/0/0.123 pop 1 + +16:37:42.721424 fe:54:00:00:10:00 > ff:ff:ff:ff:ff:ff, length 42: + ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4), + Request who-has 192.0.2.2 tell 192.0.2.1, length 28 +``` + +There is no tag at all. What happened here is that when Gi10/0/0.123 received the frame, the 'pop' operation +stripped 1 VLAN tag off the frame. And as we'll see later, when that sub-interface transmits a frame, the +'pop' operation will add one VLAN tag (123) to the front of the frame. + +Remember how I pointed out above that the 'pop' operation is symmetric? I can use that because if I were to +_also_ apply this on the Gi10/0/3.321 interface, then it will push the tag (of Gi10/0/3.321) onto the packet +before sending it, and of course the other way around as well: + +``` +vpp# set interface l2 tag-rewrite GigabitEthernet10/0/0.123 pop 1 +vpp# set interface l2 tag-rewrite GigabitEthernet10/0/3.321 pop 1 + +16:41:00.352840 fe:54:00:00:10:00 > ff:ff:ff:ff:ff:ff, length 46: + ethertype 802.1Q (0x8100), vlan 321, p 0, + ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4), + Request who-has 192.0.2.2 tell 192.0.2.1, length 28 +16:41:00.352867 fe:54:00:00:10:03 > fe:54:00:00:10:00, length 46: + ethertype 802.1Q (0x8100), vlan 321, p 0, + ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4), + Reply 192.0.2.2 is-at fe:54:00:00:10:03, length 28 +``` + +Hey look, there's our ARP reply packet! That packet coming back into Gi10/0/3.321, when hitting the +tag-rewrite, will in turn remove the tag, and the 'pop' being symmetrical, will of course add a new +tag 123 on egress of Gi10/0/0.123, and I can now see connectivity end to end. Neat! + +Other operations that are interesting, include arbitrarily adding a _dot1q_ tag (or even two tags): + +``` +vpp# set interface l2 tag-rewrite GigabitEthernet10/0/0.123 push dot1q 100 + +16:45:33.121049 fe:54:00:00:10:00 > ff:ff:ff:ff:ff:ff, length 50: + ethertype 802.1Q (0x8100), vlan 100, p 0, + ethertype 802.1Q (0x8100), vlan 123, p 0, + ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4), + Request who-has 192.0.2.2 tell 192.0.2.1, length 28 + +vpp# set interface l2 tag-rewrite GigabitEthernet10/0/0.123 push dot1q 100 200 + +16:48:15.936807 fe:54:00:00:10:00 > ff:ff:ff:ff:ff:ff, length 54: + ethertype 802.1Q (0x8100), vlan 100, p 0, + ethertype 802.1Q (0x8100), vlan 200, p 0, + ethertype 802.1Q (0x8100), vlan 123, p 0, + ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4), + Request who-has 192.0.2.2 tell 192.0.2.1, length 28 +``` + +And finally, swapping (translating) VLAN tags: + +``` +vpp# set interface l2 tag-rewrite GigabitEthernet10/0/0.123 translate 1-1 dot1ad 100 + +16:50:56.705015 fe:54:00:00:10:00 > ff:ff:ff:ff:ff:ff, length 46: + ethertype 802.1Q-QinQ (0x88a8), vlan 100, p 0, + ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4), + Request who-has 192.0.2.2 tell 192.0.2.1, length 28 + +vpp# set interface l2 tag-rewrite GigabitEthernet10/0/0.123 translate 1-1 dot1q 321 +vpp# set interface l2 tag-rewrite GigabitEthernet10/0/3.321 translate 1-1 dot1q 123 + +16:44:03.462842 fe:54:00:00:10:00 > ff:ff:ff:ff:ff:ff, length 46: + ethertype 802.1Q (0x8100), vlan 321, p 0, + ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4), + Request who-has 192.0.2.2 tell 192.0.2.1, length 28 +16:44:03.462847 fe:54:00:00:10:03 > fe:54:00:00:10:00, length 46: + ethertype 802.1Q (0x8100), vlan 321, p 0, + ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4), + Reply 192.0.2.2 is-at fe:54:00:00:10:03, length 28 +``` + +This last set of 'translate 1-1' has a similar effect to the 'pop 1', the VLAN is rewritten to 321 +when receiving from Gi10/0/0.123, and it's rewritten to 123 when receiving from Gi10/0/3.321, making +end to end traffic possible again. + +#### Final conclusion + +The four concepts discussed here can be combined in countless interesting ways: + +* Create sub-interface with or without exact-match, to handle certain encapsulated packets +* Provide layer2 crossconnect functionality between any two interfaces or sub-interfaces +* Add multiple interfaces and sub-interfaces into a bridge-domain +* Ensure that VLAN tags are popped and pushed consistently on tagged sub-interfaces + +The practical conclusion is that VPP can provide fully transparent, dot1q and jumboframe enabled +virtual leased lines (see my previous post on [VLL performance]({%post_url 2022-01-12-vpp-l2 %})), +including using regular breakout switches to greatly increase the total port count for customers. + +I'll leave you with a working example of an L2VPN between a breakout switch behind **nlams0.ipng.ch** +in Amsterdam and a remote VPP router in Zurich called **ddln0.ipng.ch**. Take the following +[S5860-20SQ]({%post_url 2021-08-07-fs-switch %}) switch, which connects to the VPP router on Te0/1 +and a customer on Te0/2: + +``` +fsw0(config)#vlan 3438 +fsw0(config-vlan)#name v-vll-customer +fsw0(config-vlan)#exit +fsw0(config)#interface TenGigabitEthernet 0/1 +fsw0(config-if-TenGigabitEthernet 0/1)#description Core: nlams0.ipng.ch Te6/0/0 +fsw0(config-if-TenGigabitEthernet 0/1)#mtu 9216 +fsw0(config-if-TenGigabitEthernet 0/1)#switchport mode trunk +fsw0(config-if-TenGigabitEthernet 0/1)#switchport trunk allowed vlan add 3438 + +fsw0(config)#interface TenGigabitEthernet 0/2 +fsw0(config-if-TenGigabitEthernet 0/2)#description Cust: Customer VLL Port NIKHEF +fsw0(config-if-TenGigabitEthernet 0/2)#mtu 1522 +fsw0(config-if-TenGigabitEthernet 0/2)#switchport mode dot1q-tunnel +fsw0(config-if-TenGigabitEthernet 0/2)#switchport dot1q-tunnel native vlan 3438 +fsw0(config-if-TenGigabitEthernet 0/2)#switchport dot1q-tunnel allowed vlan add untagged 3438 +fsw0(config-if-TenGigabitEthernet 0/2)#switchport dot1q-tunnel allowed vlan add tagged 1000-2000 +``` + +I configure the first port here to be a VLAN trunk port to the router, and add VLAN 3438 to it. Then, +I configure the second port to be a customer _dot1q-tunnel_ port, which accepts untagged frames and +puts them in VLAN 3438, and additionally accepts tagged frames in VLAN 1000-2000 and prepends the +customer VLAN 3438 to them - so these will become QinQ double tagged 3438.1000-2000. + +The corresponding snippet of the VPP router configuration as such: +``` +comment { Customer VLL to DDLN } +lcp lcp-auto-sub-int off +create sub TenGigabitEthernet6/0/0 3438 dot1q 3438 +set interface mtu packet 1518 TenGigabitEthernet6/0/0.3438 +set interface state TenGigabitEthernet6/0/0.3438 up +set interface l2 tag-rewrite TenGigabitEthernet6/0/0.3438 pop 1 + +create vxlan tunnel instance 12 src 194.1.163.32 dst 194.1.163.5 vni 320501 decap-next l2 +set interface state vxlan_tunnel12 up +set interface mtu packet 1518 vxlan_tunnel12 +set interface l2 xconnect TenGigabitEthernet6/0/0.3438 vxlan_tunnel12 +set interface l2 xconnect vxlan_tunnel12 TenGigabitEthernet6/0/0.3438 +lcp lcp-auto-sub-int on +``` + +The customer facing interfaces have an MTU of 1518 bytes, which is enough for the 1500 bytes of the +IP packet, including 14 bytes of L2 overhead (src-mac, dst-mac, ethertype), and one optional VLAN tag. +In other words, this VLL is _dot1q_ capable, because the VPP sub-interface Te6/0/0.3438 did not specify +_exact-match_, so it'll accept any additional VLAN tags. Of course this does require the path from +**nlams0.ipng.ch** to **ddln0.ipng.ch** to be (baby)jumbo enabled, which they are as AS8298 is fully +9000 byte capable. diff --git a/content/articles/2022-02-21-asr9006.md b/content/articles/2022-02-21-asr9006.md new file mode 100644 index 0000000..feece8f --- /dev/null +++ b/content/articles/2022-02-21-asr9006.md @@ -0,0 +1,778 @@ +--- +date: "2022-02-21T09:35:14Z" +title: 'Review: Cisco ASR9006/RSP440-SE' +--- + +## Introduction + +{{< image width="180px" float="right" src="/assets/asr9006/ipmax.png" alt="IP-Max" >}} + +If you've read up on my articles, you'll know that I have deployed a [European Ring]({%post_url 2021-02-27-network %}), +which was reformatted late last year into [AS8298]({%post_url 2021-10-24-as8298 %}) and upgraded to run +[VPP Routers]({%post_url 2021-09-21-vpp-7 %}) with 10G between each city. IPng Networks rents these 10G point to point +virtual leased lines between each of our locations. It's a really great network, and it performs so well because it's +built on an EoMPLS underlay provided by [IP-Max](https://ip-max.net/). They, in turn, run carrier grade hardware in the +form of Cisco ASR9k. In part, we're such a good match together, because my choice of [VPP](https://fd.io/) on the IPng +Networks routers fits very well with Fred's choice of [IOS/XR](https://en.wikipedia.org/wiki/Cisco_IOS_XR) on the +IP-Max routers. + + +And if you follow us on Twitter (I post as [@IPngNetworks](https://twitter.com/IPngNetworks/)), you may have seen a +recent post where I upgraded an aging [ASR9006](https://www.cisco.com/c/en/us/support/routers/asr-9006-router/model.html) +with a significantly larger [ASR9010](https://www.cisco.com/c/en/us/support/routers/asr-9010-router/model.html). The ASR9006 +was initially deployed at Equinix Zurich ZH05 in Oberenstringen near Zurich, Switzerland in 2015, which is seven years ago. +It has hauled countless packets from Zurich to Paris, Frankfurt and Lausanne. When it was deployed, it came with a +A9K-RSP-4G route switch processor, which in 2019 was upgraded to the A9K-RSP-8G, and after so many hours^W years of +runtime needed a replacement. Also, IP-Max was starting to run out of ports for the chassis, hence the upgrade. + +{{< image width="300px" float="left" src="/assets/asr9006/staging.png" alt="IP-Max" >}} + +If you're interested in the line-up, there's this epic reference guide from [Cisco Live!](https://www.cisco.com/c/en/us/td/docs/iosxr/asr9000/hardware-install/overview-reference/b-asr9k-overview-reference-guide/b-asr9k-overview-reference-guide_chapter_010.html#con_733653) +that shows a deep dive of the ASR9k architecture. The chassis and power supplies can host several generations of silicon, +and even mix-and-match generations. So IP-Max ordered a few new RSPs, and after deploying the ASR9010 at ZH05, we made +plans to redeploy this ASR9006 at NTT Zurich in Rümlang next to the airport, to replace an even older Cisco 7600 +at that location. Seeing as we have to order XFP optics (IP-Max has some DWDM/CWDM links in service at NTT), we have +to park the chassis in and around Zurich. What better place to park it, than in my lab ? :-) + +The IPng Networks laboratory is where I do most of my work on [VPP](https://fd.io/). The rack you see to the left here holds my coveted +Rhino and Hippo (two beefy AMD Ryzen 5950X machines with 100G network cards), and a few Dells that comprise my VPP +lab. There was not enough room, so I gave this little fridge a place just adjacent to the rack, connected with 10x 10Gbps +and serial and management ports. + +I immediately had a little giggle when booting up the machine. It comes with 4x 3kW power supply slots (3 are installed), +and when booting the machine, I was happy that there was no debris laying on the side or back of the router, as its +fans create a veritable vortex of airflow. Also, overnight the temperature in my basement lab + office room raised a +few degrees. It's now nice and toasty in my office, no need for the heater in the winter. Yet the machine stays quite +cool at 26C intake, consuming 2.2KW _idle_ with each of the two route processor (RSP440) drawing 240 Watts, each of the +three 8x TenGigE blades drawing 575W each, and the 40x GigE blade drawing a respectable 320 Watts. + +``` +RP/0/RSP0/CPU0:fridge(admin)#show environment power-supply +R/S/I Power Supply Voltage Current + (W) (V) (A) +0/PS0/M1/* 741.1 54.9 13.5 +0/PS0/M2/* 712.4 54.8 13.0 +0/PS0/M3/* 765.8 55.1 13.9 +-------------- +Total: 2219.3 +``` + +For reference, Rhino and Hippo draw approximately 265W each, but they come with 4x1G, 4x10G, 2x100G and forward ~300Mpps +when fully loaded. By the end of this article, I hope you'll see why this is a funny juxtaposition to me. + +### Installing the ASR9006 + +The Cisco RSPs came to me new-in-refurbished-box. When booting, I had no idea what username/password was used for the +preinstall, and none of the standard passwords worked. So the first order of business is to take ownership of the +machine. I do this by putting both RSPs in _rommon_ (which is done by sending _Break_ after powercycling the machine -- +my choice of _tio(1)_ has ***Ctrl-t b*** as the magic incantation). The first RSP (in slot 0) is then set to a different +`confreg 0x142`, while the other is kept in rommon so it doesn't boot and take over the machine. After booting, I'm +then presented with a root user setup dialog. I create a user `pim` with some temporary password, set back the configuration +register, and reload. When the RSP is about to boot, I release the standby RSP to catch up, and voila: I'm _In like Flynn._ + +Wiring this up - I connect Te0/0/0/0 to IPng's office switch on port sfp-sfpplus9, and I assign the router an IPv4 and IPv6 +address. Then, I connect four Tengig ports to the lab switch, so that I can play around with loadtests a little bit. +After turning on LLDP, I can see the following physical view: + +``` +RP/0/RSP0/CPU0:fridge#show lldp neighbors +Sun Feb 20 19:14:21.775 UTC +Capability codes: + (R) Router, (B) Bridge, (T) Telephone, (C) DOCSIS Cable Device + (W) WLAN Access Point, (P) Repeater, (S) Station, (O) Other + +Device ID Local Intf Hold-time Capability Port ID +xsw1-btl Te0/0/0/0 120 B,R bridge/sfp-sfpplus9 +fsw0 Te0/1/0/0 41 P,B,R TenGigabitEthernet 0/9 +fsw0 Te0/1/0/1 41 P,B,R TenGigabitEthernet 0/10 +fsw0 Te0/2/0/0 41 P,B,R TenGigabitEthernet 0/7 +fsw0 Te0/2/0/1 41 P,B,R TenGigabitEthernet 0/8 + +Total entries displayed: 5 +``` + +First, I decide to hook up basic connectivity behind port Te0/0/0/0. I establish OSPF, OSPFv3 and this gives me +visibility to the route-reflectors at IPng's AS8298. Next, I also establish three IPv4 and IPv6 iBGP sessions, so +the machine enters the Default Free Zone (also, _daaayum_, that table keeps on growing at 903K IPv4 prefixes and +143K IPv6 prefixes). + +``` +RP/0/RSP0/CPU0:fridge#show ip ospf neighbor +Neighbor ID Pri State Dead Time Address Interface +194.1.163.3 1 2WAY/DROTHER 00:00:35 194.1.163.66 TenGigE0/0/0/0.101 + Neighbor is up for 00:11:14 +194.1.163.4 1 FULL/BDR 00:00:38 194.1.163.67 TenGigE0/0/0/0.101 + Neighbor is up for 00:11:11 +194.1.163.87 1 FULL/DR 00:00:37 194.1.163.87 TenGigE0/0/0/0.101 + Neighbor is up for 00:11:12 + +RP/0/RSP0/CPU0:fridge#show ospfv3 neighbor +Neighbor ID Pri State Dead Time Interface ID Interface +194.1.163.87 1 FULL/DR 00:00:35 2 TenGigE0/0/0/0.101 + Neighbor is up for 00:12:14 +194.1.163.3 1 2WAY/DROTHER 00:00:33 16 TenGigE0/0/0/0.101 + Neighbor is up for 00:12:16 +194.1.163.4 1 FULL/BDR 00:00:36 20 TenGigE0/0/0/0.101 + Neighbor is up for 00:12:12 + + +RP/0/RSP0/CPU0:fridge#show bgp ipv4 uni sum +Process RcvTblVer bRIB/RIB LabelVer ImportVer SendTblVer StandbyVer +Speaker 915517 915517 915517 915517 915517 915517 + +Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd +194.1.163.87 0 8298 172514 9 915517 0 0 00:04:47 903406 +194.1.163.140 0 8298 171853 9 915517 0 0 00:04:56 903406 +194.1.163.148 0 8298 176244 9 915517 0 0 00:04:49 903406 + +RP/0/RSP0/CPU0:fridge#show bgp ipv6 uni sum +Process RcvTblVer bRIB/RIB LabelVer ImportVer SendTblVer StandbyVer +Speaker 151597 151597 151597 151597 151597 151597 + +Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd +2001:678:d78:3::87 + 0 8298 54763 10 151597 0 0 00:05:19 142542 +2001:678:d78:6::140 + 0 8298 51350 10 151597 0 0 00:05:23 142542 +2001:678:d78:7::148 + 0 8298 54572 10 151597 0 0 00:05:25 142542 +``` + +One of the acceptance tests of new hardware at AS25091 IP-Max is to ensure that it takes a full table +to help ensure memory is present, accounted for, and working. These route switch processor boards come +with 12GB of ECC memory, and can scale the routing table for a small while to come. If/when they are +at the end of their useful life, they will be replaced with A9K-RSP-880's, which will also give us +access to 40G and 100G and 24x10G SFP+ line cards. At that point, the upgrade path is much easier as +the chassis will already be installed. It's a matter of popping in new RSPs and replacing the line +cards one by one. + +## Loadtesting the ASR9006/RSP440-SE + +Now that this router has some basic connectivity, I'll do something that I always wanted to do: loadtest +an ASR9k! I have mad amounts of respect for Cisco's ASR9k series, but as we'll soon see, their stability +is their most redeeming quality, not their performance. Nowadays, many flashy 100G machines are around, +which do indeed have the performance, but not the stability! I've seen routers with an uptime of 7 years, +and BGP sessions and OSPF adjacencies with an uptime of 5 years+. It's just .. I've not seen that type +of stability beyond Cisco and maybe Juniper. So if you want _Rock Solid Internet_, this is definitely +the way to go. + +I have written a word or two on how VPP (an open source dataplane very similar to these industrial machines) +works. A great example is my recent [VPP VLAN Gymnastics]({%post_url 2022-02-14-vpp-vlan-gym %}) article. +There's a lot I can learn from comparing the performance between VPP and Cisco ASR9k, so I will focus +on the following set of practical questions: + +1. See if unidirectional versus bidirectional traffic impacts performance. +1. See if there is a performance penalty of using _Bundle-Ether_ (LACP controlled link aggregation). +1. Of course, replay my standard issue 1514b large packets, internet mix (_imix_) packets, small 64b packets + from random source/destination addresses (ie. multiple flows); and finally the killer test of small 64b + packets from a static source/destination address (ie. single flow). + +This is in total 2 (uni/bi) x2 (lag/plain) x4 (packet mix) or 16 loadtest runs, for three forwarding types ... + +1. See performance of L2VPN (Point-to-Point), similar to what VPP would call "l2 xconnect". I'll create an + L2 crossconnect between port Te0/1/0/0 and Te0/2/0/0; this is the simplest form computationally: it + forwards any frame received on the first interface directly out on the second interface. +1. Take a look at performance of L2VPN (Bridge Domain), what VPP would call "bridge-domain". I'll create a + Bridge Domain between port Te0/1/0/0 and Te0/2/0/0; this includes layer2 learning and FIB, and can tie + together any number of interfaces into a layer2 broadcast domain. +1. And of course, tablestakes, see performance of IPv4 forwarding, with Te0/1/0/0 as 100.64.0.1/30 and + Te0/2/0/0 as 100.64.1.1/30 and setting a static for 48.0.0.0/8 and 16.0.0.0/8 back to the loadtester. + +... making a grand total of 48 loadtests. I have my work cut out for me! So I boot up Rhino, which has a +Mellanox ConnectX5-Ex (PCIe v4.0 x16) network card sporting two 100G interfaces, and it can easily keep up +with this 2x10G single interface, and 2x20G LAG, even with 64 byte packets. I am continually amazed that +a full line rate loadtest of small 64 byte packets at a rate of 40Gbps boils down to 59.52Mpps! + +For each loadtest, I ramp up the traffic using a [T-Rex loadtester]({%post_url 2021-02-27-coloclue-loadtest %}) +that I wrote. It starts with a low-pps warmup duration of 30s, then it ramps up from 0% to a certain line rate +(in this case, alternating to 10GbpsL1 for the single TenGig tests, or 20GbpsL1 for the LACP tests), with a +rampup duration of 120s and finally it holds for duration of 30s. + +The following sections describe the methodology and the configuration statements on the ASR9k, with a quick +table of results per test, and a longer set of thoughts all the way at the bottom of this document. I so +encourage you to not skip ahead. Instead, read on and learn a bit (as I did!) from the configuration itself. + +**The question to answer**: Can this beasty mini-fridge sustain line rate? Let's go take a look! + +## Test 1 - 2x 10G + +In this test, I configure a very simple physical environment (this is a good time to take another look at the LLDP table +above). The Cisco is connected with 4x 10G to the switch, Rhino and Hippo are connected with 2x 100G to the switch +and I have a Dell connected as well with 2x 10G to the switch (this can be very useful to take a look at what's going +on on the wire). The switch is an FS S5860-48SC (with 48x10G SFP+ ports, and 8x100G QSFP ports), which is a piece of +kit that I highly recommend by the way. + +Its configuration: + +``` +interface TenGigabitEthernet 0/1 + description Infra: Dell R720xd hvn0:enp5s0f0 + no switchport + mtu 9216 +! +interface TenGigabitEthernet 0/2 + description Infra: Dell R720xd hvn0:enp5s0f1 + no switchport + mtu 9216 +! +interface TenGigabitEthernet 0/7 + description Cust: Fridge Te0/2/0/0 + mtu 9216 + switchport access vlan 20 +! +interface TenGigabitEthernet 0/9 + description Cust: Fridge Te0/1/0/0 + mtu 9216 + switchport access vlan 10 +! +interface HundredGigabitEthernet 0/53 + description Cust: Rhino HundredGigabitEthernet15/0/1 + mtu 9216 + switchport access vlan 10 +! +interface HundredGigabitEthernet 0/54 + description Cust: Rhino HundredGigabitEthernet15/0/0 + mtu 9216 + switchport access vlan 20 +! +monitor session 1 destination interface TenGigabitEthernet 0/1 +monitor session 1 source vlan 10 rx +monitor session 2 destination interface TenGigabitEthernet 0/2 +monitor session 2 source vlan 20 rx +``` + +What this does is connect Rhino's Hu15/0/1 and Fridge's Te0/1/0/0 in VLAN 10, and sends a readonly copy of all +traffic to the Dell's enp5s0f0 interface. Similarly, Rhino's Hu15/0/0 and Fridge's Te0/2/0/0 in VLAN 20 with a copy +of traffic to the Dell's enp5s0f1 interface. I can now run `tcpdump` on the Dell to see what's going back and forth. + +In case you're curious: the monitor on Te0/1 and Te0/2 ports will saturate in case both machines are transmitting at +a combined rate of over 10Gbps. If this is the case, the traffic that doesn't fit is simply dropped from the monitor +port, but it's of course forwarded correctly between the original Hu0/53 and Te0/9 ports. In other words: the monitor +session has no performance penalty. It's merely a convenience to be able to take a look on ports where `tcpdump` is +not easily available (ie. both VPP as well as the ASR9k in this case!) + +### Test 1.1: 10G L2 Cross Connect + +A simple matter of virtually patching one interface into the other, I choose the first port on blade 1 and 2, and +tie them together in a `p2p` cross connect. In my [VLAN Gymnastics]({%post_url 2022-02-14-vpp-vlan-gym %}) post, I +called this a `l2 xconnect`, and although the configuration statements are a bit different, the purpose and expected +semantics are identical: + +``` +interface TenGigE0/1/0/0 + l2transport + ! +! +interface TenGigE0/2/0/0 + l2transport + ! +! +l2vpn + xconnect group loadtest + p2p xc01 + interface TenGigE0/1/0/0 + interface TenGigE0/2/0/0 + ! + ! +``` + +The results of this loadtest look promising - although I can already see that the port will not sustain +line rate at 64 byte packets, which I find somewhat surprising. Both when using multiple flows (ie. random +source and destination IP addresses), as well as when using a single flow (repeating the same src/dst packet), +the machine tops out at around 20 Mpps which is 68% of line rate (29.76 Mpps). Fascinating! + +Loadtest | Unidirectional (pps) | L1 Unidirectional (bps) | Bidirectional (pps) | L1 Bidirectional (bps) +---------- | ---------- | ------------ | ----------- | ------------ +1514b | 810 kpps | 9.94 Gbps | 1.61 Mpps | 19.77 Gbps +imix | 3.25 Mpps | 9.94 Gbps | 6.46 Mpps | 19.78 Gbps +64b Multi | 14.66 Mpps | 9.86 Gbps | 20.3 Mpps | 13.64 Gbps +64b Single | 14.28 Mpps | 9.60 Gbps | 20.3 Mpps | 13.62 Gbps + + +### Test 1.2: 10G L2 Bridge Domain + +I then keep the two physical interfaces in `l2transport` mode, but change the type of l2vpn into a +`bridge-domain`, which I described in my [VLAN Gymnastics]({%post_url 2022-02-14-vpp-vlan-gym %}) post +as well. VPP and Cisco IOS/XR semantics look very similar indeed, they differ really only in the way +in which the configuration is expressed: + +``` +interface TenGigE0/1/0/0 + l2transport + ! +! +interface TenGigE0/2/0/0 + l2transport + ! +! +l2vpn + xconnect group loadtest + ! + bridge group loadtest + bridge-domain bd01 + interface TenGigE0/1/0/0 + ! + interface TenGigE0/2/0/0 + ! + ! + ! +! +``` + +Here, I find that performance in one direction is line rate, and with 64b packets ever so slightly better +than the L2 crossconnect test above. In both directions though, the router struggles to obtain line rate +in small packets, delivering 64% (or 19.0 Mpps) of the total offered 29.76 Mpps back to the loadtester. + +Loadtest | Unidirectional (pps) | L1 Unidirectional (bps) | Bidirectional (pps) | L1 Bidirectional (bps) +---------- | ---------- | ------------ | ----------- | ------------ +1514b | 807 kpps | 9.91 Gbps | 1.63 Mpps | 19.96 Gbps +imix | 3.24 Mpps | 9.92 Gbps | 6.47 Mpps | 19.81 Gbps +64b Multi | 14.82 Mpps | 9.96 Gbps | 19.0 Mpps | 12.79 Gbps +64b Single | 14.86 Mpps | 9.98 Gbps | 19.0 Mpps | 12.81 Gbps + +I would say that in practice, the performance of a bridge-domain is comparable to that of an L2XC. + +### Test 1.3: 10G L3 IPv4 Routing + +This is the most straight forward test: the T-Rex loadtester in this case is sourcing traffic from +100.64.0.2 on its first interface, and 100.64.1.2 on its second interface. It will send ARP for the +nexthop (100.64.0.1 and 100.64.1.1, the Cisco), but the Cisco will not maintain an ARP table for the +loadtester, so I have to add static ARP entries for it. Otherwise, this is a simple test, which stress +tests the IPv4 forwarding path: + +``` +interface TenGigE0/1/0/0 + ipv4 address 100.64.0.1 255.255.255.252 +! +interface TenGigE0/2/0/0 + ipv4 address 100.64.1.1 255.255.255.252 +! +router static + address-family ipv4 unicast + 16.0.0.0/8 100.64.1.2 + 48.0.0.0/8 100.64.0.2 + ! +! +arp vrf default 100.64.0.2 043f.72c3.d048 ARPA +arp vrf default 100.64.1.2 043f.72c3.d049 ARPA +! +``` + +Alright, so the cracks definitely show on this loadtest. The performance of small routed packets is quite +poor, weighing in at 35% of line rate in the unidirectional test, and 43% in the bidirectional test. It seems +that the ASR9k (at least in this hardware profile of `l3xl`) is not happy forwarding traffic at line rate, +and the routing performance is indeed significantly lower than the L2VPN performance. That's good to know! + +Loadtest | Unidirectional (pps) | L1 Unidirectional (bps) | Bidirectional (pps) | L1 Bidirectional (bps) +---------- | ---------- | ------------ | ----------- | ------------ +1514b | 815 kpps | 10.0 Gbps | 1.63 Mpps | 19.98 Gbps +imix | 3.27 Mpps | 9.99 Gbps | 6.52 Mpps | 19.96 Gbps +64b Multi | 5.14 Mpps | 3.45 Gbps | 12.3 Mpps | 8.28 Gbps +64b Single | 5.25 Mpps | 3.53 Gbps | 12.6 Mpps | 8.51 Gbps + +## Test 2 - LACP 2x 20G + +Link aggregation ([ref](https://en.wikipedia.org/wiki/Link_aggregation)) means combining or aggregating multiple +network connections in parallel by any of several methods, in order to increase throughput beyond what a single +connection could sustain, to provide redundancy in case one of the links should fail, or both. A link aggregation +group (LAG) is the combined collection of physical ports. Other umbrella terms used to describe the concept include +_trunking_, _bundling_, _bonding_, _channeling_ or _teaming_. Bundling ports together on a Cisco IOS/XR platform +like the ASR9k can be done by creating a _Bundle-Ether_ or _BE_. For reference, the same concept on VPP is called +a _BondEthernet_ and in Linux it'll often be referred to as simply a _bond_. They all refer to the same concept. + +One thing that immediately comes to mind when thinking about LAGs is: how will the member port be selected on +outgoing traffic? A sensible approach will be to either hash on the L2 source and/or destination (ie. the ethernet +host on either side of the LAG), but in the case of a router and as is the case in our loadtest here, there is only +one MAC address on either side of the LAG. So a different hashing algorithm has to be chosen, preferably of the +source and/or destination _L3_ (IPv4 or IPv6) address. Luckily, both the FS switch as well as the Cisco ASR9006 +support this. + +First I'll reconfigure the switch, and then reconfigure the router to use the newly created 2x 20G LAG ports. + +``` +interface TenGigabitEthernet 0/7 + description Cust: Fridge Te0/2/0/0 + port-group 2 mode active +! +interface TenGigabitEthernet 0/8 + description Cust: Fridge Te0/2/0/1 + port-group 2 mode active +! +interface TenGigabitEthernet 0/9 + description Cust: Fridge Te0/1/0/0 + port-group 1 mode active +! +interface TenGigabitEthernet 0/10 + description Cust: Fridge Te0/1/0/1 + port-group 1 mode active +! +interface AggregatePort 1 + mtu 9216 + aggregateport load-balance dst-ip + switchport access vlan 10 +! +interface AggregatePort 2 + mtu 9216 + aggregateport load-balance dst-ip + switchport access vlan 20 +! +``` + +And after the Cisco is converted to use _Bundle-Ether_ as well, the link status looks like this: + +``` +fsw0#show int ag1 +... + Aggregate Port Informations: + Aggregate Number: 1 + Name: "AggregatePort 1" + Members: (count=2) + Lower Limit: 1 + TenGigabitEthernet 0/9 Link Status: Up Lacp Status: bndl + TenGigabitEthernet 0/10 Link Status: Up Lacp Status: bndl + Load Balance by: Destination IP + +fsw0#show int usage up +Interface Bandwidth Average Usage Output Usage Input Usage +-------------------------------- ----------- ---------------- ---------------- ---------------- +TenGigabitEthernet 0/1 10000 Mbit 0.0000018300% 0.0000013100% 0.0000023500% +TenGigabitEthernet 0/2 10000 Mbit 0.0000003450% 0.0000004700% 0.0000002200% +TenGigabitEthernet 0/7 10000 Mbit 0.0000012350% 0.0000022900% 0.0000001800% +TenGigabitEthernet 0/8 10000 Mbit 0.0000011450% 0.0000021800% 0.0000001100% +TenGigabitEthernet 0/9 10000 Mbit 0.0000011350% 0.0000022300% 0.0000000400% +TenGigabitEthernet 0/10 10000 Mbit 0.0000016700% 0.0000022500% 0.0000010900% +HundredGigabitEthernet 0/53 100000 Mbit 0.00000011900% 0.00000023800% 0.00000000000% +HundredGigabitEthernet 0/54 100000 Mbit 0.00000012500% 0.00000025000% 0.00000000000% +AggregatePort 1 20000 Mbit 0.0000014600% 0.0000023400% 0.0000005799% +AggregatePort 2 20000 Mbit 0.0000019575% 0.0000023950% 0.0000015200% +``` + +It's clear that both `AggregatePort` interfaces have 20Gbps of capacity and are using an L3 +loadbalancing policy. Cool beans! + +If you recall my loadtest theory in for example my [Netgate 6100 review]({%post_url 2021-11-26-netgate-6100%}), +it can sometimes be useful to operate a single-flow loadtest, in which the source and destination +IP:Port stay the same. As I'll demonstrate, it's not only relevant for PC based routers like ones built +on VPP, it can also be very relevant in silicon vendors and high-end routers! + +### Test 2.1 - 2x 20G LAG L2 Cross Connect + +I scratched my head a little while (and with a little while I mean more like an hour or so!), because usually +I come across _Bundle-Ether_ interfaces which have hashing turned on in the interface stanza, but in my +first loadtest run I did not see any traffic on the second member port. I then found out that I need L2VPN +setting `l2vpn load-balancing flow src-dst-ip` applied rather than the Interface setting: + +``` +interface Bundle-Ether1 + description LAG1 + l2transport + ! +! +interface TenGigE0/1/0/0 + bundle id 1 mode active +! +interface TenGigE0/1/0/1 + bundle id 1 mode active +! +interface Bundle-Ether2 + description LAG2 + l2transport + ! +! +interface TenGigE0/2/0/0 + bundle id 2 mode active +! +interface TenGigE0/2/0/1 + bundle id 2 mode active +! +l2vpn + load-balancing flow src-dst-ip + xconnect group loadtest + p2p xc01 + interface Bundle-Ether1 + interface Bundle-Ether2 + ! + ! +! +``` + +Overall, the router performs as well as can be expected. In the single-flow 64 byte test, however, due to +the hashing over the available members in the LAG being on L3 information, the router is forced to always +choose the same member and effectively perform at 10G throughput, so it'll get a pass from me on the 64b +single test. In the multi-flow test, I can see that it does indeed forward over both LAG members, however +it reaches only 34.9Mpps which is 59% of line rate. + +Loadtest | Unidirectional (pps) | L1 Unidirectional (bps) | Bidirectional (pps) | L1 Bidirectional (bps) +---------- | ---------- | ------------ | ----------- | ------------ +1514b | 1.61 Mpps | 19.8 Gbps | 3.23 Mpps | 39.64 Gbps +imix | 6.40 Mpps | 19.8 Gbps | 12.8 Mpps | 39.53 Gbps +64b Multi | 29.44 Mpps | 19.8 Gbps | 34.9 Mpps | 23.48 Gbps +64b Single | 14.86 Mpps | 9.99 Gbps | 29.8 Mpps | 20.0 Gbps + +### Test 2.2 - 2x 20G LAG Bridge Domain + +Just like with Test 1.2 above, I can now transform this service from a Cross Connect into a fully formed +L2 bridge, by simply putting the two _Bundle-Ether_ interfaces in a _bridge-domain_ together, again +being careful to apply the L3 load-balancing policy on the `l2vpn` scope rather than the `interface` +scope: + +``` +l2vpn + load-balancing flow src-dst-ip + no xconnect group loadtest + bridge group loadtest + bridge-domain bd01 + interface Bundle-Ether1 + ! + interface Bundle-Ether2 + ! + ! + ! +! +``` + +The results for this test show that indeed L2XC is computationally cheaper than _bridge-domain_ work. With +imix and 1514b packets, the router is fine and forwards 20G and 40G respectively. When the bridge is slammed +with 64 byte packets, its performance reaches only 65% with multiple flows in the unidirectional, and 47% +in the bidirectional loadtest. I found the performance difference with the L2 crossconnect above remarkable. + +The single-flow loadtest cannot meaningfully stress both members of the LAG due to the src/dst being identical: +the best I can expect here, is 10G performance regardless how many LAG members there are. + +Loadtest | Unidirectional (pps) | L1 Unidirectional (bps) | Bidirectional (pps) | L1 Bidirectional (bps) +---------- | ---------- | ------------ | ----------- | ------------ +1514b | 1.61 Mpps | 19.8 Gbps | 3.22 Mpps | 39.56 Gbps +imix | 6.39 Mpps | 19.8 Gbps | 12.8 Mpps | 39.58 Gbps +64b Multi | 20.12 Mpps | 13.5 Gbps | 28.2 Mpps | 18.93 Gbps +64b Single | 9.49 Mpps | 6.38 Gbps | 19.0 Mpps | 12.78 Gbps + +### Test 2.3 - 2x 20G LAG L3 IPv4 Routing + +And finally I turn my attention to the usual suspect: IPv4 routing. Here, I simply remove the `l2vpn` +stanza alltogether, and remember to put the load-balancing policy on the _Bundle-Ether_ interfaces. +This ensures that upon transmission, both members of the LAG are used. That is, if and only if the +IP src/dst addresses differ, which is the case in most, but not all of my loadtests :-) + +``` +no l2vpn + +interface Bundle-Ether1 + description LAG1 + ipv4 address 100.64.1.1 255.255.255.252 + bundle load-balancing hash src-ip +! +interface TenGigE0/1/0/0 + bundle id 1 mode active +! +interface TenGigE0/1/0/1 + bundle id 1 mode active +! +interface Bundle-Ether2 + description LAG2 + ipv4 address 100.64.0.1 255.255.255.252 + bundle load-balancing hash src-ip +! +interface TenGigE0/2/0/0 + bundle id 2 mode active +! +interface TenGigE0/2/0/1 + bundle id 2 mode active +! +``` + +The LAG is fine at forwarding IPv4 traffic in 1514b and imix - full line rate and 40Gbps of traffic is +passed in the bidirectional test. With the 64b frames though, the forwarding performance is not line rate +but rather 84% of line in one direction, and 76% of line rate in the bidirectional test. + +And once again, the single-flow loadtest cannot make use of more than one member port in the LAG, so it +will be constrained to 10G throughput -- that said, it performs at 42.6% of line rate only. + +Loadtest | Unidirectional (pps) | L1 Unidirectional (bps) | Bidirectional (pps) | L1 Bidirectional (bps) +---------- | ---------- | ------------ | ----------- | ------------ +1514b | 1.63 Mpps | 20.0 Gbps | 3.25 Mpps | 39.92 Gbps +imix | 6.51 Mpps | 19.9 Gbps | 13.04 Mpps | 39.91 Gbps +64b Multi | 12.52 Mpps | 8.41 Gbps | 22.49 Mpps | 15.11 Gbps +64b Single | 6.49 Mpps | 4.36 Gbps | 11.62 Mpps | 7.81 Gbps + +## Bonus - ASR9k linear scaling + +{{< image width="300px" float="right" src="/assets/asr9006/loaded.png" alt="ASR9k Loaded" >}} + +As I've shown above, the loadtests often topped out at well under line rate for tests with small packet sizes, but I +can also see that the LAG tests offered a higher performance, although not quite double that of single ports. I can't +help but wonder: is this perhaps ***a per-port limit*** rather than a router-wide limit? + +To answer this question, I decide to pull out the stops and populate the ASR9k with as many XFPs as I have in my +stash, which is 9 pieces. One (Te0/0/0/0) still goes to uplink, because the machine should be carrying IGP and full +BGP tables at all times; which leaves me with 8x 10G XFPs, which I decide it might be nice to combine all three +scenarios in one test: + +1. Test 1.1 with Te0/1/0/2 cross connected to Te0/2/0/2, with a loadtest at 20Gbps. +1. Test 1.2 with Te0/1/0/3 in a bridge-domain with Te0/2/0/3, also with a loadtest at 20Gbps. +1. Test 2.3 with Te0/1/0/0+Te0/2/0/0 on one end, and Te0/1/0/1+Te0/2/0/1 on the other end, with an IPv4 + loadtest at 40Gbps. + +### 64 byte packets + +It would be unfair to use single-flow on the LAG, considering the hashing is on L3 source and/or destination IPv4 +addresses, so really only one member port would be used. To avoid this pitfall, I run with `vm=var2`. On the other +two tests, however, I do run the most stringent of traffic pattern with single-flow loadtests. So off I go, firing +up ***three T-Rex*** instances. + +First, the 10G L2 Cross Connect test (approximately 17.7Mpps): +``` +Tx bps L2 | 7.64 Gbps | 7.64 Gbps | 15.27 Gbps +Tx bps L1 | 10.02 Gbps | 10.02 Gbps | 20.05 Gbps +Tx pps | 14.92 Mpps | 14.92 Mpps | 29.83 Mpps +Line Util. | 100.24 % | 100.24 % | +--- | | | +Rx bps | 4.52 Gbps | 4.52 Gbps | 9.05 Gbps +Rx pps | 8.84 Mpps | 8.84 Mpps | 17.67 Mpps +``` + +Then, the 10G Bridge Domain test (approximately 17.0Mpps): +``` +Tx bps L2 | 7.61 Gbps | 7.61 Gbps | 15.22 Gbps +Tx bps L1 | 9.99 Gbps | 9.99 Gbps | 19.97 Gbps +Tx pps | 14.86 Mpps | 14.86 Mpps | 29.72 Mpps +Line Util. | 99.87 % | 99.87 % | +--- | | | +Rx bps | 4.36 Gbps | 4.36 Gbps | 8.72 Gbps +Rx pps | 8.51 Mpps | 8.51 Mpps | 17.02 Mpps +``` + +Finally, the 20G LAG IPv4 forwarding test (approximately 24.4Mpps), noting that the _Line Util._ here is of the 100G +loadtester ports, so 20% is expected: +``` +Tx bps L2 | 15.22 Gbps | 15.23 Gbps | 30.45 Gbps +Tx bps L1 | 19.97 Gbps | 19.99 Gbps | 39.96 Gbps +Tx pps | 29.72 Mpps | 29.74 Mpps | 59.46 Mpps +Line Util. | 19.97 % | 19.99 % | +--- | | | +Rx bps | 5.68 Gbps | 6.82 Gbps | 12.51 Gbps +Rx pps | 11.1 Mpps | 13.33 Mpps | 24.43 Mpps +``` + +To summarize, in the above tests I am pumping 80Gbit (which is 8x 10Gbit full linerate at 64 byte packets, in +other words 119Mpps) into the machine, and it's returning 30.28Gbps (or 59.2Mpps which is 38%) of that traffic back +to the loadtesters. Features: yes; linerate: nope! + +### 256 byte packets + +Seeing the lowest performance of the router coming in at 8.5Mpps (or 57% of linerate), it stands to reason +that sending 256 byte packets will stay under the per-port observed packets/sec limits, so I decide to restart +the loadtesters with 256b packets. The expected ethernet frame is now 256 + 20 byte overhead, or 2208 bits, +of which ~4.53Mpps can fit into a 10G link. Immediately all ports go up entirely to full capacity. As seen from +the Cisco's commandline: + +``` +RP/0/RSP0/CPU0:fridge#show interfaces | utility egrep 'output.*packets/sec' | exclude 0 packets +Mon Feb 21 22:14:02.250 UTC + 5 minute output rate 18390237000 bits/sec, 9075919 packets/sec + 5 minute output rate 18391127000 bits/sec, 9056714 packets/sec + 5 minute output rate 9278278000 bits/sec, 4547012 packets/sec + 5 minute output rate 9242023000 bits/sec, 4528937 packets/sec + 5 minute output rate 9287749000 bits/sec, 4563507 packets/sec + 5 minute output rate 9273688000 bits/sec, 4537368 packets/sec + 5 minute output rate 9237466000 bits/sec, 4519367 packets/sec + 5 minute output rate 9289136000 bits/sec, 4562365 packets/sec + 5 minute output rate 9290096000 bits/sec, 4554872 packets/sec +``` + +The first two ports there are _Bundle-Ether_ interface _BE1_ and _BE2_, and the other eight are the TenGigE +ports. You can see that each one is forwarding the expected 4.53Mpps, and this lines up perfectly with T-Rex +which is sending 10Gbps of L1, and 9.28Gbps of L2 (the difference here is the ethernet overhead of 20 bytes +per frame, or 4.53 * 160 bits = 724Mbps), and it's receiving all of that traffic back on the other side, which +is good. + +This clearly demonstrates the hypothesis that the machine is ***per-port pps-bound***. + +So the conclusion is that, the A9K-RSP440-SE typically will forward maybe only 8Mpps on a single TenGigE port, and +13Mpps on a two-member LAG. However, it will do this _for every port_, and with at least 8x 10G ports saturated, +it remained fully responsive, OSPF and iBGP adjacencies stayed up, and ping times on the regular (Te0/0/0/0) +uplink port were smooth. + +## Results + +### 1514b and imix: OK! + +{{< image width="1200px" src="/assets/asr9006/results-imix.png" alt="ASR9k Results - imix" >}} + +Let me start by showing a side-by-side comparison of the imix tests in all scenarios in the graph above. The +graph for 1514b tests looks very similar, differing only in the left-Y axis: imix is a 3.2Mpps stream, while +1514b saturates the 10G port already at 810Kpps. But obviously, the router can do this just fine, even if used +on 8 ports, it doesn't mind at all. As I later learned, any traffic mix larger than than 256b packets, or 4.5Mpps +per port, forwards fine in any configuration. + +### 64b: Not so much :) + +{{< image width="1200px" src="/assets/asr9006/results-64b.png" alt="ASR9k Results - 64b" >}} + +{{< image width="1200px" src="/assets/asr9006/results-lacp-64b.png" alt="ASR9k Results - LACP 64b" >}} + +These graphs show the throughput of the ASR9006 with a pair of A9K-RSP440-SE route switch processors. They +are rated at 440Gbps per slot, but their packets/sec rates are significantly lower than line rate. The top +graph shows the tests with 10G ports, and the bottom graph shows the same tests but with a 2x10G ports in +_Bundle-Ether_ LAG. + +In an ideal situation, each test would follow the loadtester up to completion, and there would be no horizontal +lines breaking out partway through. As I showed, some of the loadtests really performed poorly in terms of +packets/sec forwarded. Understandably, the 20G LAG with single-flow can only utilize one member port (which is +logical) but then managed to push through only 6Mpps or so. Other tests did better, but overall I must say, the +results were lower than I had expected. + +### That juxtaposition + +At the very top of this article I alluded to what I think is a cool juxtaposition. On the one hand, we have these +beasty ASR9k routers, running idle at 2.2kW for 24x10G and 40x1G ports (as is the case for the IP-Max router that +I took out for a spin here). They are large (10U of rackspace), heavy (40kg loaded), expensive (who cares about +list price, the street price is easily $10'000,- apiece). + +On the other hand, we have these PC based machines with Vector Packet Processing, operating as low as 19W for 2x10G, +2x1G and 4x2.5G ports (like the [Netgate 6100]({%post_url 2021-11-26-netgate-6100%})) and offering roughly equal +performance per port, except having to drop only $700,- apiece. The VPP machines come with ~infinite RAM, even a +16GB machine will run much larger routing tables, including full BGP and so on - there is no (need for) TCAM, and yet +routing performance scales out with CPUs and larger CPU instruction/data-cache. Looking at my Ryzen 5950X based Hippo/Rhino +VPP machines, they *can* sustain line rate 64b packets on their 10G ports, due to each CPU being able to process +around 22.3Mpps, and the machine has 15 usable CPU cores. Intel or Mellanox 100G network cards are affordable, the +whole machine with 2x100G, 4x10G and 4x1G will set me back about $3'000,- in 1U and run 265 Watts when fully loaded. + +See an extended rationale with backing data in my [FOSDEM'22 talk](/media/fosdem22/index.html). + +## Conclusion + +I set out to answer three questions in this article, and I'm ready to opine now: + +1. Unidirectional vs Bidirectional: there is an impact - bidirectional tests (stressing both ingress and egress + of each individual router port) have lower performance, notably in packets smaller than 256b. +1. LACP performance penalty: there is an impact - 64b multiflow loadtest on LAG obtained 59%, 47% and 42% (for + Test 2.1-3) while for single ports, they obtained 68%, 64% and 43% (for Test 1.1-3). So while aggregate + throughput grows with the LACP _Bundle-Ether_ ports, individual port throughput is reduced. +1. The router performs line rate 1514b, imix, and anything beyond 256b packets really. However, it does _not_ + sustain line rate at 64b packets. Some tests passed with a unidirectional loadtest, but all tests failed + with bidirectional loadtests. + +After all of these tests, I have to say I am ***still a huge fan*** of the ASR9k. I had kind of expected that it +would perform at line rate for any/all of my tests, but the theme became clear after a few - the ports will only +forward between 8Mpps and 11Mpps (out of the needed 14.88Mpps), but _every_ port will do that, which means +the machine will still scale up significantly in practice. But for business internet, colocation, and non-residential +purposes, I would argue that routing _stability_ is most important, and with regards to performance, I would argue +that _aggregate bandwidth_ is more important than pure _packets/sec_ performance. Finally, the ASR in Cisco ASR9k stands +for _Advanced Services Router_, and being able to mix-and-match MPLS, L2VPN, Bridges, encapsulation, tunneling, and +have an expectation of 8-10Mpps per 10G port is absolutely reasonable. The ASR9k is a very competent machine. + +### Loadtest data + +I've dropped all loadtest data [here](/assets/asr9006/asr9006-loadtest.tar.gz) and if you'd like to play around with +the data, take a look at the HTML files in [this directory](/assets/asr9006/), they were built with Michal's +[trex-loadtest-viz](https://github.com/wejn/trex-loadtest-viz/) scripts. + +## Acknowledgements + +I wanted to give a shout-out to Fred and the crew at IP-Max for allowing me to play with their router during +these loadtests. I'll be configuring it to replace their router at NTT in March, so if you have a connection +to SwissIX via IP-Max, you will be notified for maintenance ahead of time as we plan the maintenance window. + +We call these things Fridges in the IP-Max world, because they emit so much cool air when they start :) The +ASR9001 is the microfridge, this ASR9006 is the minifridge, and the ASR9010 is the regular fridge. + diff --git a/content/articles/2022-02-24-colo.md b/content/articles/2022-02-24-colo.md new file mode 100644 index 0000000..710cc83 --- /dev/null +++ b/content/articles/2022-02-24-colo.md @@ -0,0 +1,105 @@ +--- +date: "2022-02-24T13:46:12Z" +title: IPng Networks - Colocation +--- + +## Introduction + +As with most companies, it started with an opportunity. I got my hands on a location which has a +raised floor at 60m2 and a significant power connection of 3x200A, and a metro fiber connection at +10Gbps. I asked my buddy Luuk 'what would it take to turn this into a colo?' and the rest is history. +Thanks to Daedalean AG who benefit from this infrastructure as well, making this first small colocation +site was not only interesting, but also very rewarding. + +The colocation business is murder in Zurich - there are several very large datacenters (Equinix, NTT, +Colozüri, Interxion) all directly in or around the city, and I'm known to dwell in most of these. The +networking and service provider industry is quite small and well organized into _Network Operator Groups_, +so I work under the assumption that everybody knows everybody. I definitely like to pitch in and share +what I have built, both the physical bits but also the narrative. + +This article describes the small serverroom I built at a partner's premises in Zurich Albisrieden. The +colo is _open for business_, that is to say: Please feel free to [reach out](/s/contact) if you're interested. + +## Physical + +{{< image width="180px" float="right" src="/assets/colo/power.png" alt="Power" >}} + +It starts with a competent power distribution. Pictured to the right is a 200Amp 3-phase distribution panel +at Daedalean AG in Zurich. There's another similar panel on the other side of the floor, and both are +directly connected to EWZ and have plenty of smaller and larger breakers available (the room it's in used +to be a serverroom of the previous tenant, the City of Zurich). + +{{< image width="180px" float="left" src="/assets/colo/eastron-sdm630.png" alt="Eastron SDM630" >}} + +I start with installing a set of Eastron SDM630 power meters, so that I know what is being used +by IPng Networks, and can pay my dues, as well as remotely read the state and power consumption using +MODBUS, yielding two 3-phase supplies with 32A breakers on each. + +{{< image width="180px" float="left" src="/assets/colo/pdus.png" alt="PDUs" >}} + +Then, I go scouring on the Internet, to find a few second hand 19" racks. I actually find two 800x1000mm racks +but they are all the way across Switzerland. However, they're very affordable, but what's better, they each come +with two APC power distribution and remotely switchable zero-u power distribution strips. Score! + +
+ +{{< image width="180px" float="right" src="/assets/colo/racks-installed1.png" alt="Racks Installed" >}} + +Laura and I rented a little (with which I mean: huge) minivan and went to pick up the racks. The folks at +Daedalean kindly helped us schlepp them up the stairs to the serverroom, and we installed the racks in the +serverroom, connecting them redundantly to power using the four PDUs. I have to be honest: there is no battery +or diesel backup in this room, as it's in the middle of the city and it'd be weird to have generators on site +for such a small room. It's a compromise we have to make. + +{{< image width="180px" float="left" src="/assets/colo/racks-installed2.png" alt="Racks Installed w/ doors" >}} + +Of course, I have to supply some form of eye-candy, so I decide to make a few decals for the racks, so that they +sport the _IPng @ DDLN_ designation. There are a few other racks and infrastructure in the same room, of course, +and it's cool to be able to identify IPng's kit upon entering the room. They even have doors, look! + +The floor space here is about 60m2 of usable serverroom, so there is plenty of room to grow, and if the network +ever grows larger than 2x10G uplinks, it is definitely possible to rent dark fiber from this location thanks to +the liberal Swiss telco situation. But for now, we start small with 1x 10G layer2 backhaul to Interxion in +Glattbrugg. In 2022, I expect to expand with a second 10G layer2 backhaul to NTT in Rümlang to make the site +fully redundant. + + + + + +## Logical + +The physical situation is sorted, we have cooling, power, 19" racks with PDUs, and uplink connectivity. It's time +to think about a simple yet redundant colocation setup: + +{{< image width="800px" src="/assets/colo/DDLN Logical Sketch.png" alt="Design" >}} + +In this design, I'm keeping it relatively straight forward. The 10G ethernet leased line from Solnet plugs into one +switch, and the 10G leased line from Init7 plugs into the other. Everything is then built in pairs. +I bring: +* Two switches (Mikrotik CRS354, with 48x1G, 4x10G and 2x40G), two power supplies, connect them with 40G together. +* Two Dell R630 routers running VPP (of course), two power supplies, with 3x10G each: + * One leg goes back-to-back for OSPF/OSPFv3 between the two routers + * One leg goes to each switch; the "local" leg will be in a VLAN into the uplink VLL, and expose the router on the + colocation VLAN and any L2 backhaul services. The "remote" leg will be in a VLAN to the other uplink VLL. +* Two Supermicro hypervisors, each connected with 10G to their own switch +* Two PCEngines APU4 machines, each connected to Daedalean's corporate network for OOB + * These have serial connection to the PDUs and Mikrotik switches + * They also have mgmt network connection to the Dell VPP routers and Mikrotik switches + * They also run a Wireguard access service which exposes an IPMI VLAN for colo clusters + +The result is that each of these can fail without disturbing traffic to/from the servers in the colocation. Each +server in the colo gets two power connections (one on each feed), two 1Gbps ports (one for IPMI and one for Internet). + +The logical colocation network has VRRP configured for direct/live failover of IPv4 and IPv6 gateways, but the VPP +routers can offer full redundant IPv4 and IPv6 transit, as well as L2 backhaul to any other location where IPng +Networks has a presence (which is [quite a few](https://as8298.peeringdb.com/)). + +## Conclusion + +The colocation that I built, together with Daedalean, is very special. It's not carrier grade, it doesn't have +a building/room wide UPS or diesel generators, but it does have competent power, cooling, physical and logical +deployment. But most of all: it redundantly connects to AS8298 and offers full N+1 redundancy on the logical +level. + +If you're interested in hosting a server in this colocation, [contact us](/s/contact/)! diff --git a/content/articles/2022-03-03-syslog-telegram.md b/content/articles/2022-03-03-syslog-telegram.md new file mode 100644 index 0000000..01ad074 --- /dev/null +++ b/content/articles/2022-03-03-syslog-telegram.md @@ -0,0 +1,281 @@ +--- +date: "2022-03-03T19:05:14Z" +title: Syslog to Telegram +--- + +## Introduction + +From time to time, I wish I could be made aware of failures earlier. There are two events, in particular, +that I am interested to know about very quickly, as they may impact service at AS8298: + +1. _Open Shortest Path First_ (OSPF) adjacency removals. OSPF is a link-state protocol and it knows when + a physical link goes down, that the peer (neighbor) is no longer reachable. It can then recompute + paths to other routers fairly quickly. But if the link stays up but connectivity is interrupted, + for example because there is a switch in the path, it can take a relatively long time to detect. +1. _Bidirectional Forwarding Detection_ (BFD) session timeouts. BFD sets up a rapid (for example every + 50ms or 20Hz) of a unidirectional UDP stream between two hosts. If a number of packets (for example + 40 packets or 2 seconds) are not received, a link can be assumed to be dead. + +Notably, [BIRD](https://bird.network.cz/), as many other vendors do, can combine the two. At IPng, each +OSPF adjacency is protected by BFD. What happens is that once an OSPF enabled link comes up, OSPF _Hello_ +packets will be periodically transmitted (with a period called called a _Hello Timer_, typically once every +10 seconds). When a number of these are missed (called a _Dead Timer_, typically 40 seconds), the neighbor +is considered missing in action and the session cleaned up. + +To help recover from link failure faster than 40 seconds, a new BFD session can be set up from any +neighbor that sends a _Hello_ packet. From then on, BFD will send a steady stream of UDP packets, and expect +as well the neighbor to send them. If BFD detects a timeout, it can inform BIRD to take action well +before the OSPF _Dead Timer_. + +Very strict timers are known to be used, for example 10ms and 5 missed packets, or 50ms (!!) of timeout. +But at IPng, in the typical example above, I instruct BFD to send packets every 50ms, and time out +after 40 missed packets, or two (2) seconds of link downtime. Considering BIRD+VPP converge a full +routing table in about 7 seconds, that gives me an end-to-end recovery time of under 10 seconds, which +is respectable, all the while avoiding triggering on false positives. + +I'd like to be made aware of these events, which could signal a darkfiber cut or WDM optic failure, or +an EoMPLS (ie _Virtual Leased Line_ or VLL failure), or a non-recoverable VPP dataplane crash. To a lesser +extent, being made explicitly aware of BGP adjacencies to downstream (IP Transit customers) or upstream +(IP Transit providers) can be useful. + +### Syslog NG + +There are two parts to this. First I want to have a (set of) central receiver servers, that will each +receive messages from the routers in the field. I decide to take three servers: the main one being +`nms.ipng.nl`, which runs LibreNMS, and further two read-only route collectors `rr0.ddln0.ipng.ch` at +our own DDLN [colocation]({%post_url 2022-02-24-colo %}) in Zurich, and `rr0.nlams0.ipng.ch` running +at Coloclue in DCG, Amsterdam. + +Of course, it would be a mistake to use UDP as a transport for messages that discuss potential network +outages. Having receivers in multiple places in the network does help a little bit. But I decide to +configure the server (and the clients) later to use TCP. This way, messages are queued to be sent, +and if the TCP connection has to be rerouted when the underlying network converges, I am pretty certain +that the messages will arrive at the central logserver _eventually_. + +#### Syslog Server + +The configuration for each of the receiving servers is the same, very straight forward: +``` +$ cat << EOF | sudo tee /etc/syslog-ng/conf.d/listen.conf +template t_remote { + template("\$ISODATE \$FULLHOST_FROM [\$LEVEL] \${PROGRAM}: \${MESSAGE}\n"); + template_escape(no); +}; + +source s_network_tcp { + network( transport("tcp") ip("::") ip-protocol(6) port(601) max-connections(300) ); +}; + +destination d_ipng { file("/var/log/ipng.log" template(t_remote) template-escape(no)); }; + +log { source(s_network_tcp); destination(d_ipng); }; +EOF + +$ sudo systemctl restart syslog-ng +``` + +First, I define a _template_ which logs in a consistent and predictable manner. Then, I configure a +_source_ which listens on IPv4 and IPv6 on TCP port 601, which allows for more than the default 10 +connections. I configure a _destination_ into a file, using the template. Then I tie the log source +into the destination, and restart `syslog-ng`. + +One thing that took me a while to realize is that for `syslog-ng`, the parser applied to incoming +messages is different depending on the port used ([ref](https://www.syslog-ng.com/technical-documents/doc/syslog-ng-open-source-edition/3.16/administration-guide/17)): + +* 514, both TCP and UDP, for [RFC3164](https://datatracker.ietf.org/doc/html/rfc3164) (BSD-syslog) formatted traffic +* 601 TCP, for [RFC5424](https://datatracker.ietf.org/doc/html/rfc5424) (IETF-syslog) formatted traffic +* 6514 TCP, for TLS-encrypted traffic (of IETF-syslog messages) + +After seeing malformed messages in the syslog, notably with duplicate host/program/timestamp, I ultimately +understood that this was because I was sending RFC5424 style messages to an RFC3164 enabled port (514). +Once I moved the transport to be port 601, the parser matched and loglines were correct. + +And another detail -- I feel a little bit proud for not forgetting to add a logrotate entry for +this new log file, keeping 10 days worth of compressed logs: +``` +$ cat << EOF | sudo tee /etc/logrotate.d/syslog-ng-ipng +/var/log/ipng.log +{ + rotate 10 + daily + missingok + notifempty + delaycompress + compress + postrotate + invoke-rc.d syslog-ng reload > /dev/null + endscript +} +EOF +``` + +I open up the firewall in these new syslog servers for TCP port 601, from any loopback addresses on +AS8298's network. + +#### Syslog Clients + +The clients install `syslog-ng-core` (which avoids all of the extra packages). On the routers, I +have to make sure that the syslog server runs in the `dataplane` namespace, otherwise it will not +have connectivity to send its messages. And, quite importantly, I should make sure that the +TCP connections are bound to the loopback address of the router, not any arbitrary interface, +as those could go down, rendering the TCP connection useless. So taking `nlams0.ipng.ch` as +an example, here's a configuration snippet: + +``` +$ sudo apt install syslog-ng-core +$ sudo sed -i -e 's,ExecStart=,ExecStart=/usr/sbin/ip netns exec dataplane ,' \ + /lib/systemd/system/syslog-ng.service + +$ LO4=194.1.163.32 +$ LO6=2001:678:d78::8 +$ cat << EOF | sudo tee /etc/syslog-ng/conf.d/remote.conf +destination d_nms_tcp { tcp("194.1.163.89" localip("$LO4") port(601)); }; +destination d_rr0_nlams0_tcp { tcp("2a02:898:146::4" localip("$LO6") port(601)); }; +destination d_rr0_ddln0_tcp { tcp("2001:678:d78:4::1:4" localip("$LO6") port(601)); }; + +filter f_bird { program(bird); }; + +log { source(s_src); filter(f_bird); destination(d_nms_tcp); }; +log { source(s_src); filter(f_bird); destination(d_rr0_nlams0_tcp); }; +log { source(s_src); filter(f_bird); destination(d_rr0_ddln0_tcp); }; +EOF + +$ sudo systemctl restart syslog-ng +``` +Here, I create simply three _destination_ entries, one for each log-sink. Then I create a _filter_ +that grabs logs sent, but only for the BIRD server. You can imagine that later, I can add other things +to this -- for example `keepalived` for VRRP failovers. Finally, I tie these together by applying +the filter to the source and sending the result to each syslog server. + +So far, so good. + +### Bird + +For consistency, (although not strictly necessary for the logging and further handling), I add +ISO data timestamping and enable syslogging in `/etc/bird/bird.conf`: + +``` +timeformat base iso long; +timeformat log iso long; +timeformat protocol iso long; +timeformat route iso long; + +log syslog all; +``` + +And for the two protocols of interest, I add `debug { events };` to the BFD and OSPF protocols. Note +that `bfd on` stanza in the OSPF interfaces -- this instructs BIRD to create BFD session for each of +the neighbors that are found on such an interface, and if BFD were to fail, tear down the adjacency +faster than the regular _Dead Timer_ timeouts. + +``` +protocol bfd bfd1 { + debug { events }; + interface "*" { interval 50 ms; multiplier 40; }; +} + +protocol ospf v2 ospf4 { + debug { events }; + ipv4 { export filter ospf_export; import all; }; + area 0 { + interface "loop0" { stub yes; }; + interface "xe1-3.100" { type pointopoint; cost 61; bfd on; }; + interface "xe1-3.200" { type pointopoint; cost 75; bfd on; }; + }; +} +``` + +This will emit loglines for (amongst others), state changes on BFD neighbors and OSPF adjacencies. +There are a lot of messages to choose from, but I found that the following messages contain the minimally +needed information to convey links going down or up (both from BFD's point of view as well as from OSPF +and OSPFv3's point of view). I can demonstrate that by making the link between Hippo and Rhino go down +(ie. by shutting the switchport, or unplugging the cable). + +And after this, I can see on `nms.ipng.nl` that the logs start streaming in: +``` +pim@nms:~$ tail -f /var/log/ipng.log | egrep '(ospf[46]|bfd1):.*changed state.*to (Down|Up|Full)' +2022-02-24T18:12:26+00:00 hippo.btl.ipng.ch [debug] bird: bfd1: Session to 192.168.10.17 changed state from Up to Down +2022-02-24T18:12:26+00:00 hippo.btl.ipng.ch [debug] bird: ospf4: Neighbor 192.168.10.1 on e2 changed state from Full to Down +2022-02-24T18:12:26+00:00 hippo.btl.ipng.ch [debug] bird: bfd1: Session to fe80::5054:ff:fe01:1001 changed state from Up to Down +2022-02-24T18:12:26+00:00 hippo.btl.ipng.ch [debug] bird: ospf6: Neighbor 192.168.10.1 on e2 changed state from Full to Down + +2022-02-24T18:17:18+00:00 hippo.btl.ipng.ch [debug] bird: ospf6: Neighbor 192.168.10.1 on e2 changed state from Loading to Full +2022-02-24T18:17:18+00:00 hippo.btl.ipng.ch [debug] bird: bfd1: Session to fe80::5054:ff:fe01:1001 changed state from Init to Up +2022-02-24T18:17:22+00:00 hippo.btl.ipng.ch [debug] bird: ospf4: Neighbor 192.168.10.1 on e2 changed state from Loading to Full +2022-02-24T18:17:22+00:00 hippo.btl.ipng.ch [debug] bird: bfd1: Session to 192.168.10.17 changed state from Down to Up +``` + +And now I can see that important events are detected and sent , using reliable TCP transport, to multiple +logging machines, these messages about BFD and OSPF adjacency changes now make it to a central machine. + +### Telegram Bot + +{{< image width="150px" float="left" src="/assets/syslog-telegram/ptb-logo.png" alt="PTB" >}} + +Of course I can go tail the logfile on one of the servers, but I think it'd be a bit more elegant to have +a computer do the pattern matching for me. One way might be to use the `syslog-ng` destination feature _program()_ +([ref](https://www.syslog-ng.com/technical-documents/doc/syslog-ng-open-source-edition/3.22/administration-guide/43)), +which pipes these logs through a userspace process, receiving them on stdin and doing interesting things with +them, such as interacting with Telegram, the delivery mechanism of choice for IPng's monitoring systems. +Building such a Telegram enabled bot is very straight forward, thanks to the excellent documentation of the +Telegram API, and the existence of `python-telegram-bot` ([ref](https://github.com/python-telegram-bot/python-telegram-bot)). + +However, to avoid my bot from being tied at the hip to `syslog-ng`, I decide to simply tail a number of +logfiles from the commandline (ie `pttb /var/log/*.log`) - and here emerges the name of my little bot: +Python Telegram Tail Bot, or _pttb_ for short, that: + +* Tails the syslog logstream from one or more files, ie `/var/log/ipng.log` +* Pattern matches on loglines, after which an `incident` is created +* Waits for a predefined number of seconds (which may be zero) to see if more loglines match, adding them to + the `incident` +* Holds the `incident` against a list of known regular expression `silences`, throwing away those which + aren't meant to be distributed +* Sending to a predefined group chat, those incidents which aren't silenced + +The bot should allow for the following features, based on a YAML configuration file, which will allow it to be +restarted and upgraded: +* A (mandatory) `TOKEN` to interact with Telegram API +* A (mandatory) single `chat-id` - messages will be sent to this Telegram group chat +* An (optional) list of logline triggers, consisting of: + * a regular expression to match in the logstream + * a grace period to coalesce additional loglines of the same trigger into the incident + * a description to send once the incident is sent to Telegram +* An (optional) list of silences, consisting of: + * a regular expression to match any incident message data in + * an expiry timestamp + * a description carrying the reason for the silence + +The bot will start up, announce itself on the `chat-id` group, and then listen on Telegram for the following +commands: + +* **/help** - a list of available commands +* **/trigger** - without parameters, list the current triggers +* **/trigger add <regexp> [duration] [<message>]** - with one parameter set a trigger on a regular expression. Optionally, + add a duration in seconds between [0..3600>, within which additional matched loglines will be added to the + same incident, and an optional message to include in the Telegram alert. +* **/trigger del <idx>** - with one parameter, remove the trigger with that index (use /trigger to see the list). +* **/silence** - without parameters, list the current silences. +* **/silence add <regexp> [duration] [<reason>]** - with one parameter, set a default silence for 1d; optionally + add a duration in the form of `[1-9][0-9]*([hdm])` which defaults to hours (and can be days or minutes), and an optional + reason for the silence. +* **/silence del <idx>** - with one parameter, remove the silence with that index (use /silence to see the list). +* **/stfu [duration]** - a shorthand for a silence with regular expression `.*`, will suppress all notifications, with a + duration similar to the **/silence add** subcommand. +* **/stats** - shows some runtime statistics, notably how many loglines were processed, how many incidents created, + and how many were sent or suppressed due to a silence. + +It will save its configuration file any time a silence or trigger is added or deleted. It will (obviously) then +start sending incidents to the `chat-id` group-chat when they occur. + +## Results + +And a few fun hours of hacking later, I submitted a first rough approxmiation of a useful syslog scanner telegram bot +on [Github](https://github.com/pimvanpelt/python-telegram-tail-bot). It does seem to work, although not all functions +are implemented yet (I'll get them done in the month of March, probably): + +{{< image src="/assets/syslog-telegram/demo-telegram.png" alt="PTTB" >}} + +So now I'll be pretty quickly and elegantly kept up to date by this logscanner, in addition to my already existing +LibreNMS logging, monitoring and alerting. If you find this stuff useful, feel free to grab a copy from +[Github](https://github.com/pimvanpelt/python-telegram-tail-bot), the code is open source and licensed with a liberal +APACHE 2.0 license, and is based on excellent work of [Python Telegram Bot](https://github.com/python-telegram-bot/python-telegram-bot). diff --git a/content/articles/2022-03-27-vppcfg-1.md b/content/articles/2022-03-27-vppcfg-1.md new file mode 100644 index 0000000..67b8f7c --- /dev/null +++ b/content/articles/2022-03-27-vppcfg-1.md @@ -0,0 +1,444 @@ +--- +date: "2022-03-27T14:19:23Z" +title: VPP Configuration - Part1 +--- + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +I use VPP - Vector Packet Processor - extensively at IPng Networks. Earlier this year, the VPP community +merged the [Linux Control Plane]({%post_url 2021-08-12-vpp-1 %}) plugin. I wrote about its deployment +to both regular servers like the [Supermicro]({%post_url 2021-09-21-vpp-7 %}) routers that run on our +[AS8298]({% post_url 2021-02-27-network %}), as well as virtual machines running in +[KVM/Qemu]({% post_url 2021-12-23-vpp-playground %}). + +Now that I've been running VPP in production for about half a year, I can't help but notice one specific +drawback: VPP is a programmable dataplane, and _by design_ it does not include any configuration or +controlplane management stack. It's meant to be integrated into a full stack by operators. For end-users, +this unfortunately means that typing on the CLI won't persist any configuration, and if VPP is restarted, +it will not pick up where it left off. There's one developer convenience in the form of the `exec` +command-line (and startup.conf!) option, which will read a file and apply the contents to the CLI line +by line. However, if any typo is made in the file, processing immediately stops. It's meant as a convenience +for VPP developers, and is certainly not a useful configuration method for all but the simplest topologies. + +Luckily, VPP comes with an extensive set of APIs to allow it to be programmed. So in this series of posts, +I'll detail the work I've done to create a configuration utility that can take a YAML configuration file, +compare it to a running VPP instance, and step-by-step plan through the API calls needed to safely apply +the configuration to the dataplane. Welcome to `vppcfg`! + +In this first post, let's take a look at tablestakes: writing a YAML specification which models the main +configuration elements of VPP, and then ensures that the YAML file is both syntactically as well as +semantically correct. + +**Note**: Code is on [my Github](https://github.com/pimvanpelt/vppcfg), but it's not quite ready for +prime-time yet. Take a look, and engage with us on GitHub (pull requests preferred over issues themselves) +or reach out by [contacting us](/s/contact/). + +## YAML Specification + +I decide to use [Yamale](https://github.com/23andMe/Yamale/), which is a schema description language +and validator for [YAML](http://www.yaml.org/spec/1.2/spec.html). YAML is a very simple, text/human-readable +annotation format that can be used to store a wide range of data types. An interesting, but quick introduction +to the YAML language can be found on CraftIRC's [GitHub](https://github.com/Animosity/CraftIRC/wiki/Complete-idiot's-introduction-to-yaml) +page. + +The first order of business for me is to devise a YAML file specification which models the configuration +options of VPP objects in an idiomatic way. It's apealing to make the decision to immediately build a +higher level abstraction, but I resist the urge and instead look at the types of objects that exist in +VPP, for example the `VNET_DEVICE_CLASS` types: + +* ***ethernet_simulated_device_class***: Loopbacks +* ***bvi_device_class***: Bridge Virtual Interfaces +* ***dpdk_device_class***: DPDK Interfaces +* ***rdma_device_class***: RDMA Interfaces +* ***bond_device_class***: BondEthernet Interfaces +* ***vxlan_device_class***: VXLAN Tunnels + +There are several others, but I decide to start with these, as I'll be needing each one of these in my +own network. Looking over the device class specification, I learn a lot about how they are configured, +which arguments and of which type they need, and which data-structures they are represent as in VPP +internally. + +### Syntax Validation + +Yamale first reads a _schema_ definition file, and then holds a given YAML file against the definition +and shows if the file has a syntax that is well-formed or not. As a practical example, let me start +with the following definition: + +``` +$ cat << EOF > schema.yaml +sub-interfaces: map(include('sub-interface'),key=int()) +--- +sub-interface: + description: str(exclude='\'"',len=64,required=False) + lcp: str(max=15,matches='[a-z]+[a-z0-9-]*',required=False) + mtu: int(min=128,max=9216,required=False) + addresses: list(ip(version=6),required=False) + encapsulation: include('encapsulation',required=False) +--- +encapsulation: + dot1q: int(min=1,max=4095,required=False) + dot1ad: int(min=1,max=4095,required=False) + inner-dot1q: int(min=1,max=4095,required=False) + exact-match: bool(required=False) +EOF +``` + +This snippet creates two types, one called `sub-interface` and the other called `encapsulation`. The fields +of the sub-interface, for example the `description` field, must follow the given typing to be valid. In the +case of the description, it must be at most 64 characters long and it must not contain the ` or " +characters. The designation `required=False` notes that this is an optional field and may be omitted. +The `lcp` field is also a string but it must match a certain regular expression, and start with a lowercase +letter. The `MTU` field must be an integer between 128 and 9216, and so on. + +One nice feature of Yamale is the ability to reference other object types. I do this here with the `encapsulation` +field, which references an object type of the same name, and again, is optional. This means that when the +`encapsulation` field is encountered in the YAML file Yamale is validating, it'll hold the contents of that +field to the schema below. There, we have `dot1q`, `dot1ad`, `inner-dot1q` and `exact-match` fields, which are +all optional. + +Then, at the top of the file, I create the entrypoint schema, which expects YAML files to contain a map +called `sub-interfaces` which is keyed by integers and contains values of type `sub-interface`, tying it all +together. + +Yamale comes with a commandline utility to do direct schema validation, which is handy. Let me demonstrate with +the following terrible YAML: +``` +$ cat << EOF > bad.yaml +sub-interfaces: + 100: + description: "Pim's illegal description" + lcp: "NotAGoodName-AmIRite" + mtu: 16384 + addresses: 192.0.2.1 + encapsulation: False +EOF + +$ yamale -s schemal.yaml bad.yaml +Validating /home/pim/bad.yaml... +Validation failed! +Error validating data '/home/pim/bad.yaml' with schema '/home/pim/schema.yaml' + sub-interfaces.100.description: 'Pim's illegal description' contains excluded character ''' + sub-interfaces.100.lcp: Length of NotAGoodName-AmIRite is greater than 15 + sub-interfaces.100.lcp: NotAGoodName-AmIRite is not a regex match. + sub-interfaces.100.mtu: 16384 is greater than 9216 + sub-interfaces.100.addresses: '192.0.2.1' is not a list. + sub-interfaces.100.encapsulation : 'False' is not a map +``` + +This file trips so many syntax violations, it should be a crime! In fact every single field is invalid. The one that +is closest to being correct is the `addresses` field, but there I've set it up as a _list_ (not a scalar), and even +then, the list elements are expected to be IPv6 addresses, not IPv4 ones. + +So let me try again: + +``` +$ cat << EOF > good.yaml +sub-interfaces: + 100: + description: "Core: switch.example.com Te0/1" + lcp: "xe3-0-0" + mtu: 9216 + addresses: [ 2001:db8::1, 2001:db8:1::1 ] + encapsulation: + dot1q: 100 + exact-match: True +EOF + +$ yamale good.yaml +Validating /home/pim/good.yaml... +Validation success! 👍 +``` + +### Semantic Validation + +When using Yamale, I can make a good start in _syntax_ validation, that is to say, if a field is present, it follows +a prescribed type. But that's not the whole story, though. There are many configuration files I can think of that +would be syntactically correct, but still make no sense in practice. For example, creating an encapsulation which +has both `dot1q` as well as `dot1ad`, or creating a _LIP_ (Linux Interface Pair) for sub-interface which does not +have `exact-match` set. Or how's about having two sub-interfaces with the same exact encapsulation? + +Here's where _semantic_ validation comes in to play. So I set out to create all sorts of constraints, and after +reading the (Yamale validated, so syntactically correct) YAML file, I can hand it into a set of validators that +check for violations of these constraints. By means of example, let me create a few constraints that might capture +the issues described above: + +1. If a sub-interface has encapsulation: + 1. It MUST have `dot1q` OR `dot1ad` set + 1. It MUST NOT have `dot1q` AND `dot1ad` both set +1. If a sub-interface has one or more `addresses`: + 1. Its encapsulation MUST be set to `exact-match` + 1. It MUST have an `lcp` set. + 1. Each individual `address` MUST NOT occur in any other interface + +## Config Validation + +After spending a few weeks thinking about the problem, I came up with 59 semantic constraints, that is to say +things that might appear OK, but will yield impossible to implement or otherwise erratic VPP configurations. +This article would be a bad place to discuss them all, so I will talk about the structure of `vppcfg` instead. + +First, a `Validator` class is instantiated with the Yamale schema. Then, a YAML file is read and passed to the +validator's `validate()` method. It will first run Yamale on the YAML file and make note of any issues that arise. +If so, it will enumerate them in a list and return (bool, [list-of-messages]). The validation will have failed +if the boolean returned is _false_, and if so, the list of messages will help understand which constraint was +violated. + +The `vppcfg` schema consists of toplevel types, which are validated in order: + +* ***validate_bondethernets()***'s job is to ensure that anything configured in the `bondethernets` toplevel map + is correct. For example, if a _BondEthernet_ device is created there, its members should reference existing + interfaces, and it itself should make an appearance in the `interfaces` map, and the MTU of each member should + be equal to the MTU of the _BondEthernet_, and so on. See `config/bondethernet.py` for a complete rundown. +* ***validate_loopbacks()*** is pretty straight forward. It makes a few common assertions, such as that if the + loopback has addresses, it must also have an LCP, and if it has an LCP, that no other interface has the same + LCP name, and that all of the addresses configured are unique. +* ***validate_vxlan_tunnels()*** Yamale already asserts that the `local` and `remote` fields are present and an + IP address. The semantic validator ensures that the address family of the tunnel endpoints are the same, and that + the used `VNI` is unique. +* ***validate_bridgedomains()*** fiddles with its _Bridge Virtual Interface_, making sure that its addresses and + LCP name are unique. Further, it makes sure that a given member interface is in at most one bridge, and that said + member is in L2 mode, in other words, that it doesn't have an LCP or an address. An L2 interface can be either in + a bridgedomain, or act as an L2 Cross Connect, but not both. Finally, it asserts that each member has an MTU + identical to the bridge's MTU value. +* ***validate_interfaces()*** is by far the most complex, but a few common things worth calling out is that each + sub-interface must have a unique encapsulation, and if a given QinQ or QinAD 2-tagged sub-interface has an LCP, + that there exist a parent Dot1Q or Dot1AD interface with the correct encapsulation, and that it also has an LCP. + See `config/interface.py` for an extensive overview. + +## Testing + +Of course, in a configuration model so complex as a VPP router, being able to do a lot of validation helps ensure that +the constraints above are implemented correctly. To help this along, I use _regular_ unittesting as provided by +the Python3 [unittest](https://docs.python.org/3/library/unittest.html) framework, but I extend it to run as well +a special kind of test which I call a `YAMLTest`. + +### Unit Testing + +This is bread and butter, and should be straight forward for software engineers. I took a model of so called +test-driven development, where I start off by writing a test, which of course fails because the code hasn't been +implemented yet. Then I implement the code, and run this and all other unittests expecting them to pass. + +Let me give an example based on BondEthernets, with a YAML config file as follows: + +``` +bondethernets: + BondEthernet0: + interfaces: [ GigabitEthernet1/0/0, GigabitEthernet1/0/1 ] +interfaces: + GigabitEthernet1/0/0: + mtu: 3000 + GigabitEthernet1/0/1: + mtu: 3000 + GigabitEthernet2/0/0: + mtu: 3000 + sub-interfaces: + 100: + mtu: 2000 + + BondEthernet0: + mtu: 3000 + lcp: "be012345678" + addresses: [ 192.0.2.1/29, 2001:db8::1/64 ] + sub-interfaces: + 100: + mtu: 2000 + addresses: [ 192.0.2.9/29, 2001:db8:1::1/64 ] +``` + +As I mentioned when discussing the semantic constraints, there's a few here that jump out at me. First, the +BondEthernet members `Gi1/0/0` and `Gi1/0/1` must exist. There is one BondEthernet defined in this file (obvious, +I know, but bear with me), and `Gi2/0/0` is not a bond member, and certainly `Gi2/0/0.100` is not a bond member, +because having a sub-interface as an LACP member would be super weird. Taking things like this into account, here's +a few tests that could assert that the behavior of the `bondethernets` map in the YAML config is correct: + +``` +class TestBondEthernetMethods(unittest.TestCase): + def setUp(self): + with open("unittest/test_bondethernet.yaml", "r") as f: + self.cfg = yaml.load(f, Loader = yaml.FullLoader) + + def test_get_by_name(self): + ifname, iface = bondethernet.get_by_name(self.cfg, "BondEthernet0") + self.assertIsNotNone(iface) + self.assertEqual("BondEthernet0", ifname) + self.assertIn("GigabitEthernet1/0/0", iface['interfaces']) + self.assertNotIn("GigabitEthernet2/0/0", iface['interfaces']) + + ifname, iface = bondethernet.get_by_name(self.cfg, "BondEthernet-notexist") + self.assertIsNone(iface) + self.assertIsNone(ifname) + + def test_members(self): + self.assertTrue(bondethernet.is_bond_member(self.cfg, "GigabitEthernet1/0/0")) + self.assertTrue(bondethernet.is_bond_member(self.cfg, "GigabitEthernet1/0/1")) + self.assertFalse(bondethernet.is_bond_member(self.cfg, "GigabitEthernet2/0/0")) + self.assertFalse(bondethernet.is_bond_member(self.cfg, "GigabitEthernet2/0/0.100")) + + def test_is_bondethernet(self): + self.assertTrue(bondethernet.is_bondethernet(self.cfg, "BondEthernet0")) + self.assertFalse(bondethernet.is_bondethernet(self.cfg, "BondEthernet-notexist")) + self.assertFalse(bondethernet.is_bondethernet(self.cfg, "GigabitEthernet1/0/0")) + + def test_enumerators(self): + ifs = bondethernet.get_bondethernets(self.cfg) + self.assertEqual(len(ifs), 1) + self.assertIn("BondEthernet0", ifs) + self.assertNotIn("BondEthernet-noexist", ifs) +``` + +Every single function that is defined in the file `config/bondethernet.py` (there are four) will have +an accompanying unittest to ensure it works as expected. And every validator module, will have a suite +of unittests fully covering their functionality. In total, I wrote a few dozen unit tests like this, +in an attempt to be reasonably certain that the config validator functionality works as advertised. + +### YAML Testing + +I added one additional class of unittest called a ***YAMLTest***. What happens here is that a certain YAML configuration +file, which may be valid or have errors, is offered to the end to end config parser (so both the Yamale schema +validator as well as the semantic validators), and all errors are accounted for. As an example, two sub-interfaces +on the same parent cannot have the same encapsulation, so offering the following file to the config validator +is _expected_ to trip errors: + +``` +$ cat unittest/yaml/error-subinterface1.yaml << EOF +test: + description: "Two subinterfaces can't have the same encapsulation" + errors: + expected: + - "sub-interface .*.100 does not have unique encapsulation" + - "sub-interface .*.102 does not have unique encapsulation" + count: 2 +--- +interfaces: + GigabitEthernet1/0/0: + sub-interfaces: + 100: + description: "VLAN 100" + 101: + description: "Another VLAN 100, but without exact-match" + encapsulation: + dot1q: 100 + 102: + description: "Another VLAN 100, but without exact-match" + encapsulation: + dot1q: 100 + exact-match: True +EOF +``` + +You can see the file here has two YAML documents (separated by `---`), the first one explains to the YAMLTest +class what to expect. There can either be no errors (in which case `test.errors.count=0`), or there can be +specific errors that are expected. In this case, `Gi1/0/0.100` and `Gi1/0/0/102` have the same encapsulation +but `Gi1/0/0.101` is unique (if you're curious, this is because the encap on 100 and 102 has exact-match, +but the one one 101 does _not_ have exact-match). + +The implementation of this YAMLTest class is in `tests.py`, which in turn runs all YAML tests on the files it +finds in `unittest/yaml/*.yaml` (currently 47 specific cases are tested there, which covered 100% of the +semantic constraints), and regular unittests (currently 42, which is a coincidence, I swear!) + +# What's next? + +These tests, together, give me a pretty strong assurance that any given YAML file that passes the validator, +is indeed a valid configuration for VPP. In my next post, I'll go one step further, and talk about applying +the configuration to a running VPP instance, which is of course the overarching goal. But I would not want +to mess up my (or your!) VPP router by feeding it garbage, so the lions' share of my time so far on this project +has been to assert the YAML file is both syntactically and semantically valid. + + +In the mean time, you can take a look at my code on [GitHub](https://github.com/pimvanpelt/vppcfg), but to +whet your appetite, here's a hefty configuration that demonstrates all implemented types: + +``` +bondethernets: + BondEthernet0: + interfaces: [ GigabitEthernet3/0/0, GigabitEthernet3/0/1 ] + +interfaces: + GigabitEthernet3/0/0: + mtu: 9000 + description: "LAG #1" + GigabitEthernet3/0/1: + mtu: 9000 + description: "LAG #2" + + HundredGigabitEthernet12/0/0: + lcp: "ice0" + mtu: 9000 + addresses: [ 192.0.2.17/30, 2001:db8:3::1/64 ] + sub-interfaces: + 1234: + mtu: 1200 + lcp: "ice0.1234" + encapsulation: + dot1q: 1234 + exact-match: True + 1235: + mtu: 1100 + lcp: "ice0.1234.1000" + encapsulation: + dot1q: 1234 + inner-dot1q: 1000 + exact-match: True + + HundredGigabitEthernet12/0/1: + mtu: 2000 + description: "Bridged" + + BondEthernet0: + mtu: 9000 + lcp: "be0" + sub-interfaces: + 100: + mtu: 2500 + l2xc: BondEthernet0.200 + encapsulation: + dot1q: 100 + exact-match: False + 200: + mtu: 2500 + l2xc: BondEthernet0.100 + encapsulation: + dot1q: 200 + exact-match: False + 500: + mtu: 2000 + encapsulation: + dot1ad: 500 + exact-match: False + 501: + mtu: 2000 + encapsulation: + dot1ad: 501 + exact-match: False + vxlan_tunnel1: + mtu: 2000 + +loopbacks: + loop0: + lcp: "lo0" + addresses: [ 10.0.0.1/32, 2001:db8::1/128 ] + loop1: + lcp: "bvi1" + addresses: [ 10.0.1.1/24, 2001:db8:1::1/64 ] + +bridgedomains: + bd1: + mtu: 2000 + bvi: loop1 + interfaces: [ BondEthernet0.500, BondEthernet0.501, HundredGigabitEthernet12/0/1, vxlan_tunnel1 ] + bd11: + mtu: 1500 + +vxlan_tunnels: + vxlan_tunnel1: + local: 192.0.2.1 + remote: 192.0.2.2 + vni: 101 +``` + +The vision for my VPP Configuration utility is that it can move from any existing VPP configuration to any +other (validated successfully) configuration with a minimal amount of steps, and that it will plan its +way declaratively from A to B, ordering the calls to the API safely and quickly. Interested? Good, because +I do expect that a utility like this would be very valuable to serious VPP users! + diff --git a/content/articles/2022-04-02-vppcfg-2.md b/content/articles/2022-04-02-vppcfg-2.md new file mode 100644 index 0000000..7d4232d --- /dev/null +++ b/content/articles/2022-04-02-vppcfg-2.md @@ -0,0 +1,727 @@ +--- +date: "2022-04-02T08:50:19Z" +title: VPP Configuration - Part2 +--- + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +I use VPP - Vector Packet Processor - extensively at IPng Networks. Earlier this year, the VPP community +merged the [Linux Control Plane]({%post_url 2021-08-12-vpp-1 %}) plugin. I wrote about its deployment +to both regular servers like the [Supermicro]({%post_url 2021-09-21-vpp-7 %}) routers that run on our +[AS8298]({% post_url 2021-02-27-network %}), as well as virtual machines running in +[KVM/Qemu]({% post_url 2021-12-23-vpp-playground %}). + +Now that I've been running VPP in production for about half a year, I can't help but notice one specific +drawback: VPP is a programmable dataplane, and _by design_ it does not include any configuration or +controlplane management stack. It's meant to be integrated into a full stack by operators. For end-users, +this unfortunately means that typing on the CLI won't persist any configuration, and if VPP is restarted, +it will not pick up where it left off. There's one developer convenience in the form of the `exec` +command-line (and startup.conf!) option, which will read a file and apply the contents to the CLI line +by line. However, if any typo is made in the file, processing immediately stops. It's meant as a convenience +for VPP developers, and is certainly not a useful configuration method for all but the simplest topologies. + +Luckily, VPP comes with an extensive set of APIs to allow it to be programmed. So in this series of posts, +I'll detail the work I've done to create a configuration utility that can take a YAML configuration file, +compare it to a running VPP instance, and step-by-step plan through the API calls needed to safely apply +the configuration to the dataplane. Welcome to `vppcfg`! + +In this second post of the series, I want to talk a little bit about how planning a path from a running +configuration to a desired new configuration might look like. + +**Note**: Code is on [my Github](https://github.com/pimvanpelt/vppcfg), but it's not quite ready for +prime-time yet. Take a look, and engage with us on GitHub (pull requests preferred over issues themselves) +or reach out by [contacting us](/s/contact/). + +## VPP Config: a DAG + +Before we dive into my `vppcfg` code, let me first introduce a mental model of how configuration is built. We +rarely stop and think about it, but when we configure our routers (no matter if it's a Cisco or a Juniper or +a VPP router), in our mind we logically order the operations in a very particular way. To state the obvious, +if I want to create a sub-interface which also has an address, I would create the sub-int _before_ adding the +address, right? Similarly, if I wanted to expose a sub-interface `Hu12/0/0.100` in Linux as a _LIP_, I would +create it only _after_ having created a _LIP_ for the parent interface `Hu12/0/0`, to satisfy Linux's +requirement all sub-interfaces have a parent interface, like so: + +``` +vpp# create sub HundredGigabitEthernet12/0/0 100 +vpp# set interface ip address HundredGigabitEthernet12/0/0.100 192.0.2.1/29 +vpp# lcp create HundredGigabitEthernet12/0/0 host-if ice0 +vpp# lcp create HundredGigabitEthernet12/0/0.100 host-if ice0.100 +vpp# set interface state HundredGigabitEthernet12/0/0 up +vpp# set interface state HundredGigabitEthernet12/0/0.100 up +``` + +Of course some of the ordering doesn't strictly matter. For example, I can set the state of +`Hu12/0/0.100` up before adding the address, or after adding the address, or even after adding the +_LIP_, but one thing is certain: I cannot set its state to up before it was created in the first place! +In the other direction, when removing things, it's easy to see that you cannot manipulate the state +of a sub-interface after deleting it, so to cleanly remove the construction above, I would have to +walk the statements back in reverse, like so: + +``` +vpp# set interface state HundredGigabitEthernet12/0/0.100 down +vpp# set interface state HundredGigabitEthernet12/0/0 down +vpp# lcp delete HundredGigabitEthernet12/0/0.100 host-if ice0.100 +vpp# lcp delete HundredGigabitEthernet12/0/0 host-if ice0 +vpp# set interface ip address del HundredGigabitEthernet12/0/0.100 192.0.2.1/29 +vpp# delete sub HundredGigabitEthernet12/0/0.100 +``` + +Because of this reasonably straight forward ordering, it's possible to construct a graph showing +operations that depend on other operations having been completed beforehand. Such a graph, called +a [Directed Acyclic Graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph) or _DAG_. + +{{< image width="400px" float="left" src="/assets/vppcfg/vppcfg-dag.png" alt="DAG" >}} + +First some theory (from Wikipedia): A directed graph is a DAG if and only if it can be topologically +ordered, by arranging the vertices as a linear ordering that is consistent with all edge directions. +DAGs have numerous scientific and computational applications, but the one I'm mostly interested here +is dependency mapping and computational scheduling. + +A graph is formed by vertices and by edges connecting pairs of vertices, where the vertices are +objects that might exist in VPP (interfaces, bridge-domains, VXLAN tunnels, IP addresses, etc), +and these objects are connected in pairs by edges. In the case of a directed graph, each edge has an +orientation (or direction), from one (source) vertex to another (destination) vertex. A path in a +directed graph is a sequence of edges having the property that the ending vertex of each edge in the +sequence is the same as the starting vertex of the next edge in the sequence; a path forms a cycle +if the starting vertex of its first edge equals the ending vertex of its last edge. A directed acyclic +graph is a directed graph that has no cycles, which in this particular case means that objects' +existence can't rely other things that ultimately rely back on their own existence. + +After I got that technobabble out of the way, practically speaking, the _edges_ in this graph model +dependencies, let me give a few examples: + +1. The arrow from _Sub Interface_ pointing at _BondEther_ and _Physical Int_ makes the claim that + for the sub-int to exist, it _depends on_ the existence of either a BondEthernet, or a PHY. +1. The arrow from the _BondEther_ to the _Physical Int_, which makes the claim that for the BondEthernet + to work, it must have one or more PHYs in it. +1. There is no arrow between _BondEther_ and _Sub Interface_ which makes the claim that they are + independent, there is no need for a sub-int to exist in order for a BondEthernet to work. + +## VPP Config: Ordering + +In my [previous]({% post_url 2022-03-27-vppcfg-1 %}) post, I talked about a bunch of constraints that +make certain YAML configurations invalid (for example, having both _dot1q_ and _dot1ad_ on a sub-interface, +that wouldn't make any sense). Here, I'm going to talk about another type of constraint: ***Temporal +Constraints*** are statements about the ordering of operations. With the example DAG above, I derive the +following constraints: + +* A parent interface must exist before a sub-interface can be created on it +* An interface (regardless of sub-int or phy) must exist before an IP address can be added to it +* A _LIP_ can be created on a sub-int only if its parent PHY has a _LIP_ +* _LIPs_ must be removed from all sub-interfaces before a PHY's _LIP_ can be removed +* The admin-state of a sub-interface can only be up if its PHY is up +* ... and so on. + +But there's a second thing to keep in mind, and this is a bit more specific to the VPP configuration +operations themselves. Sometimes, I may find that an object already exists, say a sub-interface, but +that it has configuration attributes that are not what I wanted. For example, I may have previously +configured a sub-int to be of a certain encapsulation `dot1q 1000 inner-dot1q 1234`, but I changed +my mind and want the sub-int to now be `dot1ad 1000 inner-dot1q 1234` instead. Some attributes of +an interface can be changed on the fly (like the MTU, for example), but some really cannot, and in +my example here, the encapsulation change has to be done another way. + +I'll make an obvious but hopefully helpful observation: I can't create the second sub-int with +the same subid, because one already exists (duh). The intuitive way to solve this, of course, is to +delete the old sub-int _first_ and then create a _new_ sub-int with the correct attributes (`dot1ad` +outer encapsulation). + +Here's another scenario that illustrates the ordering: Let's say I want to move an IP address +from interface A to interface B. In VPP, I can't configure the same IP address/prefixlen on two +interfaces at the same time, so as with the previous scenario of the encap changing, I will want +to remove the IP address from A before adding it to B. + +Come to think of it, there are lots of scenarios where remove-before-add is required: +* If an interface was in bridge-domain A but now wants to be put in bridge-domain B, it'll have + to be _removed_ from the first bridge before being _added_ to the second bridge, because an + interface can't be in two bridges at the same time. +* If an interface was a member of a BondEthernet, but will be moved to be a member of a + bridge-domain now, it will have to be _removed_ from the bond before being _added_ to the + bridge, because an interface can't be both a bondethernet member and a member of a bridge + at the same time. +* And to add to the list, the scenario above: A sub-interface that differs in its intended + encapsulation must be _removed_ before a new one with the same `subid` can be _created_. + +All of these cases can be modeled as edges (arrows) between vertices (objects) in the graph +describing the ordering of operations in VPP! I'm now ready to draw two important conclusions: + +1. All objects that differ from their intended configuration must be removed before being + added elsewhere, in order to avoid them being referenced/used twice. +1. All objects must be created before their attributes can be set. + +### vppcfg: Path Planning + +By thinking about the configuration in this way, I can precisely predict the order of +operations needed to go from any running dataplane configuration to _any new_ target +dataplane configuration. A so called path-planner emerges, which has three main phases of +execution: + +1. **Prune** phase (remove objects from VPP that are not in the config) +1. **Create** phase (add objects to VPP that are in the config but not VPP) +1. **Sync** phase, for each object in the configuration + +When removing things, care has to be taken to remove inner-most objects first (first removing +LCP, then QinQ, Dot1Q, BondEthernet, and lastly PHY), because indeed, there exists a dependency +relationship between objects in this DAG. Conversely, when creating objects, the edges flip their +directionality, because creation must be done on outer-most objects first (first creating the +PHY, then BondEthernet, Dot1Q, QinQ and lastly LCP). + +For example, QinQ/QinAD sub-interfaces should be removed before before their intermediary +Dot1Q/Dot1AD can be removed. Another example, MTU of parents should raise before their children, +while children should shrink before their parent. + +**Order matters**. + +**Pruning**: First, `vppcfg` will ensure all objects do not have attributes which they should not (eg. IP +addresses) and that objects are destroyed that are not needed (ie. have been removed from the +target config). After this phase, I am certain that any object that exists in the dataplane, +both (a) has the right to exist (because it's in the target configuration), and (b) has the +correct create-time (ie non syncable) attributes. + +**Creating**: Next, `vppcfg` will ensure that all objects that are not yet present (including the ones that +it just removed because they were present but had incorrect attributes), get (re)created in the +right order. After this phase, I am certain that _all objects_ in the dataplane now (a) have the +right to exist (because they are in the target configuration), (b) have the correct attributes, +but newly, also that (c) all objects that are in the target configuration also got created and +now exist in the dataplane. + +**Syncing**: Finally, all objects are synchronized with the target configuration (IP addresses, +MTU etc), taking care to shrink children before their parents, and growing parents before their +children (this is for the special case of any given sub-interface's MTU having to be equal to or +lower than their parent's MTU). + +### vppcfg: Demonstration + +I'll create three configurations and let vppcfg path-plan between them. I start a completely +empty VPP dataplane which has two GigabitEthernet and two HundredGigabitEthernet interfaces: + +``` +pim@hippo:~/src/vpp$ make run + _______ _ _ _____ ___ + __/ __/ _ \ (_)__ | | / / _ \/ _ \ + _/ _// // / / / _ \ | |/ / ___/ ___/ + /_/ /____(_)_/\___/ |___/_/ /_/ + +DBGvpp# show interface + Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count +GigabitEthernet3/0/0 1 down 9000/0/0/0 +GigabitEthernet3/0/1 2 down 9000/0/0/0 +HundredGigabitEthernet12/0/0 3 down 9000/0/0/0 +HundredGigabitEthernet12/0/1 4 down 9000/0/0/0 +local0 0 down 0/0/0/0 +``` + +#### Demo 1: First time config (empty VPP) + +First, starting simple, I write the following YAML configuration called `hippo4.yaml`. It defines a +few sub-interfaces, a bridgedomain with one QinQ sub-interface `Hu12/0/0.101` in it, and it then +cross-connects `Gi3/0/0.100` with `Hu12/0/1.100`, keeping all sub-interfaces at an MTU of 2000 and +their PHYs at an MTU of 9216: + +``` +interfaces: + GigabitEthernet3/0/0: + mtu: 9216 + sub-interfaces: + 100: + mtu: 2000 + l2xc: HundredGigabitEthernet12/0/1.100 + GigabitEthernet3/0/1: + description: Not Used + HundredGigabitEthernet12/0/0: + mtu: 9216 + sub-interfaces: + 100: + mtu: 3000 + 101: + mtu: 2000 + encapsulation: + dot1q: 100 + inner-dot1q: 200 + exact-match: True + HundredGigabitEthernet12/0/1: + mtu: 9216 + sub-interfaces: + 100: + mtu: 2000 + l2xc: GigabitEthernet3/0/0.100 + +bridgedomains: + bd10: + description: "Bridge Domain 10" + mtu: 2000 + interfaces: [ HundredGigabitEthernet12/0/0.101 ] +``` + +If I offer this config to `vppcfg` and ask it to plan a path, there won't be any **pruning** going on, +because there are no objects in the newly started VPP dataplane that need to be deleted. But I do expect +to see a bunch of sub-interface and one bridge-domain **creation**, followed by **syncing** a bunch of +interfaces with bridge-domain memberships and L2 Cross Connects. Finally, the MTU of the interfaces will +be sync'd to their configured values, and the path is planned like so: + +``` +pim@hippo:~/src/vppcfg$ ./vppcfg -c hippo4.yaml plan +[INFO ] root.main: Loading configfile hippo4.yaml +[INFO ] vppcfg.config.valid_config: Configuration validated successfully +[INFO ] root.main: Configuration is valid +[INFO ] vppcfg.vppapi.connect: VPP version is 22.06-rc0~324-g247385823 +create sub GigabitEthernet3/0/0 100 dot1q 100 exact-match +create sub HundredGigabitEthernet12/0/0 100 dot1q 100 exact-match +create sub HundredGigabitEthernet12/0/1 100 dot1q 100 exact-match +create sub HundredGigabitEthernet12/0/0 101 dot1q 100 inner-dot1q 200 exact-match +create bridge-domain 10 +set interface l2 bridge HundredGigabitEthernet12/0/0.101 10 +set interface l2 tag-rewrite HundredGigabitEthernet12/0/0.101 pop 2 +set interface l2 xconnect GigabitEthernet3/0/0.100 HundredGigabitEthernet12/0/1.100 +set interface l2 tag-rewrite GigabitEthernet3/0/0.100 pop 1 +set interface l2 xconnect HundredGigabitEthernet12/0/1.100 GigabitEthernet3/0/0.100 +set interface l2 tag-rewrite HundredGigabitEthernet12/0/1.100 pop 1 +set interface mtu 9216 GigabitEthernet3/0/0 +set interface mtu 9216 HundredGigabitEthernet12/0/0 +set interface mtu 9216 HundredGigabitEthernet12/0/1 +set interface mtu packet 1500 GigabitEthernet3/0/1 +set interface mtu packet 9216 GigabitEthernet3/0/0 +set interface mtu packet 9216 HundredGigabitEthernet12/0/0 +set interface mtu packet 9216 HundredGigabitEthernet12/0/1 +set interface mtu packet 2000 GigabitEthernet3/0/0.100 +set interface mtu packet 3000 HundredGigabitEthernet12/0/0.100 +set interface mtu packet 2000 HundredGigabitEthernet12/0/1.100 +set interface mtu packet 2000 HundredGigabitEthernet12/0/0.101 +set interface mtu 1500 GigabitEthernet3/0/1 +set interface state GigabitEthernet3/0/0 up +set interface state GigabitEthernet3/0/0.100 up +set interface state GigabitEthernet3/0/1 up +set interface state HundredGigabitEthernet12/0/0 up +set interface state HundredGigabitEthernet12/0/0.100 up +set interface state HundredGigabitEthernet12/0/0.101 up +set interface state HundredGigabitEthernet12/0/1 up +set interface state HundredGigabitEthernet12/0/1.100 up +[INFO ] root.main: Planning succeeded +``` + +On the `vppctl` commandline, I can simply cut-and-paste these CLI commands and the dataplane ends up +configured exactly like was desired in the `hippo4.yaml` configuration file. One nice way to tell if +the reconciliation of the config file into the running VPP instance was successful is by running the +planner again with the same YAML config file. It should not find anything worth pruning, creating nor +syncing, and indeed: + +``` +pim@hippo:~/src/vppcfg$ ./vppcfg -c hippo4.yaml plan +[INFO ] root.main: Loading configfile hippo4.yaml +[INFO ] vppcfg.config.valid_config: Configuration validated successfully +[INFO ] root.main: Configuration is valid +[INFO ] vppcfg.vppapi.connect: VPP version is 22.06-rc0~324-g247385823 +[INFO ] root.main: Planning succeeded +``` + +#### Demo 2: Moving from one config to another + +To demonstrate how my reconciliation algorithm works in practice, I decide to invent a radically +different configuration for Hippo, called `hippo12.yaml`, in which a new BondEthernet appears, +two of its sub-interfaces are cross connected, `Hu12/0/0` now gets a _LIP_ and some IP addresses, and +the bridge-domain `bd10` is replaced by two others, `bd1` and `bd11`, the former of which also sports +a BVI (with a _LIP_ called `bvi1`) and a VXLAN Tunnel bridged into `bd1` for good measure: + +``` +bondethernets: + BondEthernet0: + interfaces: [ GigabitEthernet3/0/0, GigabitEthernet3/0/1 ] + +interfaces: + GigabitEthernet3/0/0: + mtu: 9000 + description: "LAG #1" + GigabitEthernet3/0/1: + mtu: 9000 + description: "LAG #2" + + HundredGigabitEthernet12/0/0: + lcp: "ice12-0-0" + mtu: 9000 + addresses: [ 192.0.2.17/30, 2001:db8:3::1/64 ] + sub-interfaces: + 1234: + mtu: 1200 + lcp: "ice0.1234" + encapsulation: + dot1q: 1234 + exact-match: True + 1235: + mtu: 1100 + lcp: "ice0.1234.1000" + encapsulation: + dot1q: 1234 + inner-dot1q: 1000 + exact-match: True + + HundredGigabitEthernet12/0/1: + mtu: 2000 + description: "Bridged" + BondEthernet0: + mtu: 9000 + lcp: "bond0" + sub-interfaces: + 10: + lcp: "bond0.10" + mtu: 3000 + 100: + mtu: 2500 + l2xc: BondEthernet0.200 + encapsulation: + dot1q: 100 + exact-match: False + 200: + mtu: 2500 + l2xc: BondEthernet0.100 + encapsulation: + dot1q: 200 + exact-match: False + 500: + mtu: 2000 + encapsulation: + dot1ad: 500 + exact-match: False + 501: + mtu: 2000 + encapsulation: + dot1ad: 501 + exact-match: False + vxlan_tunnel1: + mtu: 2000 + +loopbacks: + loop0: + lcp: "lo0" + addresses: [ 10.0.0.1/32, 2001:db8::1/128 ] + loop1: + lcp: "bvi1" + addresses: [ 10.0.1.1/24, 2001:db8:1::1/64 ] + +bridgedomains: + bd1: + mtu: 2000 + bvi: loop1 + interfaces: [ BondEthernet0.500, BondEthernet0.501, HundredGigabitEthernet12/0/1, vxlan_tunnel1 ] + bd11: + mtu: 1500 + +vxlan_tunnels: + vxlan_tunnel1: + local: 192.0.2.1 + remote: 192.0.2.2 + vni: 101 +``` + +Before writing `vppcfg`, the art of moving from `hippo4.yaml` to this radically different `hippo12.yaml` +would be a nightmare, and almost certainly have caused me to miss a step and cause an outage. But, due to +the fundamental understanding of ordering, and the methodical execution of **pruning**, **creating** and +**syncing** the objects, the path planner comes up with the following sequence, which I'll break down +in its three constituent phases: + +``` +pim@hippo:~/src/vppcfg$ ./vppcfg -c hippo12.yaml plan +[INFO ] root.main: Loading configfile hippo12.yaml +[INFO ] vppcfg.config.valid_config: Configuration validated successfully +[INFO ] root.main: Configuration is valid +[INFO ] vppcfg.vppapi.connect: VPP version is 22.06-rc0~324-g247385823 +set interface state HundredGigabitEthernet12/0/0.101 down +set interface state GigabitEthernet3/0/0.100 down +set interface state HundredGigabitEthernet12/0/0.100 down +set interface state HundredGigabitEthernet12/0/1.100 down +set interface l2 tag-rewrite HundredGigabitEthernet12/0/0.101 disable +set interface l3 HundredGigabitEthernet12/0/0.101 +create bridge-domain 10 del +set interface l2 tag-rewrite GigabitEthernet3/0/0.100 disable +set interface l3 GigabitEthernet3/0/0.100 +set interface l2 tag-rewrite HundredGigabitEthernet12/0/1.100 disable +set interface l3 HundredGigabitEthernet12/0/1.100 +delete sub HundredGigabitEthernet12/0/0.101 +delete sub GigabitEthernet3/0/0.100 +delete sub HundredGigabitEthernet12/0/0.100 +delete sub HundredGigabitEthernet12/0/1.100 +``` + +First, `vppcfg` concludes that `Hu12/0/0.101`, `Hu12/0/1.100` and `Gi3/0/0.100` are no longer +needed, so it sets them all admin-state down. The bridge-domain `bd10` no longer has the right to +exist, the poor thing. But before it is deleted, the interface that was in `bd10` can be pruned +(membership _depends_ on the bridge, so in pruning, dependencies are removed before dependents). +Considering `Hu12/0/1.101` and `Gi3/0/0.100` were an L2XC pair before, they are returned to default +(L3) mode and because it's no longer needed, the [VLAN Gymnastics]({%post_url 2022-02-14-vpp-vlan-gym %}) +tag rewriting is also cleaned up for both interfaces. Finally, the sub-interfaces that do not appear +in the target configuration are deleted, completing the **pruning** phase. + +It then continues with the **create** phase: + +``` +create loopback interface instance 0 +create loopback interface instance 1 +create bond mode lacp load-balance l34 id 0 +create vxlan tunnel src 192.0.2.1 dst 192.0.2.2 instance 1 vni 101 decap-next l2 +create sub HundredGigabitEthernet12/0/0 1234 dot1q 1234 exact-match +create sub BondEthernet0 10 dot1q 10 exact-match +create sub BondEthernet0 100 dot1q 100 +create sub BondEthernet0 200 dot1q 200 +create sub BondEthernet0 500 dot1ad 500 +create sub BondEthernet0 501 dot1ad 501 +create sub HundredGigabitEthernet12/0/0 1235 dot1q 1234 inner-dot1q 1000 exact-match +create bridge-domain 1 +create bridge-domain 11 +lcp create HundredGigabitEthernet12/0/0 host-if ice12-0-0 +lcp create BondEthernet0 host-if bond0 +lcp create loop0 host-if lo0 +lcp create loop1 host-if bvi1 +lcp create HundredGigabitEthernet12/0/0.1234 host-if ice0.1234 +lcp create BondEthernet0.10 host-if bond0.10 +lcp create HundredGigabitEthernet12/0/0.1235 host-if ice0.1234.1000 +``` + +Here, interfaces are created in order of loopbacks first, then BondEthernets, then Tunnels, and +finally sub-interfaces, first creating single-tagged and then creating dual-tagged sub-interfaces. +Of course, the BondEthernet has to be created before any sub-int will be able to be created on it. +Note that the QinQ `Hu12/0/0.1235` will be created after its intermediary parent `Hu12/0/0.1234` +due to this ordering requirement. + +Then, the two new bridgedomains `bd1` and `bd11` are created, and finally the _LIP_ plumbing is +performed, starting with the PHY `ice12-0-0` and BondEthernet `bond0`, then the two loopbacks, +and only then advancing to the two single-tag dot1q interfaces and finally the QinQ interface. For +LCPs, this is very important, because in Linux, the interfaces are a tree, not a list. `ice12-0-0` +must be created before its child `ice0.1234@ice12-0-0` can be created, and only then can the QinQ +`ice0.1234.1000@ice0.1234` be created. This creation order follows from the DAG having an edge +signalling an LCP depending on the sub-interface, and an edge between the sub-interface with two +tags depending on the sub-interface with one tag, and an edge between the single-tagged sub-interface +depending on its PHY. + +After all this work, `vppcfg` can assert (a) every object that now exists in VPP is in the +target configuration and (b) that any object that exists in the configuration also is present in +VPP (with the correct attributes). + +But there's one last thing to do, and that's ensure that the attributes that can be changed at +runtime (IP addresses, L2XCs, BondEthernet and bridge-domain members, etc) , are **sync'd** into +their respective objects in VPP based on what's in the target configuration: + +``` +bond add BondEthernet0 GigabitEthernet3/0/0 +bond add BondEthernet0 GigabitEthernet3/0/1 +comment { ip link set bond0 address 00:25:90:0c:05:01 } +set interface l2 bridge loop1 1 bvi +set interface l2 bridge BondEthernet0.500 1 +set interface l2 tag-rewrite BondEthernet0.500 pop 1 +set interface l2 bridge BondEthernet0.501 1 +set interface l2 tag-rewrite BondEthernet0.501 pop 1 +set interface l2 bridge HundredGigabitEthernet12/0/1 1 +set interface l2 tag-rewrite HundredGigabitEthernet12/0/1 disable +set interface l2 bridge vxlan_tunnel1 1 +set interface l2 tag-rewrite vxlan_tunnel1 disable +set interface l2 xconnect BondEthernet0.100 BondEthernet0.200 +set interface l2 tag-rewrite BondEthernet0.100 pop 1 +set interface l2 xconnect BondEthernet0.200 BondEthernet0.100 +set interface l2 tag-rewrite BondEthernet0.200 pop 1 +set interface state GigabitEthernet3/0/1 down +set interface mtu 9000 GigabitEthernet3/0/1 +set interface state GigabitEthernet3/0/1 up +set interface mtu packet 9000 GigabitEthernet3/0/0 +set interface mtu packet 9000 HundredGigabitEthernet12/0/0 +set interface mtu packet 2000 HundredGigabitEthernet12/0/1 +set interface mtu packet 2000 vxlan_tunnel1 +set interface mtu packet 1500 loop0 +set interface mtu packet 1500 loop1 +set interface mtu packet 9000 GigabitEthernet3/0/1 +set interface mtu packet 1200 HundredGigabitEthernet12/0/0.1234 +set interface mtu packet 3000 BondEthernet0.10 +set interface mtu packet 2500 BondEthernet0.100 +set interface mtu packet 2500 BondEthernet0.200 +set interface mtu packet 2000 BondEthernet0.500 +set interface mtu packet 2000 BondEthernet0.501 +set interface mtu packet 1100 HundredGigabitEthernet12/0/0.1235 +set interface state GigabitEthernet3/0/0 down +set interface mtu 9000 GigabitEthernet3/0/0 +set interface state GigabitEthernet3/0/0 up +set interface state HundredGigabitEthernet12/0/0 down +set interface mtu 9000 HundredGigabitEthernet12/0/0 +set interface state HundredGigabitEthernet12/0/0 up +set interface state HundredGigabitEthernet12/0/1 down +set interface mtu 2000 HundredGigabitEthernet12/0/1 +set interface state HundredGigabitEthernet12/0/1 up +set interface ip address HundredGigabitEthernet12/0/0 192.0.2.17/30 +set interface ip address HundredGigabitEthernet12/0/0 2001:db8:3::1/64 +set interface ip address loop0 10.0.0.1/32 +set interface ip address loop0 2001:db8::1/128 +set interface ip address loop1 10.0.1.1/24 +set interface ip address loop1 2001:db8:1::1/64 +set interface state HundredGigabitEthernet12/0/0.1234 up +set interface state HundredGigabitEthernet12/0/0.1235 up +set interface state BondEthernet0 up +set interface state BondEthernet0.10 up +set interface state BondEthernet0.100 up +set interface state BondEthernet0.200 up +set interface state BondEthernet0.500 up +set interface state BondEthernet0.501 up +set interface state vxlan_tunnel1 up +set interface state loop0 up +set interface state loop1 up +``` + +I'm not gonna lie, it's a tonne of work, but it's all a pretty staight forward juggle. The sync +phase will look at each object in the config and ensure that the attributes that same object has in the +dataplane are present and correct. In my demo, `hippo12.yaml` creates a lot of interfaces and IP +addresses, and changes the MTU of pretty much every interface, but in order: + +* The bondethernet gets its members `Gi3/0/0` and `Gi3/0/1`. As an interesting aside, when VPP + creates a BondEthernet it'll initially assign it an ephemeral MAC address. Then, when its first + member is added, the MAC address of the BondEthernet will change to that of the first member. + The comment reminds me to also set this MAC on the Linux device `bond0`. In the future, I'll add + some `PyRoute2` code to do that automatically. +* BridgeDomains are next. The BVI `loop1` is added first, then a few sub-interfaces and a tunnel, + and VLAN tag-rewriting for tagged interfaces is configured. There are two bridges, but only one + of them has members, so there's not much (in fact, there's nothing) to do for the other one. +* L2 Cross Connects can be changed at runtime, and they're next. The two interfaces `BE0.100` and + `BE0.200` are connected to one another and tag-rewrites are set up for them, considering they + are both tagged sub-interfaces. +* MTU is next. There's two variants of this. The first one `set interface mtu` is actually a + change in the DPDK driver to change the maximum allowable frame size. For this change, some + interface types have to be brought down first, the max frame size changed, and then brought back + up again. For all the others, the MTU will be changed in a specific order: + 1. PHYs will grow their MTU first, as growing a PHY is guaranteed to be always safe. + 1. Sub-interfaces will shrink QinX first, then Dot1Q/Dot1AD, then untagged interfaces. This is + to ensure we do not leave VPP and LinuxCP in a state where a QinQ sub-int has a higher MTU + than any of its parents. + 1. Sub-interfaces will grow untagged first, then DOt1Q/Dot1AD, and finally QinX sub-interfaces. + Same reason as step 2, no sub-interface will end up with a higher MTU than any of its + parents. + 1. PHYs will shrink their MTU last. The YAML configuration validation asserts that no PHY can + have an MTU lower than any of its children, so this is safe. +* Finally, IP addresses are added to `Hu12/0/0`, `loop0` and `loop1`. I can guarantee that adding + IP addresses will not clash with any other interface, because pruning would've removed IP + addresses from interfaces where they don't belong previously. +* And to finish off, the admin state for interfaces is set, again going from PHY, Bond, Tunnel, + 1-tagged sub-interfaces and finally 2-tagged sub-interfaces and loopbacks. + +Let's take it to the test: + +``` +pim@hippo:~/src/vppcfg$ ./vppcfg -c hippo12.yaml plan -o hippo4-to-12.exec +[INFO ] root.main: Loading configfile hippo12.yaml +[INFO ] vppcfg.config.valid_config: Configuration validated successfully +[INFO ] root.main: Configuration is valid +[INFO ] vppcfg.vppapi.connect: VPP version is 22.06-rc0~324-g247385823 +[INFO ] vppcfg.reconciler.write: Wrote 94 lines to hippo4-to-12.exec +[INFO ] root.main: Planning succeeded + +pim@hippo:~/src/vppcfg$ vppctl exec ~/src/vppcfg/hippo4-to-12.exec + +pim@hippo:~/src/vppcfg$ ./vppcfg -c hippo12.yaml plan +[INFO ] root.main: Loading configfile hippo12.yaml +[INFO ] vppcfg.config.valid_config: Configuration validated successfully +[INFO ] root.main: Configuration is valid +[INFO ] vppcfg.vppapi.connect: VPP version is 22.06-rc0~324-g247385823 +[INFO ] root.main: Planning succeeded +``` + +Notice that after applying `hippo4-to-12.exec`, the planner had nothing else to say. VPP is now in +the target configuration state, slick! + +### Demo 3: Returning VPP to empty + +This one is easy, but shows the pruning in action. Let's say I wanted to return VPP to a default +configuration without any objects, and its interfaces all at MTU 1500: + +``` +interfaces: + GigabitEthernet3/0/0: + mtu: 1500 + description: Not Used + GigabitEthernet3/0/1: + mtu: 1500 + description: Not Used + HundredGigabitEthernet12/0/0: + mtu: 1500 + description: Not Used + HundredGigabitEthernet12/0/1: + mtu: 1500 + description: Not Used +``` + +Simply applying that plan: + +``` +pim@hippo:~/src/vppcfg$ ./vppcfg -c hippo-empty.yaml plan -o 12-to-empty.exec +[INFO ] root.main: Loading configfile hippo-empty.yaml +[INFO ] vppcfg.config.valid_config: Configuration validated successfully +[INFO ] root.main: Configuration is valid +[INFO ] vppcfg.vppapi.connect: VPP version is 22.06-rc0~324-g247385823 +[INFO ] vppcfg.reconciler.write: Wrote 66 lines to 12-to-empty.exec +[INFO ] root.main: Planning succeeded + +pim@hippo:~/src/vppcfg$ vppctl +vpp# exec ~/src/vppcfg/12-to-empty.exec +vpp# show interface + Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count +GigabitEthernet3/0/0 1 up 1500/0/0/0 +GigabitEthernet3/0/1 2 up 1500/0/0/0 +HundredGigabitEthernet12/0/0 3 up 1500/0/0/0 +HundredGigabitEthernet12/0/1 4 up 1500/0/0/0 +local0 0 down 0/0/0/0 +``` + +### Final notes + +Now you may have been wondering why I would call the first file `hippo4.yaml` and the second one +`hippo12.yaml`. This is because I have 20 such YAML files that bring Hippo into all sorts of +esoteric configuration states, and I do this so that I can do a full integration test of any config +morphing into any other config: + +``` +for i in hippo[0-9]*.yaml; do + echo "Clearing: Moving to hippo-empty.yaml" + ./vppcfg -c hippo-empty.yaml > /tmp/vppcfg-exec-empty + [ -s /tmp/vppcfg-exec-empty ] && vppctl exec /tmp/vppcfg-exec-empty + + for j in hippo[0-9]*.yaml; do + echo " - Moving to $i .. " + ./vppcfg -c $i > /tmp/vppcfg-exec_$i + [ -s /tmp/vppcfg-exec_$i ] && vppctl exec /tmp/vppcfg-exec_$i + + echo " - Moving from $i to $j" + ./vppcfg -c $j > /tmp/vppcfg-exec_${i}_${j} + [ -s /tmp/vppcfg-exec_${i}_${j} ] && vppctl exec /tmp/vppcfg-exec_${i}_${j} + + echo " - Checking that from $j to $j is empty" + ./vppcfg -c $j > /tmp/vppcfg-exec_${j}_${j}_null + done +done +``` + +What this does is starts off Hippo with an empty config, then moves it to `hippo1.yaml` and from +there it moves the configuration to each YAML file and back to `hippo1.yaml`. Doing this proves, +that no matter which configuration I want to obtain, I can get there safely when the VPP dataplane +config starts out looking like what is described in `hippo1.yaml`. I'll then move it back to empty, +and into `hippo2.yaml`, doing the whole cycle again. So for 20 files, this means ~400 or so +configuration transitions. And some of these are special, notably moving from `hippoN.yaml` to +the same `hippoN.yaml` should result in zero diffs. + +With this path planner reasonably well tested, I have pretty high confidence that `vppcfg` can +change the dataplane from any existing configuration to any desired target configuration. + +## What's next + +One thing that I didn't mention yet, is that the `vppcfg` path planner works by reading the API +configuration state exactly once (at startup), and then it figures out the CLI calls to print +without needing to talk to VPP again. This is super useful as it's a non-intrusive way to inspect +the changes before applying them, and it's a property I'd like to carry forward. + +However, I don't necessarily think that emitting the CLI statements is the best user experience, +it's more for the purposes of analysis that they can be useful. What I really want to do is emit +API calls after the plan is created and reviewed/approved, directly reprogramming the VPP dataplane, +and likely the Linux network namespace interfaces as well, for example setting the MAC address of +a BondEthernet as I showed in that one comment above, or setting interface alias names based on +the configured descriptions. + +However, the VPP API set needed to do this is not 100% baked yet. For example, I observed crashes +when tinkering with BVIs and Loopbacks ([thread](https://lists.fd.io/g/vpp-dev/message/21116)), and +fixed a few obvious errors in the Linux CP API ([gerrit](https://gerrit.fd.io/r/c/vpp/+/35479)) but +there are still a few more issues to work through before I can set the next step with `vppcfg`. + +But for now, it's already helping me out tremendously at IPng Networks and I hope it'll be useful +for others, too. diff --git a/content/articles/2022-10-14-lab-1.md b/content/articles/2022-10-14-lab-1.md new file mode 100644 index 0000000..f533cd4 --- /dev/null +++ b/content/articles/2022-10-14-lab-1.md @@ -0,0 +1,645 @@ +--- +date: "2022-10-14T19:52:11Z" +title: VPP Lab - Setup +--- + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# Introduction + +In a previous post ([VPP Linux CP - Virtual Machine Playground]({% post_url 2021-12-23-vpp-playground %})), I +wrote a bit about building a QEMU image so that folks can play with the [Vector Packet Processor](https://fd.io) +and the Linux Control Plane code. Judging by our access logs, this image has definitely been downloaded a bunch, +and I myself use it regularly when I want to tinker a little bit, without wanting to impact the production +routers at [AS8298]({% post_url 2021-02-27-network %}). + +The topology of my tests has become a bit more complicated over time, and often just one router would not be +enough. Yet, repeatability is quite important, and I found myself constantly reinstalling / recheckpointing +the `vpp-proto` virtual machine I was using. I got my hands on some LAB hardware, so it's time for an upgrade! + +## IPng Networks LAB - Physical + +{{< image width="300px" float="left" src="/assets/lab/physical.png" alt="Physical" >}} + +First, I specc'd out a few machines that will serve as hypervisors. From top to bottom in the picture here, two +FS.com S5680-20SQ switches -- I reviewed these earlier [[ref]({% post_url 2021-08-07-fs-switch %})], and I really +like these, as they come with 20x10G, 4x25G and 2x40G ports, an OOB management port and serial to configure them. +Under it, is its larger brother, with 48x10G and 8x100G ports, the FS.com S5860-48SC. Although it's a bit more +expensive, it's also necessary because I often test VPP at higher bandwidth, and as such being able to make +ethernet topologies by mixing 10, 25, 40, 100G is super useful for me. So, this switch is `fsw0.lab.ipng.ch` +and dedicated to lab experiments. + +Connected to the switch are my trusty `Rhino` and `Hippo` machines. If you remember that game _Hungry Hungry Hippos_ +that's where the name comes from. They are both Ryzen 5950X on ASUS B550 motherboard, with each 2x1G i350 copper +nics (pictured here not connected), and 2x100G i810 QSFP network cards (properly slotted in the motherboard'ss +PCIe v4.0 x16 slot). + +Finally, three Dell R720XD machines serve as the to be built VPP testbed. They each come with 128GB of RAM, 2x500G +SSDs, two Intel 82599ES dual 10G NICs (four ports total), and four Broadcom BCM5720 1G NICs. The first 1G port is +connected to a management switch, and it doubles up as an IPMI speaker, so I can turn on/off the hypervisors +remotely. All four 10G ports are connected with DACs to `fsw0-lab`, as are two 1G copper ports (the blue UTP +cables). Everything can be turned on/off remotely, which is useful for noise, heat and overall the environment 🍀. + +## IPng Networks LAB - Logical + +{{< image width="200px" float="right" src="/assets/lab/logical.svg" alt="Logical" >}} + +I have three of these Dell R720XD machines in the lab, and each one of them will run one complete lab environment, +consisting of four VPP virtual machines, network plumbing, and uplink. That way, I can turn on one hypervisor, +say `hvn0.lab.ipng.ch`, prepare and boot the VMs, mess around with it, and when I'm done, return the VMs to a +pristine state, and turn off the hypervisor. And, because I have three of these machines, I can run three separate +LABs at the same time, or one really big one spanning all the machines. Pictured on the right is a logical sketch +of one of the LABs (LAB id=0), with a bunch of VPP virtual machines, each four NICs daisychained together, with +a few NICs left for experimenting. + +### Headend + +At the top of the logical environment, I am going to be using one of our production machines (`hvn0.chbtl0.ipng.ch`) +which will run a permanently running LAB _headend_, a Debian VM called `lab.ipng.ch`. This allows me to hermetically +seal the LAB environments, letting me run them entirely in RFC1918 space, and by forcing the LAbs to be connected +under this machine, I can ensure that no unwanted traffic enters or exits the network [imagine a loadtest at +100Gbit accidentally leaking, this may or totally may not have once happened to me before ...]. + +### Disk images + +On this production hypervisor (`hvn0.chbtl0.ipng.ch`), I'll also prepare and maintain a prototype `vpp-proto` disk +image, which will serve as a consistent image to boot the LAB virtual machines. This _main_ image will be replicated +over the network into all three `hvn0 - hvn2` hypervisor machines. This way, I can do periodical maintenance on the +_main_ `vpp-proto` image, snapshot it, publish it as a QCOW2 for downloading (see my [[VPP Linux CP - Virtual Machine +Playground]({% post_url 2021-12-23-vpp-playground %})] post for details on how it's built and what you can do with it +yourself!). The snapshots will then also be sync'd to all hypervisors, and from there I can use simple ZFS filesystem +_cloning_ and _snapshotting_ to maintain the LAB virtual machines. + +### Networking + +Each hypervisor will get an install of [Open vSwitch](https://openvswitch.org/), a production quality, multilayer virtual switch designed to +enable massive network automation through programmatic extension, while still supporting standard management interfaces +and protocols. This takes lots of the guesswork and tinkering out of Linux bridges in KVM/QEMU, and it's a perfect fit +due to its tight integration with `libvirt` (the thing most of us use in Debian/Ubuntu hypervisors). If need be, I can +add one or more of the 1G or 10G ports as well to the OVS fabric, to build more complicated topologies. And, because +the OVS infrastructure and libvirt both allow themselves to be configured over the network, I can control all aspects +of the runtime directly from the `lab.ipng.ch` headend, not having to log in to the hypervisor machines at all. Slick! + +# Implementation Details + +I start with image management. On the production hypervisor, I create a 6GB ZFS dataset that will serve as my `vpp-proto` +machine, and install it using the exact same method as the playground [[ref]({% post_url 2021-12-23-vpp-playground %})]. +Once I have it the way I like it, I'll poweroff the VM, and see to this image being replicated to all hypervisors. + +## ZFS Replication + +Enter [zrepl](https://zrepl.github.io/), a one-stop, integrated solution for ZFS replication. This tool is incredibly +powerful, and can do snapshot management, sourcing / sinking replication, of course using incremental snapshots as they +are native to ZFS. Because this is a LAB article, not a zrepl tutorial, I'll just cut to the chase and show the +configuration I came up with. + +``` +pim@hvn0-chbtl0:~$ cat << EOF | sudo tee /etc/zrepl/zrepl.yml +global: + logging: + # use syslog instead of stdout because it makes journald happy + - type: syslog + format: human + level: warn + +jobs: +- name: snap-vpp-proto + type: snap + filesystems: + 'ssd-vol0/vpp-proto-disk0<': true + snapshotting: + type: manual + pruning: + keep: + - type: last_n + count: 10 + +- name: source-vpp-proto + type: source + serve: + type: stdinserver + client_identities: + - "hvn0-lab" + - "hvn1-lab" + - "hvn2-lab" + filesystems: + 'ssd-vol0/vpp-proto-disk0<': true # all filesystems + snapshotting: + type: manual +EOF + +pim@hvn0-chbtl0:~$ cat << EOF | sudo tee -a /root/.ssh/authorized_keys +# ZFS Replication Clients for IPng Networks LAB +command="zrepl stdinserver hvn0-lab",restrict ecdsa-sha2-nistp256 root@hvn0.lab.ipng.ch +command="zrepl stdinserver hvn1-lab",restrict ecdsa-sha2-nistp256 root@hvn1.lab.ipng.ch +command="zrepl stdinserver hvn2-lab",restrict ecdsa-sha2-nistp256 root@hvn2.lab.ipng.ch +EOF +``` + +To unpack this, there are two jobs configured in **zrepl**: + +* `snap-vpp-proto` - the purpose of this job is to track snapshots as they are created. Normally, zrepl is configured + to automatically make snapshots every hour and copy them out, but in my case, I only want to take snapshots when I changed + and released the `vpp-proto` image, not periodically. So, I set the snapshotting to manual, and let the system keep the last + ten images. +* `source-vpp-proto` - this is a source job that uses a _lazy_ (albeit fine in this lab environment) method to serve the + snapshots to clients. By adding these SSH keys to the _authorized_keys_ file, but restricting them to be able to execute + only the `zrepl stdinserver` command, and nothing else (ie. these keys cannot log in to the machine). If any given server + were to present thesze keys, I can now map them to a **zrepl client** (for example, `hvn0-lab` for the SSH key presented by + hostname `hvn0.lab.ipng.ch`. The source job now knows to serve the listed filesystems (and their dataset children, noted by + the `<` suffix), to those clients. + +For the client side, each of the hypervisors gets only one job, called a _pull_ job, which will periodically wake up (every +minute) and ensure that any pending snapshots and their incrementals from the remote _source_ are slurped in and replicated +to a _root_fs_ dataset, in this case I called it `ssd-vol0/hvn0.chbtl0.ipng.ch` so I can track where the datasets come from. + +``` +pim@hvn0-lab:~$ sudo ssh-keygen -t ecdsa -f /etc/zrepl/ssh/identity -C "root@$(hostname -f)" +pim@hvn0-lab:~$ cat << EOF | sudo tee /etc/zrepl/zrepl.yml +global: + logging: + # use syslog instead of stdout because it makes journald happy + - type: syslog + format: human + level: warn + +jobs: +- name: vpp-proto + type: pull + connect: + type: ssh+stdinserver + host: hvn0.chbtl0.ipng.ch + user: root + port: 22 + identity_file: /etc/zrepl/ssh/identity + root_fs: ssd-vol0/hvn0.chbtl0.ipng.ch + interval: 1m + pruning: + keep_sender: + - type: regex + regex: '.*' + keep_receiver: + - type: last_n + count: 10 + recv: + placeholder: + encryption: off +``` + +After restarting zrepl for each of the machines (the _source_ machine and the three _pull_ machines), I can now do the +following cool hat trick: + +``` +pim@hvn0-chbtl0:~$ virsh start --console vpp-proto +## Do whatever maintenance, and then poweroff the VM +pim@hvn0-chbtl0:~$ sudo zfs snapshot ssd-vol0/vpp-proto-disk0@20221019-release +pim@hvn0-chbtl0:~$ sudo zrepl signal wakeup source-vpp-proto +``` + +This signals the zrepl daemon to re-read the snapshots, which will pick up the newest one, and then without me doing +much of anything else: + +``` +pim@hvn0-lab:~$ sudo zfs list -t all | grep vpp-proto +ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0 6.60G 367G 6.04G - +ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221013-release 499M - 6.04G - +ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221018-release 24.1M - 6.04G - +ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221019-release 0B - 6.04G - +``` + +That last image was just pushed automatically to all hypervisors! If they're turned off, no worries, as soon as they +start up, their local **zrepl** will make its next minutely poll, and pull in all snapshots, bringing the machine up +to date. So even when the hypervisors are normally turned off, this is zero-touch and maintenance free. + +## VM image maintenance + +Now that I have a stable image to work off of, all I have to do is `zfs clone` this image into new per-VM datasets, +after which I can mess around on the VMs all I want, and when I'm done, I can `zfs destroy` the clone and bring it +back to normal. However, I clearly don't want one and the same clone for each of the VMs, as they do have lots of +config files that are specific to that one _instance_. For example, the mgmt IPv4/IPv6 addresses are unique, and +the VPP and Bird/FRR configs are unique as well. But how unique are they, really? + +Enter Jinja (known mostly from Ansible). I decide to make some form of per-VM config files that are generated based +on some templates. That way, I can clone the base ZFS dataset, copy in the deltas, and boot that instead. And to +be extra efficient, I can also make a per-VM `zfs snapshot` of the cloned+updated filesystem, before tinkering with +the VMs, which I'll call a `pristine` snapshot. Still with me? + +1. First, clone the base dataset into a per-VM dataset, say `ssd-vol0/vpp0-0` +1. Then, generate a bunch of override files, copying them into the per-VM dataset `ssd-vol0/vpp0-0` +1. Finally, create a snapshot of that, called `ssd-vol0/vpp0-0@pristine` and boot off of that. + +Now, returning the VM to a pristine state is simply a matter of shutting down the VM, performing a `zfs rollback` +to the `pristine` snapshot, and starting the VM again. Ready? Let's go! + +### Generator + +So off I go, writing a small Python generator that uses Jinja to read a bunch of YAML files, merging them along +the way, and then traversing a set of directories with template files and per-VM overrides, to assemble a build +output directory with a fully formed set of files that I can copy into the per-VM dataset. + +Take a look at this as a minimally viable configuration: + +``` +pim@lab:~/src/lab$ cat config/common/generic.yaml +overlays: + default: + path: overlays/bird/ + build: build/default/ +lab: + mgmt: + ipv4: 192.168.1.80/24 + ipv6: 2001:678:d78:101::80/64 + gw4: 192.168.1.252 + gw6: 2001:678:d78:101::1 + nameserver: + search: [ "lab.ipng.ch", "ipng.ch", "rfc1918.ipng.nl", "ipng.nl" ] + nodes: 4 + +pim@lab:~/src/lab$ cat config/hvn0.lab.ipng.ch.yaml +lab: + id: 0 + ipv4: 192.168.10.0/24 + ipv6: 2001:678:d78:200::/60 + nameserver: + addresses: [ 192.168.10.4, 2001:678:d78:201::ffff ] + hypervisor: hvn0.lab.ipng.ch +``` + +Here I define a common config file with fields and attributes which will apply to all LAB environments, things +such as the mgmt network, nameserver search paths, and how many VPP virtual machine nodes I want to build. Then, +for `hvn0.lab.ipng.ch`, I specify an IPv4 and IPv6 prefix assigned to it, some specific nameserver endpoints +that will point at an `unbound` running on `lab.ipng.ch` itself. + +I can now create any file I'd like which may use variable substition and other jinja2 style templating. Take +for example these two files: + +{% raw %} +``` +pim@lab:~/src/lab$ cat overlays/bird/common/etc/netplan/01-netcfg.yaml.j2 +network: + version: 2 + renderer: networkd + ethernets: + enp1s0: + optional: true + accept-ra: false + dhcp4: false + addresses: [ {{node.mgmt.ipv4}}, {{node.mgmt.ipv6}} ] + gateway4: {{lab.mgmt.gw4}} + gateway6: {{lab.mgmt.gw6}} + +pim@lab:~/src/lab$ cat overlays/bird/common/etc/netns/dataplane/resolv.conf.j2 +domain lab.ipng.ch +search{% for domain in lab.nameserver.search %} {{domain}}{%endfor %} + +{% for resolver in lab.nameserver.addresses %} +nameserver {{resolver}} +{%endfor%} +``` +{% endraw %} + +The first file is a [[NetPlan.io](https://netplan.io/)] configuration that substitutes the correct management +IPv4 and IPv6 addresses and gateways. The second one enumerates a set of search domains and nameservers, so that +each LAB can have their own unique resolvers. I point these at the `lab.ipng.ch` uplink interface, in the case +of the LAB `hvn0.lab.ipng.ch`, this will be 192.168.10.4 and 2001:678:d78:201::ffff, but on `hvn1.lab.ipng.ch` +I can override that to become 192.168.11.4 and 2001:678:d78:211::ffff. + +There's one subdirectory for each _overlay_ type (imagine that I want a lab that runs Bird2, but I may also +want one which runs FRR, or another thing still). Within the _overlay_ directory, there's one _common_ +tree, with files that apply to every machine in the LAB, and a _hostname_ tree, with files that apply +only to specific nodes (VMs) in the LAB: + +``` +pim@lab:~/src/lab$ tree overlays/default/ +overlays/default/ +├── common +│   ├── etc +│   │   ├── bird +│   │   │   ├── bfd.conf.j2 +│   │   │   ├── bird.conf.j2 +│   │   │   ├── ibgp.conf.j2 +│   │   │   ├── ospf.conf.j2 +│   │   │   └── static.conf.j2 +│   │   ├── hostname.j2 +│   │   ├── hosts.j2 +│   │   ├── netns +│   │   │   └── dataplane +│   │   │   └── resolv.conf.j2 +│   │   ├── netplan +│   │   │   └── 01-netcfg.yaml.j2 +│   │   ├── resolv.conf.j2 +│   │   └── vpp +│   │   ├── bootstrap.vpp.j2 +│   │   └── config +│   │   ├── defaults.vpp +│   │   ├── flowprobe.vpp.j2 +│   │   ├── interface.vpp.j2 +│   │   ├── lcp.vpp +│   │   ├── loopback.vpp.j2 +│   │   └── manual.vpp.j2 +│   ├── home +│   │   └── ipng +│   └── root +├── hostname +    ├── vpp0-0 +       └── etc +(etc)   └── vpp +       └── config +       └── interface.vpp +``` + +Now all that's left to do is generate this hierarchy, and of course I can check this in to git and track changes to the +templates and their resulting generated filesystem overrides over time: + +``` +pim@lab:~/src/lab$ ./generate -q --host hvn0.lab.ipng.ch +pim@lab:~/src/lab$ find build/default/hvn0.lab.ipng.ch/vpp0-0/ -type f +build/default/hvn0.lab.ipng.ch/vpp0-0/home/ipng/.ssh/authorized_keys +build/default/hvn0.lab.ipng.ch/vpp0-0/etc/hosts +build/default/hvn0.lab.ipng.ch/vpp0-0/etc/resolv.conf +build/default/hvn0.lab.ipng.ch/vpp0-0/etc/bird/static.conf +build/default/hvn0.lab.ipng.ch/vpp0-0/etc/bird/bfd.conf +build/default/hvn0.lab.ipng.ch/vpp0-0/etc/bird/bird.conf +build/default/hvn0.lab.ipng.ch/vpp0-0/etc/bird/ibgp.conf +build/default/hvn0.lab.ipng.ch/vpp0-0/etc/bird/ospf.conf +build/default/hvn0.lab.ipng.ch/vpp0-0/etc/vpp/config/loopback.vpp +build/default/hvn0.lab.ipng.ch/vpp0-0/etc/vpp/config/flowprobe.vpp +build/default/hvn0.lab.ipng.ch/vpp0-0/etc/vpp/config/interface.vpp +build/default/hvn0.lab.ipng.ch/vpp0-0/etc/vpp/config/defaults.vpp +build/default/hvn0.lab.ipng.ch/vpp0-0/etc/vpp/config/lcp.vpp +build/default/hvn0.lab.ipng.ch/vpp0-0/etc/vpp/config/manual.vpp +build/default/hvn0.lab.ipng.ch/vpp0-0/etc/vpp/bootstrap.vpp +build/default/hvn0.lab.ipng.ch/vpp0-0/etc/netplan/01-netcfg.yaml +build/default/hvn0.lab.ipng.ch/vpp0-0/etc/netns/dataplane/resolv.conf +build/default/hvn0.lab.ipng.ch/vpp0-0/etc/hostname +build/default/hvn0.lab.ipng.ch/vpp0-0/root/.ssh/authorized_keys +``` + +## Open vSwitch maintenance + +The OVS installs on each Debian hypervisor in the lab is the same. I install the required Debian packages, create +a switchfabric, add one physical network port (the one that will serve as the _uplink_ (VLAN 10 in the sketch above) +for the LAB), and all the virtio ports from KVM. + +``` +pim@hvn0-lab:~$ sudo vi /etc/netplan/01-netcfg.yaml +network: + vlans: + uplink: + optional: true + accept-ra: false + dhcp4: false + link: eno1 + id: 200 +pim@hvn0-lab:~$ sudo netplan apply +pim@hvn0-lab:~$ sudo apt install openvswitch-switch python3-openvswitch +pim@hvn0-lab:~$ sudo ovs-vsctl add-br vpplan +pim@hvn0-lab:~$ sudo ovs-vsctl add-port vpplan uplink tag=10 +``` + +The `vpplan` switch fabric and its uplink port will persist across reboots. Then I add a small change to `libvirt` +defined virtual machines: + +``` +pim@hvn0-lab:~$ virsh edit vpp0-0 +... + + + + + + + +
+ + + + + + + + +
+ +... etc +``` + +That the only two things I need to do are ensure that the _source bridge_ will be called the same as +the OVS fabric, in my case `vpplan`, and the _virtualport_ type is `openvswitch`, and that's it! +Once all four `vpp0-*` virtual machines each have all four of their network cards updated, when they +boot, the hypervisor will add them each as new untagged ports in the OVS fabric. + +To then build the topology that I have in mind for the LAB, where each VPP machine is daisychained to +its siblin, all we have to do is program that into the OVS configuration: + +``` +pim@hvn0-lab:~$ cat << EOF > ovs-config.sh +#!/bin/sh +# +# OVS configuration for the `default` overlay + +LAB=${LAB:=0} +for node in 0 1 2 3; do + for int in 0 1 2 3; do + ovs-vsctl set port vpp${LAB}-${node}-${int} vlan_mode=native-untagged + done +done + +# Uplink is VLAN 10 +ovs-vsctl add port vpp${LAB}-0-0 tag 10 +ovs-vsctl add port uplink tag 10 + +# Link vpp${LAB}-0 <-> vpp${LAB}-1 in VLAN 20 +ovs-vsctl add port vpp${LAB}-0-1 tag 20 +ovs-vsctl add port vpp${LAB}-1-0 tag 20 + +# Link vpp${LAB}-1 <-> vpp${LAB}-2 in VLAN 21 +ovs-vsctl add port vpp${LAB}-1-1 tag 21 +ovs-vsctl add port vpp${LAB}-2-0 tag 21 + +# Link vpp${LAB}-2 <-> vpp${LAB}-3 in VLAN 22 +ovs-vsctl add port vpp${LAB}-2-1 tag 22 +ovs-vsctl add port vpp${LAB}-3-0 tag 22 +EOF + +pim@hvn0-lab:~$ chmod 755 ovs-config.sh +pim@hvn0-lab:~$ sudo ./ovs-config.sh +``` + +The first block here wheels over all nodes and then for all of ther ports, sets the VLAN mode to what +OVS calleds 'native-untagged'. In this mode, the `tag` becomes the VLAN in which the port will operate, +but, to add as well dot1q tagged additional VLANs, we can use the syntax `add port ... trunks 10,20,30`. + +TO see the configuration, `ovs-vsctl show port vpp0-0-0` will show the switch port configuration, while +`ovs-vsctl show interface vpp0-0-0` will show the virtual machine's NIC configuration (think of the +difference here as the switch port on the one hand, and the NIC (interface) plugged into it on the other). + +### Deployment + +There's three main points to consider when deploying these lab VMs: + +1. Create the VMs and their ZFS datasets +1. Destroy the VMs and their ZFS datasets +1. Bring the VMs into a pristine state + +#### Create + +If the hypervisor doesn't yet have a LAB running, we need to create it: + +``` +BASE=${BASE:=ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221019-release} +BUILD=${BUILD:=default} +LAB=${LAB:=0} + +## Do not touch below this line +LABDIR=/var/lab +STAGING=$LABDIR/staging +HVN="hvn${LAB}.lab.ipng.ch" + +echo "* Cloning base" +ssh root@$HVN "set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\${node}; \ + mkdir -p $STAGING/\$VM; zfs clone $BASE ssd-vol0/\$VM; done" +sleep 1 + +echo "* Mounting in staging" +ssh root@$HVN "set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\${node}; \ + mount /dev/zvol/ssd-vol0/\$VM-part1 $STAGING/\$VM; done" + +echo "* Rsyncing build" +rsync -avugP build/$BUILD/$HVN/ root@hvn${LAB}.lab.ipng.ch:$STAGING + +echo "* Setting permissions" +ssh root@$HVN "set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\${node}; \ + chown -R root. $STAGING/\$VM/root; done" + +echo "* Unmounting and snapshotting pristine state" +ssh root@$HVN "set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\${node}; \ + umount $STAGING/\$VM; zfs snapshot ssd-vol0/\${VM}@pristine; done" + +echo "* Starting VMs" +ssh root@$HVN "set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\${node}; \ + virsh start \$VM; done" + +echo "* Committing OVS config" +scp overlays/$BUILD/ovs-config.sh root@$HVN:$LABDIR +ssh root@$HVN "set -x; LAB=$LAB $LABDIR/ovs-config.sh" +``` + +After running this, the hypervisor will have 4 clones, and 4 snapshots (one for each virtual machine): + +``` +root@hvn0-lab:~# zfs list -t all +NAME USED AVAIL REFER MOUNTPOINT +ssd-vol0 6.80G 367G 24K /ssd-vol0 +ssd-vol0/hvn0.chbtl0.ipng.ch 6.60G 367G 24K none +ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0 6.60G 367G 24K none +ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0 6.60G 367G 6.04G - +ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221013-release 499M - 6.04G - +ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221018-release 24.1M - 6.04G - +ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221019-release 0B - 6.04G - +ssd-vol0/vpp0-0 43.6M 367G 6.04G - +ssd-vol0/vpp0-0@pristine 1.13M - 6.04G - +ssd-vol0/vpp0-1 25.0M 367G 6.04G - +ssd-vol0/vpp0-1@pristine 1.14M - 6.04G - +ssd-vol0/vpp0-2 42.2M 367G 6.04G - +ssd-vol0/vpp0-2@pristine 1.13M - 6.04G - +ssd-vol0/vpp0-3 79.1M 367G 6.04G - +ssd-vol0/vpp0-3@pristine 1.13M - 6.04G - +``` + +The last thing the create script does is commit the OVS configuration, because when the VMs are shutdown +or newly created, KVM will add them to the switching fabric as untagged/unconfigured ports. + +But would you look at that! The delta between the base image and the `pristine` snapshots is about 1MB of +configuration files, the ones that I generated and rsync'd in above, and then once the machine boots, it +will have a read/write mounted filesystem as per normal, except it's a delta on top of the snapshotted, +cloned dataset. + +#### Destroy + +I love destroying things! But in this case, I'm removing what are essentially ephemeral disk images, as +I still have the base image to clone from. But, the destroy is conceptually very simple: + +``` +BASE=${BASE:=ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221018-release} +LAB=${LAB:=0} + +## Do not touch below this line +HVN="hvn${LAB}.lab.ipng.ch" + +echo "* Destroying VMs" +ssh root@$HVN "set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\${node}; \ + virsh destroy \$VM; done" + +echo "* Destroying ZFS datasets" +ssh root@$HVN "set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\${node}; \ + zfs destroy -r ssd-vol0/\$VM; done" +``` + +After running this, the VMs will be shutdown and their cloned filesystems (including any snapshots +those may have) are wiped. To get back into a working state, all I must do is run `./create` again! + +#### Pristine + +Sometimes though, I don't need to completely destroy the VMs, but rather I want to put them back into +the state they where just after creating the LAB. Luckily, the create made a snapshot (called `pristine`) +for each VM before booting it, so bringing the LAB back to _factory default_ settings is really easy: + +``` +BUILD=${BUILD:=default} +LAB=${LAB:=0} + +## Do not touch below this line +LABDIR=/var/lab +STAGING=$LABDIR/staging +HVN="hvn${LAB}.lab.ipng.ch" + +## Bring back into pristine state +echo "* Restarting VMs from pristine snapshot" +ssh root@$HVN "set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\${node}; \ + virsh destroy \$VM; + zfs rollback ssd-vol0/\${VM}@pristine; + virsh start \$VM; done" + +echo "* Committing OVS config" +scp overlays/$BUILD/ovs-config.sh root@$HVN:$LABDIR +ssh root@$HVN "set -x; $LABDIR/ovs-config.sh" +``` + +## Results + +After completing this project, I have a completely hands-off, automated and autogenerated, and very maneageable set +of three LABs, each booting up in a running OSPF/OSPFv3 enabled topology for IPv4 and IPv6: + +``` +pim@lab:~/src/lab$ traceroute -q1 vpp0-3 +traceroute to vpp0-3 (192.168.10.3), 30 hops max, 60 byte packets + 1 e0.vpp0-0.lab.ipng.ch (192.168.10.5) 1.752 ms + 2 e0.vpp0-1.lab.ipng.ch (192.168.10.7) 4.064 ms + 3 e0.vpp0-2.lab.ipng.ch (192.168.10.9) 5.178 ms + 4 vpp0-3.lab.ipng.ch (192.168.10.3) 7.469 ms +pim@lab:~/src/lab$ ssh ipng@vpp0-3 + +ipng@vpp0-3:~$ traceroute6 -q1 vpp2-3 +traceroute to vpp2-3 (2001:678:d78:220::3), 30 hops max, 80 byte packets + 1 e1.vpp0-2.lab.ipng.ch (2001:678:d78:201::3:2) 2.088 ms + 2 e1.vpp0-1.lab.ipng.ch (2001:678:d78:201::2:1) 6.958 ms + 3 e1.vpp0-0.lab.ipng.ch (2001:678:d78:201::1:0) 8.841 ms + 4 lab0.lab.ipng.ch (2001:678:d78:201::ffff) 7.381 ms + 5 e0.vpp2-0.lab.ipng.ch (2001:678:d78:221::fffe) 8.304 ms + 6 e0.vpp2-1.lab.ipng.ch (2001:678:d78:221::1:21) 11.633 ms + 7 e0.vpp2-2.lab.ipng.ch (2001:678:d78:221::2:22) 13.704 ms + 8 vpp2-3.lab.ipng.ch (2001:678:d78:220::3) 15.597 ms +``` + +If you read this far, thanks! Each of these three LABs come with 4x10Gbit DPDK based packet generators (Cisco T-Rex), +four VPP machines running either Bird2 or FRR, and together they are connected to a 100G capable switch. + +**These LABs are for rent, and we offer hands-on training on them.** Please **[contact](/s/contact/)** us for +daily/weekly rates, and custom training sessions. + +I checked the generator and deploy scripts in to a git repository, which I'm happy to share if there's +an interest. But because it contains a few implementation details and doesn't do a lot of fool-proofing, as well as +because most of this can be easily recreated by interested parties from this blogpost, I decided not to publish +the LAB project github, but on our private git.ipng.ch server instead. Mail us if you'd like to take a closer look, +I'm happy to share the code. diff --git a/content/articles/2022-11-20-mastodon-1.md b/content/articles/2022-11-20-mastodon-1.md new file mode 100644 index 0000000..1aa4597 --- /dev/null +++ b/content/articles/2022-11-20-mastodon-1.md @@ -0,0 +1,251 @@ +--- +date: "2022-11-20T22:35:14Z" +title: Mastodon - Part 1 - Installing +--- + +# About this series + +{{< image width="200px" float="right" src="/assets/mastodon/mastodon-logo.svg" alt="Mastodon" >}} + +I have seen companies achieve great successes in the space of consumer internet and entertainment industry. I've been feeling less +enthusiastic about the stronghold that these corporations have over my digital presence. I am the first to admit that using "free" services +is convenient, but these companies are sometimes taking away my autonomy and exerting control over society. To each their own of course, but +for me it's time to take back a little bit of responsibility for my online social presence, away from centrally hosted services and to +privately operated ones. + +This series details my findings starting a micro blogging website, which uses a new set of super interesting open interconnect protocols to +share media (text, pictures, videos, etc) between producers and their followers, using an open source project called +[Mastodon](https://joinmastodon.org/). + +## Introduction + +Similar to how blogging is the act of publishing updates to a website, microblogging is the act of publishing small updates to a stream of +updates on your profile. You can publish text posts and optionally attach media such as pictures, audio, video, or polls. Mastodon lets you +follow friends and discover new ones. It doesn't do this in a centralized way, however. + +Groups of people congregate on a given server, of which they become a user by creating an account on that server. Then, they interact with +one another on that server, but users can also interact with folks on _other_ servers. Instead of following **@IPngNetworks**, they might +follow a user on a given server domain, like **@IPngNetworks@ublog.tech**. This way, all these servers can be run _independently_ but +interact with each other using a common protocol (called ActivityPub). I've heard this concept be compared to choosing an e-mail provider: I +might choose Google's gmail.com, and you might use Microsoft's live.com. However we can send e-mails back and forth due to this common +protocol (called SMTP). + +### uBlog.tech + +I thought I would give it a go, mostly out of engineering curiosity but also because I more strongly feel today that we (the users) ought to +take a bit more ownership back. I've been a regular blogging and micro-blogging user since approximately for ever, and I think it may be a +good investment of my time to learn a bit more about the architecture of Mastodon. So, I've decided to build and _productionize_ a server +instance. + +I registered [uBlog.tech](https://ublog.tech). Incidentally, if you're reading this and would like to participate, the server welcomes users +in the network-, systems- and software engineering disciplines. But, before I can get to the fun parts though, I have to do a bunch of work +to get this server in a shape in which it can be trusted with user generated content. + +### Hardware + +I'm running Debian on (a set of) Dell R720s hosted by IPng Networks in Zurich, Switzerland. These machines are all roughly the same, and +come with: + +* 2x10C/10T Intel E5-2680 (so 40 CPUs) +* 256GB ECC RAM +* 2x240G SSD in mdraid to boot from +* 3x1T SSD in ZFS for fast storage +* 6x16T harddisk with 2x500G SSD for L2ARC, in ZFS for bulk storage + +Data integrity and durability is important to me. It's the one thing that typically the commercial vendors do really well, and my pride +prohibits me from losing data due to things like "disk failure" or "computer broken" or "datacenter on fire". So, I handle backups in two +main ways: borg(1) and zrepl(1). + +* **Hypervisor hosts** make a daily copy of their entire filesystem using **borgbackup(1)** to a set of two remote fileservers. This way, the + important file metadata, configs for the virtual machines, and so on, are all safely stored remotely. +* **Virtual machines** are running on ZFS blockdevices on either the SSD pool, or the disk pool, or both. Using a tool called **zrepl(1)** + (which I described a little bit in a [[previous post]({% post_url 2022-10-14-lab-1 %})]), I create a snapshot every 12hrs on the local + blockdevice, and incrementally copy away those snapshots daily to the remote fileservers. + +If I do something silly on a given virtual machine, I can roll back the machine filesystem state to the previous checkpoint and reboot. This has +saved my butt a number of times, during say a PHP 7 to 8 upgrade for Librenms, or during an OpenBSD upgrade that ran out of disk midway +through. Being able to roll back to a last known good state is awesome, and completely transparent for the virtual machine, as the +snapshotting is done on the underlying storage pool in the hypervisor. The fileservers run physically separated from the server pools, one in +Zurich and another in Geneva, so this way, if I were to lose the entire machine, I still have a ~12h old backup in two locations. + +### Software + +I provision a VM with 8vCPUs (dedicated on the underlying hypervisor), including 16GB of memory and two virtio network cards. One NIC will +connect to a backend LAN in some RFC1918 address space, and the other will present an IPv4 and IPv6 interface to the internet. I give this +machine two blockdevices, one small one of 16GB (vda) that is created on the hypervisor's `ssd-vol0/libvirt/ublog-disk0`, to be used only +for boot, logs and OS. Then, a second one (vdb) is created at 300GB on `ssd-vol1/libvirt/ublog-disk1` and it will be used for Mastodon and +its supporting services. + +Then I simply install Debian into **vda** using `virt-install`. At IPng Networks we have some ansible-style automation that takes over the +machine, and further installs all sorts of Debian packages that we use (like a Prometheus node exporter, more on that later), and sets up a +firewall that allows SSH access for our trusted networks, and otherwise only allows port 80 and 443 because this is to be a webserver. + +After installing Debian Bullseye, I'll create the following ZFS filesystems on **vdb**: + +``` +pim@ublog:~$ sudo zfs create -o mountpoint=/home/mastodon data/mastodon -V10G +pim@ublog:~$ sudo zfs create -o mountpoint=/var/lib/elasticsearch data/elasticsearch -V10G +pim@ublog:~$ sudo zfs create -o mountpoint=/var/lib/postgresql data/postgresql -V20G +pim@ublog:~$ sudo zfs create -o mountpoint=/var/lib/redis data/redis -V2G +pim@ublog:~$ sudo zfs create -o mountpoint=/home/mastodon/libve/public/system data/mastodon-system +``` + +As a sidenote, I realize that this ZFS filesystem pool consists only of **vdb**, but its underlying blockdevice is protected in a raidz, and +it is copied incrementally daily off-site by the hypervisor. I'm pretty confident on safety here, but I prefer to use ZFS for the virtual +machine guests as well, because now I can do local snapshotting, of say `data/mastodon-system`, and I can more easily grow/shrink the +datasets for the supporting services, as well as monitor them individually for wildgrowth. + +#### Installing Mastodon + +I then go through the public [Mastodon docs](https://docs.joinmastodon.org/admin/install/) to further install the machine. I choose not to +go the Docker route, but instead stick to systemd installs. The install itself is pretty straight forward, but I did find the nginx config +a bit rough around the edges (notably because the default files I'm asked to use have their ssl certificate stanza's commented out, while +trying to listen on port 443, and this makes nginx and certbot very confused). A cup of tea later, and we're all good. + +I am not going to start prematurely optimizing, and after a very engaging thread on Mastodon itself +[[@davidlars@hachyderm.io](https://ublog.tech/@davidlars@hachyderm.io/109381163342345835)] with a few fellow admins, the consensus really is +to _KISS_ (keep it simple, silly!). In that thread, I made a few general observations on scaling up and out (none of which I'll be doing +initially), just by using some previous experience as a systems engineer, and knowing a bit about the components used here: + +* Running services on dedicated machines (ie. saparate storage, postgres, Redis, Puma and Sidekiq workers) +* Fiddle with Puma worker pool (more workers, and/or more threads per worker) +* Fiddle with Sidekiq worker pool and dedicated instances per queue +* Put storage on local minio cluster +* Run multiple postgres databases, read-only replicas, or multimaster +* Run cluster of multiple redis instances instead of one +* Split off the cache redis into mem-only +* Frontend the service with a cluster of NGINX + object caching + +Some other points of interest for those of us on the adventure of running our own machines follow: + +#### Logging + +Mastodon is a chatty one - it is logging to stdout/stderr and most of its tasks in Sidekiq have a lot to say. On Debian, by default this +output goes from **systemd** into **journald** which in turn copies it into **syslogd**. The result of this is that each logline hits the +disk three (!) times. And also by default, Debian and Ubuntu aren't too great at log hygiene. While `/var/log/` is scrubbed by logrotate(8), +nothing helps avoid the journal from growing unboundedly. So I quickly make the following change: + +``` +pim@ublog:~$ cat << EOF | sudo tee /etc/systemd/journald.conf +[Journal] +SystemMaxUse=500M +ForwardToSyslog=no +EOF +pim@ublog:~$ sudo systemctl restart systemd-journald +``` + +#### Paperclip and ImageMagick + +I noticed while tailing the journal `journalctl -f` that lots of incoming media gets first spooled to /tmp and then run through a conversion +step to ensure the media is of the right format/aspect ratio. Mastodon calls a library called `paperclip` which in turn uses file(1) and +identify(1) to determine the type of file, and based on the answer for images runs convert(1) or ffmpeg(1) to munge it into the shape it +wants. I suspect that this will cause a fair bit of I/O in `/tmp` so something to keep in mind, is to either lazily turn that mountpoint +into a `tmpfs` (which is in general frowned upon), or to change the paperclip library to use a user-defined filesystem like `~mastodon/tmp` +and make _that_ a memory backed filesystem instead. The log signature in case you're curious: + +``` +Nov 20 21:02:10 ublog bundle[408189]: Command :: file -b --mime '/tmp/a22ab94adb939b0eb3c224bb9046c9cf20221123-408189-s0rsty.jpg' +Nov 20 21:02:10 ublog bundle[408189]: Command :: identify -format %m '/tmp/6205b887c6c337b1a72ae2a7ccb359c920221123-408189-e9jul1.jpg[0]' +Nov 20 21:02:10 ublog bundle[408189]: Command :: convert '/tmp/6205b887c6c337b1a72ae2a7ccb359c920221123-408189-e9jul1.jpg[0]' -auto-orient -resize "400x400>" -coalesce '/tmp/8ce2976b99d4b5e861e6c988459ee20c20221123-408189-1p5gg4' +Nov 20 21:02:10 ublog bundle[408189]: Command :: convert '/tmp/8ce2976b99d4b5e861e6c988459ee20c20221123-408189-1p5gg4' -depth 8 RGB:- +``` + +I will put a pin in this until it becomes a bottleneck, but larger server admins may have thought about this before, and if so, let me know +what you came up with! + +#### Elasticsearch + +There's a little bit of a timebomb here, unfortunately. Following [[Full-text +search](https://docs.joinmastodon.org/admin/optional/elasticsearch/)] docs, the install and integration is super easy. But, in an upcoming +release, Elasticsearch is going to _force_ authentication by default, even though in the current version they are still tolerant of +non-secured instances, those will break in the future. So I'm going to get ahead of that and create my instance with the minimally required +security setup in mind [[ref](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html)]: + +``` +pim@ublog:~$ cat << EOF | sudo tee -a /etc/elasticsearch/elasticsearch.yml +xpack.security.enabled: true +discovery.type: single-node +EOF +pim@ublog:~$ PASS=$(openssl rand -base64 12) +pim@ublog:~$ /usr/share/elasticsearch/bin/elasticsearch-setup-passwords interactive +(use this $PASS for the 'elastic' user) +pim@ublog:~$ cat << EOF | sudo tee -a ~mastodon/live/.env.production +ES_USER=elastic +ES_PASS=$PASS +EOF +pim@ublog:~$ sudo systemctl restart mastodon-streaming mastodon-web mastodon-sidekiq +``` + +Elasticsearch is a memory hog, which is not that strange considering its job is to supply full text retrieval in a large +amount of documents and data at high performance. It'll by default grab roughly half of the machine's memory, which it +really doesn't need for now. So, I'll give it a little bit of a smaller playground to expand into, by limiting it's heap +to 2 GB to get us started: + +``` +pim@ublog:~$ cat << EOF | sudo tee /etc/elasticsearch/jvm.options.d/memory.options +-Xms2048M +-Xmx2048M +EOF +pim@ublog:~$ sudo systemctl restart elasticsearch +``` + +#### Mail + +E-mail can be quite tricky to get right. At IPng we've been running mailservers for a while now, and we're reasonably good at delivering +mail even to the most hard-line providers (looking at you, GMX and Google). We use relays from a previous project of mine called +[[PaPHosting](https://paphosting.net)], which you can clearly see comes from the Dark Ages when the Internet was still easy. These days, our +mailservers run a combination of STS-MTA, TLS certs from Lets Encrypt, DMARC, and SPF. So our outbound mail is simply using OpenBSD's +smtpd(8), and it forwards to the remote relay pool of five servers using authentication, but only after rewriting the envelope to always +come from `@ublog.tech` and match the e-mail sender (which allows for strict SPF): + +``` +pim@ublog:~$ cat /etc/smtpd.conf +table aliases file:/etc/aliases +table secrets file:/etc/mail/secrets + +listen on localhost + +action "local_mail" mbox alias +action "outbound" relay host "smtp+tls://papmx@smtp.paphosting.net" auth \ + mail-from "@ublog.tech" + +match from local for local action "local_mail" +match from local for any action "outbound" +``` + +Inbound mail to the `@ublog.tech` domain is also handled by the paphosting servers, which forward them all to our respective inboxes. + +#### Server Settings + +After reading a post from [[@rriemann@chaos.social](https://ublog.tech/@rriemann@chaos.social/109384055799108617)], I was quickly convinced +that having a good privacy policy is worth the time. I took their excellent advice to create a reasonable [[Privacy +Policy](https://ublog.tech/privacy-policy)]. Thanks again for that, and if you're running a server in Europe or with European users, +definitely check it out. + +Rules are important. I didn't give this as much thought, but I did assert some ground rules. Even though I do believe in [[Postel's +Robustness Principle](https://en.wikipedia.org/wiki/Robustness_principle)] (_Be liberal in what you accept, and conservative in what you +send._), I generally tend to believe that computers lose their temper less often than humans, so I started off with: + +1. **Behavioral Tenets**: Use welcoming and inclusive language, be respectful of differing viewpoints and experiences, gracefully accept + constructive criticism, focus on what is best for the community, show empathy towards other community members. Be kind to each other, and + yourself. +1. **Unacceptable behavior**: Use of sexualized language or imagery, unsolicited romantic attention, trolling, derogatory + comments, personal or political attacks, doxxing are strictly prohibited. Use conduct considered inappropriate for a professional setting. + +{{< image width="70px" float="left" src="/assets/mastodon/msie.png" alt="Favicon" >}} +I also read an entertaining (likely insider-joke) post from [[@nova@hachyderm.io](https://ublog.tech/@nova@hachyderm.io/109389072740558566)], +in which she was asking about the internet explorer favicon on her instance, so I couldn't resist but replace the mastodon favicon with the +IPng Networks one. Vanity matters. + +## What's next + +Now that the server is up, and I have a small amount of users (mostly folks I know from the tech industry), I took some time to explore +both the Fediverse, reach out to friends old and new, participate in a few random discussions, fiddle with the iOS apps (and in the end, +settled on Toot! with a runner up of Metatext), and generally had an *amazing* time on Mastodon these last few days. + +Now, I think I'm ready to further productionize the experience. My next article will cover monitoring - a vital aspect of any serious +project. I'll go over Prometheus, Grafana, Alertmanager and how to get the most signal out of a running Mastodon instance. Stay tuned! + +If you're looking for a home, feel free to sign up at [https://ublog.tech/](https://ublog.tech/) as I'm sure that having a bit more load / +traffic on this instance will allow me to learn (and in turn, to share with others)! + diff --git a/content/articles/2022-11-24-mastodon-2.md b/content/articles/2022-11-24-mastodon-2.md new file mode 100644 index 0000000..3e068c8 --- /dev/null +++ b/content/articles/2022-11-24-mastodon-2.md @@ -0,0 +1,215 @@ +--- +date: "2022-11-24T01:20:14Z" +title: Mastodon - Part 2 - Monitoring +--- + +# About this series + +{{< image width="200px" float="right" src="/assets/mastodon/mastodon-logo.svg" alt="Mastodon" >}} + +I have seen companies achieve great successes in the space of consumer internet and entertainment industry. I've been feeling less +enthusiastic about the stronghold that these corporations have over my digital presence. I am the first to admit that using "free" services +is convenient, but these companies are sometimes taking away my autonomy and exerting control over society. To each their own of course, but +for me it's time to take back a little bit of responsibility for my online social presence, away from centrally hosted services and to +privately operated ones. + +In the [[previous post]({% post_url 2022-11-20-mastodon-1 %})], I shared some thoughts on how the overall install of a Mastodon instance +went, making it a point to ensure my users' (and my own!) data is somehow safe, and the machine runs on good hardware, and with good +connectivity. Thanks IPng, for that 10G connection! In this post, I visit an old friend, +[[Borgmon](https://sre.google/sre-book/practical-alerting/)], which has since reincarnated and become the _de facto_ open source +observability and signals ecosystem, and its incomparably awesome friend. Hello, **Prometheus** and **Grafana**! + +## Anatomy of Mastodon + +Looking more closely at the architecture of Mastodon, it consists of a few moving parts: + +1. **Storage**: At the bottom, there's persistent storage, in my case **ZFS**, on which account information (like avatars), media + attachments, and site-specific media lives. As posts stream to my instance, their media is spooled locally for performance. +1. **State**: Application state is kept in two databases: + * Firstly, a **SQL database** which is chosen to be [[PostgreSQL](https://postgresql.org)]. + * Secondly, a memory based **key-value storage** system [[Redis](https://redis.io)] is used to track the vitals of home feeds, + list feeds, Sidekiq queues as well as Mastodon's streaming API. +1. **Web (transactional)**: The webserver that serves end user requests and the API is written in a Ruby framework called + [[Puma](https://github.com/puma/puma)]. Puma tries to do its job efficiently, and doesn't allow itself to be bogged down by long lived web + sessions, such as the ones where clients get streaming updates to their timelines on the web- or mobile client. +1. **Web (streaming)**: This webserver is written in [[NodeJS](https://nodejs.org/en/about/)] and excels at long lived connections + that use Websockets, by providing a Streaming API to clients. +1. **Web (frontend)**: To tie all the current and future microservices together, provide SSL (for HTTPS), and a local object cache for + things that don't change often, one or more [[NGINX](https://nginx.org)] servers are used. +1. **Backend (processing)**: Many interactions with the server (such as distributing posts) turn in to background tasks that are enqueued + and handled asynchronously by a worker pool provided by [[Sidekiq](https://github.com/mperham/sidekiq)]. +1. **Backend (search)**: Users that wish to search the local corpus of posts and media, can interact with an instance of + [[Elastic](https://www.elastic.co/)], a free and open search and analytics solution. + +These systems all interact in particular ways, but I immediately noticed one interesting tidbit. Pretty much every system in this list can +(or can be easily made to) emit metrics in a popular [[Prometheus](https://prometheus.io/)] format. I cannot overstate the love I have for +this project, both technically but also socially because I know how it came to be. Ben, thanks for the RC racecars (I still have them!). +Matt, I admire your Go- and Java-skills and your general workplace awesomeness. And Richi sorry to have missed you last week in Hamburg at +[[DENOG14](https://www.denog.de/de/meetings/denog14/agenda.html)]! + +## Prometheus + +Taking stock of the architecture here, I think my best bet is to rig this stuff up with Prometheus. This works mostly by having a central, +in my case external to [[uBlog.tech](https://ublog.tech)] server scrape a bunch of timeseries metrics periodically, after which I can create +pretty graphs of them, but also monitor if some values seem out of whack, like a Sidekiq queue delay raising, CPU or disk I/O running a bit +hot. And the best thing yet? I will get pretty much all of this for free, because other, smarter folks have contributed into this +ecosystem already: + +* **Server**: monitoring is canonically done by [[Node Exporter](https://prometheus.io/docs/guides/node-exporter/)]. It provides metrics for + all the lowlevel machine and kernel stats you'd ever think to want: network, disk, cpu, processes, load, and so on. +* **Redis**: Is provided by [[Redis Exporter](https://github.com/oliver006/redis_exporter)] and can show all sorts of operations on data + realms served by Redis. +* **PostgreSQL**: is provided by [[Postgres Exporter](https://github.com/prometheus-community/postgres_exporter)] which is + maintained by the Prometheus Community. +* **NGINX**: Is provided by [[NGINX Exporter](https://github.com/nginxinc/nginx-prometheus-exporter)] which is maintained by the company + behind NGINX. I used to have a Lua based exporter (when I ran [[SixXS](https://sixxs.net/)]) which had lots of interesting additional + stats, but for the time being I'll just use this one. +* **Elastic**: has a converter from its own metrics system in the [[Elasticsearch + Exporter](https://github.com/prometheus-community/elasticsearch_exporter)], once again maintained by the (impressively fabulous!) + Prometheus Community. + +All of these implement a common pattern: they take the (bespoke, internal) representation of statistics counters or dials/gauges, and +transform them into a common format called the _Metrics Exposition_ format, and they provide this in either an HTTP endpoint (typically +using a `/metrics` URI handler directly on the webserver), or in a push-mechanism using a popular +[[Pushgateway](https://prometheus.io/docs/instrumenting/pushing/)] in case there is no server to poll, for example a batch process that did +some work and wanted to report on its results. + +Incidentally, a fair amount of popular open source infrastructure already has a Prometheus exporter -- check out [[this +list](https://prometheus.io/docs/instrumenting/exporters/)], but also the assigned [[TCP +ports](https://github.com/prometheus/prometheus/wiki/Default-port-allocations)] for popular things that you might also be using. +Maybe you'll get lucky and find out that somebody has already provided an exporter, so you don't have to! + +### Configuring Exporters + +Now that I have found a whole swarm of these Prometheus Exporter microservices, and understand how to plumb each of them through to +what-ever it is they are monitoring, I can get cracking on some observability. Let me provide some notes for posterity, both for myself +if I ever revisit the topic and ... kind of forgot what I had done so far :), but maybe also for the adventurous, who are interested in +using Prometheus on their own Mastodon instance. + +First of all, it's worth mentioning that while these exporters (typically written in Go) have command line flags, they can often also take +their configuration from environment variables, provided mostly becasue they operate in Docker or Kubernetes. My exporters will all run +_vanilla_ in **systemd**, but these systemd units can also be configured to use environments, which is neat! + +First, I create a few environment files for each **systemd** unit that contains a Prometheus exporter: + +``` +pim@ublog:~$ ls -la /etc/default/*exporter +-rw-r----- 1 root root 49 Nov 23 18:15 /etc/default/elasticsearch-exporter +-rw-r----- 1 root root 76 Nov 22 17:13 /etc/default/nginx-exporter +-rw-r----- 1 root root 170 Nov 22 22:41 /etc/default/postgres-exporter +-rw-r----- 1 root root 9789 May 27 2021 /etc/default/prometheus-node-exporter +-rw-r----- 1 root root 0 Nov 22 22:56 /etc/default/redis-exporter +-rw-r----- 1 root root 67 Nov 22 23:56 /etc/default/statsd-exporter +``` + +The contents of these files will give away passwords, like the one for ElasticSearch or Postgres, so I specifically make them readable only +by `root:root`. I won't share my passwords with you, dear reader, so you'll have to guess the contents here! + + +Priming the environment with these values, I will take the **systemd** unit for elasticsearch as an example: + +``` +pim@ublog:~$ cat << EOF | sudo tee /lib/systemd/system/elasticsearch-exporter.service +[Unit] +Description=Elasticsearch Prometheus Exporter +After=network.target + +[Service] +EnvironmentFile=/etc/default/elasticsearch-exporter +ExecStart=/usr/local/bin/elasticsearch_exporter +User=elasticsearch +Group=elasticsearch +Restart=always + +[Install] +WantedBy=multi-user.target +EOF + +pim@ublog:~$ cat << EOF | sudo tee /etc/default/elasticsearch-exporter +ES_USERNAME=elastic +ES_PASSWORD=$(SOMETHING_SECRET) # same as ES_PASS in .env.production +EOF + +pim@ublog:~$ sudo systemctl enable elasticsearch-exporter +pim@ublog:~$ sudo systemctl start elasticsearch-exporter + +``` + +Et voilà, just like that the service starts, connects to elasticsearch, transforms all of its innards into beautiful Prometheus metrics, and +exposes them on its "registered" port, in this case 9114, which can be scraped by the Prometheus instance a few computers away, connected to +the uBlog VM via backend LAN over RFC1918. I just _knew_ that second NIC would come in useful! + +{{< image width="400px" float="right" src="/assets/mastodon/prom-metrics.png" alt="ElasticSearch Metrics" >}} + +All **five** of the exporters are configured and exposed. They are now providing a wealth of realtime information +on how the various Mastodon components are going. And if any of them start malfunctioning, or running out of steam, or simply taking the day +off, I will be able to see this either by certrain metrics going out of expected ranges, or by the exporter reporting that it cannot even +find the service at all (which we can also detect and turn into alarms, more on that later). + +Pictured here (you should probably open it in full resolution unless you have hawk eyes), is an example of those metrics, of which +Prometheus is happy to handle several million at relatively high period of scraping, in my case every 10 seconds, it comes around and pulls +the data from these five exporters. While these metrics are human readable, they aren't very practical... + +## Grafana + + +... so let's visualize them with an equally awesome tool: [[Grafana](https://grafana.com/)]. This tool provides operational dashboards for any +data that is stored here, there, or anywhere :) Grafana can render stuff from a plethora of backends, one popular and established one is +Prometheus. And as it turns out, as with Prometheus, lots of work has been done already with canonical, almost out-of-the-box, dashboards +that were contributed by folks in the field. n fact, every single one of the five exporters I installed, also have an accompanying +dashboard, sometimes even multiple to choose from! Grafana allows you to [[search and download](https://grafana.com/grafana/dashboards/)] +these from a corpus they provide, referring to them by their `id`, or alternatively downloading a JSON representation of the dashboard, for +example one that comes with the exporter, or one you find on GitHub. + +For uBlog, I installed: [[Node Exporter](https://grafana.com/grafana/dashboards/1860-node-exporter-full/)], [[Postgres +Exporter](https://grafana.com/grafana/dashboards/9628-postgresql-database/)], [[Redis +Exporter](https://grafana.com/grafana/dashboards/11692-redis-dashboard-for-prometheus-redis-exporter-1-x/)], [[NGINX +Exporter](https://github.com/nginxinc/nginx-prometheus-exporter/blob/main/grafana/README.md)], and [[ElasticSearch +Exporter](https://grafana.com/grafana/dashboards/14191-elasticsearch-overview/)]. + +{{< image width="400px" float="right" src="/assets/mastodon/grafana-psql.png" alt="Grafana Postgres" >}} + +To the right (top) you'll see a dashboard for PostgreSQL - it has lots of expert insights on how databases are used, how many read/write +operations (like SELECT and UPDATE/DELETE queries) are performed, and their respective latency expectations. What I find particularly useful +is the total amount of memory, CPU and disk activity. This allows me to see at a glance when it's time to break out +[[pgTune](https://github.com/gregs1104/pgtune)] to help change system settings for Postgres, or even inform me when it's time to move +the database to its own server rather than co-habitating with the other stuff running on this virtual machine. In my experience, stateful +systems are often the source of bottlenecks, so I take special care to monitor them and observe their performance over time. In particular, +slowness will be seen in Mastodon if the database is slow (sound familiar?). + +{{< image width="400px" float="right" src="/assets/mastodon/grafana-redis.png" alt="Grafana Redis" >}} + +Next, to the right (middle) you'll see a dashboard for Redis. This one shows me how full the Redis cache is (you can see the yellow line at +in the first graph there is when I restarted Redis to give it a `maxmemory` setting of 1GB), but also a high resolution overview of how many +operations it's doing. I can see that the server is spiky and upon closer inspection this is the `pfcount` command with a period of exactly +300 seconds, in other words something is spiking every 5min. I have a feeling that this might become an issue... and when it does, I'll get +to learn all about this elusive [[pfcount](https://redis.io/commands/pfcount/)] command. But until then, I can see the average time by +command: because Redis serves from RAM and this is a pretty quick server, I see the turnaround time for most queries to it in the +200-500 µs range, wow! + +{{< image width="400px" float="right" src="/assets/mastodon/grafana-node.png" alt="Grafana Node" >}} + +But while these dashboards are awesome, what I find has saved me (and my ISP, IPng Networks) a metric tonne of time, is the most fundamental +monitoring in the Node Exporter dashboard, pictured to the right (bottom). What I really love about this dashboard, is that it shows at a +glance the parts of the _computer_ that are going to become a problem. If RAM is full (but not because of filesystem cache), or CPU is +running hot, or the network is flatlining at a certain throughput or packets/sec limit, these are all things that the applications running +_on_ the machine won't necessarily be able to show me more information on, but the _Node Exporter_ to the rescue: it has so many interesting +pieces of kernel and host operating system telemetry, that it is one of the single most useful tools I know. Every physical host and every +virtual machine, is exporting metrics into IPng Networks' prometheus instance, and it constantly shows me what to improve. Thanks, Obama! + +## What's next + +Careful readers will have noticed that this whole article talks about all sorts of interesting telemetry, observability metrics, and +dashboards, but they are all _common components_, and none of them touch on the internals of Mastodon's processes, like _Puma_ or _Sidekiq_ +or the _API Services_ that Mastodon exposes. Consider this a cliff hanger (eh, mostly because I'm a bit busy at work and will need a little +more time). + +In an upcoming post, I take a deep dive into this application-specific behavior and how to extract this telemetry (spoiler alert: it can be +done! and I will open source it!), as I've started to learn more about how Ruby gathers and exposes its own internals. Interestingly, one of +the things that I'll talk about is _NSA_ but not the American agency, rather a comical wordplay from some open source minded folks who have +blazed the path in making Ruby's Rail application performance metrics available to external observers. In a round-about way, I hope to +show how to plug these into Prometheus in the same way all the other exporters have already done so. + +By the way: If you're looking for a home, feel free to sign up at [https://ublog.tech/](https://ublog.tech/) as I'm sure that having a bit +more load / traffic on this instance will allow me to learn (and in turn, to share with others)! + diff --git a/content/articles/2022-11-27-mastodon-3.md b/content/articles/2022-11-27-mastodon-3.md new file mode 100644 index 0000000..0f2a059 --- /dev/null +++ b/content/articles/2022-11-27-mastodon-3.md @@ -0,0 +1,327 @@ +--- +date: "2022-11-27T00:01:14Z" +title: Mastodon - Part 3 - statsd and Prometheus +--- + +# About this series + +{{< image width="200px" float="right" src="/assets/mastodon/mastodon-logo.svg" alt="Mastodon" >}} + +I have seen companies achieve great successes in the space of consumer internet and entertainment industry. I've been feeling less +enthusiastic about the stronghold that these corporations have over my digital presence. I am the first to admit that using "free" services +is convenient, but these companies are sometimes taking away my autonomy and exerting control over society. To each their own of course, but +for me it's time to take back a little bit of responsibility for my online social presence, away from centrally hosted services and to +privately operated ones. + +In my [[first post]({% post_url 2022-11-20-mastodon-1 %})], I shared some thoughts on how I installed a Mastodon instance for myself. In a +[[followup post]({% post_url 2022-11-24-mastodon-2 %})] I talked about its overall architecture and how one might use Prometheus to monitor +vital backends like Redis, Postgres and Elastic. But Mastodon _itself_ is also an application which can provide a wealth of telemetry using +a protocol called [[StatsD](https://github.com/statsd/statsd)]. + +In this post, I'll show how I tie these all together in a custom **Grafana Mastodon dashboard**! + +## Mastodon Statistics + +I noticed in the [[Mastodon docs](https://docs.joinmastodon.org/admin/config/#statsd)], that there's a one-liner breadcrumb that might be +easy to overlook, as it doesn't give many details: + +> `STATSD_ADDR`: If set, Mastodon will log some events and metrics into a StatsD instance identified by its hostname and port. + +Interesting, but what is this **statsd**, precisely? It's a simple text-only protocol that allows applications to send key-value pairs in +the form of `:|` strings, that carry statistics of certain _type_ across the network, using either TCP or UDP. +Cool! To make use of these stats, I first add this `STATSD_ADDR` environment variable from the docs to my `.env.production` file, pointing +it at `localhost:9125`. This should make Mastodon apps emit some statistics of sorts. + +I decide to a look at those packets, instructing tcpdump to show the contents of the packets (using the `-A` flag). Considering my +destination is `localhost`, I also know which interface to tcpdump on (using the `-i lo` flag). My first attempt is a little bit noisy, +because the packet dump contains the [[IPv4 header](https://en.wikipedia.org/wiki/IPv4)] (20 bytes) and [[UDP +header](https://en.wikipedia.org/wiki/User_Datagram_Protocol)] (8 bytes) as well, but sure enough, if I start reading from the 28th byte +onwards, I get human-readable data, in a bunch of strings that start with `Mastodon`: + +``` +pim@ublog:~$ sudo tcpdump -Ani lo port 9125 | grep 'Mastodon.production.' | sed -e 's,.*Mas,Mas,' +Mastodon.production.sidekiq.ActivityPub..ProcessingWorker.processing_time:16272|ms +Mastodon.production.sidekiq.ActivityPub..ProcessingWorker.success:1|c +Mastodon.production.sidekiq.scheduled_size:25|g +Mastodon.production.db.tables.accounts.queries.select.duration:1.8323479999999999|ms +Mastodon.production.web.ActivityPub.InboxesController.create.json.total_duration:33.856679|ms +Mastodon.production.web.ActivityPub.InboxesController.create.json.db_time:2.3943890118971467|ms +Mastodon.production.web.ActivityPub.InboxesController.create.json.view_time:1|ms +Mastodon.production.web.ActivityPub.InboxesController.create.json.status.202:1|c +... +``` + +**statsd** organizes its variable names in a dot-delimited tree hierarchy. I can clearly see some patterns in here, but why guess +when you're working with Open Source? Mastodon turns out to be using a popular Ruby library called the [[National Statsd +Agency](https://github.com/localshred/nsa)], a wordplay that I don't necessarily find all that funny. Naming aside though, this library +collects application level statistics in four main categories: + +1. **:action_controller**: listens to the ActionController class that is extended into ApplicationControllers in Mastodon +1. **:active_record**: listens to any database (SQL) queries and emits timing information for them +1. **:active_support_cache**: records information regarding caching (Redis) queries, and emits timing information for them +1. **:sidekiq**: listens to Sidekiq middleware and emits information about queues, workers and their jobs + +Using the library's [[docs](https://github.com/localshred/nsa)], I can clearly see the patterns described, for example in the SQL +recorder, the format will be `{ns}.{prefix}.tables.{table_name}.queries.{operation}.duration` where operation here means one of the +classic SQL query types, **SELECT**, **INSERT**, **UPDATE**, and **DELETE**. Similarly, in the cache recorder, the format will be +`{ns}.{prefix}.{operation}.duration` where operation denotes one of **read_hit**, **read_miss**, **generate**, **delete**, and so on. + +Reading a bit more of the Mastodon and **statsd** library code, I learn that all variables emitted, the namespace `{ns}` is always a +combination of the application name and Ruby Rails environment, ie. **Mastodon.production**, and the `{prefix}` is the collector name, +one-of **web**, **db**, **cache** or **sidekiq**. If you're curious, the Mastodon code that initializes the **statsd** collectors lives +in `config/initializers/statsd.rb`. Alright, I conclude that this is all I need to know about the naming schema. + +Moving along, **statsd** gives each variable name a [[metric type](https://github.com/statsd/statsd/blob/master/docs/metric_types.md)], which +can be counters **c**, timers **ms** and gauges **g**. In the packet dump above you can see examples of each of these. The counter type in +particular is a little bit different -- applications emit increments here - in the case of the ActivityPub.InboxesController, it +merely signaled to increment the counter by 1, not the absolute value of the counter. This is actually pretty smart, because now any number +of workers/servers can all contribute to a global counter, by each just sending incrementals which are aggregated by the receiver. + + As a small critique, I happened to notice that in the sidekiq datastream, some of what I think are _counters_ are actually modeled as +_gauges_ (notably the **processed** and **failed** jobs from the workers). I will have to remember that, but after observing for a few +minutes, I think I can see lots of nifty data in here. + +## Prometheus + +At IPng Networks, we use Prometheus as a monitoring observability tool. It's worth pointing out that **statsd** has a few options itself to +visualise data, but considering I already have lots of telemetry in Prometheus and Grafana (see my [[previous post]({% post_url +2022-11-24-mastodon-2 %})]), I'm going to take a bit of a detour, and convert these metrics into the Prometheus _exposition format_, so that +they can be scraped on a `/metrics` endpoint just like the others. This way, I have all monitoring in one place and using one tool. +Monitoring is hard enough as it is, and having to learn multiple tools is _no bueno_ :) + +### Statsd Exporter: overview + +The community maintains a Prometheus [[Statsd Exporter](https://github.com/prometheus/statsd_exporter)] on GitHub. This tool, like many +others in the exporter family, will connect to a local source of telemetry, and convert these into the required format for consumption by +Prometheus. If left completely unconfigured, it will simply receive the **statsd** UDP packets on the Mastodon side, and export them +verbatim on the Prometheus side. This will have a few downsides, notably when new operations or controllers come into existence, I would +have to explicitly make Prometheus aware of them. + +I think we can do better, specifically because of the patterns noted above, I can condense the many metricnames from **statsd** into a few +carefully chosen Prometheus metrics, and add their variability into _labels_ in those time series. Taking SQL queries as an example, I see +that there's a metricname for each known SQL table in Mastodon (and there are many) and then for each table, a unique metric is created for +each of the four operations: + +``` +Mastodon.production.tables.{table_name}.queries.select.duration +Mastodon.production.tables.{table_name}.queries.insert.duration +Mastodon.production.tables.{table_name}.queries.update.duration +Mastodon.production.tables.{table_name}.queries.delete.duration +``` + +What if I could rewrite these by capturing the `{table_name}` label, and further observing that there are four query types (SELECT, INSERT, +UPDATE, DELETE), so possibly capturing those into a `{operation}` label, like so: + +``` +mastodon_db_operation_sum{operation="select",table="users"} 85.910 +mastodon_db_operation_sum{operation="insert",table="accounts"} 112.70 +mastodon_db_operation_sum{operation="update",table="web_push_subscriptions"} 6.55 +mastodon_db_operation_sum{operation="delete",table="web_settings"} 9.668 +mastodon_db_operation_count{operation="select",table="users"} 28790 +mastodon_db_operation_count{operation="insert",table="accounts"} 610 +mastodon_db_operation_count{operation="update",table="web_push_subscriptions"} 380 +mastodon_db_operation_count{operation="delete",table="web_settings"} 4) +``` + +This way, there are only two Prometheus metric names **mastodon_db_operation_sum** and **mastodon_db_operation_count**. The first one counts +the cumulative time spent performing operations of that type on the table, and the second one counts the total amount of queries of that +type on the table. If I take the **rate()** of the count variable, I will have queries-per-second, and if I divide the **rate()** of the +time spent by the **rate()** of the count, I will have a running average time spent per query over that time interval. + +### Statsd Exporter: configuration + +The Prometheus folks also thought of this, _quelle surprise_, and the exporter provides incredibly powerful transformation functionality +between the hierarchical tree-form of **statsd** and the multi-dimensional labeling format of Prometheus. This is called the [[Mapping +Configuration](https://github.com/prometheus/statsd_exporter#metric-mapping-and-configuration)], and it allows either globbing or regular +expression matching of the input metricnames, turning them into labeled output metrics. Building further on our example for SQL queries, I +can create a mapping like so: + +``` +pim@ublog:~$ cat << EOF | sudo tee /etc/prometheus/statsd-mapping.yaml +mappings: + - match: Mastodon\.production\.db\.tables\.(.+)\.queries\.(.+)\.duration + match_type: regex + name: "mastodon_db_operation" + labels: + table: "$1" + operation: "$2" +``` + +This snippet will use a regular expression to match input metricnames, carefully escaping the dot-delimiters. Within the input, I will match +two groups, the segment following `tables.` holds the variable SQL Table name and the segment following `queries.` captures the SQL Operation. +Once this matches, the exporter will give the resulting variable in Prometheus simply the name `mastodon_db_operation` and add two labels +with the results of the regexp capture groups. + +This one mapping I showed above will take care of _all of the metrics_ from the **database** collector, but are three other collectors in +Mastodon's Ruby world. In the interest of brevity, I'll not bore you with them in this article, as this is mostly a rinse-and-repeat jobbie. +But I have attached a copy of the complete mapping configuration at the end of this article. With all of that hard work on mapping +completed, I can now start the **statsd** exporter and see its beautifully formed and labeled timeseries show up on port 9102, the default +[[assigned port](https://github.com/prometheus/docs/blob/main/content/docs/instrumenting/exporters.md)] for this exporter type. + +## Grafana + +First let me start by saying I'm incredibly grateful to all the folks who have contributed to existing exporters and Grafana dashboards, +notably for [[Node Exporter](https://grafana.com/grafana/dashboards/1860-node-exporter-full/)], +[[Postgres Exporter](https://grafana.com/grafana/dashboards/9628-postgresql-database/)], +[[Redis Exporter](https://grafana.com/grafana/dashboards/11692-redis-dashboard-for-prometheus-redis-exporter-1-x/)], +[[NGINX Exporter](https://github.com/nginxinc/nginx-prometheus-exporter/blob/main/grafana/README.md)], and [[ElasticSearch +Exporter](https://grafana.com/grafana/dashboards/14191-elasticsearch-overview/)]. I'm ready to make a modest contribution back to this +wonderful community of monitoring dashboards, in the form of a Grafana dashboard for Mastodon! + +Writing these is pretty rewarding. I'll take some time to explain a few Grafana concepts, although this is not meant to be a tutorial at +all and honestly, I'm not that good at this anyway. A good dashboard design goes from a 30'000ft overview of the most vital stats (not +necessarily graphs, but using visual clues like colors), and gives more information in so-called _drill-down_ dashboards that allow a much +finer grained / higher resolution picture of a specific part of the monitored application. + +Seeing as the collectors are emitting four main parts of the application (remember, the `{prefix}` is one of **web**, **db**, **cache**, or +**sidekiq**), so I will give the dashboard the same structure. Also, I will try my best not to invent new terminology, as the application +developers have given their telemetry certain names, I will stick to these as well. Building a dashboard this way, application developers as +well as application operators will more likely be talking about the same things. + +#### Mastodon Overview + +{{< image src="/assets/mastodon/mastodon-stats-overview.png" alt="Mastodon Stats Overview" >}} + +In the **Mastodon Overview**, each of the four collectors gets one or two stats-chips to present their highest level vitalsigns on. For a +web application, this will largely be requests per second, latency and possibly errors served. For a SQL database, this is typically the +issued queries and their latency. For a cache, the types of operation and again the latency observed in those operations. For Sidekiq (the +background worker pool that performs certain tasks in a queue on behalf of the system or user), I decide to focus on units of work, latency +and queue sizes. + +Setting up the Prometheus queries in Grafana that fetch the data I need for these is typically going to be one of two things: + +1. **QPS**: This is a rate of the monotonic increasing *_count*, over say one minute, I will see the average queries-per-second. Considering + the counters I created have labels that tell me what they are counting (for example in Puma, which API endpoint is being queried, and + what format that request is using), I can now elegantly aggregate those application-wide, like so: + > sum by (mastodon)(rate(mastodon_controller_duration_count[1m])) + +2. **Latency**: The metrics in Prometheus aggregate a runtime monotonic increasing *_sum* which tells me about the total time spent doing + those things. It's pretty easy to calculate the running average latency over the last minute, by simply dividing the rate of time spent + by the rate of requests served, like so: + > sum by (mastodon)(rate(mastodon_controller_duration_sum[1m]))
/
+ > sum by (mastodon)(rate(mastodon_controller_duration_count[1m])) + +To avoid clutter, I will leave the detailed full resolution view (like which _controller_ exactly, and what _format_ was queried, and which +_action_ was taken in the API) to a drilldown below. These two patterns are continud throughout the overview panel. Each QPS value is +rendered in dark blue, while the latency gets a special treatment on colors: I define a threshold which I consider "unacceptable", and then +create a few thresholds in Grafana to change the color as I approach that unacceptable max limit. By means of example, the Puma Latency +element I described above, will have a maximum acceptable latency of 250ms. If the latency is above 40% of that, the color will turn yellow; +above 60% it'll turn orange and above 80% it'll turn red. This provides a visual queue that something may be wrong. + +#### Puma Controllers + +The APIs that Mastodon offers are served by a component called [[Puma](https://github.com/puma/puma)], a simple, fast, multi-threaded, and +highly parallel HTTP 1.1 server for Ruby/Rack applications. The application running in Puma typically defines endpoints as so-called +ActionControllers, which Mastodon expands on in a derived concept called ApplicationControllers which each have a unique _controller_ name +(for example **ActivityPub.InboxesController**), an _action_ performed on them (for example **create**, **show** or **destroy**), and a +_format_ in which the data is handled (for example **html** or **json**). For each cardinal combination, a set of timeseries (counter, time +spent and latency quantiles) will exist. At the moment, there are about 53 API controllers, 8 actions, and 4 formats, which means there are +1'696 interesting metrics to inspect. + +Drawing all of these in one graph quickly turns into an unmanageable mess, but there's a neat trick in Grafana: what if I could make these +variables selectable, and maybe pin them to exactly one value (for example, all information with a specific _controller_), that would +greatly reduce the amount of data we have to show. To implement this, the dashboard can pre-populate a variable based on a Prometheus query. +By means of example, to find the possible values of _controller_, I might take a look at all Prometheus metrics with name +**mastodon_controller_duration_count** and search for labels within them with a regular expression, for example **/controller="([^"]+)"/**. + +What this will do is select all values in the group `"([^"]+)"` which may seem a little bit cryptic at first. The logic behind it is first +create a group between parenthesis `(...)` and then within that group match a set of characters `[...]` where the set is all characters +except the double-quote `[^"]` and that is repeated one-or-more times with the `+` suffix. So this will precisely select the string between +the double-quotes in the label: `controller="whatever"` will return `whatever` with this expression. + +{{< image width="400px" float="right" src="/assets/mastodon/mastodon-puma-details.png" alt="Mastodon Puma Controller Details" >}} + +After creating three of these, one for _controller_, _action_ and _format_, three new dropdown selectors appear at the top of my dashboard. +I will allow any combination of selections, including "All" of them (the default). Then, if I wish to drill down, I can pin one or more of +these variables to narrow down the total amount of timeseries to draw. + +Shown to the right are two examples, one with "All" timeseries in the graph, which shows at least which one(s) are outliers. In this case, +the orange trace in the top graph showing more than average operations is the so-called **ActivityPub.InboxesController**. + +I can find this out by hovering over the orange trace, the tooltip will show me the current name and value. Then, selecting this in the top +navigation dropdown **Puma Controller**, Grafana will narrow down the data for me to only those relevant to this controller, which is super +cool. + +#### Drilldown in Grafana + +{{< image width="400px" float="right" src="/assets/mastodon/mastodon-puma-controller.png" alt="Mastodon Puma Controller Details" >}} + +Where the graph down below (called _Action Format Controller Operations_) showed all 1'600 or so timeseries, selecting the one controller I'm +interested in, shows me a much cleaner graph with only three timeseries, take a look to the right. Just by playing around with this data, I'm +learning a lot about the architecture of this application! + +For example, I know that the only _action_ on this particular controller seems to be **create**, and there are three available _formats_ in which +this create action can be performed: **all**, **html** and **json**. And using the graph above that got me started on this little journey, I now +know that the traffic spike was for `controller=ActivityPub.InboxesController, action=create, format=all`. Dope! + +#### SQL Details + +{{< image src="/assets/mastodon/mastodon-sql-details.png" alt="Mastodon SQL Details" >}} + +While I already have a really great [[Postgres Dashboard](https://grafana.com/grafana/dashboards/9628-postgresql-database/)] (the one that +came with Postgres _server_), it is also good to be able to see what the _client_ is experiencing. Here, we can drill down on two variables, +called `$sql_table` and `$sql_operation`. For each {table,operation}-tuple, the average, median and 90th/99th percentile latency are +available. So I end up with the following graphs and dials for tail latency: the top left graph shows me something interesting -- most +queries are SELECT, but the bottom graph shows me lots of tables (at the time of this article, Mastodon has 73 unique SQL tables). If I +wanted to answer the question "which table gets most SELECTs", I can drill down first by selecting the **SQL Operation** to be **select**, +after which I see decidedly less traces in the _SQL Table Operations_ graph. Further analysis shows that the two places that are mostly read +from are the tables called **statuses** and **accounts**. When I drill down using the selectors at the top of Grafana's dashboard UI, the tail +latency is automatically filtered to only that which is selected. If I were to see very slow queries at some point in the future, it'll be +very easy to narrow down exactly which table and which operation is the culprit. + +#### Cache Details + +{{< image src="/assets/mastodon/mastodon-cache-details.png" alt="Mastodon Cache Details" >}} + +For the cache statistics collector, I learn there are a few different operators. Similar to Postgres, I already have a really cool +[[Redis Dashboard](https://grafana.com/grafana/dashboards/11692-redis-dashboard-for-prometheus-redis-exporter-1-x/)], for which I can see +the Redis _server_ view. But in Mastodon, I can now also see the _client_ view, and see when any of these operations spike in either +queries/sec (left graph), latency (middle graph), or tail latency for common operations (the dials on the right). This is bound to come in +handy at some point -- I already saw one or two spikes in the **generate** operation (see the blue spike in the screenshot above), which is +something to keep an eye on. + +#### Sidekiq Details + +{{< image src="/assets/mastodon/mastodon-sidekiq-details.png" alt="Mastodon Sidekiq Details" >}} + + +The single most +interesting thing in the Mastodon application is undoubtedly its _Sidekiq_ workers, the ones that do all sorts of system- and user-triggered +work such as distributing the posts to federated servers, prefetch links and media, and calculate trending tags, posts and links. +Sidekiq is a [[producer-consumer](https://en.wikipedia.org/wiki/Producer%E2%80%93consumer_problem)] system where new units of work (called +jobs) are written to a _queue_ in Redis by a producer (typically Mastodon's webserver Puma, or another Sidekiq task that needs something to +happen at some point in the future), and then consumed by one or more pools which execute the _worker_ jobs. + +There are several queues defined in Mastodon, and each _worker_ has a name, a _failure_ and _success_ rate, and a running tally of how much +_processing_time_ they've spent executing this type of work. Sidekiq workers will consume jobs in +[[FIFO](https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics))] order, and it has a finite amount of workers (by default on a small +instance it runs one worker with 25 threads). If you're interested in this type of provisioning, [[Nora +Tindall](https://nora.codes/post/scaling-mastodon-in-the-face-of-an-exodus/)] wrote a great article about it. + +This drill-down dashboard shows all of the Sidekiq _worker_ types known to Prometheus, and can be selected at the top of the dashboard in +the dropdown called **Sidekiq Worker**. A total amount of worker jobs/second, as well as the running average time spent performing those jobs +is shown in the first two graphs. The three dials show the median, 90th percentile and 99th percentile latency of the work being performed. + +If all threads are busy, new work is left in the queue, until a worker thread is available to execute the job. This will lead to a queue +delay on a busy server that is underprovisioned. For jobs that had to wait for an available thread to pick them up, the number of jobs per +queue, and the time in seconds that the jobs were waiting to be picked up by a worker, are shown in the two lists at bottom right. + +And with that, as [[North of the Border](https://www.youtube.com/@NorthoftheBorder)] would say: _"We're on to the glamour shots!"_. + +## What's next + +I made a promise on two references that will be needed to successfully hook up Prometheus and Grafana to the `STATSD_ADDR` configuration for +Mastodon's Rails environment, and here they are: + +* The Statsd Exporter mapping configuration file: [[/etc/prometheus/statsd-mapping.yaml](/assets/mastodon/statsd-mapping.yaml)] +* The Grafana Dashboard: [[grafana.com/dashboards/](https://grafana.com/grafana/dashboards/17492-mastodon-stats/)] + +**As a call to action:** if you are running a larger instance and would allow me to take a look and learn from you, I'd be very grateful. + +I'm going to monitor my own instance for a little while, so that I can start to get a feeling for where the edges of performance cliffs are, +in other words: How slow is _too slow_? How much load is _too much_? In an upcoming post, I will take a closer look at alerting in Prometheus, +so that I can catch these performance cliffs and make human operators aware of them by means of alerts, delivered via Telegram or Slack. + +By the way: If you're looking for a home, feel free to sign up at [https://ublog.tech/](https://ublog.tech/) as I'm sure that having a bit +more load / traffic on this instance will allow me to learn (and in turn, to share with others)! + diff --git a/content/articles/2022-12-05-oem-switch-1.md b/content/articles/2022-12-05-oem-switch-1.md new file mode 100644 index 0000000..1f892ad --- /dev/null +++ b/content/articles/2022-12-05-oem-switch-1.md @@ -0,0 +1,638 @@ +--- +date: "2022-12-05T11:56:54Z" +title: 'Review: S5648X-2Q4Z Switch - Part 1: VxLAN/GENEVE/NvGRE' +--- + +After receiving an e-mail from a newer [[China based switch OEM](https://starry-networks.com/)], I +had a chat with their founder and learned that the combination of switch silicon and software may be +a good match for IPng Networks. You may recall my previous endeavors in the Fiberstore lineup, +notably an in-depth review of the [[S5860-20SQ]({% post_url 2021-08-07-fs-switch %})] which sports +20x10G, 4x25G and 2x40G optics, and its larger sibling the S5860-48SC which comes with 48x10G and +8x100G cages. I use them in production at IPng Networks and their featureset versus price point is +pretty good. In that article, I made one critical note reviewing those FS switches, in that they'e +be a better fit if they allowed for MPLS or IP based L2VPN services in hardware. + +{{< image width="450px" float="left" src="/assets/oem-switch/S5624X-front.png" alt="S5624X Front" >}} + +{{< image width="450px" float="right" src="/assets/oem-switch/S5648X-front.png" alt="S5648X Front" >}} + +

+ +I got cautiously enthusiastic (albeit suitably skeptical) when this new vendor claimed VxLAN, +GENEVE, MPLS and GRE at 56 ports and line rate, on a really affordable budget (sub-$4K for the 56 +port; and sub-$2K for the 26 port switch). This reseller is using a less known silicon vendor called +[[Centec](https://www.centec.com/silicon)], who have a lineup of ethernet silicon. In this device, +the CTC8096 (GoldenGate) is used for cost effective high density 10GbE/40GbE applications paired +with 4x100GbE uplink capability. This is Centec's fourth generation, so CTC8096 inherits the feature +set from L2/L3 switching to advanced data center and metro Ethernet features with innovative +enhancement. The switch chip provides up to 96x10GbE ports, or 24x40GbE, or 80x10GbE + 4x100GbE +ports, inheriting from its predecessors a variety of features, including L2, L3, MPLS, VXLAN, MPLS SR, +and OAM/APS. Highlights features include Telemetry, Programmability, Security and traffic management, +and Network time synchronization. + +This will be the first of a set of write-ups exploring the hard- and software functionality of this new +vendor. As we'll see, it's all about the _software_. + +## Detailed findings + +### Hardware + +{{< image width="400px" float="right" src="/assets/oem-switch/s5648x-front-opencase.png" alt="Front" >}} + +The switch comes well packaged with two removable 400W Gold powersupplies from _Compuware +Technology_ which output 12V/33A and +5V/3A as well as four removable PWM controlled fans from +_Protechnic_. The fans are expelling air, so they are cooling front-to-back on this unit. Looking at +the fans, changing them to pull air back-to-front would be possible after-sale, by flipping the fans +around as they're attached in their case by two M4 flat-head screws. This is truly meant to be +an OEM switch -- there is no logo or sticker with the vendor's name, so I should probably print a +few vinyl IPng stickers to skin them later. + +{{< image width="400px" float="right" src="/assets/oem-switch/s5648x-daughterboard.png" alt="S5648 Daughterboard" >}} + +On the front, the switch sports an RJ45 standard serial console, a mini-usb connector of which the +function is not clear to me, an RJ45 network port used for management, a pinhole which houses a +reset button labeled `RST` and two LED indicators labeled `ID` and `SYS`. The serial port runs at +115200,8n1 and the managment network port is Gigabit. + +Regarding the regular switch ports, there are 48x SFP+ cages, 4x QSFP28 (port 49-52) runing at +100Gbit, and 2x QSFP+ ports (53-54) running at 40Gbit. All ports (management and switch) present a +MAC address from OUI `00-1E-08`, which is assigned to Centec. + +The switch is not particularly quiet, as its six fans total start up at a high pitch but once the +switch boots, they calm down and emit noise levels as you would expect from a datacenter unit. I +measured it at 74dBA when booting, and otherwise at around 62dBA when running. On the inside, the +PCB is rather clean. It comes with a daughter board, housing a small PowerPC P1010 with 533MHz CPU, +1GB of RAM, and 2GB flash on board, which is running Linux. This is the same card that many of the +FS.com switches use (eg. S5860-48S6Q), a cheaper alternative to the high end Intel Xeon-D. + +{{< image width="400px" float="right" src="/assets/oem-switch/s5648x-switchchip-block.png" alt="S5648 Switchchip" >}} + +#### S5648X (48x10, 2x40, 4x100) + +There is one switch chip, on the front of the PCB, connecting all 54 ports. It has a sizable +heatsink on it, drawing air backwards through ports (36-48). The switch uses a less well known and +somewhat dated Centec [[CTC8096](https://www.centec.com/silicon/Ethernet-Switch-Silicon/CTC8096)], +codenamed GoldenGate and released in 2015, which is rated for 1.2Tbps of aggregate throughput. The +chip can be programmed to handle a bunch of SDN protocols, including VxLAN, GRE, GENEVE, and MPLS / +MPLS SR, with a limited TCAM to hold things like ACLs, IPv4/IPv6 routes and MPLS labels. The CTC8096 +provides up to 96x10GbE ports, or 24x40GbE, or 80x10GbE + 4x100GbE ports. The SerDES design is +pretty flexible, allowing it to mix and match ports. + +You can see more (hires) pictures and screenshots throughout these articles in this [[Photo +Album](https://photos.app.goo.gl/Mxzs38p355Bo4qZB6)]. + +#### S5624X (24x10, 2x100) + +In case you're curious (I certainly was!) the smaller unit (with 24x10+2x100) is built off of the +Centic [[CTC7132](https://www.centec.com/silicon/Ethernet-Switch-Silicon/CTC7132)], codenamed +TsingMa which released in 2019, and it offers a variety of similar features, including L2, L3, MPLS, +VXLAN, MPLS SR, and OAM/APS. Highlights features include Telemetry, Programmability, Security and +traffic management, and Network time synchronization. The SoC has an embedded ARM A53 CPU Core +running at 800MHz, and the SerDES on this chip allows for 24x1G/2.5G/5G/10G and 2x40G/100G for a +throughput of 440Gbps but at a failry sharp price point. + +One thing worth noting (because I know some of my regular readers will already be wondering!) this +series of chips (both the 4th generation CTC8096 and the sixth generation CTC7132) come with a very +modest TCAM, which means in practice 32K MAC addresses, 8K IPv4 routes, 1K IPv6 routes, a 6K MPLS +table, 1K L2VPN instances, and 64 VPLS instances. The Centec comes as well with a modest 32MB packet +buffer shared between all ports, and the controlplane comes with 1GB of memory and a 533MHz ARM. So +no, this won't take a full table :-) but in all honesty, that's not the thing this machine is built +to do. + +When booted, the switch draws roughly 68 Watts combined on its two power supplies, and I find that +pretty cool considering the total throughput offered. Of course, once optics are inserted, the total +power draw will go up. Also worth noting, when the switch is under load, the Centec chip will +consume more power, for example when forwarding 8x10G + 2x100G, the total consumption was 88 Watts, +totally respectable now that datacenter power bills are skyrocketing. + +### Topology + +{{< image width="500px" float="right" src="/assets/oem-switch/topology.svg" alt="Front" >}} + +On the heals of my [[DENOG14 Talk](/media/denog14/index.html)], in which I showed how VPP can +_route_ 150Mpps and 180Gbps on a 10 year old Dell while consuming a full 1M BGP table in 7 seconds +or so, I still had a little bit of a LAB left to repurpose. So I build the following topology using +the loadtester, packet analyzer, and switches: + +* **msw-top**: S5624-2Z-EI switch +* **msw-core**: S5648X-2Q4ZA switch +* **msw-bottom**: S5624-2Z-EI switch +* All switches connect to: + * each other with 100G DACs (right, black) + * T-Rex machine with 4x10G (left, rainbow) +* Each switch gets a mgmt IPv4 and IPv6 + +With this topology I will have enough wiggle room to patch anything to anything. Now that the +physical part is out of the way, let's take a look at the firmware of these things! + +### Software + +As can be seen in the topology above, I am testing three of these switches - two are the smaller sibling +[S5624X 2Z-EI] (which come with 24x10G SFP+ and 2x100G QSFP28), and one is this [S5648X 2Q4Z] +pictured above. The vendor has a licensing system, for basic L2, basic L3 and advanced metro L3. +These switches come with the most advanced/liberal licenses, which means all of the features will +work on the switches, notably, MPLS/LDP and VPWS/VPLS. + +Taking a look at the CLI, it's very Cisco IOS-esque; there's a few small differences, but the look +and feel is definitely familiar. Base configuration kind of looks like this: + +#### Basic config + +``` +msw-core# show running-config +management ip address 192.168.1.33/24 +management route add gateway 192.168.1.252 +! +ntp server 216.239.35.4 +ntp server 216.239.35.8 +ntp mgmt-if enable +! +snmp-server enable +snmp-server system-contact noc@ipng.ch +snmp-server system-location Bruttisellen, Switzerland +snmp-server community public read-only +snmp-server version v2c + +msw-core# conf t +msw-core(config)# stm prefer ipran +``` + +A few small things of note. There is no `mgmt0` device as I would've expected. Instead, the SoC +exposes its management interface to be configured with these `management ...` commands. The IPv4 can +be either DHCP or a static address, and IPv6 can only do static addresses. Only one (default) +gateway can be set for either protocol. Then, NTP can be set up to work on the `mgmt-if` which is a +useful way to use it for timekeeping. + +The SNMP server works both from the `mgmt-if` and from the dataplane, which is nice. +SNMP supports everything you'd expect, including v3 and traps for all sorts of events, including +IPv6 targets and either dataplane or `mgmt-if`. + +I did notice that the nameserver cannot use the `mgmt-if`, so I left it unconfigured. I found it a +little bit odd, considering all the other functionality does work just fine over the `mgmt-if`. + +If you've run CAM-based systems before, you'll likely have come across some form of _partitioning_ +mechanism, to allow certain types in the CAM (eg. IPv4, IPv6, L2 MACs, MPLS labels, ACLs) to have +more or fewer entries. This is particularly relevant on this switch because it has a comparatively +small CAM. It turns out, that by default MPLS is entirely disabled, and to turn it on (and sacrifice +some of that sweet sweet content addressable memory), I have to issue the command `stm prefer ipran` +(other flavors are _ipv6_, _layer3_, _ptn_, and _default_), and reload the switch. + +Having been in the networking industry for a while, I scratched my head on the acronym **IPRAN**, so +I will admit having to look it up. It's a general term used to describe an IP based Radio Access +Network (2G, 3G, 4G or 5G) which uses IP as a transport layer technology. I find it funny in a +twisted sort of way, that to get the oldskool MPLS service, I have to turn on IPRAN. + +Anyway, after changing the STM profile to _ipran_, the following partition is available: + +**IPRAN** CAM | S5648X (msw-core) | S5624 (msw-top & msw-bottom) +---------------- | ---------------------------- | ----------------------------- +MAC Addresses | 32k | 98k +IPv4 routes | host: 4k, indirect: 8k | host: 12k, indirect: 56k +IPv6 routes | host: 512, indirect: 512 | host: 2048, indirect: 1024 +MPLS labels | 6656 | 6144 +VPWS instances | 1024 | 1024 +VPLS instances | 64 | 64 +Port ACL entries | ingress: 1927, egress: 176 | ingress: 2976, egress: 928 +VLAN ACL entries | ingress: 256, egress: 32 | ingress: 256, egress: 64 + +First off: there's quite a few differences here! The big switch has relatively few MAC, IPv4 and +IPv6 routes, compared to the little ones. But, it has a few more MPLS labels. ACL wise, the small +switch once again has a bit more capacity. But, of course the large switch has lots more ports (56 +versus 26), and is more expensive. Choose wisely :) + +Regarding IPv4/IPv6 and MPLS space, luckily [[AS8298]({% post_url 2021-02-27-network %})] is +relatively compact in its IGP. As of today, it carries 41 IPv4 and 48 IPv6 prefixes in OSPF, which +means that these switches would be fine participating in Area 0. If CAM space does turn into an +issue down the line, I can put them in stub areas and advertise only a default. As an aside, VPP +doesn't have any CAM at all, so for my routers the size is basically goverened by system memory +(which on modern computers equals "infinite routes"). As long as I keep it out of the DFZ, this +switch should be fine, for example in a BGP-free core that switches traffic based on VxLAN or MPLS, +but I digress. + +#### L2 + +First let's test a straight forward configuration: + +``` +msw-top# configure terminal +msw-top(config)# vlan database +msw-top(config-vlan)# vlan 5-8 +msw-top(config-vlan)# interface eth-0-1 +msw-top(config-if)# switchport access vlan 5 +msw-top(config-vlan)# interface eth-0-2 +msw-top(config-if)# switchport access vlan 6 +msw-top(config-vlan)# interface eth-0-3 +msw-top(config-if)# switchport access vlan 7 +msw-top(config-vlan)# interface eth-0-4 +msw-top(config-if)# switchport mode dot1q-tunnel +msw-top(config-if)# switchport dot1q-tunnel native vlan 8 +msw-top(config-vlan)# interface eth-0-26 +msw-top(config-if)# switchport mode trunk +msw-top(config-if)# switchport trunk allowed vlan only 5-8 +``` + +By means of demonstration, I created port `eth-0-4` as a QinQ capable port - which means that any +untagged frames coming into it will become VLAN 8, but any tagged frames will become s-tag 8 and +c-tag with whatever tag was sent, in other words standard issue QinQ tunneling. The configuration +of `msw-bottom` is exactly the same, and because we're connecting these VLANs through `msw-core`, +I'll have to make it a member of all these interfaces using the `interface range` shortcut: + +``` +msw-core# configure terminal +msw-core(config)# vlan database +msw-core(config-vlan)# vlan 5-8 +msw-core(config-vlan)# interface range eth-0-49 - 50 +msw-core(config-if)# switchport mode trunk +msw-core(config-if)# switchport trunk allowed vlan only 5-8 +``` + +The loadtest results in T-Rex are, quite unsurprisingly, line rate. In the screenshot below, I'm +sending 128 byte frames at 8x10G (40G from `msw-top` through `msw-core` and out `msw-bottom`, and +40G in the other direction): + +{{< image src="/assets/oem-switch/l2-trex.png" alt="L2 T-Rex" >}} + +A few notes, for critical observers: +* I have to use 128 byte frames because the T-Rex loadtester is armed with 3x Intel x710 NICs, + which have a total packet rate of 40Mpps only. Intel made these with LACP redundancy in mind, + and do not recommend fully loading them. As 64b frames would be ~59.52Mpps, the NIC won't keep + up. So, I let T-Rex send 128b frames, which is ~33.8Mpps. +* T-Rex shows only the first 4 ports in detail, and you can see all four ports are sending 10Gbps + of L1 traffic, which at this frame size is 8.66Gbps of ethernet (as each frame also has a + 24 byte overhead [[ref](https://en.wikipedia.org/wiki/Ethernet_frame)]). We can clearly see + though, that all Tx packets/sec are also Rx packets/sec, which means all traffic is safely + accounted for. +* In the top panel, you will see not 4x10, but **8x10Gbps and 67.62Mpps** of total throughput, + with no traffic lost, and the loadtester CPU well within limits: 👍 + +``` +msw-top# show int summary | exc DOWN +RXBS: rx rate (bits/sec) RXPS: rx rate (pkts/sec) +TXBS: tx rate (bits/sec) TXPS: tx rate (pkts/sec) + +Interface Link RXBS RXPS TXBS TXPS +----------------------------------------------------------------------------- +eth-0-1 UP 10016060422 8459510 10016060652 8459510 +eth-0-2 UP 10016080176 8459527 10016079835 8459526 +eth-0-3 UP 10015294254 8458863 10015294258 8458863 +eth-0-4 UP 10016083019 8459529 10016083126 8459529 +eth-0-25 UP 449 0 501 0 +eth-0-26 UP 41362394687 33837608 41362394527 33837608 +``` + +Clearly, all three switches are happy to forward 40Gbps in both directions, and the 100G port is +happy to forward (at least) 40G symmetric - and because the uplink port is trunked, each ethernet +frame will be 4 bytes longer due to the dot1q tag, which, at 128b frames means we'll be using +132/128 * 4 * 10G == 41.3G of traffic, which it spot on. + + +#### L3 + +For this test, I will reconfigure the 100G ports to become routed rather than switched. Remember, +`msw-top` connects to `msw-core`, which in turn connects to `msw-bottom`, so I'll need two IPv4 /31 +and two IPv6 /64 transit networks. I'll also create a loopback interface with a stable IPv4 and IPv6 +address on each switch, and I'll tie all of these together in IPv4 and IPv6 OSPF in Area 0. The +configuration for the `msw-top` switch becomes: + +``` +msw-top# configure terminal +interface loopback0 + ip address 172.20.0.2/32 + ipv6 address 2001:678:d78:400::2/128 + ipv6 router ospf 8298 area 0 +! +interface eth-0-26 + description Core: msw-core eth-0-49 + speed 100G + no switchport + mtu 9216 + ip address 172.20.0.11/31 + ipv6 address 2001:678:d78:400::2:2/112 + ip ospf network point-to-point + ip ospf cost 1004 + ipv6 ospf network point-to-point + ipv6 ospf cost 1006 + ipv6 router ospf 8298 area 0 +! +router ospf 8298 + router-id 172.20.0.2 + network 172.20.0.0/22 area 0 + redistribute static +! +router ipv6 ospf 8298 + router-id 172.20.0.2 + redistribute static +``` + +Now that the IGP is up for IPv4 and IPv6 and I can ping the loopbacks from any switch to any other +switch, I can continue with the loadtest. I'll configure four IPv4 interfaces: + +``` +msw-top# configure terminal +interface eth-0-1 + no switchport + ip address 100.65.1.1/30 +! +interface eth-0-2 + no switchport + ip address 100.65.2.1/30 +! +interface eth-0-3 + no switchport + ip address 100.65.3.1/30 +! +interface eth-0-4 + no switchport + ip address 100.65.4.1/30 +! +ip route 16.0.1.0/24 100.65.1.2 +ip route 16.0.2.0/24 100.65.2.2 +ip route 16.0.3.0/24 100.65.3.2 +ip route 16.0.4.0/24 100.65.4.2 +``` + +After which I can see these transit networks and static routes propagate, through `msw-core`, and +into `msw-bottom`: + +``` +msw-bottom# show ip route +Codes: K - kernel, C - connected, S - static, R - RIP, B - BGP + O - OSPF, IA - OSPF inter area + N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2 + E1 - OSPF external type 1, E2 - OSPF external type 2 + i - IS-IS, L1 - IS-IS level-1, L2 - IS-IS level-2, ia - IS-IS inter area + Dc - DHCP Client + [*] - [AD/Metric] + * - candidate default + +O 16.0.1.0/24 [110/2013] via 172.20.0.9, eth-0-26, 05:23:56 +O 16.0.2.0/24 [110/2013] via 172.20.0.9, eth-0-26, 05:23:56 +O 16.0.3.0/24 [110/2013] via 172.20.0.9, eth-0-26, 05:23:56 +O 16.0.4.0/24 [110/2013] via 172.20.0.9, eth-0-26, 05:23:56 +O 100.65.1.0/30 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56 +O 100.65.2.0/30 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56 +O 100.65.3.0/30 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56 +O 100.65.4.0/30 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56 +C 172.20.0.0/32 is directly connected, loopback0 +O 172.20.0.1/32 [110/1005] via 172.20.0.9, eth-0-26, 05:50:48 +O 172.20.0.2/32 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56 +C 172.20.0.8/31 is directly connected, eth-0-26 +C 172.20.0.8/32 is in local loopback, eth-0-26 +O 172.20.0.10/31 [110/1018] via 172.20.0.9, eth-0-26, 05:50:48 +``` + +I now instruct the T-Rex loadtester to send single-flow loadtest traffic from 16.0.X.1 -> 48.0.X.1 +on port 0; and back from 48.0.X.1 -> 16.0.X.1 on port 1; and then for port2+3 I use X=2, for +port4+5 I will use X=3, and port 6+7 I will use X=4. After T-Rex starts up, it's sending 80Gbps of +traffic with a grand total of 67.6Mpps in 8 unique flows of 8.45Mpps at 128b each, and the three +switches forward this L3 IPv4 unicast traffic effortlessly: + +{{< image src="/assets/oem-switch/l3-trex.png" alt="L3 T-Rex" >}} + +#### Overlay + +What I've built just now would be acceptable really only if the switches were in the same rack (or +at best, facility). As an industry professional, I frown upon things like _VLAN-stretching_, a term +that describes bridging VLANs between buildings (or, as some might admit to .. between cities or +even countries🤮). A long time ago (in December 1999), Luca Martini invented what is now called [[Martini +Tunnels](https://datatracker.ietf.org/doc/html/draft-martini-l2circuit-trans-mpls-00)], defining how +to transport Ethernet frames over an MPLS network, which is what I really want to demonstrate, albeit +in the next article. + +What folks don't always realize is that the industry is _moving on_ from MPLS to a set of more +flexible IP based solutions, notably tunneling using IPv4 or IPv6 UDP packets such as found in VxLAN +or GENEVE, two of my favorite protocols. This certainly does cost a little bit in VPP, as I wrote +about in my post on [[VLLs in VPP]({% post_url 2022-01-12-vpp-l2 %})], although you'd be surprised +how many VxLAN encapsulated packets/sec a simple AMD64 router can forward. With respect to these +switches, though, let's find out if tunneling this way incurs an overhead or performance penalty. +Ready? Let's go! + +First I will put the first four interfaces in range `eth-0-1 - 4` into a new set of VLANs, but in the +VLAN database I will enable what is called _overlay_ on them: + +``` +msw-top# configure terminal +vlan database + vlan 5-8,10,20,30,40 + vlan 10 name v-vxlan-xco10 + vlan 10 overlay enable + vlan 20 name v-vxlan-xco20 + vlan 20 overlay enable + vlan 30 name v-vxlan-xco30 + vlan 30 overlay enable + vlan 40 name v-vxlan-xco40 + vlan 40 overlay enable +! +interface eth-0-1 + switchport access vlan 10 +! +interface eth-0-2 + switchport access vlan 20 +! +interface eth-0-3 + switchport access vlan 30 +! +interface eth-0-4 + switchport access vlan 40 +``` + +Next, I create two new loopback interfaces (bear with me on this one), and configure the transport +of these overlays in the switch. This configuration will pick up the VLANs and move them to remote sites +in either VxLAN, GENEVE or NvGRE protocol, like this: + +``` +msw-top# configure terminal +! +interface loopback1 + ip address 172.20.1.2/32 +! +interface loopback2 + ip address 172.20.2.2/32 +! +overlay + remote-vtep 1 ip-address 172.20.0.0 type vxlan src-ip 172.20.0.2 + remote-vtep 2 ip-address 172.20.1.0 type nvgre src-ip 172.20.1.2 + remote-vtep 3 ip-address 172.20.2.0 type geneve src-ip 172.20.2.2 keep-vlan-tag + vlan 10 vni 829810 + vlan 10 remote-vtep 1 + vlan 20 vni 829820 + vlan 20 remote-vtep 2 + vlan 30 vni 829830 + vlan 30 remote-vtep 3 + vlan 40 vni 829840 + vlan 40 remote-vtep 1 +! +``` + +Alright, this is seriously cool! The first overlay defines what is called a remote `VTEP` (virtual +tunnel end point), of type `VxLAN` towards IPv4 address `172.20.0.0`, coming from source address +`172.20.0.2` (which is our `loopback0` interface on switch `msw-top`). As it turns out, I am not +allowed to create different overlay _types_ to the same _destination_ address, but not to worry: I +can create a few unique loopback interfaces with unique IPv4 addresses (see `loopback1` and +`loopback2`; and create new VTEPs using these. So, VTEP at index 2 is of type `NvGRE` and the one at +index 3 is of type `GENEVE` and due to the use of `keep-vlan-tag`, the encapsulated traffic will +carry dot1q tags, where-as in the other two VTEPs the tag will be stripped and what is transported +on the wire is untagged traffic. + +``` +msw-top# show vlan all +VLAN ID Name State STP ID Member ports + (u)-Untagged, (t)-Tagged +======= =============================== ======= ======= ======================== +(...) +10 v-vxlan-xco10 ACTIVE 0 eth-0-1(u) + VxLAN: 172.20.0.2->172.20.0.0 +20 v-vxlan-xco20 ACTIVE 0 eth-0-2(u) + NvGRE: 172.20.1.2->172.20.1.0 +30 v-vxlan-xco30 ACTIVE 0 eth-0-3(u) + GENEVE: 172.20.2.2->172.20.2.0 +40 v-vxlan-xco40 ACTIVE 0 eth-0-4(u) + VxLAN: 172.20.0.2->172.20.0.0 +msw-top# show mac address-table + Mac Address Table +------------------------------------------- +(*) - Security Entry (M) - MLAG Entry +(MO) - MLAG Output Entry (MI) - MLAG Input Entry +(E) - EVPN Entry (EO) - EVPN Output Entry +(EI) - EVPN Input Entry +Vlan Mac Address Type Ports +---- ----------- -------- ----- +10 6805.ca32.4595 dynamic VxLAN: 172.20.0.2->172.20.0.0 +10 6805.ca32.4594 dynamic eth-0-1 +20 6805.ca32.4596 dynamic NvGRE: 172.20.1.2->172.20.1.0 +20 6805.ca32.4597 dynamic eth-0-2 +30 9c69.b461.7679 dynamic GENEVE: 172.20.2.2->172.20.2.0 +30 9c69.b461.7678 dynamic eth-0-3 +40 9c69.b461.767a dynamic VxLAN: 172.20.0.2->172.20.0.0 +40 9c69.b461.767b dynamic eth-0-4 +``` +Turning my attention to the VLAN database, I can now see the power of this become obvious. This +switch has any number of local interfaces either tagged or untagged (in the case of VLAN 10 we can +see `eth-0-1(u)` which means that interface is participating in the VLAN untagged), but we can also +see that this VLAN 10 has a member port called `VxLAN: 172.20.0.2->172.20.0.0`. This port is just +like any other, in that it'll participate in unknown unicast, broadcast and multicast, and "learn" +MAC addresses behind these virtual overlay ports. In VLAN 10 (and VLAN 40), I can see in the L2 FIB +(`show mac address-table`), that there's a local MAC address learned (from the T-Rex loadtester) +behind `eth-0-1`, but there's also a remote MAC address learned behind the VxLAN port. I'm +impressed. + +I can add any number of VLANs (and dot1q-tunnels) into a VTEP endpoint, after assigning each of them +a unique `VNI` (virtual network identifier). If you're curious about these, take a look at the +[[VxLAN](https://datatracker.ietf.org/doc/html/rfc7348)], [[GENEVE](https://datatracker.ietf.org/doc/html/rfc8926)] +and [[NvGRE](https://www.rfc-editor.org/rfc/rfc7637.html)] specifications. Basically, the encapsulation is just putting the +ethernet frame as a payload of an UDP packet, so let's take a look at those. + +#### Inspecting overlay + +As you'll recall, the VLAN 10,20,30,40 trafffic is now traveling over an IP network, notably +encapsulated by the source switch `msw-top` and delivered to `msw-bottom` via IGP (in my case, +OSPF), while it transits through `msw-core`. I decide to take a look at this, by configuring a +monitor port on `msw-core`: + +``` +msw-core# show run | inc moni +monitor session 1 source interface eth-0-49 both +monitor session 1 destination interface eth-0-1 +``` + +This will copy all in- and egress traffic from interface `eth-0-49` (connected to `msw-top`) through +to local interface `eth-0-1`, which is connected to the loadtester. I can simply tcpdump this stuff: + +``` +pim@trex01:~$ sudo tcpdump -ni eno2 '(proto gre) or (udp and port 4789) or (udp and port 6081)' +01:26:24.685666 00:1e:08:26:ec:f3 > 00:1e:08:0d:6e:88, ethertype IPv4 (0x0800), length 174: + (tos 0x0, ttl 127, id 7496, offset 0, flags [DF], proto UDP (17), length 160) + 172.20.0.0.49208 > 172.20.0.2.4789: VXLAN, flags [I] (0x08), vni 829810 + 68:05:ca:32:45:95 > 68:05:ca:32:45:94, ethertype IPv4 (0x0800), length 124: + (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 110) + 48.0.1.47.1025 > 16.0.1.47.12: UDP, length 82 +01:26:24.688305 00:1e:08:0d:6e:88 > 00:1e:08:26:ec:f3, ethertype IPv4 (0x0800), length 166: + (tos 0x0, ttl 128, id 44814, offset 0, flags [DF], proto GRE (47), length 152) + 172.20.1.2 > 172.20.1.0: GREv0, Flags [key present], key=0xca97c38, proto TEB (0x6558), length 132 + 68:05:ca:32:45:97 > 68:05:ca:32:45:96, ethertype IPv4 (0x0800), length 124: + (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 110) + 48.0.2.73.1025 > 16.0.2.73.12: UDP, length 82 +01:26:24.689100 00:1e:08:26:ec:f3 > 00:1e:08:0d:6e:88, ethertype IPv4 (0x0800), length 178: + (tos 0x0, ttl 127, id 7502, offset 0, flags [DF], proto UDP (17), length 164) + 172.20.2.0.49208 > 172.20.2.2.6081: GENEVE, Flags [none], vni 0xca986, proto TEB (0x6558) + 9c:69:b4:61:76:79 > 9c:69:b4:61:76:78, ethertype 802.1Q (0x8100), length 128: vlan 30, p 0, ethertype IPv4 (0x0800), + (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 110) + 48.0.3.109.1025 > 16.0.3.109.12: UDP, length 82 +01:26:24.701666 00:1e:08:0d:6e:89 > 00:1e:08:0d:6e:88, ethertype IPv4 (0x0800), length 174: + (tos 0x0, ttl 127, id 7496, offset 0, flags [DF], proto UDP (17), length 160) + 172.20.0.0.49208 > 172.20.0.2.4789: VXLAN, flags [I] (0x08), vni 829840 + 68:05:ca:32:45:95 > 68:05:ca:32:45:94, ethertype IPv4 (0x0800), length 124: + (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 110) + 48.0.4.47.1025 > 16.0.4.47.12: UDP, length 82 +``` + +We can see packets for all four tunnels in this dump. The first one is a UDP packet to port 4789, +which is the standard port for VxLAN, and it has VNI 829810. The second packet is proto GRE with +flag `TEB` which stands for _transparent ethernet bridge_ in other words an L2 variant of GRE that +carries ethernet frames. The third one shows that feature I configured above (in case you forgot it, +it's the `keep-vlan-tag` option when creating the VTEP), and because of that flag we can see that +the inner payload carries the `vlan 30` tag, neat! The `VNI` there is `0xca986` which is hex for +`829830`. Finally, the fourth one shows VLAN40 traffic that is sent to the same VTEP endpoint as +VLAN10 traffic (showing that multiple VLANs can be transported across the same tunnel, distinguished +by VNI). + +{{< image width="90px" float="left" src="/assets/oem-switch/warning.png" alt="Warning" >}} + +At this point I make an important observation. VxLAN and GENEVE both have this really cool feature +that they can hash their _inner_ payload (ie. the IPv4/IPv6 address and ports if available) and use +that to randomize the _source port_, which makes them preferable to GRE. The reason why this is +preferable is hashing makes these inner flows become unique outer flows, which in turn allows them +to be loadbalanced in intermediate networks, but also in the receiver if it has multiple receive +queues. However, and this is important!, the switch does not hash, which means that all ethernet +traffic in the VxLAN, GENEVE and NvGRE tunnels always have the exact same outer header, so +loadbalancing and multiple receive queues are out of the question. I wonder if this is a limitation +of the Centec chip, or failure to program or configure it by the firmware. + +With that gripe out of the way, let's take a look at 80Gbit of tunneled traffic, shall we? + +{{< image src="/assets/oem-switch/overlay-trex.png" alt="Overlay T-Rex" >}} + +Once again, all three switches are acing it. So at least 40Gbps of encap- and 40Gbps of decapsulation +per switch, and the transport over IPv4 through the `msw-core` switch to the other side, is all in +working order. On top of that, I've shown that multiple types of overlay can live alongside one +another, even betwween the same pair of switches, and that multiple VLANs can share the same underlay +transport. The only downside is the **single flow** nature of these UDP transports. + +A final inspection of the switch throughput: + +``` +msw-top# show interface summary | exc DOWN +RXBS: rx rate (bits/sec) RXPS: rx rate (pkts/sec) +TXBS: tx rate (bits/sec) TXPS: tx rate (pkts/sec) + +Interface Link RXBS RXPS TXBS TXPS +----------------------------------------------------------------------------- +eth-0-1 UP 10013004482 8456929 10013004548 8456929 +eth-0-2 UP 10013030687 8456951 10013030801 8456951 +eth-0-3 UP 10012625863 8456609 10012626030 8456609 +eth-0-4 UP 10013032737 8456953 10013034423 8456954 +eth-0-25 UP 505 0 513 0 +eth-0-26 UP 51147539721 33827761 51147540111 33827762 + +``` + +Take a look at that `eth-0-26` interface: it's using significantly more bandwidth (51Gbps) than +the sum of the four transports (4x10Gbps). This is because each ethernet frame (of 128b) has to be +wrapped in an IPv4 UDP packet (or in the case of NvGRE an IPv4 packet with a GRE header), which +incurs quite some overhead, for small packets at least. But it definitely proves that the switches +here are happy to do this forwarding at line rate, and that's what counts! + +### Conclusions + +It's just super cool to see a switch like this work as expected. I did not manage to overload it at +all, neither with IPv4 loadtest at 67Mpps and 80Gbit of traffic, nor with L2 loadtest with four ports +transported with VxLAN, NvGRE and GENEVE, _at the same time_. Although the underlay can only use +IPv4 (no IPv6 is available in the switch chip), this is not a huge problem for me. At AS8298, I can +easily define some private VRF with IPv4 space from RFC1918 to do the transport of traffic over +VxLAN. And what's even better, this can perfectly inter-operate with my VPP routers which also do +VxLAN en/decapsulation. + +Now there is one more thing for me to test (and, cliffhanger, I've tested it already but I'll have +to write up all of my data and results ...). I need to do what I said I would do in the beginning of +this article, and what I had hoped to achieve with the FS switches but failed to due to lack of +support: MPLS L2VPN transport (and, its more complex but cooler sibling VPLS). diff --git a/content/articles/2022-12-09-oem-switch-2.md b/content/articles/2022-12-09-oem-switch-2.md new file mode 100644 index 0000000..e60cf9e --- /dev/null +++ b/content/articles/2022-12-09-oem-switch-2.md @@ -0,0 +1,552 @@ +--- +date: "2022-12-09T11:56:54Z" +title: 'Review: S5648X-2Q4Z Switch - Part 2: MPLS' +--- + +After receiving an e-mail from a newer [[China based OEM](https://starry-networks.com)], I had a chat with their +founder and learned that the combination of switch silicon and software may be a good match for IPng Networks. + +I got pretty enthusiastic when this new vendor claimed VxLAN, GENEVE, MPLS and GRE at 56 ports and +line rate, on a really affordable budget ($4'200,- for the 56 port; and $1'650,- for the 26 port +switch). This reseller is using a less known silicon vendor called +[[Centec](https://www.centec.com/silicon)], who have a lineup of ethernet silicon. In this device, +the CTC8096 (GoldenGate) is used for cost effective high density 10GbE/40GbE applications paired +with 4x100GbE uplink capability. This is Centec's fourth generation, so CTC8096 inherits the feature +set from L2/L3 switching to advanced data center and metro Ethernet features with innovative +enhancement. The switch chip provides up to 96x10GbE ports, or 24x40GbE, or 80x10GbE + 4x100GbE +ports, inheriting from its predecessors a variety of features, including L2, L3, MPLS, VXLAN, MPLS +SR, and OAM/APS. Highlights features include Telemetry, Programmability, Security and traffic +management, and Network time synchronization. + +{{< image width="450px" float="left" src="/assets/oem-switch/S5624X-front.png" alt="S5624X Front" >}} + +{{< image width="450px" float="right" src="/assets/oem-switch/S5648X-front.png" alt="S5648X Front" >}} + +

+ +After discussing basic L2, L3 and Overlay functionality in my [[previous post]({% post_url +2022-12-05-oem-switch-1 %})], I left somewhat of a cliffhanger alluding to all this fancy MPLS and +VPLS stuff. Honestly, I needed a bit more time to play around with the featureset and clarify a few things. +I'm now ready to assert that this stuff is really possible on this switch, and if this tickles your fancy, by +all means read on :) + +## Detailed findings + +### Hardware + +{{< image width="400px" float="right" src="/assets/oem-switch/s5648x-front-opencase.png" alt="Front" >}} + +The switch comes well packaged with two removable 400W Gold powersupplies from _Compuware +Technology_ which output 12V/33A and +5V/3A as well as four removable PWM controlled fans from +_Protechnic_. The switch chip is a Centec +[[CTC8096](https://www.centec.com/silicon/Ethernet-Switch-Silicon/CTC8096)] which is a competent silicon unit that can offer +48x10, 2x40 and 4x100G, and its smaller sibling carries the newer +[[CTC7132](https://www.centec.com/silicon/Ethernet-Switch-Silicon/CTC7132)] from 2019, which brings +24x10 and 2x100G connectivity. While the firmware seems slightly different in denomination, the large +one shows `NetworkOS-e580-v7.4.4.r.bin` as the firmware, and the smaller one shows +`uImage-v7.0.4.40.bin`, I get the impression that the latter is a compiled down version of the +former to work with the newer chipset. + +In my [[previous post]({% post_url 2022-12-05-oem-switch-1 %})], I showed L2, L3 and VxLAN, GENEVE +and NvGRE capabilities of this switch to be line rate. But the hardware also supports MPLS, so I +figured I'd complete the Overlay series by exploring VxLAN, and the MPLS, EoMPLS (L2VPN, Martini +style), and VPLS functionality of these units. + +### Topology + +![Front](/assets/oem-switch/topology.svg){: style="width:500px; float: right; margin-left: 1em; +margin-bottom: 1em;"} + +In the [[IPng Networks LAB]({% post_url 2022-10-14-lab-1 %})], I build the following topology using +the loadtester, packet analyzer, and switches: + +* **msw-top**: S5624-2Z-EI switch +* **msw-core**: S5648X-2Q4ZA switch +* **msw-bottom**: S5624-2Z-EI switch +* All switches connect to: + * each other with 100G DACs (right, black) + * T-Rex machine with 4x10G (left, rainbow) +* Each switch gets a mgmt IPv4 and IPv6 + +This is the same topology in the previous post, and it gives me lots of wiggle room to patch anything +to anything as I build point to point MPLS tunnels, VPLS clouds and eVPN overlays. Although I will +also load/stress test these configurations, this post is more about the higher level configuration +work that goes into building such an MPLS enabled telco network. + +### MPLS + +Why even bother, if we have these fancy new IP based transports that I [[wrote about]({% post_url +2022-12-05-oem-switch-1 %})] last week? I mentioned that the industry is _moving on_ from MPLS +to a set of more flexible IP based solutions like VxLAN and GENEVE, as they certainly offer lots of +benefits in deployment (notably as overlays on top of existing IP networks). + +Here's one plausible answer: You may have come across an architectural network design concept known +as [[BGP Free Core](https://bgphelp.com/2017/02/12/bgp-free-core/)] and operating this way gives +very little room for outages to occur in the L2 (Ethernet and MPLS) transport network, because it's +relatively simple in design and implementation. Some advantages worth mentioning: + +* Transport devices do not need to be capable of supporting a large number of IPv4/IPv6 routes, either + in the RIB or FIB, allowing them to be much cheaper. +* As there is no eBGP, transport devices will not be impacted by BGP-related issues, such as high CPU + utilization during massive BGP re-convergence. +* Also, without eBGP, some of the attack vectors in ISPs (loopback DDoS or ARP storms on public + internet exchange, to take two common examples) can be eliminated. If a new BGP security + vulnerability were to be discovered, transport devices aren't impacted. +* Operator errors (the #1 reason for outages in our industry) associated with BGP configuration and + the use of large RIBs (eg. leaking into IGP, flapping transit sessions, etc) can be eradicated. +* New transport services such as MPLS point to point virtual leased lines, SR-MPLS, VPLS clouds, and + eVPN can all be introduced without modifying the routing core. + +If deployed correctly, this type of transport-only network can be kept entirely isolated from the Internet, +making DDoS and hacking attacks against transport elements impossible, and it also opens up possibilities +for relatively safe sharing of infrastructure resources between ISPs (think of things like dark fibers +between locations, rackspace, power, cross connects). + +For smaller clubs (like IPng Networks), being able to share a 100G wave with others, significantly reduces +price per Megabit! So if you're in Zurich, Switzerland, or Europe and find this an interesting avenue to +expand your reach in a co-op style environment, [[reach out](/s/contact)] to us, any time! + +#### MPLS + LDP Configuration + +OK, let's talk bits and bytes. Table stakes functionality is of course MPLS switching and label distribution, +which is performed with LDP, described in [[RFC3036](https://www.rfc-editor.org/rfc/rfc3036.html)]. +Enabling these features is relatively straight forward: + +``` +msw-top# show run int loop0 +interface loopback0 + ip address 172.20.0.2/32 + ipv6 address 2001:678:d78:400::2/128 + ipv6 router ospf 8298 area 0 + +msw-top# show run int eth-0-25 +interface eth-0-25 + description Core: msw-bottom eth-0-25 + speed 100G + no switchport + mtu 9216 + label-switching + ip address 172.20.0.12/31 + ipv6 address 2001:678:d78:400::3:1/112 + ip ospf network point-to-point + ip ospf cost 104 + ipv6 ospf network point-to-point + ipv6 ospf cost 106 + ipv6 router ospf 8298 area 0 + enable-ldp + +msw-top# show run router ospf +router ospf 8298 + network 172.20.0.0/24 area 0 + +msw-top# show run router ipv6 ospf +router ipv6 ospf 8298 + router-id 172.20.0.2 + +msw-top# show run router ldp +router ldp + router-id 172.20.0.2 + transport-address 172.20.0.2 +``` + +This seems like a mouthful, but really not too complicated. From the top, I create a loopback +interface with an IPv4 (/32) and IPv6 (/128) address. Then, on the 100G transport interfaces, I +specify an IPv4 (/31, let's not be wasteful, take a look at [[RFC +3021](https://www.rfc-editor.org/rfc/rfc3021.html)]) and IPv6 (/112) transit network, after which I +add the interface to OSPF and OSPFv3. + +The main two things to note in the interface definition is the use of `label-switching` which enables +MPLS on the interface, and `enable-ldp` which makes it periodically multicast LDP discovery +packets. If another device is also doing that, an LDP _adjacency_ is formed using a TCP session. +The two devices then exchange MPLS label tables, so that they learn from each other how to switch +MPLS packets across the network. + +LDP _signalling_ kind of looks like this on the wire: +``` +14:21:43.741089 IP 172.20.0.12.646 > 224.0.0.2.646: LDP, Label-Space-ID: 172.20.0.2:0, pdu-length: 30 +14:21:44.331613 IP 172.20.0.13.646 > 224.0.0.2.646: LDP, Label-Space-ID: 172.20.0.1:0, pdu-length: 30 + +14:21:44.332773 IP 172.20.0.2.36475 > 172.20.0.1.646: Flags [S],seq 195175, win 27528, + options [mss 9176,sackOK,TS val 104349486 ecr 0,nop,wscale 7], length 0 +14:21:44.333700 IP 172.20.0.1.646 > 172.20.0.2.36475: Flags [S.], seq 466968, ack 195176, win 18328, + options [mss 9176,sackOK,TS val 104335979 ecr 104349486,nop,wscale 7], length 0 +14:21:44.334313 IP 172.20.0.2.36475 > 172.20.0.1.646: Flags [.], ack 1, win 216, + options [nop,nop,TS val 104349486 ecr 104335979], length 0 +``` + +The first two packets here are the routers announcing to [[well known +multicast](https://en.wikipedia.org/wiki/Multicast_address)] address for _all-routers_ (224.0.0.2), +and well known port 646 (for LDP), in a packet called a _Hello Message_. The router with +address 172.20.0.12 is the one we just configured (`msw-top`), and the one with address 172.20.0.13 is +the other side (`msw-bottom`). In these _Hello messages_, the router informs multicast listeners +where they should connect (called the _IPv4 transport address_), in the case of `msw-top`, it's +172.20.0.2. + +Now that they've noticed one anothers willingness to form an adjacency, a TCP connection is +initiated from our router's loopback address (specified by `transport-address` in the LDP +configuration), towards the loopback that was learned from the _Hello Message_ in the multicast +packet earlier. A TCP three way handshake follows, in which the routers also tell each other their +MTU (by means of the MSS field set to 9176, which is 9216 minus 20 bytes [[IPv4 +header](https://en.wikipedia.org/wiki/Internet_Protocol_version_4)] and 20 bytes [[TCP +header](https://en.wikipedia.org/wiki/Transmission_Control_Protocol)]). The adjacency forms and both +routers exchange label information (in things called a _Label Mapping Message_). Once done +exchanging this info, `msw-top` can now switch MPLS packets across its two 100G interfaces. + +Zooming back out from what happened on the wire with the LDP _signalling_, I can take a look at the +`msw-top` switch: besides the adjacency that I described in detail above, another one has formed over +the IPv4 transit network between `msw-top` and `msw-core` (refer to the topology diagram to see what +connects where). As this is a layer3 network, icky things like spanning tree and forwarding loops +are no longer an issue. Any switch can forward MPLS packets to any neighbor in this topology, preference +on the used path is informed with OSPF costs for the IPv4 interfaces (because LDP is using IPv4 here). + +``` +msw-top# show ldp adjacency +IP Address Intf Name Holdtime LDP-Identifier +172.20.0.10 eth-0-26 15 172.20.0.1:0 +172.20.0.13 eth-0-25 15 172.20.0.0:0 + +msw-top# show ldp session +Peer IP Address IF Name My Role State KeepAlive +172.20.0.0 eth-0-25 Active OPERATIONAL 30 +172.20.0.1 eth-0-26 Active OPERATIONAL 30 +``` + +#### MPLS pseudowire + +The easiest form (and possibly most widely used one), is to create a point to point ethernet link +betwen an interface on one switch, through the MPLS network, and into another switch's interface on +the other side. Think of this as a really long network cable. Ethernet frames are encapsulated into +an MPLS frame, and passed through the network though some sort of tunnel, called a _pseudowire_. + +There are many names of this tunneling technique. Folks refer to them as PWs (PseudoWires), VLLs +(Virtual Leased Lines), Carrier Ethernet, or Metro Ethernet. Luckily, these are almost always +interoperable, because under the covers, the vendors are implementing these MPLS cross connect +circuits using [[Martini Tunnels](https://datatracker.ietf.org/doc/html/draft-martini-l2circuit-trans-mpls-00)] +which were formalized in [[RFC 4447](https://datatracker.ietf.org/doc/html/rfc4447)]. + +The way Martini tunnels work is by creating an extension in LDP signalling. An MPLS label-switched-path is +annotated as being of a certain type, carrying a 32 bit _pseudowire ID_, which is ignored by all +intermediate routers (they will just switch the MPLS packet onto the next hop), but the last router +will inspect the MPLS packet and find which _pseudowire ID_ it belongs to, and look up in its local +table what to do with it (mostly just unwrap the MPLS packet, and marshall the resulting ethernet +frame into an interface or tagged sub-interface). + +Configuring the _pseudowire_ is really simple: + +``` +msw-top# configure terminal +interface eth-0-1 + mpls-l2-circuit pw-vll1 ethernet +! + +mpls l2-circuit pw-vll1 829800 172.20.0.0 raw mtu 9000 + +msw-top# show ldp mpls-l2-circuit 829800 +Transport Client VC Trans Local Remote Destination +VC ID Binding State Type VC Label VC Label Address +829800 eth-0-1 UP Ethernet 32774 32773 172.20.0.0 + +``` + +After I've configured this on both `msw-top` and `msw-bottom`, using LDP signalling, a new LSP will be +set up which carries ethernet packets at up to 9000 bytes, encapsulated MPLS, over the network. To +show this in more detail, I'll take the two ethernet interfaces that are connected to `msw-top:eth-0-1` +and `msw-bottom:eth-0-1`, and move them in their own network namespace on the lab machine: + +``` +root@dut-lab:~# ip netns add top +root@dut-lab:~# ip netns add bottom +root@dut-lab:~# ip link set netns top enp66s0f0 +root@dut-lab:~# ip link set netns bottom enp66s0f1 +``` + +I can now enter the _top_ and _bottom_ namespaces, and play around with those interfaces, for example +I'll give them an IPv4 address and a sub-interface with dot1q tag 1234 and an IPv6 address: + +``` +root@dut-lab:~# nsenter --net=/var/run/netns/bottom +root@dut-lab:~# ip addr add 192.0.2.1/31 dev enp66s0f1 +root@dut-lab:~# ip link add link enp66s0f1 name v1234 type vlan id 1234 +root@dut-lab:~# ip addr add 2001:db8::2/64 dev v1234 +root@dut-lab:~# ip link set v1234 up + +root@dut-lab:~# nsenter --net=/var/run/netns/top +root@dut-lab:~# ip addr add 192.0.2.0/31 dev enp66s0f0 +root@dut-lab:~# ip link add link enp66s0f0 name v1234 type vlan id 1234 +root@dut-lab:~# ip addr add 2001:db8::1/64 dev v1234 +root@dut-lab:~# ip link set v1234 up + +root@dut-lab:~# ping -c 5 2001:db8::2 +PING 2001:db8::2(2001:db8::2) 56 data bytes +64 bytes from 2001:db8::2: icmp_seq=1 ttl=64 time=0.158 ms +64 bytes from 2001:db8::2: icmp_seq=2 ttl=64 time=0.155 ms +64 bytes from 2001:db8::2: icmp_seq=3 ttl=64 time=0.162 ms +``` + +The `mpls-l2-circuit` that I created will transport the received ethernet frames between `enp66s0f0` +(in the _top_ namespace) and `enp66s0f1` (in the _bottom_ namespace), using MPLS encapsulation, and +giving the packets a stack of _two_ labels. The outer most label helps the switches determine where +to switch the MPLS packet (in other words, route it from `msw-top` to `msw-bottom`). Once the +destination is reached, the outer label is popped off the stack, to reveal the second label, the +purpose of which is to tell the `msw-bottom` switch what, preciesly, to do with this payload. The +switch will find that the second label instructs it to transmit the MPLS payload as an ethernet +frame out on port `eth-0-1`. + +If I want to look at what happens on the wire with tcpdump(8), I can use the monitor port on +`msw-core` which mirrors all packets transiting through it. But, I don't get very far: + +``` +root@dut-lab:~# tcpdump -evni eno2 mpls +19:57:37.055854 00:1e:08:0d:6e:88 > 00:1e:08:26:ec:f3, ethertype MPLS unicast (0x8847), length 144: +MPLS (label 32768, exp 0, ttl 255) (label 32773, exp 0, [S], ttl 255) + 0x0000: 9c69 b461 7679 9c69 b461 7678 8100 04d2 .i.avy.i.avx.... + 0x0010: 86dd 6003 4a42 0040 3a40 2001 0db8 0000 ..`.JB.@:@...... + 0x0020: 0000 0000 0000 0000 0001 2001 0db8 0000 ................ + 0x0030: 0000 0000 0000 0000 0002 8000 3553 9326 ............5S.& + 0x0040: 0001 2185 9363 0000 0000 e7d9 0000 0000 ..!..c.......... + 0x0050: 0000 1011 1213 1415 1617 1819 1a1b 1c1d ................ + 0x0060: 1e1f 2021 2223 2425 2627 2829 2a2b 2c2d ...!"#$%&'()*+,- + 0x0070: 2e2f 3031 3233 3435 3637 ./01234567 +19:57:37.055890 00:1e:08:26:ec:f3 > 00:1e:08:0d:6e:88, ethertype MPLS unicast (0x8847), length 140: +MPLS (label 32774, exp 0, [S], ttl 254) + 0x0000: 9c69 b461 7678 9c69 b461 7679 8100 04d2 .i.avx.i.avy.... + 0x0010: 86dd 6009 4122 0040 3a40 2001 0db8 0000 ..`.A".@:@...... + 0x0020: 0000 0000 0000 0000 0002 2001 0db8 0000 ................ + 0x0030: 0000 0000 0000 0000 0001 8100 3453 9326 ............4S.& + 0x0040: 0001 2185 9363 0000 0000 e7d9 0000 0000 ..!..c.......... + 0x0050: 0000 1011 1213 1415 1617 1819 1a1b 1c1d ................ + 0x0060: 1e1f 2021 2223 2425 2627 2829 2a2b 2c2d ...!"#$%&'()*+,- + 0x0070: 2e2f 3031 3233 3435 3637 ./01234567 +``` + +For a brief moment, I stare closely at the first part of the hex dump, and I recognize two MAC addresses +`9c69.b461.7678` and `9c69.b461.7679` followed by what appears to be `0x8100` (the ethertype for +[[Dot1Q](https://en.wikipedia.org/wiki/IEEE_802.1Q)]) and then `0x04d2` (which is 1234 in decimal, +the VLAN tag I chose). + +Clearly, the hexdump here is "just" an ethernet frame. So why doesn't tcpdump decode it? The answer is simple: +nothing in the MPLS packet tells me that the payload is actually ethernet. It could be anything, and +it's really up to the recipient of the packet with the label 32773 to determine what its payload means. +Luckily, Wireshark can be prompted to decode further based on which MPLS label is present. Using the +_Decode As..._ option, I can specify that data following label 32773 is _Ethernet PW (no CW)_, where +PW here means _pseudowire_ and CW means _controlword_. Et, voilà, the first packet reveals itself: + +{{< image src="/assets/oem-switch/mpls-wireshark-1.png" alt="MPLS Frame #1 disected" >}} + +#### Pseudowires on Sub Interfaces + +One very common use case for me at IPng Networks is to work with excellent partners like +[[IP-Max](https://www.ip-max.net/)] who provide Internet Exchange transport, for example from DE-CIX +or SwissIX, to the customer premises. IP-Max uses Cisco's ASR9k routers, an absolutely beautiful piece +of technology [[ref]({% post_url 2022-02-21-asr9006 %})], and with those you can terminate a _L2VPN_ in +any sub-interface. + +Let's configure something similar. I take one port on `msw-top`, and branch that out into three +remote locations, in this case `msw-bottom` port 1, 2 and 3. I will be terminating all three _pseudowires_ +on the same endpoint, but obviously this could also be one port that goes to three internet exchanges, +say SwissIX, DE-CIX and FranceIX, on three different endpoints. + +The configuration for both switches will look like this: + +``` +msw-top# configure terminal +interface eth-0-1 + switchport mode trunk + switchport trunk native vlan 5 + switchport trunk allowed vlan add 6-8 + mpls-l2-circuit pw-vlan10 vlan 10 + mpls-l2-circuit pw-vlan20 vlan 20 + mpls-l2-circuit pw-vlan30 vlan 30 + +mpls l2-circuit pw-vlan10 829810 172.20.0.0 raw mtu 9000 +mpls l2-circuit pw-vlan20 829820 172.20.0.0 raw mtu 9000 +mpls l2-circuit pw-vlan30 829830 172.20.0.0 raw mtu 9000 + +msw-bottom# configure terminal +interface eth-0-1 + mpls-l2-circuit pw-vlan10 ethernet +interface eth-0-2 + mpls-l2-circuit pw-vlan20 ethernet +interface eth-0-3 + mpls-l2-circuit pw-vlan30 ethernet + +mpls l2-circuit pw-vlan10 829810 172.20.0.2 raw mtu 9000 +mpls l2-circuit pw-vlan20 829820 172.20.0.2 raw mtu 9000 +mpls l2-circuit pw-vlan30 829830 172.20.0.2 raw mtu 9000 +``` + +Previously, I configured the port in _ethernet_ mode, which takes all frames and forwards them into +the MPLS tunnel. In this case, I'm using _vlan_ mode, specifying a VLAN tag that, when frames arrive +on the port matching it, will selectively be put into a pseudowire. As an added benefit, this allows +me to still use the port as a regular switchport, in the snippet above it will take untagged frames +and assign them to VLAN 5, allow tagged frames with dot1q VLAN tag 6, 7 or 8, and handle them as any +normal switch would. VLAN tag 10, however, is directed into the pseudowire called _pw-vlan10_, and +the other two tags similarly get put into their own `l2-circuit`. Using LDP signalling, the _pw-id_ +(829810, 829820, and 829830) determines which label is assigned. On the way back, that label +allows the switch to correlate the ethernet frame with the correct port and transmit the it with the +configured VLAN tag. + +To show this from an end-user point of view, let's take a look at the Linux server connected to these +switches. I'll put one port in a namespace called _top_, and three other ports in a network namespace +called _bottom_, and then proceed to give them a little bit of config: + +``` +root@dut-lab:~# ip link set netns top dev enp66s0f0 +root@dut-lab:~# ip link set netns bottom dev enp66s0f1 +root@dut-lab:~# ip link set netns bottom dev enp66s0f2 +root@dut-lab:~# ip link set netns bottom dev enp4s0f1 + +root@dut-lab:~# nsenter --net=/var/run/netns/top +root@dut-lab:~# ip link add link enp66s0f0 name v10 type vlan id 10 +root@dut-lab:~# ip link add link enp66s0f0 name v20 type vlan id 20 +root@dut-lab:~# ip link add link enp66s0f0 name v30 type vlan id 30 +root@dut-lab:~# ip addr add 192.0.2.0/31 dev v10 +root@dut-lab:~# ip addr add 192.0.2.2/31 dev v20 +root@dut-lab:~# ip addr add 192.0.2.4/31 dev v30 + +root@dut-lab:~# nsenter --net=/var/run/netns/bottom +root@dut-lab:~# ip addr add 192.0.2.1/31 dev enp66s0f1 +root@dut-lab:~# ip addr add 192.0.2.3/31 dev enp66s0f2 +root@dut-lab:~# ip addr add 192.0.2.5/31 dev enp4s0f1 +root@dut-lab:~# ping 192.0.2.4 +PING 192.0.2.4 (192.0.2.4) 56(84) bytes of data. +64 bytes from 192.0.2.4: icmp_seq=1 ttl=64 time=0.153 ms +64 bytes from 192.0.2.4: icmp_seq=2 ttl=64 time=0.209 ms +``` + +To unpack this a little bit, in the first block I assign the interfaces to their respective +namespace. Then, for the interface connected to the `msw-top` switch, I create three dot1q +sub-interfaces, corresponding to the pseudowires I created. Note: untagged traffic out of +`enp66s0f0` will simply be picked up by the switch and assigned VLAN 5 (and I'm also allowed to send +VLAN tags 6, 7 and 8, which will all be handled locally). + +But, VLAN 10, 20 and 30 will be moved through the MPLS network and pop out on the `msw-bottom` +switch, where they are each assigned a unique port, represented by `enp66s0f1`, `enp66s0f2` and +`enp4s0f1` connected to the bottom switch. + +When I finally ping 192.0.2.4, that ICMP packet goes out on `enp4s0f1`, which enters +`msw-bottom:eth-0-3`, it gets assigned the pseudowire name _pw-vlan30_, which corresponds to the +_pw-id_ 829830, then it travels over the MPLS network, arriving at `msw-bottom` carrying a label +that tells that switch that it belongs to its local _pw-id_ 829830 which corresponds to name +_pw-vlan30_ and is assigned VLAN tag 30 on port `eth-0-1`. Phew, I made it. It actually makes sense +when you think about it! + +#### VPLS + +The _pseudowires_ that I described in the previous section are simply ethernet cross connects +spanning over an MPLS network. They are inherently point-to-point, much like a physical Ethernet +cable is. Sometimes, it makes more sense to take a local port and create what is called a _Virtual +Private LAN Service_ (VPLS), described in [[RFC4762]](https://www.rfc-editor.org/rfc/rfc4762.html), +where packets into this port are capable of being sent to any number of other ports on any number of +other switches, while using MPLS as transport. + +By means of example, let's say a telco offers me one port in Amsterdam, one in Zurich and one in +Frankfurt. A VPLS instance would create an emulated LAN segment between these locations, in other +words a Layer 2 broadcast domain that is fully capable of learning and forwarding on Ethernet MAC +addresses but the ports are dedicated to me, and they are isolated from other customers. The telco +has essentially created a three-port switch for me, but at the same time, that telco can create +any number of VPLS services, each one unique to their individual customers. It's a pretty powerful +concept. + +In principle, a VPLS consists of two parts: + +1. A full mesh of simple MPLS point-to-point tunnels from each participating switch to + each other one. These are just _pseudowires_ with a given _pw-id_, just like I showed before. +1. The _pseudowires_ are then tied together in a form of bridge domain, and learning is applied to + MAC addresses that appear behind each port, signalling that these are available behind the port. + +Configuration on the switch looks like this: + +``` +msw-top# configure terminal +interface eth-0-1 + mpls-vpls v-ipng ethernet +interface eth-0-2 + mpls-vpls v-ipng ethernet +interface eth-0-3 + mpls-vpls v-ipng ethernet +interface eth-0-4 + mpls-vpls v-ipng ethernet +! +mpls vpls v-ipng 829801 + vpls-peer 172.20.0.0 raw + vpls-peer 172.20.0.1 raw +``` + +The first set of commands add each individual interface into the VPLS instance by binding it to a +name, in this case _v-ipng_. Then, the VPLS neighbors are specified, by offering a _pw-id_ (829801) +which is used to construct a _pseudowire_ to the two peers. The first, 172.20.0.0 is `msw-bottom`, and +the other, 172.20.0.1 is `msw-core`. Each switch that participates in the VPLS for _v-ipng_ will signal +LSPs to each of its peers, and MAC learning will be enabled just as if each of these _pseudowires_ +were a regular switchport. + +Once I configure this pattern on all three switches, effectively interfaces `eth-0-1 - 4` are now +bound together as a virtual switch with a unique broadcast domain dedicated to instance _v-ipng_. +I've created a fully transparent 12-port switch, which means that what-ever traffic I generate, will +be encapsulated in MPLS and sent through the MPLS network towards its destination port. + +Let's take a look at the `msw-core` switch to see how this looks like: + +``` +msw-core# show ldp vpls +VPLS-ID Peer Address State Type Label-Sent Label-Rcvd Cw +829801 172.20.0.0 Up ethernet 32774 32773 0 +829801 172.20.0.2 Up ethernet 32776 32774 0 + +msw-core# show mpls vpls mesh +VPLS-ID Peer Addr/name In-Label Out-Intf Out-Label Type St Evpn Type2 Sr-tunid +829801 172.20.0.0/- 32777 eth-0-50 32775 RAW Up N N - +829801 172.20.0.2/- 32778 eth-0-49 32776 RAW Up N N - + +msw-core# show mpls vpls detail +Virtual Private LAN Service Instance: v-ipng, ID: 829801 + Group ID: 0, Configured MTU: NULL + Description: none + AC interface : + Name TYPE Vlan + eth-0-1 Ethernet ALL + eth-0-2 Ethernet ALL + eth-0-3 Ethernet ALL + eth-0-4 Ethernet ALL + Mesh Peers : + Peer TYPE State C-Word Tunnel name LSP name + 172.20.0.0 RAW UP Disable N/A N/A + 172.20.0.2 RAW UP Disable N/A N/A + Vpls-mac-learning enable + Discard broadcast disabled + Discard unknown-unicast disabled + Discard unknown-multicast disabled +``` + +Putting this to the test, I decide to run a loadtest saturating 12x 10G of traffic through this +spiffy 12-port virtual switch. I randomly assign ports on the loadtester to the 12 ports in the +_v-ipng_ VPLS, and then I start full line rate load with 128 byte packets. Considering I'm using +twelve TenGig ports, I would expect 12x8.43 or roughly 101Mpps flowing, and indeed, the loadtests +demonstrate this mark nicely: + +{{< image src="/assets/oem-switch/vpls-trex.png" alt="VPLS T-Rex" >}} + +**Important**: The screenshot above shows the first four ports on the T-Rex interface only, but there +are actually _twelve ports_ participating in this loadtest. In the top right corner, the total +throughput is correctly represented. The switches are handling 120Gbps of L1, 103.5Gbps of L2 (which +is expected at 128b frames, as there is a little bit of ethernet overhead for each frame), which is +a whopping 101Mpps, which is exactly what I would expect. + +And the chassis doesn't even get warm. + +### Conclusions + +It's just super cool to see a switch like this work as expected. I did not manage to overload it at +all, in my [[previous article]({% post_url 2022-12-05-oem-switch-1 %})], I showed VxLAN, GENEVE and +NvGRE overlays at line rate. Here, I can see that MPLS with all of its Martini bells and whistles, +and as well the more advanced VPLS, are keeping up like a champ. I think at least for initial +configuration and throughput on all MPLS features I tested, both the small 24x10 + 2x100G switch, +and the larger 48x10 + 2x40 + 4x100G switch, are keeping up just fine. + +A duration test will have to show if the configuration and switch fabric are stable _over time_, but +I am hopeful that Centec is hitting the exact sweet spot for me on the MPLS transport front. + +Yes, yes yes. I did as well promise to take a look at eVPN functionality (this is another form of +L3VPN which uses iBGP to share which MAC addresses live behind which VxLAN ports). This post has +been fun, but also quite long (4300 words!) so I'll follow up in a future article on the eVPN +capabilities of the Centec switches. diff --git a/content/articles/2023-02-12-fitlet2.md b/content/articles/2023-02-12-fitlet2.md new file mode 100644 index 0000000..5912e7d --- /dev/null +++ b/content/articles/2023-02-12-fitlet2.md @@ -0,0 +1,505 @@ +--- +date: "2023-02-12T09:51:23Z" +title: 'Review: Compulab Fitlet2' +--- + +{{< image width="400px" float="right" src="/assets/fitlet2/Fitlet2-stock.png" alt="Fitlet" >}} + +A while ago, in June 2021, we were discussing home routers that can keep up with 1G+ internet +connections in the [CommunityRack](https://www.communityrack.org) telegram channel. Of course +at IPng Networks we are fond of the Supermicro Xeon D1518 [[ref]({% post_url 2021-09-21-vpp-7 %})], +which has a bunch of 10Gbit X522 and 1Gbit i350 and i210 intel NICs, but it does come at a certain +price. + +For smaller applications, PC Engines APU6 [[ref]({%post_url 2021-07-19-pcengines-apu6 %})] is +kind of cool and definitely more affordable. But, in this chat, Patrick offered an alternative, +the [[Fitlet2](https://fit-iot.com/web/products/fitlet2/)] which is a small, passively cooled, +and expandable IoT-esque machine. + +Fast forward 18 months, and Patrick decided to sell off his units, so I bought one off of him, +and decided to loadtest it. Considering the pricetag (the unit I will be testing will ship for +around $400), and has the ability to use (1G/SFP) fiber optics, it may be a pretty cool one! + +# Executive Summary + +**TL/DR: Definitely a cool VPP router, 3x 1Gbit line rate, A- would buy again** + +With some care on the VPP configuration (notably RX/TX descriptors), this unit can handle L2XC at +(almost) line rate in both directions (2.94Mpps out a theoretical 2.97Mpps), with one VPP worker +thread, which it not just good, it's _Good Enough™_, at which time there is still plenty of +headroom on the CPU, as the Atom E3950 has 4 cores. + +In IPv4 routing, using two VPP worker threads, and 2 RX/TX queues on each NIC, the machine keeps up +with 64 byte traffic in both directions (ie 2.97Mpps), again with compute power to spare, and while +using only two out of four CPU cores on the Atom E3950. + +For a $400,- machine that draws close to 11 Watts fully loaded, and sporting 8GB (at a max of 16GB) +this Fitlet2 is a gem: it will easily keep up 3x 1Gbit in a production environment, while carrying +multiple full BGP tables (900K IPv4 and 170K IPv6), with room to spare. _It's a classy little +machine!_ + +## Detailed findings + +{{< image width="250px" float="right" src="/assets/fitlet2/Fitlet2-BottomOpen.png" alt="Fitlet2 Open" >}} + +The first thing that I noticed when it arrived is how small it is! The design of the Fitlet2 has a +motherboard with a non-removable Atom E3950 CPU running at 1.6GHz, from the _Goldmont_ series. This +is a notoriously slow/budget CPU, and it comes with 4C/4T, each CPU thread comes with 24kB of L1 +and 1MB of L2 cache, and there is no L3 cache on this CPU at all. That would mean performance in +applications like VPP (which try to leverage these caches) will be poorer -- the main question on +my mind is: does the CPU have enough __oompff__ to keep up with the 1G network cards? I'll want this +CPU to be able to handle roughly 4.5Mpps in total, in order for Fitlet2 to count itself amongst the +_wirespeed_ routers. + +Looking further, Fitlet2 has one HDMI and one MiniDP port, two USB2 and two USB3 ports, two Intel +i211 NICs with RJ45 port (these are 1Gbit). There's a helpful MicroSD slot, two LEDs and an audio +in- and output 3.5mm jack. The power button does worry me a little bit, I feel like just brushing +against it may turn the machine off. I do appreciate the cooling situation - the top finned plate +mates with the CPU on the top of the motherboard, and the bottom bracket holds a sizable aluminium +cooling block which further helps dissipate heat, without needing any active cooling. The Fitlet +folks claim this machine can run in environments anywhere between -50C and +112C, which I won't be +doing :) + +{{< image width="400px" float="right" src="/assets/fitlet2/Fitlet2+FACET.png" alt="Fitlet2" >}} + +Inside, there's a single DDR3 SODIMM slot for memory (the one I have came with 8GB at 1600MT/s) and +a custom, ableit open specification expansion board called a __FACET-Card__ which stands for +**F**unction **A**nd **C**onnectivity **E**xtension **T**-Card, well okay then! The __FACET__ card +in this little machine sports one extra Intel i210-IS NIC, an M2 for an SSD, and an M2E for a WiFi +port. The NIC is a 1Gbit SFP capable device. You can see its optic cage on the _FACET_ card above, +next to the yellow CMOS / Clock battery. + +The whole thing is fed with 12V powerbrick delivering 2A, and a nice touch is that the barrel +connector has a plastic bracket that locks it into the chassis by turning it 90degrees, so it won't +flap around in the breeze and detach. I wish other embedded PCs would ship with those, as I've been +fumbling around in 19" racks that are, let me say, less tightly cable organized, and may or may not +have disconnected the CHIX routeserver at some point in the past. Sorry, Max :) + +For the curious, here's a list of interesting details: [[lspci](/assets/fitlet2/lspci.txt)] - +[[dmidecode](/assets/fitlet2/dmidecode.txt)] - +[[likwid-topology](/assets/fitlet2/likwid-topology.txt)] - [[dmesg](/assets/fitlet2/dmesg.txt)]. + +## Preparing the Fitlet2 + +First, I grab a USB key and install Debian _Bullseye_ (11.5) on it, using the UEFI installer. After +booting, I carry through the instructions on my [[VPP Production]({% post_url 2021-09-21-vpp-7 %})] +post. Notably, I create the `dataplane` namespace, run an SSH and SNMP agent there, run +`isolcpus=1-3` so that I can give three worker threads to VPP, but I start off giving it only one (1) +worker thread, because this way I can take a look at what the performance is of a single CPU, before +scaling out to the three (3) threads that this CPU can offer. I also take the defaults for DPDK, +notably allowing the DPDK poll-mode-drivers to take their proposed defaults: + +* **GigabitEthernet1/0/0**: Intel Corporation I211 Gigabit Network Connection (rev 03) + > rx: queues 1 (max 2), desc 512 (min 32 max 4096 align 8)
+ > tx: queues 2 (max 2), desc 512 (min 32 max 4096 align 8) +* **GigabitEthernet3/0/0**: Intel Corporation I210 Gigabit Fiber Network Connection (rev 03) + > rx: queues 1 (max 4), desc 512 (min 32 max 4096 align 8)
+ > tx: queues 2 (max 4), desc 512 (min 32 max 4096 align 8) + +I observe that the i211 NIC allows for a maximum of two (2) RX/TX queues, while the (older!) i210 +will allow for four (4) of them. And another thing that I see here is that there are two (2) TX +queues active, but I only have one worker thread, so what gives? This is because there is always a +main thread and a worker thread, and it could be that the main thread needs to / wants to send +traffic out on an interface, so it always attaches to a queue in addition to the worker thread(s). +When exploring new hardware, I find it useful to take a look at the output of a few tactical `show` +commands on the CLI, such as: + +**1. What CPU is in this machine?** + +``` +vpp# show cpu +Model name: Intel(R) Atom(TM) Processor E3950 @ 1.60GHz +Microarch model (family): [0x6] Goldmont ([0x5c] Apollo Lake) stepping 0x9 +Flags: sse3 pclmulqdq ssse3 sse41 sse42 rdrand pqe rdseed aes sha invariant_tsc +Base frequency: 1.59 GHz +``` + +**2. Which devices on the PCI bus, PCIe speed details, and driver?** + +``` +vpp# show pci +Address Sock VID:PID Link Speed Driver Product Name Vital Product Data +0000:01:00.0 0 8086:1539 2.5 GT/s x1 uio_pci_generic +0000:02:00.0 0 8086:1539 2.5 GT/s x1 igb +0000:03:00.0 0 8086:1536 2.5 GT/s x1 uio_pci_generic +``` + +__Note__: This device at slot `02:00.0` is the second onboard RJ45 i211 NIC. I have used this one +to log in to the Fitlet2 and more easily kill/restart VPP and so on, but I could of course just as +well give it to VPP, in which case I'd have three gigabit interfaces to play with! + +**3. What details are known for the physical NICs?** + +``` +vpp# show hardware GigabitEthernet1/0/0 +GigabitEthernet1/0/0 1 up GigabitEthernet1/0/0 + Link speed: 1 Gbps + RX Queues: + queue thread mode + 0 vpp_wk_0 (1) polling + TX Queues: + TX Hash: [name: hash-eth-l34 priority: 50 description: Hash ethernet L34 headers] + queue shared thread(s) + 0 no 0 + 1 no 1 + Ethernet address 00:01:c0:2a:eb:a8 + Intel e1000 + carrier up full duplex max-frame-size 2048 + flags: admin-up maybe-multiseg tx-offload intel-phdr-cksum rx-ip4-cksum int-supported + rx: queues 1 (max 2), desc 512 (min 32 max 4096 align 8) + tx: queues 2 (max 2), desc 512 (min 32 max 4096 align 8) + pci: device 8086:1539 subsystem 8086:0000 address 0000:01:00.00 numa 0 + max rx packet len: 16383 + promiscuous: unicast off all-multicast on + vlan offload: strip off filter off qinq off + rx offload avail: vlan-strip ipv4-cksum udp-cksum tcp-cksum vlan-filter + vlan-extend scatter keep-crc rss-hash + rx offload active: ipv4-cksum scatter + tx offload avail: vlan-insert ipv4-cksum udp-cksum tcp-cksum sctp-cksum + tcp-tso multi-segs + tx offload active: ipv4-cksum udp-cksum tcp-cksum multi-segs + rss avail: ipv4-tcp ipv4-udp ipv4 ipv6-tcp-ex ipv6-udp-ex ipv6-tcp + ipv6-udp ipv6-ex ipv6 + rss active: none + tx burst function: (not available) + rx burst function: (not available) +``` + +### Configuring VPP + +After this exploratory exercise, I have learned enough about the hardware to be able to take the +Fitlet2 out for a spin. To configure the VPP instance, I turn to +[[vppcfg](https://github.com/pimvanpelt/vppcfg)], which can take a YAML configuration file +describing the desired VPP configuration, and apply it safely to the running dataplane using the VPP +API. I've written a few more posts on how it does that, notably on its [[syntax]({% post_url +2022-03-27-vppcfg-1 %})] and its [[planner]({% post_url 2022-04-02-vppcfg-2 %})]. A complete +configuration guide on vppcfg can be found +[[here](https://github.com/pimvanpelt/vppcfg/blob/main/docs/config-guide.md)]. + +``` +pim@fitlet:~$ sudo dpkg -i {lib,}vpp*23.06*deb +pim@fitlet:~$ sudo apt install python3-pip +pim@fitlet:~$ sudo pip install vppcfg-0.0.3-py3-none-any.whl +``` + +### Methodology + +#### Method 1: Single CPU Thread Saturation + +First I will take VPP out for a spin by creating an L2 Cross Connect where any ethernet frame +received on `Gi1/0/0` will be directly transmitted as-is on `Gi3/0/0` and vice versa. This is a +relatively cheap operation for VPP, as it will not have to do any routing table lookups. The +configuration looks like this: + +``` +pim@fitlet:~$ cat << EOF > l2xc.yaml +interfaces: + GigabitEthernet1/0/0: + mtu: 1500 + l2xc: GigabitEthernet3/0/0 + GigabitEthernet3/0/0: + mtu: 1500 + l2xc: GigabitEthernet1/0/0 +EOF +pim@fitlet:~$ vppcfg plan -c l2xc.yaml +[INFO ] root.main: Loading configfile l2xc.yaml +[INFO ] vppcfg.config.valid_config: Configuration validated successfully +[INFO ] root.main: Configuration is valid +[INFO ] vppcfg.vppapi.connect: VPP version is 23.06-rc0~35-gaf4046134 +comment { vppcfg sync: 10 CLI statement(s) follow } +set interface l2 xconnect GigabitEthernet1/0/0 GigabitEthernet3/0/0 +set interface l2 tag-rewrite GigabitEthernet1/0/0 disable +set interface l2 xconnect GigabitEthernet3/0/0 GigabitEthernet1/0/0 +set interface l2 tag-rewrite GigabitEthernet3/0/0 disable +set interface mtu 1500 GigabitEthernet1/0/0 +set interface mtu 1500 GigabitEthernet3/0/0 +set interface mtu packet 1500 GigabitEthernet1/0/0 +set interface mtu packet 1500 GigabitEthernet3/0/0 +set interface state GigabitEthernet1/0/0 up +set interface state GigabitEthernet3/0/0 up +[INFO ] vppcfg.reconciler.write: Wrote 11 lines to (stdout) +[INFO ] root.main: Planning succeeded +``` + +{{< image width="500px" float="right" src="/assets/fitlet2/l2xc-demo1.png" alt="Fitlet2 L2XC First Try" >}} + +After I paste these commands on the CLI, I start T-Rex in L2 stateless mode, and start T-Rex, I can +generate some activity by starting the `bench` profile on port 0 with packets of 64 bytes in size +and with varying IPv4 source and destination addresses _and_ ports: + +``` +tui>start -f stl/bench.py -m 1.48mpps -p 0 + -t size=64,vm=var2 +``` + +Let me explain a few hilights from the picture to the right. When starting this profile, I +specified 1.48Mpps, which is the maximum amount of packets/second that can be generated on a 1Gbit +link when using 64 byte frames (the smallest permissible ethernet frames). I do this because the +loadtester comes with 10Gbit (and 100Gbit) ports, but the Fitlet2 has only 1Gbit ports. Then, I see +that port0 is indeed transmitting (**Tx pps**) 1.48 Mpps, shown in dark blue. This is about 992 Mbps +on the wire (the **Tx bps L1**), but due to the overhead of ethernet (each 64 byte ethernet frame +needs an additional 20 bytes [[details](https://en.wikipedia.org/wiki/Ethernet_frame)]), so the **Tx +bps L2** is about `64/84 * 992.35 = 756.08` Mbps, which lines up. + +Then, after the Fitlet2 tries its best to forward those from its receiving Gi1/0/0 port onto its +transmitting port Gi3/0/0, they are received again by T-Rex on port 1. Here, I can see that the **Rx +pps** is 1.29 Mpps, with an **Rx bps** of 660.49 Mbps (which is the L2 counter), and in bright red +at the top I see the **drop_rate** is about 95.59 Mbps. In other words, the Fitlet2 is _not keeping +up_. + +But, after I take a look at the runtime statistics, I see that the CPU isn't very busy at all: + +``` +vpp# show run +... +Thread 1 vpp_wk_0 (lcore 1) +Time 23.8, 10 sec internal node vector rate 4.30 loops/sec 1638976.68 + vector rates in 1.2908e6, out 1.2908e6, drop 0.0000e0, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call +GigabitEthernet3/0/0-output active 6323688 27119700 0 9.14e1 4.29 +GigabitEthernet3/0/0-tx active 6323688 27119700 0 1.79e2 4.29 +dpdk-input polling 44406936 27119701 0 5.35e2 .61 +ethernet-input active 6323689 27119701 0 1.42e2 4.29 +l2-input active 6323689 27119701 0 9.94e1 4.29 +l2-output active 6323689 27119701 0 9.77e1 4.29 +``` + +Very interesting! Notice that the line above says `vector rates in .. out ..` are saying that the +thread is receiving only 1.29Mpps, and it is managing to send all of them out as well. When a VPP +worker is busy, each DPDK call will yield many packets, up to 256 in one call, which means the +amount of "vectors per call" will rise. Here, I see that on average, DPDK is returning an average of +only 0.61 packets each time it polls the NIC, and in each time a bunch of the packets are sent off +into the VPP graph, there is an average of 4.29 packets per loop. If the CPU was the bottleneck, it +would look more like 256 in the Vectors/Call column -- so the **bottleneck must be in the NIC**. + +Remember above, when I showed the `show hardware` command output? There's a clue in there. The +Fitlet2 has two onboard i211 NICs and one i210 NIC on the _FACET_ card. Despite the lower number, +the i210 is a bit more advanced +[[datasheet](/assets/fitlet2/i210_ethernet_controller_datasheet-257785.pdf)]. If I reverse the +direction of flow (so receiving on the i210 Gi3/0/0, and transmitting on the i211 Gi1/0/0), things +look a fair bit better: + +``` +vpp# show run +... +Thread 1 vpp_wk_0 (lcore 1) +Time 12.6, 10 sec internal node vector rate 4.02 loops/sec 853956.73 + vector rates in 1.4799e6, out 1.4799e6, drop 0.0000e0, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call +GigabitEthernet1/0/0-output active 4642964 18652932 0 9.34e1 4.02 +GigabitEthernet1/0/0-tx active 4642964 18652420 0 1.73e2 4.02 +dpdk-input polling 12200880 18652933 0 3.27e2 1.53 +ethernet-input active 4642965 18652933 0 1.54e2 4.02 +l2-input active 4642964 18652933 0 1.04e2 4.02 +l2-output active 4642964 18652933 0 1.01e2 4.02 +``` + +Hey, would you look at that! The line up top here shows vector rates of in 1.4799e6 (which is +1.48Mpps) and outbound is the same number. And in this configuration as well, the DPDK node isn't +even reading that many packets, and the graph traversal is on average with 4.02 packets per run, +which means that this CPU can do in excess of 1.48Mpps on one (1) CPU thread. Slick! + +So what _is_ the maximum throughput per CPU thread? To show this, I will saturate both ports with +line rate traffic, and see what makes it through the other side. After instructing the T-Rex to +perform the following profile: + +``` +tui>start -f stl/bench.py -m 1.48mpps -p 0 1 \ + -t size=64,vm=var2 +``` + +T-Rex will faithfully start to send traffic on both ports and expect the same amount back from the +Fitlet2 (the _Device Under Test_ or _DUT_). I can see that from T-Rex port 1->0 all traffic makes +its way back, but from port 0->1 there is a little bit of loss (for the 1.48Mpps sent, only 1.43Mpps +is returned). This is the same phenomenon that I explained above -- the i211 NIC is not quite as +good at eating packets as the i210 NIC is. + +Even when doing this though, the (still) single threaded VPP is keeping up just fine, CPU wise: + +``` +vpp# show run +... +Thread 1 vpp_wk_0 (lcore 1) +Time 13.4, 10 sec internal node vector rate 13.59 loops/sec 122820.33 + vector rates in 2.9599e6, out 2.8834e6, drop 0.0000e0, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call +GigabitEthernet1/0/0-output active 1822674 19826616 0 3.69e1 10.88 +GigabitEthernet1/0/0-tx active 1822674 19597360 0 1.51e2 10.75 +GigabitEthernet3/0/0-output active 1823770 19826612 0 4.79e1 10.87 +GigabitEthernet3/0/0-tx active 1823770 19029508 0 1.56e2 10.43 +dpdk-input polling 1827320 39653228 0 1.62e2 21.70 +ethernet-input active 3646444 39653228 0 7.67e1 10.87 +l2-input active 1825356 39653228 0 4.96e1 21.72 +l2-output active 1825356 39653228 0 4.58e1 21.72 +``` + +Here we can see 2.96Mpps received (_vector rates in_) while only 2.88Mpps are transmitted (_vector +rates out_). First off, this lines up perfectly with the reporting of T-Rex in the screenshot above, +and it also shows that one direction loses more packets than the other. We're dropping some 80kpps, +but where did they go? Looking at the statistics counters, which include any packets which had +errors in processing, we learn more: + +``` +vpp# show err + Count Node Reason Severity +3109141488 l2-output L2 output packets error +3109141488 l2-input L2 input packets error + 9936649 GigabitEthernet1/0/0-tx Tx packet drops (dpdk tx failure) error + 32120469 GigabitEthernet3/0/0-tx Tx packet drops (dpdk tx failure) error +``` + +{{< image width="500px" float="right" src="/assets/fitlet2/l2xc-demo2.png" alt="Fitlet2 L2XC Second Try" >}} + +Aha! From previous experience I know that when DPDK signals packet drops due to 'tx failure', +that this is often because it's trying to hand off the packet to the NIC, which has a ringbuffer to +collect them while the hardware transmits them onto the wire, and this NIC has run out of slots, +which means the packet has to be dropped and a kitten gets hurt. But, I can raise the number of +RX and TX slots, by setting them in VPP's `startup.conf` file: + +``` +dpdk { + dev default { + num-rx-desc 512 ## default + num-tx-desc 1024 + } + no-multi-seg +} +``` + +And with that simple tweak, I've succeeded in configuring the Fitlet2 in a way that it is capable of +receiving and transmitting 64 byte packets in both directions at (almost) line rate, with **one CPU +thread**. + +#### Method 2: Rampup using trex-loadtest.py + +For this test, I decide to put the Fitlet2 into L3 mode (up until now it was set up in _L2 Cross +Connect_ mode). To do this, I give the interfaces an IPv4 address and set a route for the loadtest +traffic (which will be coming from `16.0.0.0/8` and going to `48.0.0.0/8`). I will once again look +to `vppcfg` to do this, because manipulating the YAML files like this allow me to easily and reliabily +swap back and forth, letting `vppcfg` do the mundane chore of figuring out what commands to type, in +which order, safely. + +From my existing L2XC dataplane configuration, I switch to L3 like so: + +``` +pim@fitlet:~$ cat << EOF > l3.yaml +interfaces: + GigabitEthernet1/0/0: + mtu: 1500 + lcp: e1-0-0 + addresses: [ 100.64.10.1/30 ] + GigabitEthernet3/0/0: + mtu: 1500 + lcp: e3-0-0 + addresses: [ 100.64.10.5/30 ] +EOF +pim@fitlet:~$ vppcfg plan -c l3.yaml +[INFO ] root.main: Loading configfile l3.yaml +[INFO ] vppcfg.config.valid_config: Configuration validated successfully +[INFO ] root.main: Configuration is valid +[INFO ] vppcfg.vppapi.connect: VPP version is 23.06-rc0~35-gaf4046134 +comment { vppcfg prune: 2 CLI statement(s) follow } +set interface l3 GigabitEthernet1/0/0 +set interface l3 GigabitEthernet3/0/0 +comment { vppcfg create: 2 CLI statement(s) follow } +lcp create GigabitEthernet1/0/0 host-if e1-0-0 +lcp create GigabitEthernet3/0/0 host-if e3-0-0 +comment { vppcfg sync: 2 CLI statement(s) follow } +set interface ip address GigabitEthernet1/0/0 100.64.10.1/30 +set interface ip address GigabitEthernet3/0/0 100.64.10.5/30 +[INFO ] vppcfg.reconciler.write: Wrote 9 lines to (stdout) +[INFO ] root.main: Planning succeeded +``` + +One small note -- `vppcfg` cannot set routes, and this is by design as the Linux Control Plane is +meant to take care of that. I can either set routes using `ip` in the `dataplane` network namespace, +like so: + +``` +pim@fitlet:~$ sudo nsenter --net=/var/run/netns/dataplane +root@fitlet:/home/pim# ip route add 16.0.0.0/8 via 100.64.10.2 +root@fitlet:/home/pim# ip route add 48.0.0.0/8 via 100.64.10.6 +``` + +Or, alternatively, I can set them directly on VPP in the CLI, interestingly with identical syntax: +``` +pim@fitlet:~$ vppctl +vpp# ip route add 16.0.0.0/8 via 100.64.10.2 +vpp# ip route add 48.0.0.0/8 via 100.64.10.6 +``` + +The loadtester will run a bunch of profiles (1514b, _imix_, 64b with multiple flows, and 64b with +only one flow), either in unidirectional or bidirectional mode, which gives me a wealth of data to +share: + +Loadtest | 1514b | imix | Multi 64b | Single 64b +-------------------- | -------- | -------- | --------- | ---------- +***Bidirectional*** | [81.7k (100%)](/assets/fitlet2/fitlet2.bench-var2-1514b-bidirectional.html) | [327k (100%)](/assets/fitlet2/fitlet2.bench-var2-imix-bidirectional.html) | [1.48M (100%)](/assets/fitlet2/fitlet2.bench-var2-bidirectional.html) | [1.43M (98.8%)](/assets/fitlet2/fitlet2.bench-bidirectional.html) +***Unidirectional*** | [73.2k (89.6%)](/assets/fitlet2/fitlet2.bench-var2-1514b-unidirectional.html) | [255k (78.2%)](/assets/fitlet2/fitlet2.bench-var2-imix-unidirectional.html) | [1.18M (79.4%)](/assets/fitlet2/fitlet2.bench-var2-unidirectional.html) | [1.23M (82.7%)](/assets/fitlet2/fitlet2.bench-bidirectional.html) + +## Caveats + +While all results of the loadtests are navigable [[here](/assets/fitlet2/fitlet2.html)], I will cherrypick +one interesting bundle showing the results of _all_ (bi- and unidirectional) tests: + +{{< image src="/assets/fitlet2/loadtest.png" alt="Fitlet2 All Loadtests" >}} + +I have to admit I was a bit stumped with the unidirectional loadtests - these +are pushing traffic into the i211 (onboard RJ45) NIC, and out of the i210 +(_FACET_ SFP) NIC. What I found super weird (and can't really explain), is +that the _unidirectional_ load, which in the end serves half the packets/sec, +is __lower__ than the _bidirectional_ load, which was almost perfect dropping +only a little bit of traffic at the very end. A picture says a thousand words - +so here's a graph of all the loadtests, which you can also find by clicking on +the links in the table. + +## Appendix + +### Generating the data + +The JSON files that are emitted by my loadtester script can be fed directly into Michal's +[visualizer](https://github.com/wejn/trex-loadtest-viz) to plot interactive graphs (which I've +done for the table above): + +``` +DEVICE=Fitlet2 + +## Loadtest + +SERVER=${SERVER:=hvn0.lab.ipng.ch} +TARGET=${TARGET:=l3} +RATE=${RATE:=10} ## % of line +DURATION=${DURATION:=600} +OFFSET=${OFFSET:=10} +PROFILE=${PROFILE:="ipng"} + +for DIR in unidirectional bidirectional; do + for SIZE in 1514 imix 64; do + [ $DIR == "unidirectional" ] && FLAGS="-u " + ## Multiple Flows + ./trex-loadtest -s ${SERVER} ${FLAGS} -p $PROFILE}.py -t "offset=${OFFSET},vm=var2,size=${SIZE}" \ + -rd ${DURATION} -rt ${RATE} -o ${DEVICE}-${TARGET}-${PROFILE}-var2-${SIZE}-${DIR}.json + + [ $SIZE -eq 64 ] && { + ## Specialcase: Single Flow + ./trex-loadtest -s ${SERVER} ${FLAGS -p ${PROFILE}.py -t "offset=${OFFSET},size=${SIZE}" \ + -rd ${DURATION} -rt ${RATE} -o ${DEVICE}-${TARGET}-${PROFILE}-${SIZE}-${DIR}.json + } + done +done + +## Graphs + +ruby graph.rb -t "${DEVICE} All Loadtests" ${DEVICE}*.json -o ${DEVICE}.html +ruby graph.rb -t "${DEVICE} Unidirectional Loadtests" ${DEVICE}*unidir*.json \ + -o ${DEVICE}.unidirectional.html +ruby graph.rb -t "${DEVICE} Bidirectional Loadtests" ${DEVICE}*bidir*.json \ + -o ${DEVICE}.bidirectional.html + +for i in ${PROFILE}-var2-1514 ${PROFILE}-var2-imix ${PROFILE}-var2-64 ${PROFILE}-64; do + ruby graph.rb -t "${DEVICE} Unidirectional Loadtests" ${DEVICE}*-${i}*unidirectional.json \ + -o ${DEVICE}.$i-unidirectional.html; done + ruby graph.rb -t "${DEVICE} Bidirectional Loadtests" ${DEVICE}*-${i}*bidirectional.json \ + -o ${DEVICE}.$i-bidirectional.html; done +done +``` diff --git a/content/articles/2023-02-24-coloclue-vpp-2.md b/content/articles/2023-02-24-coloclue-vpp-2.md new file mode 100644 index 0000000..6ca42cb --- /dev/null +++ b/content/articles/2023-02-24-coloclue-vpp-2.md @@ -0,0 +1,678 @@ +--- +date: "2023-02-24T07:31:00Z" +title: 'Case Study: VPP at Coloclue, part 2' +--- + +{{< image width="300px" float="right" src="/assets/coloclue-vpp/coloclue_logo2.png" alt="Yoloclue" >}} + +* Author: Pim van Pelt, Rogier Krieger +* Reviewers: Coloclue Network Committee +* Status: Draft - Review - **Published** + +Almost precisely two years ago, in February of 2021, I created a loadtesting environment at +[[Coloclue](https://coloclue.net)] to prove that a provider of L2 connectivity between two +datacenters in Amsterdam was not incurring jitter or loss on its services -- I wrote up my findings +in [[an article]({% post_url 2021-02-27-coloclue-loadtest %})], which demonstrated that the service +provider indeed provides a perfect service. One month later, in March 2021, I briefly ran +[[VPP](https://fd.io)] on one of the routers at Coloclue, but due to lack of time and a few +technical hurdles along the way, I had to roll back [[ref]({% post_url 2021-03-27-coloclue-vpp %})]. + +## The Problem + +Over the years, Coloclue AS8283 continues to suffer from packet loss in its network. Taking a look +at a simple traceroute, in this case from IPng AS8298, shows very high variance and _packetlo_ +when entering the network (at hop 5 in a router called `eunetworks-2.router.nl.coloclue.net`): + +``` + My traceroute [v0.94] +squanchy.ipng.ch (194.1.193.90) -> 185.52.227.1 2023-02-24T09:03:36+0100 +Keys: Help Display mode Restart statistics Order of fields quit + Packets Pings + Host Loss% Snt Last Avg Best Wrst StDev + 1. chbtl0.ipng.ch 0.0% 49904 1.3 0.9 0.7 1.7 0.2 + 2. chrma0.ipng.ch 0.0% 49904 1.7 1.2 1.2 2.1 0.9 + 3. defra0.ipng.ch 0.0% 49904 6.3 6.2 6.0 19.2 1.3 + 4. nlams0.ipng.ch 0.0% 49904 12.7 12.6 12.4 19.8 1.8 + 5. bond0-105.eunetworks-2.router.nl.coloclue.net 0.2% 49903 98.8 12.3 12.0 272.8 23.0 + 6. 185.52.227.1 6.6% 49903 15.3 12.5 12.3 308.7 20.4 +``` + +The last two hops show the packet loss well north of 6.5%, some paths are better, some are worse, +but notably when more than one router is in the path, it's difficult to pinpoint where or what is +responsible. But honestly, any source will reveal packet loss and high variance when traversing +through one or more Coloclue routers, to more or lesser degree: + +--------------- | --------------------- +![Before nlede01](/assets/coloclue-vpp/coloclue-beacon-before-nlams01.png) | ![Before chbtl01](/assets/coloclue-vpp/coloclue-beacon-before-chbtl01.png) + +_The screenshots above are smokeping from (left) a machine at AS8283 Coloclue (in Amsterdam, the +Netherlands), and from (right) a machine at AS8298 IPng (in Brüttisellen, Switzerland), both +are showing ~4.8-5.0% packetlo and high variance in end to end latency. No bueno!_ + +## Isolating a Device Under Test + +Because Coloclue has several routers, I want to ensure that traffic traverses only the _one router_ under +test. I decide to use an allocated but currently unused IPv4 prefix and announce that only from one +of the four routers, so that all traffic to and from that /24 goes over that router. Coloclue uses a +piece of software called [Kees](https://github.com/coloclue/kees.git), a set of Python and Jinja2 +scripts to generate a Bird1.6 configuration for each router. This is great because that allows me to +add a small feature to get what I need: **beacons**. + +### Setting up the beacon + +A beacon is a prefix that is sent to (some, or all) peers on the internet to attract traffic in a +particular way. I added a function called `is_coloclue_beacon()` which reads the input YAML file and +uses a construction similar to the existing feature for "supernets". It determines if a given prefix +must be announced to peers and upstreams. Any IPv4 and IPv6 prefixes from the `beacons` list will be +then matched in `is_coloclue_beacon()` and announced. For the curious, [[this +commit](https://github.com/coloclue/kees/commit/3710f1447ade10384c86f35b2652565b440c6aa6)] holds the +logic and tests to ensure this is safe. + +Based on a per-router config (eg. `vars/eunetworks-2.router.nl.coloclue.net.yml`) I can now add the +following YAML stanza: + +``` +coloclue: + beacons: + - prefix: "185.52.227.0" + length: 24 + comment: "VPP test prefix (pim, rogier)" +``` + +And further, from this router, I can forward all traffic destined to this /24 to a machine running +in EUNetworks (my Dell R630 called `hvn0.nlams2.ipng.ch`), using a simple static route: + +``` +statics: + ... + - route: "185.52.227.0/24" + via: "94.142.240.71" + comment: "VPP test prefix (pim, rogier)" +``` + +After running Kees, I can now see traffic for that /24 show up on my machine. The last step is to +ensure that traffic that is destined for the beacon will always traverse back over `eunetworks-2`. +Coloclue has VRRP and sometimes another router might be the logical router. With a little trick on +my machine, I can force traffic by means of _policy based routing_: + +``` +pim@hvn0-nlams2:~$ sudo ip ro add default via 94.142.240.254 +pim@hvn0-nlams2:~$ sudo ip ro add prohibit 185.52.227.0/24 +pim@hvn0-nlams2:~$ sudo ip addr add 185.52.227.1/32 dev lo +pim@hvn0-nlams2:~$ sudo ip rule add from 185.52.227.0/24 lookup 10 +pim@hvn0-nlams2:~$ sudo ip ro add default via 94.142.240.253 table 10 +``` + +First, I set the default gateway to be the VRRP address that floats between multiple routers. Then, +I will set a `prohibit` route for the covering /24, which means the machine will send an ICMP +unreachable (rather than discarding the packets), which can be useful later. Next, I'll add .1 as an +IPv4 address onto loopback, after which the machine will start replying to ICMP packets there with +icmp-echo rather than dst-unreach. To make sure routing is always symmetric, I'll add an `ip rule` +which is a classifier that matches packets based on their source address, and then diverts these to +an alternate routing table, which has only one entry: send via .253 (which is `eunetworks-2`). + +Let me show this in action: + + +``` +pim@hvn0-nlams2:~$ dig +short -x 94.142.240.254 +eunetworks-gateway-100.router.nl.coloclue.net. +pim@hvn0-nlams2:~$ dig +short -x 94.142.240.253 +bond0-100.eunetworks-2.router.nl.coloclue.net. +pim@hvn0-nlams2:~$ dig +short -x 94.142.240.252 +bond0-100.eunetworks-3.router.nl.coloclue.net. + +pim@hvn0-nlams2:~$ ip -4 nei | grep '94.142.240.25[234]' +94.142.240.252 dev coloclue lladdr 64:9d:99:b1:31:db REACHABLE +94.142.240.253 dev coloclue lladdr 64:9d:99:b1:31:af REACHABLE +94.142.240.254 dev coloclue lladdr 64:9d:99:b1:31:db REACHABLE +``` + +In the output above, I can see that `eunetworks-2` (94.142.240.253) has MAC address +`64:9d:99:b1:31:af`, and that `eunetworks-3` (94.142.240.252) has MAC address `64:9d:99:b1:31:db`. +My default gateway, handled by VRRP, is at .254 and it's using the second MAC address, so I know that +`eunetworks-3` is primary, and will handle my egress traffic. + +### Verifying symmetric routing of the beacon + +A quick demonstration to show the symmetric routing case, I can tcpdump and see that my "usual" +egress traffic will be sent to the MAC address of the VRRP primary (which I showed to be +`eunetworks-3` above), while traffic coming from 185.52.227.0/24 ought to be sent to the +MAC address of `eunetworks-2` due to the `ip rule` and alternate routing table 10: + +``` +pim@hvn0-nlams2:~$ sudo tcpdump -eni coloclue host 194.1.163.93 and icmp +tcpdump: verbose output suppressed, use -v[v]... for full protocol decode +listening on coloclue, link-type EN10MB (Ethernet), snapshot length 262144 bytes +10:02:17.193844 64:9d:99:b1:31:af > 6e:fa:52:d0:c1:ff, ethertype IPv4 (0x0800), length 98: + 194.1.163.93 > 94.142.240.71: ICMP echo request, id 16287, seq 1, length 64 +10:02:17.193882 6e:fa:52:d0:c1:ff > 64:9d:99:b1:31:db, ethertype IPv4 (0x0800), length 98: + 94.142.240.71 > 194.1.163.93: ICMP echo reply, id 16287, seq 1, length 64 + +10:02:19.276657 64:9d:99:b1:31:af > 6e:fa:52:d0:c1:ff, ethertype IPv4 (0x0800), length 98: + 194.1.163.93 > 185.52.227.1: ICMP echo request, id 6646, seq 1, length 64 +10:02:19.276694 6e:fa:52:d0:c1:ff > 64:9d:99:b1:31:af, ethertype IPv4 (0x0800), length 98: + 185.52.227.1 > 194.1.163.93: ICMP echo reply, id 6646, seq 1, length 64 + +``` + +It takes a keen eye to spot the difference here the first packet (which is going to the main +IPv4 address 94.142.240.71), is returned via MAC address `64:9d:99:b1:31:db` (the VRRP default +gateway), but the second one (going to the beacon 185.52.227.1) is returned via MAC address +`64:9d:99:b1:31:af`. + +I've now ensured that traffic to and from 185.52.227.1 will always traverse through the DUT +(`eunetworks-2` with MAC `64:9d:99:b1:31:af`). Very elegant :-) + +## Installing VPP + +I've written about this before, the general _spiel_ is just following my previous article (I'm +often very glad to read back my own articles as they serve as pretty good documentation to my +forgetful chipmunk-sized brain!), so here, I'll only recap what's already written in +[[vpp-7]({% post_url 2021-09-21-vpp-7 %})]: + +1. Build VPP with Linux Control Plane +1. Bring `eunetworks-2` into maintenance mode, so we can safely tinker with it +1. Start services like ssh, snmp, keepalived and bird in a new `dataplane` namespace +1. Start VPP and give the LCP interface names the same as their original +1. Slowly introduce the router: OSPF, OSPFv3, iBGP, members-bgp, eBGP, in that order +1. Re-enable keepalived and let the machine forward traffic +1. Stare at the latency graphs + +{{< image width="500px" float="right" src="/assets/coloclue-vpp/likwid-topology.png" alt="Likwid Topology" >}} + +**1. BUILD:** For the first step, the build is straight forward, and yields a VPP instance based on +`vpp-ext-deps_23.06-1` at version `23.06-rc0~71-g182d2b466`, which contains my +[[LCPng](https://github.com/pimvanpelt/lcpng.git)] plugin. I then copy the packages to the router. +The router has an E-2286G CPU @ 4.00GHz with 6 cores and 6 hyperthreads. There's a really handy tool +called `likwid-topology` that can show how the L1, L2 and L3 cache lines up with respect to CPU +cores. Here I learn that CPU (0+6) and (1+7) share L1 and L2 cache -- so I can conclude that 0-5 are +CPU cores which share a hyperthread with 6-11 respectively. + +I also see that L3 cache is shared across all of the cores+hyperthreads, which is normal. I decide +to give CPUs 0,1 and their hyperthread 6,7 to Linux for general purpose scheduling, and I want to +block the remaining CPUs and their hyperthreads to dedicated to VPP. So the kernel is rebooted with +`isolcpus=2-5,8-11`. + +**2. DRAIN:** In the mean time, Rogier prepares the drain, which is two step process. First he marks all +the BGP sessions as `graceful_shutdown: True`, and waits for the traffic to die down. Then, he marks +the machine as `maintenance_mode: True` which will make Kees set OSPF cost to 65535 and avoid +attracting or sending traffic through this machine. After he submits these, we are free to tinker +with the router, as it will not affect any Coloclue members. Rogier also ensures we will have the +hand on this little machine in Amsterdam, by preparing an IPMI serial-over-lan connection and KVM. + +**3. PREPARE:** Starting an ssh and snmpd in the dataplane is the most important part. This way, we will be +able to scrape the machine using SNMP just as-if it were a Linux native router. And of course we +will want to be able to log in to the router. I start with these two services, the only small +note is that, because I want to run two copies (one in the default namespace and one additional +one in the dataplane namespace), I'll want to tweak the startup flags (pid file, config file, etc) a +little bit: + +``` +## in snmpd-dataplane.service +ExecStart=/sbin/ip netns exec dataplane /usr/sbin/snmpd -LOw -u Debian-snmp \ + -g vpp -I -smux,mteTrigger,mteTriggerConf -f -p /run/snmpd-dataplane.pid \ + -C -c /etc/snmp/snmpd-dataplane.conf + +## in ssh-dataplane.service +ExecStart=/usr/sbin/ip netns exec dataplane /usr/sbin/sshd \ + -oPidFile=/run/sshd-dataplane.pid -D $SSHD_OPTS +``` +**4. LAUNCH:** Now what's left for us to do is switch from our SSH session to an IPMI serial-over-lan session +so that we can safely transition to the VPP world. Rogier and I log in and share a tmux session, +after which I bring down all ethernet links, remove VLAN sub-interfaces and the LACP BondEthernet, +leaving only the main physical interfaces. I then set link down on them, and restart VPP -- which +will take all DPDK eligble interfaces that are link admin-down, and then let the magic happen: + +``` +root@eunetworks-2:~# vppctl show int + Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count +GigabitEthernet5/0/0 5 down 9000/0/0/0 +GigabitEthernet6/0/0 6 down 9000/0/0/0 +TenGigabitEthernet1/0/0 1 down 9000/0/0/0 +TenGigabitEthernet1/0/1 2 down 9000/0/0/0 +TenGigabitEthernet1/0/2 3 down 9000/0/0/0 +TenGigabitEthernet1/0/3 4 down 9000/0/0/0 +``` + +Dope! One way to trick the rest of the machine into thinking it hasn't changed, is to recreate these +interfaces in the `dataplane` network namespace using their _original_ interface names (eg. +`enp1s0f3` for AMS-IX, and `bond0` for the LACP signaled BondEthernet that we'll create. Rogier +prepared an excellent `vppcfg` config file: + +``` +loopbacks: + loop0: + description: 'eunetworks-2.router.nl.coloclue.net' + lcp: 'loop0' + mtu: 9216 + addresses: [ 94.142.247.3/32, 2a02:898:0:300::3/128 ] + +bondethernets: + BondEthernet0: + description: 'Core: MLAG member switches' + interfaces: [ TenGigabitEthernet1/0/0, TenGigabitEthernet1/0/1 ] + mode: 'lacp' + load-balance: 'l34' + mac: '64:9d:99:b1:31:af' + +interfaces: + GigabitEthernet5/0/0: + description: "igb 0000:05:00.0 eno1 # FiberRing" + lcp: 'eno1' + mtu: 9216 + sub-interfaces: + 205: + description: "Peering: Arelion" + lcp: 'eno1.205' + addresses: [ 62.115.144.33/31, 2001:2000:3080:ebc::2/126 ] + mtu: 1500 + 992: + description: "Transit: FiberRing" + lcp: 'eno1.992' + addresses: [ 87.255.32.130/30, 2a00:ec8::102/126 ] + mtu: 1500 + + GigabitEthernet6/0/0: + description: "igb 0000:06:00.0 eno2 # Free" + lcp: 'eno2' + mtu: 9216 + state: down + + TenGigabitEthernet1/0/0: + description: "i40e 0000:01:00.0 enp1s0f0 (bond-member)" + mtu: 9216 + + TenGigabitEthernet1/0/1: + description: "i40e 0000:01:00.1 enp1s0f1 (bond-member)" + mtu: 9216 + + TenGigabitEthernet1/0/2: + description: 'Core: link between eunetworks-2 and eunetworks-3' + lcp: 'enp1s0f2' + addresses: [ 94.142.247.246/31, 2a02:898:0:301::/127 ] + mtu: 9214 + + TenGigabitEthernet1/0/3: + description: "i40e 0000:01:00.3 enp1s0f3 # AMS-IX" + lcp: 'enp1s0f3' + mtu: 9216 + sub-interfaces: + 501: + description: "Peering: AMS-IX" + lcp: 'enp1s0f3.501' + addresses: [ 80.249.211.161/21, 2001:7f8:1::a500:8283:1/64 ] + mtu: 1500 + 511: + description: "Peering: NBIP-NaWas via AMS-IX" + lcp: 'enp1s0f3.511' + addresses: [ 194.62.128.38/24, 2001:67c:608::f200:8283:1/64 ] + mtu: 1500 + + BondEthernet0: + lcp: 'bond0' + mtu: 9216 + sub-interfaces: + 100: + description: "Cust: Members" + lcp: 'bond0.100' + mtu: 1500 + addresses: [ 94.142.240.253/24, 2a02:898:0:20::e2/64 ] + 101: + description: "Core: Powerbars" + lcp: 'bond0.101' + mtu: 1500 + addresses: [ 172.28.3.253/24 ] + 105: + description: "Cust: Members (no strict uRPF filtering)" + lcp: 'bond0.105' + mtu: 1500 + addresses: [ 185.52.225.14/28, 2a02:898:0:21::e2/64 ] + 130: + description: "Core: Link between eunetworks-2 and dcg-1" + lcp: 'bond0.130' + mtu: 1500 + addresses: [ 94.142.247.242/31, 2a02:898:0:301::14/127 ] + 2502: + description: "Transit: Fusix Networks" + lcp: 'bond0.2502' + mtu: 1500 + addresses: [ 37.139.140.27/31, 2a00:a7c0:e20b:104::2/126 ] +``` + +We take this configuration and pre-generate a suitable VPP config, which exposes two little bugs +in `vppcfg`: + +* Rogier had used captial letters in his IPv6 addresses (ie. `2001:2000:3080:0EBC::2`), while the + dataplane reports lower case (ie. `2001:2000:3080:ebc::2`), which consistently yield a diff that's + not there. I make a note to fix that. +* When I create the initial `--novpp` config, there's a bug in `vppcfg` where I incorrectly + reference a dataplane object which I haven't initialized (because with `--novpp` the tool + will not contact the dataplane at all. That one was easy to fix, which I did in [[this + commit](https://github.com/pimvanpelt/vppcfg/commit/0a0413927a0be6ed3a292a8c336deab8b86f5eee)]). + +After that small detour, I can now proceed to configure the dataplane by offering the resulting +VPP commands, like so: +``` +root@eunetworks-2:~# vppcfg plan --novpp -c /etc/vpp/vppcfg.yaml \ + -o /etc/vpp/config/vppcfg.vpp +[INFO ] root.main: Loading configfile /etc/vpp/vppcfg.yaml +[INFO ] vppcfg.config.valid_config: Configuration validated successfully +[INFO ] root.main: Configuration is valid +[INFO ] vppcfg.reconciler.write: Wrote 84 lines to /etc/vpp/config/vppcfg.vpp +[INFO ] root.main: Planning succeeded + +root@eunetworks-2:~# vppctl exec /etc/vpp/config/vppcfg.vpp +``` + +**5. UNDRAIN:** The VPP dataplane comes to life, only to immediately hang. Whoops! What follows is a +90 minute forray into the innards of VPP (and Bird) which I haven't yet fully understood, but will +definitely want to learn more about (future article, anyone?) -- but the TL/DR of our investigation +is that if an IPv6 address is added to a loopback device, and an OSPFv3 (IPv6) stub area is created +on it, as is common for IPv4 and IPv6 loopback addresses in OSPF, then the dataplane immediately +hangs on the controlplane, but does continue to forward traffic. + +However, we also find a workaround, which is to put the IPv6 loopback address on a physical +interface instead of a loopback interface. Then, we observe a perfectly functioning dataplane, which +has a working BondEthernet with LACP signalling: + +``` +root@eunetworks-2:~# vppctl show bond details +BondEthernet0 + mode: lacp + load balance: l34 + number of active members: 2 + TenGigabitEthernet1/0/1 + TenGigabitEthernet1/0/0 + number of members: 2 + TenGigabitEthernet1/0/0 + TenGigabitEthernet1/0/1 + device instance: 0 + interface id: 0 + sw_if_index: 8 + hw_if_index: 8 + +root@eunetworks-2:~# vppctl show lacp + actor state partner state +interface name sw_if_index bond interface exp/def/dis/col/syn/agg/tim/act exp/def/dis/col/syn/agg/tim/act +TenGigabitEthernet1/0/0 1 BondEthernet0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 + LAG ID: [(ffff,64-9d-99-b1-31-af,0008,00ff,0001), (8000,02-1c-73-0f-8b-bc,0015,8000,8015)] + RX-state: CURRENT, TX-state: TRANSMIT, MUX-state: COLLECTING_DISTRIBUTING, PTX-state: PERIODIC_TX +TenGigabitEthernet1/0/1 2 BondEthernet0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 + LAG ID: [(ffff,64-9d-99-b1-31-af,0008,00ff,0002), (8000,02-1c-73-0f-8b-bc,0015,8000,0015)] + RX-state: CURRENT, TX-state: TRANSMIT, MUX-state: COLLECTING_DISTRIBUTING, PTX-state: PERIODIC_TX +``` + +**6. WRAP UP:** After doing a bit of standard issue ping / ping6 and `show err` and `show log`, +things are looking good. Rogier and I are now ready to slowly introduce the router: we first turn on +OSPF and OSPFv3, see adjacencies and BFD turn up. We make a note that `enp1s0f2` (which is now a +_LIP_ in the dataplane) does not have BFD while it does have OSPF, and the explanation for this is +that `bond0` is connected to a switch, while `enp1s0f2` is directly connected to its peer via a +cross connect cable, so if it fails, it'll be able to use link-state to quickly reconverge, while +the ethernet link may still be up on `bond0` if something along the transport path were to fail, so +BFD is the better choice there. Smart thinking, Coloclue! + +``` +root@eunetworks-2:~# birdc6 show ospf nei ospf1 +BIRD 1.6.8 ready. +ospf1: +Router ID Pri State DTime Interface Router IP +94.142.247.1 1 Full/PtP 00:33 bond0.130 fe80::669d:99ff:feb1:394b +94.142.247.6 1 Full/PtP 00:31 enp1s0f2 fe80::669d:99ff:feb1:31d8 + +root@eunetworks-2:~# birdc show bfd ses +BIRD 1.6.8 ready. +bfd1: +IP address Interface State Since Interval Timeout +94.142.247.243 bond0.130 Up 2023-02-24 15:56:29 0.100 0.500 +``` + +We are then ready to undrain iBGP and eBGP to members, transit and peering sessions. Rogier swiftly +takes care of business, and the router finds its spot in the DFZ just a few minutes later: + +``` +root@eunetworks-2:~# birdc show route count +BIRD 1.6.8 ready. +6239493 of 6239493 routes for 907650 networks + +root@eunetworks-2:~# birdc6 show route count +BIRD 1.6.8 ready. +1152345 of 1152345 routes for 169987 networks + +root@eunetworks-2:~# vppctl show ip fib sum +ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none +locks:[adjacency:1, default-route:1, lcp-rt:1, ] + Prefix length Count + 0 1 + 4 2 + 8 16 + 9 13 + 10 38 + 11 103 + 12 299 + 13 577 + 14 1214 + 15 2093 + 16 13477 + 17 8250 + 18 13824 + 19 24990 + 20 43089 + 21 51191 + 22 109106 + 23 97073 + 24 542106 + 27 3 + 28 13 + 29 32 + 30 36 + 31 41 + 32 788 + +root@eunetworks-2:~# vppctl show ip6 fib sum +ipv6-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none +locks:[adjacency:1, default-route:1, lcp-rt:1, ] + Prefix length Count + 128 863 + 127 4 + 126 4 + 125 1 + 120 2 + 64 22 + 60 17 + 52 2 + 49 2 + 48 80069 + 47 3535 + 46 3411 + 45 1726 + 44 14909 + 43 1041 + 42 2529 + 41 932 + 40 14126 + 39 1459 + 38 1654 + 37 988 + 36 6640 + 35 1374 + 34 3419 + 33 3707 + 32 22819 + 31 294 + 30 589 + 29 4373 + 28 196 + 27 20 + 26 15 + 25 8 + 24 30 + 23 7 + 22 7 + 21 3 + 20 15 + 19 1 + 10 1 + 0 1 + +``` + +One thing that I really appreciate is how ... _normal_ ... this machine looks, with no interfaces in +the default namespace, but after switching to the dataplane network namespace using `nsenter`, there +they are and they look (unsurprisingly, because we configured them that way), identical to what was +running before, except now all goverend by VPP instead of the Linux kernel: + +``` +root@eunetworks-2:~# ip -br l +lo UNKNOWN 00:00:00:00:00:00 + +root@eunetworks-2:~# nsenter --net=/var/run/netns/dataplane +root@eunetworks-2:~# ip -br l +lo UNKNOWN 00:00:00:00:00:00 +eno1 UP ac:1f:6b:e0:b1:0c +eno2 DOWN ac:1f:6b:e0:b1:0d +enp1s0f2 UP 64:9d:99:b1:31:ad +enp1s0f3 UP 64:9d:99:b1:31:ac +bond0 UP 64:9d:99:b1:31:af +loop0 UP de:ad:00:00:00:00 +eno1.205@eno1 UP ac:1f:6b:e0:b1:0c +eno1.992@eno1 UP ac:1f:6b:e0:b1:0c +enp1s0f3.501@enp1s0f3 UP 64:9d:99:b1:31:ac +enp1s0f3.511@enp1s0f3 UP 64:9d:99:b1:31:ac +bond0.100@bond0 UP 64:9d:99:b1:31:af +bond0.101@bond0 UP 64:9d:99:b1:31:af +bond0.105@bond0 UP 64:9d:99:b1:31:af +bond0.130@bond0 UP 64:9d:99:b1:31:af +bond0.2502@bond0 UP 64:9d:99:b1:31:af + +root@eunetworks-2:~# ip -br a +lo UNKNOWN 127.0.0.1/8 ::1/128 +eno1 UP fe80::ae1f:6bff:fee0:b10c/64 +eno2 DOWN +enp1s0f2 UP 94.142.247.246/31 2a02:898:0:300::3/128 2a02:898:0:301::/127 fe80::669d:99ff:feb1:31ad/64 +enp1s0f3 UP fe80::669d:99ff:feb1:31ac/64 +bond0 UP fe80::669d:99ff:feb1:31af/64 +loop0 UP 94.142.247.3/32 fe80::dcad:ff:fe00:0/64 +eno1.205@eno1 UP 62.115.144.33/31 2001:2000:3080:ebc::2/126 fe80::ae1f:6bff:fee0:b10c/64 +eno1.992@eno1 UP 87.255.32.130/30 2a00:ec8::102/126 fe80::ae1f:6bff:fee0:b10c/64 +enp1s0f3.501@enp1s0f3 UP 80.249.211.161/21 2001:7f8:1::a500:8283:1/64 fe80::669d:99ff:feb1:31ac/64 +enp1s0f3.511@enp1s0f3 UP 194.62.128.38/24 2001:67c:608::f200:8283:1/64 fe80::669d:99ff:feb1:31ac/64 +bond0.100@bond0 UP 94.142.240.253/24 2a02:898:0:20::e2/64 fe80::669d:99ff:feb1:31af/64 +bond0.101@bond0 UP 172.28.3.253/24 fe80::669d:99ff:feb1:31af/64 +bond0.105@bond0 UP 185.52.225.14/28 2a02:898:0:21::e2/64 fe80::669d:99ff:feb1:31af/64 +bond0.130@bond0 UP 94.142.247.242/31 2a02:898:0:301::14/127 fe80::669d:99ff:feb1:31af/64 +bond0.2502@bond0 UP 37.139.140.27/31 2a00:a7c0:e20b:104::2/126 fe80::669d:99ff:feb1:31af/64 +``` + +{{< image width="400px" float="right" src="/assets/coloclue-vpp/traffic.jpg" alt="Traffic" >}} + +Of course, VPP handles all the traffic _through_ the machine, and the only traffic that Linux will +see is that which is destined to the controlplane (eg, to one of the IPv4 or IPv6 addresses or +multicast/broadcast groups that they are participating in), so things like tcpdump or SNMP won't +really work. + +However, due to my [[vpp-snmp-agent](https://github.com/pimvanpelt/vpp-snmp-agent.git)], which is +feeding as an AgentX behind an snmpd that in turn is running in the `dataplane` namespace, SNMP scrapes +work as they did before, albeit with a few different interface names. + +**6.** Earlier, I had failed over `keepalived` and stopped the service. This way, the peer router +on `eunetworks-3` would pick up all outbound traffic to the virtual IPv4 and IPv6 for our users' +default gateway. Because we're mainly interested in non-intrusively measuring the BGP beacon (which +is forced to always go through this machine), and we know some of our members use BGP and take a +preference over this router because it's connected to AMS-IX, we make a decision to leave keepalived +turned off for now. + +But, traffic is flowing, and in fact a little bit more throughput, possibly because traffic flows +faster when there's not 5% packet loss on certain egress paths? I don't know but OK, moving along! + +## Results + +Clearly VPP is a winner in this scenario. If you recall the traceroute from before the operation, the latency +was good up until `nlams0.ipng.ch`, after which loss occured and variance was very high. Rogier and +I let the VPP instance run overnight, and started this traceroute after our maintenance was +concluded: + +``` + My traceroute [v0.94] +squanchy.ipng.ch (194.1.163.90) -> 185.52.227.1 2023-02-25T09:48:46+0100 +Keys: Help Display mode Restart statistics Order of fields quit + Packets Pings + Host Loss% Snt Last Avg Best Wrst StDev + 1. chbtl0.ipng.ch 0.0% 51796 0.6 0.2 0.1 1.7 0.2 + 2. chrma0.ipng.ch 0.0% 51796 1.6 1.0 0.9 5.5 1.2 + 3. defra0.ipng.ch 0.0% 51796 7.0 6.5 6.4 27.7 1.9 + 4. nlams0.ipng.ch 0.0% 51796 12.7 12.6 12.5 43.8 3.9 + 5. bond0-105.eunetworks-2.router.nl.coloclue.net 0.0% 51796 13.3 13.0 12.8 138.9 11.1 + 6. 185.52.227.1 0.0% 51796 13.6 12.7 12.3 46.6 8.3 +``` + +{{< image width="400px" float="right" src="/assets/coloclue-vpp/bill-clinton-zero.gif" alt="Clinton Zero" >}} + +This mtr shows clear network weather with absolutely no packets dropped from Brüttisellen (near +Zurich, Switzerland) all the way to the BGP beacon running in EUNetworks in Amsterdam. Considering +I've been running VPP for a few years now, including writing the code necessary to plumb the +dataplane interfaces through to Linux so that a higher order control plane (such as Bird, or FRR) +can manipulate them, I am reasonably bullish, but I do hope to convert others. + +This computer now forwards packets like a boss, its packet loss is → + +Looking at the local situation, from a hypervisor running at IPng Networks in Equinix AM3 via FrysIX +through VPP and into the dataplane of the Coloclue router `eunetworks-2` , shows quite reasonable +throughput as well: + +``` +root@eunetworks-2:~# traceroute hvn0.nlams3.ipng.ch +traceroute to 46.20.243.179 (46.20.243.179), 30 hops max, 60 byte packets + 1 enp1s0f3.eunetworks-3.router.nl.coloclue.net (94.142.247.247) 0.087 ms 0.078 ms 0.071 ms + 2 frys-ix.ip-max.net (185.1.203.135) 1.288 ms 1.432 ms 1.479 ms + 3 hvn0.nlams3.ipng.ch (46.20.243.179) 0.524 ms 0.534 ms 0.531 ms + +root@eunetworks-2:~# iperf3 -c 46.20.243.179 -P 10 +Connecting to host 46.20.243.179, port 5201 +... +[SUM] 0.00-10.00 sec 6.70 GBytes 5.76 Gbits/sec 192 sender +[SUM] 0.00-10.03 sec 6.58 GBytes 5.64 Gbits/sec receiver + +root@eunetworks-2:~# iperf3 -c 46.20.243.179 -P 10 -R +Connecting to host 46.20.243.179, port 5201 +Reverse mode, remote host 46.20.243.179 is sending +... +[SUM] 0.00-10.03 sec 6.07 GBytes 5.20 Gbits/sec 54623 sender +[SUM] 0.00-10.00 sec 6.03 GBytes 5.18 Gbits/sec receiver +``` + +And the smokepings look just plain gorgeous: + +--------------- | --------------------- +![After nlams01](/assets/coloclue-vpp/coloclue-beacon-after-nlams01.png) | ![After chbtl01](/assets/coloclue-vpp/coloclue-beacon-after-chbtl01.png) + +_The screenshots above are smokeping from (left) a machine at AS8283 Coloclue (in Amsterdam, the +Netherlands), and from (right) a machine at AS8298 IPng (in Brüttisellen, Switzerland), both +are showing no packetloss and clearly improved performance in end to end latency. Super!_ + +## What's next + +The performance of the one router we upgraded definitely improved, no question about that. But +there's a couple of things that I think we still need to do, so Rogier and I rolled back the change +to the previous situation and kernel based routing. + +* We didn't migrate keepalived, although IPng runs this in our DDLN [[colocation]({% post_url 2022-02-24-colo %})] + site, so I'm pretty confident that it will work. +* Kees and Ansible at Coloclue will need a few careful changes, to facilitate ongoing automation, + think of dataplane and controlplane firewalls, sysctls (uRPF et al), fastnetmon, and so on will + need a meaningful overhaul. +* There's an unknown dataplane hang with Bird IPv6 enables a `stub` OSPF interface on interface + `lo0`. We worked around that by putting the loopback IPv6 address on another interface, but this + needs to be fully understood. +* Completely unrelated to Coloclue, there's one dataplane hang regarding IPv6 RA/NS and/or BFD + and/or Linux Control Plane that the VPP developer community is hunting down - it happens with + my plugin but also with [[TNSR](http://www.netgate.com/tnsr)] (who used the upstream `linux-cp` plugin). + I've been working with a few folks from Netgate and customers of IPng Networks to try to find the root + cause, as AS8298 has been bitten by this a few times over the last ~quarter or so. I cannot + recommend in good faith running VPP until this is sorted out. + +As an important side note, VPP is not well enough understood at Coloclue - rolling this out further +risks making me a single point of failure in the networking committee, and I'm not comfortable taking +that responsibility. I recommend that Coloclue network committee members gain experience with VPP, +DPDK, `vppcfg` and the other ecosystem tools, and that at least the bird6 OSPF issue and possible +IPv6 NS/RA issue are understood, before making the jump to the VPP world. diff --git a/content/articles/2023-03-11-mpls-core.md b/content/articles/2023-03-11-mpls-core.md new file mode 100644 index 0000000..74dd68f --- /dev/null +++ b/content/articles/2023-03-11-mpls-core.md @@ -0,0 +1,633 @@ +--- +date: "2023-03-11T11:56:54Z" +title: 'Case Study: Centec MPLS Core' +--- + +After receiving an e-mail from [[Starry Networks](https://starry-networks.com)], I had a chat with their +founder and learned that the combination of switch silicon and software may be a good match for IPng Networks. + +I got pretty enthusiastic when this new vendor claimed VxLAN, GENEVE, MPLS and GRE at 56 ports and +line rate, on a really affordable budget ($4'200,- for the 56 port; and $1'650,- for the 26 port +switch). This reseller is using a less known silicon vendor called +[[Centec](https://www.centec.com/silicon)], who have a lineup of ethernet chipsets. In this device, +the CTC8096 (GoldenGate) is used for cost effective high density 10GbE/40GbE applications paired +with 4x100GbE uplink capability. This is Centec's fourth generation, so CTC8096 inherits the feature +set from L2/L3 switching to advanced data center and metro Ethernet features with innovative +enhancement. The switch chip provides up to 96x10GbE ports, or 24x40GbE, or 80x10GbE + 4x100GbE +ports, inheriting from its predecessors a variety of features, including L2, L3, MPLS, VXLAN, MPLS +SR, and OAM/APS. Highlights features include Telemetry, Programmability, Security and traffic +management, and Network time synchronization. + +{{< image width="450px" float="left" src="/assets/oem-switch/S5624X-front.png" alt="S5624X Front" >}} + +{{< image width="450px" float="right" src="/assets/oem-switch/S5648X-front.png" alt="S5648X Front" >}} + +

+ +After discussing basic L2, L3 and Overlay functionality in my [[first post]({% post_url +2022-12-05-oem-switch-1 %})], and explored the functionality and performance of MPLS and VPLS in my +[[second post]({% post_url 2022-12-09-oem-switch-2 %})], I convinced myself and committed to a bunch +of these for IPng Networks. I'm now ready to roll out these switches and create a BGP-free core +network for IPng Networks. If this kind of thing tickles your fancy, by all means read on :) + +## Overview + +You may be wondering what folks mean when they talk about a [[BGP Free +Core](https://bgphelp.com/2017/02/12/bgp-free-core/)], and also you may ask yourself why would I +decide to retrofit this in our network. For most, operating this way gives very little room for +outages to occur in the L2 (Ethernet and MPLS) transport network, because it's relatively simple in +design and implementation. Some advantages worth mentioning: + +* Transport devices do not need to be capable of supporting a large number of IPv4/IPv6 routes, either + in the RIB or FIB, allowing them to be much cheaper. +* As there is no eBGP, transport devices will not be impacted by BGP-related issues, such as high CPU + utilization during massive BGP re-convergence. +* Also, without eBGP, some of the attack vectors in ISPs (loopback DDoS or ARP storms on public + internet exchange, to take two common examples) can be eliminated. If a new BGP security + vulnerability were to be discovered, transport devices aren't impacted. +* Operator errors (the #1 reason for outages in our industry) associated with BGP configuration and + the use of large RIBs (eg. leaking into IGP, flapping transit sessions, etc) can be eradicated. +* New transport services such as MPLS point to point virtual leased lines, SR-MPLS, VPLS clouds, and + eVPN can all be introduced without modifying the routing core. + +If deployed correctly, this type of transport-only network can be kept entirely isolated from the Internet, +making DDoS and hacking attacks against transport elements impossible, and it also opens up possibilities +for relatively safe sharing of infrastructure resources between ISPs (think of things like dark fibers +between locations, rackspace, power, cross connects). + +For smaller clubs (like IPng Networks), being able to share a 100G wave with others, significantly reduces +price per Megabit! So if you're in Zurich, Switzerland, or Europe and find this an interesting avenue to +expand your reach in a co-op style environment, [[reach out](/s/contact)] to us, any time! + +### Hybrid Design + +I've decided to make this the direction of IPng's core network -- I know that the specs of the +Centec switches I've bought will allow for a modest but not huge amount of routes in the hardware +forwarding tables. I loadtested them in [[a previous article]({% post_url 2022-12-05-oem-switch-1 +%})] at line rate (well, at least 8x10G at 64b packets and around 110Mpps), so they were forwarding +both IPv4 and MPLS traffic effortlessly, and at 45 Watts I might add! However, they clearly cannot +operate in the DFZ for two main reasons: + +1. The FIB is limited to 12K IPv4, 2K IPv6 entries, so they can't hold a full table +1. The CPU is a bit whimpy so it won't be happy doing large BGP reconvergence operations + +IPng Networks has three (3) /24 IPv4 networks, which means we're not swimming in IPv4 addresses. +But, I'm possibly the world's okayest systems engineer, and I happen to know that most things don't +really need an IPv4 address anymore. There's all sorts of fancy loadbalancers like +[[NGINX](https://nginx.org)] and [[HAProxy](https://www.haproxy.org/)] which can take traffic (UDP, +TCP or higher level constructs like HTTP traffic), provide SSL offloading, and then talk to one or +more loadbalanced backends to retrieve the actual content. + +#### IPv4 versus IPv6 + +{{< image float="right" src="/assets/mpls-core/go-for-it-90s.gif" alt="The 90s" >}} + +Most modern operating systems can operate in IPv6-only mode, certainly the Debian and Ubuntu and +Apple machines that are common in my network are happily dual-stack and probably mono-stack as well. +Seeing that I've been running IPv6 since, eh, the 90s (my first allocation was on the 6bone in +1996, and I did run [[SixXS](https://sixxs.net/)] for longer than I can remember!). + +You might be inclined to argue that I should be able to advance the core of my serverpark to +IPv6-only ... but unfortunately that's not only up to me, as it has been mentioned to me a number of times +that my [[Videos](https://video.ipng.ch/)] are not reachable, which of course they are, but only if +your computer speaks IPv6. + +In addition to my stuff needing _legacy_ reachability, some external websites, including pretty big +ones (I'm looking at you, [[GitHub](https://github.com/)] and [[Cisco +T-Rex](https://trex-tgn.cisco.com/)]) are still IPv4 only, and some network gear still hasn't +really caught on to the IPv6 control- and management plane scene (for example, SNMP traps or +scraping, BFD, LDP, and a few others, even in a modern switch like the Centecs that I'm about to +deploy). + +#### AS8298 BGP-Free Core + +I have a few options -- I could be stubborn and do NAT64 for an IPv6-only internal network. But if I'm going +to be doing NAT anyway, I decide to make a compromise and deploy my new network using private IPv4 +space alongside public IPv6 space, and to deploy a few strategically placed border gateways that can +do the translation and frontending for me. + +There's quite a few private/reserved IPv4 ranges on the internet, which the current LIRs on the +RIPE [[Waiting List](https://www.ripe.net/manage-ips-and-asns/ipv4/ipv4-waiting-list)] are salivating +all over, gross. However, there's a few ones beyond canonical [[RFC1918](https://www.rfc-editor.org/rfc/rfc1918)] +that are quite frequently used in enterprise networking, for example by large Cloud providers. They build +what is called a _Virtual Private Cloud_ or +[[VPC](https://www.cloudflare.com/learning/cloud/what-is-a-virtual-private-cloud/)]. And if they can +do it, so can I! + +#### Numberplan + +Let me draw your attention to [[RFC5735](https://www.rfc-editor.org/rfc/rfc5735)], which describes +special use IPv4 addresses. One of these is **198.18.0.0/15**: this block has been allocated for use +in benchmark tests of network interconnect devices. What I found interesting, is that +[[RFC2544](https://www.rfc-editor.org/rfc/rfc2544)] explains that this range was assigned to minimize the +chance of conflict in case a testing device were to be accidentally connected to part of the Internet. +Packets with source addresses from this range are not meant to be forwarded across the Internet. +But, they can _totally_ be used to build a pan-european private network that is not directly connected +to the internet. I grab my free private Class-B, like so: + +* For IPv4, I take the second /16 from that to use as my IPv4 block: **198.19.0.0/16**. +* For IPv6, I carve out a small part of IPng's own IPv6 PI block: **2001:678:d78:500::/56** + +First order of business is to create a simple numberplan that's not totally broken: + +Purpose | IPv4 Prefix | IPv6 Prefix +--------------|-------------------|------------------ +Loopbacks | 198.19.0.0/24 (size /32) | 2001:678:d78:500::/64 (size /128) +P2P Networks | 198.19.2.0/23 (size /31) | 2001:678:d78:501::/64 (size /112) +Site Local Networks | 198.19.4.0/22 (size /27) | 2001:678:d78:502::/56 (size /64) + +This simple start leaves most of the IPv4 space allocatable for the future, while giving me lots of +IPv4 and IPv6 addresses to retrofit this network in all sites where IPng is present, which is +[[quite a few](https://as8298.peeringdb.com/)]. All of **198.19.1.0/24** (reserved either for P2P +networks or for loopbacks, whichever I'll need first), **198.19.8.0/21**, **198.19.16.0/20**, +**198.19.32.0/19**, **198.19.64.0/18** and **198.19.128.0/17** will be ready for me to use in the +future, and they are all nicely tucked away under one **19.198.in-addr.arpa** reversed domain, which +I stub out on IPng's resolvers. Winner! + +### Inserting MPLS Under AS8298 + +I am currently running [[VPP](https://fd.io)] based on my own deployment [[article]({% post_url +2021-09-21-vpp-7%})], and this has a bunch of routers connected back-to-back with one another using +either crossconnects (if there are multiple routers in the same location), or a CWDM/DWDM wave over +dark fiber (if they are in adjacent buildings and I have found a provider willing to share their +dark fiber with me), or a Carrier Ethernet virtual leased line (L2VPN, provided by folks like +[[Init7](https://init7.net)] in Switzerland, or [[IP-Max](https://ip-max.net)] throughout europe in +our [[backbone]({% post_url 2021-02-27-network %})]). + +{{< image width="350px" float="right" src="/assets/mpls-core/before.svg" alt="Before" >}} + +Most of these links are actually "just" point to point ethernet links, which I can use untagged (eg +`xe1-0`), or add any _dot1q_ sub-interfaces (eg `xe1-0.10`). In some cases, the ISP will deliver the +circuit to me with an additional _outer_ tag, in which case I can still use that interface (eg +`xe1-0.400`) and create _qinq_ sub-interfaces (eg `xe1-0.400.10`). + +In January 2023, my Zurich metro deployment looks a bit like the top drawing to the right. Of +course, these routers connect to all sorts of other things, like internet exchange points +([[SwissIX](https://swissix.net/)], [[CHIX](https://ch-ix.ch/)], +[[CommunityIX](https://communityrack.org/)], and [[FreeIX](https://free-ix.net/)]), IP transit +upstreams (in Zurich mainly [[IP-Max](https://as25091.peeringdb.com/)] and +[[Openfactory](https://as58299.peeringdb.com/)]), and customer downstreams, colocation space, +private network interconnects with others, and so on. + +I want to draw your attention to the four _main_ links between these routers: + +1. Orange (bottom): chbtl0 and chbtl1 are at our offices in Brüttisellen; they're in two +separate racks, and have 24 fibers between them. Here, the two routers connect back to back with a +25G SFP28 optic at 1310nm. +1. Blue (top): Between chrma0 (at NTT datacenter in Rümlang) and chgtg0 (at Interxion datacenter +in Glattbrugg), IPng rents a CWDM wave from Openfactory, so the two routers here connect back to +back also, albeit over 4.2km of dark fiber between the two datacenters, with a 25G SFP28 optic at 1610nm. +1. Red (left): Between chbtl0 and chrma0, Init7 provides a 10G L2VPN over MPLS ethernet circuit, +starting in our offices with a BiDi 10G optic, and delivered at NTT on a BiDi 10G optic as well (we +did this, so that the cross connect between our racks might in the future be able to use the other +fiber). Init7 delivers both ports tagged VLAN 400. +1. Green (right): Between chbtl1 and chgtg0, Openfactory provides a 10G VLAN ethernet circuit, +starting in our offices with a BiDi 10G optic to the local telco, and then transported over dark +fiber by UPC to Interxion. Openfactory delivers both sides tagged VLAN 200-209 to us. + +This is a super fun puzzle! I am running a live network, with customers, and I want to retrofit this +MPLS network _underneath_ my existing network, and after thinking about it for a while, I see how I +can do it. + +{{< image width="350px" float="right" src="/assets/mpls-core/after.svg" alt="After" >}} + +To avoid using the link, I raise OSPF cost for the link chbtl0-chrma0, the red link in the graph. +Traffic will now flow via chgtg0 and through chbtl1. After I've taken the link out of service, I +make a few important changes: + +1. First, I move the interface on both VPP routers from it's _dot1q_ tagged `xe1-0.400`, to a double +tagged `xe1-0.400.10`. Init7 will pass this through for me, and after I make the change, I can ping +both sides again (with a subtle loss of 4 bytes because of the second tag). +1. Next, I unplug the Init7 link on both sides and plug them into a TenGig port on a Centec switch +that I deployed in both sites, and I take a second TenGig port and I plug that into the router. I +make both ports a _trunk_ mode switchport, and allow VLAN 400 tagged on it. +1. Finally, on the switch I create interface `vlan400` on both sides, and the two switches can see +each other directly connected now on the single-tagged interface, while the two routers can see each +other directly connected now on the double-tagged interface. + +With the _red_ leg taken care of, I ask the kind folks from Openfactory if they would mind if I use +a second wavelength for the duration of my migration, which they kindly agree to. So, I plug a new +CWDM 25G optic on another channel (1270nm), and bring the network to Glattbrugg, where I deploy a +Centec switch. + +With the _blue_/_purple_ leg taken care of, all I have to do is undrain the _red_ link (lower OSPF +cost) while draining the _green_ link (raising its OSPF cost). Traffic now flips back from chgtg0 +through chrma0 and into chbtl0. I can rinse and repeat the green leg, moving the interfaces on the +routers to a double-tagged `xe1-0.200.10` on both sides, inserting and moving the _green_ link from +the routers into the switches, and connecting them in turn to the routers. + +## Configuration + +And just like that, I've inserted a triangle of Centec switches without disrupting any traffic, +would you look at that! They are however, still "just" switches, each with two ports sharing +the _red_ VLAN 400 and the _green_ VLAN 200, and doing ... decidedly nothing on the _purple_ leg, +as those ports aren't even switchports! + +Next up: configuring these switches to become, you guessed it, routers! + +### Interfaces + +I will take the switch at NTT Rümlang as an example, but the switches really are all very +similar. First, I define the loopback addresses and transit networks to chbtl0 (_red_ link) and +chgtg0 (_purple_ link). + +``` +interface loopback0 + description Core: msw0.chrma0.net.ipng.ch + ip address 198.19.0.2/32 + ipv6 address 2001:678:d78:500::2/128 +! +interface vlan400 + description Core: msw0.chbtl0.net.ipng.ch (Init7) + mtu 9172 + ip address 198.19.2.1/31 + ipv6 address 2001:678:d78:501::2/112 +! +interface eth-0-38 + description Core: msw0.chgtg0.net.ipng.ch (UPC 1270nm) + mtu 9216 + ip address 198.19.2.4/31 + ipv6 address 2001:678:d78:501::2:1/112 +``` + +I need to make sure that the MTU is correct on both sides (this will be important later when OSPF is +turned on), and I ensure that the underlay has sufficient MTU (in the case of Init7, as the _purple_ +interface goes over dark fiber with no active equipment in between!) I issue a set of ping commands +ensuring that the dont-fragment bit is set and the size of the resulting IP packet is exactly that +which my MTU claims I should allow, and validate that indeed, we're good. + +### OSPF, LDP, MPLS + +For OSPF, I am certain that this network should never carry or propagate anything other than the +**198.19.0.0/16** and **2001:678:d78:500::/56** networks that I have assigned to it, even if it were +to be connected to other things (like an out-of-band connection, or even AS8298), so as belt-and-braces +style protection I take the following base-line configuration: + +``` +ip prefix-list pl-ospf seq 5 permit 198.19.0.0/16 le 32 +ipv6 prefix-list pl-ospf seq 5 permit 2001:678:d78:500::/56 le 128 +! +route-map ospf-export permit 10 + match ipv6 address prefix-list pl-ospf +route-map ospf-export permit 20 + match ip address prefix-list pl-ospf +route-map ospf-export deny 9999 +! +router ospf + router-id 198.19.0.2 + redistribute connected route-map ospf-export + redistribute static route-map ospf-export + network 198.19.0.0/16 area 0 +! +router ipv6 ospf + router-id 198.19.0.2 + redistribute connected route-map ospf-export + redistribute static route-map ospf-export +! +ip route 198.19.0.0/16 null0 +ipv6 route 2001:678:d78:500::/56 null0 +``` + +I also set a static discard by means of a _nullroute_, for the space beloning to the private +network. This way, packets will not loop around if there is not a more specific for them in OSPF. +The route-map ensures that I'll only be advertising _our space_, even if the switches eventually get +connected to other networks, for example some out-of-band access mechanism. + +Next up, enabling LDP and MPLS, which is very straight forward. In my interfaces, I'll add the +***label-switching*** and ***enable-ldp*** keywords, as well as ensure that the OSPF and OSPFv3 +speakers on these interfaces know that they are in __point-to-point__ mode. For the cost, I will +start off with the cost in tenths of milliseconds, in other words, if the latency between chbtl0 and +chrma0 is 0.8ms, I will set the cost to 8: + +``` +interface vlan400 + description Core: msw0.chbtl0.net.ipng.ch (Init7) + mtu 9172 + label-switching + ip address 198.19.2.1/31 + ipv6 address 2001:678:d78:501::2/112 + ip ospf network point-to-point + ip ospf cost 8 + ip ospf bfd + ipv6 ospf network point-to-point + ipv6 ospf cost 8 + ipv6 router ospf area 0 + enable-ldp +! +router ldp + router-id 198.19.0.2 + transport-address 198.19.0.2 +! +``` + +The rest is really just rinse-and-repeat. I loop around all relevant interfaces, and see all of +OSPF, OSPFv3, and LDP adjacencies form: + +``` +msw0.chrma0# show ip ospf nei +OSPF process 0: +Neighbor ID Pri State Dead Time Address Interface +198.19.0.0 1 Full/ - 00:00:35 198.19.2.0 vlan400 +198.19.0.3 1 Full/ - 00:00:39 198.19.2.5 eth-0-38 + +msw0.chrma0# show ipv6 ospf nei +OSPFv3 Process (0) +Neighbor ID Pri State Dead Time Interface Instance ID +198.19.0.0 1 Full/ - 00:00:37 vlan400 0 +198.19.0.3 1 Full/ - 00:00:39 eth-0-38 0 + +msw0.chrma0# show ldp session +Peer IP Address IF Name My Role State KeepAlive +198.19.0.0 vlan400 Active OPERATIONAL 30 +198.19.0.3 eth-0-38 Active OPERATIONAL 30 +``` + +### Connectivity + +{{< image float="right" src="/assets/mpls-core/backbone.svg" alt="Backbone" >}} + +And after I'm done with this heavy lifting, I can now build MPLS services (like L2VPN and VPLS) on +these three switches. But as you may remember, IPng is in a few more sites than just +Brüttisellen, Rümlang and Glattbrugg. While a lot of work, retrofitting every +site in exactly the same way is not mentally challenging, so I'm not going to spend a lot of words +describing it. Wax on, wax off. + +Once I'm done though, the (MPLS) network looks a little bit like this. What's really cool about it, +is that it's an fully capable IPv4 and IPv6 network running OSPF and OSPFv3, LDP and MPLS services, +albeit one that's not connected to the internet, yet. This means that I've successfully created both +a completely private network that spans all sites we have active equipment in, but also did not +stand in the way of our public facing (VPP) routers in AS8298. Customers haven't noticed a single +thing, except now they can benefit from any L2 services (using MPLS tunnels or VPLS clouds) from any +of our sites. Neat! + +Our VPP routers are connected through the switches, (carrier) L2VPN and WDM waves just as they were +before, but carried transparently by the Centec switches. Performance wise, there is no regression, +because the switches do line rate L2/MPLS switching and L3 forwarding. This means that the VPP +routers, except for having a little detour in-and-out the switch for their long haul, have the same +throughput as they had before. + +I will deploy three additional features, to make this new private network a fair bit more powerful: + +**1. Site Local Connectivity** + +Each switch gets what is called an IPng Site Local (or _ipng-sl_) interface. This is a /27 IPv4 and +a /64 IPv6 that is bound on a local VLAN on each switch on our private network. Remember: the links +_between_ sites are no longer switched, they are _routed_ and pass ethernet frames only using MPLS. +I can connect for example all of the fleet's hypervisors to this internal network. I have given our +three bastion _jumphosts_ (Squanchy, Glootie and Pencilvester) an address on this internal network +as well, just look at this beautiful result: + +``` +pim@hvn0-ddln0:~$ traceroute hvn0.nlams3.net.ipng.ch +traceroute to hvn0.nlams3.net.ipng.ch (198.19.4.98), 64 hops max, 40 byte packets + 1 msw0.ddln0.net.ipng.ch (198.19.4.129) 1.488 ms 1.233 ms 1.102 ms + 2 msw0.chrma0.net.ipng.ch (198.19.2.1) 2.138 ms 2.04 ms 1.949 ms + 3 msw0.defra0.net.ipng.ch (198.19.2.13) 6.207 ms 6.288 ms 7.862 ms + 4 msw0.nlams0.net.ipng.ch (198.19.2.14) 13.424 ms 13.459 ms 13.513 ms + 5 hvn0.nlams3.net.ipng.ch (198.19.4.98) 12.221 ms 12.131 ms 12.161 ms + +pim@hvn0-ddln0:~$ iperf3 -6 -c hvn0.nlams3.net.ipng.ch -P 10 +Connecting to host hvn0.nlams3, port 5201 +- - - - - - - - - - - - - - - - - - - - - - - - - +[ 5] 9.00-10.00 sec 60.0 MBytes 503 Mbits/sec 0 1.47 MBytes +[ 7] 9.00-10.00 sec 71.2 MBytes 598 Mbits/sec 0 1.73 MBytes +[ 9] 9.00-10.00 sec 61.2 MBytes 530 Mbits/sec 0 1.30 MBytes +[ 11] 9.00-10.00 sec 91.2 MBytes 765 Mbits/sec 0 2.16 MBytes +[ 13] 9.00-10.00 sec 88.8 MBytes 744 Mbits/sec 0 2.13 MBytes +[ 15] 9.00-10.00 sec 62.5 MBytes 524 Mbits/sec 0 1.57 MBytes +[ 17] 9.00-10.00 sec 60.0 MBytes 503 Mbits/sec 0 1.47 MBytes +[ 19] 9.00-10.00 sec 65.0 MBytes 561 Mbits/sec 0 1.39 MBytes +[ 21] 9.00-10.00 sec 61.2 MBytes 530 Mbits/sec 0 1.24 MBytes +[ 23] 9.00-10.00 sec 63.8 MBytes 535 Mbits/sec 0 1.58 MBytes +[SUM] 9.00-10.00 sec 685 MBytes 5.79 Gbits/sec 0 +... +[SUM] 0.00-10.00 sec 7.38 GBytes 6.34 Gbits/sec 177 sender +[SUM] 0.00-10.02 sec 7.37 GBytes 6.32 Gbits/sec receiver +``` + +**2. Egress Connectivity** + +Having a private network is great, as it allows me to run the entire internal environment with 9000 +byte jumboframes, mix IPv4 and IPv6, segment off background tasks such as ZFS replication and +borgbackup between physical sites, and employ monitoring with Prometheus and LibreNMS and log in +safely with SSH or IPMI without ever needing to leave the safety of the walled garden that is +**198.19.0.0/16**. + +Hypervisors will now typically get a management interface _only_ in this network, and for them to be +able to do things like run _apt upgrade_, some remote repositories will need to be reachable over +IPv4 as well. For this, I decide to add three internet gateways, which will have one leg into the +private network, and one leg out into the world. For IPv4 they'll provide NAT, and for IPv6 they'll +ensure only _trusted_ traffic can enter the private network. + +These gateways will: + +* Connect to the internal network with OSPF and OSPFv3: + * They will learn 198.19.0.0/16, 2001:687:d78:500::/56 and their more specifics from it + * They will inject a default route for 0.0.0.0/0 and ::/0 to it +* Connect to AS8298 with BGP: + * They will receive a default IPv4 and IPv6 route from AS8298 + * They will announce the two aggregate prefixes to it with **no-export** community set +* Provide a WireGuard endpoint to allow remote management: + * Clients will be put in 192.168.6.0/24 and 2001:678:d78:300::/56 + * These ranges will be announced both to AS8298 externally and to OSPF internally + +This provides dynamic routing at its best. If the gateway, the physical connection to the internal +network, or the OSPF adjacency is down, AS8298 will not learn the routes into the internal network +at this node. If the gateway, the physical connection to the external network, or the BGP adjacency +is down, the Centec switch will not pick up the default routes, and no traffic will be sent through +it. By having three such nodes geographically separated (one in Brüttisellen, one in +Plan-les-Ouates and one in Amsterdam), I am very likely to have stable and resilient connectivity. + +At the same time, these three machines serve as WireGuard endpoints to be able to remotely manage +the network. For this purpose, I've carved out **192.168.6.0/26** and **2001:678:d78:300::/56** and +will hand out IP addresses from those to clients. I'd like these two networks to have access to the +internal private network as well. + +The Bird2 OSPF configuration for one of the nodes (in Brüttisellen) looks like this: + +``` +filter ospf_export { + if (net.type = NET_IP4 && net ~ [ 0.0.0.0/0, 192.168.6.0/26 ]) then accept; + if (net.type = NET_IP6 && net ~ [ ::/0, 2001:678:d78:300::/64 ]) then accept; + if (source = RTS_DEVICE) then accept; + reject; +} + +filter ospf_import { + if (net.type = NET_IP4 && net ~ [ 198.19.0.0/16 ]) then accept; + if (net.type = NET_IP6 && net ~ [ 2001:678:d78:500::/56 ]) then accept; + reject; +} + +protocol ospf v2 ospf4 { + debug { events }; + ipv4 { export filter ospf_export; import filter ospf_import; }; + area 0 { + interface "lo" { stub yes; }; + interface "wg0" { stub yes; }; + interface "ipng-sl" { type broadcast; cost 15; bfd on; }; + }; +} + +protocol ospf v3 ospf6 { + debug { events }; + ipv6 { export filter ospf_export; import filter ospf_import; }; + area 0 { + interface "lo" { stub yes; }; + interface "wg0" { stub yes; }; + interface "ipng-sl" { type broadcast; cost 15; bfd off; }; + }; +} +``` + +The ***ospf_export*** filter is what we're telling the Centec switches. Here, precisely the default +route and the WireGuard space is announced, in addition to connected routes. The ***ospf_import*** +is what we're willing to learn from the Centec switches, and here we will accept exactly the +aggregate **198.19.0.0/16** and **2001:678:d78:500::/56** prefixes belonging to the private internal +network. + +The Bird2 BGP configuration for this gateway then looks like this: + +``` +filter bgp_export { + if (net.type = NET_IP4 && ! (net ~ [ 198.19.0.0/16, 192.168.6.0/26 ])) then reject; + if (net.type = NET_IP6 && ! (net ~ [ 2001:678:d78:500::/56, 2001:678:d78:300::/64 ]) then reject; + + # Add BGP Wellknown community no-export (FFFF:FF01) + bgp_community.add((65535,65281)); + accept; +} + +template bgp T_GW4 { + local as 64512; + source address 194.1.163.72; + default bgp_med 0; + default bgp_local_pref 400; + ipv4 { import all; export filter bgp_export; next hop self on; }; +} + +template bgp T_GW6 { + local as 64512; + source address 2001:678:d78:3::72; + default bgp_med 0; + default bgp_local_pref 400; + ipv6 { import all; export filter bgp_export; next hop self on; }; +} + +protocol bgp chbtl0_ipv4_1 from T_GW4 { neighbor 194.1.163.66 as 8298; }; +protocol bgp chbtl1_ipv4_1 from T_GW4 { neighbor 194.1.163.67 as 8298; }; +protocol bgp chbtl0_ipv6_1 from T_GW6 { neighbor 2001:678:d78:3::2 as 8298; }; +protocol bgp chbtl1_ipv6_1 from T_GW6 { neighbor 2001:678:d78:3::3 as 8298; }; +``` + +The ***bgp_export*** filter is where we restrict our announcements to only exactly the prefixes +we've learned from the Centec, and WireGuard. We'll set the _no-export_ BGP community on it, which +will allow the prefixes to live in AS8298 but never be announced to any eBGP peers. If the any of +the machine, the BGP session, the WireGuard interface, or the default route, would be missing, they +would simply not be announced. In the other direction, if the Centec is not feeding the gateway its +prefixes via OSPF, the BGP session may be up, but it will not be propagating these prefixes, and the +gateway will not attract network traffic to it. There are two BGP uplinks to AS8298 here, which also +provides resilience in case one of them is down for maintenance or in fault condition. __N+k__ is a +great rule to live by, when it comes to network engineering. + +The last two things I should provide on each gateway, is **(A)** a NAT translator from internal to +external, and **(B)** a firewall that ensures only authorized traffic gets passed to the Centec +network. + +First, I'll provide an IPv4 NAT translation to the internet facing AS8298 (`ipng`), for traffic +that is coming from WireGuard or the private network, while allowing it to pass _between_ the two +networks without performing NAT. The first rule says to jump to __ACCEPT__ (skipping the NAT rules), +if the source is WireGuard. The second two rules say to provide NAT towards the internet for any +traffic coming from WireGuard or the private network. The fourth and last rule says to provide NAT +towards the _internal_ private network, so that anything trying to get into the network will be +coming from an address in **198.19.0.0/16** as well. Here they are: + +``` +iptables -t nat -A POSTROUTING -s 192.168.6.0/24 -o ipng-sl -j ACCEPT +iptables -t nat -A POSTROUTING -s 192.168.6.0/24 -o ipng -j MASQUERADE +iptables -t nat -A POSTROUTING -s 198.19.0.0/16 -o ipng -j MASQUERADE +iptables -t nat -A POSTROUTING -o ipng-sl -j MASQUERADE +``` + +**3. Ingress Connectivity** + +For inbound traffic, the rules are similarly permissive for _trusted_ sources but otherwise prohibit +any passing traffic. Prefixes are allowed to be forwarded from WireGuard, and some (not +disclosed, cuz I'm not stoopid!) trusted prefixes for IPv4 and IPv6, but ultimately if not specified +the forwarding tables will end in a default policy of __DROP__, which means no traffic will be +passed into the WireGuard or Centec internal networks unless explicitly allowed here: + +``` +iptables -P FORWARD DROP +ip6tables -P FORWARD DROP +for SRC4 in 192.168.6.0/24 ...; do + iptables -I FORWARD -s $SRC4 -j ACCEPT +done +for SRC6 in 2001:678:d78:300::/56 ...; do + ip6tables -I FORWARD -s $SRC6 -j ACCEPT +done +``` + +With that, any machine in the Centec (and WireGuard) private internal network will have full access +amongst each other, and they will be NATed to the internet, through these three (N+2) gateways. If I +turn one of them off, things look like this: + +``` +pim@hvn0-ddln0:~$ traceroute 8.8.8.8 +traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets + 1 msw0.ddln0.net.ipng.ch (198.19.4.129) 0.733 ms 1.040 ms 1.340 ms + 2 msw0.chrma0.net.ipng.ch (198.19.2.6) 1.249 ms 1.555 ms 1.799 ms + 3 msw0.chbtl0.net.ipng.ch (198.19.2.0) 2.733 ms 2.840 ms 2.974 ms + 4 hvn0.chbtl0.net.ipng.ch (198.19.4.2) 1.447 ms 1.423 ms 1.402 ms + 5 chbtl0.ipng.ch (194.1.163.66) 1.672 ms 1.652 ms 1.632 ms + 6 chrma0.ipng.ch (194.1.163.17) 2.414 ms 2.431 ms 2.322 ms + 7 as15169.lup.swissix.ch (91.206.52.223) 2.353 ms 2.331 ms 2.311 ms + ... + +pim@hvn0-chbtl0:~$ sudo systemctl stop bird + +pim@hvn0-ddln0:~$ traceroute 8.8.8.8 +traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets + 1 msw0.ddln0.net.ipng.ch (198.19.4.129) 0.770 ms 1.058 ms 1.311 ms + 2 msw0.chrma0.net.ipng.ch (198.19.2.6) 1.251 ms 1.662 ms 2.036 ms + 3 msw0.chplo0.net.ipng.ch (198.19.2.22) 5.828 ms 5.455 ms 6.064 ms + 4 hvn1.chplo0.net.ipng.ch (198.19.4.163) 4.901 ms 4.879 ms 4.858 ms + 5 chplo0.ipng.ch (194.1.163.145) 4.867 ms 4.958 ms 5.113 ms + 6 chrma0.ipng.ch (194.1.163.50) 9.274 ms 9.306 ms 9.313 ms + 7 as15169.lup.swissix.ch (91.206.52.223) 10.168 ms 10.127 ms 10.090 ms + ... +``` + +{{< image width="200px" float="right" src="/assets/mpls-core/swedish_chef.jpg" alt="Chef's Kiss" >}} + +**How cool is that :)** First I do a traceroute from the hypervisor pool in DDLN colocation site, which +finds its closest default at `msw0.chbtl0.net.ipng.ch` which exits via `hvn0.chbtl0` and into the public +internet. Then, I shut down bird on that hypervisor/gateway, which means it won't be advertising the +default into the private network, nor will it be picking up traffic to/from it. About one second +later, the next default route is found to be at `msw0.chplo0.net.ipng.ch` over its hypervisor in +Geneva (note, 4ms down the line), after which the egress is performed at `hvn1.chplo0` into the +public internet. Of course, it's then sent back to Zurich to still find its way to Google at +SwissIX, but the only penalty is a scenic route: looping from Brüttisellen to Geneva and back +adds pretty much 8ms of end to end latency. + +Just look at that beautiful resillience at play. Chef's kiss. + +## What's next + +The ring hasn't been fully deployed yet. I am waiting on a backorder of switches from Starry +Networks, due to arrive early April. The delivery of those will allow me to deploy in Paris and Lille, +hopefully in a cool roadtrip with Fred :) + +But, I got pretty far, so what's next for me is the following few fun things: + +1. Start offering EoMPLS / L2VPN / VPLS services to IPng customers. Who wants some?! +1. Move replication traffic from the current public internet, towards the internal _private_ network. +This both can leverage 9000 byte jumboframes, but it can also use wirespeed forwarding from the Centec +network gear. +1. Move all unneeded IPv4 addresses into the _private_ network, such as maintenance and management +/ controlplane, route reflectors, backup servers, hypervisors, and so on. +1. Move frontends to be dual-homed as well: one leg towards AS8298 using Public IPv4 and IPv6 +addresses, and then finding backend servers in the private network (think of it like an NGINX +frontend that terminmates the HTTP/HTTPS connection [_SSL is inserted and removed here :)_], and then +has one or more backend servers in the private network. This can be useful for Mastodon, Peertube, +and of course our own websites. diff --git a/content/articles/2023-03-17-ipng-frontends.md b/content/articles/2023-03-17-ipng-frontends.md new file mode 100644 index 0000000..aca1130 --- /dev/null +++ b/content/articles/2023-03-17-ipng-frontends.md @@ -0,0 +1,481 @@ +--- +date: "2023-03-17T10:56:54Z" +title: 'Case Study: Site Local NGINX' +--- + +A while ago I rolled out an important change to the IPng Networks design: I inserted a bunch of +[[Centec MPLS](https://starry-networks.com)] and IPv4/IPv6 capable switches underneath +[[AS8298]({% post_url 2021-02-27-network %})], which gave me two specific advantages: + +1. The entire IPng network is now capable of delivering L2VPN services, taking the form of MPLS +point-to-point ethernet, and VPLS, as shown in a previous [[deep dive]({% post_url +2022-12-09-oem-switch-2 %})], in addition to IPv4 and IPv6 transit provided by VPP in an elaborate +and elegant [[BGP Routing Policy]({% post_url 2021-11-14-routing-policy %})]. + +1. A new internal private network becomes available to any device connected IPng switches, with +addressing in **198.19.0.0/16** and **2001:678:d78:500::/56**. This network is completely isolated +from the Internet, with access controlled via N+2 redundant gateways/firewalls, described in more +detail in a previous [[deep dive]({% post_url 2023-03-11-mpls-core %})] as well. + +## Overview + +{{< image width="220px" float="left" src="/assets/ipng-frontends/soad.png" alt="Toxicity" >}} + +After rolling out this spiffy BGP Free [[MPLS Core]({% post_url 2023-03-11-mpls-core %})], I wanted +to take a look at maybe conserving a few IP addresses here and there, as well as tightening access +and protecting the more important machines that IPng Networks runs. You see, most enterprise +networks will include a bunch of internal services, like databases, network attached storage, backup +servers, network monitoring, billing/CRM et cetera. IPng Networks is no different. + +Somewhere between the sacred silence and sleep, lives my little AS8298. It's a gnarly and toxic +place out there in the DFZ, how do you own disorder? + +### Connectivity + +{{< image float="right" src="/assets/mpls-core/backbone.svg" alt="Backbone" >}} + +As a refresher, here's the current situation at IPng Networks: + +**1. Site Local Connectivity** + +Each switch gets what is called an IPng Site Local (or _ipng-sl_) interface. This is a /27 IPv4 and +a /64 IPv6 that is bound on a local VLAN on each switch on our private network. Remember: the links +_between_ sites are no longer switched, they are _routed_ and pass ethernet frames only using MPLS. +I can connect for example all of the fleet's hypervisors to this internal network using jumboframes +using **198.19.0.0/16** and **2001:678:d78:500::/56** which is not connected to the internet. + +**2. Egress Connectivity** + +There are three geographically diverse gateways that inject an _OSPF E1_ default route into the +Centec Site Local network, and they will provide NAT for IPv4 and IPv6 to the internet. This setup +allows all machines in the internal private network to reach the internet, using their closest +gateway. Failing over between gateways is fully automatic, when one is unavailable or down for +maintenance, the network will simply find the next-closest gateway. + +**3. Ingress Connectivity** + +Inbound traffic (from the internet to IPng Site Local) is held at the gateways. First of all, the +reserved IPv4 space **198.18.0.0/15** is a bogon and will not be routed on the public internet, but +our VPP routers in AS8298 do carry the route albeit with the well-known BGP _no-export_ community set, +so traffic could arrive at the gateway coming from our own network only. This is not true for IPv6, +because here our prefix is a part of the AS8298 IPv6 PI space, and traffic will be globally +routable. Even then, only very few prefixes are allowed to enter into the IPng Site Local private +network, nominally only our NOC prefixes, one or two external bastion hosts, and our own Wireguard +endpoints which are running on the gateways. + +### Frontend Setup + +One of my goals for the private network is IPv4 conservation. I decided to move our web-frontends to +be dual-homed: one network interface towards the internet using public IPv4 and IPv6 addresses, and +another network interface that finds backend servers in the IPng Site Local private network. + +This way, I can have one NGINX instance (or a pool of them), terminate the HTTP/HTTPS connection +(there's an InfraSec joke about _SSL is inserted and removed here :)_), no matter how many websites, +domains, or physical webservers I want to use. Some SSL certificate providers allow for wildcards +(ie. `*.ipng.ch`), but I'm going to keep it relatively simple and use [[Let's +Encrypt](https://letsencrypt.org/)] which offers free certificates with a validity of three months. + +#### Installing NGINX + +First, I will install three _minimal_ VMs with Debian Bullseye on separate hypervisors (in +Rümlang `chrma0`, Plan-les-Ouates `chplo0` and Amsterdam `nlams1`), giving them each 4 CPUs, +a 16G blockdevice on the hypervisor's ZFS (which is automatically snapsotted and backed up offsite +using ZFS replication!), and 1GB of memory. These machines will be the IPng Frontend servers, and +handle all client traffic to our web properties. Their job is to forward that HTTP/HTTPS traffic +internally to webservers that are running entirely in the IPng Site Local (private) network. + +I'll install a few tablestakes packages on them, taking `nginx0.chrma0` as an example: + +``` +pim@nginx0-chrma0:~$ sudo apt install nginx iptables ufw rsync +pim@nginx0-chrma0:~$ sudo ufw allow 80 +pim@nginx0-chrma0:~$ sudo ufw allow 443 +pim@nginx0-chrma0:~$ sudo ufw allow from 198.19.0.0/16 +pim@nginx0-chrma0:~$ sudo ufw allow from 2001:678:d78:500::/56 +pim@nginx0-chrma0:~$ sudo ufw enable +``` +#### Installing Lego + +Next, I'll install one more highly secured _minimal_ VM with Debian Bullseye, giving it 1 CPU, a 16G +blockdevice and 1GB of memory. This is where my Let's Encrypt SSL certificate store will live. This +machine does not need to be publicly available, so it will only get one interface, connected to the +IPng Site Local network, so it'll be using private IPs. + +This virtual machine really is bare-bones, it only gets a firewall, rsync, and the _lego_ package. +It doesn't technically even need to run SSH, because I can log into serial console using the +hypervisor. Considering it's an internal-only server (not connected to the internet), but also +because I do believe in OpenSSH's track record of safety, I decide to leave SSH enabled: + +``` +pim@lego:~$ apt install ufw lego rsync +pim@lego:~$ sudo ufw allow 8080 +pim@lego:~$ sudo ufw allow 22 +pim@lego:~$ sudo ufw enable +``` + +Now that all four machines are set up and appropriately filtered (using a simple `ufw` Debian package): +* NGINX will allow port 80 and 443 for public facing web traffic, and is permissive for the IPng Site + Local network, to allow SSH for rsync and maintenance tasks +* LEGO will be entirely closed off, allowing access only from trusted sources for SSH, and to one + TCP port 8080 on which `HTTP-01` certificate challenges will be served. + +I make a pre-flight check to make sure that jumbo frames are possible from the frontends into the +backend network. + +``` +pim@nginx0-nlams1:~$ traceroute lego +traceroute to lego (198.19.4.6), 30 hops max, 60 byte packets + 1 msw0.nlams0.net.ipng.ch (198.19.4.97) 0.737 ms 0.958 ms 1.155 ms + 2 msw0.defra0.net.ipng.ch (198.19.2.22) 6.414 ms 6.748 ms 7.089 ms + 3 msw0.chrma0.net.ipng.ch (198.19.2.7) 12.147 ms 12.315 ms 12.401 ms + 2 msw0.chbtl0.net.ipng.ch (198.19.2.0) 12.685 ms 12.429 ms 12.557 ms + 3 lego.net.ipng.ch (198.19.4.7) 12.916 ms 12.864 ms 12.944 ms + +pim@nginx0-nlams1:~$ ping -c 3 -6 -M do -s 8952 lego +PING lego(lego.net.ipng.ch (2001:678:d78:503::6)) 8952 data bytes +8960 bytes from lego.net.ipng.ch (2001:678:d78:503::7): icmp_seq=1 ttl=62 time=13.33 ms +8960 bytes from lego.net.ipng.ch (2001:678:d78:503::7): icmp_seq=2 ttl=62 time=13.52 ms +8960 bytes from lego.net.ipng.ch (2001:678:d78:503::7): icmp_seq=3 ttl=62 time=13.28 ms + +--- lego ping statistics --- +3 packets transmitted, 3 received, 0% packet loss, time 4005ms +rtt min/avg/max/mdev = 13.280/13.437/13.590/0.116 ms + +pim@nginx0-nlams1:~$ ping -c 5 -3 -M do -s 8972 lego +PING (198.19.4.6) 8972(9000) bytes of data. +8980 bytes from lego.net.ipng.ch (198.19.4.7): icmp_seq=1 ttl=62 time=12.85 ms +8980 bytes from lego.net.ipng.ch (198.19.4.7): icmp_seq=2 ttl=62 time=12.82 ms +8980 bytes from lego.net.ipng.ch (198.19.4.7): icmp_seq=3 ttl=62 time=12.91 ms + +--- ping statistics --- +3 packets transmitted, 3 received, 0% packet loss, time 4007ms +rtt min/avg/max/mdev = 12.823/12.843/12.913/0.138 ms +``` + +A note on the size used: An IPv4 header is 20 bytes, an IPv6 header is 40 bytes, and an ICMP header +is 8 bytes. If the MTU defined on the network is 9000, then the size of the ping payload can be +9000-20-8=**8972** bytes for IPv4 and 9000-40-8=**8952** for IPv6 packets. Using jumboframes +internally is a small optimization for the benefit of the internal webservers - less packets/sec +means more throughput and performance in general. It's also cool :) + +#### CSRs and ACME, oh my! + +In the old days, (and indeed, still today in many cases!) operators would write a Certificate +Signing Request or _CSR_ with the pertinent information for their website, and the SSL authority would +then issue a certificate, send it to the operator via e-mail (or would you believe it, paper mail), +after which the webserver operator could install and use the cert. + +Today, most SSL authorities and their customers use the Automatic Certificate Management Environment +or _ACME protocol_ which is described in [[RFC8555](https://www.rfc-editor.org/rfc/rfc8555)]. It +defines a way for certificate authorities to check the websites that they are asked to issue a +certificate for using so-called challenges. There are several challenge types to choose from, but +the one I'll be focusing on is called `HTTP-01`. These challenges are served from a well known +URI, unsurprisingly in the path `/.well-known/...`, as described in [[RFC5785](https://www.rfc-editor.org/rfc/rfc5785)]. + +{{< image width="200px" float="right" src="/assets/ipng-frontends/certbot.svg" alt="Certbot" >}} + +***Certbot***: Usually when running a webserver with SSL enabled, folks will use the excellent +[[Certbot](https://certbot.eff.org/)] tool from the electronic frontier foundation. This tool is +really smart, and has plugins that can automatically take a webserver running common server +software like Apache, Nginx, HAProxy or Plesk, figure out how you configured the webserver (which +hostname, options, etc), request a certificate and rewrite your configuration. What I find a nice +touch is that it automatically installs certificate renewal using a crontab. + +{{< image width="200px" float="right" src="/assets/ipng-frontends/lego-logo.min.svg" alt="LEGO" >}} +***LEGO***: A Let’s Encrypt client and ACME library written in Go +[[ref](https://go-acme.github.io/lego/)] and it's super powerful, able to solve for multiple ACME +challenges, and tailored to work well with Let's Encrypt as a certificate authority. The `HTTP-01` +challenge works as follows: when an operator wants to prove that they own a given domain name, the +CA can challenge the client to host a mutually agreed upon random number at a random URL under their +webserver's `/.well-known/acme-challenge/` on port 80. The CA will send an HTTP GET to this random +URI and expect the number back in the response. + +#### Shared SSL at Edge + +Because I will be running multiple frontends in different locations, it's operationally tricky to serve +this `HTTP-01` challenge random number in a randomly named **file** on all three NGINX servers. But +while the LEGO client can write the challenge file directly into a file in the webroot of a server, it +can _also_ run as an HTTP **server** with the sole purpose of responding to the challenge. + +{{< image width="500px" float="left" src="/assets/ipng-frontends/acme-flow.svg" alt="ACME Flow" >}} + +This is a killer feature: if I point the `/.well-known/acme-challenge/` URI on all the NGINX servers +to the one LEGO instance running centrally, it no longer matters which of the NGINX servers Let's +Encrypt will try to use to solve the challenge - they will all serve the same thing! The LEGO client +will construct the challenge request, ask Let's Encrypt to send the challenge, and then serve the +response. The only thing left to do then is copy the resulting certificate to the frontends. + +Let me demonstrate how this works, by taking an example based on four websites, none of which run on +servers that are reachable from the internet: [[go.ipng.ch](https://go.ipng.ch/)], +[[video.ipng.ch](https://video.ipng.ch/)], [[billing.ipng.ch](https://billing.ipng.ch/)] and +[[grafana.ipng.ch](https://grafana.ipng.ch/)]. These run on four separate virtual machines (or +docker containers), all within the IPng Site Local private network in **198.19.0.0/16** and +**2001:678:d78:500::/56** which aren't reachable from the internet. + +Ready? Let's go! + +``` +lego@lego:~$ lego --path /etc/lego/ --http --http.port :8080 --email=noc@ipng.ch \ + --domains=nginx0.ipng.ch --domains=grafana.ipng.ch --domains=go.ipng.ch \ + --domains=video.ipng.ch --domains=billing.ipng.ch \ + run +``` + +The flow of requests is as follows: + +1. The _LEGO_ client contacts the Certificate Authority and requests validation for a list of the +cluster hostname `nginx0.ipng.ch` and the additional four domains. It asks the CA to perform an +`HTTP-01` challenge. The CA will share two random numbers with _LEGO_, which will start a +webserver on port 8080 and serve the URI `/.well-known/acme-challenge/$(NUMBER1)`. + +1. The CA will now resolve the A/AAAA addresses for the domain (`grafana.ipng.ch`), which is a CNAME +for the cluster (`nginx0.ipng.ch`), which in turn has multiple A/AAAA pointing to the three machines +associated with it. Visit any one of the _NGINX servers_ on that negotiated URI, and they will forward +requests for `/.well-known/acme-challenge` internally back to the machine running LEGO on its port 8080. + +1. The _LEGO_ client will know that it's going to be visited on the URI +`/.well-known/acme-challenge/$(NUMBER1)`, as it has negotiated that with the CA in step 1. When the +challenge request arrives, LEGO will know to respond using the contents as agreed upon in +`$(NUMBER2)`. + +1. After validating that the response on the random URI contains the agreed upon random number, the +CA knows that the operator of the webserver is the same as the certificate requestor for the domain. +It issues a certificate to the _LEGO_ client, which stores it on its local filesystem. + +1. The _LEGO_ machine finally distributes the private key and certificate to all NGINX machines, which +are now capable of serving SSL traffic under the given names. + +This sequence is done for each of the domains (and indeed, any other domain I'd like to add), and in +the end a bundled certiicate with the common name `nginx0.ipng.ch` and the four additional alternate +names is issued and saved in the certificate store. Up until this point, NGINX has been operating in +**clear text**, that is to say the CA has issued the ACME challenge on port 80, and NGINX has +forwarded it internally to the machine running _LEGO_ on its port 8080 without using encryption. + +Taking a look at the certificate that I'll install in the NGINX frontends (note: never share your +`.key` material, but `.crt` files are public knowledge): + +``` +lego@lego:~$ openssl x509 -noout -text -in /etc/lego/certificates/nginx0.ipng.ch.crt +... +Certificate: + Data: + Version: 3 (0x2) + Serial Number: + 03:db:3d:99:05:f8:c0:92:ec:6b:f6:27:f2:31:55:81:0d:10 + Signature Algorithm: sha256WithRSAEncryption + Issuer: C = US, O = Let's Encrypt, CN = R3 + Validity + Not Before: Mar 16 19:16:29 2023 GMT + Not After : Jun 14 19:16:28 2023 GMT + Subject: CN = nginx0.ipng.ch +... + X509v3 extensions: + X509v3 Subject Alternative Name: + DNS:billing.ipng.ch, DNS:go.ipng.ch, DNS:grafana.ipng.ch, + DNS:nginx0.ipng.ch, DNS:video.ipng.ch +``` + +While the amount of output of this certificate is considerable, I've highlighted the cool bits. The +_Subject_ (also called _Common Name_ or _CN_) of the cert is the first `--domains` entry, and the +alternate names are that one plus all other `--domains` given when calling _LEGO_ earlier. In other +words, this certificate is valid for all five DNS domain names. Sweet! + +#### NGINX HTTP Configuration + +I find it useful to think about the NGINX configuration in two parts: (1) the cleartext / non-ssl +parts on port 80, and (2) the website itself that lives behind SSL on port 443. So in order, here's +my configuration for the acme-challenge bits on port 80: + +``` +pim@nginx0-chrma0:~$ cat < EOF | tee /etc/nginx/conf.d/lego.inc +location /.well-known/acme-challenge/ { + auth_basic off; + proxy_intercept_errors on; + proxy_http_version 1.1; + proxy_set_header Host $host; + proxy_pass http://lego.net.ipng.ch:8080; + break; +} +EOF + +pim@nginx0-chrma0:~$ cat < EOF | tee /etc/nginx/sites-available/go.ipng.ch.conf +server { + listen [::]:80; + listen 0.0.0.0:80; + + server_name go.ipng.ch go.net.ipng.ch go; + access_log /var/log/nginx/go.ipng.ch-access.log; + + include "conf.d/lego.inc"; + + location / { + return 301 https://go.ipng.ch$request_uri; + } +} +EOF +``` + +The first file is an include-file that is shared between all websites I'll serve from this cluster. +Its purpose is to forward any requests that start with the well-known ACME challenge URI onto the +backend _LEGO_ virtual machine, without requiring any authorization. Then, the second snippet +defines a simple webserver on port 80 giving it a few names (the FQDN `go.ipng.ch` but also two +shorthands `go.net.ipng.ch` and `go`). Due to the include, the ACME challenge will be performed on +port 80. All other requests will be rewritten and returned as a redirect to +`https://go.ipng.ch/`. If you've ever wondered how folks are able to type http://go/foo and still +avoid certificate errors, here's a cool trick that accomplishes that. + +Actually these two things are all that's needed to obtain the SSL cert from Let's Encrypt. I haven't +even started a webserver on port 443 yet! To recap: + +* Listen ***only to*** `/.well-known/acme-challenge/` on port 80, and forward those requests to LEGO. +* Rewrite ***all other*** port-80 traffic to `https://go.ipng.ch/` to avoid serving any unencrypted +content. + + +#### NGINX HTTPS Configuration + +Now that I have the SSL certificate in hand, I can start to write webserver configs to handle the SSL +parts. I'll include a few common options to make SSL as safe as it can be (borrowed from Certbot), +and then create the configs for the webserver itself: + +``` +pim@nginx0-chrma0:~$ cat < EOF | tee -a /etc/nginx/conf.d/options-ssl-nginx.inc +ssl_session_cache shared:le_nginx_SSL:10m; +ssl_session_timeout 1440m; +ssl_session_tickets off; + +ssl_protocols TLSv1.2 TLSv1.3; +ssl_prefer_server_ciphers off; + +ssl_ciphers "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384: + ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305: + DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384"; +EOF + +pim@nginx0-chrma0:~$ cat < EOF | tee /etc/nginx/sites-available/go.ipng.ch.conf +server { + listen [::]:443 ssl http2; + listen 0.0.0.0:443 ssl http2; + ssl_certificate /etc/nginx/conf.d/nginx0.ipng.ch.crt; + ssl_certificate_key /etc/nginx/conf.d/nginx0.ipng.ch.key; + include /etc/nginx/conf.d/options-ssl-nginx.inc; + ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.pem; + + server_name go.ipng.ch; + access_log /var/log/nginx/go.ipng.ch-access.log upstream; + + location /edit/ { + proxy_pass http://git.net.ipng.ch:5000; + proxy_set_header Host $host; + proxy_set_header X-Real-IP $remote_addr; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + proxy_set_header X-Forwarded-Proto $scheme; + + satisfy any; + allow 198.19.0.0/16; + allow 194.1.163.0/24; + allow 2001:678:d78::/48; + deny all; + auth_basic "Go Edits"; + auth_basic_user_file /etc/nginx/conf.d/go.ipng.ch-htpasswd; + } + + location / { + proxy_pass http://git.net.ipng.ch:5000; + proxy_set_header Host $host; + proxy_set_header X-Real-IP $remote_addr; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + proxy_set_header X-Forwarded-Proto $scheme; + } +} +EOF +``` + +The certificate and SSL options are loaded first from `/etc/nginc/conf.d/nginx0.ipng.ch.{crt,key}`. + +Next, I don't want folks on the internet to be able to create or edit/overwrite my go-links, so I'll +add an ACL on the URI starting with `/edit/`. Either you come from a trusted IPv4/IPv6 prefix, in +which case you can edit links at will, or alternatively you present a username and password that is +stored in the `go-ipng.ch-htpasswd` file (using the Debian package `apache2-utils`). + +Finally, all other traffic is forwarded internally to the machine `git.net.ipng.ch` on port 5000, where +the go-link server is running as a Docker container. That server accepts requests from the IPv4 and IPv6 +IPng Site Local addresses of all three NGINX frontends to its port 5000. + +### Icing on the cake: Internal SSL + +The go-links server I described above doesn't itself spreak SSL. It's meant to be frontended on the +same machine by an Apache or NGINX or HAProxy which handles the client en- and decryption, and +usually that frontend will be running on the same server, at which point I could just let it bind +`localhost:5000`. However, the astute observer will point out that the traffic on the IPng Site +Local network is cleartext. Now, I don't think that my go-links traffic poses a security or privacy +threat, but certainly other sites (like `billing.ipng.ch`) are more touchy, and as such require a +end to end encryption on the network. + +In 2003, twenty years ago, a feature was added to TLS that allows the client to specify the hostname +it was expecting to connect to, in a feature called _Server Name Indication_ or _SNI_, described in +detail in [[RFC3546](https://www.rfc-editor.org/rfc/rfc3546)]: + +> [TLS] does not provide a mechanism for a client to tell a server the name of the server it is +> contacting. It may be desirable for clients to provide this information to facilitate secure +> connections to servers that host multiple 'virtual' servers at a single underlying network +> address. +> +> In order to provide the server name, clients MAY include an extension of type "server_name" in +> the (extended) client hello. + +Every modern webserver and -browser can utilize the _SNI_ extention when talking to eachother. NGINX can +be configured to pass traffic along to the internal webserver by re-encrypting it with a new SSL +connection. Considering the internal hostname will not necessarily be the same as the external website +hostname, I can use _SNI_ to force the NGINX->Billing connection to re-use the `billing.ipng.ch` +hostname: + +``` + server_name billing.ipng.ch; + ... + location / { + proxy_set_header Host $host; + proxy_set_header X-Real-IP $remote_addr; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + proxy_set_header X-Forwarded-Proto $scheme; + proxy_read_timeout 60; + + proxy_pass https://biling.net.ipng.ch:443; + proxy_ssl_name $host; + proxy_ssl_server_name on; + } +``` + +What happens here is the upstream server is hit on port 443 with hostname `billing.net.ipng.ch` but +the SNI value is set back to `$host` which is `billing.ipng.ch` (note, without the \*.net.ipng.ch +domain). The cool thing is, now the internal webserver can reuse the same certificate! I can use the +mechanism described here to obtain the bundled certificate, and then pass that key+cert along to the +billing machine, and serve it there using the same certificate files as the frontend NGINX. + +### What's next + +Of course, the mission to save IPv4 addresses is achieved - I can now run dozens of websites behind +these three IPv4 and IPv6 addresses, and security gets a little bit better too, as the webservers +themselves are tucked away in IPng Site Local and unreachable from the public internet. + +This IPng Frontend design also helps with reliability and latency. I can put frontends in any +number of places, renumber them relatively easily (by adding or removing A/AAAA records to +`nginx0.ipng.ch` and otherwise CNAMEing all my websites to that cluster-name). If load becomes an +issue, NGINX has a bunch of features like caching, cookie-persistence, loadbalancing with health +checking (so I could use multiple backend webservers and round-robin over the healthy ones), and so +on. Our Mastodon server on [[ublog.tech](https://ublog.tech)] or our Peertube server on +[[video.ipng.ch](https://video.ipng.ch/)] can make use of many of these optimizations, but while I +do love engineering, I am also super lazy so I prefer not to prematurely over-optimize. + +The main thing that's next is to automate a bit more of this. IPng Networks has an Ansible +controller, which I'd like to add maintenance of the NGINX and LEGO configuration. That would sort +of look like defining pool `nginx0` with hostnames A, B and C; and then having a playbook that +creates the virtual machine, installes and configures NGINX, and plumbs it through to the _LEGO_ +machine. I can imagine running a specific playbook that ensures the certificates stay fresh in some +`CI/CD` (I have a drone runner alongside our [[Gitea](https://git.ipng.ch/)] server), or just add +something clever to a cronjob on the _LEGO_ machine that periodically runs `lego ... renew` and when +new certificates are issued, copy them out to the NGINX machines in the given cluster with rsync, +and reloading their configuration to pick up the new certs. + +But considering Ansible is its whole own elaborate bundle of joy, I'll leave that for maybe another +article. diff --git a/content/articles/2023-03-24-lego-dns01.md b/content/articles/2023-03-24-lego-dns01.md new file mode 100644 index 0000000..8e5748c --- /dev/null +++ b/content/articles/2023-03-24-lego-dns01.md @@ -0,0 +1,344 @@ +--- +date: "2023-03-24T10:56:54Z" +title: 'Case Study: Let''s Encrypt DNS-01' +--- + +Last week I shared how IPng Networks deployed a loadbalanced frontend cluster of NGINX webservers +that have public IPv4 / IPv6 addresses, but talk to a bunch of internal webservers that are in a +private network which isn't directly connected to the internet, so called _IPng Site Local_ +[[ref]({%post_url 2023-03-11-mpls-core %})] with addresses **198.19.0.0/16** and +**2001:678:d78:500::/56**. + +I wrote in [[that article]({% post_url 2023-03-17-ipng-frontends %})] that IPng will be using +_ACME_ HTTP-01 validation, which asks the certificate authority, in this case Let's Encrypt, to +contact the webserver on a well-known URI for each domain that I'm requesting a certificate for. +Unsurprisingly, several folks reached out to me asking "well what about DNS-01", and one sentence +caught their eye: + +> Some SSL certificate providers allow for wildcards (ie. `*.ipng.ch`), but I'm going to keep it +> relatively simple and use [[Let's Encrypt](https://letsencrypt.org/)] which offers free +> certificates with a validity of three months. + +I could've seen this one coming! The sentence can be read to imply it doesn't, but **of course** +Let's Encrypt offers wildcard certificates. It just doesn't satisfy my _relatively simple_ qualifier +of the second part of the sentence ... So here I go, down the rabbit hole that is understanding +(for myself, and possibly for readers of this article), how the DNS-01 challenge works, in greater +detail. Hopefully after writing this (me) and reading this (you), we can all agree that I was +wrong, and that using DNS-01 ***is*** relatively simple after all. + +## Overview + +I've installed three frontend NGINX servers (running at Coloclue AS8283, IPng AS8298 and IP-Max +AS25091), and one LEGO certificate machine (running in the internal _IPng Site Local_ network). +In the [[previous article]({% post_url 2023-03-17-ipng-frontends %})], I described the setup and +the use of Let's Encrypt with HTTP-01 challenges. I'll skip that here. + +#### HTTP-01 vs DNS-01 + +{{< image width="200px" float="right" src="/assets/ipng-frontends/lego-logo.min.svg" alt="LEGO" >}} + +Today, most SSL authorities and their customers use the Automatic Certificate Management Environment +or _ACME protocol_ which is described in [[RFC8555](https://www.rfc-editor.org/rfc/rfc8555)]. It +defines a way for certificate authorities to check the websites that they are asked to issue a +certificate for using so-called challenges. One popular challenge is the so-called `HTTP-01`, in +which the certificate authority will visit a well-known URI on the website domain for which the +certificate is being requested, namely `/.well-known/acme-challenge/`, which described in +[[RFC5785](https://www.rfc-editor.org/rfc/rfc5785)]. The CA will expect the webserver to respond +with an agreed upon string of numbers at that location, in which case proof of ownership is +established and a certificate is issued. + +In some situations, this `HTTP-01` challenge can be difficult to perform: + +* If the webserver is not reachable from the internet, or not reachable from the Let's Encrypt +servers, for example if it is on an intranet, such as _IPng Site Local_ itself. +* If the operator would prefer a wildcard certificate, proving ownership of all possible +sub-domains is no longer feasible with `HTTP-01` but proving ownership of the parent domain is. + + +One possible solution for these cases is to use the ACME challenge `DNS-01`, which doesn't use the +webserver running on `go.ipng.ch` to prove ownership, but the _nameserver_ that serves `ipng.ch` +instead. The Let's Encrypt GO client [[ref](https://go-acme.github.io/lego/)] supports both +challenges types. + +The flow of requests in a `DNS-01` challenge is as follows: + +{{< image width="400px" float="right" src="/assets/ipng-frontends/acme-flow-dns01.svg" alt="ACME Flow DNS01" >}} + +1. First, the _LEGO_ client registers itself with the ACME-DNS server running on `auth.ipng.ch`. +After successful registration, _LEGO_ is given a username, password, and access to one DNS +recordname $(RRNAME). +It is expected that the operator sets up a CNAME for a well-known record `_acme-challenge.ipng.ch` +which points to that `$(RRNAME).auth.ipng.ch`. This happens only once. + +1. When a certificate is needed, the _LEGO_ client contacts the Certificate Authority and requests +validation for the hostname `go.ipng.ch`. The CA will will inform the client of a random +number $(RANDOM) that it expects to see in a a well-known TXT record for `_acme-challenge.ipng.ch` +(which is the CNAME set up previously). + +1. The _LEGO_ client now uses the username and password it received in step 1, to update the TXT +record of its `$(RRNAME).auth.ipng.ch` record to contain the $(RANDOM) number it learned in step 2. + +1. The CA will issue a TXT query for `_acme-challenge.ipng.ch`, which is a CNAME to +`$(RRNAME).auth.ipng.ch`, which ultimately responds to the TXT query with the $(RANDOM) number. + +1. After validating that the response on the TXT records contains the agreed upon random number, the +CA knows that the operator of the nameserver is the same as the certificate requestor for the domain. +It issues a certificate to the _LEGO_ client, which stores it on its local filesystem. + +1. Similar to any other challenge, the _LEGO_ machine can now distribute the private key and +certificate to all NGINX machines, which are now capable of serving SSL traffic under the given names. + +One thing worth noting, is that the TXT query is for _domain_ names, not _hostnames_, in other +words, anything in the `ipng.ch` domain will solicit a query to `_acme-challenge.ipng.ch` by the +`DNS-01` challenge. It is for this reason, that the challenge allows for wildcard certificates, +which can greatly reduce operational complexity and the total number of certificates needed. + +### ACME DNS + +Originally, DNS providers were expected to give the ability for their clients to _directly_ update the +well-known `_acme-challenge` TXT record, and while many commercial providers allow for this, IPng +Networks runs just plain-old [[NSD](https://nlnetlabs.nl/projects/nsd/about)] as authoritative +nameservers (shown above as `nsd0`, `nsd1` and `nsd2`). So what todo? Luckily, it was quickly +understood by the community that if there is a lookup for TXT record of `_acme-challenge.ipng.ch`, +that it would be absolutely OK to make some form of DNS-symlink by means of a CNAME. + +One really great solution that leverages this ability is written by Joona Hoikkala, called +[[ACME-DNS](https://github.com/joohoi/acme-dns)]. It's sole purpose is to allow for an API, served +over https, to register new clients, let those clients update their TXT record(s), and then serve +them out in DNS. It's meant to be a multi-tenant system, by which I mean one ACME-DNS instance can +host millions of domains from thousands of distinct users. + +#### Installing + +I noticed that ACME-DNS relies on features in relatively modern Go, and the standard version that +comes with Debian Bullseye is a tad old, so first I need to install Go v1.19 from backports, before +I can continue with the build of the binary: + +``` +lego@lego:~$ sudo apt -t bullseye-backports install golang +lego@lego:~/src$ git clone https://github.com/joohoi/acme-dns +lego@lego:~/src/acme-dns$ export GOPATH=/tmp/acme-dns +lego@lego:~/src/acme-dns$ go build +lego@lego:~/src/acme-dns$ sudo cp acme-dns /usr/local/bin/acme-dns +lego@lego:~/src/acme-dns$ cat << EOF | sudo tee /lib/systemd/system/acme-dns.service +[Unit] +Description=Limited DNS server with RESTful HTTP API to handle ACME DNS challenges easily and +securely +After=network.target + +[Service] +User=lego +Group=lego +AmbientCapabilities=CAP_NET_BIND_SERVICE +WorkingDirectory=~ +ExecStart=/usr/local/bin/acme-dns -c /home/lego/acme-dns/config.cfg +Restart=on-failure + +[Install] +WantedBy=multi-user.target +EOF +``` + +This authoritative nameserver will want to listen on UDP and TCP ports 53, for which it either needs to +run as root, or perhaps better, run as non-privileged user with the `CAP_NET_BIND_SERVICE` +capability. The only other difference with the provided unit file, is that I'll be running this as +the `lego` user, with a configuration file and working path in its home-directory. + +#### Configuring + +***Step 1. Delegate auth.ipng.ch*** + +The first thing I should do is configure the subdomain for ACME-DNS, which I decide will be hosted on +`auth.ipng.ch`. I assign it an NS, an A and a AAAA record, and then update the `ipng.ch` domain: + +``` +$ORIGIN ipng.ch. +$TTL 86400 +@ IN SOA ns.paphosting.net. hostmaster.ipng.ch. ( 2023032401 28800 7200 604800 86400) + NS ns.paphosting.nl. + NS ns.paphosting.net. + NS ns.paphosting.eu. + +; ACME DNS +auth NS auth.ipng.ch. + A 194.1.163.93 + AAAA 2001:678:d78:3::93 +``` + +This snippet will make a DNS delegation for sub-domain `auth.ipng.ch` to the server also called +`auth.ipng.ch` and because the downstream delegation is in the same domain, I need to provide _glue_ +records, that tell clients who are querying for `auth.ipng.ch` where to find that nameserver. At +this point, any request for `*.auth.ipng.ch` will end up being forwarded to the authoritative +nameserver, which can be found at either 194.1.163.93 or 2001:678:d78:3::93. + +***Step 2. Start ACME DNS*** + +After having built the acme-dns server and given it a suitable systemd unit file, and knowing that +it's going to be responsible for the sub-domain `auth.ipng.ch`, I give it the following straight +forward configuration file: + +``` +lego@lego:~$ mkdir ~/acme-dns/ +lego@lego:~$ cat << EOF > acme-dns/config.cfg +[general] +listen = "[::]:53" +protocol = "both" +domain = "auth.ipng.ch" +nsname = "auth.ipng.ch" +nsadmin = "hostmaster.ipng.ch" +records = [ + "auth.ipng.ch. NS auth.ipng.ch.", + "auth.ipng.ch. A 194.1.163.93", + "auth.ipng.ch. AAAA 2001:678:d78:3::93", +] +debug = false + +[database] +engine = "sqlite3" +connection = "/home/lego/acme-dns/acme-dns.db" + +[api] +ip = "[::]" +disable_registration = false +port = "443" +tls = "letsencrypt" +acme_cache_dir = "/home/lego/acme-dns/api-certs" +notification_email = "hostmaster+dns-auth@ipng.ch" +corsorigins = [ "*" ] +use_header = false +header_name = "X-Forwarded-For" + +[logconfig] +loglevel = "debug" +logtype = "stdout" +logformat = "text" +EOF +lego@lego:~$ sudo systemctl enable acme-dns +lego@lego:~$ sudo systemctl start acme-dns +``` + +The first part of this tells the server how to construct the SOA record (domain, nsname and +nsadmin), and which records to put in the apex, nominally the NS/A/AAAA records that describe the +nameserver which is authoritative for the `auth.ipng.ch` domain. Then, the database part is where +user credentials will be stored, and the API portion shows how users will be able to interact with +the controlplane part of the service, notably registering new clients, and updating nameserver TXT +records for existing clients. + +{{< image width="200px" float="right" src="/assets/ipng-frontends/turtles.png" alt="Turtles" >}} + +Interestingly, the API is served on HTTPS port 443, and for that it needs, you guessed it, a +certificate! ACME-DNS eats its own dogfood, which I can appreciate: it will use `DNS-01` validation +to get a certificate for `auth.ipng.ch` _itself_, by serving the challenge for well known record +`_acme-challenge.auth.ipng.ch`, so it's turtles all the way down! + +***Step 3. Register a new client*** + +Seeing as many public DNS providers allow programmatic setting of the contents of the zonefiles, for +them it's a matter of directly being driven by _LEGO_. But for me, running NSD, I am going to be using +the ACME DNS server to fulfill that purpose, so I have to configure it to do that for me. + +In the explanation of `DNS-01` challenges above, you'll remember I made a mention of registering. Here's +a closer look at what that means: + +``` +lego@lego:~$ curl -s -X POST https://auth.ipng.ch/register | json_pp +{ + "allowfrom" : [], + "fulldomain" : "76f88564-740b-4483-9bc0-86d1fb531e20.auth.ipng.ch", + "password" : "", + "subdomain" : "76f88564-740b-4483-9bc0-86d1fb531e20", + "username" : "e4608fdf-9a69-4930-8cf1-57218738792d" +} +``` + +What happened here is that, using the HTTPS endpoint, I asked the ACME-DNS server to create for me an empty +DNS record, which it did on `76f88564-740b-4483-9bc0-86d1fb531e20.auth.ipng.ch`. Further, if I offer +the given username and password, I am able to update that record's value. Let's take a look: + +``` +lego@lego:~$ dig +short TXT 02e3acfc-bbca-46bb-9cee-8eab52c73c30.auth.ipng.ch + +lego@lego:~$ curl -s -X POST -H "X-Api-User: 5f3591d1-0d13-4816-a329-7965a8639ab5" \ + -H "X-Api-Key: " \ + -d '{"subdomain": "02e3acfc-bbca-46bb-9cee-8eab52c73c30", \ + "txt": "___Hello_World_token_______________________"}' \ + https://auth.ipng.ch/update +``` + +Numbers everywhere, but I learned a lot here! Notice how the first time I sent the `dig` request for +the `02e3acfc-bbca-46bb-9cee-8eab52c73c30.auth.ipng.ch` it did not respond anything (an empty +record). But then, using the username/password I could update the record with a 41 character +string, and I was informed of the `fulldomain` key there, which is the one that I should be +configuring in the domain(s) for which I want to get a certificate. + +I configure it in the `ipng.ch` and `ipng.nl` domain as follows (taking `ipng.nl` as an example): + +``` +$ORIGIN ipng.nl. +$TTL 86400 +@ IN SOA ns.paphosting.net. hostmaster.ipng.nl. ( 2023032401 28800 7200 604800 86400) + IN NS ns.paphosting.nl. + IN NS ns.paphosting.net. + IN NS ns.paphosting.eu. + CAA 0 issue "letsencrypt.org" + CAA 0 issuewild "letsencrypt.org" + CAA 0 iodef "mailto:hostmaster@ipng.ch" +_acme-challenge CNAME 8ee2969b-571c-4b3a-b6a0-6d6221130c96.auth.ipng.ch. +``` + +The records here are a `CAA` which is a type of DNS record used to provide additional confirmation +for the Certificate Authority when validating an SSL certificate. This record allows me to specify +which certificate authorities are authorized to deliver SSL certificates for the domain. Then, the +well known `_acme-challenge.ipng.nl` record is merely telling the client by means of a `CNAME` to go +ask for `8ee2969b-571c-4b3a-b6a0-6d6221130c96.auth.ipng.ch` instead. + +Putting this part all together now, I can issue a query for that ipng.nl domain ... + +``` +lego@lego:~$ dig +short TXT _acme-challenge.ipng.nl. +"___Hello_World_token_______________________" +``` + +... and would you look at that! The query for the ipng.nl domain, is a CNAME to the specific uuid +record in the auth.ipng.ch domain, where ACME-DNS is serving it with the response that I can +programmatically set to different values, yee-haw! + +***Step 4. Run LEGO*** + +The _LEGO_ client has all sorts of challenge providers linked in. Once again, Debian is a bit behind +on things, shipping version 3.2.0-3.1+b5 in Bullseye, although upstream is much further along. So I +purge the Debian package and download the v4.10.2 amd64 package directly from its +[[Github](https://github.com/go-acme/lego/releases)] releases page. The ACME-DNS handler was only +added in v4 of the client. But now all that's left for me to do is run it: + +``` +lego@lego:~$ export ACME_DNS_API_BASE=https://auth.ipng.ch/ +lego@lego:~$ export ACME_DNS_STORAGE_PATH=/home/lego/acme-dns/credentials.json +lego@lego:~$ /home/lego/bin/lego --path /etc/lego/ --email noc@ipng.ch --accept-tos --dns acme-dns \ + --domains ipng.ch --domains *.ipng.ch \ + --domains ipng.nl --domains *.ipng.nl \ + run +``` + +The LEGO client goes through the ACME flow that I described at the top of this article, and ends up +spitting out a certificate \o/ + +``` +lego@lego:~$ openssl x509 -noout -text -in /etc/lego/certificates/ipng.ch.crt +Certificate: + Data: + Version: 3 (0x2) + Serial Number: + 03:58:8f:c1:25:00:e2:f3:d3:3f:d6:ed:ba:bc:1d:0d:54:ea + Signature Algorithm: sha256WithRSAEncryption + Issuer: C = US, O = Let's Encrypt, CN = R3 + Validity + Not Before: Mar 21 20:24:08 2023 GMT + Not After : Jun 19 20:24:07 2023 GMT + Subject: CN = ipng.ch + X509v3 extensions: + X509v3 Subject Alternative Name: + DNS:*.ipng.ch, DNS:*.ipng.nl, DNS:ipng.ch, DNS:ipng.nl +``` + +Et voila! Wildcard certificates for multiple domains using ACME-DNS. diff --git a/content/articles/2023-04-09-vpp-stats.md b/content/articles/2023-04-09-vpp-stats.md new file mode 100644 index 0000000..abdbb65 --- /dev/null +++ b/content/articles/2023-04-09-vpp-stats.md @@ -0,0 +1,382 @@ +--- +date: "2023-04-09T11:01:14Z" +title: VPP - Monitoring +--- + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. + +I've been working on the Linux Control Plane [[ref](https://github.com/pimvanpelt/lcpng)], which you +can read all about in my series on VPP back in 2021: + +[![DENOG14](/assets/vpp-stats/denog14-thumbnail.png){: style="width:300px; float: right; margin-left: 1em;"}](https://video.ipng.ch/w/erc9sAofrSZ22qjPwmv6H4) + +* [[Part 1]({% post_url 2021-08-12-vpp-1 %})]: Punting traffic through TUN/TAP interfaces into Linux +* [[Part 2]({% post_url 2021-08-13-vpp-2 %})]: Mirroring VPP interface configuration into Linux +* [[Part 3]({% post_url 2021-08-15-vpp-3 %})]: Automatically creating sub-interfaces in Linux +* [[Part 4]({% post_url 2021-08-25-vpp-4 %})]: Synchronize link state, MTU and addresses to Linux +* [[Part 5]({% post_url 2021-09-02-vpp-5 %})]: Netlink Listener, synchronizing state from Linux to VPP +* [[Part 6]({% post_url 2021-09-10-vpp-6 %})]: Observability with LibreNMS and VPP SNMP Agent +* [[Part 7]({% post_url 2021-09-21-vpp-7 %})]: Productionizing and reference Supermicro fleet at IPng + +With this, I can make a regular server running Linux use VPP as kind of a software ASIC for super +fast forwarding, filtering, NAT, and so on, while keeping control of the interface state (links, +addresses and routes) itself. With Linux CP, running software like [FRR](https://frrouting.org/) or +[Bird](https://bird.network.cz/) on top of VPP and achieving >150Mpps and >180Gbps forwarding +rates are easily in reach. If you find that hard to believe, check out [[my DENOG14 +talk](https://video.ipng.ch/w/erc9sAofrSZ22qjPwmv6H4)] or click the thumbnail above. I am +continuously surprised at the performance per watt, and the performance per Swiss Franc spent. + +## Monitoring VPP + +Of course, it's important to be able to see what routers are _doing_ in production. For the longest +time, the _de facto_ standard for monitoring in the networking industry has been Simple Network +Management Protocol (SNMP), described in [[RFC 1157](https://www.rfc-editor.org/rfc/rfc1157)]. But +there's another way, using a metrics and time series system called _Borgmon_, originally designed by +Google [[ref](https://sre.google/sre-book/practical-alerting/)] but popularized by Soundcloud in an +open source interpretation called **Prometheus** [[ref](https://prometheus.io/)]. IPng Networks ♥ Prometheus. + +I'm a really huge fan of Prometheus and its graphical frontend Grafana, as you can see with my work on +Mastodon in [[this article]({% post_url 2022-11-27-mastodon-3 %})]. Join me on +[[ublog.tech](https://ublog.tech)] if you haven't joined the Fediverse yet. It's well monitored! + +### SNMP + +SNMP defines an extensible model by which parts of the OID (object identifier) tree can be delegated +to another process, and the main SNMP daemon will call out to it using an _AgentX_ protocol, +described in [[RFC 2741](https://datatracker.ietf.org/doc/html/rfc2741)]. In a nutshell, this +allows an external program to connect to the main SNMP daemon, register an interest in certain OIDs, +and get called whenever the SNMPd is being queried for them. + +{{< image width="400px" float="right" src="/assets/vpp-stats/librenms.png" alt="LibreNMS" >}} + +The flow is pretty simple (see section 6.2 of the RFC), the Agent (client): +1. opens a TCP or Unix domain socket to the SNMPd +1. sends an Open PDU, which the server will respond or reject. +1. (optionally) can send a Ping PDU, the server will respond. +1. registers an interest with Register PDU + +It then waits and gets called by the SNMPd with Get PDUs (to retrieve one single value), GetNext PDU +(to enable snmpwalk), GetBulk PDU (to retrieve a whole subsection of the MIB), all of which are +answered by a Response PDU. + +Using parts of a Python Agentx library written by GitHub user hosthvo +[[ref](https://github.com/hosthvo/pyagentx)], I tried my hands at writing one of these AgentX's. +The resulting source code is on [[GitHub](https://github.com/pimvanpelt/vpp-snmp-agent)]. That's the +one that's running in production ever since I started running VPP routers at IPng Networks AS8298. +After the _AgentX_ exposes the dataplane interfaces and their statistics into _SNMP_, an open source +monitoring tool such as LibreNMS [[ref](https://librenms.org/)] can discover the routers and draw +pretty graphs, as well as detect when interfaces go down, or are overloaded, and so on. That's +pretty slick. + +### VPP Stats Segment in Go + +But if I may offer some critique on my own approach, SNMP monitoring is _very_ 1990s. I'm +continously surpsied that our industry is still clinging on to this archaic approach. VPP offers +_a lot_ of observability, its statistics segment is chock full of interesting counters and gauges +that can be really helpful to understand how the dataplane performs. If there are errors or a +bottleneck develops in the router, going over `show runtime` or `show errors` can be a life saver. +Let's take another look at that Stats Segment (the one that the SNMP AgentX connects to in order to +query it for packets/byte counters and interface names). + +You can think of the Stats Segment as a directory hierarchy where each file represents a type of +counter. VPP comes with a small helper tool called VPP Stats FS, which uses a FUSE based read-only +filesystem to expose those counters in an intuitive way, so let's take a look + +``` +pim@hippo:~/src/vpp/extras/vpp_stats_fs$ sudo systemctl start vpp +pim@hippo:~/src/vpp/extras/vpp_stats_fs$ sudo make start +pim@hippo:~/src/vpp/extras/vpp_stats_fs$ mount | grep stats +rawBridge on /run/vpp/stats_fs_dir type fuse.rawBridge (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other) + +pim@hippo:/run/vpp/stats_fs_dir$ ls -la +drwxr-xr-x 0 root root 0 Apr 9 14:07 bfd +drwxr-xr-x 0 root root 0 Apr 9 14:07 buffer-pools +drwxr-xr-x 0 root root 0 Apr 9 14:07 err +drwxr-xr-x 0 root root 0 Apr 9 14:07 if +drwxr-xr-x 0 root root 0 Apr 9 14:07 interfaces +drwxr-xr-x 0 root root 0 Apr 9 14:07 mem +drwxr-xr-x 0 root root 0 Apr 9 14:07 net +drwxr-xr-x 0 root root 0 Apr 9 14:07 node +drwxr-xr-x 0 root root 0 Apr 9 14:07 nodes +drwxr-xr-x 0 root root 0 Apr 9 14:07 sys + +pim@hippo:/run/vpp/stats_fs_dir$ cat sys/boottime +1681042046.00 +pim@hippo:/run/vpp/stats_fs_dir$ date +%s +1681042058 +pim@hippo:~/src/vpp/extras/vpp_stats_fs$ sudo make stop +``` + +There's lots of really interesting stuff in here - for example in the `/sys` hierarchy we can see a +`boottime` file, and from there I can determine the uptime of the process. Further, the `/mem` +hierarchy shows the current memory usage for each of the _main_, _api_ and _stats_ segment heaps. +And of course, in the `/interfaces` hierarchy we can see all the usual packets and bytes counters +for any interface created in the dataplane. + +### VPP Stats Segment in C + +I wish I were good at Go, but I never really took to the language. I'm pretty good at Python, but +sorting through the stats segment isn't super quick as I've already noticed in the Python3 based +[[VPP SNMP Agent](https://github.com/pimvanpelt/vpp-snmp-agent)]. I'm probably the world's least +terrible C programmer, so maybe I can take a look at the VPP Stats Client and make sense of it. Luckily, +there's an example already in `src/vpp/app/vpp_get_stats.c` and it reveals the following pattern: + +1. assemble a vector of regular expression patterns in the hierarchy, or just `^/` to start +1. get a handle to the stats segment with `stats_segment_ls()` using the pattern(s) +1. use the handle to dump the stats segment into a vector with `stat_segment_dump()`. +1. iterate over the returned stats structure, each element has a type and a given name: + * ***STAT_DIR_TYPE_SCALAR_INDEX***: these are floating point doubles + * ***STAT_DIR_TYPE_COUNTER_VECTOR_SIMPLE***: single uint32 counter + * ***STAT_DIR_TYPE_COUNTER_VECTOR_COMBINED***: two uint32 counters +1. freeing the used stats structure with `stat_segment_data_free()` + +The simple and combined stats turn out to be associative arrays, the outer of which notes the +_thread_ and the inner of which refers to the _index_. As such, a statistic of type +***VECTOR_SIMPLE*** can be decoded like so: + +``` +if (res[i].type == STAT_DIR_TYPE_COUNTER_VECTOR_SIMPLE) + for (k = 0; k < vec_len (res[i].simple_counter_vec); k++) + for (j = 0; j < vec_len (res[i].simple_counter_vec[k]); j++) + printf ("[%d @ %d]: %llu packets %s\n", j, k, res[i].simple_counter_vec[k][j], res[i].name); +``` + +The statistic of type ***VECTOR_COMBINED*** is very similar, except the union type there is a +`combined_counter_vec[k][j]` which has a member `.packets` and a member called `.bytes`. The +simplest form, ***SCALAR_INDEX***, is just a single floating point number attached to the name. + +In principle, this should be really easy to sift through and decode. Now that I've figured that +out, let me dump a bunch of stats with the `vpp_get_stats` tool that comes with vanilla VPP: + +``` +pim@chrma0:~$ vpp_get_stats dump /interfaces/TenGig.*40121 | grep -v ': 0' +[0 @ 2]: 67057 packets /interfaces/TenGigabitEthernet81_0_0.40121/drops +[0 @ 2]: 76125287 packets /interfaces/TenGigabitEthernet81_0_0.40121/ip4 +[0 @ 2]: 1793946 packets /interfaces/TenGigabitEthernet81_0_0.40121/ip6 +[0 @ 2]: 77919629 packets, 66184628769 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx +[0 @ 0]: 7 packets, 610 bytes /interfaces/TenGigabitEthernet81_0_0.40121/tx +[0 @ 1]: 26687 packets, 18771919 bytes /interfaces/TenGigabitEthernet81_0_0.40121/tx +[0 @ 2]: 6448944 packets, 3663975508 bytes /interfaces/TenGigabitEthernet81_0_0.40121/tx +[0 @ 3]: 138924 packets, 20599785 bytes /interfaces/TenGigabitEthernet81_0_0.40121/tx +[0 @ 4]: 130720342 packets, 57436383614 bytes /interfaces/TenGigabitEthernet81_0_0.40121/tx +``` + +I can see both types of counter at play here, let me explain the first line: it is saying that the +counter of name `/interfaces/TenGigabitEthernet81_0_0.40121/drops`, at counter index 0, CPU thread +2, has a simple counter with value 67057. Taking the last line, this is a combined counter type with +name `/interfaces/TenGigabitEthernet81_0_0.40121/tx` at index 0, all five CPU threads (the main +thread and four worker threads) have all sent traffic into this interface, and the counters for each +in packets and bytes is given. + +For readability's sake, my `grep -v` above doesn't print any counter that is 0. For example, +interface `Te81/0/0` has only one receive queue, and it's bound to thread 2. The other threads will +not receive any packets for it, consequently their `rx` counters stay zero: + +``` +pim@chrma0:~/src/vpp$ vpp_get_stats dump /interfaces/TenGig.*40121 | grep rx$ +[0 @ 0]: 0 packets, 0 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx +[0 @ 1]: 0 packets, 0 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx +[0 @ 2]: 80720186 packets, 68458816253 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx +[0 @ 3]: 0 packets, 0 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx +[0 @ 4]: 0 packets, 0 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx +``` + +### Hierarchy: Pattern Matching + +I quickly discover a pattern in most of these names: they start with a scope, say `/interfaces`, +then have a path entry for the interface name, and finally a specific counter (`/rx` or `/mpls`). +This is true also for the `/nodes` hiearchy, in which all VPP's graph nodes have a set of counters: + +``` +pim@chrma0:~$ vpp_get_stats dump /nodes/ip4-lookup | grep -v ': 0' +[0 @ 1]: 11365675493301 packets /nodes/ip4-lookup/clocks +[0 @ 2]: 3256664129799 packets /nodes/ip4-lookup/clocks +[0 @ 3]: 28364098623954 packets /nodes/ip4-lookup/clocks +[0 @ 4]: 30198798628761 packets /nodes/ip4-lookup/clocks +[0 @ 1]: 80870763789 packets /nodes/ip4-lookup/vectors +[0 @ 2]: 17392446654 packets /nodes/ip4-lookup/vectors +[0 @ 3]: 259363625369 packets /nodes/ip4-lookup/vectors +[0 @ 4]: 298176625181 packets /nodes/ip4-lookup/vectors +[0 @ 1]: 49730112811 packets /nodes/ip4-lookup/calls +[0 @ 2]: 13035172295 packets /nodes/ip4-lookup/calls +[0 @ 3]: 109088424231 packets /nodes/ip4-lookup/calls +[0 @ 4]: 119789874274 packets /nodes/ip4-lookup/calls +``` + +If you're ever seen the output of `show runtime`, it looks like this: + +``` +vpp# show runtime +Thread 1 vpp_wk_0 (lcore 28) +Time 3377500.2, 10 sec internal node vector rate 1.46 loops/sec 3301017.05 + vector rates in 2.7440e6, out 2.7210e6, drop 3.6025e1, punt 7.2243e-5 + Name State Calls Vectors Suspends Clocks Vectors/Call +... +ip4-lookup active 49732141978 80873724903 0 1.41e2 1.63 +``` + +Hey look! On thread 1, which is called `vpp_wk_0` and is running on logical CPU core #28, there are +a bunch of VPP graph nodes that are all keeping stats of what they've been doing, and you can see +here that the following numbers line up between `show runtime` and the VPP Stats dumper: + +* ***Name***: This is the name of the VPP graph node, in this case `ip4-lookup`, which is performing an + IPv4 FIB lookup to figure out what the L3 nexthop is of a given IPv4 packet we're trying to route. +* ***Calls***: How often did we invoke this graph node, 49.7 billion times so far. +* ***Vectors***: How many packets did we push through, 80.87 billion, humble brag. +* ***Clocks***: This one is a bit different -- you can see the cumulative clock cycles spent by + this CPU thread in the stats dump: 11365675493301 divided by 80870763789 packets is 140.54 CPU + cycles per packet. It's a cool interview question "How many CPU cycles does it take to do an + IPv4 routing table lookup". You now know the answer :-) +* ***Vectors/Call***: This is a measure of how busy the node is (did it run for only one packet, + or for many packets?). On average when the worker thread gave the `ip4-lookup` node some work to + do, there have been a total of 80873724903 packets handled in 49732141978 calls, so 1.626 + packets per call. If ever you're handling 256 packets per call (the most VPP will allow per call), + your router will be sobbing. + +### Prometheus Metrics + +Prometheus has metrics which carry a name, and zero or more labels. The prometheus query language +can then use these labels to do aggregation, division, averages, and so on. As a practical example, +above I looked at interface stats and saw that the Rx/Tx numbers were counted one per thread. If +we'd like the total on the interface, it would be great if we could `sum without (thread,index)`, +which will have the effect of adding all of these numbers together. +For the monotonically increasing counter numbers (like the total vectors/calls/clocks per node), we +can take the running _rate_ of change, showing the time spent over the last minute, or so. This way, +spikes in traffic will clearly correlate both with a spike in packets/sec or bytes/sec on the +interface, but also a higher number of _vectors/call_, and correspondingly typically a lower number +of _clocks/vector_, as VPP gets more efficient when it can re-use the CPU's instruction and data +cache to do repeat work on multiple packets. + +I decide to massage the statistic names a little bit, by transforming them of the basic format: +`prefix_suffix{label="X",index="A",thread="B"} value` + +A few examples: +* The single counter that looks like `[6 @ 0]: 994403888 packets /mem/main heap` becomes: + * `mem{heap="main heap",index="6",thread="0"}` +* The combined counter `[0 @ 1]: 79582338270 packets, 16265349667188 bytes /interfaces/Te1_0_2/rx` + becomes: + * `interfaces_rx_packets{interface="Te1_0_2",index="0",thread="1"} 79582338270` + * `interfaces_rx_bytes{interface="Te1_0_2",index="0",thread="1"} 16265349667188` +* The node information running on, say thread 4, becomes: + * `nodes_clocks{node="ip4-lookup",index="0",thread="4"} 30198798628761` + * `nodes_vectors{node="ip4-lookup",index="0",thread="4"} 298176625181` + * `nodes_calls{node="ip4-lookup",index="0",thread="4"} 119789874274` + * `nodes_suspends{node="ip4-lookup",index="0",thread="4"} 0` + +### VPP Exporter + +I wish I had things like `split()` and `re.match()` but in C (well, I guess I do have POSIX regular +expressions...), but it's all a little bit more low level. Based on my basic loop that opens the +stats segment, registers its desired patterns, and then retrieves a vector of {name, type, +counter}-tuples, I decide to do a little bit of non-intrusive string tokenization first: + +``` +static int tokenize (const char *str, char delimiter, char **tokens, int *lengths) { + char *p = (char *) str; + char *savep = p; + int i = 0; + + while (*p) if (*p == delimiter) { + tokens[i] = (char *) savep; + lengths[i] = (int) (p - savep); + i++; p++; savep = p; + } else p++; + tokens[i] = (char *) savep; + lengths[i] = (int) (p - savep); + return i++; +} + +/* The call site */ + char *tokens[10]; + int lengths[10]; + int num_tokens = tokenize (res[i].name, '/', tokens, lengths); +``` + +The tokenizer takes an array of N pointers to the resulting tokens, and their lengths. This sets it +aside from `strtok()` and friends, because those will overwrite the occurences of the delimiter in +the input string with `\0`, and as such cannot take a `const char *str` as input. This one leaves +the string alone though, and will return the tokens as {ptr, len}-tuples, including how many +tokens it found. + +One thing I'll probably regret is that there's no bounds checking on the number of tokens -- if I +have more than 10 of these, I'll come to regret it. But for now, the depth of the hierarchy is only +3, so I should be fine. Besides, I got into a fight with ChatGPT after it declared a romantic +interest in my cat, so it won't write code for me anymore :-( + +But using this simple tokenizer, and knowledge of the structure of well known hierarchy paths, the +rest of the exporter is quickly in hand. Some variables don't have a label (for example +`/sys/boottime`), but those that do will see that field transposed from the directory path +`/mem/main heap/free` into the label as I showed above. + +### Results + +{{< image width="400px" float="right" src="/assets/vpp-stats/grafana1.png" alt="Grafana 1" >}} + +With this VPP Prometheus Exporter, I can now hook the VPP routers up to Prometheus and Grafana. +Aggregations in Grafana are super easy and scalable, due to the conversion of the static paths into +dynamically created labels on the prometheus metric names. + +Drawing a graph of the running time spent by each individual VPP graph node might look something +like this: + +``` +sum without (node)(rate(nodes_clocks[60s])) + / +sum without (node)(rate(nodes_vectors[60s])) +``` + +The plot to the right shows a system under a loadtest that ramps up from 0% to 100% of line rate, +and the traces are the cummulative time spent in each node (on a logarithmic scale). The top purple +line represents `dpdk-input`. When a VPP dataplane is idle, the worker threads will be repeatedly +polling DPDK to ask it if it has something to do, spending 100% of their time being told "there is +nothing for you to do". But, once load starts appearing, the other nodes start spending CPU time, +for example the chain of IPv4 forwarding is `ethernet-input`, `ip4-input`, `ip4-lookup`, followed by +`ip4-rewrite` and ultimately the packet is transmitted on some other interface. When the system is +lightly loaded, the `ethernet-input` node for example will spend 1100 or so CPU cycles per packet, +but when the machine is under higher load, the time spent will decrease to as low as 22 CPU cycles +per packet. This is true for almost all of the nodes - VPP gets relatively _more efficient_ under +load. + +{{< image width="400px" float="right" src="/assets/vpp-stats/grafana2.png" alt="Grafana 2" >}} + +Another cool graph that I won't be able to see when using only LibreNMS and SNMP polling, is how +busy the router is. In VPP, each dispatch of the worker loop will poll DPDK and dispatch the packets +through the directed graph of nodes that I showed above. But how many packets can be moved through +the graph per CPU? The largest number of packets that VPP will ever offer into a call of the nodes +is 256. Typically an unloaded machine will have an average number of Vectors/Call of around 1.00. +When the worker thread is loaded, it may sit at around 130-150 Vectors/Call. If it's saturated, it +will quickly shoot up to 256. + +As a good approximation, Vectors/Call normalized to 100% will be an indication of how busy the +dataplane is. In the picture above, between 10:30 and 11:00 my test router was pushing about 180Gbps +of traffic, but with large packets so its total vectors/call was modest (roughly 35-40), which you +can see as all threads there are running in the ~25% load range. Then at 11:00 a few threads got +hotter, and one of them completely saturated, and the traffic being forwarded by the CPU thread was +suffering _packetlo_, even though the others were absolutely fine... forwarding 150Mpps on a 10 year +old Dell R720! + +### What's Next + +Together with the graph above, I can also see how many CPU cycles are spent in which +type of operation. For example, encapsulation of GENEVE or VxLAN is not _free_, although it's also +not every expensive. If I know how many CPU cycles are available (roughly the clock speed of the CPU +threads, in our case Xeon X1518 (2.2GHz) or Xeon E5-2683 v4 CPUs (3GHz), I can pretty accurately +calculate what a given mix of traffic and features is going to cost, and how many packets/sec our +routers at IPng will be able to forward. Spoiler alert: it's way more than currently needed. Our +supermicros can handle roughly 35Mpps each, and considering a regular mixture of internet traffic +(called _imix_) is about 3Mpps per 10G, I will have room to spare for the time being + +This is super valuable information for folks running VPP in production. +I haven't put the finishing touches on the VPP Prometheus Exporter, for example there are no +commandline flags yet, it doesn't listen on any port other than 9482 (the same one that the toy +exporter in `src/vpp/app/vpp_prometheus_export.c` ships with +[[ref](https://github.com/prometheus/prometheus/wiki/Default-port-allocations)]). My grafana +dashboard is also not fully completed yet. I hope to get that done in April, and publish both the +exporter and the dashboard on GitHub. Stay tuned! diff --git a/content/articles/2023-05-07-vpp-mpls-1.md b/content/articles/2023-05-07-vpp-mpls-1.md new file mode 100644 index 0000000..d2a28aa --- /dev/null +++ b/content/articles/2023-05-07-vpp-mpls-1.md @@ -0,0 +1,749 @@ +--- +date: "2023-05-07T10:01:14Z" +title: VPP MPLS - Part 1 +--- + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. + +I've deployed an MPLS core for IPng Networks, which allows me to provide L2VPN services, and at the +same time keep an IPng Site Local network with IPv4 and IPv6 that is separate from the internet, +based on hardware/silicon based forwarding at line rate and high availability. You can read all +about my Centec MPLS shenanigans in [[this article]({% post_url 2023-03-11-mpls-core %})]. + +Ever since the release of the Linux Control Plane [[ref](https://github.com/pimvanpelt/lcpng)] +plugin in VPP, folks have asked "What about MPLS?" -- I have never really felt the need to go this +rabbit hole, because I figured that in this day and age, higher level IP protocols that do tunneling +are just as performant, and a little bit less of an 'art' to get right. For example, the Centec +switches I deployed perform VxLAN, GENEVE and GRE all at line rate in silicon. And in an earlier +article, I showed that the performance of VPP in these tunneling protocols is actually pretty good. +Take a look at my [[VPP L2 article]({% post_url 2022-01-12-vpp-l2 %})] for context. + +You might ask yourself: _Then why bother?_ To which I would respond: if you have to ask that question, +clearly you don't know me :) This article will form a deep dive into MPLS as implemented by VPP. In +a later set of articles, I'll partner with the incomparable [@vifino](https://chaos.social/@vifino) who +is adding MPLS support to the Linux Controlplane plugin. After that, I do expect VPP to be able to act +as a fully fledged provider- and provider-edge MPLS router. + +## Lab Setup + +A while ago I created a [[VPP Lab]({% post_url 2022-10-14-lab-1 %})] which is pretty slick, I use it +all the time. Most of the time I find myself messing around on the hypervisor and adding namespaces +with interfaces in it, to pair up with the VPP interfaces. And I tcpdump a lot! It's time for me to +make an upgrade to the Lab -- take a look at this picture: + +{{< image src="/assets/vpp-mpls/LAB v2.svg" alt="Lab Setup" >}} + +There's quite a bit to unpack here, but it will be useful to know this layout as I'll be referring +to the components here throughout the rest of the article. Each **lab** now has seven virtual +machines: + +1. **vppX-Y** are Debian Testing machines running a reasonably fresh VPP - they are daisychained +with the first one attaching to the headend called `lab.ipng.ch`, using its Gi10/0/0 interface, and +onwards to its eastbound neighbor vpp0-1 using its GI10/0/1 interface. +1. **hostX-Y** are two Debian machines which have their 4 network cards (enp16s0fX) connected each +to one VPP instance's Gi10/0/2 interface (for `host0-0`) or Gi10/0/3 (for `host0-1`). This way, I +can test all sorts of topologies with one router, two routers, or multiple routers. +1. **tapX-0** is a special virtual machine which receives a copy of every packet on the underlying +Open vSwitch network fabric. + +***NOTE***: X is the 0-based lab number, and Y stands for the 0-based logical machine number, so `vpp1-3` +is the fourth VPP virtualmachine on the second lab. + +### Detour 1: Open vSwitch + +To explain this tap a little bit - let me first talk about the underlay. All seven of these machines +(and each their four network cards) are bound by the hypervisor into an Open vSwitch bridge called +`vpplan`. Then, I use two features to build this topology: + +Firstly, each pair of interfaces will be added as an access port into individual VLANs. For +example, `vpp0-0.Gi10/0/1` connects with `vpp0-1.Gi10/0/0` in VLAN 20 (annotated in orange), and +`vpp0-0.Gi10/0/2` connects to `host0-0.enp16s0f0` in VLAN 30 (annotated in purple). You can see the +East-West traffic over the VPP backbone are in the 20s, the `host0-0` traffic northbound is in the +30s, and the `host0-1` traffic southbound is in the 40s. Finally, the whole Open vSwitch fabric is +connected to `lab.ipng.ch` using VLAN 10 and a physical network card on the hypervisor (annotated in +green). The `lab.ipng.ch` machine then has internet connectivity. + +``` +BR=vpplan +for p in $(ovs-vsctl list-ifaces $BR); do + ovs-vsctl set port $p vlan_mode=access +done + +# Uplink (Green) +ovs-vsctl set port uplink tag=10 ## eno1.200 on the Hypervisor +ovs-vsctl set port vpp0-0-0 tag=10 + +# Backbone (Orange) +ovs-vsctl set port vpp0-0-1 tag=20 +ovs-vsctl set port vpp0-1-0 tag=20 +... + +# Northbound (Purple) +ovs-vsctl set port vpp0-0-2 tag=30 +ovs-vsctl set port host0-0-0 tag=30 +... + +# Southbound (Red) +... +ovs-vsctl set port vpp0-3-3 tag=43 +ovs-vsctl set port host0-1-3 tag=43 +``` + +**NOTE**: The KVM interface names such as `vppX-Y-Z` where X means the lab number (0 in this case -- +IPng does have multiple labs so I can run experiments and lab environments independently and isolated), +Y is the machine number, and Z is the interface number on the machine (from [0..3]). + +### Detour 2: Mirroring Traffic + +Secondly, now that I have created a 29 port switch with 12 VLANs, I decide to create an OVS _mirror +port_, which can be used to make a copy of traffic going in- or out of (a list of) ports. This is a +super powerful feature, and it looks like this: + +``` +BR=vpplan +MIRROR=mirror-rx +ovs-vsctl set port tap0-0-0 vlan_mode=access + +[ ovs-vsctl list mirror $MIRROR >/dev/null 2>&1 ] || \ + ovs-vsctl --id=@m get mirror $MIRROR -- remove bridge $BR mirrors @m + +ovs-vsctl --id=@m create mirror name=$MIRROR \ + -- --id=@p get port tap0-0-0 \ + -- add bridge $BR mirrors @m \ + -- set mirror $MIRROR output-port=@p \ + -- set mirror $MIRROR select_dst_port=[] \ + -- set mirror $MIRROR select_src_port=[] + +for iface in $(ovs-vsctl list-ports $BR); do + [[ $iface == tap* ]] && continue + ovs-vsctl add mirror $MIRROR select_dst_port $(ovs-vsctl get port $iface _uuid) +done +``` + +The first call sets up the OVS switchport called `tap0-0-0` (which is enp16s0f0 on the machine +`tap0-0`) as an access port. To allow for this script to be idempotent, the second line will look up +if the mirror exists and if so, delete it. Then, I (re)create a mirror port with a given name +(`mirror-rx`), add it to the bridge, make the mirror's output port become `tap0-0-0`, and finally +clear the selected source and destination ports (this is where the traffic is mirrored _from_). At +this point, I have an empty mirror. To give it something useful to do, I loop over all of the ports +in the `vpplan` bridge and add them to the mirror, if they are the _destination_ port (here I have +to specify the uuid of the interface, not its name). I will add all interfaces, except those +of the `tap0-0` machine itself, to avoid loops. + +In the end, I create two of these, one called `mirror-rx` which is forwarded to `tap0-0-0` +(enp16s0f0) and the other called `mirror-tx` which is forwarded to `tap0-0-1` (enp16s0f1). I can use +tcpdump on either of these ports, to show all the traffic either going _ingress_ to any port on any +machine, or emitting _egress_ from any port on any machine, respectively. + +## Preparing the LAB + +I wrote a little bit about the automation I use to maintain a few reproducable lab environments in a +[[previous article]({% post_url 2022-10-14-lab-1 %})], so I'll only show the commands themselves here, +not the underlying systems. When the LAB boots up, it comes with a basic Linux CP configuration that +uses OSPF and OSPFv3 running in Bird2, to connect the `vpp0-0` through `vpp0-3` machines together (each +router's Gi10/0/0 port connects to the next router's Gi10/0/1 port). LAB0 is in use by +[@vifino](https://chaos.social/@vifino) at the moment, so I'll take the next one running on its own +hypervisor, called LAB1. + +Each machine has an IPv4 and IPv6 loopback, so the LAB will come up with basic connectivity: + +``` +pim@lab:~/src/lab$ LAB=1 ./create +pim@lab:~/src/lab$ LAB=1 ./command pristine +pim@lab:~/src/lab$ LAB=1 ./command start && sleep 150 +pim@lab:~/src/lab$ traceroute6 vpp1-3.lab.ipng.ch +traceroute to vpp1-3.lab.ipng.ch (2001:678:d78:211::3), 30 hops max, 24 byte packets + 1 e0.vpp1-0.lab.ipng.ch (2001:678:d78:211::fffe) 2.0363 ms 2.0123 ms 2.0138 ms + 2 e0.vpp1-1.lab.ipng.ch (2001:678:d78:211::1:11) 3.0969 ms 3.1261 ms 3.3413 ms + 3 e0.vpp1-2.lab.ipng.ch (2001:678:d78:211::2:12) 6.4845 ms 6.3981 ms 6.5409 ms + 4 vpp1-3.lab.ipng.ch (2001:678:d78:211::3) 7.4610 ms 7.5698 ms 7.6413 ms +``` + +## MPLS For Dummies + +.. like me! MPLS stands for [[Multi Protocol Label +Switching](https://en.wikipedia.org/wiki/Multiprotocol_Label_Switching)]. Rather than looking at the +IPv4 or IPv6 header in the packet, and making the routing decision based on the destination address, +MPLS takes the whole packet and encpsulates it into a new datagram that carries a 20-bit number +(called the _label_), three bits to classify the traffic, one _S-bit_ to signal that this is the +last label in a stack of labels, and finally 8 bits of TTL. + +In total, 32 bits are prepended to the whole IP packet, or Ethernet frame, or any other type of inner +datagram. This is why it's called _Multi Protocol_. The _S-bit_ allows routers to know if the +following data is the inner payload (S=1), or if the following 32 bits are another MPLS label (S=0). +This way, routers can add more than one labels into a _label stack_. + +Forwarding decisions are made on the contents of this MPLS _label_, without the need to examine +the packet itself. Two significant benefits become obvious: + +1. The inner data payload (ie. an IPv6 packet or an Ethernet frame) doesn't have to be +rewritten, no new checksum created, no TTL decremented. Any datagram can be stuffed into an MPLS +packet, the routing (or _packet switching_) entirely happens using only the MPLS headers. + +2. Importantly, no source- or destination IP addresses have to be looked up in a possibly very +large ~1M large FIB tree to figure out the next hop. Rather than traversing a [[Radix +Trie](https://en.wikipedia.org/wiki/Radix_tree)] or other datastructure to find the next-hop, a +static [[Hash Table](https://en.wikipedia.org/wiki/Hash_table)] with literal integer MPLS labels +can be consulted. This greatly simplifies the computational complexity in transit. + +***P-Router***: The simplest form of an MPLS router is a so-called Label-Switch-Router (_LSR_) which is synonymous +for Provider-Router (_P-Router_). This is the router that sits in the core of the network, and its +only purpose is to receive MPLS packets, look up what to do with them based on the _label_ value, +and then forward the packet onto the next router. Sometimes the router can (and will) rewrite the +label, in an operation called a SWAP, but it can also leave the label as it was (in other words, the +input label value can be the same as the outgoing label value). The logic kind of goes like +**MPLS In-Label** => **{ MPLS Out-Label, Out-Interface, Out-NextHop }**. It's this behavior +that explains the name _Label Switching_. + +If you were to imagine plotting a path through the lab network from say `vpp1-0` on the one side, +through `vpp1-1` and `vpp1-2` on finally onwards to `vpp1-3`, each router would be receiving MPLS packets on one +interface, and emitting them on their way to the next router on another interface. That *path* of +*switching* operations on the *labels* of those MPLS packets thus forms a so-called _Label-Switched-Path +(LSP)_. These LSPs are fundamental building blocks of MPLS networks, as I'll demonstrate later. + +***PE-Router***: Some routers have a less boring job to do - those that sit at the edge of an MPLS network, accept +customer traffic and do something useful with it. These are called Label-Edge-Router (_LER_) which +is often colloquially called a Provider-Edge-Router (_PE-Router_). These routers receive normal +packets (ethernet or IP or otherwise), and perform the encapsulation by adding MPLS labels to them +upon receipt (ingress, called PUSH), or removing the encapsulation (called POP) and finding the +inner payload, continuing to handle them as per normal. The logic for these can be much more +complicated, but you can imagine it goes something like **MPLS In-Label** => **{ Operation }** +where the operation may be "take the resulting datagram, assume it is an IPv4 packet, so look it up +in the IPv4 routing table" or "take the resulting datagram, assume it is an ethernet frame, and emit +it on a specific interface", and really any number of other "operations". + +The cool thing about MPLS is that the type of operations are vendor-extensible. If two routers A and B +agree what label 1234 means _to them_, they can simply insert it at the top of the _labelstack_ say +{100,1234}, where the bottom one (the 100 label that all the _P-Routers_ see) carries the semantic +meaning of "switch this packet onto the destination _PE-router_", where that _PE-router_ can pop the +outer label, to reveal the 1234-label, which it can look up in its table to tell it what to do next +with the MPLS payload in any way it chooses - the _P-Routers_ don't have to understand the meaning +of label 1234, they don't have to use or inspect it at all! + +### Step 0: End Host setup + +{{< image src="/assets/vpp-mpls/LAB v2 (1).svg" alt="Lab Setup" >}} + +For this lab, I'm going to boot up instance LAB1 with no changes (for posterity, using image +`vpp-proto-disk0@20230403-release`). As an aside, IPng Networks has several of these lab +environments, and while [@vifino](https://chaos.social/@vifino) is doing some development testing on +LAB0, I simply switch to LAB1 to let him work in peace. + +With the MPLS concepts introduced, let me start by configuring `host1-0` and `host1-1` and giving them +an IPv4 loopback address, and a transit network to their routers `vpp1-0` and `vpp1-3` respectively: + +``` +root@host1-1:~# ip link set enp16s0f0 up mtu 1500 +root@host1-1:~# ip addr add 192.0.2.2/31 dev enp16s0f0 +root@host1-1:~# ip addr add 10.0.1.1/32 dev lo +root@host1-1:~# ip ro add 10.0.1.0/32 via 192.0.2.3 + +root@host1-0:~# ip link set enp16s0f3 up mtu 1500 +root@host1-0:~# ip addr add 192.0.2.0/31 dev enp16s0f3 +root@host1-0:~# ip addr add 10.0.1.0/32 dev lo +root@host1-0:~# ip ro add 10.0.1.1/32 via 192.0.2.1 +root@host1-0:~# ping -I 10.0.1.0 10.0.1.1 +``` + +At this point, I don't expect to see much, as I haven't configured VPP yet. But `host1-0` will start +ARPing for 192.0.2.1 on `enp16s0f3`, which is connected to `vpp1-3.e2`. Let me take a look on the +Open vSwitch mirror to confirm that: + +``` +root@tap1-0:~# tcpdump -vni enp16s0f0 vlan 33 +12:41:27.174052 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.1 tell 192.0.2.0, length 28 +12:41:28.333901 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.1 tell 192.0.2.0, length 28 +12:41:29.517415 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.1 tell 192.0.2.0, length 28 +12:41:30.645418 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.1 tell 192.0.2.0, length 28 +``` + +Alright! I'm going to leave the ping running in the background, and I'll trace packets through the +network using the Open vSwitch mirror, as well as take a look at what VPP is doing with the packets +using its packet tracer. + +### Step 1: PE Ingress + +``` +vpp1-3# set interface state GigabitEthernet10/0/2 up +vpp1-3# set interface ip address GigabitEthernet10/0/2 192.0.2.1/31 +vpp1-3# mpls table add 0 +vpp1-3# set interface mpls GigabitEthernet10/0/1 enable +vpp1-3# set interface mpls GigabitEthernet10/0/0 enable +vpp1-3# ip route add 10.0.1.1/32 via 192.168.11.10 GigabitEthernet10/0/0 out-labels 100 +``` + +Now the ARP resolution succeeds, and I can see that `host1-0` starts sending ICMP packets towards the +loopback that I have configured on `host1-1`, and it's of course using the newly learned L2 adjacency +for 192.0.2.1 at 52:54:00:13:10:02 (which is `vpp1-3.e2`). But, take a look at what the VPP router +does next: due to the `ip route add ...` command, I've told it to reach 10.0.1.1 via a nexthop of +`vpp1-2.e1`, but it will PUSH a single MPLS label 100,S=1 and forward it out on its Gi10/0/0 +interface: + +``` +root@tap1-0:~# tcpdump -eni enp16s0f0 vlan or mpls +12:45:56.551896 52:54:00:20:10:03 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 33 + p 0, ethertype ARP (0x0806), Request who-has 192.0.2.1 tell 192.0.2.0, length 28 +12:45:56.553311 52:54:00:13:10:02 > 52:54:00:20:10:03, ethertype 802.1Q (0x8100), length 46: vlan 33 + p 0, ethertype ARP (0x0806), Reply 192.0.2.1 is-at 52:54:00:13:10:02, length 28 + +12:45:56.620924 52:54:00:20:10:03 > 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33 + p 0, ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 38791, seq 184, length 64 +12:45:56.621473 52:54:00:13:10:00 > 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22 + p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64) + 10.0.1.0 > 10.0.1.1: ICMP echo request, id 38791, seq 184, length 64 +``` + +My MPLS journey on VPP has officially begun! The first exchange in the tcpdump (packets 1 and 2) +is the ARP resolution of 192.0.2.1 by `host1-0`, after which it knows where to send the ICMP echo +(packet 3, on VLAN33), which is then sent out by `vpp1-3` as MPLS to `vpp1-2` (packet 4, on VLAN22). + +Let me show you what such a packet looks like from the point of view of VPP. It has a _packet +tracing_ function which shows how any individual packet traverses the graph of nodes through the +router. It's a lot of information, but as a VPP operator, let alone a developer, it's really +important skill to learn -- so off I go, capturing and tracing a handful of packets: + +``` +vpp1-3# trace add dpdk-input 10 +vpp1-3# show trace +------------------- Start of thread 0 vpp_main ------------------- +Packet 1 + +20:15:00:496109: dpdk-input + GigabitEthernet10/0/2 rx queue 0 + buffer 0x4c44df: current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x0 + ext-hdr-valid + PKT MBUF: port 2, nb_segs 1, pkt_len 98 + buf_len 2176, data_len 98, ol_flags 0x0, data_off 128, phys_addr 0x2ed13840 + packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 + rss 0x0 fdir.hi 0x0 fdir.lo 0x0 + IP4: 52:54:00:20:10:03 -> 52:54:00:13:10:02 + ICMP: 10.0.1.0 -> 10.0.1.1 + tos 0x00, ttl 64, length 84, checksum 0x46a2 dscp CS0 ecn NON_ECN + fragment id 0x2706, flags DONT_FRAGMENT + ICMP echo_request checksum 0x3bd6 id 8399 + +20:15:00:496167: ethernet-input + frame: flags 0x1, hw-if-index 3, sw-if-index 3 + IP4: 52:54:00:20:10:03 -> 52:54:00:13:10:02 + +20:15:00:496201: ip4-input + ICMP: 10.0.1.0 -> 10.0.1.1 + tos 0x00, ttl 64, length 84, checksum 0x46a2 dscp CS0 ecn NON_ECN + fragment id 0x2706, flags DONT_FRAGMENT + ICMP echo_request checksum 0x3bd6 id 8399 + +20:15:00:496225: ip4-lookup + fib 0 dpo-idx 1 flow hash: 0x00000000 + ICMP: 10.0.1.0 -> 10.0.1.1 + tos 0x00, ttl 64, length 84, checksum 0x46a2 dscp CS0 ecn NON_ECN + fragment id 0x2706, flags DONT_FRAGMENT + ICMP echo_request checksum 0x3bd6 id 8399 + +20:15:00:496256: ip4-mpls-label-imposition-pipe + mpls-header:[100:64:0:eos] + +20:15:00:496258: mpls-output + adj-idx 25 : mpls via 192.168.11.10 GigabitEthernet10/0/0: mtu:9000 next:2 flags:[] 5254001210015254001310008847 flow hash: 0x00000000 + +20:15:00:496260: GigabitEthernet10/0/0-output + GigabitEthernet10/0/0 flags 0x0018000d + MPLS: 52:54:00:13:10:00 -> 52:54:00:12:10:01 + label 100 exp 0, s 1, ttl 64 + +20:15:00:496262: GigabitEthernet10/0/0-tx + GigabitEthernet10/0/0 tx queue 0 + buffer 0x4c44df: current data -4, length 102, buffer-pool 0, ref-count 1, trace handle 0x0 + ext-hdr-valid + l2-hdr-offset 0 l3-hdr-offset 14 + PKT MBUF: port 2, nb_segs 1, pkt_len 102 + buf_len 2176, data_len 102, ol_flags 0x0, data_off 124, phys_addr 0x2ed13840 + packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 + rss 0x0 fdir.hi 0x0 fdir.lo 0x0 + MPLS: 52:54:00:13:10:00 -> 52:54:00:12:10:01 + label 100 exp 0, s 1, ttl 64 +``` + +This packet has gone through a total of eight nodes, and the local timestamps are the uptime of VPP +when the packets were received. I'll try to explain them in turn: + +1. ***dpdk-input***: The packet is initially received by from Gi10/0/2 receive queue 0. It was an +ethernet packet from 52:54:00:20:10:03 (`host1-0.enp16s0f3`) to 52:54:00:13:10:02 (`vpp1-3.e2`). Some +more information is gleaned here, notably that it was an ethernet frame, an L3 IPv4 and L4 ICMP +packet. +1. ***ethernet-input***: Since it was an ethernet frame, it was passed into this node. Here VPP +concludes that this is an IPv4 packet, because the ethertype is 0x0800. +1. ***ip4-input***: We know it's an IPv4 packet, and the layer4 information shows this is an ICMP +echo packet from 10.0.1.0 to 10.0.1.1 (configured on `host1-1.lo`). VPP now needs to figure out where +to route this packet. +1. ***ip4-lookup***: VPP takes a look at its FIB for 10.0.1.1 - note the information I specified +above (the `ip route add ...` on `vpp1-3`) - the next-hop here is 192.168.11.10 on Gi10/0/0 _but_ VPP +also sees that I'm intending to add an MPLS _out-label_ of 100. +1. ***ip4-mpls-label-inposition-pipe***: An MPLS packet header is prepended in front of the IPv4 +packet, which will have only one label (100) and since it's the only label, it will set the S-bit +(end-of-stack) to 1, and the MPLS TTL initializes at 64. +1. ***mpls-output***: Now that the IPv4 packet is wrapped into an MPLS packet, VPP uses the rest +of the FIB entry (notably the next-hop 192.168.11.0 and the output interface Gi10/0/0) to find where +this thing is supposed to go. +1. ***Gi10/0/0-output***: VPP now prepares the packet to be sent out on Gi10/0/0 as an MPLS +ethernet type. It uses the L2FIB adjacency table to figure out that we'll be sending it from our mac +address 52:54:00:13:10:00 (`vpp1-3.e0`) to the next hop on 52:54:00:12:10:01 (`vpp1-2.e1`). +1. ***Gi10/0/0-tx***: VPP hands the fully formed packet with all necessary information back to +DPDK to marshall it on the wire. + +Can you imagine this router can do such a thing at a rate of 18-20 Million packets per second, +linearly scaling up per added CPU thread? I learn something new every time I look at a packet trace, +I simply love this dataplane implementation! + +### Step 2: P-routers + +In Step 1 I've shown that `vpp1-3` did send the MPLS packet to `vpp1-2`, but I haven't configured +anything there yet, and because I didn't enable MPLS, each of these beautiful packets is brutally +sent to the bit-bucket (also called _dpo-drop_): + +``` +vpp1-2# show err + Count Node Reason Severity + 132 mpls-input MPLS input packets decapsulated info + 132 mpls-input MPLS not enabled error +``` + +The purpose of a _P-router_ is to switch labels along the _Label-Switched-Path_. So let's manually +create the following to tell this `vpp1-2` router what to do when it receives an MPLS frame with +label 100: + +``` +vpp1-2# mpls table add 0 +vpp1-2# set interface mpls GigabitEthernet10/0/0 enable +vpp1-2# set interface mpls GigabitEthernet10/0/1 enable +vpp1-2# mpls local-label add 100 eos via 192.168.11.8 GigabitEthernet10/0/0 out-labels 100 +``` + +Remember, above I explained that the _P-Router_ has a simple job? It really does! All I'm doing here +is telling VPP, that if it receives an MPLS packet on any MPLS-enabled interface (notably Gi10/0/1 +from which it is currently receiving MPLS packets from `vpp1-3`), that it should send the MPLS packet +out on Gi10/0/0 to neighbor 192.168.11.8 after imposing label 100. + +If I've done a good job, I should be able to see this packet traversing the P-Router in a packet +trace: + +``` +20:42:51:151144: dpdk-input + GigabitEthernet10/0/1 rx queue 0 + buffer 0x4c7d8b: current data 0, length 102, buffer-pool 0, ref-count 1, trace handle 0x0 + ext-hdr-valid + PKT MBUF: port 1, nb_segs 1, pkt_len 102 + buf_len 2176, data_len 102, ol_flags 0x0, data_off 128, phys_addr 0x1d1f6340 + packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 + rss 0x0 fdir.hi 0x0 fdir.lo 0x0 + MPLS: 52:54:00:13:10:00 -> 52:54:00:12:10:01 + label 100 exp 0, s 1, ttl 64 + +20:42:51:151161: ethernet-input + frame: flags 0x1, hw-if-index 2, sw-if-index 2 + MPLS: 52:54:00:13:10:00 -> 52:54:00:12:10:01 + +20:42:51:151171: mpls-input + MPLS: next mpls-lookup[1] label 100 ttl 64 exp 0 + +20:42:51:151174: mpls-lookup + MPLS: next [6], lookup fib index 0, LB index 74 hash 0 label 100 eos 1 + +20:42:51:151177: mpls-label-imposition-pipe + mpls-header:[100:63:0:eos] + +20:42:51:151179: mpls-output + adj-idx 28 : mpls via 192.168.11.8 GigabitEthernet10/0/0: mtu:9000 next:2 flags:[] 5254001110015254001210008847 flow hash: 0x00000000 + +20:42:51:151181: GigabitEthernet10/0/0-output + GigabitEthernet10/0/0 flags 0x0018000d + MPLS: 52:54:00:12:10:00 -> 52:54:00:11:10:01 + label 100 exp 0, s 1, ttl 63 + +20:42:51:151184: GigabitEthernet10/0/0-tx + GigabitEthernet10/0/0 tx queue 0 + buffer 0x4c7d8b: current data 0, length 102, buffer-pool 0, ref-count 1, trace handle 0x0 + ext-hdr-valid + l2-hdr-offset 0 l3-hdr-offset 14 + PKT MBUF: port 1, nb_segs 1, pkt_len 102 + buf_len 2176, data_len 102, ol_flags 0x0, data_off 128, phys_addr 0x1d1f6340 + packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 + rss 0x0 fdir.hi 0x0 fdir.lo 0x0 + MPLS: 52:54:00:12:10:00 -> 52:54:00:11:10:01 + label 100 exp 0, s 1, ttl 63 +``` + +In order, the following nodes are traversed: +1. ***dpdk-input***: received the frame from the network interface Gi10/0/1 +1. ***ethernet-input***: the frame was an ethernet frame, and VPP determines based on the +ethertype (0x8847) that it is an MPLS frame +1. ***mpls-input***: inspects the MPLS labelstack and sees the outermost label (the only one on +this frame) with a value of 100. +1. ***mpls-lookup***: looks up the MPLS FIB what to do with packets which are End-Of-Stack or +_EOS_ (ie. with the S-bit set to 1), and are labeled 100. At this point VPP could make a different +choice if there is 1 label (as in this case), or a stack of multiple labels (Not-End-of-Stack or +_NEOS_, ie. with the S-bit set to 0). +1. ***mpls-label-imposition-pipe***: reads from the FIB that the outer label needs to be SWAPd +to a new _out-label_ (also with value 100). Because it's the same label, this is a no-op. However, +since this router is forwarding the MPLS packet, it will decrement the TTL to 63. +1. ***mpls-output***: VPP then uses the rest of the FIB information to determine the L3 nexthop is +192.168.11.8 on Gi10/0/0. +1. ***Gi10/0/0-output***: uses the L2FIB adjacency table to determine that the L2 nexthop is MAC +address 52:54:00:11:10:01 (`vpp1-1.e1`). If there is no L2 adjacency, this would be a good time for +VPP to send an ARP request to resolve the IP-to-MAC and store it in the L2FIB. +1. ***Gi10/0/0-tx***: hands off the frame to DPDK for marshalling on the wire. + +If you counted with me, you'll see that this flow in the _P-Router_ also has eight nodes. However, +while the IPv4 FIB can and will be north of one million entries in a longest-prefix match radix trie +(which is computationally expensive), the MPLS FIB contains far fewer entries which are organized as +a literal key lookup in a hash table; and as well compared to IPv4 routing, the packet that is being +transported does not have to get a decremented TTL which requires a recalculated IPv4 checksum. MPLS +switching is _much_ cheaper than IPv4 routing! + +Now that our packets are switched from `vpp1-2` to `vpp1-1` (which is also a _P-Router_), I'll just +rinse and repeat there, using the L3 adjacency pointing at `vpp1-0.e1` (192.168.11.6 on Gi10/0/0): + +``` +vpp1-1# mpls table add 0 +vpp1-1# set interface mpls GigabitEthernet10/0/0 enable +vpp1-1# set interface mpls GigabitEthernet10/0/1 enable +vpp1-1# mpls local-label add 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 100 +``` + +Did I do this correctly? One way to check is by taking a look at which packets are seen on the Open +vSwitch mirror ports: + + +``` +root@tap1-0:~# tcpdump -eni enp16s0f0 +13:42:47.724107 52:54:00:20:10:03 > 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33 + p 0, ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 8399, seq 3238, length 64 + +13:42:47.724769 52:54:00:13:10:00 > 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22 + p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64) + 10.0.1.0 > 10.0.1.1: ICMP echo request, id 8399, seq 3238, length 64 + +13:42:47.725038 52:54:00:12:10:00 > 52:54:00:11:10:01, ethertype 802.1Q (0x8100), length 106: vlan 21 + p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 63) + 10.0.1.0 > 10.0.1.1: ICMP echo request, id 8399, seq 3238, length 64 + +13:42:47.726155 52:54:00:11:10:00 > 52:54:00:10:10:01, ethertype 802.1Q (0x8100), length 106: vlan 20 + p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 62) + 10.0.1.0 > 10.0.1.1: ICMP echo request, id 8399, seq 3238, length 64 +``` + +Nice!! I confirm that the ICMP packet first travels over VLAN 33 (from `host1-0` to `vpp1-3`), and then +the MPLS packets travel from `vpp1-3`, through `vpp-1-2`, through `vpp1-1` and towards `vpp1-0` over VLAN +22, 21 and 20 respectively. + +### Step 3: PE Egress + +Seeing as I haven't done anything with `vpp1-0` yet, now the MPLS packets all get dropped there. But not +for much longer, as I'm now ready to tell `vpp1-0` what to do with those packets: + +``` +vpp1-0# mpls table add 0 +vpp1-0# set interface mpls GigabitEthernet10/0/0 enable +vpp1-0# set interface mpls GigabitEthernet10/0/1 enable +vpp1-0# mpls local-label add 100 eos via ip4-lookup-in-table 0 +vpp1-0# ip route add 10.0.1.1/32 via 192.0.2.2 +``` + +The difference between the _P-Routers_ in Step 2 and this _PE-Router_, is the operation provided in +the MPLS FIB. When an MPLS packet with _label_ value 100 is received, instead of forwarding it into +another interface (which is what the _P-Router_ would do), I tell VPP here to unwrap the MPLS label, +and expect to find an IPv4 packet which I'm asking it to route by looking up an IPv4 next hop in the +(IPv4) FIB table 0. + +All that's left for me to do is add a regular static route for 10.0.1.1/32 via 192.0.2.2 (which is +the address on interface `host1-1.enp16s0f3`). If my thinkingcap is still working, I should now see +packets emit from `vpp1-0` on Gi10/0/3: + +``` +vpp1-0# trace add dpdk-input 10 +vpp1-0# show trace + +21:34:39:370589: dpdk-input + GigabitEthernet10/0/1 rx queue 0 + buffer 0x4c4a34: current data 0, length 102, buffer-pool 0, ref-count 1, trace handle 0x0 + ext-hdr-valid + PKT MBUF: port 1, nb_segs 1, pkt_len 102 + buf_len 2176, data_len 102, ol_flags 0x0, data_off 128, phys_addr 0x2ff28d80 + packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 + rss 0x0 fdir.hi 0x0 fdir.lo 0x0 + MPLS: 52:54:00:11:10:00 -> 52:54:00:10:10:01 + label 100 exp 0, s 1, ttl 62 + +21:34:39:370672: ethernet-input + frame: flags 0x1, hw-if-index 2, sw-if-index 2 + MPLS: 52:54:00:11:10:00 -> 52:54:00:10:10:01 + +21:34:39:370702: mpls-input + MPLS: next mpls-lookup[1] label 100 ttl 62 exp 0 + +21:34:39:370704: mpls-lookup + MPLS: next [6], lookup fib index 0, LB index 83 hash 0 label 100 eos 1 + +21:34:39:370706: ip4-mpls-label-disposition-pipe + rpf-id:-1 ip4, pipe + +21:34:39:370708: lookup-ip4-dst + fib-index:0 addr:10.0.1.1 load-balance:82 + +21:34:39:370710: ip4-rewrite + tx_sw_if_index 4 dpo-idx 32 : ipv4 via 192.0.2.2 GigabitEthernet10/0/3: mtu:9000 next:9 flags:[] 5254002110005254001010030800 flow hash: 0x00000000 + 00000000: 5254002110005254001010030800450000543dec40003e01e8bc0a0001000a00 + 00000020: 01010800173d231c01a0fce65864000000009ce80b00000000001011 + +21:34:39:370735: GigabitEthernet10/0/3-output + GigabitEthernet10/0/3 flags 0x0418000d + IP4: 52:54:00:10:10:03 -> 52:54:00:21:10:00 + ICMP: 10.0.1.0 -> 10.0.1.1 + tos 0x00, ttl 62, length 84, checksum 0xe8bc dscp CS0 ecn NON_ECN + fragment id 0x3dec, flags DONT_FRAGMENT + ICMP echo_request checksum 0x173d id 8988 + +21:34:39:370739: GigabitEthernet10/0/3-tx + GigabitEthernet10/0/3 tx queue 0 + buffer 0x4c4a34: current data 4, length 98, buffer-pool 0, ref-count 1, trace handle 0x0 + ext-hdr-valid + l2-hdr-offset 0 l3-hdr-offset 14 loop-counter 1 + PKT MBUF: port 1, nb_segs 1, pkt_len 98 + buf_len 2176, data_len 98, ol_flags 0x0, data_off 132, phys_addr 0x2ff28d80 + packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 + rss 0x0 fdir.hi 0x0 fdir.lo 0x0 + IP4: 52:54:00:10:10:03 -> 52:54:00:21:10:00 + ICMP: 10.0.1.0 -> 10.0.1.1 + tos 0x00, ttl 62, length 84, checksum 0xe8bc dscp CS0 ecn NON_ECN + fragment id 0x3dec, flags DONT_FRAGMENT + ICMP echo_request checksum 0x173d id 8988 +``` + +Alright, another one of those huge blobs of information about a single packet traversing the VPP +dataplane, but it's the last one for this article, I promise! In order: + +1. ***dpdk-input***: DPDK reads the frame which is arriving from `vpp1-1` on Gi10/0/1, it determines +that this is an ethernet frame +1. ***ethernet-input***: Based on the ethertype 0x8447, it knows that this ethernet frame is an +MPLS packet +1. ***mpls-input***: The MPLS _labelstack_ has one label, value 100, with (obviously) the +EndOfStack _S-bit_ set to 1; I can also see the (MPLS) TTL is 62 here, because it has traversed three +routers (`vpp1-3` TTL=64, `vpp1-2` TTL=63, and `vpp1-1` TTL=62) +1. ***mpls-lookup***: The lookup of local _label_ 100 informs VPP that it should switch to IPv4 +processing and handle the packet as such +1. ***ip4-mpls-label-disposition-pipe***: The MPLS label is removed, revealing an IPv4 packet as +the inner payload of the MPLS datagram +1. ***lookup-ip4-dst***: VPP can now do a regular IPv4 forwarding table lookup for 10.0.1.1 which +informs it that it should forward the packet via 192.0.2.2 which is directly connected to Gi10/0/3. +1. ***ip4-rewrite***: The IPv4 TTL is decremented and the IP header checksum recomputed +1. ***Gi10/0/3-output***: VPP now can look up the L2FIB adjacency belonging to 192.0.2.2 on Gi10/0/3, +which informs it that 52:54:00:21:10:00 is the ethernet nexthop +1. ***Gi10/0/3-tx***: The packet is now handed off to DPDK to marshall on the wire, destined to +`host1-1.enp16s0f3` + +That means I should be able to see it on `host1-1`, right? If you, too, are dying to know, check this out: + +``` +root@host1-1:~# tcpdump -ni enp16s0f0 icmp +tcpdump: verbose output suppressed, use -v[v]... for full protocol decode +listening on enp16s0f0, link-type EN10MB (Ethernet), snapshot length 262144 bytes +14:25:53.776486 IP 10.0.1.0 > 10.0.1.1: ICMP echo request, id 8988, seq 1249, length 64 +14:25:53.776522 IP 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 8988, seq 1249, length 64 +14:25:54.799829 IP 10.0.1.0 > 10.0.1.1: ICMP echo request, id 8988, seq 1250, length 64 +14:25:54.799866 IP 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 8988, seq 1250, length 64 +``` + +"Jiggle jiggle, wiggle wiggle!", as I do a premature congratulatory dance on the chair in my lab! I +created a _label-switched-path_ using VPP as MPLS provider-edge and provider routers, to move +this ICMP echo packet all the way from `host1-0` to `host1-1`, but there's absolutely nothing to +suggest that the resulting ICMP echo-reply can go to back from `host1-1` to `host1-0`, because +_LSPs_ are unidirectional. The final step for me to do is create an _LSP_ back in the other +direction: + +``` +vpp1-0# ip route add 10.0.1.0/32 via 192.168.11.7 GigabitEthernet10/0/1 out-labels 103 +vpp1-1# mpls local-label add 103 eos via 192.168.11.9 GigabitEthernet10/0/1 out-labels 103 +vpp1-2# mpls local-label add 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 103 +vpp1-3# mpls local-label add 103 eos via ip4-lookup-in-table 0 +vpp1-3# ip route add 10.0.1.0/32 via 192.0.2.0 +``` + +And with that, the ping I started at the beginning of this article, shoots to life: + +``` +root@host1-0:~# ping -I 10.0.1.0 10.0.1.1 +PING 10.0.1.1 (10.0.1.1) from 10.0.1.0 : 56(84) bytes of data. +64 bytes from 10.0.1.1: icmp_seq=7644 ttl=62 time=6.28 ms +64 bytes from 10.0.1.1: icmp_seq=7645 ttl=62 time=7.45 ms +64 bytes from 10.0.1.1: icmp_seq=7646 ttl=62 time=7.01 ms +64 bytes from 10.0.1.1: icmp_seq=7647 ttl=62 time=5.76 ms +64 bytes from 10.0.1.1: icmp_seq=7648 ttl=62 time=5.88 ms +64 bytes from 10.0.1.1: icmp_seq=7649 ttl=62 time=9.23 ms +``` + +I will leave you with this packetdump from the Open vSwitch mirror, showing the entire flow of one +ICMP packet through the network: + +``` +root@tap1-0:~# tcpdump -c 10 -eni enp16s0f0 +tcpdump: verbose output suppressed, use -v[v]... for full protocol decode +listening on enp16s0f0, link-type EN10MB (Ethernet), snapshot length 262144 bytes +14:41:07.526861 52:54:00:20:10:03 > 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33 + p 0, ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64 +14:41:07.528103 52:54:00:13:10:00 > 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22 + p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64) + 10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64 +14:41:07.529342 52:54:00:12:10:00 > 52:54:00:11:10:01, ethertype 802.1Q (0x8100), length 106: vlan 21 + p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 63) + 10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64 +14:41:07.530421 52:54:00:11:10:00 > 52:54:00:10:10:01, ethertype 802.1Q (0x8100), length 106: vlan 20 + p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 62) + 10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64 +14:41:07.531160 52:54:00:10:10:03 > 52:54:00:21:10:00, ethertype 802.1Q (0x8100), length 102: vlan 40 + p 0, ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64 + +14:41:07.531455 52:54:00:21:10:00 > 52:54:00:10:10:03, ethertype 802.1Q (0x8100), length 102: vlan 40 + p 0, ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64 +14:41:07.532245 52:54:00:10:10:01 > 52:54:00:11:10:00, ethertype 802.1Q (0x8100), length 106: vlan 20 + p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 64) + 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64 +14:41:07.532732 52:54:00:11:10:01 > 52:54:00:12:10:00, ethertype 802.1Q (0x8100), length 106: vlan 21 + p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 63) + 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64 +14:41:07.533923 52:54:00:12:10:01 > 52:54:00:13:10:00, ethertype 802.1Q (0x8100), length 106: vlan 22 + p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 62) + 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64 +14:41:07.535040 52:54:00:13:10:02 > 52:54:00:20:10:03, ethertype 802.1Q (0x8100), length 102: vlan 33 + p 0, ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64 +10 packets captured +10 packets received by filter +``` + +You can see all of the attributes I demonstrated in this article in one go: ingress ICMP packet on +VLAN 33, encapsulation with label 100, S=1 and ttl decrementing as the MPLS packet traverses +eastwards through the string of VPP routers on VLANs 22, 21 and 20, ultimately being sent out on +VLAN 40. There, the ICMP echo request packet is responded to, and we can trace the ICMP response as +it makes its way back westwards through the MPLS network using label 103, ultimately re-appearing on +VLAN 33. + +There you have it. This is a fun story on _Multi Protocol Label Switching (MPLS)_ bringing a packet from +a _Label-Edge-Router (LER)_ through several _Label-Switch-Routers (LSRs)_ over a staticlly +configured _Label-Switched-Path (LSP)_. I feel like I can now more confidently use these terms +without sounding silly. + +## What's next + +The first mission is accomplished. I've taken a good look at IPv4 forwarding in the VPP dataplane as +MPLS packets, thereby en- and decapsulating the traffic using _PE-Routers_ and forwarding the +traffic using intermediary _P-Routers_. MPLS switching is cheaper than IPv4/IPv6 routing, but it can +also open a bunch of possibilities regarding advanced services offering, such as my coveted _Martini +Tunnels_ which transport ethernet frames point-to-point over an MPLS backbone. That will be the topic +of an upcoming article, as will I join forces with [@vifino](https://chaos.social/@vifino) who is adding +Linux Controlplane functionality to program the MPLS FIB using Netlink -- such that things like 'ip' +and 'FRR' can discover and share label information using a Label Distribution Protocol or _LDP_. diff --git a/content/articles/2023-05-17-vpp-mpls-2.md b/content/articles/2023-05-17-vpp-mpls-2.md new file mode 100644 index 0000000..3f5fc01 --- /dev/null +++ b/content/articles/2023-05-17-vpp-mpls-2.md @@ -0,0 +1,635 @@ +--- +date: "2023-05-17T10:01:14Z" +title: VPP MPLS - Part 2 +--- + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. + +I've deployed an MPLS core for IPng Networks, which allows me to provide L2VPN services, and at the +same time keep an IPng Site Local network with IPv4 and IPv6 that is separate from the internet, +based on hardware/silicon based forwarding at line rate and high availability. You can read all +about my Centec MPLS shenanigans in [[this article]({% post_url 2023-03-11-mpls-core %})]. + +In the last article, I explored VPP's MPLS implementation a little bit. All the while, +[@vifino](https://chaos.social/@vifino) has been tinkering with the Linux Control Plane and adding +MPLS support to it, and together we learned a lot about how VPP does MPLS forwarding and how it +sometimes differs to other implementations. During the process, we talked a bit about +_implicit-null_ and _explicit-null_. When my buddy Fred read the [[previous article]({% post_url +2023-05-07-vpp-mpls-1 %})], he also talked about a feature called _penultimate-hop-popping_ which +maybe deserves a bit more explanation. At the same time, I could not help but wonder what the +performance is of VPP as a _P-Router_ and _PE-Router_, compared to say IPv4 forwarding. + +## Lab Setup: VMs + +{{< image src="/assets/vpp-mpls/LAB v2 (1).svg" alt="Lab Setup" >}} + +For this article, I'm going to boot up instance LAB1 with no changes (for posterity, using image +`vpp-proto-disk0@20230403-release`), and it will be in the same state it was at the end of my +previous [[MPLS article]({% post_url 2023-05-07-vpp-mpls-1 %})]. To recap, there are four routers +daisychained in a string, and they are called `vpp1-0` through `vpp1-3`. I've then connected a +Debian virtual machine on both sides of the string. `host1-0.enp16s0f3` connects to `vpp1-3.e2` +and `host1-1.enp16s0f0` connects to `vpp1-0.e3`. Finally, recall that all of the links between these +routers and hosts can be inspected with the machine `tap1-0` which is connected to a mirror port on +the underlying Open vSwitch fabric. I bound some RFC1918 addresses on `host1-0` and `host1-1` and +can ping between the machines, using the VPP routers as MPLS transport. + +### MPLS: Simple LSP + +In this mode, I can plumb two _label switched paths (LSPs)_, the first one westbound from `vpp1-3` +to `vpp1-0`, and it wraps the packet destined to 10.0.1.1 into an MPLS packet with a single label +100: +``` +vpp1-3# ip route add 10.0.1.1/32 via 192.168.11.10 GigabitEthernet10/0/0 out-labels 100 +vpp1-2# mpls local-label add 100 eos via 192.168.11.8 GigabitEthernet10/0/0 out-labels 100 +vpp1-1# mpls local-label add 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 100 +vpp1-0# mpls local-label add 100 eos via ip4-lookup-in-table 0 +vpp1-0# ip route add 10.0.1.1/32 via 192.0.2.2 +``` + +The second is eastbound from `vpp1-0` to `vpp1-3`, and it is using MPLS label 103. Remember: +LSPs are unidirectional! + +``` +vpp1-0# ip route add 10.0.1.0/32 via 192.168.11.7 GigabitEthernet10/0/1 out-labels 103 +vpp1-1# mpls local-label add 103 eos via 192.168.11.9 GigabitEthernet10/0/1 out-labels 103 +vpp1-2# mpls local-label add 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 103 +vpp1-3# mpls local-label add 103 eos via ip4-lookup-in-table 0 +vpp1-3# ip route add 10.0.1.0/32 via 192.0.2.0 +``` + +With these two _LSPs_ established, the ICMP echo request and subsequent ICMP echo reply can be seen +traveling through the network entirely as MPLS: + +``` +root@tap1-0:~# tcpdump -c 10 -eni enp16s0f0 +tcpdump: verbose output suppressed, use -v[v]... for full protocol decode +listening on enp16s0f0, link-type EN10MB (Ethernet), snapshot length 262144 bytes +14:41:07.526861 52:54:00:20:10:03 > 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33 + p 0, ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64 +14:41:07.528103 52:54:00:13:10:00 > 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22 + p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64) + 10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64 +14:41:07.529342 52:54:00:12:10:00 > 52:54:00:11:10:01, ethertype 802.1Q (0x8100), length 106: vlan 21 + p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 63) + 10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64 +14:41:07.530421 52:54:00:11:10:00 > 52:54:00:10:10:01, ethertype 802.1Q (0x8100), length 106: vlan 20 + p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 62) + 10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64 +14:41:07.531160 52:54:00:10:10:03 > 52:54:00:21:10:00, ethertype 802.1Q (0x8100), length 102: vlan 40 + p 0, ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64 + +14:41:07.531455 52:54:00:21:10:00 > 52:54:00:10:10:03, ethertype 802.1Q (0x8100), length 102: vlan 40 + p 0, ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64 +14:41:07.532245 52:54:00:10:10:01 > 52:54:00:11:10:00, ethertype 802.1Q (0x8100), length 106: vlan 20 + p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 64) + 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64 +14:41:07.532732 52:54:00:11:10:01 > 52:54:00:12:10:00, ethertype 802.1Q (0x8100), length 106: vlan 21 + p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 63) + 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64 +14:41:07.533923 52:54:00:12:10:01 > 52:54:00:13:10:00, ethertype 802.1Q (0x8100), length 106: vlan 22 + p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 62) + 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64 +14:41:07.535040 52:54:00:13:10:02 > 52:54:00:20:10:03, ethertype 802.1Q (0x8100), length 102: vlan 33 + p 0, ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64 +10 packets captured +10 packets received by filter +``` + +When `vpp1-0` receives the MPLS frame with label 100,S=1, it looks up in the FIB to figure out what +_operation_ to perform with this packet is to POP the label, revealing the inner payload, which it +must look in the IPv4 FIB, and forward as per normal. This is a bit more expensive than it could be, +and the folks who established MPLS protocols found a few clever ways to cut down on cost! + +### MPLS: Wellknown Label Values + +I didn't know this until I started tinkering with MPLS on VPP, and as an operator it's easy to +overlook these things. As it so turns out, there are a few MPLS label values that have a very specific +meaning. Taking a read on [[RFC3032](https://www.rfc-editor.org/rfc/rfc3032.html)], label values 0-15 +are reserved and they each serve a specific purpose: + +* ***Value 0***: IPv4 Explicit NULL Label +* ***Value 1***: Router Alert Label +* ***Value 2***: IPv6 Explicit NULL Label +* ***Value 3***: Implicit NULL Label + +There's a few other label values, 4-15, and if you're curious you could take a look at the [[Iana +List](https://www.iana.org/assignments/mpls-label-values/mpls-label-values.xhtml)] for them. For my +purposes, though, I'm only going to look at these weird little _NULL_ labels. What do they do? + +### MPLS: Explicit Null + +RFC3032 discusses the IPv4 explicit NULL label, value 0 (and the IPv6 variant with value 2): + +> This label value is only legal at the bottom of the label +> stack. It indicates that the label stack must be popped, +> and the forwarding of the packet must then be based on the +> IPv4 header. + +What this means in practice is that we can allow MPLS _PE-Routers_ to take a little shortcut. If the +MPLS label in the last hop is just telling the router to POP the label and take a look in its IPv4 +forwarding table, I can also set the label to 0 in the router just preceding it. This way, when the +last router sees label value 0, it knows already what to do, saving it one FIB lookup. + +I can reconfigure both LSPs to make use of this feature, by changing the MPLS FIB entries on +`vpp1-1` that points the _LSP_ towards `vpp1-0`, removing what I configured before (`mpls +local-label del ...`) and replacing that with an out-label value of 0 (`mpls local-label add ...`): + +``` +vpp1-1# mpls local-label del 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 100 +vpp1-1# mpls local-label add 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 0 + +vpp1-2# mpls local-label del 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 103 +vpp1-2# mpls local-label add 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 0 +``` + +Due to this, the last routers in the _LSP_ now already know what to do, so I can clean these up: + +``` +vpp1-0# mpls local-label del 100 eos via ip4-lookup-in-table 0 +vpp1-3# mpls local-label del 103 eos via ip4-lookup-in-table 0 +``` + +If I ping from `host1-0` to `host1-1` again, I can see a subtle but important difference in the +packets on the wire: + +``` +17:49:23.770119 52:54:00:20:10:03 > 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33, p 0, + ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 524, length 64 +17:49:23.770403 52:54:00:13:10:00 > 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22, p 0, + ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64) + 10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 524, length 64 +17:49:23.771184 52:54:00:12:10:00 > 52:54:00:11:10:01, ethertype 802.1Q (0x8100), length 106: vlan 21, p 0, + ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 63) + 10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 524, length 64 +17:49:23.772503 52:54:00:11:10:00 > 52:54:00:10:10:01, ethertype 802.1Q (0x8100), length 106: vlan 20, p 0, + ethertype MPLS unicast (0x8847), MPLS (label 0, exp 0, [S], ttl 62) + 10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 524, length 64 +17:49:23.773392 52:54:00:10:10:03 > 52:54:00:21:10:00, ethertype 802.1Q (0x8100), length 102: vlan 40, p 0, + ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 524, length 64 + +17:49:23.773602 52:54:00:21:10:00 > 52:54:00:10:10:03, ethertype 802.1Q (0x8100), length 102: vlan 40, p 0, + ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 524, length 64 +17:49:23.774592 52:54:00:10:10:01 > 52:54:00:11:10:00, ethertype 802.1Q (0x8100), length 106: vlan 20, p 0, + ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 64) + 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 524, length 64 +17:49:23.775804 52:54:00:11:10:01 > 52:54:00:12:10:00, ethertype 802.1Q (0x8100), length 106: vlan 21, p 0, + ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 63) + 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 524, length 64 +17:49:23.776973 52:54:00:12:10:01 > 52:54:00:13:10:00, ethertype 802.1Q (0x8100), length 106: vlan 22, p 0, + ethertype MPLS unicast (0x8847), MPLS (label 0, exp 0, [S], ttl 62) + 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 524, length 64 +17:49:23.778255 52:54:00:13:10:02 > 52:54:00:20:10:03, ethertype 802.1Q (0x8100), length 102: vlan 33, p 0, + ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 524, length 64 +``` + +Did you spot it? :) If your eyes are spinning, don't worry! I have configured the routers `vpp1-1` +towards `vpp1-0` in vlan 20 to use _IPv4 Explicit NULL_ (label 0). You can spot it on the fourth +packet in the tcpdump above. On the way back, `vpp1-2` towards `vpp1-3` in vlan 22 also sets _IPv4 +Explicit NULL_ for the echo-reply. But, I do notice that end to end, the packet is still traversing +the network entirely as MPLS packets. The optimization here is that `vpp1-0` knows that label value +0 at the end of the label-stack just means 'what follows is an IPv4 packet, route it.'. + +### MPLS: Implicit Null + +Did that really help that much? I think I can answer the question by loadtesting, but first let me +take a closer look at what RFC3032 has to say about the _Implicit NULL Label_: + +> A value of 3 represents the "Implicit NULL Label". This +> is a label that an LSR may assign and distribute, but +> which never actually appears in the encapsulation. When +> an LSR would otherwise replace the label at the top of the +> stack with a new label, but the new label is "Implicit +> NULL", the LSR will pop the stack instead of doing the +> replacement. Although this value may never appear in the +> encapsulation, it needs to be specified in the Label +> Distribution Protocol, so a value is reserved. + +{{< image width="200px" float="right" src="/assets/vpp-mpls/PHP-logo.svg" alt="PHP Logo" >}} + +Oh, groovy! What this tells me is that I can take one further shortcut: if I set the label value 0 +(_Explicit NULL IPv4_), or 2 (_Explicit NULL IPV6_), my last router in the chain will know to look +up the FIB entry automatically, saving one MPLS FIB lookup. But in this case, label value 3 +(_Implicit NULL_) is telling the router to just unwrap the MPLS parts (it's looking at them anyway!) +and just forward the bare inner payload which is an IPv4 or IPv6 packet, directy onto the last +router. This is what all the real geeks call _Penultimate Hop Popping_ or PHP, none of that website +programming language rubbish! + +Let me replace the FIB entries in the penultimate routers with this magic label value (3): + +``` +vpp1-1# mpls local-label del 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 0 +vpp1-1# mpls local-label add 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 3 + +vpp1-2# mpls local-label del 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 0 +vpp1-2# mpls local-label add 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 3 +``` + +Now I would expect this _penultimate hop popping_ to yield an IPv4 packet between `vpp1-1` and +`vpp1-0` on the ICMP echo-request, and as well an IPv4 packet between `vpp1-2` and `vpp1-3` on the ICMP +echo-reply way back, and would you look at that: + +``` +17:45:35.783214 52:54:00:20:10:03 > 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33, p 0, + ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 298, length 64 +17:45:35.783879 52:54:00:13:10:00 > 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22, p 0, + ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64) + 10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 298, length 64 +17:45:35.784222 52:54:00:12:10:00 > 52:54:00:11:10:01, ethertype 802.1Q (0x8100), length 106: vlan 21, p 0, + ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 63) + 10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 298, length 64 +17:45:35.785123 52:54:00:11:10:00 > 52:54:00:10:10:01, ethertype 802.1Q (0x8100), length 102: vlan 20, p 0, + ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 298, length 64 +17:45:35.785311 52:54:00:10:10:03 > 52:54:00:21:10:00, ethertype 802.1Q (0x8100), length 102: vlan 40, p 0, + ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 298, length 64 + +17:45:35.785533 52:54:00:21:10:00 > 52:54:00:10:10:03, ethertype 802.1Q (0x8100), length 102: vlan 40, p 0, + ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 298, length 64 +17:45:35.786465 52:54:00:10:10:01 > 52:54:00:11:10:00, ethertype 802.1Q (0x8100), length 106: vlan 20, p 0, + ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 64) + 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 298, length 64 +17:45:35.787354 52:54:00:11:10:01 > 52:54:00:12:10:00, ethertype 802.1Q (0x8100), length 106: vlan 21, p 0, + ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 63) + 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 298, length 64 +17:45:35.787575 52:54:00:12:10:01 > 52:54:00:13:10:00, ethertype 802.1Q (0x8100), length 102: vlan 22, p 0, + ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 298, length 64 +17:45:35.788320 52:54:00:13:10:02 > 52:54:00:20:10:03, ethertype 802.1Q (0x8100), length 102: vlan 33, p 0, + ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 298, length 64 +``` + +I can now see that the behavior has changed in a subtle way once again. Where before, there were +**three** MPLS packets all the way between `vpp1-3` through `vpp1-2` and `vpp1-1` onto `vpp1-0`, now +there are only **two** MPLS packets, and the last one (on the way out in VLAN 20, and on the way +back in VLAN 22), is just an IPv4 packet. PHP is slick! + +## Loadtesting Setup: Bare Metal + +{{< image width="350px" float="right" src="/assets/vpp-mpls/MPLS Lab.svg" alt="Bare Metal Setup" >}} + +In 1997, an Internet Engineering Task Force (IETF) working group created standards to help fix the +issues of the time, mostly around internet traffic routing. MPLS was developed as an alternative to +multilayer switching and IP over asynchronous transfer mode (ATM). In the 90s, routers were +comparatively weak in terms of CPU, and things like _content addressable memory_ to facilitate +faster lookups, was incredibly expensive. Back then, every FIB lookup counted, so tricks like +_Penultimate Hop Popping_ really helped. But what about now? I'm reasonably confident that any +silicon based router would not mind to have one extra MPLS FIB operation, and equally would not mind +to unwrap the MPLS packet at the end. But, since these things exist, I thought it would be a fun +activity to see how much they would help in the VPP world, where just like in the old days, every +operation performed on a packet does cost valuable CPU cycles. + +I can't really perform a loadtest on the virtual machines backed by Open vSwitch, while tightly +packing six machines on one hypervisor. That setup is made specifically to do functional testing and +development work. To do a proper loadtest, I will need bare metal. So, I grabbed three Supermicro +SYS-5018D-FN8T, which I'm running throughout [[AS8298]({% post_url 2021-02-27-network %})], as I +know their performance quite well. I'll take three of these, and daisychain them with TenGig ports. +This way, I can take a look at the cost of _P-Routers_ (which only SWAP MPLS labels and forward the +result), as well as _PE-Routers_ (which have to encapsulate, and sometimes decapsulate the IP or +Ethernet traffic). + +These machines get a fresh Debian Bookworm install and VPP 23.06 without any plugins. It's weird for +me to run a VPP instance without Linux CP, but in this case I'm going completely vanilla, so I +disable all plugins and give each VPP machine one worker thread. The install follows my popular +[[VPP-7]({% post_url 2021-09-21-vpp-7 %})]. By the way did you know that you can just type the search query [VPP-7] directly into Google to find this article. Am I an influencer now? Jokes aside, I decide to call the bare metal machines _France_, +_Belgium_ and _Netherlands_. And because if it ain't dutch, it ain't much, the Netherlands machine +sits on top :) + + +### IPv4 forwarding performance + +The way Cisco T-Rex works in its simplest stateless loadtesting mode, is that it reads a Scapy file, +for example `bench.py`, and it then generates a stream of traffic from its first port, through the +_device under test (DUT)_, and expects to see that traffic returned on its second port. In a +bidirectional mode, traffic is sent from `16.0.0.0/8` to `48.0.0.0/8` in one direction, and back +from `48.0.0.0/8` to `16.0.0.0/8` in the other. + +OK so first things first, let me configure a basic skeleton, taking _Netherlands_ as an example: + +``` +netherlands# set interface ip address TenGigabitEthernet6/0/1 192.168.13.7/31 +netherlands# set interface ip address TenGigabitEthernet6/0/1 2001:678:d78:230::2:2/112 +netherlands# set interface state TenGigabitEthernet6/0/1 up +netherlands# ip route add 100.64.0.0/30 via 192.168.13.6 +netherlands# ip route add 192.168.13.4/31 via 192.168.13.6 + +netherlands# set interface ip address TenGigabitEthernet6/0/0 100.64.1.2/30 +netherlands# set interface state TenGigabitEthernet6/0/0 up +netherlands# ip nei TenGigabitEthernet6/0/0 100.64.1.1 9c:69:b4:61:ff:40 static + +netherlands# ip route add 16.0.0.0/8 via 100.64.1.1 +netherlands# ip route add 48.0.0.0/8 via 192.168.13.6 +``` + +The _Belgium_ router just has static routes back and forth, and the _France_ router looks similar +except it has its static routes all pointing in the other direction, and of course it has different +/31 transit networks towards T-Rex and _Belgium_. The one thing that is a bit curious is the use of +a static ARP entry that allows the VPP routers to resolve the nexthop for T-Rex -- in the case +above, T-Rex is sourcing from 100.64.1.1/30 (which has MAC address 9c:69:b4:61:ff:40) and sending to +our 100.64.1.2 on Te6/0/0. + +{{< image src="/assets/vpp-mpls/trex.png" alt="T-Rex Baseline" >}} + +After fiddling around a little bit with `imix`, I do notice the machine is still keeping up +with one CPU thread in both directions (~6.5Mpps). So I switch to 64b packets and ram up traffic +until that one VPP worker thread is saturated, which is a around the 9.2Mpps mark, so I lower it +slightly to a cool 9Mpps. Note: this CPU can have 3 worker threads in production, so it can do roughly +27Mpps per router, which is way cool! + +The machines are at this point all doing exactly the same: receive ethernet from DPDK, do an IPv4 +lookup, rewrite the header, and emit the frame on another interface. I can see that clearly in the +runtime statistics, taking a look at _Belgium_ for example: + +``` +belgium# show run +Thread 1 vpp_wk_0 (lcore 1) +Time 7912.6, 10 sec internal node vector rate 207.47 loops/sec 20604.47 + vector rates in 8.9997e6, out 9.0054e6, drop 0.0000e0, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call +TenGigabitEthernet6/0/0-output active 172120948 35740749991 0 6.47e0 207.65 +TenGigabitEthernet6/0/0-tx active 171687877 35650752635 0 8.49e1 207.65 +TenGigabitEthernet6/0/1-output active 172119849 35740963315 0 7.79e0 207.65 +TenGigabitEthernet6/0/1-tx active 171471125 35605967085 0 8.48e1 207.65 +dpdk-input polling 171588827 71211720238 0 4.87e1 415.01 +ethernet-input active 344675998 71571710136 0 2.16e1 207.65 +ip4-input-no-checksum active 343340278 71751697912 0 1.86e1 208.98 +ip4-load-balance active 342929714 71661706997 0 1.44e1 208.97 +ip4-lookup active 341632798 71391716172 0 2.28e1 208.97 +ip4-rewrite active 342498637 71571712383 0 2.59e1 208.97 +``` + +Looking at the time spent for one individual packet, it's about 245 CPU cycles, and considering the +cores on this Xeon D1518 run at 2.2GHz, that checks out very accurately: 2.2e9 / 245 = 9Mpps! Every +time that DPDK is asked for some work, it yields on average a vector of 208 packets -- and this is +why VPP is so super fast: the first packet may need to page in the instructions belonging to one of +the graph nodes, but the second through 208th packet will find almost 100% hitrate in the CPU's +instruction cache. Who needs RAM anyway? + +### MPLS forwarding performance + +Now that I have a baseline, I can take a look at the difference between the IPv4 path and the MPLS +path, and here's where the routers will start to behave differently. _France_ and _Netherlands_ will +be _PE-Routers_ and handle encapsulation/decapsulation, while _Belgium_ has a comparatively easy +job, as it will only handle MPLS forwarding. I'll choose country-codes for the labels, that which is +destined to _France_ will have MPLS label 33,S=1; while that which goes to _Netherlands_ will have +MPLS label 31,S=1. + +``` +netherlands# ip ro del 48.0.0.0/8 via 192.168.13.6 +netherlands# ip ro add 48.0.0.0/8 via 192.168.13.6 TenGigabitEthernet6/0/1 out-labels 33 +netherlands# mpls local-label add 31 eos via ip4-lookup-in-table 0 + +belgium# ip route del 48.0.0.0/8 via 192.168.13.4 +belgium# ip route del 16.0.0.0/8 via 192.168.13.7 +belgium# mpls local-label add 33 eos via 192.168.13.4 TenGigabitEthernet6/0/1 out-labels 33 +belgium# mpls local-label add 31 eos via 192.168.13.7 TenGigabitEthernet6/0/0 out-labels 31 + +france# ip route del 16.0.0.0/8 via 192.168.13.5 +france# ip route add 16.0.0.0/8 via 192.168.13.5 TenGigabitEthernet6/0/1 out-labels 31 +france# mpls local-label add 33 eos via ip4-lookup-in-table 0 +``` + +The types of _operation_ in MPLS is no longer symmetric. On the way in, the _PE-Router_ has to +encapsulate the IPv4 packet into an MPLS packet, and on the way out, the _PE-Router_ has to +decapsulate the MPLS packet to reveal the IPv4 packet. So, I change the loadtester to be +unidirectional, and ask it to send 10Mpps from _Netherlands_ to _France_. As soon as I reconfigure +the routers in this mode, I see quite a bit of _packetlo_, as only 7.3Mpps make it through. +Interesting! I wonder where this traffic is dropped, and what the bottleneck is, precisely. + +#### MPLS: PE Ingress Performance + +First, let's take a look at _Netherlands_, to try to understand why it is more expensive: + +``` +netherlands# show run +Time 255.5, 10 sec internal node vector rate 256.00 loops/sec 29399.92 + vector rates in 7.6937e6, out 7.6937e6, drop 0.0000e0, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call +TenGigabitEthernet6/0/1-output active 7978541 2042505472 0 7.28e0 255.99 +TenGigabitEthernet6/0/1-tx active 7678013 1965570304 0 8.25e1 255.99 +dpdk-input polling 7684444 1965570304 0 4.55e1 255.79 +ethernet-input active 7978549 2042507520 0 1.94e1 255.99 +ip4-input-no-checksum active 7978557 2042509568 0 1.75e1 255.99 +ip4-lookup active 7678013 1965570304 0 2.17e1 255.99 +ip4-mpls-label-imposition-pipe active 7678013 1965570304 0 2.42e1 255.99 +mpls-output active 7678013 1965570304 0 6.71e1 255.99 +``` + +Each packet gets from `dpdk-input` into `ethernet-input`, the resulting IPv4 packet visits +`ip4-lookup` FIB where the MPLS out-label is found in the IPv4 FIB, the packet is then wrapped into +an MPLS packet in `ip4-mpls-label-imposition-pipe` and then sent through `mpls-output` to the NIC. +In total the input path (`ip4-*` plus `mpls-*`) takes **131 CPU cycles** for each packet. Including +all the nodes, from DPDK input to DPDK output sums up to 285 cycles, so 2.2GHz/285 = 7.69Mpps which +checks out. + +#### MPLS: P Transit Performance + +I would expect that _Belgium_ has it easier, as it's only doing label swapping and MPLS forwarding. + +``` +belgium# show run +Thread 1 vpp_wk_0 (lcore 1) +Time 595.6, 10 sec internal node vector rate 47.68 loops/sec 224464.40 + vector rates in 7.6930e6, out 7.6930e6, drop 0.0000e0, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call +TenGigabitEthernet6/0/1-output active 97711093 4659109793 0 8.83e0 47.68 +TenGigabitEthernet6/0/1-tx active 96096377 4582172229 0 8.14e1 47.68 +dpdk-input polling 161102959 4582172278 0 5.72e1 28.44 +ethernet-input active 97710991 4659111684 0 2.45e1 47.68 +mpls-input active 97709468 4659096718 0 2.25e1 47.68 +mpls-label-imposition-pipe active 99324916 4736048227 0 2.52e1 47.68 +mpls-lookup active 99324903 4736045943 0 3.25e1 47.68 +mpls-output active 97710989 4659111742 0 3.04e1 47.68 +``` + +Indeed, _Belgium_ can still breathe, it's spending **110 Cycles per packet** doing the MPLS +switching (`mpls-*`), which is 18% less than the _PE-Router_ ingress. Judging by the _vectors/Call_ +(last column), it's also running a bit cooler than the ingress router. + +It's nice to see that the claim that _P-Routers_ are cheaper on the CPU can be verified to be true +in practice! + +#### MPLS: PE Egress Performance + +On to the last router, _France_, which is in charge of decapsulating the MPLS packet and doing the +resulting IPv4 lookup: + +``` +france# show run +Thread 1 vpp_wk_0 (lcore 1) +Time 1067.2, 10 sec internal node vector rate 256.00 loops/sec 27986.96 + vector rates in 7.3234e6, out 7.3234e6, drop 0.0000e0, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call +TenGigabitEthernet6/0/0-output active 30528978 7815395072 0 6.59e0 255.99 +TenGigabitEthernet6/0/0-tx active 30528978 7815395072 0 8.20e1 255.99 +dpdk-input polling 30534880 7815395072 0 4.68e1 255.95 +ethernet-input active 30528978 7815395072 0 1.97e1 255.99 +ip4-load-balance active 30528978 7815395072 0 1.35e1 255.99 +ip4-mpls-label-disposition-pip active 30528978 7815395072 0 2.82e1 255.99 +ip4-rewrite active 30528978 7815395072 0 2.48e1 255.99 +lookup-ip4-dst active 30815069 7888634368 0 3.09e1 255.99 +mpls-input active 30528978 7815395072 0 1.86e1 255.99 +mpls-lookup active 30528978 7815395072 0 2.85e1 255.99 +``` + +This router is spending its time (in `*ip4*` and `mpls-*`) roughly at roughly **144.5 Cycles per +packet** and reveals itself as the bottleneck. _Netherlands_ sent _Belgium_ 7.69Mpps which it all +forwarded to _France_, where only 7.3Mpps make it through this _PE-Router_ egress, and into the +hands of T-Rex. In total, this router is spending 298 cycles/packet, which amounts to 7.37Mpps. + +### MPLS Explicit Null performance + +At the beginning of this article, I made a claim that we could take some shortcuts, and now is a +good time to see if those short cuts are worthwhile in the VPP setting. I'll reconfigure the +_Belgium_ router to set the _IPv4 Explicit NULL_ label (0), which can help my poor overloaded +_France_ router save some valuable CPU cycles. + +``` +belgium# mpls local-label del 33 eos via 192.168.13.4 TenGigabitEthernet6/0/1 out-labels 33 +belgium# mpls local-label add 33 eos via 192.168.13.4 TenGigabitEthernet6/0/1 out-labels 0 +``` + +The situation for _Belgium_ doesn't change at all, it's still doing the SWAP operation on the +incoming packet, but it's writing label 0,S=1 now (instead of label 33,S=1 before). But, haha!, take +a look at _France_ for an important difference: + +``` +france# show run +Thread 1 vpp_wk_0 (lcore 1) +Time 53.3, 10 sec internal node vector rate 85.35 loops/sec 77643.80 + vector rates in 7.6933e6, out 7.6933e6, drop 0.0000e0, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call +TenGigabitEthernet6/0/0-output active 4773870 409847372 0 6.96e0 85.85 +TenGigabitEthernet6/0/0-tx active 4773870 409847372 0 8.07e1 85.85 +dpdk-input polling 4865704 409847372 0 5.01e1 84.23 +ethernet-input active 4773870 409847372 0 2.15e1 85.85 +ip4-load-balance active 4773869 409847235 0 1.51e1 85.85 +ip4-rewrite active 4773870 409847372 0 2.60e1 85.85 +lookup-ip4-dst-itf active 4773870 409847372 0 3.41e1 85.85 +mpls-input active 4773870 409847372 0 1.99e1 85.85 +mpls-lookup active 4773870 409847372 0 3.01e1 85.85 +``` + +First off, I notice the _input_ vector rates match the _output_ vector rates, both at 7.69Mpps, and +that the average _Vectors/Call_ is no longer pegged at 256. The router is now spending **125 Cycles +per packet** which is a lot better than it was before (15.4% better than 144.5 Cycles/packet). + +Conclusion: **MPLS Explicit NULL is cheaper**! + +### MPLS Implicit Null (PHP) performance + +So there's one mode of operation left for me to play with. What if we asked _Belgium_ to unwrap the +MPLS packet and forward it as an IPv4 packet towards _France_, in other words apply _Penultimate Hop +Popping_? Of course, the ingress _Netherlands_ won't change at all, but I reconfigure the _Belgium_ +router, like so: + +``` +belgium# mpls local-label del 33 eos via 192.168.13.4 TenGigabitEthernet6/0/1 out-labels 0 +belgium# mpls local-label add 33 eos via 192.168.13.4 TenGigabitEthernet6/0/1 out-labels 3 +``` + +The situation in _Belgium_ now looks subtly different: + +``` +belgium# show run +Thread 1 vpp_wk_0 (lcore 1) +Time 171.1, 10 sec internal node vector rate 50.64 loops/sec 188552.87 + vector rates in 7.6966e6, out 7.6966e6, drop 0.0000e0, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call +TenGigabitEthernet6/0/1-output active 26128425 1316828499 0 8.74e0 50.39 +TenGigabitEthernet6/0/1-tx active 26128424 1316828327 0 8.16e1 50.39 +dpdk-input polling 39339977 1316828499 0 5.58e1 33.47 +ethernet-input active 26128425 1316828499 0 2.39e1 50.39 +ip4-mpls-label-disposition-pip active 26128425 1316828499 0 3.07e1 50.39 +ip4-rewrite active 27648864 1393790359 0 2.82e1 50.41 +mpls-input active 26128425 1316828499 0 2.21e1 50.39 +mpls-lookup active 26128422 1316828355 0 3.16e1 50.39 +``` + +After doing the `mpls-lookup`, this router finds that it can just toss the label and forward the +packet as IPv4 down south. Cost for _Belgium_: **113 Cycles per packet**. + +_France_ is now not participating in MPLS at all - it is simply receiving IPv4 packets which it has +to route back towards T-Rex. I take one final look at _France_ to see where it's spending its time: + +``` +france# show run +Thread 1 vpp_wk_0 (lcore 1) +Time 397.3, 10 sec internal node vector rate 42.17 loops/sec 259634.88 + vector rates in 7.7112e6, out 7.6964e6, drop 0.0000e0, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call +TenGigabitEthernet6/0/0-output active 74381543 3211443520 0 9.47e0 43.18 +TenGigabitEthernet6/0/0-tx active 70820630 3057504872 0 8.26e1 43.17 +dpdk-input polling 131873061 3063377312 0 6.09e1 23.23 +ethernet-input active 72645873 3134461107 0 2.66e1 43.15 +ip4-input-no-checksum active 70820629 3057504812 0 2.68e1 43.17 +ip4-load-balance active 72646140 3134473660 0 1.74e1 43.15 +ip4-lookup active 70820628 3057504796 0 2.79e1 43.17 +ip4-rewrite active 70820631 3057504924 0 2.96e1 43.17 +``` + +As an IPv4 router, _France_ spends in total **102 Cycles per packet**. This matches very closely +with the 104 cycles/packet I found when doing my baseline loadtest with only IPv4 routing. I love it +when numbers align!! + +### Scaling + +One thing that I was curious to know, is if MPLS packets would allow for multiple receive queues, to +enable horizontal scaling by adding more VPP worker threads. The answer is a resounding YES! If I +restart the VPP routers _Netherlands_, _Belgium_ and _France_ with three workers and set DPDK +`num-rx-queues` to 3 as well, I see perfect linear scaling, in other words these little routers +would be able to forward roughly 27Mpps of MPLS packets with varying inner payloads (be it IPv4 or +IPv6 or Ethernet traffic with differing src/dest MAC addresses). All things said, IPv4 is still a +little bit cheaper on the CPU, at least on these routers with only a very small routing table. But, +it's great to see that MPLS forwarding can leverage RSS. + +### Conclusions + +This is all fine and dandy, but I think it's a bit trickier to see if PHP is actually cheaper or not. +To answer this question, I think I should count the total amount of CPU cycles spent end to end: for a +packet traveling from T-Rex coming into _Netherlands_, through _Belgium_ and _France_, and back out +to T-Rex. + +| | Netherlands | Belgium | France | Total Cost | +| ------------------------- | ----------- | ----------- | ----------- | ------------ | +| Regular IPv4 path | 104 cycles | 104 cycles | 104 cycles | 312 cycles | +| MPLS: Simple LSP | 131 cycles | 110 cycles | 145 cycles | 386 cycles | +| MPLS: Explicit NULL LSP | 131 cycles | 110 cycles | 125 cycles | 366 cycles | +| MPLS: Penultimate Hop Pop | 131 cycles | 113 cycles | 102 cycles | ***346 cycles*** | + +***Note***: The clock cycle numbers here are only `*mpls*` and `*ip4*` nodes, exclusing the `*-input`, +`*-output` and `*-tx` nodes, as they will add the same cost for all modes of operation. + +I threw a lot of numbers into this article, and my head is spinning as I write this. But I still +think I can wrap it up in a way that allows me to have a few high level takeaways: + +* IPv4 forwarding is a fair bit cheaper than MPLS forwarding (with an empty FIB, anyway). I had + not expected this! +* End to end, the MPLS bottleneck is in the _PE-Ingress_ operation. +* Explicit NULL helps without any drawbacks, as it cuts off one MPLS FIB lookup in the _PE-Egress_ + operation. +* Implicit NULL (aka Penultimate Hop Popping) is also the fastest way to do MPLS with VPP, all + things considered. + +## What's next + +I joined forces with [@vifino](https://chaos.social/@vifino) who has effectively added MPLS handling +to the Linux Control Plane, so VPP can start to function as an MPLS router using FRR's label +distribution protocol implementation. Gosh, I wish Bird3 would have LDP :) + +Our work is mostly complete, there's two pending Gerrit's which should be ready to review and +certainly ready to play with: + +1. [[Gerrit 38826](https://gerrit.fd.io/r/c/vpp/+/38826)]: This adds the ability to listen to internal + state changes of an interface, so that the Linux Control Plane plugin can enable MPLS on the + _LIP_ interfaces and Linux sysctl for MPLS input. +1. [[Gerrit 38702](https://gerrit.fd.io/r/c/vpp/+/38702)]: This adds the ability to listen to Netlink + messages in the Linux Control Plane plugin, and sensibly apply these routes to the IPv4, IPv6 + and MPLS FIB in the VPP dataplane. + +If you'd like to test this - reach out to the VPP Developer mailinglist +[[ref](mailto:vpp-dev@lists.fd.io)] any time! diff --git a/content/articles/2023-05-21-vpp-mpls-3.md b/content/articles/2023-05-21-vpp-mpls-3.md new file mode 100644 index 0000000..1280238 --- /dev/null +++ b/content/articles/2023-05-21-vpp-mpls-3.md @@ -0,0 +1,463 @@ +--- +date: "2023-05-21T11:01:14Z" +title: VPP MPLS - Part 3 +--- + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +**Special Thanks**: Adrian _vifino_ Pistol for writing this code and for the wonderful collaboration! + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. + +In the [[first article]({%post_url 2023-05-07-vpp-mpls-1 %})] of this series, I took a look at MPLS +in general, and how setting up static _Label Switched Paths_ can be done in VPP. A few details on +special case labels (such as _Implicit Null_ which enabled the fabled _Penultimate Hop Popping_) +were missing, so I took a good look at them in the [[second article]({% post_url +2023-05-17-vpp-mpls-2 %})] of the series. + +This was all just good fun but also allowed me to buy some time for +[@vifino](https://chaos.social/@vifino) who has been implementing MPLS handling within the Linux +Control Plane plugin for VPP! This final article in the series shows the engineering considerations +that went in to writing the plugin, which is currently under review but reasonably complete. +Considering the VPP 23.06 cutoff is next week, I'm not super hopeful that we'll be able to get a +full community / committer review in time, but at this point both @vifino and I think this code is +ready for consumption - considering FRR has a good _Label Distribution Protocol_ daemon, I'll switch +out of my usual habitat of Bird and install a LAB with FRR. + +Caveat empor, outside of a modest functional and load-test, this MPLS functionality +hasn't seen a lot of mileage as it's only a few weeks old at this point, so it could definitely +contain some rough edges. Use at your own risk, but if you did want to discuss issues, the +[[vpp-dev@](mailto:vpp-dev@lists.fd.io)] mailinglist is a good first stop. + +## Introduction + +MPLS support is fairly complete in VPP already, but programming the dataplane would require custom +integrations, while using the Linux netlink subsystem feels easier from an end-user point of view. +This is a technical deep dive into the implementation of MPLS in the Linux Control Plane plugin for +VPP. If you haven't already, now is a good time to read up on the initial implementation of LCP: + +* [[Part 1]({% post_url 2021-08-12-vpp-1 %})]: Punting traffic through TUN/TAP interfaces into Linux +* [[Part 2]({% post_url 2021-08-13-vpp-2 %})]: Mirroring VPP interface configuration into Linux +* [[Part 3]({% post_url 2021-08-15-vpp-3 %})]: Automatically creating sub-interfaces in Linux +* [[Part 4]({% post_url 2021-08-25-vpp-4 %})]: Synchronize link state, MTU and addresses to Linux +* [[Part 5]({% post_url 2021-09-02-vpp-5 %})]: Netlink Listener, synchronizing state from Linux to VPP +* [[Part 6]({% post_url 2021-09-10-vpp-6 %})]: Observability with LibreNMS and VPP SNMP Agent +* [[Part 7]({% post_url 2021-09-21-vpp-7 %})]: Productionizing and reference Supermicro fleet at IPng + +To keep this writeup focused, I'll assume the anatomy of VPP plugins and the Linux Controlplane +_Interface_ and _Netlink_ plugins are understood. That way, I can focus on the _changes_ needed for +MPLS integration, which at first glance seem reasonably straight forward. + +## VPP Linux-CP: Interfaces + +First off, to enable any MPLS forwarding at all in VPP, I have to create the MPLS forwarding table +and enable MPLS on one or more interfaces: + +``` +vpp# mpls table add 0 +vpp# lcp create GigabitEthernet10/0/0 host-if e0 +vpp# set int mpls GigabitEthernet10/0/0 enable +``` + +What happens when the Gi10/0/0 interface has a _Linux Interface Pair (LIP)_ is that there exists a +corresponding TAP interface in the dataplane (typically called `tapX`) which in turn appears on the +Linux side as `e0`. Linux will want to be able to send MPLS datagrams into `e0`, and for that, two +things must happen: + +1. Linux kernel must enable MPLS input on `e0`, typically with a sysctl. +1. VPP must enable MPLS on the TAP, in addition to the phy Gi10/0/0. + +Therefore, the first order of business is to create a hook where the Linux CP interface plugin can +be made aware if MPLS is enabled or disabled in VPP - it turns out, such a callback function +_definition_ already exists, but it was never implemented. [[Gerrit +38826](https://gerrit.fd.io/r/c/vpp/+/38826)] adds a function `mpls_interface_state_change_add_callback()`, +which implements the ability to register a callback on MPLS on/off in VPP. + +Now that the callback plumbing exists, Linux CP will want to register one of these, so that it can +set MPLS to the same enabled or disabled state on the Linux interface using +`/proc/sys/net/mpls/conf/${host-if}/input` (which is the moral equivalent of running `sysctl`), and +it'll also call `mpls_sw_interface_enable_disable()` on the TAP interface. With these changes both +implemented, enabling MPLS now looks like this in the logs: + +``` +linux-cp/mpls-sync: sync_state_cb: called for sw_if_index 1 +linux-cp/mpls-sync: sync_state_cb: mpls enabled 1 parent itf-pair: [1] GigabitEthernet10/0/0 tap2 e0 97 type tap netns dataplane +linux-cp/mpls-sync: sync_state_cb: called for sw_if_index 8 +linux-cp/mpls-sync: sync_state_cb: set mpls input for e0 +``` + +Take a look at the code that implements enable/disable semantics in `src/plugins/linux-cp/lcp_mpls_sync.c`. + +## VPP Linux-CP: Netlink + +When Linux installs a route with MPLS labels, it will be seen in the return value of +`rtnl_route_nh_get_encap_mpls_dst()`. One or more labels can now be read using +`nl_addr_get_binary_addr()` yielding `struct mpls_label`, which contains the label value, experiment +bits and TTL, and these can be added to the route path in VPP by casting them to `struct +fib_mpls_label_t`. The last label in the stackwill have the S-bit set, so we can continue consuming these +until we find that condition. The first patchset that plays around with these semantics is +[[38702#2](https://gerrit.fd.io/r/c/vpp/+/38702/2)]. As you can see, MPLS is going to look very much +like IPv4 and IPv6 route updates in [[previous work]({%post_url 2021-09-02-vpp-5 %})], in that they +take the Netlink representation, rewrite them into VPP representation, and update the FIB. + +Up until now, the Linux Controlplane netlink plugin understands only IPv4 and IPv6. So some +preparation work is called for: + +* ***lcp_router_proto_k2f()*** gains the ability to cast Linux `AF_*` into VPP's `FIB_PROTOCOL_*`. +* ***lcp_router_route_mk_prefix()*** turns into a switch statement that creates a `fib_prefix_t` + for MPLS address family, in addition to the existing IPv4 and IPv6 types. It uses the non-EOS + type. +* ***lcp_router_mpls_nladdr_to_path()*** implements the loop that I described above, taking the + stack of `struct mpls_label` from Netlink and turning them into a vector of `fib_mpls_label_t` + for the VPP FIB. +* ***lcp_router_route_path_parse()*** becomes aware of MPLS SWAP and POP operations (the latter + being the case if there are 0 labels in the Netlink label stack) +* ***lcp_router_fib_route_path_dup()*** is a helper function to make a copy of a the FIB path + for the EOS and non-EOS VPP FIB inserts. + +The VPP FIB differentiates between entries that are non-EOS (S=0), and can treat them differently to +those which are EOS (end of stack, S=1). Linux does not make this destinction, so it's safest to +just install non-EOS **and** EOS entries for each route from Linux. This is why +`lcp_router_fib_route_path_dup()` exists, otherwise Netlink route deletions for the MPLS routes +will yield a double free later on. + +This prep work then allows for the following two main functions to become MPLS aware: + +* ***lcp_router_route_add()*** when Linux sends a Netlink message about a new route, and that + route carries MPLS labels, make a copy of the path for the EOS entry and proceed to insert + both the non-EOS and newly crated EOS entries into the FIB, +* ***lcp_router_route_del()*** when Linux sends a Netlink message about a deleted route, we can + remove both the EOS and non-EOS variants of the route from VPP's FIB. + +## VPP Linux-CP: MPLS with FRR + +{{< image src="/assets/vpp-mpls/LAB v2.svg" alt="Lab Setup" >}} + +I finally get to show off @vifino's lab! It's installed based off of a Debian Bookworm build, because +there's a few Netlink Library changes that haven't made their way into Debian Bullseye yet. The LAB +image is quickly built and distributed, and for this LAB I'm choosing specifically for +[[FRR](https://frrouting.org/)] because it ships with a _Label Distribution Protocol_ daemon out of +the box. + +First order of business is to enable MPLS on the correct interfaces, and create the MPLS FIB table. +On each machine, I insert the following in the startup sequence: + +``` +ipng@vpp0-1:~$ cat << EOF | tee -a /etc/vpp/config/manual.vpp +mpls table add 0 +set interface mpls GigabitEthernet10/0/0 enable +set interface mpls GigabitEthernet10/0/1 enable +EOF +``` + +The lab comes with OSPF and OSPFv3 enabled on each of the Gi10/0/0 and Gi10/0/1 interfaces that go +from East to West. This extra sequence enables MPLS on those interfaces, and because they have a +_Linux Interface Pair (LIP)_, VPP will enable MPLS on the internal TAP interfaces, as well as set +the Linux `sysctl` to allow the kernel to send MPLS encapsulated packets towards VPP. + +Next up, turning on _LDP_ for FRR, which is easy enough: +``` +ipng@vpp0-1:~$ vtysh +vpp0-2# conf t +vpp0-2(config)# mpls ldp + router-id 192.168.10.1 + dual-stack cisco-interop + ordered-control + ! + address-family ipv4 + discovery transport-address 192.168.10.1 + label local advertise explicit-null + interface e0 + interface e1 + exit-address-family + ! + address-family ipv6 + discovery transport-address 2001:678:d78:200::1 + label local advertise explicit-null + ttl-security disable + interface e0 + interface e1 + exit-address-family +exit +``` + +I configure _LDP_ here to prefer advertising locally connected routes as _MPLS Explicit NULL_, which I +described in detail in the [[previous post]({% post_url 2023-05-17-vpp-mpls-2 %})]. It tells the +penultimate router to send the router a packet as MPLS with label value 0,S=1 for IPv4 and value 2,S=1 +for IPv6, so that VPP knows imediately to decapsulate the packet and continue to IPv4/IPv6 forwarding. +An alternative here is setting implicit-null, which instructs the router before this one to perform +_Penultimate Hop Popping_. If this is confusing, take a look at that article for reference! + +Otherwise, just giving each router a transport-address of a loopback interface, and a unique router-id, +the same as used in OSPF and OSPFv3, and we're off to the races. Just take a look at how easy this was: + +``` +vpp0-1# show mpls ldp discovery +AF ID Type Source Holdtime +ipv4 192.168.10.0 Link e0 15 +ipv4 192.168.10.2 Link e1 15 +ipv6 192.168.10.0 Link e0 15 +ipv6 192.168.10.2 Link e1 15 + +vpp0-1# show mpls ldp neighbor +AF ID State Remote Address Uptime +ipv6 192.168.10.0 OPERATIONAL 2001:678:d78:200:: + 19:49:10 +ipv6 192.168.10.2 OPERATIONAL 2001:678:d78:200::2 + 19:49:10 +``` + +The first `show ... discovery` shows which interfaces are receiving multicast _LDP Hello Packets_, +and because I enabled discovery for both IPv4 and IPv6, I can see two pairs there. If I look at +which interfaces formed adjacencies, `show ... neighbor` reveals that LDP is preferring IPv6, and +that both adjacencies to `vpp0-0` and `vpp0-2` are operational. Awesome sauce! + +I see _LDP_ neighbor adjacencies, so let me show you what label information was actually +exchanged, in three different places, **FRR**'s label distribution protocol daemon, **Linux**'s +IPv4, IPv6 and MPLS routing tables, and **VPP**'s dataplane forwarding information base. + +### MPLS: FRR view + +There are two things to note -- the IPv4 and IPv6 routing table, called a _Forwarding Equivalent Class +(FEC)_, and the MPLS forwarding table, called the _MPLS FIB_: + +``` +vpp0-1# show mpls ldp binding +AF Destination Nexthop Local Label Remote Label In Use +ipv4 192.168.10.0/32 192.168.10.0 20 exp-null yes +ipv4 192.168.10.1/32 0.0.0.0 exp-null - no +ipv4 192.168.10.2/32 192.168.10.2 16 exp-null yes +ipv4 192.168.10.3/32 192.168.10.2 33 33 yes +ipv4 192.168.10.4/31 192.168.10.0 21 exp-null yes +ipv4 192.168.10.6/31 192.168.10.0 exp-null exp-null no +ipv4 192.168.10.8/31 192.168.10.2 exp-null exp-null no +ipv4 192.168.10.10/31 192.168.10.2 17 exp-null yes +ipv6 2001:678:d78:200::/128 192.168.10.0 18 exp-null yes +ipv6 2001:678:d78:200::1/128 0.0.0.0 exp-null - no +ipv6 2001:678:d78:200::2/128 192.168.10.2 31 exp-null yes +ipv6 2001:678:d78:200::3/128 192.168.10.2 38 34 yes +ipv6 2001:678:d78:210::/60 0.0.0.0 48 - no +ipv6 2001:678:d78:210::/128 0.0.0.0 39 - no +ipv6 2001:678:d78:210::1/128 0.0.0.0 40 - no +ipv6 2001:678:d78:210::2/128 0.0.0.0 41 - no +ipv6 2001:678:d78:210::3/128 0.0.0.0 42 - no + +vpp0-1# show mpls table + Inbound Label Type Nexthop Outbound Label + ------------------------------------------------------------------ + 16 LDP 192.168.10.9 IPv4 Explicit Null + 17 LDP 192.168.10.9 IPv4 Explicit Null + 18 LDP fe80::5054:ff:fe00:1001 IPv6 Explicit Null + 19 LDP fe80::5054:ff:fe00:1001 IPv6 Explicit Null + 20 LDP 192.168.10.6 IPv4 Explicit Null + 21 LDP 192.168.10.6 IPv4 Explicit Null + 31 LDP fe80::5054:ff:fe02:1000 IPv6 Explicit Null + 32 LDP fe80::5054:ff:fe02:1000 IPv6 Explicit Null + 33 LDP 192.168.10.9 33 + 38 LDP fe80::5054:ff:fe02:1000 34 +``` + +In the first table, each entry of the IPv4 and IPv6 routing table, as fed by OSPF and OSPFv3, +will get a label associated with them. The negotiation of _LDP_ will ask our peer to set a +specific label, and it'll inform the peer on which label we are intending to use for the +_Label Switched Path_ towards that destination. I'll give two examples to illustrate how +this table is used: +1. This router (`vpp0-1`) has a peer `vpp0-0` and when this router wants to send traffic to + it, it'll be sent with `exp-null` (because it is the last router in the _LSP_), but when + other routers might want to use this router to reach `vpp0-0`, they should use the MPLS + label value 20. +1. This router (`vpp0-1`) is _not_ directly connected to `vpp0-3` and as such, its IPv4 and IPv6 + loopback addresses are going to contain labels in both directions: if `vpp0-1` itself + wants to send a packet to `vpp0-3`, it will use label value 33 and 38 respectively. + However, if other routers want to use this router to reach `vpp0-3`, they should use the + MPLS label value 33 and 34 respectively. + +The second table describes the MPLS _Forwarding Information Base (FIB)_. When receiving an +MPLS packet with an inbound label noted in this table, the operation applied is _SWAP_ to the +outbound label, and forward towards a nexthop -- this is the stuff that _P-Routers_ use when +transiting MPLS traffic. + +### MPLS: Linux view + +FRR's LDP daemon will offer both of these routing tables to the Linux kernel using Netlink +messages, so the Linux view looks similar: + +``` +root@vpp0-1:~# ip ro +192.168.10.0 nhid 230 encap mpls 0 via 192.168.10.6 dev e0 proto ospf src 192.168.10.1 metric 20 +192.168.10.2 nhid 226 encap mpls 0 via 192.168.10.9 dev e1 proto ospf src 192.168.10.1 metric 20 +192.168.10.3 nhid 227 encap mpls 33 via 192.168.10.9 dev e1 proto ospf src 192.168.10.1 metric 20 +192.168.10.4/31 nhid 230 encap mpls 0 via 192.168.10.6 dev e0 proto ospf src 192.168.10.1 metric 20 +192.168.10.6/31 dev e0 proto kernel scope link src 192.168.10.7 +192.168.10.8/31 dev e1 proto kernel scope link src 192.168.10.8 +192.168.10.10/31 nhid 226 encap mpls 0 via 192.168.10.9 dev e1 proto ospf src 192.168.10.1 metric 20 + +root@vpp0-1:~# ip -6 ro +2001:678:d78:200:: nhid 231 encap mpls 2 via fe80::5054:ff:fe00:1001 dev e0 proto ospf src 2001:678:d78:200::1 metric 20 pref medium +2001:678:d78:200::1 dev loop0 proto kernel metric 256 pref medium +2001:678:d78:200::2 nhid 237 encap mpls 2 via fe80::5054:ff:fe02:1000 dev e1 proto ospf src 2001:678:d78:200::1 metric 20 pref medium +2001:678:d78:200::3 nhid 239 encap mpls 34 via fe80::5054:ff:fe02:1000 dev e1 proto ospf src 2001:678:d78:200::1 metric 20 pref medium +2001:678:d78:201::/112 nhid 231 encap mpls 2 via fe80::5054:ff:fe00:1001 dev e0 proto ospf src 2001:678:d78:200::1 metric 20 pref medium +2001:678:d78:201::1:0/112 dev e0 proto kernel metric 256 pref medium +2001:678:d78:201::2:0/112 dev e1 proto kernel metric 256 pref medium +2001:678:d78:201::3:0/112 nhid 237 encap mpls 2 via fe80::5054:ff:fe02:1000 dev e1 proto ospf src 2001:678:d78:200::1 metric 20 pref medium + +root@vpp0-1:~# ip -f mpls ro +16 as to 0 via inet 192.168.10.9 dev e1 proto ldp +17 as to 0 via inet 192.168.10.9 dev e1 proto ldp +18 as to 2 via inet6 fe80::5054:ff:fe00:1001 dev e0 proto ldp +19 as to 2 via inet6 fe80::5054:ff:fe00:1001 dev e0 proto ldp +20 as to 0 via inet 192.168.10.6 dev e0 proto ldp +21 as to 0 via inet 192.168.10.6 dev e0 proto ldp +31 as to 2 via inet6 fe80::5054:ff:fe02:1000 dev e1 proto ldp +32 as to 2 via inet6 fe80::5054:ff:fe02:1000 dev e1 proto ldp +33 as to 33 via inet 192.168.10.9 dev e1 proto ldp +38 as to 34 via inet6 fe80::5054:ff:fe02:1000 dev e1 proto ldp +``` + +The first two tabled show a 'regular' Linux routing table for IPv4 and IPv6 respectively, except there's +an `encap mpls ` added for all not-directly-connected prefixes. In this case, `vpp0-1` connects on +`e0` to `vpp0-0` to the West, and on interface `e1` to `vpp0-2` to the East. These connected routes do +not carry MPLS information and in fact, this is how LDP can continue to work and exchange information +naturally even when no _LSPs_ are established yet. + +The third table is the _MPLS FIB_, and it shows the special case of _MPLS Explicit NULL_ clearly. All IPv4 +routes for which this router is the penultimate hop carry the outbound label value 0,S=1, while the IPv6 +routes carry the value 2,S=1. Booyah! + +### MPLS: VPP view + +The _FIB_ information in general is super densely populated in VPP. Rather than dumping the whole table, +I'll show one example, for `192.168.10.3` which we can see above will be encapsulated into an MPLS +packet with label value 33,S=0 before being fowarded: + +``` +root@vpp0-1:~# vppctl show ip fib 192.168.10.3 +ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, default-route:1, lcp-rt:1, ] +192.168.10.3/32 fib:0 index:78 locks:2 + lcp-rt-dynamic refs:1 src-flags:added,contributing,active, + path-list:[29] locks:6 flags:shared, uPRF-list:53 len:1 itfs:[2, ] + path:[41] pl-index:29 ip4 weight=1 pref=20 attached-nexthop: oper-flags:resolved, + 192.168.10.9 GigabitEthernet10/0/1 + [@0]: ipv4 via 192.168.10.9 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 5254000210005254000110010800 + Extensions: + path:41 labels:[[33 pipe ttl:0 exp:0]] + forwarding: unicast-ip4-chain + [@0]: dpo-load-balance: [proto:ip4 index:81 buckets:1 uRPF:53 to:[2421:363846]] + [0] [@13]: mpls-label[@4]:[33:64:0:eos] + [@1]: mpls via 192.168.10.9 GigabitEthernet10/0/1: mtu:9000 next:3 flags:[] 5254000210005254000110018847 +``` + +The trick is looking at the Extensions, which shows the _out-labels_ set to 33, with ttl=0 (which makes +VPP copy the TTL from the IPv4 packet itself), and exp=0. It can then forward the packet as MPLS onto +the nexthop at 192.168.10.9 (`vpp0-2.e0` on Gi10/0/1). + +The MPLS _FIB_ is also a bit chatty, and shows a fundamental difference with Linux: + +``` +root@vpp0-1:~# vppctl show mpls fib 33 +MPLS-VRF:0, fib_index:0 locks:[interface:6, CLI:1, lcp-rt:1, ] +33:neos/21 fib:0 index:37 locks:2 + lcp-rt-dynamic refs:1 src-flags:added,contributing,active, + path-list:[57] locks:12 flags:shared, uPRF-list:21 len:1 itfs:[2, ] + path:[81] pl-index:57 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved, + 192.168.10.9 GigabitEthernet10/0/1 + [@0]: ipv4 via 192.168.10.9 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 5254000210005254000110010800 + Extensions: + path:81 labels:[[33 pipe ttl:0 exp:0]] + forwarding: mpls-neos-chain + [@0]: dpo-load-balance: [proto:mpls index:40 buckets:1 uRPF:21 to:[0:0]] + [0] [@6]: mpls-label[@28]:[33:64:0:neos] + [@1]: mpls via 192.168.10.9 GigabitEthernet10/0/1: mtu:9000 next:3 flags:[] 5254000210005254000110018847 + +33:eos/21 fib:0 index:64 locks:2 + lcp-rt-dynamic refs:1 src-flags:added,contributing,active, + path-list:[57] locks:12 flags:shared, uPRF-list:21 len:1 itfs:[2, ] + path:[81] pl-index:57 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved, + 192.168.10.9 GigabitEthernet10/0/1 + [@0]: ipv4 via 192.168.10.9 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 5254000210005254000110010800 + Extensions: + path:81 labels:[[33 pipe ttl:0 exp:0]] + forwarding: mpls-eos-chain + [@0]: dpo-load-balance: [proto:mpls index:67 buckets:1 uRPF:21 to:[73347:10747680]] + [0] [@6]: mpls-label[@29]:[33:64:0:eos] + [@1]: mpls via 192.168.10.9 GigabitEthernet10/0/1: mtu:9000 next:3 flags:[] 5254000210005254000110018847 +``` + +I note that there are two entries here -- I wrote about them above. The MPLS implementation in VPP +allows for a different forwarding behavior in the case that the label inspected is the last one in the +stack (S=1), which is the usual case called _End of Stack (EOS)_. But, it also has a second entry +which tells it what to do if S=0 or _Not End of Stack (NEOS)_. Linux doesn't make the destinction, so +@vifino added two identical entries using that ***lcp_router_fib_route_path_dup()*** function. + +But, what the entries themselves mean is that if this `vpp0-1` router were to receive an MPLS +packet with label value 33,S=1 (or value 33,S=0), it'll perform a _SWAP_ operation and put as +new outbound label (the same) value 33, and forward the packet as MPLS onto 192.168.10.9 on Gi10/0/1. + +## Results + +And with that, I think we achieved a running LDP with IPv4 and IPv6 and forwarding + encapsulation +of MPLS with VPP. One cool wrapup I thought I'd leave you with, is showing how these MPLS routers +are transparent with respect to IP traffic going through them. If I look at the diagram above, `lab` +reaches `vpp0-3` via three hops: first into `vpp0-0` where it is wrapped into MPLS and forwarded +to `vpp0-1`, and then through `vpp0-2`, which sets the _Explicit NULL_ label and forwards again +as MPLS onto `vpp0-3`, which does the IPv4 and IPv6 lookup. + +Check this out: + +``` +pim@lab:~$ for node in $(seq 0 3); do traceroute -4 -q1 vpp0-$node; done +traceroute to vpp0-0 (192.168.10.0), 30 hops max, 60 byte packets + 1 vpp0-0.lab.ipng.ch (192.168.10.0) 1.907 ms +traceroute to vpp0-1 (192.168.10.1), 30 hops max, 60 byte packets + 1 vpp0-1.lab.ipng.ch (192.168.10.1) 2.460 ms +traceroute to vpp0-1 (192.168.10.2), 30 hops max, 60 byte packets + 1 vpp0-2.lab.ipng.ch (192.168.10.2) 3.860 ms +traceroute to vpp0-1 (192.168.10.3), 30 hops max, 60 byte packets + 1 vpp0-3.lab.ipng.ch (192.168.10.3) 4.414 ms + +pim@lab:~$ for node in $(seq 0 3); do traceroute -6 -q1 vpp0-$node; done +traceroute to vpp0-0 (2001:678:d78:200::), 30 hops max, 80 byte packets + 1 vpp0-0.lab.ipng.ch (2001:678:d78:200::) 3.037 ms +traceroute to vpp0-1 (2001:678:d78:200::1), 30 hops max, 80 byte packets + 1 vpp0-1.lab.ipng.ch (2001:678:d78:200::1) 5.125 ms +traceroute to vpp0-1 (2001:678:d78:200::2), 30 hops max, 80 byte packets + 1 vpp0-2.lab.ipng.ch (2001:678:d78:200::2) 7.135 ms +traceroute to vpp0-1 (2001:678:d78:200::3), 30 hops max, 80 byte packets + 1 vpp0-3.lab.ipng.ch (2001:678:d78:200::3) 8.763 ms +``` + +With MPLS, each of these routers appears to the naked eye to be directly connected to the +`lab` headend machine, but we know better! :) + +## What's next + +I joined forces with [@vifino](https://chaos.social/@vifino) who has effectively added MPLS handling +to the Linux Control Plane, so VPP can start to function as an MPLS router using FRR's label +distribution protocol implementation. Gosh, I wish Bird3 would have LDP :) + +Our work is mostly complete, there's two pending Gerrit's which should be ready to review and +certainly ready to play with: + +1. [[Gerrit 38826](https://gerrit.fd.io/r/c/vpp/+/38826)]: This adds the ability to listen to internal + state changes of an interface, so that the Linux Control Plane plugin can enable MPLS on the + _LIP_ interfaces and Linux sysctl for MPLS input. +1. [[Gerrit 38702](https://gerrit.fd.io/r/c/vpp/+/38702)]: This adds the ability to listen to Netlink + messages in the Linux Control Plane plugin, and sensibly apply these routes to the IPv4, IPv6 + and MPLS FIB in the VPP dataplane. + +Finally, a note from your friendly neighborhood developers: this code is brand-new and has had _very +limited_ peer-review from the VPP developer community. It adds a significant feature to the Linux +Controlplane plugin, so make sure you both understand the semantics, the differences between Linux +and VPP, and the overall implementation before attempting to use in production. We're pretty sure +we got at least some of this right, but testing and runtime experience will tell. + +I will be silently porting the change into my own copy of the Linux Controlplane called lcpng on +[[GitHub](https://github.com/pimvanpelt/lcpng.git)]. If you'd like to test this - reach out to the VPP +Developer [[mailinglist](mailto:vpp-dev@lists.fd.io)] any time! + diff --git a/content/articles/2023-05-28-vpp-mpls-4.md b/content/articles/2023-05-28-vpp-mpls-4.md new file mode 100644 index 0000000..f54a022 --- /dev/null +++ b/content/articles/2023-05-28-vpp-mpls-4.md @@ -0,0 +1,387 @@ +--- +date: "2023-05-28T10:08:14Z" +title: VPP MPLS - Part 4 +--- + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +**Special Thanks**: Adrian _vifino_ Pistol for writing this code and for the wonderful collaboration! + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. + +In the last three articles, I thought I had described "all we need to know" to perform MPLS using +the Linux Controlplane in VPP: + +1. In the [[first article]({%post_url 2023-05-07-vpp-mpls-1 %})] of this series, I took a look at MPLS + in general. +2. In the [[second article]({% post_url 2023-05-17-vpp-mpls-2 %})] of the series, I demonstrated a few + special case labels (such as _Explicit Null_ and _Implicit Null_ which enables the fabled + _Penultimate Hop Popping_ behavior of MPLS. +3. Then, in the [[third article]({% post_url 2023-05-21-vpp-mpls-3%})], I worked with + [@vifino](https://chaos.social/@vifino) to implement the plumbing for MPLS in the Linux Control Plane + plugin for VPP. He did most of the work, I just watched :) + +As if in a state of premonition, I mentioned: +> Caveat empor, outside of a modest functional and load-test, this MPLS functionality +> hasn't seen a lot of mileage as it's only a few weeks old at this point, so it could definitely +> contain some rough edges. Use at your own risk, but if you did want to discuss issues, the +> [[vpp-dev@](mailto:vpp-dev@lists.fd.io)] mailinglist is a good first stop. + + +## Introduction + +{{< image src="/assets/vpp-mpls/LAB v2.svg" alt="Lab Setup" >}} + +As a reminder, the LAB we built is running VPP with a feature added to Linux Control Plane Plugin, +which lets it consume MPLS routes and program the IPv4/IPv6 routing table as well as the MPLS +forwarding table in VPP. At this point, we are running [[Gerrit 38702, PatchSet +10](https://gerrit.fd.io/r/c/vpp/+/38702/10)]. + +First, let me specify the problem statement: @vifino and I both noticed that sometimes, pinging from +one VPP node to another worked fine, while SSHing did not. This article describes an issue I +diagnosed, and provided a fix for, in the Linux Controlplane plugin implementation. + +### Clue 1: Intermittent ping + +My first finding is that our LAB machines run _all_ the VPP plugins, notably the `ping` plugin, +which means that VPP was responding to ping/ping6, and the Linux controlplane plugin sometimes did +not receive any traffic, while other times it did receive the traffic, say a TCP/syn for port 22, +and dutifully responded to that, but that syn/ack was never seen back. + +If I were to disable the `ping` plugin, indeed pinging from seemingly random pairs of `vpp0-[0123]` +no longer works, while pinging direct neighbors (eg. `vpp0-0.e1` to `vpp0-1.e0`) consistently works +well. + +### Clue 2: Corrupted MPLS packets + +Using the `tap0-0` virtual machine, which sees a copy of all packets on the Open vSwitch underlay in +our lab, I started tcpdumping and noticed two curious packets from time to time: + +``` +09:22:55.349977 52:54:00:03:10:00 > 52:54:00:02:10:01, ethertype 802.1Q (0x8100), length 140: vlan 22, p 0, + ethertype MPLS unicast (0x8847), MPLS (label 2 (IPv6 explicit NULL), tc 0, [S], ttl 63) + version error: 4 != 6 + +09:23:00.357583 52:54:00:01:10:00 > 52:54:00:00:10:01, ethertype 802.1Q (0x8100), length 160: vlan 20, p 0, + ethertype MPLS unicast (0x8847), MPLS (label 0 (IPv4 explicit NULL), tc 0, [S], ttl 61) + IP6, wrong link-layer encapsulation (invalid) +``` + +Looking at the payload of these broken packets, they are DNS packets coming from the `vpp0-3` Linux +Control Plane there, and they are being sent to either the IPv4 address of `192.168.10.4` or the +IPv6 address of `2001:678:d78:201::ffff`. Interestingly, these are the lab's resolvers, so I think +`vpp0-3` is just trying to resolve something. + +### Clue 3: Vanishing MPLS packets + +As I mentioned, some source/destination pairs in the lab do not seem to pass traffic, while others +are fine. One such case of _packetlo_ is any traffic from `vpp0-3` to the IPv4 address of +`vpp0-1.e0`. The path from `vpp0-3` to that IPv4 address should go out on `vpp0-3.e0` and into +`vpp0-2.e1`, but using tcpdump shows absolutely no such traffic at between `vpp0-3` and `vpp0-2`, +while I'd expect to see it on VLAN 22! + +## Diagnosis + +Well, based on **Clue 3**, I take a look at what is happening on `vpp0-3`. I start by looking at the Linux +controlplane view, where the route to `lab` looks like this: + +``` +root@vpp0-3:~$ ip route get 192.168.10.4 +192.168.10.4/31 nhid 154 encap mpls 36 via 192.168.10.10 dev e0 proto ospf src 192.168.10.3 metric 20 + +root@vpp0-3:~$ tcpdump -evni e0 mpls 36 +15:07:50.864605 52:54:00:03:10:00 > 52:54:00:02:10:01, ethertype MPLS unicast (0x8847), length 136: + MPLS (label 36, tc 0, [S], ttl 64) + (tos 0x0, ttl 64, id 15752, offset 0, flags [DF], proto UDP (17), length 118) + 192.168.10.3.36954 > 192.168.10.4.53: 20950+ PTR? + 1.9.0.0.0.0.0.0.0.0.0.0.0.0.0.0.3.0.0.0.8.7.d.0.8.7.6.0.1.0.0.2.ip6.arpa. (90) +``` + +Yes indeed, Linux is sending an IPv4 DNS packet out on `e0`, so what am I seeing on the switch fabric? +In the LAB diagram above, I can look up that traffic from `vpp0-3` destined to `vpp0-2` should show up +on VLAN 22: + +``` +root@tap0-0:~$ tcpdump -evni enp16s0f0 -s 1500 -X vlan 22 and mpls +15:19:56.453521 52:54:00:03:10:00 > 52:54:00:02:10:01, ethertype 802.1Q (0x8100), length 140: vlan 22, p 0, + ethertype MPLS unicast (0x8847), MPLS (label 2 (IPv6 explicit NULL), tc 0, [S], ttl 63) + version error: 4 != 6 + 0x0000: 0000 213f 4500 0076 d17e 4000 4011 d3a0 ..!?E..v.~@.@... + 0x0010: c0a8 0a03 c0a8 0a04 e139 0035 0062 0dde .........9.5.b.. + 0x0020: 079e 0100 0001 0000 0000 0000 0131 0139 .............1.9 + 0x0030: 0130 0130 0130 0130 0130 0130 0130 0130 .0.0.0.0.0.0.0.0 + 0x0040: 0130 0130 0130 0130 0130 0130 0133 0130 .0.0.0.0.0.0.3.0 + 0x0050: 0130 0130 0138 0137 0164 0130 0138 0137 .0.0.8.7.d.0.8.7 + 0x0060: 0136 0130 0131 0130 0130 0132 0369 7036 .6.0.1.0.0.2.ip6 + 0x0070: 0461 7270 6100 000c 0001 .arpa..... +``` + +### MPLS Corruption + +Ouch, that hurts my eyes! Linux sent an IPv4 packet into the TAP device carrying label value 36, so +why is it being observed as an _IPv6 Explicit Null_ with label value 2? That can't be right. In an +attempt to learn more, I ask VPP to give me a packet trace. I happen to remember that on the way +from Linux to VPP, the `virtio-input` driver is used (while, on the way from the wire to VPP, I see +`dpdk-input` is used). + +The trace teaches me something really valuable: + +``` +vpp0-3# trace add virtio-input 100 +vpp0-3# show trace + +00:03:27:192490: virtio-input + virtio: hw_if_index 7 next-index 4 vring 0 len 136 + hdr: flags 0x00 gso_type 0x00 hdr_len 0 gso_size 0 csum_start 0 csum_offset 0 num_buffers 1 +00:03:27:192500: ethernet-input + MPLS: 52:54:00:03:10:00 -> 52:54:00:02:10:01 +00:03:27:192504: mpls-input + MPLS: next mpls-lookup[1] label 36 ttl 64 exp 0 +00:03:27:192506: mpls-lookup + MPLS: next [6], lookup fib index 0, LB index 92 hash 0 label 36 eos 1 +00:03:27:192510: mpls-label-imposition-pipe + mpls-header:[ipv6-explicit-null:63:0:eos] +00:03:27:192512: mpls-output + adj-idx 21 : mpls via fe80::5054:ff:fe02:1001 GigabitEthernet10/0/0: mtu:9000 next:2 flags:[] 5254000210015254000310008847 flow hash: 0x00000000 +00:03:27:192515: GigabitEthernet10/0/0-output + GigabitEthernet10/0/0 flags 0x00180005 + MPLS: 52:54:00:03:10:00 -> 52:54:00:02:10:01 + label 2 exp 0, s 1, ttl 63 +00:03:27:192517: GigabitEthernet10/0/0-tx + GigabitEthernet10/0/0 tx queue 0 + buffer 0x4c2ea1: current data 0, length 136, buffer-pool 0, ref-count 1, trace handle 0x7 + l2-hdr-offset 0 l3-hdr-offset 14 + PKT MBUF: port 65535, nb_segs 1, pkt_len 136 + buf_len 2176, data_len 136, ol_flags 0x0, data_off 128, phys_addr 0x730ba8c0 + packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 + rss 0x0 fdir.hi 0x0 fdir.lo 0x0 + MPLS: 52:54:00:03:10:00 -> 52:54:00:02:10:01 + label 2 exp 0, s 1, ttl 63 +``` + +At this point, I think I've figured it out. I can see clearly that the MPLS packet is seen coming +from Linux, and it has label value 36. But, it is then offered to graph node `mpls-input`, which does +what it is designed to do, namely look up the label in the FIB: + +``` +vpp0-3# show mpls fib 36 +MPLS-VRF:0, fib_index:0 locks:[interface:4, CLI:1, lcp-rt:1, ] +36:neos/21 fib:0 index:88 locks:2 + lcp-rt-dynamic refs:1 src-flags:added,contributing,active, + path-list:[50] locks:24 flags:shared, uPRF-list:38 len:1 itfs:[1, ] + path:[66] pl-index:50 ip6 weight=1 pref=0 attached-nexthop: oper-flags:resolved, + fe80::5054:ff:fe02:1001 GigabitEthernet10/0/0 + [@0]: ipv6 via fe80::5054:ff:fe02:1001 GigabitEthernet10/0/0: mtu:9000 next:4 flags:[] 52540002100152540003100086dd + Extensions: + path:66 labels:[[ipv6-explicit-null pipe ttl:0 exp:0]] + forwarding: mpls-neos-chain + [@0]: dpo-load-balance: [proto:mpls index:91 buckets:1 uRPF:38 to:[0:0]] + [0] [@6]: mpls-label[@34]:[ipv6-explicit-null:64:0:neos] + [@1]: mpls via fe80::5054:ff:fe02:1001 GigabitEthernet10/0/0: mtu:9000 next:2 flags:[] 5254000210015254000310008847 +``` + +{{< image width="80px" float="left" src="/assets/vpp-mpls/lightbulb.svg" alt="Lightbulb" >}} + +Haha, I love it when the brain-ligutbulb goes to the _on_ position. What's happening is that when we +turned on the MPLS feature on the VPP `tap` that is connected to `e0`, and VPP saw an MPLS packet, +that it looked up in the MPLS FIB what to do with label 36, learning that it must _SWAP_ it for +_IPv6 Explicit NULL_ (which is label value 2), and send it out on Gi10/0/0 to an IPv6 nexthop. Yeah, +**that'll break all right**. + +### MPLS Drops + +OK, that explains the garbled packets, but what about the ones that I never even saw on the wire +(**Clue 3**)? Well, now that I've enjoyed my lightbulb moment, I know exactly where to look. +Consider the following route in Linux, which is sending out encapsulated with MPLS label value 37; +and consider also what happens if `mpls-input` receives an MPLS frame with that value: + +``` +root@vpp0-3:~# ip ro get 192.168.10.6 +192.168.10.6 encap mpls 37 via 192.168.10.10 dev e0 src 192.168.10.3 uid 0 + +vpp0-3# show mpls fib 37 +MPLS-VRF:0, fib_index:0 locks:[interface:4, CLI:1, lcp-rt:1, ] +``` + +.. that's right, there ***IS*** no entry. As such, I would expect VPP to not know what to do with +such a mislabeled packet, and drop it. Unsurprisingly at this point, here's a nice proof: + +``` +00:10:31:107882: virtio-input + virtio: hw_if_index 7 next-index 4 vring 0 len 102 + hdr: flags 0x00 gso_type 0x00 hdr_len 0 gso_size 0 csum_start 0 csum_offset 0 num_buffers 1 +00:10:31:107891: ethernet-input + MPLS: 52:54:00:03:10:00 -> 52:54:00:02:10:01 +00:10:31:107897: mpls-input + MPLS: next mpls-lookup[1] label 37 ttl 64 exp 0 +00:10:31:107898: mpls-lookup + MPLS: next [0], lookup fib index 0, LB index 22 hash 0 label 37 eos 1 +00:10:31:107901: mpls-drop + drop +00:10:31:107902: error-drop + rx:tap1 +00:10:31:107905: drop + mpls-input: MPLS DROP DPO +``` + +Conclusion: ***tadaa.wav***. When VPP receives the MPLS packet from Linux, it has already been +routed (encapsulated and put in an MPLS packet that's meant to be sent to the next router), so it +should be left alone. Instead, VPP is forcing the packet through the MPLS FIB, where if I'm lucky +(and I'm not, clearly ...) the right thing happens. But, sometimes, the MPLS FIB has instructions +that are different to what Linux had intended, bad things happen, and kittens get hurt. I can't +allow that to happen. I like kittens! + +## Fixing Linux CP + MPLS + +Now that I know what's actually going on, the fix comes quickly into focus. Of course, when Linux +sends an MPLS packet, VPP **must not** do a FIB lookup. Instead, it should emit the packet on the +correct interface as-is. It sounds a little bit like re-arranging the directed graph that VPP uses +internally. I've never done this before, but why not give it a go .. you know, for science :) + +VPP has a concept called _feature arcs_. These are codepoints where features can be inserted and +turned on/off. There's a feature arc for MPLS called `mpls-input`. I can create a graph node that +does anything I'd like to the packets at this point, and what I want to do is take the packet and +instead of offering it to the `mpls-input` node, just emit it on its egress interface using +`interface-output`. + +First, I call `VLIB_NODE_FN` which defines a new node in VPP, and I call it `lcp_xc_mpls()`. I +register this node with `VLIB_REGISTER_NODE` giving it the symbolic name `linux-cp-xc-mpls` which +extends the existing code in this plugin for ARP and IPv4/IPv6 forwarding. Once the packet enters my +new node, there are two possible places for it to go, defined by the `next_nodes` field: + +1. ***LCP_XC_MPLS_NEXT_DROP***: If I can't figure out where this packet is headed (there should be + an existing adjacency for it), I will send it to `error-drop` where it will be discarded. +1. ***LCP_XC_MPLS_NEXT_IO***: If I do know, however, I ask VPP to send this packet simply to + `interface-output`, where it will be marshalled on the wire, unmodified. + +Taking this short cut for MPLS packets avoids them being looked up in the FIB, and in hindsight this +is no different to how IPv4 and IPv6 packets are also short circuited: for those, `ip4-lookup` and +`ip6-lookup` are also not called, but instead `lcp_xc_inline()` does the business. + +I can inform VPP that my new node should be attached as a feature on the `mpls-input` arc, +by calling `VNET_FEATURE_INIT` with it. + +Implementing the VPP node is a bit of fiddling - but I take inspiration from the existing function +`lc_xc_inline()` which does this for IPv4 and IPv6. Really all I must do, is two things: + +1. Using the _Linux Interface Pair (LIP)_ entry, figure out which physical interface corresponds + to the TAP interface I just received the packet on, and then set the TX interface to that. +1. Retrieve the ethernet adjacency based on the destination MAC address, use it to set the correct + L2 nexthop. If I don't know what adjacency to use, set `LCP_XC_MPLS_NEXT_DROP` as the next node, + otherwise set `LCP_XC_MPLS_NEXT_IO`. + +The finishing touch on the graph node is to make sure that it's trace-aware. I use packet tracing _a +lot_, as can be seen as well in this article, so I'll detect if tracing for a given packet is turned +on, and if so, tack on a `lcp_xc_trace_t` object, so traces will reveal my new node in use. + +Once the node is ready, I have one final step. When constructing the _Linux Interface Pair_ in +`lcp_itf_pair_add()`, I will enable the newly created feature called `linux-cp-xc-mpls` on the +`mpls-input` feature arc for the TAP interface, by calling `vnet_feature_enable_disable()`. +Conversely, I'll disable the feature when removing the _LIP_ in `lcp_itf_pair_del()`. + +## Results + +After rebasing @vifino's change, I add my code in [[Gerrit 38702, PatchSet +11-14](https://gerrit.fd.io/r/c/vpp/+/38702/11..14)]. I think the simplest thing to show the effect +of the change is by taking a look at these MPLS packets that come in from Linux Controlplane, and +how they now get moved into `linux-cp-xc-mpls` instead of `mpls-input` before: + +``` +00:04:12:846748: virtio-input + virtio: hw_if_index 7 next-index 4 vring 0 len 102 + hdr: flags 0x00 gso_type 0x00 hdr_len 0 gso_size 0 csum_start 0 csum_offset 0 num_buffers 1 +00:04:12:846804: ethernet-input + MPLS: 52:54:00:03:10:00 -> 52:54:00:02:10:01 +00:04:12:846811: mpls-input + MPLS: next BUG![3] label 37 ttl 64 exp 0 +00:04:12:846812: linux-cp-xc-mpls + lcp-xc: itf:1 adj:21 +00:04:12:846844: GigabitEthernet10/0/0-output + GigabitEthernet10/0/0 flags 0x00180005 + MPLS: 52:54:00:03:10:00 -> 52:54:00:02:10:01 + label 37 exp 0, s 1, ttl 64 +00:04:12:846846: GigabitEthernet10/0/0-tx + GigabitEthernet10/0/0 tx queue 0 + buffer 0x4be948: current data 0, length 102, buffer-pool 0, ref-count 1, trace handle 0x0 + l2-hdr-offset 0 l3-hdr-offset 14 + PKT MBUF: port 65535, nb_segs 1, pkt_len 102 + buf_len 2176, data_len 102, ol_flags 0x0, data_off 128, phys_addr 0x1f9a5280 + packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 + rss 0x0 fdir.hi 0x0 fdir.lo 0x0 + MPLS: 52:54:00:03:10:00 -> 52:54:00:02:10:01 + label 37 exp 0, s 1, ttl 64 +``` + +The same is true for the original DNS packet with MPLS label 36 -- it just transmits out on +`Gi10/0/0` with the same label, which is dope! Indeed, no more garbled MPLS packets are seen, and +the following simple acceptance test shows that all machines can reach all other machines on the LAB +cluster with both IPv4 and IPv6: + +``` +ipng@vpp0-3:~$ fping -g 192.168.10.0 192.168.10.3 +192.168.10.0 is alive +192.168.10.1 is alive +192.168.10.2 is alive +192.168.10.3 is alive + +ipng@vpp0-3:~$ fping6 2001:678:d78:200:: 2001:678:d78:200::1 2001:678:d78:200::2 2001:678:d78:200::3 +2001:678:d78:200:: is alive +2001:678:d78:200::1 is alive +2001:678:d78:200::2 is alive +2001:678:d78:200::3 is alive +``` + +My ping test here from `vpp0-3` tries to ping (via the Linux controlplane) each of the other +routers, including itself. It first does this with IPv4, and then with IPv6, showing that all +_eight_ possible destinations are alive. Progress, sweet sweet progress. + +I then expand that with this nice oneliner: + +``` +pim@lab:~$ for af in 4 6; do \ + for node in $(seq 0 3); do \ + ssh -$af ipng@vpp0-$node "fping -g 192.168.10.0 192.168.10.3; \ + fping6 2001:678:d78:200:: 2001:678:d78:200::1 2001:678:d78:200::2 2001:678:d78:200::3"; \ + done \ + done | grep -c alive + +64 +``` + +Explanation: Taking both IPv4 and iPv6, I log in to all four nodes (so in total I invoke SSH 8 +times), and then perform both `fping` operations, and receive each time _eight_ respondes, +_sixty-four_ in total. This checks out. I am very pleased with my work. + +## What's next + +I joined forces with [@vifino](https://chaos.social/@vifino) who has effectively added MPLS handling +to the Linux Control Plane, so VPP can start to function as an MPLS router using FRR's label +distribution protocol implementation. Gosh, I wish Bird3 would have LDP :) + +Our work is mostly complete, there's two pending Gerrit's which should be ready to review and +certainly ready to play with: + +1. [[Gerrit 38826](https://gerrit.fd.io/r/c/vpp/+/38826)]: This adds the ability to listen to internal + state changes of an interface, so that the Linux Control Plane plugin can enable MPLS on the + _LIP_ interfaces and Linux sysctl for MPLS input. +1. [[Gerrit 38702/10](https://gerrit.fd.io/r/c/vpp/+/38702/10)]: This adds the ability to listen to Netlink + messages in the Linux Control Plane plugin, and sensibly apply these routes to the IPv4, IPv6 + and MPLS FIB in the VPP dataplane. +1. [[Gerrit 38702/14](https://gerrit.fd.io/r/c/vpp/+/38702/11..14)]: This Gerrit now also adds the + ability to directly output MPLS packets from Linux out on the correct interface, without + pulling it through the MPLS fib. + +Finally, a note from your friendly neighborhood developers: this code is brand-new and has had _very +limited_ peer-review from the VPP developer community. It adds a significant feature to the Linux +Controlplane plugin, so make sure you both understand the semantics, the differences between Linux +and VPP, and the overall implementation before attempting to use in production. We're pretty sure +we got at least some of this right, but testing and runtime experience will tell. + +I will be silently porting the change into my own copy of the Linux Controlplane called lcpng on +[[GitHub](https://github.com/pimvanpelt/lcpng.git)]. If you'd like to test this - reach out to the VPP +Developer [[mailinglist](mailto:vpp-dev@lists.fd.io)] any time! diff --git a/content/articles/2023-08-06-pixelfed-1.md b/content/articles/2023-08-06-pixelfed-1.md new file mode 100644 index 0000000..733bfca --- /dev/null +++ b/content/articles/2023-08-06-pixelfed-1.md @@ -0,0 +1,553 @@ +--- +date: "2023-08-06T05:35:14Z" +title: Pixelfed - Part 1 - Installing +--- + +# About this series + +{{< image width="200px" float="right" src="/assets/pixelfed/pixelfed-logo.svg" alt="Pixelfed" >}} + +I have seen companies achieve great successes in the space of consumer internet and entertainment industry. I've been feeling less +enthusiastic about the stronghold that these corporations have over my digital presence. I am the first to admit that using "free" services +is convenient, but these companies are sometimes taking away my autonomy and exerting control over society. To each their own of course, but +for me it's time to take back a little bit of responsibility for my online social presence, away from centrally hosted services and to +privately operated ones. + +After having written a fair bit about my Mastodon [[install]({% post_url 2022-11-20-mastodon-1 %})] and +[[monitoring]({% post_url 2022-11-27-mastodon-3 %})], I've been using it every day. This morning, my buddy Ramón asked if he could +make a second account on **ublog.tech** for his _Campervan Adventures_, and notably to post pics of where he and his family went. + +But if pics is your jam, why not ... [[Pixelfed](https://pixelfed.org/)]! + +## Introduction + +Similar to how blogging is the act of publishing updates to a website, microblogging is the act of publishing small updates to a stream of +updates on your profile. Very similar to the relationship between _Facebook_ and _Instagram_, _Mastodon_ and _Pixelfed_ give the ability to +post and share, cross-link, discuss, comment and like, across the entire _Fediverse_. Except, Pixelfed doesn't do this in a centralized +way, and I get to be a steward of my own data. + +As is common in the _Fediverse_, groups of people congregate on a given server, of which they become a user by creating an account on that +server. Then, they interact with one another on that server, but users can also interact with folks on _other_ servers. Instead of following +**@IPngNetworks**, they might follow a user on a given server domain, like **@IPngNetworks@pix.ublog.tech**. This way, all these servers can +be run _independently_ but interact with each other using a common protocol (called ActivityPub). I've heard this concept be compared to +choosing an e-mail provider: I might choose Google's gmail.com, and you might use Microsoft's live.com. However we can send e-mails back and +forth due to this common protocol (called SMTP). + +### pix.uBlog.tech + +I thought I would give it a go, mostly out of engineering curiosity but also because I more strongly feel today that we (the users) ought to +take a bit more ownership back. I've been a regular blogging and micro-blogging user since approximately for ever, and I think it may be a +good investment of my time to learn a bit more about the architecture of Pixelfed. So, I've decided to build and _productionize_ a server +instance. + +Previously, I registered [uBlog.tech](https://ublog.tech) and have been running that for about a year as a _Mastodon_ instance. +Incidentally, if you're reading this and would like to participate, the server welcomes users in the network-, systems- and software +engineering disciplines. But, before I can get to the fun parts though, I have to do a bunch of work to get this server in a shape in which +it can be trusted with user generated content. + +### The IPng environment + +#### Pixelfed: Virtual Machine + +I provision a VM with 8vCPUs (dedicated on the underlying hypervisor), including 16GB of memory and one virtio network card. For disks, I +will use two block devices, one small one of 16GB (vda) that is created on the hypervisor's `ssd-vol1/libvirt/pixelfed-disk0`, to be used only +for boot, logs and OS. Then, a second one (vdb) is created at 2TB on `vol0/pixelfed-disk1` and it will be used for Pixelfed itself. + +I simply install Debian into **vda** using `virt-install`. At IPng Networks we have some ansible-style automation that takes over the +machine, and further installs all sorts of Debian packages that we use (like a Prometheus node exporter, more on that later), and sets up a +firewall that allows SSH access for our trusted networks, and otherwise only allows port 80 because this is to be a (backend) webserver +behind the NGINX cluster. + +After installing Debian Bullseye, I'll create the following ZFS filesystems on **vdb**: + +``` +pim@pixelfed:~$ sudo zpool create data /dev/vdb +pim@pixelfed:~$ sudo zfs create -o data/pixelfed -V10G +pim@pixelfed:~$ sudo zfs create -o mountpoint=/data/pixelfed/pixelfed/storage data/pixelfed-storage +pim@pixelfed:~$ sudo zfs create -o mountpoint=/var/lib/mysql data/mysql -V20G +pim@pixelfed:~$ sudo zfs create -o mountpoint=/var/lib/redis data/redis -V2G +``` + +As a sidenote, I realize that this ZFS filesystem pool consists only of **vdb**, but its underlying blockdevice is protected in a raidz, and +it is copied incrementally daily off-site by the hypervisor. I'm pretty confident on safety here, but I prefer to use ZFS for the virtual +machine guests as well, because now I can do local snapshotting, of say `data/pixelfed`, and I can more easily grow/shrink the +datasets for the supporting services, as well as isolate them individually against sibling wildgrowth. + +The VM gets one virtual NIC, which will connect to the [[IPng Site Local]({% post_url 2023-03-17-ipng-frontends %})] network using +jumboframes. This way, the machine itself is disconnected from the internet, saving a few IPv4 addresses and allowing for the IPng NGINX +frontends to expose it. I give it the name `pixelfed.net.ipng.ch` with addresses 198.19.4.141 and 2001:678:d78:507::d, which will be +firewalled and NATed via the IPng SL gateways. + +#### IPng Frontend: Wildcard SSL + +I run most websites behind a cluster of NGINX webservers, which are carrying an SSL certificate which support wildcards. The system is +using [[DNS-01]({% post_url 2023-03-24-lego-dns01 %})] challenges, so the first order of business is to expand the certificate from serving +only [[ublog.tech](https://ublog.tech)] (which is in use by the companion Mastodon instance), to include as well _*.ublog.tech_ so that I can +add the new Pixelfed instance as [[pix.ublog.tech](https://pix.ublog.tech)]: + +``` +lego@lego:~$ certbot certonly --config-dir /home/lego/acme-dns --logs-dir /home/lego/logs \ + --work-dir /home/lego/workdir --manual \ + --manual-auth-hook /home/lego/acme-dns/acme-dns-auth.py \ + --preferred-challenges dns --debug-challenges \ + -d ipng.ch -d *.ipng.ch -d *.net.ipng.ch \ + -d ipng.nl -d *.ipng.nl \ + -d ipng.eu -d *.ipng.eu \ + -d ipng.li -d *.ipng.li \ + -d ublog.tech -d *.ublog.tech \ + -d as8298.net \ + -d as50869.net + +CERTFILE=/home/lego/acme-dns/live/ipng.ch/fullchain.pem +KEYFILE=/home/lego/acme-dns/live/ipng.ch/privkey.pem +MACHS="nginx0.chrma0.ipng.ch nginx0.chplo0.ipng.ch nginx0.nlams1.ipng.ch nginx0.nlams2.ipng.ch" + +for MACH in $MACHS; do + fping -q $MACH 2>/dev/null || { + echo "$MACH: Skipping (unreachable)" + continue + } + echo $MACH: Copying $CERT + scp -q $CERTFILE $MACH:/etc/nginx/certs/$CERT.crt + scp -q $KEYFILE $MACH:/etc/nginx/certs/$CERT.key + echo $MACH: Reloading nginx + ssh $MACH 'sudo systemctl reload nginx' +done +``` + +The first command here requests a certificate with `certbot`, and note the addition of the flag `-d *.ublog.tech`. It'll correctly say that +there are 11 existing domains in this certificate, and ask me if I'd like to request a new cert with the 12th one added. I answer yes, and +a few seconds later, `acme-dns` has answered all of Let's Encrypt's challenges, and issues a certificate. + +The second command then distributes that certificate to the four NGINX frontends, and reloads the cert. Now, I can use the hostname +`pix.ublog.tech`, as far as the SSL certs are concerned. Of course, the regular certbot cronjob renews the cert regularly, so I tucked away +the second part here into a script called `bin/certbot-distribute`, using the `RENEWED_LINEAGE` variable that certbot(1) sets when using the +flag `--deploy-hook`: + +``` +lego@lego:~$ cat /etc/cron.d/certbot +SHELL=/bin/sh +PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin + +0 */12 * * * lego perl -e 'sleep int(rand(43200))' && \ + certbot -q renew --config-dir /home/lego/acme-dns \ + --logs-dir /home/lego/logs --work-dir /home/lego/workdir \ + --deploy-hook "/home/lego/bin/certbot-distribute" +``` + +#### IPng Frontend: NGINX + +The previous `certbot-distribute` shell script has copied the certificate to four separate NGINX instances, two in Amsterdam hosted at +AS8283 (Coloclue), one in Zurich hosted at AS25091 (IP-Max), and one in Geneva hosted at AS8298 (IPng Networks). Each of these NGINX servers +has a frontend IPv4 and IPv6 address, and a backend jumboframe enabled interface in IPng Site Local (198.19.0.0/16). Because updating the +configuration on four production machines is cumbersome, I previously created an Ansible playbook, which I now add this new site to: + +``` +pim@squanchy:~/src/ipng-ansible$ cat roles/nginx/files/sites-available/pix.ublog.tech.conf +server { + listen [::]:80; + listen 0.0.0.0:80; + + server_name pix.ublog.tech; + access_log /var/log/nginx/pix.ublog.tech-access.log; + include /etc/nginx/conf.d/ipng-headers.inc; + + include "conf.d/lego.inc"; + + location / { + return 301 https://$host$request_uri; + } +} + +server { + listen [::]:443 ssl http2; + listen 0.0.0.0:443 ssl http2; + ssl_certificate /etc/nginx/certs/ipng.ch.crt; + ssl_certificate_key /etc/nginx/certs/ipng.ch.key; + include /etc/nginx/conf.d/options-ssl-nginx.inc; + ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc; + + server_name pix.ublog.tech; + access_log /var/log/nginx/pix.ublog.tech-access.log upstream; + include /etc/nginx/conf.d/ipng-headers.inc; + + keepalive_timeout 70; + sendfile on; + client_max_body_size 80m; + + location / { + proxy_pass http://pixelfed.net.ipng.ch:80; + proxy_set_header Host $host; + proxy_set_header X-Forwarded-Proto $scheme; + proxy_set_header X-Real-IP $remote_addr; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + } +} +``` + +The configuration is very straight forward. The first server block bounces all traffic destined to port 80 towards its port 443 equivalent. +The second server block (listening on port 443) contains the certificate I just renewed serve for `*.ublog.tech` which allows the cluster to +offload SSL and forward the traffic on the internal private network on to the VM I created earlier. + +One quick Ansible playbook run later, and the reversed proxies are ready to rock and roll: + +{{< image src="/assets/pixelfed/ansible.png" alt="Ansible" >}} + +Of course, this website will just timeout for the time being, because there's nothing listening (yet) on `pixelfed.net.ipng.ch:80`. + +#### Installing Pixelfed + +So off I go, installing Pixelfed on the new Debian VM. First, I'll install the set of Debian packages this instance will need, including +PHP 8.1 (which is the minimum supported, according to the Pixelfed docs): + +``` +pim@pixelfed:~$ sudo apt install apt-transport-https lsb-release ca-certificates git wget curl \ + build-essential apache2 mariadb-server pngquant optipng jpegoptim gifsicle ffmpeg redis +pim@pixelfed:~$ sudo wget -O /etc/apt/trusted.gpg.d/php.gpg https://packages.sury.org/php/apt.gpg +pim@pixelfed:~$ echo "deb https://packages.sury.org/php/ $(lsb_release -sc) main" \ + | sudo tee -a /etc/apt/sources.list.d/php.list +pim@pixelfed:~$ apt update +pim@pixelfed:~$ apt-get install php8.1-fpm php8.1 php8.1-common php8.1-cli php8.1-gd \ + php8.1-mbstring php8.1-xml php8.1-bcmath php8.1-pgsql php8.1-curl php8.1-xml php8.1-xmlrpc \ + php8.1-imagick php8.1-gd php8.1-mysql php8.1-cli php8.1-intl php8.1-zip php8.1-redis +``` + +After all those bits and bytes settle on the filesystem, I simply follow the regular [[install +guide](https://docs.pixelfed.org/running-pixelfed/installation/)] from the upstream documentation. + +I update the PHP config to allow larger uploads: + +``` +pim@pixelfed:~$ sudo vim /etc/php/8.1/fpm/php.ini +upload_max_filesize = 100M +post_max_size = 100M +``` + +I create a FastCGI pool for Pixelfed: +``` +pim@pixelfed:~$ cat << EOF | sudo tee /etc/php/8.1/fpm/pool.d/pixelfed.conf +[pixelfed] +user = pixelfed +group = pixelfed +listen.owner = www-data +listen.group = www-data +listen.mode = 0660 +listen = /var/run/php.pixelfed.sock +pm = dynamic +pm.max_children = 20 +pm.start_servers = 5 +pm.min_spare_servers = 5 +pm.max_spare_servers = 20 +chdir = /data/pixelfed +php_flag[display_errors] = on +php_admin_value[error_log] = /data/pixelfed/php.error.log +php_admin_flag[log_errors] = on +php_admin_value[open_basedir] = /data/pixelfed:/usr/share/:/tmp:/var/lib/php +EOF +``` + +I reference this pool in a simple non-SSL Apache config, after enabling the modules that Pixelfed needs: + +``` +pim@pixelfed:~$ cat << EOF | sudo tee /etc/apache2/sites-available/pixelfed.conf + + ServerName pix.ublog.tech + ServerAdmin pixelfed@ublog.tech + + DocumentRoot /data/pixelfed/pixelfed/public + LogLevel debug + + Options Indexes FollowSymLinks + AllowOverride All + Require all granted + + + ErrorLog ${APACHE_LOG_DIR}/pixelfed.error.log + CustomLog ${APACHE_LOG_DIR}/pixelfed.access.log combined + + SetHandler "proxy:unix:/var/run/php.pixelfed.sock|fcgi://localhost" + + +EOF +``` + +I create a user and database, and finally download the Pixelfed sourcecode and install the `composer` tool: +``` +pim@pixelfed:~$ sudo useradd pixelfed -m -d /data/pixelfed -s /bin/bash -r -c "Pixelfed User" +pim@pixelfed:~$ sudo mysql +CREATE DATABASE pixelfed; +GRANT ALL ON pixelfed.* TO pixelfed@localhost IDENTIFIED BY ''; +exit + +pim@pixelfed:~$ wget -O composer-setup.php https://getcomposer.org/installer +pim@pixelfed:~$ sudo php composer-setup.php +pim@pixelfed:~$ sudo cp composer.phar /usr/local/bin/composer +pim@pixelfed:~$ rm composer-setup.php + +pim@pixelfed:~$ sudo su pixelfed +pixelfed@pixelfed:~$ git clone -b dev https://github.com/pixelfed/pixelfed.git pixelfed +pixelfed@pixelfed:~$ cd pixelfed +pixelfed@pixelfed:/data/pixelfed/pixelfed$ composer install --no-ansi --no-interaction --optimize-autoloader +pixelfed@pixelfed:/data/pixelfed/pixelfed$ composer update +``` + +With the basic installation of pacakges and dependencies all squared away, I'm ready to configure the instance: + +``` +pixelfed@pixelfed:/data/pixelfed/pixelfed$ vim .env +APP_NAME="uBlog Pixelfed" +APP_URL="https://pix.ublog.tech" +APP_DOMAIN="pix.ublog.tech" +ADMIN_DOMAIN="pix.ublog.tech" +SESSION_DOMAIN="pix.ublog.tech" +TRUST_PROXIES="*" + +# Database Configuration +DB_CONNECTION="mysql" +DB_HOST="127.0.0.1" +DB_PORT="3306" +DB_DATABASE="pixelfed" +DB_USERNAME="pixelfed" +DB_PASSWORD="" +MAIL_DRIVER=smtp +MAIL_HOST=localhost +MAIL_PORT=25 +MAIL_FROM_ADDRESS="pixelfed@ublog.tech" +MAIL_FROM_NAME="uBlog Pixelfed" + +pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan key:generate +pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan storage:link +pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan migrate --force +pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan import:cities +pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan instance:actor +pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan passport:keys +pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan route:cache +pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan view:cache +pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan config:cache +``` + +Pixelfed is based on [[Laravel](https://laravel.com/)], a PHP framework for _Web Artisans_ (which I guess now that I run both +LibreNMS, IXPManager and Pixelfed, makes me one too?). Laravel has two runner types commonly used. One is task queuing via a +module called _Laravel Horizon_, which uses Redis to store work items to be consumed by task workers: + +``` +pim@pixelfed:~$ cat << EOF | sudo tee /lib/systemd/system/pixelfed.service +[Unit] +Description=Pixelfed task queueing via Laravel Horizon +After=network.target +Requires=mariadb +Requires=php-fpm +Requires=redis +Requires=apache + +[Service] +Type=simple +ExecStart=/usr/bin/php /data/pixelfed/pixelfed/artisan horizon +User=pixelfed +Restart=on-failure + +[Install] +WantedBy=multi-user.target +pim@pixelfed:~$ sudo systemctl enable --now pixelfed +``` + +The other type of runner is periodic tasks, typically configured in a crontab, like so: +``` +pim@pixelfed:~$ cat << EOF | sudo tee /etc/cron.d/pixelfed +* * * * * pixelfed /usr/bin/php /data/pixelfed/pixelfed/artisan schedule:run >> /dev/null 2>&1 +EOF +``` + +After running the _schedule:run_ module once by hand, it exits cleanly, so I think this is good to go even though I'm not a huge fan of +redirecting output to `/dev/null` like that. + +I will create one `admin` user on the commandline first: +``` +pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan user:create +``` + +And now that everything is ready, I can put the icing on the cake by enabling and starting the Apache2 webserver: +``` +pim@pixelfed:~$ sudo a2enmod rewrite proxy proxy_fcgi +pim@pixelfed:~$ sudo a2ensite pixelfed +pim@pixelfed:~$ sudo systemctl restart apache2 +``` + +{{< image src="/assets/pixelfed/fipo.png" alt="First Post" >}} + +### Finishing Touches + +#### File permissions + +After signing up, logging in and uploading my first post (which is of a BLT sandwich and a bowl of noodles, of course), +I noticed that the permissions are overly strict, and the pictures I just uploaded are not visible. I noticed that the PHP FastCGI is +running as user `pixelfed` while the webserver is running as user `www-data`, and the former is writing files with permissions `rw-------` +and directories with `rwx------`, which doesn't seem quite right to me, so I make a small edit in `config/filesystems.php`, changing the +`0600` to `0644` and the `0700` to `0755`, after which my post is visible. + +#### uBlog's logo + +{{< image width="300px" float="right" src="/assets/pixelfed/ublog.png" alt="uBlog" >}} + +Although I do like the Pixelfed logo, I wanted to keep a **ublog.tech** branding, so I replaced the `public/storage/headers/default.jpg` +with my own mountains-picture in roughly the same size. By the way, I took that picture in Grindelwald, Switzerland during a +[[serene moment]({% post_url 2021-07-26-bucketlist %})] in which I discovered why tinkering with things like this is so important to my +mental health. + +#### Backups + +Of course, since Ramón is a good friend, I would not want to lose his pictures. Data integrity and durability is important to me. +It's the one thing that typically the commercial vendors do really well, and my pride prohibits me from losing data due to things like "disk +failure" or "computer broken" or "datacenter on fire". + +To honor this promise, I handle backups in three main ways: zrepl(1), borg(1) and mysqldump(1). + +* **VM Block Devices** are running on the hypervisor's ZFS on either the SSD pool, or the disk pool, or both. Using a tool called **zrepl(1)** + (which I described a little bit in a [[previous post]({% post_url 2022-10-14-lab-1 %})]), I create a snapshot every 12hrs on the local + blockdevice, and incrementally copy away those snapshots daily to the remote fileservers. + +``` +pim@hvn0.ddln0:~$ sudo cat /etc/zrepl/zrepl.yaml +jobs: +- name: snap-libvirt + type: snap + filesystems: { + "ssd-vol0/libvirt<": true, + "ssd-vol1/libvirt<": true + } + snapshotting: + type: periodic + prefix: zrepl_ + interval: 12h + pruning: + keep: + - type: grid + grid: 4x12h(keep=all) | 7x1d + regex: "^zrepl_.*" + +- type: push + name: "push-st0-chplo0" + filesystems: { + "ssd-vol0/libvirt<": true, + "ssd-vol1/libvirt<": true + } + connect: + type: ssh+stdinserver + host: st0.chplo0.net.ipng.ch + user: root + port: 22 + identity_file: /etc/zrepl/ssh/identity + snapshotting: + type: manual + send: + encrypted: false + pruning: + keep_sender: + - type: not_replicated + - type: last_n + count: 10 + regex: ^zrepl_.*$ # optional + keep_receiver: + - type: grid + grid: 8x12h(keep=all) | 7x1d | 6x7d + regex: "^zrepl_.*" +``` + +* **Filesystem Backups** make a daily copy of their entire VM filesystem using **borgbackup(1)** to a set of two remote fileservers. This way, + the important file metadata, configs for the virtual machines, and so on, are all safely stored remotely. + +``` +pim@pixelfed:~$ sudo mkdir -p /etc/borgmatic/ssh +pim@pixelfed:~$ sudo ssh-keygen -t ecdsa -f /etc/borgmatic/ssh/identity -C root@pixelfed.net.ipng.ch +pim@pixelfed:~$ cat << EOF | sudo tee /etc/borgmatic/config.yaml +location: + source_directories: + - / + repositories: + - u022eaebe661@st0.chbtl0.ipng.ch:borg/{fqdn} + - u022eaebe661@st0.chplo0.ipng.ch:borg/{fqdn} + exclude_patterns: + - /proc + - /sys + - /dev + - /run + - /swap.img + exclude_if_present: + - .nobackup + - .borgskip +storage: + encryption_passphrase: + ssh_command: "ssh -i /etc/borgmatic/identity -6" + compression: lz4 + umask: 0077 + lock_wait: 5 +retention: + keep_daily: 7 + keep_weekly: 4 + keep_monthly: 6 +consistency: + checks: + - repository + - archives + check_last: 3 +output: + color: false +``` + +* **MySQL** has a running binary log to recover from failures/restarts, but I also run a daily mysqldump(1) operation that dumps the + database to the local filesystem, allowing for quick and painless recovery. As the dump is a regular file on the filesystem, it'll be + picked up by the filesystem backup every night as well, for long term and off-site safety. + +``` +pim@pixelfed:~$ sudo zfs create data/mysql-backups +pim@pixelfed:~$ cat << EOF | sudo tee /etc/cron.d/bitcron +25 5 * * * root /usr/local/bin/bitcron mysql-backup.cron +EOF +``` + +For my friends at AS12859 [[bit.nl](https://www.bit.nl/)], I still use `bitcron(1)` :-) For the rest of you -- bitcron is a little wrapper +written in Bash that defines a few primitives such as logging, iteration, info/warning/error/fatals etc, and then runs whatever you define +in a function called `bitcron_main()`, sending e-mail to an operator only if there are warnings or errors, and otherwise logging to +`/var/log/bitcron`. The gist of the mysql-backup bitcron is this: + +``` +echo "Rotating the $DESTDIR directory" +rotate 10 +echo "Done (rotate)" +echo "" + +echo "Creating $DESTDIR/0/ to store today's backup" +mkdir -p $DESTDIR/0 || fatal "Could not create $DESTDIR/0/" +echo "Done (mkdir)" +echo "" + +echo "Fetching databases" +DBS=$(echo 'show databases' | mysql -u$MYSQLUSER -p$MYSQLPASS | egrep -v '^Database') +echo "Done (fetching DBs)" +echo "" + +echo "Backing up all databases" +for DB in $DBS; +do + echo " * Database $DB" + mysqldump --single-transaction -u$MYSQLUSER -p$MYSQLPASS -a $DB | gzip -9 -c \ + > $DESTDIR/0/mysqldump_$DB.gz \ + || warning "Could not dump database $DB" +done +echo "Done backing up all databases" +echo "" +``` + +## What's next + +Now that the server is up, and I have a small amount of users (mostly folks I know from the tech industry), I took some time to explore +both the Fediverse, reach out to friends old and new, participate in a few random discussions possibly about food, datacenter pics and +camping trips, as well as fiddle with the iOS and Android apps (for now, I've settled on Vernissage after switching my iPhone away from the +horrible HEIC format which literally nobody supports). This is going to be fun :) + +Now, I think I'm ready to further productionize the experience. It's important to monitor these applications, so in an upcoming post I'll be +looking at how to do _blackbox_ and _whitebox_ monitoring on this instance. + +If you're looking for a home, feel free to sign up at [https://pix.ublog.tech/](https://pix.ublog.tech/) as I'm sure that having a bit more load / +traffic on this instance will allow me to learn (and in turn, to share with others)! Of course, my Mastodon instance at +[https://ublog.tech/](https://ublog.tech) is also happy to serve. diff --git a/content/articles/2023-08-27-ansible-nginx.md b/content/articles/2023-08-27-ansible-nginx.md new file mode 100644 index 0000000..5a4c06c --- /dev/null +++ b/content/articles/2023-08-27-ansible-nginx.md @@ -0,0 +1,628 @@ +--- +date: "2023-08-27T08:56:54Z" +title: 'Case Study: NGINX + Certbot with Ansible' +--- + +# About this series + +{{< image width="200px" float="right" src="/assets/ansible/Ansible_logo.svg" alt="Ansible" >}} + +In the distant past (to be precise, in November of 2009) I wrote a little piece of automation together with my buddy Paul, called +_PaPHosting_. The goal was to be able to configure common attributes like servername, config files, webserver and DNS configs in a +consistent way, tracked in Subversion. By the way despite this project deriving its name from the first two authors, our mutual buddy Jeroen +also started using it, and has written lots of additional cool stuff in the repo, as well as helped to move from Subversion to Git a few +years ago. + +Michael DeHaan [[ref](https://www.ansible.com/blog/author/michael-dehaan)] founded Ansible in 2012, and by then our little _PaPHosting_ +project, which was written as a set of bash scripts, had sufficiently solved our automation needs. But, as is the case with most home-grown +systems, over time I kept on seeing more and more interesting features and integrations emerge, solid documentation, large user group, and +eventually I had to reconsider our 1.5K LOC of Bash and ~16.5K files under maintenance, and in the end, I settled on Ansible. + +``` +commit c986260040df5a9bf24bef6bfc28e1f3fa4392ed +Author: Pim van Pelt +Date: Thu Nov 26 23:13:21 2009 +0000 + +pim@squanchy:~/src/paphosting$ find * -type f | wc -l + 16541 + +pim@squanchy:~/src/paphosting/scripts$ wc -l *push.sh funcs + 132 apache-push.sh + 148 dns-push.sh + 92 files-push.sh + 100 nagios-push.sh + 178 nginx-push.sh + 271 pkg-push.sh + 100 sendmail-push.sh + 76 smokeping-push.sh + 371 funcs + 1468 total +``` + +In a [[previous article]({% post_url 2023-03-17-ipng-frontends %})], I talked about having not one but a cluster of NGINX servers that would +each share a set of SSL certificates and pose as a reversed proxy for a bunch of websites. At the bottom of that article, I wrote: + +> The main thing that's next is to automate a bit more of this. IPng Networks has an Ansible controller, which I'd like to add ... +> but considering Ansible is its whole own elaborate bundle of joy, I'll leave that for maybe another article. + +**Tadaah.wav** that article is here! This is by no means an introduction or howto to Ansible. For that, please take a look at the +incomparable Jeff Geerling [[ref](https://www.jeffgeerling.com/)] and his book: [[Ansible for Devops](https://www.ansiblefordevops.com/)]. I +bought and read this book, and I highly recommend it. + +## Ansible: Playbook Anatomy + +The first thing I do is install four Debian Bookworm virtual machines, two in Amsterdam, one in Geneva and one in Zurich. These will be my +first group of NGINX servers, that are supposed to be my geo-distributed frontend pool. I don't do any specific configuration or +installation of packages, I just leave whatever deboostrap gives me, which is a relatively lean install with 8 vCPUs, 16GB of memory, a 20GB +boot disk and a 30G second disk for caching and static websites. + +Ansible is a simple, but powerful, server and configuration management tool (with a few other tricks up its sleeve). It consists of an +_inventory_ (the hosts I'll manage), that are put in one or more _groups_, there is a registery of _variables_ (telling me things about +those hosts and groups), and an elaborate system to run small bits of automation, called _tasks_ organized in things called _Playbooks_. + +### NGINX Cluster: Group Basics + +First of all, I create an Ansible _group_ called **nginx** and I add the following four freshly installed virtual machine hosts to it: + +``` +pim@squanchy:~/src/ipng-ansible$ cat << EOF | tee -a inventory/nodes.yml +nginx: + hosts: + nginx0.chrma0.net.ipng.ch: + nginx0.chplo0.net.ipng.ch: + nginx0.nlams1.net.ipng.ch: + nginx0.nlams2.net.ipng.ch: +EOF +``` + +I have a mixture of Debian and OpenBSD machines at IPng Networks, so I will add this group **nginx** as a child to another group called +**debian**, so that I can run "common debian tasks", such as installing Debian packages that I want all of my servers to have, adding users +and their SSH key for folks who need access, installing and configuring the firewall and things like Borgmatic backups. + +I'm not going to go into all the details here for the **debian** playbook, though. It's just there to make the base system consistent across +all servers (bare metal or virtual). The one thing I'll mention though, is that the **debian** playbook will see to it that the correct +users are created, with their SSH pubkey, and I'm going to first use this feature by creating two users: + +1. `lego`: As I described in a [[post on DNS-01]({% post_url 2023-03-24-lego-dns01 %})], IPng has a certificate machine that answers Let's + Encrypt DNS-01 challenges, and its job is to regularly prove ownership of my domains, and then request a (wildcard!) certificate. + Once that renews, copy the certificate to all NGINX machines. To do that copy, `lego` needs an account on these machines, it needs + to be able to write the certs and issue a reload to the NGINX server. +1. `drone`: Most of my websites are static, for example `ipng.ch` is generated by Jekyll. I typically write an article on my laptop, and + once I'm happy with it, I'll git commit and push it, after which a _Continuous Integration_ system called [[Drone](https://drone.io)] + gets triggered, builds the website, runs some tests, and ultimately copies it out to the NGINX machines. Similar to the first user, + this second user must have an account and the ability to write its web data to the NGINX server in the right spot. + +That explains the following: + +```yaml +pim@squanchy:~/src/ipng-ansible$ cat << EOF | tee group_vars/nginx.yml +--- +users: + lego: + comment: Lets Encrypt + password: "!" + groups: [ lego ] + drone: + comment: Drone CI + password: "!" + groups: [ www-data ] + +sshkeys: + lego: + - key: ecdsa-sha2-nistp256 + comment: lego@lego.net.ipng.ch + drone: + - key: ecdsa-sha2-nistp256 + comment: drone@git.net.ipng.ch +``` + +I note that the `users` and `sshkeys` used here are dictionaries, and that the `users` role defines a few default accounts like my own +account `pim`, so writing this to the **group_vars** means that these new entries are applied to all machines that belong to the group +**nginx**, so they'll get these users created _in addition to_ the other users in the dictionary. Nifty! + +### NGINX Cluster: Config + +I wanted to be able to conserve IP addresses, and just a few months ago, had a discussion with some folks at Coloclue where we shared the +frustration that what was hip in the 90s (go to RIPE NCC and ask for a /20, justifying that with "I run SSL websites") is somehow still +being used today, even though that's no longer required, or in fact, desirable. So I take one IPv4 and IPv6 address and will use a TLS +extension called _Server Name Indication_ or [[SNI](https://en.wikipedia.org/wiki/Server_Name_Indication)], designed in 2003 (**20 years +old today**), which you can see described in [[RFC 3546](https://datatracker.ietf.org/doc/html/rfc3546)]. + +Folks who try to argue they need multiple IPv4 addresses because they run multiple SSL websites are somewhat of a trigger to me, so this +article doubles up as a "how to do SNI and conserve IPv4 addresses". + +I will group my websites that share the same SSL certificate, and I'll call these things _clusters_. An IPng NGINX Cluster: + +* is identified by a name, for example `ipng` or `frysix` +* is served by one or more NGINX servers, for example `nginx0.chplo0.ipng.ch` and `nginx0.nlams1.ipng.ch` +* serves one or more distinct websites, for example `www.ipng.ch` and `nagios.ipng.ch` and `go.ipng.ch` +* has exactly one SSL certificate, which should cover all of the website(s), preferably using wildcard certs, for example `*.ipng.ch, + ipng.ch` + +And then, I define several clusters this way, in the following configuration file: + +```yaml +pim@squanchy:~/src/ipng-ansible$ cat << EOF | tee vars/nginx.yml +--- +nginx: + clusters: + ipng: + members: [ nginx0.chrma0.net.ipng.ch, nginx0.chplo0.net.ipng.ch, nginx0.nlams1.net.ipng.ch, nginx0.nlams2.net.ipng.ch ] + ssl_common_name: ipng.ch + sites: + ipng.ch: + nagios.ipng.ch: + go.ipng.ch: + frysix: + members: [ nginx0.nlams1.net.ipng.ch, nginx0.nlams2.net.ipng.ch ] + ssl_common_name: frys-ix.net + sites: + frys-ix.net: +``` + +This way I can neatly group the websites (eg. the **ipng** websites) together, call them by name, and immediately see which servers are going to +be serving them using which certificate common name. For future expansion (hint: an upcoming article on monitoring), I decide to make the +**sites** element here a _dictionary_ with only keys and no values as opposed to a _list_, because later I will want to add some bits and +pieces of information for each website. + +### NGINX Cluster: Sites + +As is common with NGINX, I will keep a list of websites in the directory `/etc/nginx/sites-available/` and once I need a given machine to +actually serve that website, I'll symlink it from `/etc/nginx/sites-enabled/`. In addition, I decide to add a few common configuration +snippets, such as logging and SSL/TLS parameter files and options, which allow the webserver to score relatively high on SSL certificate +checker sites. It helps to keep the security buffs off my case. + +So I decide on the following structure, each file to be copied to all nginx machines in `/etc/nginx/`: + +``` +roles/nginx/files/conf.d/http-log.conf +roles/nginx/files/conf.d/ipng-headers.inc +roles/nginx/files/conf.d/options-ssl-nginx.inc +roles/nginx/files/conf.d/ssl-dhparams.inc +roles/nginx/files/sites-available/ipng.ch.conf +roles/nginx/files/sites-available/nagios.ipng.ch.conf +roles/nginx/files/sites-available/go.ipng.ch.conf +roles/nginx/files/sites-available/go.ipng.ch.htpasswd +roles/nginx/files/sites-available/... +``` + +In order: +* `conf.d/http-log.conf` defines a custom logline type called `upstream` that contains a few interesting additional items that show me + the performance of NGINX: +> log_format upstream '$remote_addr - $remote_user [$time_local] ' '"$request" $status $body_bytes_sent ' '"$http_referer" "$http_user_agent" ' 'rt=$request_time uct=$upstream_connect_time uht=$upstream_header_time urt=$upstream_response_time'; +* `conf.d/ipng-headers.inc` adds a header served to end-users from this NGINX, that reveals the instance that served the request. + Debugging a cluster becomes a lot easier if you know which server served what: +> add_header X-IPng-Frontend $hostname always; +* `conf.d/options-ssl-nginx.inc` and `conf.d/ssl-dhparams.inc` are files borrowed from Certbot's NGINX configuration, and ensure the best + TLS and SSL session parameters are used. +* `sites-available/*.conf` are the configuration blocks for the port-80 (HTTP) and port-443 (SSL certificate) websites. In the interest of + brevity I won't copy them here, but if you're curious I showed a bunch of these in a [[previous article]({% post_url +2023-03-17-ipng-frontends %})]. These per-website config files sensibly include the SSL defaults, custom IPng headers and `upstream` log + format. + +### NGINX Cluster: Let's Encrypt + +I figure the single most important thing to get right is how to enable multiple groups of websites, including SSL certificates, in multiple +_Clusters_ (say `ipng` and `frysix`), to be served using different SSL certificates, but on the same IPv4 and IPv6 address, using _Server +Name Indication_ or SNI. Let's first take a look at building these two of these certificates, one for [[IPng Networks](https://ipng.ch)] and +one for [[FrysIX](https://frys-ix.net/)], the internet exchange with Frysian roots, which incidentally offers free 1G, 10G, 40G and 100G +ports all over the Amsterdam metro. My buddy Arend and I are running that exchange, so please do join it! + +I described the usual `HTTP-01` certificate challenge a while ago in [[this article]({% post_url 2023-03-17-ipng-frontends %})], but I +rarely use it because I've found that once installed, `DNS-01` is vastly superior. I wrote about the ability to request a single certificate +with multiple _wildcard_ entries in a [[DNS-01 article]({% post_url 2023-03-24-lego-dns01 %})], so I'm going to save you the repetition, and +simply use `certbot`, `acme-dns` and the `DNS-01` challenge type, to request the following _two_ certificates: + +```bash +lego@lego:~$ certbot certonly --config-dir /home/lego/acme-dns --logs-dir /home/lego/logs \ + --work-dir /home/lego/workdir --manual --manual-auth-hook /home/lego/acme-dns/acme-dns-auth.py \ + --preferred-challenges dns --debug-challenges \ + -d ipng.ch -d *.ipng.ch -d *.net.ipng.ch \ + -d ipng.nl -d *.ipng.nl \ + -d ipng.eu -d *.ipng.eu \ + -d ipng.li -d *.ipng.li \ + -d ublog.tech -d *.ublog.tech \ + -d as8298.net -d *.as8298.net \ + -d as50869.net -d *.as50869.net + +lego@lego:~$ certbot certonly --config-dir /home/lego/acme-dns --logs-dir /home/lego/logs \ + --work-dir /home/lego/workdir --manual --manual-auth-hook /home/lego/acme-dns/acme-dns-auth.py \ + --preferred-challenges dns --debug-challenges \ + -d frys-ix.net -d *.frys-ix.net +``` + +First off, while I showed how to get these certificates by hand, actually generating these two commands is easily doable in Ansible (which +I'll show at the end of this article!) I defined which cluster has which main certificate name, and which websites it's wanting to serve. +Looking at `vars/nginx.yml`, it becomes quickly obvious how I can automate this. Using a relatively straight forward construct, I can let +Ansible create for me a list of commandline arguments programmatically: + +1. Initialize a variable `CERT_ALTNAMES` as a list of `nginx.clusters.ipng.ssl_common_name` and its wildcard, in other words `[ipng.ch, + *.ipng.ch]`. +1. As a convenience, tack onto the `CERT_ALTNAMES` list any entries in the `nginx.clusters.ipng.ssl_altname`, such as `[*.net.ipng.ch]`. +1. Then looping over each entry in the `nginx.clusters.ipng.sites` dictionary, use `fnmatch` to match it against any entries in the + `CERT_ALTNAMES` list: + * If it matches, for example with `go.ipng.ch`, skip and continue. This website is covered already by an altname. + * If it doesn't match, for example with `ublog.tech`, simply add it and its wildcard to the `CERT_ALTNAMES` list: `[ublog.tech, *.ublog.tech]`. + + +Now, the first time I run this for a new cluster (which has never had a certificate issued before), `certbot` will ask me to ensure the correct +`_acme-challenge` records are in each respective DNS zone. After doing that, it will issue two separate certificates and install a cronjob +that will periodically check the age, and renew the certificate(s) when they are up for renewal. In a post-renewal hook, I will create a +script that copies the new certificate to the NGINX cluster (using the `lego` user + SSH key that I defined above). + +```bash +lego@lego:~$ find /home/lego/acme-dns/live/ -type f +/home/lego/acme-dns/live/README +/home/lego/acme-dns/live/frys-ix.net/README +/home/lego/acme-dns/live/frys-ix.net/chain.pem +/home/lego/acme-dns/live/frys-ix.net/privkey.pem +/home/lego/acme-dns/live/frys-ix.net/cert.pem +/home/lego/acme-dns/live/frys-ix.net/fullchain.pem +/home/lego/acme-dns/live/ipng.ch/README +/home/lego/acme-dns/live/ipng.ch/chain.pem +/home/lego/acme-dns/live/ipng.ch/privkey.pem +/home/lego/acme-dns/live/ipng.ch/cert.pem +/home/lego/acme-dns/live/ipng.ch/fullchain.pem +``` + +The crontab entry that Certbot normally installs makes soms assumptions on directory and which user is running the renewal. I am not a fan of +having the `root` user do this, so I've changed it to this: + +```bash +lego@lego:~$ cat /etc/cron.d/certbot +0 */12 * * * lego perl -e 'sleep int(rand(43200))' && certbot -q renew \ + --config-dir /home/lego/acme-dns --logs-dir /home/lego/logs \ + --work-dir /home/lego/workdir \ + --deploy-hook "/home/lego/bin/certbot-distribute" +``` + +And some pretty cool magic happens with this `certbot-distribute` script. When `certbot` has successfully received a new +certificate, it'll set a few environment variables and execute the deploy hook with them: + +* ***RENEWED_LINEAGE***: will point to the config live subdirectory (eg. `/home/lego/acme-dns/live/ipng.ch`) containing the new + certificates and keys +* ***RENEWED_DOMAINS*** will contain a space-delimited list of renewed certificate domains (eg. `ipng.ch *.ipng.ch *.net.ipng.ch`) + +Using the first of those two things, I guess it becomes straight forward to distribute the new certs: + +```bash +#!/bin/sh + +CERT=$(basename $RENEWED_LINEAGE) +CERTFILE=$RENEWED_LINEAGE/fullchain.pem +KEYFILE=$RENEWED_LINEAGE/privkey.pem + +if [ "$CERT" = "ipng.ch" ]; then + MACHS="nginx0.chrma0.ipng.ch nginx0.chplo0.ipng.ch nginx0.nlams1.ipng.ch nginx0.nlams2.ipng.ch" +elif [ "$CERT" = "frys-ix.net" ]; then + MACHS="nginx0.nlams1.ipng.ch nginx0.nlams2.ipng.ch" +else + echo "Unknown certificate $CERT, do not know which machines to copy to" + exit 3 +fi + +for MACH in $MACHS; do + fping -q $MACH 2>/dev/null || { + echo "$MACH: Skipping (unreachable)" + continue + } + echo $MACH: Copying $CERT + scp -q $CERTFILE $MACH:/etc/nginx/certs/$CERT.crt + scp -q $KEYFILE $MACH:/etc/nginx/certs/$CERT.key + echo $MACH: Reloading nginx + ssh $MACH 'sudo systemctl reload nginx' +done +``` + +There are a few things to note, if you look at my little shell script. I already kind of know which `CERT` belongs to which `MACHS`, +because this was configured in `vars/nginx.yml`, where I have a cluster name, say `ipng`, which conveniently has two variables, one called +`members` which is a list of machines, and the second is `ssl_common_name` which is `ipng.ch`. I think that I can find a way to let +Ansible generate this file for me also, whoot! + +### Ansible: NGINX + +Tying it all together (frankly, a tiny bit surprised you're still reading this!), I can now offer an Ansible role that automates all of this. + +```yaml +{%- raw %} +pim@squanchy:~/src/ipng-ansible$ cat << EOF | tee roles/nginx/tasks/main.yml +- name: Install Debian packages + ansible.builtin.apt: + update_cache: true + pkg: [ nginx, ufw, net-tools, apache2-utils, mtr-tiny, rsync ] + +- name: Copy config files + ansible.builtin.copy: + src: "{{ item }}" + dest: "/etc/nginx/" + owner: root + group: root + mode: u=rw,g=r,o=r + directory_mode: u=rwx,g=rx,o=rx + loop: [ conf.d, sites-available ] + notify: Reload nginx + +- name: Add cluster + ansible.builtin.include_tasks: + file: cluster.yml + loop: "{{ nginx.clusters | dict2items }}" + loop_control: + label: "{{ item.key }}" +EOF + +pim@squanchy:~/src/ipng-ansible$ cat << EOF > roles/nginx/handlers/main.yml +- name: Reload nginx + ansible.builtin.service: + name: nginx + state: reloaded +EOF +{% endraw %} +``` + +The first task installs the Debian packages I'll want to use. The `apache2-utils` package is to create and maintain `htpasswd` files and +some other useful things. The `rsync` package is needed to accept both website data from the `drone` continuous integration user, as well as +certificate data from the `lego` user. + +The second task copies all of the (static) configuration files onto the machine, populating `/etc/nginx/conf.d/` and +`/etc/nginx/sites-available/`. It uses a `notify` stanza to make note if any of these files (notably the ones in `conf.d/`) have changed, and +if so, remember to invoke a _handler_ to reload the running NGINX to pick up those changes later on. + +Finally, the third task branches out and executes the tasks defined in `tasks/cluster.yml` one for each NGINX cluster (in my case, `ipng` +and then `frysix`): + +```yaml +{%- raw %} +pim@squanchy:~/src/ipng-ansible$ cat << EOF | tee roles/nginx/tasks/cluster.yml +- name: "Enable sites for cluster {{ item.key }}" + ansible.builtin.file: + src: "/etc/nginx/sites-available/{{ sites_item.key }}.conf" + dest: "/etc/nginx/sites-enabled/{{ sites_item.key }}.conf" + owner: root + group: root + state: link + loop: "{{ (nginx.clusters[item.key].sites | default({}) | dict2items) }}" + when: inventory_hostname in nginx.clusters[item.key].members | default([]) + loop_control: + loop_var: sites_item + label: "{{ sites_item.key }}" + notify: Reload nginx +EOF +{% endraw %} +``` + +This task is a bit more complicated, so let me go over it from outwards facing in. The thing that called us, already has a loop variable +called `item` which has a key (`ipng`) and a value (the whole cluster defined under `nginx.clusters.ipng`). Now if I take that +`item.key` variable and look at its `sites` dictionary (in other words: `nginx.clusters.ipng.sites`, I can create another loop over all the +sites belonging to that cluster. Iterating over a dictionary in Ansible is done with a filter called `dict2items`, and because technically +the cluster could have zero sites, I can ensure the `sites` dictionary defaults to the empty dictionary `{}`. Phew! + +Ansible is running this for each machine, and of course I only want to execute this block, if the given machine (which is referenced as +`inventory_hostname` occurs in the clusters' `members` list. If not: skip, if yes: go! which is what the `when` line does. + +The loop itself then runs for each site in the `sites` dictionary, allowing the `loop_control` to give that loop variable a unique name +called `sites_item`, and when printing information on the CLI, using the `label` set to the `sites_item.key` variable (eg. `frys-ix.net`) +rather than the whole dictionary belonging to it. + +With all of that said, the inner loop is easy: create a (sym)link for each website config file from `sites-available` to `sites-enabled` and +if new links are created, invoke the _Reload nginx_ handler. + +### Ansible: Certbot + +***But what about that LEGO stuff?*** Fair question. The two scripts I described above (one to create the certbot certificate, and another +to copy it to the correct machines), both need to be generated and copied to the right places, so here I go, appending to the tasks: + +```yaml +{%- raw %} +pim@squanchy:~/src/ipng-ansible$ cat << EOF | tee -a roles/nginx/tasks/main.yml +- name: Create LEGO directory + ansible.builtin.file: + path: "/etc/nginx/certs/" + owner: lego + group: lego + mode: u=rwx,g=rx,o= + +- name: Add sudoers.d + ansible.builtin.copy: + src: sudoers + dest: "/etc/sudoers.d/lego-ipng" + owner: root + group: root + +- name: Generate Certbot Distribute script + delegate_to: lego.net.ipng.ch + run_once: true + ansible.builtin.template: + src: certbot-distribute.j2 + dest: "/home/lego/bin/certbot-distribute" + owner: lego + group: lego + mode: u=rwx,g=rx,o= + +- name: Generate Certbot Cluster scripts + delegate_to: lego.net.ipng.ch + run_once: true + ansible.builtin.template: + src: certbot-cluster.j2 + dest: "/home/lego/bin/certbot-{{ item.key }}" + owner: lego + group: lego + mode: u=rwx,g=rx,o= + loop: "{{ nginx.clusters | dict2items }}" +EOF + +pim@squanchy:~/src/ipng-ansible$ cat << EOF | tee roles/nginx/files/sudoers +## *** Managed by IPng Ansible *** +# +%lego ALL=(ALL) NOPASSWD: /usr/bin/systemctl reload nginx +EOF +{% endraw -%} +``` + +The first task creates `/etc/nginx/certs` which will be owned by the user `lego`, and that's where Certbot will rsync the certificates after +renewal. The second task then allows `lego` user to issue a `systemctl reload nginx` so that NGINX can pick up the certificates once they've +changed on disk. + +The third task generated the `certbot-distribute` script, that, depending on the common name of the certificate (for example `ipng.ch` or +`frys-ix.net`), knows which NGINX machines to copy it to. Its logic is pretty similar to the plain-old shellscript I started with, but does +have a few variable expansions. If you'll recall, that script had hard coded way to assemble the MACHS variable, which can be replaced now: + +```bash +{%- raw %} +# ... +{% for cluster_name, cluster in nginx.clusters.items() | default({}) %} +{% if not loop.first%}el{% endif %}if [ "$CERT" = "{{ cluster.ssl_common_name }}" ]; then + MACHS="{{ cluster.members | join(' ') }}" +{% endfor %} +else + echo "Unknown certificate $CERT, do not know which machines to copy to" + exit 3 +fi +{% endraw %} +``` + +One common Ansible trick here is to detect if a given loop has just begun (in which case `loop.first` will be true), or if this is the last +element in the loop (in which case `loop.last` will be true). I can use this to emit the `if` (first) versus `elif` (not first) statements. + +Looking back at what I wrote in this _Certbot Distribute_ task, you'll see I used two additional configuration elements: +1. ***run_once***: Since there are potentially many machines in the **nginx** _Group_, by default Ansible will run this task for each machine. However, the Certbot cluster and distribute scripts really only need to be generated once per _Playbook_ execution, which is determined by this `run_once` field. +1. ***delegate_to***: This task should be executed not on an NGINX machine, rather instead on the `lego.net.ipng.ch` machine, which is specified by the `delegate_to` field. + +#### Ansible: lookup example + +And now for the _pièce de résistance_, the fourth and final task generates a shell script that captures for each cluster the primary name +(called `ssl_common_name`) and the list of alternate names which will turn into full commandline to request a certificate with all wildcard +domains added (eg. `ipng.ch` and `*.ipng.ch`). To do this, I decide to create an Ansible [[Lookup +Plugin](https://docs.ansible.com/ansible/latest/plugins/lookup.html)]. This lookup will simply return **true** if a given sitename is +covered by any of the existing certificace altnames, including wildcard domains, for which I can use the standard python `fnmatch`. + +First, I can create the lookup plugin in a a well-known directory, so Ansible can discover it: + +``` +pim@squanchy:~/src/ipng-ansible$ cat << EOF | tee roles/nginx/lookup_plugins/altname_match.py +import ansible.utils as utils +import ansible.errors as errors +from ansible.plugins.lookup import LookupBase +import fnmatch + +class LookupModule(LookupBase): + def __init__(self, basedir=None, **kwargs): + self.basedir = basedir + def run(self, terms, variables=None, **kwargs): + sitename = terms[0] + cert_altnames = terms[1] + for altname in cert_altnames: + if sitename == altname: + return [True] + if fnmatch.fnmatch(sitename, altname): + return [True] + return [False] +EOF +``` + +The Python class here will compare the website name in `terms[0]` with a list of altnames given in +`terms[1]` and will return True either if a literal match occured, or if the altname `fnmatch` with the sitename. +It will return False otherwise. Dope! Here's how I use it in the `certbot-cluster` script, which is +starting to get pretty fancy: + +```bash +{%- raw %} +pim@squanchy:~/src/ipng-ansible$ cat << EOF | tee roles/nginx/templates/certbot-cluster.j2 +#!/bin/sh +### +### {{ ansible_managed }} +### +{% set cluster_name = item.key %} +{% set cluster = item.value %} +{% set sites = nginx.clusters[cluster_name].sites | default({}) %} +# +# This script generates a certbot commandline to initialize (or re-initialize) a given certificate for an NGINX cluster. +# +### Metadata for this cluster: +# +# {{ cluster_name }}: {{ cluster }} +{% set cert_altname = [ cluster.ssl_common_name, '*.' + cluster.ssl_common_name ] %} +{% do cert_altname.extend(cluster.ssl_altname|default([])) %} +{% for sitename, site in sites.items() %} +{% set altname_matched = lookup('altname_match', sitename, cert_altname) %} +{% if not altname_matched %} +{% do cert_altname.append(sitename) %} +{% do cert_altname.append("*."+sitename) %} +{% endif %} +{% endfor %} +# CERT_ALTNAME: {{ cert_altname | join(' ') }} +# +### + +certbot certonly --config-dir /home/lego/acme-dns --logs-dir /home/lego/logs --work-dir /home/lego/workdir \ + --manual --manual-auth-hook /home/lego/acme-dns/acme-dns-auth.py \ + --preferred-challenges dns --debug-challenges \ +{% for domain in cert_altname %} + -d {{ domain }}{% if not loop.last %} \{% endif %} + +{% endfor %} +EOF +{% endraw %} +``` + +Ansible provides a lot of templating and logic evaluation in its Jinja2 templating language, but it isn't really a programming language. +That said, from the top, here's what happens: +* I set three variables, `cluster_name`, `cluster` (the dictionary with the cluster config) and as a shorthand `sites` which is a + dictionary of sites, defaulting to `{}` if it doesn't exist. +* I'll print the cluster name and the cluster config for posterity. Who knows, eventually I'll be debugging this anyway :-) +* Then comes the main thrust, the simple loop that I described above, but in Jinja2: + * Initialize the `cert_altname` list with the `ssl_common_name` and its wildcard variant, optionally extending it with the list + of altnames in `ssl_altname`, if it's set. + * For each site in the sites dictionary, invoke the lookup and capture its (boolean) result in `altname_matched`. + * If the match failed, we have a new domain, so add it and its wildcard variant to the `cert_altname` list, I use the `do` + Jinja2 extension there comes from package `jinja2.ext.do`. +* At the end of this, all of these website names have been reduced to their domain+wildcard variant, which I can loop over to emit + the `-d` flags to `certbot` at the bottom of the file. + +And with that, I can generate both the certificate request command, and distribute the resulting +certificates to those NGINX servers that need them. + +## Results + +{{< image src="/assets/ansible/ansible-run.png" alt="Ansible Run" >}} + +I'm very pleased with the results. I can clearly see that the two servers that I assigned to this +NGINX cluster (the two in Amsterdam) got their sites enabled, whereas the other two (Zurich and +Geneva) were skipped. I can also see that the new certbot request scripts was generated and the +existing certbot-distribute script was updated (to be aware of where to copy a renewed cert for this +cluster). And, in the end only the two relevant NGINX servers were reloaded, reducing overall risk. + +One other way to show that the very same IPv4 and IPv6 address can be used to serve multiple +distinct multi-domain/wildcard SSL certificates, using this _Server Name Indication_ (SNI, which, I +repeat, has been available **since 2003** or so), is this: + +```bash +pim@squanchy:~$ HOST=nginx0.nlams1.ipng.ch +pim@squanchy:~$ PORT=443 +pim@squanchy:~$ SERVERNAME=www.ipng.ch +pim@squanchy:~$ openssl s_client -connect $HOST:$PORT -servername $SERVERNAME /dev/null \ + | openssl x509 -text | grep DNS: | sed -e 's,^ *,,' +DNS:*.ipng.ch, DNS:*.ipng.eu, DNS:*.ipng.li, DNS:*.ipng.nl, DNS:*.net.ipng.ch, DNS:*.ublog.tech, +DNS:as50869.net, DNS:as8298.net, DNS:ipng.ch, DNS:ipng.eu, DNS:ipng.li, DNS:ipng.nl, DNS:ublog.tech + +pim@squanchy:~$ SERVERNAME=www.frys-ix.net +pim@squanchy:~$ openssl s_client -connect $HOST:$PORT -servername $SERVERNAME /dev/null \ + | openssl x509 -text | grep DNS: | sed -e 's,^ *,,' +DNS:*.frys-ix.net, DNS:frys-ix.net +``` + +Ansible is really powerful, and once I got to know it a little bit, will readily admit it's way +cooler than PaPhosting ever was :) + +## What's Next + +If you remember, I wrote that the `nginx.clusters.*.sites` would not be a list but rather a +dictionary, because I'd like to be able to carry other bits of information. And if you take a close +look at my screenshot above, you'll see I revealed something about Nagios... so in an upcoming post +I'd like to share how IPng Networks arranges its Nagios environment, and I'll use the NGINX configs +here to show how I automatically monitor all servers participating in an NGINX _Cluster_, both for +pending certificate expiry, which should not generally happen precisely due to the automation here, +but also in case any backend server takes the day off. + +Stay tuned! Oh, and if you're good at Ansible and would like to point out how silly I approach +things, please do drop me a line on Mastodon, where you can reach me on +[[@IPngNetworks@ublog.tech](https://ublog.tech/@IPngNetworks)]. diff --git a/content/articles/2023-10-21-vpp-ixp-gateway-1.md b/content/articles/2023-10-21-vpp-ixp-gateway-1.md new file mode 100644 index 0000000..96e56d3 --- /dev/null +++ b/content/articles/2023-10-21-vpp-ixp-gateway-1.md @@ -0,0 +1,308 @@ +--- +date: "2023-10-21T11:35:14Z" +title: VPP IXP Gateway - Part 1 +--- + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. + +There's some really fantastic features in VPP, some of which are lesser well known, and not always +very well documented. In this article, I will describe a unique usecase in which I think VPP will +excel, notably acting as a gateway for Internet Exchange Points. + +In this first article, I'll take a closer look at three things that would make such a gateway +possible: bridge domains, MAC address filtering and traffic shaping. + +## Introduction + +Internet Exchanges are typically L2 (ethernet) switch platforms that allow their connected members +to exchange traffic amongst themselves. Not all members share physical locations with the Internet +Exchange itself, for example the IXP may be at NTT Zurich, but the member may be present in +Interxion Zurich. For smaller clubs, like IPng Networks, it's not always financially feasible (or +desirable) to order a dark fiber between two adjacent datacenters, or even a cross connect in the +same datacenter (as many of them are charging exorbitant fees for what is essentially passive fiber +optics and patch panels), if the amount of traffic passed is modest. + +One solution to such problems is to have one member transport multiple end-user downstream members +to the platform, for example by means of an Ethernet over MPLS or VxLAN transport from where the +enduser lives, to the physical port of the Internet Exchange. These transport members are often +called _IXP Resellers_ noting that usually, but not always, some form of payment is required. + +From the point of view of the IXP, it's often the case that there is a one MAC address per member +limitation, and not all members will have the same bandwidth guarantees. Many IXPs will offer +physical connection speeds (like a Gigabit, TenGig or HundredGig port), but they _also_ have a +common practice to limit the passed traffic by means of traffic shaping, for example one might have +a TenGig port but only entitled to pass 3.0 Gbit/sec of traffic in- and out of the platform. + +For a long time I thought this kind of sucked, after all, who wants to connect to an internet +exchange point but then see their traffic rate limited? But if you think about it, this is often to +protect both the member, and the reseller, and the exchange itself: if the total downstream +bandwidth to the reseller is potentially larger than the reseller's port to the exchange, and this +is almost certainly the case in the other direction: the total IXP bandwidth that might go to one +individual members, is significantly larger than the reseller's port to the exchange. + +Due to these two issues, a reseller port may become a bottleneck and _packetlo_ may occur. To +protect the ecosystem, having the internet exchange try to enforce fairness and bandwidth limits +makes operational sense. + +## VPP as an IXP Gateway + +{{< image width="400px" float="right" src="/assets/vpp-ixp-gateway/VPP IXP Gateway.svg" alt="VPP IXP Gateway" >}} + +Here's a few requirements that may be necessary to provide an end-to-end solution: +1. Downstream ports MAY be _untagged_, or _tagged_, in which case encapsulation (for example + .1q VLAN tags) SHOULD be provided, one per downstream member. +1. Each downstream member MUST ONLY be allowed to send traffic from one or more registered MAC + addresses, in other words, strict filtering MUST be applied by the gateway. +1. If a downstream member is assigned an up- and downstream bandwidth limit, this MUST be + enforced by the gateway. + +Of course, all sorts of other things come to mind -- perhaps MPLS encapsulation, or VxLAN/GENEVE +tunneling endpoints, and certainly some monitoring with SNMP or Prometheus, and how about just +directly integrating this gateway with [[IXPManager](https://www.ixpmanager.org/)] while we're at +it. Yes, yes! But for this article, I'm going to stick to the bits and pieces regarding VPP itself, +and leave the other parts for another day! + +First, I build a quick lab out of this, by taking one supermicro bare metal server with VPP (it will +be the VPP IXP Gateway), and a couple of Debian servers and switches to simulate clients (A-J): +* Client A-D (on port `e0`-`e3`) will use `192.0.2.1-4/24` and `2001:db8::1-5/64` +* Client E-G (on switch port `e0`-`e2` of switch0, behind port `xe0`) will use `192.0.2.5-7/24` and + `2001:db8::5-7/64` +* Client H-J (on switch port `e0`-`e2` of switch1, behind port `xe1`) will use `192.0.2.8-10/24` and + `2001:db8::8-a/64` +* There will be a server attached to port `xxv0` with address `198.0.2.254/24` and `2001:db8::ff/64` + * The server will run `iperf3`. + +### VPP: Bridge Domains + +The fundamental topology described in the picture above tries to bridge together a bunch of untagged +ports (`e0`..`e3` 1Gbit each)) with two tagged ports (`xe0` and `xe1`, 10Gbit) into an upstream IXP +port (`xxv0`, 25Gbit). One thing to note for the pedants (and I love me some good pedantry) is that +the total physical bandwidth to downstream members in this gateway (4x1+2x10 == 24Gbit) is lower +than the physical bandwidth to the IXP platform (25Gbit), which makes sense. It means that there +will not be contention per se. + +Building this topology in VPP is rather straight forward by using a so called **Bridge Domain**, +which will be referred to by its bridge-id, for which I'll rather arbitrarily choose 8298: +``` +vpp# create bridge-domain 8298 +vpp# set interface l2 bridge xxv0 8298 +vpp# set interface l2 bridge e0 8298 +vpp# set interface l2 bridge e1 8298 +vpp# set interface l2 bridge e2 8298 +vpp# set interface l2 bridge e3 8298 +vpp# set interface l2 bridge xe0 8298 +vpp# set interface l2 bridge xe1 8298 +``` + +### VPP: Bridge Domain Encapsulations + +I cheated a little bit in the previous section: I added the two TenGig ports called `xe0` and `xe1` +directly to the bridge; however they are trunk ports to breakout switches which will each contain +three additional downstream customers. So to add these six new customers, I will do the following: + +``` +vpp# set interface l3 xe0 +vpp# create sub-interfaces xe0 10 +vpp# create sub-interfaces xe0 20 +vpp# create sub-interfaces xe0 30 +vpp# set interface l2 bridge xe0.10 8298 +vpp# set interface l2 bridge xe0.20 8298 +vpp# set interface l2 bridge xe0.30 8298 +``` + +The first command here puts the interface `xe0` back into Layer3 mode, which will detach it from the +bridge-domain. The second set of commands creates sub-interfaces with dot1q tags 10, 20 and 30 +respectively. The third set then adds these three sub-interfaces to the bridge. By the way, I'll do +this for both `xe0` shown above, but also for the second `xe1` port, so all-up that makes 6 +downstream member ports. + +Readers of my articles at this point may have a little bit of an uneasy feeling: "What about the +VLAN Gymnastics?" I hear you ask :) You see, VPP will generally just pick up these ethernet frames +from `xe0.10` which are tagged, and add them as-is to the bridge, which is weird, because all the +other bridge ports are expecting untagged frames. So what I must do is tell VPP, upon receipt of a +tagged ethernet frame on these ports, to strip the tag; and on the way out, before transmitting the +ethernet frame, to wrap it into its correct encapsulation. This is called **tag rewriting** in VPP, +and I've written a bit about it in [[this article]({% post_url 2022-02-14-vpp-vlan-gym %})] in case +you're curious. But to cut to the chase: + +``` +vpp# set interface l2 tag-rewrite xe0.10 pop 1 +vpp# set interface l2 tag-rewrite xe0.20 pop 1 +vpp# set interface l2 tag-rewrite xe0.30 pop 1 +vpp# set interface l2 tag-rewrite xe1.10 pop 1 +vpp# set interface l2 tag-rewrite xe1.20 pop 1 +vpp# set interface l2 tag-rewrite xe1.30 pop 1 +``` + +Allright, with the VLAN gymnastics properly applied, I now have a bridge with all ten downstream +members and one upstream port (`xxv0`): + +``` +vpp# show bridge-domain 8298 int + BD-ID Index BSN Age(min) Learning U-Forwrd UU-Flood Flooding ARP-Term arp-ufwd Learn-co Learn-li BVI-Intf + 8298 1 0 off on on flood on off off 1 16777216 N/A + + Interface If-idx ISN SHG BVI TxFlood VLAN-Tag-Rewrite + xxv0 3 1 0 - * none + e0 5 1 0 - * none + e1 6 1 0 - * none + e2 7 1 0 - * none + e3 8 1 0 - * none + xe0.10 19 1 0 - * pop-1 + xe0.20 20 1 0 - * pop-1 + xe0.30 21 1 0 - * pop-1 + xe1.10 22 1 0 - * pop-1 + xe1.20 23 1 0 - * pop-1 + xe1.30 24 1 0 - * pop-1 +``` + +One cool thing to re-iterate is that VPP is really a router, not a switch. It's +entirely possible and common to create two completely independent subinterfaces with .1q tag 10 (in +my case, `xe0.10` and `xe1.10` and use the bridge-domain to tie them together. + +#### Validating Bridge Domains + +Looking at my clients above, I can see that several of them are untagged (`e0`-`e3`) and a few of +them are tagged behind ports `xe0` and `xe1`. It should be straight forward to validate reachability +with the following simple ping command: + +``` +pim@clientA:~$ fping -a -g 192.0.2.0/24 +192.0.2.1 is alive +192.0.2.2 is alive +192.0.2.3 is alive +192.0.2.4 is alive +192.0.2.5 is alive +192.0.2.6 is alive +192.0.2.7 is alive +192.0.2.8 is alive +192.0.2.9 is alive +192.0.2.10 is alive +192.0.2.254 is alive +``` + +At this point the table stakes configuration provides for a Layer2 bridge domain spanning all of +these ports, including performing the correct encapsulation on the TenGig ports that connect to +the switches. There is L2 reachability between all clients over this VPP IXP Gateway. + +**✅ Requirement #1 is implemented!** + +### VPP: MAC Address Filtering + +Enter classifiers! Actually while doing the research for this article, I accidentally nerd-sniped +myself while going through the features provided by VPP's classifier system, and holy moly is that +thing powerful! + +I'm only going to show the results of that little journey through the code base and documentation, +but in an upcoming article I intend to do a thorough deep-dive into VPP classifiers, and add them to +`vppcfg` because I think that would be the bee's knees! + +Back to the topic of MAC address filtering, a classifier would look roughly like this: + +``` +vpp# classify table acl-miss-next deny mask l2 src table 5 +vpp# classify session acl-hit-next permit table-index 5 match l2 src 00:01:02:03:ca:fe +vpp# classify session acl-hit-next permit table-index 5 match l2 src 00:01:02:03:d0:d0 +vpp# set interface input acl intfc e0 l2-table 5 +vpp# show inacl type l2 + Intfc idx Classify table Interface name + 5 5 e0 +``` + +The first line create a classify table where we'll want to match on Layer2 source addresses, and if +there is no entry in the table that matches, the default will be to _deny_ (drop) the ethernet +frame. The next two lines add an entry for ethernet frames which have Layer2 source of the _cafe_ +and _d0d0_ MAC addresses. When matching, the action is to _permit_ (accept) the ethernet frame. +Then, I apply this classifier as an l2 input ACL on interface `e0`. + +Incidentally the input ACL can operate at five distinct points in the packet's journey through +the dataplane. At the Layer2 input stage, like I'm using here, in the IPv4 and IPv6 input path, and +when punting traffic for IPv4 and IPv6 respectively. + +#### Validating MAC filtering + +Remember when I created the classify table and added two bogus MAC addresses to it? Let me show you +what would happen on client A, which is directly connected to port `e0`. + +``` +pim@clientA:~$ ip -br link show eno3 +eno3 UP 3c:ec:ef:6a:7b:74 + +pim@clientA:~$ ping 192.0.2.254 +PING 192.0.2.254 (192.0.2.254) 56(84) bytes of data. +... +``` + +This is expected because ClientA's MAC address has not yet been added to the classify table driving +the Layer2 input ACL, which is quicky remedied like so: + +``` +vpp# classify session acl-hit-next permit table-index 5 match l2 src 3c:ec:ef:6a:7b:74 +... +64 bytes from 192.0.2.254: icmp_seq=34 ttl=64 time=2048 ms +64 bytes from 192.0.2.254: icmp_seq=35 ttl=64 time=1024 ms +64 bytes from 192.0.2.254: icmp_seq=36 ttl=64 time=0.450 ms +64 bytes from 192.0.2.254: icmp_seq=37 ttl=64 time=0.262 ms +``` + +**✅ Requirement #2 is implemented!** + +### VPP: Traffic Policers + +I realize that from the IXP's point of view, not all the available bandwidth behind `xxv0` should be +made available to all clients. Some may have negotiated a higher- or lower- bandwidth available to +them. Therefor, the VPP IXP Gateway should be able to rate limit the traffic through the it, for +which a VPP feature already exists: Policers. + +Consider for a moment our client A (untagged on port `e0`), and client E (behind port `xe0` with a +dot1q tag of 10). Client A has a bandwidth of 1Gbit, but client E nominally has a bandwidth of +10Gbit. If I were to want to restrict both clients to, say, 150Mbit, I could do the following: + +``` +vpp# policer add name client-a rate kbps cir 150000 cb 15000000 conform-action transmit +vpp# policer input name client-a e0 +vpp# policer output name client-a e0 + +vpp# policer add name client-e rate kbps cir 150000 cb 15000000 conform-action transmit +vpp# policer input name client-e xe0.10 +vpp# policer output name client-e xe0.10 +``` + +And here's where I bump into a stubborn VPP dataplane. I would've expected the input and output +packet shaping to occur on both the untagged interface `e0` as well as the tagged interface +`xe0.10`, but alas, the policer only works in one of these four cases. Ouch! + +I read the code around `vnet/src/policer/` and understand the following: +* On input, the policer is applied on `device-input` which is the Phy, not the Sub-Interface. This + explains why the policer works on untagged, but not on tagged interfaces. +* On output, the policer is applied on `ip4-output` and `ip6-output`, which works only for L3 + enabled interfaces, not for L2 ones like the ones in this bridge domain. + +I also tried to work with classifiers, like in the MAC address filtering above -- but I concluded +here as well, that the policer works only on input, not on output. So the mission is now to figure +out how to enable an L2 policer on (1) untagged output, and (2) tagged in- and output. + +**❌ Requirement #3 is not implemented!** + +## What's Next + +It's too bad that policers are a bit fickle. That's quite unfortunate, but I think fixable. I've +started a thread on `vpp-dev@` to discuss, and will reach out to Stanislav who added the +_policer output_ capability in commit `e5a3ae0179`. + +Of course, this is just a proof of concept. I typed most of the configuration by hand on the VPP IXP +Gateway, just to show a few of the more advanced features of VPP. For me, this triggered a whole new +line of thinking: classifiers. This extract/match/act pattern can be used in policers, ACLs and +arbitrary traffic redirection through VPP's directed graph (eg. selecting a next node for +processing). I'm going to deep-dive into this classifier behavior in an upcoming article, and see +how I might add this to [[vppcfg](https://github.com/pimvanpelt/vppcfg.git)], because I think it +would be super powerful to abstract away the rather complex underlying API into something a little +bit more ... user friendly. Stay tuned! :) + diff --git a/content/articles/2023-11-11-mellanox-sn2700.md b/content/articles/2023-11-11-mellanox-sn2700.md new file mode 100644 index 0000000..e68f6ce --- /dev/null +++ b/content/articles/2023-11-11-mellanox-sn2700.md @@ -0,0 +1,827 @@ +--- +date: "2023-11-11T13:31:00Z" +title: Debian on Mellanox SN2700 (32x100G) +--- + +# Introduction + +I'm still hunting for a set of machines with which I can generate 1Tbps and 1Gpps of VPP traffic, +and considering a 100G network interface can do at most 148.8Mpps, I will need 7 or 8 of these +network cards. Doing a loadtest like this with DACs back-to-back is definitely possible, but it's a +bit more convenient to connect them all to a switch. However, for this to work I would need (at +least) fourteen or more HundredGigabitEthernet ports, and these switches tend to get expensive, real +quick. + +Or do they? + +## Hardware + +{{< image width="400px" float="right" src="/assets/mlxsw/sn2700-1.png" alt="SN2700" >}} + +I thought I'd ask the #nlnog IRC channel for advice, and of course the usual suspects came past, +such as Juniper, Arista, and Cisco. But somebody mentioned "How about Mellanox, like SN2700?" and I +remembered my buddy Eric was a fan of those switches. I looked them up on the refurbished market and +I found one for EUR 1'400,- for 32x100G which felt suspiciously low priced... but I thought YOLO and +I ordered it. It arrived a few days later via UPS from Denmark to Switzerland. + +The switch specs are pretty impressive, with 32x100G QSFP28 ports, which can be broken out to a set +of sub-ports (each of 1/10/25/50G), with a specified switch throughput of 6.4Tbps in 4.76Gpps, while +only consuming ~150W all-up. + +Further digging revealed that the architecture of this switch consists of two main parts: + +{{< image width="400px" float="right" src="/assets/mlxsw/sn2700-2.png" alt="SN2700" >}} + +1. an AMD64 component with an mSATA disk to boot from, two e1000 network cards, and a single USB + and RJ45 serial port with standard pinout. It has a PCIe connection to a switch board in the + front of the chassis, further more it's equipped with 8GB of RAM in an SO-DIMM, and its CPU is a + two core Celeron(R) CPU 1047UE @ 1.40GHz. + +1. the silicon used in this switch is called _Spectrum_ and identifies itself in Linux as PCI + device `03:00.0` called _Mellanox Technologies MT52100_, so the front dataplane with 32x100G is + separated from the Linux based controlplane. + +{{< image width="400px" float="right" src="/assets/mlxsw/sn2700-3.png" alt="SN2700" >}} + +When turning on the device, the serial port comes to life and shows me a BIOS, quickly after which +it jumps into GRUB2 and wants me to install it using something called ONIE. I've heard of that, but +now it's time for me to learn a little bit more about that stuff. I ask around and there's plenty of +ONIE images for this particular type of chip to be found - some are open source, some are semi-open +source (as in: were once available but now are behind paywalls etc). + +Before messing around with the switch and possibly locking myself out or bricking it, I take out the +16GB mSATA and make a copy of it for safe keeping. I feel somewhat invincible by doing this. How bad +could I mess up this switch, if I can just copy back a bitwise backup of the 16GB mSATA? I'm about +to find out, so read on! + +## Software + +The Mellanox SN2700 switch is an ONIE (Open Network Install Environment) based platform that +supports a multitude of operating systems, as well as utilizing the advantages of Open Ethernet and +the capabilities of the Mellanox Spectrum® ASIC. The SN2700 has three modes of operation: + +* Preinstalled with Mellanox Onyx (successor to MLNX-OS Ethernet), a home-grown operating system + utilizing common networking user experiences and industry standard CLI. +* Preinstalled with Cumulus Linux, a revolutionary operating system taking the Linux user experience + from servers to switches and providing a rich routing functionality for large scale applications. +* Provided with a bare ONIE image ready to be installed with the aforementioned or other ONIE-based + operating systems. + +I asked around a bit more and found that there's a few more things one might do with this switch. +One of them is [[SONiC](https://github.com/sonic-net/SONiC)], which stands for _Software for Open +Networking in the Cloud_, and has support for the _Spectrum_ and notably the _SN2700_ switch. Cool! + +I also learned about [[DENT](https://dent.dev)], which utilizes the Linux Kernel, Switchdev, +and other Linux based projects as the basis for building a new standardized network operating system +without abstractions or overhead. Unfortunately, while the _Spectrum_ chipset is known to DENT, this +particular layout on SN2700 is not supported. + +Finally, my buddy `fall0ut` said "why not just Debian with switchdev?" and now my eyes opened wide. +I had not yet come across [[switchdev](https://docs.kernel.org/networking/switchdev.html)], which is +a standard Linux kernel driver model for switch devices which offload the forwarding (data)plane +from the kernel. As it turns out, Mellanox did a really good job writing a switchdev implementation +in the [[linux kernel](https://github.com/torvalds/linux/tree/master/drivers/net/ethernet/mellanox/mlxsw)] +for the Spectrum series of silicon, and it's all upstreamed to the Linux kernel. Wait, what?! + +### Mellanox Switchdev + +I start by reading the [[brochure](/assets/mlxsw/PB_Spectrum_Linux_Switch.pdf)], which shows me the +intentions Mellanox had when designing and marketing these switches. It seems that they really meant +it when they said this thing is a fully customizable Linux switch, check out this paragraph: + +> Once the Mellanox Switchdev driver is loaded into the Linux Kernel, each +> of the switch’s physical ports is registered as a net_device within the +> kernel. Using standard Linux tools (for example, bridge, tc, iproute), ports +> can be bridged, bonded, tunneled, divided into VLANs, configured for L3 +> routing and more. Linux switching and routing tables are reflected in the +> switch hardware. Network traffic is then handled directly by the switch. +> Standard Linux networking applications can be natively deployed and +> run on switchdev. This may include open source routing protocol stacks, +> such as Quagga, Bird and XORP, OpenFlow applications, or user-specific +> implementations. + +### Installing Debian on SN2700 + +.. they had me at Bird :) so off I go, to install a vanilla Debian AMD64 Bookworm on a 120G mSATA I +had laying around. After installing it, I noticed that the coveted `mlxsw` driver is not shipped by +default on the Linux kernel image in Debian, so I decide to build my own, letting the [[Debian +docs](https://wiki.debian.org/BuildADebianKernelPackage)] take my hand and guide me through it. + +I find a reference on the Mellanox [[GitHub +wiki](https://github.com/Mellanox/mlxsw/wiki/Installing-a-New-Kernel)] which shows me which kernel +modules to include to successfully use the _Spectrum_ under Linux, so I think I know what to do: + +``` +pim@summer:/usr/src$ sudo apt-get install build-essential linux-source bc kmod cpio flex \ + libncurses5-dev libelf-dev libssl-dev dwarves bison +pim@summer:/usr/src$ sudo apt install linux-source-6.1 +pim@summer:/usr/src$ sudo tar xf linux-source-6.1.tar.xz +pim@summer:/usr/src$ cd linux-source-6.1/ +pim@summer:/usr/src/linux-source-6.1$ sudo cp /boot/config-6.1.0-12-amd64 .config +pim@summer:/usr/src/linux-source-6.1$ cat << EOF | sudo tee -a .config +CONFIG_NET_IPIP=m +CONFIG_NET_IPGRE_DEMUX=m +CONFIG_NET_IPGRE=m +CONFIG_IPV6_GRE=m +CONFIG_IP_MROUTE_MULTIPLE_TABLES=y +CONFIG_IP_MULTIPLE_TABLES=y +CONFIG_IPV6_MULTIPLE_TABLES=y +CONFIG_BRIDGE=m +CONFIG_VLAN_8021Q=m +CONFIG_BRIDGE_VLAN_FILTERING=y +CONFIG_BRIDGE_IGMP_SNOOPING=y +CONFIG_NET_SWITCHDEV=y +CONFIG_NET_DEVLINK=y +CONFIG_MLXFW=m +CONFIG_MLXSW_CORE=m +CONFIG_MLXSW_CORE_HWMON=y +CONFIG_MLXSW_CORE_THERMAL=y +CONFIG_MLXSW_PCI=m +CONFIG_MLXSW_I2C=m +CONFIG_MLXSW_MINIMAL=y +CONFIG_MLXSW_SWITCHX2=m +CONFIG_MLXSW_SPECTRUM=m +CONFIG_MLXSW_SPECTRUM_DCB=y +CONFIG_LEDS_MLXCPLD=m +CONFIG_NET_SCH_PRIO=m +CONFIG_NET_SCH_RED=m +CONFIG_NET_SCH_INGRESS=m +CONFIG_NET_CLS=y +CONFIG_NET_CLS_ACT=y +CONFIG_NET_ACT_MIRRED=m +CONFIG_NET_CLS_MATCHALL=m +CONFIG_NET_CLS_FLOWER=m +CONFIG_NET_ACT_GACT=m +CONFIG_NET_ACT_MIRRED=m +CONFIG_NET_ACT_SAMPLE=m +CONFIG_NET_ACT_VLAN=m +CONFIG_NET_L3_MASTER_DEV=y +CONFIG_NET_VRF=m +EOF +pim@summer:/usr/src/linux-source-6.1$ sudo make menuconfig +pim@summer:/usr/src/linux-source-6.1$ sudo make -j`nproc` bindeb-pkg +``` + +I run a gratuitous `make menuconfig` after adding all those config statements to the end of the +`.config` file, and it figures out how to combine what I wrote before with what was in the file +earlier, and I used the standard Bookworm 6.1 kernel config that came from the default installer, so +that it would be a minimal diff to what Debian itself shipped with. + +After Summer stretches her legs a bit compiling this kernel for me, look at the result: +``` +pim@summer:/usr/src$ dpkg -c linux-image-6.1.55_6.1.55-4_amd64.deb | grep mlxsw +drwxr-xr-x root/root 0 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/ +-rw-r--r-- root/root 414897 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_core.ko +-rw-r--r-- root/root 19721 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_i2c.ko +-rw-r--r-- root/root 31817 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_minimal.ko +-rw-r--r-- root/root 65161 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_pci.ko +-rw-r--r-- root/root 1425065 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_spectrum.ko +``` + +Good job, Summer! On my mSATA disk, I tell Linux to boot its kernel using the following in GRUB, +which will make the kernel not create spiffy interface names like `enp6s0` or `eno1` but just +enumerate them all one by one and call them `eth0` and so on: +``` +pim@fafo:~$ grep GRUB_CMDLINE /etc/default/grub +GRUB_CMDLINE_LINUX_DEFAULT="" +GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8 net.ifnames=0 biosdevname=0" +``` + +## Mellanox SN2700 running Debian+Switchdev + +{{< image width="400px" float="right" src="/assets/mlxsw/debian.png" alt="Debian" >}} + +I insert the freshly installed Debian Bookworm with custom compiled 6.1.55+mlxsw kernel into the +switch, and it boots on the first try. I see 34 (!) ethernet ports, noting that the first two come +from an Intel NIC but carrying a MAC address from Mellanox (starting with `0c:42:a1`) and the other +32 have a common MAC address (from Mellanox, starting with `04:3f:72`), and what I noticed is that +the MAC addresses here are skipping one between subsequent ports, which leads me to believe that +these 100G ports can be split into two (perhaps 2x50G, 2x40G, 2x25G, 2x10G, which I intend to find +out later). According to the official spec sheet, the switch allows 2-way breakout ports as well as +converter modules, to insert for example a 25G SFP28 into a QSFP28 switchport. + +Honestly, I did not think I would get this far, so I humorously (at least, I think so) decide to +call this switch [[FAFO](https://www.urbandictionary.com/define.php?term=FAFO)]. + +First off, the `mlxsw` driver loaded: +``` +root@fafo:~# lsmod | grep mlx +mlxsw_spectrum 708608 0 +mlxsw_pci 36864 1 mlxsw_spectrum +mlxsw_core 217088 2 mlxsw_pci,mlxsw_spectrum +mlxfw 36864 1 mlxsw_core +vxlan 106496 1 mlxsw_spectrum +ip6_tunnel 45056 1 mlxsw_spectrum +objagg 53248 1 mlxsw_spectrum +psample 20480 1 mlxsw_spectrum +parman 16384 1 mlxsw_spectrum +bridge 311296 1 mlxsw_spectrum +``` + +I run `sensors-detect` and `pwmconfig`, let the fans calibrate and write their config file. The fans +come back down to a more chill (pun intended) speed, and I take a closer look. It seems all fans and +all thermometers, including the ones in the QSFP28 cages and the _Spectrum_ switch ASIC are +accounted for: + +``` +root@fafo:~# sensors +coretemp-isa-0000 +Adapter: ISA adapter +Package id 0: +30.0°C (high = +87.0°C, crit = +105.0°C) +Core 0: +29.0°C (high = +87.0°C, crit = +105.0°C) +Core 1: +30.0°C (high = +87.0°C, crit = +105.0°C) + +acpitz-acpi-0 +Adapter: ACPI interface +temp1: +27.8°C (crit = +106.0°C) +temp2: +29.8°C (crit = +106.0°C) + +mlxsw-pci-0300 +Adapter: PCI adapter +fan1: 6239 RPM +fan2: 5378 RPM +fan3: 6268 RPM +fan4: 5378 RPM +fan5: 6326 RPM +fan6: 5442 RPM +fan7: 6268 RPM +fan8: 5315 RPM +temp1: +37.0°C (highest = +41.0°C) +front panel 001: +23.0°C (crit = +73.0°C, emerg = +75.0°C) +front panel 002: +24.0°C (crit = +73.0°C, emerg = +75.0°C) +front panel 003: +23.0°C (crit = +73.0°C, emerg = +75.0°C) +front panel 004: +26.0°C (crit = +73.0°C, emerg = +75.0°C) +... +``` + +From the top, first I see the classic CPU core temps, then an `ACPI` interface which I'm not quite +sure I understand the purpose of (possibly motherboard, but not PSU because pulling one out does not +change any values). Finally, the sensors using driver `mlxsw-pci-0300`, are those on the switch PCB +carrying the _Spectrum_ silicon, and there's a thermometer for each of the QSFP28 cages, possibly +reading from the optic, as most of them are empty except the first four which I inserted optics to. +Slick! + + +{{< image width="500px" float="right" src="/assets/mlxsw/debian-ethernet.png" alt="Ethernet" >}} + +I notice that the ports are in a bit of a weird order. Firstly, eth0-1 are the two 1G ports on the +Debian machine. But then, the rest of the ports are the Mellanox Spectrum ASIC: +* eth2-17 correspond to port 17-32, which seems normal, but +* eth18-19 correspond to port 15-16 +* eth20-21 correspond to port 13-14 +* eth30-31 correspond to port 3-4 +* eth32-33 correspond to port 1-2 + +The switchports are actually sequentially numbered with respect to MAC addresses, with `eth2` +starting at `04:3f:72:74:a9:41` and finally `eth34` having `04:3f:72:74:a9:7f` (for 64 consecutive +MACs). + +Somehow though, the ports are wired in a different way on the front panel. As it turns out, I can +insert a little `udev` ruleset that will take care of this: + +``` +root@fafo:~# cat << EOF > /etc/udev/rules.d/10-local.rules +SUBSYSTEM=="net", ACTION=="add", DRIVERS=="mlxsw_spectrum*", \ + NAME="sw$attr{phys_port_name}" +EOF +``` + +After rebooting the switch, the ports are now called `swp1` .. `swp32` and they also correspond with +their physical ports on the front panel. One way to check this, is using `ethtool --identify swp1` +which will blink the LED of port 1, until I press ^C. Nice. + +### Debian SN2700: Diagnostics + +The first thing I'm curious to try, is if _Link Layer Discovery Protocol_ +[[LLDP](https://en.wikipedia.org/wiki/Link_Layer_Discovery_Protocol)] works. This is a +vendor-neutral protocol that network devices use to advertise their identity to peers over Ethernet. +I install an open source LLDP daemon and plug in a DAC from port1 to a Centec switch in the lab. + +And indeed, quickly after that, I see two devices, the first on the Linux machine `eth0` which is +the Unifi switch that has my LAN, and the second is the Centec behind `swp1`: + +``` +root@fafo:~# apt-get install lldpd +root@fafo:~# lldpcli show nei summary +------------------------------------------------------------------------------- +LLDP neighbors: +------------------------------------------------------------------------------- +Interface: eth0, via: LLDP + Chassis: + ChassisID: mac 44:d9:e7:05:ff:46 + SysName: usw6-BasementServerroom + Port: + PortID: local Port 9 + PortDescr: fafo.lab + TTL: 120 +Interface: swp1, via: LLDP + Chassis: + ChassisID: mac 60:76:23:00:01:ea + SysName: sw3.lab + Port: + PortID: ifname eth-0-25 + PortDescr: eth-0-25 + TTL: 120 +``` + +With this I learn that the switch forwards these datagrams (ethernet type `0x88CC`) from the +dataplane to the Linux controlplane. I would call this _punting_ in VPP language, but switchdev +calls it _trapping_, and I can see the LLDP packets when tcpdumping on ethernet device `swp1`. +So today I learned how to trap packets :-) + +### Debian SN2700: ethtool + +One popular diagnostics tool that is useful (and, hopefully well known because it's awesome), +is`ethtool`, a command-line tool in Linux for managing network interface devices. It allows me to +modify the parameters of the ports and their transceivers, as well as query the information of those +devices. + +Here are few common examples, all of which work on this switch running Debian: + +* `ethtool swp1`: Shows link capabilities (eg, 1G/10G/25G/40G/100G) +* `ethtool -s swp1 speed 40000 duplex full autoneg off`: Force speed/duplex +* `ethtool -m swp1`: Shows transceiver diagnostics like SFP+ light levels, link levels + (also `--module-info`) +* `ethtool -p swp1`: Flashes the transceiver port LED (also `--identify`) +* `ethtool -S swp1`: Shows packet and octet counters, and sizes, discards, errors, and so on + (also `--statistics`) + +I specifically love the _digital diagnostics monitoring_ (DDM), originally specified in +[[SFF-8472](https://members.snia.org/document/dl/25916)], which allows me to read the EEPROM of +optical transceivers and get all sorts of critical diagnostics. I wish DPDK and VPP had that! + +### Debian SN2700: devlink + +In reading up on the _switchdev_ ecosystem, I stumbled across `devlink`, an API to expose device +information and resources not directly related to any device class, such as switch ASIC +configuration. As a fun fact, devlink was written by the same engineer who wrote the `mlxsw` driver +for Linux, Jiří Pírko. Its documentation can be found in the [[linux +kernel](https://docs.kernel.org/networking/devlink/index.html)], and it ships with any modern +`iproute2` distribution. The specific (somewhat terse) documentation of the `mlxsw` driver +[[lives there](https://docs.kernel.org/networking/devlink/mlxsw.html)] as well. + +There's a lot to explore here, but I'll focus my attention to three things: + +#### 1. devlink resource + +When learning that the switch also does IPv4 and IPv6 routing, I immediately thought: how many +prefixes can be offloaded to the ASIC? One way to find out is to query what types of _resources_ it +has: + +``` +root@fafo:~# devlink resource show pci/0000:03:00.0 +pci/0000:03:00.0: + name kvd size 258048 unit entry dpipe_tables none + resources: + name linear size 98304 occ 1 unit entry size_min 0 size_max 159744 size_gran 128 dpipe_tables none + resources: + name singles size 16384 occ 1 unit entry size_min 0 size_max 159744 size_gran 1 dpipe_tables none + name chunks size 49152 occ 0 unit entry size_min 0 size_max 159744 size_gran 32 dpipe_tables none + name large_chunks size 32768 occ 0 unit entry size_min 0 size_max 159744 size_gran 512 dpipe_tables none + name hash_double size 65408 unit entry size_min 32768 size_max 192512 size_gran 128 dpipe_tables none + name hash_single size 94336 unit entry size_min 65536 size_max 225280 size_gran 128 dpipe_tables none + name span_agents size 3 occ 0 unit entry dpipe_tables none + name counters size 32000 occ 4 unit entry dpipe_tables none + resources: + name rif size 8192 occ 0 unit entry dpipe_tables none + name flow size 23808 occ 4 unit entry dpipe_tables none + name global_policers size 1000 unit entry dpipe_tables none + resources: + name single_rate_policers size 968 occ 0 unit entry dpipe_tables none + name rif_mac_profiles size 1 occ 0 unit entry dpipe_tables none + name rifs size 1000 occ 1 unit entry dpipe_tables none + name physical_ports size 64 occ 36 unit entry dpipe_tables none +``` + +There's a lot to unpack here, but this is a tree of resources, each with names and children. Let me +focus on the first one, called `kvd`, which stands for _Key Value Database_ (in other words, a set +of lookup tables). It contains a bunch of children called `linear`, `hash_double` and `hash_single`. +The kernel [[docs](https://www.kernel.org/doc/Documentation/ABI/testing/devlink-resource-mlxsw)] +explain it in more detail, but this is where the switch will keep its FIB in _Content Addressable +Memory_ (CAM) of certain types of elements of a given length and count. All up, the size is 252KB, +which is not huge, but also certainly not tiny! + +Here I learn that it's subdivided into: +* ***linear***: 96KB bytes of flat memory using an index, further divided into regions: + * ***singles***: 16KB of size 1, nexthops + * ***chunks***: 48KB of size 32, multipath routes with <32 entries + * ***large_chunks***: 32KB of size 512, multipath routes with <512 entries +* ***hash_single***: 92KB bytes of hash table for keys smaller than 64 bits (eg. L2 FIB, IPv4 FIB and + neighbors) +* ***hash_double***: 63KB bytes of hash table for keys larger than 64 bits (eg. IPv6 FIB and + neighbors) + + +#### 2. devlink dpipe + +Now that I know the memory layout and regions of the CAM, I can start making some guesses on the FIB +size. The devlink pipeline debug API (DPIPE) is aimed at providing the user visibility into the +ASIC's pipeline in a generic way. The API is described in detail in the [[kernel +docs](https://docs.kernel.org/networking/devlink/devlink-dpipe.html)]. I feel free to take a peek at +the dataplane configuration innards: + +``` +root@fafo:~# devlink dpipe table show pci/0000:03:00.0 +pci/0000:03:00.0: + name mlxsw_erif size 1000 counters_enabled false + match: + type field_exact header mlxsw_meta field erif_port mapping ifindex + action: + type field_modify header mlxsw_meta field l3_forward + type field_modify header mlxsw_meta field l3_drop + name mlxsw_host4 size 0 counters_enabled false resource_path /kvd/hash_single resource_units 1 + match: + type field_exact header mlxsw_meta field erif_port mapping ifindex + type field_exact header ipv4 field destination ip + action: + type field_modify header ethernet field destination mac + name mlxsw_host6 size 0 counters_enabled false resource_path /kvd/hash_double resource_units 2 + match: + type field_exact header mlxsw_meta field erif_port mapping ifindex + type field_exact header ipv6 field destination ip + action: + type field_modify header ethernet field destination mac + name mlxsw_adj size 0 counters_enabled false resource_path /kvd/linear resource_units 1 + match: + type field_exact header mlxsw_meta field adj_index + type field_exact header mlxsw_meta field adj_size + type field_exact header mlxsw_meta field adj_hash_index + action: + type field_modify header ethernet field destination mac + type field_modify header mlxsw_meta field erif_port mapping ifindex +``` + +From this I can puzzle together how the CAM is _actually_ used: + +* ***mlxsw_host4***: matches on the interface port and IPv4 destination IP, using `hash_single` + above with one unit for each entry, and when looking that up, puts the result into the ethernet + destination MAC (in other words, the FIB entry points at an L2 nexthop!) +* ***mlxsw_host6***: matches on the interface port and IPv6 destination IP using `hash_double` + with two units for each entry. +* ***mlxsw_adj***: holds the L2 adjacencies, and the lookup key is an index, size and hash index, + where the returned value is used to rewrite the destination MAC and select the egress port! + + +Now that I know the types of tables and what they are matching on (and then which action they are +performing), I can also take a look at the _actual data_ in the FIB. For example, if I create an IPv4 +interface on the switch and ping a member on directly connected network there, I can see an entry +show up in the L2 adjacency table, like so: + +``` +root@fafo:~# ip addr add 100.65.1.1/30 dev swp31 +root@fafo:~# ping 100.65.1.2 +root@fafo:~# devlink dpipe table dump pci/0000:03:00.0 name mlxsw_host4 +pci/0000:03:00.0: + index 0 + match_value: + type field_exact header mlxsw_meta field erif_port mapping ifindex mapping_value 71 value 1 + type field_exact header ipv4 field destination ip value 100.65.1.2 + action_value: + type field_modify header ethernet field destination mac value b4:96:91:b3:b1:10 +``` + +To decypher what the switch is doing: if the ifindex is `71` (which corresponds to `swp31`), and the +IPv4 destination IP address is `100.65.1.2`, then the destination MAC address will be set to +`b4:96:91:b3:b1:10`, so the switch knows where to send this ethernet datagram. + +And now I have found what I need to know to be able to answer the question of the FIB size. This +switch can take **92K IPv4 routes** and **31.5K IPv6 routes**, and I can even inspect the FIB in great +detail. Rock on! + +#### 3. devlink port split + +But reading the switch chip configuration and FIB is not all that `devlink` can do, it can also make +changes! One particularly interesting one is the ability to split and unsplit ports. What this means +is that, when you take a 100Gbit port, it internally is divided into four so-called _lanes_ of +25Gbit each, where a 40Gbit port is internally divided into four _lanes_ of 10Gbit each. Splitting +ports is the act of taking such a port and reconfiguring its lanes. + +Let me show you, by means of example, what spliting the first two switchports might look like. They +begin their life as 100G ports, which support a number of link speeds, notably: 100G, 50G, 25G, but +also 40G, 10G, and finally 1G: + +``` +root@fafo:~# ethtool swp1 +Settings for swp1: + Supported ports: [ FIBRE ] + Supported link modes: 1000baseKX/Full + 10000baseKR/Full + 40000baseCR4/Full + 40000baseSR4/Full + 40000baseLR4/Full + 25000baseCR/Full + 25000baseSR/Full + 50000baseCR2/Full + 100000baseSR4/Full + 100000baseCR4/Full + 100000baseLR4_ER4/Full + +root@fafo:~# devlink port show | grep 'swp[12] ' +pci/0000:03:00.0/61: type eth netdev swp1 flavour physical port 1 splittable true lanes 4 +pci/0000:03:00.0/63: type eth netdev swp2 flavour physical port 2 splittable true lanes 4 +root@fafo:~# devlink port split pci/0000:03:00.0/61 count 4 +[ 629.593819] mlxsw_spectrum 0000:03:00.0 swp1: link down +[ 629.722731] mlxsw_spectrum 0000:03:00.0 swp2: link down +[ 630.049709] mlxsw_spectrum 0000:03:00.0: EMAD retries (1/5) (tid=64b1a5870000c726) +[ 630.092179] mlxsw_spectrum 0000:03:00.0 swp1s0: renamed from eth2 +[ 630.148860] mlxsw_spectrum 0000:03:00.0 swp1s1: renamed from eth2 +[ 630.375401] mlxsw_spectrum 0000:03:00.0 swp1s2: renamed from eth2 +[ 630.375401] mlxsw_spectrum 0000:03:00.0 swp1s3: renamed from eth2 + +root@fafo:~# ethtool swp1s0 +Settings for swp1s0: + Supported ports: [ FIBRE ] + Supported link modes: 1000baseKX/Full + 10000baseKR/Full + 25000baseCR/Full + 25000baseSR/Full +``` + +Whoa, what just happened here? The switch took the port defined by `pci/0000:03:00.0/61` which says +it is _splittable_ and has four lanes, and split it into four NEW ports called `swp1s0`-`swp1s3`, +and the resulting ports are 25G, 10G or 1G. + +{{< image width="100px" float="left" src="/assets/oem-switch/warning.png" alt="Warning" >}} + +However, I make an important observation. When splitting `swp1` in 4, the switch also removed port +`swp2`, and remember at the beginning of this article I mentioned that the MAC addresses seemed to +skip one entry between subsequent interfaces? Now I understand why: when spltting the port into two, +it will use the second MAC address for the second 50G port; but if I split it into four, it'll use +the MAC addresses from the adjacent port and decommission it. In other words: this switch can do +32x100G, or 64x50G, or 64x25G/10G/1G. + +It doesn't matter which of the PCI interfaces I split on. The operation is also reversible, I can +issue `devlink port unsplit` to return the port to its aggregate state (eg. 4 lanes and 100Gbit), +which will remove the `swp1s0-3` ports and put back `swp1` and `swp2` again. + +What I find particularly impressive about this, is that for most hardware vendors, this splitting of +ports requires a reboot of the chassis, while here it can happen entirely online. Well done, +Mellanox! + +## Performance + +OK, so this all seems to work, but does it work _well_? If you're a reader of my blog you'll know +that I love doing loadtests, so I boot my machine, Hippo, and I connect it with two 100G DACs to the +switch on ports 31 and 32: + +``` +[ 1.354802] ice 0000:0c:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link) +[ 1.447677] ice 0000:0c:00.0: firmware: direct-loading firmware intel/ice/ddp/ice.pkg +[ 1.561979] ice 0000:0c:00.1: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link) +[ 7.738198] ice 0000:0c:00.0 enp12s0f0: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC, + Negotiated FEC: RS-FEC, Autoneg Advertised: On, Autoneg Negotiated: True, Flow Control: None +[ 7.802572] ice 0000:0c:00.1 enp12s0f1: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC, + Negotiated FEC: RS-FEC, Autoneg Advertised: On, Autoneg Negotiated: True, Flow Control: None +``` + +I hope you're hungry, Hippo, cuz you're about to get fed! + +### Debian SN2700: L2 + +To use the switch in L2 mode, I intuitively create Linux bridge, say `br0`, and add ports to that. +From the Mellanox documentation I learn that there can be multiple bridges, each isolated from one +another, but there can only be _one_ such bridge with `vlan_filtering` set. VLAN Filtering allows +the switch to only accept tagged frames from a list of _configured_ VLANs, and drop the rest. This +is what you'd imagine a regular commercial switch would provide. + +So off I go, creating the bridge in which I'll add two ports (HundredGigabitEthernet port swp31 and +swp32), and I will allow for the maximum MTU size of 9216, also known as [[Jumbo Frames](https://en.wikipedia.org/wiki/Jumbo_frame)]. + +``` +root@fafo:~# ip link add name br0 type bridge +root@fafo:~# ip link set br0 type bridge vlan_filtering 1 mtu 9216 up +root@fafo:~# ip link set swp31 mtu 9216 master br0 up +root@fafo:~# ip link set swp32 mtu 9216 master br0 up +``` + +These two ports are now _access_ ports, that is to say they accept and emit only untagged traffic, +and due to the `vlan_filtering` flag, they will drop all other frames. Using the standard `bridge` +utility from Linux, I can manipulate the VLANs on these ports. + +First, I'll remove the _default VLAN_ and add VLAN 1234 to both ports, specifying that VLAN 1234 is +the so-called _Port VLAN ID_ (pvid). This makes them the equivalent of Cisco's _switchport access +1234_: + +``` +root@fafo:~# bridge vlan del vid 1 dev swp1 +root@fafo:~# bridge vlan del vid 1 dev swp2 +root@fafo:~# bridge vlan add vid 1234 dev swp1 pvid +root@fafo:~# bridge vlan add vid 1234 dev swp2 pvid +``` +Then, I'll add a few tagged VLANs to the ports, so that they become the Cisco equivalent of a +_trunk_ port allowing these tagged VLANs and assuming untagged traffic is still VLAN 1234: + +``` +root@fafo:~# for port in swp1 swp2; do for vlan in 100 200 300 400; do \ + bridge vlan add vid $vlan dev $port; done; done +root@fafo:~# bridge vlan +port vlan-id +swp1 100 200 300 400 + 1234 PVID +swp2 100 200 300 400 + 1234 PVID +br0 1 PVID Egress Untagged +``` + +When these commands are run against the interfaces `swp*`, they are picked up by the `mlxsw` kernel +driver, and transmitted to the _Spectrum_ switch chip, in other words, these commands end up +programming the silicon. Traffic through these switch ports on the front, rarely (if ever) get +forwarded to the Linux kernel, very similar to [[VPP](https://fd.io/)], the traffic stays mostly in +the dataplane. Some traffic, such as LLDP (and as we'll see later, IPv4 ARP and IPv6 neighbor +discovery), will be forwarded from the switch chip over the PCIe link to the kernel, after which the +results are transmitted back via PCIe to program the switch chip L2/L3 _Forwarding Information Base_ +(FIB). + +Now I turn my attention to the loadtest, by configuring T-Rex in L2 Stateless mode. I start a +bidirectional loadtest with 256b packets at 50% of line rate, which looks just fine: + +{{< image src="/assets/mlxsw/trex1.png" alt="Trex L2" >}} + +At this point I can already conclude that this is all happening in the dataplane, as the _Spectrum_ +switch is connected to the Debian machine using a PCIe v3.0 x8 link, which is even obscured by +another device on the PCIe bus, so the Debian kernel is in no way able to process more than a token +amount of traffic, and yet I'm seeing 100Gbit go through the switch chip and the CPU load on the +kernel pretty much zero. I can however retrieve the link statistics using `ip stats`, and those will +show me the actual counters of the silicon, not _just_ the trapped packets. If you'll recall, in VPP +the only packets that the TAP interfaces see are those packets that are _punted_, and the Linux +kernel there is completely oblivious to the total dataplane throughput. Here, the interface is +showing the correct dataplane packet and byte counters, which means that things like SNMP will +automatically just do the right thing. + +``` +root@fafo:~# dmesg | grep 03:00.*bandwidth +[ 2.180410] pci 0000:03:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s + PCIe x4 link at 0000:00:01.2 (capable of 31.504 Gb/s with 8.0 GT/s PCIe x4 link) + +root@fafo:~# uptime + 03:19:16 up 2 days, 14:14, 1 user, load average: 0.00, 0.00, 0.00 + +root@fafo:~# ip stats show dev swp32 group link +72: swp32: group link + RX: bytes packets errors dropped missed mcast + 5106713943502 15175926564 0 0 0 103 + TX: bytes packets errors dropped carrier collsns + 23464859508367 103495791750 0 0 0 0 +``` + +### Debian SN2700: IPv4 and IPv6 + +I now take a look at the L3 capabilities of the switch. To do this, I simply destroy the bridge +`br0`, which will return the enslaved switchports. I then convert the T-Rex loadtester to use an L3 +profile, and configure the switch as follows: + +``` +root@fafo:~# ip addr add 100.65.1.1/30 dev swp31 +root@fafo:~# ip nei replace 100.65.1.2 lladdr b4:96:91:b3:b1:10 dev swp31 +root@fafo:~# ip ro add 16.0.0.0/8 via 100.65.1.2 dev swp31 + +root@fafo:~# ip addr add 100.65.2.1/30 dev swp32 +root@fafo:~# ip nei replace 100.65.2.2 lladdr b4:96:91:b3:b1:11 dev swp32 +root@fafo:~# ip ro add 48.0.0.0/8 via 100.65.2.2 dev swp32 +``` +Several other routers I've loadtested have the same (cosmetic) issue, that T-Rex doesn't reply to +ARP packets after the first few seconds, so I first set the IPv4 address, then add a static L2 +adjacency for the T-Rex side (on MAC `b4:96:91:b3:b1:10`), and route 16.0.0.0/8 to port 0 and I +route 48.0.0.0/8 to port 1 of the loadtester. + +{{< image src="/assets/mlxsw/trex2.png" alt="Trex L3" >}} + +I start a stateless L3 loadtest with 192 byte packets in both directions, and the switch keeps up +just fine. Taking a closer look at the `ip stats` instrumentation, I see that there's the ability to +turn on L3 counters in addition to L2 (ethernet) counters. So I do that on my two router ports while +they are happily forwarding 58.9Mpps, and I can now see the difference between dataplane (forwarded in +hardware) and CPU (forwarded by the CPU) + +``` +root@fafo:~# ip stats set dev swp31 l3_stats on +root@fafo:~# ip stats set dev swp32 l3_stats on +root@fafo:~# ip stats show dev swp32 group offload subgroup l3_stats +72: swp32: group offload subgroup l3_stats on used on + RX: bytes packets errors dropped mcast + 270222574848200 1137559577576 0 0 0 + TX: bytes packets errors dropped + 281073635911430 1196677185749 0 0 +root@fafo:~# ip stats show dev swp32 group offload subgroup cpu_hit +72: swp32: group offload subgroup cpu_hit + RX: bytes packets errors dropped missed mcast + 1068742 17810 0 0 0 0 + TX: bytes packets errors dropped carrier collsns + 468546 2191 0 0 0 0 +``` + +The statistics above clearly demonstrate that the lion's share of the packets have been forwarded by +the ASIC, and only a few (notably things like IPv6 neighbor discovery, IPv4 ARP, LLDP, and of course +any traffic _to_ the IP addresses configured on the router) will go to the kernel. + +#### Debian SN2700: BVI (or VLAN Interfaces) + +I've played around a little bit with L2 (switch) and L3 (router) ports, but there is one middle +ground. I'll keep the T-Rex loadtest running in L3 mode, but now I'll reconfigure the switch to put +the ports back into the bridge, each port in its own VLAN, and have so-called _Bridge Virtual +Interface_, also known as VLAN interfaces -- this is where the switch has a bunch of ports together +in a VLAN, but the switch itself has an IPv4 or IPv6 address in that VLAN as well, which can act as +a router. + + +I reconfigure the switch to put the interfaces back into VLAN 1000 and 2000 respectively, and move +the IPv4 addresses and routes there -- so here I go, first putting the switch interfaces back into +L2 mode and adding them to the bridge, each in their own VLAN, by making them access ports: + +``` +root@fafo:~# ip link add name br0 type bridge vlan_filtering 1 +root@fafo:~# ip link set br0 address 04:3f:72:74:a9:7d mtu 9216 up +root@fafo:~# ip link set swp31 master br0 mtu 9216 up +root@fafo:~# ip link set swp32 master br0 mtu 9216 up +root@fafo:~# bridge vlan del vid 1 dev swp31 +root@fafo:~# bridge vlan del vid 1 dev swp32 +root@fafo:~# bridge vlan add vid 1000 dev swp31 pvid +root@fafo:~# bridge vlan add vid 2000 dev swp32 pvid +``` + +From the ASIC specs, I understand that these BVIs need to (re)use a MAC from one of the members, so +the first thing I do is give `br0` the right MAC address. Then I put the switch ports into the bridge, +remove VLAN 1 and put them in their respective VLANs. At this point, the loadtester reports 100% +packet loss, because the two ports can no longer see each other at layer2, and layer3 configs have +been removed. But I can restore connectivity with two _BVIs_ as follows: + +``` +root@fafo:~# for vlan in 1000 2000; do + ip link add link br0 name br0.$vlan type vlan id $vlan + bridge vlan add dev br0 vid $vlan self + ip link set br0.$vlan up mtu 9216 +done + +root@fafo:~# ip addr add 100.65.1.1/24 dev br0.1000 +root@fafo:~# ip ro add 16.0.0.0/8 via 100.65.1.2 +root@fafo:~# ip nei replace 100.65.1.2 lladdr b4:96:91:b3:b1:10 dev br0.1000 + +root@fafo:~# ip addr add 100.65.2.1/24 dev br0.2000 +root@fafo:~# ip ro add 48.0.0.0/8 via 100.65.2.2 +root@fafo:~# ip nei replace 100.65.2.2 lladdr b4:96:91:b3:b1:11 dev br0.2000 +``` + +And with that, the loadtest shoots back in action: +{{< image src="/assets/mlxsw/trex3.png" alt="Trex L3 BVI" >}} + +First a quick overview of the sitation I have created: + +``` +root@fafo:~# bridge vlan +port vlan-id +swp31 1000 PVID +swp32 2000 PVID +br0 1 PVID Egress Untagged + +root@fafo:~# ip -4 ro +default via 198.19.5.1 dev eth0 onlink rt_trap +16.0.0.0/8 via 100.65.1.2 dev br0.1000 offload rt_offload +48.0.0.0/8 via 100.65.2.2 dev br0.2000 offload rt_offload +100.65.1.0/24 dev br0.1000 proto kernel scope link src 100.65.1.1 rt_offload +100.65.2.0/24 dev br0.2000 proto kernel scope link src 100.65.2.1 rt_offload +198.19.5.0/26 dev eth0 proto kernel scope link src 198.19.5.62 rt_trap + +root@fafo:~# ip -4 nei +198.19.5.1 dev eth0 lladdr 00:1e:08:26:ec:f3 REACHABLE +100.65.1.2 dev br0.1000 lladdr b4:96:91:b3:b1:10 offload PERMANENT +100.65.2.2 dev br0.2000 lladdr b4:96:91:b3:b1:11 offload PERMANENT +``` + +Looking at the situation now, compared to the regular IPv4 L3 loadtest, there is one important +difference. Now, the switch can have any number of ports in VLAN 1000, which will all amongst +themselves do L2 forwarding at line rate, and when they need to send IPv4 traffic out, they will ARP +for the gateway (for example at `100.65.1.1/24`), which will get _trapped_ and forwarded to the CPU, +after which the ARP reply will go out so that the machines know where to find the gateway. From that +point on, IPv4 forwarding happens once again in hardware, which can be shown by the keywords +`rt_offload` in the routing table (`br0`, in the ASIC), compared to the `rt_trap` (`eth0`, in the +kernel). Similarly for the IPv4 neighbors, the L2 adjacency is programmed into the CAM (the output +of which I took a look at above), do forwarding can be done directly by the ASIC without +intervention from the CPU. + +As a result, these _VLAN Interfaces_ (which are synonymous with _BVIs_), work at line rate out of +the box. + +### Results + +This switch is phenomenal, and Jiří Pírko and the Mellanox team truly outdid themselves with their `mlxsw` +switchdev implementation. I have in my hands a very affordable 32x100G or 64x(50G, 25G, 10G, 1G) and +anything in between, with IPv4 and IPv6 forwarding in hardware, with a limited FIB size, not too +dissimilar from the [[Centec]({% post_url 2022-12-09-oem-switch-2 %})] switches that IPng Networks +runs in its AS8298 network, albeit without MPLS forwarding capabilities. + +Still, for a LAB switch, to better test 25G and 100G topologies, this switch is very good value for +my money spent, and that it runs Debian and is fully configurable with things like Kees and Ansible. +Considering there's a whole range of 48x10G and 48x25G switches as well from Mellanox, all +completely open and _officially allowed_ to run OSS stuff on, these make a perfect fit for IPng +Networks! + +### Acknowledgements + +This article was written after fussing around and finding out, but a few references were particularly +helpful, and I'd like to acknowledge the following super useful sites: + +* [[mlxsw wiki](https://github.com/Mellanox/mlxsw/wiki)] on GitHub +* [[jpirko's kernel driver](https://github.com/jpirko/linux_mlxsw)] on GitHub +* [[SONiC wiki](https://github.com/sonic-net/SONiC/wiki/)] on GitHub +* [[Spectrum Docs](https://www.nvidia.com/en-us/networking/ethernet-switching/spectrum-sn2000/)] on NVIDIA + +And to the community for writing and maintaining this excellent switchdev implementation. diff --git a/content/articles/2023-12-17-defra0-debian.md b/content/articles/2023-12-17-defra0-debian.md new file mode 100644 index 0000000..232be78 --- /dev/null +++ b/content/articles/2023-12-17-defra0-debian.md @@ -0,0 +1,474 @@ +--- +date: "2023-12-17T13:37:00Z" +title: Debian on IPng's VPP Routers +--- + +{{< image width="200px" float="right" src="/assets/debian-vpp/debian-logo.png" alt="Debian" >}} + +# Introduction + +When IPng Networks first built out a european network, I was running the Disaggregated Network +Operating System [[ref](https://www.danosproject.org/)], initially based on AT&T’s “dNOS” software +framework. Over time though, the DANOS project slowed down, and the developers with whom I had a +pretty good relationship all left for greener pastures. + +In 2019, Pierre Pfister (and several others) built a VPP _router sandbox_ [[ref](https://wiki.fd.io/view/VPP_Sandbox/router)], +which graduated into a feature called the Linux Control Plane plugin +[[ref](https://s3-docs.fd.io/vpp/22.10/developer/plugins/lcp.html)]. Lots of folks put in an effort +for the Linux Control Plane, notably Neale Ranns from Cisco (these days Graphiant), and Matt Smith +and Jon Loeliger from Netgate (who ship this as TNSR [[ref](https://netgate.com/tnsr)], check it out!). +I helped as well, by adding a bunch of Netlink handling and VPP->Linux synchronization code, +which I've written about a bunch on this blog in the 2021 VPP development series [[ref]({% post_url +2021-08-12-vpp-1 %})]. + +At the time, Ubuntu and CentOS were the supported platforms, so I installed a bunch of Ubuntu +machines when doing the deploy with my buddy Fred from IP-Max [[ref](https://ip-max.net)]. But as +time went by, I fell back to my old habit of running Debian on hypervisors and VMs for the services +at IPng Networks. After some time automating mostly everything with Ansible and Kees, I got tired of +those places where I needed branches like `if Ubuntu then ... elif Debian then ... elif OpenBSD +then ... else panic`. + +I took stock of the fleet at the end of 2023, and I found the following: + +* ***OpenBSD***: 3 virtual machines, bastion jumphosts connected to Internet and IPng Site Local +* ***Ubuntu***: 4 physical machines, VPP routers (`nlams0`, `defra0`, `chplo0` and `usfmt0`) +* ***Debian***: 22 physical machines and 116 virtual machines, running internal and public services, + almost all of these machines are entirely in IPng Site Local [[ref]({% post_url +2023-03-11-mpls-core %})], not connected to the + internet at all. + +It became clear to me that I could make a small sprint to standardize all physical hardware on +Debian Bookworm, and move away from Ubuntu LTS. In case you're wondering: there's **nothing wrong with +Ubuntu**, although I will admit I'm not a big fan of `snapd` and `cloud-init` but they are easily +disabled. I guess with the way the situation evolved in AS8298, I ended up running a fair few more Debian +physical (and virtual) machines, so I'll make an executive decision to move to Debian. By the way, +the fun thing about IPng is that being the _Chief of Everything_ (COE), I get to make those calls +unilaterally :) + +## Upgrading to Debian + +Luckily, I already have a fair number of VPP routers that have been deployed on Debian (mostly +_Bullseye_, but one of them is _Bookworm_), and my LAB environment [[ref]({% post_url +2022-10-14-lab-1 %})] is running Debian Bookworm as well. Although its native habitat is Ubuntu, I +regularly run VPP in a Debian environment, for example when Adrian contributed the MPLS code +[[ref]({% post_url 2023-05-21-vpp-mpls-3 %})], he also recommended Debian 12, because that ships +with a modern libnl which supports a few bits and pieces he needed. + +### Preparations + +{{< image width="300px" float="right" src="/assets/debian-vpp/defra0.png" alt="Frankfurt" >}} + +OK, while my network is not large, it'a also not completely devoid of customers, so instead of a +YOLO, I decide to make an action plan that roughly looks like this: + +1. Notify customers of upcoming maintenance +1. For each of the routers to-be-upgraded: + 1. Check the borgmatic daily backups + 1. Drain traffic away from the router + 1. Use IPMI to re-install it remotely + 1. Put the VPP, Bird, SSH configs back + 1. Undrain the router +1. Drink my advents-calendar tea! + +When deploying a datacenter site, I am adamant to have a consistent and dependable environment. At +each site, specifically those that are a bit further away, I deploy a standard issue PCEngines +APU [[ref](https://pcengines.ch/)] with 802.11ac WiFi, serial, and IPMI access to any machine that may be +there. If you ever visit a datacenter floor where I'm present, look for an SSID like `AS8298 FRA` in the +case of the Frankfurt site. The password is `IPngGuest`, you're welcome to some bits of bandwidth in a +pinch :) + +You can find the APU in the picture to the right. All the way at the top, you'll see a +small blue machine with two antenna's sticking out. It's connected to my carrier, AS25091's packet +factory Cisco ASR9010, for out of band connectivity. Then, all the way at the bottom, you can see my +Supermicro SYS-5018D-FN8T called `defra0.ipng.ch` paired with a Centec MPLS switch for transport and +breakout ports 😍. + +When I installed all of this kit, I did two specific things that will greatly benefit me now: + +1. I enabled IPMI KVM and Serial-over-LAN on the Supermicro, so I can reach it over its dedicated + IPMI port, and see what its VGA does. Also, in case anything weird happens to VPP and/or the + Centec switches and IPng Site Local becomes unavailable, I can still log in and take a look via serial. +2. I installed Samba on the APU, which allows me to instruct the IPMI to insert a virtual USB 'stick' by + means of mounting a SAMBA share. This is incredibly useful in scenarios such as this reinstall! + +Allthough I do trust it, I would hate to reboot the machine to find that IPMI or serial doesn't +work. So let me make sure that the machine is still good go to: + +``` +pim@summer:~$ ssh -L 8443:defra0-ipmi:443 cons0.defra0 +pim@cons0-defra0:~$ ipmitool -I lanplus -H defra0-ipmi -U ${IPMI_USER} -P ${IPMI_PASS} sol activate +[SOL Session operational. Use ~? for help] + +defra0 login: +``` + +Nice going! Checking the samba configuration, it is super straightforward: +``` +pim@cons0-defra0:~$ cat /etc/samba/smb.conf +[global] +workgroup = WINSHARE +server string = Ubuntu Samba %v +netbios name = console +security = user +map to guest = bad user +dns proxy = no +server min protocol = NT1 +#============================ Share Definitions ============================== + +[share] +path = /var/samba +browsable = yes +writable = no +guest ok = yes +read only = yes + +pim@cons0-defra0:/var/samba$ ls -lrt +total 2306000 +-rw-r--r-- 1 pim pim 441450496 Feb 10 2021 danos-2012-base-amd64.iso +-rw-r--r-- 1 pim pim 1261371392 Aug 24 2021 ubuntu-20.04.3-live-server-amd64.iso +-rw-r--r-- 1 pim pim 658505728 Dec 17 17:20 debian-12.4.0-amd64-netinst.iso + +pim@cons0-defra0:~$ ip -br a +internal UP 172.16.13.1/24 fd25:8c03:9b1c:100d::1/64 fe80::b49b:1cff:feb2:7f2f/64 +external UP 46.20.246.50/29 2a02:2528:ff01::2/64 fe80::d8fe:8ff:fe73:8c99/64 +wlp4s0 UP 172.16.14.1/24 fd25:8c03:9b1c:100e::1/64 fe80::6f0:21ff:fe9b:562e/64 +``` + +You can see the lifecycle progression on this server. In Feb'21, I installed DANOS 20.12, then +moving to Ubuntu LTS 20.04 around Aug'21, and now it is time to advance once again, this time to +Debian 12. + +As a final pre-flight check, while using the port forwarding I set up (`-L` flag above), I will log +in to the IPMI controller remotely, to insert this CD image into the virtual CDROM drive, like so: + +{{< image src="/assets/debian-vpp/supermicro-ipmi.png" alt="IPMI" >}} + +And indeed, it pops up in the running Ubuntu router: +``` +pim@defra0:~$ uname -a +Linux defra0 5.4.0-109-generic #123-Ubuntu SMP Fri Apr 8 09:10:54 UTC 2022 x86_64 GNU/Linux +pim@defra0:~$ uptime + 15:51:10 up 600 days, 17:40, 1 user, load average: 3.44, 3.30, 3.31 +pim@defra0:~$ dmesg | tail -10 +[51852396.194030] usb 2-4.2: New USB device strings: Mfr=0, Product=0, SerialNumber=0 +[51852396.215804] usb-storage 2-4.2:1.0: USB Mass Storage device detected +[51852396.215993] scsi host6: usb-storage 2-4.2:1.0 +[51852396.216107] usbcore: registered new interface driver usb-storage +[51852396.219915] usbcore: registered new interface driver uas +[51852396.232081] scsi 6:0:0:0: CD-ROM ATEN Virtual CDROM YS0J PQ: 0 ANSI: 0 CCS +[51852396.232475] scsi 6:0:0:0: Attached scsi generic sg1 type 5 +[51852396.251038] sr 6:0:0:0: [sr0] scsi3-mmc drive: 40x/40x cd/rw xa/form2 cdda tray +[51852396.251047] cdrom: Uniform CD-ROM driver Revision: 3.20 +[51852396.267643] sr 6:0:0:0: Attached scsi CD-ROM sr0 +``` + +I just love it when this stuff works. And it's nice to see the happenstance of the machine being up +for 600 days. Good power, great operating system and awesome hosting provider. Thanks for the service +so far, my sweet little Ubuntu router ❤️ ! + +## Installing + +### Drain + +Considering there is live traffic on the network, typically what an operator would do is drain the +links to route around the maintenance. To do this in my case, I need to make two changes, notably +draining OSPF and eBGP. + +***OSPF***: In AS8298, all backbone connections use OSPF, and typically traffic from Zurich to Amsterdam +will be over Frankfurt because the OSPF cost is slightly lower than the other way around. I've +decided to standardize the OSPF link cost to be in tenths of milliseconds. In other words, if the +latency from `chrma0` to `defra0` is 5.6 ms, the OSPF cost will be 56. One way for me to avoid using +the Frankfurt router, is to make the cost of all traffic in- and out of the router be synthetically +high. I do this by adding +1000 to the OSPF cost. + +***BGP***: But there are also a bunch of internet exchanges (such as Kleyrex, DE-CIX and LoCIX), and two IP +transit upstreams (IP-Max and Meerfarbig) connected to this router in Frankfurt. I do not want +them to send IPng any traffic here during the maintenance, so I will drain eBGP as well by setting the +groups to _shutdown_ state in Kees. + +``` +pim@squanchy:~/src/ipng-kees$ git diff +diff --git a/config/defra0.ipng.ch.yaml b/config/defra0.ipng.ch.yaml +index 869058c..105630c 100644 +--- a/config/defra0.ipng.ch.yaml ++++ b/config/defra0.ipng.ch.yaml +@@ -151,12 +151,13 @@ vppcfg: + ospf: + xe1-0.304: + description: chrma0 +- cost: 56 ++ cost: 1056 + xe1-1.302: + description: defra0 +- cost: 61 ++ cost: 1061 + + ebgp: ++ shutdown: true + groups: + decix_dus: + local-addresses: [ 185.1.171.43/23, 2001:7f8:9e::206a:0:1/64 ] +``` + +By raising the OSPF cost, the network will route around the machine that I want to play with: + +``` +pim@squanchy:~/src/ipng-kees$ traceroute nlams0.ipng.ch +traceroute to defra0.ipng.ch (194.1.163.32), 64 hops max, 40 byte packets + 1 chbtl0 (194.1.163.66) 0.492 ms 0.64 ms 0.615 ms + 2 chrma0 (194.1.163.17) 1.268 ms 1.196 ms 1.194 ms + 3 chplo0 (194.1.163.51) 5.682 ms 5.514 ms 5.603 ms + 4 frpar0 (194.1.163.40) 14.481 ms 14.605 ms 14.58 ms + 5 frggh0 (194.1.163.30) 19.545 ms 18.61 ms 18.684 ms + 6 nlams0 (194.1.163.32) 47.613 ms 47.765 ms 47.584 ms +``` + +And by setting the sessions to _shutdown_, Kees will make it regenerate all of the BGP sessions +with an `export none` and a low `bgp_local_pref`, which will make the router itself stop announcing +any prefixes, for example this session in Düsseldorf: + +``` +@@ -25,11 +25,11 @@ protocol bgp decix_dus_56890_ipv4_1 { + source address 185.1.171.43; + neighbor 185.1.170.252 as 56890; + default bgp_med 0; +- default bgp_local_pref 200; ++ default bgp_local_pref 0; # shutdown + ipv4 { + import keep filtered; + import filter ebgp_decix_dus_56890_import; +- export filter ebgp_decix_dus_56890_export; ++ export none; # shutdown + receive limit 250000 action restart; + next hop self on; + }; +``` + +{{< image width="80px" float="left" src="/assets/debian-vpp/warning.png" alt="Warning" >}} + +This is where it's a good idea to grab some tea. Quite a few internet providers have +incredibly slow convergence, so just by stopping the announcment of `AS8298:AS-IPNG` prefixes at +this internet exchange, doesn't mean things get updated too quickly. It makes sense to wait a few +minutes (by default I wait 15min) so that every router that might be a slow-poke (I'm looking at +you, Juniper!) has time to update their RIB and FIB. + +VPP itself pretty immediately flips all of its paths to other places, and it converges a full table +of 950K IPv4 and 195K IPv6 routes in about 7 seconds or so, but not everybody has such fast CPUs in +their vendor-silicon-fancypants-router :-) + +### Upgrade + +The tea in my advents calendar for December 17th is _Whittard's Lemon & Ginger_ infusion, and it is +delicious. What could possibly go wrong?! Now that the router is fully drained, I start a ping to +the loopback, and flip the virtual powerswitch on the IPMI console. A few seconds later, the machine +expectedly stops pinging and ... the world doesn't end, my SSH session to a hypervisor in Amsterdam +is still alive, and most importantly, Spotify is still playing music: + +``` +pim@squanchy:~/src/ipng-kees$ ping defra0.ipng.ch +PING defra0.ipng.ch (194.1.163.7): 56 data bytes +64 bytes from 194.1.163.7: icmp_seq=0 ttl=62 time=6.3 ms +64 bytes from 194.1.163.7: icmp_seq=1 ttl=62 time=6.5 ms +64 bytes from 194.1.163.7: icmp_seq=2 ttl=62 time=6.2 ms +... +``` + +I open the IPMI KVM console, hit F10 and select the CDROM option, which has my previously inserted +Debian 12 _netinst_ ISO: + +{{< image src="/assets/debian-vpp/debian-ipmi.png" alt="Debian on IPMI" >}} + +At this point I can't help but smile. I'm sitting here in Brüttisellen, roughly 400km south of +this computer in Frankfurt, and I am looking at the VGA output of a fresh Debian installer. Come on, +you have to admit, that's pretty slick! Installing Debian follows pretty precisely my previous VPP#7 +article [[ref]({% post_url 2021-09-21-vpp-7 %})]. I go through the installer options and a few +minutes later, it's mission accomplished. I give the router its IPv4/IPv6 address in _IPng Site +Local_, so that it has management network connectivity, and just before it wants to reboot, I +quickly edit `/etc/default/grub` to turn on serial output, just like in the article: + +``` +GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8 isolcpus=1,2,3,5,6,7" +GRUB_TERMINAL=serial +GRUB_SERIAL_COMMAND="serial --unit=0 --speed=115200 --stop=1 --parity=no --word=8" +``` + +As the machine reboots, I eject the CDROM from the IPMI web interface, and attach to the +serial-over-lan interface instead. Booyah, it boots! + +### Configure + +On my workstation, I mount yesterday's Borg backup for the machine, because instead of doing the +whole router build over from scratch, I'm going to selectively copy a few bits and pieces over, in +the interest of time. Also, it's nice to actually use borgbackup for once, although Fred and I have +made grateful use of it in an emergency when one of IP-Max's hypervisors failed in Geneva. + +``` +pim@summer:~$ sudo borg mount ssh://${BORG_REPO}/defra0.ipng.ch/ /var/borgbackup/ +Enter passphrase for key ssh://${BORG_REPO}/defra0.ipng.ch: + +pim@summer:~$ sudo ls -l /var/borgbackup/defra0-2023-12-17T01:45:47.983599 +bin boot cdrom etc home lib lib32 lib64 libx32 lost+found media mnt opt +root sbin srv tmp usr var +``` + +In case you're wondering why I mount the backup as root, it's because that way I can guarantee all +the correct users/permissions etc are present in the restore. I've done a practice run of the +upgrade, yesterday, at `chplo0.ipng.ch`, so by now I think I have a pretty good handle on what needs +to happen, so while connected to the freshly installed Debian Bookworm machine via serial-over-lan, +here's what I do: + +``` +root@defra0:~# apt install sudo rsync net-tools traceroute snmpd snmp iptables ipmitool bird2 \ + lm-sensors netplan.io build-essential borgmatic unbound tcpdump \ + libnl-3-200 libnl-route-3-200 + +root@defra0:~# adduser pim sudo +root@defra0:~# adduser pim bird +root@defra0:~# systemctl stop bird; systemctl disable bird; systemctl mask bird +root@defra0:~# sensors-detect --auto +root@defra0:~# export REPO=summer.net.ipng.ch:/var/borgbackup/defra0-2023-12-17T01:45:47.983599 + +root@defra0:~# mv /etc/network/interfaces /etc/network/interfaces.orig +root@defra0:~# rsync -avugP $REPO/etc/netplan/ /etc/netplan/ + +root@defra0:~# rm -f /etc/ssh/ssh_host* +root@defra0:~# rsync -avugP $REPO/etc/ssh/ssh_host* /etc/ssh/ + +root@defra0:~# rsync -avugP $REPO/etc/sysctl.d/80* /etc/sysctl.d/ +root@defra0:~# rsync -avugP $REPO/etc/bird/ /etc/bird/ +root@defra0:~# rsync -avugP $REPO/etc/vpp/ /etc/vpp/ +root@defra0:~# rsync -avugP $REPO/etc/borgmatic/ /etc/borgmatic/ +root@defra0:~# rsync -avugP $REPO/etc/rc.local /etc/rc.local +root@defra0:~# rsync -avugP $REPO/lib/systemd/system/*dataplane* /lib/systemd/system +``` + +I decide to selectively copy only the specific configuration files necessary to boot the dataplane. +This means the systemd services (like snmpd, sshd, and their network namespace), and all the Bird +and VPP config files. Because I prefer not to have to clear the SSH host keys, I also copy the old +SSH host keys over. And considering IPng Networks standardizes on netplan for interface config, I'll +move the Debian-default `interfaces` out of the way. + +Finally, I add a few finishing touches and reboot one last time to ensure things are settled: + +``` +root@defra0:~# cat << EOF | tee -a /etc/modules +coretemp +mpls_router +vfio_pci +EOF +root@defra0:~# update-initramfs -k all -u +root@defra0:~# update-grub + +root@defra0:~# mkdir -p /etc/systemd/system/unbound.service.d/ +root@defra0:~# mkdir -p /etc/systemd/system/snmpd.service.d/ +root@defra0:~# cat << EOF | tee /etc/systemd/system/unbound.service.d/override.conf +[Service] +NetworkNamespacePath=/var/run/netns/dataplane +EOF +root@defra0:~# cp /etc/systemd/system/unbound.service.d/override.conf \ + /etc/systemd/system/snmpd.service.d/override.conf +root@defra0:~# reboot +``` + +The machine once again comes up, and now it's loaded the VFIO and MPLS kernel modules, so I'm ready +for the grand finale, which is installing VPP at the same version as the other routers in the fleet: + +``` +root@defra0:~# mkdir -p /var/log/vpp/ +root@defra0:~# wget -m --no-parent https://ipng.ch/media/vpp/bookworm/24.02-rc0~175-g31d4891cf/ +root@defra0:~# dpkg -i ipng.ch/media/vpp/bookworm/24.02-rc0~175-g31d4891cf/*.deb +root@defra0:~# adduser pim vpp +root@defra0:~# vppctl show version +vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 +``` + +In the corner of my eye, I see one of my xterms move. Hah! It's the ping I left running on +squanchy before, check it out: + +``` +pim@squanchy:~/src/ipng-kees$ ping defra0.ipng.ch +PING defra0.ipng.ch (194.1.163.7): 56 data bytes +64 bytes from 194.1.163.7: icmp_seq=0 ttl=62 time=6.3 ms +64 bytes from 194.1.163.7: icmp_seq=1 ttl=62 time=6.5 ms +64 bytes from 194.1.163.7: icmp_seq=2 ttl=62 time=6.2 ms +... +64 bytes from 194.1.163.7: icmp_seq=1484 ttl=62 time=6.5 ms +64 bytes from 194.1.163.7: icmp_seq=1485 ttl=62 time=6.6 ms +64 bytes from 194.1.163.7: icmp_seq=1486 ttl=62 time=6.8 ms +``` + +One think-o I made is that the Bird configs that I just put back from the backup were those from before I +set the drains (remember, raising the OSPF cost and setting the EBGP sessions to _shutdown_) so they +are now all alive again. But it's all good - the dataplane came up, Bird2 came up and formed OSPF +and OSPFv3 adjacencies a few seconds later, and BGP sessions all shot to life. I take a quick look +at the state of the dataplane to make sure I'm not accidentally introducing a broken router: + +``` +pim@defra0:~$ birdc show route count +BIRD 2.0.12 ready. +6782372 of 6782372 routes for 958020 networks in table master4 +1848350 of 1848350 routes for 198255 networks in table master6 +1620753 of 1620753 routes for 405189 networks in table t_roa4 +367875 of 367875 routes for 91969 networks in table t_roa6 +Total: 10619350 of 10619350 routes for 1653433 networks in 4 tables + +pim@defra0:~$ vppctl show ip fib summary | awk '{ TOTAL += $2 } END { print TOTAL }' +958664 +pim@defra0:~$ vppctl show ip6 fib summary | awk '{ TOTAL += $2 } END { print TOTAL }' +198322 +``` + +OK, looking at the output I can conclude that my think-o was benign and the router has all routes +accounted for in the RIB, it has slurped in the RPKI tables, and it has successfully transferred all +of this into VPP's FIB. So this entire upgrade took 1482 seconds, which is just under 25 minutes. +***Gnarly!*** + +### Post Install + +The machine is up and running, and there's one last thing for me to do, which is perform an Ansble +run to make sure that the whole machine is configured correctly (for example, the correct access +list for _Unbound_, the correct IPv4/IPv6 firewall for the Linux controlplane, the correct SSH +daemon options, working mailer and NTP daemon, et cetera). + +So I fire off a one-shot Ansible playbook run, and it pokes and prods the machine a bit: + +{{< image src="/assets/debian-vpp/ansible.png" alt="Ansible" >}} + +Now the machine is completely up-to-snuff, its latest VPP SNMP agent Prometheus exporter, Bird +exporter, and so on are all good. I check LibreNMS and indeed, the machine is back with a half an +hour or so of monitoring data missing. I'm still grinning as I write this, as most Juniper and Cisco +_firmware_ upgrades take more than 30min, while for me the whole thing from start to finish was less +than that. + +## Results + +{{< image src="/assets/debian-vpp/smokeping.png" alt="Smokeping" >}} + +This article describes how I managed to upgrade the entire network of routers, remotely, from the +comfort of my home, while sipping tea, and without having a single network outage. The bump in the +graph is the moment at which I drained `defra0` and traffic from the monitoring machine at `nlams0` +had to go via France to my house at `chbtl0`. No packets were lost in the making of this upgrade! + +Yesterday I practiced on `chplo0`, and today for this article I did `defra0`, after which I also +did the last remaining router `nlams0`. Every router is now up to date running Debian Bookworm as +well as VPP version 24.02 (including a bunch of desirable fixes for IPFIX/Flowprobe): + +``` +pim@squanchy:~/src/ipng-kees$ ./doall.sh 'echo -n $(hostname -s):\ ; vppctl show version' +chbtl0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 +chbtl1: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 +chgtg0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 +chplo0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 +chrma0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 +ddln0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 +ddln1: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 +defra0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 +frggh0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 +frpar0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 +nlams0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 +usfmt0: vpp v24.02-rc0~175-g31d4891cf built by pim on bullseye-builder at 2023-12-09T16:27:33 +``` + +For the hawk-eyed, yes `usfmt0` has not been done. I don't have Supermicro with IPMI there, so the +next time I visit California, I'll make a stop at the local Hurricane Electric datacenter to upgrade +that last one :-) diff --git a/content/articles/2024-01-27-vpp-papi.md b/content/articles/2024-01-27-vpp-papi.md new file mode 100644 index 0000000..e65ad52 --- /dev/null +++ b/content/articles/2024-01-27-vpp-papi.md @@ -0,0 +1,717 @@ +--- +date: "2024-01-27T10:01:14Z" +title: VPP Python API +--- + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. + +You'll hear me talk about VPP being API centric, with no configuration persistence, and that's by +design. However, there is this also a CLI utility called `vppctl`, right, so what gives? In truth, +the CLI is used a lot by folks to configure their dataplane, but it really was always meant to be +a debug utility. There's a whole wealth of programmability that is _not_ exposed via the CLI at all, +and the VPP community develops and maintains an elaborate set of tools to allow external programs +to (re)configure the dataplane. One such tool is my own [[vppcfg]({% post_url 2022-04-02-vppcfg-2 +%})] which takes a YAML specification that describes the dataplane configuration, and applies it +safely to a running VPP instance. + +## Introduction + +In case you're interested in writing your own automation, this article is for you! I'll provide a +deep dive into the Python API which ships with VPP. It's actually very easy to use once you get used +it it -- assuming you know a little bit of Python of course :) + +### VPP API: Anatomy + +When developers write their VPP features, they'll add an API definition file that describes +control-plane messages that are typically called via shared memory interface, which explains why +these things are called _memclnt_ in VPP. Certain API _types_ can be created, resembling their +underlying C structures, and these types are passed along in _messages_. Finally a _service_ is a +**Request/Reply** pair of messages. When requests are received, VPP executes a _handler_ whose job it +is to parse the request and send either a singular reply, or a _stream_ of replies (like a list of +interfaces). + +Clients connect to a unix domain socket, typically `/run/vpp/api.sock`. A TCP port can also +be used, with the caveat that there is no access control provided. Messages are exchanged over this +channel _asynchronously_. A common pattern of async API design is to have a client identifier +(called a _client_index_) and some random number (called a _context_) with which the client +identifies their request. Using these two things, VPP will issue a callback using (a) the +_client_index_ to send the reply to and (b) the client knows which _context_ the reply is meant for. + +By the way, this _asynchronous_ design pattern gives programmers one really cool benefit out of the +box: events that are not explicitly requested, like say, link-state change on an interface, can now +be implemented by simply registering a standing callback for a certain message type - I'll show how +that works at the end of this article. As a result, any number of clients, their requests and even +arbitrary VPP initiated events can be in flight at the same time, which is pretty slick! + +#### API Types + +Most APIs requests pass along datastructures, which follow their internal representation in VPP. I'll start +by taking a look at a simple example -- the VPE itself. It defines a few things in +`src/vpp/vpe_types.api`, notably a few type definitions and one enum: + +``` +typedef version +{ + u32 major; + u32 minor; + u32 patch; + + /* since we can't guarantee that only fixed length args will follow the typedef, + string type not supported for typedef for now. */ + u8 pre_release[17]; /* 16 + "\0" */ + u8 build_metadata[17]; /* 16 + "\0" */ +}; + +typedef f64 timestamp; +typedef f64 timedelta; + +enum log_level { + VPE_API_LOG_LEVEL_EMERG = 0, /* emerg */ + VPE_API_LOG_LEVEL_ALERT = 1, /* alert */ + VPE_API_LOG_LEVEL_CRIT = 2, /* crit */ + VPE_API_LOG_LEVEL_ERR = 3, /* err */ + VPE_API_LOG_LEVEL_WARNING = 4, /* warn */ + VPE_API_LOG_LEVEL_NOTICE = 5, /* notice */ + VPE_API_LOG_LEVEL_INFO = 6, /* info */ + VPE_API_LOG_LEVEL_DEBUG = 7, /* debug */ + VPE_API_LOG_LEVEL_DISABLED = 8, /* disabled */ +}; +``` + +By doing this, API requests and replies can start referring to these types in. When reading this, it +feels a bit like a C header file, showing me the structure. For example, I know that if I ever need +to pass along an argument called `log_level`, I know which values I can provide, together with their +meaning. + +#### API Messages + +I now take a look at `src/vpp/api/vpe.api` itself, this is where the VPE API is defined. It includes +the former `vpe_types.api` file, so it can reference these typedefs and the enum. Here, I see a few +messages defined that constitute a **Request/Reply** pair: + +``` +define show_version +{ + u32 client_index; + u32 context; +}; + +define show_version_reply +{ + u32 context; + i32 retval; + string program[32]; + string version[32]; + string build_date[32]; + string build_directory[256]; +}; +``` + +There's one small surprise here out of the gate. I would've expected that beautiful typedef called +`version` from the `vpe_types.api` file to make an appearance, but it's conspicuously missing from +the `show_version_reply` message. Ha! But the rest of it seems reasonably self-explanatory -- as I +already know about the _client_index_ and _context_ fields, I now know that this request does not +carry any arguments, and that the reply has a _retval_ for application errors, similar to how most +libC functions return 0 on success, and some negative value error number defined in +[[errno.h](https://en.wikipedia.org/wiki/Errno.h)]. Then, there are four strings of the given +length, which I should be able to consume. + +#### API Services + +The VPP API defines three types of message exchanges: + +1. **Request/Reply** - The client sends a request message and the server replies with a single reply + message. The convention is that the reply message is named as `method_name + _reply`. + +1. **Dump/Detail** - The client sends a “bulk” request message to the server, and the server replies + with a set of detail messages. These messages may be of different type. The method name must end + with `method + _dump`, the reply message should be named `method + _details`. These Dump/Detail + methods are typically used for acquiring bulk information, like the complete FIB table. + +1. **Events** - The client can register for getting asynchronous notifications from the server. This + is useful for getting interface state changes, and so on. The method name for requesting + notifications is conventionally prefixed with `want_`, for example `want_interface_events`. + +If the convention is kept, the API machinery will correlate the `foo` and `foo_reply` messages into +RPC services. But it's also possible to be explicit about these, by defining _service_ scopes in the +`*.api` files. I'll take two examples, the first one is from the Linux Control Plane plugin (which +I've [[written about]({% post_url 2021-08-12-vpp-1 %})] a lot while I was contributing to it back in +2021). + +**Dump/Detail (example)**: When enumerating _Linux Interface Pairs_, the service definition looks like +this: + +``` +service { + rpc lcp_itf_pair_get returns lcp_itf_pair_get_reply + stream lcp_itf_pair_details; +}; +``` + +To puzzle this together, the request called `lcp_itf_pair_get` is paired up with a reply called +`lcp_itf_pair_get_reply` followed by a stream of zero-or-more `lcp_itf_pair_details` messages. Note +the use of the pattern _rpc_ X _returns_ Y _stream_ Z. + + +**Events (example)**: I also take a look at an event handler like the one in the interface API that +made an appearance in my list of API message types, above: + +``` +service { + rpc want_interface_events returns want_interface_events_reply + events sw_interface_event; +}; +``` + +Here, the request is `want_interface_events` which returns a `want_interface_events_reply` followed +by zero or more `sw_interface_event` messages, which is very similar to the streaming (dump/detail) +pattern. The semantic difference is that streams are lists of things and events are asynchronously +happening things in the dataplane -- in other words the _stream_ is meant to end whlie the _events_ +messages are generated by VPP when the event occurs. In this case, if an interface is created or +deleted, or the link state of an interface changes, one of these is sent from VPP to the client(s) +that registered an interested in it by calling the `want_interface_events` RPC. + +#### JSON Representation + +VPP comes with an internal API compiler that scans the source code for these `*.api` files and +assembles them into a few output formats. I take a look at the Python implementation of it in +`src/tools/vppapigen/` and see that it generates C, Go and JSON. As an aside, I chuckle a little bit +on a Python script generating Go and C, but I quickly get over myself. I'm not that funny. + +The `vppapigen` tool outputs a bunch of JSON files, one per API specification, which wraps up all of +the information from the _types_, _unions_ and _enums_, the _message_ and _service_ definitions, +together with a few other bits and bobs, and when VPP is installed, these end up in +`/usr/share/vpp/api/`. As of the upcoming VPP 24.02 release, there's about 50 of these _core_ APIs +and an additional 80 or so APIs defined by _plugins_ like the Linux Control Plane. + +Implementing APIs is pretty user friendly, largely due to the `vppapigen` tool taking so much of the +boilerplate and autogenerating things. As an example, I need to be able to enumerate the interfaces +that are MPLS enabled, so that I can use my [[vppcfg]({% post_url 2022-03-27-vppcfg-1 %})] utility to +configure MPLS. I contributed an API called `mpls_interface_dump` which returns a +stream of `mpls_interface_details` messages. You can see that small contribution in merged [[Gerrit +39022](https://gerrit.fd.io/r/c/vpp/+/39022)]. + +## VPP Python API + +The VPP API has been ported to many languages (C, C++, Go, Lua, Rust, Python, probably a few others). +I am primarily a user of the Python API, which ships alongside VPP in a separate Debian package. The +source code lives in `src/vpp-api/python/` which doesn't have any dependencies other than Python's +own `setuptools`. Its implementation canonically called `vpp_papi`, which, I cannot tell a lie, +reminds me of spanish rap music. But, if you're still reading, maybe now is a good time to depart +from the fundamental, and get to the practical! + +### Example: Hello World + +Without further ado, I dive right in with this tiny program: + +```python +from vpp_papi import VPPApiClient, VPPApiJSONFiles + +vpp_api_dir = VPPApiJSONFiles.find_api_dir([]) +vpp_api_files = VPPApiJSONFiles.find_api_files(api_dir=vpp_api_dir) +vpp = VPPApiClient(apifiles=vpp_api_files, server_address="/run/vpp/api.sock") + +vpp.connect("ipng-client") + +api_reply = vpp.api.show_version() +print(api_reply) +``` + +The first thing this program does is construct a so-called `VPPApiClient` object. To do this, I need +to feed it a list of JSON definitions, so that it knows what types of APIs are available. +As I +mentioned above, those +to create the list of files, but there are two handy helpers here: +1. **find_api_dir()** - This is a helper that finds the location of the API files. Normally, the JSON + files get installed in `/usr/share/vpp/api/`, but when I'm writing code, it's more likely that the + files are in `/home/pim/src/vpp/` somewhere. This helper function tries to do the right thing and + detect if I'm in a client or if I'm using a production install, and will return the correct + directory. +1. **find_api_files()** - Now, I could rummage through that directory and find the JSON files, but + there's another handy helper that does that for me, given a directory (like the one I just got + handed to me). Life is easy. + +Once I have the JSON files in hand, I can construct a client by specifying the _server_address_ +location to connect to -- this is typically a unix domain socket in `/run/vpp/api.sock` but it can +also be a TCP endpoint. As a quick aside: If you, like me, stumbled over the socket being owned by +`root:vpp` but not writable by the group, that finally got fixed by Georgy in +[[Gerrit 39862](https://gerrit.fd.io/r/c/vpp/+/39862)]. + +Once I'm connected, I can start calling arbitrary API methods, like `show_version()` which does not +take any arguments. Its reply is a named tuple, and it looks like this: + +```bash +pim@vpp0-0:~/vpp_papi_examples$ ./00-version.py +show_version_reply(_0=1415, context=1, + retval=0, program='vpe', version='24.02-rc0~46-ga16463610', + build_date='2023-10-15T14:50:49', build_directory='/home/pim/src/vpp') +``` + +And here is my beautiful hello world in seven (!) lines of code. All that reading and preparing +finally starts paying off. Neat-oh! + +### Example: Listing Interfaces + +From here on out, it's just incremental learning. Here's an example of how to extend the hello world +example above and make it list the dataplane interfaces and their IPv4/IPv6 addresses: + +```python +api_reply = vpp.api.sw_interface_dump() +for iface in api_reply: + str = f"[{iface.sw_if_index}] {iface.interface_name}" + ipr = vpp.api.ip_address_dump(sw_if_index=iface.sw_if_index, is_ipv6=False) + for addr in ipr: + str += f" {addr.prefix}" + ipr = vpp.api.ip_address_dump(sw_if_index=iface.sw_if_index, is_ipv6=True) + for addr in ipr: + str += f" {addr.prefix}" + print(str) +``` + +The API method `sw_interface_dump()` can take a few optional arguments. Notably, if `sw_if_index` is +set, the call will dump that exact interface. If it's not set, it will default to -1 which will dump +all interfaces, and this is how I use it here. For completeness, the method also has an optional +string `name_filter`, which will dump all interfaces which contain a given substring. For example +passing `name_filter='loop'` and `name_filter_value=True` as arguments, would enumerate all interfaces +that have the word 'loop' in them. + +Now, the definition of the `sw_interface_dump` method suggests that it returns a stream (remember +the **Dump/Detail** pattern above), so I can predict that the messages I will receive are of type +`sw_interface_details`. There's *lots* of cool information in here, like the MAC address, MTU, +encapsulation (if this is a sub-interface), but for now I'll only make note of the `sw_if_index` and +`interface_name`. + +Using this interface index, I then call the `ip_address_dump()` method, which looks like this: + +``` +define ip_address_dump +{ + u32 client_index; + u32 context; + vl_api_interface_index_t sw_if_index; + bool is_ipv6; +}; + +define ip_address_details +{ + u32 context; + vl_api_interface_index_t sw_if_index; + vl_api_address_with_prefix_t prefix; +}; +``` + +Allright then! If I want the IPv4 addresses for a given interface (referred to not by its name, but +by its index), I can call it with argument `is_ipv6=False`. The return is zero or more messages that +contain the index again, and a prefix the precise type of which can be looked up in `ip_types.api`. +After doing a form of layer-one traceroute through the API specification files, it turns out, that +this prefix is cast to an instance of the `IPv4Interface()` class in Python. I won't bore you with +it, but the second call sets `is_ipv6=True` and, unsurprisingly, returns a bunch of +`IPv6Interface()` objects. + +To put it all together, the output of my little script: +``` +pim@vpp0-0:~/vpp_papi_examples$ ./01-interface.py +VPP version is 24.02-rc0~46-ga16463610 +[0] local0 +[1] GigabitEthernet10/0/0 192.168.10.5/31 2001:678:d78:201::fffe/112 +[2] GigabitEthernet10/0/1 192.168.10.6/31 2001:678:d78:201::1:0/112 +[3] GigabitEthernet10/0/2 +[4] GigabitEthernet10/0/3 +[5] loop0 192.168.10.0/32 2001:678:d78:200::/128 +``` + +### Example: Linux Control Plane + +Normally, services of are either a **Request/Reply** or a **Dump/Detail** type. But careful readers may +have noticed that the Linux Control Plane does a little bit of both. It has a **Request/Reply/Detail** +triplet, because for request `lcp_itf_pair_get`, it will return a `lcp_itf_pair_get_reply` AND a +_stream_ of `lcp_itf_pair_details`. Perhaps in hindsight a more idiomatic way to do this was to have +created simply a `lcp_itf_pair_dump`, but considering this is what we ended up with, I can use it as +a good example case -- how might I handle such a response? + +```python +api_reply = vpp.api.lcp_itf_pair_get() +if isinstance(api_reply, tuple) and api_reply[0].retval == 0: + for lcp in api_reply[1]: + str = f"[{lcp.vif_index}] {lcp.host_if_name}" + api_reply2 = vpp.api.sw_interface_dump(sw_if_index=lcp.host_sw_if_index) + tap_iface = api_reply2[0] + api_reply2 = vpp.api.sw_interface_dump(sw_if_index=lcp.phy_sw_if_index) + phy_iface = api_reply2[0] + str += f" tap {tap_iface.interface_name} phy {phy_iface.interface_name} mtu {phy_iface.link_mtu}" + print(str) +``` + +This particular API first sends its _reply_ and then its _stream_, so I can expect it to be a tuple +with the first element being a namedtuple and the second element being a list of details messages. A +good way to ensure that is to check for the reply's _retval_ field to be 0 (success) before trying +to enumerate the _Linux Interface Pairs_. These consist of a VPP interface (say +`GigabitEthernet10/0/0`), which corresponds to a TUN/TAP device which in turn has a VPP name (eg +`tap1`) and a Linux name (eg. `e0`). + +The Linux Control Plane call will return these dataplane objects as numerical interface indexes, +not names. However, I can resolve them to names by calling the `sw_interface_dump()` method and +specifying the index as an argument. Because this is a **Dump/Detail** type API call, the return will +be a _stream_ (a list), which will have either zero (if the index didn't exist), or one element +(if it did). + +Using this I can puzzle together the following output: +``` +pim@vpp0-0:~/vpp_papi_examples$ ./02-lcp.py +VPP version is 24.02-rc0~46-ga16463610 +[2] loop0 tap tap0 phy loop0 mtu 9000 +[3] e0 tap tap1 phy GigabitEthernet10/0/0 mtu 9000 +[4] e1 tap tap2 phy GigabitEthernet10/0/1 mtu 9000 +[5] e2 tap tap3 phy GigabitEthernet10/0/2 mtu 9000 +[6] e3 tap tap4 phy GigabitEthernet10/0/3 mtu 9000 +``` + +### VPP's Python API objects + +The objects in the VPP dataplane can be arbitrarily complex. They can have nested objects, enums, +unions, repeated fields and so on. To illustrate a more complete example, I will take a look at an +MPLS tunnel object in the dataplane. I first create the MPLS tunnel using the CLI, as follows: + +``` +vpp# mpls tunnel l2-only via 192.168.10.3 GigabitEthernet10/0/1 out-labels 8298 100 200 +vpp# mpls local-label add 8298 eos via l2-input-on mpls-tunnel0 +``` +The first command creates an interface called `mpls-tunnel0` which, if it receives an ethernet frame, will +encapsulate it into an MPLS datagram with a labelstack of `8298.100.200`, and then forward it on to +the router at 192.168.10.3. The second command adds a FIB entry to the MPLS table, upon receipt of a +datagram with the label `8298`, unwrap it and present the resulting datagram contents as an ethernet +frame into `mpls-tunnel0`. By cross connecting this MPLS tunnel with any other dataplane interface +(for example, `HundredGigabitEthernet10/0/1.1234`), this would be an elegant way to configure a +classic L2VPN ethernet-over-MPLS transport. Which is hella cool, but I digress :) + +I want to inspect this tunnel using the API, and I find an `mpls_tunnel_dump()` method. As we +know well by now, this is a **Dump/Detail** type method, so the return value will be a list of +zero or more `mpls_tunnel_details` messages. + +The `mpls_tunnel_details` message is simply a wrapper around an `mpls_tunnel` type as can be seen in +`mpls.api`, and it references the `fib_path` type as well. Here they are: + +``` +typedef fib_path +{ + u32 sw_if_index; + u32 table_id; + u32 rpf_id; + u8 weight; + u8 preference; + + vl_api_fib_path_type_t type; + vl_api_fib_path_flags_t flags; + vl_api_fib_path_nh_proto_t proto; + vl_api_fib_path_nh_t nh; + u8 n_labels; + vl_api_fib_mpls_label_t label_stack[16]; +}; + +typedef mpls_tunnel +{ + vl_api_interface_index_t mt_sw_if_index; + u32 mt_tunnel_index; + bool mt_l2_only; + bool mt_is_multicast; + string mt_tag[64]; + u8 mt_n_paths; + vl_api_fib_path_t mt_paths[mt_n_paths]; +}; + +define mpls_tunnel_details +{ + u32 context; + vl_api_mpls_tunnel_t mt_tunnel; +}; +``` + +Taking a closer look, the `mpls_tunnel` message consists of an index, then an `mt_tunnel_index` +which corresponds to the tunnel number (ie. interface mpls-tunnel**N**), some boolean flags, and +then a vector of N FIB paths. Incidentally, you'll find FIB paths all over the place in VPP: in +routes, tunnels like this one, ACLs, and so on, so it's good to get to know them a bit. + +Remember when I created the tunnel, I specified something like `.. via ..`? That's a tell-tale +sign that what follows is a FIB path. I specified a nexhop (192.168.10.3 GigabitEthernet10/0/3) and +a list of three out-labels (8298, 100 and 200), all of which VPP has tucked them away in this +`mt_paths` field. + +Although it's a bit verbose, I'll paste the complete object for this tunnel, including the FIB path. +You know, for science: + +``` +mpls_tunnel_details(_0=1185, context=5, + mt_tunnel=vl_api_mpls_tunnel_t( + mt_sw_if_index=17, + mt_tunnel_index=0, + mt_l2_only=True, + mt_is_multicast=False, + mt_tag='', + mt_n_paths=1, + mt_paths=[ + vl_api_fib_path_t(sw_if_index=2, table_id=0,rpf_id=0, weight=1, preference=0, + type=, + flags=, + proto=, + nh=vl_api_fib_path_nh_t( + address=vl_api_address_union_t( + ip4=IPv4Address('192.168.10.3'), ip6=IPv6Address('c0a8:a03::')), + via_label=0, obj_id=0, classify_table_index=0), + n_labels=3, + label_stack=[ + vl_api_fib_mpls_label_t(is_uniform=0, label=8298, ttl=0, exp=0), + vl_api_fib_mpls_label_t(is_uniform=0, label=100, ttl=0, exp=0), + vl_api_fib_mpls_label_t(is_uniform=0, label=200, ttl=0, exp=0), + vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), + vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), + vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), + vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), + vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), + vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), + vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), + vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), + vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), + vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), + vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), + vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), + vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0) + ] + ) + ] + ) + ) +``` + +This `mt_paths` is really interesting, and I'd like to make a few observations: +* **type**, **flags** and **proto** are ENUMs which I can find in `fib_types.api` +* **nh** is the nexthop - there is only one nexthop specified per path entry, so when things like + ECMP multipath are in play, this will be a vector of N _paths_ each with one _nh_. Good to know. + This nexthop specifies an address which is a _union_ just like in C. It can be either an + _ip4_ or an _ip6_. I will know which to choose due to the _proto_ field above. +* **n_labels** and **label_stack**: The MPLS label stack has a fixed size. VPP reveals here (in + the API definition but also in the response) that the label-stack can be at most 16 labels deep. + I feel like this is an interview question at Cisco, somehow. I know how many labels are relevant + because of the _n_labels_ field above. Their type is of `fib_mpls_label` which can be found in + `mpls.api`. + +After having consumed all of this, I am ready to write a program that wheels over these message +types and prints something a little bit more compact. The final program, in all of its glory -- + +```python +from vpp_papi import VPPApiClient, VPPApiJSONFiles, VppEnum + +def format_path(path): + str = "" + if path.proto == VppEnum.vl_api_fib_path_nh_proto_t.FIB_API_PATH_NH_PROTO_IP4: + str += f" ipv4 via {path.nh.address.ip4}" + elif path.proto == VppEnum.vl_api_fib_path_nh_proto_t.FIB_API_PATH_NH_PROTO_IP6: + str += f" ipv6 via {path.nh.address.ip6}" + elif path.proto == VppEnum.vl_api_fib_path_nh_proto_t.FIB_API_PATH_NH_PROTO_MPLS: + str += f" mpls" + elif path.proto == VppEnum.vl_api_fib_path_nh_proto_t.FIB_API_PATH_NH_PROTO_ETHERNET: + api_reply2 = vpp.api.sw_interface_dump(sw_if_index=path.sw_if_index) + iface = api_reply2[0] + str += f" ethernet to {iface.interface_name}" + else: + print(path) + if path.n_labels > 0: + str += " label" + for i in range(path.n_labels): + str += f" {path.label_stack[i].label}" + return str + +api_reply = vpp.api.mpls_tunnel_dump() +for tunnel in api_reply: + str = f"Tunnel [{tunnel.mt_tunnel.mt_sw_if_index}] mpls-tunnel{tunnel.mt_tunnel.mt_tunnel_index}" + for path in tunnel.mt_tunnel.mt_paths: + str += format_path(path) + print(str) + +api_reply = vpp.api.mpls_table_dump() +for table in api_reply: + print(f"Table [{table.mt_table.mt_table_id}] {table.mt_table.mt_name}") + api_reply2 = vpp.api.mpls_route_dump(table=table.mt_table.mt_table_id) + for route in api_reply2: + str = f" label {route.mr_route.mr_label} eos {route.mr_route.mr_eos}" + for path in route.mr_route.mr_paths: + str += format_path(path) + print(str) +``` + +{{< image width="50px" float="left" src="/assets/vpp-papi/warning.png" alt="Warning" >}} + +Funny detail - it took me almost two years to discover `VppEnum`, which contains all of these +symbols. If you end up reading this after a Bing, Yahoo or DuckDuckGo search, feel free to buy +me a bottle of Glenmorangie - sláinte! + +The `format_path()` method here has the smarts. Depending on the _proto_ field, I print either +an IPv4 path, an IPv6 path, an internal MPLS path (for example for the reserved labels 0..15), or an +Ethernet path, which is the case in the FIB entry above that diverts incoming packets with label 8298 to be +presented as ethernet datagrams into the intererface `mpls-tunnel0`. If it is an Ethernet proto, I +can use the _sw_if_index_ field to figure out which interface, and retrieve its details to find its +name. + +The `format_path()` method finally adds the stack of labels to the returned string, if the _n_labels_ +field is non-zero. + +My program's output: + +``` +pim@vpp0-0:~/vpp_papi_examples$ ./03-mpls.py +VPP version is 24.02-rc0~46-ga16463610 +Tunnel [17] mpls-tunnel0 ipv4 via 192.168.10.3 label 8298 100 200 +Table [0] MPLS-VRF:0 + label 0 eos 0 mpls + label 0 eos 1 mpls + label 1 eos 0 mpls + label 1 eos 1 mpls + label 2 eos 0 mpls + label 2 eos 1 mpls + label 8298 eos 1 ethernet to mpls-tunnel0 +``` + +### Creating VxLAN Tunnels + +Until now, all I've done is _inspect_ the dataplane, in other words I've called a bunch of APIs +that do not change state. Of course, many of VPP's API methods _change_ state as well. I'll turn to +another example API -- The VxLAN tunnel API is defined in `plugins/vxlan/vxlan.api` and it's gone +through a few iterations. The VPP community tries to keep backwards compatibility, and a simple way +of doing this is to create new versions of the methods by tagging them with suffixes such as `_v2`, +while eventually marking the older versions as deprecated by setting the `option deprecated;` field +in the definition. In this API specification I can see that we're already at version 3 of the +**Request/Reply** method in `vxlan_add_del_tunnel_v3` and version 2 of the **Dump/Detail** method in +`vxlan_tunnel_v2_dump`. + +Once again, using these `*.api` defintions, finding an incantion to create a unicast VxLAN tunnel +with a given VNI, then listing the tunnels, and finally deleting the tunnel I just created, would +look like this: + +```python +api_reply = vpp.api.vxlan_add_del_tunnel_v3(is_add=True, instance=100, vni=8298, + src_address="192.0.2.1", dst_address="192.0.2.254", decap_next_index=1) +if api_reply.retval == 0: + print(f"Created VXLAN tunnel with sw_if_index={api_reply.sw_if_index}") + +api_reply = vpp.api.vxlan_tunnel_v2_dump() +for vxlan in api_reply: + str = f"[{vxlan.sw_if_index}] instance {vxlan.instance} vni {vxlan.vni}" + str += " src {vxlan.src_address}:{vxlan.src_port} dst {vxlan.dst_address}:{vxlan.dst_port}") + print(str) + +api_reply = vpp.api.vxlan_add_del_tunnel_v3(is_add=False, instance=100, vni=8298, + src_address="192.0.2.1", dst_address="192.0.2.254", decap_next_index=1) +if api_reply.retval == 0: + print(f"Deleted VXLAN tunnel with sw_if_index={api_reply.sw_if_index}") +``` + +Many of the APIs in VPP will have create and delete in the same method, mostly by specifying the +operation with an `is_add` argument like here. I think it's kind of nice because it makes the +creation and deletion symmetric, even though the deletion needs to specify a fair bit more than +strictly necessary: the _instance_ uniquely identifies the tunnel and should have been enough. + +The output of this [[CRUD](https://en.wikipedia.org/wiki/Create,_read,_update_and_delete)] sequence +(which stands for **C**reate, **R**ead, **U**pdate, **D**elete, in case you haven't come across that +acronym yet) then looks like this: + +``` +pim@vpp0-0:~/vpp_papi_examples$ ./04-vxlan.py +VPP version is 24.02-rc0~46-ga16463610 +Created VXLAN tunnel with sw_if_index=18 +[18] instance 100 vni 8298 src 192.0.2.1:4789 dst 192.0.2.254:4789 +Deleted VXLAN tunnel with sw_if_index=18 +``` + +### Listening to Events + +But wait, there's more! Just one more thing, I promise. Way in the beginning of this article, I +mentioned that there is a special variant of the **Dump/Detail** pattern, and that's the **Events** +pattern. With the VPP API client, first I register a single callback function, and then I can +enable/disable events to trigger this callback. + +{{< image width="80px" float="left" src="/assets/vpp-papi/warning.png" alt="Warning" >}} +One important note to this: enabling this callback will spawn a new (Python) thread so that the main +program can continue to execute. Because of this, all the standard care has to be taken to make the +program thread-aware. Make sure to pass information from the events-thread to the main-thread in a +safe way! + +Let me demonstrate this powerful functionality with a program that listens on +`want_interface_events` which is defined in `interface.api`: + +``` +service { + rpc want_interface_events returns want_interface_events_reply + events sw_interface_event; +}; + +define sw_interface_event +{ + u32 client_index; + u32 pid; + vl_api_interface_index_t sw_if_index; + vl_api_if_status_flags_t flags; + bool deleted; +}; +``` + +Here's a complete program, shebang and all, that accomplishes this in a minimalistic way: + +```python +#!/usr/bin/env python3 + +import time +from vpp_papi import VPPApiClient, VPPApiJSONFiles, VppEnum + +def sw_interface_event(msg): + print(msg) + +def vpp_event_callback(msg_name, msg): + if msg_name == "sw_interface_event": + sw_interface_event(msg) + else: + print(f"Received unknown callback: {msg_name} => {msg}") + +vpp_api_dir = VPPApiJSONFiles.find_api_dir([]) +vpp_api_files = VPPApiJSONFiles.find_api_files(api_dir=vpp_api_dir) +vpp = VPPApiClient(apifiles=vpp_api_files, server_address="/run/vpp/api.sock") + +vpp.connect("ipng-client") +vpp.register_event_callback(vpp_event_callback) +vpp.api.want_interface_events(enable_disable=True, pid=8298) + +api_reply = vpp.api.show_version() +print(f"VPP version is {api_reply.version}") + +try: + while True: + time.sleep(1) +except KeyboardInterrupt: + pass +``` + +## Results + +After all of this deep-diving, all that's left is for me to demonstrate the API by means of this +little asciinema [[screencast](/assets/vpp-papi/vpp_papi_clean.cast)] - I hope you enjoy it as much +as I enjoyed creating it: + +{{< image src="/assets/vpp-papi/vpp_papi_clean.gif" alt="Asciinema" >}} + +Note to self: + +``` +$ asciinema-edit quantize --range 0.18,0.8 --range 0.5,1.5 --range 1.5 \ + vpp_papi.cast > clean.cast +$ Insert the ANSI colorcodes from the mac's terminal into clean.cast's header: +"theme":{"fg": "#ffffff","bg":"#000000", + "palette":"#000000:#990000:#00A600:#999900:#0000B3:#B300B3:#999900:#BFBFBF: + #666666:#F60000:#00F600:#F6F600:#0000F6:#F600F6:#00F6F6:#F6F6F6"} +$ agg --font-size 18 clean.cast clean.gif +$ gifsicle --lossy=80 -k 128 -O2 -Okeep-empty clean.gif -o vpp_papi_clean.gif +``` diff --git a/content/articles/2024-02-10-vpp-freebsd-1.md b/content/articles/2024-02-10-vpp-freebsd-1.md new file mode 100644 index 0000000..c41a5b8 --- /dev/null +++ b/content/articles/2024-02-10-vpp-freebsd-1.md @@ -0,0 +1,359 @@ +--- +date: "2024-02-10T10:17:54Z" +title: VPP on FreeBSD - Part 1 +--- + + +# About this series + +{{< image width="300px" float="right" src="/assets/freebsd-vpp/freebsd-logo.png" alt="FreeBSD" >}} + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. Over the years, folks have asked me regularly "What about BSD?" and to +my surprise, late last year I read an announcement from the _FreeBSD Foundation_ +[[ref](https://freebsdfoundation.org/blog/2023-in-review-software-development/)] as they looked back +over 2023 and forward to 2024: + +> ***Porting the Vector Packet Processor to FreeBSD*** +> +> Vector Packet Processing (VPP) is an open-source, high-performance user space networking stack +> that provides fast packet processing suitable for software-defined networking and network function +> virtualization applications. VPP aims to optimize packet processing through vectorized operations +> and parallelism, making it well-suited for high-speed networking applications. In November of this +> year, the Foundation began a contract with Tom Jones, a FreeBSD developer specializing in network +> performance, to port VPP to FreeBSD. Under the contract, Tom will also allocate time for other +> tasks such as testing FreeBSD on common virtualization platforms to improve the desktop +> experience, improving hardware support on arm64 platforms, and adding support for low power idle +> on Intel and arm64 hardware. + +I reached out to Tom and introduced myself -- and IPng Networks -- and offered to partner. Tom knows +FreeBSD very well, and I know VPP very well. And considering lots of folks have asked me that loaded +"What about BSD?" question, I think a reasonable answer might now be: Coming up! +Tom will be porting VPP to FreeBSD, and I'll be providing a test environment with a few VMs, +physical machines with varying architectures (think single-numa, AMD64 and Intel platforms). + +In this first article, let's take a look at tablestakes: installing FreeBSD 14.0-RELEASE and doing +all the little steps necessary to get VPP up and running. + +## My test setup + +Tom and I will be using two main test environments. The first is a set of VMs running on QEMU, which +we can do functional testing on, by configuring a bunch of VPP routers with a set of normal FreeBSD +hosts attached to them. The second environment will be a few Supermicro bare metal servers that +we'll use for performance testing, notably to compare the FreeBSD kernel routing, fancy features +like `netmap`, and of course VPP itself. I do intend to do some side-by-side comparisons between +Debian and FreeBSD when they run VPP. + +{{< image width="100px" float="left" src="/assets/freebsd-vpp/brain.png" alt="brain" >}} + +If you know me a little bit, you'll know that I typically forget how I did a thing, so I'm using +this article for others as well as myself in case I want to reproduce this whole thing 5 years down +the line. Oh, and if you don't know me at all, now you know my brain, pictured left, is not too +different from a leaky sieve. + +### VMs: IPng Lab + +I really like the virtual machine environment that the [[IPng Lab]({% post_url 2022-10-14-lab-1 %})] +provides. So my very first step is to grab an UFS based image like [[these +ones](https://download.freebsd.org/releases/VM-IMAGES/14.0-RELEASE/amd64/Latest/)], and I prepare a +lab image. This goes roughly as follows -- + +1. Download the UFS qcow2 and `unxz` it. +1. Create a 10GB ZFS blockdevice `zfs create ssd-vol0/vpp-proto-freebsd-disk0 -V10G` +1. Make a copy of my existing vpp-proto-bookworm libvirt config, and edit it with new MAC addresses, + UUID and hostname (essentially just an `s/bookworm/freebsd/g` +1. Boot the VM once using VNC, to add serial booting to `/boot/loader.conf` +1. Finally, install a bunch of stuff that I would normally use on a FreeBSD machine: + * A user account 'pim' and 'ipng', set the 'root' password + * A bunch of packages (things like vim, bash, python3, rsync) + * SSH host keys and `authorized_keys` files + * A sensible `rc.conf` that DHCPs on its first network card `vtnet0` + +I notice that FreeBSD has something pretty neat in `rc.conf`, called `growfs_enable`, which will +take a look at the total disk size available in slice 4 (the one that contains the main filesystem), +and if the disk has free space beyond the end of the partition, it'll slurp it up and resize the +filesystem to fit. Reading the `/etc/rc.d/growfs` file, I see that this works for both ZFS and UFS. +A chef's kiss that I found super cool! + +Next, I take a snapshot of the disk image and add it to the Lab's `zrepl` configuration, so that +this base image gets propagated to all hypervisors, the result is a nice 10GB large base install +that boots off of serial. + +``` +pim@hvn0-chbtl0:~$ zfs list -t all | grep vpp-proto-freebsd +ssd-vol0/vpp-proto-freebsd-disk0 13.0G 45.9G 6.14G - +ssd-vol0/vpp-proto-freebsd-disk0@20240206-release 614M - 6.07G - +ssd-vol0/vpp-proto-freebsd-disk0@20240207-release 3.95M - 6.14G - +ssd-vol0/vpp-proto-freebsd-disk0@20240207-2 0B - 6.14G - +ssd-vol0/vpp-proto-freebsd-disk0#zrepl_CURSOR_G_760881003460c452_J_source-vpp-proto - - 6.14G - +``` + +One note for the pedants -- the kernel that ships with Debian, for some reason I don't quite +understand, does not come with an UFS kernel module that allows to mount these filesystems +read-write. Maybe this is because there are a few different flavors of UFS out there, and the +maintainer of that kernel module is not comfortable enabling write-mode on all of them. I +don't know, but my use case isn't critical as my build will just copy a few files on the +otherwise ephemeral ZFS cloned filesystem. + +So off I go, asking Summer to build me a Linux 6.1 kernel for Debian Bookworm (which is what the +hypervisors are running). For those following along at home, here's how that looked like for me: + +``` +pim@summer:/usr/src$ sudo apt-get install build-essential linux-source bc kmod cpio flex \ + libncurses5-dev libelf-dev libssl-dev dwarves bison +pim@summer:/usr/src$ sudo apt install linux-source-6.1 +pim@summer:/usr/src$ sudo tar xf linux-source-6.1.tar.xz +pim@summer:/usr/src$ cd linux-source-6.1/ +pim@summer:/usr/src/linux-source-6.1$ sudo cp /boot/config-6.1.0-16-amd64 .config +pim@summer:/usr/src/linux-source-6.1$ cat << EOF | sudo tee -a .config +CONFIG_UFS_FS=m +CONFIG_UFS_FS_WRITE=y +EOF +pim@summer:/usr/src/linux-source-6.1$ sudo make menuconfig +pim@summer:/usr/src/linux-source-6.1$ sudo make -j`nproc` bindeb-pkg +``` + +Finally, I add a new LAB overlay type called `freebsd` to the Python/Jinja2 tool I built, which I +use to create and maintain the LAB hypervisors. If you're curious about this part, take a look at +the [[article]({% post_url 2022-10-14-lab-1 %})] I wrote about the environment. I reserve LAB #2 +running on `hvn2.lab.ipng.ch` for the time being, as LAB #0 and #1 are in use by other projects. To +cut to the chase, here's what I type to generate the overlay and launch a LAB using the FreeBSD I +just made. There's not much in the overlay, really just some templated `rc.conf` to set the correct +hostname and mgmt IPv4/IPv6 addresses and so on. + +``` +pim@lab:~/src/lab$ find overlays/freebsd/ -type f +overlays/freebsd/common/home/ipng/.ssh/authorized_keys.j2 +overlays/freebsd/common/etc/rc.local.j2 +overlays/freebsd/common/etc/rc.conf.j2 +overlays/freebsd/common/etc/resolv.conf.j2 +overlays/freebsd/common/root/lab-build/perms +overlays/freebsd/common/root/.ssh/authorized_keys.j2 + +pim@lab:~/src/lab$ ./generate --host hvn2.lab.ipng.ch --overlay freebsd +pim@lab:~/src/lab$ export BASE=vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-freebsd-disk0@20240207-2 +pim@lab:~/src/lab$ OVERLAY=freebsd LAB=2 ./create +pim@lab:~/src/lab$ LAB=2 ./command start +``` + +After rebooting the hypervisors with their new UFS2-write-capable kernel, I can finish the job and +create the lab VMs. The `create` call above first makes a ZFS _clone_ of the base image, then mounts +it, rsyncs the generated overlay files over it, then creates a ZFS _snapshot_ called `@pristine`, +before booting up the seven virtual machines that comprise this spiffy new FreeBSD lab: + +{{< image src="/assets/freebsd-vpp/LAB v2.svg" alt="Lab Setup" >}} + +I decide to park the LAB for now, as that beautiful daisy-chain of `vpp2-0` - `vpp2-3` routers will +first need a working VPP install, which I don't quite have yet. + +### Bare Metal + +Next, I take three spare Supermicro SYS-5018D-FN8T, which have the following specs: +* Full IPMI support (power, serial-over-lan and kvm-over-ip with HTML5), on a dedicated network port. +* A 4-core, 8-thread Xeon D1518 CPU which runs at 35W TDP +* Two independent Intel i210 NICs (Gigabit) +* A Quad Intel i350 NIC (Gigabit) +* Two Intel X552 (TenGigabitEthernet) +* Two Intel X710-XXV (TwentyFiveGigabitEthernet) ports in the PCIe v3.0 x8 slot +* m.SATA 120G boot SSD +* 2x16GB of ECC RAM + +These were still arranged in a test network from when Adrian and I worked on the [[VPP MPLS]({% +post_url 2023-05-07-vpp-mpls-1 %})] project together, and back then I called the three machines +`France`, `Belgium` and `Netherlands`. I decide to reuse that, and save myself some recabling. +Using IPMI, I install the `France` server with FreeBSD, while the other two, for now, are still +running Debian. This can be useful for (a) side by side comparison tests and (b) to be able to +quickly run some T-Rex loadtests. + +{{< image src="/assets/freebsd-vpp/freebsd-bootscreen.png" alt="FreeBSD" >}} + +I have to admit - I **love** Supermicro's IPMI implementation. Being able to plop in an ISO over +Samba, and the boot on VGA, including into the BIOS to set/change things, and then completely +reinstall while hanging out on the couch while drinking tea, absolutetey + +## Starting Point + +I use the base image I described above to clone a beefy VM for building and development purposes. I +give that machine 32GB of RAM and 24 cores on one of IPng's production hypervisors. I spent some +time with Tom this week to go over a few details about the build, and he patiently described where +he's at with the porting. It's not done yet, but he has good news: it does compile cleanly on his +machine, so there is hope for me yet! He has prepared a GitHub repository with all of the changes +staged - and he will be sequencing them out one by one to merge upstream. In case you want to follow +along with his work, take a look at this [[Gerrit +search](https://gerrit.fd.io/r/q/repo:vpp+owner:thj%2540freebsd.org)]. + +First, I need to go build a whole bunch of stuff. Here's a recap -- +1. Download ports and kernel source +1. Build and install a GENERIC kernel +1. Build DPDK including its FreeBSD kernel modules `contigmem` and `nic_uio` +1. Build netmap `bridge` utility +1. Build VPP :) + +To explain a little bit: Linux has _hugepages_ which are 2MB or 1GB memory pages. These come with a +significant performance benefit, mostly because the CPU will have a table called the _Translation +Lookaside Buffer_ or [[TLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer)] which keeps +a mapping between virtual and physical memory pages. If there is too much memory allocated to a +process, this TLB table thrashes which comes at a performance penalty. When allocating not the +standard 4kB pages, but larger 2MB or 1GB ones, this does not happen. For FreeBSD, the DPDK library +provides an equivalent kernel module, which is called `contigmem`. + +Many (but not all!) DPDK poll mode drivers will remove the kernel network card driver and rebind the +network card to a _Userspace IO_ or _UIO_ driver. DPDK also ships one of these for FreeBSD, called +`nic_uio`. So my first three steps are compiling all of these things, including a standard DPDK +install from ports. + +### Build: FreeBSD + DPDK + +Building things on FreeBSD is all very well documented in the [[FreeBSD +Handbook](ttps://docs.freebsd.org/en/books/developers-handbook/kernelbuild/)]. +In order to avoid filling up the UFS boot disk, I snuck in another SAS-12 SSD to get a bit +faster builds, and I mount `/usr/src` and `/usr/obj` on it. + +Here's a recap of what I ended up doing to build a fresh GENERIC kernel and the `DPDK` port: + +``` +[pim@freebsd-builder ~]$ sudo zfs create -o mountpoint=/usr/src ssd-vol0/src +[pim@freebsd-builder ~]$ sudo zfs create -o mountpoint=/usr/obj ssd-vol0/obj +[pim@freebsd-builder ~]$ sudo git clone --branch stable/14 https://git.FreeBSD.org/src.git /usr/src +[pim@freebsd-builder /usr/src]$ sudo make buildkernel KERNCONF=GENERIC +[pim@freebsd-builder /usr/src]$ sudo make installkernel KERNCONF=GENERIC +[pim@freebsd-builder ~]$ sudo git clone https://git.FreeBSD.org/ports.git /usr/ports +[pim@freebsd-builder /usr/ports/net/dpdk ]$ sudo make install +``` + +I patiently answer a bunch of questions (all of them just with the default) when the build process +asks me what I want. DPDK is a significant project, and it pulls in lots of dependencies to build as +well. After what feels like an eternity, the builds are complete, and I have a kernel together with +kernel modules, as well as a bunch of handy DPDK helper utilities (like `dpdk-testpmd`) installed. +Just to set expectations -- the build took about an hour for me from start to finish (on a 32GB machine +with 24 vCPUs), so hunker down if you go this route. + +**NOTE**: I wanted to see what I was being asked in this build process, but since I ended up answering +everything with the default, you can feel free to add `BATCH=yes` to the make of DPDK (and see the man +page of dpdk(7) for details). + +### Build: contigmem and nic_uio + +Using a few `sysctl` calls, I can configure four buffers of 1GB each, which will serve as my +equivalent _hugepages_ from Linux, and I add the following to `/boot/loader.conf`, so that these +contiguous regions are reserved early in the boot cycle, when memory is not yet fragmented: + +``` +hw.contigmem.num_buffers=4 +hw.contigmem.buffer_size=1073741824 +contigmem_load="YES" +``` + +To figure out which network devices to rebind to the _UIO_ driver, I can inspect the PCI bus with the +`pciconf` utility: + +``` +[pim@freebsd-builder ~]$ pciconf -vl | less +... +virtio_pci0@pci0:1:0:0: class=0x020000 rev=0x01 hdr=0x00 vendor=0x1af4 device=0x1041 subvendor=0x1af4 subdevice=0x1100 + vendor = 'Red Hat, Inc.' + device = 'Virtio 1.0 network device' + class = network + subclass = ethernet +virtio_pci1@pci0:1:0:1: class=0x020000 rev=0x01 hdr=0x00 vendor=0x1af4 device=0x1041 subvendor=0x1af4 subdevice=0x1100 + vendor = 'Red Hat, Inc.' + device = 'Virtio 1.0 network device' + class = network + subclass = ethernet +virtio_pci0@pci0:1:0:2: class=0x020000 rev=0x01 hdr=0x00 vendor=0x1af4 device=0x1041 subvendor=0x1af4 subdevice=0x1100 + vendor = 'Red Hat, Inc.' + device = 'Virtio 1.0 network device' + class = network + subclass = ethernet +virtio_pci1@pci0:1:0:3: class=0x020000 rev=0x01 hdr=0x00 vendor=0x1af4 device=0x1041 subvendor=0x1af4 subdevice=0x1100 + vendor = 'Red Hat, Inc.' + device = 'Virtio 1.0 network device' + class = network + subclass = ethernet +``` + +My virtio based network devices are on PCI location `1:0:0` -- `1:0:3` and I decide to take away the +last two, which makes my final loader configuration for the kernel: +``` +[pim@freebsd-builder ~]$ cat /boot/loader.conf +kern.geom.label.disk_ident.enable=0 +zfs_load=YES +boot_multicons=YES +boot_serial=YES +comconsole_speed=115200 +console="comconsole,vidconsole" +hw.contigmem.num_buffers=4 +hw.contigmem.buffer_size=1073741824 +contigmem_load="YES" +nic_uio_load="YES" +hw.nic_uio.bdfs="1:0:2,1:0:3" +``` + +## Build: Results + +Now that all of this is done, the machine boots with these drivers loaded, and I can see only my +first two network devices (`vtnet0` and `vtnet1`), while the other two are gone. This is good news, +because that means they are now under control of the DPDK `nic_uio` kernel driver, whohoo! + +``` +[pim@freebsd-builder ~]$ kldstat +Id Refs Address Size Name + 1 28 0xffffffff80200000 1d36230 kernel + 2 1 0xffffffff81f37000 4258 nic_uio.ko + 3 1 0xffffffff81f3c000 5d5618 zfs.ko + 4 1 0xffffffff82513000 5378 contigmem.ko + 5 1 0xffffffff82c18000 3250 ichsmb.ko + 6 1 0xffffffff82c1c000 2178 smbus.ko + 7 1 0xffffffff82c1f000 430c virtio_console.ko + 8 1 0xffffffff82c24000 22a8 virtio_random.ko +``` + +## Build: VPP + +Tom has prepared a branch on his GitHub account, which poses a few small issues with the build. +Notably, we have to use a few GNU tools like `gmake`. But overall, I find the build is very straight +forward - kind of looking like this: + +``` +[pim@freebsd-builder ~]$ sudo pkg install py39-ply git gmake gsed cmake libepoll-shim gdb python3 ninja +[pim@freebsd-builder ~/src]$ git clone git@github.com:adventureloop/vpp.git +[pim@freebsd-builder ~/src/vpp]$ git checkout freebsd-vpp +[pim@freebsd-builder ~]$ gmake install-dep +[pim@freebsd-builder ~]$ gmake build +``` + +# Results + +Now, taking into account that not everything works (for example there isn't a packaging yet, let +alone something as fancy as a port), and that there's a bit of manual tinkering going on, let me +show you at least the absolute gem that is this screenshot: + +{{< image src="/assets/freebsd-vpp/freebsd-vpp.png" alt="VPP :-)" >}} + +The (debug build) VPP instance started, the DPDK plugin loaded, and it found the two devices that +were bound by the newly installed `nic_uio` driver. Setting an IPv4 address on one of these +interfaces works, and I can ping another machine on the LAN connected to Gi10/0/2, which I find +dope. + +Hello, World! + +# What's next ? + +There's a lot of ground to cover with this port. While Tom munches away at the Gerrits he has +stacked up, I'm going to start kicking the tires on the FreeBSD machines. I showed in this article +the tablestakes preparation, giving a FreeBSD lab on the hypervisors, a build machine that has DPDK, +kernel and VPP in a somewhat working state (with two NICs in VirtIO), and I installed a Supermicro +bare metal machine to do the same. + +In a future set of articles in this series, I will: +* Do a comparative loadtest between FreeBSD kernel, Netmap, VPP+Netmap, and VPP+DPDK +* Take a look at how FreeBSD stacks up against Debian on the the same machine +* Do a bit of functional testing, to ensure dataplane functionality is in place + +A few things will need some attention: +* Some Linux details have leaked, for example `show cpu` and `show pci` in VPP +* Linux Control Plane uses TAP devices which Tom has mentioned may need some work +* Similarly, Linux Control Plane netlink handling may or may not work as expected in FreeBSD +* Build and packaging, obviously there is no `make pkg-deb` diff --git a/content/articles/2024-02-17-vpp-freebsd-2.md b/content/articles/2024-02-17-vpp-freebsd-2.md new file mode 100644 index 0000000..7b9c18e --- /dev/null +++ b/content/articles/2024-02-17-vpp-freebsd-2.md @@ -0,0 +1,501 @@ +--- +date: "2024-02-17T12:17:54Z" +title: VPP on FreeBSD - Part 2 +--- + + +# About this series + +{{< image width="300px" float="right" src="/assets/freebsd-vpp/freebsd-logo.png" alt="FreeBSD" >}} + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. Over the years, folks have asked me regularly "What about BSD?" and to +my surprise, late last year I read an announcement from the _FreeBSD Foundation_ +[[ref](https://freebsdfoundation.org/blog/2023-in-review-software-development/)] as they looked back +over 2023 and forward to 2024: + +> ***Porting the Vector Packet Processor to FreeBSD*** +> +> Vector Packet Processing (VPP) is an open-source, high-performance user space networking stack +> that provides fast packet processing suitable for software-defined networking and network function +> virtualization applications. VPP aims to optimize packet processing through vectorized operations +> and parallelism, making it well-suited for high-speed networking applications. In November of this +> year, the Foundation began a contract with Tom Jones, a FreeBSD developer specializing in network +> performance, to port VPP to FreeBSD. Under the contract, Tom will also allocate time for other +> tasks such as testing FreeBSD on common virtualization platforms to improve the desktop +> experience, improving hardware support on arm64 platforms, and adding support for low power idle +> on Intel and arm64 hardware. + +In my first [[article]({% post_url 2024-02-10-vpp-freebsd-1 %})], I wrote a sort of a _hello world_ +by installing FreeBSD 14.0-RELEASE on both a VM and a bare metal Supermicro, and showed that Tom's +VPP branch compiles, runs and pings. In this article, I'll take a look at some comparative +performance numbers. + +## Comparing implementations + +FreeBSD has an extensive network stack, including regular _kernel_ based functionality such as +routing, filtering and bridging, a faster _netmap_ based datapath, including some userspace +utilities like a _netmap_ bridge, and of course completely userspace based dataplanes, such as the +VPP project that I'm working on here. Last week, I learned that VPP has a _netmap_ driver, and from +previous travels I am already quite familiar with its _DPDK_ based forwarding. I decide to do a +baseline loadtest for each of these on the Supermicro Xeon-D1518 that I installed last week. See the +[[article]({% post_url 2024-02-10-vpp-freebsd-1 %})] for details on the setup. + +The loadtests will use a common set of different configurations, using Cisco T-Rex's default +benchmark profile called `bench.py`: +1. **var2-1514b**: Large Packets, multiple flows with modulating source and destination IPv4 + addresses, often called an 'iperf test', with packets of 1514 bytes. +1. **var2-imix**: Mixed Packets, multiple flows, often called an 'imix test', which includes a + bunch of 64b, 390b and 1514b packets. +1. **var2-64b**: Small Packets, still multiple flows, 64 bytes, which allows for multiple receive + queues and kernel or application threads. +1. **64b**: Small Packets, but now single flow, often called 'linerate test', with a packet size + of 64 bytes, limiting to one receive queue. + +Each of these four loadtests might occur in only undirectionally (port0 -> port1) or bidirectionally +(port0 <-> port1). This yields eight different loadtests, each taking about 8 minutes. I put the kettle +on and get underway. + +### FreeBSD 14: Kernel Bridge + +The machine I'm testing has a quad-port Intel i350 (1Gbps copper, using the FreeBSD `igb(4)` driver), +a dual-port Intel X522 (10Gbps SFP+, using the `ix(4)` driver), and a dual-port Intel i710-XXV +(25Gbps SFP28, using the `ixl(4)` driver). I decide to live it up a little, and choose the 25G ports +for my loadtests today, even if I think this machine with its relatively low-end Xeon-D1518 CPU +will struggle a little bit at very high packet rates. No pain, no gain, _amirite_? + +I take my fresh FreeBSD 14.0-RELEASE install, without any tinkering other than compiling a GENERIC +kernel that has support for the DPDK modules I'll need later. For my first loadtest, I create a +kernel based bridge as follows, just tying the two 25G interfaces together: + +``` +[pim@france /usr/obj]$ uname -a +FreeBSD france 14.0-RELEASE FreeBSD 14.0-RELEASE #0: Sat Feb 10 22:18:51 CET 2024 root@france:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64 + +[pim@france ~]$ dmesg | grep ixl +ixl0: mem 0xf8000000-0xf8ffffff,0xf9008000-0xf900ffff irq 16 at device 0.0 on pci7 +ixl1: mem 0xf7000000-0xf7ffffff,0xf9000000-0xf9007fff irq 16 at device 0.1 on pci7 + +[pim@france ~]$ sudo ifconfig bridge0 create +[pim@france ~]$ sudo ifconfig bridge0 addm ixl0 addm ixl1 up +[pim@france ~]$ sudo ifconfig ixl0 up +[pim@france ~]$ sudo ifconfig ixl1 up +[pim@france ~]$ ifconfig bridge0 +bridge0: flags=1008843 metric 0 mtu 1500 + options=0 + ether 58:9c:fc:10:6c:2e + id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15 + maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200 + root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0 + member: ixl1 flags=143 + ifmaxaddr 0 port 4 priority 128 path cost 800 + member: ixl0 flags=143 + ifmaxaddr 0 port 3 priority 128 path cost 800 + groups: bridge + nd6 options=9 + +``` + +One thing that I quickly realize, is that FreeBSD, when using hyperthreading, does have 8 threads +available, but only 4 of them participate in forwarding. When I put the machine under load, I see a +curious 399% spent in _kernel_ while I see 402% in _idle_: + +{{< image src="/assets/freebsd-vpp/top-kernel-bridge.png" alt="FreeBSD top" >}} + +When I then do a single-flow unidirectional loadtest, the expected outcome is that only one CPU +participates (100% in _kernel_ and 700% in _idle_) and if I perform a single-flow bidirectional +loadtest, my expectations are confirmed again, seeing two CPU threads do the work (200% in _kernel_ +and 600% in _idle_). + +While the math checks out, the performance is a little bit less impressive: + +| Type | Uni/BiDir | Packets/Sec | L2 Bits/Sec | Line Rate | +| ---- | --------- | ----------- | -------- | --------- | +| vm=var2,size=1514 | Unidirectional | 2.02Mpps | 24.77Gbps | 99% | +| vm=var2,size=imix | Unidirectional | 3.48Mpps | 10.23Gbps | 43% | +| vm=var2,size=64 | Unidirectional | 3.61Mpps | 2.43Gbps | 9.7% | +| size=64 | Unidirectional | 1.22Mpps | 0.82Gbps | 3.2% | +| vm=var2,size=1514 | Bidirectional | 3.77Mpps | 46.31Gbps | 93% | +| vm=var2,size=imix | Bidirectional | 3.81Mpps | 11.22Gbps | 24% | +| vm=var2,size=64 | Bidirectional | 4.02Mpps | 2.69Gbps | 5.4% | +| size=64 | Bidirectional | 2.29Mpps | 1.54Gbps | 3.1% | + +***Conclusion***: FreeBSD's kernel on this Xeon-D1518 processor can handle about 1.2Mpps per CPU +thread, and I can use only four of them. FreeBSD is happy to forward big packets, and I can +reasonably reach 2x25Gbps but once I start ramping up the packets/sec by lowering the packet size, +things very quickly deteriorate. + +### FreeBSD 14: netmap Bridge + +Tom pointed out a tool in the source tree, called the _netmap bridge_ originally written by Luigi +Rizzo and Matteo Landi. FreeBSD ships the source code, but you can also take a look at their GitHub +repository [[ref](https://github.com/luigirizzo/netmap/blob/master/apps/bridge/bridge.c)]. + +What is _netmap_ anyway? It's a framework for extremely fast and efficient packet I/O for userspace +and kernel clients, and for Virtual Machines. It runs on FreeBSD, Linux and some versions of +Windows. As an aside, my buddy Pavel from FastNetMon pointed out a blogpost from 2015 in which +Cloudflare folks described a way to do DDoS mitigation on Linux using traffic classification to +program the network cards to move certain offensive traffic to a dedicated hardware queue, and +service that queue from a _netmap_ client. If you're curious (I certainly was!), you might take a +look at that cool write-up +[[here](https://blog.cloudflare.com/single-rx-queue-kernel-bypass-with-netmap)]. + +I compile the code and put it to work, and the man-page tells me that I need to fiddle with the +interfaces a bit. They need to be: + +* set to _promiscuous_, which makes sense as they have to receive ethernet frames sent to MAC + addresses other than their own +* turn off any hardware offloading, notably `-rxcsum -txcsum -tso4 -tso6 -lro` +* my user needs write permission to `/dev/netmap` to bind the interfaces from userspace. + +``` +[pim@france /usr/src/tools/tools/netmap]$ make +[pim@france /usr/src/tools/tools/netmap]$ cd /usr/obj/usr/src/amd64.amd64/tools/tools/netmap +[pim@france .../tools/netmap]$ sudo ifconfig ixl0 -rxcsum -txcsum -tso4 -tso6 -lro promisc +[pim@france .../tools/netmap]$ sudo ifconfig ixl1 -rxcsum -txcsum -tso4 -tso6 -lro promisc +[pim@france .../tools/netmap]$ sudo chmod 660 /dev/netmap +[pim@france .../tools/netmap]$ ./bridge -i netmap:ixl0 -i netmap:ixl1 +065.804686 main [290] ------- zerocopy supported +065.804708 main [297] Wait 4 secs for link to come up... +075.810547 main [301] Ready to go, ixl0 0x0/4 <-> ixl1 0x0/4. +``` + +{{< image width="80px" float="left" src="/assets/freebsd-vpp/warning.png" alt="Warning" >}} + +I start my first loadtest, which pretty immediately fails. It's an interesting behavior pattern which +I've not seen before. After staring at the problem, and reading the code of `bridge.c`, which is a +remarkably straight forward program, I restart the bridge utility, and traffic passes again but only +for a little while. Whoops! + +I took a [[screencast](/assets/freebsd-vpp/netmap_bridge.cast)] in case any kind soul on freebsd-net +wants to take a closer look at this: + +{{< image src="/assets/freebsd-vpp/netmap_bridge.gif" alt="FreeBSD netmap Bridge" >}} + +I start a bit of trial and error in which I conclude that if I send **a lot** of traffic (like 10Mpps), +forwarding is fine; but if I send **a little** traffic (like 1kpps), at some point forwarding stops +alltogether. So while it's not great, this does allow me to measure the total throughput just by +sending a lot of traffic, say 30Mpps, and seeing what amount comes out the other side. + +Here I go, and I'm having fun: + +| Type | Uni/BiDir | Packets/Sec | L2 Bits/Sec | Line Rate | +| ---- | --------- | ----------- | -------- | --------- | +| vm=var2,size=1514 | Unidirectional | 2.04Mpps | 24.72Gbps | 100% | +| vm=var2,size=imix | Unidirectional | 8.16Mpps | 23.76Gbps | 100% | +| vm=var2,size=64 | Unidirectional | 10.83Mpps | 5.55Gbps | 29% | +| size=64 | Unidirectional | 11.42Mpps | 5.83Gbps | 31% | +| vm=var2,size=1514 | Bidirectional | 3.91Mpps | 47.27Gbps | 96% | +| vm=var2,size=imix | Bidirectional | 11.31Mpps | 32.74Gbps | 77% | +| vm=var2,size=64 | Bidirectional | 11.39Mpps | 5.83Gbps | 15% | +| size=64 | Bidirectional | 11.57Mpps | 5.93Gbps | 16% | + +***Conclusion***: FreeBSD's _netmap_ implementation is also bound by packets/sec, and in this +setup, the Xeon-D1518 machine is capable of forwarding roughly 11.2Mpps. What I find cool is that +single flow or multiple flows doesn't seem to matter that much, in fact bidirectional 64b single +flow loadtest was most favorable at 11.57Mpps, which is _an order of magnitude_ better than using just +the kernel (which clocked in at 1.2Mpps). + +### FreeBSD 14: VPP with netmap + +It's good to have a baseline on this machine on how the FreeBSD kernel itself performs. But of +course this series is about Vector Packet Processing, so I now turn my attention to the VPP branch +that Tom shared with me. I wrote a bunch of details about the VM and bare metal install in my +[[first article]({% post_url 2024-02-10-vpp-freebsd-1 %})] so I'll just go straight to the +configuration parts: + +``` +DBGvpp# create netmap name ixl0 +DBGvpp# create netmap name ixl1 +DBGvpp# set int state netmap-ixl0 up +DBGvpp# set int state netmap-ixl1 up +DBGvpp# set int l2 xconnect netmap-ixl0 netmap-ixl1 +DBGvpp# set int l2 xconnect netmap-ixl1 netmap-ixl0 + +DBGvpp# show int + Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count +local0 0 down 0/0/0/0 +netmap-ixl0 1 up 9000/0/0/0 rx packets 25622 + rx bytes 1537320 + tx packets 25437 + tx bytes 1526220 +netmap-ixl1 2 up 9000/0/0/0 rx packets 25437 + rx bytes 1526220 + tx packets 25622 + tx bytes 1537320 +``` + +At this point I can pretty much rule out that the _netmap_ `bridge.c` is the issue, because a +few seconds after introducing 10Kpps of traffic and seeing it successfully pass, the loadtester +receives no more packets, even though T-Rex is still sending it. However, about a minute later +I can _also_ see the RX **and** TX counters continue to increase in the VPP dataplane: + +``` +DBGvpp# show int + Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count +local0 0 down 0/0/0/0 +netmap-ixl0 1 up 9000/0/0/0 rx packets 515843 + rx bytes 30950580 + tx packets 515657 + tx bytes 30939420 +netmap-ixl1 2 up 9000/0/0/0 rx packets 515657 + rx bytes 30939420 + tx packets 515843 + tx bytes 30950580 +``` + +.. and I can see that every packet that VPP received is accounted for: interface `ixl0` has received +515843 packets, and `ixl1` claims to have transmitted _exactly_ that amount of packets. So I think +perhaps they are getting lost somewhere on egress between the kernel and the Intel i710-XXV network +card. + +However, counter to the previous case, I cannot sustain any reasonable amount of traffic, be it +1Kpps, 10Kpps or 10Mpps, the system pretty consistently comes to a halt mere seconds after +introducing the load. Restarting VPP makes it forward traffic again for a few seconds, just to end +up in the same upset state. I don't learn much. + +***Conclusion***: This setup with VPP using _netmap_ does not yield results, for the moment. I have a +suspicion that whatever the cause is of the _netmap_ bridge in the previous test, is likely also the +culprit for this test. + +### FreeBSD 14: VPP with DPDK + +But not all is lost - I have one test left, and judging by what I learned last week when bringing up +the first test environment, this one is going to be a fair bit better. In my previous loadtests, the +network interfaces were on their usual kernel driver (`ixl(4)` in the case of the Intel i710-XXV +interfaces), but now I'm going to mix it up a little, and rebind these interfaces to a specific DPDK +driver called `nic_uio(4)` which stands for _Network Interface Card Userspace Input/Output_: + +``` +[pim@france ~]$ cat < EOF | sudo tee -a /boot/loader.conf +nic_uio_load="YES" +hw.nic_uio.bdfs="6:0:0,6:0:1" +EOF +``` + +After I reboot, the network interfaces are gone from the output of `ifconfig(8)`, which is good. I +start up VPP with a minimal config file [[ref](/assets/freebsd-vpp/startup.conf)], which defines +three worker threads and starts DPDK with 3 RX queues and 4 TX queues. It's a common question why +there would be one more TX queue. The explanation is that in VPP, there is one (1) _main_ thread and +zero or more _worker_ threads. If the _main_ thread wants to send traffic (for example, in a plugin +like _LLDP_ which sends periodic announcements), it would be most efficient to use a transmit queue +specific to that _main_ thread. Any return traffic will be picked up by the _DPDK Process_ on worker +threads (as _main_ does not have one of these). That's why the general rule num(TX) = num(RX)+1. + +``` +[pim@france ~/src/vpp]$ export STARTUP_CONF=/home/pim/src/startup.conf +[pim@france ~/src/vpp]$ gmake run-release + +vpp# set int l2 xconnect TwentyFiveGigabitEthernet6/0/0 TwentyFiveGigabitEthernet6/0/1 +vpp# set int l2 xconnect TwentyFiveGigabitEthernet6/0/1 TwentyFiveGigabitEthernet6/0/0 +vpp# set int state TwentyFiveGigabitEthernet6/0/0 up +vpp# set int state TwentyFiveGigabitEthernet6/0/1 up +vpp# show int + Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count +TwentyFiveGigabitEthernet6/0/0 1 up 9000/0/0/0 rx packets 11615035382 + rx bytes 1785998048960 + tx packets 700076496 + tx bytes 161043604594 +TwentyFiveGigabitEthernet6/0/1 2 up 9000/0/0/0 rx packets 700076542 + rx bytes 161043674054 + tx packets 11615035440 + tx bytes 1785998136540 +local0 0 down 0/0/0/0 + +``` + +And with that, the dataplane shoots to life and starts forwarding (lots of) packets. To my great +relief, sending either 1kpps or 1Mpps "just works". I can run my loadtest as per normal, first with +1514 byte packets, then imix, then 64 byte packets, and finally single-flow 64 byte packets. And of +course, both unidirectionally and bidirectionally. + +I take a look at the system load while the loadtests are running: + +{{< image src="/assets/freebsd-vpp/top-vpp-dpdk.png" alt="FreeBSD top" >}} + +It is fully expected that the VPP process is spinning 300% +epsilon of CPU time. This is because it +has started three _worker_ threads, and these are execuing the DPDK _Poll Mode Driver_ which is +essentially a tight loop that asks the network cards for work, and if there are any packets +arriving, executes on that work. As such, each _worker_ thread is always burning 100% of its +assigned CPU. + +That said, I can take a look at finer grained statistics in the dataplane itself: + + +``` +vpp# show run +Thread 0 vpp_main (lcore 0) +Time .9, 10 sec internal node vector rate 0.00 loops/sec 297041.19 + vector rates in 0.0000e0, out 0.0000e0, drop 0.0000e0, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call +ip4-full-reassembly-expire-wal any wait 0 0 18 2.39e3 0.00 +ip6-full-reassembly-expire-wal any wait 0 0 18 3.08e3 0.00 +unix-cli-process-0 active 0 0 9 7.62e4 0.00 +unix-epoll-input polling 13066 0 0 1.50e5 0.00 +--------------- +Thread 1 vpp_wk_0 (lcore 1) +Time .9, 10 sec internal node vector rate 12.38 loops/sec 1467742.01 + vector rates in 5.6294e6, out 5.6294e6, drop 0.0000e0, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call +TwentyFiveGigabitEthernet6/0/1 active 399663 5047800 0 2.20e1 12.63 +TwentyFiveGigabitEthernet6/0/1 active 399663 5047800 0 9.54e1 12.63 +dpdk-input polling 1531252 5047800 0 1.45e2 3.29 +ethernet-input active 399663 5047800 0 3.97e1 12.63 +l2-input active 399663 5047800 0 2.93e1 12.63 +l2-output active 399663 5047800 0 2.53e1 12.63 +unix-epoll-input polling 1494 0 0 3.09e2 0.00 + +(et cetera) +``` +I showed only one _worker_ thread's output, but there are actually three _worker_ threads, and they are +all doing similar work, because they are picking up 33% of the traffic each assigned to the three RX +queues in the network card. + +While the overall CPU load is 300%, here I can see a different picture. Thread 0 (the _main_ thread) +is doing essentially ~nothing. It is polling a set of unix sockets in the node called +`unix-epoll-input`, but other than that, _main_ doesn't have much on its plate. Thread 1 however is +a _worker_ thread, and I can see that it is busy doing work: + +* `dpdk-input`: it's polling the NIC for work, it has been called 1.53M times, and in total it has + handled just over 5.04M _vectors_ (which are packets). So I can derive, that each time the _Poll + Mode Driver_ gives work, on average there are 3.29 _vectors_ (packets), and each packet is + taking about 145 CPU clocks. +* `ethernet-input`: The DPDK vectors are all ethernet frames coming from the loadtester. Seeing as + I have cross connected all traffic from Tf6/0/0 to Tf6/0/1 and vice-versa, VPP knows that it + should handle the packets in the L2 forwarding path. +* `l2-input` is called with the (list of N) ethernet frames, which all get cross connected to the + output interface, in this case Tf6/0/1. +* `l2-output` prepares the ethernet frames for output into their egress interface. +* `TwentyFiveGigabitEthernet6/0/1-output` (**Note**: the name is truncated) If this were to have + been L3 traffic, this would be the place where the destination MAC address is inserted into the + ethernet frame, but since this is an L2 cross connect, the node simply passes the ethernet frames + through to the final egress node in DPDK. +* `TwentyFiveGigabitEthernet6/0/1-tx` (**Note**: the name is truncated) hands them to the DPDK + driver for marshalling on the wire. + +Halfway through, I see that there's an issue with the distribution of ingress traffic over the +three workers, maybe you can spot it too: + +``` +--------------- +Thread 1 vpp_wk_0 (lcore 1) +Time 56.7, 10 sec internal node vector rate 38.59 loops/sec 106879.84 + vector rates in 7.2982e6, out 7.2982e6, drop 0.0000e0, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call +TwentyFiveGigabitEthernet6/0/0 active 6689553 206899956 0 1.34e1 30.93 +TwentyFiveGigabitEthernet6/0/0 active 6689553 206899956 0 1.37e2 30.93 +TwentyFiveGigabitEthernet6/0/1 active 6688572 206902836 0 1.45e1 30.93 +TwentyFiveGigabitEthernet6/0/1 active 6688572 206902836 0 1.34e2 30.93 +dpdk-input polling 7128012 413802792 0 8.77e1 58.05 +ethernet-input active 13378125 413802792 0 2.77e1 30.93 +l2-input active 6809002 413802792 0 1.81e1 60.77 +l2-output active 6809002 413802792 0 1.68e1 60.77 +unix-epoll-input polling 6954 0 0 6.61e2 0.00 +--------------- +Thread 2 vpp_wk_1 (lcore 2) +Time 56.7, 10 sec internal node vector rate 256.00 loops/sec 7702.68 + vector rates in 4.1188e6, out 4.1188e6, drop 0.0000e0, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call +TwentyFiveGigabitEthernet6/0/0 active 456112 116764672 0 1.27e1 256.00 +TwentyFiveGigabitEthernet6/0/0 active 456112 116764672 0 2.64e2 256.00 +TwentyFiveGigabitEthernet6/0/1 active 456112 116764672 0 1.39e1 256.00 +TwentyFiveGigabitEthernet6/0/1 active 456112 116764672 0 2.74e2 256.00 +dpdk-input polling 456112 233529344 0 1.41e2 512.00 +ethernet-input active 912224 233529344 0 5.71e1 256.00 +l2-input active 912224 233529344 0 3.66e1 256.00 +l2-output active 912224 233529344 0 1.70e1 256.00 +unix-epoll-input polling 445 0 0 9.59e2 0.00 +--------------- +Thread 3 vpp_wk_2 (lcore 3) +Time 56.7, 10 sec internal node vector rate 256.00 loops/sec 7742.43 + vector rates in 4.1188e6, out 4.1188e6, drop 0.0000e0, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call +TwentyFiveGigabitEthernet6/0/0 active 456113 116764928 0 8.94e0 256.00 +TwentyFiveGigabitEthernet6/0/0 active 456113 116764928 0 2.81e2 256.00 +TwentyFiveGigabitEthernet6/0/1 active 456113 116764928 0 9.54e0 256.00 +TwentyFiveGigabitEthernet6/0/1 active 456113 116764928 0 2.72e2 256.00 +dpdk-input polling 456113 233529856 0 1.61e2 512.00 +ethernet-input active 912226 233529856 0 4.50e1 256.00 +l2-input active 912226 233529856 0 2.93e1 256.00 +l2-output active 912226 233529856 0 1.23e1 256.00 +unix-epoll-input polling 445 0 0 1.03e3 0.00 +``` + +Thread 1 (`vpp_wk_0`) is handling 7.29Mpps and moderately loaded, while Thread 2 and 3 are handling +each 4.11Mpps and are completely pegged. That said, the relative amount of CPU clocks they are +spending per packet is reasonably similar, but they don't quite add up: + +* Thread 1 is doing 7.29Mpps and is spending on average 449 CPU cycles per packet. I get this + number by adding up all of the values in the _Clocks_ column, except for the `unix-epoll-input` + node. But that's somewhat strange, because this Xeon D1518 clocks at 2.2GHz -- and yet 7.29M * + 449 is 3.27GHz. My experience (in Linux) is that these numbers actually line up quite well. +* Thread 2 is doing 4.12Mpps and is spending on average 816 CPU cycles per packet. This kind of + makes sense as the cycles/packet is roughly double that of thread 1, and the packet/sec is + roughly half ... and the total of 4.12M * 816 is 3.36GHz. +* I can see similarly values for thread 3: 4.12Mpps and also 819 CPU cycles per packet which + amounts to VPP self-reporting using 3.37GHz worth of cycles on this thread. + +When I look at the thread to CPU placement, I get another surprise: + +``` +vpp# show threads +ID Name Type LWP Sched Policy (Priority) lcore Core Socket State +0 vpp_main 100346 (nil) (n/a) 0 42949674294967 +1 vpp_wk_0 workers 100473 (nil) (n/a) 1 42949674294967 +2 vpp_wk_1 workers 100474 (nil) (n/a) 2 42949674294967 +3 vpp_wk_2 workers 100475 (nil) (n/a) 3 42949674294967 + +vpp# show cpu +Model name: Intel(R) Xeon(R) CPU D-1518 @ 2.20GHz +Microarch model (family): [0x6] Broadwell ([0x56] Broadwell DE) stepping 0x3 +Flags: sse3 pclmulqdq ssse3 sse41 sse42 avx rdrand avx2 bmi2 rtm pqm pqe + rdseed aes invariant_tsc +Base frequency: 2.19 GHz +``` + +The numbers in `show threads` are all messed up, and I don't quite know what to make of it yet. I +think the perhaps overly specific Linux implementation of the thread pool management is throwing off +FreeBSD a bit. Perhaps some profiling could be useful, so I make a note to discuss this with Tom or +the freebsd-net mailing list, who will know a fair bit more about this type of stuff on FreeBSD than +I do. + +Anyway, functionally: this works. Performance wise: I have some questions :-) I let all eight +loadtests complete and without further ado, here's the results: + +| Type | Uni/BiDir | Packets/Sec | L2 Bits/Sec | Line Rate | +| ---- | --------- | ----------- | -------- | --------- | +| vm=var2,size=1514 | Unidirectional | 2.01Mpps | 24.45Gbps | 99% | +| vm=var2,size=imix | Unidirectional | 8.07Mpps | 23.42Gbps | 99% | +| vm=var2,size=64 | Unidirectional | 23.93Mpps | 12.25Gbps | 64% | +| size=64 | Unidirectional | 12.80Mpps | 6.56Gbps | 34% | +| vm=var2,size=1514 | Bidirectional | 3.91Mpps | 47.35Gbps | 86% | +| vm=var2,size=imix | Bidirectional | 13.38Mpps | 38.81Gbps | 82% | +| vm=var2,size=64 | Bidirectional | 15.56Mpps | 7.97Gbps | 21% | +| size=64 | Bidirectional | 20.96Mpps | 10.73Gbps | 28% | + +***Conclusion***: I have to say: 12.8Mpps on a unidirectional 64b single-flow loadtest (thereby only +being able to make use of one DPDK worker), and 20.96Mpps on a bidirectional 64b single-flow +loadtest, is not too shabby. But seeing as one CPU thread can do 12.8Mpps, I would imagine that +three CPU threads would perform at 38.4Mpps or there-abouts, but I'm seeing only 23.9Mpps and some +unexplained variance in per-thread performance. + +## Results + +I learned a lot! Some hilights: +1. The _netmap_ implementation is not playing ball for the moment, as forwarding stops consistently, in + both the `bridge.c` as well as the VPP plugin. +1. It is clear though, that _netmap_ is a fair bit faster (11.4Mpps) than _kernel forwarding_ which came in at + roughly 1.2Mpps per CPU thread. What's a bit troubling is that _netmap_ doesn't seem to work + very well in VPP -- traffic forwarding also stops here. +1. DPDK performs quite well on FreeBSD, I manage to see a throughput of 20.96Mpps which is almost + twice the throughput of _netmap_, which is cool but I can't quite explain the stark variance + in throughput between the worker threads. Perhaps VPP is placing the workers on hyperthreads? + Perhaps an equivalent of `isolcpus` in the Linux kernel would help? + +For the curious, I've bundled up a few files that describe the machine and its setup: +[[dmesg](/assets/freebsd-vpp/france-dmesg.txt)] +[[pciconf](/assets/freebsd-vpp/france-pciconf.txt)] +[[loader.conf](/assets/freebsd-vpp/france-loader.conf)] +[[VPP startup.conf](/assets/freebsd-vpp/france-startup.conf)] diff --git a/content/articles/2024-03-06-vpp-babel-1.md b/content/articles/2024-03-06-vpp-babel-1.md new file mode 100644 index 0000000..fcd7a77 --- /dev/null +++ b/content/articles/2024-03-06-vpp-babel-1.md @@ -0,0 +1,644 @@ +--- +date: "2024-03-06T20:17:54Z" +title: VPP with Babel - Part 1 +--- + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. Thanks to the [[Linux ControlPlane]({% post_url 2021-08-12-vpp-1 %})] +plugin, higher level control plane software becomes available, that is to say: things like BGP, +OSPF, LDP, VRRP and so on become quite natural for VPP. + +IPng Networks is a small service provider that has built a network based entirely on open source: +[[Debian]({% post_url 2023-12-17-defra0-debian %})] servers with widely available Intel and Mellanox +10G/25G/100G network cards, paired with [[VPP](https://fd.io/)] for the dataplane, and +[[Bird2](https://bird.nic.cz/)] for the controlplane. + +As a small provider, I am well aware of the cost of IPv4 address space. Long gone are the times at +which an initial allocation was a /19, and subsequent allocations usually a /20 based on +justification. Then it watered down to a /22 for new _Local Internet Registries_, then that became a +/24 for new _LIRs_, and ultimately we ran out. What was once a plentiful resource, has now become a +very constrained resource. + +In this first article, I want to show a rather clever way to conserve IPv4 addresses by exploring +one of the newer routing protocols: Babel. + +## 🙁 A sad waste + +I have to go back to something very fundamental about routing. When RouterA holds a routing table, +it will associate prefixes with next-hops and their associated interfaces. When RouterA gets a +packet, it'll look up the destination address, and then forward the packet on to RouterB which is +the next router in the path towards the destination: + +1. RouterA does a route lookup in its routing table. For destination `192.0.2.1`, the covering + prefix is `192.0.2.0/24` and it might find that it can reach it via IPv4 next hop `100.64.0.1`. +1. RouterA then does another lookup in its routing table, to figure out how can it reach + `100.64.0.1`. It may find that this address is directly connected, say to interface `eth0`, on + which RouterA is `100.64.0.2/30`. +1. Assuming that `eth0` is an ethernet device, which the vast majority of interfaces are, then + RouterA can look up the link-layer address for that IPv4 address `100.64.0.1`, by using ARP. +1. The ARP request asks, quite literally `who-has 100.64.0.1?` using a broadcast message on + `eth0`, to which the other RouterB will answer `100.64.0.1 is-at 90:e2:ba:3f:ca:d5`. +1. Now that RouterA knows that, it can forward along the IP packet out on its `eth0` device and + towards `90:e2:ba:3f:ca:d5`. Huzzah. + +## 🥰 A clever trick + +I can't help but notice that the only purpose of having the `100.64.0.0/30` transit network between +these two routers is to: + +1. provide the routers the ability to resolve IPv4 next hops towards link-layer MAC addresses, + using ARP resolution. +1. provide a means for the routers to send ICMP messages, for example in a traceroute, each hop + along the way will respond with an TTL exceeded message. And I do like traceroutes! + +Let me discuss these two purposes in more detail: + +### 1. IPv4 ARP, née IPv6 NDP + +{{< image width="100px" float="left" src="/assets/vpp-babel/brain.png" alt="Brain" >}} + +One really neat trick is simply replacing ARP resolution by something that can resolve the +link-layer MAC address in a different way. As it turns out, IPv6 has an equivalent that's +called _Neighbor Discovery Protocol_ in which a router can determine the link-layer address of a +neighbor, or to verify that a neighbor is still reachable via a cached link-layer address. This uses +ICMPv6 to send out a query with the _Neighbor Solicitation_, which is followed by a response in the +form of a _Neighbor Advertisement_. + +Why am I talking about IPv6 neighbor discovery when I'm explaining IPv4 forwarding, you may be +wondering? Well, because of this neat trick that the IPv4 prefix brokers don't want you to know: + +``` +pim@vpp0-0:~$ sudo ip ro add 192.0.2.0/24 via inet6 fe80::5054:ff:fef0:1110 dev e1 + +pim@vpp0-0:~$ ip -br a show e1 +e1 UP fe80::5054:ff:fef0:1101/64 +pim@vpp0-0:~$ ip ro get 192.0.2.0 +192.0.2.0 via inet6 fe80::5054:ff:fef0:1110 dev e1 src 192.168.10.0 uid 0 + cache +pim@vpp0-0:~$ ip neighbor | grep fe80::5054:ff:fef0:1110 +fe80::5054:ff:fef0:1110 dev e1 lladdr 52:54:00:f0:11:10 REACHABLE + +pim@vpp0-0:~$ sudo tcpdump -evni e1 host 192.0.2.0 +tcpdump: listening on e1, link-type EN10MB (Ethernet), snapshot length 262144 bytes +16:21:30.002878 52:54:00:f0:11:01 > 52:54:00:f0:11:10, ethertype IPv4 (0x0800), length 98: + (tos 0x0, ttl 64, id 21521, offset 0, flags [DF], proto ICMP (1), length 84) + 192.168.10.0 > 192.0.2.0: ICMP echo request, id 54710, seq 20, length 64 +``` + +While it looks counter-intuitive at first, this is actually pretty straight forward. When the router +gets a packet destined for `192.0.2.0/24`, it will know that the next hop is some link-local IPv6 +address, which it can resolve by _NDP_ on ethernet interface `e1`. It can then simply forward the +IPv4 datagram to the MAC address it found. + +Who would've thunk that you do not need ARP or even IPv4 on the interface at all? + +### 2. Originating ICMP messages + +{{< image width="200px" float="right" src="/assets/vpp-babel/too-big.png" alt="Too Big" >}} + +The Internet Control Message Protocol is described in +[[RFC792](https://datatracker.ietf.org/doc/html/rfc792)]. It's mostly used to carry diagnostic and +debugging information, either originated by end hosts, for example the "destination unreachable, +port unreachable" types of messages, but they may also be originated by intermediate routers, for +example with most other kinds of "destination unreachable" packets. + +Path MTU Discovery, described in [[RFC1191](https://datatracker.ietf.org/doc/html/rfc1191)] allows a +host to discover the maximum packet size that a route is able to carry. There's a few different +types of _PMTUd_, but the most common one uses ICMPv4 packets coming from these intermediate +routers, informing them that packets which are marked as un-fragmentable, will not be able to be +transmitted due to them being too large. + +Without the ability for a router to signal these ICMPv4 packets, end to end connectivity quality +might break undetected. So, every router that is able to forward IPv4 traffic SHOULD be able +originate ICMPv4 traffic. + +If you're curious, you can read more in this [[IETF +Draft](https://www.ietf.org/archive/id/draft-chroboczek-int-v4-via-v6-01.html)] from Juliusz +Chroboczek et al. It's really insightful, yet elegant. + +{{< image width="200px" float="right" src="/assets/vpp-babel/Babel_logo_black.svg" alt="Babel Logo" >}} + +## Introducing Babel + +I've learned so far that I (a) **MAY** use IPv6 link-local networks in order to _forward_ IPv4 +packets, as I can use IPv6 _NDP_ to find the link-layer next hop; and (b) each router **SHOULD** be +able to _originate_ ICMPv4 packets, therefore it needs _at least one_ IPv4 address. + +These two claims mean that I need _at most one_ IPv4 address on each router. Could it be?! + +**Babel** is a loop-avoiding distance-vector routing protocol that is designed to be robust and +efficient both in networks using prefix-based routing and in networks using flat routing ("mesh +networks"), and both in relatively stable wired networks and in highly dynamic wireless networks. + +The definitive [[RFC8966](https://datatracker.ietf.org/doc/html/rfc8966)] describes it in great +detail, and previous work are in [[RFC7557](https://datatracker.ietf.org/doc/html/rfc7557)] and +[[RFC6126](https://datatracker.ietf.org/doc/html/rfc6126)]. Lots of reading :) Babel is a _hybrid_ +routing protocol, in the sense that it can carry routes for multiple network-layer protocols (IPv4 +and IPv6), regardless of which protocol the Babel packets are themselves being carried over. + +I quickly realise that Babel is hybrid in a different and very interesting way: it can set next-hops +across address families, which is described in [[RFC9229](https://datatracker.ietf.org/doc/html/rfc9229)]: + +> When a packet is routed according to a given routing table entry, the forwarding plane typically +> uses a neighbour discovery protocol (the Neighbour Discovery (ND) protocol +> [[RFC4861](https://datatracker.ietf.org/doc/html/rfc4861)] in the case of IPv6 and the Address +> Resolution Protocol (ARP) [[RFC826](https://datatracker.ietf.org/doc/html/rfc826)] in the case of +> IPv4) to map the next-hop address to a link-layer address (a "Media Access Control (MAC) +> address"), which is then used to construct the link-layer frames that encapsulate forwarded +> packets. +> +> It is apparent from the description above that there is no fundamental reason why the destination +> prefix and the next-hop address should be in the same address family: there is nothing preventing +> an IPv6 packet from being routed through a next hop with an IPv4 address (in which case the next +> hop's MAC address will be obtained using ARP) or, conversely, an IPv4 packet from being routed +> through a next hop with an IPv6 address. (In fact, it is even possible to store link-layer +> addresses directly in the next-hop entry of the routing table, which is commonly done in networks +> using the OSI protocol suite). + +### Babel and Bird2 + +There's an implementation of Babel in Bird2, the routing solution that I use at AS8298. What made me +extra enthusiastic, is that I found out the functionality described in RFC9229 was committed about a +year ago in Bird2 +[[ref](https://gitlab.nic.cz/labs/bird/-/commit/eecc3f02e41bcb91d463c4c1189fd56bc44e6514)], with a +hat-tip to Toke Høiland-Jørgensen. + +The Debian machines at IPng are current (Bookworm 12.5), but Debian still ships a version older than +this commit, so my first order of business is to get a Debian package: + +``` +pim@summer:~/src$ sudo apt install devscripts +pim@summer:~/src$ wget http://deb.debian.org/debian/pool/main/b/bird2/bird2_2.14.orig.tar.gz +pim@summer:~/src$ tar xzf bird2_2.14.orig.tar.gz +pim@summer:~/src/bird-2.14$ wget http://deb.debian.org/debian/pool/main/b/bird2/bird2_2.14-1.debian.tar.xz +pim@summer:~/src/bird-2.14$ tar xf bird2_2.14-1.debian.tar.xz +pim@summer:~/src/bird-2.14$ sudo mk-build-deps -i +pim@summer:~/src/bird-2.14$ sudo dpkg-buildpackage -b -uc -us +``` + +And that yields me a fresh Bird 2.14 package. I can't help but wonder though, why did the semantic +versioning [[ref](https://semver.org/)] of `2.0.X` change to `2.14`? I found an answer in the NEWS +file of the 2.13 release +[[link](https://gitlab.nic.cz/labs/bird/-/blob/7d2c7d59a363e690995eb958959f0bc12445355c/NEWS#L45-50)]. +It's a little bit of a disappointment, but I quickly get over myself because I want to take this +Babel-Bird out for a test flight. Thank you for the Babel-Bird-Build, Summer! + +### Babel and the LAB + +I decide to take an IPng [[lab]({% post_url 2022-10-14-lab-1 %})] out for a spin. These labs come +with four VPP routers and two Debian machines connected like so: + +{{< image src="/assets/vpp-mpls/LAB v2.svg" alt="Lab Setup" >}} + +The configuration snippet for Bird2 is very simple, as most of the defaults are sensible: + +``` +pim@vpp0-0:~$ cat << EOF | sudo tee -a /etc/bird/bird.conf +protocol babel { + interface "e*" { + type wired; + extended next hop on; + }; + ipv6 { import all; export all; }; + ipv4 { import all; export all; }; +} +EOF + +pim@vpp0-0:~$ birdc show babel interfaces +BIRD 2.14 ready. +babel1: +Interface State Auth RX cost Nbrs Timer Next hop (v4) Next hop (v6) +e1 Up No 96 1 0.958 :: fe80::5054:ff:fef0:1101 + +pim@vpp0-0:~$ birdc show babel neigh +BIRD 2.14 ready. +babel1: +IP address Interface Metric Routes Hellos Expires Auth RTT (ms) +fe80::5054:ff:fef0:1110 e1 96 8 16 5.003 No 4.831 + +pim@vpp0-0:~$ birdc show babel entries +BIRD 2.14 ready. +babel1: +Prefix Router ID Metric Seqno Routes Sources +192.168.10.0/32 00:00:00:00:c0:a8:0a:00 0 1 0 0 +192.168.10.0/24 00:00:00:00:c0:a8:0a:00 0 1 1 0 +192.168.10.1/32 00:00:00:00:c0:a8:0a:01 96 7 1 0 +2001:678:d78:200::/128 00:00:00:00:c0:a8:0a:00 0 1 0 0 +2001:678:d78:200::/60 00:00:00:00:c0:a8:0a:00 0 1 1 0 +2001:678:d78:200::1/128 00:00:00:00:c0:a8:0a:01 96 7 1 0 +``` + +Based on this simple configuration, Bird2 will start the babel protocol on `e0` and `e1`, and it +quickly finds a neighbor with which it establishes an adjacency. Looking at the routing protocol +database (called _entries_), I can see my own IPv4 and IPv6 loopbacks (192.168.10.0 and +2001:678:d78:200::), the neighbor's IPv4 and IPv6 loopbacks (192.168.10.1 and 201:678:d78:200::1), +and finally the two supernets (192.168.10.0/24 and 2001:678:d78:200::/60). + +The coolest part is the `extended next hop on` statement, which enables Babel to set the nexthop +to be an IPv6 address, which becomes clear very quickly when looking at the Linux routing table: + +``` +pim@vpp0-0:~$ ip ro +192.168.10.1 via inet6 fe80::5054:ff:fef0:1110 dev e1 proto bird metric 32 +unreachable 192.168.10.0/24 proto bird metric 32 + +pim@vpp0-0:~$ ip -6 ro +2001:678:d78:200:: dev loop0 proto kernel metric 256 pref medium +2001:678:d78:200::1 via fe80::5054:ff:fef0:1110 dev e1 proto bird metric 32 pref medium +unreachable 2001:678:d78:200::/60 dev lo proto bird metric 32 pref medium +fe80::/64 dev loop0 proto kernel metric 256 pref medium +fe80::/64 dev e1 proto kernel metric 256 pref medium +``` + +**✅ Setting IPv4 routes over IPv6 nexthops works!** + +### Babel and VPP + +For the [[VPP](https://fd.io)] configuration, I start off with a pretty much _empty_ configuration, +creating only a loopback interface called `loop0`, setting the interfaces up, and exposing them in +_LinuxCP_: + +``` +vpp0-0# create loopback interface instance 0 +vpp0-0# set interface state loop0 up +vpp0-0# set interface ip address loop0 192.168.10.0/32 +vpp0-0# set interface ip address loop0 2001:678:d78:200::/128 + +vpp0-0# set interface mtu 9000 GigabitEthernet10/0/0 +vpp0-0# set interface state GigabitEthernet10/0/0 up +vpp0-0# set interface mtu 9000 GigabitEthernet10/0/1 +vpp0-0# set interface state GigabitEthernet10/0/1 up + +vpp0-0# lcp create loop0 host-if loop0 +vpp0-0# lcp create GigabitEthernet10/0/0 host-if e0 +vpp0-0# lcp create GigabitEthernet10/0/1 host-if e1 +``` + +Between the four VPP routers, the only relevant difference is the IPv4 and IPv6 addresses of the +loopback device. For the rest, things are good. The routing tables quickly fill with all IPv4 and +IPv6 loopbacks across the network. + +### Adding support to VPP + +IPv6 pings and looks good. However, IPv4 endpoints do not ping yet. The first thing I look at, is +does VPP understand how to interpret an IPv4 route with an IPv6 nexthop? I think it does, because I +remember reviewing a change from Adrian during our MPLS [[project]({% post_url 2023-05-28-vpp-mpls-4 +%})], which he submitted in this [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/38633)]. His change +allows VPP to use routes with `rtnl_route_nh_get_via()` to map them to a different address family, +exactly what I am looking for. The routes are correctly installed in the FIB: + +``` +pim@vpp0-0:~$ vppctl show ip fib 192.168.10.1 +ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[default-route:1, lcp-rt:1, ] +192.168.10.1/32 fib:0 index:31 locks:2 + lcp-rt-dynamic refs:1 src-flags:added,contributing,active, + path-list:[51] locks:4 flags:shared, uPRF-list:42 len:1 itfs:[2, ] + path:[72] pl-index:51 ip6 weight=1 pref=32 attached-nexthop: oper-flags:resolved, + fe80::5054:ff:fef0:1110 GigabitEthernet10/0/1 + [@0]: ipv6 via fe80::5054:ff:fef0:1110 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 525400f01110525400f0110186dd + + forwarding: unicast-ip4-chain + [@0]: dpo-load-balance: [proto:ip4 index:34 buckets:1 uRPF:42 to:[0:0]] + [0] [@5]: ipv4 via fe80::5054:ff:fef0:1110 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 525400f01110525400f011010800 +``` + + +Using the Open vSwitch tap I can see I can clearly see the packets go out from `vpp0-0.e1` and into +`vpp0-1.e0`, but there is no response, so they are getting lost in `vpp0-1` somewhere. I take a look +at a packet trace on `vpp0-1`, I'm expecting the ICMP packet there: + +``` +pim@vpp0-1:~$ vppctl show trace +07:42:53:178694: dpdk-input + GigabitEthernet10/0/0 rx queue 0 + buffer 0x4c513d: current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x0 + ext-hdr-valid + PKT MBUF: port 0, nb_segs 1, pkt_len 98 + buf_len 2176, data_len 98, ol_flags 0x0, data_off 128, phys_addr 0x29944fc0 + packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 + rss 0x0 fdir.hi 0x0 fdir.lo 0x0 + IP4: 52:54:00:f0:11:01 -> 52:54:00:f0:11:10 + ICMP: 192.168.10.0 -> 192.168.10.1 + tos 0x00, ttl 64, length 84, checksum 0xb02b dscp CS0 ecn NON_ECN + fragment id 0xf52b, flags DONT_FRAGMENT + ICMP echo_request checksum 0x43b7 id 26166 +07:42:53:178765: ethernet-input + frame: flags 0x1, hw-if-index 1, sw-if-index 1 + IP4: 52:54:00:f0:11:01 -> 52:54:00:f0:11:10 +07:42:53:178791: ip4-input + ICMP: 192.168.10.0 -> 192.168.10.1 + tos 0x00, ttl 64, length 84, checksum 0xb02b dscp CS0 ecn NON_ECN + fragment id 0xf52b, flags DONT_FRAGMENT + ICMP echo_request checksum 0x43b7 id 26166 +07:42:53:178810: ip4-not-enabled + ICMP: 192.168.10.0 -> 192.168.10.1 + tos 0x00, ttl 64, length 84, checksum 0xb02b dscp CS0 ecn NON_ECN + fragment id 0xf52b, flags DONT_FRAGMENT + ICMP echo_request checksum 0x43b7 id 26166 +07:42:53:178833: error-drop + rx:GigabitEthernet10/0/0 +07:42:53:178835: drop + dpdk-input: no error +``` + +Okay, that checks out! Going over this packet trace, the `ip4-input` node indeed got handed a packet, +which it promptly rejected by forwarding it to `ip4-not-enabled` which drops it. It kind of makes +sense, the VPP dataplane doesn't think it's logical to handle IPv4 traffic on an interface which +does not have an IPv4 address. Except -- I'm bending the rules a little bit by doing exactly that. + +#### Approach 1: force-enable ip4 in VPP + +There's an internal function `ip4_sw_interface_enable_disable()` which is called to enable IPv4 +processing on an interface once the first IPv4 address is added. So my first fix is to force this to +be enabled for any interface that is exposed via Linux Control Plane, notably in `lcp_itf_pair_create()` +[[here](https://github.com/pimvanpelt/lcpng/blob/main/lcpng_interface.c#L777)]. + +This approach is partially effective: + +``` +pim@vpp0-0:~$ ip ro get 192.168.10.1 +192.168.10.1 via inet6 fe80::5054:ff:fef0:1110 dev e1 src 192.168.10.0 uid 0 + cache +pim@vpp0-0:~$ ping -c5 192.168.10.1 +PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data. +64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=3.92 ms +64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=3.81 ms +64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=3.75 ms +64 bytes from 192.168.10.1: icmp_seq=4 ttl=64 time=3.23 ms +64 bytes from 192.168.10.1: icmp_seq=5 ttl=64 time=2.67 ms +^C +--- 192.168.10.1 ping statistics --- +5 packets transmitted, 5 received, 0% packet loss, time 4006ms +rtt min/avg/max/mdev = 2.673/3.477/3.921/0.467 ms + +pim@vpp0-0:~$ traceroute 192.168.10.3 +traceroute to 192.168.10.3 (192.168.10.3), 30 hops max, 60 byte packets + 1 * * * + 2 * * * + 3 192.168.10.3 (192.168.10.3) 10.418 ms 10.343 ms 11.362 ms +``` + +I take a moment to think about why the traceroutes are not responding in the routers in the middle, +and it dawns on me that when the router needs to send an ICMPv4 TTL Exceeded message, it can't +select an IPv4 address to originate the message from, as the interface has none. + +**🟠 Forwarding works, but ❌ PMTUd does not!** + +#### Approach 2: Use unnumbered interfaces + +Looking at my options, I see that VPP is capable of using so-called _unnumbered_ interfaces. These +can be left unconfigured, but borrow an address from another interface. It's a good idea to +borrow from `loop0`, which has a valid IPv4 and IPv6 address. It looks like this in VPP: + +``` +vpp0-0# set interface unnumbered GigabitEthernet10/0/0 use loop0 +vpp0-0# set interface unnumbered GigabitEthernet10/0/1 use loop0 + +vpp0-0# show interface address +GigabitEthernet10/0/0 (dn): + unnumbered, use loop0 + L3 192.168.10.0/32 + L3 2001:678:d78:200::/128 +GigabitEthernet10/0/1 (up): + unnumbered, use loop0 + L3 192.168.10.0/32 + L3 2001:678:d78:200::/128 +loop0 (up): + L3 192.168.10.0/32 + L3 2001:678:d78:200::/128 +``` + +The Linux ControlPlane configuration will always synchronize interface information from VPP to +Linux, as I described back then when I [[worked on the plugin]({% post_url 2021-08-13-vpp-2 %})]. +Babel starts and sets next hops for IPv4 that look like this: + +``` +pim@vpp0-2:~$ ip -br a +lo UNKNOWN 127.0.0.1/8 ::1/128 +loop0 UNKNOWN 192.168.10.2/32 2001:678:d78:200::2/128 fe80::dcad:ff:fe00:0/64 +e0 UP 192.168.10.2/32 2001:678:d78:200::2/128 fe80::5054:ff:fef0:1120/64 +e1 UP 192.168.10.2/32 2001:678:d78:200::2/128 fe80::5054:ff:fef0:1121/64 + +pim@vpp0-2:~$ ip ro +192.168.10.0 via 192.168.10.1 dev e0 proto bird metric 32 onlink +unreachable 192.168.10.0/24 proto bird metric 32 +192.168.10.1 via 192.168.10.1 dev e0 proto bird metric 32 onlink +192.168.10.3 via 192.168.10.3 dev e1 proto bird metric 32 onlink +``` + +While on the surface this looks good, for VPP it clearly poses a problem, as my IPv4 neighbors +(192.168.10.1 and 192.168.10.3) are not reachable: + +``` +pim@vpp0-2:~# ping -c3 192.168.10.1 +PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data. +From 192.168.10.2 icmp_seq=1 Destination Host Unreachable +From 192.168.10.2 icmp_seq=2 Destination Host Unreachable +From 192.168.10.2 icmp_seq=3 Destination Host Unreachable + +--- 192.168.10.1 ping statistics --- +3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2034ms +``` + +I take a look at why that might be, and I notice this on the neighbor `vpp0-1` when I try to ping it +from `vpp0-2`: + +``` +vpp0-1# show err + Count Node Reason Severity + 5 arp-reply IP4 source address not local to sub error + 1 arp-reply IP4 source address matches local in error +``` + +Oh, snap! I traced this down to `src/vnet/arp/arp.c` around line 522 where I can see that VPP, when +it receives an ARP request, wants that to be coming from a peer that is in its own subnet. But with a +point to point link like this one, there is nobody else in the `192.168.10.1/32` subnet! I think +this error should not be returned if the interface is `arp_unnumbered()`, defined further up in the +same source file. I write a small patch in Gerrit [[40482](https://gerrit.fd.io/r/c/vpp/+/40482)] +which removes this requirement and the test that asserts the previous behavior, allowing the ARP +request to succeed, and things shoot to life: + +``` +pim@vpp0-2:~$ ping -c3 192.168.10.1 +PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data. +64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=11.5 ms +64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=1.69 ms +64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=3.03 ms + +--- 192.168.10.1 ping statistics --- +3 packets transmitted, 3 received, 0% packet loss, time 2004ms +rtt min/avg/max/mdev = 1.689/5.394/11.468/4.329 ms +``` + +I make a mental note to discuss this ARP relaxation Gerrit with [[vpp-dev](https://lists.fd.io/g/vpp-dev/message/24155)], +and I'll see where that takes me. + +**✅ Forwarding IPv4 routes over IPv4 point-to-point nexthops works!** + +#### Approach 3: VPP Unnumbered Hack + +At this point, I think I'm good, but one of the cool features of Babel is that it can use IPv6 next +hops for IPv4 destinations. Setting `GigabitEthernet10/0/X` to unnumbered will make +`192.168.10.X/32` reappear on the `e0` an `e1` interfaces, which will make Babel prefer the more + classic IPv4 next-hops. So can I trick it somehow to use IPv6 anyway ? + +One option is to ask Babel to use `extended next hop` even when IPv4 is available, which would be a +change to Bird (and possibly a violation of the Babel specification, I should read up on that). + +But I think there's another way, so I take a look at the VPP code which prints out the **unnumbered, +use loop0** message, and I find a way to know if an interface is borrowing addresses in this way. I +decide to change the LCP plugin to _inhibit_ sync'ing the addresses if they belong to an interface +which is unnumbered. Because I don't know for sure if everybody would find this behavior desirable, +I make sure to guard the behavior behind a backwards compatible configuration option. + +If you're curious, please take a look at the change in my [[GitHub +repo](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in +which I: +1. add a new configuration option, `lcp-sync-unnumbered`, which defaults to `on`. That would be + what the plugin would do in the normal case: copy forward these borrowed IP addresses to Linux. +1. add a CLI call to change the value, `lcp lcp-sync-unnumbered [on|enable|off|disable]` +1. extend the CLI call to show the LCP plugin state, as an additional output of `lcp show` + +And with that, the VPP configuration becomes: + +``` +vpp0-0# lcp lcp-sync on +vpp0-0# lcp lcp-sync-unnumbered off + +vpp0-0# create loopback interface instance 0 +vpp0-0# set interface state loop0 up +vpp0-0# set interface ip address loop0 192.168.10.0/32 +vpp0-0# set interface ip address loop0 2001:678:d78:200::/128 + +vpp0-0# set interface mtu 9000 GigabitEthernet10/0/0 +vpp0-0# set interface unnumbered GigabitEthernet10/0/0 use loop0 +vpp0-0# set interface state GigabitEthernet10/0/0 up +vpp0-0# set interface mtu 9000 GigabitEthernet10/0/1 +vpp0-0# set interface unnumbered GigabitEthernet10/0/1 use loop0 +vpp0-0# set interface state GigabitEthernet10/0/1 up + +vpp0-0# lcp create loop0 host-if loop0 +vpp0-0# lcp create GigabitEthernet10/0/0 host-if e0 +vpp0-0# lcp create GigabitEthernet10/0/1 host-if e1 +``` + +## Results + +I can claim plausible success on this effort, which makes me wiggle in my seat a little bit, I have +to admit: + +``` +pim@vpp0-0:~$ ip -br a +lo UNKNOWN 127.0.0.1/8 ::1/128 +loop0 UNKNOWN 192.168.10.0/32 2001:678:d78:200::/128 fe80::dcad:ff:fe00:0/64 +e0 UP fe80::5054:ff:fef0:1100/64 +e1 UP fe80::5054:ff:fef0:1101/64 +e2 DOWN +e3 DOWN + +pim@vpp0-0:~$ traceroute -n 192.168.10.3 +traceroute to 192.168.10.3 (192.168.10.3), 30 hops max, 60 byte packets + 1 192.168.10.1 1.882 ms 2.231 ms 1.472 ms + 2 192.168.10.2 4.243 ms 3.492 ms 2.797 ms + 3 192.168.10.3 6.689 ms 5.925 ms 5.157 ms + +pim@vpp0-0:~$ traceroute -n 2001:678:d78:200::3 +traceroute to 2001:678:d78:200::3 (2001:678:d78:200::3), 30 hops max, 80 byte packets + 1 2001:678:d78:200::1 2.543 ms 1.762 ms 2.154 ms + 2 2001:678:d78:200::2 4.943 ms 3.063 ms 3.562 ms + 3 2001:678:d78:200::3 6.273 ms 6.694 ms 7.086 ms +``` +**✅ Forwarding IPv4 routes over IPv6 nexthops works, ICMPv4 works, PMTUd works!** + +I recorded a little [[screencast](/assets/vpp-babel/screencast.cast)] that shows my work, so far: + +{{< image src="/assets/vpp-babel/screencast.gif" alt="Babel IPv4-less VPP" >}} + +## Additional thoughts + +### Comparing OSPFv2 and Babel + +Ondrej from the Bird team pointed out (thank you!) that OSPFv2 can also be made to avoid use of IPv4 +transit networks, by making use of this `peer` pattern, which is similar but not quite the same as +what I discussed in _Approach 2_ above: + +``` +$ ip addr add 192.168.10.2 peer 192.168.10.1 dev e0 +$ ip addr add 192.168.10.2 peer 192.168.10.3 dev e1 +``` + +The Linux ControlPlane plugin is not currently capable of accepting the `peer` netlink message, and +I can see a problem: VPP does not allow for two interfaces to have the same IP address, _unless_ one +is borrowing from another using _unnumbered_. I wonder why that is ... + +I could certainly give implementing that `peer` pattern in Netlink a go, but I'm not enthusiastic. +To consume the netlink message correctly, the plugin would need to assert that left hand (source) IPv4 +address strictly corresponds to a loopback, and then internally rewrite the address addition into +a _unnumbered_ use, and also somehow reject (delete?) the netlink configuration otherwise. Ick! + +I think there's a more idiomatic way of doing this in VPP. OSPFv2 doesn't really _need_ to use the +`peer` pattern, as long as the point to point peer is reachable. Babel is emitting a static route +over the interface after using IPv6 to learn its peer's IPv4 address, which is really neat! I +suppose for OSPFv2 setting a manual static route for the peer into the device would do the trick as +well. + +The VPP idiom for the `peer` pattern above, which Babel does naturally, and OSPFv2 could be manually + configured to do, would look like this: + +``` +vpp0-2# set interface ip address loop0 192.168.10.2/32 +vpp0-2# set interface state loop0 up + +vpp0-2# set interface unnumbered GigabitEthernet10/0/0 use loop0 +vpp0-2# set interface state GigabitEthernet10/0/0 up +vpp0-2# ip route add 192.168.10.1/32 via 192.168.10.1 GigabitEthernet10/0/0 + +vpp0-2# set interface unnumbered GigabitEthernet10/0/1 use loop0 +vpp0-2# set interface state GigabitEthernet10/0/1 up +vpp0-2# ip route add 192.168.10.3/32 via 192.168.10.3 GigabitEthernet10/0/1 +``` + +Either way, when using point to point connections (like these explicit static routes, or the implied +static routes that the `peer` pattern will yield) over an ethernet broadcast medium, will require to +get the ARP [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/40482)] merged. This one seems reasonably +straight forward because allowing point to point to work over an ethernet broadcast medium is +successfully done in many popular vendors, and I can't find any RFC that forbids it. Perhaps VPP is +being a bit too strict. + +### To Unnumbered or Not To Unnumbered + +I'm torn between _Approach 2_ and _Approach 3_. While on the one hand, setting the _unnumbered_ +interface would be best reflected in Linux, it is not without problems. If the operator subsequently +tries to remove one of the addresses on `e0` or `e1`, that will yield a desync between Linux and +VPP (Linux will have removed the address, but VPP will still be _unnumbered_). On the other hand, +tricking Linux (and the operator) to believe there isn't an IPv4 (and IPv6) address configured on +the interface, is also not great. + +Of the two approaches, I think I prefer _Approach 3_ (changing the Linux CP plugin to not sync +unnumbered addresses), because it minimizes the chance of operator error. If you're reading this and +have an Opinion™, would you please let me know? + +## What's Next + +I think that over time, IPng Networks might replace OSPF and OSPFv3 with Babel, as it will allow me +to retire the many /31 IPv4 and /112 IPv6 transit networks (which consume about half of my routable +IPv4 addresses!). I will discuss my change with the VPP and Babel/Bird Developer communities and see +if it makes sense to upstream my changes. Personally, I think it's a reasonable direction, because +(a) both changes are backwards compatible and (b) its semantics are pretty straight forward. I'll +also add some configuration knobs to [[vppcfg]({% post_url 2022-04-02-vppcfg-2 %})] to make it +easier to configure VPP in this way. + + +Of course, migrating AS8298 won't be overnight, I need to gain a bit more confidence, and obviously +upgrade both Bird2 and VPP using my changes, which I think might benefit from a bit of peer review. +And finally I need to roll this new IPv4-less IGP out very carefully and without interruptions, +which considering the IGP is the most fundamental building block of the network, may be tricky. + +But, I am uncomfortably excited by the prospect of having my network go entirely without backbone +transit networks. By the way: Babel is amazing! diff --git a/content/articles/2024-04-06-vpp-ospf.md b/content/articles/2024-04-06-vpp-ospf.md new file mode 100644 index 0000000..0d3c3b1 --- /dev/null +++ b/content/articles/2024-04-06-vpp-ospf.md @@ -0,0 +1,438 @@ +--- +date: "2024-04-06T10:17:54Z" +title: VPP with loopback-only OSPFv3 - Part 1 +--- + +{{< image width="200px" float="right" src="/assets/vpp-ospf/bird-logo.svg" alt="Bird" >}} + +# Introduction + +A few weeks ago I took a good look at the [[Babel]({% post_url 2024-03-06-vpp-babel-1 %})] protocol. +I found a set of features there that I really appreciated. The first was a latency aware routing +protocol - this is useful for mesh (wireless) networks but it is also a good fit for IPng's usecase, +notably because it makes use of carrier ethernet which, if any link in the underlying MPLS network +fails, will automatically re-route but sometimes with much higher latency. In these cases, Babel can +reconverge on its own to a topology that has the lowest end to end latency. + +But a second really cool find, is that Babel can use IPv6 nexthops for IPv4 destinations - which is +_super_ useful because it will allow me to retire all of the IPv4 /31 point to point networks +between my routers. AS8298 has about half of a /24 tied up in these otherwise pointless (pun +intended) transit networks. + +In the same week, my buddy Benoit asked a question about OSPFv3 on the Bird users mailinglist +[[ref](https://www.mail-archive.com/bird-users@network.cz/msg07961.html)] which may or may not have +been because I had been messing around with Babel using only IPv4 loopback interfaces. And just a +few weeks before that, the incomparable Nico from [[Ungleich](https://ungleich.ch/)] had a very +similar question [[ref](https://bird.network.cz/pipermail/bird-users/2024-January/017370.html)]. + +These three folks have something in common - we're all trying to conserve IPv4 addresses! + +# OSPFv3 with IPv4 🙁 + +Nico's thread referenced [[RFC 5838](https://datatracker.ietf.org/doc/html/rfc5838)] which defines +support for multiple address families in OSPFv3. It does this by mapping a given address family to +a specific instance of OSPFv3 using the _instance id_ and adding a new option to the _options field_ +that tells neighbors that multiple address families are supported in this instance (and thus, that +the neighbor should not assume all link state advertisements are IPv6-only). + +This way, multiple instances can run on the same router, and they will only form adjacencies with +neighbors that are operating in the same address family. This in itself doesn't change much: rather +than using IPv4 multicast in the hello's while forming adjacencies, OSPFv3 will use IPv6 link local +addresses for them. + +RFC 5838, Section 2.5 says: +> Although IPv6 link local addresses could be used as next hops for IPv4 address families, it is +> desirable to have IPv4 next-hop addresses. [ ... ] In order to achieve this, the link's IPv4 +> address will be advertised in the "link local address" field of the IPv4 instance's Link-LSA. +> This address is placed in the first 32 bits of the "link local address" field and is used for IPv4 +> next-hop calculations. The remaining bits MUST be set to zero. + +First my hopes are raised by saying IPv6 link local addresses _could_ be used as +next hops (just like Babel, yaay!), but then it goes on to say the link local address field will be +overridden with an IPv4 address in the top 32 bits. That's ... gross. I understand why this was +done, it allows for a minimal deviation of the OSPFv3 protocol, but this unfortunate choice +precludes the ability for IPv6 nexthops to be used. Crap on a cracker! + +# OSPFv3 with IPv4 🥰 + +But wait, not all is lost! Remember in my [[VPP Babel]({% post_url 2024-03-06-vpp-babel-1 %})] +article I mentioned that VPP has this ability to run _unnumbered_ interfaces? To recap, this is a +configuration where a primary interface, typically a loopback, will have an IPv4 and IPv6 address, +say **192.168.10.2/32** and **2001:678:d78:200::2/128** and other interfaces will borrow from that. +That will allow for the IPv4 address to be present on multiple interfaces, like so: + +``` +pim@vpp0-2:~$ ip -br a +lo UNKNOWN 127.0.0.1/8 ::1/128 +loop0 UNKNOWN 192.168.10.2/32 2001:678:d78:200::2/128 fe80::dcad:ff:fe00:0/64 +e0 UP 192.168.10.2/32 2001:678:d78:200::2/128 fe80::5054:ff:fef0:1120/64 +e1 UP 192.168.10.2/32 2001:678:d78:200::2/128 fe80::5054:ff:fef0:1121/64 +``` + +### VPP Changes + +Historically in VPP, broadcast medium like ethernet will respond to ARP requests only if the +requestor is in the same subnet. With these point to point interfaces, the remote will _never_ be in +the same subnet, because we're using /32 addresses here! VPP logs these as invalid ARP requests. +With a small change though, I can make VPP become tolerant of this scenario, and the consensus +in the VPP community is that this is OK. + +{{< image src="/assets/vpp-ospf/vpp-diff1.png" alt="VPP Diff #1" >}} + +Check out [[40482](https://gerrit.fd.io/r/c/vpp/+/40482)] for the full change, but in a nutshell, +just before deciding to return an error because the requesting source address is not directly +connected (called an `attached` route in VPP), I'll change the condition to allow for it, if and +only if the ARP request comes from an _unnumbered_ interface. + +I think this is a good direction, if only because most other popular implementations (including +Linux, FreeBSD, Cisco IOS/XR and Juniper) will answer ARP requests that are onlink but not directly +connected, in the same way. + +### Bird2 Changes + +Meanwhile, in the Bird community, we were thinking about solving this problem in a different way. +Babel allows a feature to use IPv6 transit networks with IPv4 destinations, by specifying an option +called `extended next hop`. With this option, Babel will set a nexthop across address families. It +may sound freaky at first, but it's not too strange when you think about it. Take a look at my +explanation in the [[Babel]({% post_url 2024-03-06-vpp-babel-1 %})] article on how IPv6 neighbor +discovery can take the place of IPv4 ARP resolution to figure out the ethernet next hop. + +So our initial take was: why don't we do that with OSPFv3 as well? We thought of a trick to +get that Link LSA hack from RFC5838 removed: what if Bird, upon setting the `extended next hop` +feature on an interface, would simply put the IPv6 address back like it was, rather than overwriting +it with the IPv4 address? That way, we'd just learn routes to IPv4 destinations with nexthops on +IPv6 linklocal addresses. It would break compatibility with other vendors, but seeing as it is an +optional feature which defaults to off, perhaps it is a reasonable compromise... + +Ondrej started to work on it, but came back a few days later with a different solution, which is +quite clever. Any IPv4 router needs at least one IPv4 address anyways, to be able to send ICMP +messages, so there is no need to put IPv4 addresses on links. Ondrej's theory corroborates my +previous comments for Babel's IPv4-less routing: + +> I’ve learned so far that I (a) MAY use IPv6 link-local networks in order to forward IPv4 packets, +> as I can use IPv6 NDP to find the link-layer next hop; and (b) each router SHOULD be able to +> originate ICMPv4 packets, therefore it needs at least one IPv4 address. +> +> These two claims mean that I need at most one IPv4 address on each router. + +Ondrej's proposal for Bird2 will, when OSPFv3 is used with IPv4 destinations, keep the RFC5838 +behavior and try to _find_ a working IPv4 address to put in the Link LSA: +{{< image src="/assets/vpp-ospf/bird-diff1.png" alt="Bird Diff #1" >}} + +He adds a function `update_loopback_addr()`, which scans all interfaces for an IPv4 address, and if +there are multiple, prefer host addresses, then addresses from OSPF stub interfaces, and finally +just any old IPv4 address. Now that IPv4 address can be simply used to put in the Link LSA. Slick! + +His change also removes next-hop-in-address-range check for OSPFv3 when using IPv4, and +automatically adds onlink flag to such routes, which newly accepts next hops that are not directly +connected: +{{< image src="/assets/vpp-ospf/bird-diff2.png" alt="Bird Diff #2" >}} + +I realize when reading the code that this change paired with the [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/40482)] +are perfect partners: +1. Ondrej's change will make the Link LSA be _onlink_, which is a way to describe that the next hop + is not directly connected, in other words nexthop `192.168.10.3/32`, while the router itself is + `192.168.10.2/32`. +1. My change will make VPP answer for ARP requests in such a scenario where the router with an + _unnumbered_ interface with `192.168.10.3/32` will respond to a request from the not directly + connected _onlink_ peer at `192.168.10.2`. + +## Tying it together + +With all of that, I am ready to demonstrate two working solutions now. I first compile Bird2 with +Ondrej's [[commit](https://gitlab.nic.cz/labs/bird/-/commit/280daed57d061eb1ebc89013637c683fe23465e8)]. +Then, I compile VPP with my pending [[gerrit](https://gerrit.fd.io/r/c/vpp/+/40482)]. Finally, +to demonstrate how `update_loopback_addr()` might work, I compile `lcpng` with my previous +[[commit](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], +which allows me to inhibit copying forward addresses from VPP to Linux, when using _unnumbered_ +interfaces. + +I take an IPng lab instance out for a spin with this updated Bird2 and VPP+lcpng environment: + +{{< image src="/assets/vpp-mpls/LAB v2.svg" alt="Lab Setup" >}} + +### Solution 1: Somewhat unnumbered + +I configure an otherwise empty VPP dataplane as follows: + +``` +vpp0-3# lcp lcp-sync on +vpp0-3# lcp lcp-sync-unnumbered on + +vpp0-3# create loopback interface instance 0 +vpp0-3# set interface state loop0 up +vpp0-3# set interface ip address loop0 192.168.10.3/32 +vpp0-3# set interface ip address loop0 2001:678:d78:200::3/128 + +vpp0-3# set interface mtu 9000 GigabitEthernet10/0/0 +vpp0-3# set interface mtu packet 9000 GigabitEthernet10/0/0 +vpp0-3# set interface unnumbered GigabitEthernet10/0/0 use loop0 +vpp0-3# set interface state GigabitEthernet10/0/0 up + +vpp0-3# lcp create loop0 host-if loop0 +vpp0-3# lcp create GigabitEthernet10/0/0 host-if e0 +``` + +Which yields the following configuration: +``` +pim@vpp0-3:~$ ip -br a +lo UNKNOWN 127.0.0.1/8 ::1/128 +loop0 UNKNOWN 192.168.10.3/32 2001:678:d78:200::3/128 fe80::dcad:ff:fe00:0/64 +e0 UP 192.168.10.3/32 2001:678:d78:200::3/128 fe80::5054:ff:fef0:1130/64 +pim@vpp0-3:~$ ip route get 182.168.10.2 +RTNETLINK answers: Network is unreachable +``` + +I can see that VPP copied forward the IPv4/IPv6 addresses to interface `e0`, and because there's no +routing protocol running yet, the neighbor router `vpp0-2` is unreachable. Let me fix that, next. I +start bird in the VPP dataplane network namespace, and configure it as follows: + +``` +router id 192.168.10.3; + +protocol device { scan time 30; } +protocol direct { ipv4; ipv6; check link yes; } + +protocol kernel kernel4 { + ipv4 { import none; export where source != RTS_DEVICE; }; + learn off; scan time 300; +} + +protocol kernel kernel6 { + ipv6 { import none; export where source != RTS_DEVICE; }; + learn off; scan time 300; +} + +protocol bfd bfd1 { + interface "e*" { + interval 100 ms; + multiplier 20; + }; +} + +protocol ospf v3 ospf4 { + ipv4 { export all; import where (net ~ [ 192.168.10.0/24+, 0.0.0.0/0 ]); }; + area 0 { + interface "loop0" { stub yes; }; + interface "e0" { type pointopoint; cost 5; bfd on; }; + }; +} + +protocol ospf v3 ospf6 { + ipv6 { export all; import where (net ~ [ 2001:678:d78:200::/56, ::/0 ]); }; + area 0 { + interface "loop0" { stub yes; }; + interface "e0" { type pointopoint; cost 5; bfd on; }; + }; +} +``` + +This minimal Bird2 configuration will configure the main protocols `device`, `direct`, +and two kernel protocols `kernel4` and `kernel6`, which are instructed to export learned routes from +the kernel for all but directly connected routes (because the Linux kernel and VPP already have +these when an interface is brought up, this avoids duplicate connected route entries). + +If you haven't come across it yet, _Bidirectional Forwarding Detection_ or _BFD_ is a protocol that +repeatedly sends UDP packets between routers, to be able to detect if the forwarding is interrupted +even if the interface link stays up. It's described in detail in +[[RFC5880](https://www.rfc-editor.org/rfc/rfc5880.txt)], and I use it at IPng Networks all over the +place. + +{{< image width="100px" float="left" src="/assets/vpp-babel/brain.png" alt="Brain" >}} + +Then I'll configure two OSPF protocols, one for IPv4 called `ospf4` and another for IPv6 called +`ospf6`. It's easy to overlook, but while usually the IPv4 protocol is OSPFv2 and the IPv6 protocol +is OSPFv3, here _both_ are using OSPFv3! I'll instruct Bird to erect a _BFD_ session for any +neighbor it establishes an adjacency with. If at any point the BFD session times out (currently at +20x100ms or 2.0s), OSPF will tear down the adjacency. + +The OSPFv3 protocols each define one channel, in which I allow Bird to export anything, but import +only those routes that are in the LAB IPv4 (192.168.10.0/24) and IPv6 (2001:687:d78:200::/56), and +I'll also allow a default to be learned over OSPF for both address families. That'll come in handy +later. + +I start up Bird on the rightmost two routers in the lab (`vpp0-3` and `vpp0-2`). Looking at +`vpp0-3`, Bird starts sending IPv6 hello packets on interface `e0`, and pretty quickly finds not one +but two neighbors: + +``` +pim@vpp0-3:~$ birdc show ospf neighbors +BIRD v2.15.1-4-g280daed5-x ready. +ospf4: +Router ID Pri State DTime Interface Router IP +192.168.10.2 1 Full/PtP 30.870 e0 fe80::5054:ff:fef0:1121 + +ospf6: +Router ID Pri State DTime Interface Router IP +192.168.10.2 1 Full/PtP 30.870 e0 fe80::5054:ff:fef0:1121 +``` + +Bird is able to sort out which is which on account of the 'normal' IPv6 OSPFv3 having an _instance +id_ value of 0 (IPv6 Unicast), and the IPv4 OSPFv3 having an _instance id_ of 64 (IPv4 Unicast). +Further, the IPv4 variant will set the AF-bit in the OSPFv3 options, so the peer will know it +supports using the Link LSA to model IPv4 nexthops rather than IPv6 nexthops. + +Indeed, routes are quickly learned: + +``` +pim@vpp0-3:~$ birdc show route table master4 +BIRD v2.15.1-4-g280daed5-x ready. +Table master4: +192.168.10.3/32 unicast [direct1 13:02:56.883] * (240) + dev loop0 + unicast [direct1 13:02:56.883] (240) + dev e0 + unicast [ospf4 13:02:56.980] I (150/0) [192.168.10.3] + dev loop0 + dev e0 +192.168.10.2/32 unicast [ospf4 13:03:04.980] * I (150/5) [192.168.10.2] + via 192.168.10.2 on e0 onlink + +``` + +They are quickly propagated both to the Linux kernel, and by means of Netlink into the Linux Control +Plane plugin in VPP, which programs it into VPP's _FIB_: + +``` +pim@vpp0-3:~$ ip ro +192.168.10.2 via 192.168.10.2 dev e0 proto bird metric 32 onlink + +pim@vpp0-3:~$ vppctl show ip fib 192.168.10.2 +ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, default-route:1, lcp-rt:1, ] +192.168.10.2/32 fib:0 index:23 locks:3 + lcp-rt-dynamic refs:1 src-flags:added,contributing,active, + path-list:[40] locks:2 flags:shared, uPRF-list:22 len:1 itfs:[1, ] + path:[53] pl-index:40 ip4 weight=1 pref=32 attached-nexthop: oper-flags:resolved, + 192.168.10.2 GigabitEthernet10/0/0 + [@0]: ipv4 via 192.168.10.2 GigabitEthernet10/0/0: mtu:9000 next:6 flags:[] 525400f01121525400f011300800 + + adjacency refs:1 entry-flags:attached, src-flags:added, cover:-1 + path-list:[43] locks:1 uPRF-list:24 len:1 itfs:[1, ] + path:[56] pl-index:43 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved, + 192.168.10.2 GigabitEthernet10/0/0 + [@0]: ipv4 via 192.168.10.2 GigabitEthernet10/0/0: mtu:9000 next:6 flags:[] 525400f01121525400f011300800 + Extensions: + path:56 + forwarding: unicast-ip4-chain + [@0]: dpo-load-balance: [proto:ip4 index:28 buckets:1 uRPF:22 to:[0:0]] + [0] [@5]: ipv4 via 192.168.10.2 GigabitEthernet10/0/0: mtu:9000 next:6 flags:[] 525400f01121525400f011300800 +``` + +The neighbor is reachable, over IPv6 (which is nothing special), but also over IPv4: + +``` +pim@vpp0-3:~$ ping -c5 2001:678:d78:200::2 +PING 2001:678:d78:200::2(2001:678:d78:200::2) 56 data bytes +64 bytes from 2001:678:d78:200::2: icmp_seq=1 ttl=64 time=2.16 ms +64 bytes from 2001:678:d78:200::2: icmp_seq=2 ttl=64 time=3.69 ms +64 bytes from 2001:678:d78:200::2: icmp_seq=3 ttl=64 time=2.66 ms +64 bytes from 2001:678:d78:200::2: icmp_seq=4 ttl=64 time=2.30 ms +64 bytes from 2001:678:d78:200::2: icmp_seq=5 ttl=64 time=2.92 ms + +--- 2001:678:d78:200::2 ping statistics --- +5 packets transmitted, 5 received, 0% packet loss, time 4006ms +rtt min/avg/max/mdev = 2.164/2.747/3.687/0.540 ms + +pim@vpp0-3:~$ ping -c5 192.168.10.2 +PING 192.168.10.2 (192.168.10.2) 56(84) bytes of data. +64 bytes from 192.168.10.2: icmp_seq=1 ttl=64 time=3.58 ms +64 bytes from 192.168.10.2: icmp_seq=2 ttl=64 time=3.40 ms +64 bytes from 192.168.10.2: icmp_seq=3 ttl=64 time=3.28 ms +64 bytes from 192.168.10.2: icmp_seq=4 ttl=64 time=3.32 ms +64 bytes from 192.168.10.2: icmp_seq=5 ttl=64 time=3.29 ms + +--- 192.168.10.2 ping statistics --- +5 packets transmitted, 5 received, 0% packet loss, time 4007ms +rtt min/avg/max/mdev = 3.283/3.374/3.577/0.109 ms +``` + +**✅ OSPFv3 with IPv4/IPv6 on-link nexthops works!** + +### Solution 2: Truly unnumbered + +However, Ondrej's patch does something in addition to this. I repeat the same setup, except now I +set one additional feature when starting up VPP: `lcp lcp-sync-unnumbered off` + +What happens next is that VPP's dataplane looks subtly different. It has created an _unnumbered_ +interface keyed off of `loop0`, but it doesn't propagate the addresses to Linux. + +``` +pim@vpp0-3:~$ ip -br a +lo UNKNOWN 127.0.0.1/8 ::1/128 +loop0 UNKNOWN 192.168.10.3/32 2001:678:d78:200::3/128 fe80::dcad:ff:fe00:0/64 +e0 UP fe80::5054:ff:fef0:1130/64 +``` + +With `e0` only having a linklocal address, Bird can still form an adjacency with its neighbor +`vpp0-2`, because adjacencies in OSPFv3 are formed using IPv6 only. However, the clever trick to +walk the list of interfaces in `update_loopback_addr()` will be able to find a usable IPv4 address, +and use that to put in the Link LSA using RFC5838. In this case, it finds `192.168.10.3` from +interface `loop0` so it'll use that to signal the next hop for LSAs that it sends. + +Now I start the same VPP and Bird configuration on all four VPP routers, but on `vpp0-0` I'll add a +static route out of the LAB to the internet: + +``` +protocol static static4 { + ipv4 { export all; }; + route 0.0.0.0/0 via 192.168.10.4; +} + +protocol static static6 { + ipv6 { export all; }; + route ::/0 via 2001:678:d78:201::ffff; +} +``` + +These two default routes from `vpp0-0` quickly propagate through the network, where `vpp0-3` +ultimately sees this: + +``` +pim@vpp0-3:~$ ip -br a +lo UNKNOWN 127.0.0.1/8 ::1/128 +loop0 UNKNOWN 192.168.10.3/32 2001:678:d78:200::3/128 fe80::dcad:ff:fe00:0/64 +e0 UP fe80::5054:ff:fef0:1130/64 + +pim@vpp0-3:~$ ip ro +default via 192.168.10.2 dev e0 proto bird metric 32 onlink +192.168.10.0 via 192.168.10.2 dev e0 proto bird metric 32 onlink +192.168.10.1 via 192.168.10.2 dev e0 proto bird metric 32 onlink +192.168.10.2 via 192.168.10.2 dev e0 proto bird metric 32 onlink +192.168.10.4/31 via 192.168.10.2 dev e0 proto bird metric 32 onlink + +pim@vpp0-3:~$ ip -6 ro +2001:678:d78:200:: via fe80::5054:ff:fef0:1121 dev e0 proto bird metric 32 pref medium +2001:678:d78:200::1 via fe80::5054:ff:fef0:1121 dev e0 proto bird metric 32 pref medium +2001:678:d78:200::2 via fe80::5054:ff:fef0:1121 dev e0 proto bird metric 32 pref medium +2001:678:d78:200::3 dev loop0 proto kernel metric 256 pref medium +2001:678:d78:201::/112 via fe80::5054:ff:fef0:1121 dev e0 proto bird metric 32 pref medium +fe80::/64 dev loop0 proto kernel metric 256 pref medium +fe80::/64 dev e0 proto kernel metric 256 pref medium +default via fe80::5054:ff:fef0:1121 dev e0 proto bird metric 32 pref medium +``` + +**✅ OSPFv3 with loopback-only, _unnumbered_ IPv4/IPv6 interfaces works!** + +## Results + +I thought I'd record a little [[asciinema](/assets/vpp-ospf/clean.cast)] that shows the end to end +configuration, starting from an empty dataplane and bird configuration. I'll show _Solution 2_, that +is, the solution that doesn't copy the _unnumbered_ interfaces in VPP to Linux. + +Ready? Here I go! + +{{< image src="/assets/vpp-ospf/clean.gif" alt="OSPF IPv4-less VPP" >}} + + +### To unnumbered or Not To unnumbered + +I'm torn between _Solution 1_ and _Solution 2_. While on the one hand, setting the _unnumbered_ +interface would be best reflected in Linux, it is not without problems. If the operator subsequently +tries to remove one of the addresses on `e0` or `e1`, that will yield a desync between Linux and +VPP (Linux will have removed the address, but VPP will still be _unnumbered_). On the other hand, +tricking Linux (and the operator) to believe there isn't an IPv4 (and IPv6) address configured on +the interface, is also not great. + +Of the two approaches, I think I prefer _Solution 2_ (configuring the Linux CP plugin to not sync +_unnumbered_ addresses), because it minimizes the chance of operator error. If you're reading this and +have an Opinion™, would you please let me know? diff --git a/content/articles/2024-04-27-freeix-1.md b/content/articles/2024-04-27-freeix-1.md new file mode 100644 index 0000000..65fe6db --- /dev/null +++ b/content/articles/2024-04-27-freeix-1.md @@ -0,0 +1,230 @@ +--- +date: "2024-04-27T10:52:11Z" +title: FreeIX - Remote +--- + +# Introduction + +{{< image width="300px" float="right" src="/assets/freeix/openart-image_REzWzO43_1714219288118_raw.jpg" alt="OpenART" >}} + +Tier1 and aspiring Tier2 providers interconnect only in large metropolitan areas, due to commercial incentives and +politics. They won't often peer with smaller providers, because why peer with a potential customer? Due to this, +it’s entirely likely that traffic between two parties in Thessaloniki is sent to Frankfurt or Milan and back. + +One possible antidote to this is to connect to a local Internet Exchange point. Not all ISPs have access to large +metropolitan datacenters where larger internet exchanges have a point of presence, and it doesn't help that the +datacenter operator is happy to charge a substantial amount of money each month, just for the privilege of having +a passive fiber cross connect to the exchange. Many Internet Exchanges these days ask for per-month port costs *and* +meter the traffic with policers and rate limiters, such that the total cost of peering starts to exceed what one +might pay for transit, especially at low volumes, which further exacerbates the problem. Bah. + +This is an unfortunate market effect (the race to the bottom), where transit providers are continuously lowering their +prices to compete. And while transit providers can make up to some extent due to economies of scale, at some point they +are mostly all of equal size, and thus the only thing that can flex is quality of service. + +The benefit of using an Internet Exchange is to reduce the portion of an ISP’s (and CDN’s) traffic that must be +delivered via their upstream transit providers, thereby reducing the average per-bit delivery cost and as well reducing +the end to end latency as seen by their users or customers. Furthermore, the increased number of paths available through +the IXP improves routing efficiency and fault-tolerance, and it avoids traffic going the scenic route to a large hub +like Frankfurt, London, Amsterdam, Paris or Rome, if it could very well remain local. + +IPng Networks really believes in an open and affordable Internet, and I would like to do my part in ensuring the +internet stays accessible for smaller parties. + +## Smöl IXPs + +One notable problem with small exchanges, like for example [[FNC-IX](https://www.fnc-ix.net/)] in the Paris metro, or +[[CHIX-CH](https://ch-ix.ch/)], [[Community IX](https://www.community-ix.ch/)] and [[Free-IX](https://free-ix.ch/)] in +the Zurich metropolitan area, is that they are, well, small. They may be cheaper to connect to, in some cases even free, +but they don't have a sizable membership which means that there is inherently less traffic flowing, which in turn makes +it less appealing for prospect members to connect to. + +At IPng, I have partnered with a few super cool ISPs and carriers to offer a Free Internet Exchange platform. Just to +head the main question off at the pass: _Free_ here actually does mean "Free as in beer" or +[[Gratis](https://en.wikipedia.org/wiki/Gratis)], a gift to the community that does not cost money. It also more +philosophically wants to be "Free as in open, and transparent" or +[[Libre](https://en.wikipedia.org/wiki/Free_software)]. + +Two examples are: +* [[Free IX: Switzerland](https://free-ix.ch/)] with POPs at STACK GEN01 Geneva, NTT Zurich and Bancadati Lugano. +* [[Free IX: Greece](https://free-ix.gr/)] with POPs at TISparkle in Athens and Balkan Gate in Thessaloniki. + +.. but there are actually quite a few out there once you start looking :) + +## Growing Smöl IXPs + +Some internet exchanges break through the magical 1Tbps barrier (and get a courtesy callout on Twitter from Dr. King), +but many remain smöl. Perhaps it's time to break the _chicken-and-egg_ problem. What if there was a way to interconnect +these exchanges? + +Let's take for example the Free IX in Greece that was announced at GRNOG16 in Athens on April 19th. This exchange +initially targets Athens and Thessaloniki, with 2x100G between the two cities. Members can connect to either site for +the cost of only a cross connect. The 1G/10G/25G ports will be _Gratis_. But I will be connecting one very special +member to Free IX Greece, AS50869: + +{{< image src="/assets/freeix/Free IX Remote.svg" alt="FreeIX Remote" >}} + +## Free IX: Remote + +Here's what I am going to build. The _Free IX Remote_ project offers an outreach infrastructure which connects to +internet exchange points, and allows members to benefit from that in the following way: + +1. FreeIX uses AS50869 to peer with any network operator who is available at public internet exchanges or using + private interconnects. It looks like a normal service provider in this regard. It will connect to internet + exchanges, and learn a bunch of routes. +1. FreeIX _members_ can join the program, after which they are granted certain propagation permissions by FreeIX + at the point where they have a BGP session with AS50869. The prefixes learned on these _member_ sessions are marked + as such, and will be allowed to propagate. Members will receive some or all learned prefixes from AS50869. +1. FreeIX _members_ can set fine grained BGP communities to determine which of their prefixes are propagated and at + which locations. + +Members at smaller internet exchanges greatly benefit from this type of outreach, by receiving large portions of the +public internet directly at their preferred peering location. Similarly, the _Free IX Remote_ routers will carry +their traffic to these remote internet exchanges. + +## Detailed Design + +### Peer types + +There are two types of BGP neighbor adjacency: + +1. ***Members***: these are {ip-address,AS}-tuples which FreeIX has explicitly configured. Learned prefixes are added + to as-set AS50869:AS-MEMBERS. Members receive _all_ prefixes from FreeIX, each annotated with BGP **informational** + communities, and members can drive certain behavior with BGP **action** communities. + +1. ***Peers***: these are all other entities with whom FreeIX has an adjacency at public internet exchanges or private + network interconnects. Peers receive some (or all) _member prefixes_ from FreeIX and cannot drive any behavior + with communities. With respect to internet exchanges and peers, AS50869 looks like a completely normal ISP, + advertising subsets of the customer AS cone from AS50869:AS-MEMBERS at each exchange point. + +BGP sessions with members use strict ingress filtering by means of `bgpq4`, and will be tagged with a set of +informational BGP communities, such as where the prefix was learned, and what propagation permissions that it received +(eg. at which internet exchanges will it be allowed to be announced). Of course, prefixes that are RPKI invalid will be +dropped, while valid and unknown prefixes will be accepted. Members are granted _permissions_ by FreeIX, which determine +where their prefixes will be announced by AS50869. Further, members can perform optional actions by means of BGP communities +at their ingress point, to inhibit announcements to a certain peer or at a given exchange point. + +Peers on the other hand are not granted any permissions and all action BGP communities will be stripped on prefixes +learned. Informational communities will still be tagged on learned prefixes. Two things happen here. Firstly, members +will be offered only those prefixes for which they have permission -- in other words, I will create a configuration file +that says member AS8298 may receive prefixes learned from Frys-IX. Secondly, even for those prefixes that are advertised, +the member AS8298 can use the informational communities to further filter what they accept from Free IX Remote AS50869. + +# BGP Classic Communities + +Members are allowed to set the following legacy action BGP communities for coarse grained distribution of their prefixes +through the FreeIX network. + +* `(50869,0)` or `(50869,3000)` do not announce anywhere +* `(50869,666)` or `(65535,666)` blackhole everywhere (can be on any more specific from the member's AS-SET) +* `(50869,3100)` prepend once everywhere +* `(50869,3200)` prepend twice everywhere +* `(50869,3300)` prepend three times everywhere + +Peers, on the other hand, are not allowed to set _any_ communities, so all classic BGP communities from them are stripped +on ingress. + +# BGP Large Communities + +Free IX Remote will use three types of BGP Large Communities, which each serve a distinct purpose: + +1. ***Informational***: These communities are set by the FreeIX router when learning a prefix. They cannot be set by + peers or members, and will be stripped on ingress. They will be sent to both members and peers, allowing operators to + choose which prefixes to learn based on their origin details, like which country or internet exchange they were + learned at. + +1. ***Permission***: These communities are also set by FreeIX operators when learning a prefix (eg. on the ingress + router). They cannot be set by peers or members, and will be stripped on ingress. The permission communities + determine where FreeIX will allow the prefix to propagate. They will be stripped on egress. + +1. ***Action***: Based on the permissions, members can further steer announcements by sending certain action communities + to FreeIX. These actions cannot be sent by peers, but in certain cases they can be set by FreeIX operators on ingress. + Similarly to the _permission_ communties, all _action_ communities will be stripped on egress. + +Regular peers of AS50869 at exchange points and private network interconnects will not be able to set any communities, +so all large BGP communities from them are stripped on ingress. + +### Informational Communities + +When FreeIX routers learn prefixes, they will annotate them with certain communities. For example, the router at +Amsterdam NIKHEF (which is router #1, country #2), when learning a prefix at FrysIX (which is ixp #1152), will set the +following BGP large communities: + +* `(50869,1010,1)`: Informational (10XX), Router (1010), vpp0.nlams0.free-ix.net (1) +* `(50869,1020,2)`: Informational (10XX), Country (1020), Netherlands (2) +* `(50869,1030,1152)`: Informational (10XX), IXP (1030), PeeringDB IXP for FrysIX (1152) + +When propagating these prefixes to neighbors (both members and peers), these informational communities can be used to +determine local policy, for example by setting a different localpref or dropping prefixes from a certain location. +Informational communities can be read, but they can't be _set_ by peers or members -- they are always cleared by FreeIX +routers when learning prefixes, and as such the only routers which will set them are the FreeIX ones. + +### Permission Communities + +FreeIX maintains a list of permissions per member. When members announce their prefixes to FreeIX routers, these +permissions communities are set. They determine what the member is allowed to do with FreeIX propagation - notably which +routers, countries, and internet exchanges the member will be allowed to propagate to. + +Usually, member prefixes are allowed to propagate everywhere, so the following communities might be set by the FreeIX +router on ingress: + +* `(50869,2010,0)`: Permission (20XX), Router (2010), everywhere (0) +* `(50869,2020,0)`: Permission (20XX), Country (2020), everywhere (0) +* `(50869,2030,0)`: Permission (20XX), IXP (2030), everywhere (0) + +If the member prefixes are allowed to propagate only to certain places, the 'everywhere' communities will not be set, +and instead lists of communities with finer grained permissions can be used, for example: + +* `(50869,2010,2)`: Permission (20XX), Router (2010), vpp0.grskg0.free-ix.net (2) +* `(50869,2020,3)`: Permission (20XX), Country (2020), Greece (3) +* `(50869,2030,60)`: Permission (20XX), IXP (2030), PeeringDB IXP for SwissIX (60) + +Permission communities can't be set by peers, nor by members -- they are always cleared by FreeIX routers when learning +prefixes, and are configured explicitly by FreeIX operators. + +### Action Communities + +Based on the permission communities, zero or more egress routers, countries and internet exchanges are eligible to +propagate member prefixes by AS50869 to its peers. Members can define very fine grained action communities to further +tweak which prefixes propagate on which routers, in which countries and towards which internet exchanges and private +network interconnects: + +* `(50869,3010,3)`: Inhibit Action (30XX), Router (3010), vpp0.gratt0.free-ix.net (3) +* `(50869,3020,1)`: Inhibit Action (30XX), Country (3020), Switzerland (1) +* `(50869,3030,1308)`: Inhibit Action (30XX), IXP (3030), PeeringDB IXP for LS-IX (1308) + +Further actions can be placed on a per-remote-neighbor basis: + +* `(50869,3040,13030)`: Inhibit Action (30XX), AS (3040), Init7 (AS13030) +* `(50869,3041,6939)`: Prepend Action (30XX), Prepend Once (3041), Hurricane Electric (AS6939) +* `(50869,3042,12859)`: Prepend Action (30XX), Prepend Twice (3042), BIT BV (AS12859) +* `(50869,3043,8283)`: Prepend Action (30XX), Prepend Three Times (3043), Coloclue (AS8283) + +Peers cannot set these actions, as all action communities will be stripped on ingress. Members can set these action +communities on their sessions with FreeIX routers, however in some cases they may also be set by FreeIX operators when +learning prefixes. + +## What's next + +{{< image width="200px" float="right" src="/assets/freeix/bird-logo.svg" alt="Bird" >}} + +Perhaps this interaction between _informational_, _permission_ and _action_ BGP communities gives you an idea on how +such a network may operate. It's somewhat different to a classic Transit provider, in that AS50869 will not carry a +full table. It'll _merely_ provide a form of partial transit from member A at IXP #1, to and from all peers that +can be found at IXPs #2-#N. Makes the mind boggle? Don't worry, we'll figure it out together :) + +In an upcoming article I'll detail the programming work that goes into implementing this complex peering policy in Bird2 +as driving VPP routers (duh), with an IGP that is IPv4-less, because at this point, I [[may as well]({%post_url +2024-04-06-vpp-ospf %})] put my money where my mouth is. + +If you're interested in this kind of stuff, take a look at the IPng Networks AS8298 [[Routing Policy]({% post_url +2021-11-14-routing-policy %})]. Similar to that one, this one will use a combination of functional programming, templates, +and clever expansions to make a customized per-member and per-peer configuration based on a YAML input file which +dictates which member and which prefix is allowed to go where. + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +First, I need to get a replacement router for the Thessaloniki router, which will run VPP of course. My buddy Antonis +noticed that there are CPU and/or DDR errors on that chassis, so it may need to be RMAd. But once it's operational, I will +start by deploying one instance in Amsterdam NIKHEF, and another in Thessaloniki Balkan Gate, with a 100G connection +between them, graciously provided by [[LANCOM](https://www.lancom.gr/en/)]. Just look at that FD.io hound runnnnn!!1 diff --git a/content/articles/2024-05-17-smtp.md b/content/articles/2024-05-17-smtp.md new file mode 100644 index 0000000..ae54f6b --- /dev/null +++ b/content/articles/2024-05-17-smtp.md @@ -0,0 +1,1134 @@ +--- +date: "2024-05-17T10:56:54Z" +title: 'Case Study: IPng''s mail servers' +--- + +## Intro + +I have seen companies achieve great successes in the space of consumer internet and entertainment +industry. I've been feeling less enthusiastic about the stronghold that these corporations have over +my digital presence. I am the first to admit that using "free" services is convenient, but these +companies are sometimes taking away my autonomy and exerting control over society. To each their own +of course, but for the last few years, I've been more and more inclined to take back a little bit of +responsibility for my online social presence, away from centrally hosted services and to privately +operated ones. + +{{< image width="120px" float="right" src="/assets/smtp/gmail_logo.png" alt="GMail" >}} + +First off - I **love** Google's Workspace products. I started using GMail just after it launched, +back in 2004. Its user interface is sleek, performant, and very intuitive. Its filtering, granted, +could be a bit less ... robotic, but that's made up by labels and an incredibly comprehensive +search function. I would dare say that between GMail and Photos, those are my absolute favorite +products on the internet. + +That said, I have been running e-mail servers since well before Google existed as a company. I +started off at M.C.G.V. Stack, the computer club of the University of Eindhoven, in 1995. We ran +sendmail back then, and until about two months ago, I have continuously run sendmail in production +using the PaPHosting platform [[ref](https://paphosting.net/)] that I wrote with my buddies Paul +and Jeroen. + +However, two things happened, both of them somewhat nerdsnipe-esque: +1. Mrs IPngNetworks said "Well if you are going to use NextCloud and PeerTube and PixelFed and + Mastodon, why would you not run your own mailserver?" +1. I added a forward for `event@frys-ix.net` on my Sendmail relays at PaPHosting, and was tipped + by my buddy Jelle that his e-mail to it was bouncing due to SPF strictness. + +### I tried to resist ... + +{{< image width="150px" float="left" src="/assets/smtp/pulling_hair.png" alt="Pulling Hair" >}} +My main argument against running a mailserver has been the mailspool. Before I moved to GMail, I had +the misfortune of having my mail and primary DNS fail, running on `bfib.ipng.nl` at the time, a +server 700km away from me, without redundancy. Even the nameserver slaves went beyond their zone +refresh. It was not a good month for me, even though it was twenty years or so ago :) + +Last year, during a roadtrip with Fred, he and I spent a few long hours restoring a backup after a +catastrophic failure of a hypervisor at IP-Max on which his mailserver was running. Luckily, backups +were awesome and saved the day, but having to go into a Red Alert mode and not being able to +communicate, really can be stressful. I don't want to run mailservers!!!1 + +### .. but resistance is futile + +After this nerdsnipe, I had a short conversation with Jeroen who mentioned that since I last had a +look at this, Dovecot, a popular imap/pop3 server, had gained the ability to do mailbox +synchronization across multiple machines. That's a really nifty feature - but also it meant that +there will be no more single points of failure, if I do this properly. Oh crap, there's no longer an +argument of resistance? Nerd-snipe accepted! + +Let me first introduce the mail^W main characters of my story: + +| ![Postfix](/assets/smtp/postfix_logo.png){: style="width:100px; margin: 1em;"} | ![Dovecot](/assets/smtp/dovecot_logo.png){: style="width:100px; margin: 1em;"} | ![NGINX](/assets/smtp/nginx_logo.png){: style="width:100px; margin: 1em;"} | ![rspamd](/assets/smtp/rspamd_logo.png){: style="width:100px; margin: 1em;"} | ![Unbound](/assets/smtp/unbound_logo.png){: style="width:100px; margin: 1em;"} | ![Postfix](/assets/smtp/roundcube_logo.png){: style="width:100px; margin: 1em;"} | + +* ***Postfix***: is Wietse Venema's mail server that started life at IBM research as an + alternative to the widely-used Sendmail program. After eight years at Google, Wietse continues + to maintain Postfix. +* ***Dovecot***: an open source IMAP and POP3 email server for Linux/UNIX-like systems, written + with security primarily in mind. Dovecot is an excellent choice for both small and large + installations. +* ***NGINX***: an HTTP and reverse proxy server, a mail proxy server, and a generic TCP/UDP proxy + server, originally written by Igor Sysoev. +* ***Rspamd***: an advanced spam filtering system and email processing framework that allows + evaluation of messages by a number of rules including regular expressions, statistical analysis + and custom services such as URL black lists. +* ***OpenDKIM***: is a community effort to develop and maintain a C library for producing + DKIM-aware applications and an open source milter for providing DKIM service. +* ***Unbound***: a validating, recursive, caching DNS resolver. It is designed to be fast and + lean and incorporates modern features based on open standards. +* ***Roundcube***: a web-based IMAP email client. Roundcube's most prominent feature is the + pervasive use of Ajax technology. + +In the rest of this article, I'll go over four main parts that I used to build a fully redundant +and self-healing mail service at IPng Networks: + +1. **Green**: `smtp-in.ipng.ch` which handles inbound e-mail +1. **Red**: `imap.ipng.ch` which serves mailboxes to users +1. **Blue**: `smtp-out.ipng.ch` which handles outbound e-mail +1. **Magenta**: `webmail.ipng.ch` which exposes the mail in a web browser + +Let me start with a functional diagram, using those colors: + +{{< image src="/assets/smtp/IPng Mail Cluster.svg" alt="IPng Mail Cluster" >}} + +As you can see in this diagram, I will be separating concerns and splitting the design into three +discrete parts, which will also be in three sets of redundantly configured backend servers running +on IPng's hypervisors in Zurich (CH), Lille (FR) and Amsterdam (NL). + +### 1. Outbound: smtp-out + +I'm going to start with a relatively simple component first: outbound mail. This service will be +listening on the _smtp submission_ port 587, require TLS and user authentication from clients, +validate outbound e-mail using a spam detection agent, and finally provide DKIM signing on all +outbound e-mails. It should spool and retry the delivery in case there is a temporary issue (like +greylisting, or server failure) on the receiving side. + +Because the only way to send e-mail will be using TLS and user authentication, the smtp-out servers +themselves will not need to do any DNSBL lookups, which is convenient because it means I can put +them behind a loadbalancer and serve them entirely within IPng Site Local. If you're curious as to +what this site local thing means, basically it's an internal network spanning all IPng's points of +presence, with an IPv4, IPv6 and MPLS backbone that is disconnected from the internet. For more +details on the design goals, take a look at the [[article]({% post_url 2023-03-11-mpls-core +%})] I wrote about it last year. + +#### Debian VMs + +I'll take three identical virtual machines, hosted on three separate hypervisors each in their own +country. + +``` +pim@summer:~$ dig ANY smtp-out.net.ipng.ch +smtp-out.net.ipng.ch. 60 IN A 198.19.6.73 +smtp-out.net.ipng.ch. 60 IN A 198.19.4.230 +smtp-out.net.ipng.ch. 60 IN A 198.19.6.135 +smtp-out.net.ipng.ch. 60 IN AAAA 2001:678:d78:50e::9 +smtp-out.net.ipng.ch. 60 IN AAAA 2001:678:d78:50a::6 +smtp-out.net.ipng.ch. 60 IN AAAA 2001:678:d78:510::7 +```` + +I will give them each 8GB of memory, 4 vCPUs, and 16GB of bootdisk. I'm pretty confident that the +whole system will be running in only a fraction of that. I will install a standard issue Debian +Bookworm (12.5), and while my VMs by default have 4 virtual NICs, I only need one, connected to the +IPng Site Local: + +``` +pim@smtp-out-chrma0:~$ ip -br a +lo UNKNOWN 127.0.0.1/8 ::1/128 +enp1s0f0 UP 198.19.6.135/27 2001:678:d78:510::7/64 fe80::5054:ff:fe99:81b5/64 +enp1s0f1 UP fe80::5054:ff:fe99:81b6/64 +enp1s0f2 UP fe80::5054:ff:fe99:81b7/64 +enp1s0f3 UP fe80::5054:ff:fe99:81b8/64 + +pim@smtp-out-chrma0:~$ mtr -6 -c5 -r dns.google +Start: 2024-05-17T13:49:28+0200 +HOST: smtp-out-chrma0 Loss% Snt Last Avg Best Wrst StDev + 1.|-- msw1.chrma0.net.ipng.ch 0.0% 5 1.6 1.5 1.3 1.6 0.1 + 2.|-- msw0.chrma0.net.ipng.ch 0.0% 5 1.4 1.3 1.3 1.4 0.1 + 3.|-- msw0.chbtl0.net.ipng.ch 0.0% 5 3.2 3.1 2.8 3.2 0.2 + 4.|-- hvn0.chbtl0.net.ipng.ch 0.0% 5 1.5 1.5 1.4 1.5 0.0 + 5.|-- chbtl0.ipng.ch 0.0% 5 1.6 1.7 1.6 1.7 0.0 + 6.|-- chrma0.ipng.ch 0.0% 5 2.4 2.4 2.4 2.5 0.0 + 7.|-- as15169.lup.swissix.ch 0.0% 5 3.2 3.8 3.2 5.6 1.0 + 8.|-- 2001:4860:0:1::6083 0.0% 5 4.5 4.5 4.5 4.5 0.0 + 9.|-- 2001:4860:0:1::12f9 0.0% 5 3.4 3.5 3.4 3.5 0.0 + 10.|-- dns.google 0.0% 5 3.8 3.9 3.8 4.0 0.1 +``` + +One cool observation: these machines are not really connected to the internet - you'll note that +their IPv4 address is from reserved space, and their IPv6 supernet (2001:678:d78:500::/56) is +filtered at the border. I'll get to that later! + +#### Postfix + +I will install Postfix, and make a few adjustments to its config. First off, this mailserver will only +be receiving `submission` mail, which is port 587. It will not participate or listen to the regular +`smtp` port 25, nor `smtps` port 465, as such the `master.cf` file for Postfix becomes: + +``` +#smtp inet n - y - - smtpd +# -o smtpd_sasl_auth_enable=no +submission inet n - y - - smtpd + -o syslog_name=postfix/submission + -o smtpd_tls_security_level=encrypt + -o smtpd_sasl_auth_enable=yes + -o smtpd_reject_unlisted_recipient=no + -o smtpd_client_restrictions=permit_sasl_authenticated,permit_mynetworks,reject + -o milter_macro_daemon_name=ORIGINATING +#smtps inet n - y - - smtpd +# -o syslog_name=postfix/smtps +# -o smtpd_tls_wrappermode=yes +# -o smtpd_sasl_auth_enable=yes +``` + +The only thing I will make note of is that the `submission` service has a set of client +restrictions. In other words, to be able to use this service, the client must either be SASL +authenticated, or from a list of network prefixes that are allowed to relay. If neither of those two +conditions are satisfied, relaying will be denied. + +Now, I understand that pasting in the entire postfix configuration is a bit verbose, but honestly +I've spent many an hour trying to puzzle together an end-to-end valid configuration, so I'm just +going to swim upstream and post the whole `main.cf`, which I'll try to annotate the broad strokes +of, in case there's anybody out there trying to learn: + +``` +myhostname = smtp-out.ipng.ch +myorigin = smtp-out.ipng.ch +mydestination = $myhostname, smtp-out.chrma0.net.ipng.ch, localhost.net.ipng.ch, localhost +mynetworks = 127.0.0.0/8, [::1]/128 + +recipient_delimiter = + +inet_interfaces = all +inet_protocols = all +biff = no + +# appending .domain is the MUA's job. +append_dot_mydomain = no +readme_directory = no + +# See http://www.postfix.org/COMPATIBILITY_README.html -- default to 3.6 on fresh installs. +compatibility_level = 3.6 + +# SMTP Server +smtpd_banner = $myhostname ESMTP $mail_name (smtp-out.chrma0.net.ipng.ch) +smtpd_relay_restrictions = permit_mynetworks permit_sasl_authenticated defer_unauth_destination +smtpd_tls_cert_file = /etc/certs/ipng.ch/fullchain.pem +smtpd_tls_key_file = /etc/certs/ipng.ch/privkey.pem +smtpd_tls_CAfile = /etc/ssl/certs/ca-certificates.crt +smtpd_tls_CApath = /etc/ssl/certs +smtpd_use_tls = yes +smtpd_tls_received_header = yes +smtpd_tls_auth_only = yes +smtpd_tls_session_cache_database = btree:$data_directory/smtpd_scache +smtpd_client_connection_count_limit = 4 +smtpd_client_connection_rate_limit = 10 +smtpd_client_message_rate_limit = 60 +smtpd_client_event_limit_exceptions = $mynetworks + +# Dovecot auth +smtpd_sasl_type = dovecot +smtpd_sasl_path = private/auth +smtpd_sasl_authenticated_header = yes +smtpd_sasl_auth_enable = yes +smtpd_sasl_security_options = noanonymous, noplaintext +smtpd_sasl_tls_security_options = noanonymous + +# SMTP Client +smtp_use_tls = yes +smtp_tls_note_starttls_offer = yes +smtp_tls_cert_file = /etc/certs/ipng.ch/fullchain.pem +smtp_tls_key_file = /etc/certs/ipng.ch/privkey.pem +smtp_tls_CAfile = /etc/ssl/certs/ca-certificates.crt +smtp_tls_CApath = /etc/ssl/certs +smtp_tls_mandatory_ciphers = medium +smtp_tls_session_cache_database = btree:${data_directory}/smtp_scache +smtp_tls_security_level = encrypt + +header_size_limit = 4096000 +message_size_limit = 52428800 +mailbox_size_limit = 0 + +# OpenDKIM, Rspamd +smtpd_milters = inet:localhost:8891,inet:rspamd.net.ipng.ch:11332 +non_smtpd_milters = $smtpd_milters + +# Local aliases +alias_maps = hash:/etc/postfix/aliases +alias_database = hash:/etc/postfix/aliases +``` + +***Hostnames***: The full (internal) hostname for the server is `smtp-out.$(site).net.ipng.ch`, in +this case for `chrma0` in Rümlang, Switzerland. However, when clients connect to the public hostname +`smtp-out.ipng.ch`, they will expect that the TLS certificate matches _that_ hostname. This is why I +let the server present itself as simply `smtp-out.ipng.ch`, which will also be its public DNS name +later, but put the internal FQDN for debugging purposes between parenthesis. See the `smtpd_banner` +and `myhostname` for the destinction. I'll load up the `*.ipng.ch` wildcard certificate which I +described in my Let's Encrypt [[DNS-01]({% post_url 2023-03-24-lego-dns01 %})] article. + + +***Authorization***: I will make Postfix accept relaying for those users that are either in the +`mynetworks` (which is only localhost) OR `sasl_authenticated` (ie. presenting a username and +password). This password exchange will only be possible after encryption has been triggered using +the `STARTTLS` SMTP feature. This way, user/pass combos will be safe on the network. + + +***Authentication***: Those username and password combos can come from a few places. One popular way +to do this is via a `dovecot` authentication service. Via the `smtpd_sasl_path`, I tell Postfix to +ask these authentication questions using the dovecot protocol on a certain file path. I'll let +Dovecot listen in the `/var/spool/postfix/private/auth` directory. This is how Postfix will know +which user to relay for, and which to deny. + + +***DKIM/SPF***: These days, most (large and small) mail providers will be suspicious of e-mail that is +delivered to them without proper SPF and DKIM fields. ***DKIM*** is a mechanism to create a cryptographic +signature over some of the E-Mail header fields (usually From/Subject/Date), which can be checked +by the recipient for validity. ***SPF*** is a mechanism to use DNS to inform receiving mailservers +of which are the valid IPv4/IPv6 addresses that should be used to deliver mail for a given sender +domain. + +#### Dovecot (auth) + +The configuration for Dovecot is incredibly simple. The only thing I do is create a mostly empty +`dovecot.conf` file which defines the `auth` service listening in the place where Postfix expects +it. Then, I add a password file called `sasl-users` which will contain user:password tuples: + +``` +service auth { + unix_listener /var/spool/postfix/private/auth { + mode = 0660 + # Assuming the default Postfix user and group + user = postfix + group = postfix + } +} + +passdb { + driver = passwd-file + args = username_format=%n /etc/dovecot/sasl-users +} +``` + +I can use `doveadm pw` to generate such passwords. I do this in an upstream Ansible repository and +then push out the same configuration to any number of `smtp-out` servers, so they are all configured +identically to this one. + +#### OpenDKIM (signing) + +Now that I can authorize (via SASL) and authenticate (via Dovecot backend) a user, it will be +entitled to use the `smtp-out` Postfix to send e-mail. However, there's a good chance that +recipients will bounce the e-mail, unless it comes with a DKIM signature, and from the _correct_ IP +addresses. + +To configure DKIM signing, I use OpenDKIM, which I give the following `/etc/opendkim.conf` file: + +``` +pim@smtp-out-chrma0:~$ cat /etc/opendkim.conf +Syslog yes +LogWhy yes +UMask 007 +Mode sv +AlwaysAddARHeader yes +SignatureAlgorithm rsa-sha256 +X-Header no +KeyTable refile:/etc/opendkim/keytable +SigningTable refile:/etc/opendkim/signers +RequireSafeKeys false +Canonicalization relaxed +TrustAnchorFile /usr/share/dns/root.key +UserID opendkim +PidFile /run/opendkim/opendkim.pid +Socket inet6:8891 +``` + +It opens a socket at port 8891, which is where Postfix expects it, based on its `smtpd_milter` +configuration option. It will look at the so-called **SigningTable** to determine which outbound +e-mail addresses it can sign. This table looks up From addresses, including wildcards, and informs +which symbolic keyname in the **KeyTable** to use for the signature, like so: + +``` +pim@smtp-out-chrma0:/etc/opendkim$ cat signers +*@*.ipng.nl ipng-nl +*@ipng.nl ipng-nl +*@*.ipng.ch ipng-ch +*@ipng.ch ipng-ch +*@*.ublog.tech ublog +*@ublog.tech ublog +... + +pim@smtp-out-chrma0:/etc/opendkim$ cat keytable +ipng-nl ipng.nl:DKIM2022:/etc/opendkim/keys/DKIM2022-ipng.nl-private +ipng-ch ipng.ch:DKIM2022:/etc/opendkim/keys/DKIM2022-ipng.ch-private +ublog ublog.tech:DKIM2022:/etc/opendkim/keys/DKIM2022-ublog.tech-private +... +``` + +This allows OpenDKIM to sign messages for any number of domains, using the correct key. Slick! + +#### NGINX + +Now that I have three of these identical VMs, I am ready to hook them up to the internet. On the way +in, I will point `smtp-out.ipng.ch` to our NGINX cluster. I wrote about that cluster in a [[previous +article]({% post_url 2023-03-17-ipng-frontends %})]. I will add a snippet there, that exposes these +VMs behind a TCP loadbalancer like so: + +``` +pim@squanchy:~/src/ipng-ansible/roles/nginx/files/streams-available$ cat smtp-out.ipng.ch.conf +upstream smtp_out { + server smtp-out.chrma0.net.ipng.ch:587 fail_timeout=10s max_fails=2; + server smtp-out.frggh0.net.ipng.ch:587 fail_timeout=10s max_fails=2 backup; + server smtp-out.nlams2.net.ipng.ch:587 fail_timeout=10s max_fails=2 backup; +} + +server { + listen [::]:587; + listen 0.0.0.0:587; + proxy_pass smtp_out; +} +``` + +I make use of the `backup` keyword, which will make the loadbalancer choose, if it's available, the +primary server in `chrma0`. If it were to go down, no problem, two connection failures within ten +seconds will make NGINX choose the alternative ones in `frggh0` or `nlams2`. + +#### IPng Site Local gateway + +When the `smtp-out` server receives the e-mail from the customer/client, it'll spool it and start to +deliver it to the remote MX record. To do this, it'll create an outbound connection from its cozy +spot within IPng Site Local (which, you will remember, is not connected directly to the internet). +There are three redundant gateways in IPng Site Local (in Geneva, Brüttisellen and Amsterdam). +If any of these were to go down for maintenance or fail, the network will use OSPF E1 to find the +next closest default gateway. I wrote about how this entire european network is connected via three +gateways that are self-repairing in this [[article]({% post_url 2023-03-11-mpls-core %})], in case +you're curious. + +But, for the purposes of SMTP, it means that each of the internal `smtp-out` VMs will be seen by +remote mailservers as NATted via ***one of these egress points***. This allows me to determine the +SPF records in DNS. With that, I'm ready to share the publicly visible details for this service: + +``` +_spf.ipng.ch. 3600 IN TXT "v=spf1 include:_spf4.ipng.ch include:_spf6.ipng.ch ~all" +_spf4.ipng.ch. 3600 IN TXT "v=spf1 ip4:46.20.246.112/28 ip4:46.20.243.176/28 ip4:94.142.245.80/29" + "ip4:94.142.241.184/29 ip4:194.1.163.0/24 ~all" +_spf6.ipng.ch. 3600 IN TXT "v=spf1 ip6:2a02:2528:ff00::/40 ip6:2a02:898:146::/48" + "ip6:2001:678:d78::/48 ~all" + +smtp-out.ipng.ch. 3600 IN CNAME nginx0.ipng.ch. +nginx0.ipng.ch. 600 IN A 194.1.163.151 +nginx0.ipng.ch. 600 IN A 46.20.246.124 +nginx0.ipng.ch. 600 IN A 94.142.241.189 +nginx0.ipng.ch. 600 IN AAAA 2001:678:d78:7::151 +nginx0.ipng.ch. 600 IN AAAA 2a02:2528:ff00::124 +nginx0.ipng.ch. 600 IN AAAA 2a02:898:146::5 +``` + +To re-iterate one point: the _inbound_ path of the mail is via the redundant cluster of `nginx0` +entrypoints, while the _outbound_ path will be seen from `gw0.chbtl0.ipng.ch`, `gw0.chplo0.ipng.ch` +or `gw0.nlams3.ipng.ch`, which are all covered by the SPF records for IPv4 and IPv6. + +#### Bonus: opensmtpd on clients + +By the way, every single server (VM, hypervisor, router) at IPng Neworks will all use smtp-out to +send e-mail. I use `opensmtpd` for that, and it's incredibly simple: + +``` +pim@squanchy:~$ cat /etc/mail/smtpd.conf +table aliases file:/etc/mail/aliases +table secrets file:/etc/mail/secrets + +listen on localhost + +action "local_mail" mbox alias +action "outbound" relay host "smtp+tls://ipng@smtp-out.ipng.ch:587" auth mail-from "@ipng.ch" + +match from local for local action "local_mail" +match from local for any action "outbound" + +pim@squanchy:~$ sudo cat /etc/mail/secrets +ipng bastion: +``` + +{{< image width="120px" float="left" src="/assets/smtp/lightbulb.svg" alt="Lightbulb" >}} + +What happens here is, every time this server `squanchy` wants to send an e-mail, it will use an SMTP +session with TLS, on port 587, of the machine called `smtp-out.ipng.ch`, and it'll authenticate +using the opensmtpd realm called `ipng`, which maps to a username:password tuple in the _secrets_ +file. It will also rewrite the envelope to be always from `@ipng.ch`. As a best practice I organize +my SMTP users by Ansible group. Squanchy is in the group `bastion`, hence its username. By doing it +this way, I can make use of the DKIM and SPF, which makes all mails properly formatted, routed, +signed and delivered. I love it, so much! + +### 2. Inbound: smtp-in + +The `smtp-out` service I described in the previous section is completely standalone. That is to say, +its purpose is only to receive submitted mail from humans and servers, sign it, spool it if need be, +and deliver it. But users also want to deliver e-mail to me and my customers. For this, I'll build a +second cluster of redundant _inbound_ mailservers: `smtp-in`. + +Here, the base setup is not too different from above, so I won't repeat it. I'll take three +identical VMs, in three different datacenters, and install them with Debian and Postfix as well. +But, contrary to the outbound servers, here I will make them listen to the _smtp_ port 25 and the +_smtps_ port 465, and I'll turn off the ability to authenticate with SASL (and thereby, refuse to +forward any e-mail that I'm not the MX record for), making `master.cf` look like this: + +``` +smtp inet n - y - - smtpd + -o smtpd_sasl_auth_enable=no +#submission inet n - y - - smtpd +# -o syslog_name=postfix/submission +# -o smtpd_tls_security_level=encrypt +# -o smtpd_sasl_auth_enable=yes +# -o smtpd_reject_unlisted_recipient=no +# -o smtpd_client_restrictions=permit_sasl_authenticated,permit_mynetworks,reject +# -o milter_macro_daemon_name=ORIGINATING +smtps inet n - y - - smtpd + -o syslog_name=postfix/smtps + -o smtpd_tls_wrappermode=yes + -o smtpd_sasl_auth_enable=no +``` + +Many of the `main.cf` attributes are the same, unsurprisingly the `myhostname` configuration option +is set to `smtp-in.ipng.ch`, which is going to be expected to match the wildcard SSL certificate +from the `smtpd_tls_cert_file` config option. The banner is a bit more telling, as it shows also the +FQDN hostname (eg. `smtp-in.frggh0.net.ipng.ch`), helpful when debugging. + +``` +# Impose DNSBL restrictions at SMTP time +smtpd_recipient_restrictions = permit_mynetworks, + reject_invalid_helo_hostname, + reject_non_fqdn_recipient, + reject_unknown_recipient_domain, + reject_unauth_pipelining, + reject_unauth_destination, + reject_rbl_client zen.spamhaus.org=127.0.0.[2..11], + reject_rhsbl_sender dbl.spamhaus.org=127.0.1.[2..99], + reject_rhsbl_helo dbl.spamhaus.org=127.0.1.[2..99], + reject_rhsbl_reverse_client dbl.spamhaus.org=127.0.1.[2..99], + warn_if_reject reject_rbl_client zen.spamhaus.org=127.255.255.[1..255], + reject_rbl_client dnsbl-1.uceprotect.net, + reject_rbl_client bl.0spam.org=127.0.0.[7..9], + permit + +# Milter for rspamd +smtpd_milters = inet:rspamd.net.ipng.ch:11332 +milter_default_action = accept + +# PostSRSd +sender_canonical_maps = tcp:localhost:10001 +sender_canonical_classes = envelope_sender +recipient_canonical_maps = tcp:localhost:10002 +recipient_canonical_classes= envelope_recipient,header_recipient + +# Virtual domains +virtual_alias_domains = hash:/etc/postfix/virtual-domains +virtual_alias_maps = hash:/etc/postfix/virtual +``` + +The config arguably is quite compact. but I will hilight four specific pieces. + +***DNSBL***: When connecting and receiving the envelope (ie. `MAIL FROM` and `RCPT TO` in the SMTP +transaction), I'll ask Postfix to do a bunch of DNS blocklist lookups. Many sender domains, and +infected hosts/networks are mapped in public DNSBL zones, notably +[[Spamhaus](https://www.spamhaus.org/)], [[UCEProtect](http://www.uceprotect.net/en/index.php)], and +[[0Spam](https://0spam.org/)] do a great job at identifying malicious and spammy domain-names and +networks. So I'll tell Postfix to reject folks attempting to connect from these low-reputation +places. + +***Rspamd***: Here's where I hook up a redundant cluster of rspamd servers. Each e-mail, once +accepted, will be routed through this milter, and after thinking about it a little bit, the Rspamd +server will either answer: +* ***greylisted***: where Rspamd recommends a _tempfail_ so the remote mailserver comes back + after a few minutes after connecting for the first time, many spammers will not do this. +* ***blocked***: if Rspamd finds the e-mail is egregious, it'll simply recommend a _permfail_ + so Postfix immediately rejects it. +* ***tagged***: if Rspamd is iffy about the e-mail, it may insert an `X-Spam` header, so that + downstream mail clients like Thunderbird or Mail.app can decide for themselves to consider the + e-mail junk or not. + +***PostSRS***: This is a really useful feature which allows Postfix to safely _forward_ an e-mail to +another mailhost. Perhaps best explained with an example, notably the aforementioned nerdsnipe from +my buddy Jelle: + +1. Let's say `jelle@luteijn.email` sends an e-mail to `event@frys-ix.net` for which IPng is the + mailhost. +1. Jelle configured his SPF records to allow mail to come from either `ip4:185.36.229.0/24` + or `ip6:2a07:cd40::/29`, and if it comes from neither of those, to hard fail the SPF check + `-all`. +1. My spiffy `smtp-in.ipng.ch` receives this e-mail and decides to forward it internally by + rewriting it to `foo@eritap.com`. +1. The mailserver for `eritap.com` now sees an e-mail coming From: `jelle@luteijn.email` going to + its `foo@eritap.com`. It does an SPF check and concludes: Yikes! That mailserver + `smtp-in.ipng.ch` is NOT authorized to send e-mail on behalf of Jelle, so reject it! A kitten + gets hurt, which is obviously unacceptable. + +To handle this, PostSRSd detects when such a forward is about to happen, and rewrites the envelope +From: header to be something that `smtp-in.ipng.ch` might be allowed to deliver mail for: something +in the `@ipng.ch` domain! Using a secret (shared between the replicas of IPng's `smtp-in` cluster), +it can insert a little cryptographic signature as it does this rewrite. + +In the example above, the e-mail from `jelle@luteijn.email` will be rewritten to an envelope such as +`SRS0=CCIM=MT=luteijn.email=jelle@ipng.ch` and while hideous, it **is** in the `@ipng.ch` domain. If +a bounce for this e-mail were to be generated, PostSRSd can also rewrite in reverse, re-assembling +the original envelope From when sending the bounce on to Jelle's mailserver. + +I configure Postfix to do this using the _sender_ and _recipient_ canonical maps. I read these from +a server running on localhost port 10001 and 10002 respectively. This is where PostSRSd does its +magic. + +Oh, what's that I hear? The telephone is ringing! 1982 called, and it wants to change the title of +[[RFC821](https://datatracker.ietf.org/doc/html/rfc821)] from SMTP to CMTP (**Convoluted Mail +Transfer Protocol**). + +***Virtual***: With all of that out of the way, I can now receive and forward aliased e-mails. I +won't be using local mail delivery (to unix users on the local machine), but rather I will forward +the mails for my local users onwards to what is called a redundant `maildrop` server. So for the +virtualized part of the Postfix config, I have things like this: + +``` +pim@smtp-in-chrma0:~$ cat /etc/postfix/virtual-domains +ublog.tech ublog.tech +frys-ix.net frys-ix.net +ipng.nl ipng.nl +ipng.ch ipng.ch +... + +pim@smtp-in-chrma0:~$ cat /etc/postfix/virtual +## Virtual domain: ipng.ch +postmaster@ipng.ch pim+postmaster@maildrop.net.ipng.ch +hostmaster@ipng.ch pim+hostmaster@maildrop.net.ipng.ch +abuse@ipng.ch pim+abuse@maildrop.net.ipng.ch +pim@ipng.ch pim+ipng@maildrop.net.ipng.ch +noreply@ipng.ch /dev/null +... +## Virtual domain: ipng.nl +@ipng.nl @ipng.ch +## Virtual domain: frys-ix.net +postmaster@frys-ix.net pim+postmaster@maildrop.net.ipng.ch +hostmaster@frys-ix.net pim+hostmaster@maildrop.net.ipng.ch +abuse@frys-ix.net pim+abuse@maildrop.net.ipng.ch +noc@frys-ix.net pim+frysix@maildrop.net.ipng.ch,noc@eritap.com +pim@frys-ix.net pim+frysix@maildrop.net.ipng.ch +arend@frys-ix.net arend+frysix@eritap.com +event@frys-ix.net someplace@example.com +... +``` + +The first file here `virtual_alias_domains`, simply explains to Postfix which domains it is to +accept e-mail for. This avoids users trying to use it as a relay. If the domain is not listed in +the lefthand side of the table, it's not welcome here. But then once Postfix knows it's supposed to +be accepting e-mail for this domain, it will consult the `virtual_alias_maps` configuration. Here, I +showed three domains, and a few features: + +* I can simply forward along `pim@ipng.ch` to `pim+ipng@maildrop.net.ipng.ch`. Cool. +* I can toss the email by passing it to `/dev/null` (useful for things like `noreply@` and + `nobody@`) +* I can forward it to multiple recipients as well, for example `noc@frys-ix.net` goes to me and + Eritap (hoi, Arend!) + +When such a forward happens, PostSRSd kicks in, and for that e-mail, the envelope rewrite will +happen such that `smtp-in` can safely deliver this to even the strictest of SPF users. + +#### Why no NGINX ? + +There's an important technical reason for me not to be able to use an inbound loadbalancer, even +though I'd love to frontend port 25 and 465 on IPng's nginx cluster. I have enabled the use of +DNSBL, which implies that Postfix needs to know the remotely connecting IPv4 and IPv6 addresses. +While for domain-based blocklists this is not important, for IP based ones like _zen.spamhaus.org_ +it is critical. Therefore, I will assign a public IPv4 and IPv6 address to each of the machines in +the cluster. They will be used in a round-robin way, and if one of them is down for a while, remote +mail servers will automatically and gracefully use another replica. + +With that, the public DNS entries: + +``` +ublog.tech. 86400 IN MX 10 smtp-in.ipng.ch. +ipng.nl. 86400 IN MX 10 smtp-in.ipng.ch. +ipng.ch. 86400 IN MX 10 smtp-in.ipng.ch. + +smtp-in.ipng.ch. 60 IN A 46.20.246.125 +smtp-in.ipng.ch. 60 IN A 94.142.245.85 +smtp-in.ipng.ch. 60 IN A 194.1.163.141 +smtp-in.ipng.ch. 60 IN AAAA 2a02:2528:ff00::125 +smtp-in.ipng.ch. 60 IN AAAA 2a02:898:146:1::5 +smtp-in.ipng.ch. 60 IN AAAA 2001:678:d78:6::141 +``` + +### 3. Dovecot: maildrop + +Remember when I said that mail to `pim@ipng.ch` is forwarded to `pim+ipng@maildrop.net.ipng.ch`? +Doing this allows me to have replicated, fully redundant, IMAP servers! As it turns out, Dovecot, a +very popular open source pop3/imap server, has the ability to do realtime synchronization between +multiple machines serving the same user. + +On these servers, I'll start with enabling Postfix only using the `smtp` and `smtps` transport in +`master.cf`. The maildrop servers will be entirely within IPng Site Local, and cannot be reached +from the internet directly, just the same as the `smtp-out` server replicas. + +Postfix on the server receives mail from the `smtp-in` servers as the final destination for an +e-mail. It does this very similar to the `smtp-in` server pool I described above, with two notable +differences: + +1. It does not need to do DNSBL lookups or spam analysis -- those have already happened upstream + from these maildrop servers by the `smtp-in` servers. That's also why these can be safely + tucked away in IPng Site Local. +1. Their virtual maps point to what is called an ***LMTP***: Local Mail Transport Protocol, where + I'll ask Postfix to pump them into a redundalty replicated Dovecot pair. + +``` +# Completely virtual +virtual_alias_maps = hash:/etc/postfix/virtual-maildrop +virtual_mailbox_domains = maildrop.net.ipng.ch +virtual_transport = lmtp:unix:private/dovecot-lmtp + +pim@maildrop0-chbtl0:$ cat /etc/postfix/virtual-maildrop +pim@maildrop.net.ipng.ch pim +``` + +What I've done here is define only one `virtual_mailbox_domains` entry, for which I look up the +users in the `virtual_alias_maps` and use a `virtual_transport` to deliver the enduser (`pim`) to a +unix domain socket in `/var/spool/postfix/private/dovecot-lmtp`. Once again, mail servers are super +simple after you've spent ten hours reading configuration manuals and RFCs and asked at least three +other people how they did theirs.... Super... Simple! + +#### Dovecot + +By default, Dovecot ships with a very elaborate configuration file hierarchy. I decide to replace it +with an autogenerated one from Ansible that has fewer includes (namely: none at all). Here's the +features that I want to enable in Dovecot: + +* ***UserDB***: To define username, password and mail directory for users. +* ***LMTP***: To be a local recepticle for the Postfix delivery +* ***IMAP***: To serve SSL enabled IMAP to mail clients like Mail.app, Thunderbird, Roundcube, + etc. +* ***Replicator***: To replicate mailboxes between pairs of Dovecot servers. +* ***Sieve***: To allow users to create mail filters using the Sieve protocol. + +Starting from the easier bits, here's how I configure the User Database in `dovecot.conf`: + +``` +passdb { + driver = passwd-file + args = username_format=%n /etc/dovecot/maildrop-users +} + +userdb { + driver = passwd-file + args = username_format=%n /etc/dovecot/maildrop-users + default_fields = uid=vmail gid=vmail home=/var/dovecot/users/%n +} + +mail_plugins = $mail_plugins notify push_notification replication +mail_location = mdbox:~/mdbox +``` + +I can add a user `pim` with an encrypted password from `doveadm pw` like so: + +``` +pim@maildrop0-chbtl0:/etc/dovecot$ sudo cat maildrop-users +... +pim:{CRYPT}$2y$:::: +``` + +Due to the `passdb` option, this user can authenticate with username and password, and due to the +`userdb` option, this user receives a mailbox homedir in the specified location. One important +observation is that the _unix_ user is `vmail:vmail` for every mailbox. This is pretty cool as it +allows the whole mail delivery system to be virtualized under Dovecot's guidance. Slick! + +There are two tried-and-tested mailbox formats: Maildir and mbox. mbox is one giant file per mail +folder, and can be expensive to search and sort and delete mails out of. Maildir is cheaper to +search and sort and delete, but is essentially one file per e-mail, which is bulky. Dovecot has its +own high performance mailbox, which is the best of both worlds: an indexed append-only chunked mail +format called mdbox. I learned more about the options and trade offs reading [[this +doc](https://doc.dovecot.org/admin_manual/mailbox_formats/dbox/)]. + +#### Dovecot: LMTP + +The following `dovecot.conf` snippet ties Postfix into Dovecot: + +``` +protocols = $protocols lmtp +protocol lmtp { + mail_plugins = $mail_plugins sieve +} + +service lmtp { + unix_listener /var/spool/postfix/private/dovecot-lmtp { + mode = 0660 + user = postfix + group = postfix + } +} +``` + +Recall that in Postfix above, the `virtual_transport` field specified the same location. This is how +user `pim` gets mail handed to Dovecot. One other tidbit here is that the LMTP protocol enables a +plugin called `sieve`. What this does, is upon receipt of each e-mail, a list of filters is run +through, on the server side! It is here that I can tell Dovecot that some mail goes to different folders +and sub-folders, some might be forwarded, marked read or discarded entirely. I'll get to that in a +minute. + +#### Dovecot: IMAP + +Then, I enable SSL enabled IMAP in `dovecot.conf`: + +``` +disable_plaintext_auth = yes +auth_mechanisms = plain login + +protocols = $protocols imap +protocol imap { + mail_max_userip_connections = 50 + mail_plugins = $mail_plugins imap_sieve +} + +service imap-login { + inet_listener imap { + port = 0 ## Disabled + } + inet_listener imaps { + port = 993 + } +} +``` + +With this snippet, I instruct Dovecot to disable any plain-text authentication, and use either +`plain` or `login` challenges to authenticate users. I'll disable the un-encrypted IMAP listener +by setting its port to 0, and I'll allow for an IMAP+SSL listener on the common port 993, which will +be presenting a `*.ipng.ch` wildcard certificate that's shared between all sorts of services at +IPng. + +#### Dovecot: Replication + +And now for something really magical. Dovecot can be instructed to replicate in multi-master (ie. +read/write) mailboxes to remote machines also running Dovecot. This is called ***dsync*** and it's +hella cool! In reading the [[docs](https://doc.dovecot.org/configuration_manual/replication/)], I +take note that the same user should be directed to a stable replica in normal use, but changes do +not get lost even if the same user modifies mails simultaneously on both replicas, some mails just +might have to be redownloaded in that case. The replication is done by looking at Dovecot index +files (not what exists in filesystem), so no mails get lost due to filesystem corruption or an +accidental `rm -rf`, they will simply be replicated back. + +This is amazing!! The configuration for it is remarkably straight forward: + +``` +mail_plugins = $mail_plugins notify replication + +# Replication details +replication_max_conns = 10 +replication_full_sync_interval = 1h + +service aggregator { + fifo_listener replication-notify-fifo { + user = vmail + group = vmail + mode = 0660 + } + unix_listener replication-notify { + user = vmail + group = vmail + mode = 0660 + } +} + +# Enable doveadm replicator commands +service replicator { + process_min_avail = 1 + unix_listener replicator-doveadm { + mode = 0660 + user = vmail + group = vmail + } +} + +doveadm_port = 63301 +doveadm_password = +service doveadm { + vsz_limit=512 MB + inet_listener { + port = 63301 + } +} + +plugin { + mail_replica = tcp:maildrop0.ddln0.net.ipng.ch +} +``` + +To try to explain this - The first service, the `aggregator` opens some notification FIFOs that will +notify listeners of new replication events. Then, Dovecot will start a process called `replicator`, +which gets these cues when there is work to be done. It will connect to a `mail_replica` on another +host, on the `doveadm_port` (in my case 63301) which is protected by a shared password. And with +that, all e-mail that is delivered via **LMTP** on this machine, is both retrievable via **IMAPS** +but also gets copied to the remote machine `maildrop0.ddln0.net.ipng.ch` (and in _its_ +configuration, it'll synchronize mail to `maildrop0.chbtl0.net.ipng.ch`). Nice! + +#### Dovecot: Sieve + +Having a flat mailbox is just no fun (unless you're using GMail, in which case: tolerable). Enter +_Sieve_, described in [[RFC5228](https://datatracker.ietf.org/doc/html/rfc5228)]. Scripts written in +Sieve are executed during final delivery, when the message is moved to the user-accessible mailbox. +In systems where the Mail Transfer Agent (MTA) does final delivery, such as traditional Unix mail, +it is reasonable to filter when the MTA deposits mail into the user's mailbox. + +``` +protocols = $protocols sieve + +plugin { + sieve = ~/.dovecot.sieve + sieve_global_path = /etc/dovecot/sieve/default.sieve + sieve_dir = ~/sieve + sieve_global_dir = /etc/dovecot/sieve/ + sieve_extensions = +editheader + sieve_before = /etc/dovecot/sieve/before.d + sieve_after = /etc/dovecot/sieve/after.d +} + +plugin { + sieve_plugins = sieve_imapsieve sieve_extprograms + + # From elsewhere to Junk folder + imapsieve_mailbox1_name = Junk + imapsieve_mailbox1_causes = COPY + imapsieve_mailbox1_before = file:/etc/dovecot/sieve/report-spam.sieve + + # From Junk folder to elsewhere + imapsieve_mailbox2_name = * + imapsieve_mailbox2_from = Junk + imapsieve_mailbox2_causes = COPY + imapsieve_mailbox2_before = file:/etc/dovecot/sieve/report-ham.sieve + + sieve_pipe_bin_dir = /etc/dovecot/sieve + + sieve_global_extensions = +vnd.dovecot.pipe +} +``` + +This is a mouthful, but only because it's hella cool. By default, each mailbox will have a +`.dovecot.sieve` file that is consulted at each delivery. If no file exists there, the default sieve +will be used. But also, some sieve filters might happen either `sieve_before` the users' one is +called, or `sieve_after`. Then, in the plugin I create two specific triggers: + +1. if a file is copied to the Junk folder, I will run it through a script called + `report-spam.sieve`. +1. similarly, if it is moved *out of* the Junk folder, I'll run the `report-ham.sieve` script. + +Using an `rspamc` client, I can wheel over the cluster of Rspamd servers one by one and offer them +these two events (the train-spam and train-ham are similar, so I'll only show one): + +``` +pim@maildrop0-chbtl0:~$ cat /etc/dovecot/sieve/report-spam.sieve +require ["vnd.dovecot.pipe", "copy", "imapsieve", "environment", "variables"]; + +if environment :matches "imap.email" "*" { + set "email" "${1}"; +} + +pipe :copy "train-spam.sh" [ "${email}" ]; +pim@maildrop0-chbtl0:~$ cat /etc/dovecot/sieve/train-spam.sh +logger learning spam +/usr/bin/rspamc -h rspamd.net.ipng.ch:11332 learn_spam +``` + +Users of Dovecot can now add their Sieve configs to their mailbox: + +``` +pim@maildrop0-chbtl0:~$ sudo ls -la /var/dovecot/users/pim/ +lrwxrwxrwx 1 vmail vmail 19 Apr 2 17:27 .dovecot.sieve -> sieve/ipng_v1.sieve +-rw------- 1 vmail vmail 2113 Mar 29 11:35 .dovecot.sieve.log +-rw------- 1 vmail vmail 5001 May 14 14:34 .dovecot.svbin +drwx------ 4 vmail vmail 4096 May 17 16:02 mdbox +drwx------ 3 vmail vmail 4096 May 14 14:30 sieve +``` + +but seeing as (a) it's tedious to have to edit these files on multiple dovecot replicas, and (b) my +users will not receive access to `vmail` user in order to actually do that, as it would be a +security risk, I need one more thing. + +#### Dovecot: IMAP Sieve + +Dovecot has an implementation of a replication-aware Sieve filter editor called `managesieve`: + +``` +service managesieve-login { + inet_listener sieve { + port = 4190 + } +} + +service managesieve { + process_limit = 256 +} + +protocol sieve { +} +``` + +It will use the IMAP credentials to allow users to edit their _Sieve_ filter online. For example, +Thunderbird has a plugin for it, which does syntax checking and what-not. When the filter is edited, +it is syntax checked, compiled and replicated to the other Dovecot instance. + +{{< image src="/assets/smtp/sieve_thunderbird.png" alt="Sieve Thunderbird" >}} + +#### NGINX + +I have an imap server and a mangesieve server, redundantly running on two Dovecot machines. I recall +reading in the Dovecot manual that it is slightly preferable to have users go to a consistent +replica and not bounce around between them. Luckily, I can do exactly that using the NGINX +frontends: + +``` +upstream imap { + server maildrop0.chbtl0.net.ipng.ch:993 fail_timeout=10s max_fails=2; + server maildrop0.ddln0.net.ipng.ch:993 fail_timeout=10s max_fails=2 backup; +} + +server { + listen [::]:993; + listen 0.0.0.0:993; + proxy_pass imap; +} + +upstream sieve { + server maildrop0.chbtl0.net.ipng.ch:4190 fail_timeout=10s max_fails=2; + server maildrop0.ddln0.net.ipng.ch:4190 fail_timeout=10s max_fails=2 backup; +} + +server { + listen [::]:4190; + listen 0.0.0.0:4190; + proxy_pass sieve; +} +``` + +I keep port 993 for `maildrop` as well as port 587 for `smtp-out` unfiltered on the NGINX cluster. +I'm a little bit more protective of the `managesieve` service, so port 4190 is allowed only when +users are connected to the VPN or the internal (office/home) network. + +Now, you'll recall that in the `smtp-in` servers, I forward mail to `pim@maildrop.net.ipng.ch`, +for which the redundant Dovecot servers are both accepting mail. On the way in, I can see to it that +the primary replica is used , by giving it a slightly lower preference in DNS MX records: + +``` +maildrop.net.ipng.ch. 300 IN MX 10 maildrop0.chbtl0.net.ipng.ch. +maildrop.net.ipng.ch. 300 IN MX 20 maildrop0.ddln0.net.ipng.ch. + +imap.ipng.ch. 60 IN CNAME nginx0.ipng.ch. +nginx0.ipng.ch. 600 IN A 194.1.163.151 +nginx0.ipng.ch. 600 IN A 46.20.246.124 +nginx0.ipng.ch. 600 IN A 94.142.241.189 +nginx0.ipng.ch. 600 IN AAAA 2001:678:d78:7::151 +nginx0.ipng.ch. 600 IN AAAA 2a02:2528:ff00::124 +nginx0.ipng.ch. 600 IN AAAA 2a02:898:146::5 +``` + +This will make the `smtp-in` hosts prefer to use the `chbtl0` maildrop replica when it's available. +If ever it were to go down, they will automatically fail over and use `ddln0`, which will replicate +back any changes while `chbtl0` is down for maintenance or hardware failure. On the way to out, the +nginx cluster will prefer to use `chbtl0` as well, as it has marked the `ddln0` replica as `backup`. + +### 4. Webmail: Roundcube + +Now that I have all of the infrastructure up and running, I thought I'd put some icing on the cake +with Roundcube, a web-based IMAP email client. Roundcube's most prominent feature is the pervasive +use of Ajax technology. It also comes with an online Sieve editor, and runs in Docker. What more +can I ask for? + +Installing it is really really easy in my case. Since I have an nginx cluster to frontend it and do +the SSL offloading, I choose the simplest version with the following `docker-compose.yaml`: + + +``` +version: '2' + +services: + roundcubemail: + image: roundcube/roundcubemail:latest + container_name: roundcubemail + volumes: + - ./www:/var/www/html + - ./db/sqlite:/var/roundcube/db + ports: + - 9002:80 + environment: + - ROUNDCUBEMAIL_DB_TYPE=sqlite + - ROUNDCUBEMAIL_SKIN=elastic + - ROUNDCUBEMAIL_DEFAULT_HOST=ssl://maildrop0.net.ipng.ch + - ROUNDCUBEMAIL_DEFAULT_PORT=993 + - ROUNDCUBEMAIL_SMTP_SERVER=tls://smtp-out.net.ipng.ch + - ROUNDCUBEMAIL_SMTP_PORT=587 +``` + +There's a small snag, in that by default the SMTP user and password are expected to be the same as +for the IMAP server, which is not the case for my design. So, I create a user `roundcube` on the +`smtp-out` cluster and give it a suitable password. I nose around a little bit, and decide my +preference is to have _threaded_ view by default, and I also enable the `managesieve` plugin: + +``` + $config['log_driver'] = 'stdout'; + $config['zipdownload_selection'] = true; + $config['des_key'] = ''; + $config['enable_spellcheck'] = true; + $config['spellcheck_engine'] = 'pspell'; + $config['smtp_user'] = 'roundcube'; + $config['smtp_pass'] = ''; + $config['plugins'] = array('managesieve'); + $config['managesieve_host'] = 'tls://maildrop0.net.ipng.ch:4190'; + $config['default_list_mode'] = 'threads'; +``` + +I start the docker containers, and very quickly after, Roundcube shoots to life. I can expose it +behind the nginx cluster, while keeping it accessible only for VPN + office/home network users: + +``` +server { + listen [::]:80; + listen 0.0.0.0:80; + + server_name webmail.ipng.ch webmail.net.ipng.ch webmail; + access_log /var/log/nginx/webmail.ipng.ch-access.log; + include /etc/nginx/conf.d/ipng-headers.inc; + + location / { + return 301 https://webmail.ipng.ch$request_uri; + } +} + +geo $allowed_user { + default 0; + include /etc/nginx/conf.d/geo-ipng.inc; +} + +server { + listen [::]:443 ssl http2; + listen 0.0.0.0:443 ssl http2; + ssl_certificate /etc/certs/ipng.ch/fullchain.pem; + ssl_certificate_key /etc/certs/ipng.ch/privkey.pem; + include /etc/nginx/conf.d/options-ssl-nginx.inc; + ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc; + + server_name webmail.ipng.ch; + access_log /var/log/nginx/webmail.ipng.ch-access.log upstream; + include /etc/nginx/conf.d/ipng-headers.inc; + + if ($allowed_user = 0) { rewrite ^ https://ipng.ch/ break; } + + location / { + proxy_pass http://docker0.frggh0.net.ipng.ch:9002; + proxy_set_header Host $host; + proxy_set_header X-Real-IP $remote_addr; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + proxy_set_header X-Forwarded-Proto $scheme; + } +} +``` + +The configuration has one neat trick in it -- it uses the `geo` module in NGINX to assert that the +client address is used to set the value of `allowed_user`. It will be 1 if the client connected from +any network defined in the `geo-ipng.inc` file, and 0 otherwise. I then use it to bounce unwanted +visitors back to the main [[website](https://ipng.ch/)], and expose Roundcube for those that are +welcome. + +While the Roundcube instance is not replicated, it's also non-essential. I will be using +Thunderbird, Mail.app and other clients more regularly than Roundcube. It may just be handy in a +pinch to either check mail using a browser, but also to edit the _Sieve_ filters easily. + +In my defense, considering roundcube is pretty much stateless, I can actually just run multiple +copies of it on a few docker hosts at IPng -- then in the nginx configs I might use a similar +construct as for the `maildrop` and `smtp-out` services, with a primary and hot standby. But that +will be for that one day that the docker host in Lille dies AND I decided I absolutely require +Roundcube precisely on that day :) diff --git a/content/articles/2024-05-25-nat64-1.md b/content/articles/2024-05-25-nat64-1.md new file mode 100644 index 0000000..b9dfe28 --- /dev/null +++ b/content/articles/2024-05-25-nat64-1.md @@ -0,0 +1,412 @@ +--- +date: "2024-05-25T12:23:54Z" +title: 'Case Study: NAT64' +--- + +# Introduction + +{{< image width="400px" float="right" src="/assets/oem-switch/s5648x-front-opencase.png" alt="Front" >}} + +IPng's network is built up in two main layers, (1) an MPLS transport layer, which is disconnected +from the Internet, and (2) a VPP overlay, which carries the Internet. I created a BGP Free core +transport network, which uses MPLS switches from a company called Centec. These switches offer IPv4, +IPv6, VxLAN, GENEVE and GRE all in silicon, are very cheap on power and relatively affordable per +port. + +Centec switches allow for a modest but not huge amount of routes in the hardware forwarding tables. +I loadtested them in [[a previous article]({% post_url 2022-12-05-oem-switch-1 %})] at line rate +(well, at least 8x10G at 64b packets and around 110Mpps), and they forward IPv4, IPv6 and MPLS +traffic effortlessly, at 45 watts. + +I wrote more about the Centec switches in [[my review]({% post_url 2023-03-11-mpls-core %})] of them +back in 2022. + +### IPng Site Local + +{{< image width="400px" float="right" src="/assets/nat64/MPLS IPng Site Local v2.svg" alt="IPng SL" >}} + +I leverage this internal transport network for more than _just_ MPLS. The transport switches are +perfectly capable of line rate (at 100G+) IPv4 and IPv6 forwarding as well. When designing IPng Site +Local, I created a number plan that assigns ***IPv4*** from the **198.19.0.0/16** prefix, and ***IPv6*** +from the **2001:678:d78:500::/56** prefix. Within these, I allocate blocks for _Loopback_ addresses, +_PointToPoint_ subnets, and hypervisor networks for VMs and internal traffic. + +Take a look at the diagram to the right. Each site has one or more Centec switches (in red), and +there are three redundant gateways that connect the IPng Site Local network to the Internet (in +orange). I run lots of services in this red portion of the network: site to site backups +[[Borgbackup](https://www.borgbackup.org/)], ZFS replication [[ZRepl](https://zrepl.github.io/)], a +message bus using [[Nats](https://nats.io)], and of course monitoring with SNMP and Prometheus all +make use of this network. But it's not only internal services like management traffic, I also +actively use this private network to expose _public_ services! + +For example, I operate a bunch of [[NGINX Frontends]({% post_url 2023-03-17-ipng-frontends %})] that +have a public IPv4/IPv6 address, and reversed proxy for webservices (like +[[ublog.tech](https://ublog.tech)] or [[Rallly](https://rallly.ipng.ch/)]) which run on VMs and +Docker hosts which don't have public IP addresses. Another example which I wrote about [[last +week]({% post_url 2024-05-17-smtp %})], is a bunch of mail services that run on VMs without public +access, but are each carefully exposed via reversed proxies (like Postfix, Dovecot, or +[[Roundcube](https://webmail.ipng.ch)]). It's an incredibly versatile network design! + +### Border Gateways + +Seeing as IPng Site Local uses native IPv6, it's rather straight forward to give each hypervisor and +VM an IPv6 address, and configure IPv4 only on the externally facing NGINX Frontends. As a reversed +proxy, NGINX will create a new TCP session to the internal server, and that's a fine solution. +However, I also want my internal hypervisors and servers to have full Internet connectivity. For +IPv6, this feels pretty straight forward, as I can just route the **2001:678:d78:500::/56** through +a firewall that blocks incoming traffic, and call it a day. For IPv4, similarly I can use classic +NAT just like one would in a residential network. + +**But what if I wanted to go IPv6-only?** This poses a small challenge, because while IPng is fully +IPv6 capable, and has been since the early 2000s, the rest of the internet is not quite there yet. +For example, the quite popular [[GitHub](https://github.com/pimvanpelt/)] hosting site still has +only an IPv4 address. Come on, folks, what's taking you so long?! It is for this purpose that NAT64 +was invented. Described in [[RFC6146](https://datatracker.ietf.org/doc/html/rfc6146)]: + +> Stateful NAT64 translation allows IPv6-only clients to contact IPv4 servers using unicast +> UDP, TCP, or ICMP. One or more public IPv4 addresses assigned to a NAT64 translator are shared +> among several IPv6-only clients. When stateful NAT64 is used in conjunction with DNS64, no +> changes are usually required in the IPv6 client or the IPv4 server. + +The rest of this article describes version 2 of the IPng SL border gateways, which opens the path +for IPng to go IPv6-only. By the way, I thought it would be super complicated, but in hindsight: I +should have done this years ago! + +#### Gateway Design + +{{< image width="400px" float="right" src="/assets/nat64/IPng NAT64.svg" alt="IPng Border Gateway" >}} + +Let me take a closer look at the orange boxes that I drew in the network diagram above. I call these +machines _Border Gateways_. Their job is to sit between IPng Site Local and the Internet. They'll +each have one network interface connected to the Centec switch, and another connected to +the VPP routers at AS8298. They will provide two main functions: firewalling, so that no unwanted +traffic enters IPng Site local, and NAT translation, so that: +1. IPv4 users from **198.19.0.0/16** can reach external IPv4 addresses, +1. IPv6 users from **2001:678:d78:500::/56** can reach external IPv6, +1. _IPv6-only_ users can reach external **IPv4** addresses, a neat trick. + +#### IPv4 and IPv6 NAT + +Let me start off with the basic tablestakes. You'll likely be familiar with _masquerading_, a +NAT technique in Linux that uses the public IPv4 address assigned by your provider, allowing +many internal clients, often using [[RFC1918](https://datatracker.ietf.org/doc/html/rfc1918)] addresses, +to access the internet via that shared IPv4 address. You may not have come across IPv6 _masquerading_ +though, but it's equally possible to take an internal (private, non-routable) +IPv6 network and access the internet via a shared IPv6 address. + +I will assign a pool of four public IPv4 addresses and eight IPv6 addresses to each border gateway: + +| **Machine** | **IPv4 pool** | **IPv6 pool** | +| border0.chbtl0.net.ipng.ch | 194.126.235.0/30 | 2001:678:d78::3:0:0/125 | +| border0.chrma0.net.ipng.ch | 194.126.235.4/30 | 2001:678:d78::3:1:0/125 | +| border0.chplo0.net.ipng.ch | 194.126.235.8/30 | 2001:678:d78::3:2:0/125 | +| border0.nlams0.net.ipng.ch | 194.126.235.12/30 | 2001:678:d78::3:3:0/125 | + +Linux iptables _masquerading_ will only work with the IP addresses assigned to the external +interface, so I will need to use a slightly different approach to be able to use these _pools_. In +case you're wondering -- IPng's internal network has grown to the size now that I cannot expose it +all behind a single IPv4 address; there will not be enough TCP/UDP ports. Luckily, NATing via a pool +is pretty easy using the _SNAT_ module: + +``` +pim@border0-chrma0:~$ cat << EOF | sudo tee /etc/rc.firewall.ipng-sl +# IPng Site Local: Enable stateful firewalling on IPv4/IPv6 forwarding +iptables -P FORWARD DROP +ip6tables -P FORWARD DROP +iptables -I FORWARD -i enp1s0f1 -m state --state NEW -s 198.19.0.0/16 -j ACCEPT +ip6tables -I FORWARD -i enp1s0f1 -m state --state NEW -s 2001:678:d78:500::/56 -j ACCEPT +iptables -I FORWARD -m state --state RELATED,ESTABLISHED -j ACCEPT +ip6tables -I FORWARD -m state --state RELATED,ESTABLISHED -j ACCEPT + +# IPng Site Local: Enable NAT on external interface using NAT pools +iptables -t nat -I POSTROUTING -s 198.19.0.0/16 -o enp1s0f0 \ + -j SNAT --to 194.126.235.4-194.126.235.7 +ip6tables -t nat -I POSTROUTING -s 2001:678:d78:500::/56 -o enp1s0f0 \ + -j SNAT --to 2001:678:d78::3:1:0-2001:678:d78::3:1:7 +EOF +``` + +From the top -- I'll first make it the default for the kernel to refuse to _FORWARD_ any traffic that +is not explicitly accepted. I will only allow traffic that comes in via `enp1s0f1` (the internal +interface), only if it comes from the assigned IPv4 and IPv6 site local prefixes. On the way back, +I'll allow traffic that matches states created on the way out. This is the _firewalling_ portion of +the setup. + +Then, two _POSTROUTING_ rules turn on network address translation. If the source address is any of +the site local prefixes, I'll rewrite it to come from the IPv4 or IPv6 pool addresses, respectively. +This is the _NAT44_ and _NAT66_ portion of the setup. + +#### NAT64: Jool + +{{< image width="400px" float="right" src="/assets/nat64/jool.png" alt="Jool" >}} + +So far, so good. But this article is about NAT64 :-) Here's where I grossly overestimated how +difficult it might be -- and if there's one takeaway from my story here, it should be that NAT64 is +as straight forward as the others! Enter [[Jool](https://jool.mx)], an Open Source SIIT and NAT64 +for Linux. It's available in Debian as a DKMS kernel module and userspace tool, and it integrates +cleanly with both _iptables_ and _netfilter_. + +Jool is a network address and port translating +implementation, which is referred to as _NAPT_, just as regular IPv4 NAT. When internal IPv6 clients +try to reach an external endpoint, Jool will make note of the internal src6:port, then select an +external IPv4 address:port, rewrite the packet, and on the way back, correlate the src4:port with +the internal src6:port, and rewrite the packet. If this sounds an awful lot like NAT, then you're +not wrong! The only difference is, Jool will also translate the *address family*: it will rewrite +the internal IPv6 addresses to external IPv4 addresses. + +Installing Jool is as simple as this: + +``` +pim@border0-chrma0:~$ sudo apt install jool-dkms jool-tools +pim@border0-chrma0:~$ sudo mkdir /etc/jool +pim@border0-chrma0:~$ cat << EOF | sudo tee /etc/jool/jool.conf +{ + "comment": { + "description": "Full NAT64 configuration for border0.chrma0.net.ipng.ch", + "last update": "2024-05-21" + }, + "instance": "default", + "framework": "netfilter", + "global": { "pool6": "2001:678:d78:564::/96", "lowest-ipv6-mtu": 1280, "logging-debug": false }, + "pool4": [ + { "protocol": "TCP", "prefix": "194.126.235.4/30", "port range": "1024-65535" }, + { "protocol": "UDP", "prefix": "194.126.235.4/30", "port range": "1024-65535" }, + { "protocol": "ICMP", "prefix": "194.126.235.4/30" } + ] +} +EOF +pim@border0-chrma0:~$ sudo systemctl start jool +``` + +.. and that, as they say, is all there is to it! There's two things I make note of here: +1. I have assigned **2001:678:d78:564::/96** as NAT64 `pool6`, which means that if this machine + sees any traffic _destined_ to that prefix, it'll activate Jool, select an available IPv4 + address:port from the `pool4`, and send the packet to the IPv4 destination address which it + takes from the last 32 bits of the original IPv6 destination address. +1. Cool trick: I am **reusing** the same IPv4 pool as for regular NAT. The Jool kernel module + happily coexists with the _iptables_ implementation! + +#### DNS64: Unbound + +There's one vital piece of information missing, and it took me a little while to appreciate that. If +I take an IPv6 only host, like Summer, and I try to connect to an IPv4-only host, how does that even +work? + +``` +pim@summer:~$ ip -br a +lo UNKNOWN 127.0.0.1/8 ::1/128 +eno1 UP 2001:678:d78:50b::f/64 fe80::7e4d:8fff:fe03:3c00/64 +pim@summer:~$ ip -6 ro +2001:678:d78:50b::/64 dev eno1 proto kernel metric 256 pref medium +fe80::/64 dev eno1 proto kernel metric 256 pref medium +default via 2001:678:d78:50b::1 dev eno1 proto static metric 1024 pref medium + +pim@summer:~$ host github.com +github.com has address 140.82.121.4 +pim@summer:~$ ping github.com +ping: connect: Network is unreachable +``` + +Now comes the really clever reveal -- NAT64 works by assigning an IPv6 prefix that snugly fits the +entire IPv4 address space, typically **64:ff9b::/96**, but operators can chose any prefix they'd like. +For IPng's site local network, I decided to assign **2001:678:d78:564::/96** for this purpose +(this is the `global.pool6` attribute in Jool's config file I described above). A resolver can then +tweak DNS lookups for IPv6-only hosts to return addresses from that IPv6 range. This tweaking is +called DNS64, described in [[RFC6147](https://datatracker.ietf.org/doc/html/rfc6147)]: + +> DNS64 is a mechanism for synthesizing AAAA records from A records. DNS64 is used with an +> IPv6/IPv4 translator to enable client-server communication between an IPv6-only client and an +> IPv4-only server, without requiring any changes to either the IPv6 or the IPv4 node, for the +> class of applications that work through NATs. + +I run the popular [[Unbound](https://www.nlnetlabs.nl/projects/unbound/about/)] resolver at IPng, +deployed as a set of anycasted instances across the network. With two lines of configuration only, I +can turn on this feature: + +``` +pim@border0-chrma0:~$ cat << EOF | sudo tee /etc/unbound/unbound.conf.d/dns64.conf +server: + module-config: "dns64 iterator" + dns64-prefix: 2001:678:d78:564::/96 +EOF +pim@border0-chrma0:~$ sudo systemctl restat unbound +``` + +The behavior of the resolver now changes in a very subtle but cool way: + +``` +pim@summer:~$ host github.com +github.com has address 140.82.121.3 +github.com has IPv6 address 2001:678:d78:564::8c52:7903 +pim@summer:~$ host 2001:678:d78:564::8c52:7903 +3.0.9.7.2.5.c.8.0.0.0.0.0.0.0.0.4.6.5.0.8.7.d.0.8.7.6.0.1.0.0.2.ip6.arpa + domain name pointer lb-140-82-121-3-fra.github.com. +``` + +Before, [[github.com](https://github.com/pimvanpelt/)] did not return an AAAA record, so there was +no way for Summer to connect to it. But now, not only does it return an AAAA record, but it also +rewrites the PTR request, knowing that I'm asking for something in the DNS64 range of +**2001:678:d78:564::/96**, Unbound will instead strip off the last 32 bits (`8c52:7903`, which is the +hex encoding for the original IPv4 address), and return the answer for a PTR lookup for the original +`3.121.82.140.in-addr.arpa` instead. Game changer! + +{{< image width="400px" float="right" src="/assets/nat64/IPng NAT64.svg" alt="IPng Border Gateway" >}} + +#### DNS64 + NAT64 + +What I learned from this, is that the _combination_ of these two tools provides the magic: + +1. When an IPv6-only client asks for AAAA for an IPv4-only hostname, Unbound will synthesize an AAAA + from the IPv4 address, casting it into the last 32 bits of its NAT64 prefix **2001:678:d78:564::/96** +1. When an IPv6-only client tries to send traffic to **2001:678:d78:564::/96**, Jool will do the + address family (and address/port) translation. This is represented by the red (ipv6) flow in the + diagram to the right turning into a green (ipv4) flow to the left. + +What's left for me to do is to ensure that (a) the NAT64 prefix is routed from IPng Site Local to +the gateways and (b) that the IPv4 and IPv6 NAT address pools is routed from the Internet to the +gateways. + +#### Internal: OSPF + +I use Bird2 to accomplish the dynamic routing - and considering the Centec switch network is by +design _BGP Free_, I will use OSPF and OSPFv3 for these announcements. Using OSPF has an important +benefit: I can selectively turn on and off the Bird announcements to the Centec IPng Site local +network. Seeing as there will be multiple redundant gateways, if one of them goes down (either due +to failure or because of maintenance), the network will quickly reconverge on another replica. Neat! + +Here's how I configure the OSPF import and export filters: + +``` +filter ospf_import { + if (net.type = NET_IP4 && net ~ [ 198.19.0.0/16 ]) then accept; + if (net.type = NET_IP6 && net ~ [ 2001:678:d78:500::/56 ]) then accept; + reject; +} + +filter ospf_export { + if (net.type=NET_IP4 && !(net~[198.19.0.255/32,0.0.0.0/0])) then reject; + if (net.type=NET_IP6 && !(net~[2001:678:d78:564::/96,2001:678:d78:500::1:0/128,::/0])) then reject; + + ospf_metric1 = 200; unset(ospf_metric2); + accept; +} +``` + +When learning prefixes _from_ the Centec switch, I will only accept precisely the IPng Site Local +IPv4 (198.19.0.0/16) and IPv6 (2001:678:d78:500::/56) supernets. On sending prefixes _to_ the Centec +switches, I will announce: +* ***198.19.0.255/32*** and ***2001:678:d78:500::1:0/128***: These are the anycast addresses of the Unbound resolver. +* ***0.0.0.0/0*** and ***::/0***: These are default routes for IPv4 and IPv6 respectively +* ***2001:678:d78:564::/96***: This is the NAT64 prefix, which will attract the IPv6-only traffic + towards DNS64-rewritten destinations, for example 2001:678:d78:564::8c52:7903 as DNS64 representation + of github.com, which is reachable only at legacy address 140.82.121.3. + +{{< image width="100px" float="left" src="/assets/nat64/brain.png" alt="Brain" >}} + +I have to be careful with the announcements into OSPF. The cost of E1 routes is the cost of the +external metric **in addition to** the internal cost within OSPF to reach that network. The cost +of E2 routes will always be the external metric, the metric will take no notice of the internal +cost to reach that router. Therefor, I emit these prefixes without Bird's `ospf_metric2` set, so +that the closest border gateway is always used. + +With that, I can see the following: +``` +pim@summer:~$ traceroute6 github.com +traceroute to github.com (2001:678:d78:564::8c52:7903), 30 hops max, 80 byte packets + 1 msw0.chbtl0.net.ipng.ch (2001:678:d78:50b::1) 4.134 ms 4.640 ms 4.796 ms + 2 border0.chbtl0.net.ipng.ch (2001:678:d78:503::13) 0.751 ms 0.818 ms 0.688 ms + 3 * * * + 4 * * * ^C +``` + +I'm not quite there yet, I have one more step to go. What's happening at the Border Gateway? Let me +take a look at this, while I ping6 to github.com: + +``` +pim@summer:~$ ping6 github.com +PING github.com(lb-140-82-121-4-fra.github.com (2001:678:d78:564::8c52:7904)) 56 data bytes +... (nothing) + +pim@border0-chbtl0:~$ sudo tcpdump -ni any src host 2001:678:d78:50b::f or dst host 140.82.121.4 +11:25:19.225509 enp1s0f1 In IP6 2001:678:d78:50b::f > 2001:678:d78:564::8c52:7904: + ICMP6, echo request, id 3904, seq 7, length 64 +11:25:19.225603 enp1s0f0 Out IP 194.126.235.3 > 140.82.121.4: + ICMP echo request, id 61668, seq 7, length 64 +``` + +Unbound and Jool are doing great work. Unbound saw my DNS request for IPv4-only github.com, and +synthesized a DNS64 response for me. Jool then saw the inbound packet from enp1s0f1, the internal +interface pointed at IPng Site Local. This is because the **2001:678:d78:564::/96** prefix is +announced in OSPFv3 so every host knows to route traffic to that prefix to this border gateway. +But then, I see the NAT64 in action on the outbound interface enp1s0f0. Here, one of the IPv4 pool +addresses is selected as source address. But there is no return packet, because there is no route +back from the Internet, yet. + +#### External: BGP + +The final step for me is to allow return traffic, from the Internet to the IPv4 and IPv6 pools to +reach this Border Gateway instance. For this, I configure BGP with the following Bird2 +configuration snippet: + + +``` +filter bgp_import { + if (net.type = NET_IP4 && !(net = 0.0.0.0/0)) then reject; + if (net.type = NET_IP6 && !(net = ::/0)) then reject; + accept; +} +filter bgp_export { + if (net.type = NET_IP4 && !(net ~ [ 194.126.235.4/30 ])) then reject; + if (net.type = NET_IP6 && !(net ~ [ 2001:678:d78::3:1:0/125 ])) then reject; + + # Add BGP Wellknown community no-export (FFFF:FF01) + bgp_community.add((65535,65281)); + accept; +} +``` + +I then establish an eBGP session from private AS64513 to two of IPng Networks' core routers at +AS8298. I add the wellknown BGP no-export community (`FFFF:FF01`) so that these prefixes are learned +in AS8298, but never propagated. It's not strictly necessary, because AS8298 won't announce more +specifics like these anyway, but it's a nice way to really assert that these are meant to stay +local. Because AS8298 is already announcing **194.126.235.0/24** and **2001:678:d78::/48** +supernets, return traffic will already be able to reach IPng's routers upstream. With these more +specific announcements of the /30 and /125 pools, the upstream VPP routers will be able to route the +return traffic to this specific server. + +And with that, the ping to Unbound's DNS64 provided IPv6 address for github.com shoots to life. + +### Results + +I deployed four of these Border Gateways using Ansible: one at my office in Brüttisellen, one +in Zurich, one in Geneva and one in Amsterdam. They do all three types of NAT: + +* Announcing the IPv4 default **0.0.0.0/0** will allow them to serve as NAT44 gateways for + **198.19.0.0/16** +* Announcing the IPv6 default **::/0** will allow them to serve as NAT66 gateway for + **2001:678:d78:500::/56** +* Announcing the IPv6 nat64 prefix **2001:678:d78:564::/96** will allow them to serve as NAT64 gateway +* Announcing the IPv4 and IPv6 anycast address for `nscache.net.ipng.ch` allows them to serve DNS64 + +Each individual service can be turned on or off. For example, stopping to announce the IPv4 default +into the Centec network, will no longer attract NAT44 traffic through a replica. Similarly, stopping +to announce the NAT64 prefix will no longer attract NAT64 traffic through that replica. OSPF in the +IPng Site Local network will automatically select an alternative replica in such cases. Shutting +down Bird2 alltogether will immediately drain the machine of all traffic, while traffic is +immediately rerouted. + +If you're curious, here's a few minutes of me playing with failover, while watching YouTube videos +concurrently. + +{{< image src="/assets/nat64/nat64.gif" alt="Asciinema" >}} + +### What's Next + +I've added an Ansible module in which I can configure the individual instances' IPv4 and IPv6 NAT +pools, and turn on/off the three NAT types by means of steering the OSPF announcements. I can also +turn on/off the Anycast Unbound announcements, in much the same way. + +If you're a regular reader of my stories, you'll maybe be asking: Why didn't you use VPP? And that +would be an excellent question. I need to noodle a little bit more with respect to having all three +NAT types concurrently working alongside Linux CP for the Bird and Unbound stuff, but I think in the +future you might see a followup article on how to do all of this in VPP. Stay tuned! diff --git a/content/articles/2024-06-22-vpp-ospf-2.md b/content/articles/2024-06-22-vpp-ospf-2.md new file mode 100644 index 0000000..c4347c4 --- /dev/null +++ b/content/articles/2024-06-22-vpp-ospf-2.md @@ -0,0 +1,624 @@ +--- +date: "2024-06-22T09:17:54Z" +title: VPP with loopback-only OSPFv3 - Part 2 +--- + +{{< image width="200px" float="right" src="/assets/vpp-ospf/bird-logo.svg" alt="Bird" >}} + +# Introduction + +When I first built IPng Networks AS8298, I decided to use OSPF as an IPv4 and IPv6 internal gateway +protocol. Back in March I took a look at two slightly different ways of doing this for IPng, notably +against a backdrop of conserving IPv4 addresses. As the network grows, the little point to point +transit networks between routers really start adding up. + +I explored two potential solutions to this problem: + +1. **[[Babel]({% post_url 2024-03-06-vpp-babel-1 %})]** can use IPv6 nexthops for IPv4 destinations - + which is _super_ useful because it would allow me to retire all of the IPv4 /31 point to point + networks between my routers. +1. **[[OSPFv3]({% post_url 2024-04-06-vpp-ospf %})]** makes it difficult to use IPv6 nexthops for + IPv4 destinations, but in a discussion with the Bird Users mailinglist, we found a way: by reusing + a single IPv4 loopback address on adjacent interfaces + +{{< image width="90px" float="left" src="/assets/vpp-ospf/canary.png" alt="Canary" >}} + +In May I ran a modest set of two _canaries_, one between the two routers in my house (`chbtl0` and +`chbtl1`), and another between a router at the Daedalean colocation and Interxion datacenters (`ddln0` +and `chgtg0`). AS8298 has about quarter of a /24 tied up in these otherwise pointless point-to-point +transit networks (see what I did there?). I want to reclaim these! + +Seeing as the two tests went well, I decided to roll this out and make it official. This post +describes how I rolled out an (almost) IPv4-less core network for IPng Networks. It was actually way +easier than I had anticipated, and apparently I was not alone - several of my buddies in the +industry have asked me about it, so I thought I'd write a little bit about the configuration. + +# Background: OSPFv3 with IPv4 + +***💩 /30: 4 addresses***: In the oldest of days, two routers that formed an IPv4 OSPF adjacency would +have a /30 _point-to-point_ transit network between them. Router A would have the lower available +IPv4 address, and Router B would have the upper available IPv4 address. The other two addresses in +the /30 would be the _network_ and _broadcast_ addresses of the prefix. Not a very efficient way to +do things, but back in the old days, IPv4 addresses were in infinite supply. + +***🥈 /31: 2 addresses***: Enter [[RFC3021](https://datatracker.ietf.org/doc/html/rfc3021)], from +December 2000, which some might argue are also the old days. With ever-increasing pressure to +conserve IP address space on the Internet, it makes sense to consider where relatively minor changes +can be made to fielded practice to improve numbering efficiency. This RFC describes how to halve the +amount of address space assigned to point-to-point links (common throughout the Internet +infrastructure) by allowing the use of /31 prefixes for them. At some point, even our friends from +Latvia figured it out! + +***🥇 /32: 1 address***: In most networks, each router has what is called a _loopback_ IPv4 and IPv6 +address, typically a /32 and /128 in size. This allows the router to select a unique address that is +not bound to any given interface. It comes in handy in many ways -- for example to have stable +addresses to manage the router, and to allow it to connect to iBGP route reflectors and peers from +well known addresses. + +As it so turns out, two routers that form an adjacency can advertise ~any IPv4 address as nexthop, +provided that their adjacent peer knows how to find that address. Of course, with a /30 or /31 this +is obvious: if I have a directly connected /31, I can simply ARP for the other side, learn its MAC +address, and use that to forward traffic to the other router. + +### The Trick + +What would it look like if there's no subnet that directly connects two adjacent routers? Well, I +happen to know that RouterA and RouterB both have a /32 loopback address. So if I simply let RouterA +(1) advertise _its loopback_ address to neighbor RouterB, and also (2) answer ARP requests for that +address, the two routers should be able to form an adjacency. This is exactly what Ondrej's [[Bird2 +commit (1)](https://gitlab.nic.cz/labs/bird/-/commit/280daed57d061eb1ebc89013637c683fe23465e8)] and +my [[VPP gerrit (2)](https://gerrit.fd.io/r/c/vpp/+/40482)] accomplish, as perfect partners: + +1. Ondrej's change will make the Link LSA be _onlink_, which is a way to describe that the next hop + is not directly connected, in other words RouterB will be at nexthop `192.0.2.1`, while + RouterA itself is `192.0.2.0/32`. +1. My change will make VPP answer for ARP requests in such a scenario where RouterA with an + _unnumbered_ interface with `192.0.2.0/32` will respond to a request from the not directly + connected _onlink_ peer RouterB at `192.0.2.1`. + +## Rolling out P2P-less OSPFv3 + +### 1. Upgrade VPP + Bird2 + +First order of business is to upgrade all routers. I need a VPP version with the [[ARP +gerrit](https://gerrit.fd.io/r/c/vpp/+/40482)] and a Bird2 version with the [[OSPFv3 +commit](https://gitlab.nic.cz/labs/bird/-/commit/280daed57d061eb1ebc89013637c683fe23465e8)]. I build +a set of Debian packages on `bookworm-builder` and upload them to IPng's website +[[ref](https://ipng.ch/media/vpp/bookworm/)]. + +I schedule a two nightly maintenance windows. In the first one, I'll upgrade two routers (`frggh0` +and `ddln1`) by means of canary. I'll let them run for a few days, and then wheel over the rest +after I'm confident there are no regressions. + +For each router, I will first _drain it_: this means in Kees, setting the OSPFv2 and OSPFv3 cost of +routers neighboring it to a higher number, so that traffic flows around the 'expensive' link. I will +also move the eBGP sessions into _shutdown_ mode, which will make the BGP sessions stay connected, +but the router will not announce any prefixes nor accept any from peers. Without it announcing or +learning any prefixes, the router stops seeing traffic. After about 10 minutes, it is safe to make +intrusive changes to it. + +Seeing as I'll be moving from OSPFv2 to OSPFv3, I will allow for a seemless transition by +configuring both protocols to run at the same time. The filter that applies to both flavors of OSPF +is the same: I will only allow more specifics of IPng's own prefixes to be propagated, and in +particular I'll drop all prefixes that come from BGP. I'll rename the protocol called `ospf4` to +`ospf4_old`, and create a new (OSPFv3) protocol called `ospf4` which has only the loopback interface +in it. This way, when I'm done, the final running protocol will simply be called `ospf4`: + +``` +filter f_ospf { + if (source = RTS_BGP) then reject; + if (net ~ [ 92.119.38.0/24{25,32}, 194.1.163.0/24{25,32}, 194.126.235.0/24{25,32} ]) then accept; + if (net ~ [ 2001:678:d78::/48{56,128}, 2a0b:dd80:3000::/36{48,48} ]) then accept; + reject; +} +protocol ospf v2 ospf4_old { + ipv4 { export filter f_ospf; import filter f_ospf; }; + area 0 { + interface "loop0" { stub yes; }; + interface "xe1-1.302" { type pointopoint; cost 61; bfd on; }; + interface "xe1-0.304" { type pointopoint; cost 56; bfd on; }; + }; +} +protocol ospf v3 ospf4 { + ipv4 { export filter f_ospf; import filter f_ospf; }; + area 0 { + interface "loop0","lo" { stub yes; }; + }; +} +``` + +In one terminal, I will start a ping to the router's IPv4 loopback: + +``` +pim@summer:~$ ping defra0.ipng.ch +PING (194.1.163.7) 56(84) bytes of data. +64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=1 ttl=61 time=6.94 ms +64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=2 ttl=61 time=7.00 ms +64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=3 ttl=61 time=7.03 ms +64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=4 ttl=61 time=7.03 ms +... +``` + +While in the other, I log in to the IPng Site Local connection to the router's management plane, to +perform the ugprade: + +``` +pim@squanchy:~$ ssh defra0.net.ipng.ch +pim@defra0:~$ wget -m --no-parent https://ipng.ch/media/vpp/bookworm/24.06-rc0~183-gb0d433978/ +pim@defra0:~$ cd ipng.ch/media/vpp/bookworm/24.06-rc0~183-gb0d433978/ +pim@defra0:~$ sudo nsenter --net=/var/run/netns/dataplane +root@defra0:~# pkill -9 vpp && systemctl stop bird-dataplane vpp && \ + dpkg -i ~pim/ipng.ch/media/vpp/bookworm/24.06-rc0~183-gb0d433978/*.deb && \ + dpkg -i ~pim/bird2_2.15.1_amd64.deb && \ + systemctl start bird-dataplane && \ + systemctl restart vpp-snmp-agent-dataplane vpp-exporter-dataplane +``` + +Then comes the small window of awkward staring at the ping I started in the other terminal. It +always makes me smile because it all comes back very quickly, within 90 seconds the router is back +online and fully converged with BGP: + +``` +pim@summer:~$ ping defra0.ipng.ch +PING (194.1.163.7) 56(84) bytes of data. +64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=1 ttl=61 time=6.94 ms +64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=2 ttl=61 time=7.00 ms +64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=3 ttl=61 time=7.03 ms +64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=4 ttl=61 time=7.03 ms +... +64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=94 ttl=61 time=1003.83 ms +64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=95 ttl=61 time=7.03 ms +64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=96 ttl=61 time=7.02 ms +64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=97 ttl=61 time=7.03 ms + +pim@defra0:~$ birdc show ospf nei +BIRD v2.15.1-4-g280daed5-x ready. +ospf4_old: +Router ID Pri State DTime Interface Router IP +194.1.163.8 1 Full/PtP 32.113 xe1-1.302 194.1.163.27 +194.1.163.0 1 Full/PtP 30.936 xe1-0.304 194.1.163.24 + +ospf4: +Router ID Pri State DTime Interface Router IP + +ospf6: +Router ID Pri State DTime Interface Router IP +194.1.163.8 1 Full/PtP 32.113 xe1-1.302 fe80::3eec:efff:fe46:68a8 +194.1.163.0 1 Full/PtP 30.936 xe1-0.304 fe80::6a05:caff:fe32:4616 +``` + +I can see that the OSPFv2 adjacencies have reformed, which is totally expected. Looking at the +router's current addresses: + +``` +pim@defra0:~$ ip -br a | grep UP +loop0 UP 194.1.163.7/32 2001:678:d78::7/128 fe80::dcad:ff:fe00:0/64 +xe1-0 UP fe80::6a05:caff:fe32:3e48/64 +xe1-1 UP fe80::6a05:caff:fe32:3e49/64 +xe1-2 UP fe80::6a05:caff:fe32:3e4a/64 +xe1-3 UP fe80::6a05:caff:fe32:3e4b/64 +xe1-0.304@xe1-0 UP 194.1.163.25/31 2001:678:d78::2:7:2/112 fe80::6a05:caff:fe32:3e48/64 +xe1-1.302@xe1-1 UP 194.1.163.26/31 2001:678:d78::2:8:1/112 fe80::6a05:caff:fe32:3e49/64 +xe1-2.441@xe1-2 UP 46.20.246.51/29 2a02:2528:ff01::3/64 fe80::6a05:caff:fe32:3e4a/64 +xe1-2.503@xe1-2 UP 80.81.197.38/21 2001:7f8::206a:0:1/64 fe80::6a05:caff:fe32:3e4a/64 +xe1-2.514@xe1-2 UP 185.1.210.235/23 2001:7f8:3d::206a:0:1/64 fe80::6a05:caff:fe32:3e4a/64 +xe1-2.515@xe1-2 UP 185.1.208.84/23 2001:7f8:44::206a:0:1/64 fe80::6a05:caff:fe32:3e4a/64 +xe1-2.516@xe1-2 UP 185.1.171.43/23 2001:7f8:9e::206a:0:1/64 fe80::6a05:caff:fe32:3e4a/64 +xe1-3.900@xe1-3 UP 193.189.83.55/23 2001:7f8:33::a100:8298:1/64 fe80::6a05:caff:fe32:3e4b/64 +xe1-3.2003@xe1-3 UP 185.1.155.116/24 2a0c:b641:701::8298:1/64 fe80::6a05:caff:fe32:3e4b/64 +xe1-3.3145@xe1-3 UP 185.1.167.136/23 2001:7f8:f2:e1::8298:1/64 fe80::6a05:caff:fe32:3e4b/64 +xe1-3.1405@xe1-3 UP 80.77.16.214/30 2a00:f820:839::2/64 fe80::6a05:caff:fe32:3e4b/64 +``` + +Take a look at interfaces `xe1-0.304` which is southbound from Frankfurt to Zurich +(`chrma0.ipng.ch`) and `xe1-1.302` which is northbound from Frankfurt to Amsterdam +(`nlams0.ipng.ch`). I am going to get rid of the IPv4 and IPv6 global unicast addresses on these two +interfaces, and let OSPFv3 borrow the IPv4 address from `loop0` instead. + +But first, rinse and repeat, until all routers are upgraded. + +### 2. A situational overview + +First, let me draw a diagram that helps show what I'm about to do: + +{{< image src="/assets/vpp-ospf/BTL-GTG-RMA Before.svg" alt="Step 2: Before" >}} + +In the network overview I've drawn four of IPng's routers. The ones at the bottom are the two +routers at my office in Brüttisellen, Switzerland, which explains their name `chbtl0` and +`chbtl1`, and they are connected via a local fiber trunk using 10Gig optics (drawn in red). On the left, the first router is connected via a +10G Ethernet-over-MPLS link (depicted in green) +to the NTT Datacenter in Rümlang. From there, IPng rents a 25Gbps wavelength to the Interxion +datacenter in Glattbrugg (shown in blue). Finally, +the Interxion router connects back to Brüttisellen using a 10G Ethernet-over-MPLS link (colored +in pink), completing the ring. + +You can also see that each router has a set of _loopback_ addresses, for example `chbtl0` in the +bottom left has IPv4 address `194.1.163.3/32` and IPv6 address `2001:678:d78::3/128`. Each point to +point network has assigned one /31 and one /112 with each router taking one address at either side. +Counting them up real quick, I see **twelve IPv4 addresses** in this diagram. This is a classic OSPF +design pattern. I seek to save eight of these addresses! + +### 3. First OSPFv3 link + +The rollout has to start somewhere, and I decide to start close to home, literally. I'm going to +remove the IPv4 and IPv6 addresses from the red link between the two +routers in Brüttisellen. They are directly connected, and if anything goes wrong, I can walk +over and rescue them. Sounds like a safe way to start! + +I quickly add the ability for [[vppcfg](https://github.com/pimvanpelt/vppcfg)] to configure +_unnumbered_ interfaces. In VPP, these are interfaces that don't have an IPv4 or IPv6 address of +their own, but they borrow one from another interface. If you're curious, you can take a look at the +[[User Guide](https://github.com/pimvanpelt/vppcfg/blob/main/docs/config-guide.md#interfaces)] on +GitHub. + +Looking at their `vppcfg` files, the change is actually very easy, taking as an example the +configuration file for `chbtl0.ipng.ch`: + +``` +loopbacks: + loop0: + description: 'Core: chbtl1.ipng.ch' + addresses: ['194.1.163.3/32', '2001:678:d78::3/128'] + lcp: loop0 + mtu: 9000 +interfaces: + TenGigabitEthernet6/0/0: + device-type: dpdk + description: 'Core: chbtl1.ipng.ch' + mtu: 9000 + lcp: xe1-0 +# addresses: [ '194.1.163.20/31', '2001:678:d78::2:5:1/112' ] + unnumbered: loop0 +``` + +By commenting out the `addresses` field, and replacing it with `unnumbered: loop0`, I instruct +vppcfg to make Te6/0/0, which in Linux is called `xe1-0`, borrow its addresses from the loopback +interface `loop0`. + +{{< image width="100px" float="left" src="/assets/freebsd-vpp/brain.png" alt="brain" >}} + +Planning and applying this is straight forward, but there's one detail I should +mention. In my [[previous article]({% post_url 2024-04-06-vpp-ospf %})] I asked myself a question: +would it be better to leave the addresses unconfigured in Linux, or would it be better to make the +Linux Control Plane plugin carry forward the borrowed addresses? In the end, I decided to _not_ copy +them forward. VPP will be aware of the addresses, but Linux will only carry them on the `loop0` +interface. + +In the article, you'll see that discussed as _Solution 2_, and it includes a bit of rationale why I +find this better. I implemented it in this +[[commit](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in +case you're curious, and the commandline keyword is `lcp lcp-sync-unnumbered off` (the default is +_on_). + +``` +pim@chbtl0:~$ vppcfg plan -c /etc/vpp/vppcfg.yaml +[INFO ] root.main: Loading configfile /etc/vpp/vppcfg.yaml +[INFO ] vppcfg.config.valid_config: Configuration validated successfully +[INFO ] root.main: Configuration is valid +[INFO ] vppcfg.vppapi.connect: VPP version is 24.06-rc0~183-gb0d433978 +comment { vppcfg prune: 2 CLI statement(s) follow } +set interface ip address del TenGigabitEthernet6/0/0 194.1.163.20/31 +set interface ip address del TenGigabitEthernet6/0/0 2001:678:d78::2:5:1/112 +comment { vppcfg sync: 1 CLI statement(s) follow } +set interface unnumbered TenGigabitEthernet6/0/0 use loop0 +[INFO ] vppcfg.reconciler.write: Wrote 5 lines to (stdout) +[INFO ] root.main: Planning succeeded + +pim@chbtl0:~$ vppcfg show int addr TenGigabitEthernet6/0/0 +TenGigabitEthernet6/0/0 (up): + unnumbered, use loop0 + L3 194.1.163.3/32 + L3 2001:678:d78::3/128 + +pim@chbtl0:~$ vppctl show lcp | grep TenGigabitEthernet6/0/0 +itf-pair: [9] TenGigabitEthernet6/0/0 tap9 xe1-0 65 type tap netns dataplane + +pim@chbtl0:~$ ip -br a | grep UP +xe0-0 UP fe80::92e2:baff:fe3f:cad4/64 +xe0-1 UP fe80::92e2:baff:fe3f:cad5/64 +xe0-1.400@xe0-1 UP fe80::92e2:baff:fe3f:cad4/64 +xe0-1.400.10@xe0-1.400 UP 194.1.163.16/31 2001:678:d78:2:3:1/112 fe80::92e2:baff:fe3f:cad4/64 +xe1-0 UP fe80::21b:21ff:fe55:1dbc/64 +xe1-1.101@xe1-1 UP 194.1.163.65/27 2001:678:d78:3::1/64 fe80::14b4:c6ff:fe1e:68a3/64 +xe1-1.179@xe1-1 UP 45.129.224.236/29 2a0e:5040:0:2::236/64 fe80::92e2:baff:fe3f:cad5/64 +``` + +After applying this configuration, I can see that Te6/0/0 indeed is _unnumbered, use loop0_ noting +the IPv4 and IPv6 addresses that it borrowed. I can see with the second command that Te6/0/0 +corresponds in Linux with `xe1-0`, and finally with the third command I can list the addresses of +the Linux view, and indeed I confirm that `xe1-0` only has a link local address. Slick! + +After applying this change, the OSPFv2 adjacency in the `ospf4_old` protocol expires, and I see the +routing table converge. A traceroute between `chbtl0` and `chbtl1` now takes a bit of a detour: + +``` +pim@chbtl0:~$ traceroute chbtl1.ipng.ch +traceroute to chbtl1 (194.1.163.4), 30 hops max, 60 byte packets + 1 chrma0.ipng.ch (194.1.163.17) 0.981 ms 0.969 ms 0.953 ms + 2 chgtg0.ipng.ch (194.1.163.9) 1.194 ms 1.192 ms 1.176 ms + 3 chbtl1.ipng.ch (194.1.163.4) 1.875 ms 1.866 ms 1.911 ms +``` + +I can now introduce the very first OSPFv3 adjacency for IPv4, and I do this by moving the neighbor +from the `ospf4_old` protocol to the `ospf4` prototol. Of course, I also update chbtl1 with the +_unnumbered_ interface on its `xe1-0`, and update OSPF there. And with that, something magical +happens: + +``` +pim@chbtl0:~$ birdc show ospf nei +BIRD v2.15.1-4-g280daed5-x ready. +ospf4_old: +Router ID Pri State DTime Interface Router IP +194.1.163.0 1 Full/PtP 30.571 xe0-1.400.10 fe80::266e:96ff:fe37:934c + +ospf4: +Router ID Pri State DTime Interface Router IP +194.1.163.4 1 Full/PtP 31.955 xe1-0 fe80::9e69:b4ff:fe61:ff18 + +ospf6: +Router ID Pri State DTime Interface Router IP +194.1.163.4 1 Full/PtP 31.955 xe1-0 fe80::9e69:b4ff:fe61:ff18 +194.1.163.0 1 Full/PtP 30.571 xe0-1.400.10 fe80::266e:96ff:fe37:934c + +pim@chbtl0:~$ birdc show route protocol ospf4 +BIRD v2.15.1-4-g280daed5-x ready. +Table master4: +194.1.163.4/32 unicast [ospf4 2024-05-19 20:58:04] * I (150/2) [194.1.163.4] + via 194.1.163.4 on xe1-0 onlink +194.1.163.64/27 unicast [ospf4 2024-05-19 20:58:04] E2 (150/2/10000) [194.1.163.4] + via 194.1.163.4 on xe1-0 onlink +``` + +Aww, would you look at that! Especially the first entry is interesting to me. It says that this +router has learned the address `194.1.163.4/32`, the loopback address of `chbtl1` via nexthop +**also** `194.1.163.4` on interface `xe1-0` with a flag _onlink_. + +The kernel routing table agrees with this construction: + +``` +pim@chbtl0:~$ ip ro get 194.1.163.4 +194.1.163.4 via 194.1.163.4 dev xe1-0 src 194.1.163.3 uid 1000 + cache +``` + +Now, what this construction tells the kernel, is that it should ARP for `194.1.163.4` using local +address `194.1.163.3`, for which VPP on the other side will respond, thanks to my [[VPP ARP +gerrit](https://gerrit.fd.io/r/c/vpp/+/40482)]. As such, I should expect now a FIB entry for VPP: + +``` +pim@chbtl0:~$ vppctl show ip fib 194.1.163.4 +ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, default-route:1, lcp-rt:1, ] +194.1.163.4/32 fib:0 index:973099 locks:3 + lcp-rt-dynamic refs:1 src-flags:added,contributing,active, + path-list:[189] locks:98 flags:shared,popular, uPRF-list:507 len:1 itfs:[36, ] + path:[166] pl-index:189 ip4 weight=1 pref=32 attached-nexthop: oper-flags:resolved, + 194.1.163.4 TenGigabitEthernet6/0/0 + [@0]: ipv4 via 194.1.163.4 TenGigabitEthernet6/0/0: mtu:9000 next:10 flags:[] 90e2ba3fcad4246e9637934c810001908100000a0800 + + adjacency refs:1 entry-flags:attached, src-flags:added, cover:-1 + path-list:[1025] locks:1 uPRF-list:1521 len:1 itfs:[36, ] + path:[379] pl-index:1025 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved, + 194.1.163.4 TenGigabitEthernet6/0/0 + [@0]: ipv4 via 194.1.163.4 TenGigabitEthernet6/0/0: mtu:9000 next:10 flags:[] 90e2ba3fcad4246e9637934c810001908100000a0800 + Extensions: + path:379 + forwarding: unicast-ip4-chain + [@0]: dpo-load-balance: [proto:ip4 index:848961 buckets:1 uRPF:507 to:[1966944:611861009]] + [0] [@5]: ipv4 via 194.1.163.4 TenGigabitEthernet6/0/0: mtu:9000 next:10 flags:[] 90e2ba3fcad4246e9637934c810001908100000a0800 +``` + +Nice work, VPP and Bird2! I confirm that I can ping the neighbor again, and that the traceroute is +direct rather than the scenic route from before, and I validate that IPv6 still works for good +measure: + +``` +pim@chbtl0:~$ ping -4 chbtl1.ipng.ch +PING 194.1.163.4 (194.1.163.4) 56(84) bytes of data. +64 bytes from 194.1.163.4: icmp_seq=1 ttl=63 time=0.169 ms +64 bytes from 194.1.163.4: icmp_seq=2 ttl=63 time=0.283 ms +64 bytes from 194.1.163.4: icmp_seq=3 ttl=63 time=0.232 ms +64 bytes from 194.1.163.4: icmp_seq=4 ttl=63 time=0.271 ms +^C +--- 194.1.163.4 ping statistics --- +4 packets transmitted, 4 received, 0% packet loss, time 3003ms +rtt min/avg/max/mdev = 0.163/0.233/0.276/0.045 ms + +pim@chbtl0:~$ traceroute chbtl1.ipng.ch +traceroute to chbtl1 (194.1.163.4), 30 hops max, 60 byte packets + 1 chbtl1.ipng.ch (194.1.163.4) 0.190 ms 0.176 ms 0.147 ms + +pim@chbtl0:~$ ping6 chbtl1.ipng.ch +PING chbtl1.ipng.ch(chbtl1.ipng.ch (2001:678:d78::4)) 56 data bytes +64 bytes from chbtl1.ipng.ch (2001:678:d78::4): icmp_seq=1 ttl=64 time=0.205 ms +64 bytes from chbtl1.ipng.ch (2001:678:d78::4): icmp_seq=2 ttl=64 time=0.203 ms +64 bytes from chbtl1.ipng.ch (2001:678:d78::4): icmp_seq=3 ttl=64 time=0.213 ms +64 bytes from chbtl1.ipng.ch (2001:678:d78::4): icmp_seq=4 ttl=64 time=0.219 ms +^C +--- chbtl1.ipng.ch ping statistics --- +4 packets transmitted, 4 received, 0% packet loss, time 3068ms +rtt min/avg/max/mdev = 0.203/0.210/0.219/0.006 ms + +pim@chbtl0:~$ traceroute6 chbtl1.ipng.ch +traceroute to chbtl1.ipng.ch (2001:678:d78::4), 30 hops max, 80 byte packets + 1 chbtl1.ipng.ch (2001:678:d78::4) 0.163 ms 0.147 ms 0.124 ms +``` + +### 4. From one to two + +{{< image width="300px" float="right" src="/assets/vpp-ospf/OSPFv3 Rollout Step 4.svg" alt="Step 4: Canary" >}} + +At this point I have two IPv4 IGPs running. This is not ideal, but it's also not completely broken, +because the OSPF filter allows the routers to learn and propagate any more specific prefix from +`194.1.163.0/24`. This way, the legacy OSPFv2 called `ospf4_old` and this new OSPFv3 called `ospf4` +will be aware of all routes. Bird will learn them twice, and routing decisions may be a bit funky +because the OSPF protocols learn the routes from each other as OSPF-E2. There are two implications +of this: + +1. It means that the routes that are learned from the other OSPF protocol will have a fixed metric + (==cost), and for the time being, I won't be able to cleanly add up link costs between the + routers that are speaking OSPFv2 and those that are speaking OSPFv3. + +1. If an OSPF External Type E1 and Type E2 route exist to the same destination the E1 route will + always be preferred irrespective of the metric. This means that within the routers that speak + OSPFv2, cost will remain consistent; and also within the routers that speak OSPFv3, it will be + consistent. Between them, routes will be learned, but cost will be roughly meaningless. + +I upgrade another link, between router `chgtg0` and `ddln0` at my [[colo]({% post_url +2022-02-24-colo %})], which is connected via a 10G EoMPLS link from a local telco called Solnet. The +colo, similar to IPng's office, has two redundant 10G uplinks, so if things were to fall apart, I +can always quickly shutdown the offending link (thereby removing OSPFv3 adjacencies), and traffic +will reroute. I have created two islands of OSPFv3, drawn in orange, with exactly two links using IPv4-less point to +point networks. I let this run for a few weeks, to make sure things do not fail in mysterious ways. + +### 5. From two to many + +{{< image width="300px" float="right" src="/assets/vpp-ospf/OSPFv3 Rollout Step 5.svg" alt="Step 5: Zurich" >}} + +From this point on it's just rinse-and-repeat. For each backbone link, I will: + +1. I will drain the backbone link I'm about to work on, by raising OSPFv2 and OSPFv3 cost on both + sides. If the cost was, say, 56, I will temporarily make that 1056. This will make traffic + avoid using the link if at all possible. Due to redundancy, every router has (at least) two + backbone links. Traffic will be diverted. +1. I first change the VPP router's `vppcfg.yaml` to remove the p2p addresses and replace them with + an `unnumbered: loop0` instead. I apply the diff, and the OSPF adjacency breaks for IPv4. + The BFD adjacency for IPv4 will disappear. Curiously, the IPv6 adjacency stays up, because + OSPFv3 adjacencies use link-local addresses. +1. I move the interface section of the old OSPFv2 `ospf4_old` protocol to the new OSPFv3 + `ospf4` protocol, which will als use link-local addresses to form adjacencies. The two routers + will exchange Link LSA and be able to find each other directly connected. Now the link is + running **two** OSPFv3 protocols, each in their own address family. They will share the same BFD + session. +1. I finally undrain the link by setting the OSPF link cost back to what it was. This link is now + a part of the OSPFv3 part of the network. + +I work my way through the network. The first one I do is the link between `chgtg0` and `chbtl1` +(which I've colored in the diagram in pink), so +that there are four contiguous OSPFv3 links, spanning from chbtl0 - chbtl1 - chgtg0 - ddln0. I +constantly do a traceroute to a machine that is directly connected behind ddln0, and as well use +RIPE Atlas and the NLNOG Ring to ensure that I have reachability: + +``` +pim@squanchy:~$ traceroute ipng.mm.fcix.net +traceroute to ipng.mm.fcix.net (194.1.163.59), 64 hops max, 40 byte packets + 1 chbtl0 (194.1.163.65) 0.279 ms 0.362 ms 0.249 ms + 2 chbtl1 (194.1.163.3) 0.455 ms 0.394 ms 0.384 ms + 3 chgtg0 (194.1.163.1) 1.302 ms 1.296 ms 1.294 ms + 4 ddln0 (194.1.163.5) 2.232 ms 2.385 ms 2.322 ms + 5 mm0.ddln0.ipng.ch (194.1.163.59) 2.377 ms 2.577 ms 2.364 ms +``` + +I work my way outwards from there. First completing the ring chbtl0 - chrma0 - chgtg0 - chbtl1, and +then completing the ring ddln0 - ddln1 - chrma0 - chgtg0, after which the Zurich metro area is +converted. I then work my way clockwise from Zurich to Geneva, Paris, Lille, Amsterdam, Frankfurt, +and end up with the last link completing the set: defra0 - chrma0. + +## Results + +{{< image width="300px" float="right" src="/assets/vpp-ospf/OSPFv3 Rollout After.svg" alt="OSPFv3: After" >}} + +In total I reconfigure thirteen backbone links, and they all become _unnumbered_ using the router's +loopback addresses for IPv4 and IPv6, and they all switch over from their OSPFv2 IGP to the new +OSPFv3 IGP; the total number of routers running the old IGP shrinks until there are none left. Once +that happens, I can simply remove the OSPFv2 protocol called `ospf4_old`, and keep the two OSPFv3 +protocols now intuitively called `ospf4` and `ospf6`. Nice. + +This maintenance isn't super intrusive. For IPng's customers, latency goes up from time to time as +backbone links are drained, the link is reconfigured to become unnumbered and OSPFv3, and put back +into service. The whole operation takes a few hours, and I enjoy the repetitive tasks, getting +pretty good at the drain-reconfigure-undrain cycle after a while. + +It looks really cool on transit routers, like this one in Lille, France: + + +``` +pim@frggh0:~$ ip -br a | grep UP +loop0 UP 194.1.163.10/32 2001:678:d78::a/128 fe80::dcad:ff:fe00:0/64 +xe0-0 UP 193.34.197.143/25 2001:7f8:6d::8298:1/64 fe80::3eec:efff:fe70:24a/64 +xe0-1 UP fe80::3eec:efff:fe70:24b/64 +xe1-0 UP fe80::6a05:caff:fe32:45ac/64 +xe1-1 UP fe80::6a05:caff:fe32:45ad/64 +xe1-2 UP fe80::6a05:caff:fe32:45ae/64 +xe1-2.100@xe1-2 UP fe80::6a05:caff:fe32:45ae/64 +xe1-2.200@xe1-2 UP fe80::6a05:caff:fe32:45ae/64 +xe1-2.391@xe1-2 UP 46.20.247.3/29 2a02:2528:ff03::3/64 fe80::6a05:caff:fe32:45ae/64 +xe0-1.100@xe0-1 UP 194.1.163.137/29 2001:678:d78:6::1/64 fe80::3eec:efff:fe70:24b/64 + +pim@frggh0:~$ birdc show bfd ses +BIRD v2.15.1-4-g280daed5-x ready. +bfd1: +IP address Interface State Since Interval Timeout +fe80::3eec:efff:fe46:68a9 xe1-2.200 Up 2024-06-19 20:16:58 0.100 3.000 +fe80::6a05:caff:fe32:3e38 xe1-2.100 Up 2024-06-19 20:13:11 0.100 3.000 + +pim@frggh0:~$ birdc show ospf nei +BIRD v2.15.1-4-g280daed5-x ready. +ospf4: +Router ID Pri State DTime Interface Router IP +194.1.163.9 1 Full/PtP 34.947 xe1-2.100 fe80::6a05:caff:fe32:3e38 +194.1.163.8 1 Full/PtP 31.940 xe1-2.200 fe80::3eec:efff:fe46:68a9 + +ospf6: +Router ID Pri State DTime Interface Router IP +194.1.163.9 1 Full/PtP 34.947 xe1-2.100 fe80::6a05:caff:fe32:3e38 +194.1.163.8 1 Full/PtP 31.940 xe1-2.200 fe80::3eec:efff:fe46:68a9 +``` + +You can see here that the router indeed has an IPv4 loopback address 194.1.163.10/32, and +2001:678:d78::a/128. It has two backbone links, on `xe1-2.100` towards Paris and `xe1-2.200` +towards Amsterdam. Judging by the time between the BFD sessions, it took me somewhere around four +minutes to drain, reconfigure, and undrain each link. I kept on listening to Nora en Pure's +[[Episode #408](https://www.youtube.com/watch?v=AzfCrOEW7e8)] the whole time. + +### A traceroute + +The beauty of this solution is that the routers will still have one IPv4 and IPv6 address, from +their `loop0` interface. The VPP dataplane will use this when generating ICMP error messages, for +example in a traceroute. It will look quite normal: + +``` +pim@squanchy:~/src/ipng.ch$ traceroute bit.nl +traceroute to bit.nl (213.136.12.97), 30 hops max, 60 byte packets + 1 chbtl0.ipng.ch (194.1.163.65) 0.366 ms 0.408 ms 0.393 ms + 2 chrma0.ipng.ch (194.1.163.0) 1.219 ms 1.252 ms 1.180 ms + 3 defra0.ipng.ch (194.1.163.7) 6.943 ms 6.887 ms 6.922 ms + 4 nlams0.ipng.ch (194.1.163.8) 12.882 ms 12.835 ms 12.910 ms + 5 as12859.frys-ix.net (185.1.203.186) 14.028 ms 14.160 ms 14.436 ms + 6 http-bit-ev-new.lb.network.bit.nl (213.136.12.97) 14.098 ms 14.671 ms 14.965 ms + +pim@squanchy:~$ traceroute6 bit.nl +traceroute6 to bit.nl (2001:7b8:3:5::80:19), 64 hops max, 60 byte packets + 1 chbtl0.ipng.ch (2001:678:d78:3::1) 0.871 ms 0.373 ms 0.304 ms + 2 chrma0.ipng.ch (2001:678:d78::) 1.418 ms 1.387 ms 1.764 ms + 3 defra0.ipng.ch (2001:678:d78::7) 6.974 ms 6.877 ms 6.912 ms + 4 nlams0.ipng.ch (2001:678:d78::8) 13.023 ms 13.014 ms 13.013 ms + 5 as12859.frys-ix.net (2001:7f8:10f::323b:186) 14.322 ms 14.181 ms 14.827 ms + 6 http-bit-ev-new.lb.network.bit.nl (2001:7b8:3:5::80:19) 14.176 ms 14.24 ms 14.093 ms +``` + +The only difference from before is that now, these traceroute hops are from the loopback addresses, +not the P2P transit links (eg the second hop, through `chrma0` is now 194.1.163.0 and 2001:678:d78:: +respectively, where before that would have been 194.1.163.17 and 2001:678:d78::2:3:2 respectively. +Subtle, but super dope. + +### Link Flap Test + +The proof is in the pudding, they say. After all of this link draining, reconfiguring and undraining, +I gain confidence that this stuff actually works as advertised! I thought it'd be a nice touch to +demonstrate a link drain, between Frankfurt and Amsterdam. I recorded a little asciinema +[[screencast](/assets/vpp-ospf/rollout.cast)], shown here: + +{{< image src="/assets/vpp-ospf/rollout.gif" alt="Asciinema" >}} + +### Returning IPv4 (and IPv6!) addresses + +Now that the backbone links no longer carry global unicast addresses, and they borrow from the one +IPv4 and IPv6 address in `loop0`, I can return a whole stack of addresses: + +{{< image src="/assets/vpp-ospf/roi.png" alt="ROI" >}} + +In total, I returned 34 IPv4 addresses from IPng's /24, which is 13.3%. This is huge, and I'm +confident that I will find a better use for these little addresses than being pointless +point-to-point links! diff --git a/content/articles/2024-06-29-coloclue-ipng.md b/content/articles/2024-06-29-coloclue-ipng.md new file mode 100644 index 0000000..aaa3dec --- /dev/null +++ b/content/articles/2024-06-29-coloclue-ipng.md @@ -0,0 +1,617 @@ +--- +date: "2024-06-29T06:31:00Z" +title: 'Case Study: IPng at Coloclue' +--- + +{{< image width="250px" float="right" src="/assets/coloclue-vpp/coloclue_logo2.png" alt="Coloclue" >}} + +I have been a member of the Coloclue association in Amsterdam for a long time. This is a networking +association in the social and technical sense of the word. [[Coloclue](https://coloclue.net)] is +based in Amsterdam with members throughout the Netherlands and Europe. Its goals are to facilitate +learning about and operating IP based networks and services. It has about 225 members who, together, +have built this network and deployed about 135 servers across 8 racks in 3 datacenters (Qupra, +EUNetworks and NIKHEF). Coloclue is operating [[AS8283](https://as8283.peeringdb.net/)] across +several local and international internet exchange points. + +A small while ago, one of our members, Sebas, shared their setup with the membership. It generated a +bit of a show-and-tell response, with Sebas and other folks on our mailinglist curious as to how we +all deployed our stuff. My buddy Tim pinged me on Telegram: "This is something you should share for +IPng as well!", so this article is a bit different than my usual dabbles. It will be more of a show +and tell: how did I deploy and configure the _Amsterdam Chapter_ of IPng Networks? + +I'll make this article a bit more picture-dense, to show the look-and-feel of the equipment. + +### Network + +{{< image width="350px" float="right" src="/assets/coloclue-ipng/MPLS-Amsterdam.svg" alt="MPLS Ring" >}} + +One thing that Coloclue and IPng Networks have in common is that we are networking clubs :) And +readers of my articles may well know that I do so very much like writing about networking. During +the Corona Pandemic, my buddy Fred asked "Hey you have this PI /24, why don't you just announce it +yourself? It'll be fun!" -- and after resisting for a while, I finally decided to go for it. Fred +owns a Swiss ISP called [[IP-Max](https://ip-max.net/)] and he was about to expand into Amsterdam, +and in an epic roadtrip we deployed a point of presence for IPng Networks in each site where IP-Max +has a point of presence. + +In Amsterdam, I introduced Fred to Arend Brouwer from [[ERITAP](https://eritap.com/)], and we +deployed our stuff in a brand new rack he had acquired at NIKHEF. It was fun to be in an AirBnB, +drive over to NIKHEF, and together with Arend move in to this completely new and empty rack in one +of the most iconic internet places on the planet. I am deeply grateful for the opportunity. + +{{< image width="350px" float="right" src="/assets/coloclue-ipng/staging-ams01.png" alt="Staging" >}} + +For IP-Max, this deployment means a Nexus 3068PQ switch and an ASR9001 router, with one 10G +wavelength towards Newtelco in Frankfurt, Germany, and another 10G wavelength towards ETIX in Lille, +France. For IPng it means a Centec S5612X switch, and a Supermicro router. To the right you'll see +the network as it was deployed during that roadtrip - a ring of sites from Zurich, Frankfurt, +Amsterdam, Lille, Paris, Geneva and back to Zurich. They are all identical in terms of hardware. +Pictured to the right is our _staging_ environment, in that AirBnB in Amsterdam: Fred's Nexus and +Cisco ASR9k, two of my Supermicro routers, and an APU which is used as an OOB access point. + +### Hardware + +Considering Coloclue is a computer and network association, lots of folks are interested in the +physical bits. I'll take some time to detail the hardware that I use for my network, focusing +specifically on the Amsterdam sites. + +#### Switches + +{{< image width="350px" float="right" src="/assets/coloclue-ipng/centec-stack.png" alt="Centec" >}} + +My switches are from a brand called Centec. They make their own switch silicon, and are known to be +very power efficient, affordable cost per port, and what's important for me is that the switches +offer MPLS, VPLS and L2VPN, as well as VxLAN, GENEVE and GRE functionality, all in hardware. + +Pictured on the right you can see two main Centec switch types that I use (in the red boxes above +called _IPng Site Local_): +* Centec S5612X: 8x1G RJ45, 8x1G SFP, 12x10G SFP+, 2x40G QSFP+ and 8x25G SFP28 +* Centec S5548X: 48x1G RJ45, 2x40G QSFP+ and 4x25G SFP28 +* Centec S5624X: 24x10G SFP+ and 2x100G QSFP28 + +There are also bigger variants, such as the S7548N-8Z switch (48x25G, 8x100G, which is delicious), +or S7800-32Z (32x100G, which is possibly even more yummy). Overall, I have very good experiences +with these switches and the vendor ecosystem around them (optics, patch cables, WDM muxes, and so +on). + +Sandwidched between the switches, you'll see some black Supermicro machines with 6x1G, 2x10G SFP+ +and 2x 25G SFP28. I'll detail them below, they are IPng's default choice for low-power routers, as +fully loaded they consume about 45W, and can forward 65Gbps and around 35Mpps or so, enough to +kill a horse. And definitely enough for IPng! + +{{< image width="350px" float="right" src="/assets/coloclue-ipng/msw0.nlams0.png" alt="msw0.nlams0" >}} + +A photo album with a few pictures of the Centec switches, including their innards, lives +[[here](https://photos.app.goo.gl/Mxzs38p355Bo4qZB6)]. In Amsterdam, I have one S5624X, which +connects with 3x10G to IP-Max (one link to Lille, another to Frankfurt, and the third for local +services off of IP-Max's ASR9001). Pictured right is that Centec S5624X MPLS switch at NIKHEF, +called `msw0.nlams0.net.ipng.ch`, which is where my Amsterdam story really begins: _"Goed zat voor +doordeweeks!"_ + +#### Router + +{{< image width="350px" float="right" src="/assets/coloclue-ipng/nlams0-inside.png" alt="nlams0 innards" >}} + +The european ring that I built on my roadtrip with Fred consists of identical routers in each +location. I was looking for a machine that has competent out of band operations with IPMI or iDRAC, +could carry 32GB of ECC memory, had at least 4C/8T, and as many 10Gig ports as could realistically +fit. I settled on the Supermicro 5018D-FN8T +[[ref](https://www.supermicro.com/en/products/system/1u/5018/sys-5018d-fn8t.cfm)], because of its +relatively low power CPU (an Intel Xeon D-1518 at 35W TDP), ability to boot off of NVME or mSATA, +PCIe v3.0 x8 expansion port, which carries an additional Intel X710-DA4 quad-tengig port. + +I've loadtested these routers extensively while I was working on the Linux Control Plane in VPP, and +I can sustain full port speeds on all six TenGig ports, to a maximum of roughly 35Mpps of IPv4, IPv6 +or MPLS traffic. Considering the machine, when fully loaded, will draw about 45 Watts, this is a +very affordable and power efficient router. I love them! + +{{< image width="350px" float="right" src="/assets/coloclue-ipng/nlams0.png" alt="nlams0" >}} + +The only one thing I'd change , is a second power supply. I personally have never had a PSU fail in +any of the Supermicro routers I operate, but sometimes datacenters do need to take a feed offline, +and that's unfortunate if it causes an interruption. If I'd do it again, I would go for dual PSU, +but I can't complain either, as my router in NIKHEF has been running since 2021 without any power +issues. + +I'm a huge fan of Supermicro's IPMI design based on the ASpeed AST2400 BMC; it supports almost all +IPMI features, notably serial-over-LAN, remote port off/cycle, remote KVM over HTML5, remote USB +disk mount, remote install of operating systems, and requires no download of client software. It all +just works in Firefox or Chrome or Safari -- and I've even reinstalled several routers remotely, as +I described in my article [[Debian on IPng's VPP routers]({% post_url 2023-12-17-defra0-debian %})]. +There's just something magical about remote-mounting a Debian Bookworm iso image from my workstation +in Brüttisellen, Switzerland, in a router running in Amsterdam, to then proceed to use KVM over +HTML5 to reinstall the whole thing remotely. We didn't have that, growing up!! + +#### Hypervisors + +{{< image width="350px" float="right" src="/assets/coloclue-ipng/hvn0.nlams1.png" alt="hvn0.nlams1" >}} + +I host two machines at Coloclue. I started off way back in 2010 or so with one Dell R210-II. That +machine still runs today, albeit in the Telehouse2 datacenter in Paris. At the end of 2022, I made a +trip to Amsterdam to deploy three identical machines, all reasonably spec'd Dell PowerEdge R630 +servers: + +1. 8x32GB or 256GB Registered/Buffered (ECC) DDR4 +1. 2x Intel Xeon E5-2696 v4 (88 CPU threads in total) +1. 1x LSI SAS3008 SAS12 controller +1. 2x 500G Crucial MX500 (TLC) SSD +1. 3x 3840G Seagate ST3840FM0003 (MLC) SAS12 +1. Dell rNDC 2x Intel I350 (RJ45), 2x Intel X520 (10G SFP+) + +{{< image width="350px" float="right" src="/assets/coloclue-ipng/R630-S5612X.png" alt="Dell+Centec" >}} + +Once you go to enterprise storage, you will never want to go back. I take specific care to buy +redundant boot drives, mostly Crucial MX500 because it's TLC flash, and a bit more reliable. However, +the MLC flash from these Seagate and HPE SAS-3 drives (12Gbps bus speeds) are next level. The Seagate +3.84TB drives are in a RAIDZ1 together, and read/write over them is a sustained 2.6GB/s and roughly +380Kops/sec per drive. It really makes the VMs on the hypervisor fly -- and at the same time has a +much, much, much better durability and lifetime. Before I switched to enterprise storage, I would +physically wear out a Samsung consumer SSd in about 12-15mo, and reads/writes would become +unbearably slow over time. With these MLC based drives: no such problem. _Ga hard!_ + +All hypervisors run Debian Bookworm and have a dedicated iDRAC enterprise port + license. I find the +Supermicro IPMI a little bit easier to work with, but the basic features are supported on the Dell +as well: serial-over-LAN (which comes in super handy at Coloclue) and remote power on/off/cycle, power +metering (using `ipmitool sensors`), and a clunky KVM over HTML if need be. + +### Coloclue: Routing + +Let's dive into the Coloclue deployment, here's an overview picture, with three colors: +blue for Coloclue's network components, +red for IPng's internal network, and +orange for IPng's public network services. + +{{< image src="/assets/coloclue-ipng/AS8298-Amsterdam.svg" alt="Amsterdam" >}} + +{{< image width="350px" float="right" src="/assets/coloclue-ipng/nikhef.png" alt="NIKHEF" >}} + +Coloclue currently has three main locations: Qupra, EUNetworks and NIKHEF. I've drawn the Coloclue +network in blue. It's pretty impressive, with +a 10G wave between each of the locations. Within the two primary colocation sites (Qupra and +EUNetworks), there are two core switches from Arista, which connect to top-of-rack switches from +FS.com in an MLAG configuration. This means that each TOR is connected redundantly to both core +switches with 10G. The switch in NIKHEF connects to a set of local internet exchanges, and as well to +IP-Max, who deliver a bunch of remote IXPs toIPng'sColoclue, notably DE-CIX, FranceIX, and SwissIX. It is +in NIKHEF where `nikhef-core-1.switch.nl.coloclue.net` (colored blue), connects to my my `msw0.nlams0.net.ipng.ch` +switch (in red). IPng Networks' european backbone +then connects from here to Frankfurt and southbound onwards to Zurich, but it also connects from +here to Lille and southbound onwards to Paris. + +In the picture to right you can see ERITAP's rack R181 in NIKHEF, when it was ... younger. It did +not take long for many folks to request many cross connects (I myself have already a dozen or so, +and I'm only one customer in this rack!) + +One of the advantages of being a launching customer, is that I got to see the rack when it was mostly +empty. Here we can see Coloclue's switch at the top (with the white flat ribbon RJ45 being my +interconnect with it). Then there are two PC Engines APUs, which are IP-Max and IPng's OOB serial +machines. Then comes the ASR9001 called `er01.zrh56.ip-max.net` and under it the Nexus switch that +IP-Max uses for its local customers (including Coloclue and IPng!). + +My main router is all the way at the bottom of the picture, called `nlams0.ipng.ch`, one of those +Supermicro D-1518 machines. It is connected with 2x10G in a LAG to the MPLS switch, and then to most +of the available internet exchanges in NIKHEF. I also have two transit providers in Amsterdam: +IP-Max (10Gbit) and A2B Internet (10Gbit). + +``` +pim@nlams0:~$ birdc show route count +BIRD v2.15.1-4-g280daed5-x ready. +10735329 of 10735329 routes for 969311 networks in table master4 +2469998 of 2469998 routes for 203707 networks in table master6 +1852412 of 1852412 routes for 463103 networks in table t_roa4 +438100 of 438100 routes for 109525 networks in table t_roa6 +Total: 15495839 of 15495839 routes for 1745646 networks in 4 tables +``` + +With the RIB at over 15M entries, I would say this site is very well connected! + +### Coloclue: Hypervisors + +I have one Dell R630 at Qupra (`hvn0.nlams1.net.ipng.ch`), one at EUNetworks (`hvn0.nlams2.net.ipng.ch`), +and a third one with ERITAP at Equinix AM3 (`hvn0.nlams3.net.ipng.ch`). That last one is connected +with a 10Gbit wavelength to IPng's switch `msw0.nlams0.net.ipng.ch`, and another 10Gbit port to +FrysIX. + +{{< image width="350px" float="right" src="/assets/coloclue-ipng/qupra.png" alt="EUNetworks" >}} + +Arend and I run a small internet exchange called [[FrysIX](https://ixpmanager.frys-ix.net/)]. I +supply most of the services from that third hypervisor, which has a 10G connection to the local +FrysIX switch at Equinix AM3. More recently, it became possible to request cross connects at Qupra, so +I've put in a request to connect my hypervisor there to FrysIX with 10G as well - this will not be +for peering purposes, but to be able to redundantly connect things like our routeservers, +ixpmanager, librenms, sflow services, and so on. It's nice to be able to have two hypervisors +available, as it makes maintenance just that much easier. + +{{< image width="350px" float="right" src="/assets/coloclue-ipng/eunetworks.png" alt="EUNetworks" >}} + +Turning my attention to the two hypervisors at Coloclue, one really cool feature that Coloclue +offers, is an L2 VLAN connection from your colo server to the NIKHEF site over our 10G waves between +the datacenters. I requested one of these in each site to NIKHEF using Coloclue's VLAN 402 at Qupra, +and VLAN 412 at EUNetworks. It is over these VLANs that I carry _IPng Site Local_ to the +hypervisors. I showed this in the overview diagram as an orange dashed line. I bridge Coloclue's VLAN 105 (which +is their eBGP VLAN which has loose uRPF filtering on the Coloclue routers) into a Q-in-Q transport +towards NIKHEF. These two links are colored purple from EUnetworks and green from Qupra. Finally, I transport my own colocation +VLAN to each site using another Q-in-Q transport with inner VLAN 100. + +That may seem overly complicated, so let me describe these one by one: + +**1. Colocation Connectivity**: + +I will first create a bridge called `coloclue`, which I'll give an MTU of 1500. I will add to that +the port that connects to the Coloclue TOR switch, called `eno4`. However, I will give that port an +MTU of 9216 as I will support jumbo frames on other VLANs later. + +``` +pim@hvn0-nlams1:~$ sudo ip link add coloclue type bridge +pim@hvn0-nlams1:~$ sudo ip link set coloclue mtu 1500 up +pim@hvn0-nlams1:~$ sudo ip link set eno4 mtu 9216 master coloclue up +pim@hvn0-nlams1:~$ sudo ip addr add 94.142.244.54/24 dev coloclue +pim@hvn0-nlams1:~$ sudo ip addr add 2a02:898::146:1/64 dev coloclue +pim@hvn0-nlams1:~$ sudo ip route add default via 94.142.244.254 +pim@hvn0-nlams1:~$ sudo ip route add default via 2a02:898::1 +``` + +**2. IPng Site Local**: + +All hypervisors at IPng are connected to a private network called _IPng Site Local_ with IPv4 +addresses from `198.19.0.0/16` and IPv6 addresses from `2001:678:d78:500::/56`, both of which are +not routed on the public Internet. I will give the hypervisor an address and a route towards IPng +Site local like so: + +``` +pim@hvn0-nlams1:~$ sudo ip link add ipng-sl type bridge +pim@hvn0-nlams1:~$ sudo ip link set ipng-sl mtu 9000 up +pim@hvn0-nlams1:~$ sudo ip link add link eno4 name eno4.402 type vlan id 402 +pim@hvn0-nlams1:~$ sudo ip link set link eno4.402 mtu 9216 master ipng-sl up +pim@hvn0-nlams1:~$ sudo ip addr add 198.19.4.194/27 dev ipng-sl +pim@hvn0-nlams1:~$ sudo ip addr add 2001:678:d78:509::2/64 dev ipng-sl +pim@hvn0-nlams1:~$ sudo ip route add 198.19.0.0/16 via 198.19.4.193 +pim@hvn0-nlams1:~$ sudo ip route add 2001:678:d78:500::/56 via 2001:678:d78:509::1 +``` + +Note the MTU here. While the hypervisor is connected via 1500 bytes to the Coloclue network, it is +connected with 9000 bytes to IPng Site local. On the other side of VLAN 402 lives the Centec switch, +which is configured simply with a VLAN interface: + +``` +interface vlan402 + description Infra: IPng Site Local (Qupra) + mtu 9000 + ip address 198.19.4.193/27 + ipv6 address 2001:678:d78:509::1/64 +! +interface vlan301 + description Core: msw0.defra0.net.ipng.ch + mtu 9028 + label-switching + ip address 198.19.2.13/31 + ipv6 address 2001:678:d78:501::6:2/112 + ip ospf network point-to-point + ip ospf cost 73 + ipv6 ospf network point-to-point + ipv6 ospf cost 73 + ipv6 router ospf area 0 + enable-ldp +! +interface vlan303 + description Core: msw0.frggh0.net.ipng.ch + mtu 9028 + label-switching + ip address 198.19.2.24/31 + ipv6 address 2001:678:d78:501::c:1/112 + ip ospf network point-to-point + ip ospf cost 85 + ipv6 ospf network point-to-point + ipv6 ospf cost 85 + ipv6 router ospf area 0 + enable-ldp +``` + +There are two other interfaces here: `vlan301` towards the MPLS switch in Frankfurt Equinix FR5 and +`vlan303` towards the MPLS switch in Lille ETIX#2. I've configured those to enable OSPF, LDP and +MPLS forwarding. As such, this network with `hvn0.nlams1.net.ipng.ch` becomes a leaf node with a /27 and +/64 in IPng Site Local, in which I can run virtual machines and stuff. + +Traceroutes on this private underlay network are very pretty, using the `net.ipng.ch` domain, and +entirely using silicon-based wirespeed routers with IPv4, IPv6 and MPLS and jumbo frames, never +hitting the public Internet: + +``` +pim@hvn0-nlams1:~$ traceroute6 squanchy.net.ipng.ch 9000 +traceroute to squanchy.net.ipng.ch (2001:678:d78:503::4), 30 hops max, 9000 byte packets + 1 msw0.nlams0.net.ipng.ch (2001:678:d78:509::1) 1.116 ms 1.720 ms 2.369 ms + 2 msw0.defra0.net.ipng.ch (2001:678:d78:501::6:1) 7.804 ms 7.812 ms 7.823 ms + 3 msw0.chrma0.net.ipng.ch (2001:678:d78:501::5:1) 12.839 ms 13.498 ms 14.138 ms + 4 msw1.chrma0.net.ipng.ch (2001:678:d78:501::11:2) 12.686 ms 13.363 ms 13.951 ms + 5 msw0.chbtl0.net.ipng.ch (2001:678:d78:501::1) 13.446 ms 13.523 ms 13.683 ms + 6 squanchy.net.ipng.ch (2001:678:d78:503::4) 12.890 ms 12.751 ms 12.767 ms +``` + +**3. Coloclue BGP uplink**: + +I make use of the IP transit offering of Coloclue. Coloclue has four routers in total: two in +EUNetworks and two in Qupra, which I'll show the configuration for here. I don't take the transit +session on the hypervisor, but rather I forward the traffic Layer2 to my VPP router called +`nlams0.ipng.ch` over VLAN 402 purple and VLAN +412 green VLANs to NIKHEF. I'll show the +configuration for Qupra (VLAN 402) first: + +``` +pim@hvn0-nlams1:~$ sudo ip link add coloclue-bgp type bridge +pim@hvn0-nlams1:~$ sudo ip link set coloclue-bgp mtu 1500 up +pim@hvn0-nlams1:~$ sudo ip link add link eno4 name eno4.105 type vlan id 105 +pim@hvn0-nlams1:~$ sudo ip link add link eno4.402 name eno4.402.105 type vlan id 105 +pim@hvn0-nlams1:~$ sudo ip link set eno4.105 mtu 1500 master coloclue-bgp up +pim@hvn0-nlams1:~$ sudo ip link set eno4.402.105 mtu 1500 master coloclue-bgp up +``` + +These VLANs terminate on `msw0.nlams0.net.ipng.ch` where I just offer them directly to the VPP +router: + +``` +interface eth-0-2 + description Infra: nikhef-core-1.switch.nl.coloclue.net e1/34 + switchport mode trunk + switchport trunk allowed vlan add 402,412 + switchport trunk allowed vlan remove 1 + lldp disable +! +interface eth-0-3 + description Infra: nlams0.ipng.ch:Gi8/0/0 + switchport mode trunk + switchport trunk allowed vlan add 402,412 + switchport trunk allowed vlan remove 1 +``` + +**4. IPng Services VLANs**: + +I have one more thing to share. Up until now, the hypervisor has internal connectivity to _IPng Site +Local_, and a single IPv4 / IPv6 address in the shared colocation network. Almost all VMs at IPng +run entirely in IPng Site Local, and will use reversed proxies and other tricks to expose themselves +to the internet. But, I also use a modest amount of IPv4 and IPv6 addresses on the VMs here, for +example for those NGINX reversed proxies [[ref]({% post_url 2023-03-17-ipng-frontends %})], or my +SMTP relays [[ref]({% post_url 2024-05-17-smtp %})]. + +For this purpose, I will need to plumb through some form of colocation VLAN in each site, which +looks very similar to the BGP uplink VLAN I described previously: + +``` +pim@hvn0-nlams1:~$ sudo ip link add ipng type bridge +pim@hvn0-nlams1:~$ sudo ip link set ipng mtu 9000 up +pim@hvn0-nlams1:~$ sudo ip link add link eno4 name eno4.100 type vlan id 100 +pim@hvn0-nlams1:~$ sudo ip link add link eno4.402 name eno4.402.100 type vlan id 100 +pim@hvn0-nlams1:~$ sudo ip link set eno4.100 mtu 9000 master ipng up +pim@hvn0-nlams1:~$ sudo ip link set eno4.402.100 mtu 9000 master ipng up +``` + +Looking at the VPP router, it picks up these two VLANs 402 and 412, which are used for _IPng Site +Local_. On top of those, the router will add two Q-in-Q VLANs: `402.105` will be the BGP uplink, and +Q-in-Q `402.100` will be the IPv4 space assigned to IPng: + +``` +interfaces: + GigabitEthernet8/0/0: + device-type: dpdk + description: 'Infra: msw0.nlams0.ipng.ch:eth-0-3' + lcp: e0-1 + mac: '3c:ec:ef:46:65:97' + mtu: 9216 + sub-interfaces: + 402: + description: 'Infra: VLAN to Qupra' + lcp: e0-0.402 + mtu: 9000 + 412: + description: 'Infra: VLAN to EUNetworks' + lcp: e0-0.412 + mtu: 9000 + 402100: + description: 'Infra: hvn0.nlams1.ipng.ch' + addresses: ['94.142.241.184/32', '2a02:898:146::1/64'] + lcp: e0-0.402.100 + mtu: 9000 + encapsulation: + dot1q: 402 + exact-match: True + inner-dot1q: 100 + 402105: + description: 'Transit: Coloclue (urpf-shared-vlan Qupra)' + addresses: ['185.52.225.34/28', '2a02:898:0:1::146:1/64'] + lcp: e0-0.402.105 + mtu: 1500 + encapsulation: + dot1q: 402 + exact-match: True + inner-dot1q: 105 +``` + +Using BGP, my AS8298 will announce my own prefixes and two /29s that I have assigned to me from +Coloclue. One of them is `94.142.241.184/29` in Qupra, and the other is `94.142.245.80/29` in +EUNetworks. But, I don't like wasting IP space, so I assign only the first /32 from that range to +the interface, and use Bird2 to set a route for the other 7 addresses into the interface, which will +allow me to use all eight addresses! + +``` +pim@border0-nlams3:~$ traceroute nginx0.nlams1.ipng.ch +traceroute to nginx0.nlams1.ipng.ch (94.142.241.189), 30 hops max, 60 byte packets + 1 ipmax.nlams0.ipng.ch (46.20.243.177) 1.190 ms 1.102 ms 1.101 ms + 2 speed-ix.coloclue.net (185.1.222.16) 0.448 ms 0.405 ms 0.361 ms + 3 nlams0.ipng.ch (185.52.225.34) 0.461 ms 0.461 ms 0.382 ms + 4 nginx0.nlams1.ipng.ch (94.142.241.189) 1.084 ms 1.042 ms 1.004 ms + +pim@border0-nlams3:~$ traceroute smtp.nlams2.ipng.ch +traceroute to smtp.nlams2.ipng.ch (94.142.245.85), 30 hops max, 60 byte packets + 1 ipmax.nlams0.ipng.ch (46.20.243.177) 2.842 ms 2.743 ms 3.264 ms + 2 speed-ix.coloclue.net (185.1.222.16) 0.383 ms 0.338 ms 0.338 ms + 3 nlams0.ipng.ch (185.52.225.34) 0.372 ms 0.365 ms 0.304 ms + 4 smtp.nlams2.ipng.ch (94.142.245.85) 1.042 ms 1.000 ms 0.959 ms +``` + +### Coloclue: Services + +I run a bunch of services on these hypervisors. Some are for me personally, or for my company IPng +Networks GmbH, and some are for community projects. Let me list a few things here: + +**AS112 Services** \ +I run an anycasted AS112 cluster in all sites where IPng has hypervisor capacity. Notably in +Amsterdam, my nodes are running on both Qupra and EUNetworks, and connect to LSIX, SpeedIX, FogIXP, +FrysIX and behind AS8283 and AS8298. The nodes here handle roughly 5kqps at peak, and if RIPE NCC's +node in Amsterdam goes down, this can go up to 13kqps (right, WEiRD?). I described the setup in an +[[article]({% post_url 2021-06-28-as112 %})]. You may be wondering: how do I get those internet +exchanges backhauled to a VM at Coloclue? The answer is: VxLAN transport! Here's a relevant snippet +from the `nlams0.ipng.ch` router config: + +``` +vxlan_tunnels: + vxlan_tunnel1: + local: 94.142.241.184 + remote: 94.142.241.187 + vni: 11201 + +interfaces: + TenGigabitEthernet4/0/0: + device-type: dpdk + description: 'Infra: msw0.nlams0:eth-0-9' + lcp: xe0-0 + mac: '3c:ec:ef:46:68:a8' + mtu: 9216 + sub-interfaces: + 112: + description: 'Peering: LSIX for AS112' + l2xc: vxlan_tunnel1 + mtu: 1522 + vxlan_tunnel1: + description: 'Infra: AS112 LSIX' + l2xc: TenGigabitEthernet4/0/0.112 + mtu: 1522 +``` + +And the Centec switch config: + +``` +vlan database + vlan 112 name v-lsix-as112 mac learning disable +interface eth-0-5 + description Infra: LSIX AS112 + switchport access vlan 112 +interface eth-0-9 + description Infra: nlams0.ipng.ch:Te4/0/0 + switchport mode trunk + switchport trunk allowed vlan add 100,101,110-112,302,311,312,501-503,2604 + switchport trunk allowed vlan remove 1 +``` + +What happens is: LSIX connects the AS112 port to the Centec switch on `eth-0-5`, which offers it +tagged to `Te4/0/0.112` on the VPP router and without wasting CAM space for the MAC addresses (by +turing off MAC learning -- this is possible because there's only 2 ports in the VLAN, so the switch +implicitly always knows where to forward the frames!). + +After sending it out on `eth-0-9` tagged as VLAN 112, VPP in turn encapsulates it with VxLAN and sends +it as VNI 11201 to remote endpoint `94.142.241.187`. Because that path has an MTU of 9000, the +traffic arrives to the VM with 1500b, no worries. Most of my AS112 traffic arrives to a VM this +way, as it's really easy to flip the remote endpoint of the VxLAN tunnel to anothe replica in case +of an outage or maintenance. Typically, BGP sessions won't even notice. + +**NGINX Frontends** \ +At IPng, almost everything runs in the internal network called _IPng Site Local_. I expose this +network via a few carefully placed NGINX frontends. There are two in my own network (in Geneva and +Zurich), and one in IP-Max's network (in Zurich), and two at Coloclue (in Amsterdam). They frontend +and do SSL offloading and TCP loadbalancing for a variety of websites and services. I described the +architecture and design in an [[article]({% post_url 2023-03-17-ipng-frontends %})]. There are +currently ~120 or so websites frontended on this cluster. + +**SMTP Relays** \ +I self-host my mail, and I tried to make a fully redundant and self-repairing SMTP in- and outbound +with Postfix, IMAP server and redundant maildrop storage with Dovecot, a webmail service with +Roundcube, and so on. Because I need to perform DNSBL lookups, this requires routable IPv4 and IPv6 +addresses. Two of my four mailservers run at Coloclue, which I described in an [[article]({% +post_url 2024-05-17-smtp %})]. + +**Mailman Service** \ +For FrysIX, FreeIX, and IPng itself, I run a set of mailing lists. The mailman service runs +partially in IPng Site Local, and has one IPv4 address for outbound e-mail. I separated this from +the IPng relays so that IP based reputation does not interfere between these two types of +mailservice. + +**FrysIX Services** \ +The routeserver `rs2.frys-ix.net`, the authoritative nameserver `ns2.frys-ix.net`, the IXPManager +and LibreNMS monitoring service all run on hypervisors at either Coloclue (Qupra) or ERITAP +(Equinix AM3). By the way, remember the part about the enterprise storage? The ixpmanager is +currently running on `hvn0.nlams3.net.ipng.ch` which has a set of three Samsung EVO consumer SSDs, +which are really at the end of their life. Please, can I connect to FrysIX from Qupra so I can move +these VMs to the Seagate SAS-3 MLC storage pool? :) + +{{< image width="100px" float="right" src="/assets/coloclue-ipng/pencilvester.png" alt="Pencilvester" >}} + +**IPng OpenBSD Bastion Hosts** \ +IPng Networks has three OpenBSD bastion jumphosts with an open SSH port 22, which are named after +characters from a TV show called Rick and Morty. **Squanchy** lives in my house on +`hvn0.chbtl0.net.ipng.ch`, **Glootie** lives at IP-Max on `hvn0.chrma0.net.ipng.ch`, and **Pencilvester** +lives on a hypervisor at Coloclue on `hvn0.nlams1.net.ipng.ch`. These bastion hosts connect both to the public +internet, but also to the _IPng Site Local_ network. As such, if I have SSH access, I will also have +access to the internal network of IPng. + +**IPng Border Gateways** \ +The internal network of IPng is mostly disconnected from the Internet. Although I can log in via these +bastion hosts, I also have a set of four so-called _Border Gateways_, which are also connected both +to the _IPng Site Local_ network, but also to the Internet. Each of them runs an IPv4 and IPv6 +WireGuard endpoint, and I'm pretty much always connected with these. It allows me full access to the +internal network, and NAT'ed towards the Internet. + +Each border gateway announces a default route towards the Centec switches, and connect to AS8298, +AS8283 and AS25091 for internet connectivity. One of them runs in Amsterdam, and I wrote about +these gateways in an [[article]({% post_url 2023-03-11-mpls-core %})]. + +**Public NAT64/DNS64 Gateways** \ +I operate a set of four private NAT64/DNS64 gateways, one of which in Amsterdam. It pairs up and +complements the WireGuard and NAT44/NAT66 functionality of the _Border Gateways_. Because NAT64 is +useful in general, I also operate two public NAT64/DNS64 gateways, one at Qupra and one at +EUNetworks. You can try them for yourself by using the following anycasted resolver: +`2a02:898:146:64::64` and performing a traceroute to an IPv4 only host, like `github.com`. Note: +this works from anywhere, but for satefy reasons, I filter some ports like SMTP, NETBIOS and so on, +roughly the same way a TOR exit router would. I wrote about them in an [[article]({% post_url +2024-05-25-nat64-1 %})]. + +``` +pim@cons0-nlams0:~$ cat /etc/resolv.conf +# *** Managed by IPng Ansible *** +# +domain ipng.ch +search net.ipng.ch ipng.ch +nameserver 2a02:898:146:64::64 + +pim@cons0-nlams0:~$ traceroute6 -q1 ipv4.tlund.se +traceroute to ipv4c.tlund.se (2a02:898:146:64::c10f:e4c3), 30 hops max, 80 byte packets + 1 2a10:e300:26:48::1 (2a10:e300:26:48::1) 0.221 ms + 2 as8283.ix.frl (2001:7f8:10f::205b:187) 0.443 ms + 3 hvn0.nlams1.ipng.ch (2a02:898::146:1) 0.866 ms + 4 bond0-100.dc5-1.router.nl.coloclue.net (2a02:898:146:64::5e8e:f4fc) 0.900 ms + 5 bond0-130.eunetworks-2.router.nl.coloclue.net (2a02:898:146:64::5e8e:f7f2) 0.920 ms + 6 ams13-peer-1.hundredgige2-3-0.tele2.net (2a02:898:146:64::50f9:d18b) 2.302 ms + 7 ams13-agg-1.bundle-ether4.tele2.net (2a02:898:146:64::5b81:e1e) 22.760 ms + 8 gbg-cagg-1.bundle-ether7.tele2.net (2a02:898:146:64::5b81:ef8) 22.983 ms + 9 bck3-core-1.bundle-ether6.tele2.net (2a02:898:146:64::5b81:c74) 22.295 ms +10 lba5-core-2.bundle-ether2.tele2.net (2a02:898:146:64::5b81:c2f) 21.951 ms +11 avk-core-2.bundle-ether9.tele2.net (2a02:898:146:64::5b81:c24) 21.760 ms +12 avk-cagg-1.bundle-ether4.tele2.net (2a02:898:146:64::5b81:c0d) 22.602 ms +13 skst123-lgw-2.bundle-ether50.tele2.net (2a02:898:146:64::5b81:e23) 21.553 ms +14 skst123-pe-1.gigabiteth0-2.tele2.net (2a02:898:146:64::82f4:5045) 21.336 ms +15 2a02:898:146:64::c10f:e4c3 (2a02:898:146:64::c10f:e4c3) 21.722 ms +``` + +### Thanks for reading + +{{< image width="150px" float="left" src="/assets/coloclue-ipng/heart.png" alt="Heart" >}} + +This article is a bit different to my usual writing - it doesn't deep dive into any protocol or code +that I've written, but it does describe a good chunk of the way I think about systems and +networking. I appreciate the opportunities that Coloclue as a networking community and hobby club +affords. I'm always happy to talk about routing, network- and systems engineering, and the stuff I +develop at IPng Networks, notably our VPP routing stack. I encourage folks to become a member and +learn about develop novel approaches to this thing we call the Internet. + +Oh, and if you're a Coloclue member looking for a secondary location, IPng offers colocation and +hosting services in Zurich, Geneva, and soon in Lucerne as well :) Houdoe! + diff --git a/content/articles/2024-07-05-r86s.md b/content/articles/2024-07-05-r86s.md new file mode 100644 index 0000000..a436dbb --- /dev/null +++ b/content/articles/2024-07-05-r86s.md @@ -0,0 +1,566 @@ +--- +date: "2024-07-05T12:51:23Z" +title: 'Review: R86S (Jasper Lake - N6005)' +--- + +# Introduction + +{{< image width="250px" float="right" src="/assets/r86s/r86s-front.png" alt="R86S Front" >}} + +I am always interested in finding new hardware that is capable of running VPP. Of course, a standard +issue 19" rack mountable machine like a Dell, HPE or SuperMicro machine is an obvious choice. They +come with redundant power supplies, PCIe v3.0 or better expansion slots, and can boot off of mSATA +or NVME, with plenty of RAM. But for some people and in some locations, the power envelope or +size/cost of these 19" rack mountable machines can be prohibitive. Sometimes, just having a smaller +form factor can be very useful: \ +***Enter the GoWin R86S!*** + +{{< image width="250px" float="right" src="/assets/r86s/r86s-nvme.png" alt="R86S NVME" >}} + +I stumbled across this lesser known build from GoWin, which is an ultra compact but modern design, +featuring three 2.5GbE ethernet ports and optionally two 10GbE, or as I'll show here, two 25GbE +ports. What I really liked about the machine is that it comes with 32GB of LPDDR4 memory and can boot +off of an m.2 NVME -- which makes it immediately an appealing device to put in the field. I +noticed that the height of the machine is just a few millimeters smaller than 1U which is 1.75" +(44.5mm), which gives me the bright idea to 3D print a bracket to be able to rack these and because +they are very compact -- a width of 78mm only, I can manage to fit four of them in one 1U front, or +maybe a Mikrotik CRS305 breakout switch. Slick! + +{{< image width="250px" float="right" src="/assets/r86s/r86s-ocp.png" alt="R86S OCP" >}} + +I picked up two of these _R86S Pro_ and when they arrived, I noticed that their 10GbE is actually +an _Open Compute Project_ (OCP) footprint expansion card, which struck me as clever. It means that I +can replace the Mellanox `CX342A` network card with perhaps something more modern, such as an Intel +`X520-DA2` or Mellanox `MCX542B_ACAN` which is even dual-25G! So I take to ebay and buy myself a few +expansion OCP boards, which are surprisingly cheap, perhaps because the OCP form factor isn't as +popular as 'normal' PCIe v3.0 cards. + +I put a Google photos album online [[here](https://photos.app.goo.gl/gPMAp21FcXFiuNaH7)], in case +you'd like some more detailed shots. + +In this article, I'll write about a mixture of hardware, systems engineering (how the hardware like +network cards and motherboard and CPU interact with one another), and VPP performance diagnostics. +I hope that it helps a few wary Internet denizens feel their way around these challenging but +otherwise fascinating technical topics. Ready? Let's go! + +# Hardware Specs + +{{< image width="250px" float="right" src="/assets/r86s/nics.png" alt="NICs" >}} + +For the CHF 314,- I paid for each Intel Pentium N6005, this machine is delightful! They feature: + +* Intel Pentium Silver N6005 @ 2.00GHz (4 cores) +* 2x16GB Micron LPDDR4 memory @2933MT/s +* 1x Samsung SSD 980 PRO 1TB NVME +* 3x Intel I226-V 2.5GbE network ports +* 1x OCP v2.0 connector with PCIe v3.0 x4 delivered +* USB-C power supply +* 2x USB3 (one on front, one on side) +* 1x USB2 (on the side) +* 1x MicroSD slot +* 1x MicroHDMI video out +* Wi-Fi 6 AX201 160MHz onboard + +To the right I've put the three OCP nework interface cards side by side. On the top, the Mellanox +Cx3 (2x10G) that shipped with the R86S units. In the middle, a spiffy Mellanox Cx5 (2x25G), and at +the bottom, the _classic_ Intel 82599ES (2x10G) card. As I'll demonstrate, despite having the same +form factor, each of these have a unique story to tell, well beyond their rated portspeed. + +There's quite a few options for CPU out there - GoWin sells them with Jasper Lake (Celeron N5105 +or Pentium N6005, the one I bought), but also with newer Alder Lake (N100 or N305). Price, +performance and power draw will vary. I looked at a few differences in Passmark, and I think I made +a good trade off between cost, power and performance. You may of course choose differently! + +The R86S formfactor is very compact, coming in at (80mm x 120mm x 40mm), and the case is made of +sturdy aluminium. It feels like a good quality build, and the inside is also pretty neat. In the +kit, a cute little M2 hex driver is included. This allows me to remove the bottom plate (to service +the NVME) and separate the case to access the OCP connector (and replace the NIC!). Finally, the two +antennae at the back are tri-band, suitable for WiFi 6. There is one fan included in the chassis, +with a few cut-outs in the top of the case, to let the air flow through the case. The fan is not +noisy, but definitely noticeable. + +## Compiling VPP on R86S + +I first install Debian Bookworm on them, and retrofit one of them with the Intel X520 and the other +with the Mellanox Cx5 network cards. While the Mellanox Cx342A that comes with the R86S does have +DPDK support (using the MLX4 poll mode driver), it has a quirk in that it does not enumerate both +ports as unique PCI devices, causing VPP to crash with duplicate graph node names: + +``` +vlib_register_node:418: more than one node named `FiftySixGigabitEthernet5/0/0-tx' +Failed to save post-mortem API trace to /tmp/api_post_mortem.794 +received signal SIGABRT, PC 0x7f9445aa9e2c +``` + +The way VPP enumerates DPDK devices is by walking the PCI bus, but considering the Connect-X3 has +two ports behind the same PCI address, it'll try to create two interfaces, which fails. It's pretty +easily fixable with a small [[patch](/assets/r86s/vpp-cx3.patch)]. Off I go, to compile VPP (version +`24.10-rc0~88-ge3469369dd`) with Mellanox DPDK support, to get the best side by side comparison +between the Cx3 and X520 cards on the one hand needing DPDK, and the Cx5 card optionally also being +able to use VPP's RDMA driver. They will _all_ be using DPDK in my tests. + +I'm not out of the woods yet, because VPP throws an error when enumerating and attaching the +Mellanox Cx342. I read the DPDK documentation for this poll mode driver +[[ref](https://doc.dpdk.org/guides/nics/mlx4.html)] and find that when using DPDK applications, the +`mlx4_core` driver in the kernel has to be initialized with a specific flag, like so: + +``` +GRUB_CMDLINE_LINUX_DEFAULT="isolcpus=1-3 iommu=on intel_iommu=on mlx4_core.log_num_mgm_entry_size=-1" +``` + +And because I'm using `iommu`, the correct driver to load for Cx3 is `vfio_pci`, so I put that in +`/etc/modules`, rebuild the initrd, and reboot the machine. With all of that sleuthing out of the +way, I am now ready to take the R86S out for a spin and see how much this little machine is capable +of forwarding as a router. + +### Power: Idle and Under Load +I note that the Intel Pentium Silver CPU has 4 cores, one of which will be used by OS and +controlplane, leaving 3 worker threads left for VPP. The Pentium Silver N6005 comes with 32kB of L1 +per core, and 1.5MB of L2 + 4MB of L3 cache shared between the cores. It's not much, but then again +the TDP is shockingly low 10 Watts. Before VPP runs (and makes the CPUs work really hard), the +entire machine idles at 12 Watts. When powered on under full load, the Mellanox Cx3 and Intel +x520-DA2 both sip 17 Watts of power and the Mellanox Cx5 slurps 20 Watts of power all-up. Neat! + +## Loadtest Results + +{{< image width="400px" float="right" src="/assets/r86s/loadtest.png" alt="Loadtest" >}} + +For each network interface I will do a bunch of loadtests, to show different aspects of the setup. +First, I'll do a bunch of unidirectional tests, where traffic goes into one port and exits another. +I will do this with either large packets (1514b), small packets (64b) but many flows, which allow me +to use multiple hardware receive queues assigned to individual worker threads, or small packets with +only one flow, limiting VPP to only one RX queue and consequently only one CPU thread. Because I +think it's hella cool, I will also loadtest MPLS label switching (eg. MPLS frame with label '16' on +ingress, forwarded with a swapped label '17' on egress). In general, MPLS lookups can be a bit +faster as they are (constant time) hashtable lookups, while IPv4 longest prefix match lookups use a +trie. MPLS won't be significantly faster than IPv4 in these tests, because the FIB is tiny with +only a handful of entries. + +Second, I'll do the same loadtests but in both directions, which means traffic is both entering NIC0 +and being emitted on NIC1, but also entering on NIC1 to be emitted on NIC0. In these loadtests, +again large packets, small packets multi-flow, small packets single-flow, and MPLS, the network chip +has to do more work to maintain its RX queues *and* its TX queues simultaneously. As I'll +demonstrate, this tends to matter quite a bit on consumer hardware. + +### Intel i226-V (2.5GbE) + +This is a 2.5G network interface from the _Foxville_ family, released in Q2 2022 with a ten year +expected availability, it's currently a very good choice. It is a consumer/client chip, which means +I cannot expect super performance from it. In this machine, the three RJ45 ports are connected to +PCI slot 01:00.0, 02:00.0 and 03:00.0, each at 5.0GT/s (this means they are PCIe v2.0) and they take +one x1 PCIe lane to the CPU. I leave the first port as management, and take the second+third one and +give them to VPP like so: + +``` +dpdk { + dev 0000:02:00.0 { name e0 } + dev 0000:03:00.0 { name e1 } + no-multi-seg + decimal-interface-names + uio-driver vfio-pci +} +``` + +The logical configuration then becomes: + +``` +set int state e0 up +set int state e1 up +set int ip address e0 100.64.1.1/30 +set int ip address e1 100.64.2.1/30 +ip route add 16.0.0.0/24 via 100.64.1.2 +ip route add 48.0.0.0/24 via 100.64.2.2 +ip neighbor e0 100.64.1.2 50:7c:6f:20:30:70 +ip neighbor e1 100.64.2.2 50:7c:6f:20:30:71 + +mpls table add 0 +set interface mpls e0 enable +set interface mpls e1 enable +mpls local-label add 16 eos via 100.64.2.2 e1 +mpls local-label add 17 eos via 100.64.1.2 e0 +``` + +In the first block, I'll bring up interfaces `e0` and `e1`, give them an IPv4 address in a /30 +transit net, and set a route to the other side. I'll route packets destined to 16.0.0.0/24 to the +Cisco T-Rex loadtester at 100.64.1.2, and I'll route packets for 48.0.0.0/24 to the T-Rex at +100.64.2.2. To avoid the need to ARP for T-Rex, I'll set some static ARP entries to the loadtester's +MAC addresses. + +In the second block, I'll enable MPLS, turn it on on the two interfaces, and add two FIB entries. If +VPP receives an MPLS packet with label 16, it'll forward it on to Cisco T-Rex on port `e1`, and if it +receives a packet with label 17, it'll forward it to T-Rex on port `e0`. + +Without further ado, here are the results of the i226-V loadtest: + +| ***Intel i226-V***: Loadtest | L2 bits/sec | Packets/sec | % of Line-Rate | +|---------------|----------|-------------|-----------| +| Unidirectional 1514b | 2.44Gbps | 202kpps | 99.4% | +| Unidirectional 64b Multi | 1.58Gbps | 3.28Mpps | 88.1% | +| Unidirectional 64b Single | 1.58Gbps | 3.28Mpps | 88.1% | +| Unidirectional 64b MPLS | 1.57Gbps | 3.27Mpps | 87.9% | +| Bidirectional 1514b | 4.84Gbps | 404kpps | 99.4% | +| Bidirectional 64b Multi | 2.44Gbps | 5.07Mpps | 68.2% | +| Bidirectional 64b Single | 2.44Gbps | 5.07Mpps | 68.2% | +| Bidirectional 64b MPLS | 2.43Gbps | 5.07Mpps | 68.2% | + +First response: very respectable! + +#### Important Notes + +**1. L1 vs L2** \ +There's a few observations I want to make, as these numbers can be confusing. First off, VPP when +given large packets, can easily sustain almost exactly (!) the line rate of 2.5GbE. There's always a +debate about these numbers, so let me offer offer some theoretical background -- + +1. The L2 Ethernet frame that Cisco T-Rex sends consists of the source/destination MAC (6 + bytes each), a type (2 bytes), the payload, and a frame checksum (4 bytes). It shows us this + number as `Tx bps L2`. +1. But on the wire, the PHY has to additionally send a _preamble_ (7 bytes), a _start frame + delimiter_ (1 byte), and at the end, an _interpacket gap_ (12 bytes), which is 20 bytes of + overhead. This means that the total size on the wire will be **1534 bytes**. It shows us this + number as `Tx bps L1`. +1. This 1534 byte L1 frame on the wire is 12272 bits. For a 2.5Gigabit line rate, this means we + can send at most 2'500'000'000 / 12272 = **203715 packets per second**. Regardless of L1 or L2, + this number is always `Tx pps`. +1. The smallest (L2) Ethernet frame we're allowed to send, is 64 bytes, and anything shorter than + this is called a _Runt_. On the wire, such a frame will be 84 bytes (672 bits). With 2.5GbE, this + means **3.72Mpps** is the theoretical maximum. + +When reading back loadtest results from Cisco T-Rex, it shows us packets per second (Rx pps), but it +only shows us the `Rx bps`, which is the **L2 bits/sec** which corresponds to the sending port's `Tx +bps L2`. When I describe the percentage of Line-Rate, I calculate this with what physically fits on +the wire, eg the **L1 bits/sec**, because that makes most sense to me. + +When sending small 64b packets, the difference is significant: taking the above _Unidirectional 64b +Single_ as an example, I observed 3.28M packets/sec. This is a bandwidth of 3.28M\*64\*8 = 1.679Gbit +of L2 traffic, but a bandwidth of 3.28M\*(64+20)\*8 = 2.204Gbit of L1 traffic, which is how I +determine that it is 88.1% of Line-Rate. + +**2. One RX queue** \ +A less pedantic observation is that there is no difference between _Multi_ and _Single_ flow +loadtests. This is because the NIC only uses one RX queue, and therefor only one VPP worker thread. +I did do a few loadtests with multiple receive queues, but it does not matter for performance. When +performing this 3.28Mpps of load, I can see that VPP itself is not saturated. I can see that most of +the time it's just sitting there waiting for DPDK to give it work, which manifests as a vectors/call +relatively low: + +``` +--------------- +Thread 2 vpp_wk_1 (lcore 2) +Time 10.9, 10 sec internal node vector rate 40.39 loops/sec 68325.87 + vector rates in 3.2814e6, out 3.2814e6, drop 0.0000e0, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call +dpdk-input polling 61933 2846586 0 1.28e2 45.96 +ethernet-input active 61733 2846586 0 1.71e2 46.11 +ip4-input-no-checksum active 61733 2846586 0 6.54e1 46.11 +ip4-load-balance active 61733 2846586 0 4.70e1 46.11 +ip4-lookup active 61733 2846586 0 7.50e1 46.11 +ip4-rewrite active 61733 2846586 0 7.23e1 46.11 +e1-output active 61733 2846586 0 2.53e1 46.11 +e1-tx active 61733 2846586 0 1.38e2 46.11 +``` + +By the way the other numbers here are fascinating as well. Take a look at them: +* **Calls**: How often has VPP executed this graph node. +* **Vectors**: How many packets (which are internally called vectors) have been handled. +* **Vectors/Call**: Every time VPP executes the graph node, on average how many packets are done + at once? An unloaded VPP will hover around 1.00, and the maximum permissible is 256.00. +* **Clocks**: How many CPU cycles, on average, did each packet spend in each graph node. + Interestingly, summing up this number gets very close to the total CPU clock cycles available + (on this machine 2.4GHz). + +Zooming in on the **clocks** number a bit more: every time a packet was handled, roughly 594 CPU +cycles were spent in VPP's directed graph. An additional 128 CPU cycles were spent asking DPDK for +work. Summing it all up, 3.28M\*(594+128) = 2'369'170'800 which is earily close to the 2.4GHz I +mentioned above. I love it when the math checks out!! + +By the way, in case you were wondering what happens on an unloaded VPP thread, the clocks spent +in `dpdk-input` (and other polling nodes like `unix-epoll-input`) just go up to consume the whole +core. I explain that in a bit more detail below. + +**3. Uni- vs Bidirectional** \ +I noticed a non-linear response between loadtests in one direction versus both directions. At large +packets, it did not matter. Both directions satured the line nearly perfectly (202kpps in one +direction, and 404kpps in both directions). However, in the smaller packets, some contention became +clear. In only one direction, IPv4 and MPLS forwarding were roughly 3.28Mpps; but in both +directions, this went down to 2.53Mpps in each direction (which is my reported 5.07Mpps). So it's +interesting to see how these i226-V chips do seem to care if they are only receiving or transmitting +transmitting, or performing both receiving *and* transmitting. + +### Intel X520 (10GbE) + +{{< image width="100px" float="left" src="/assets/oem-switch/warning.png" alt="Warning" >}} + +This network card is based on the classic Intel _Niantic_ chipset, also known as the 82599ES chip, +first released in 2009. It's super reliable, but there is one downside. It's a PCIe v2.0 device +(5.0GT/s) and to be able to run two ports, it will need eight lanes of PCI connectivity. However, a +quick inspection using `dmesg` shows me, that there are only 4 lanes brought to the OCP connector: + +``` +ixgbe 0000:05:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x4 link + at 0000:00:1c.4 (capable of 32.000 Gb/s with 5.0 GT/s PCIe x8 link) +ixgbe 0000:05:00.0: MAC: 2, PHY: 1, PBA No: H31656-000 +ixgbe 0000:05:00.0: 90:e2:ba:c5:c9:38 +ixgbe 0000:05:00.0: Intel(R) 10 Gigabit Network Connection +``` + +That's a bummer! Because there are two Tengig ports on this OCP, and this chip is a PCIe v2.0 device +which means the PCI encoding will be 8b/10b which means each lane can deliver about 80% of the +5.0GT/s, and 80% of 20GT/s is 16.0Gbit. By the way, when PCIe v3.0 was released, not only did the +transfer speed go to 8.0GT/s per lane, the encoding also changed to 128b/130b which lowers the +overhead from a whopping 20% to only 1.5%. It's not a bad investment of time to read up on PCI +Express standards on [[Wikipedia](https://en.wikipedia.org/wiki/PCI_Express)], as PCIe limitations +and blocked lanes (like in this case!) are the number one reason for poor VPP performance, as my +buddy Sander also noted during my NLNOG talk last year. + +#### Intel X520: Loadtest Results + +Now that I've shown a few of these runtime statistics, I think it's good to review three pertinent graphs. +I proceed to hook up the loadtester to the 10G ports of the R86S unit that has the Intel X520-DA2 +adapter. I'll run the same eight loadtests: +**{1514b,64b,64b-1Q,MPLS} x {unidirectional,bidirectional}** + +In the table above, I showed the output of `show runtime` in the VPP debug CLI. These numbers are +also exported in a prometheus exporter. I wrote about that in this [[article]({% post_url +2023-04-09-vpp-stats %})]. In Grafana, I can draw these timeseries as graphs, and it shows me a lot +about where VPP is spending its time. Each _node_ in the directed graph counts how many vectors +(packets) it has seen, and how many CPU cycles it has spent doing its work. + +{{< image src="/assets/r86s/grafana-vectors.png" alt="Grafana Vectors" >}} + +In VPP, a graph of _vectors/sec_ means how many packets per second is the router forwarding. The +graph above is on a logarithmic scale, and I've annotated each of the eight loadtests in orange. The +first block of four are the ***U***_nidirectional_ tests and of course, higher values is better. + +I notice that some of these loadtests ramp up until a certain point, after which they become a flatline, which I drew orange arrows for. The first +time this clearly happens is in the ***U3*** loadtest. It makes sense to me, because having one flow +implies only one worker thread, whereas in the ***U2*** loadtest the system can make use of multiple +receive queues and therefore multiple worker threads. It stands to reason that ***U2*** has a +slightly better performance than ***U3***. + +The fourth test, the _MPLS_ loadtest, is forwarding the same identical packets with label 16, out on +another interface with label 17. They are therefore also single flow, and this explains why the +***U4*** loadtest looks very similar to the ***U3*** one. Some NICs can hash MPLS traffic to +multiple receive queues based on the inner payload, but I conclude that the Intel X520-DA2 aka +82599ES cannot do that. + +The second block of four are the ***B***_idirectional_ tests. Similar to the tests I did with the +i226-V 2.5GbE NICs, here each of the network cards has to both receive traffic as well as sent +traffic. It is with this graph that I can determine the overall throughput in packets/sec +of these network interfaces. Of course the bits/sec and packets/sec also come from the T-Rex +loadtester output JSON. Here they are, for the Intel X520-DA2: + +| ***Intel 82599ES***: Loadtest | L2 bits/sec | Packets/sec | % of Line-Rate | +|---------------|----------|-------------|-----------| +| U1: Unidirectional 1514b | 9.77Gbps | 809kpps | 99.2% | +| U2: Unidirectional 64b Multi | 6.48Gbps | 13.4Mpps | 90.1% | +| U3: Unidirectional 64b Single | 3.73Gbps | 7.77Mpps | 52.2% | +| U4: Unidirectional 64b MPLS | 3.32Gbps | 6.91Mpps | 46.4% | +| B1: Bidirectional 1514b | 12.9Gbps | 1.07Mpps | 65.6% | +| B2: Bidirectional 64b Multi | 6.08Gbps | 12.7Mpps | 42.7% | +| B3: Bidirectional 64b Single | 6.25Gbps | 13.0Mpps | 43.7% | +| B4: Bidirectional 64b MPLS | 3.26Gbps | 6.79Mpps | 22.8% | + +A few further observations: +1. ***U1***'s loadtest shows that the machine can sustain 10Gbps in one direction, while ***B1*** + shows that bidirectional loadtests are not yielding twice as much throughput. This is very + likely because the PCIe 5.0GT/s x4 link is constrained to 16Gbps total throughput, while the + OCP NIC supports PCIe 5.0GT/s x8 (32Gbps). +1. ***U3***'s loadtest shows that one single CPU can do 7.77Mpps max, if it's the only CPU that is + doing work. This is likely because if it's the only thread doing work, it gets to use the + entire L2/L3 cache for itself. +1. ***U2***'s test shows that when multiple workers perform work, the throughput raises to + 13.4Mpps, but this is not double that of a single worker. Similar to before, I think this is + because the threads now need to share the CPU's modest L2/L3 cache. +1. ***B3***'s loadtest shows that two CPU threads together can do 6.50Mpps each (for a total of + 13.0Mpps), which I think is likely because each NIC now has to receive _and_ transit packets. + +If you're reading this and think you have an alternative explanation, do let me know! + +### Mellanox Cx3 (10GbE) + +When VPP is doing its work, it typically asks DPDK (or other input types like virtio, AVF, or RDMA) +for a _list_ of packets, rather than one individual packet. It then brings these packets, called +_vectors_, through a directed acyclic graph inside of VPP. Each graph node does something specific to +the packets, for example in `ethernet-input`, the node checks what ethernet type each packet is (ARP, +IPv4, IPv6, MPLS, ...), and hands them off to the correct next node, such as `ip4-input` or `mpls-input`. +If VPP is idle, there may be only one or two packets in the list, which means every time the packets +go into a new node, a new chunk of code has to be loaded from working memory into the CPU's +instruction cache. Conversely, if there are many packets in the list, only the first packet may need +to pull things into the i-cache, the second through Nth packet will become cache hits and execute +_much_ faster. Moreover, some nodes in VPP make use of processor optimizations like _SIMD_ (single +instruction, multiple data), to save on clock cycles if the same operation needs to be executed +multiple times. + +{{< image src="/assets/r86s/grafana-clocks.png" alt="Grafana Clocks" >}} + +This graph shows the average CPU cycles per packet for each node. In the first three loadtests +(***U1***, ***U2*** and ***U3***), you can see four lines representing the VPP nodes `ip4-input` +`ip4-lookup`, `ip4-load-balance` and `ip4-rewrite`. In the fourth loadtest ***U4***, you can see +only three nodes: `mpls-input`, `mpls-lookup`, and `ip4-mpls-label-disposition-pipe` (where the MPLS +label '16' is swapped for outgoing label '17'). + +It's clear to me that when VPP has not many packets/sec to route (ie ***U1*** loadtest), that the +cost _per packet_ is actually quite high at around 200 CPU cycles per packet per node. But, if I +slam the VPP instance with lots of packets/sec (ie ***U3*** loadtest), that VPP gets _much_ more +efficient at what it does. What used to take 200+ cycles per packet, now only takes between 34-52 +cycles per packet, which is a whopping 5x increase in efficiency. How cool is that?! + +And with that, the Mellanox C3 loadtest completes, and the results are in: + +| ***Mellanox MCX342A-XCCN***: Loadtest | L2 bits/sec | Packets/sec | % of Line-Rate | +|---------------|----------|-------------|-----------| +| U1: Unidirectional 1514b | 9.73Gbps | 805kpps | 99.7% | +| U2: Unidirectional 64b Multi | 1.11Gbps | 2.30Mpps | 15.5% | +| U3: Unidirectional 64b Single | 1.10Gbps | 2.27Mpps | 15.3% | +| U4: Unidirectional 64b MPLS | 1.10Gbps | 2.27Mpps | 15.3% | +| B1: Bidirectional 1514b | 18.7Gbps | 1.53Mpps | 94.9% | +| B2: Bidirectional 64b Multi | 1.54Gbps | 2.29Mpps | 7.69% | +| B3: Bidirectional 64b Single | 1.54Gbps | 2.29Mpps | 7.69% | +| B4: Bidirectional 64b MPLS | 1.54Gbps | 2.29Mpps | 7.69% | + +Here's something that I find strange though. VPP is clearly not saturated by these 64b loadtests. I +know this, because in the case of the Intel X520-DA2 above, I could easily see 13Mpps in a +bidirectional test, yet with this Mellanox Cx3 card, no matter if I do one direction or both +directions, the max packets/sec tops at 2.3Mpps only -- that's an order of magnitude lower. + +Looking at VPP, both worker threads (the one reading from Port 5/0/0, and the other reading from +Port 5/0/1), are not very busy at all. If a VPP worker thread is saturated, this typically shows as +a vectors/call of 256.00 and 100% of CPU cycles consumed. But here, that's not the case at all, and +most time is spent in DPDK waiting for traffic: + +``` +Thread 1 vpp_wk_0 (lcore 1) +Time 31.2, 10 sec internal node vector rate 2.26 loops/sec 988626.15 + vector rates in 1.1521e6, out 1.1521e6, drop 0.0000e0, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call +FiftySixGigabitEthernet5/0/1-o active 15949560 35929200 0 8.39e1 2.25 +FiftySixGigabitEthernet5/0/1-t active 15949560 35929200 0 2.59e2 2.25 +dpdk-input polling 36250611 35929200 0 6.55e2 .99 +ethernet-input active 15949560 35929200 0 2.69e2 2.25 +ip4-input-no-checksum active 15949560 35929200 0 1.01e2 2.25 +ip4-load-balance active 15949560 35929200 0 7.64e1 2.25 +ip4-lookup active 15949560 35929200 0 9.26e1 2.25 +ip4-rewrite active 15949560 35929200 0 9.28e1 2.25 +unix-epoll-input polling 35367 0 0 1.29e3 0.00 +--------------- +Thread 2 vpp_wk_1 (lcore 2) +Time 31.2, 10 sec internal node vector rate 2.43 loops/sec 659534.38 + vector rates in 1.1517e6, out 1.1517e6, drop 0.0000e0, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call +FiftySixGigabitEthernet5/0/0-o active 14845221 35913927 0 8.66e1 2.42 +FiftySixGigabitEthernet5/0/0-t active 14845221 35913927 0 2.72e2 2.42 +dpdk-input polling 23114538 35913927 0 6.99e2 1.55 +ethernet-input active 14845221 35913927 0 2.65e2 2.42 +ip4-input-no-checksum active 14845221 35913927 0 9.73e1 2.42 +ip4-load-balance active 14845220 35913923 0 7.17e1 2.42 +ip4-lookup active 14845221 35913927 0 9.03e1 2.42 +ip4-rewrite active 14845221 35913927 0 8.97e1 2.42 +unix-epoll-input polling 22551 0 0 1.37e3 0.00 +``` + +{{< image width="100px" float="left" src="/assets/vpp-babel/brain.png" alt="Brain" >}} + +I kind of wonder why that is. Is the Mellanox Connect-X3 such a poor performer? Or does it not like +small packets? I've read online that Mellanox cards do some form of message compression on the PCI +bus, something perhaps to turn off. I don't know, but I don't like it! + + +### Mellanox Cx5 (25GbE) + +VPP has a few _polling_ nodes, which are pieces of code that execute back-to-back in a tight +execution loop. A classic example of a _polling_ node is a _Poll Mode Driver_ from DPDK: this will +ask the network cards if they have any packets, and if so: marshall them through the directed graph +of VPP. As soon as that's done, the node will immediately ask again. If there is no work to do, this +turns into a tight loop with DPDK continuously asking for work. There is however another, lesser +known, _polling_ node: `unix-epoll-input`. This node services a local pool of file descriptors, like +the _Linux Control Plane_ netlink socket for example, or the clients attached to the Statistics +segment, CLI or API. You can see the open files with `show unix files`. + +This design explains why the CPU load of a typical DPDK application is 100% of each worker thread. As +an aside, you can ask the PMD to start off in _interrupt_ mode, and only after a certain load switch +seemlessly to _polling_ mode. Take a look at `set interface rx-mode` on how to change from _polling_ +to _interrupt_ or _adaptive_ modes. For performance reasons, I always leave the node in _polling_ +mode (the default in VPP). + +{{< image src="/assets/r86s/grafana-cpu.png" alt="Grafana CPU" >}} + +The stats segment shows how many clock cycles are being spent in each call of each node. It also +knows how often nodes are called. Considering the `unix-epoll-input` and `dpdk-input` nodes will +perform what is essentially a tight-loop, the CPU should always add up to 100%. I found that one +cool way to show how busy a VPP instance really is, is to look over all CPU threads, and sort +through the fraction of time spent in each node: + +* ***Input Nodes***: are those which handle the receive path from DPDK and into the directed graph + for routing -- for example `ethernet-input`, then `ip4-input` through to `ip4-lookup` and finally + `ip4-rewrite`. This is where VPP usually spends most of its CPU cycles. +* ***Output Nodes***: are those which handle the transmit path into DPDK. You'll see these are + nodes whose name ends in `-output` or `-tx`. You can also see that in ***U2***, there are only + two nodes consuming CPU, while in ***B2*** there are four nodes (because two interfaces are + transmitting!) +* ***epoll***: the _polling_ node called `unix-epoll-input` depicted in brown in this graph. +* ***dpdk***: the _polling_ node called `dpdk-input` depicted in green in this graph. + +If there is no work to do, as was the case at around 20:30 in the graph above, the _dpdk_ and +_epoll_ nodes are the only two that are consuming CPU. If there's lots of work to do, as was the +case in the unidirectional 64b loadtest between 19:40-19:50, and the bidirectional 64b loadtest +between 20:45-20:55, I can observe lots of other nodes doing meaningful work, ultimately starving +the _dpdk_ and _epoll_ threads until an equilibrium is achieved. This is how I know the VPP process +is the bottleneck and not, for example, the PCI bus. + +I let the eight loadtests run, and make note of the bits/sec and packets/sec for each, in this table +for the Mellanox Cx5: + +| ***Mellanox MCX542_ACAT***: Loadtest | L2 bits/sec | Packets/sec | % of Line-Rate | +|---------------|----------|-------------|-----------| +| U1: Unidirectional 1514b | 24.2Gbps | 2.01Mpps | 98.6% | +| U2: Unidirectional 64b Multi | 7.43Gbps | 15.5Mpps | 41.6% | +| U3: Unidirectional 64b Single | 3.52Gbps | 7.34Mpps | 19.7% | +| U4: Unidirectional 64b MPLS | 7.34Gbps | 15.3Mpps | 46.4% | +| B1: Bidirectional 1514b | 24.9Gbps | 2.06Mpps | 50.4% | +| B2: Bidirectional 64b Multi | 6.58Gbps | 13.7Mpps | 18.4% | +| B3: Bidirectional 64b Single | 3.15Gbps | 6.55Mpps | 8.81% | +| B4: Bidirectional 64b MPLS | 6.55Gbps | 13.6Mpps | 18.3% | + +Some observations: +1. This Mellanox Cx5 runs quite a bit hotter than the other two cards. It's a PCIe v3.0 which means + that despite there only being 4 lanes to the OCP port, it can achieve 31.504 Gbit/s (in case + you're wondering, this is 128b/130b encoding on 8.0GT/s x4). +1. It easily saturates 25Gbit in one direction with big packets in ***U1***, but as soon as smaller + packets are offered, each worker thread tops out at 7.34Mpps or so in ***U2***. +1. When testing in both directions, each thread can do about 6.55Mpps or so in ***B2***. Similar to + the other NICs, there is a clear slowdown due to CPU cache contention (when using multiple + threads), and RX/TX simultaneously (when doing bidirectional tests). +1. MPLS is a lot faster -- nearly double based on the use of multiple threads. I think this is + because the Cx5 has a hardware hashing function for MPLS packets that looks at the inner + payload to sort the traffic into multiple queues, while the Cx3 and Intel X520-DA2 do not. + +## Summary and closing thoughts + +There's a lot to say about these OCP cards. While the Intel is cheap, the Mellanox Cx3 is a bit +quirky with its VPP enumeration, and the Mellanox Cx5 is a bit more expensive (and draws a fair bit +more power, coming in at 20W), but it do 25Gbit reasonably, it's pretty difficult to make a solid +recommendation. What I find interesting is the very low limit in packets/sec on 64b packets coming +from the Cx3, while at the same time there seems to be an added benefit in MPLS hashing that the +other two cards do not have. + +All things considered, I think I would recommend the Intel x520-DA2 (based on the _Niantic_ chip, +Intel 82599ES, total machine coming in at 17W). It seems like it pairs best with the available CPU +on the machine. Maybe a Mellanox ConnectX-4 could be a good alternative though, hmmmm :) + +Here's a few files I gathered along the way, in case they are useful: + +* [[LSCPU](/assets/r86s/lscpu.txt)] - [[Likwid Topology](/assets/r86s/likwid-topology.txt)] - + [[DMI Decode](/assets/r86s/dmidecode.txt)] - [[LSBLK](/assets/r86s/lsblk.txt)] +* Mellanox Cx341: [[dmesg](/assets/r86s/dmesg-cx3.txt)] - [[LSPCI](/assets/r86s/lspci-cx3.txt)] - + [[LSHW](/assets/r86s/lshw-cx3.txt)] - [[VPP Patch](/assets/r86s/vpp-cx3.patch)] +* Mellanox Cx542: [[dmesg](/assets/r86s/dmesg-cx5.txt)] - [[LSPCI](/assets/r86s/lspci-cx5.txt)] - + [[LSHW](/assets/r86s/lshw-cx5.txt)] +* Intel X520-DA2: [[dmesg](/assets/r86s/dmesg-x520.txt)] - [[LSPCI](/assets/r86s/lspci-x520.txt)] - + [[LSHW](/assets/r86s/lshw-x520.txt)] +* VPP Configs: [[startup.conf](/assets/r86s/vpp/startup.conf)] - [[L2 Config](/assets/r86s/vpp/config/l2.vpp)] - + [[L3 Config](/assets/r86s/vpp/config/l3.vpp)] - [[MPLS Config](/assets/r86s/vpp/config/mpls.vpp)] + diff --git a/content/articles/2024-08-03-gowin.md b/content/articles/2024-08-03-gowin.md new file mode 100644 index 0000000..0082715 --- /dev/null +++ b/content/articles/2024-08-03-gowin.md @@ -0,0 +1,443 @@ +--- +date: "2024-08-03T10:51:23Z" +title: 'Review: Gowin 1U 2x25G (Alder Lake - N305)' +--- + +# Introduction + +{{< image float="right" src="/assets/gowin-n305/gowin-logo.png" alt="Gowin logo" >}} + +Last month, I took a good look at the Gowin R86S based on Jasper Lake (N6005) CPU +[[ref](https://www.gowinfanless.com/products/network-device/r86s-firewall-router/gw-r86s-u-series)], +which is a really neat little 10G (and, if you fiddle with it a little bit, 25G!) router that runs +off of USB-C power and can be rack mounted if you print a bracket. Check out my findings in this +[[article]({% post_url 2024-07-05-r86s %})]. + +David from Gowin reached out and asked me if I was willing to also take a look their Alder Lake +(N305) CPU, which comes in a 19" rack mountable chassis, running off of 110V/220V AC mains power, +but also with 2x25G ConnectX-4 network card. Why not! For critical readers: David sent me this +machine, but made no attempt to influence this article. + +### Hardware Specs + +{{< image width="500px" float="right" src="/assets/gowin-n305/case.jpg" alt="Gowin overview" >}} + +There are a few differences between this 19" model and the compact mini-pc R86S. The most obvious +difference is the form factor. The R86S is super compact, not inherently rack mountable, +although I 3D printed a bracket for it. Looking inside, the motherboard is mostly obscured bya large +cooling block with fins that are flush with the top plate. There are 5 copper ports in the front: 2x +Intel i226-V (these are 2.5Gbit) and 3x Intel i210 (these are 1Gbit), and one of them offers POE, +which can be very handy to power a camera or wifi access point. A nice touch. + +The Gowin server comes with an OCP v2.0 port, just like the R86S does. There's a custom bracket with +a ribbon cable to the motherboard, and in the bracket is housed a Mellanox ConnectX-4 LX 2x25Gbit +network card. + +### A look inside + +{{< image width="350px" float="right" src="/assets/gowin-n305/mobo.jpg" alt="Gowin mobo" >}} + +The machine comes with an Intel i3-N305 (Alder Lake) CPU running at a max clock of 3GHz and 4x8GB of +LPDDR5 memory at 4800MT/s -- and considering the Alder Lake can make use of 4-channel memory, this +thing should be plenty fast. The memory is soldered to the board, though, so there's no option of +expanding or changing the memory after buying the unit. + +Using `likwid-topology`, I determine that the 8-core CPU has no hyperthreads, but just straight up 8 +cores with 32kB of L1 cache, two times 2MB of L3 cache (shared between cores 0-3, and another bank +shared between cores 4-7), and 6MB of L3 cache shared between all 8 cores. This is again a step up +from the Jasper Lake CPU, and should make VPP run a little bit faster. + +What I find a nice touch is that Gowin has shipped this board with a 128GB MMC flash disk, which +appears in Linux as `/dev/mmcblk0` and can be used to install an OS. However, there are also two +NVME slots with M.2 2280, one M.SATA slot and two additional SATA slots with 4-pin power. On the +side of the chassis is a clever bracket that holds three 2.5" SSDs in a staircase configuration. +That's quite a lot of storage options, and given the CPU has some oompf, this little one could +realistically be a NAS, although I'd prefer it to be a VPP router! + +{{< image width="350px" float="right" src="/assets/gowin-n305/ocp-ssd.jpg" alt="Gowin ocp-ssd" >}} + +The copper RJ45 ports are all on the motherboard, and there's an OCP breakout port that fits any OCP +v2.0 network card. Gowin shipped it with a ConnectX-4 LX, but since I had a ConnectX-5 EN, I will +take a look at performance with both cards. One critical observation, as with the Jasper Lake R86S, +is that there are only 4 PCIe v3.0 lanes routed to the OCP, which means that the spiffy x8 network +interfaces (both the Cx4 and the Cx5 I have here) will run at half speed. Bummer! + +The power supply is a 100-240V switching PSU with about 150W of power available. When running idle, +with one 1TB NVME drive, I measure 38.2W on the 220V side. When running VPP at full load, I measure +47.5W of total load. That's totally respectable for a 2x 25G + 2x 2.5G + 3x 1G VPP router. + +I've added some pictures to a [[Google Photos](https://photos.app.goo.gl/rbd9xJBUUcnCgW7v9)] album, +if you'd like to take a look. + +## VPP Loadtest: RDMA versus DPDK + +You (hopefully>) came here to read about VPP stuff. For years now, I have been curious as to the +performance and functional differences in VPP between using DPDK and the native RDMA driver support +that Mellanox network cards have support for. In this article, I'll do four loadtests, with the +stock Mellanox Cx4 that comes with the Gowin server, and with the Mellanox Cx5 card that I had +bought for the R86S. I'll take a look at the differences betwen DPDK on the one hand and RDMA on the +other. This will yield, for me at least, a better understanding on the differences. Spoiler: there +are not many! + +## DPDK + +{{< image float="right" src="/assets/gowin-n305/dpdk.png" alt="DPDK logo" >}} + +The Data Plane Development Kit (DPDK) is an open source software project managed by the Linux +Foundation. It provides a set of data plane libraries and network interface controller polling-mode +drivers for offloading ethernet packet processing from the operating system kernel to processes +running in user space. This offloading achieves higher computing efficiency and higher packet +throughput than is possible using the interrupt-driven processing provided in the kernel. + +You can read more about it on [[Wikipedia](https://en.wikipedia.org/wiki/Data_Plane_Development_Kit)] +or on the [[DPDK Homepage](https://dpdk.org/)]. VPP uses DPDK as one of the (more popular) drivers +for network card interaction. + +### DPDK: ConnectX-4 Lx + +This is the OCP network card that came with the Gowin server. It identifies in Linux as: + +``` +0e:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] +0e:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] +``` + +Albeit with an important warning in `dmesg`, about the lack of PCIe lanes: + +``` +[3.704174] pci 0000:0e:00.0: [15b3:1015] type 00 class 0x020000 +[3.708154] pci 0000:0e:00.0: reg 0x10: [mem 0x60e2000000-0x60e3ffffff 64bit pref] +[3.716221] pci 0000:0e:00.0: reg 0x30: [mem 0x80d00000-0x80dfffff pref] +[3.724079] pci 0000:0e:00.0: Max Payload Size set to 256 (was 128, max 512) +[3.732678] pci 0000:0e:00.0: PME# supported from D3cold +[3.736296] pci 0000:0e:00.0: reg 0x1a4: [mem 0x60e4800000-0x60e48fffff 64bit pref] +[3.756916] pci 0000:0e:00.0: 31.504 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x4 link + at 0000:00:1d.0 (capable of 63.008 Gb/s with 8.0 GT/s PCIe x8 link) +``` + +With a PCIe v3.0 overhead of 130b/128b, that means the card will have (128/130) * 32 = 31.508 Gbps +available and I'm actually not quite sure why the kernel claims 31.504G in the log message. Anyway, +the card itself works just fine at this speed, and is immediately detected in DPDK while continuing +to use the `mlx5_core` driver. This would be a bit different with Intel based cards, as there the +driver has to be rebound to `vfio_pci` or `uio_pci_generic`. Here, the NIC itself remains visible +(and usable!) in Linux, which is kind of neat. + +I do my standard set of eight loadtests: {unidirectional,bidirectional} x {1514b, 64b multiflow, 64b +singleflow, MPLS}. This teaches me a lot about how the NIC uses flow hashing, and what it's maximum +performance is. Without further ado, here's the results: + +| Loadtest: Gowin CX4 DPDK | L1 bits/sec | Packets/sec | % of Line | +|----------------------------------|-------------|-------------|-----------| +| 1514b-unidirectional | 25.00 Gbps | 2.04 Mpps | 100.2 % | +| 64b-unidirectional | 7.43 Gbps | 11.05 Mpps | 29.7 % | +| 64b-single-unidirectional | 3.09 Gbps | 4.59 Mpps | 12.4 % | +| 64b-mpls-unidirectional | 7.34 Gbps | 10.93 Mpps | 29.4 % | +| 1514b-bidirectional | 22.63 Gbps | 1.84 Mpps | 45.2 % | +| 64b-bidirectional | 7.42 Gbps | 11.04 Mpps | 14.8 % | +| 64b-single-bidirectional | 5.33 Gbps | 7.93 Mpps | 10.7 % | +| 64b-mpls-bidirectional | 7.36 Gbps | 10.96 Mpps | 14.8 % | + +Some observations: + +* In the large packet department, the NIC easily saturates the port speed in _unidirectional_, and + saturates the PCI bus (x4) in _bidirectional_ forwarding. I'm surprised that the bidirectional + forwarding capacity is a bit lower (1.84Mpps versus 2.04Mpps). +* The NIC is using three queues, and the difference between single flow (which could only use one + queue, and one CPU thread) is not exactly linear (4.59Mpps vs 11.05Mpps for 3 RX queues) +* The MPLS performance is higher than single flow, which I think means that the NIC is capable of + hashing the packets based on the _inner packet_. Otherwise, while using the same MPLS label, the + Cx3 and other NICs tend to just leverage only one receive queue. + +I'm very curious how this NIC stacks up between DPDK and RDMA -- read on below for my results! + +### DPDK: ConnectX-5 EN + +I swap the card out of its OCP bay and replace it with a ConnectX-5 EN that I have from when I +tested the [[R86S]({% post_url 2024-07-05-r86s %})]. It identifies as: + +``` +0e:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] +0e:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] +``` + +And similar to the ConnectX-4, this card also complains about PCIe bandwidth: + +``` +[6.478898] mlx5_core 0000:0e:00.0: firmware version: 16.25.4062 +[6.485393] mlx5_core 0000:0e:00.0: 31.504 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x4 link + at 0000:00:1d.0 (capable of 63.008 Gb/s with 8.0 GT/s PCIe x8 link) +[6.816156] mlx5_core 0000:0e:00.0: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384) +[6.841005] mlx5_core 0000:0e:00.0: Port module event: module 0, Cable plugged +[7.023602] mlx5_core 0000:0e:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) +[7.177744] mlx5_core 0000:0e:00.0: Supported tc offload range - chains: 4294967294, prios: 4294967295 +``` + +With that said, the loadtests are quite a bit more favorable for the newer ConnectX-5: + + +| Loadtest: Gowin CX5 DPDK | L1 bits/sec | Packets/sec | % of Line | +|----------------------------------|-------------|-------------|-----------| +| 1514b-unidirectional | 24.98 Gbps | 2.04 Mpps | 99.7 % | +| 64b-unidirectional | 10.71 Gbps | 15.93 Mpps | 42.8 % | +| 64b-single-unidirectional | 4.44 Gbps | 6.61 Mpps | 17.8 % | +| 64b-mpls-unidirectional | 10.36 Gbps | 15.42 Mpps | 41.5 % | +| 1514b-bidirectional | 24.70 Gbps | 2.01 Mpps | 49.4 % | +| 64b-bidirectional | 14.58 Gbps | 21.69 Mpps | 29.1 % | +| 64b-single-bidirectional | 8.38 Gbps | 12.47 Mpps | 16.8 % | +| 64b-mpls-bidirectional | 14.50 Gbps | 21.58 Mpps | 29.1 % | + +Some observations: + +* The NIC also saturates 25G in one direction with large packets, and saturates the PCI bus when + pushing in both directions. +* Single queue / thread operation at 6.61Mpps is a fair bit higher than Cx4 (which is 4.59Mpps) +* Multiple threads scale almost linearly, from 6.61Mpps in 1Q to 15.93Mpps in 3Q. That's + respectable! +* Bidirectional small packet performance is pretty great at 21.69Mpps, more than double that of + the Cx4 (which is 11.04Mpps). +* MPLS rocks! The NIC forwards 21.58Mpps of MPLS traffic. + +One thing I should note, is that at this point, the CPUs are not fully saturated. Looking at +Prometheus/Grafana for this set of loadtests: + +{{< image src="/assets/gowin-n305/cx5-cpu.png" alt="Cx5 CPU" >}} + +What I find interesting is that in no cases did any CPU thread run to 100% utilization. In the 64b +single flow loadtests (from 14:00-14:10 and from 15:05-15:15), the CPU threads definitely got close, +but they did not clip -- which does lead me to believe that the NIC (or the PCIe bus!) are the +bottleneck. + +By the way, the bidirectional single flow 64b loadtest shows two threads that have an overall +slightly _lower_ utilization (63%) versus the unidirectional single flow 64 loadtest (at 78.5%). I +think this can be explained by the two threads being able to use/re-use each others' cache lines. + + +***Conclusion***: ConnectX-5 performs significantly better than ConnectX-4 with DPDK. + +## RDMA + +{{< image float="right" src="/assets/gowin-n305/rdma.png" alt="RDMA artwork" >}} + +RDMA supports zero-copy networking by enabling the network adapter to transfer data from the wire +directly to application memory or from application memory directly to the wire, eliminating the need +to copy data between application memory and the data buffers in the operating system. Such transfers +require no work to be done by CPUs, caches, or context switches, and transfers continue in parallel +with other system operations. This reduces latency in message transfer. + +You can read more about it on [[Wikipedia](https://en.wikipedia.org/wiki/Remote_direct_memory_access)] +VPP uses RDMA in a clever way, relying on the Linux library for rdma-core (libibverb) to create a +custom userspace poll-mode driver, specifically for Ethernet packets. Despite using the RDMA APIs, +this is not about RDMA (no Infiniband, no RoCE, no iWARP), just pure traditional Ethernet packets. +Many VPP developers recommend and prefer RDMA for Mellanox devices. I myself have been more +comfortable with DPDK. But, now is the time to _FAFO_. + +### RDMA: ConnectX-4 Lx + +Considering I used three RX queues for DPDK, I instruct VPP now to use 3 receive queues for RDMA as +well. I remove the `dpdk_plugin.so` from `startup.conf`, although I could also have kept the DPDK +plugin running (to drive the 1.0G and 2.5G ports!) and de-selected the `0000:0e:00.0` and +`0000:0e:00.1` PCI entries, so that the RDMA driver can grab them. + +The VPP startup now looks like this: + +``` +vpp# create interface rdma host-if enp14s0f0np0 name xxv0 rx-queue-size 512 tx-queue-size 512 + num-rx-queues 3 no-multi-seg no-striding max-pktlen 2026 +vpp# create interface rdma host-if enp14s0f1np1 name xxv1 rx-queue-size 512 tx-queue-size 512 + num-rx-queues 3 no-multi-seg no-striding max-pktlen 2026 +vpp# set int mac address xxv0 02:fe:4a:ce:c2:fc +vpp# set int mac address xxv1 02:fe:4e:f5:82:e7 +``` + +I realize something pretty cool - the RDMA interface gets an ephemeral (randomly generated) MAC +address, while the main network card in Linux stays available. The NIC internally has a hardware +filter for the RDMA bound MAC address and gives it to VPP -- the implication is that the 25G NICs +can *also* be used in Linux itself. That's slick. + +Performance wise: + +| Loadtest: Gowin CX4 with RDMA | L1 bits/sec | Packets/sec | % of Line | +|----------------------------------|-------------|-------------|-----------| +| 1514b-unidirectional | 25.01 Gbps | 2.04 Mpps | 100.3 % | +| 64b-unidirectional | 12.32 Gbps | 18.34 Mpps | 49.1 % | +| 64b-single-unidirectional | 6.21 Gbps | 9.24 Mpps | 24.8 % | +| 64b-mpls-unidirectional | 11.95 Gbps | 17.78 Mpps | 47.8 % | +| 1514b-bidirectional | 26.24 Gbps | 2.14 Mpps | 52.5 % | +| 64b-bidirectional | 14.94 Gbps | 22.23 Mpps | 29.9 % | +| 64b-single-bidirectional | 11.53 Gbps | 17.16 Mpps | 23.1 % | +| 64b-mpls-bidirectional | 14.99 Gbps | 22.30 Mpps | 30.0 % | + +Some thoughts: + +* The RDMA driver is significantly _faster_ than DPDK in this configuration. Hah! +* 1514b are fine in both directions, RDMA slightly outperforms DPDK in the bidirectional test. +* 64b is massively faster: + * Unidirectional multiflow: RDMA 18.34Mpps, DPDK 11.05Mpps + * Bidirectional multiflow: RDMA 22.23Mpps, DPDK 11.04Mpps + * Bidirectional MPLS: RDMA 22.30Mpps, DPDK 10.93Mpps. + +***Conclusion***: I would say, roughly speaking, that RDMA outperforms DPDK on the Cx4 by a factor +of two. That's really cool, especially because ConnectX-4 network cards are found very cheap these +days. + +### RDMA: ConnectX-5 EN + +Well then what about the newer Mellanox ConnectX-5 card? Something surprising happens when I boot +the machine and start the exact same configuration as with the Cx4, the loadtests results almost +invariably suck: + +| Loadtest: Gowin CX5 with RDMA | L1 bits/sec | Packets/sec | % of Line | +|----------------------------------|-------------|-------------|-----------| +| 1514b-unidirectional | 24.95 Gbps | 2.03 Mpps | 99.6 % | +| 64b-unidirectional | 6.19 Gbps | 9.22 Mpps | 24.8 % | +| 64b-single-unidirectional | 3.27 Gbps | 4.87 Mpps | 13.1 % | +| 64b-mpls-unidirectional | 6.18 Gbps | 9.20 Mpps | 24.7 % | +| 1514b-bidirectional | 24.59 Gbps | 2.00 Mpps | 49.2 % | +| 64b-bidirectional | 8.84 Gbps | 13.15 Mpps | 17.7 % | +| 64b-single-bidirectional | 5.57 Gbps | 8.29 Mpps | 11.1 % | +| 64b-mpls-bidirectional | 8.77 Gbps | 13.05 Mpps | 17.5 % | + +Yikes! Cx5 in its default mode can still saturate 1514b loadtests, but turns into single digit with +almost all other loadtest types. I'm surprised also that single flow loadtest clocks in at only +4.87Mpps, that's about the same speed I sawwith the ConnectX-4 using DPDK. This does not look good +at all, and honestly, I don't believe it. + +So I start fiddling with settings. + +#### ConnectX-5 EN: Tuning Parameters + +There are a few things I found that might speed up processing in the ConnectX network card: + +1. Allowing for larger PCI packets - by default 512b, I can raise this to 1k, 2k or even 4k. + `setpci -s 0e:00.0 68.w` will return some hex number ABCD, the A here stands for max read size. + 0=128b, 1=256b, 2=512b, 3=1k, 4=2k, 8=4k. I can set the value by writing `setpci -s 0e:00.0 + 68.w=3BCD`, which immediately speeds up the loadtests! +1. Mellanox recommends to turn on CQE compression, to allow for the PCI messages to be aggressively + compressed, saving bandwidth. This helps specifically with _smaller_ packets, as the PCI message + overhead really starts to matter. `mlxconfig -d 0e:00.0 set CQE_COMPRESSION=1` and reboot. +1. For MPLS, the Cx5 can do flow matching on the inner packet (rather than hashing all packets to + the same queue based on the MPLS label) -- `mlxconfig -d 0e:00.0 set + FLEX_PARSER_PROFILE_ENABLE=1` and reboot. +1. Likely the number of receive queues matters, and can be set in the `create interface rdma` + command. + +I notice that CQE_COMPRESSION and FLEX_PARSER_PROFILE_ENABLE help in all cases, so I set them and +reboot. The PCI packets resizing also helps specifically with smaller packets, so I set that too in +`/etc/rc.local`. The fourth variable is left over, which is varying receive queue count. + +Here's a comparison that, to me at least, was surprising. With three receive queues, thus three +CPU threads each receiving 4.7Mpps and sending 3.1Mpps, performance looked like this: +``` +$ vppctl create interface rdma host-if enp14s0f0np0 name xxv0 rx-queue-size 1024 tx-queue-size 4096 + num-rx-queues 3 mode dv no-multi-seg max-pktlen 2026 +$ vppctl create interface rdma host-if enp14s0f1np1 name xxv1 rx-queue-size 1024 tx-queue-size 4096 + num-rx-queues 3 mode dv no-multi-seg max-pktlen 2026 + +$ vppctl show run | grep vector\ rates | grep -v in\ 0 + vector rates in 4.7586e6, out 3.2259e6, drop 3.7335e2, punt 0.0000e0 + vector rates in 4.9881e6, out 3.2206e6, drop 3.8344e2, punt 0.0000e0 + vector rates in 5.0136e6, out 3.2169e6, drop 3.7335e2, punt 0.0000e0 +``` + +This is fishy - why is the inbound rate much higher than the outbound rate? The behavior is +consistent in multi-queue setups. If I create 2 queues it's 8.45Mpps in and 7.98Mpps out: + +``` +$ vppctl create interface rdma host-if enp14s0f0np0 name xxv0 rx-queue-size 1024 tx-queue-size 4096 + num-rx-queues 2 mode dv no-multi-seg max-pktlen 2026 +$ vppctl create interface rdma host-if enp14s0f1np1 name xxv1 rx-queue-size 1024 tx-queue-size 4096 + num-rx-queues 2 mode dv no-multi-seg max-pktlen 2026 + +$ vppctl show run | grep vector\ rates | grep -v in\ 0 + vector rates in 8.4533e6, out 7.9804e6, drop 0.0000e0, punt 0.0000e0 + vector rates in 8.4517e6, out 7.9798e6, drop 0.0000e0, punt 0.0000e0 +``` + +And when I create only one queue, the same appears: + +``` +$ vppctl create interface rdma host-if enp14s0f0np0 name xxv0 rx-queue-size 1024 tx-queue-size 4096 + num-rx-queues 1 mode dv no-multi-seg max-pktlen 2026 +$ vppctl create interface rdma host-if enp14s0f1np1 name xxv1 rx-queue-size 1024 tx-queue-size 4096 + num-rx-queues 1 mode dv no-multi-seg max-pktlen 2026 + +$ vppctl show run | grep vector\ rates | grep -v in\ 0 + vector rates in 1.2082e7, out 9.3865e6, drop 0.0000e0, punt 0.0000e0 +``` + +But now that I've scaled down to only one queue (and thus one CPU thread doing all the work), I +manage to find a clue in the `show runtime` command: +``` +Thread 1 vpp_wk_0 (lcore 1) +Time 321.1, 10 sec internal node vector rate 256.00 loops/sec 46813.09 + vector rates in 1.2392e7, out 9.4015e6, drop 0.0000e0, punt 1.5571e-2 + Name State Calls Vectors Suspends Clocks Vectors/Call +ethernet-input active 15543357 3979099392 0 2.79e1 256.00 +ip4-input-no-checksum active 15543352 3979098112 0 1.26e1 256.00 +ip4-load-balance active 15543357 3979099387 0 9.17e0 255.99 +ip4-lookup active 15543357 3979099387 0 1.43e1 255.99 +ip4-rewrite active 15543357 3979099387 0 1.69e1 255.99 +rdma-input polling 15543357 3979099392 0 2.57e1 256.00 +xxv1-output active 15543357 3979099387 0 5.03e0 255.99 +xxv1-tx active 15543357 3018807035 0 4.35e1 194.22 +``` + +It takes a bit of practice to spot this, but see how `xx1-output` is running at 256 vectors/call, +while `xxv1-tx` is running at only 194.22 vectors/call? That means that VPP is dutifully handling +the whole packet, but when it is handed off to RDMA to marshall onto the hardware, it's getting +lost! And indeed, this is corroborated by `show errors`: + +``` +$ vppctl show err + Count Node Reason Severity + 3334 null-node blackholed packets error + 7421 ip4-arp ARP requests throttled info + 3 ip4-arp ARP requests sent info +1454511616 xxv1-tx no free tx slots error + 16 null-node blackholed packets error +``` + +Wow, billions of packets have been routed by VPP but then had to be discarded because RDMA output +could not keep up. Ouch. + +Compare the previous CPU utilization graph (from the Cx5/DPDK loadtest) with this Cx5/RDMA/1-RXQ +loadtest: + +{{< image src="/assets/gowin-n305/cx5-cpu-rdma1q.png" alt="Cx5 CPU with 1Q" >}} + +{{< image width="100px" float="left" src="/assets/vpp-babel/brain.png" alt="Brain" >}} + +Here I can clearly see that the one CPU thread (in yellow for unidirectional) and the two CPU +therads (one for each of the bidirectional flows) jump up to 100% and stay there. This means that +when VPP is completely pegged, it is receiving 12.4Mpps _per core_, but only manages to get RDMA to +send 9.40Mpps of those on the wire. The performance further deteriorates when multiple receive queues +are in play. Note: 12.4Mpps is pretty great for these CPU threads. + + +***Conclusion***: Single Queue RDMA based Cx5 will allow for about 9Mpps per interface, which is a +little bit better than DPDK performance; but Cx4 and Cx5 performance are not too far apart. + +## Summary and closing thoughts + +Looking at the RDMA results for both Cx4 and Cx5, when using only one thread, gives fair performance +with very low CPU cost per port -- however I could not manage to get rid of the `no free tx slots` +errors, and VPP can consume / process / forward more packets than RDMA is willing to marshall out on +the wire, which is disappointing. + +That said, both RDMA and DPDK performance is line rate at 25G unidirectional with sufficiently large +packets, and for small packets, it can realistically handle roughly 9Mpps per CPU thread. +Considering the CPU has 8 threads -- of which 6 usable by VPP -- the machine has more CPU than it +needs to drive the NICs. It should be a really great router at 10Gbps traffic rates, and a very fair +router at 25Gbps either with RDMA or DPDK. + +Here's a few files I gathered along the way, in case they are useful: + +* [[LSCPU](/assets/gowin-n305/lscpu.txt)] - [[Likwid Topology](/assets/gowin-n305/likwid-topology.txt)] - + [[DMI Decode](/assets/gowin-n305/dmidecode.txt)] - [[LSBLK](/assets/gowin-n305/lsblk.txt)] +* Mellanox MCX4421A-ACAN: [[dmesg](/assets/gowin-n305/dmesg-cx4.txt)] - [[LSPCI](/assets/gowin-n305/lspci-cx4.txt)] - + [[LSHW](/assets/gowin-n305/lshw-cx4.txt)] +* Mellanox MCX542B-ACAN: [[dmesg](/assets/gowin-n305/dmesg-cx5.txt)] - [[LSPCI](/assets/gowin-n305/lspci-cx5.txt)] - + [[LSHW](/assets/gowin-n305/lshw-cx5.txt)] +* VPP Configs: [[startup.conf](/assets/gowin-n305/vpp/startup.conf)] - [[L2 Config](/assets/gowin-n305/vpp/config/l2.vpp)] - + [[L3 Config](/assets/gowin-n305/vpp/config/l3.vpp)] - [[MPLS Config](/assets/gowin-n305/vpp/config/mpls.vpp)] +