All checks were successful
continuous-integration/drone/push Build is passing
681 lines
34 KiB
Markdown
681 lines
34 KiB
Markdown
---
|
|
date: "2023-02-24T07:31:00Z"
|
|
title: 'Case Study: VPP at Coloclue, part 2'
|
|
aliases:
|
|
- /s/articles/2023/02/24/coloclue-vpp-2.html
|
|
---
|
|
|
|
{{< image width="300px" float="right" src="/assets/coloclue-vpp/coloclue_logo2.png" alt="Yoloclue" >}}
|
|
|
|
* Author: Pim van Pelt, Rogier Krieger
|
|
* Reviewers: Coloclue Network Committee
|
|
* Status: Draft - Review - **Published**
|
|
|
|
Almost precisely two years ago, in February of 2021, I created a loadtesting environment at
|
|
[[Coloclue](https://coloclue.net)] to prove that a provider of L2 connectivity between two
|
|
datacenters in Amsterdam was not incurring jitter or loss on its services -- I wrote up my findings
|
|
in [[an article]({{< ref "2021-02-27-coloclue-loadtest" >}})], which demonstrated that the service
|
|
provider indeed provides a perfect service. One month later, in March 2021, I briefly ran
|
|
[[VPP](https://fd.io)] on one of the routers at Coloclue, but due to lack of time and a few
|
|
technical hurdles along the way, I had to roll back [[ref]({{< ref "2021-03-27-coloclue-vpp" >}})].
|
|
|
|
## The Problem
|
|
|
|
Over the years, Coloclue AS8283 continues to suffer from packet loss in its network. Taking a look
|
|
at a simple traceroute, in this case from IPng AS8298, shows very high variance and _packetlo_
|
|
when entering the network (at hop 5 in a router called `eunetworks-2.router.nl.coloclue.net`):
|
|
|
|
```
|
|
My traceroute [v0.94]
|
|
squanchy.ipng.ch (194.1.193.90) -> 185.52.227.1 2023-02-24T09:03:36+0100
|
|
Keys: Help Display mode Restart statistics Order of fields quit
|
|
Packets Pings
|
|
Host Loss% Snt Last Avg Best Wrst StDev
|
|
1. chbtl0.ipng.ch 0.0% 49904 1.3 0.9 0.7 1.7 0.2
|
|
2. chrma0.ipng.ch 0.0% 49904 1.7 1.2 1.2 2.1 0.9
|
|
3. defra0.ipng.ch 0.0% 49904 6.3 6.2 6.0 19.2 1.3
|
|
4. nlams0.ipng.ch 0.0% 49904 12.7 12.6 12.4 19.8 1.8
|
|
5. bond0-105.eunetworks-2.router.nl.coloclue.net 0.2% 49903 98.8 12.3 12.0 272.8 23.0
|
|
6. 185.52.227.1 6.6% 49903 15.3 12.5 12.3 308.7 20.4
|
|
```
|
|
|
|
The last two hops show the packet loss well north of 6.5%, some paths are better, some are worse,
|
|
but notably when more than one router is in the path, it's difficult to pinpoint where or what is
|
|
responsible. But honestly, any source will reveal packet loss and high variance when traversing
|
|
through one or more Coloclue routers, to more or lesser degree:
|
|
|
|
--------------- | ---------------------
|
|
 | 
|
|
|
|
_The screenshots above are smokeping from (left) a machine at AS8283 Coloclue (in Amsterdam, the
|
|
Netherlands), and from (right) a machine at AS8298 IPng (in Brüttisellen, Switzerland), both
|
|
are showing ~4.8-5.0% packetlo and high variance in end to end latency. No bueno!_
|
|
|
|
## Isolating a Device Under Test
|
|
|
|
Because Coloclue has several routers, I want to ensure that traffic traverses only the _one router_ under
|
|
test. I decide to use an allocated but currently unused IPv4 prefix and announce that only from one
|
|
of the four routers, so that all traffic to and from that /24 goes over that router. Coloclue uses a
|
|
piece of software called [Kees](https://github.com/coloclue/kees.git), a set of Python and Jinja2
|
|
scripts to generate a Bird1.6 configuration for each router. This is great because that allows me to
|
|
add a small feature to get what I need: **beacons**.
|
|
|
|
### Setting up the beacon
|
|
|
|
A beacon is a prefix that is sent to (some, or all) peers on the internet to attract traffic in a
|
|
particular way. I added a function called `is_coloclue_beacon()` which reads the input YAML file and
|
|
uses a construction similar to the existing feature for "supernets". It determines if a given prefix
|
|
must be announced to peers and upstreams. Any IPv4 and IPv6 prefixes from the `beacons` list will be
|
|
then matched in `is_coloclue_beacon()` and announced. For the curious, [[this
|
|
commit](https://github.com/coloclue/kees/commit/3710f1447ade10384c86f35b2652565b440c6aa6)] holds the
|
|
logic and tests to ensure this is safe.
|
|
|
|
Based on a per-router config (eg. `vars/eunetworks-2.router.nl.coloclue.net.yml`) I can now add the
|
|
following YAML stanza:
|
|
|
|
```
|
|
coloclue:
|
|
beacons:
|
|
- prefix: "185.52.227.0"
|
|
length: 24
|
|
comment: "VPP test prefix (pim, rogier)"
|
|
```
|
|
|
|
And further, from this router, I can forward all traffic destined to this /24 to a machine running
|
|
in EUNetworks (my Dell R630 called `hvn0.nlams2.ipng.ch`), using a simple static route:
|
|
|
|
```
|
|
statics:
|
|
...
|
|
- route: "185.52.227.0/24"
|
|
via: "94.142.240.71"
|
|
comment: "VPP test prefix (pim, rogier)"
|
|
```
|
|
|
|
After running Kees, I can now see traffic for that /24 show up on my machine. The last step is to
|
|
ensure that traffic that is destined for the beacon will always traverse back over `eunetworks-2`.
|
|
Coloclue has VRRP and sometimes another router might be the logical router. With a little trick on
|
|
my machine, I can force traffic by means of _policy based routing_:
|
|
|
|
```
|
|
pim@hvn0-nlams2:~$ sudo ip ro add default via 94.142.240.254
|
|
pim@hvn0-nlams2:~$ sudo ip ro add prohibit 185.52.227.0/24
|
|
pim@hvn0-nlams2:~$ sudo ip addr add 185.52.227.1/32 dev lo
|
|
pim@hvn0-nlams2:~$ sudo ip rule add from 185.52.227.0/24 lookup 10
|
|
pim@hvn0-nlams2:~$ sudo ip ro add default via 94.142.240.253 table 10
|
|
```
|
|
|
|
First, I set the default gateway to be the VRRP address that floats between multiple routers. Then,
|
|
I will set a `prohibit` route for the covering /24, which means the machine will send an ICMP
|
|
unreachable (rather than discarding the packets), which can be useful later. Next, I'll add .1 as an
|
|
IPv4 address onto loopback, after which the machine will start replying to ICMP packets there with
|
|
icmp-echo rather than dst-unreach. To make sure routing is always symmetric, I'll add an `ip rule`
|
|
which is a classifier that matches packets based on their source address, and then diverts these to
|
|
an alternate routing table, which has only one entry: send via .253 (which is `eunetworks-2`).
|
|
|
|
Let me show this in action:
|
|
|
|
|
|
```
|
|
pim@hvn0-nlams2:~$ dig +short -x 94.142.240.254
|
|
eunetworks-gateway-100.router.nl.coloclue.net.
|
|
pim@hvn0-nlams2:~$ dig +short -x 94.142.240.253
|
|
bond0-100.eunetworks-2.router.nl.coloclue.net.
|
|
pim@hvn0-nlams2:~$ dig +short -x 94.142.240.252
|
|
bond0-100.eunetworks-3.router.nl.coloclue.net.
|
|
|
|
pim@hvn0-nlams2:~$ ip -4 nei | grep '94.142.240.25[234]'
|
|
94.142.240.252 dev coloclue lladdr 64:9d:99:b1:31:db REACHABLE
|
|
94.142.240.253 dev coloclue lladdr 64:9d:99:b1:31:af REACHABLE
|
|
94.142.240.254 dev coloclue lladdr 64:9d:99:b1:31:db REACHABLE
|
|
```
|
|
|
|
In the output above, I can see that `eunetworks-2` (94.142.240.253) has MAC address
|
|
`64:9d:99:b1:31:af`, and that `eunetworks-3` (94.142.240.252) has MAC address `64:9d:99:b1:31:db`.
|
|
My default gateway, handled by VRRP, is at .254 and it's using the second MAC address, so I know that
|
|
`eunetworks-3` is primary, and will handle my egress traffic.
|
|
|
|
### Verifying symmetric routing of the beacon
|
|
|
|
A quick demonstration to show the symmetric routing case, I can tcpdump and see that my "usual"
|
|
egress traffic will be sent to the MAC address of the VRRP primary (which I showed to be
|
|
`eunetworks-3` above), while traffic coming from 185.52.227.0/24 ought to be sent to the
|
|
MAC address of `eunetworks-2` due to the `ip rule` and alternate routing table 10:
|
|
|
|
```
|
|
pim@hvn0-nlams2:~$ sudo tcpdump -eni coloclue host 194.1.163.93 and icmp
|
|
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
|
|
listening on coloclue, link-type EN10MB (Ethernet), snapshot length 262144 bytes
|
|
10:02:17.193844 64:9d:99:b1:31:af > 6e:fa:52:d0:c1:ff, ethertype IPv4 (0x0800), length 98:
|
|
194.1.163.93 > 94.142.240.71: ICMP echo request, id 16287, seq 1, length 64
|
|
10:02:17.193882 6e:fa:52:d0:c1:ff > 64:9d:99:b1:31:db, ethertype IPv4 (0x0800), length 98:
|
|
94.142.240.71 > 194.1.163.93: ICMP echo reply, id 16287, seq 1, length 64
|
|
|
|
10:02:19.276657 64:9d:99:b1:31:af > 6e:fa:52:d0:c1:ff, ethertype IPv4 (0x0800), length 98:
|
|
194.1.163.93 > 185.52.227.1: ICMP echo request, id 6646, seq 1, length 64
|
|
10:02:19.276694 6e:fa:52:d0:c1:ff > 64:9d:99:b1:31:af, ethertype IPv4 (0x0800), length 98:
|
|
185.52.227.1 > 194.1.163.93: ICMP echo reply, id 6646, seq 1, length 64
|
|
|
|
```
|
|
|
|
It takes a keen eye to spot the difference here the first packet (which is going to the main
|
|
IPv4 address 94.142.240.71), is returned via MAC address `64:9d:99:b1:31:db` (the VRRP default
|
|
gateway), but the second one (going to the beacon 185.52.227.1) is returned via MAC address
|
|
`64:9d:99:b1:31:af`.
|
|
|
|
I've now ensured that traffic to and from 185.52.227.1 will always traverse through the DUT
|
|
(`eunetworks-2` with MAC `64:9d:99:b1:31:af`). Very elegant :-)
|
|
|
|
## Installing VPP
|
|
|
|
I've written about this before, the general _spiel_ is just following my previous article (I'm
|
|
often very glad to read back my own articles as they serve as pretty good documentation to my
|
|
forgetful chipmunk-sized brain!), so here, I'll only recap what's already written in
|
|
[[vpp-7]({{< ref "2021-09-21-vpp-7" >}})]:
|
|
|
|
1. Build VPP with Linux Control Plane
|
|
1. Bring `eunetworks-2` into maintenance mode, so we can safely tinker with it
|
|
1. Start services like ssh, snmp, keepalived and bird in a new `dataplane` namespace
|
|
1. Start VPP and give the LCP interface names the same as their original
|
|
1. Slowly introduce the router: OSPF, OSPFv3, iBGP, members-bgp, eBGP, in that order
|
|
1. Re-enable keepalived and let the machine forward traffic
|
|
1. Stare at the latency graphs
|
|
|
|
{{< image width="500px" float="right" src="/assets/coloclue-vpp/likwid-topology.png" alt="Likwid Topology" >}}
|
|
|
|
**1. BUILD:** For the first step, the build is straight forward, and yields a VPP instance based on
|
|
`vpp-ext-deps_23.06-1` at version `23.06-rc0~71-g182d2b466`, which contains my
|
|
[[LCPng](https://git.ipng.ch/ipng/lcpng.git)] plugin. I then copy the packages to the router.
|
|
The router has an E-2286G CPU @ 4.00GHz with 6 cores and 6 hyperthreads. There's a really handy tool
|
|
called `likwid-topology` that can show how the L1, L2 and L3 cache lines up with respect to CPU
|
|
cores. Here I learn that CPU (0+6) and (1+7) share L1 and L2 cache -- so I can conclude that 0-5 are
|
|
CPU cores which share a hyperthread with 6-11 respectively.
|
|
|
|
I also see that L3 cache is shared across all of the cores+hyperthreads, which is normal. I decide
|
|
to give CPUs 0,1 and their hyperthread 6,7 to Linux for general purpose scheduling, and I want to
|
|
block the remaining CPUs and their hyperthreads to dedicated to VPP. So the kernel is rebooted with
|
|
`isolcpus=2-5,8-11`.
|
|
|
|
**2. DRAIN:** In the mean time, Rogier prepares the drain, which is two step process. First he marks all
|
|
the BGP sessions as `graceful_shutdown: True`, and waits for the traffic to die down. Then, he marks
|
|
the machine as `maintenance_mode: True` which will make Kees set OSPF cost to 65535 and avoid
|
|
attracting or sending traffic through this machine. After he submits these, we are free to tinker
|
|
with the router, as it will not affect any Coloclue members. Rogier also ensures we will have the
|
|
hand on this little machine in Amsterdam, by preparing an IPMI serial-over-lan connection and KVM.
|
|
|
|
**3. PREPARE:** Starting an ssh and snmpd in the dataplane is the most important part. This way, we will be
|
|
able to scrape the machine using SNMP just as-if it were a Linux native router. And of course we
|
|
will want to be able to log in to the router. I start with these two services, the only small
|
|
note is that, because I want to run two copies (one in the default namespace and one additional
|
|
one in the dataplane namespace), I'll want to tweak the startup flags (pid file, config file, etc) a
|
|
little bit:
|
|
|
|
```
|
|
## in snmpd-dataplane.service
|
|
ExecStart=/sbin/ip netns exec dataplane /usr/sbin/snmpd -LOw -u Debian-snmp \
|
|
-g vpp -I -smux,mteTrigger,mteTriggerConf -f -p /run/snmpd-dataplane.pid \
|
|
-C -c /etc/snmp/snmpd-dataplane.conf
|
|
|
|
## in ssh-dataplane.service
|
|
ExecStart=/usr/sbin/ip netns exec dataplane /usr/sbin/sshd \
|
|
-oPidFile=/run/sshd-dataplane.pid -D $SSHD_OPTS
|
|
```
|
|
**4. LAUNCH:** Now what's left for us to do is switch from our SSH session to an IPMI serial-over-lan session
|
|
so that we can safely transition to the VPP world. Rogier and I log in and share a tmux session,
|
|
after which I bring down all ethernet links, remove VLAN sub-interfaces and the LACP BondEthernet,
|
|
leaving only the main physical interfaces. I then set link down on them, and restart VPP -- which
|
|
will take all DPDK eligble interfaces that are link admin-down, and then let the magic happen:
|
|
|
|
```
|
|
root@eunetworks-2:~# vppctl show int
|
|
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
|
|
GigabitEthernet5/0/0 5 down 9000/0/0/0
|
|
GigabitEthernet6/0/0 6 down 9000/0/0/0
|
|
TenGigabitEthernet1/0/0 1 down 9000/0/0/0
|
|
TenGigabitEthernet1/0/1 2 down 9000/0/0/0
|
|
TenGigabitEthernet1/0/2 3 down 9000/0/0/0
|
|
TenGigabitEthernet1/0/3 4 down 9000/0/0/0
|
|
```
|
|
|
|
Dope! One way to trick the rest of the machine into thinking it hasn't changed, is to recreate these
|
|
interfaces in the `dataplane` network namespace using their _original_ interface names (eg.
|
|
`enp1s0f3` for AMS-IX, and `bond0` for the LACP signaled BondEthernet that we'll create. Rogier
|
|
prepared an excellent `vppcfg` config file:
|
|
|
|
```
|
|
loopbacks:
|
|
loop0:
|
|
description: 'eunetworks-2.router.nl.coloclue.net'
|
|
lcp: 'loop0'
|
|
mtu: 9216
|
|
addresses: [ 94.142.247.3/32, 2a02:898:0:300::3/128 ]
|
|
|
|
bondethernets:
|
|
BondEthernet0:
|
|
description: 'Core: MLAG member switches'
|
|
interfaces: [ TenGigabitEthernet1/0/0, TenGigabitEthernet1/0/1 ]
|
|
mode: 'lacp'
|
|
load-balance: 'l34'
|
|
mac: '64:9d:99:b1:31:af'
|
|
|
|
interfaces:
|
|
GigabitEthernet5/0/0:
|
|
description: "igb 0000:05:00.0 eno1 # FiberRing"
|
|
lcp: 'eno1'
|
|
mtu: 9216
|
|
sub-interfaces:
|
|
205:
|
|
description: "Peering: Arelion"
|
|
lcp: 'eno1.205'
|
|
addresses: [ 62.115.144.33/31, 2001:2000:3080:ebc::2/126 ]
|
|
mtu: 1500
|
|
992:
|
|
description: "Transit: FiberRing"
|
|
lcp: 'eno1.992'
|
|
addresses: [ 87.255.32.130/30, 2a00:ec8::102/126 ]
|
|
mtu: 1500
|
|
|
|
GigabitEthernet6/0/0:
|
|
description: "igb 0000:06:00.0 eno2 # Free"
|
|
lcp: 'eno2'
|
|
mtu: 9216
|
|
state: down
|
|
|
|
TenGigabitEthernet1/0/0:
|
|
description: "i40e 0000:01:00.0 enp1s0f0 (bond-member)"
|
|
mtu: 9216
|
|
|
|
TenGigabitEthernet1/0/1:
|
|
description: "i40e 0000:01:00.1 enp1s0f1 (bond-member)"
|
|
mtu: 9216
|
|
|
|
TenGigabitEthernet1/0/2:
|
|
description: 'Core: link between eunetworks-2 and eunetworks-3'
|
|
lcp: 'enp1s0f2'
|
|
addresses: [ 94.142.247.246/31, 2a02:898:0:301::/127 ]
|
|
mtu: 9214
|
|
|
|
TenGigabitEthernet1/0/3:
|
|
description: "i40e 0000:01:00.3 enp1s0f3 # AMS-IX"
|
|
lcp: 'enp1s0f3'
|
|
mtu: 9216
|
|
sub-interfaces:
|
|
501:
|
|
description: "Peering: AMS-IX"
|
|
lcp: 'enp1s0f3.501'
|
|
addresses: [ 80.249.211.161/21, 2001:7f8:1::a500:8283:1/64 ]
|
|
mtu: 1500
|
|
511:
|
|
description: "Peering: NBIP-NaWas via AMS-IX"
|
|
lcp: 'enp1s0f3.511'
|
|
addresses: [ 194.62.128.38/24, 2001:67c:608::f200:8283:1/64 ]
|
|
mtu: 1500
|
|
|
|
BondEthernet0:
|
|
lcp: 'bond0'
|
|
mtu: 9216
|
|
sub-interfaces:
|
|
100:
|
|
description: "Cust: Members"
|
|
lcp: 'bond0.100'
|
|
mtu: 1500
|
|
addresses: [ 94.142.240.253/24, 2a02:898:0:20::e2/64 ]
|
|
101:
|
|
description: "Core: Powerbars"
|
|
lcp: 'bond0.101'
|
|
mtu: 1500
|
|
addresses: [ 172.28.3.253/24 ]
|
|
105:
|
|
description: "Cust: Members (no strict uRPF filtering)"
|
|
lcp: 'bond0.105'
|
|
mtu: 1500
|
|
addresses: [ 185.52.225.14/28, 2a02:898:0:21::e2/64 ]
|
|
130:
|
|
description: "Core: Link between eunetworks-2 and dcg-1"
|
|
lcp: 'bond0.130'
|
|
mtu: 1500
|
|
addresses: [ 94.142.247.242/31, 2a02:898:0:301::14/127 ]
|
|
2502:
|
|
description: "Transit: Fusix Networks"
|
|
lcp: 'bond0.2502'
|
|
mtu: 1500
|
|
addresses: [ 37.139.140.27/31, 2a00:a7c0:e20b:104::2/126 ]
|
|
```
|
|
|
|
We take this configuration and pre-generate a suitable VPP config, which exposes two little bugs
|
|
in `vppcfg`:
|
|
|
|
* Rogier had used captial letters in his IPv6 addresses (ie. `2001:2000:3080:0EBC::2`), while the
|
|
dataplane reports lower case (ie. `2001:2000:3080:ebc::2`), which consistently yield a diff that's
|
|
not there. I make a note to fix that.
|
|
* When I create the initial `--novpp` config, there's a bug in `vppcfg` where I incorrectly
|
|
reference a dataplane object which I haven't initialized (because with `--novpp` the tool
|
|
will not contact the dataplane at all. That one was easy to fix, which I did in [[this
|
|
commit](https://git.ipng.ch/ipng/vppcfg/commit/0a0413927a0be6ed3a292a8c336deab8b86f5eee)]).
|
|
|
|
After that small detour, I can now proceed to configure the dataplane by offering the resulting
|
|
VPP commands, like so:
|
|
```
|
|
root@eunetworks-2:~# vppcfg plan --novpp -c /etc/vpp/vppcfg.yaml \
|
|
-o /etc/vpp/config/vppcfg.vpp
|
|
[INFO ] root.main: Loading configfile /etc/vpp/vppcfg.yaml
|
|
[INFO ] vppcfg.config.valid_config: Configuration validated successfully
|
|
[INFO ] root.main: Configuration is valid
|
|
[INFO ] vppcfg.reconciler.write: Wrote 84 lines to /etc/vpp/config/vppcfg.vpp
|
|
[INFO ] root.main: Planning succeeded
|
|
|
|
root@eunetworks-2:~# vppctl exec /etc/vpp/config/vppcfg.vpp
|
|
```
|
|
|
|
**5. UNDRAIN:** The VPP dataplane comes to life, only to immediately hang. Whoops! What follows is a
|
|
90 minute forray into the innards of VPP (and Bird) which I haven't yet fully understood, but will
|
|
definitely want to learn more about (future article, anyone?) -- but the TL/DR of our investigation
|
|
is that if an IPv6 address is added to a loopback device, and an OSPFv3 (IPv6) stub area is created
|
|
on it, as is common for IPv4 and IPv6 loopback addresses in OSPF, then the dataplane immediately
|
|
hangs on the controlplane, but does continue to forward traffic.
|
|
|
|
However, we also find a workaround, which is to put the IPv6 loopback address on a physical
|
|
interface instead of a loopback interface. Then, we observe a perfectly functioning dataplane, which
|
|
has a working BondEthernet with LACP signalling:
|
|
|
|
```
|
|
root@eunetworks-2:~# vppctl show bond details
|
|
BondEthernet0
|
|
mode: lacp
|
|
load balance: l34
|
|
number of active members: 2
|
|
TenGigabitEthernet1/0/1
|
|
TenGigabitEthernet1/0/0
|
|
number of members: 2
|
|
TenGigabitEthernet1/0/0
|
|
TenGigabitEthernet1/0/1
|
|
device instance: 0
|
|
interface id: 0
|
|
sw_if_index: 8
|
|
hw_if_index: 8
|
|
|
|
root@eunetworks-2:~# vppctl show lacp
|
|
actor state partner state
|
|
interface name sw_if_index bond interface exp/def/dis/col/syn/agg/tim/act exp/def/dis/col/syn/agg/tim/act
|
|
TenGigabitEthernet1/0/0 1 BondEthernet0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1
|
|
LAG ID: [(ffff,64-9d-99-b1-31-af,0008,00ff,0001), (8000,02-1c-73-0f-8b-bc,0015,8000,8015)]
|
|
RX-state: CURRENT, TX-state: TRANSMIT, MUX-state: COLLECTING_DISTRIBUTING, PTX-state: PERIODIC_TX
|
|
TenGigabitEthernet1/0/1 2 BondEthernet0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1
|
|
LAG ID: [(ffff,64-9d-99-b1-31-af,0008,00ff,0002), (8000,02-1c-73-0f-8b-bc,0015,8000,0015)]
|
|
RX-state: CURRENT, TX-state: TRANSMIT, MUX-state: COLLECTING_DISTRIBUTING, PTX-state: PERIODIC_TX
|
|
```
|
|
|
|
**6. WRAP UP:** After doing a bit of standard issue ping / ping6 and `show err` and `show log`,
|
|
things are looking good. Rogier and I are now ready to slowly introduce the router: we first turn on
|
|
OSPF and OSPFv3, see adjacencies and BFD turn up. We make a note that `enp1s0f2` (which is now a
|
|
_LIP_ in the dataplane) does not have BFD while it does have OSPF, and the explanation for this is
|
|
that `bond0` is connected to a switch, while `enp1s0f2` is directly connected to its peer via a
|
|
cross connect cable, so if it fails, it'll be able to use link-state to quickly reconverge, while
|
|
the ethernet link may still be up on `bond0` if something along the transport path were to fail, so
|
|
BFD is the better choice there. Smart thinking, Coloclue!
|
|
|
|
```
|
|
root@eunetworks-2:~# birdc6 show ospf nei ospf1
|
|
BIRD 1.6.8 ready.
|
|
ospf1:
|
|
Router ID Pri State DTime Interface Router IP
|
|
94.142.247.1 1 Full/PtP 00:33 bond0.130 fe80::669d:99ff:feb1:394b
|
|
94.142.247.6 1 Full/PtP 00:31 enp1s0f2 fe80::669d:99ff:feb1:31d8
|
|
|
|
root@eunetworks-2:~# birdc show bfd ses
|
|
BIRD 1.6.8 ready.
|
|
bfd1:
|
|
IP address Interface State Since Interval Timeout
|
|
94.142.247.243 bond0.130 Up 2023-02-24 15:56:29 0.100 0.500
|
|
```
|
|
|
|
We are then ready to undrain iBGP and eBGP to members, transit and peering sessions. Rogier swiftly
|
|
takes care of business, and the router finds its spot in the DFZ just a few minutes later:
|
|
|
|
```
|
|
root@eunetworks-2:~# birdc show route count
|
|
BIRD 1.6.8 ready.
|
|
6239493 of 6239493 routes for 907650 networks
|
|
|
|
root@eunetworks-2:~# birdc6 show route count
|
|
BIRD 1.6.8 ready.
|
|
1152345 of 1152345 routes for 169987 networks
|
|
|
|
root@eunetworks-2:~# vppctl show ip fib sum
|
|
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none
|
|
locks:[adjacency:1, default-route:1, lcp-rt:1, ]
|
|
Prefix length Count
|
|
0 1
|
|
4 2
|
|
8 16
|
|
9 13
|
|
10 38
|
|
11 103
|
|
12 299
|
|
13 577
|
|
14 1214
|
|
15 2093
|
|
16 13477
|
|
17 8250
|
|
18 13824
|
|
19 24990
|
|
20 43089
|
|
21 51191
|
|
22 109106
|
|
23 97073
|
|
24 542106
|
|
27 3
|
|
28 13
|
|
29 32
|
|
30 36
|
|
31 41
|
|
32 788
|
|
|
|
root@eunetworks-2:~# vppctl show ip6 fib sum
|
|
ipv6-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none
|
|
locks:[adjacency:1, default-route:1, lcp-rt:1, ]
|
|
Prefix length Count
|
|
128 863
|
|
127 4
|
|
126 4
|
|
125 1
|
|
120 2
|
|
64 22
|
|
60 17
|
|
52 2
|
|
49 2
|
|
48 80069
|
|
47 3535
|
|
46 3411
|
|
45 1726
|
|
44 14909
|
|
43 1041
|
|
42 2529
|
|
41 932
|
|
40 14126
|
|
39 1459
|
|
38 1654
|
|
37 988
|
|
36 6640
|
|
35 1374
|
|
34 3419
|
|
33 3707
|
|
32 22819
|
|
31 294
|
|
30 589
|
|
29 4373
|
|
28 196
|
|
27 20
|
|
26 15
|
|
25 8
|
|
24 30
|
|
23 7
|
|
22 7
|
|
21 3
|
|
20 15
|
|
19 1
|
|
10 1
|
|
0 1
|
|
|
|
```
|
|
|
|
One thing that I really appreciate is how ... _normal_ ... this machine looks, with no interfaces in
|
|
the default namespace, but after switching to the dataplane network namespace using `nsenter`, there
|
|
they are and they look (unsurprisingly, because we configured them that way), identical to what was
|
|
running before, except now all goverend by VPP instead of the Linux kernel:
|
|
|
|
```
|
|
root@eunetworks-2:~# ip -br l
|
|
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
|
|
|
|
root@eunetworks-2:~# nsenter --net=/var/run/netns/dataplane
|
|
root@eunetworks-2:~# ip -br l
|
|
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
|
|
eno1 UP ac:1f:6b:e0:b1:0c <BROADCAST,MULTICAST,UP,LOWER_UP>
|
|
eno2 DOWN ac:1f:6b:e0:b1:0d <BROADCAST,MULTICAST>
|
|
enp1s0f2 UP 64:9d:99:b1:31:ad <BROADCAST,MULTICAST,UP,LOWER_UP>
|
|
enp1s0f3 UP 64:9d:99:b1:31:ac <BROADCAST,MULTICAST,UP,LOWER_UP>
|
|
bond0 UP 64:9d:99:b1:31:af <BROADCAST,MULTICAST,UP,LOWER_UP>
|
|
loop0 UP de:ad:00:00:00:00 <BROADCAST,MULTICAST,UP,LOWER_UP>
|
|
eno1.205@eno1 UP ac:1f:6b:e0:b1:0c <BROADCAST,MULTICAST,UP,LOWER_UP>
|
|
eno1.992@eno1 UP ac:1f:6b:e0:b1:0c <BROADCAST,MULTICAST,UP,LOWER_UP>
|
|
enp1s0f3.501@enp1s0f3 UP 64:9d:99:b1:31:ac <BROADCAST,MULTICAST,UP,LOWER_UP>
|
|
enp1s0f3.511@enp1s0f3 UP 64:9d:99:b1:31:ac <BROADCAST,MULTICAST,UP,LOWER_UP>
|
|
bond0.100@bond0 UP 64:9d:99:b1:31:af <BROADCAST,MULTICAST,UP,LOWER_UP>
|
|
bond0.101@bond0 UP 64:9d:99:b1:31:af <BROADCAST,MULTICAST,UP,LOWER_UP>
|
|
bond0.105@bond0 UP 64:9d:99:b1:31:af <BROADCAST,MULTICAST,UP,LOWER_UP>
|
|
bond0.130@bond0 UP 64:9d:99:b1:31:af <BROADCAST,MULTICAST,UP,LOWER_UP>
|
|
bond0.2502@bond0 UP 64:9d:99:b1:31:af <BROADCAST,MULTICAST,UP,LOWER_UP>
|
|
|
|
root@eunetworks-2:~# ip -br a
|
|
lo UNKNOWN 127.0.0.1/8 ::1/128
|
|
eno1 UP fe80::ae1f:6bff:fee0:b10c/64
|
|
eno2 DOWN
|
|
enp1s0f2 UP 94.142.247.246/31 2a02:898:0:300::3/128 2a02:898:0:301::/127 fe80::669d:99ff:feb1:31ad/64
|
|
enp1s0f3 UP fe80::669d:99ff:feb1:31ac/64
|
|
bond0 UP fe80::669d:99ff:feb1:31af/64
|
|
loop0 UP 94.142.247.3/32 fe80::dcad:ff:fe00:0/64
|
|
eno1.205@eno1 UP 62.115.144.33/31 2001:2000:3080:ebc::2/126 fe80::ae1f:6bff:fee0:b10c/64
|
|
eno1.992@eno1 UP 87.255.32.130/30 2a00:ec8::102/126 fe80::ae1f:6bff:fee0:b10c/64
|
|
enp1s0f3.501@enp1s0f3 UP 80.249.211.161/21 2001:7f8:1::a500:8283:1/64 fe80::669d:99ff:feb1:31ac/64
|
|
enp1s0f3.511@enp1s0f3 UP 194.62.128.38/24 2001:67c:608::f200:8283:1/64 fe80::669d:99ff:feb1:31ac/64
|
|
bond0.100@bond0 UP 94.142.240.253/24 2a02:898:0:20::e2/64 fe80::669d:99ff:feb1:31af/64
|
|
bond0.101@bond0 UP 172.28.3.253/24 fe80::669d:99ff:feb1:31af/64
|
|
bond0.105@bond0 UP 185.52.225.14/28 2a02:898:0:21::e2/64 fe80::669d:99ff:feb1:31af/64
|
|
bond0.130@bond0 UP 94.142.247.242/31 2a02:898:0:301::14/127 fe80::669d:99ff:feb1:31af/64
|
|
bond0.2502@bond0 UP 37.139.140.27/31 2a00:a7c0:e20b:104::2/126 fe80::669d:99ff:feb1:31af/64
|
|
```
|
|
|
|
{{< image width="400px" float="right" src="/assets/coloclue-vpp/traffic.jpg" alt="Traffic" >}}
|
|
|
|
Of course, VPP handles all the traffic _through_ the machine, and the only traffic that Linux will
|
|
see is that which is destined to the controlplane (eg, to one of the IPv4 or IPv6 addresses or
|
|
multicast/broadcast groups that they are participating in), so things like tcpdump or SNMP won't
|
|
really work.
|
|
|
|
However, due to my [[vpp-snmp-agent](https://git.ipng.ch/ipng/vpp-snmp-agent.git)], which is
|
|
feeding as an AgentX behind an snmpd that in turn is running in the `dataplane` namespace, SNMP scrapes
|
|
work as they did before, albeit with a few different interface names.
|
|
|
|
**6.** Earlier, I had failed over `keepalived` and stopped the service. This way, the peer router
|
|
on `eunetworks-3` would pick up all outbound traffic to the virtual IPv4 and IPv6 for our users'
|
|
default gateway. Because we're mainly interested in non-intrusively measuring the BGP beacon (which
|
|
is forced to always go through this machine), and we know some of our members use BGP and take a
|
|
preference over this router because it's connected to AMS-IX, we make a decision to leave keepalived
|
|
turned off for now.
|
|
|
|
But, traffic is flowing, and in fact a little bit more throughput, possibly because traffic flows
|
|
faster when there's not 5% packet loss on certain egress paths? I don't know but OK, moving along!
|
|
|
|
## Results
|
|
|
|
Clearly VPP is a winner in this scenario. If you recall the traceroute from before the operation, the latency
|
|
was good up until `nlams0.ipng.ch`, after which loss occured and variance was very high. Rogier and
|
|
I let the VPP instance run overnight, and started this traceroute after our maintenance was
|
|
concluded:
|
|
|
|
```
|
|
My traceroute [v0.94]
|
|
squanchy.ipng.ch (194.1.163.90) -> 185.52.227.1 2023-02-25T09:48:46+0100
|
|
Keys: Help Display mode Restart statistics Order of fields quit
|
|
Packets Pings
|
|
Host Loss% Snt Last Avg Best Wrst StDev
|
|
1. chbtl0.ipng.ch 0.0% 51796 0.6 0.2 0.1 1.7 0.2
|
|
2. chrma0.ipng.ch 0.0% 51796 1.6 1.0 0.9 5.5 1.2
|
|
3. defra0.ipng.ch 0.0% 51796 7.0 6.5 6.4 27.7 1.9
|
|
4. nlams0.ipng.ch 0.0% 51796 12.7 12.6 12.5 43.8 3.9
|
|
5. bond0-105.eunetworks-2.router.nl.coloclue.net 0.0% 51796 13.3 13.0 12.8 138.9 11.1
|
|
6. 185.52.227.1 0.0% 51796 13.6 12.7 12.3 46.6 8.3
|
|
```
|
|
|
|
{{< image width="400px" float="right" src="/assets/coloclue-vpp/bill-clinton-zero.gif" alt="Clinton Zero" >}}
|
|
|
|
This mtr shows clear network weather with absolutely no packets dropped from Brüttisellen (near
|
|
Zurich, Switzerland) all the way to the BGP beacon running in EUNetworks in Amsterdam. Considering
|
|
I've been running VPP for a few years now, including writing the code necessary to plumb the
|
|
dataplane interfaces through to Linux so that a higher order control plane (such as Bird, or FRR)
|
|
can manipulate them, I am reasonably bullish, but I do hope to convert others.
|
|
|
|
This computer now forwards packets like a boss, its packet loss is →
|
|
|
|
Looking at the local situation, from a hypervisor running at IPng Networks in Equinix AM3 via FrysIX
|
|
through VPP and into the dataplane of the Coloclue router `eunetworks-2` , shows quite reasonable
|
|
throughput as well:
|
|
|
|
```
|
|
root@eunetworks-2:~# traceroute hvn0.nlams3.ipng.ch
|
|
traceroute to 46.20.243.179 (46.20.243.179), 30 hops max, 60 byte packets
|
|
1 enp1s0f3.eunetworks-3.router.nl.coloclue.net (94.142.247.247) 0.087 ms 0.078 ms 0.071 ms
|
|
2 frys-ix.ip-max.net (185.1.203.135) 1.288 ms 1.432 ms 1.479 ms
|
|
3 hvn0.nlams3.ipng.ch (46.20.243.179) 0.524 ms 0.534 ms 0.531 ms
|
|
|
|
root@eunetworks-2:~# iperf3 -c 46.20.243.179 -P 10
|
|
Connecting to host 46.20.243.179, port 5201
|
|
...
|
|
[SUM] 0.00-10.00 sec 6.70 GBytes 5.76 Gbits/sec 192 sender
|
|
[SUM] 0.00-10.03 sec 6.58 GBytes 5.64 Gbits/sec receiver
|
|
|
|
root@eunetworks-2:~# iperf3 -c 46.20.243.179 -P 10 -R
|
|
Connecting to host 46.20.243.179, port 5201
|
|
Reverse mode, remote host 46.20.243.179 is sending
|
|
...
|
|
[SUM] 0.00-10.03 sec 6.07 GBytes 5.20 Gbits/sec 54623 sender
|
|
[SUM] 0.00-10.00 sec 6.03 GBytes 5.18 Gbits/sec receiver
|
|
```
|
|
|
|
And the smokepings look just plain gorgeous:
|
|
|
|
--------------- | ---------------------
|
|
 | 
|
|
|
|
_The screenshots above are smokeping from (left) a machine at AS8283 Coloclue (in Amsterdam, the
|
|
Netherlands), and from (right) a machine at AS8298 IPng (in Brüttisellen, Switzerland), both
|
|
are showing no packetloss and clearly improved performance in end to end latency. Super!_
|
|
|
|
## What's next
|
|
|
|
The performance of the one router we upgraded definitely improved, no question about that. But
|
|
there's a couple of things that I think we still need to do, so Rogier and I rolled back the change
|
|
to the previous situation and kernel based routing.
|
|
|
|
* We didn't migrate keepalived, although IPng runs this in our DDLN [[colocation]({{< ref "2022-02-24-colo" >}})]
|
|
site, so I'm pretty confident that it will work.
|
|
* Kees and Ansible at Coloclue will need a few careful changes, to facilitate ongoing automation,
|
|
think of dataplane and controlplane firewalls, sysctls (uRPF et al), fastnetmon, and so on will
|
|
need a meaningful overhaul.
|
|
* There's an unknown dataplane hang with Bird IPv6 enables a `stub` OSPF interface on interface
|
|
`lo0`. We worked around that by putting the loopback IPv6 address on another interface, but this
|
|
needs to be fully understood.
|
|
* Completely unrelated to Coloclue, there's one dataplane hang regarding IPv6 RA/NS and/or BFD
|
|
and/or Linux Control Plane that the VPP developer community is hunting down - it happens with
|
|
my plugin but also with [[TNSR](http://www.netgate.com/tnsr)] (who used the upstream `linux-cp` plugin).
|
|
I've been working with a few folks from Netgate and customers of IPng Networks to try to find the root
|
|
cause, as AS8298 has been bitten by this a few times over the last ~quarter or so. I cannot
|
|
recommend in good faith running VPP until this is sorted out.
|
|
|
|
As an important side note, VPP is not well enough understood at Coloclue - rolling this out further
|
|
risks making me a single point of failure in the networking committee, and I'm not comfortable taking
|
|
that responsibility. I recommend that Coloclue network committee members gain experience with VPP,
|
|
DPDK, `vppcfg` and the other ecosystem tools, and that at least the bird6 OSPF issue and possible
|
|
IPv6 NS/RA issue are understood, before making the jump to the VPP world.
|