Rewrite all images to Hugo format
This commit is contained in:
678
content/articles/2023-02-24-coloclue-vpp-2.md
Normal file
678
content/articles/2023-02-24-coloclue-vpp-2.md
Normal file
@ -0,0 +1,678 @@
|
||||
---
|
||||
date: "2023-02-24T07:31:00Z"
|
||||
title: 'Case Study: VPP at Coloclue, part 2'
|
||||
---
|
||||
|
||||
{{< image width="300px" float="right" src="/assets/coloclue-vpp/coloclue_logo2.png" alt="Yoloclue" >}}
|
||||
|
||||
* Author: Pim van Pelt, Rogier Krieger
|
||||
* Reviewers: Coloclue Network Committee
|
||||
* Status: Draft - Review - **Published**
|
||||
|
||||
Almost precisely two years ago, in February of 2021, I created a loadtesting environment at
|
||||
[[Coloclue](https://coloclue.net)] to prove that a provider of L2 connectivity between two
|
||||
datacenters in Amsterdam was not incurring jitter or loss on its services -- I wrote up my findings
|
||||
in [[an article]({% post_url 2021-02-27-coloclue-loadtest %})], which demonstrated that the service
|
||||
provider indeed provides a perfect service. One month later, in March 2021, I briefly ran
|
||||
[[VPP](https://fd.io)] on one of the routers at Coloclue, but due to lack of time and a few
|
||||
technical hurdles along the way, I had to roll back [[ref]({% post_url 2021-03-27-coloclue-vpp %})].
|
||||
|
||||
## The Problem
|
||||
|
||||
Over the years, Coloclue AS8283 continues to suffer from packet loss in its network. Taking a look
|
||||
at a simple traceroute, in this case from IPng AS8298, shows very high variance and _packetlo_
|
||||
when entering the network (at hop 5 in a router called `eunetworks-2.router.nl.coloclue.net`):
|
||||
|
||||
```
|
||||
My traceroute [v0.94]
|
||||
squanchy.ipng.ch (194.1.193.90) -> 185.52.227.1 2023-02-24T09:03:36+0100
|
||||
Keys: Help Display mode Restart statistics Order of fields quit
|
||||
Packets Pings
|
||||
Host Loss% Snt Last Avg Best Wrst StDev
|
||||
1. chbtl0.ipng.ch 0.0% 49904 1.3 0.9 0.7 1.7 0.2
|
||||
2. chrma0.ipng.ch 0.0% 49904 1.7 1.2 1.2 2.1 0.9
|
||||
3. defra0.ipng.ch 0.0% 49904 6.3 6.2 6.0 19.2 1.3
|
||||
4. nlams0.ipng.ch 0.0% 49904 12.7 12.6 12.4 19.8 1.8
|
||||
5. bond0-105.eunetworks-2.router.nl.coloclue.net 0.2% 49903 98.8 12.3 12.0 272.8 23.0
|
||||
6. 185.52.227.1 6.6% 49903 15.3 12.5 12.3 308.7 20.4
|
||||
```
|
||||
|
||||
The last two hops show the packet loss well north of 6.5%, some paths are better, some are worse,
|
||||
but notably when more than one router is in the path, it's difficult to pinpoint where or what is
|
||||
responsible. But honestly, any source will reveal packet loss and high variance when traversing
|
||||
through one or more Coloclue routers, to more or lesser degree:
|
||||
|
||||
--------------- | ---------------------
|
||||
 | 
|
||||
|
||||
_The screenshots above are smokeping from (left) a machine at AS8283 Coloclue (in Amsterdam, the
|
||||
Netherlands), and from (right) a machine at AS8298 IPng (in Brüttisellen, Switzerland), both
|
||||
are showing ~4.8-5.0% packetlo and high variance in end to end latency. No bueno!_
|
||||
|
||||
## Isolating a Device Under Test
|
||||
|
||||
Because Coloclue has several routers, I want to ensure that traffic traverses only the _one router_ under
|
||||
test. I decide to use an allocated but currently unused IPv4 prefix and announce that only from one
|
||||
of the four routers, so that all traffic to and from that /24 goes over that router. Coloclue uses a
|
||||
piece of software called [Kees](https://github.com/coloclue/kees.git), a set of Python and Jinja2
|
||||
scripts to generate a Bird1.6 configuration for each router. This is great because that allows me to
|
||||
add a small feature to get what I need: **beacons**.
|
||||
|
||||
### Setting up the beacon
|
||||
|
||||
A beacon is a prefix that is sent to (some, or all) peers on the internet to attract traffic in a
|
||||
particular way. I added a function called `is_coloclue_beacon()` which reads the input YAML file and
|
||||
uses a construction similar to the existing feature for "supernets". It determines if a given prefix
|
||||
must be announced to peers and upstreams. Any IPv4 and IPv6 prefixes from the `beacons` list will be
|
||||
then matched in `is_coloclue_beacon()` and announced. For the curious, [[this
|
||||
commit](https://github.com/coloclue/kees/commit/3710f1447ade10384c86f35b2652565b440c6aa6)] holds the
|
||||
logic and tests to ensure this is safe.
|
||||
|
||||
Based on a per-router config (eg. `vars/eunetworks-2.router.nl.coloclue.net.yml`) I can now add the
|
||||
following YAML stanza:
|
||||
|
||||
```
|
||||
coloclue:
|
||||
beacons:
|
||||
- prefix: "185.52.227.0"
|
||||
length: 24
|
||||
comment: "VPP test prefix (pim, rogier)"
|
||||
```
|
||||
|
||||
And further, from this router, I can forward all traffic destined to this /24 to a machine running
|
||||
in EUNetworks (my Dell R630 called `hvn0.nlams2.ipng.ch`), using a simple static route:
|
||||
|
||||
```
|
||||
statics:
|
||||
...
|
||||
- route: "185.52.227.0/24"
|
||||
via: "94.142.240.71"
|
||||
comment: "VPP test prefix (pim, rogier)"
|
||||
```
|
||||
|
||||
After running Kees, I can now see traffic for that /24 show up on my machine. The last step is to
|
||||
ensure that traffic that is destined for the beacon will always traverse back over `eunetworks-2`.
|
||||
Coloclue has VRRP and sometimes another router might be the logical router. With a little trick on
|
||||
my machine, I can force traffic by means of _policy based routing_:
|
||||
|
||||
```
|
||||
pim@hvn0-nlams2:~$ sudo ip ro add default via 94.142.240.254
|
||||
pim@hvn0-nlams2:~$ sudo ip ro add prohibit 185.52.227.0/24
|
||||
pim@hvn0-nlams2:~$ sudo ip addr add 185.52.227.1/32 dev lo
|
||||
pim@hvn0-nlams2:~$ sudo ip rule add from 185.52.227.0/24 lookup 10
|
||||
pim@hvn0-nlams2:~$ sudo ip ro add default via 94.142.240.253 table 10
|
||||
```
|
||||
|
||||
First, I set the default gateway to be the VRRP address that floats between multiple routers. Then,
|
||||
I will set a `prohibit` route for the covering /24, which means the machine will send an ICMP
|
||||
unreachable (rather than discarding the packets), which can be useful later. Next, I'll add .1 as an
|
||||
IPv4 address onto loopback, after which the machine will start replying to ICMP packets there with
|
||||
icmp-echo rather than dst-unreach. To make sure routing is always symmetric, I'll add an `ip rule`
|
||||
which is a classifier that matches packets based on their source address, and then diverts these to
|
||||
an alternate routing table, which has only one entry: send via .253 (which is `eunetworks-2`).
|
||||
|
||||
Let me show this in action:
|
||||
|
||||
|
||||
```
|
||||
pim@hvn0-nlams2:~$ dig +short -x 94.142.240.254
|
||||
eunetworks-gateway-100.router.nl.coloclue.net.
|
||||
pim@hvn0-nlams2:~$ dig +short -x 94.142.240.253
|
||||
bond0-100.eunetworks-2.router.nl.coloclue.net.
|
||||
pim@hvn0-nlams2:~$ dig +short -x 94.142.240.252
|
||||
bond0-100.eunetworks-3.router.nl.coloclue.net.
|
||||
|
||||
pim@hvn0-nlams2:~$ ip -4 nei | grep '94.142.240.25[234]'
|
||||
94.142.240.252 dev coloclue lladdr 64:9d:99:b1:31:db REACHABLE
|
||||
94.142.240.253 dev coloclue lladdr 64:9d:99:b1:31:af REACHABLE
|
||||
94.142.240.254 dev coloclue lladdr 64:9d:99:b1:31:db REACHABLE
|
||||
```
|
||||
|
||||
In the output above, I can see that `eunetworks-2` (94.142.240.253) has MAC address
|
||||
`64:9d:99:b1:31:af`, and that `eunetworks-3` (94.142.240.252) has MAC address `64:9d:99:b1:31:db`.
|
||||
My default gateway, handled by VRRP, is at .254 and it's using the second MAC address, so I know that
|
||||
`eunetworks-3` is primary, and will handle my egress traffic.
|
||||
|
||||
### Verifying symmetric routing of the beacon
|
||||
|
||||
A quick demonstration to show the symmetric routing case, I can tcpdump and see that my "usual"
|
||||
egress traffic will be sent to the MAC address of the VRRP primary (which I showed to be
|
||||
`eunetworks-3` above), while traffic coming from 185.52.227.0/24 ought to be sent to the
|
||||
MAC address of `eunetworks-2` due to the `ip rule` and alternate routing table 10:
|
||||
|
||||
```
|
||||
pim@hvn0-nlams2:~$ sudo tcpdump -eni coloclue host 194.1.163.93 and icmp
|
||||
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
|
||||
listening on coloclue, link-type EN10MB (Ethernet), snapshot length 262144 bytes
|
||||
10:02:17.193844 64:9d:99:b1:31:af > 6e:fa:52:d0:c1:ff, ethertype IPv4 (0x0800), length 98:
|
||||
194.1.163.93 > 94.142.240.71: ICMP echo request, id 16287, seq 1, length 64
|
||||
10:02:17.193882 6e:fa:52:d0:c1:ff > 64:9d:99:b1:31:db, ethertype IPv4 (0x0800), length 98:
|
||||
94.142.240.71 > 194.1.163.93: ICMP echo reply, id 16287, seq 1, length 64
|
||||
|
||||
10:02:19.276657 64:9d:99:b1:31:af > 6e:fa:52:d0:c1:ff, ethertype IPv4 (0x0800), length 98:
|
||||
194.1.163.93 > 185.52.227.1: ICMP echo request, id 6646, seq 1, length 64
|
||||
10:02:19.276694 6e:fa:52:d0:c1:ff > 64:9d:99:b1:31:af, ethertype IPv4 (0x0800), length 98:
|
||||
185.52.227.1 > 194.1.163.93: ICMP echo reply, id 6646, seq 1, length 64
|
||||
|
||||
```
|
||||
|
||||
It takes a keen eye to spot the difference here the first packet (which is going to the main
|
||||
IPv4 address 94.142.240.71), is returned via MAC address `64:9d:99:b1:31:db` (the VRRP default
|
||||
gateway), but the second one (going to the beacon 185.52.227.1) is returned via MAC address
|
||||
`64:9d:99:b1:31:af`.
|
||||
|
||||
I've now ensured that traffic to and from 185.52.227.1 will always traverse through the DUT
|
||||
(`eunetworks-2` with MAC `64:9d:99:b1:31:af`). Very elegant :-)
|
||||
|
||||
## Installing VPP
|
||||
|
||||
I've written about this before, the general _spiel_ is just following my previous article (I'm
|
||||
often very glad to read back my own articles as they serve as pretty good documentation to my
|
||||
forgetful chipmunk-sized brain!), so here, I'll only recap what's already written in
|
||||
[[vpp-7]({% post_url 2021-09-21-vpp-7 %})]:
|
||||
|
||||
1. Build VPP with Linux Control Plane
|
||||
1. Bring `eunetworks-2` into maintenance mode, so we can safely tinker with it
|
||||
1. Start services like ssh, snmp, keepalived and bird in a new `dataplane` namespace
|
||||
1. Start VPP and give the LCP interface names the same as their original
|
||||
1. Slowly introduce the router: OSPF, OSPFv3, iBGP, members-bgp, eBGP, in that order
|
||||
1. Re-enable keepalived and let the machine forward traffic
|
||||
1. Stare at the latency graphs
|
||||
|
||||
{{< image width="500px" float="right" src="/assets/coloclue-vpp/likwid-topology.png" alt="Likwid Topology" >}}
|
||||
|
||||
**1. BUILD:** For the first step, the build is straight forward, and yields a VPP instance based on
|
||||
`vpp-ext-deps_23.06-1` at version `23.06-rc0~71-g182d2b466`, which contains my
|
||||
[[LCPng](https://github.com/pimvanpelt/lcpng.git)] plugin. I then copy the packages to the router.
|
||||
The router has an E-2286G CPU @ 4.00GHz with 6 cores and 6 hyperthreads. There's a really handy tool
|
||||
called `likwid-topology` that can show how the L1, L2 and L3 cache lines up with respect to CPU
|
||||
cores. Here I learn that CPU (0+6) and (1+7) share L1 and L2 cache -- so I can conclude that 0-5 are
|
||||
CPU cores which share a hyperthread with 6-11 respectively.
|
||||
|
||||
I also see that L3 cache is shared across all of the cores+hyperthreads, which is normal. I decide
|
||||
to give CPUs 0,1 and their hyperthread 6,7 to Linux for general purpose scheduling, and I want to
|
||||
block the remaining CPUs and their hyperthreads to dedicated to VPP. So the kernel is rebooted with
|
||||
`isolcpus=2-5,8-11`.
|
||||
|
||||
**2. DRAIN:** In the mean time, Rogier prepares the drain, which is two step process. First he marks all
|
||||
the BGP sessions as `graceful_shutdown: True`, and waits for the traffic to die down. Then, he marks
|
||||
the machine as `maintenance_mode: True` which will make Kees set OSPF cost to 65535 and avoid
|
||||
attracting or sending traffic through this machine. After he submits these, we are free to tinker
|
||||
with the router, as it will not affect any Coloclue members. Rogier also ensures we will have the
|
||||
hand on this little machine in Amsterdam, by preparing an IPMI serial-over-lan connection and KVM.
|
||||
|
||||
**3. PREPARE:** Starting an ssh and snmpd in the dataplane is the most important part. This way, we will be
|
||||
able to scrape the machine using SNMP just as-if it were a Linux native router. And of course we
|
||||
will want to be able to log in to the router. I start with these two services, the only small
|
||||
note is that, because I want to run two copies (one in the default namespace and one additional
|
||||
one in the dataplane namespace), I'll want to tweak the startup flags (pid file, config file, etc) a
|
||||
little bit:
|
||||
|
||||
```
|
||||
## in snmpd-dataplane.service
|
||||
ExecStart=/sbin/ip netns exec dataplane /usr/sbin/snmpd -LOw -u Debian-snmp \
|
||||
-g vpp -I -smux,mteTrigger,mteTriggerConf -f -p /run/snmpd-dataplane.pid \
|
||||
-C -c /etc/snmp/snmpd-dataplane.conf
|
||||
|
||||
## in ssh-dataplane.service
|
||||
ExecStart=/usr/sbin/ip netns exec dataplane /usr/sbin/sshd \
|
||||
-oPidFile=/run/sshd-dataplane.pid -D $SSHD_OPTS
|
||||
```
|
||||
**4. LAUNCH:** Now what's left for us to do is switch from our SSH session to an IPMI serial-over-lan session
|
||||
so that we can safely transition to the VPP world. Rogier and I log in and share a tmux session,
|
||||
after which I bring down all ethernet links, remove VLAN sub-interfaces and the LACP BondEthernet,
|
||||
leaving only the main physical interfaces. I then set link down on them, and restart VPP -- which
|
||||
will take all DPDK eligble interfaces that are link admin-down, and then let the magic happen:
|
||||
|
||||
```
|
||||
root@eunetworks-2:~# vppctl show int
|
||||
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
|
||||
GigabitEthernet5/0/0 5 down 9000/0/0/0
|
||||
GigabitEthernet6/0/0 6 down 9000/0/0/0
|
||||
TenGigabitEthernet1/0/0 1 down 9000/0/0/0
|
||||
TenGigabitEthernet1/0/1 2 down 9000/0/0/0
|
||||
TenGigabitEthernet1/0/2 3 down 9000/0/0/0
|
||||
TenGigabitEthernet1/0/3 4 down 9000/0/0/0
|
||||
```
|
||||
|
||||
Dope! One way to trick the rest of the machine into thinking it hasn't changed, is to recreate these
|
||||
interfaces in the `dataplane` network namespace using their _original_ interface names (eg.
|
||||
`enp1s0f3` for AMS-IX, and `bond0` for the LACP signaled BondEthernet that we'll create. Rogier
|
||||
prepared an excellent `vppcfg` config file:
|
||||
|
||||
```
|
||||
loopbacks:
|
||||
loop0:
|
||||
description: 'eunetworks-2.router.nl.coloclue.net'
|
||||
lcp: 'loop0'
|
||||
mtu: 9216
|
||||
addresses: [ 94.142.247.3/32, 2a02:898:0:300::3/128 ]
|
||||
|
||||
bondethernets:
|
||||
BondEthernet0:
|
||||
description: 'Core: MLAG member switches'
|
||||
interfaces: [ TenGigabitEthernet1/0/0, TenGigabitEthernet1/0/1 ]
|
||||
mode: 'lacp'
|
||||
load-balance: 'l34'
|
||||
mac: '64:9d:99:b1:31:af'
|
||||
|
||||
interfaces:
|
||||
GigabitEthernet5/0/0:
|
||||
description: "igb 0000:05:00.0 eno1 # FiberRing"
|
||||
lcp: 'eno1'
|
||||
mtu: 9216
|
||||
sub-interfaces:
|
||||
205:
|
||||
description: "Peering: Arelion"
|
||||
lcp: 'eno1.205'
|
||||
addresses: [ 62.115.144.33/31, 2001:2000:3080:ebc::2/126 ]
|
||||
mtu: 1500
|
||||
992:
|
||||
description: "Transit: FiberRing"
|
||||
lcp: 'eno1.992'
|
||||
addresses: [ 87.255.32.130/30, 2a00:ec8::102/126 ]
|
||||
mtu: 1500
|
||||
|
||||
GigabitEthernet6/0/0:
|
||||
description: "igb 0000:06:00.0 eno2 # Free"
|
||||
lcp: 'eno2'
|
||||
mtu: 9216
|
||||
state: down
|
||||
|
||||
TenGigabitEthernet1/0/0:
|
||||
description: "i40e 0000:01:00.0 enp1s0f0 (bond-member)"
|
||||
mtu: 9216
|
||||
|
||||
TenGigabitEthernet1/0/1:
|
||||
description: "i40e 0000:01:00.1 enp1s0f1 (bond-member)"
|
||||
mtu: 9216
|
||||
|
||||
TenGigabitEthernet1/0/2:
|
||||
description: 'Core: link between eunetworks-2 and eunetworks-3'
|
||||
lcp: 'enp1s0f2'
|
||||
addresses: [ 94.142.247.246/31, 2a02:898:0:301::/127 ]
|
||||
mtu: 9214
|
||||
|
||||
TenGigabitEthernet1/0/3:
|
||||
description: "i40e 0000:01:00.3 enp1s0f3 # AMS-IX"
|
||||
lcp: 'enp1s0f3'
|
||||
mtu: 9216
|
||||
sub-interfaces:
|
||||
501:
|
||||
description: "Peering: AMS-IX"
|
||||
lcp: 'enp1s0f3.501'
|
||||
addresses: [ 80.249.211.161/21, 2001:7f8:1::a500:8283:1/64 ]
|
||||
mtu: 1500
|
||||
511:
|
||||
description: "Peering: NBIP-NaWas via AMS-IX"
|
||||
lcp: 'enp1s0f3.511'
|
||||
addresses: [ 194.62.128.38/24, 2001:67c:608::f200:8283:1/64 ]
|
||||
mtu: 1500
|
||||
|
||||
BondEthernet0:
|
||||
lcp: 'bond0'
|
||||
mtu: 9216
|
||||
sub-interfaces:
|
||||
100:
|
||||
description: "Cust: Members"
|
||||
lcp: 'bond0.100'
|
||||
mtu: 1500
|
||||
addresses: [ 94.142.240.253/24, 2a02:898:0:20::e2/64 ]
|
||||
101:
|
||||
description: "Core: Powerbars"
|
||||
lcp: 'bond0.101'
|
||||
mtu: 1500
|
||||
addresses: [ 172.28.3.253/24 ]
|
||||
105:
|
||||
description: "Cust: Members (no strict uRPF filtering)"
|
||||
lcp: 'bond0.105'
|
||||
mtu: 1500
|
||||
addresses: [ 185.52.225.14/28, 2a02:898:0:21::e2/64 ]
|
||||
130:
|
||||
description: "Core: Link between eunetworks-2 and dcg-1"
|
||||
lcp: 'bond0.130'
|
||||
mtu: 1500
|
||||
addresses: [ 94.142.247.242/31, 2a02:898:0:301::14/127 ]
|
||||
2502:
|
||||
description: "Transit: Fusix Networks"
|
||||
lcp: 'bond0.2502'
|
||||
mtu: 1500
|
||||
addresses: [ 37.139.140.27/31, 2a00:a7c0:e20b:104::2/126 ]
|
||||
```
|
||||
|
||||
We take this configuration and pre-generate a suitable VPP config, which exposes two little bugs
|
||||
in `vppcfg`:
|
||||
|
||||
* Rogier had used captial letters in his IPv6 addresses (ie. `2001:2000:3080:0EBC::2`), while the
|
||||
dataplane reports lower case (ie. `2001:2000:3080:ebc::2`), which consistently yield a diff that's
|
||||
not there. I make a note to fix that.
|
||||
* When I create the initial `--novpp` config, there's a bug in `vppcfg` where I incorrectly
|
||||
reference a dataplane object which I haven't initialized (because with `--novpp` the tool
|
||||
will not contact the dataplane at all. That one was easy to fix, which I did in [[this
|
||||
commit](https://github.com/pimvanpelt/vppcfg/commit/0a0413927a0be6ed3a292a8c336deab8b86f5eee)]).
|
||||
|
||||
After that small detour, I can now proceed to configure the dataplane by offering the resulting
|
||||
VPP commands, like so:
|
||||
```
|
||||
root@eunetworks-2:~# vppcfg plan --novpp -c /etc/vpp/vppcfg.yaml \
|
||||
-o /etc/vpp/config/vppcfg.vpp
|
||||
[INFO ] root.main: Loading configfile /etc/vpp/vppcfg.yaml
|
||||
[INFO ] vppcfg.config.valid_config: Configuration validated successfully
|
||||
[INFO ] root.main: Configuration is valid
|
||||
[INFO ] vppcfg.reconciler.write: Wrote 84 lines to /etc/vpp/config/vppcfg.vpp
|
||||
[INFO ] root.main: Planning succeeded
|
||||
|
||||
root@eunetworks-2:~# vppctl exec /etc/vpp/config/vppcfg.vpp
|
||||
```
|
||||
|
||||
**5. UNDRAIN:** The VPP dataplane comes to life, only to immediately hang. Whoops! What follows is a
|
||||
90 minute forray into the innards of VPP (and Bird) which I haven't yet fully understood, but will
|
||||
definitely want to learn more about (future article, anyone?) -- but the TL/DR of our investigation
|
||||
is that if an IPv6 address is added to a loopback device, and an OSPFv3 (IPv6) stub area is created
|
||||
on it, as is common for IPv4 and IPv6 loopback addresses in OSPF, then the dataplane immediately
|
||||
hangs on the controlplane, but does continue to forward traffic.
|
||||
|
||||
However, we also find a workaround, which is to put the IPv6 loopback address on a physical
|
||||
interface instead of a loopback interface. Then, we observe a perfectly functioning dataplane, which
|
||||
has a working BondEthernet with LACP signalling:
|
||||
|
||||
```
|
||||
root@eunetworks-2:~# vppctl show bond details
|
||||
BondEthernet0
|
||||
mode: lacp
|
||||
load balance: l34
|
||||
number of active members: 2
|
||||
TenGigabitEthernet1/0/1
|
||||
TenGigabitEthernet1/0/0
|
||||
number of members: 2
|
||||
TenGigabitEthernet1/0/0
|
||||
TenGigabitEthernet1/0/1
|
||||
device instance: 0
|
||||
interface id: 0
|
||||
sw_if_index: 8
|
||||
hw_if_index: 8
|
||||
|
||||
root@eunetworks-2:~# vppctl show lacp
|
||||
actor state partner state
|
||||
interface name sw_if_index bond interface exp/def/dis/col/syn/agg/tim/act exp/def/dis/col/syn/agg/tim/act
|
||||
TenGigabitEthernet1/0/0 1 BondEthernet0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1
|
||||
LAG ID: [(ffff,64-9d-99-b1-31-af,0008,00ff,0001), (8000,02-1c-73-0f-8b-bc,0015,8000,8015)]
|
||||
RX-state: CURRENT, TX-state: TRANSMIT, MUX-state: COLLECTING_DISTRIBUTING, PTX-state: PERIODIC_TX
|
||||
TenGigabitEthernet1/0/1 2 BondEthernet0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1
|
||||
LAG ID: [(ffff,64-9d-99-b1-31-af,0008,00ff,0002), (8000,02-1c-73-0f-8b-bc,0015,8000,0015)]
|
||||
RX-state: CURRENT, TX-state: TRANSMIT, MUX-state: COLLECTING_DISTRIBUTING, PTX-state: PERIODIC_TX
|
||||
```
|
||||
|
||||
**6. WRAP UP:** After doing a bit of standard issue ping / ping6 and `show err` and `show log`,
|
||||
things are looking good. Rogier and I are now ready to slowly introduce the router: we first turn on
|
||||
OSPF and OSPFv3, see adjacencies and BFD turn up. We make a note that `enp1s0f2` (which is now a
|
||||
_LIP_ in the dataplane) does not have BFD while it does have OSPF, and the explanation for this is
|
||||
that `bond0` is connected to a switch, while `enp1s0f2` is directly connected to its peer via a
|
||||
cross connect cable, so if it fails, it'll be able to use link-state to quickly reconverge, while
|
||||
the ethernet link may still be up on `bond0` if something along the transport path were to fail, so
|
||||
BFD is the better choice there. Smart thinking, Coloclue!
|
||||
|
||||
```
|
||||
root@eunetworks-2:~# birdc6 show ospf nei ospf1
|
||||
BIRD 1.6.8 ready.
|
||||
ospf1:
|
||||
Router ID Pri State DTime Interface Router IP
|
||||
94.142.247.1 1 Full/PtP 00:33 bond0.130 fe80::669d:99ff:feb1:394b
|
||||
94.142.247.6 1 Full/PtP 00:31 enp1s0f2 fe80::669d:99ff:feb1:31d8
|
||||
|
||||
root@eunetworks-2:~# birdc show bfd ses
|
||||
BIRD 1.6.8 ready.
|
||||
bfd1:
|
||||
IP address Interface State Since Interval Timeout
|
||||
94.142.247.243 bond0.130 Up 2023-02-24 15:56:29 0.100 0.500
|
||||
```
|
||||
|
||||
We are then ready to undrain iBGP and eBGP to members, transit and peering sessions. Rogier swiftly
|
||||
takes care of business, and the router finds its spot in the DFZ just a few minutes later:
|
||||
|
||||
```
|
||||
root@eunetworks-2:~# birdc show route count
|
||||
BIRD 1.6.8 ready.
|
||||
6239493 of 6239493 routes for 907650 networks
|
||||
|
||||
root@eunetworks-2:~# birdc6 show route count
|
||||
BIRD 1.6.8 ready.
|
||||
1152345 of 1152345 routes for 169987 networks
|
||||
|
||||
root@eunetworks-2:~# vppctl show ip fib sum
|
||||
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none
|
||||
locks:[adjacency:1, default-route:1, lcp-rt:1, ]
|
||||
Prefix length Count
|
||||
0 1
|
||||
4 2
|
||||
8 16
|
||||
9 13
|
||||
10 38
|
||||
11 103
|
||||
12 299
|
||||
13 577
|
||||
14 1214
|
||||
15 2093
|
||||
16 13477
|
||||
17 8250
|
||||
18 13824
|
||||
19 24990
|
||||
20 43089
|
||||
21 51191
|
||||
22 109106
|
||||
23 97073
|
||||
24 542106
|
||||
27 3
|
||||
28 13
|
||||
29 32
|
||||
30 36
|
||||
31 41
|
||||
32 788
|
||||
|
||||
root@eunetworks-2:~# vppctl show ip6 fib sum
|
||||
ipv6-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none
|
||||
locks:[adjacency:1, default-route:1, lcp-rt:1, ]
|
||||
Prefix length Count
|
||||
128 863
|
||||
127 4
|
||||
126 4
|
||||
125 1
|
||||
120 2
|
||||
64 22
|
||||
60 17
|
||||
52 2
|
||||
49 2
|
||||
48 80069
|
||||
47 3535
|
||||
46 3411
|
||||
45 1726
|
||||
44 14909
|
||||
43 1041
|
||||
42 2529
|
||||
41 932
|
||||
40 14126
|
||||
39 1459
|
||||
38 1654
|
||||
37 988
|
||||
36 6640
|
||||
35 1374
|
||||
34 3419
|
||||
33 3707
|
||||
32 22819
|
||||
31 294
|
||||
30 589
|
||||
29 4373
|
||||
28 196
|
||||
27 20
|
||||
26 15
|
||||
25 8
|
||||
24 30
|
||||
23 7
|
||||
22 7
|
||||
21 3
|
||||
20 15
|
||||
19 1
|
||||
10 1
|
||||
0 1
|
||||
|
||||
```
|
||||
|
||||
One thing that I really appreciate is how ... _normal_ ... this machine looks, with no interfaces in
|
||||
the default namespace, but after switching to the dataplane network namespace using `nsenter`, there
|
||||
they are and they look (unsurprisingly, because we configured them that way), identical to what was
|
||||
running before, except now all goverend by VPP instead of the Linux kernel:
|
||||
|
||||
```
|
||||
root@eunetworks-2:~# ip -br l
|
||||
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
|
||||
|
||||
root@eunetworks-2:~# nsenter --net=/var/run/netns/dataplane
|
||||
root@eunetworks-2:~# ip -br l
|
||||
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
|
||||
eno1 UP ac:1f:6b:e0:b1:0c <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
eno2 DOWN ac:1f:6b:e0:b1:0d <BROADCAST,MULTICAST>
|
||||
enp1s0f2 UP 64:9d:99:b1:31:ad <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
enp1s0f3 UP 64:9d:99:b1:31:ac <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
bond0 UP 64:9d:99:b1:31:af <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
loop0 UP de:ad:00:00:00:00 <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
eno1.205@eno1 UP ac:1f:6b:e0:b1:0c <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
eno1.992@eno1 UP ac:1f:6b:e0:b1:0c <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
enp1s0f3.501@enp1s0f3 UP 64:9d:99:b1:31:ac <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
enp1s0f3.511@enp1s0f3 UP 64:9d:99:b1:31:ac <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
bond0.100@bond0 UP 64:9d:99:b1:31:af <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
bond0.101@bond0 UP 64:9d:99:b1:31:af <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
bond0.105@bond0 UP 64:9d:99:b1:31:af <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
bond0.130@bond0 UP 64:9d:99:b1:31:af <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
bond0.2502@bond0 UP 64:9d:99:b1:31:af <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
|
||||
root@eunetworks-2:~# ip -br a
|
||||
lo UNKNOWN 127.0.0.1/8 ::1/128
|
||||
eno1 UP fe80::ae1f:6bff:fee0:b10c/64
|
||||
eno2 DOWN
|
||||
enp1s0f2 UP 94.142.247.246/31 2a02:898:0:300::3/128 2a02:898:0:301::/127 fe80::669d:99ff:feb1:31ad/64
|
||||
enp1s0f3 UP fe80::669d:99ff:feb1:31ac/64
|
||||
bond0 UP fe80::669d:99ff:feb1:31af/64
|
||||
loop0 UP 94.142.247.3/32 fe80::dcad:ff:fe00:0/64
|
||||
eno1.205@eno1 UP 62.115.144.33/31 2001:2000:3080:ebc::2/126 fe80::ae1f:6bff:fee0:b10c/64
|
||||
eno1.992@eno1 UP 87.255.32.130/30 2a00:ec8::102/126 fe80::ae1f:6bff:fee0:b10c/64
|
||||
enp1s0f3.501@enp1s0f3 UP 80.249.211.161/21 2001:7f8:1::a500:8283:1/64 fe80::669d:99ff:feb1:31ac/64
|
||||
enp1s0f3.511@enp1s0f3 UP 194.62.128.38/24 2001:67c:608::f200:8283:1/64 fe80::669d:99ff:feb1:31ac/64
|
||||
bond0.100@bond0 UP 94.142.240.253/24 2a02:898:0:20::e2/64 fe80::669d:99ff:feb1:31af/64
|
||||
bond0.101@bond0 UP 172.28.3.253/24 fe80::669d:99ff:feb1:31af/64
|
||||
bond0.105@bond0 UP 185.52.225.14/28 2a02:898:0:21::e2/64 fe80::669d:99ff:feb1:31af/64
|
||||
bond0.130@bond0 UP 94.142.247.242/31 2a02:898:0:301::14/127 fe80::669d:99ff:feb1:31af/64
|
||||
bond0.2502@bond0 UP 37.139.140.27/31 2a00:a7c0:e20b:104::2/126 fe80::669d:99ff:feb1:31af/64
|
||||
```
|
||||
|
||||
{{< image width="400px" float="right" src="/assets/coloclue-vpp/traffic.jpg" alt="Traffic" >}}
|
||||
|
||||
Of course, VPP handles all the traffic _through_ the machine, and the only traffic that Linux will
|
||||
see is that which is destined to the controlplane (eg, to one of the IPv4 or IPv6 addresses or
|
||||
multicast/broadcast groups that they are participating in), so things like tcpdump or SNMP won't
|
||||
really work.
|
||||
|
||||
However, due to my [[vpp-snmp-agent](https://github.com/pimvanpelt/vpp-snmp-agent.git)], which is
|
||||
feeding as an AgentX behind an snmpd that in turn is running in the `dataplane` namespace, SNMP scrapes
|
||||
work as they did before, albeit with a few different interface names.
|
||||
|
||||
**6.** Earlier, I had failed over `keepalived` and stopped the service. This way, the peer router
|
||||
on `eunetworks-3` would pick up all outbound traffic to the virtual IPv4 and IPv6 for our users'
|
||||
default gateway. Because we're mainly interested in non-intrusively measuring the BGP beacon (which
|
||||
is forced to always go through this machine), and we know some of our members use BGP and take a
|
||||
preference over this router because it's connected to AMS-IX, we make a decision to leave keepalived
|
||||
turned off for now.
|
||||
|
||||
But, traffic is flowing, and in fact a little bit more throughput, possibly because traffic flows
|
||||
faster when there's not 5% packet loss on certain egress paths? I don't know but OK, moving along!
|
||||
|
||||
## Results
|
||||
|
||||
Clearly VPP is a winner in this scenario. If you recall the traceroute from before the operation, the latency
|
||||
was good up until `nlams0.ipng.ch`, after which loss occured and variance was very high. Rogier and
|
||||
I let the VPP instance run overnight, and started this traceroute after our maintenance was
|
||||
concluded:
|
||||
|
||||
```
|
||||
My traceroute [v0.94]
|
||||
squanchy.ipng.ch (194.1.163.90) -> 185.52.227.1 2023-02-25T09:48:46+0100
|
||||
Keys: Help Display mode Restart statistics Order of fields quit
|
||||
Packets Pings
|
||||
Host Loss% Snt Last Avg Best Wrst StDev
|
||||
1. chbtl0.ipng.ch 0.0% 51796 0.6 0.2 0.1 1.7 0.2
|
||||
2. chrma0.ipng.ch 0.0% 51796 1.6 1.0 0.9 5.5 1.2
|
||||
3. defra0.ipng.ch 0.0% 51796 7.0 6.5 6.4 27.7 1.9
|
||||
4. nlams0.ipng.ch 0.0% 51796 12.7 12.6 12.5 43.8 3.9
|
||||
5. bond0-105.eunetworks-2.router.nl.coloclue.net 0.0% 51796 13.3 13.0 12.8 138.9 11.1
|
||||
6. 185.52.227.1 0.0% 51796 13.6 12.7 12.3 46.6 8.3
|
||||
```
|
||||
|
||||
{{< image width="400px" float="right" src="/assets/coloclue-vpp/bill-clinton-zero.gif" alt="Clinton Zero" >}}
|
||||
|
||||
This mtr shows clear network weather with absolutely no packets dropped from Brüttisellen (near
|
||||
Zurich, Switzerland) all the way to the BGP beacon running in EUNetworks in Amsterdam. Considering
|
||||
I've been running VPP for a few years now, including writing the code necessary to plumb the
|
||||
dataplane interfaces through to Linux so that a higher order control plane (such as Bird, or FRR)
|
||||
can manipulate them, I am reasonably bullish, but I do hope to convert others.
|
||||
|
||||
This computer now forwards packets like a boss, its packet loss is →
|
||||
|
||||
Looking at the local situation, from a hypervisor running at IPng Networks in Equinix AM3 via FrysIX
|
||||
through VPP and into the dataplane of the Coloclue router `eunetworks-2` , shows quite reasonable
|
||||
throughput as well:
|
||||
|
||||
```
|
||||
root@eunetworks-2:~# traceroute hvn0.nlams3.ipng.ch
|
||||
traceroute to 46.20.243.179 (46.20.243.179), 30 hops max, 60 byte packets
|
||||
1 enp1s0f3.eunetworks-3.router.nl.coloclue.net (94.142.247.247) 0.087 ms 0.078 ms 0.071 ms
|
||||
2 frys-ix.ip-max.net (185.1.203.135) 1.288 ms 1.432 ms 1.479 ms
|
||||
3 hvn0.nlams3.ipng.ch (46.20.243.179) 0.524 ms 0.534 ms 0.531 ms
|
||||
|
||||
root@eunetworks-2:~# iperf3 -c 46.20.243.179 -P 10
|
||||
Connecting to host 46.20.243.179, port 5201
|
||||
...
|
||||
[SUM] 0.00-10.00 sec 6.70 GBytes 5.76 Gbits/sec 192 sender
|
||||
[SUM] 0.00-10.03 sec 6.58 GBytes 5.64 Gbits/sec receiver
|
||||
|
||||
root@eunetworks-2:~# iperf3 -c 46.20.243.179 -P 10 -R
|
||||
Connecting to host 46.20.243.179, port 5201
|
||||
Reverse mode, remote host 46.20.243.179 is sending
|
||||
...
|
||||
[SUM] 0.00-10.03 sec 6.07 GBytes 5.20 Gbits/sec 54623 sender
|
||||
[SUM] 0.00-10.00 sec 6.03 GBytes 5.18 Gbits/sec receiver
|
||||
```
|
||||
|
||||
And the smokepings look just plain gorgeous:
|
||||
|
||||
--------------- | ---------------------
|
||||
 | 
|
||||
|
||||
_The screenshots above are smokeping from (left) a machine at AS8283 Coloclue (in Amsterdam, the
|
||||
Netherlands), and from (right) a machine at AS8298 IPng (in Brüttisellen, Switzerland), both
|
||||
are showing no packetloss and clearly improved performance in end to end latency. Super!_
|
||||
|
||||
## What's next
|
||||
|
||||
The performance of the one router we upgraded definitely improved, no question about that. But
|
||||
there's a couple of things that I think we still need to do, so Rogier and I rolled back the change
|
||||
to the previous situation and kernel based routing.
|
||||
|
||||
* We didn't migrate keepalived, although IPng runs this in our DDLN [[colocation]({% post_url 2022-02-24-colo %})]
|
||||
site, so I'm pretty confident that it will work.
|
||||
* Kees and Ansible at Coloclue will need a few careful changes, to facilitate ongoing automation,
|
||||
think of dataplane and controlplane firewalls, sysctls (uRPF et al), fastnetmon, and so on will
|
||||
need a meaningful overhaul.
|
||||
* There's an unknown dataplane hang with Bird IPv6 enables a `stub` OSPF interface on interface
|
||||
`lo0`. We worked around that by putting the loopback IPv6 address on another interface, but this
|
||||
needs to be fully understood.
|
||||
* Completely unrelated to Coloclue, there's one dataplane hang regarding IPv6 RA/NS and/or BFD
|
||||
and/or Linux Control Plane that the VPP developer community is hunting down - it happens with
|
||||
my plugin but also with [[TNSR](http://www.netgate.com/tnsr)] (who used the upstream `linux-cp` plugin).
|
||||
I've been working with a few folks from Netgate and customers of IPng Networks to try to find the root
|
||||
cause, as AS8298 has been bitten by this a few times over the last ~quarter or so. I cannot
|
||||
recommend in good faith running VPP until this is sorted out.
|
||||
|
||||
As an important side note, VPP is not well enough understood at Coloclue - rolling this out further
|
||||
risks making me a single point of failure in the networking committee, and I'm not comfortable taking
|
||||
that responsibility. I recommend that Coloclue network committee members gain experience with VPP,
|
||||
DPDK, `vppcfg` and the other ecosystem tools, and that at least the bird6 OSPF issue and possible
|
||||
IPv6 NS/RA issue are understood, before making the jump to the VPP world.
|
Reference in New Issue
Block a user