Files
ipng.ch/content/articles/2024-06-22-vpp-ospf-2.md
Pim van Pelt fdb77838b8
All checks were successful
continuous-integration/drone/push Build is passing
Rewrite github.com to git.ipng.ch for popular repos
2025-05-04 21:54:16 +02:00

628 lines
33 KiB
Markdown

---
date: "2024-06-22T09:17:54Z"
title: VPP with loopback-only OSPFv3 - Part 2
aliases:
- /s/articles/2024/06/22/vpp-ospf-2.html
params:
asciinema: true
---
{{< image width="200px" float="right" src="/assets/vpp-ospf/bird-logo.svg" alt="Bird" >}}
# Introduction
When I first built IPng Networks AS8298, I decided to use OSPF as an IPv4 and IPv6 internal gateway
protocol. Back in March I took a look at two slightly different ways of doing this for IPng, notably
against a backdrop of conserving IPv4 addresses. As the network grows, the little point to point
transit networks between routers really start adding up.
I explored two potential solutions to this problem:
1. **[[Babel]({{< ref "2024-03-06-vpp-babel-1" >}})]** can use IPv6 nexthops for IPv4 destinations -
which is _super_ useful because it would allow me to retire all of the IPv4 /31 point to point
networks between my routers.
1. **[[OSPFv3]({{< ref "2024-04-06-vpp-ospf" >}})]** makes it difficult to use IPv6 nexthops for
IPv4 destinations, but in a discussion with the Bird Users mailinglist, we found a way: by reusing
a single IPv4 loopback address on adjacent interfaces
{{< image width="90px" float="left" src="/assets/vpp-ospf/canary.png" alt="Canary" >}}
In May I ran a modest set of two _canaries_, one between the two routers in my house (`chbtl0` and
`chbtl1`), and another between a router at the Daedalean colocation and Interxion datacenters (`ddln0`
and `chgtg0`). AS8298 has about quarter of a /24 tied up in these otherwise pointless point-to-point
transit networks (see what I did there?). I want to reclaim these!
Seeing as the two tests went well, I decided to roll this out and make it official. This post
describes how I rolled out an (almost) IPv4-less core network for IPng Networks. It was actually way
easier than I had anticipated, and apparently I was not alone - several of my buddies in the
industry have asked me about it, so I thought I'd write a little bit about the configuration.
# Background: OSPFv3 with IPv4
***💩 /30: 4 addresses***: In the oldest of days, two routers that formed an IPv4 OSPF adjacency would
have a /30 _point-to-point_ transit network between them. Router A would have the lower available
IPv4 address, and Router B would have the upper available IPv4 address. The other two addresses in
the /30 would be the _network_ and _broadcast_ addresses of the prefix. Not a very efficient way to
do things, but back in the old days, IPv4 addresses were in infinite supply.
***🥈 /31: 2 addresses***: Enter [[RFC3021](https://datatracker.ietf.org/doc/html/rfc3021)], from
December 2000, which some might argue are also the old days. With ever-increasing pressure to
conserve IP address space on the Internet, it makes sense to consider where relatively minor changes
can be made to fielded practice to improve numbering efficiency. This RFC describes how to halve the
amount of address space assigned to point-to-point links (common throughout the Internet
infrastructure) by allowing the use of /31 prefixes for them. At some point, even our friends from
Latvia figured it out!
***🥇 /32: 1 address***: In most networks, each router has what is called a _loopback_ IPv4 and IPv6
address, typically a /32 and /128 in size. This allows the router to select a unique address that is
not bound to any given interface. It comes in handy in many ways -- for example to have stable
addresses to manage the router, and to allow it to connect to iBGP route reflectors and peers from
well known addresses.
As it so turns out, two routers that form an adjacency can advertise ~any IPv4 address as nexthop,
provided that their adjacent peer knows how to find that address. Of course, with a /30 or /31 this
is obvious: if I have a directly connected /31, I can simply ARP for the other side, learn its MAC
address, and use that to forward traffic to the other router.
### The Trick
What would it look like if there's no subnet that directly connects two adjacent routers? Well, I
happen to know that RouterA and RouterB both have a /32 loopback address. So if I simply let RouterA
(1) advertise _its loopback_ address to neighbor RouterB, and also (2) answer ARP requests for that
address, the two routers should be able to form an adjacency. This is exactly what Ondrej's [[Bird2
commit (1)](https://gitlab.nic.cz/labs/bird/-/commit/280daed57d061eb1ebc89013637c683fe23465e8)] and
my [[VPP gerrit (2)](https://gerrit.fd.io/r/c/vpp/+/40482)] accomplish, as perfect partners:
1. Ondrej's change will make the Link LSA be _onlink_, which is a way to describe that the next hop
is not directly connected, in other words RouterB will be at nexthop `192.0.2.1`, while
RouterA itself is `192.0.2.0/32`.
1. My change will make VPP answer for ARP requests in such a scenario where RouterA with an
_unnumbered_ interface with `192.0.2.0/32` will respond to a request from the not directly
connected _onlink_ peer RouterB at `192.0.2.1`.
## Rolling out P2P-less OSPFv3
### 1. Upgrade VPP + Bird2
First order of business is to upgrade all routers. I need a VPP version with the [[ARP
gerrit](https://gerrit.fd.io/r/c/vpp/+/40482)] and a Bird2 version with the [[OSPFv3
commit](https://gitlab.nic.cz/labs/bird/-/commit/280daed57d061eb1ebc89013637c683fe23465e8)]. I build
a set of Debian packages on `bookworm-builder` and upload them to IPng's website
[[ref](https://ipng.ch/media/vpp/bookworm/)].
I schedule a two nightly maintenance windows. In the first one, I'll upgrade two routers (`frggh0`
and `ddln1`) by means of canary. I'll let them run for a few days, and then wheel over the rest
after I'm confident there are no regressions.
For each router, I will first _drain it_: this means in Kees, setting the OSPFv2 and OSPFv3 cost of
routers neighboring it to a higher number, so that traffic flows around the 'expensive' link. I will
also move the eBGP sessions into _shutdown_ mode, which will make the BGP sessions stay connected,
but the router will not announce any prefixes nor accept any from peers. Without it announcing or
learning any prefixes, the router stops seeing traffic. After about 10 minutes, it is safe to make
intrusive changes to it.
Seeing as I'll be moving from OSPFv2 to OSPFv3, I will allow for a seemless transition by
configuring both protocols to run at the same time. The filter that applies to both flavors of OSPF
is the same: I will only allow more specifics of IPng's own prefixes to be propagated, and in
particular I'll drop all prefixes that come from BGP. I'll rename the protocol called `ospf4` to
`ospf4_old`, and create a new (OSPFv3) protocol called `ospf4` which has only the loopback interface
in it. This way, when I'm done, the final running protocol will simply be called `ospf4`:
```
filter f_ospf {
if (source = RTS_BGP) then reject;
if (net ~ [ 92.119.38.0/24{25,32}, 194.1.163.0/24{25,32}, 194.126.235.0/24{25,32} ]) then accept;
if (net ~ [ 2001:678:d78::/48{56,128}, 2a0b:dd80:3000::/36{48,48} ]) then accept;
reject;
}
protocol ospf v2 ospf4_old {
ipv4 { export filter f_ospf; import filter f_ospf; };
area 0 {
interface "loop0" { stub yes; };
interface "xe1-1.302" { type pointopoint; cost 61; bfd on; };
interface "xe1-0.304" { type pointopoint; cost 56; bfd on; };
};
}
protocol ospf v3 ospf4 {
ipv4 { export filter f_ospf; import filter f_ospf; };
area 0 {
interface "loop0","lo" { stub yes; };
};
}
```
In one terminal, I will start a ping to the router's IPv4 loopback:
```
pim@summer:~$ ping defra0.ipng.ch
PING (194.1.163.7) 56(84) bytes of data.
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=1 ttl=61 time=6.94 ms
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=2 ttl=61 time=7.00 ms
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=3 ttl=61 time=7.03 ms
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=4 ttl=61 time=7.03 ms
...
```
While in the other, I log in to the IPng Site Local connection to the router's management plane, to
perform the ugprade:
```
pim@squanchy:~$ ssh defra0.net.ipng.ch
pim@defra0:~$ wget -m --no-parent https://ipng.ch/media/vpp/bookworm/24.06-rc0~183-gb0d433978/
pim@defra0:~$ cd ipng.ch/media/vpp/bookworm/24.06-rc0~183-gb0d433978/
pim@defra0:~$ sudo nsenter --net=/var/run/netns/dataplane
root@defra0:~# pkill -9 vpp && systemctl stop bird-dataplane vpp && \
dpkg -i ~pim/ipng.ch/media/vpp/bookworm/24.06-rc0~183-gb0d433978/*.deb && \
dpkg -i ~pim/bird2_2.15.1_amd64.deb && \
systemctl start bird-dataplane && \
systemctl restart vpp-snmp-agent-dataplane vpp-exporter-dataplane
```
Then comes the small window of awkward staring at the ping I started in the other terminal. It
always makes me smile because it all comes back very quickly, within 90 seconds the router is back
online and fully converged with BGP:
```
pim@summer:~$ ping defra0.ipng.ch
PING (194.1.163.7) 56(84) bytes of data.
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=1 ttl=61 time=6.94 ms
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=2 ttl=61 time=7.00 ms
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=3 ttl=61 time=7.03 ms
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=4 ttl=61 time=7.03 ms
...
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=94 ttl=61 time=1003.83 ms
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=95 ttl=61 time=7.03 ms
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=96 ttl=61 time=7.02 ms
64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=97 ttl=61 time=7.03 ms
pim@defra0:~$ birdc show ospf nei
BIRD v2.15.1-4-g280daed5-x ready.
ospf4_old:
Router ID Pri State DTime Interface Router IP
194.1.163.8 1 Full/PtP 32.113 xe1-1.302 194.1.163.27
194.1.163.0 1 Full/PtP 30.936 xe1-0.304 194.1.163.24
ospf4:
Router ID Pri State DTime Interface Router IP
ospf6:
Router ID Pri State DTime Interface Router IP
194.1.163.8 1 Full/PtP 32.113 xe1-1.302 fe80::3eec:efff:fe46:68a8
194.1.163.0 1 Full/PtP 30.936 xe1-0.304 fe80::6a05:caff:fe32:4616
```
I can see that the OSPFv2 adjacencies have reformed, which is totally expected. Looking at the
router's current addresses:
```
pim@defra0:~$ ip -br a | grep UP
loop0 UP 194.1.163.7/32 2001:678:d78::7/128 fe80::dcad:ff:fe00:0/64
xe1-0 UP fe80::6a05:caff:fe32:3e48/64
xe1-1 UP fe80::6a05:caff:fe32:3e49/64
xe1-2 UP fe80::6a05:caff:fe32:3e4a/64
xe1-3 UP fe80::6a05:caff:fe32:3e4b/64
xe1-0.304@xe1-0 UP 194.1.163.25/31 2001:678:d78::2:7:2/112 fe80::6a05:caff:fe32:3e48/64
xe1-1.302@xe1-1 UP 194.1.163.26/31 2001:678:d78::2:8:1/112 fe80::6a05:caff:fe32:3e49/64
xe1-2.441@xe1-2 UP 46.20.246.51/29 2a02:2528:ff01::3/64 fe80::6a05:caff:fe32:3e4a/64
xe1-2.503@xe1-2 UP 80.81.197.38/21 2001:7f8::206a:0:1/64 fe80::6a05:caff:fe32:3e4a/64
xe1-2.514@xe1-2 UP 185.1.210.235/23 2001:7f8:3d::206a:0:1/64 fe80::6a05:caff:fe32:3e4a/64
xe1-2.515@xe1-2 UP 185.1.208.84/23 2001:7f8:44::206a:0:1/64 fe80::6a05:caff:fe32:3e4a/64
xe1-2.516@xe1-2 UP 185.1.171.43/23 2001:7f8:9e::206a:0:1/64 fe80::6a05:caff:fe32:3e4a/64
xe1-3.900@xe1-3 UP 193.189.83.55/23 2001:7f8:33::a100:8298:1/64 fe80::6a05:caff:fe32:3e4b/64
xe1-3.2003@xe1-3 UP 185.1.155.116/24 2a0c:b641:701::8298:1/64 fe80::6a05:caff:fe32:3e4b/64
xe1-3.3145@xe1-3 UP 185.1.167.136/23 2001:7f8:f2:e1::8298:1/64 fe80::6a05:caff:fe32:3e4b/64
xe1-3.1405@xe1-3 UP 80.77.16.214/30 2a00:f820:839::2/64 fe80::6a05:caff:fe32:3e4b/64
```
Take a look at interfaces `xe1-0.304` which is southbound from Frankfurt to Zurich
(`chrma0.ipng.ch`) and `xe1-1.302` which is northbound from Frankfurt to Amsterdam
(`nlams0.ipng.ch`). I am going to get rid of the IPv4 and IPv6 global unicast addresses on these two
interfaces, and let OSPFv3 borrow the IPv4 address from `loop0` instead.
But first, rinse and repeat, until all routers are upgraded.
### 2. A situational overview
First, let me draw a diagram that helps show what I'm about to do:
{{< image src="/assets/vpp-ospf/BTL-GTG-RMA Before.svg" alt="Step 2: Before" >}}
In the network overview I've drawn four of IPng's routers. The ones at the bottom are the two
routers at my office in Br&uuml;ttisellen, Switzerland, which explains their name `chbtl0` and
`chbtl1`, and they are connected via a local fiber trunk using 10Gig optics (drawn in <span
style='color:red;font-weight:bold;'>red</span>). On the left, the first router is connected via a
10G Ethernet-over-MPLS link (depicted in <span style='color:green;font-weight:bold;'>green</span>)
to the NTT Datacenter in R&uuml;mlang. From there, IPng rents a 25Gbps wavelength to the Interxion
datacenter in Glattbrugg (shown in <span style='color:blue;font-weight:bold;'>blue</span>). Finally,
the Interxion router connects back to Br&uuml;ttisellen using a 10G Ethernet-over-MPLS link (colored
in <span style='color:#ff00ff;font-weight:bold;'>pink</span>), completing the ring.
You can also see that each router has a set of _loopback_ addresses, for example `chbtl0` in the
bottom left has IPv4 address `194.1.163.3/32` and IPv6 address `2001:678:d78::3/128`. Each point to
point network has assigned one /31 and one /112 with each router taking one address at either side.
Counting them up real quick, I see **twelve IPv4 addresses** in this diagram. This is a classic OSPF
design pattern. I seek to save eight of these addresses!
### 3. First OSPFv3 link
The rollout has to start somewhere, and I decide to start close to home, literally. I'm going to
remove the IPv4 and IPv6 addresses from the <span style='color:red;font-weight:bold;'>red</span> link between the two
routers in Br&uuml;ttisellen. They are directly connected, and if anything goes wrong, I can walk
over and rescue them. Sounds like a safe way to start!
I quickly add the ability for [[vppcfg](https://git.ipng.ch/ipng/vppcfg)] to configure
_unnumbered_ interfaces. In VPP, these are interfaces that don't have an IPv4 or IPv6 address of
their own, but they borrow one from another interface. If you're curious, you can take a look at the
[[User Guide](https://git.ipng.ch/ipng/vppcfg/blob/main/docs/config-guide.md#interfaces)] on
GitHub.
Looking at their `vppcfg` files, the change is actually very easy, taking as an example the
configuration file for `chbtl0.ipng.ch`:
```
loopbacks:
loop0:
description: 'Core: chbtl1.ipng.ch'
addresses: ['194.1.163.3/32', '2001:678:d78::3/128']
lcp: loop0
mtu: 9000
interfaces:
TenGigabitEthernet6/0/0:
device-type: dpdk
description: 'Core: chbtl1.ipng.ch'
mtu: 9000
lcp: xe1-0
# addresses: [ '194.1.163.20/31', '2001:678:d78::2:5:1/112' ]
unnumbered: loop0
```
By commenting out the `addresses` field, and replacing it with `unnumbered: loop0`, I instruct
vppcfg to make Te6/0/0, which in Linux is called `xe1-0`, borrow its addresses from the loopback
interface `loop0`.
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="brain" >}}
Planning and applying this is straight forward, but there's one detail I should
mention. In my [[previous article]({{< ref "2024-04-06-vpp-ospf" >}})] I asked myself a question:
would it be better to leave the addresses unconfigured in Linux, or would it be better to make the
Linux Control Plane plugin carry forward the borrowed addresses? In the end, I decided to _not_ copy
them forward. VPP will be aware of the addresses, but Linux will only carry them on the `loop0`
interface.
In the article, you'll see that discussed as _Solution 2_, and it includes a bit of rationale why I
find this better. I implemented it in this
[[commit](https://git.ipng.ch/ipng/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
case you're curious, and the commandline keyword is `lcp lcp-sync-unnumbered off` (the default is
_on_).
```
pim@chbtl0:~$ vppcfg plan -c /etc/vpp/vppcfg.yaml
[INFO ] root.main: Loading configfile /etc/vpp/vppcfg.yaml
[INFO ] vppcfg.config.valid_config: Configuration validated successfully
[INFO ] root.main: Configuration is valid
[INFO ] vppcfg.vppapi.connect: VPP version is 24.06-rc0~183-gb0d433978
comment { vppcfg prune: 2 CLI statement(s) follow }
set interface ip address del TenGigabitEthernet6/0/0 194.1.163.20/31
set interface ip address del TenGigabitEthernet6/0/0 2001:678:d78::2:5:1/112
comment { vppcfg sync: 1 CLI statement(s) follow }
set interface unnumbered TenGigabitEthernet6/0/0 use loop0
[INFO ] vppcfg.reconciler.write: Wrote 5 lines to (stdout)
[INFO ] root.main: Planning succeeded
pim@chbtl0:~$ vppcfg show int addr TenGigabitEthernet6/0/0
TenGigabitEthernet6/0/0 (up):
unnumbered, use loop0
L3 194.1.163.3/32
L3 2001:678:d78::3/128
pim@chbtl0:~$ vppctl show lcp | grep TenGigabitEthernet6/0/0
itf-pair: [9] TenGigabitEthernet6/0/0 tap9 xe1-0 65 type tap netns dataplane
pim@chbtl0:~$ ip -br a | grep UP
xe0-0 UP fe80::92e2:baff:fe3f:cad4/64
xe0-1 UP fe80::92e2:baff:fe3f:cad5/64
xe0-1.400@xe0-1 UP fe80::92e2:baff:fe3f:cad4/64
xe0-1.400.10@xe0-1.400 UP 194.1.163.16/31 2001:678:d78:2:3:1/112 fe80::92e2:baff:fe3f:cad4/64
xe1-0 UP fe80::21b:21ff:fe55:1dbc/64
xe1-1.101@xe1-1 UP 194.1.163.65/27 2001:678:d78:3::1/64 fe80::14b4:c6ff:fe1e:68a3/64
xe1-1.179@xe1-1 UP 45.129.224.236/29 2a0e:5040:0:2::236/64 fe80::92e2:baff:fe3f:cad5/64
```
After applying this configuration, I can see that Te6/0/0 indeed is _unnumbered, use loop0_ noting
the IPv4 and IPv6 addresses that it borrowed. I can see with the second command that Te6/0/0
corresponds in Linux with `xe1-0`, and finally with the third command I can list the addresses of
the Linux view, and indeed I confirm that `xe1-0` only has a link local address. Slick!
After applying this change, the OSPFv2 adjacency in the `ospf4_old` protocol expires, and I see the
routing table converge. A traceroute between `chbtl0` and `chbtl1` now takes a bit of a detour:
```
pim@chbtl0:~$ traceroute chbtl1.ipng.ch
traceroute to chbtl1 (194.1.163.4), 30 hops max, 60 byte packets
1 chrma0.ipng.ch (194.1.163.17) 0.981 ms 0.969 ms 0.953 ms
2 chgtg0.ipng.ch (194.1.163.9) 1.194 ms 1.192 ms 1.176 ms
3 chbtl1.ipng.ch (194.1.163.4) 1.875 ms 1.866 ms 1.911 ms
```
I can now introduce the very first OSPFv3 adjacency for IPv4, and I do this by moving the neighbor
from the `ospf4_old` protocol to the `ospf4` prototol. Of course, I also update chbtl1 with the
_unnumbered_ interface on its `xe1-0`, and update OSPF there. And with that, something magical
happens:
```
pim@chbtl0:~$ birdc show ospf nei
BIRD v2.15.1-4-g280daed5-x ready.
ospf4_old:
Router ID Pri State DTime Interface Router IP
194.1.163.0 1 Full/PtP 30.571 xe0-1.400.10 fe80::266e:96ff:fe37:934c
ospf4:
Router ID Pri State DTime Interface Router IP
194.1.163.4 1 Full/PtP 31.955 xe1-0 fe80::9e69:b4ff:fe61:ff18
ospf6:
Router ID Pri State DTime Interface Router IP
194.1.163.4 1 Full/PtP 31.955 xe1-0 fe80::9e69:b4ff:fe61:ff18
194.1.163.0 1 Full/PtP 30.571 xe0-1.400.10 fe80::266e:96ff:fe37:934c
pim@chbtl0:~$ birdc show route protocol ospf4
BIRD v2.15.1-4-g280daed5-x ready.
Table master4:
194.1.163.4/32 unicast [ospf4 2024-05-19 20:58:04] * I (150/2) [194.1.163.4]
via 194.1.163.4 on xe1-0 onlink
194.1.163.64/27 unicast [ospf4 2024-05-19 20:58:04] E2 (150/2/10000) [194.1.163.4]
via 194.1.163.4 on xe1-0 onlink
```
Aww, would you look at that! Especially the first entry is interesting to me. It says that this
router has learned the address `194.1.163.4/32`, the loopback address of `chbtl1` via nexthop
**also** `194.1.163.4` on interface `xe1-0` with a flag _onlink_.
The kernel routing table agrees with this construction:
```
pim@chbtl0:~$ ip ro get 194.1.163.4
194.1.163.4 via 194.1.163.4 dev xe1-0 src 194.1.163.3 uid 1000
cache
```
Now, what this construction tells the kernel, is that it should ARP for `194.1.163.4` using local
address `194.1.163.3`, for which VPP on the other side will respond, thanks to my [[VPP ARP
gerrit](https://gerrit.fd.io/r/c/vpp/+/40482)]. As such, I should expect now a FIB entry for VPP:
```
pim@chbtl0:~$ vppctl show ip fib 194.1.163.4
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, default-route:1, lcp-rt:1, ]
194.1.163.4/32 fib:0 index:973099 locks:3
lcp-rt-dynamic refs:1 src-flags:added,contributing,active,
path-list:[189] locks:98 flags:shared,popular, uPRF-list:507 len:1 itfs:[36, ]
path:[166] pl-index:189 ip4 weight=1 pref=32 attached-nexthop: oper-flags:resolved,
194.1.163.4 TenGigabitEthernet6/0/0
[@0]: ipv4 via 194.1.163.4 TenGigabitEthernet6/0/0: mtu:9000 next:10 flags:[] 90e2ba3fcad4246e9637934c810001908100000a0800
adjacency refs:1 entry-flags:attached, src-flags:added, cover:-1
path-list:[1025] locks:1 uPRF-list:1521 len:1 itfs:[36, ]
path:[379] pl-index:1025 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved,
194.1.163.4 TenGigabitEthernet6/0/0
[@0]: ipv4 via 194.1.163.4 TenGigabitEthernet6/0/0: mtu:9000 next:10 flags:[] 90e2ba3fcad4246e9637934c810001908100000a0800
Extensions:
path:379
forwarding: unicast-ip4-chain
[@0]: dpo-load-balance: [proto:ip4 index:848961 buckets:1 uRPF:507 to:[1966944:611861009]]
[0] [@5]: ipv4 via 194.1.163.4 TenGigabitEthernet6/0/0: mtu:9000 next:10 flags:[] 90e2ba3fcad4246e9637934c810001908100000a0800
```
Nice work, VPP and Bird2! I confirm that I can ping the neighbor again, and that the traceroute is
direct rather than the scenic route from before, and I validate that IPv6 still works for good
measure:
```
pim@chbtl0:~$ ping -4 chbtl1.ipng.ch
PING 194.1.163.4 (194.1.163.4) 56(84) bytes of data.
64 bytes from 194.1.163.4: icmp_seq=1 ttl=63 time=0.169 ms
64 bytes from 194.1.163.4: icmp_seq=2 ttl=63 time=0.283 ms
64 bytes from 194.1.163.4: icmp_seq=3 ttl=63 time=0.232 ms
64 bytes from 194.1.163.4: icmp_seq=4 ttl=63 time=0.271 ms
^C
--- 194.1.163.4 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3003ms
rtt min/avg/max/mdev = 0.163/0.233/0.276/0.045 ms
pim@chbtl0:~$ traceroute chbtl1.ipng.ch
traceroute to chbtl1 (194.1.163.4), 30 hops max, 60 byte packets
1 chbtl1.ipng.ch (194.1.163.4) 0.190 ms 0.176 ms 0.147 ms
pim@chbtl0:~$ ping6 chbtl1.ipng.ch
PING chbtl1.ipng.ch(chbtl1.ipng.ch (2001:678:d78::4)) 56 data bytes
64 bytes from chbtl1.ipng.ch (2001:678:d78::4): icmp_seq=1 ttl=64 time=0.205 ms
64 bytes from chbtl1.ipng.ch (2001:678:d78::4): icmp_seq=2 ttl=64 time=0.203 ms
64 bytes from chbtl1.ipng.ch (2001:678:d78::4): icmp_seq=3 ttl=64 time=0.213 ms
64 bytes from chbtl1.ipng.ch (2001:678:d78::4): icmp_seq=4 ttl=64 time=0.219 ms
^C
--- chbtl1.ipng.ch ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3068ms
rtt min/avg/max/mdev = 0.203/0.210/0.219/0.006 ms
pim@chbtl0:~$ traceroute6 chbtl1.ipng.ch
traceroute to chbtl1.ipng.ch (2001:678:d78::4), 30 hops max, 80 byte packets
1 chbtl1.ipng.ch (2001:678:d78::4) 0.163 ms 0.147 ms 0.124 ms
```
### 4. From one to two
{{< image width="300px" float="right" src="/assets/vpp-ospf/OSPFv3 Rollout Step 4.svg" alt="Step 4: Canary" >}}
At this point I have two IPv4 IGPs running. This is not ideal, but it's also not completely broken,
because the OSPF filter allows the routers to learn and propagate any more specific prefix from
`194.1.163.0/24`. This way, the legacy OSPFv2 called `ospf4_old` and this new OSPFv3 called `ospf4`
will be aware of all routes. Bird will learn them twice, and routing decisions may be a bit funky
because the OSPF protocols learn the routes from each other as OSPF-E2. There are two implications
of this:
1. It means that the routes that are learned from the other OSPF protocol will have a fixed metric
(==cost), and for the time being, I won't be able to cleanly add up link costs between the
routers that are speaking OSPFv2 and those that are speaking OSPFv3.
1. If an OSPF External Type E1 and Type E2 route exist to the same destination the E1 route will
always be preferred irrespective of the metric. This means that within the routers that speak
OSPFv2, cost will remain consistent; and also within the routers that speak OSPFv3, it will be
consistent. Between them, routes will be learned, but cost will be roughly meaningless.
I upgrade another link, between router `chgtg0` and `ddln0` at my [[colo]({{< ref "2022-02-24-colo" >}})], which is connected via a 10G EoMPLS link from a local telco called Solnet. The
colo, similar to IPng's office, has two redundant 10G uplinks, so if things were to fall apart, I
can always quickly shutdown the offending link (thereby removing OSPFv3 adjacencies), and traffic
will reroute. I have created two islands of OSPFv3, drawn in <span
style='color:orange;font-weight:bold'>orange</span>, with exactly two links using IPv4-less point to
point networks. I let this run for a few weeks, to make sure things do not fail in mysterious ways.
### 5. From two to many
{{< image width="300px" float="right" src="/assets/vpp-ospf/OSPFv3 Rollout Step 5.svg" alt="Step 5: Zurich" >}}
From this point on it's just rinse-and-repeat. For each backbone link, I will:
1. I will drain the backbone link I'm about to work on, by raising OSPFv2 and OSPFv3 cost on both
sides. If the cost was, say, 56, I will temporarily make that 1056. This will make traffic
avoid using the link if at all possible. Due to redundancy, every router has (at least) two
backbone links. Traffic will be diverted.
1. I first change the VPP router's `vppcfg.yaml` to remove the p2p addresses and replace them with
an `unnumbered: loop0` instead. I apply the diff, and the OSPF adjacency breaks for IPv4.
The BFD adjacency for IPv4 will disappear. Curiously, the IPv6 adjacency stays up, because
OSPFv3 adjacencies use link-local addresses.
1. I move the interface section of the old OSPFv2 `ospf4_old` protocol to the new OSPFv3
`ospf4` protocol, which will als use link-local addresses to form adjacencies. The two routers
will exchange Link LSA and be able to find each other directly connected. Now the link is
running **two** OSPFv3 protocols, each in their own address family. They will share the same BFD
session.
1. I finally undrain the link by setting the OSPF link cost back to what it was. This link is now
a part of the OSPFv3 part of the network.
I work my way through the network. The first one I do is the link between `chgtg0` and `chbtl1`
(which I've colored in the diagram in <span style='color:#ff00ff;font-weight:bold'>pink</span>), so
that there are four contiguous OSPFv3 links, spanning from chbtl0 - chbtl1 - chgtg0 - ddln0. I
constantly do a traceroute to a machine that is directly connected behind ddln0, and as well use
RIPE Atlas and the NLNOG Ring to ensure that I have reachability:
```
pim@squanchy:~$ traceroute ipng.mm.fcix.net
traceroute to ipng.mm.fcix.net (194.1.163.59), 64 hops max, 40 byte packets
1 chbtl0 (194.1.163.65) 0.279 ms 0.362 ms 0.249 ms
2 chbtl1 (194.1.163.3) 0.455 ms 0.394 ms 0.384 ms
3 chgtg0 (194.1.163.1) 1.302 ms 1.296 ms 1.294 ms
4 ddln0 (194.1.163.5) 2.232 ms 2.385 ms 2.322 ms
5 mm0.ddln0.ipng.ch (194.1.163.59) 2.377 ms 2.577 ms 2.364 ms
```
I work my way outwards from there. First completing the ring chbtl0 - chrma0 - chgtg0 - chbtl1, and
then completing the ring ddln0 - ddln1 - chrma0 - chgtg0, after which the Zurich metro area is
converted. I then work my way clockwise from Zurich to Geneva, Paris, Lille, Amsterdam, Frankfurt,
and end up with the last link completing the set: defra0 - chrma0.
## Results
{{< image width="300px" float="right" src="/assets/vpp-ospf/OSPFv3 Rollout After.svg" alt="OSPFv3: After" >}}
In total I reconfigure thirteen backbone links, and they all become _unnumbered_ using the router's
loopback addresses for IPv4 and IPv6, and they all switch over from their OSPFv2 IGP to the new
OSPFv3 IGP; the total number of routers running the old IGP shrinks until there are none left. Once
that happens, I can simply remove the OSPFv2 protocol called `ospf4_old`, and keep the two OSPFv3
protocols now intuitively called `ospf4` and `ospf6`. Nice.
This maintenance isn't super intrusive. For IPng's customers, latency goes up from time to time as
backbone links are drained, the link is reconfigured to become unnumbered and OSPFv3, and put back
into service. The whole operation takes a few hours, and I enjoy the repetitive tasks, getting
pretty good at the drain-reconfigure-undrain cycle after a while.
It looks really cool on transit routers, like this one in Lille, France:
```
pim@frggh0:~$ ip -br a | grep UP
loop0 UP 194.1.163.10/32 2001:678:d78::a/128 fe80::dcad:ff:fe00:0/64
xe0-0 UP 193.34.197.143/25 2001:7f8:6d::8298:1/64 fe80::3eec:efff:fe70:24a/64
xe0-1 UP fe80::3eec:efff:fe70:24b/64
xe1-0 UP fe80::6a05:caff:fe32:45ac/64
xe1-1 UP fe80::6a05:caff:fe32:45ad/64
xe1-2 UP fe80::6a05:caff:fe32:45ae/64
xe1-2.100@xe1-2 UP fe80::6a05:caff:fe32:45ae/64
xe1-2.200@xe1-2 UP fe80::6a05:caff:fe32:45ae/64
xe1-2.391@xe1-2 UP 46.20.247.3/29 2a02:2528:ff03::3/64 fe80::6a05:caff:fe32:45ae/64
xe0-1.100@xe0-1 UP 194.1.163.137/29 2001:678:d78:6::1/64 fe80::3eec:efff:fe70:24b/64
pim@frggh0:~$ birdc show bfd ses
BIRD v2.15.1-4-g280daed5-x ready.
bfd1:
IP address Interface State Since Interval Timeout
fe80::3eec:efff:fe46:68a9 xe1-2.200 Up 2024-06-19 20:16:58 0.100 3.000
fe80::6a05:caff:fe32:3e38 xe1-2.100 Up 2024-06-19 20:13:11 0.100 3.000
pim@frggh0:~$ birdc show ospf nei
BIRD v2.15.1-4-g280daed5-x ready.
ospf4:
Router ID Pri State DTime Interface Router IP
194.1.163.9 1 Full/PtP 34.947 xe1-2.100 fe80::6a05:caff:fe32:3e38
194.1.163.8 1 Full/PtP 31.940 xe1-2.200 fe80::3eec:efff:fe46:68a9
ospf6:
Router ID Pri State DTime Interface Router IP
194.1.163.9 1 Full/PtP 34.947 xe1-2.100 fe80::6a05:caff:fe32:3e38
194.1.163.8 1 Full/PtP 31.940 xe1-2.200 fe80::3eec:efff:fe46:68a9
```
You can see here that the router indeed has an IPv4 loopback address 194.1.163.10/32, and
2001:678:d78::a/128. It has two backbone links, on `xe1-2.100` towards Paris and `xe1-2.200`
towards Amsterdam. Judging by the time between the BFD sessions, it took me somewhere around four
minutes to drain, reconfigure, and undrain each link. I kept on listening to Nora en Pure's
[[Episode #408](https://www.youtube.com/watch?v=AzfCrOEW7e8)] the whole time.
### A traceroute
The beauty of this solution is that the routers will still have one IPv4 and IPv6 address, from
their `loop0` interface. The VPP dataplane will use this when generating ICMP error messages, for
example in a traceroute. It will look quite normal:
```
pim@squanchy:~/src/ipng.ch$ traceroute bit.nl
traceroute to bit.nl (213.136.12.97), 30 hops max, 60 byte packets
1 chbtl0.ipng.ch (194.1.163.65) 0.366 ms 0.408 ms 0.393 ms
2 chrma0.ipng.ch (194.1.163.0) 1.219 ms 1.252 ms 1.180 ms
3 defra0.ipng.ch (194.1.163.7) 6.943 ms 6.887 ms 6.922 ms
4 nlams0.ipng.ch (194.1.163.8) 12.882 ms 12.835 ms 12.910 ms
5 as12859.frys-ix.net (185.1.203.186) 14.028 ms 14.160 ms 14.436 ms
6 http-bit-ev-new.lb.network.bit.nl (213.136.12.97) 14.098 ms 14.671 ms 14.965 ms
pim@squanchy:~$ traceroute6 bit.nl
traceroute6 to bit.nl (2001:7b8:3:5::80:19), 64 hops max, 60 byte packets
1 chbtl0.ipng.ch (2001:678:d78:3::1) 0.871 ms 0.373 ms 0.304 ms
2 chrma0.ipng.ch (2001:678:d78::) 1.418 ms 1.387 ms 1.764 ms
3 defra0.ipng.ch (2001:678:d78::7) 6.974 ms 6.877 ms 6.912 ms
4 nlams0.ipng.ch (2001:678:d78::8) 13.023 ms 13.014 ms 13.013 ms
5 as12859.frys-ix.net (2001:7f8:10f::323b:186) 14.322 ms 14.181 ms 14.827 ms
6 http-bit-ev-new.lb.network.bit.nl (2001:7b8:3:5::80:19) 14.176 ms 14.24 ms 14.093 ms
```
The only difference from before is that now, these traceroute hops are from the loopback addresses,
not the P2P transit links (eg the second hop, through `chrma0` is now 194.1.163.0 and 2001:678:d78::
respectively, where before that would have been 194.1.163.17 and 2001:678:d78::2:3:2 respectively.
Subtle, but super dope.
### Link Flap Test
The proof is in the pudding, they say. After all of this link draining, reconfiguring and undraining,
I gain confidence that this stuff actually works as advertised! I thought it'd be a nice touch to
demonstrate a link drain, between Frankfurt and Amsterdam. I recorded a little screencast
[[asciinema](/assets/vpp-ospf/rollout.cast), [gif](/assets/vpp-ospf/rollout.gif)], shown here:
{{< asciinema src="/assets/vpp-ospf/rollout.cast" >}}
### Returning IPv4 (and IPv6!) addresses
Now that the backbone links no longer carry global unicast addresses, and they borrow from the one
IPv4 and IPv6 address in `loop0`, I can return a whole stack of addresses:
{{< image src="/assets/vpp-ospf/roi.png" alt="ROI" >}}
In total, I returned 34 IPv4 addresses from IPng's /24, which is 13.3%. This is huge, and I'm
confident that I will find a better use for these little addresses than being pointless
point-to-point links!