ipng.ch/content/articles/2024-04-06-vpp-ospf.md

---
date: "2024-04-06T10:17:54Z"
title: VPP with loopback-only OSPFv3 - Part 1
aliases:
- /s/articles/2024/04/07/vpp-ospf.html
params:
  asciinema: true
---

{{< image width="200px" float="right" src="/assets/vpp-ospf/bird-logo.svg" alt="Bird" >}}

# Introduction

A few weeks ago I took a good look at the [[Babel]({{< ref "2024-03-06-vpp-babel-1" >}})] protocol.
I found a set of features there that I really appreciated. The first was a latency aware routing
protocol - this is useful for mesh (wireless) networks but it is also a good fit for IPng's usecase,
notably because it makes use of carrier ethernet which, if any link in the underlying MPLS network
fails, will automatically re-route but sometimes with much higher latency. In these cases, Babel can
reconverge on its own to a topology that has the lowest end to end latency.

But a second really cool find, is that Babel can use IPv6 nexthops for IPv4 destinations - which is
_super_ useful because it will allow me to retire all of the IPv4 /31 point to point networks
between my routers. AS8298 has about half of a /24 tied up in these otherwise pointless (pun
intended) transit networks.

In the same week, my buddy Benoit asked a question about OSPFv3 on the Bird users mailinglist
[[ref](https://www.mail-archive.com/bird-users@network.cz/msg07961.html)] which may or may not have
been because I had been messing around with Babel using only IPv4 loopback interfaces. And just a
few weeks before that, the incomparable Nico from [[Ungleich](https://ungleich.ch/)] had a very
similar question [[ref](https://bird.network.cz/pipermail/bird-users/2024-January/017370.html)].

These three folks have something in common - we're all trying to conserve IPv4 addresses!

# OSPFv3 with IPv4 🙁

Nico's thread referenced [[RFC 5838](https://datatracker.ietf.org/doc/html/rfc5838)] which defines
support for multiple address families in OSPFv3. It does this by mapping a given address family to
a specific instance of OSPFv3 using the _instance id_ and adding a new option to the _options field_
that tells neighbors that multiple address families are supported in this instance (and thus, that
the neighbor should not assume all link state advertisements are IPv6-only).

This way, multiple instances can run on the same router, and they will only form adjacencies with
neighbors that are operating in the same address family. This in itself doesn't change much: rather
than using IPv4 multicast in the hello's while forming adjacencies, OSPFv3 will use IPv6 link local
addresses for them.

RFC 5838, Section 2.5 says:
> Although IPv6 link local addresses could be used as next hops for IPv4 address families, it is
> desirable to have IPv4 next-hop addresses. [ ... ] In order to achieve this, the link's IPv4
> address will be advertised in the "link local address" field of the IPv4 instance's Link-LSA.
> This address is placed in the first 32 bits of the "link local address" field and is used for IPv4
> next-hop calculations.  The remaining bits MUST be set to zero.

First my hopes are raised by saying IPv6 link local addresses _could_ be used as
next hops (just like Babel, yaay!), but then it goes on to say the link local address field will be
overridden with an IPv4 address in the top 32 bits. That's ... gross. I understand why this was
done, it allows for a minimal deviation of the OSPFv3 protocol, but this unfortunate choice
precludes the ability for IPv6 nexthops to be used. Crap on a cracker!

# OSPFv3 with IPv4 🥰

But wait, not all is lost! Remember in my [[VPP Babel]({{< ref "2024-03-06-vpp-babel-1" >}})]
article I mentioned that VPP has this ability to run _unnumbered_ interfaces? To recap, this is a
configuration where a primary interface, typically a loopback, will have an IPv4 and IPv6 address,
say **192.168.10.2/32** and **2001:678:d78:200::2/128** and other interfaces will borrow from that.
That will allow for the IPv4 address to be present on multiple interfaces, like so:

```
pim@vpp0-2:~$ ip -br a
lo               UNKNOWN        127.0.0.1/8 ::1/128
loop0            UNKNOWN        192.168.10.2/32 2001:678:d78:200::2/128 fe80::dcad:ff:fe00:0/64
e0               UP             192.168.10.2/32 2001:678:d78:200::2/128 fe80::5054:ff:fef0:1120/64
e1               UP             192.168.10.2/32 2001:678:d78:200::2/128 fe80::5054:ff:fef0:1121/64
```

### VPP Changes

Historically in VPP, broadcast medium like ethernet will respond to ARP requests only if the
requestor is in the same subnet. With these point to point interfaces, the remote will _never_ be in
the same subnet, because we're using /32 addresses here! VPP logs these as invalid ARP requests.
With a small change though, I can make VPP become tolerant of this scenario, and the consensus
in the VPP community is that this is OK.

{{< image src="/assets/vpp-ospf/vpp-diff1.png" alt="VPP Diff #1" >}}

Check out [[40482](https://gerrit.fd.io/r/c/vpp/+/40482)] for the full change, but in a nutshell,
just before deciding to return an error because the requesting source address is not directly
connected (called an `attached` route in VPP), I'll change the condition to allow for it, if and
only if the ARP request comes from an _unnumbered_ interface.

I think this is a good direction, if only because most other popular implementations (including
Linux, FreeBSD, Cisco IOS/XR and Juniper) will answer ARP requests that are onlink but not directly
connected, in the same way.

### Bird2 Changes

Meanwhile, in the Bird community, we were thinking about solving this problem in a different way.
Babel allows a feature to use IPv6 transit networks with IPv4 destinations, by specifying an option
called `extended next hop`. With this option, Babel will set a nexthop across address families. It
may sound freaky at first, but it's not too strange when you think about it. Take a look at my
explanation in the [[Babel]({{< ref "2024-03-06-vpp-babel-1" >}})] article on how IPv6 neighbor
discovery can take the place of IPv4 ARP resolution to figure out the ethernet next hop.

So our initial take was: why don't we do that with OSPFv3 as well? We thought of a trick to
get that Link LSA hack from RFC5838 removed: what if Bird, upon setting the `extended next hop`
feature on an interface, would simply put the IPv6 address back like it was, rather than overwriting
it with the IPv4 address? That way, we'd just learn routes to IPv4 destinations with nexthops on
IPv6 linklocal addresses. It would break compatibility with other vendors, but seeing as it is an
optional feature which defaults to off, perhaps it is a reasonable compromise...

Ondrej started to work on it, but came back a few days later with a different solution, which is
quite clever. Any IPv4 router needs at least one IPv4 address anyways, to be able to send ICMP
messages, so there is no need to put IPv4 addresses on links.  Ondrej's theory corroborates my
previous comments for Babel's IPv4-less routing:

> I’ve learned so far that I (a) MAY use IPv6 link-local networks in order to forward IPv4 packets,
> as I can use IPv6 NDP to find the link-layer next hop; and (b) each router SHOULD be able to
> originate ICMPv4 packets, therefore it needs at least one IPv4 address.
>
> These two claims mean that I need at most one IPv4 address on each router.

Ondrej's proposal for Bird2 will, when OSPFv3 is used with IPv4 destinations, keep the RFC5838
behavior and try to _find_ a working IPv4 address to put in the Link LSA:
{{< image src="/assets/vpp-ospf/bird-diff1.png" alt="Bird Diff #1" >}}

He adds a function `update_loopback_addr()`, which scans all interfaces for an IPv4 address, and if
there are multiple, prefer host addresses, then addresses from OSPF stub interfaces, and finally
just any old IPv4 address. Now that IPv4 address can be simply used to put in the Link LSA. Slick!

His change also removes next-hop-in-address-range check for OSPFv3 when using IPv4, and
automatically adds onlink flag to such routes, which newly accepts next hops that are not directly
connected:
{{< image src="/assets/vpp-ospf/bird-diff2.png" alt="Bird Diff #2" >}}

I realize when reading the code that this change paired with the [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/40482)]
are perfect partners:
1.   Ondrej's change will make the Link LSA be _onlink_, which is a way to describe that the next hop
     is not directly connected, in other words nexthop `192.168.10.3/32`, while the router itself is
     `192.168.10.2/32`.
1.   My change will make VPP answer for ARP requests in such a scenario where the router with an
     _unnumbered_ interface with `192.168.10.3/32` will respond to a request from the not directly
     connected _onlink_ peer at `192.168.10.2`.

## Tying it together

With all of that, I am ready to demonstrate two working solutions now. I first compile Bird2 with
Ondrej's [[commit](https://gitlab.nic.cz/labs/bird/-/commit/280daed57d061eb1ebc89013637c683fe23465e8)].
Then, I compile VPP with my pending [[gerrit](https://gerrit.fd.io/r/c/vpp/+/40482)]. Finally,
to demonstrate how `update_loopback_addr()` might work, I compile `lcpng` with my previous
[[commit](https://git.ipng.ch/ipng/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)],
which allows me to inhibit copying forward addresses from VPP to Linux, when using _unnumbered_
interfaces.

I take an IPng lab instance out for a spin with this updated Bird2 and VPP+lcpng environment:

{{< image src="/assets/vpp-mpls/LAB v2.svg" alt="Lab Setup" >}}

### Solution 1: Somewhat unnumbered

I configure an otherwise empty VPP dataplane as follows:

```
vpp0-3# lcp lcp-sync on
vpp0-3# lcp lcp-sync-unnumbered on

vpp0-3# create loopback interface instance 0
vpp0-3# set interface state loop0 up
vpp0-3# set interface ip address loop0 192.168.10.3/32
vpp0-3# set interface ip address loop0 2001:678:d78:200::3/128

vpp0-3# set interface mtu 9000 GigabitEthernet10/0/0
vpp0-3# set interface mtu packet 9000 GigabitEthernet10/0/0
vpp0-3# set interface unnumbered GigabitEthernet10/0/0 use loop0
vpp0-3# set interface state GigabitEthernet10/0/0 up

vpp0-3# lcp create loop0 host-if loop0
vpp0-3# lcp create GigabitEthernet10/0/0 host-if e0
```

Which yields the following configuration:
```
pim@vpp0-3:~$ ip -br a
lo               UNKNOWN        127.0.0.1/8 ::1/128
loop0            UNKNOWN        192.168.10.3/32 2001:678:d78:200::3/128 fe80::dcad:ff:fe00:0/64
e0               UP             192.168.10.3/32 2001:678:d78:200::3/128 fe80::5054:ff:fef0:1130/64
pim@vpp0-3:~$ ip route get 182.168.10.2
RTNETLINK answers: Network is unreachable
```

I can see that VPP copied forward the IPv4/IPv6 addresses to interface `e0`, and because there's no
routing protocol running yet, the neighbor router `vpp0-2` is unreachable. Let me fix that, next.  I
start bird in the VPP dataplane network namespace, and configure it as follows:

```
router id 192.168.10.3;

protocol device { scan time 30; }
protocol direct { ipv4; ipv6; check link yes; }

protocol kernel kernel4 {
  ipv4 { import none; export where source != RTS_DEVICE; };
  learn off; scan time 300;
}

protocol kernel kernel6 {
  ipv6 { import none; export where source != RTS_DEVICE; };
  learn off; scan time 300;
}

protocol bfd bfd1 {
  interface "e*" {
    interval 100 ms;
    multiplier 20;
  };
}

protocol ospf v3 ospf4 {
  ipv4 { export all; import where (net ~ [ 192.168.10.0/24+, 0.0.0.0/0 ]); };
  area 0 {
    interface "loop0" { stub yes; };
    interface "e0" { type pointopoint; cost 5; bfd on; };
  };
}

protocol ospf v3 ospf6 {
  ipv6 { export all; import where (net ~ [ 2001:678:d78:200::/56, ::/0 ]); };
  area 0 {
    interface "loop0" { stub yes; };
    interface "e0" { type pointopoint; cost 5; bfd on; };
  };
}
```

This minimal Bird2 configuration will configure the main protocols `device`, `direct`,
and two kernel protocols `kernel4` and `kernel6`, which are instructed to export learned routes from
the kernel for all but directly connected routes (because the Linux kernel and VPP already have
these when an interface is brought up, this avoids duplicate connected route entries).

If you haven't come across it yet, _Bidirectional Forwarding Detection_ or _BFD_ is a protocol that
repeatedly sends UDP packets between routers, to be able to detect if the forwarding is interrupted
even if the interface link stays up.  It's described in detail in
[[RFC5880](https://www.rfc-editor.org/rfc/rfc5880.txt)], and I use it at IPng Networks all over the
place.

{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Brain" >}}

Then I'll configure two OSPF protocols, one for IPv4 called `ospf4` and another for IPv6 called
`ospf6`. It's easy to overlook, but while usually the IPv4 protocol is OSPFv2 and the IPv6 protocol
is OSPFv3, here _both_ are using OSPFv3! I'll instruct Bird to erect a _BFD_ session for any
neighbor it establishes an adjacency with. If at any point the BFD session times out (currently at
20x100ms or 2.0s), OSPF will tear down the adjacency.

The OSPFv3 protocols each define one channel, in which I allow Bird to export anything, but import
only those routes that are in the LAB IPv4 (192.168.10.0/24) and IPv6 (2001:687:d78:200::/56), and
I'll also allow a default to be learned over OSPF for both address families. That'll come in handy
later.

I start up Bird on the rightmost two routers in the lab (`vpp0-3` and `vpp0-2`). Looking at
`vpp0-3`, Bird starts sending IPv6 hello packets on interface `e0`, and pretty quickly finds not one
but two neighbors:

```
pim@vpp0-3:~$ birdc show ospf neighbors
BIRD v2.15.1-4-g280daed5-x ready.
ospf4:
Router ID   	Pri	     State     	DTime	Interface  Router IP
192.168.10.2	  1	Full/PtP  	30.870	e0         fe80::5054:ff:fef0:1121

ospf6:
Router ID   	Pri	     State     	DTime	Interface  Router IP
192.168.10.2	  1	Full/PtP  	30.870	e0         fe80::5054:ff:fef0:1121
```

Bird is able to sort out which is which on account of the 'normal' IPv6 OSPFv3 having an _instance
id_ value of 0 (IPv6 Unicast), and the IPv4 OSPFv3 having an _instance id_ of 64 (IPv4 Unicast).
Further, the IPv4 variant will set the AF-bit in the OSPFv3 options, so the peer will know it
supports using the Link LSA to model IPv4 nexthops rather than IPv6 nexthops.

Indeed, routes are quickly learned:

```
pim@vpp0-3:~$ birdc show route table master4
BIRD v2.15.1-4-g280daed5-x ready.
Table master4:
192.168.10.3/32      unicast [direct1 13:02:56.883] * (240)
	dev loop0
                     unicast [direct1 13:02:56.883] (240)
	dev e0
                     unicast [ospf4 13:02:56.980] I (150/0) [192.168.10.3]
	dev loop0
	dev e0
192.168.10.2/32      unicast [ospf4 13:03:04.980] * I (150/5) [192.168.10.2]
	via 192.168.10.2 on e0 onlink

```

They are quickly propagated both to the Linux kernel, and by means of Netlink into the Linux Control
Plane plugin in VPP, which programs it into VPP's _FIB_:

```
pim@vpp0-3:~$ ip ro
192.168.10.2 via 192.168.10.2 dev e0 proto bird metric 32 onlink

pim@vpp0-3:~$ vppctl show ip fib 192.168.10.2
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, default-route:1, lcp-rt:1, ]
192.168.10.2/32 fib:0 index:23 locks:3
  lcp-rt-dynamic refs:1 src-flags:added,contributing,active,
    path-list:[40] locks:2 flags:shared, uPRF-list:22 len:1 itfs:[1, ]
      path:[53] pl-index:40 ip4 weight=1 pref=32 attached-nexthop:  oper-flags:resolved,
        192.168.10.2 GigabitEthernet10/0/0
      [@0]: ipv4 via 192.168.10.2 GigabitEthernet10/0/0: mtu:9000 next:6 flags:[] 525400f01121525400f011300800

  adjacency refs:1 entry-flags:attached, src-flags:added, cover:-1
    path-list:[43] locks:1 uPRF-list:24 len:1 itfs:[1, ]
      path:[56] pl-index:43 ip4 weight=1 pref=0 attached-nexthop:  oper-flags:resolved,
        192.168.10.2 GigabitEthernet10/0/0
      [@0]: ipv4 via 192.168.10.2 GigabitEthernet10/0/0: mtu:9000 next:6 flags:[] 525400f01121525400f011300800
    Extensions:
     path:56
 forwarding:   unicast-ip4-chain
  [@0]: dpo-load-balance: [proto:ip4 index:28 buckets:1 uRPF:22 to:[0:0]]
    [0] [@5]: ipv4 via 192.168.10.2 GigabitEthernet10/0/0: mtu:9000 next:6 flags:[] 525400f01121525400f011300800
```

The neighbor is reachable, over IPv6 (which is nothing special), but also over IPv4:

```
pim@vpp0-3:~$ ping -c5 2001:678:d78:200::2
PING 2001:678:d78:200::2(2001:678:d78:200::2) 56 data bytes
64 bytes from 2001:678:d78:200::2: icmp_seq=1 ttl=64 time=2.16 ms
64 bytes from 2001:678:d78:200::2: icmp_seq=2 ttl=64 time=3.69 ms
64 bytes from 2001:678:d78:200::2: icmp_seq=3 ttl=64 time=2.66 ms
64 bytes from 2001:678:d78:200::2: icmp_seq=4 ttl=64 time=2.30 ms
64 bytes from 2001:678:d78:200::2: icmp_seq=5 ttl=64 time=2.92 ms

--- 2001:678:d78:200::2 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4006ms
rtt min/avg/max/mdev = 2.164/2.747/3.687/0.540 ms

pim@vpp0-3:~$ ping -c5 192.168.10.2
PING 192.168.10.2 (192.168.10.2) 56(84) bytes of data.
64 bytes from 192.168.10.2: icmp_seq=1 ttl=64 time=3.58 ms
64 bytes from 192.168.10.2: icmp_seq=2 ttl=64 time=3.40 ms
64 bytes from 192.168.10.2: icmp_seq=3 ttl=64 time=3.28 ms
64 bytes from 192.168.10.2: icmp_seq=4 ttl=64 time=3.32 ms
64 bytes from 192.168.10.2: icmp_seq=5 ttl=64 time=3.29 ms

--- 192.168.10.2 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4007ms
rtt min/avg/max/mdev = 3.283/3.374/3.577/0.109 ms
```

**✅ OSPFv3 with IPv4/IPv6 on-link nexthops works!**

### Solution 2: Truly unnumbered

However, Ondrej's patch does something in addition to this. I repeat the same setup, except now I
set one additional feature when starting up VPP: `lcp lcp-sync-unnumbered off`

What happens next is that VPP's dataplane looks subtly different. It has created an _unnumbered_
interface keyed off of `loop0`, but it doesn't propagate the addresses to Linux.

```
pim@vpp0-3:~$ ip -br a
lo               UNKNOWN        127.0.0.1/8 ::1/128
loop0            UNKNOWN        192.168.10.3/32 2001:678:d78:200::3/128 fe80::dcad:ff:fe00:0/64
e0               UP             fe80::5054:ff:fef0:1130/64
```

With `e0` only having a linklocal address, Bird can still form an adjacency with its neighbor
`vpp0-2`, because adjacencies in OSPFv3 are formed using IPv6 only. However, the clever trick to
walk the list of interfaces in `update_loopback_addr()` will be able to find a usable IPv4 address,
and use that to put in the Link LSA using RFC5838. In this case, it finds `192.168.10.3` from
interface `loop0` so it'll use that to signal the next hop for LSAs that it sends.

Now I start the same VPP and Bird configuration on all four VPP routers, but on `vpp0-0` I'll add a
static route out of the LAB to the internet:

```
protocol static static4 {
  ipv4 { export all; };
  route 0.0.0.0/0 via 192.168.10.4;
}

protocol static static6 {
  ipv6 { export all; };
  route ::/0 via 2001:678:d78:201::ffff;
}
```

These two default routes from `vpp0-0` quickly propagate through the network, where `vpp0-3`
ultimately sees this:

```
pim@vpp0-3:~$ ip -br a
lo               UNKNOWN        127.0.0.1/8 ::1/128
loop0            UNKNOWN        192.168.10.3/32 2001:678:d78:200::3/128 fe80::dcad:ff:fe00:0/64
e0               UP             fe80::5054:ff:fef0:1130/64

pim@vpp0-3:~$ ip ro
default via 192.168.10.2 dev e0 proto bird metric 32 onlink
192.168.10.0 via 192.168.10.2 dev e0 proto bird metric 32 onlink
192.168.10.1 via 192.168.10.2 dev e0 proto bird metric 32 onlink
192.168.10.2 via 192.168.10.2 dev e0 proto bird metric 32 onlink
192.168.10.4/31 via 192.168.10.2 dev e0 proto bird metric 32 onlink

pim@vpp0-3:~$ ip -6 ro
2001:678:d78:200:: via fe80::5054:ff:fef0:1121 dev e0 proto bird metric 32 pref medium
2001:678:d78:200::1 via fe80::5054:ff:fef0:1121 dev e0 proto bird metric 32 pref medium
2001:678:d78:200::2 via fe80::5054:ff:fef0:1121 dev e0 proto bird metric 32 pref medium
2001:678:d78:200::3 dev loop0 proto kernel metric 256 pref medium
2001:678:d78:201::/112 via fe80::5054:ff:fef0:1121 dev e0 proto bird metric 32 pref medium
fe80::/64 dev loop0 proto kernel metric 256 pref medium
fe80::/64 dev e0 proto kernel metric 256 pref medium
default via fe80::5054:ff:fef0:1121 dev e0 proto bird metric 32 pref medium
```

**✅ OSPFv3 with loopback-only, _unnumbered_ IPv4/IPv6 interfaces works!**

## Results

I thought I'd record a little [[asciinema](/assets/vpp-ospf/clean.cast),
[gif](/assets/vpp-ospf/clean.gif)] that shows the end to end
configuration, starting from an empty dataplane and bird configuration. I'll show _Solution 2_, that
is, the solution that doesn't copy the _unnumbered_ interfaces in VPP to Linux.

Ready? Here I go!

{{< asciinema src="/assets/vpp-ospf/clean.cast" >}}


### To unnumbered or Not To unnumbered

I'm torn between _Solution 1_ and _Solution 2_. While on the one hand, setting the _unnumbered_
interface would be best reflected in Linux, it is not without problems. If the operator subsequently
tries to remove one of the addresses on `e0` or `e1`, that will yield a desync between Linux and
VPP (Linux will have removed the address, but VPP will still be _unnumbered_). On the other hand,
tricking Linux (and the operator) to believe there isn't an IPv4 (and IPv6) address configured on
the interface, is also not great.

Of the two approaches, I think I prefer _Solution 2_ (configuring the Linux CP plugin to not sync
_unnumbered_ addresses), because it minimizes the chance of operator error. If you're reading this and
have an Opinion&trade;, would you please let me know?