Files
ipng.ch/content/articles/2024-03-06-vpp-babel-1.md
Pim van Pelt fdb77838b8
All checks were successful
continuous-integration/drone/push Build is passing
Rewrite github.com to git.ipng.ch for popular repos
2025-05-04 21:54:16 +02:00

647 lines
32 KiB
Markdown

---
date: "2024-03-06T20:17:54Z"
title: VPP with Babel - Part 1
aliases:
- /s/articles/2024/03/06/vpp-babel-1.html
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches
are shared between the two. Thanks to the [[Linux ControlPlane]({{< ref "2021-08-12-vpp-1" >}})]
plugin, higher level control plane software becomes available, that is to say: things like BGP,
OSPF, LDP, VRRP and so on become quite natural for VPP.
IPng Networks is a small service provider that has built a network based entirely on open source:
[[Debian]({{< ref "2023-12-17-defra0-debian" >}})] servers with widely available Intel and Mellanox
10G/25G/100G network cards, paired with [[VPP](https://fd.io/)] for the dataplane, and
[[Bird2](https://bird.nic.cz/)] for the controlplane.
As a small provider, I am well aware of the cost of IPv4 address space. Long gone are the times at
which an initial allocation was a /19, and subsequent allocations usually a /20 based on
justification. Then it watered down to a /22 for new _Local Internet Registries_, then that became a
/24 for new _LIRs_, and ultimately we ran out. What was once a plentiful resource, has now become a
very constrained resource.
In this first article, I want to show a rather clever way to conserve IPv4 addresses by exploring
one of the newer routing protocols: Babel.
## 🙁 A sad waste
I have to go back to something very fundamental about routing. When RouterA holds a routing table,
it will associate prefixes with next-hops and their associated interfaces. When RouterA gets a
packet, it'll look up the destination address, and then forward the packet on to RouterB which is
the next router in the path towards the destination:
1. RouterA does a route lookup in its routing table. For destination `192.0.2.1`, the covering
prefix is `192.0.2.0/24` and it might find that it can reach it via IPv4 next hop `100.64.0.1`.
1. RouterA then does another lookup in its routing table, to figure out how can it reach
`100.64.0.1`. It may find that this address is directly connected, say to interface `eth0`, on
which RouterA is `100.64.0.2/30`.
1. Assuming that `eth0` is an ethernet device, which the vast majority of interfaces are, then
RouterA can look up the link-layer address for that IPv4 address `100.64.0.1`, by using ARP.
1. The ARP request asks, quite literally `who-has 100.64.0.1?` using a broadcast message on
`eth0`, to which the other RouterB will answer `100.64.0.1 is-at 90:e2:ba:3f:ca:d5`.
1. Now that RouterA knows that, it can forward along the IP packet out on its `eth0` device and
towards `90:e2:ba:3f:ca:d5`. Huzzah.
## 🥰 A clever trick
I can't help but notice that the only purpose of having the `100.64.0.0/30` transit network between
these two routers is to:
1. provide the routers the ability to resolve IPv4 next hops towards link-layer MAC addresses,
using ARP resolution.
1. provide a means for the routers to send ICMP messages, for example in a traceroute, each hop
along the way will respond with an TTL exceeded message. And I do like traceroutes!
Let me discuss these two purposes in more detail:
### 1. IPv4 ARP, n&eacute;e IPv6 NDP
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Brain" >}}
One really neat trick is simply replacing ARP resolution by something that can resolve the
link-layer MAC address in a different way. As it turns out, IPv6 has an equivalent that's
called _Neighbor Discovery Protocol_ in which a router can determine the link-layer address of a
neighbor, or to verify that a neighbor is still reachable via a cached link-layer address. This uses
ICMPv6 to send out a query with the _Neighbor Solicitation_, which is followed by a response in the
form of a _Neighbor Advertisement_.
Why am I talking about IPv6 neighbor discovery when I'm explaining IPv4 forwarding, you may be
wondering? Well, because of this neat trick that the IPv4 prefix brokers don't want you to know:
```
pim@vpp0-0:~$ sudo ip ro add 192.0.2.0/24 via inet6 fe80::5054:ff:fef0:1110 dev e1
pim@vpp0-0:~$ ip -br a show e1
e1 UP fe80::5054:ff:fef0:1101/64
pim@vpp0-0:~$ ip ro get 192.0.2.0
192.0.2.0 via inet6 fe80::5054:ff:fef0:1110 dev e1 src 192.168.10.0 uid 0
cache
pim@vpp0-0:~$ ip neighbor | grep fe80::5054:ff:fef0:1110
fe80::5054:ff:fef0:1110 dev e1 lladdr 52:54:00:f0:11:10 REACHABLE
pim@vpp0-0:~$ sudo tcpdump -evni e1 host 192.0.2.0
tcpdump: listening on e1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
16:21:30.002878 52:54:00:f0:11:01 > 52:54:00:f0:11:10, ethertype IPv4 (0x0800), length 98:
(tos 0x0, ttl 64, id 21521, offset 0, flags [DF], proto ICMP (1), length 84)
192.168.10.0 > 192.0.2.0: ICMP echo request, id 54710, seq 20, length 64
```
While it looks counter-intuitive at first, this is actually pretty straight forward. When the router
gets a packet destined for `192.0.2.0/24`, it will know that the next hop is some link-local IPv6
address, which it can resolve by _NDP_ on ethernet interface `e1`. It can then simply forward the
IPv4 datagram to the MAC address it found.
Who would've thunk that you do not need ARP or even IPv4 on the interface at all?
### 2. Originating ICMP messages
{{< image width="200px" float="right" src="/assets/vpp-babel/too-big.png" alt="Too Big" >}}
The Internet Control Message Protocol is described in
[[RFC792](https://datatracker.ietf.org/doc/html/rfc792)]. It's mostly used to carry diagnostic and
debugging information, either originated by end hosts, for example the "destination unreachable,
port unreachable" types of messages, but they may also be originated by intermediate routers, for
example with most other kinds of "destination unreachable" packets.
Path MTU Discovery, described in [[RFC1191](https://datatracker.ietf.org/doc/html/rfc1191)] allows a
host to discover the maximum packet size that a route is able to carry. There's a few different
types of _PMTUd_, but the most common one uses ICMPv4 packets coming from these intermediate
routers, informing them that packets which are marked as un-fragmentable, will not be able to be
transmitted due to them being too large.
Without the ability for a router to signal these ICMPv4 packets, end to end connectivity quality
might break undetected. So, every router that is able to forward IPv4 traffic SHOULD be able
originate ICMPv4 traffic.
If you're curious, you can read more in this [[IETF
Draft](https://www.ietf.org/archive/id/draft-chroboczek-int-v4-via-v6-01.html)] from Juliusz
Chroboczek et al. It's really insightful, yet elegant.
{{< image width="200px" float="right" src="/assets/vpp-babel/Babel_logo_black.svg" alt="Babel Logo" >}}
## Introducing Babel
I've learned so far that I (a) **MAY** use IPv6 link-local networks in order to _forward_ IPv4
packets, as I can use IPv6 _NDP_ to find the link-layer next hop; and (b) each router **SHOULD** be
able to _originate_ ICMPv4 packets, therefore it needs _at least one_ IPv4 address.
These two claims mean that I need _at most one_ IPv4 address on each router. Could it be?!
**Babel** is a loop-avoiding distance-vector routing protocol that is designed to be robust and
efficient both in networks using prefix-based routing and in networks using flat routing ("mesh
networks"), and both in relatively stable wired networks and in highly dynamic wireless networks.
The definitive [[RFC8966](https://datatracker.ietf.org/doc/html/rfc8966)] describes it in great
detail, and previous work are in [[RFC7557](https://datatracker.ietf.org/doc/html/rfc7557)] and
[[RFC6126](https://datatracker.ietf.org/doc/html/rfc6126)]. Lots of reading :) Babel is a _hybrid_
routing protocol, in the sense that it can carry routes for multiple network-layer protocols (IPv4
and IPv6), regardless of which protocol the Babel packets are themselves being carried over.
I quickly realise that Babel is hybrid in a different and very interesting way: it can set next-hops
across address families, which is described in [[RFC9229](https://datatracker.ietf.org/doc/html/rfc9229)]:
> When a packet is routed according to a given routing table entry, the forwarding plane typically
> uses a neighbour discovery protocol (the Neighbour Discovery (ND) protocol
> [[RFC4861](https://datatracker.ietf.org/doc/html/rfc4861)] in the case of IPv6 and the Address
> Resolution Protocol (ARP) [[RFC826](https://datatracker.ietf.org/doc/html/rfc826)] in the case of
> IPv4) to map the next-hop address to a link-layer address (a "Media Access Control (MAC)
> address"), which is then used to construct the link-layer frames that encapsulate forwarded
> packets.
>
> It is apparent from the description above that there is no fundamental reason why the destination
> prefix and the next-hop address should be in the same address family: there is nothing preventing
> an IPv6 packet from being routed through a next hop with an IPv4 address (in which case the next
> hop's MAC address will be obtained using ARP) or, conversely, an IPv4 packet from being routed
> through a next hop with an IPv6 address. (In fact, it is even possible to store link-layer
> addresses directly in the next-hop entry of the routing table, which is commonly done in networks
> using the OSI protocol suite).
### Babel and Bird2
There's an implementation of Babel in Bird2, the routing solution that I use at AS8298. What made me
extra enthusiastic, is that I found out the functionality described in RFC9229 was committed about a
year ago in Bird2
[[ref](https://gitlab.nic.cz/labs/bird/-/commit/eecc3f02e41bcb91d463c4c1189fd56bc44e6514)], with a
hat-tip to Toke Høiland-Jørgensen.
The Debian machines at IPng are current (Bookworm 12.5), but Debian still ships a version older than
this commit, so my first order of business is to get a Debian package:
```
pim@summer:~/src$ sudo apt install devscripts
pim@summer:~/src$ wget http://deb.debian.org/debian/pool/main/b/bird2/bird2_2.14.orig.tar.gz
pim@summer:~/src$ tar xzf bird2_2.14.orig.tar.gz
pim@summer:~/src/bird-2.14$ wget http://deb.debian.org/debian/pool/main/b/bird2/bird2_2.14-1.debian.tar.xz
pim@summer:~/src/bird-2.14$ tar xf bird2_2.14-1.debian.tar.xz
pim@summer:~/src/bird-2.14$ sudo mk-build-deps -i
pim@summer:~/src/bird-2.14$ sudo dpkg-buildpackage -b -uc -us
```
And that yields me a fresh Bird 2.14 package. I can't help but wonder though, why did the semantic
versioning [[ref](https://semver.org/)] of `2.0.X` change to `2.14`? I found an answer in the NEWS
file of the 2.13 release
[[link](https://gitlab.nic.cz/labs/bird/-/blob/7d2c7d59a363e690995eb958959f0bc12445355c/NEWS#L45-50)].
It's a little bit of a disappointment, but I quickly get over myself because I want to take this
Babel-Bird out for a test flight. Thank you for the Babel-Bird-Build, Summer!
### Babel and the LAB
I decide to take an IPng [[lab]({{< ref "2022-10-14-lab-1" >}})] out for a spin. These labs come
with four VPP routers and two Debian machines connected like so:
{{< image src="/assets/vpp-mpls/LAB v2.svg" alt="Lab Setup" >}}
The configuration snippet for Bird2 is very simple, as most of the defaults are sensible:
```
pim@vpp0-0:~$ cat << EOF | sudo tee -a /etc/bird/bird.conf
protocol babel {
interface "e*" {
type wired;
extended next hop on;
};
ipv6 { import all; export all; };
ipv4 { import all; export all; };
}
EOF
pim@vpp0-0:~$ birdc show babel interfaces
BIRD 2.14 ready.
babel1:
Interface State Auth RX cost Nbrs Timer Next hop (v4) Next hop (v6)
e1 Up No 96 1 0.958 :: fe80::5054:ff:fef0:1101
pim@vpp0-0:~$ birdc show babel neigh
BIRD 2.14 ready.
babel1:
IP address Interface Metric Routes Hellos Expires Auth RTT (ms)
fe80::5054:ff:fef0:1110 e1 96 8 16 5.003 No 4.831
pim@vpp0-0:~$ birdc show babel entries
BIRD 2.14 ready.
babel1:
Prefix Router ID Metric Seqno Routes Sources
192.168.10.0/32 00:00:00:00:c0:a8:0a:00 0 1 0 0
192.168.10.0/24 00:00:00:00:c0:a8:0a:00 0 1 1 0
192.168.10.1/32 00:00:00:00:c0:a8:0a:01 96 7 1 0
2001:678:d78:200::/128 00:00:00:00:c0:a8:0a:00 0 1 0 0
2001:678:d78:200::/60 00:00:00:00:c0:a8:0a:00 0 1 1 0
2001:678:d78:200::1/128 00:00:00:00:c0:a8:0a:01 96 7 1 0
```
Based on this simple configuration, Bird2 will start the babel protocol on `e0` and `e1`, and it
quickly finds a neighbor with which it establishes an adjacency. Looking at the routing protocol
database (called _entries_), I can see my own IPv4 and IPv6 loopbacks (192.168.10.0 and
2001:678:d78:200::), the neighbor's IPv4 and IPv6 loopbacks (192.168.10.1 and 201:678:d78:200::1),
and finally the two supernets (192.168.10.0/24 and 2001:678:d78:200::/60).
The coolest part is the `extended next hop on` statement, which enables Babel to set the nexthop
to be an IPv6 address, which becomes clear very quickly when looking at the Linux routing table:
```
pim@vpp0-0:~$ ip ro
192.168.10.1 via inet6 fe80::5054:ff:fef0:1110 dev e1 proto bird metric 32
unreachable 192.168.10.0/24 proto bird metric 32
pim@vpp0-0:~$ ip -6 ro
2001:678:d78:200:: dev loop0 proto kernel metric 256 pref medium
2001:678:d78:200::1 via fe80::5054:ff:fef0:1110 dev e1 proto bird metric 32 pref medium
unreachable 2001:678:d78:200::/60 dev lo proto bird metric 32 pref medium
fe80::/64 dev loop0 proto kernel metric 256 pref medium
fe80::/64 dev e1 proto kernel metric 256 pref medium
```
**✅ Setting IPv4 routes over IPv6 nexthops works!**
### Babel and VPP
For the [[VPP](https://fd.io)] configuration, I start off with a pretty much _empty_ configuration,
creating only a loopback interface called `loop0`, setting the interfaces up, and exposing them in
_LinuxCP_:
```
vpp0-0# create loopback interface instance 0
vpp0-0# set interface state loop0 up
vpp0-0# set interface ip address loop0 192.168.10.0/32
vpp0-0# set interface ip address loop0 2001:678:d78:200::/128
vpp0-0# set interface mtu 9000 GigabitEthernet10/0/0
vpp0-0# set interface state GigabitEthernet10/0/0 up
vpp0-0# set interface mtu 9000 GigabitEthernet10/0/1
vpp0-0# set interface state GigabitEthernet10/0/1 up
vpp0-0# lcp create loop0 host-if loop0
vpp0-0# lcp create GigabitEthernet10/0/0 host-if e0
vpp0-0# lcp create GigabitEthernet10/0/1 host-if e1
```
Between the four VPP routers, the only relevant difference is the IPv4 and IPv6 addresses of the
loopback device. For the rest, things are good. The routing tables quickly fill with all IPv4 and
IPv6 loopbacks across the network.
### Adding support to VPP
IPv6 pings and looks good. However, IPv4 endpoints do not ping yet. The first thing I look at, is
does VPP understand how to interpret an IPv4 route with an IPv6 nexthop? I think it does, because I
remember reviewing a change from Adrian during our MPLS [[project]({{< ref "2023-05-28-vpp-mpls-4" >}})],
which he submitted in this [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/38633)]. His change
allows VPP to use routes with `rtnl_route_nh_get_via()` to map them to a different address family,
exactly what I am looking for. The routes are correctly installed in the FIB:
```
pim@vpp0-0:~$ vppctl show ip fib 192.168.10.1
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[default-route:1, lcp-rt:1, ]
192.168.10.1/32 fib:0 index:31 locks:2
lcp-rt-dynamic refs:1 src-flags:added,contributing,active,
path-list:[51] locks:4 flags:shared, uPRF-list:42 len:1 itfs:[2, ]
path:[72] pl-index:51 ip6 weight=1 pref=32 attached-nexthop: oper-flags:resolved,
fe80::5054:ff:fef0:1110 GigabitEthernet10/0/1
[@0]: ipv6 via fe80::5054:ff:fef0:1110 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 525400f01110525400f0110186dd
forwarding: unicast-ip4-chain
[@0]: dpo-load-balance: [proto:ip4 index:34 buckets:1 uRPF:42 to:[0:0]]
[0] [@5]: ipv4 via fe80::5054:ff:fef0:1110 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 525400f01110525400f011010800
```
Using the Open vSwitch tap I can see I can clearly see the packets go out from `vpp0-0.e1` and into
`vpp0-1.e0`, but there is no response, so they are getting lost in `vpp0-1` somewhere. I take a look
at a packet trace on `vpp0-1`, I'm expecting the ICMP packet there:
```
pim@vpp0-1:~$ vppctl show trace
07:42:53:178694: dpdk-input
GigabitEthernet10/0/0 rx queue 0
buffer 0x4c513d: current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x0
ext-hdr-valid
PKT MBUF: port 0, nb_segs 1, pkt_len 98
buf_len 2176, data_len 98, ol_flags 0x0, data_off 128, phys_addr 0x29944fc0
packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
rss 0x0 fdir.hi 0x0 fdir.lo 0x0
IP4: 52:54:00:f0:11:01 -> 52:54:00:f0:11:10
ICMP: 192.168.10.0 -> 192.168.10.1
tos 0x00, ttl 64, length 84, checksum 0xb02b dscp CS0 ecn NON_ECN
fragment id 0xf52b, flags DONT_FRAGMENT
ICMP echo_request checksum 0x43b7 id 26166
07:42:53:178765: ethernet-input
frame: flags 0x1, hw-if-index 1, sw-if-index 1
IP4: 52:54:00:f0:11:01 -> 52:54:00:f0:11:10
07:42:53:178791: ip4-input
ICMP: 192.168.10.0 -> 192.168.10.1
tos 0x00, ttl 64, length 84, checksum 0xb02b dscp CS0 ecn NON_ECN
fragment id 0xf52b, flags DONT_FRAGMENT
ICMP echo_request checksum 0x43b7 id 26166
07:42:53:178810: ip4-not-enabled
ICMP: 192.168.10.0 -> 192.168.10.1
tos 0x00, ttl 64, length 84, checksum 0xb02b dscp CS0 ecn NON_ECN
fragment id 0xf52b, flags DONT_FRAGMENT
ICMP echo_request checksum 0x43b7 id 26166
07:42:53:178833: error-drop
rx:GigabitEthernet10/0/0
07:42:53:178835: drop
dpdk-input: no error
```
Okay, that checks out! Going over this packet trace, the `ip4-input` node indeed got handed a packet,
which it promptly rejected by forwarding it to `ip4-not-enabled` which drops it. It kind of makes
sense, the VPP dataplane doesn't think it's logical to handle IPv4 traffic on an interface which
does not have an IPv4 address. Except -- I'm bending the rules a little bit by doing exactly that.
#### Approach 1: force-enable ip4 in VPP
There's an internal function `ip4_sw_interface_enable_disable()` which is called to enable IPv4
processing on an interface once the first IPv4 address is added. So my first fix is to force this to
be enabled for any interface that is exposed via Linux Control Plane, notably in `lcp_itf_pair_create()`
[[here](https://git.ipng.ch/ipng/lcpng/blob/main/lcpng_interface.c#L777)].
This approach is partially effective:
```
pim@vpp0-0:~$ ip ro get 192.168.10.1
192.168.10.1 via inet6 fe80::5054:ff:fef0:1110 dev e1 src 192.168.10.0 uid 0
cache
pim@vpp0-0:~$ ping -c5 192.168.10.1
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=3.92 ms
64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=3.81 ms
64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=3.75 ms
64 bytes from 192.168.10.1: icmp_seq=4 ttl=64 time=3.23 ms
64 bytes from 192.168.10.1: icmp_seq=5 ttl=64 time=2.67 ms
^C
--- 192.168.10.1 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4006ms
rtt min/avg/max/mdev = 2.673/3.477/3.921/0.467 ms
pim@vpp0-0:~$ traceroute 192.168.10.3
traceroute to 192.168.10.3 (192.168.10.3), 30 hops max, 60 byte packets
1 * * *
2 * * *
3 192.168.10.3 (192.168.10.3) 10.418 ms 10.343 ms 11.362 ms
```
I take a moment to think about why the traceroutes are not responding in the routers in the middle,
and it dawns on me that when the router needs to send an ICMPv4 TTL Exceeded message, it can't
select an IPv4 address to originate the message from, as the interface has none.
**🟠 Forwarding works, but ❌ PMTUd does not!**
#### Approach 2: Use unnumbered interfaces
Looking at my options, I see that VPP is capable of using so-called _unnumbered_ interfaces. These
can be left unconfigured, but borrow an address from another interface. It's a good idea to
borrow from `loop0`, which has a valid IPv4 and IPv6 address. It looks like this in VPP:
```
vpp0-0# set interface unnumbered GigabitEthernet10/0/0 use loop0
vpp0-0# set interface unnumbered GigabitEthernet10/0/1 use loop0
vpp0-0# show interface address
GigabitEthernet10/0/0 (dn):
unnumbered, use loop0
L3 192.168.10.0/32
L3 2001:678:d78:200::/128
GigabitEthernet10/0/1 (up):
unnumbered, use loop0
L3 192.168.10.0/32
L3 2001:678:d78:200::/128
loop0 (up):
L3 192.168.10.0/32
L3 2001:678:d78:200::/128
```
The Linux ControlPlane configuration will always synchronize interface information from VPP to
Linux, as I described back then when I [[worked on the plugin]({{< ref "2021-08-13-vpp-2" >}})].
Babel starts and sets next hops for IPv4 that look like this:
```
pim@vpp0-2:~$ ip -br a
lo UNKNOWN 127.0.0.1/8 ::1/128
loop0 UNKNOWN 192.168.10.2/32 2001:678:d78:200::2/128 fe80::dcad:ff:fe00:0/64
e0 UP 192.168.10.2/32 2001:678:d78:200::2/128 fe80::5054:ff:fef0:1120/64
e1 UP 192.168.10.2/32 2001:678:d78:200::2/128 fe80::5054:ff:fef0:1121/64
pim@vpp0-2:~$ ip ro
192.168.10.0 via 192.168.10.1 dev e0 proto bird metric 32 onlink
unreachable 192.168.10.0/24 proto bird metric 32
192.168.10.1 via 192.168.10.1 dev e0 proto bird metric 32 onlink
192.168.10.3 via 192.168.10.3 dev e1 proto bird metric 32 onlink
```
While on the surface this looks good, for VPP it clearly poses a problem, as my IPv4 neighbors
(192.168.10.1 and 192.168.10.3) are not reachable:
```
pim@vpp0-2:~# ping -c3 192.168.10.1
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
From 192.168.10.2 icmp_seq=1 Destination Host Unreachable
From 192.168.10.2 icmp_seq=2 Destination Host Unreachable
From 192.168.10.2 icmp_seq=3 Destination Host Unreachable
--- 192.168.10.1 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2034ms
```
I take a look at why that might be, and I notice this on the neighbor `vpp0-1` when I try to ping it
from `vpp0-2`:
```
vpp0-1# show err
Count Node Reason Severity
5 arp-reply IP4 source address not local to sub error
1 arp-reply IP4 source address matches local in error
```
Oh, snap! I traced this down to `src/vnet/arp/arp.c` around line 522 where I can see that VPP, when
it receives an ARP request, wants that to be coming from a peer that is in its own subnet. But with a
point to point link like this one, there is nobody else in the `192.168.10.1/32` subnet! I think
this error should not be returned if the interface is `arp_unnumbered()`, defined further up in the
same source file. I write a small patch in Gerrit [[40482](https://gerrit.fd.io/r/c/vpp/+/40482)]
which removes this requirement and the test that asserts the previous behavior, allowing the ARP
request to succeed, and things shoot to life:
```
pim@vpp0-2:~$ ping -c3 192.168.10.1
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=11.5 ms
64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=1.69 ms
64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=3.03 ms
--- 192.168.10.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2004ms
rtt min/avg/max/mdev = 1.689/5.394/11.468/4.329 ms
```
I make a mental note to discuss this ARP relaxation Gerrit with [[vpp-dev](https://lists.fd.io/g/vpp-dev/message/24155)],
and I'll see where that takes me.
**✅ Forwarding IPv4 routes over IPv4 point-to-point nexthops works!**
#### Approach 3: VPP Unnumbered Hack
At this point, I think I'm good, but one of the cool features of Babel is that it can use IPv6 next
hops for IPv4 destinations. Setting `GigabitEthernet10/0/X` to unnumbered will make
`192.168.10.X/32` reappear on the `e0` an `e1` interfaces, which will make Babel prefer the more
classic IPv4 next-hops. So can I trick it somehow to use IPv6 anyway ?
One option is to ask Babel to use `extended next hop` even when IPv4 is available, which would be a
change to Bird (and possibly a violation of the Babel specification, I should read up on that).
But I think there's another way, so I take a look at the VPP code which prints out the **unnumbered,
use loop0** message, and I find a way to know if an interface is borrowing addresses in this way. I
decide to change the LCP plugin to _inhibit_ sync'ing the addresses if they belong to an interface
which is unnumbered. Because I don't know for sure if everybody would find this behavior desirable,
I make sure to guard the behavior behind a backwards compatible configuration option.
If you're curious, please take a look at the change in my [[GitHub
repo](https://git.ipng.ch/ipng/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
which I:
1. add a new configuration option, `lcp-sync-unnumbered`, which defaults to `on`. That would be
what the plugin would do in the normal case: copy forward these borrowed IP addresses to Linux.
1. add a CLI call to change the value, `lcp lcp-sync-unnumbered [on|enable|off|disable]`
1. extend the CLI call to show the LCP plugin state, as an additional output of `lcp show`
And with that, the VPP configuration becomes:
```
vpp0-0# lcp lcp-sync on
vpp0-0# lcp lcp-sync-unnumbered off
vpp0-0# create loopback interface instance 0
vpp0-0# set interface state loop0 up
vpp0-0# set interface ip address loop0 192.168.10.0/32
vpp0-0# set interface ip address loop0 2001:678:d78:200::/128
vpp0-0# set interface mtu 9000 GigabitEthernet10/0/0
vpp0-0# set interface unnumbered GigabitEthernet10/0/0 use loop0
vpp0-0# set interface state GigabitEthernet10/0/0 up
vpp0-0# set interface mtu 9000 GigabitEthernet10/0/1
vpp0-0# set interface unnumbered GigabitEthernet10/0/1 use loop0
vpp0-0# set interface state GigabitEthernet10/0/1 up
vpp0-0# lcp create loop0 host-if loop0
vpp0-0# lcp create GigabitEthernet10/0/0 host-if e0
vpp0-0# lcp create GigabitEthernet10/0/1 host-if e1
```
## Results
I can claim plausible success on this effort, which makes me wiggle in my seat a little bit, I have
to admit:
```
pim@vpp0-0:~$ ip -br a
lo UNKNOWN 127.0.0.1/8 ::1/128
loop0 UNKNOWN 192.168.10.0/32 2001:678:d78:200::/128 fe80::dcad:ff:fe00:0/64
e0 UP fe80::5054:ff:fef0:1100/64
e1 UP fe80::5054:ff:fef0:1101/64
e2 DOWN
e3 DOWN
pim@vpp0-0:~$ traceroute -n 192.168.10.3
traceroute to 192.168.10.3 (192.168.10.3), 30 hops max, 60 byte packets
1 192.168.10.1 1.882 ms 2.231 ms 1.472 ms
2 192.168.10.2 4.243 ms 3.492 ms 2.797 ms
3 192.168.10.3 6.689 ms 5.925 ms 5.157 ms
pim@vpp0-0:~$ traceroute -n 2001:678:d78:200::3
traceroute to 2001:678:d78:200::3 (2001:678:d78:200::3), 30 hops max, 80 byte packets
1 2001:678:d78:200::1 2.543 ms 1.762 ms 2.154 ms
2 2001:678:d78:200::2 4.943 ms 3.063 ms 3.562 ms
3 2001:678:d78:200::3 6.273 ms 6.694 ms 7.086 ms
```
**✅ Forwarding IPv4 routes over IPv6 nexthops works, ICMPv4 works, PMTUd works!**
I recorded a little [[screencast](/assets/vpp-babel/screencast.cast)] that shows my work, so far:
{{< image src="/assets/vpp-babel/screencast.gif" alt="Babel IPv4-less VPP" >}}
## Additional thoughts
### Comparing OSPFv2 and Babel
Ondrej from the Bird team pointed out (thank you!) that OSPFv2 can also be made to avoid use of IPv4
transit networks, by making use of this `peer` pattern, which is similar but not quite the same as
what I discussed in _Approach 2_ above:
```
$ ip addr add 192.168.10.2 peer 192.168.10.1 dev e0
$ ip addr add 192.168.10.2 peer 192.168.10.3 dev e1
```
The Linux ControlPlane plugin is not currently capable of accepting the `peer` netlink message, and
I can see a problem: VPP does not allow for two interfaces to have the same IP address, _unless_ one
is borrowing from another using _unnumbered_. I wonder why that is ...
I could certainly give implementing that `peer` pattern in Netlink a go, but I'm not enthusiastic.
To consume the netlink message correctly, the plugin would need to assert that left hand (source) IPv4
address strictly corresponds to a loopback, and then internally rewrite the address addition into
a _unnumbered_ use, and also somehow reject (delete?) the netlink configuration otherwise. Ick!
I think there's a more idiomatic way of doing this in VPP. OSPFv2 doesn't really _need_ to use the
`peer` pattern, as long as the point to point peer is reachable. Babel is emitting a static route
over the interface after using IPv6 to learn its peer's IPv4 address, which is really neat! I
suppose for OSPFv2 setting a manual static route for the peer into the device would do the trick as
well.
The VPP idiom for the `peer` pattern above, which Babel does naturally, and OSPFv2 could be manually
configured to do, would look like this:
```
vpp0-2# set interface ip address loop0 192.168.10.2/32
vpp0-2# set interface state loop0 up
vpp0-2# set interface unnumbered GigabitEthernet10/0/0 use loop0
vpp0-2# set interface state GigabitEthernet10/0/0 up
vpp0-2# ip route add 192.168.10.1/32 via 192.168.10.1 GigabitEthernet10/0/0
vpp0-2# set interface unnumbered GigabitEthernet10/0/1 use loop0
vpp0-2# set interface state GigabitEthernet10/0/1 up
vpp0-2# ip route add 192.168.10.3/32 via 192.168.10.3 GigabitEthernet10/0/1
```
Either way, when using point to point connections (like these explicit static routes, or the implied
static routes that the `peer` pattern will yield) over an ethernet broadcast medium, will require to
get the ARP [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/40482)] merged. This one seems reasonably
straight forward because allowing point to point to work over an ethernet broadcast medium is
successfully done in many popular vendors, and I can't find any RFC that forbids it. Perhaps VPP is
being a bit too strict.
### To Unnumbered or Not To Unnumbered
I'm torn between _Approach 2_ and _Approach 3_. While on the one hand, setting the _unnumbered_
interface would be best reflected in Linux, it is not without problems. If the operator subsequently
tries to remove one of the addresses on `e0` or `e1`, that will yield a desync between Linux and
VPP (Linux will have removed the address, but VPP will still be _unnumbered_). On the other hand,
tricking Linux (and the operator) to believe there isn't an IPv4 (and IPv6) address configured on
the interface, is also not great.
Of the two approaches, I think I prefer _Approach 3_ (changing the Linux CP plugin to not sync
unnumbered addresses), because it minimizes the chance of operator error. If you're reading this and
have an Opinion&trade;, would you please let me know?
## What's Next
I think that over time, IPng Networks might replace OSPF and OSPFv3 with Babel, as it will allow me
to retire the many /31 IPv4 and /112 IPv6 transit networks (which consume about half of my routable
IPv4 addresses!). I will discuss my change with the VPP and Babel/Bird Developer communities and see
if it makes sense to upstream my changes. Personally, I think it's a reasonable direction, because
(a) both changes are backwards compatible and (b) its semantics are pretty straight forward. I'll
also add some configuration knobs to [[vppcfg]({{< ref "2022-04-02-vppcfg-2" >}})] to make it
easier to configure VPP in this way.
Of course, migrating AS8298 won't be overnight, I need to gain a bit more confidence, and obviously
upgrade both Bird2 and VPP using my changes, which I think might benefit from a bit of peer review.
And finally I need to roll this new IPv4-less IGP out very carefully and without interruptions,
which considering the IGP is the most fundamental building block of the network, may be tricky.
But, I am uncomfortably excited by the prospect of having my network go entirely without backbone
transit networks. By the way: Babel is amazing!