All checks were successful
continuous-integration/drone/push Build is passing
647 lines
32 KiB
Markdown
647 lines
32 KiB
Markdown
---
|
|
date: "2024-03-06T20:17:54Z"
|
|
title: VPP with Babel - Part 1
|
|
aliases:
|
|
- /s/articles/2024/03/06/vpp-babel-1.html
|
|
---
|
|
|
|
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
|
|
|
|
# About this series
|
|
|
|
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
|
|
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
|
|
_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches
|
|
are shared between the two. Thanks to the [[Linux ControlPlane]({{< ref "2021-08-12-vpp-1" >}})]
|
|
plugin, higher level control plane software becomes available, that is to say: things like BGP,
|
|
OSPF, LDP, VRRP and so on become quite natural for VPP.
|
|
|
|
IPng Networks is a small service provider that has built a network based entirely on open source:
|
|
[[Debian]({{< ref "2023-12-17-defra0-debian" >}})] servers with widely available Intel and Mellanox
|
|
10G/25G/100G network cards, paired with [[VPP](https://fd.io/)] for the dataplane, and
|
|
[[Bird2](https://bird.nic.cz/)] for the controlplane.
|
|
|
|
As a small provider, I am well aware of the cost of IPv4 address space. Long gone are the times at
|
|
which an initial allocation was a /19, and subsequent allocations usually a /20 based on
|
|
justification. Then it watered down to a /22 for new _Local Internet Registries_, then that became a
|
|
/24 for new _LIRs_, and ultimately we ran out. What was once a plentiful resource, has now become a
|
|
very constrained resource.
|
|
|
|
In this first article, I want to show a rather clever way to conserve IPv4 addresses by exploring
|
|
one of the newer routing protocols: Babel.
|
|
|
|
## 🙁 A sad waste
|
|
|
|
I have to go back to something very fundamental about routing. When RouterA holds a routing table,
|
|
it will associate prefixes with next-hops and their associated interfaces. When RouterA gets a
|
|
packet, it'll look up the destination address, and then forward the packet on to RouterB which is
|
|
the next router in the path towards the destination:
|
|
|
|
1. RouterA does a route lookup in its routing table. For destination `192.0.2.1`, the covering
|
|
prefix is `192.0.2.0/24` and it might find that it can reach it via IPv4 next hop `100.64.0.1`.
|
|
1. RouterA then does another lookup in its routing table, to figure out how can it reach
|
|
`100.64.0.1`. It may find that this address is directly connected, say to interface `eth0`, on
|
|
which RouterA is `100.64.0.2/30`.
|
|
1. Assuming that `eth0` is an ethernet device, which the vast majority of interfaces are, then
|
|
RouterA can look up the link-layer address for that IPv4 address `100.64.0.1`, by using ARP.
|
|
1. The ARP request asks, quite literally `who-has 100.64.0.1?` using a broadcast message on
|
|
`eth0`, to which the other RouterB will answer `100.64.0.1 is-at 90:e2:ba:3f:ca:d5`.
|
|
1. Now that RouterA knows that, it can forward along the IP packet out on its `eth0` device and
|
|
towards `90:e2:ba:3f:ca:d5`. Huzzah.
|
|
|
|
## 🥰 A clever trick
|
|
|
|
I can't help but notice that the only purpose of having the `100.64.0.0/30` transit network between
|
|
these two routers is to:
|
|
|
|
1. provide the routers the ability to resolve IPv4 next hops towards link-layer MAC addresses,
|
|
using ARP resolution.
|
|
1. provide a means for the routers to send ICMP messages, for example in a traceroute, each hop
|
|
along the way will respond with an TTL exceeded message. And I do like traceroutes!
|
|
|
|
Let me discuss these two purposes in more detail:
|
|
|
|
### 1. IPv4 ARP, née IPv6 NDP
|
|
|
|
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Brain" >}}
|
|
|
|
One really neat trick is simply replacing ARP resolution by something that can resolve the
|
|
link-layer MAC address in a different way. As it turns out, IPv6 has an equivalent that's
|
|
called _Neighbor Discovery Protocol_ in which a router can determine the link-layer address of a
|
|
neighbor, or to verify that a neighbor is still reachable via a cached link-layer address. This uses
|
|
ICMPv6 to send out a query with the _Neighbor Solicitation_, which is followed by a response in the
|
|
form of a _Neighbor Advertisement_.
|
|
|
|
Why am I talking about IPv6 neighbor discovery when I'm explaining IPv4 forwarding, you may be
|
|
wondering? Well, because of this neat trick that the IPv4 prefix brokers don't want you to know:
|
|
|
|
```
|
|
pim@vpp0-0:~$ sudo ip ro add 192.0.2.0/24 via inet6 fe80::5054:ff:fef0:1110 dev e1
|
|
|
|
pim@vpp0-0:~$ ip -br a show e1
|
|
e1 UP fe80::5054:ff:fef0:1101/64
|
|
pim@vpp0-0:~$ ip ro get 192.0.2.0
|
|
192.0.2.0 via inet6 fe80::5054:ff:fef0:1110 dev e1 src 192.168.10.0 uid 0
|
|
cache
|
|
pim@vpp0-0:~$ ip neighbor | grep fe80::5054:ff:fef0:1110
|
|
fe80::5054:ff:fef0:1110 dev e1 lladdr 52:54:00:f0:11:10 REACHABLE
|
|
|
|
pim@vpp0-0:~$ sudo tcpdump -evni e1 host 192.0.2.0
|
|
tcpdump: listening on e1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
|
|
16:21:30.002878 52:54:00:f0:11:01 > 52:54:00:f0:11:10, ethertype IPv4 (0x0800), length 98:
|
|
(tos 0x0, ttl 64, id 21521, offset 0, flags [DF], proto ICMP (1), length 84)
|
|
192.168.10.0 > 192.0.2.0: ICMP echo request, id 54710, seq 20, length 64
|
|
```
|
|
|
|
While it looks counter-intuitive at first, this is actually pretty straight forward. When the router
|
|
gets a packet destined for `192.0.2.0/24`, it will know that the next hop is some link-local IPv6
|
|
address, which it can resolve by _NDP_ on ethernet interface `e1`. It can then simply forward the
|
|
IPv4 datagram to the MAC address it found.
|
|
|
|
Who would've thunk that you do not need ARP or even IPv4 on the interface at all?
|
|
|
|
### 2. Originating ICMP messages
|
|
|
|
{{< image width="200px" float="right" src="/assets/vpp-babel/too-big.png" alt="Too Big" >}}
|
|
|
|
The Internet Control Message Protocol is described in
|
|
[[RFC792](https://datatracker.ietf.org/doc/html/rfc792)]. It's mostly used to carry diagnostic and
|
|
debugging information, either originated by end hosts, for example the "destination unreachable,
|
|
port unreachable" types of messages, but they may also be originated by intermediate routers, for
|
|
example with most other kinds of "destination unreachable" packets.
|
|
|
|
Path MTU Discovery, described in [[RFC1191](https://datatracker.ietf.org/doc/html/rfc1191)] allows a
|
|
host to discover the maximum packet size that a route is able to carry. There's a few different
|
|
types of _PMTUd_, but the most common one uses ICMPv4 packets coming from these intermediate
|
|
routers, informing them that packets which are marked as un-fragmentable, will not be able to be
|
|
transmitted due to them being too large.
|
|
|
|
Without the ability for a router to signal these ICMPv4 packets, end to end connectivity quality
|
|
might break undetected. So, every router that is able to forward IPv4 traffic SHOULD be able
|
|
originate ICMPv4 traffic.
|
|
|
|
If you're curious, you can read more in this [[IETF
|
|
Draft](https://www.ietf.org/archive/id/draft-chroboczek-int-v4-via-v6-01.html)] from Juliusz
|
|
Chroboczek et al. It's really insightful, yet elegant.
|
|
|
|
{{< image width="200px" float="right" src="/assets/vpp-babel/Babel_logo_black.svg" alt="Babel Logo" >}}
|
|
|
|
## Introducing Babel
|
|
|
|
I've learned so far that I (a) **MAY** use IPv6 link-local networks in order to _forward_ IPv4
|
|
packets, as I can use IPv6 _NDP_ to find the link-layer next hop; and (b) each router **SHOULD** be
|
|
able to _originate_ ICMPv4 packets, therefore it needs _at least one_ IPv4 address.
|
|
|
|
These two claims mean that I need _at most one_ IPv4 address on each router. Could it be?!
|
|
|
|
**Babel** is a loop-avoiding distance-vector routing protocol that is designed to be robust and
|
|
efficient both in networks using prefix-based routing and in networks using flat routing ("mesh
|
|
networks"), and both in relatively stable wired networks and in highly dynamic wireless networks.
|
|
|
|
The definitive [[RFC8966](https://datatracker.ietf.org/doc/html/rfc8966)] describes it in great
|
|
detail, and previous work are in [[RFC7557](https://datatracker.ietf.org/doc/html/rfc7557)] and
|
|
[[RFC6126](https://datatracker.ietf.org/doc/html/rfc6126)]. Lots of reading :) Babel is a _hybrid_
|
|
routing protocol, in the sense that it can carry routes for multiple network-layer protocols (IPv4
|
|
and IPv6), regardless of which protocol the Babel packets are themselves being carried over.
|
|
|
|
I quickly realise that Babel is hybrid in a different and very interesting way: it can set next-hops
|
|
across address families, which is described in [[RFC9229](https://datatracker.ietf.org/doc/html/rfc9229)]:
|
|
|
|
> When a packet is routed according to a given routing table entry, the forwarding plane typically
|
|
> uses a neighbour discovery protocol (the Neighbour Discovery (ND) protocol
|
|
> [[RFC4861](https://datatracker.ietf.org/doc/html/rfc4861)] in the case of IPv6 and the Address
|
|
> Resolution Protocol (ARP) [[RFC826](https://datatracker.ietf.org/doc/html/rfc826)] in the case of
|
|
> IPv4) to map the next-hop address to a link-layer address (a "Media Access Control (MAC)
|
|
> address"), which is then used to construct the link-layer frames that encapsulate forwarded
|
|
> packets.
|
|
>
|
|
> It is apparent from the description above that there is no fundamental reason why the destination
|
|
> prefix and the next-hop address should be in the same address family: there is nothing preventing
|
|
> an IPv6 packet from being routed through a next hop with an IPv4 address (in which case the next
|
|
> hop's MAC address will be obtained using ARP) or, conversely, an IPv4 packet from being routed
|
|
> through a next hop with an IPv6 address. (In fact, it is even possible to store link-layer
|
|
> addresses directly in the next-hop entry of the routing table, which is commonly done in networks
|
|
> using the OSI protocol suite).
|
|
|
|
### Babel and Bird2
|
|
|
|
There's an implementation of Babel in Bird2, the routing solution that I use at AS8298. What made me
|
|
extra enthusiastic, is that I found out the functionality described in RFC9229 was committed about a
|
|
year ago in Bird2
|
|
[[ref](https://gitlab.nic.cz/labs/bird/-/commit/eecc3f02e41bcb91d463c4c1189fd56bc44e6514)], with a
|
|
hat-tip to Toke Høiland-Jørgensen.
|
|
|
|
The Debian machines at IPng are current (Bookworm 12.5), but Debian still ships a version older than
|
|
this commit, so my first order of business is to get a Debian package:
|
|
|
|
```
|
|
pim@summer:~/src$ sudo apt install devscripts
|
|
pim@summer:~/src$ wget http://deb.debian.org/debian/pool/main/b/bird2/bird2_2.14.orig.tar.gz
|
|
pim@summer:~/src$ tar xzf bird2_2.14.orig.tar.gz
|
|
pim@summer:~/src/bird-2.14$ wget http://deb.debian.org/debian/pool/main/b/bird2/bird2_2.14-1.debian.tar.xz
|
|
pim@summer:~/src/bird-2.14$ tar xf bird2_2.14-1.debian.tar.xz
|
|
pim@summer:~/src/bird-2.14$ sudo mk-build-deps -i
|
|
pim@summer:~/src/bird-2.14$ sudo dpkg-buildpackage -b -uc -us
|
|
```
|
|
|
|
And that yields me a fresh Bird 2.14 package. I can't help but wonder though, why did the semantic
|
|
versioning [[ref](https://semver.org/)] of `2.0.X` change to `2.14`? I found an answer in the NEWS
|
|
file of the 2.13 release
|
|
[[link](https://gitlab.nic.cz/labs/bird/-/blob/7d2c7d59a363e690995eb958959f0bc12445355c/NEWS#L45-50)].
|
|
It's a little bit of a disappointment, but I quickly get over myself because I want to take this
|
|
Babel-Bird out for a test flight. Thank you for the Babel-Bird-Build, Summer!
|
|
|
|
### Babel and the LAB
|
|
|
|
I decide to take an IPng [[lab]({{< ref "2022-10-14-lab-1" >}})] out for a spin. These labs come
|
|
with four VPP routers and two Debian machines connected like so:
|
|
|
|
{{< image src="/assets/vpp-mpls/LAB v2.svg" alt="Lab Setup" >}}
|
|
|
|
The configuration snippet for Bird2 is very simple, as most of the defaults are sensible:
|
|
|
|
```
|
|
pim@vpp0-0:~$ cat << EOF | sudo tee -a /etc/bird/bird.conf
|
|
protocol babel {
|
|
interface "e*" {
|
|
type wired;
|
|
extended next hop on;
|
|
};
|
|
ipv6 { import all; export all; };
|
|
ipv4 { import all; export all; };
|
|
}
|
|
EOF
|
|
|
|
pim@vpp0-0:~$ birdc show babel interfaces
|
|
BIRD 2.14 ready.
|
|
babel1:
|
|
Interface State Auth RX cost Nbrs Timer Next hop (v4) Next hop (v6)
|
|
e1 Up No 96 1 0.958 :: fe80::5054:ff:fef0:1101
|
|
|
|
pim@vpp0-0:~$ birdc show babel neigh
|
|
BIRD 2.14 ready.
|
|
babel1:
|
|
IP address Interface Metric Routes Hellos Expires Auth RTT (ms)
|
|
fe80::5054:ff:fef0:1110 e1 96 8 16 5.003 No 4.831
|
|
|
|
pim@vpp0-0:~$ birdc show babel entries
|
|
BIRD 2.14 ready.
|
|
babel1:
|
|
Prefix Router ID Metric Seqno Routes Sources
|
|
192.168.10.0/32 00:00:00:00:c0:a8:0a:00 0 1 0 0
|
|
192.168.10.0/24 00:00:00:00:c0:a8:0a:00 0 1 1 0
|
|
192.168.10.1/32 00:00:00:00:c0:a8:0a:01 96 7 1 0
|
|
2001:678:d78:200::/128 00:00:00:00:c0:a8:0a:00 0 1 0 0
|
|
2001:678:d78:200::/60 00:00:00:00:c0:a8:0a:00 0 1 1 0
|
|
2001:678:d78:200::1/128 00:00:00:00:c0:a8:0a:01 96 7 1 0
|
|
```
|
|
|
|
Based on this simple configuration, Bird2 will start the babel protocol on `e0` and `e1`, and it
|
|
quickly finds a neighbor with which it establishes an adjacency. Looking at the routing protocol
|
|
database (called _entries_), I can see my own IPv4 and IPv6 loopbacks (192.168.10.0 and
|
|
2001:678:d78:200::), the neighbor's IPv4 and IPv6 loopbacks (192.168.10.1 and 201:678:d78:200::1),
|
|
and finally the two supernets (192.168.10.0/24 and 2001:678:d78:200::/60).
|
|
|
|
The coolest part is the `extended next hop on` statement, which enables Babel to set the nexthop
|
|
to be an IPv6 address, which becomes clear very quickly when looking at the Linux routing table:
|
|
|
|
```
|
|
pim@vpp0-0:~$ ip ro
|
|
192.168.10.1 via inet6 fe80::5054:ff:fef0:1110 dev e1 proto bird metric 32
|
|
unreachable 192.168.10.0/24 proto bird metric 32
|
|
|
|
pim@vpp0-0:~$ ip -6 ro
|
|
2001:678:d78:200:: dev loop0 proto kernel metric 256 pref medium
|
|
2001:678:d78:200::1 via fe80::5054:ff:fef0:1110 dev e1 proto bird metric 32 pref medium
|
|
unreachable 2001:678:d78:200::/60 dev lo proto bird metric 32 pref medium
|
|
fe80::/64 dev loop0 proto kernel metric 256 pref medium
|
|
fe80::/64 dev e1 proto kernel metric 256 pref medium
|
|
```
|
|
|
|
**✅ Setting IPv4 routes over IPv6 nexthops works!**
|
|
|
|
### Babel and VPP
|
|
|
|
For the [[VPP](https://fd.io)] configuration, I start off with a pretty much _empty_ configuration,
|
|
creating only a loopback interface called `loop0`, setting the interfaces up, and exposing them in
|
|
_LinuxCP_:
|
|
|
|
```
|
|
vpp0-0# create loopback interface instance 0
|
|
vpp0-0# set interface state loop0 up
|
|
vpp0-0# set interface ip address loop0 192.168.10.0/32
|
|
vpp0-0# set interface ip address loop0 2001:678:d78:200::/128
|
|
|
|
vpp0-0# set interface mtu 9000 GigabitEthernet10/0/0
|
|
vpp0-0# set interface state GigabitEthernet10/0/0 up
|
|
vpp0-0# set interface mtu 9000 GigabitEthernet10/0/1
|
|
vpp0-0# set interface state GigabitEthernet10/0/1 up
|
|
|
|
vpp0-0# lcp create loop0 host-if loop0
|
|
vpp0-0# lcp create GigabitEthernet10/0/0 host-if e0
|
|
vpp0-0# lcp create GigabitEthernet10/0/1 host-if e1
|
|
```
|
|
|
|
Between the four VPP routers, the only relevant difference is the IPv4 and IPv6 addresses of the
|
|
loopback device. For the rest, things are good. The routing tables quickly fill with all IPv4 and
|
|
IPv6 loopbacks across the network.
|
|
|
|
### Adding support to VPP
|
|
|
|
IPv6 pings and looks good. However, IPv4 endpoints do not ping yet. The first thing I look at, is
|
|
does VPP understand how to interpret an IPv4 route with an IPv6 nexthop? I think it does, because I
|
|
remember reviewing a change from Adrian during our MPLS [[project]({{< ref "2023-05-28-vpp-mpls-4" >}})],
|
|
which he submitted in this [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/38633)]. His change
|
|
allows VPP to use routes with `rtnl_route_nh_get_via()` to map them to a different address family,
|
|
exactly what I am looking for. The routes are correctly installed in the FIB:
|
|
|
|
```
|
|
pim@vpp0-0:~$ vppctl show ip fib 192.168.10.1
|
|
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[default-route:1, lcp-rt:1, ]
|
|
192.168.10.1/32 fib:0 index:31 locks:2
|
|
lcp-rt-dynamic refs:1 src-flags:added,contributing,active,
|
|
path-list:[51] locks:4 flags:shared, uPRF-list:42 len:1 itfs:[2, ]
|
|
path:[72] pl-index:51 ip6 weight=1 pref=32 attached-nexthop: oper-flags:resolved,
|
|
fe80::5054:ff:fef0:1110 GigabitEthernet10/0/1
|
|
[@0]: ipv6 via fe80::5054:ff:fef0:1110 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 525400f01110525400f0110186dd
|
|
|
|
forwarding: unicast-ip4-chain
|
|
[@0]: dpo-load-balance: [proto:ip4 index:34 buckets:1 uRPF:42 to:[0:0]]
|
|
[0] [@5]: ipv4 via fe80::5054:ff:fef0:1110 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 525400f01110525400f011010800
|
|
```
|
|
|
|
|
|
Using the Open vSwitch tap I can see I can clearly see the packets go out from `vpp0-0.e1` and into
|
|
`vpp0-1.e0`, but there is no response, so they are getting lost in `vpp0-1` somewhere. I take a look
|
|
at a packet trace on `vpp0-1`, I'm expecting the ICMP packet there:
|
|
|
|
```
|
|
pim@vpp0-1:~$ vppctl show trace
|
|
07:42:53:178694: dpdk-input
|
|
GigabitEthernet10/0/0 rx queue 0
|
|
buffer 0x4c513d: current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x0
|
|
ext-hdr-valid
|
|
PKT MBUF: port 0, nb_segs 1, pkt_len 98
|
|
buf_len 2176, data_len 98, ol_flags 0x0, data_off 128, phys_addr 0x29944fc0
|
|
packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
|
|
rss 0x0 fdir.hi 0x0 fdir.lo 0x0
|
|
IP4: 52:54:00:f0:11:01 -> 52:54:00:f0:11:10
|
|
ICMP: 192.168.10.0 -> 192.168.10.1
|
|
tos 0x00, ttl 64, length 84, checksum 0xb02b dscp CS0 ecn NON_ECN
|
|
fragment id 0xf52b, flags DONT_FRAGMENT
|
|
ICMP echo_request checksum 0x43b7 id 26166
|
|
07:42:53:178765: ethernet-input
|
|
frame: flags 0x1, hw-if-index 1, sw-if-index 1
|
|
IP4: 52:54:00:f0:11:01 -> 52:54:00:f0:11:10
|
|
07:42:53:178791: ip4-input
|
|
ICMP: 192.168.10.0 -> 192.168.10.1
|
|
tos 0x00, ttl 64, length 84, checksum 0xb02b dscp CS0 ecn NON_ECN
|
|
fragment id 0xf52b, flags DONT_FRAGMENT
|
|
ICMP echo_request checksum 0x43b7 id 26166
|
|
07:42:53:178810: ip4-not-enabled
|
|
ICMP: 192.168.10.0 -> 192.168.10.1
|
|
tos 0x00, ttl 64, length 84, checksum 0xb02b dscp CS0 ecn NON_ECN
|
|
fragment id 0xf52b, flags DONT_FRAGMENT
|
|
ICMP echo_request checksum 0x43b7 id 26166
|
|
07:42:53:178833: error-drop
|
|
rx:GigabitEthernet10/0/0
|
|
07:42:53:178835: drop
|
|
dpdk-input: no error
|
|
```
|
|
|
|
Okay, that checks out! Going over this packet trace, the `ip4-input` node indeed got handed a packet,
|
|
which it promptly rejected by forwarding it to `ip4-not-enabled` which drops it. It kind of makes
|
|
sense, the VPP dataplane doesn't think it's logical to handle IPv4 traffic on an interface which
|
|
does not have an IPv4 address. Except -- I'm bending the rules a little bit by doing exactly that.
|
|
|
|
#### Approach 1: force-enable ip4 in VPP
|
|
|
|
There's an internal function `ip4_sw_interface_enable_disable()` which is called to enable IPv4
|
|
processing on an interface once the first IPv4 address is added. So my first fix is to force this to
|
|
be enabled for any interface that is exposed via Linux Control Plane, notably in `lcp_itf_pair_create()`
|
|
[[here](https://git.ipng.ch/ipng/lcpng/blob/main/lcpng_interface.c#L777)].
|
|
|
|
This approach is partially effective:
|
|
|
|
```
|
|
pim@vpp0-0:~$ ip ro get 192.168.10.1
|
|
192.168.10.1 via inet6 fe80::5054:ff:fef0:1110 dev e1 src 192.168.10.0 uid 0
|
|
cache
|
|
pim@vpp0-0:~$ ping -c5 192.168.10.1
|
|
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
|
|
64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=3.92 ms
|
|
64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=3.81 ms
|
|
64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=3.75 ms
|
|
64 bytes from 192.168.10.1: icmp_seq=4 ttl=64 time=3.23 ms
|
|
64 bytes from 192.168.10.1: icmp_seq=5 ttl=64 time=2.67 ms
|
|
^C
|
|
--- 192.168.10.1 ping statistics ---
|
|
5 packets transmitted, 5 received, 0% packet loss, time 4006ms
|
|
rtt min/avg/max/mdev = 2.673/3.477/3.921/0.467 ms
|
|
|
|
pim@vpp0-0:~$ traceroute 192.168.10.3
|
|
traceroute to 192.168.10.3 (192.168.10.3), 30 hops max, 60 byte packets
|
|
1 * * *
|
|
2 * * *
|
|
3 192.168.10.3 (192.168.10.3) 10.418 ms 10.343 ms 11.362 ms
|
|
```
|
|
|
|
I take a moment to think about why the traceroutes are not responding in the routers in the middle,
|
|
and it dawns on me that when the router needs to send an ICMPv4 TTL Exceeded message, it can't
|
|
select an IPv4 address to originate the message from, as the interface has none.
|
|
|
|
**🟠 Forwarding works, but ❌ PMTUd does not!**
|
|
|
|
#### Approach 2: Use unnumbered interfaces
|
|
|
|
Looking at my options, I see that VPP is capable of using so-called _unnumbered_ interfaces. These
|
|
can be left unconfigured, but borrow an address from another interface. It's a good idea to
|
|
borrow from `loop0`, which has a valid IPv4 and IPv6 address. It looks like this in VPP:
|
|
|
|
```
|
|
vpp0-0# set interface unnumbered GigabitEthernet10/0/0 use loop0
|
|
vpp0-0# set interface unnumbered GigabitEthernet10/0/1 use loop0
|
|
|
|
vpp0-0# show interface address
|
|
GigabitEthernet10/0/0 (dn):
|
|
unnumbered, use loop0
|
|
L3 192.168.10.0/32
|
|
L3 2001:678:d78:200::/128
|
|
GigabitEthernet10/0/1 (up):
|
|
unnumbered, use loop0
|
|
L3 192.168.10.0/32
|
|
L3 2001:678:d78:200::/128
|
|
loop0 (up):
|
|
L3 192.168.10.0/32
|
|
L3 2001:678:d78:200::/128
|
|
```
|
|
|
|
The Linux ControlPlane configuration will always synchronize interface information from VPP to
|
|
Linux, as I described back then when I [[worked on the plugin]({{< ref "2021-08-13-vpp-2" >}})].
|
|
Babel starts and sets next hops for IPv4 that look like this:
|
|
|
|
```
|
|
pim@vpp0-2:~$ ip -br a
|
|
lo UNKNOWN 127.0.0.1/8 ::1/128
|
|
loop0 UNKNOWN 192.168.10.2/32 2001:678:d78:200::2/128 fe80::dcad:ff:fe00:0/64
|
|
e0 UP 192.168.10.2/32 2001:678:d78:200::2/128 fe80::5054:ff:fef0:1120/64
|
|
e1 UP 192.168.10.2/32 2001:678:d78:200::2/128 fe80::5054:ff:fef0:1121/64
|
|
|
|
pim@vpp0-2:~$ ip ro
|
|
192.168.10.0 via 192.168.10.1 dev e0 proto bird metric 32 onlink
|
|
unreachable 192.168.10.0/24 proto bird metric 32
|
|
192.168.10.1 via 192.168.10.1 dev e0 proto bird metric 32 onlink
|
|
192.168.10.3 via 192.168.10.3 dev e1 proto bird metric 32 onlink
|
|
```
|
|
|
|
While on the surface this looks good, for VPP it clearly poses a problem, as my IPv4 neighbors
|
|
(192.168.10.1 and 192.168.10.3) are not reachable:
|
|
|
|
```
|
|
pim@vpp0-2:~# ping -c3 192.168.10.1
|
|
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
|
|
From 192.168.10.2 icmp_seq=1 Destination Host Unreachable
|
|
From 192.168.10.2 icmp_seq=2 Destination Host Unreachable
|
|
From 192.168.10.2 icmp_seq=3 Destination Host Unreachable
|
|
|
|
--- 192.168.10.1 ping statistics ---
|
|
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2034ms
|
|
```
|
|
|
|
I take a look at why that might be, and I notice this on the neighbor `vpp0-1` when I try to ping it
|
|
from `vpp0-2`:
|
|
|
|
```
|
|
vpp0-1# show err
|
|
Count Node Reason Severity
|
|
5 arp-reply IP4 source address not local to sub error
|
|
1 arp-reply IP4 source address matches local in error
|
|
```
|
|
|
|
Oh, snap! I traced this down to `src/vnet/arp/arp.c` around line 522 where I can see that VPP, when
|
|
it receives an ARP request, wants that to be coming from a peer that is in its own subnet. But with a
|
|
point to point link like this one, there is nobody else in the `192.168.10.1/32` subnet! I think
|
|
this error should not be returned if the interface is `arp_unnumbered()`, defined further up in the
|
|
same source file. I write a small patch in Gerrit [[40482](https://gerrit.fd.io/r/c/vpp/+/40482)]
|
|
which removes this requirement and the test that asserts the previous behavior, allowing the ARP
|
|
request to succeed, and things shoot to life:
|
|
|
|
```
|
|
pim@vpp0-2:~$ ping -c3 192.168.10.1
|
|
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
|
|
64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=11.5 ms
|
|
64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=1.69 ms
|
|
64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=3.03 ms
|
|
|
|
--- 192.168.10.1 ping statistics ---
|
|
3 packets transmitted, 3 received, 0% packet loss, time 2004ms
|
|
rtt min/avg/max/mdev = 1.689/5.394/11.468/4.329 ms
|
|
```
|
|
|
|
I make a mental note to discuss this ARP relaxation Gerrit with [[vpp-dev](https://lists.fd.io/g/vpp-dev/message/24155)],
|
|
and I'll see where that takes me.
|
|
|
|
**✅ Forwarding IPv4 routes over IPv4 point-to-point nexthops works!**
|
|
|
|
#### Approach 3: VPP Unnumbered Hack
|
|
|
|
At this point, I think I'm good, but one of the cool features of Babel is that it can use IPv6 next
|
|
hops for IPv4 destinations. Setting `GigabitEthernet10/0/X` to unnumbered will make
|
|
`192.168.10.X/32` reappear on the `e0` an `e1` interfaces, which will make Babel prefer the more
|
|
classic IPv4 next-hops. So can I trick it somehow to use IPv6 anyway ?
|
|
|
|
One option is to ask Babel to use `extended next hop` even when IPv4 is available, which would be a
|
|
change to Bird (and possibly a violation of the Babel specification, I should read up on that).
|
|
|
|
But I think there's another way, so I take a look at the VPP code which prints out the **unnumbered,
|
|
use loop0** message, and I find a way to know if an interface is borrowing addresses in this way. I
|
|
decide to change the LCP plugin to _inhibit_ sync'ing the addresses if they belong to an interface
|
|
which is unnumbered. Because I don't know for sure if everybody would find this behavior desirable,
|
|
I make sure to guard the behavior behind a backwards compatible configuration option.
|
|
|
|
If you're curious, please take a look at the change in my [[GitHub
|
|
repo](https://git.ipng.ch/ipng/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
|
|
which I:
|
|
1. add a new configuration option, `lcp-sync-unnumbered`, which defaults to `on`. That would be
|
|
what the plugin would do in the normal case: copy forward these borrowed IP addresses to Linux.
|
|
1. add a CLI call to change the value, `lcp lcp-sync-unnumbered [on|enable|off|disable]`
|
|
1. extend the CLI call to show the LCP plugin state, as an additional output of `lcp show`
|
|
|
|
And with that, the VPP configuration becomes:
|
|
|
|
```
|
|
vpp0-0# lcp lcp-sync on
|
|
vpp0-0# lcp lcp-sync-unnumbered off
|
|
|
|
vpp0-0# create loopback interface instance 0
|
|
vpp0-0# set interface state loop0 up
|
|
vpp0-0# set interface ip address loop0 192.168.10.0/32
|
|
vpp0-0# set interface ip address loop0 2001:678:d78:200::/128
|
|
|
|
vpp0-0# set interface mtu 9000 GigabitEthernet10/0/0
|
|
vpp0-0# set interface unnumbered GigabitEthernet10/0/0 use loop0
|
|
vpp0-0# set interface state GigabitEthernet10/0/0 up
|
|
vpp0-0# set interface mtu 9000 GigabitEthernet10/0/1
|
|
vpp0-0# set interface unnumbered GigabitEthernet10/0/1 use loop0
|
|
vpp0-0# set interface state GigabitEthernet10/0/1 up
|
|
|
|
vpp0-0# lcp create loop0 host-if loop0
|
|
vpp0-0# lcp create GigabitEthernet10/0/0 host-if e0
|
|
vpp0-0# lcp create GigabitEthernet10/0/1 host-if e1
|
|
```
|
|
|
|
## Results
|
|
|
|
I can claim plausible success on this effort, which makes me wiggle in my seat a little bit, I have
|
|
to admit:
|
|
|
|
```
|
|
pim@vpp0-0:~$ ip -br a
|
|
lo UNKNOWN 127.0.0.1/8 ::1/128
|
|
loop0 UNKNOWN 192.168.10.0/32 2001:678:d78:200::/128 fe80::dcad:ff:fe00:0/64
|
|
e0 UP fe80::5054:ff:fef0:1100/64
|
|
e1 UP fe80::5054:ff:fef0:1101/64
|
|
e2 DOWN
|
|
e3 DOWN
|
|
|
|
pim@vpp0-0:~$ traceroute -n 192.168.10.3
|
|
traceroute to 192.168.10.3 (192.168.10.3), 30 hops max, 60 byte packets
|
|
1 192.168.10.1 1.882 ms 2.231 ms 1.472 ms
|
|
2 192.168.10.2 4.243 ms 3.492 ms 2.797 ms
|
|
3 192.168.10.3 6.689 ms 5.925 ms 5.157 ms
|
|
|
|
pim@vpp0-0:~$ traceroute -n 2001:678:d78:200::3
|
|
traceroute to 2001:678:d78:200::3 (2001:678:d78:200::3), 30 hops max, 80 byte packets
|
|
1 2001:678:d78:200::1 2.543 ms 1.762 ms 2.154 ms
|
|
2 2001:678:d78:200::2 4.943 ms 3.063 ms 3.562 ms
|
|
3 2001:678:d78:200::3 6.273 ms 6.694 ms 7.086 ms
|
|
```
|
|
**✅ Forwarding IPv4 routes over IPv6 nexthops works, ICMPv4 works, PMTUd works!**
|
|
|
|
I recorded a little [[screencast](/assets/vpp-babel/screencast.cast)] that shows my work, so far:
|
|
|
|
{{< image src="/assets/vpp-babel/screencast.gif" alt="Babel IPv4-less VPP" >}}
|
|
|
|
## Additional thoughts
|
|
|
|
### Comparing OSPFv2 and Babel
|
|
|
|
Ondrej from the Bird team pointed out (thank you!) that OSPFv2 can also be made to avoid use of IPv4
|
|
transit networks, by making use of this `peer` pattern, which is similar but not quite the same as
|
|
what I discussed in _Approach 2_ above:
|
|
|
|
```
|
|
$ ip addr add 192.168.10.2 peer 192.168.10.1 dev e0
|
|
$ ip addr add 192.168.10.2 peer 192.168.10.3 dev e1
|
|
```
|
|
|
|
The Linux ControlPlane plugin is not currently capable of accepting the `peer` netlink message, and
|
|
I can see a problem: VPP does not allow for two interfaces to have the same IP address, _unless_ one
|
|
is borrowing from another using _unnumbered_. I wonder why that is ...
|
|
|
|
I could certainly give implementing that `peer` pattern in Netlink a go, but I'm not enthusiastic.
|
|
To consume the netlink message correctly, the plugin would need to assert that left hand (source) IPv4
|
|
address strictly corresponds to a loopback, and then internally rewrite the address addition into
|
|
a _unnumbered_ use, and also somehow reject (delete?) the netlink configuration otherwise. Ick!
|
|
|
|
I think there's a more idiomatic way of doing this in VPP. OSPFv2 doesn't really _need_ to use the
|
|
`peer` pattern, as long as the point to point peer is reachable. Babel is emitting a static route
|
|
over the interface after using IPv6 to learn its peer's IPv4 address, which is really neat! I
|
|
suppose for OSPFv2 setting a manual static route for the peer into the device would do the trick as
|
|
well.
|
|
|
|
The VPP idiom for the `peer` pattern above, which Babel does naturally, and OSPFv2 could be manually
|
|
configured to do, would look like this:
|
|
|
|
```
|
|
vpp0-2# set interface ip address loop0 192.168.10.2/32
|
|
vpp0-2# set interface state loop0 up
|
|
|
|
vpp0-2# set interface unnumbered GigabitEthernet10/0/0 use loop0
|
|
vpp0-2# set interface state GigabitEthernet10/0/0 up
|
|
vpp0-2# ip route add 192.168.10.1/32 via 192.168.10.1 GigabitEthernet10/0/0
|
|
|
|
vpp0-2# set interface unnumbered GigabitEthernet10/0/1 use loop0
|
|
vpp0-2# set interface state GigabitEthernet10/0/1 up
|
|
vpp0-2# ip route add 192.168.10.3/32 via 192.168.10.3 GigabitEthernet10/0/1
|
|
```
|
|
|
|
Either way, when using point to point connections (like these explicit static routes, or the implied
|
|
static routes that the `peer` pattern will yield) over an ethernet broadcast medium, will require to
|
|
get the ARP [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/40482)] merged. This one seems reasonably
|
|
straight forward because allowing point to point to work over an ethernet broadcast medium is
|
|
successfully done in many popular vendors, and I can't find any RFC that forbids it. Perhaps VPP is
|
|
being a bit too strict.
|
|
|
|
### To Unnumbered or Not To Unnumbered
|
|
|
|
I'm torn between _Approach 2_ and _Approach 3_. While on the one hand, setting the _unnumbered_
|
|
interface would be best reflected in Linux, it is not without problems. If the operator subsequently
|
|
tries to remove one of the addresses on `e0` or `e1`, that will yield a desync between Linux and
|
|
VPP (Linux will have removed the address, but VPP will still be _unnumbered_). On the other hand,
|
|
tricking Linux (and the operator) to believe there isn't an IPv4 (and IPv6) address configured on
|
|
the interface, is also not great.
|
|
|
|
Of the two approaches, I think I prefer _Approach 3_ (changing the Linux CP plugin to not sync
|
|
unnumbered addresses), because it minimizes the chance of operator error. If you're reading this and
|
|
have an Opinion™, would you please let me know?
|
|
|
|
## What's Next
|
|
|
|
I think that over time, IPng Networks might replace OSPF and OSPFv3 with Babel, as it will allow me
|
|
to retire the many /31 IPv4 and /112 IPv6 transit networks (which consume about half of my routable
|
|
IPv4 addresses!). I will discuss my change with the VPP and Babel/Bird Developer communities and see
|
|
if it makes sense to upstream my changes. Personally, I think it's a reasonable direction, because
|
|
(a) both changes are backwards compatible and (b) its semantics are pretty straight forward. I'll
|
|
also add some configuration knobs to [[vppcfg]({{< ref "2022-04-02-vppcfg-2" >}})] to make it
|
|
easier to configure VPP in this way.
|
|
|
|
|
|
Of course, migrating AS8298 won't be overnight, I need to gain a bit more confidence, and obviously
|
|
upgrade both Bird2 and VPP using my changes, which I think might benefit from a bit of peer review.
|
|
And finally I need to roll this new IPv4-less IGP out very carefully and without interruptions,
|
|
which considering the IGP is the most fundamental building block of the network, may be tricky.
|
|
|
|
But, I am uncomfortably excited by the prospect of having my network go entirely without backbone
|
|
transit networks. By the way: Babel is amazing!
|