All checks were successful
continuous-integration/drone/push Build is passing
555 lines
27 KiB
Markdown
555 lines
27 KiB
Markdown
---
|
|
date: "2021-09-02T12:19:14Z"
|
|
title: VPP Linux CP - Part5
|
|
aliases:
|
|
- /s/articles/2021/09/02/vpp-5.html
|
|
params:
|
|
asciinema: true
|
|
---
|
|
|
|
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
|
|
|
|
# About this series
|
|
|
|
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
|
|
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
|
|
_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches
|
|
are shared between the two. One thing notably missing, is the higher level control plane, that is
|
|
to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a
|
|
VPP _plugin_ which is called the **Linux Control Plane**, or LCP for short, which creates Linux network
|
|
devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols
|
|
like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet
|
|
forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use
|
|
VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the
|
|
interface state (links, addresses and routes) itself. When the plugin is completed, running software
|
|
like [FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving
|
|
>100Mpps and >100Gbps forwarding rates will be well in reach!
|
|
|
|
In the previous post, I added support for VPP to consume Netlink messages that describe interfaces,
|
|
IP addresses and ARP/ND neighbor changes. This post completes the stablestakes Netlink handler by
|
|
adding IPv4 and IPv6 route messages, and ends up with a router in the DFZ consuming 133K IPv6
|
|
prefixes and 870K IPv4 prefixes.
|
|
|
|
## My test setup
|
|
|
|
The goal of this post is to show what code needed to be written to extend the **Netlink Listener**
|
|
plugin I wrote in the [fourth post]({{< ref "2021-08-25-vpp-4" >}}), so that it can consume
|
|
route additions/deletions, a thing that is common in dynamic routing protocols such as OSPF and
|
|
BGP.
|
|
|
|
The setup from my [third post]({{< ref "2021-08-15-vpp-3" >}}) is still there, but it's no longer
|
|
a focal point for me. I use it (the regular interface + subints and the BondEthernet + subints)
|
|
just to ensure my new code doesn't have a regression.
|
|
|
|
Instead, I'm creating two VLAN interfaces now:
|
|
- The first is in my home network's _servers_ VLAN. There are three OSPF speakers there:
|
|
- `chbtl0.ipng.ch` and `chbtl1.ipng.ch` are my main routers, they run DANOS and are in
|
|
the Default Free Zone (or DFZ for short).
|
|
- `rr0.chbtl0.ipng.ch` is one of AS50869's three route-reflectors. Every one of the 13
|
|
routers in AS50869 exchanges BGP information with these, and it cuts down on the total
|
|
amount of iBGP sessions I have to maintain -- see [here](https://networklessons.com/bgp/bgp-route-reflector)
|
|
for details on Route Reflectors.
|
|
- The second is an L2 connection to a local BGP exchange, with only three members (IPng Networks,
|
|
AS50869, Openfactory AS58299, and Stucchinet AS58280). In this VLAN, Openfactory was so kind
|
|
as to configure a full transit session for me, and I'll use it in my test bench.
|
|
|
|
The test setup offers me the ability to consume OSPF, OSPFv3 and BGP.
|
|
|
|
### Startingpoint
|
|
|
|
Based on the state of the plugin after the [fourth post]({{< ref "2021-08-25-vpp-4" >}}),
|
|
operators can create VLANs (including .1q, .1ad, QinQ and QinAD subinterfaces) directly in
|
|
Linux. They can change link attributes (like set admin state 'up' or 'down', or change
|
|
the MTU on a link), they can add/remove IP addresses, and the system will add/remove IPv4
|
|
and IPv6 neighbors. But notably, the following Netlink messages are not yet consumed, as shown
|
|
by the following example:
|
|
|
|
```
|
|
pim@hippo:~/src/lcpng$ sudo ip link add link e1 name servers type vlan id 101
|
|
pim@hippo:~/src/lcpng$ sudo ip link up mtu 1500 servers
|
|
pim@hippo:~/src/lcpng$ sudo ip addr add 194.1.163.86/27 dev servers
|
|
pim@hippo:~/src/lcpng$ sudo ip ro add default via 194.1.163.65
|
|
```
|
|
|
|
which does the first three commands just fine, but the fourth:
|
|
```
|
|
linux-cp/nl [debug ]: dispatch: ignored route/route: add family inet type 1 proto 3
|
|
table 254 dst 0.0.0.0/0 nexthops { gateway 194.1.163.65 idx 197 }
|
|
```
|
|
|
|
In this post, I'll implement that last missing piece in two functions called `lcp_nl_route_add()`
|
|
and `lcp_nl_route_del()`. Here we go!
|
|
|
|
## Netlink Routes
|
|
|
|
Reusing the approach from the work-in-progress [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122)], I introduce two FIB sources: one
|
|
for manual routes (ie. the ones that an operator might set with `ip route add`), and another one
|
|
for dynamic routes (ie. what a routing protocol like Bird or FRR might set), this is in
|
|
`lcp_nl_proto_fib_source()`. Next, I need a bunch of helper functions that can translate the
|
|
Netlink message information into VPP primitives:
|
|
|
|
- `lcp_nl_mk_addr46()` converts a Netlink `nl_addr` to a VPP `ip46_address_t`.
|
|
- `lcp_nl_mk_route_prefix()` converts a Netlink `rtnl_route` to a VPP `fib_prefix_t`.
|
|
- `lcp_nl_mk_route_mprefix()` converts a Netlink `rtnl_route` to a VPP `mfib_prefix_t` (for
|
|
multicast routes).
|
|
- `lcp_nl_mk_route_entry_flags()` generates `fib_entry_flag_t` from the Netlink route type,
|
|
table and proto metadata.
|
|
- `lcp_nl_proto_fib_source()` selects the most appropciate FIB source by looking at the
|
|
`rt_proto` field from the Netlink message (see `/etc/iproute2/rt_protos` for a list of
|
|
these). Anything **RTPROT_STATIC** or better is `fib_src`, while anything above that
|
|
becomes `fib_src_dynamic`.
|
|
- `lcp_nl_route_path_parse()` converts a Netlink `rtnl_nexthop` to a VPP `fib_route_path_t`
|
|
and adds that to a growing list of paths. Similar to Netlink's nethops being a list, so
|
|
are the individual paths in VPP, so that lines up perfectly.
|
|
- `lcp_nl_route_path_add_special()` adds a blackhole/unreach/prohibit route to the list
|
|
of paths, in the special-case there is not yet a path for the destination.
|
|
|
|
With these helpers, I will have enough to manipulate VPP's forwarding information base or _FIB_
|
|
for short. But in VPP, the _FIB_ consists of any number of _tables_ (think of them as _VRFs_
|
|
or Virtual Routing/Forwarding domains). So first, I need to add these:
|
|
|
|
- `lcp_nl_table_find()` selects the matching `{table-id,protocol}` (v4/v6) tuple from
|
|
an internally kept hash of tables.
|
|
- `lcp_nl_table_add_or_lock()` if a table with key `{table-id,protocol}` (v4/v6) hasn't
|
|
been used yet, create one in VPP, and store it for future reference. Otherwise increment
|
|
a table reference counter so I know how many FIB entries VPP will have in this table.
|
|
- `lcp_nl_table_unlock()` given a table, decrease the refcount on it, and if no more
|
|
prefixes are in the table, remove it from VPP.
|
|
|
|
All of this code was heavily inspired by the pending [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122)]
|
|
but a few finishing touches were added, and wrapped up in this
|
|
[[commit](https://git.ipng.ch/ipng/lcpng/commit/7a76498277edc43beaa680e91e3a0c1787319106)].
|
|
|
|
### Deletion
|
|
|
|
Our main function `lcp_nl_route_del()` will remove a route from the given table-id/protocol.
|
|
I do this by applying `rtnl_route_foreach_nexthop()` callbacks to the list of Netlink message
|
|
nexthops, converting each of them into VPP paths in a `lcp_nl_route_path_parse_t` structure.
|
|
If the route is for unreachable/blackhole/prohibit in Linux, add that path too.
|
|
|
|
Then, remove the VPP paths from the FIB and reduce refcnt or remove the table if it's empty.
|
|
This is reasonably straight forward.
|
|
|
|
### Addition
|
|
|
|
Adding routes to the FIB is done with `lcp_nl_route_add()`. It immediately becomes obvious
|
|
that not all routes are relevant for VPP. A prime example are those in table 255, they are
|
|
'local' routes, which have already been set up by IPv4 and IPv6 address addition functions
|
|
in VPP. There are some other route types that are invalid, so I'll just skip those.
|
|
|
|
Link-local IPv6 and IPv6 multicast is also skipped, because they're also added when interfaces
|
|
get their IP addresses configured. But for the other routes, similar to deletion, I'll extract
|
|
the paths from the Netlink message's netxhops list, by constructing an `lcp_nl_route_path_parse_t`
|
|
by walking those Netlink nexthops, and optionally add a _special_ route (in case the route was
|
|
for unreachable/blackhole/prohibit in Linux -- those won't have a nexthop).
|
|
|
|
Then, insert the VPP paths found in the Netlink message into the FIB or the multicast FIB,
|
|
respectively.
|
|
|
|
## Control Plane: Bird
|
|
|
|
So with this newly added code, the example above of setting a default route shoots to life.
|
|
But I can do better! At IPng Networks, my routing suite of choice is Bird2, and I have some
|
|
code to generate configurations for it and push those configs safely to routers. So, let's
|
|
take a closer look at a configuration on the test machine running VPP + Linux CP with this
|
|
new Netlink route handler.
|
|
|
|
```
|
|
router id 194.1.163.86;
|
|
protocol device { scan time 10; }
|
|
protocol direct { ipv4; ipv6; check link yes; }
|
|
```
|
|
|
|
These first two protocols are internal implementation details. The first, called _device_
|
|
periodically scans the network interface list in Linux, to pick up new interfaces. You can
|
|
compare it to issuing `ip link` and acting on additions/removals as they occur. The second,
|
|
called _direct_, generates directly connected routes for interfaces that have IPv4 or IPv6
|
|
addresses configured. It turns out that if I add `194.1.163.86/27` as an IPv4 address on
|
|
an interface, it'll generate several Netlink messages: one for the `RTM_NEWADDR` which
|
|
I discussed in my [fourth post]({{< ref "2021-08-25-vpp-4" >}}), and also a `RTM_NEWROUTE`
|
|
for the connected `194.1.163.64/27` in this case. It helps the kernel understand that if
|
|
we want to send a packet to a host in that prefix, we should not send it to the default
|
|
gateway, but rather to a nexthop of the device. Those are intermittently called `direct`
|
|
or `connected` routes. Ironically, these are called `RTS_DEVICE` routes in Bird2
|
|
[ref](https://github.com/BIRD/bird/blob/master/nest/route.h#L373) even though they are
|
|
generated by the `direct` routing protocol.
|
|
|
|
That brings me to the third protocol, one for each address type:
|
|
```
|
|
protocol kernel kernel4 {
|
|
ipv4 {
|
|
import all;
|
|
export where source != RTS_DEVICE;
|
|
};
|
|
}
|
|
|
|
protocol kernel kernel6 {
|
|
ipv6 {
|
|
import all;
|
|
export where source != RTS_DEVICE;
|
|
};
|
|
}
|
|
```
|
|
We're asking Bird to import any route it learns from the kernel, and we're asking it to
|
|
export any route that's not an `RTS_DEVICE` route. The reason for this is that when we
|
|
create IPv4/IPv6 addresses, the `ip` command already adds the connected route, and this
|
|
avoids Bird from inserting a second, identical route for those connected routes. And with
|
|
that, I have a very simple view, given for example these two interfaces:
|
|
|
|
```
|
|
pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip route
|
|
45.129.224.232/29 dev ixp proto kernel scope link src 45.129.224.235
|
|
194.1.163.64/27 dev servers proto kernel scope link src 194.1.163.86
|
|
|
|
pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip -6 route
|
|
2a0e:5040:0:2::/64 dev ixp proto kernel metric 256 pref medium
|
|
2001:678:d78:3::/64 dev servers proto kernel metric 256 pref medium
|
|
|
|
pim@hippo:/etc/bird$ birdc show route
|
|
BIRD 2.0.7 ready.
|
|
Table master4:
|
|
45.129.224.232/29 unicast [direct1 20:48:55.547] * (240)
|
|
dev ixp
|
|
194.1.163.64/27 unicast [direct1 20:48:55.547] * (240)
|
|
dev servers
|
|
|
|
Table master6:
|
|
2a0e:5040:1001::/64 unicast [direct1 20:48:55.547] * (240)
|
|
dev stucchi
|
|
2001:678:d78:3::/64 unicast [direct1 20:48:55.547] * (240)
|
|
dev servers
|
|
```
|
|
|
|
## Control Plane: OSPF
|
|
|
|
Considering the `servers` network above has a few OSPF speakers in it, I will introduce this
|
|
router there as well. The configuration is very straight forward in Bird, let's just add
|
|
the OSPF and OSPFv3 protocols as follows:
|
|
|
|
```
|
|
protocol ospf v2 ospf4 {
|
|
ipv4 { export where source = RTS_DEVICE; import all; };
|
|
area 0 {
|
|
interface "lo" { stub yes; };
|
|
interface "servers" { type broadcast; cost 5; };
|
|
};
|
|
}
|
|
|
|
protocol ospf v3 ospf6 {
|
|
ipv6 { export where source = RTS_DEVICE; import all; };
|
|
area 0 {
|
|
interface "lo" { stub yes; };
|
|
interface "servers" { type broadcast; cost 5; };
|
|
};
|
|
}
|
|
```
|
|
|
|
Here, I tell OSPF to export all `connected` routes, and accept any route given to it. The only
|
|
difference between IPv4 and IPv6 is that the former uses OSPF version 2 of the protocol, and IPv6
|
|
uses version 3 of the protocol. And, as with the `kernel` routing protocol above, each instance
|
|
has to has its own unique name, so I make the obvious choice.
|
|
|
|
Within a few seconds, the OSPF Hello packets can be seen going out of the `servers` interface,
|
|
and adjacencies form shortly thereafter:
|
|
|
|
```
|
|
|
|
pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip ro | wc -l
|
|
83
|
|
pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip -6 ro | wc -l
|
|
74
|
|
|
|
pim@hippo:~/src/lcpng$ birdc show ospf nei ospf4
|
|
BIRD 2.0.7 ready.
|
|
ospf4:
|
|
Router ID Pri State DTime Interface Router IP
|
|
194.1.163.3 1 Full/Other 39.588 servers 194.1.163.66
|
|
194.1.163.87 1 Full/DR 39.588 servers 194.1.163.87
|
|
194.1.163.4 1 Full/Other 39.588 servers 194.1.163.67
|
|
|
|
pim@hippo:~/src/lcpng$ birdc show ospf nei ospf6
|
|
BIRD 2.0.7 ready.
|
|
ospf6:
|
|
Router ID Pri State DTime Interface Router IP
|
|
194.1.163.87 1 Full/DR 32.221 servers fe80::5054:ff:feaa:2b24
|
|
194.1.163.3 1 Full/BDR 39.504 servers fe80::9e69:b4ff:fe61:7679
|
|
194.1.163.4 1 2-Way/Other 38.357 servers fe80::9e69:b4ff:fe61:a1dd
|
|
```
|
|
|
|
And all of these were inserted into the VPP forwarding information base, taking for example
|
|
the IPng router in Amsterdam, loopback address `194.1.163.32` and `2001:678:d78::8`:
|
|
|
|
```
|
|
DBGvpp# show ip fib 194.1.163.32
|
|
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:1, default-route:1, lcp-rt:1, nat-hi:2, ]
|
|
194.1.163.32/32 fib:0 index:70 locks:2
|
|
lcp-rt-dynamic refs:1 src-flags:added,contributing,active,
|
|
path-list:[49] locks:142 flags:shared,popular, uPRF-list:49 len:1 itfs:[16, ]
|
|
path:[69] pl-index:49 ip4 weight=1 pref=32 attached-nexthop: oper-flags:resolved,
|
|
194.1.163.67 TenGigabitEthernet3/0/1.3
|
|
[@0]: ipv4 via 194.1.163.67 TenGigabitEthernet3/0/1.3: mtu:1500 next:5 flags:[] 9c69b461a1dd6805ca324615810000650800
|
|
|
|
forwarding: unicast-ip4-chain
|
|
[@0]: dpo-load-balance: [proto:ip4 index:72 buckets:1 uRPF:49 to:[0:0]]
|
|
[0] [@5]: ipv4 via 194.1.163.67 TenGigabitEthernet3/0/1.3: mtu:1500 next:5 flags:[] 9c69b461a1dd6805ca324615810000650800
|
|
|
|
DBGvpp# show ip6 fib 2001:678:d78::8
|
|
ipv6-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, default-route:1, ]
|
|
2001:678:d78::8/128 fib:0 index:130058 locks:2
|
|
lcp-rt-dynamic refs:1 src-flags:added,contributing,active,
|
|
path-list:[116] locks:220 flags:shared,popular, uPRF-list:106 len:1 itfs:[16, ]
|
|
path:[141] pl-index:116 ip6 weight=1 pref=32 attached-nexthop: oper-flags:resolved,
|
|
fe80::9e69:b4ff:fe61:a1dd TenGigabitEthernet3/0/1.3
|
|
[@0]: ipv6 via fe80::9e69:b4ff:fe61:a1dd TenGigabitEthernet3/0/1.3: mtu:1500 next:5 flags:[] 9c69b461a1dd6805ca3246158100006586dd
|
|
|
|
forwarding: unicast-ip6-chain
|
|
[@0]: dpo-load-balance: [proto:ip6 index:130060 buckets:1 uRPF:106 to:[0:0]]
|
|
[0] [@5]: ipv6 via fe80::9e69:b4ff:fe61:a1dd TenGigabitEthernet3/0/1.3: mtu:1500 next:5 flags:[] 9c69b461a1dd6805ca3246158100006586dd
|
|
```
|
|
|
|
In the snippet above we can see elements of the Linux CP Netlink Listener plugin doing its work.
|
|
It found the right nexthop, the right interface, enabled the FIB entry, and marked it with the
|
|
correct _FIB_ source `lcp-rt-dynamic`. And, with OSPF and OSPFv3 now enabled, VPP has gained visibility
|
|
to all of my internal network:
|
|
```
|
|
pim@hippo:~/src/lcpng$ traceroute nlams0.ipng.ch
|
|
traceroute to nlams0.ipng.ch (2001:678:d78::8) from 2001:678:d78:3::86, 30 hops max, 24 byte packets
|
|
1 chbtl1.ipng.ch (2001:678:d78:3::1) 0.3182 ms 0.2840 ms 0.1841 ms
|
|
2 chgtg0.ipng.ch (2001:678:d78::2:4:2) 0.5473 ms 0.6996 ms 0.6836 ms
|
|
3 chrma0.ipng.ch (2001:678:d78::2:0:1) 0.7700 ms 0.7693 ms 0.7692 ms
|
|
4 defra0.ipng.ch (2001:678:d78::7) 6.6586 ms 6.6443 ms 6.9292 ms
|
|
5 nlams0.ipng.ch (2001:678:d78::8) 12.8321 ms 12.9398 ms 12.6225 ms
|
|
```
|
|
|
|
## Control Plane: BGP
|
|
|
|
But the holy grail, and what got me started on this whole adventure, is to be able to participate in the
|
|
_Default Free Zone_ using BGP, So let's put these plugins to the test and load up a so-called _full table_
|
|
which means: all the routing information needed to reach any part of the internet. As of August'21,
|
|
there are about 870'000 such prefixes for IPv4, and aboug 133'000 prefixes for IPv6. We passed the magic
|
|
1M number, which I'm sure makes some silicon vendors anxious, because lots of older kit in the field won't
|
|
scale beyond a certain size. VPP is totally immune to this problem, so here we go!
|
|
|
|
```
|
|
template bgp T_IBGP4 {
|
|
local as 50869;
|
|
neighbor as 50869;
|
|
source address 194.1.163.86;
|
|
ipv4 { import all; export none; next hop self on; };
|
|
};
|
|
protocol bgp rr4_frggh0 from T_IBGP4 { neighbor 194.1.163.140; }
|
|
protocol bgp rr4_chplo0 from T_IBGP4 { neighbor 194.1.163.148; }
|
|
protocol bgp rr4_chbtl0 from T_IBGP4 { neighbor 194.1.163.87; }
|
|
|
|
template bgp T_IBGP6 {
|
|
local as 50869;
|
|
neighbor as 50869;
|
|
source address 2001:678:d78:3::86;
|
|
ipv6 { import all; export none; next hop self ibgp; };
|
|
};
|
|
protocol bgp rr6_frggh0 from T_IBGP6 { neighbor 2001:678:d78:6::140; }
|
|
protocol bgp rr6_chplo0 from T_IBGP6 { neighbor 2001:678:d78:7::148; }
|
|
protocol bgp rr6_chbtl0 from T_IBGP6 { neighbor 2001:678:d78:3::87; }
|
|
```
|
|
|
|
And with these two blocks, I've added six new protocols -- three of them are IPv4 route-reflector
|
|
clients, and three of them are IPv6 ones. Once this commits, Bird will be able to find these IP
|
|
addresses due to the OSPF routes being loaded into the _FIB_, and once it does that, each of the
|
|
route-reflector servers will download a full routing table into Bird's memory, and in turn Bird
|
|
will use the `kernel4` and `kernel6` protocol to export them into Linux (essentially performing
|
|
an `ip ro add ... via ...` on each), and the kernel will then generate a Netlink message, which
|
|
the Linux CP **Netlink Listener** plugin will pick up and the rest, as they say, is history.
|
|
|
|
I gotta tell you - the first time I saw this working end to end, I was elated. Just seeing blocks
|
|
of 6800-7000 of these being pumped into VPP's _FIB_ each 40ms block was just .. magical. And the
|
|
performance is pretty good, too, because 7000/40ms is 175K/sec alluding to VPP operators being
|
|
able to not only consume but also program into the _FIB_, a full IPv4 and IPv6 table in about 6
|
|
seconds, whoa!
|
|
|
|
```
|
|
DBGvpp#
|
|
linux-cp/nl [warn ]: process_msgs: Processed 6550 messages in 40001 usecs, 2607 left in queue
|
|
linux-cp/nl [warn ]: process_msgs: Processed 6368 messages in 40000 usecs, 7012 left in queue
|
|
linux-cp/nl [warn ]: process_msgs: Processed 6460 messages in 40001 usecs, 13163 left in queue
|
|
...
|
|
linux-cp/nl [warn ]: process_msgs: Processed 6418 messages in 40004 usecs, 93606 left in queue
|
|
linux-cp/nl [warn ]: process_msgs: Processed 6438 messages in 40002 usecs, 96944 left in queue
|
|
linux-cp/nl [warn ]: process_msgs: Processed 6575 messages in 40002 usecs, 99986 left in queue
|
|
linux-cp/nl [warn ]: process_msgs: Processed 6552 messages in 40004 usecs, 94767 left in queue
|
|
linux-cp/nl [warn ]: process_msgs: Processed 5890 messages in 40001 usecs, 88877 left in queue
|
|
linux-cp/nl [warn ]: process_msgs: Processed 6829 messages in 40003 usecs, 82048 left in queue
|
|
...
|
|
linux-cp/nl [warn ]: process_msgs: Processed 6685 messages in 40004 usecs, 13576 left in queue
|
|
linux-cp/nl [warn ]: process_msgs: Processed 6701 messages in 40003 usecs, 6893 left in queue
|
|
linux-cp/nl [warn ]: process_msgs: Processed 6579 messages in 40003 usecs, 314 left in queue
|
|
DBGvpp#
|
|
```
|
|
|
|
Due to a good cooperative multitasking approach in the Netlink message queue producer, I will continuously
|
|
read Netlink messages from the kernel and put them in a queue, but only consume 40ms or 8000 messages
|
|
whichever comes first, after which I yield control back to VPP. So you can see here that when the
|
|
kernel is flooding the Netlink messages of the learned BGP routing table, the plugin correctly consumes
|
|
what it can, the queue grows (in this case to just about 100K messages) and then quickly shrinks again.
|
|
|
|
And indeed, Bird, IP and VPP all seem to agree, we did a good job:
|
|
```
|
|
pim@hippo:~/src/lcpng$ birdc show route count
|
|
BIRD 2.0.7 ready.
|
|
1741035 of 1741035 routes for 870479 networks in table master4
|
|
396518 of 396518 routes for 132479 networks in table master6
|
|
Total: 2137553 of 2137553 routes for 1002958 networks in 2 tables
|
|
|
|
pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip -6 ro | wc -l
|
|
132430
|
|
pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip ro | wc -l
|
|
870494
|
|
|
|
pim@hippo:~/src/lcpng$ vppctl sh ip6 fib sum | awk '$1~/[0-9]+/ { total += $2 } END { print total }'
|
|
132479
|
|
pim@hippo:~/src/lcpng$ vppctl sh ip fib sum | awk '$1~/[0-9]+/ { total += $2 } END { print total }'
|
|
870529
|
|
```
|
|
|
|
|
|
## Results
|
|
|
|
The functional regression test I made on day one, the one that ensures end-to-end connectivity to and
|
|
from the Linux host interfaces works for all 5 interface types (untagged, .1q tagged, QinQ, .1ad tagged
|
|
and QinAD) and for both physical and virtual interfaces (like `TenGigabitEthernet3/0/0` and `BondEthernet0`),
|
|
still works. Great.
|
|
|
|
Here's a screencast [[asciinema](/assets/vpp/432943.cast), [gif](/assets/vpp/432942.gif)] showing me
|
|
playing around a bit with that configuration shown above, demonstrating that RIB and FIB
|
|
synchronisation works pretty well in both directions, making the combination of these two plugins
|
|
sufficient to run a VPP router in the _Default Free Zone_, Whoohoo!
|
|
|
|
{{< asciinema src="/assets/vpp/432943.cast" >}}
|
|
|
|
|
|
### Future work
|
|
|
|
**Atomic Updates** - When running VPP + Linux CP in a default free zone BGP environment,
|
|
IPv4 and IPv6 prefixes will be constantly updated as the internet topology morphs and changes.
|
|
One thing I noticed is that those are often deletes followed by adds with the exact same
|
|
nexthop (ie. something in Germany flapped, and this is not deduplicated), which shows up
|
|
as many of these pairs of messages like so:
|
|
|
|
```
|
|
linux-cp/nl [debug ]: route_del: netlink route/route: del family inet6 type 1 proto 12 table 254 dst 2a10:cc40:b03::/48 nexthops { gateway fe80::9e69:b4ff:fe61:a1dd idx 197 }
|
|
linux-cp/nl [debug ]: route_path_parse: path ip6 fe80::9e69:b4ff:fe61:a1dd, TenGigabitEthernet3/0/1.3, []
|
|
linux-cp/nl [info ]: route_del: table 254 prefix 2a10:cc40:b03::/48 flags
|
|
linux-cp/nl [debug ]: route_add: netlink route/route: add family inet6 type 1 proto 12 table 254 dst 2a10:cc40:b03::/48 nexthops { gateway fe80::9e69:b4ff:fe61:a1dd idx 197 }
|
|
linux-cp/nl [debug ]: route_path_parse: path ip6 fe80::9e69:b4ff:fe61:a1dd, TenGigabitEthernet3/0/1.3, []
|
|
linux-cp/nl [info ]: route_add: table 254 prefix 2a10:cc40:b03::/48 flags
|
|
linux-cp/nl [info ]: process_msgs: Processed 2 messages in 225 usecs
|
|
```
|
|
|
|
See how `2a10:cc40:b03::/48` is first removed, and then immediately reinstalled to the exact same
|
|
nexthop `fe80::9e69:b4ff:fe61:a1dd` on interface `TenGigabitEthernet3/0/1.3` ? Although it only takes
|
|
225µs, it's still a bit sad to parse, create paths, just to remove from the FIB and re-insert the
|
|
exact same thing into the FIB. But more importantly, if a packet destined for this prefix arrives in that
|
|
225µs window, it will be lost. So I think I'll build a peek-ahead mechanism to capture specifically
|
|
this occurence, and let the two del+add messages cancel each other out.
|
|
|
|
**Prefix updates towards lo** - When writing the code, I borrowed a bunch from the
|
|
pending [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122)] but that one has a nasty crash which was hard to
|
|
debug and I haven't yet fully understood it. When a add/del occurs for a route towards IPv6 localhost (these
|
|
are typically seen when Bird shuts down eBGP sessions and I no longer have a path to a prefix, it'll mark
|
|
it as 'unreachable' rather than deleting it. These are *additions* which have a nexthop without a gateway
|
|
but with an interface index of 1 (which, in Netlink, is 'lo'). This makes VPP intermittently crash, so I
|
|
currently commented this out, while I gain better understanding. Result: blackhole/unreachable/prohibit
|
|
specials can not be set using the plugin. Beware!
|
|
(disabled in this [[commit](https://git.ipng.ch/ipng/lcpng/commit/7c864ed099821f62c5be8cbe9ed3f4dd34000a42)]).
|
|
|
|
## Credits
|
|
|
|
I'd like to make clear that the Linux CP plugin is a collaboration between several great minds,
|
|
and that my work stands on other software engineer's shoulders. In particular most of the Netlink
|
|
socket handling and Netlink message queueing was written by Matthew Smith, and I've had a little bit
|
|
of help along the way from Neale Ranns and Jon Loeliger. I'd like to thank them for their work!
|
|
|
|
## Appendix
|
|
|
|
#### VPP config
|
|
|
|
We only use one TenGigabitEthernet device on the router, and create two VLANs on it:
|
|
|
|
```
|
|
IP="sudo ip netns exec dataplane ip"
|
|
|
|
vppctl set logging class linux-cp rate-limit 1000 level warn syslog-level notice
|
|
vppctl lcp create TenGigabitEthernet3/0/1 host-if e1 netns dataplane
|
|
$IP link set e1 mtu 1500 up
|
|
|
|
$IP link add link e1 name ixp type vlan id 179
|
|
$IP link set ixp mtu 1500 up
|
|
$IP addr add 45.129.224.235/29 dev ixp
|
|
$IP addr add 2a0e:5040:0:2::235/64 dev ixp
|
|
|
|
$IP link add link e1 name servers type vlan id 101
|
|
$IP link set servers mtu 1500 up
|
|
$IP addr add 194.1.163.86/27 dev servers
|
|
$IP addr add 2001:678:d78:3::86/64 dev servers
|
|
```
|
|
|
|
#### Bird config
|
|
|
|
I'm using a purposefully minimalist configuration for demonstration purposes, posted here
|
|
in full for posterity:
|
|
|
|
```
|
|
log syslog all;
|
|
log "/var/log/bird/bird.log" { debug, trace, info, remote, warning, error, auth, fatal, bug };
|
|
|
|
router id 194.1.163.86;
|
|
|
|
protocol device { scan time 10; }
|
|
protocol direct { ipv4; ipv6; check link yes; }
|
|
protocol kernel kernel4 { ipv4 { import all; export where source != RTS_DEVICE; }; }
|
|
protocol kernel kernel6 { ipv6 { import all; export where source != RTS_DEVICE; }; }
|
|
|
|
protocol ospf v2 ospf4 {
|
|
ipv4 { export where source = RTS_DEVICE; import all; };
|
|
area 0 {
|
|
interface "lo" { stub yes; };
|
|
interface "servers" { type broadcast; cost 5; };
|
|
};
|
|
}
|
|
|
|
protocol ospf v3 ospf6 {
|
|
ipv6 { export where source = RTS_DEVICE; import all; };
|
|
area 0 {
|
|
interface "lo" { stub yes; };
|
|
interface "servers" { type broadcast; cost 5; };
|
|
};
|
|
}
|
|
|
|
template bgp T_IBGP4 {
|
|
local as 50869;
|
|
neighbor as 50869;
|
|
source address 194.1.163.86;
|
|
ipv4 { import all; export none; next hop self on; };
|
|
};
|
|
protocol bgp rr4_frggh0 from T_IBGP4 { neighbor 194.1.163.140; }
|
|
protocol bgp rr4_chplo0 from T_IBGP4 { neighbor 194.1.163.148; }
|
|
protocol bgp rr4_chbtl0 from T_IBGP4 { neighbor 194.1.163.87; }
|
|
|
|
template bgp T_IBGP6 {
|
|
local as 50869;
|
|
neighbor as 50869;
|
|
source address 2001:678:d78:3::86;
|
|
ipv6 { import all; export none; next hop self ibgp; };
|
|
};
|
|
protocol bgp rr6_frggh0 from T_IBGP6 { neighbor 2001:678:d78:6::140; }
|
|
protocol bgp rr6_chplo0 from T_IBGP6 { neighbor 2001:678:d78:7::148; }
|
|
protocol bgp rr6_chbtl0 from T_IBGP6 { neighbor 2001:678:d78:3::87; }
|
|
```
|
|
|
|
#### Final note
|
|
|
|
You may have noticed that the [commit] links are all to git commits in my private working copy. I
|
|
want to wait until my [previous work](https://gerrit.fd.io/r/c/vpp/+/33481) is reviewed and
|
|
submitted before piling on more changes. Feel free to contact vpp-dev@ for more information in the
|
|
mean time :-)
|
|
|