All checks were successful
continuous-integration/drone/push Build is passing
526 lines
28 KiB
Markdown
526 lines
28 KiB
Markdown
---
|
|
date: "2021-08-25T08:55:14Z"
|
|
title: VPP Linux CP - Part4
|
|
aliases:
|
|
- /s/articles/2021/08/25/vpp-4.html
|
|
params:
|
|
asciinema: true
|
|
---
|
|
|
|
|
|
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
|
|
|
|
# About this series
|
|
|
|
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
|
|
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
|
|
_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches
|
|
are shared between the two. One thing notably missing, is the higher level control plane, that is
|
|
to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a
|
|
VPP _plugin_ which is called the **Linux Control Plane**, or LCP for short, which creates Linux network
|
|
devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols
|
|
like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet
|
|
forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use
|
|
VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the
|
|
interface state (links, addresses and routes) itself. When the plugin is completed, running software
|
|
like [FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving
|
|
>100Mpps and >100Gbps forwarding rates will be well in reach!
|
|
|
|
In the first three posts, I added the ability for VPP to synchronize its state (like link state,
|
|
MTU, and interface addresses) into Linux. In this post, I'll make a start on the other direction:
|
|
allowing changes to interfaces made in Linux to make their way back into VPP!
|
|
|
|
## My test setup
|
|
|
|
I'm keeping the setup from the [third post]({{< ref "2021-08-15-vpp-3" >}}). A Linux machine has an
|
|
interface `enp66s0f0` which has 4 sub-interfaces (one dot1q tagged, one q-in-q, one dot1ad tagged,
|
|
and one q-in-ad), giving me five flavors in total. Then, I created an LACP `bond0` interface, which
|
|
also has the whole kit and caboodle of sub-interfaces defined, see below in the Appendix for details,
|
|
but here's the table again for reference:
|
|
|
|
| Name | type | Addresses
|
|
|-----------------|------|----------
|
|
| enp66s0f0 | untagged | 10.0.1.2/30 2001:db8:0:1::2/64
|
|
| enp66s0f0.q | dot1q 1234 | 10.0.2.2/30 2001:db8:0:2::2/64
|
|
| enp66s0f0.qinq | outer dot1q 1234, inner dot1q 1000 | 10.0.3.2/30 2001:db8:0:3::2/64
|
|
| enp66s0f0.ad | dot1ad 2345 | 10.0.4.2/30 2001:db8:0:4::2/64
|
|
| enp66s0f0.qinad | outer dot1ad 2345, inner dot1q 1000 | 10.0.5.2/30 2001:db8:0:5::2/64
|
|
| bond0 | untagged | 10.1.1.2/30 2001:db8:1:1::2/64
|
|
| bond0.q | dot1q 1234 | 10.1.2.2/30 2001:db8:1:2::2/64
|
|
| bond0.qinq | outer dot1q 1234, inner dot1q 1000 | 10.1.3.2/30 2001:db8:1:3::2/64
|
|
| bond0.ad | dot1ad 2345 | 10.1.4.2/30 2001:db8:1:4::2/64
|
|
| bond0.qinad | outer dot1ad 2345, inner dot1q 1000 | 10.1.5.2/30 2001:db8:1:5::2/64
|
|
|
|
The goal of this post is to show what code needed to be written and introduces an entirely _new
|
|
plugin_, so that we can separate concerns (and have a higher chance of community acceptance
|
|
of the plugins). In the first plugin, now called the **Interface Mirror**, I have previously
|
|
implemented the VPP-to-Linux synchronization. In this new plugin (called the **Netlink Listener**)
|
|
I implement the Linux-to-VPP synchronization using, _quelle surprise_, Netlink message handlers.
|
|
|
|
### Startingpoint
|
|
|
|
Based on the state of the plugin after the [third post]({{< ref "2021-08-15-vpp-3" >}}),
|
|
operators can enable `lcp-sync` (which copies changes made in VPP into their Linux counterpart)
|
|
and `lcp-auto-subint` (which extends sub-interface creation in VPP to automatically create a
|
|
Linux Interface Pair, or _LIP_, and its companion Linux network interface):
|
|
|
|
```
|
|
DBGvpp# lcp lcp-sync on
|
|
DBGvpp# lcp lcp-auto-subint on
|
|
DBGvpp# lcp create TenGigabitEthernet3/0/0 host-if e0
|
|
DBGvpp# create sub TenGigabitEthernet3/0/0 1234
|
|
DBGvpp# create sub TenGigabitEthernet3/0/0 1235 dot1q 1234 inner-dot1q 1000 exact-match
|
|
DBGvpp# create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match
|
|
DBGvpp# create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match
|
|
|
|
pim@hippo:~/src/lcpng$ ip link | grep e0
|
|
1286: e0.1234@e0: <BROADCAST,MULTICAST,M-DOWN> mtu 9000 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
|
|
1287: e0.1235@e0.1234: <BROADCAST,MULTICAST,M-DOWN> mtu 9000 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
|
|
1288: e0.1236@e0: <BROADCAST,MULTICAST,M-DOWN> mtu 9000 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
|
|
1289: e0.1237@e0.1236: <BROADCAST,MULTICAST,M-DOWN> mtu 9000 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
|
|
1701: e0: <BROADCAST,MULTICAST> mtu 9050 qdisc mq state DOWN mode DEFAULT group default qlen 1000
|
|
```
|
|
|
|
The vision for this plugin has been that Linux can drive most control-plane operations, such as
|
|
creating sub-interfaces, adding/removing addresses, changing MTU on links, etc. We can do that by
|
|
listening to [Netlink](https://en.wikipedia.org/wiki/Netlink) messages, which were designed for
|
|
transferring miscellaneous networking information between the kernel space and userspace processes
|
|
(like `VPP`). Networking utilities, such as the _iproute2_ family and its command line utilities
|
|
(like `ip`) use Netlink to communicate with the Linux kernel from userspace.
|
|
|
|
## Netlink Listener
|
|
|
|
The first task at hand is to install a Netlink listener. In this new plugin, I first register
|
|
`lcp_nl_init()` which adds Linux interface pair (_LIP_) add/del callbacks from the first plugin.
|
|
I'm now made aware of new _LIPs_ as they are created.
|
|
|
|
In `lcb_nl_pair_add_cb()`, I will initiate Netlink listener for first interface that gets created,
|
|
noting its netns. If subsequent adds are in other netns, I'll just issue a warning. And, I will keep
|
|
a refcount so I know how many _LIPs_ are bound to this listener.
|
|
|
|
In `lcb_nl_pair_del_cb()`, I can remove the listener when the last interface pair is removed.
|
|
|
|
Then for listening itself, a Netlink socket is opened, and because Linux can be quite chatty on
|
|
Netlink sockets, I'll raise its read/write buffers to something quite large (typically 64M read
|
|
and 16K write size). One note on this size, it'll need some sysctl to be set before VPP starts,
|
|
typically done as follows:
|
|
|
|
```
|
|
pim@hippo:~/src/vpp$ cat << EOF | sudo tee /etc/sysctl.d/81-vpp-Netlink.conf
|
|
# Increase Netlink to 64M
|
|
net.core.rmem_default=67108864
|
|
net.core.wmem_default=67108864
|
|
net.core.rmem_max=67108864
|
|
net.core.wmem_max=67108864
|
|
EOF
|
|
pim@hippo:~/src/vpp$ sudo sysctl -p
|
|
```
|
|
|
|
After creating the Netlink socket, I add its file descriptor to VPP's built in file handler, which
|
|
will see to polling it. On the file handler, I install `lcp_nl_read_cb()` and `lcp_nl_error_cb()`
|
|
callbacks which will be invoked when anything interesting happens on the socket:
|
|
|
|
A bit of explanation on why I'd use a queue rather than just consuming the Netlink messages directly
|
|
as they are offered. I _have to_ use a queue for the common case in which VPP is running single threaded.
|
|
Instead of consuming a block of potentially a million route del/add's (say, if BGP is reconverging),
|
|
and thereby blocking VPP from reading new packets from DPDK, but more importantly, new Netlink
|
|
messages from the kernel, which will fill the 64M socket buffer and overflow it, losing Netlink messages,
|
|
which is bad because it requires an end to end resync of the Linux namespace into the VPP dataplane,
|
|
something called an `NLM_F_DUMP` but that's a story for another day.
|
|
|
|
So I process only a batch of messages and only for a maximum amount of time per batch. If there are still
|
|
some messages left in the queue, I'll just reschedule consumption after M milliseconds. This allows new
|
|
Netlink messages to continuously be read from the kernel by VPP's file handler, even if there's a lot of
|
|
work to do.
|
|
|
|
* `lcp_nl_read_cb()` calls `lcp_nl_callback()` which pushes Netlink messages onto a queue and
|
|
issues a `NL_EVENT_READ` event, any socket read error issues `NL_EVENT_READ_ERR` event.
|
|
* `lcp_nl_error_cb()` simply issues `NL_EVENT_READ_ERR` event and moves on with life.
|
|
|
|
To capture these events, I initialize a process node called `lcp_nl_process()`, which handles:
|
|
* `NL_EVENT_READ` by calling `lcp_nl_process_msgs()` and processing a batch of messages (either
|
|
a maximum count, or a maximum duration, whichever is reached first).
|
|
* `NL_EVENT_READ_ERR` is the other event that can happen, in case VPP's file handler or my own
|
|
`lcp_nl_read_cb()` encounter a read error. All it does is close and reopen the Netlink socket
|
|
in the same network namespace we were before, in an attempt to minimize the damage, _dazed and
|
|
confused, but trying to continue_.
|
|
|
|
Allright, so at this point, I have a producer queue that gets added to by the Netlink reader
|
|
machinery, so all I have to do is consume them. `lcp_nl_process_msgs()` processes up to N messages
|
|
and/or for up to M msecs, whichever comes first, and for each individual Netlink message, it
|
|
will call `lcp_nl_dispatch()` to handle messages of a given type.
|
|
|
|
For now, `lcp_nl_dispatch()` just throws the message away after logging it with `format_nl_object()`,
|
|
a function that will come in very useful as I start to explore all the different Netlink message types.
|
|
|
|
The code that forms the basis of our Netlink Listener lives in [[this
|
|
commit](https://git.ipng.ch/ipng/lcpng/commit/c4e3043ea143d703915239b2390c55f7b6a9b0b1)] and
|
|
specifically, here I want to call out I was not the primary author, I worked off of Matt and Neale's
|
|
awesome work in this pending [Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122).
|
|
|
|
### Netlink: Neighbor
|
|
|
|
ARP and IPv6 Neighbor Discovery will trigger a set of Netlink messages, which are of type
|
|
`RTM_NEWNEIGH` and `RTM_DELNEIGH`
|
|
|
|
First, I'll add a new source file `lcpng_nl_sync.c` that will house these handler functions.
|
|
Their purpose is to take state learned from Netlink messages, and apply that state to VPP.
|
|
|
|
Then, I add `lcp_nl_neigh_add()` and `lcp_nl_neigh_del()` which implement the following
|
|
pattern: Most Netlink messages are somehow about a `link`, which is identified by an
|
|
interface index (`ifindex` or just idx for short). That's the same interface index I stored
|
|
when I created the _LIP_, calling it `vif_index` because in VPP, it describes a `virtio`
|
|
device which implements the IO for the TAP.
|
|
|
|
If I'm handling a message for link with a given ifindex, I can correlate it with a _LIP_. Not all
|
|
messages will be related to something VPP knows or cares about, I'll discuss that more later when
|
|
I discuss `RTM_NEWLINK` messages.
|
|
|
|
If there is no _LIP_ associated with the `ifindex`, then clearly this message is about a
|
|
Linux interface VPP is not aware of. But, if I can find the _LIP_, I can convert the lladdr
|
|
(MAC address) and IP address from the Netlink message into their VPP variants, and then simply
|
|
add or remove the ip4/ip6 neighbor adjacency.
|
|
|
|
The code for this first Netlink message handler lives in this
|
|
[[commit](https://git.ipng.ch/ipng/lcpng/commit/30bab1d3f9ab06670fbef2c7c6a658e7b77f7738)]. An
|
|
ironic insight is that after writing the code, I don't think any of it will be necessary, because
|
|
the interface plugin will already copy ARP and IPv6 ND packets back and forth and itself update its
|
|
neighbor adjacency tables; but I'm leaving the code in for now.
|
|
|
|
### Netlink: Address
|
|
|
|
A decidedly more interesting message is `RTM_NEWADDR` and its deletion companion `RTM_DELADDR`.
|
|
|
|
It's pretty straight forward to add and remove IPv4 and IPv6 addresses on interfaces. I have
|
|
to convert the Netlink representation of an IP address to its VPP counterpart with a helper, add
|
|
it or remove it, and if there are no link-local addresses left, disable IPv6 on the interface.
|
|
There's also a few multicast routes to add (notably 224.0.0.0/24 and ff00::/8, all-local-subnet).
|
|
|
|
The code for IP address handling is in this
|
|
[[commit]](https://git.ipng.ch/ipng/lcpng/commit/87742b4f541d389e745f0297d134e34f17b5b485), but
|
|
when I took it out for a spin, I noticed something curious, looking at the log lines that are
|
|
generated for the following sequence:
|
|
|
|
```
|
|
ip addr add 10.0.1.1/30 dev e0
|
|
debug linux-cp/nl addr_add: Netlink route/addr: add idx 1488 family inet local 10.0.1.1/30 flags 0x0080 (permanent)
|
|
warn linux-cp/nl dispatch: ignored route/route: add family inet type 2 proto 2 table 255 dst 10.0.1.1 nexthops { idx 1488 }
|
|
warn linux-cp/nl dispatch: ignored route/route: add family inet type 1 proto 2 table 254 dst 10.0.1.0/30 nexthops { idx 1488 }
|
|
warn linux-cp/nl dispatch: ignored route/route: add family inet type 3 proto 2 table 255 dst 10.0.1.0 nexthops { idx 1488 }
|
|
warn linux-cp/nl dispatch: ignored route/route: add family inet type 3 proto 2 table 255 dst 10.0.1.3 nexthops { idx 1488 }
|
|
|
|
ping 10.0.1.2
|
|
debug linux-cp/nl neigh_add: Netlink route/neigh: add idx 1488 family inet lladdr 68:05:ca:32:45:94 dst 10.0.1.2 state 0x0002 (reachable) flags 0x0000
|
|
notice linux-cp/nl neigh_add: Added 10.0.1.2 lladdr 68:05:ca:32:45:94 iface TenGigabitEthernet3/0/0
|
|
|
|
ip addr del 10.0.1.1/30 dev e0
|
|
debug linux-cp/nl addr_del: Netlink route/addr: del idx 1488 family inet local 10.0.1.1/30 flags 0x0080 (permanent)
|
|
notice linux-cp/nl addr_del: Deleted 10.0.1.1/30 iface TenGigabitEthernet3/0/0
|
|
warn linux-cp/nl dispatch: ignored route/route: del family inet type 1 proto 2 table 254 dst 10.0.1.0/30 nexthops { idx 1488 }
|
|
warn linux-cp/nl dispatch: ignored route/route: del family inet type 3 proto 2 table 255 dst 10.0.1.3 nexthops { idx 1488 }
|
|
warn linux-cp/nl dispatch: ignored route/route: del family inet type 3 proto 2 table 255 dst 10.0.1.0 nexthops { idx 1488 }
|
|
warn linux-cp/nl dispatch: ignored route/route: del family inet type 2 proto 2 table 255 dst 10.0.1.1 nexthops { idx 1488 }
|
|
debug linux-cp/nl neigh_del: Netlink route/neigh: del idx 1488 family inet lladdr 68:05:ca:32:45:94 dst 10.0.1.2 state 0x0002 (reachable) flags 0x0000
|
|
error linux-cp/nl neigh_del: Failed 10.0.1.2 iface TenGigabitEthernet3/0/0
|
|
```
|
|
|
|
It is this very last message that's a bit of a surprise -- the ping brought the peer's
|
|
lladdr into the neighbor cache; and the subsequent address deletion first removed the address,
|
|
then all the typical local routes (the connected, the broadcast, the network, and the self/local);
|
|
but then as well explicitly deleted the neighbor, which I suppose is correct behavior for Linux,
|
|
were it not that VPP already invalidates the neighbor cache and adds/removes the connected routes
|
|
for example in `ip/ip4_forward.c` L826-L830 and L583.
|
|
|
|
I can see more of these false positive non-errors like the one on `lcp_nl_neigh_del()` because
|
|
interface and directly connected route addition/deletion is slightly different in VPP than in Linux.
|
|
So, I decide to take a little shortcut -- if an addition returns "already there", or a deletion returns
|
|
"no such entry", I'll just consider it a successful addition and deletion respectively, saving my eyes
|
|
from being screamed at by this red error message. I changed that in this
|
|
[[commit](https://git.ipng.ch/ipng/lcpng/commit/d63fbd8a9a612d038aa385e79a57198785d409ca)],
|
|
turning this situation in a friendly green notice instead.
|
|
|
|
### Netlink: Link (existing)
|
|
|
|
There's a bunch of use cases for these messages `RTM_NEWLINK` and `RTM_DELLINK`. They carry information
|
|
about carrier (link, no-link), admin state (up/down), MTU, and so on. The function `lcp_nl_link_del()`
|
|
is the easier of the two. If I see a message like this for an ifindex that VPP has a _LIP_ for, I'll
|
|
just remove it. This means first calling the `lcp_itf_pair_delete()` function and then, if the message
|
|
was for a VLAN interface, remove the accompanying sub-interface (both the physical one (eg. `TenGigabitEthernet3/0/0.1234`)
|
|
as well as the TAP that we used to communicate to the host with (eg. `tap8.1234`).
|
|
|
|
The other message (the `RTM_NEWLINK` one), is much more complicated, because it's actually many types
|
|
of operation all in one message type: We can set the link up/down, change its MTU, and change its MAC
|
|
address, in any combination, perhaps like so:
|
|
```
|
|
ip link set e0 mtu 9216 address 00:11:22:33:44:55 down
|
|
```
|
|
|
|
So in turn, `lcp_nl_link_add()` will first look at admin state and apply it to the phy and tap,
|
|
apply the MTU if it's different to what VPP has, and apply the MAC address if it's different to
|
|
what VPP has, notably applying MAC addresses only in 'hardware' interfaces, which I now know are
|
|
not just physical ones like `TenGigabitEthernet3/0/0` but also virtual ones like `BondEthernet0`.
|
|
|
|
One thing I noticed, is that link state and MTU changes tend to go around in circles (from Netlink
|
|
into VPP, with this code, but when `lcp-sync` is on in the interface mirror plugin, changes to link
|
|
and mtu will trigger a callback there, which will in turn generate a Netlink message, and so on).
|
|
To avoid this loop, I temporarily turn off `lcp-sync` just before handling a batch of messages, and
|
|
turn it back to its original state when I'm done with that.
|
|
|
|
The code for all/del of existing links is in this
|
|
[[commit](https://git.ipng.ch/ipng/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)].
|
|
|
|
### Netlink: Link (new)
|
|
|
|
Here's where it gets interesting! What if the `RTM_NEWLINK` message was for an interface that VPP
|
|
doesn't have a _LIP_ for, but specifically describes a VLAN interface? Well, then clearly the operator
|
|
is trying to create a new sub-interface. And supporting that operation would be super cool, so let's go!
|
|
|
|
Using the earlier placeholder hint in `lcp_nl_link_add()` (see the previous
|
|
[[commit](https://git.ipng.ch/ipng/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)]),
|
|
I know that I've gotten a NEWLINK request but the Linux ifindex doesn't have a _LIP_. This could be
|
|
because the interface is entirely foreign to VPP, for example somebody created a dummy interface or
|
|
a VLAN sub-interface on one:
|
|
```
|
|
ip link add dum0 type dummy
|
|
ip link add link dum0 name dum0.10 type vlan id 10
|
|
```
|
|
|
|
Or perhaps more interestingly, the operator is actually trying to create a VLAN sub-interface on an
|
|
interface we created in VPP earlier, like these:
|
|
```
|
|
ip link add link e0 name e0.1234 type vlan id 1234
|
|
ip link add link e0.1234 name e0.1235 type vlan id 1000
|
|
ip link add link e0 name e0.1236 type vlan id 2345 proto 802.1ad
|
|
ip link add link e0.1236 name e0.1237 type vlan id 1000
|
|
```
|
|
|
|
None of these `RTM_NEWLINK` messages, represented by vif (Linux ifindex) will have a corresponding _LIP_.
|
|
So, I try to _create one_ by calling `lcp_nl_link_add_vlan()`.
|
|
|
|
First, I'll lookup the parent ifindex (`dum0` or `e0` in the examples above). The first example parent,
|
|
`dum0`, doesn't have a _LIP_, so I bail after logging a warning. The second example however, `e0`,
|
|
definitely does have a _LIP_, so it's known to VPP.
|
|
|
|
Now, I have two further choices:
|
|
|
|
1. the _LIP_ is a phy (ie `TenGigabitEthernet3/0/0` or `BondEthernet0`) and this is a regular tagged
|
|
interface with a given proto (dot1q or dot1ad); or
|
|
1. the _LIP_ is itself a subint (ie `TenGigabitEthernet3/0/0.1234`) and what I'm being asked for is
|
|
actually a QinQ or QinAD sub-interface. Remember, there's an important difference:
|
|
- In Linux these sub-interfaces are chained (`e0` creates child `e0.1234@e0` for a normal VLAN,
|
|
and `e0.1234` creates child `e0.1235@e0.1234` for the QinQ).
|
|
- In VPP these are actually all flat sub-interfaces, with the 'regular' VLAN interface carrying
|
|
the `one_tag` flag with only an `outer_vlan_id` set, and the latter QinQ carrying the `two_tags`
|
|
flag with both an `outer_vlan_id` (1234) and an `inner_vlan_id` (1000).
|
|
|
|
So I look up both the parent _LIP_ as well the phy _LIP_. I now have all the ingredients I need to create
|
|
the VPP sub-interfaces with the correct inner-dot1q and outer dot1q or dot1ad.
|
|
|
|
Of course, I don't really know what subinterface ID to use. It's appealing to "just" use the vlan id,
|
|
but that's not helpful if the outer tag and the inner tag are the same. So I write a helper function
|
|
`vnet_sw_interface_get_available_subid()` whose job it is to return an unused subid for the phy,
|
|
starting from 1.
|
|
|
|
Here as well, the interface plugin can be configured to automatically create _LIPs_ for sub-interfaces,
|
|
which I have to turn off temporarily to let my new form of creation do its thing. I carefully ensure that
|
|
the thread barrier is taken/released and the original setting of `lcp-auto-subint` is restored at all
|
|
exit points. One cool thing is that the new link's name is given in the Netlink message, so I can just
|
|
use that one. I like the aesthetic a bit more, because here the operator can give the Linux interface
|
|
any name they like, where-as in the other direction, VPP's `lcp-auto-subint` feature has to make up
|
|
a boring `<phy>.<subid>` name.
|
|
|
|
Alright, without further ado, the code for the main innovation here, the implementation of
|
|
`lcp_nl_link_add_vlan()`, is in this
|
|
[[commit](https://git.ipng.ch/ipng/lcpng/commit/45f408865688eb7ea0cdbf23aa6f8a973be49d1a)].
|
|
|
|
## Results
|
|
|
|
The functional regression test I made on day one, the one that ensures end-to-end connectivity to and
|
|
from the Linux host interfaces works for all 5 interface types (untagged, .1q tagged, QinQ, .1ad tagged
|
|
and QinAD) and for both physical and virtual interfaces (like `TenGigabitEthernet3/0/0` and `BondEthernet0`),
|
|
still works.
|
|
|
|
After this code is in, the operator will only have to create a _LIP_ for any phy interfaces, and
|
|
can rely on the new Netlink Listener plugin and the use of `ip` in Linux for all the rest. This
|
|
implementation starts approaching 'vanilla' Linux user experience!
|
|
|
|
Here's a new screencast [[asciinema](/assets/vpp/432243.cast), [gif](/assets/vpp/432243.gif)]
|
|
showing me playing around a bit, demonstrating that synchronization works pretty well in both
|
|
directions, a huge improvement from the [[previous asciinema](/assets/vpp/430411.cast),
|
|
[gif](/assets/vpp/430411.gif)] in my [[second post]({{< ref "2021-08-13-vpp-2"
|
|
>}})], which was only two weeks ago:
|
|
|
|
{{< asciinema src="/assets/vpp/432243.cast" >}}
|
|
|
|
### Further Work
|
|
|
|
You will note that there's one important Netlink message type that's missing: routes! They are so
|
|
important in fact, that they're a topic of their very own post. Also, I haven't written the code
|
|
for them yet :-)
|
|
|
|
A few things worth noting, as future work.
|
|
|
|
**Multiple NetNS** - The original Netlink Listener ([ref](https://gerrit.fd.io/r/c/vpp/+/31122)) would
|
|
only listen to the default netns specified in the configuration file. This is problematic because the
|
|
interface plugin does allow interfaces to be made in other namespaces (by issuing
|
|
`lcp create ... host-if X netns foo`), the Netlink world of which will be unknown to VPP. I
|
|
created `struct lcp_nl_netlink_namespace` to hold the stuff needed for the Netlink listener,
|
|
which is a good starting point to create not one but multiple listeners, one for each unique
|
|
namespace that has one or more _LIPs_ defined. This is version-two work :)
|
|
|
|
**Multithreading** - In testing, I noticed that while my plugin itself are (or seem to be..) thread
|
|
safe, `virtio` may not be totally clean, and I noticed that in a multithreaded VPP instance with many
|
|
workers, there's a crash in `lcp_arp_phy_node()` where `vlib_buffer_copy()` returns NULL, which should
|
|
not happen. When VPP is in such a state, other plugins (notably DHCP and IPv6 ND) also start complaining,
|
|
and `show errors` shows millions of `virtio-input` errors about unavailable buffers.
|
|
I do confirm though, that running VPP single threaded does not have these issues.
|
|
|
|
## Credits
|
|
|
|
I'd like to make clear that the Linux CP plugin is a collaboration between several great minds,
|
|
and that my work stands on other software engineer's shoulders. In particular most of the Netlink
|
|
socket handling and Netlink message queueing was written by Matthew Smith, and I've had a little bit
|
|
of help along the way from Neale Ranns and Jon Loeliger. I'd like to thank them for their work!
|
|
|
|
## Appendix
|
|
|
|
#### Ubuntu config
|
|
|
|
This configuration has been the exact same ever since [my first post]({{< ref "2021-08-12-vpp-1" >}}):
|
|
```
|
|
# Untagged interface
|
|
ip addr add 10.0.1.2/30 dev enp66s0f0
|
|
ip addr add 2001:db8:0:1::2/64 dev enp66s0f0
|
|
ip link set enp66s0f0 up mtu 9000
|
|
|
|
# Single 802.1q tag 1234
|
|
ip link add link enp66s0f0 name enp66s0f0.q type vlan id 1234
|
|
ip link set enp66s0f0.q up mtu 9000
|
|
ip addr add 10.0.2.2/30 dev enp66s0f0.q
|
|
ip addr add 2001:db8:0:2::2/64 dev enp66s0f0.q
|
|
|
|
# Double 802.1q tag 1234 inner-tag 1000
|
|
ip link add link enp66s0f0.q name enp66s0f0.qinq type vlan id 1000
|
|
ip link set enp66s0f0.qinq up mtu 9000
|
|
ip addr add 10.0.3.2/30 dev enp66s0f0.qinq
|
|
ip addr add 2001:db8:0:3::2/64 dev enp66s0f0.qinq
|
|
|
|
# Single 802.1ad tag 2345
|
|
ip link add link enp66s0f0 name enp66s0f0.ad type vlan id 2345 proto 802.1ad
|
|
ip link set enp66s0f0.ad up mtu 9000
|
|
ip addr add 10.0.4.2/30 dev enp66s0f0.ad
|
|
ip addr add 2001:db8:0:4::2/64 dev enp66s0f0.ad
|
|
|
|
# Double 802.1ad tag 2345 inner-tag 1000
|
|
ip link add link enp66s0f0.ad name enp66s0f0.qinad type vlan id 1000 proto 802.1q
|
|
ip link set enp66s0f0.qinad up mtu 9000
|
|
ip addr add 10.0.5.2/30 dev enp66s0f0.qinad
|
|
ip addr add 2001:db8:0:5::2/64 dev enp66s0f0.qinad
|
|
|
|
## Bond interface
|
|
ip link add bond0 type bond mode 802.3ad
|
|
ip link set enp66s0f2 down
|
|
ip link set enp66s0f3 down
|
|
ip link set enp66s0f2 master bond0
|
|
ip link set enp66s0f3 master bond0
|
|
ip link set enp66s0f2 up
|
|
ip link set enp66s0f3 up
|
|
ip link set bond0 up
|
|
|
|
ip addr add 10.1.1.2/30 dev bond0
|
|
ip addr add 2001:db8:1:1::2/64 dev bond0
|
|
ip link set bond0 up mtu 9000
|
|
|
|
# Single 802.1q tag 1234
|
|
ip link add link bond0 name bond0.q type vlan id 1234
|
|
ip link set bond0.q up mtu 9000
|
|
ip addr add 10.1.2.2/30 dev bond0.q
|
|
ip addr add 2001:db8:1:2::2/64 dev bond0.q
|
|
|
|
# Double 802.1q tag 1234 inner-tag 1000
|
|
ip link add link bond0.q name bond0.qinq type vlan id 1000
|
|
ip link set bond0.qinq up mtu 9000
|
|
ip addr add 10.1.3.2/30 dev bond0.qinq
|
|
ip addr add 2001:db8:1:3::2/64 dev bond0.qinq
|
|
|
|
# Single 802.1ad tag 2345
|
|
ip link add link bond0 name bond0.ad type vlan id 2345 proto 802.1ad
|
|
ip link set bond0.ad up mtu 9000
|
|
ip addr add 10.1.4.2/30 dev bond0.ad
|
|
ip addr add 2001:db8:1:4::2/64 dev bond0.ad
|
|
|
|
# Double 802.1ad tag 2345 inner-tag 1000
|
|
ip link add link bond0.ad name bond0.qinad type vlan id 1000 proto 802.1q
|
|
ip link set bond0.qinad up mtu 9000
|
|
ip addr add 10.1.5.2/30 dev bond0.qinad
|
|
ip addr add 2001:db8:1:5::2/64 dev bond0.qinad
|
|
```
|
|
|
|
#### VPP config
|
|
|
|
We can whittle down the VPP configuration to the bare minimum:
|
|
```
|
|
vppctl lcp default netns dataplane
|
|
vppctl lcp lcp-sync on
|
|
vppctl lcp lcp-auto-subint on
|
|
|
|
## Create `e0`
|
|
vppctl lcp create TenGigabitEthernet3/0/0 host-if e0
|
|
|
|
## Create `be0`
|
|
vppctl create bond mode lacp load-balance l34
|
|
vppctl bond add BondEthernet0 TenGigabitEthernet3/0/2
|
|
vppctl bond add BondEthernet0 TenGigabitEthernet3/0/3
|
|
vppctl set interface state TenGigabitEthernet3/0/2 up
|
|
vppctl set interface state TenGigabitEthernet3/0/3 up
|
|
vppctl lcp create BondEthernet0 host-if be0
|
|
```
|
|
|
|
|
|
And the rest of the confifuration work is done entirely from the Linux side!
|
|
```
|
|
IP="sudo ip netns exec dataplane ip"
|
|
## `e0` aka TenGigabitEthernet3/0/0
|
|
$IP link add link e0 name e0.1234 type vlan id 1234
|
|
$IP link add link e0.1234 name e0.1235 type vlan id 1000
|
|
$IP link add link e0 name e0.1236 type vlan id 2345 proto 802.1ad
|
|
$IP link add link e0.1236 name e0.1237 type vlan id 1000
|
|
$IP link set e0 up mtu 9000
|
|
|
|
$IP addr add 10.0.1.1/30 dev e0
|
|
$IP addr add 2001:db8:0:1::1/64 dev e0
|
|
$IP addr add 10.0.2.1/30 dev e0.1234
|
|
$IP addr add 2001:db8:0:2::1/64 dev e0.1234
|
|
$IP addr add 10.0.3.1/30 dev e0.1235
|
|
$IP addr add 2001:db8:0:3::1/64 dev e0.1235
|
|
$IP addr add 10.0.4.1/30 dev e0.1236
|
|
$IP addr add 2001:db8:0:4::1/64 dev e0.1236
|
|
$IP addr add 10.0.5.1/30 dev e0.1237
|
|
$IP addr add 2001:db8:0:5::1/64 dev e0.1237
|
|
|
|
## `be0` aka BondEthernet0
|
|
$IP link add link be0 name be0.1234 type vlan id 1234
|
|
$IP link add link be0.1234 name be0.1235 type vlan id 1000
|
|
$IP link add link be0 name be0.1236 type vlan id 2345 proto 802.1ad
|
|
$IP link add link be0.1236 name be0.1237 type vlan id 1000
|
|
$IP link set be0 up mtu 9000
|
|
|
|
$IP addr add 10.1.1.1/30 dev be0
|
|
$IP addr add 2001:db8:1:1::1/64 dev be0
|
|
$IP addr add 10.1.2.1/30 dev be0.1234
|
|
$IP addr add 2001:db8:1:2::1/64 dev be0.1234
|
|
$IP addr add 10.1.3.1/30 dev be0.1235
|
|
$IP addr add 2001:db8:1:3::1/64 dev be0.1235
|
|
$IP addr add 10.1.4.1/30 dev be0.1236
|
|
$IP addr add 2001:db8:1:4::1/64 dev be0.1236
|
|
$IP addr add 10.1.5.1/30 dev be0.1237
|
|
$IP addr add 2001:db8:1:5::1/64 dev be0.1237
|
|
```
|
|
|
|
#### Final note
|
|
|
|
You may have noticed that the [commit] links are all to git commits in my private working copy. I
|
|
want to wait until my [previous work](https://gerrit.fd.io/r/c/vpp/+/33481) is reviewed and
|
|
submitted before piling on more changes. Feel free to contact vpp-dev@ for more information in the
|
|
mean time :-)
|