1) Imports ENCAP_MPLS labels from IPv4/IPv6 routes.
Note that this requires libnl 3.6.0 or newer.
In previous patches, the fib_path_ext_t had a path ID of -1.
After a long investigation, it turned out to be caused by route weight
being set to 0. There is a comment explaining more details.
2) Handles MPLS routes.
MPLS routes were wrongly added as IPv4 routes before.
POP and SWAP are now both supported.
All the routes are installed as NON-EOS and EOS routes,
as the Linux kernel does not differentiate.
EOS POP used in PHP uses the next-hop address family
to determine the resulting address family.
This patch is sufficient for P setups.
PE setups with implicit null should also function okay, as long as a
seperate label gets programmed per address family.
PE setups with explicit null will also forward packets,
but punting is a bit odd and needs MPLS input enabled on the LCP host
device.
Make sure to enable MPLS in VPP first.
3) Propagate MPLS input state to LCP Pair and Linux.
Since the Linux kernel uses the MPLS routes itself,
the LCP pair tap needs MPLS enabled to allow host originated packets.
This also syncs the Linux `net.mpls.conf.<host_if>.input` sysctl to
allow punted packets to have MPLS labels, mostly explicit nulls.
For that to work, load the mpls kernel modules.
4) Cross connect MPLS packets from Linux directly to interface-output
This is a port of https://gerrit.fd.io/r/c/vpp/+/38702
-- it works fine for phy's that are carrier-up, but crashes if they are
carrier-down.
0: /home/pim/src/vpp/src/vnet/interface_funcs.h:46 (vnet_get_hw_interface) assertion `! pool_is_free (vnm->interface_main.hw_interfaces, _e)' fails
at /home/pim/src/vpp/src/vppinfra/error.c:143
ns=0x7fff98774e80 "dataplane", host_sw_if_indexp=0x0) at /home/pim/src/vpp/src/plugins/lcpng/lcpng_interface.c:998
at /home/pim/src/vpp/src/plugins/lcpng/lcpng_if_cli.c:96
parent_command_index=371) at /home/pim/src/vpp/src/vlib/cli.c:591
parent_command_index=0) at /home/pim/src/vpp/src/vlib/cli.c:548
at /home/pim/src/vpp/src/vlib/cli.c:694
- move tap_set_carrier() upstream to lcp_itf_set_link_state()
- refuse to set admin-up on sub-int if parent is down
- no need to switch namespaces, lcp_itf_set_link_state() already does
- in change_mtu and change_admin_state, if the interface is a sub,
we only have to sync that one interface. Otherwise, walk the parent
interface and all sub-ints with lcp_itf_pair_sync_state_hw() and
make note of this in the (DBG) log
This is super complicated work, taken mostly verbatim from the upstream
linux-cp Gerrit, with due credit mgsmith@netgate.comneale@grafiant.com
First, add main handler lcp_nl_route_add() and lcp_nl_route_del()
Introduce two FIB sources: one for manual routes, one for dynamic
routes. See lcp_nl_proto_fib_source() fo details.
Add a bunch of helpers that translate Netlink message data into VPP
primitives:
- lcp_nl_mk_addr46()
converts a Netlink nl_addr to a VPP ip46_address_t.
- lcp_nl_mk_route_prefix()
converts a Netlink rtnl_route to a VPP fib_prefix_t.
- lcp_nl_mk_route_mprefix()
converts a Netlink rtnl_route to a VPP mfib_prefix_t.
- lcp_nl_proto_fib_source()
selects the most appropciate fib_src by looking at the rt_proto
(see /etc/iproute2/rt_protos for a hint). Anything RTPROT_STATIC or
better is 'fib_src', while anything above that becomes fib_src_dynamic.
- lcp_nl_mk_route_entry_flags()
generates fib_entry_flag_t from the Netlink route type,
table and proto metadata.
- lcp_nl_route_path_parse()
converts a Netlink rtnl_nexthop to VPP fib_route_path_t and adds
that to a growing list of paths.
- lcp_nl_route_path_add_special()
adds a blackhole/unreach/prohibit route to the list of paths, in
the special-case there is not yet a path for the destination.
Now we're ready to insert FIB entries:
- lcp_nl_table_find()
selects the matching table-id,protocol(v4/v6) from a hash of tables.
- lcp_nl_table_add_or_lock()
if at table-id,protocol(v4/v6) hasn't been used yet, create one,
otherwise increment a table reference counter so we know how many
FIB entries we have in this table. Then, return it.
- lcp_nl_table_unlock()
Decrease the refcount on a table, and if no more prefixes are in
the table, remove it from VPP.
- lcp_nl_route_del()
Remove a route from the given table-id/protocol. Do this by applying
rtnl_route_foreach_nexthop() to the list of Netlink nexthops,
converting them into VPP paths in a lcp_nl_route_path_parse_t
structure. If the route is for unreachable/blackhole/prohibit in
Linux, add that path too.
Then, remove the VPP paths from the FIB and reduce refcnt or
remove the table if it's empty using table_unlock().
- lcp_nl_route_add()
Not all routes are relevant for VPP. Those in table 255 are 'local'
routes, already set up by ip[46]_address_add(), and some other route
types are invalid, skip those. Link-local IPv6 and IPv6 multicast is
also skipped. Then, construct lcp_nl_route_path_parse_t by walking
the Netlink nexthops, and optionally add a special (in case the
route was for unreachable/blackhole/prohibit in Linux -- those won't
have a nexthop).
Then, insert the VPP paths found in the Netlink message into the FIB
or the multicast FIB, respectively.
And with that, Bird shoots to life. Both IPv4 and IPv6 OSPF interior
gateway protocol and BGP full tables can be consumed, on my bench in
about 9 seconds:
- A batch of 2048 Netlink messages is handled in 9-11ms, so we can do
approx 200K messages/sec at peak (and this will consume 50% CPU due
to the yielding logic in lcp_nl_process() (see the 'case
NL_EVENT_READ' block that adds a cooldown period of
LCP_NL_PROCESS_WAIT milliseconds between batches.
- With 3 route reflectors and 2 full BGP peers, at peak I could see
309K messages left in the producer queue.
- All IPv4 and IPv6 prefixes made their way into the FIB
pim@hippo:~/src/lcpng$ echo -n "IPv6: "; vppctl sh ip6 fib summary | awk '$1~/[0-9]+/ { total += $2 } END { print total }'
IPv6: 132506
pim@hippo:~/src/lcpng$ echo -n "IPv4: "; vppctl sh ip fib summary | awk '$1~/[0-9]+/ { total += $2 } END { print total }'
IPv4: 869966
- Compared to Bird2's view:
pim@hippo:~/src/lcpng$ birdc show route count
BIRD 2.0.7 ready.
3477845 of 3477845 routes for 869942 networks in table master4
527887 of 527887 routes for 132484 networks in table master6
Total: 4005732 of 4005732 routes for 1002426 networks in 2 tables
- Flipping one of the full feeds to another, forcing a reconvergence
of every prefix in the FIB took about 8 seconds, peaking at 242K
messages in the queue, with again an average consumption of 2048
messages per 9-10ms.
- All of this was done while iperf'ing 6Gbps to and from the
controlplane.
---
Because handling full BGP table is O(1M) messages, I will have to make
some changes in the logging:
- all neigh/route messages become DBG/INFO at best
- all addr/link messages become INFO/NOTICE at best
- when we overflow time/msgs, turn process_msgs into a WARN, otherwise
keep it at INFO so as not to spam.
In lcpng_interface.c:
- Log NOTICE for pair_add() and pair_del() call;
- Log NOTICE for set_interface_addr() call;
With this approach, setting the logging level of the linux-cp/nl plugin
to 'notice' hits the sweet spot: with things that the operator has
~explicitly done, leaving implicit actions (BGP route adds/dels, ARP/ND)
to stay below the NOTICE level.
Introduce lcp_main.lcp_sync, which determines if state changes made
to interfaces in VPP do or don't propagate into Linux.
- Add a startup.conf directive 'lcp-sync' to enable at startup time.
- Add CLI.short_help = "lcp lcp-sync [on|enable|off|disable]",
- Show the current value in "show lcp".
Gate changes in mtu, state and address on lcp_lcp_sync().
When the operator issues 'lcp lcp-sync on', it is prudent to do a
one-off sync of all interface attributes from VPP into Linux.
For this, add a lcp_itf_pair_sync_state_all() function.
In preparation of another feature 'netlink-auto-subint', rename
lcp_main's field to "lcp_auto_subint".
Add CLI .short_help = "lcp lcp-auto-subint [on|enable|off|disable]"
Show status of the field on "lcp show" output.
I was looking at the hw interface list, which makes sense for ethernet
devices but not for other devices, notably BondEthernets.
In addition, creating two separate interfaces with the same outer
(for example Gi3/0/0 dot1ad 2345 inner-dot1q 1000 AND the same on
Gi3/0/1) would yield an erratic match and a crash.
Switch to walking the sw interface list instead, and search for
the sup_sw_if_index that has the desired outer. Result:
BondEthernet0.{1234,1235,1236,1237} can be created and are functional.
Update the pair_config() parser to follow suite.
When the configuration 'lcp-auto-subint' is set, and the interface at
hand is a subinterface, in lcp_itf_interface_add_del():
- if it's a deletion and we're a sub-int, and we have a LIP: delete it.
- if it's a creation and we're a sub-int, and our parent has a LIP, create one.
Fix a few logging consistency issues (pair_del), and in
pair_delete_by_index() ensure that the right namespace is selected.
Due to this quirk with lip->lip_namespace not wanting to be a vec_dup()
string, rewrite them all to be strdup/free instead.
This first preparation moves lcp_itf_phy_add() to lcpng_if_sync.c
and renames it lcp_itf_interface_add_del().
It does all the pre-flight checks to validate that a new device, given
by sw_if_index, can have a LIP created:
- must be a sub-int
- must have a sw_sup_if_index, which itself has a LIP
However, I realize that I cannot create an interface from within an
interface add callback, so I'll have to schedule the child LIP to be
created by a process, after the callback returns.
I'll do that in the next commit.
I've made a few cosmetic adjustments:
- introduce debug, info, notice, warn and err loggers
- remove redundant logging, and set correct (conservative) log levels
- turn the sync-state error into a warning
And a little debt paydown:
- refactor sync-state into its own function, to be called instead of
all the spot fixes elsewhere. It's going to be the case that
sync-state is "the reconsiliation function".
- Fix a bug in lip->lip_namespace copy: vec_dup() there doesn't seem
to work for me, use strdup() instead and make a mental note to
reviist.
The plugin now works with 'lcpng default netns dataplane' in its
startup.conf; and with 'lcp default netns dataplane' as its first
command. A few of these fixes should go upstream, too, which I'll
do next.
I have been very careless in using the correct network namespace when
manipulating LCP host devices. Around any/every netlink write operation,
we must first clib_setns() into the correct namespace. So, wrap every
call of vnet_netlink_*() in all places.
For consistency, use the convention 'curr_ns_fd' (for the one we are
coming from) and 'vif_ns_fd' (to signal the one that the netlink VIF
is in).
Be careful as well to enter and exit everywhere without losing file
descriptors.
There are three ways in which IP addresses will want to be copied
from VPP into the companion Linux devices:
1) set interface ip address ... adds an IPv4 or IPv6 address
- this is handled by lcp_itf_ip[46]_add_del_interface_addr() which
is a callback installed in lcp_itf_pair_init()
2) set interface ip address del ... removes them
- also handled by lcp_itf_ip[46]_add_del_interface_addr() but
curiously there is no upstream vnet_netlink_del_ip[46]_addr() so
I wrote them inline here - I will try to get them upstreamed, as
they appear to be obvious companions in vnet/device/netlink.h
3) Upon LIP creation, it could be that there are L3 addresses already
on the VPP interface. If so, set them with lcp_itf_set_interface_addr()
This means that now, at any time a new LIP is created, its state from
VPP is fully copied over (MTU, Link state, IPv4/IPv6 addresses)!
At runtime, new addresses can be set/removed as well.