All checks were successful
continuous-integration/drone/push Build is passing
562 lines
33 KiB
Markdown
562 lines
33 KiB
Markdown
---
|
|
date: "2026-02-21T11:35:14Z"
|
|
title: VPP SRv6 L2VPN
|
|
---
|
|
|
|
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
|
|
|
|
# About this series
|
|
|
|
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
|
|
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
|
|
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
|
|
are shared between the two.
|
|
|
|
Segment Routing is a lesser known technique that allows network operators to determine a path
|
|
through their network by encoding the path inside headers in the packet itself, rather than relying
|
|
on the IGP to determine the path. Originally created to help traffic engineering of MPLS packets,
|
|
the concepts were carried forward for IPv6 as well.
|
|
|
|
In this article I take SRv6 out for a spin, implement some missing features in VPP, and stumble
|
|
across, and manage to fix a nasty bug in its implementation.
|
|
|
|
## Introduction
|
|
|
|
SRv6 - Segment Routing for IPv6 - is defined in a number of RFCs.
|
|
|
|
1. [[RFC 8402](https://datatracker.ietf.org/doc/html/rfc8402)]: Segment Routing Architecture. This
|
|
document describes the fundamentals. It defines the general concepts of Segment Routing (nodes,
|
|
segments, and steering) for both MPLS and IPv6.
|
|
1. [[RFC 8754](https://datatracker.ietf.org/doc/html/rfc8754)]: IPv6 Segment Routing Header (SRH).
|
|
This RFC defines the specific IPv6 Extension Header used for SRv6. It explains how segments are listed
|
|
and how the Segments Left field works.
|
|
1. [[RFC 8986](https://datatracker.ietf.org/doc/html/rfc8986)]: SRv6 Network Programming.
|
|
This one describes the so-called "behaviors" associated with a Segment ID (SID). It defines functions
|
|
like End (Endpoint), End.X (Layer-3 cross-connect), and End.DT4/6 (VRF decapsulation).
|
|
|
|
While reading these RFCs, I learn that I can configure an SRv6 path through the network that picks
|
|
up an ethernet packet on the ingress, and decapsulates and cross connects that ethernet packet to an
|
|
interface on the egress: an L2VPN using Ethernet-over-IPv6. That sounds dope to me!
|
|
|
|
### SRv6 in VPP - Segment Routing Header
|
|
|
|
For the dataplane, there are two parts of note. Firstly, when an IPv6 packet arrives with an IPv6
|
|
extension header, the so-called _Segment Routing Header_ or SRH, any router supporting SRv6 needs to
|
|
inspect it. The presence of an SRH changes the forwarding logic from a simple "look at the
|
|
destination, do a FIB lookup for next hop, and send the packet on its merry way" to a more
|
|
customized "process the instruction and update the IPv6 headers" kind of thing.
|
|
|
|
In IPv6, an (almost) arbitrary amount of headers can be chained from the base IPv6 packet header, to
|
|
the ultimate layer4 protocol header like ICMP, TCP or UDP. In IPv4, this is not the case, there is
|
|
only the L3 header (IPv4) and the L4 header (TCP/UDP/ICMP etc). These intermediate headers are
|
|
called Routing Extension headers, and the SRH is the one with type 4.
|
|
|
|
The fields in this header are:
|
|
|
|
* ***Next Header***: Identifies the type of header following the SRH. It can be another routing
|
|
extension header or it might be the Layer4 protocol header like TCP, UDP or ICMP.
|
|
* ***Flags***: IANA loves reserving optionality for the future. The authors of SRv6 added an 8-bit
|
|
flags field, but none of them have been assigned yet.
|
|
* ***Tag***: Moar optionality! This 16-bit tag is not defined in the RFC, simply stating that _The
|
|
allocation and use of tag is outside the scope of this document_. OK then!
|
|
* ***Segments Left (SL)***: A counter indicating how many intermediate nodes still need to be
|
|
visited.
|
|
* ***Last Entry***: The index (starting from 0) of the last element in the Segment List.
|
|
* ***Segment List***: This is an array of 128-bit IPv6 addresses, listed in reverse order of the
|
|
path. The first segment to be visited is at the highest index.
|
|
* (optional) ***TLVs***: These Type-Length-Value objects can encode other information, like HMAC
|
|
signatures, operational and performance monitoring data, and so on.
|
|
|
|
### SRv6: Anatomy
|
|
|
|
{{< image width="14em" float="right" src="/assets/vpp-srv6/magnets.jpg" alt="Insane Clown" >}}
|
|
|
|
Much like magnents, you might be wondering _SRv6 Routers: How do they work?_. There's really only
|
|
three relevant things: SR Policy (they determine how packets are steered into the SRv6 routing
|
|
domain), SRv6 Source nodes (they handle the ingress part), and SRv6 Segment Endpoint Nodes (they
|
|
handle both the intermediate routers that participate in SRv6, and also the egress part where the
|
|
packet leaves the SRv6 routing domain).
|
|
|
|
#### SRv6: Policies
|
|
|
|
A _Segment Routing Policy_ is the same for MPLS and SRv6. They are represented by either a stack of
|
|
MPLS labels, or by a stack of IPv6 addresses, and they are uniquely identified by either an MPLS
|
|
label or an IPv6 address as well. The identifier is called a _Binding Segment ID_ or BSID, and the
|
|
elements of the list are called _Segment IDs_ or SIDs.
|
|
|
|
```
|
|
BSID := SID [, SID] [, SID] ...
|
|
8298::1 := 2001:db8::1 , 2001:db8::2 , 2001:db8::3
|
|
```
|
|
|
|
These policies are written to the FIB in the router. I can now do a lookup for `8298::1`, and find
|
|
that it points to this _SR Policy_ object with the list of three IPv6 addresses. In the case of
|
|
MPLS, the _BSID_ will be in the MPLS FIB and point at a list of three MPLS labels, but I'm going to
|
|
stop talking about MPLS now :)
|
|
|
|
#### SRv6: Source Node
|
|
|
|
An _SR Source Node_ originates an IPv6 packet with a Segment in the destination address, and it
|
|
optionally adds an SRH with a list of instructions on the network. The _SR Source Node_ is the
|
|
ingress point and enables SRv6 processing in the network, which is called _steering_. Instead of
|
|
setting the destination address to the final destination, the source node will set it to the first
|
|
Segment, which is the first router that needs to be visited.
|
|
|
|
#### SRv6: Transit Node
|
|
|
|
Spoiler alert! This node type doesn't have anything to do with SRv6. SRv6 packets really do look
|
|
like normal packets, the IPv6 source address is the Source Node, and the destination address is the
|
|
_Transit Node_, which can just forward it like any other packet using their routing table. Notably,
|
|
those routers are not actively participating in SRv6 and they don't need to know anything about it.
|
|
|
|
#### SRv6: Segment Endpoint Node
|
|
|
|
The _Segment Endpoint Node_ is a router that is SRv6 capable. A packet may arrive with a locally
|
|
configured address in the IPv6 destination. The magic happens here - one of two things:
|
|
|
|
1. The _Segment Routing Header_ is inspected. If _Segments Left_ is 0, then the next header
|
|
(typically UDP, TCP, ICMP) is processed. Otherwisem the next segment is read from the _Segment
|
|
List_, and the IPv6 destination address is overwritten with it. The _Segments Left_ field is
|
|
decremented. In this case the packet is routed normally through a bunch of potential transit
|
|
routers, who are blissfully ignorant of what is happening, and onto a next _Segment Endpoint_
|
|
router.
|
|
|
|
1. The IPv6 destination address might have an entry in the forwarding table which points at a
|
|
specific local meaning, called a _Local Segment ID_ or _LocalSID_. The LocalSID tells this router
|
|
what to do, for example decapsulate the packet and do a next-hop lookup in a specific routing table,
|
|
useful for L3VPNs; or perhaps an instruction to decapsulate the packet and cross connect it to a
|
|
local interface, useful for L2VPN. The key insight here is, that the local FIB entry can carry any
|
|
type of further instruction.
|
|
|
|
## VPP: IPng LAB
|
|
|
|
At this point I'm pretty sure I've bored you to tears with all the RFC stuff and theory. I do think
|
|
that segment routing (both the MPLS and the SRv6 variant) are sufficiently complex that taking a
|
|
read of the main RFCs at least once is useful. But for me, the fun part is seeing it work in
|
|
practice. So I boot the [[IPng Lab]({{< ref 2022-10-14-lab-1 >}})], which looks a bit like this.
|
|
|
|
{{< image width="100%" src="/assets/lab/LAB v2.svg" alt="Logical" >}}
|
|
|
|
In this environment, each of the VPP routers is running Bird2 with OSPF and OSPFv3. They are
|
|
connected in a string, and each VPP router has an interface (`Gi10/0/2`) connected to a debian host
|
|
called `host0-0` (at the bottom), as well as an interface (`Gi10/0/3`) connected to a host called
|
|
`host0-1` (at the top). One really cool feature of the LAB is that all links are on an OpenVSwitch
|
|
which is mirroring all traffic to a tap host called `tap0-0`, so I can see traffic clearly:
|
|
|
|
```
|
|
root@vpp0-0:/etc/bird# ping -n 2001:678:d78:200::3 -c1
|
|
PING 2001:678:d78:200::3 (2001:678:d78:200::3) 56 data bytes
|
|
64 bytes from 2001:678:d78:200::3: icmp_seq=1 ttl=62 time=3.24 ms
|
|
|
|
--- 2001:678:d78:200::3 ping statistics ---
|
|
1 packets transmitted, 1 received, 0% packet loss, time 0ms
|
|
rtt min/avg/max/mdev = 3.240/3.240/3.240/0.000 ms
|
|
|
|
root@tap0-0:~# tcpdump -eni enp16s0f0
|
|
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
|
|
listening on enp16s0f0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
|
|
10:39:23.558942 52:54:00:f0:11:01 > 52:54:00:f0:11:10, ethertype 802.1Q (0x8100), length 122: vlan 20, p 0,
|
|
ethertype IPv6 (0x86dd), 2001:678:d78:201::1:0 > 2001:678:d78:200::3: ICMP6, echo request, id 12, seq 1, length 64
|
|
10:39:23.558942 52:54:00:f0:11:11 > 52:54:00:f0:11:20, ethertype 802.1Q (0x8100), length 122: vlan 21, p 0,
|
|
ethertype IPv6 (0x86dd), 2001:678:d78:201::1:0 > 2001:678:d78:200::3: ICMP6, echo request, id 12, seq 1, length 64
|
|
10:39:23.559993 52:54:00:f0:11:21 > 52:54:00:f0:11:30, ethertype 802.1Q (0x8100), length 122: vlan 22, p 0,
|
|
ethertype IPv6 (0x86dd), 2001:678:d78:201::1:0 > 2001:678:d78:200::3: ICMP6, echo request, id 12, seq 1, length 64
|
|
|
|
10:39:23.560179 52:54:00:f0:11:30 > 52:54:00:f0:11:21, ethertype 802.1Q (0x8100), length 122: vlan 22, p 0,
|
|
ethertype IPv6 (0x86dd), 2001:678:d78:200::3 > 2001:678:d78:201::1:0: ICMP6, echo reply, id 12, seq 1, length 64
|
|
10:39:23.561070 52:54:00:f0:11:20 > 52:54:00:f0:11:11, ethertype 802.1Q (0x8100), length 122: vlan 21, p 0,
|
|
ethertype IPv6 (0x86dd), 2001:678:d78:200::3 > 2001:678:d78:201::1:0: ICMP6, echo reply, id 12, seq 1, length 64
|
|
10:39:23.561248 52:54:00:f0:11:10 > 52:54:00:f0:11:01, ethertype 802.1Q (0x8100), length 122: vlan 20, p 0,
|
|
ethertype IPv6 (0x86dd), 2001:678:d78:200::3 > 2001:678:d78:201::1:0: ICMP6, echo reply, id 12, seq 1, length 64
|
|
```
|
|
|
|
Here you can see the packet path from `vpp0-0` sending one ICMPv6 echo request to `vpp0-3`, which
|
|
responded with one ICMPv6 echo reply. I can see the packet on vlan 20, 21, 22 on the way out, and
|
|
back again on vlan 22, 21 and 20.
|
|
|
|
### VPP: SRv6 Example
|
|
|
|
Alright, here I go! With the following short snippet, I can sum up all of the theory above in a
|
|
practical first example:
|
|
|
|
```
|
|
vpp0-0# set sr encaps source addr 2001:678:d78:200::
|
|
vpp0-0# sr policy add bsid 8298::2:1 next 2001:678:d78:20F::3:1 encap
|
|
vpp0-0# sr steer l2 GigabitEthernet10/0/2 via bsid 8298::2:1
|
|
vpp0-0# sr localsid address 2001:678:d78:20f::0:1 behavior end.dx2 GigabitEthernet10/0/2
|
|
vpp0-0# set int state GigabitEthernet10/0/2 up
|
|
```
|
|
|
|
Looking at what I typed on `vpp0-0`, first I tell the system that its encapsulation source address
|
|
is its IPv6 loopback address. Then I add a _Binding SID_ with one _Segment ID_ and I instruct this
|
|
policy to encapsulate the packet. Then, I add an L2 steering from interface `Gi10/0/2` via this
|
|
_BSID_. At this point, `vpp0-0` knows that if an ethernet frame comes in on that interface, it needs
|
|
to encapsulate it in SRv6 from `2001:678:d78:200::` and send it to `2001:678:d78:20F::3:1`. Finally,
|
|
I tell the system that if an IPv6 packet arrives with destination address `2001:678:d78:20f::0:1`,
|
|
that it needs to decapsulate it and send the resulting L2 datagram out on Gi10/0/2.
|
|
|
|
There is one last thing I have to do, and that's somehow attract this `2001:678:d78:20F::0:0/112` prefix
|
|
to `vpp0-0` and `2001:678:d78:20F::3:0/112` prefix to `vpp0-3`. I can do this by adding the prefix
|
|
to `loop0`, like so:
|
|
|
|
```
|
|
vpp0-0# create loopback interface instance 0
|
|
vpp0-0# set interface state loop0 up
|
|
vpp0-0# set interface ip address loop0 192.168.10.0/32
|
|
vpp0-0# set interface ip address loop0 2001:678:d78:200::0/128
|
|
vpp0-0# set interface ip address loop0 2001:678:d78:20F::0:0/112
|
|
```
|
|
|
|
This will be picked up in OSPFv3, and all routers will install a FIB entry pointing at `vpp0-0` for
|
|
the /112. Did it work?
|
|
|
|
```
|
|
root@host0-0:~# ping6 ff02::1%enp16s0f0
|
|
PING ff02::1%enp16s0f0 (ff02::1%enp16s0f0) 56 data bytes
|
|
64 bytes from fe80::5054:ff:fef0:1000%enp16s0f0: icmp_seq=1 ttl=64 time=0.156 ms
|
|
64 bytes from fe80::5054:ff:fef0:1013%enp16s0f0: icmp_seq=1 ttl=64 time=4.03 ms
|
|
^C
|
|
--- ff02::1%enp16s0f0 ping statistics ---
|
|
1 packets transmitted, 1 received, +1 duplicates, 0% packet loss, time 0ms
|
|
rtt min/avg/max/mdev = 0.156/2.092/4.029/1.936 ms
|
|
```
|
|
|
|
{{< image width="12em" float="right" src="/assets/vpp-srv6/hannibal-plan.png" alt="Hannibal Smith loves it" >}}
|
|
|
|
Yes, it worked! I love it when a plan comes together! This IPv6 address that I pinged, `ff02::1` is
|
|
called `all-hosts`, and I can see one reply from `fe80::5054:ff:fef0:1000` which is host0-0's own
|
|
link-local address, and a second reply from `fe80::5054:ff:fef0:1013` which is host0-1's address.
|
|
I have created a point to point L2VPN or _Virtual Leased Line_ between `vpp0-0:Gi10/0/2` and
|
|
`vpp0-3:Gi10/0/3` and any ethernet traffic between these two ports is passed through the network as
|
|
IPv6 packets including segment routing. Nice going!
|
|
|
|
### SRv6 on the Wire
|
|
|
|
I learn something curious. I configure an IPv4 address on both hosts:
|
|
|
|
```
|
|
root@host0-0:~# ip addr add 192.0.2.0/31 dev enp16s0f0
|
|
|
|
root@host0-1:~# ip addr add 192.0.2.1/31 dev enp16s0f3
|
|
root@host0-1:~# ping 192.0.2.0
|
|
PING 192.0.2.0 (192.0.2.0) 56(84) bytes of data.
|
|
64 bytes from 192.0.2.0: icmp_seq=1 ttl=64 time=5.27 ms
|
|
^C
|
|
--- 192.0.2.0 ping statistics ---
|
|
1 packets transmitted, 1 received, 0% packet loss, time 0ms
|
|
rtt min/avg/max/mdev = 5.274/5.274/5.274/0.000 ms
|
|
```
|
|
|
|
And then I take a look at this IPv4 ICMP packet on the wire:
|
|
|
|
```
|
|
11:03:22.118770 52:54:00:f0:10:00 > 52:54:00:f0:10:13, ethertype 802.1Q-QinQ (0x88a8), length 102: vlan 30, p 0,
|
|
ethertype IPv4 (0x0800), (tos 0x0, ttl 64, id 35014, offset 0, flags [DF], proto ICMP (1), length 84)
|
|
192.0.2.0 > 192.0.2.1: ICMP echo request, id 50, seq 1, length 64
|
|
|
|
11:03:22.119078 52:54:00:f0:11:01 > 52:54:00:f0:11:10, ethertype 802.1Q (0x8100), length 156: vlan 20, p 0,
|
|
ethertype IPv6 (0x86dd), (flowlabel 0x09d8f, hlim 63, next-header Ethernet (143) payload length: 98) 2001:678:d78:200:: > 2001:678:d78:20f::3:1:
|
|
52:54:00:f0:10:00 > 52:54:00:f0:10:13, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 35014, offset 0, flags [DF], proto ICMP (1), length 84)
|
|
192.0.2.0 > 192.0.2.1: ICMP echo request, id 50, seq 1, length 64
|
|
```
|
|
|
|
The first packet is coming in on vlan 30 (`host0-0:enp16s0f0` to `vpp0-0:Gi10/0/2`). I then see it
|
|
go out on vlan 20 (from `vpp0-0` to `vpp0-1`). I see it is an IPv6 packet from `2001:678:d78:200::`
|
|
(the encapsulation address I configured), and to `2001:678:d78:20f::3:1` (the _BSID_ steers towards
|
|
this Segment Router ID), and then I see the Ethernet inner payload with the ICMP echo packet. But
|
|
where's the _Segment Routing Header_??
|
|
|
|
It is here that I learn why the RFC says that SRH are optional. This packet has everything it needs
|
|
to have using the destination address, `2001:678:d78:20f::3:1`, which is routed towards the loopback
|
|
interface of `vpp0-3`. There, it is looked up in the FIB and the _Local Segment ID_ or LocalSID
|
|
determines that packets to this address must be decapsulated and forwarded out on `vpp0-3:Gi10/0/3`.
|
|
|
|
### VPP: Let's ZigZag
|
|
|
|
So how do I get these elusive SRH headers? Easy: make more than one segment in the BSID, because
|
|
then, the _SR Source Node_ will have to encode it in the _Segment List_, for which it needs to
|
|
construct an SRH.
|
|
|
|
I want to tell `vpp0-0` to do some scenic routing. I want it to send the packet first to `vpp0-2`,
|
|
then `vpp0-1` and then `vpp0-3`. I struggle a little bit, because how should I construct the
|
|
_Segment List_ ? If I put `vpp0-2`'s loopback address in there, the packet will be seen as local,
|
|
and sent for local processing, in VPP's `ip6-receive` node. I don't want that to happen, instead I
|
|
want VPP to inspect the SRH in this case. After reading a little bit in
|
|
`src/vnet/srv6/sr_localsid.c`, I realize the trick is simple (once you know it, of course): I need
|
|
to tell all routers to handle a specific localsid as _End_ behavior, which will make the
|
|
intermediate routers run `end_srh_processing()` which processes the SRH and does the destination
|
|
swap.
|
|
|
|
```
|
|
vpp0-3# sr localsid address 2001:678:d78:20F::3:ffff behavior end
|
|
vpp0-2# sr localsid address 2001:678:d78:20F::2:ffff behavior end
|
|
vpp0-1# sr localsid address 2001:678:d78:20F::1:ffff behavior end
|
|
vpp0-0# sr localsid address 2001:678:d78:20F::0:ffff behavior end
|
|
vpp0-0# sr policy add bsid 8298::2:2 next 2001:678:d78:20F::2:ffff next 2001:678:d78:20F::1:ffff
|
|
next 2001:678:d78:20f::3:1 encap
|
|
```
|
|
|
|
Now each router knows that if an IPv6 packet is destined to it's `:ffff` address, that it needs to
|
|
"End" the segment by inspecting the SRH. And the _SR Policy_ for `vpp0-0` is to send it first to
|
|
`::2:ffff`, which is `vpp0-2`, which has now inspects the SRH and advances the _Segment List_.
|
|
|
|
|
|
The proof is in the tcpdump pudding, and it makes me smile to see the icmp-echo packet bounce back
|
|
and forward on its scenic route:
|
|
|
|
```
|
|
root@tap0-0:~# tcpdump -veni enp16s0f0 src 2001:678:d78:200::
|
|
tcpdump: listening on enp16s0f0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
|
|
|
|
12:15:39.442587 52:54:00:f0:10:00 > 52:54:00:f0:10:13, ethertype 802.1Q-QinQ (0x88a8), length 102: vlan 30, p 0,
|
|
ethertype IPv4 (0x0800), (tos 0x0, ttl 64, id 5534, offset 0, flags [DF], proto ICMP (1), length 84)
|
|
192.0.2.0 > 192.0.2.1: ICMP echo request, id 51, seq 561, length 64
|
|
|
|
12:15:39.501353 52:54:00:f0:11:01 > 52:54:00:f0:11:10, ethertype 802.1Q (0x8100), length 212: vlan 20, p 0,
|
|
ethertype IPv6 (0x86dd), (flowlabel 0x09d8f, hlim 63, next-header Routing (43) payload length: 154)
|
|
2001:678:d78:200:: > 2001:678:d78:20f::2:ffff: RT6 (len=6, type=4, segleft=2, last-entry=2, flags=0x0, tag=0, [0]2001:678:d78:20f::3:1, [1]2001:678:d78:20f::1:ffff, [2]2001:678:d78:20f::2:ffff)
|
|
52:54:00:f0:10:00 > 52:54:00:f0:10:13, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 64406, offset 0, flags [DF], proto ICMP (1), length 84)
|
|
192.0.2.0 > 192.0.2.1: ICMP echo request, id 51, seq 6, length 64
|
|
|
|
12:15:39.501902 52:54:00:f0:11:11 > 52:54:00:f0:11:20, ethertype 802.1Q (0x8100), length 212: vlan 21, p 0,
|
|
ethertype IPv6 (0x86dd), (flowlabel 0x09d8f, hlim 62, next-header Routing (43) payload length: 154)
|
|
2001:678:d78:200:: > 2001:678:d78:20f::2:ffff: RT6 (len=6, type=4, segleft=2, last-entry=2, flags=0x0, tag=0, [0]2001:678:d78:20f::3:1, [1]2001:678:d78:20f::1:ffff, [2]2001:678:d78:20f::2:ffff)
|
|
52:54:00:f0:10:00 > 52:54:00:f0:10:13, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 64406, offset 0, flags [DF], proto ICMP (1), length 84)
|
|
192.0.2.0 > 192.0.2.1: ICMP echo request, id 51, seq 6, length 64
|
|
|
|
12:15:39.502658 52:54:00:f0:11:20 > 52:54:00:f0:11:11, ethertype 802.1Q (0x8100), length 212: vlan 21, p 0,
|
|
ethertype IPv6 (0x86dd), (flowlabel 0x09d8f, hlim 61, next-header Routing (43) payload length: 154)
|
|
2001:678:d78:200:: > 2001:678:d78:20f::1:ffff: RT6 (len=6, type=4, segleft=1, last-entry=2, flags=0x0, tag=0, [0]2001:678:d78:20f::3:1, [1]2001:678:d78:20f::1:ffff, [2]2001:678:d78:20f::2:ffff)
|
|
52:54:00:f0:10:00 > 52:54:00:f0:10:13, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 64406, offset 0, flags [DF], proto ICMP (1), length 84)
|
|
192.0.2.0 > 192.0.2.1: ICMP echo request, id 51, seq 6, length 64
|
|
|
|
12:15:39.502990 52:54:00:f0:11:11 > 52:54:00:f0:11:20, ethertype 802.1Q (0x8100), length 212: vlan 21, p 0,
|
|
ethertype IPv6 (0x86dd), (flowlabel 0x09d8f, hlim 60, next-header Routing (43) payload length: 154)
|
|
2001:678:d78:200:: > 2001:678:d78:20f::3:1: RT6 (len=6, type=4, segleft=0, last-entry=2, flags=0x0, tag=0, [0]2001:678:d78:20f::3:1, [1]2001:678:d78:20f::1:ffff, [2]2001:678:d78:20f::2:ffff)
|
|
52:54:00:f0:10:00 > 52:54:00:f0:10:13, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 64406, offset 0, flags [DF], proto ICMP (1), length 84)
|
|
192.0.2.0 > 192.0.2.1: ICMP echo request, id 51, seq 6, length 64
|
|
|
|
12:15:39.503813 52:54:00:f0:11:21 > 52:54:00:f0:11:30, ethertype 802.1Q (0x8100), length 212: vlan 22, p 0,
|
|
ethertype IPv6 (0x86dd), (flowlabel 0x09d8f, hlim 59, next-header Routing (43) payload length: 154)
|
|
2001:678:d78:200:: > 2001:678:d78:20f::3:1: RT6 (len=6, type=4, segleft=0, last-entry=2, flags=0x0, tag=0, [0]2001:678:d78:20f::3:1, [1]2001:678:d78:20f::1:ffff, [2]2001:678:d78:20f::2:ffff)
|
|
52:54:00:f0:10:00 > 52:54:00:f0:10:13, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 64406, offset 0, flags [DF], proto ICMP (1), length 84)
|
|
192.0.2.0 > 192.0.2.1: ICMP echo request, id 51, seq 6, length 64
|
|
|
|
12:15:39.525605 52:54:00:f0:10:00 > 52:54:00:f0:10:13, ethertype 802.1Q-QinQ (0x88a8), length 102: vlan 43, p 0,
|
|
ethertype IPv4 (0x0800), (tos 0x0, ttl 64, id 5534, offset 0, flags [DF], proto ICMP (1), length 84)
|
|
192.0.2.0 > 192.0.2.1: ICMP echo request, id 51, seq 561, length 64
|
|
```
|
|
|
|
The echo-request packet can be observed seven times:
|
|
1. coming in on vlan 30 (between `host0-0` and `vpp0-0:Gi10/0/2`), here it is simply an IPv4
|
|
packet.
|
|
1. on vlan 20, encapsulated in an IPv6 packet, this time _including_ SRH header showing where it
|
|
is expected to go.
|
|
1. on vlan 21, because the first segment wants the packet to go to `vpp0-2`. and `vpp0-1` is
|
|
acting as a transit router (just normally using IPv6 FIB lookup to pass it along)
|
|
1. on vlan 21 again, because when `vpp0-2` got it, it popped the SRH list, and sent it to the
|
|
second segment, which is onwards to `vpp0-1`.
|
|
1. on vlan 21 yet again, because when `vpp0-1` got it, it popped the SRH list, and sent it to the
|
|
third and final segment, which is onwards to `vpp0-3`.
|
|
1. on vlan 22, because `vpp0-2` is a transit router, using its FIB to pass it along to `vpp0-3`,
|
|
which decapsulates it with End.DX2 and sends it as an L2 packet on Gi10/0/3.
|
|
1. coming out of vlan 43 (between `vpp0-3:Gi10/0/3` and `host0-1`), where it is simply an IPv4
|
|
packet again.
|
|
|
|
VPP has a few relevant dataplane nodes:
|
|
1. ***sr-pl-rewrite-encaps-l2***: This node encapsulates ethernet at the ingress point by steering
|
|
packets into an _SR Policy_ named by its _Binding Segment ID_
|
|
1. ***sr-localsid***: This node implements End behavior, in this case sending to the next Segment
|
|
Router by looking up its _Local Segment ID_ in the FIB
|
|
1. ***sr-localsid-d***: This node decapsulates the ethernet, on an `End.DX2` behavior, by looking
|
|
up its _Local Segment ID_ in the FIB
|
|
|
|
## VPP: Adding SRv6 encap/decap on sub-interface
|
|
|
|
A few years ago, I thought maybe it'd be cool to use SRv6 for L2VPN at IPng. But I was quickly
|
|
disappointed because SRv6 encap and decap is only implemented on the `device-input` path which means
|
|
it will not work with sub-interfaces.
|
|
|
|
A few weeks ago, I worked on Gerrit [[44654](https://gerrit.fd.io/r/c/vpp/+/44654)], which
|
|
implements policers on sub-interfaces. I wrote about it in a [[policer article]({{< ref
|
|
2026-02-14-vpp-policers >}})], but since my brain's instruction cache is still warm with the code I
|
|
wrote to enable L2 features on input- and output, I thought I'd give it another go. If you're not
|
|
interested in the software engineering parts, you can stop reading now :-)
|
|
|
|
***0. Remove vlan_index everywhere***
|
|
|
|
The original author followed the RFC, where there is an `End.DX2V` behavior that allows to
|
|
decapsulate to a VLAN tag on an interface, but they never implemented it and added a note to the
|
|
code to that effect. I can see why, DX2V is not idiomatic for VPP, but there's an alternative. It
|
|
would make more sense to decapsulate with `End.DX2` to a sub-interface. So I removed this from the
|
|
codebase in all places except the API functions, where I marked them as 'not implemented', which is
|
|
true at this point anyway.
|
|
|
|
***1. Add feature bitmap entries***
|
|
|
|
I added `L2INPUT_FEAT_SRV6` to `l2_input.h`. This allows me to turn on an SRv6 feature bit, and on
|
|
ingress, send L2 datagrams from `l2-input` node directly to `sr-pl-rewrite-encaps-l2` node,
|
|
regardless of the interface being a PHY like `Gi10/0/0` or a SUB like `Gi10/0/0.100`. It comes at a
|
|
small CPU cost though, because moving on the `device-input` arc directly to the encapsulation node
|
|
will skip a bunch of L2 processing, like L2 ACL, and VLAN TAG Rewriting (which doesn't make sense on
|
|
an untagged interface anyway). But, in return I can apply SRv6 encapsulation to any interface type.
|
|
|
|
***2. Precompute DX2 headerlen***
|
|
|
|
In the case of an `End.DX2` to a sub-interface, I need to add either 4 bytes (single tag) or 8 bytes
|
|
(QinQ or QinAD double tag) to the packet length. I know which at creation time, because I can look
|
|
that up from the to-be-DX2'd interface. I'll store this in the localsid structure as `ls->l2_len`
|
|
(either 14, 18, or 22 bytes).
|
|
|
|
***3a. Connect to `l2-input` on ingress***
|
|
|
|
When enabling the `sr steer` with keyword `encap`, I need to change two things: first, I need to
|
|
allow `VNET_SW_INTERFACE_TYPE_SUB` in addition to the already present
|
|
`VNET_SW_INTERFACE_TYPE_HARDWARE`, and then if the steer policy is `SR_STEER_L2`, I remove the bits
|
|
which initialize the feature arc on `device-input`, and instead, call
|
|
`set_int_l2_mode()` in `MODE_L2_XC` (cross connect), but then I sneakily clear the feature bitmap
|
|
bit for `L2INPUT_FEAT_XCONNECT`, and instead set my new `L2INPUT_FEAT_SRV6` bit. This means that
|
|
from now on, any L2 frames will get sent to node `sr-pl-rewrite-encaps-l2` instead of `l2-output`
|
|
which is what the L2XC would've done. Finally, I initialize the L2 feature bitmap next-nodes for the
|
|
encapsulation node in function `sr_policy_rewrite_init()`.
|
|
|
|
***3b. Connect to `l2-output` on egress***
|
|
|
|
I call `l2output_create_output_node_mapping()` on the (sub)-interface, so that traffic into it will
|
|
go to `l2-output`, where I can inspect the feature bitmap to see if I need to send it to
|
|
decapsulation or not. I also need to update `sr_localsid_next` to remove `interface-output` and
|
|
replace it with `l2-output` so that egress traffic visits `l2-output`. In
|
|
`end_decaps_srh_processing()`, I need to set the `l2_len` on the buffer, and change the next node to
|
|
be `SR_LOCALSID_NEXT_L2_OUTPUT` instead of `SR_LOCALSID_NEXT_INTERFACE_OUTPUT`, so that
|
|
sub-interface processing can occur (eg, VLAN Tag Rewriting, ACLs, SPAN, and so on).
|
|
|
|
***4. Fix a bug in `sr_policy_rewrite_encaps_l2`***
|
|
|
|
I kind of thought I would be done, and it did work, but I had about 75% packet loss and iperf
|
|
performance was 20Mbps or so, while on the bench I usually expect 350+ Mbps. I scratched my head a
|
|
little bit, but then found a bug in the quad-loop processing of `sr_policy_rewrite_encaps_l2()`.
|
|
Maybe you can spot it too?
|
|
|
|
|
|
```
|
|
if (vec_len (sp0->segments_lists) == 1)
|
|
vnet_buffer (b0)->ip.adj_index[VLIB_TX] = sp0->segments_lists[0];
|
|
else {
|
|
vnet_buffer (b0)->ip.flow_hash = flow_label0;
|
|
vnet_buffer (b0)->ip.adj_index[VLIB_TX] = sp0->segments_lists[(vnet_buffer (b0)->ip.flow_hash & (vec_len (sp0->segments_lists) - 1))];
|
|
}
|
|
|
|
if (vec_len (sp1->segments_lists) == 1)
|
|
vnet_buffer (b1)->ip.adj_index[VLIB_TX] = sp1->segments_lists[1];
|
|
else {
|
|
vnet_buffer (b1)->ip.flow_hash = flow_label1;
|
|
vnet_buffer (b1)->ip.adj_index[VLIB_TX] = sp1->segments_lists[(vnet_buffer (b1)->ip.flow_hash & (vec_len (sp1->segments_lists) - 1))];
|
|
}
|
|
```
|
|
|
|
Once I found this, I became quite certain that nobody uses L2 encapsulation in VPP, because if 4+
|
|
packets would be present in the vector, for the second through fourth packet (`b1`-`b3`), and if the
|
|
segment list had length 1, then the segment list index would incorrectly be set to garbage
|
|
`segment_lists[1]` rather than the first and only segment `segment_list[0]`. Yikes! But it explains
|
|
perfectly why I had roughly 75% packetloss, lots of TCP retransmits, and terrible throughput. I fix
|
|
this bug and SRv6 encap starts to work flawlessly.
|
|
|
|
***5. Add tests***
|
|
|
|
I decide to add four tests: for {PHY, SUB} x {Encap, Decap}. On the encap side, I create a _SR
|
|
Policy_ with BSID `a3::9999:1` which encapsulates from source `a3::` and sends to _Segment List_
|
|
[`a4::`, `a5::`, `a6::c7`]. I then _steer_ L2 traffic from interface `pg0` using this _BSID_. I'll
|
|
generate a packet and want to receive it ffom `pg1` encapsulated with the correct SRH and
|
|
destination address. On the decap side, I create an SRv6 packet and send it into `pg1`, and want to
|
|
see it decapsulated and exit on interface `pg0`.
|
|
|
|
I try to get consistency by adding a `send_and_verify_pkts()` which takes an argument as a validator
|
|
function, either `compare_rx_tx_packet_T_Encaps_L2()` or `compare_rx_tx_packet_End_DX2()`. These
|
|
four tests succeed, look at me!
|
|
|
|
```
|
|
==============================================================================
|
|
SRv6 L2 Sub-Interface Steering Test Case [main thread only]
|
|
==============================================================================
|
|
Test SRv6 End.DX2 decapsulation to a hardware (phy) interface. 1.53 OK
|
|
Test SRv6 End.DX2 decapsulation to a sub-interface (VLAN). 1.00 OK
|
|
Test SRv6 L2 encapsulation on a hardware (phy) interface. 1.97 OK
|
|
Test SRv6 L2 encapsulation on a sub-interface (VLAN). 1.93 OK
|
|
|
|
==============================================================================
|
|
TEST RESULTS:
|
|
Scheduled tests: 4
|
|
Executed tests: 4
|
|
Passed tests: 4
|
|
==============================================================================
|
|
```
|
|
|
|
## Results
|
|
|
|
With this change, it becomes possible to `sr steer` into a sub-interface, and to have an `sr
|
|
localsid` that outputs to a sub-interface, which I can demonstrate like so:
|
|
|
|
```
|
|
vpp0-0# create sub-interfaces GigabitEthernet10/0/2 100
|
|
vpp0-0# set int l2 tag-rewrite GigabitEthernet10/0/2.100 pop 1
|
|
vpp0-0# set int state GigabitEthernet10/0/2.100 up
|
|
vpp0-0# sr policy add bsid 8298::2:2 next 2001:678:d78:20f::3:2 encap
|
|
vpp0-0# sr steer l2 GigabitEthernet10/0/2.100 via bsid 8298::2:2
|
|
vpp0-0# sr localsid address 2001:678:d78:20f::0:2 behavior end.dx2 GigabitEthernet10/0/2.100
|
|
|
|
vpp0-3# create sub-interfaces GigabitEthernet10/0/3 200
|
|
vpp0-3# set int l2 tag-rewrite GigabitEthernet10/0/3.200 pop 1
|
|
vpp0-3# set int state GigabitEthernet10/0/3.200 up
|
|
vpp0-3# sr policy add bsid 8298::2:2 next 2001:678:d78:20F::2 encap
|
|
vpp0-3# sr steer l2 GigabitEthernet10/0/3.200 via bsid 8298::2:2
|
|
vpp0-3# sr localsid address 2001:678:d78:20f::3:2 behavior end.dx2 GigabitEthernet10/0/3.200
|
|
```
|
|
|
|
One thing to remember, is that when sub-interfaces are created and used in L2 mode, they have to get
|
|
the [[VLAN Gymnastics]({{< ref 2022-02-14-vpp-vlan-gym >}})] applied to them. In VPP terminology, it
|
|
means applying _VTR_ or _VLAN Tag Rewrite_ feature, where the tag is removed upon ingress, and
|
|
re-added on egress. That way, the ethernet frame that gets put into the SRv6 L2VPN is untagged. It
|
|
allows me to have different encapsulation on both sides.
|
|
|
|
Now, for the _moment suprème_, on the two hosts, I can now create this sub-interface and use the tagged L2VPN also:
|
|
```
|
|
root@host0-0:~# ip link add link enp16s0f0 name enp16s0f0.100 type vlan id 100
|
|
root@host0-0:~# ip link set enp16s0f0.100 up mtu 1500
|
|
root@host0-0:~# ip addr add 192.0.2.128/31 dev enp16s0f0.100
|
|
|
|
root@host0-1:~# ip link add link enp16s0f3 name enp16s0f3.200 type vlan id 200
|
|
root@host0-1:~# ip link set enp16s0f3.200 up mtu 1500
|
|
root@host0-1:~# ip addr add 192.0.2.129/31 dev enp16s0f3.200
|
|
root@host0-1:~# ping 192.0.2.128
|
|
PING 192.0.2.128 (192.0.2.128) 56(84) bytes of data.
|
|
64 bytes from 192.0.2.128: icmp_seq=1 ttl=64 time=9.88 ms
|
|
64 bytes from 192.0.2.128: icmp_seq=2 ttl=64 time=4.88 ms
|
|
64 bytes from 192.0.2.128: icmp_seq=3 ttl=64 time=7.07 ms
|
|
^C
|
|
--- 192.0.2.128 ping statistics ---
|
|
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
|
|
rtt min/avg/max/mdev = 4.880/7.273/9.876/2.044 ms
|
|
|
|
root@host0-1:~# ip nei | grep 200
|
|
192.0.2.128 dev enp16s0f3.200 lladdr 52:54:00:f0:10:00 REACHABLE
|
|
fe80::5054:ff:fef0:1000 dev enp16s0f3.200 lladdr 52:54:00:f0:10:00 DELAY
|
|
```
|
|
|
|
## What's Next
|
|
|
|
I've sent the change, which is about ~850 LOC, off for review. You can follow along on the
|
|
gerrit on [[44899](https://gerrit.fd.io/r/c/vpp/+/44899)]. I'm happy to have fixed the quad-loop
|
|
encap bug, but it does show me that SRv6 (at least in L2 transport mode) is not super common for
|
|
VPP, perhaps not common in the industry? I am not convinced that I want to use this in production on
|
|
AS8298, but if I did, the basic functionality would be adding an IPv6 prefix to each of the loopback
|
|
devices, in order to attract traffic to the router, add an 'End' localsid on every router so that
|
|
they can participate in multi-hop SRv6, and add some static config to
|
|
[[vppcfg](https://git.ipng.ch/ipng/vppcfg)] to do the encap/decap for L2VPN. By the way, there's a
|
|
whole world of encap and decap behaviors, including L3VPN for IPv4, IPv6, GTP-U, and so on.
|
|
|
|
For me, I've still set my sights on eVPN VxLAN as a destination, because that will give me
|
|
multi-point ethernet mesh akin to VPLS. However there's a lot of ground to cover for me,
|
|
considering IPng uses Bird2 as a routing controlplane. Bird2 is starting to get eVPN support, but
|
|
there's a lot for me to learn. Stay tuned!
|
|
|