Files
ipng.ch/content/articles/2026-02-21-vpp-srv6.md
Pim van Pelt 95146624c2
All checks were successful
continuous-integration/drone/push Build is passing
Give slighly more descriptive name, we are discussing L2VPN not any/all SRv6 functionality
2026-02-21 14:42:15 +01:00

562 lines
33 KiB
Markdown

---
date: "2026-02-21T11:35:14Z"
title: VPP SRv6 L2VPN
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
are shared between the two.
Segment Routing is a lesser known technique that allows network operators to determine a path
through their network by encoding the path inside headers in the packet itself, rather than relying
on the IGP to determine the path. Originally created to help traffic engineering of MPLS packets,
the concepts were carried forward for IPv6 as well.
In this article I take SRv6 out for a spin, implement some missing features in VPP, and stumble
across, and manage to fix a nasty bug in its implementation.
## Introduction
SRv6 - Segment Routing for IPv6 - is defined in a number of RFCs.
1. [[RFC 8402](https://datatracker.ietf.org/doc/html/rfc8402)]: Segment Routing Architecture. This
document describes the fundamentals. It defines the general concepts of Segment Routing (nodes,
segments, and steering) for both MPLS and IPv6.
1. [[RFC 8754](https://datatracker.ietf.org/doc/html/rfc8754)]: IPv6 Segment Routing Header (SRH).
This RFC defines the specific IPv6 Extension Header used for SRv6. It explains how segments are listed
and how the Segments Left field works.
1. [[RFC 8986](https://datatracker.ietf.org/doc/html/rfc8986)]: SRv6 Network Programming.
This one describes the so-called "behaviors" associated with a Segment ID (SID). It defines functions
like End (Endpoint), End.X (Layer-3 cross-connect), and End.DT4/6 (VRF decapsulation).
While reading these RFCs, I learn that I can configure an SRv6 path through the network that picks
up an ethernet packet on the ingress, and decapsulates and cross connects that ethernet packet to an
interface on the egress: an L2VPN using Ethernet-over-IPv6. That sounds dope to me!
### SRv6 in VPP - Segment Routing Header
For the dataplane, there are two parts of note. Firstly, when an IPv6 packet arrives with an IPv6
extension header, the so-called _Segment Routing Header_ or SRH, any router supporting SRv6 needs to
inspect it. The presence of an SRH changes the forwarding logic from a simple "look at the
destination, do a FIB lookup for next hop, and send the packet on its merry way" to a more
customized "process the instruction and update the IPv6 headers" kind of thing.
In IPv6, an (almost) arbitrary amount of headers can be chained from the base IPv6 packet header, to
the ultimate layer4 protocol header like ICMP, TCP or UDP. In IPv4, this is not the case, there is
only the L3 header (IPv4) and the L4 header (TCP/UDP/ICMP etc). These intermediate headers are
called Routing Extension headers, and the SRH is the one with type 4.
The fields in this header are:
* ***Next Header***: Identifies the type of header following the SRH. It can be another routing
extension header or it might be the Layer4 protocol header like TCP, UDP or ICMP.
* ***Flags***: IANA loves reserving optionality for the future. The authors of SRv6 added an 8-bit
flags field, but none of them have been assigned yet.
* ***Tag***: Moar optionality! This 16-bit tag is not defined in the RFC, simply stating that _The
allocation and use of tag is outside the scope of this document_. OK then!
* ***Segments Left (SL)***: A counter indicating how many intermediate nodes still need to be
visited.
* ***Last Entry***: The index (starting from 0) of the last element in the Segment List.
* ***Segment List***: This is an array of 128-bit IPv6 addresses, listed in reverse order of the
path. The first segment to be visited is at the highest index.
* (optional) ***TLVs***: These Type-Length-Value objects can encode other information, like HMAC
signatures, operational and performance monitoring data, and so on.
### SRv6: Anatomy
{{< image width="14em" float="right" src="/assets/vpp-srv6/magnets.jpg" alt="Insane Clown" >}}
Much like magnents, you might be wondering _SRv6 Routers: How do they work?_. There's really only
three relevant things: SR Policy (they determine how packets are steered into the SRv6 routing
domain), SRv6 Source nodes (they handle the ingress part), and SRv6 Segment Endpoint Nodes (they
handle both the intermediate routers that participate in SRv6, and also the egress part where the
packet leaves the SRv6 routing domain).
#### SRv6: Policies
A _Segment Routing Policy_ is the same for MPLS and SRv6. They are represented by either a stack of
MPLS labels, or by a stack of IPv6 addresses, and they are uniquely identified by either an MPLS
label or an IPv6 address as well. The identifier is called a _Binding Segment ID_ or BSID, and the
elements of the list are called _Segment IDs_ or SIDs.
```
BSID := SID [, SID] [, SID] ...
8298::1 := 2001:db8::1 , 2001:db8::2 , 2001:db8::3
```
These policies are written to the FIB in the router. I can now do a lookup for `8298::1`, and find
that it points to this _SR Policy_ object with the list of three IPv6 addresses. In the case of
MPLS, the _BSID_ will be in the MPLS FIB and point at a list of three MPLS labels, but I'm going to
stop talking about MPLS now :)
#### SRv6: Source Node
An _SR Source Node_ originates an IPv6 packet with a Segment in the destination address, and it
optionally adds an SRH with a list of instructions on the network. The _SR Source Node_ is the
ingress point and enables SRv6 processing in the network, which is called _steering_. Instead of
setting the destination address to the final destination, the source node will set it to the first
Segment, which is the first router that needs to be visited.
#### SRv6: Transit Node
Spoiler alert! This node type doesn't have anything to do with SRv6. SRv6 packets really do look
like normal packets, the IPv6 source address is the Source Node, and the destination address is the
_Transit Node_, which can just forward it like any other packet using their routing table. Notably,
those routers are not actively participating in SRv6 and they don't need to know anything about it.
#### SRv6: Segment Endpoint Node
The _Segment Endpoint Node_ is a router that is SRv6 capable. A packet may arrive with a locally
configured address in the IPv6 destination. The magic happens here - one of two things:
1. The _Segment Routing Header_ is inspected. If _Segments Left_ is 0, then the next header
(typically UDP, TCP, ICMP) is processed. Otherwisem the next segment is read from the _Segment
List_, and the IPv6 destination address is overwritten with it. The _Segments Left_ field is
decremented. In this case the packet is routed normally through a bunch of potential transit
routers, who are blissfully ignorant of what is happening, and onto a next _Segment Endpoint_
router.
1. The IPv6 destination address might have an entry in the forwarding table which points at a
specific local meaning, called a _Local Segment ID_ or _LocalSID_. The LocalSID tells this router
what to do, for example decapsulate the packet and do a next-hop lookup in a specific routing table,
useful for L3VPNs; or perhaps an instruction to decapsulate the packet and cross connect it to a
local interface, useful for L2VPN. The key insight here is, that the local FIB entry can carry any
type of further instruction.
## VPP: IPng LAB
At this point I'm pretty sure I've bored you to tears with all the RFC stuff and theory. I do think
that segment routing (both the MPLS and the SRv6 variant) are sufficiently complex that taking a
read of the main RFCs at least once is useful. But for me, the fun part is seeing it work in
practice. So I boot the [[IPng Lab]({{< ref 2022-10-14-lab-1 >}})], which looks a bit like this.
{{< image width="100%" src="/assets/lab/LAB v2.svg" alt="Logical" >}}
In this environment, each of the VPP routers is running Bird2 with OSPF and OSPFv3. They are
connected in a string, and each VPP router has an interface (`Gi10/0/2`) connected to a debian host
called `host0-0` (at the bottom), as well as an interface (`Gi10/0/3`) connected to a host called
`host0-1` (at the top). One really cool feature of the LAB is that all links are on an OpenVSwitch
which is mirroring all traffic to a tap host called `tap0-0`, so I can see traffic clearly:
```
root@vpp0-0:/etc/bird# ping -n 2001:678:d78:200::3 -c1
PING 2001:678:d78:200::3 (2001:678:d78:200::3) 56 data bytes
64 bytes from 2001:678:d78:200::3: icmp_seq=1 ttl=62 time=3.24 ms
--- 2001:678:d78:200::3 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 3.240/3.240/3.240/0.000 ms
root@tap0-0:~# tcpdump -eni enp16s0f0
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on enp16s0f0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:39:23.558942 52:54:00:f0:11:01 > 52:54:00:f0:11:10, ethertype 802.1Q (0x8100), length 122: vlan 20, p 0,
ethertype IPv6 (0x86dd), 2001:678:d78:201::1:0 > 2001:678:d78:200::3: ICMP6, echo request, id 12, seq 1, length 64
10:39:23.558942 52:54:00:f0:11:11 > 52:54:00:f0:11:20, ethertype 802.1Q (0x8100), length 122: vlan 21, p 0,
ethertype IPv6 (0x86dd), 2001:678:d78:201::1:0 > 2001:678:d78:200::3: ICMP6, echo request, id 12, seq 1, length 64
10:39:23.559993 52:54:00:f0:11:21 > 52:54:00:f0:11:30, ethertype 802.1Q (0x8100), length 122: vlan 22, p 0,
ethertype IPv6 (0x86dd), 2001:678:d78:201::1:0 > 2001:678:d78:200::3: ICMP6, echo request, id 12, seq 1, length 64
10:39:23.560179 52:54:00:f0:11:30 > 52:54:00:f0:11:21, ethertype 802.1Q (0x8100), length 122: vlan 22, p 0,
ethertype IPv6 (0x86dd), 2001:678:d78:200::3 > 2001:678:d78:201::1:0: ICMP6, echo reply, id 12, seq 1, length 64
10:39:23.561070 52:54:00:f0:11:20 > 52:54:00:f0:11:11, ethertype 802.1Q (0x8100), length 122: vlan 21, p 0,
ethertype IPv6 (0x86dd), 2001:678:d78:200::3 > 2001:678:d78:201::1:0: ICMP6, echo reply, id 12, seq 1, length 64
10:39:23.561248 52:54:00:f0:11:10 > 52:54:00:f0:11:01, ethertype 802.1Q (0x8100), length 122: vlan 20, p 0,
ethertype IPv6 (0x86dd), 2001:678:d78:200::3 > 2001:678:d78:201::1:0: ICMP6, echo reply, id 12, seq 1, length 64
```
Here you can see the packet path from `vpp0-0` sending one ICMPv6 echo request to `vpp0-3`, which
responded with one ICMPv6 echo reply. I can see the packet on vlan 20, 21, 22 on the way out, and
back again on vlan 22, 21 and 20.
### VPP: SRv6 Example
Alright, here I go! With the following short snippet, I can sum up all of the theory above in a
practical first example:
```
vpp0-0# set sr encaps source addr 2001:678:d78:200::
vpp0-0# sr policy add bsid 8298::2:1 next 2001:678:d78:20F::3:1 encap
vpp0-0# sr steer l2 GigabitEthernet10/0/2 via bsid 8298::2:1
vpp0-0# sr localsid address 2001:678:d78:20f::0:1 behavior end.dx2 GigabitEthernet10/0/2
vpp0-0# set int state GigabitEthernet10/0/2 up
```
Looking at what I typed on `vpp0-0`, first I tell the system that its encapsulation source address
is its IPv6 loopback address. Then I add a _Binding SID_ with one _Segment ID_ and I instruct this
policy to encapsulate the packet. Then, I add an L2 steering from interface `Gi10/0/2` via this
_BSID_. At this point, `vpp0-0` knows that if an ethernet frame comes in on that interface, it needs
to encapsulate it in SRv6 from `2001:678:d78:200::` and send it to `2001:678:d78:20F::3:1`. Finally,
I tell the system that if an IPv6 packet arrives with destination address `2001:678:d78:20f::0:1`,
that it needs to decapsulate it and send the resulting L2 datagram out on Gi10/0/2.
There is one last thing I have to do, and that's somehow attract this `2001:678:d78:20F::0:0/112` prefix
to `vpp0-0` and `2001:678:d78:20F::3:0/112` prefix to `vpp0-3`. I can do this by adding the prefix
to `loop0`, like so:
```
vpp0-0# create loopback interface instance 0
vpp0-0# set interface state loop0 up
vpp0-0# set interface ip address loop0 192.168.10.0/32
vpp0-0# set interface ip address loop0 2001:678:d78:200::0/128
vpp0-0# set interface ip address loop0 2001:678:d78:20F::0:0/112
```
This will be picked up in OSPFv3, and all routers will install a FIB entry pointing at `vpp0-0` for
the /112. Did it work?
```
root@host0-0:~# ping6 ff02::1%enp16s0f0
PING ff02::1%enp16s0f0 (ff02::1%enp16s0f0) 56 data bytes
64 bytes from fe80::5054:ff:fef0:1000%enp16s0f0: icmp_seq=1 ttl=64 time=0.156 ms
64 bytes from fe80::5054:ff:fef0:1013%enp16s0f0: icmp_seq=1 ttl=64 time=4.03 ms
^C
--- ff02::1%enp16s0f0 ping statistics ---
1 packets transmitted, 1 received, +1 duplicates, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.156/2.092/4.029/1.936 ms
```
{{< image width="12em" float="right" src="/assets/vpp-srv6/hannibal-plan.png" alt="Hannibal Smith loves it" >}}
Yes, it worked! I love it when a plan comes together! This IPv6 address that I pinged, `ff02::1` is
called `all-hosts`, and I can see one reply from `fe80::5054:ff:fef0:1000` which is host0-0's own
link-local address, and a second reply from `fe80::5054:ff:fef0:1013` which is host0-1's address.
I have created a point to point L2VPN or _Virtual Leased Line_ between `vpp0-0:Gi10/0/2` and
`vpp0-3:Gi10/0/3` and any ethernet traffic between these two ports is passed through the network as
IPv6 packets including segment routing. Nice going!
### SRv6 on the Wire
I learn something curious. I configure an IPv4 address on both hosts:
```
root@host0-0:~# ip addr add 192.0.2.0/31 dev enp16s0f0
root@host0-1:~# ip addr add 192.0.2.1/31 dev enp16s0f3
root@host0-1:~# ping 192.0.2.0
PING 192.0.2.0 (192.0.2.0) 56(84) bytes of data.
64 bytes from 192.0.2.0: icmp_seq=1 ttl=64 time=5.27 ms
^C
--- 192.0.2.0 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 5.274/5.274/5.274/0.000 ms
```
And then I take a look at this IPv4 ICMP packet on the wire:
```
11:03:22.118770 52:54:00:f0:10:00 > 52:54:00:f0:10:13, ethertype 802.1Q-QinQ (0x88a8), length 102: vlan 30, p 0,
ethertype IPv4 (0x0800), (tos 0x0, ttl 64, id 35014, offset 0, flags [DF], proto ICMP (1), length 84)
192.0.2.0 > 192.0.2.1: ICMP echo request, id 50, seq 1, length 64
11:03:22.119078 52:54:00:f0:11:01 > 52:54:00:f0:11:10, ethertype 802.1Q (0x8100), length 156: vlan 20, p 0,
ethertype IPv6 (0x86dd), (flowlabel 0x09d8f, hlim 63, next-header Ethernet (143) payload length: 98) 2001:678:d78:200:: > 2001:678:d78:20f::3:1:
52:54:00:f0:10:00 > 52:54:00:f0:10:13, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 35014, offset 0, flags [DF], proto ICMP (1), length 84)
192.0.2.0 > 192.0.2.1: ICMP echo request, id 50, seq 1, length 64
```
The first packet is coming in on vlan 30 (`host0-0:enp16s0f0` to `vpp0-0:Gi10/0/2`). I then see it
go out on vlan 20 (from `vpp0-0` to `vpp0-1`). I see it is an IPv6 packet from `2001:678:d78:200::`
(the encapsulation address I configured), and to `2001:678:d78:20f::3:1` (the _BSID_ steers towards
this Segment Router ID), and then I see the Ethernet inner payload with the ICMP echo packet. But
where's the _Segment Routing Header_??
It is here that I learn why the RFC says that SRH are optional. This packet has everything it needs
to have using the destination address, `2001:678:d78:20f::3:1`, which is routed towards the loopback
interface of `vpp0-3`. There, it is looked up in the FIB and the _Local Segment ID_ or LocalSID
determines that packets to this address must be decapsulated and forwarded out on `vpp0-3:Gi10/0/3`.
### VPP: Let's ZigZag
So how do I get these elusive SRH headers? Easy: make more than one segment in the BSID, because
then, the _SR Source Node_ will have to encode it in the _Segment List_, for which it needs to
construct an SRH.
I want to tell `vpp0-0` to do some scenic routing. I want it to send the packet first to `vpp0-2`,
then `vpp0-1` and then `vpp0-3`. I struggle a little bit, because how should I construct the
_Segment List_ ? If I put `vpp0-2`'s loopback address in there, the packet will be seen as local,
and sent for local processing, in VPP's `ip6-receive` node. I don't want that to happen, instead I
want VPP to inspect the SRH in this case. After reading a little bit in
`src/vnet/srv6/sr_localsid.c`, I realize the trick is simple (once you know it, of course): I need
to tell all routers to handle a specific localsid as _End_ behavior, which will make the
intermediate routers run `end_srh_processing()` which processes the SRH and does the destination
swap.
```
vpp0-3# sr localsid address 2001:678:d78:20F::3:ffff behavior end
vpp0-2# sr localsid address 2001:678:d78:20F::2:ffff behavior end
vpp0-1# sr localsid address 2001:678:d78:20F::1:ffff behavior end
vpp0-0# sr localsid address 2001:678:d78:20F::0:ffff behavior end
vpp0-0# sr policy add bsid 8298::2:2 next 2001:678:d78:20F::2:ffff next 2001:678:d78:20F::1:ffff
next 2001:678:d78:20f::3:1 encap
```
Now each router knows that if an IPv6 packet is destined to it's `:ffff` address, that it needs to
"End" the segment by inspecting the SRH. And the _SR Policy_ for `vpp0-0` is to send it first to
`::2:ffff`, which is `vpp0-2`, which has now inspects the SRH and advances the _Segment List_.
The proof is in the tcpdump pudding, and it makes me smile to see the icmp-echo packet bounce back
and forward on its scenic route:
```
root@tap0-0:~# tcpdump -veni enp16s0f0 src 2001:678:d78:200::
tcpdump: listening on enp16s0f0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:15:39.442587 52:54:00:f0:10:00 > 52:54:00:f0:10:13, ethertype 802.1Q-QinQ (0x88a8), length 102: vlan 30, p 0,
ethertype IPv4 (0x0800), (tos 0x0, ttl 64, id 5534, offset 0, flags [DF], proto ICMP (1), length 84)
192.0.2.0 > 192.0.2.1: ICMP echo request, id 51, seq 561, length 64
12:15:39.501353 52:54:00:f0:11:01 > 52:54:00:f0:11:10, ethertype 802.1Q (0x8100), length 212: vlan 20, p 0,
ethertype IPv6 (0x86dd), (flowlabel 0x09d8f, hlim 63, next-header Routing (43) payload length: 154)
2001:678:d78:200:: > 2001:678:d78:20f::2:ffff: RT6 (len=6, type=4, segleft=2, last-entry=2, flags=0x0, tag=0, [0]2001:678:d78:20f::3:1, [1]2001:678:d78:20f::1:ffff, [2]2001:678:d78:20f::2:ffff)
52:54:00:f0:10:00 > 52:54:00:f0:10:13, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 64406, offset 0, flags [DF], proto ICMP (1), length 84)
192.0.2.0 > 192.0.2.1: ICMP echo request, id 51, seq 6, length 64
12:15:39.501902 52:54:00:f0:11:11 > 52:54:00:f0:11:20, ethertype 802.1Q (0x8100), length 212: vlan 21, p 0,
ethertype IPv6 (0x86dd), (flowlabel 0x09d8f, hlim 62, next-header Routing (43) payload length: 154)
2001:678:d78:200:: > 2001:678:d78:20f::2:ffff: RT6 (len=6, type=4, segleft=2, last-entry=2, flags=0x0, tag=0, [0]2001:678:d78:20f::3:1, [1]2001:678:d78:20f::1:ffff, [2]2001:678:d78:20f::2:ffff)
52:54:00:f0:10:00 > 52:54:00:f0:10:13, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 64406, offset 0, flags [DF], proto ICMP (1), length 84)
192.0.2.0 > 192.0.2.1: ICMP echo request, id 51, seq 6, length 64
12:15:39.502658 52:54:00:f0:11:20 > 52:54:00:f0:11:11, ethertype 802.1Q (0x8100), length 212: vlan 21, p 0,
ethertype IPv6 (0x86dd), (flowlabel 0x09d8f, hlim 61, next-header Routing (43) payload length: 154)
2001:678:d78:200:: > 2001:678:d78:20f::1:ffff: RT6 (len=6, type=4, segleft=1, last-entry=2, flags=0x0, tag=0, [0]2001:678:d78:20f::3:1, [1]2001:678:d78:20f::1:ffff, [2]2001:678:d78:20f::2:ffff)
52:54:00:f0:10:00 > 52:54:00:f0:10:13, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 64406, offset 0, flags [DF], proto ICMP (1), length 84)
192.0.2.0 > 192.0.2.1: ICMP echo request, id 51, seq 6, length 64
12:15:39.502990 52:54:00:f0:11:11 > 52:54:00:f0:11:20, ethertype 802.1Q (0x8100), length 212: vlan 21, p 0,
ethertype IPv6 (0x86dd), (flowlabel 0x09d8f, hlim 60, next-header Routing (43) payload length: 154)
2001:678:d78:200:: > 2001:678:d78:20f::3:1: RT6 (len=6, type=4, segleft=0, last-entry=2, flags=0x0, tag=0, [0]2001:678:d78:20f::3:1, [1]2001:678:d78:20f::1:ffff, [2]2001:678:d78:20f::2:ffff)
52:54:00:f0:10:00 > 52:54:00:f0:10:13, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 64406, offset 0, flags [DF], proto ICMP (1), length 84)
192.0.2.0 > 192.0.2.1: ICMP echo request, id 51, seq 6, length 64
12:15:39.503813 52:54:00:f0:11:21 > 52:54:00:f0:11:30, ethertype 802.1Q (0x8100), length 212: vlan 22, p 0,
ethertype IPv6 (0x86dd), (flowlabel 0x09d8f, hlim 59, next-header Routing (43) payload length: 154)
2001:678:d78:200:: > 2001:678:d78:20f::3:1: RT6 (len=6, type=4, segleft=0, last-entry=2, flags=0x0, tag=0, [0]2001:678:d78:20f::3:1, [1]2001:678:d78:20f::1:ffff, [2]2001:678:d78:20f::2:ffff)
52:54:00:f0:10:00 > 52:54:00:f0:10:13, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 64406, offset 0, flags [DF], proto ICMP (1), length 84)
192.0.2.0 > 192.0.2.1: ICMP echo request, id 51, seq 6, length 64
12:15:39.525605 52:54:00:f0:10:00 > 52:54:00:f0:10:13, ethertype 802.1Q-QinQ (0x88a8), length 102: vlan 43, p 0,
ethertype IPv4 (0x0800), (tos 0x0, ttl 64, id 5534, offset 0, flags [DF], proto ICMP (1), length 84)
192.0.2.0 > 192.0.2.1: ICMP echo request, id 51, seq 561, length 64
```
The echo-request packet can be observed seven times:
1. coming in on vlan 30 (between `host0-0` and `vpp0-0:Gi10/0/2`), here it is simply an IPv4
packet.
1. on vlan 20, encapsulated in an IPv6 packet, this time _including_ SRH header showing where it
is expected to go.
1. on vlan 21, because the first segment wants the packet to go to `vpp0-2`. and `vpp0-1` is
acting as a transit router (just normally using IPv6 FIB lookup to pass it along)
1. on vlan 21 again, because when `vpp0-2` got it, it popped the SRH list, and sent it to the
second segment, which is onwards to `vpp0-1`.
1. on vlan 21 yet again, because when `vpp0-1` got it, it popped the SRH list, and sent it to the
third and final segment, which is onwards to `vpp0-3`.
1. on vlan 22, because `vpp0-2` is a transit router, using its FIB to pass it along to `vpp0-3`,
which decapsulates it with End.DX2 and sends it as an L2 packet on Gi10/0/3.
1. coming out of vlan 43 (between `vpp0-3:Gi10/0/3` and `host0-1`), where it is simply an IPv4
packet again.
VPP has a few relevant dataplane nodes:
1. ***sr-pl-rewrite-encaps-l2***: This node encapsulates ethernet at the ingress point by steering
packets into an _SR Policy_ named by its _Binding Segment ID_
1. ***sr-localsid***: This node implements End behavior, in this case sending to the next Segment
Router by looking up its _Local Segment ID_ in the FIB
1. ***sr-localsid-d***: This node decapsulates the ethernet, on an `End.DX2` behavior, by looking
up its _Local Segment ID_ in the FIB
## VPP: Adding SRv6 encap/decap on sub-interface
A few years ago, I thought maybe it'd be cool to use SRv6 for L2VPN at IPng. But I was quickly
disappointed because SRv6 encap and decap is only implemented on the `device-input` path which means
it will not work with sub-interfaces.
A few weeks ago, I worked on Gerrit [[44654](https://gerrit.fd.io/r/c/vpp/+/44654)], which
implements policers on sub-interfaces. I wrote about it in a [[policer article]({{< ref
2026-02-14-vpp-policers >}})], but since my brain's instruction cache is still warm with the code I
wrote to enable L2 features on input- and output, I thought I'd give it another go. If you're not
interested in the software engineering parts, you can stop reading now :-)
***0. Remove vlan_index everywhere***
The original author followed the RFC, where there is an `End.DX2V` behavior that allows to
decapsulate to a VLAN tag on an interface, but they never implemented it and added a note to the
code to that effect. I can see why, DX2V is not idiomatic for VPP, but there's an alternative. It
would make more sense to decapsulate with `End.DX2` to a sub-interface. So I removed this from the
codebase in all places except the API functions, where I marked them as 'not implemented', which is
true at this point anyway.
***1. Add feature bitmap entries***
I added `L2INPUT_FEAT_SRV6` to `l2_input.h`. This allows me to turn on an SRv6 feature bit, and on
ingress, send L2 datagrams from `l2-input` node directly to `sr-pl-rewrite-encaps-l2` node,
regardless of the interface being a PHY like `Gi10/0/0` or a SUB like `Gi10/0/0.100`. It comes at a
small CPU cost though, because moving on the `device-input` arc directly to the encapsulation node
will skip a bunch of L2 processing, like L2 ACL, and VLAN TAG Rewriting (which doesn't make sense on
an untagged interface anyway). But, in return I can apply SRv6 encapsulation to any interface type.
***2. Precompute DX2 headerlen***
In the case of an `End.DX2` to a sub-interface, I need to add either 4 bytes (single tag) or 8 bytes
(QinQ or QinAD double tag) to the packet length. I know which at creation time, because I can look
that up from the to-be-DX2'd interface. I'll store this in the localsid structure as `ls->l2_len`
(either 14, 18, or 22 bytes).
***3a. Connect to `l2-input` on ingress***
When enabling the `sr steer` with keyword `encap`, I need to change two things: first, I need to
allow `VNET_SW_INTERFACE_TYPE_SUB` in addition to the already present
`VNET_SW_INTERFACE_TYPE_HARDWARE`, and then if the steer policy is `SR_STEER_L2`, I remove the bits
which initialize the feature arc on `device-input`, and instead, call
`set_int_l2_mode()` in `MODE_L2_XC` (cross connect), but then I sneakily clear the feature bitmap
bit for `L2INPUT_FEAT_XCONNECT`, and instead set my new `L2INPUT_FEAT_SRV6` bit. This means that
from now on, any L2 frames will get sent to node `sr-pl-rewrite-encaps-l2` instead of `l2-output`
which is what the L2XC would've done. Finally, I initialize the L2 feature bitmap next-nodes for the
encapsulation node in function `sr_policy_rewrite_init()`.
***3b. Connect to `l2-output` on egress***
I call `l2output_create_output_node_mapping()` on the (sub)-interface, so that traffic into it will
go to `l2-output`, where I can inspect the feature bitmap to see if I need to send it to
decapsulation or not. I also need to update `sr_localsid_next` to remove `interface-output` and
replace it with `l2-output` so that egress traffic visits `l2-output`. In
`end_decaps_srh_processing()`, I need to set the `l2_len` on the buffer, and change the next node to
be `SR_LOCALSID_NEXT_L2_OUTPUT` instead of `SR_LOCALSID_NEXT_INTERFACE_OUTPUT`, so that
sub-interface processing can occur (eg, VLAN Tag Rewriting, ACLs, SPAN, and so on).
***4. Fix a bug in `sr_policy_rewrite_encaps_l2`***
I kind of thought I would be done, and it did work, but I had about 75% packet loss and iperf
performance was 20Mbps or so, while on the bench I usually expect 350+ Mbps. I scratched my head a
little bit, but then found a bug in the quad-loop processing of `sr_policy_rewrite_encaps_l2()`.
Maybe you can spot it too?
```
if (vec_len (sp0->segments_lists) == 1)
vnet_buffer (b0)->ip.adj_index[VLIB_TX] = sp0->segments_lists[0];
else {
vnet_buffer (b0)->ip.flow_hash = flow_label0;
vnet_buffer (b0)->ip.adj_index[VLIB_TX] = sp0->segments_lists[(vnet_buffer (b0)->ip.flow_hash & (vec_len (sp0->segments_lists) - 1))];
}
if (vec_len (sp1->segments_lists) == 1)
vnet_buffer (b1)->ip.adj_index[VLIB_TX] = sp1->segments_lists[1];
else {
vnet_buffer (b1)->ip.flow_hash = flow_label1;
vnet_buffer (b1)->ip.adj_index[VLIB_TX] = sp1->segments_lists[(vnet_buffer (b1)->ip.flow_hash & (vec_len (sp1->segments_lists) - 1))];
}
```
Once I found this, I became quite certain that nobody uses L2 encapsulation in VPP, because if 4+
packets would be present in the vector, for the second through fourth packet (`b1`-`b3`), and if the
segment list had length 1, then the segment list index would incorrectly be set to garbage
`segment_lists[1]` rather than the first and only segment `segment_list[0]`. Yikes! But it explains
perfectly why I had roughly 75% packetloss, lots of TCP retransmits, and terrible throughput. I fix
this bug and SRv6 encap starts to work flawlessly.
***5. Add tests***
I decide to add four tests: for {PHY, SUB} x {Encap, Decap}. On the encap side, I create a _SR
Policy_ with BSID `a3::9999:1` which encapsulates from source `a3::` and sends to _Segment List_
[`a4::`, `a5::`, `a6::c7`]. I then _steer_ L2 traffic from interface `pg0` using this _BSID_. I'll
generate a packet and want to receive it ffom `pg1` encapsulated with the correct SRH and
destination address. On the decap side, I create an SRv6 packet and send it into `pg1`, and want to
see it decapsulated and exit on interface `pg0`.
I try to get consistency by adding a `send_and_verify_pkts()` which takes an argument as a validator
function, either `compare_rx_tx_packet_T_Encaps_L2()` or `compare_rx_tx_packet_End_DX2()`. These
four tests succeed, look at me!
```
==============================================================================
SRv6 L2 Sub-Interface Steering Test Case [main thread only]
==============================================================================
Test SRv6 End.DX2 decapsulation to a hardware (phy) interface. 1.53 OK
Test SRv6 End.DX2 decapsulation to a sub-interface (VLAN). 1.00 OK
Test SRv6 L2 encapsulation on a hardware (phy) interface. 1.97 OK
Test SRv6 L2 encapsulation on a sub-interface (VLAN). 1.93 OK
==============================================================================
TEST RESULTS:
Scheduled tests: 4
Executed tests: 4
Passed tests: 4
==============================================================================
```
## Results
With this change, it becomes possible to `sr steer` into a sub-interface, and to have an `sr
localsid` that outputs to a sub-interface, which I can demonstrate like so:
```
vpp0-0# create sub-interfaces GigabitEthernet10/0/2 100
vpp0-0# set int l2 tag-rewrite GigabitEthernet10/0/2.100 pop 1
vpp0-0# set int state GigabitEthernet10/0/2.100 up
vpp0-0# sr policy add bsid 8298::2:2 next 2001:678:d78:20f::3:2 encap
vpp0-0# sr steer l2 GigabitEthernet10/0/2.100 via bsid 8298::2:2
vpp0-0# sr localsid address 2001:678:d78:20f::0:2 behavior end.dx2 GigabitEthernet10/0/2.100
vpp0-3# create sub-interfaces GigabitEthernet10/0/3 200
vpp0-3# set int l2 tag-rewrite GigabitEthernet10/0/3.200 pop 1
vpp0-3# set int state GigabitEthernet10/0/3.200 up
vpp0-3# sr policy add bsid 8298::2:2 next 2001:678:d78:20F::2 encap
vpp0-3# sr steer l2 GigabitEthernet10/0/3.200 via bsid 8298::2:2
vpp0-3# sr localsid address 2001:678:d78:20f::3:2 behavior end.dx2 GigabitEthernet10/0/3.200
```
One thing to remember, is that when sub-interfaces are created and used in L2 mode, they have to get
the [[VLAN Gymnastics]({{< ref 2022-02-14-vpp-vlan-gym >}})] applied to them. In VPP terminology, it
means applying _VTR_ or _VLAN Tag Rewrite_ feature, where the tag is removed upon ingress, and
re-added on egress. That way, the ethernet frame that gets put into the SRv6 L2VPN is untagged. It
allows me to have different encapsulation on both sides.
Now, for the _moment supr&egrave;me_, on the two hosts, I can now create this sub-interface and use the tagged L2VPN also:
```
root@host0-0:~# ip link add link enp16s0f0 name enp16s0f0.100 type vlan id 100
root@host0-0:~# ip link set enp16s0f0.100 up mtu 1500
root@host0-0:~# ip addr add 192.0.2.128/31 dev enp16s0f0.100
root@host0-1:~# ip link add link enp16s0f3 name enp16s0f3.200 type vlan id 200
root@host0-1:~# ip link set enp16s0f3.200 up mtu 1500
root@host0-1:~# ip addr add 192.0.2.129/31 dev enp16s0f3.200
root@host0-1:~# ping 192.0.2.128
PING 192.0.2.128 (192.0.2.128) 56(84) bytes of data.
64 bytes from 192.0.2.128: icmp_seq=1 ttl=64 time=9.88 ms
64 bytes from 192.0.2.128: icmp_seq=2 ttl=64 time=4.88 ms
64 bytes from 192.0.2.128: icmp_seq=3 ttl=64 time=7.07 ms
^C
--- 192.0.2.128 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 4.880/7.273/9.876/2.044 ms
root@host0-1:~# ip nei | grep 200
192.0.2.128 dev enp16s0f3.200 lladdr 52:54:00:f0:10:00 REACHABLE
fe80::5054:ff:fef0:1000 dev enp16s0f3.200 lladdr 52:54:00:f0:10:00 DELAY
```
## What's Next
I've sent the change, which is about ~850 LOC, off for review. You can follow along on the
gerrit on [[44899](https://gerrit.fd.io/r/c/vpp/+/44899)]. I'm happy to have fixed the quad-loop
encap bug, but it does show me that SRv6 (at least in L2 transport mode) is not super common for
VPP, perhaps not common in the industry? I am not convinced that I want to use this in production on
AS8298, but if I did, the basic functionality would be adding an IPv6 prefix to each of the loopback
devices, in order to attract traffic to the router, add an 'End' localsid on every router so that
they can participate in multi-hop SRv6, and add some static config to
[[vppcfg](https://git.ipng.ch/ipng/vppcfg)] to do the encap/decap for L2VPN. By the way, there's a
whole world of encap and decap behaviors, including L3VPN for IPv4, IPv6, GTP-U, and so on.
For me, I've still set my sights on eVPN VxLAN as a destination, because that will give me
multi-point ethernet mesh akin to VPLS. However there's a lot of ground to cover for me,
considering IPng uses Bird2 as a routing controlplane. Bird2 is starting to get eVPN support, but
there's a lot for me to learn. Stay tuned!