ipng.ch/content/articles/2023-05-07-vpp-mpls-1.md

---
date: "2023-05-07T10:01:14Z"
title: VPP MPLS - Part 1
aliases:
- /s/articles/2023/05/07/vpp-mpls-1.html
---

{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}

# About this series

Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
are shared between the two.

I've deployed an MPLS core for IPng Networks, which allows me to provide L2VPN services, and at the
same time keep an IPng Site Local network with IPv4 and IPv6 that is separate from the internet,
based on hardware/silicon based forwarding at line rate and high availability. You can read all
about my Centec MPLS shenanigans in [[this article]({{< ref "2023-03-11-mpls-core" >}})].

Ever since the release of the Linux Control Plane [[ref](https://git.ipng.ch/ipng/lcpng)]
plugin in VPP, folks have asked "What about MPLS?" -- I have never really felt the need to go this
rabbit hole, because I figured that in this day and age, higher level IP protocols that do tunneling
are just as performant, and a little bit less of an 'art' to get right. For example, the Centec
switches I deployed perform VxLAN, GENEVE and GRE all at line rate in silicon. And in an earlier
article, I showed that the performance of VPP in these tunneling protocols is actually pretty good.
Take a look at my [[VPP L2 article]({{< ref "2022-01-12-vpp-l2" >}})] for context.

You might ask yourself: _Then why bother?_ To which I would respond: if you have to ask that question,
clearly you don't know me :) This article will form a deep dive into MPLS as implemented by VPP. In
a later set of articles, I'll partner with the incomparable [@vifino](https://chaos.social/@vifino) who
is adding MPLS support to the Linux Controlplane plugin. After that, I do expect VPP to be able to act
as a fully fledged provider- and provider-edge MPLS router.

## Lab Setup

A while ago I created a [[VPP Lab]({{< ref "2022-10-14-lab-1" >}})] which is pretty slick, I use it
all the time. Most of the time I find myself messing around on the hypervisor and adding namespaces
with interfaces in it, to pair up with the VPP interfaces. And I tcpdump a lot! It's time for me to
make an upgrade to the Lab -- take a look at this picture:

{{< image src="/assets/vpp-mpls/LAB v2.svg" alt="Lab Setup" >}}

There's quite a bit to unpack here, but it will be useful to know this layout as I'll be referring
to the components here throughout the rest of the article. Each **lab** now has seven virtual
machines:

1.   **vppX-Y** are Debian Testing machines running a reasonably fresh VPP - they are daisychained
with the first one attaching to the headend called `lab.ipng.ch`, using its Gi10/0/0 interface, and
onwards to its eastbound neighbor vpp0-1 using its GI10/0/1 interface.
1.   **hostX-Y** are two Debian machines which have their 4 network cards (enp16s0fX) connected each
to one VPP instance's Gi10/0/2 interface (for `host0-0`) or Gi10/0/3 (for `host0-1`). This way, I
can test all sorts of topologies with one router, two routers, or multiple routers.
1.   **tapX-0** is a special virtual machine which receives a copy of every packet on the underlying
Open vSwitch network fabric.

***NOTE***: X is the 0-based lab number, and Y stands for the 0-based logical machine number, so `vpp1-3`
is the fourth VPP virtualmachine on the second lab.

### Detour 1: Open vSwitch

To explain this tap a little bit - let me first talk about the underlay. All seven of these machines
(and each their four network cards) are bound by the hypervisor into an Open vSwitch bridge called
`vpplan`. Then, I use two features to build this topology:

Firstly, each pair of interfaces will be added as an access port into individual VLANs. For
example, `vpp0-0.Gi10/0/1` connects with `vpp0-1.Gi10/0/0` in VLAN 20 (annotated in orange), and
`vpp0-0.Gi10/0/2` connects to `host0-0.enp16s0f0` in VLAN 30 (annotated in purple). You can see the
East-West traffic over the VPP backbone are in the 20s, the `host0-0` traffic northbound is in the
30s, and the `host0-1` traffic southbound is in the 40s. Finally, the whole Open vSwitch fabric is
connected to `lab.ipng.ch` using VLAN 10 and a physical network card on the hypervisor (annotated in
green). The `lab.ipng.ch` machine then has internet connectivity.

```
BR=vpplan
for p in $(ovs-vsctl list-ifaces $BR); do
  ovs-vsctl set port $p vlan_mode=access
done

# Uplink (Green)
ovs-vsctl set port uplink tag=10    ## eno1.200 on the Hypervisor
ovs-vsctl set port vpp0-0-0 tag=10

# Backbone (Orange)
ovs-vsctl set port vpp0-0-1 tag=20
ovs-vsctl set port vpp0-1-0 tag=20
...

# Northbound (Purple)
ovs-vsctl set port vpp0-0-2 tag=30
ovs-vsctl set port host0-0-0 tag=30
...

# Southbound (Red)
...
ovs-vsctl set port vpp0-3-3 tag=43
ovs-vsctl set port host0-1-3 tag=43
```

**NOTE**: The KVM interface names such as `vppX-Y-Z` where X means the lab number (0 in this case --
IPng does have multiple labs so I can run experiments and lab environments independently and isolated),
Y is the machine number, and Z is the interface number on the machine (from [0..3]).

### Detour 2: Mirroring Traffic

Secondly, now that I have created a 29 port switch with 12 VLANs, I decide to create an OVS _mirror
port_, which can be used to make a copy of traffic going in- or out of (a list of) ports. This is a
super powerful feature, and it looks like this:

```
BR=vpplan
MIRROR=mirror-rx
ovs-vsctl set port tap0-0-0 vlan_mode=access

[ ovs-vsctl list mirror $MIRROR >/dev/null 2>&1 ] || \
  ovs-vsctl --id=@m get mirror $MIRROR -- remove bridge $BR mirrors @m

ovs-vsctl --id=@m create mirror name=$MIRROR \
  -- --id=@p get port tap0-0-0 \
  -- add bridge $BR mirrors @m \
  -- set mirror $MIRROR output-port=@p \
  -- set mirror $MIRROR select_dst_port=[] \
  -- set mirror $MIRROR select_src_port=[]

for iface in $(ovs-vsctl list-ports $BR); do
  [[ $iface == tap* ]] && continue
  ovs-vsctl add mirror $MIRROR select_dst_port $(ovs-vsctl get port $iface _uuid)
done
```

The first call sets up the OVS switchport called `tap0-0-0` (which is enp16s0f0 on the machine
`tap0-0`) as an access port. To allow for this script to be idempotent, the second line will look up
if the mirror exists and if so, delete it. Then, I (re)create a mirror port with a given name
(`mirror-rx`), add it to the bridge, make the mirror's output port become `tap0-0-0`, and finally
clear the selected source and destination ports (this is where the traffic is mirrored _from_). At
this point, I have an empty mirror. To give it something useful to do, I loop over all of the ports
in the `vpplan` bridge and add them to the mirror, if they are the _destination_ port (here I have
to specify the uuid of the interface, not its name). I will add all interfaces, except those
of the `tap0-0` machine itself, to avoid loops.

In the end, I create two of these, one called `mirror-rx` which is forwarded to `tap0-0-0`
(enp16s0f0) and the other called `mirror-tx` which is forwarded to `tap0-0-1` (enp16s0f1). I can use
tcpdump on either of these ports, to show all the traffic either going _ingress_ to any port on any
machine, or emitting _egress_ from any port on any machine, respectively.

## Preparing the LAB

I wrote a little bit about the automation I use to maintain a few reproducable lab environments in a
[[previous article]({{< ref "2022-10-14-lab-1" >}})], so I'll only show the commands themselves here,
not the underlying systems. When the LAB boots up, it comes with a basic Linux CP configuration that
uses OSPF and OSPFv3 running in Bird2, to connect the `vpp0-0` through `vpp0-3` machines together (each
router's Gi10/0/0 port connects to the next router's Gi10/0/1 port). LAB0 is in use by
[@vifino](https://chaos.social/@vifino) at the moment, so I'll take the next one running on its own
hypervisor, called LAB1.

Each machine has an IPv4 and IPv6 loopback, so the LAB will come up with basic connectivity:

```
pim@lab:~/src/lab$ LAB=1 ./create
pim@lab:~/src/lab$ LAB=1 ./command pristine
pim@lab:~/src/lab$ LAB=1 ./command start && sleep 150
pim@lab:~/src/lab$ traceroute6 vpp1-3.lab.ipng.ch
traceroute to vpp1-3.lab.ipng.ch (2001:678:d78:211::3), 30 hops max, 24 byte packets
 1  e0.vpp1-0.lab.ipng.ch (2001:678:d78:211::fffe)  2.0363 ms  2.0123 ms  2.0138 ms
 2  e0.vpp1-1.lab.ipng.ch (2001:678:d78:211::1:11)  3.0969 ms  3.1261 ms  3.3413 ms
 3  e0.vpp1-2.lab.ipng.ch (2001:678:d78:211::2:12)  6.4845 ms  6.3981 ms  6.5409 ms
 4  vpp1-3.lab.ipng.ch (2001:678:d78:211::3)  7.4610 ms  7.5698 ms  7.6413 ms
```

## MPLS For Dummies

.. like me! MPLS stands for [[Multi Protocol Label
Switching](https://en.wikipedia.org/wiki/Multiprotocol_Label_Switching)]. Rather than looking at the
IPv4 or IPv6 header in the packet, and making the routing decision based on the destination address,
MPLS takes the whole packet and encpsulates it into a new datagram that carries a 20-bit number
(called the _label_), three bits to classify the traffic, one _S-bit_ to signal that this is the
last label in a stack of labels, and finally 8 bits of TTL.

In total, 32 bits are prepended to the whole IP packet, or Ethernet frame, or any other type of inner
datagram. This is why it's called _Multi Protocol_. The _S-bit_ allows routers to know if the
following data is the inner payload (S=1), or if the following 32 bits are another MPLS label (S=0).
This way, routers can add more than one labels into a _label stack_.

Forwarding decisions are made on the contents of this MPLS _label_, without the need to examine
the packet itself. Two significant benefits become obvious:

1.   The inner data payload (ie. an IPv6 packet or an Ethernet frame) doesn't have to be
rewritten, no new checksum created, no TTL decremented. Any datagram can be stuffed into an MPLS
packet, the routing (or _packet switching_) entirely happens using only the MPLS headers.

2.   Importantly, no source- or destination IP addresses have to be looked up in a possibly very
large ~1M large FIB tree to figure out the next hop. Rather than traversing a [[Radix
Trie](https://en.wikipedia.org/wiki/Radix_tree)] or other datastructure to find the next-hop, a
static [[Hash Table](https://en.wikipedia.org/wiki/Hash_table)] with literal integer MPLS labels
can be consulted. This greatly simplifies the computational complexity in transit.

***P-Router***: The simplest form of an MPLS router is a so-called Label-Switch-Router (_LSR_) which is synonymous
for Provider-Router (_P-Router_). This is the router that sits in the core of the network, and its
only purpose is to receive MPLS packets, look up what to do with them based on the _label_ value,
and then forward the packet onto the next router.  Sometimes the router can (and will) rewrite the
label, in an operation called a SWAP, but it can also leave the label as it was (in other words, the
input label value can be the same as the outgoing label value). The logic kind of goes like
**MPLS In-Label** => **{ MPLS Out-Label, Out-Interface, Out-NextHop }**. It's this behavior
that explains the name _Label Switching_.

If you were to imagine plotting a path through the lab network from say `vpp1-0` on the one side,
through `vpp1-1` and `vpp1-2` on finally onwards to `vpp1-3`, each router would be receiving MPLS packets on one
interface, and emitting them on their way to the next router on another interface. That *path* of
*switching* operations on the *labels* of those MPLS packets thus forms a so-called _Label-Switched-Path
(LSP)_. These LSPs are fundamental building blocks of MPLS networks, as I'll demonstrate later.

***PE-Router***: Some routers have a less boring job to do - those that sit at the edge of an MPLS network, accept
customer traffic and do something useful with it. These are called Label-Edge-Router (_LER_) which
is often colloquially called a Provider-Edge-Router (_PE-Router_). These routers receive normal
packets (ethernet or IP or otherwise), and perform the encapsulation by adding MPLS labels to them
upon receipt (ingress, called PUSH), or removing the encapsulation (called POP) and finding the
inner payload, continuing to handle them as per normal. The logic for these can be much more
complicated, but you can imagine it goes something like **MPLS In-Label** => **{ Operation }**
where the operation may be "take the resulting datagram, assume it is an IPv4 packet, so look it up
in the IPv4 routing table" or "take the resulting datagram, assume it is an ethernet frame, and emit
it on a specific interface", and really any number of other "operations".

The cool thing about MPLS is that the type of operations are vendor-extensible. If two routers A and B
agree what label 1234 means _to them_, they can simply insert it at the top of the _labelstack_ say
{100,1234}, where the bottom one (the 100 label that all the _P-Routers_ see) carries the semantic
meaning of "switch this packet onto the destination _PE-router_", where that _PE-router_ can pop the
outer label, to reveal the 1234-label, which it can look up in its table to tell it what to do next
with the MPLS payload in any way it chooses - the _P-Routers_ don't have to understand the meaning
of label 1234, they don't have to use or inspect it at all!

### Step 0: End Host setup

{{< image src="/assets/vpp-mpls/LAB v2 (1).svg" alt="Lab Setup" >}}

For this lab, I'm going to boot up instance LAB1 with no changes (for posterity, using image
`vpp-proto-disk0@20230403-release`). As an aside, IPng Networks has several of these lab
environments, and while [@vifino](https://chaos.social/@vifino) is doing some development testing on
LAB0, I simply switch to LAB1 to let him work in peace.

With the MPLS concepts introduced, let me start by configuring `host1-0` and `host1-1` and giving them
an IPv4 loopback address, and a transit network to their routers `vpp1-0` and `vpp1-3` respectively:

```
root@host1-1:~# ip link set enp16s0f0 up mtu 1500
root@host1-1:~# ip addr add 192.0.2.2/31 dev enp16s0f0
root@host1-1:~# ip addr add 10.0.1.1/32 dev lo
root@host1-1:~# ip ro add 10.0.1.0/32 via 192.0.2.3

root@host1-0:~# ip link set enp16s0f3 up mtu 1500
root@host1-0:~# ip addr add 192.0.2.0/31 dev enp16s0f3
root@host1-0:~# ip addr add 10.0.1.0/32 dev lo
root@host1-0:~# ip ro add 10.0.1.1/32 via 192.0.2.1
root@host1-0:~# ping -I 10.0.1.0 10.0.1.1
```

At this point, I don't expect to see much, as I haven't configured VPP yet. But `host1-0` will start
ARPing for 192.0.2.1 on `enp16s0f3`, which is connected to `vpp1-3.e2`. Let me take a look on the
Open vSwitch mirror to confirm that:

```
root@tap1-0:~# tcpdump -vni enp16s0f0 vlan 33
12:41:27.174052 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.1 tell 192.0.2.0, length 28
12:41:28.333901 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.1 tell 192.0.2.0, length 28
12:41:29.517415 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.1 tell 192.0.2.0, length 28
12:41:30.645418 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.1 tell 192.0.2.0, length 28
```

Alright! I'm going to leave the ping running in the background, and I'll trace packets through the
network using the Open vSwitch mirror, as well as take a look at what VPP is doing with the packets
using its packet tracer.

### Step 1: PE Ingress

```
vpp1-3# set interface state GigabitEthernet10/0/2 up
vpp1-3# set interface ip address GigabitEthernet10/0/2 192.0.2.1/31
vpp1-3# mpls table add 0
vpp1-3# set interface mpls GigabitEthernet10/0/1 enable
vpp1-3# set interface mpls GigabitEthernet10/0/0 enable
vpp1-3# ip route add 10.0.1.1/32 via 192.168.11.10 GigabitEthernet10/0/0 out-labels 100
```

Now the ARP resolution succeeds, and I can see that `host1-0` starts sending ICMP packets towards the
loopback that I have configured on `host1-1`, and it's of course using the newly learned L2 adjacency
for 192.0.2.1 at 52:54:00:13:10:02 (which is `vpp1-3.e2`). But, take a look at what the VPP router
does next: due to the `ip route add ...` command, I've told it to reach 10.0.1.1 via a nexthop of
`vpp1-2.e1`, but it will PUSH a single MPLS label 100,S=1 and forward it out on its Gi10/0/0
interface:

```
root@tap1-0:~# tcpdump -eni enp16s0f0 vlan or mpls
12:45:56.551896 52:54:00:20:10:03 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 33
  p 0, ethertype ARP (0x0806), Request who-has 192.0.2.1 tell 192.0.2.0, length 28
12:45:56.553311 52:54:00:13:10:02 > 52:54:00:20:10:03, ethertype 802.1Q (0x8100), length 46: vlan 33
  p 0, ethertype ARP (0x0806), Reply 192.0.2.1 is-at 52:54:00:13:10:02, length 28

12:45:56.620924 52:54:00:20:10:03 > 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33
  p 0, ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 38791, seq 184, length 64
12:45:56.621473 52:54:00:13:10:00 > 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64)
    10.0.1.0 > 10.0.1.1: ICMP echo request, id 38791, seq 184, length 64
```

My MPLS journey on VPP has officially begun! The first exchange in the tcpdump (packets 1 and 2)
is the ARP resolution of 192.0.2.1 by `host1-0`, after which it knows where to send the ICMP echo
(packet 3, on VLAN33), which is then sent out by `vpp1-3` as MPLS to `vpp1-2` (packet 4, on VLAN22).

Let me show you what such a packet looks like from the point of view of VPP. It has a _packet
tracing_ function which shows how any individual packet traverses the graph of nodes through the
router. It's a lot of information, but as a VPP operator, let alone a developer, it's really
important skill to learn -- so off I go, capturing and tracing a handful of packets:

```
vpp1-3# trace add dpdk-input 10
vpp1-3# show trace
------------------- Start of thread 0 vpp_main -------------------
Packet 1

20:15:00:496109: dpdk-input
  GigabitEthernet10/0/2 rx queue 0
  buffer 0x4c44df: current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x0
                   ext-hdr-valid
  PKT MBUF: port 2, nb_segs 1, pkt_len 98
    buf_len 2176, data_len 98, ol_flags 0x0, data_off 128, phys_addr 0x2ed13840
    packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
    rss 0x0 fdir.hi 0x0 fdir.lo 0x0
  IP4: 52:54:00:20:10:03 -> 52:54:00:13:10:02
  ICMP: 10.0.1.0 -> 10.0.1.1
    tos 0x00, ttl 64, length 84, checksum 0x46a2 dscp CS0 ecn NON_ECN
    fragment id 0x2706, flags DONT_FRAGMENT
  ICMP echo_request checksum 0x3bd6 id 8399

20:15:00:496167: ethernet-input
  frame: flags 0x1, hw-if-index 3, sw-if-index 3
  IP4: 52:54:00:20:10:03 -> 52:54:00:13:10:02

20:15:00:496201: ip4-input
  ICMP: 10.0.1.0 -> 10.0.1.1
    tos 0x00, ttl 64, length 84, checksum 0x46a2 dscp CS0 ecn NON_ECN
    fragment id 0x2706, flags DONT_FRAGMENT
  ICMP echo_request checksum 0x3bd6 id 8399

20:15:00:496225: ip4-lookup
  fib 0 dpo-idx 1 flow hash: 0x00000000
  ICMP: 10.0.1.0 -> 10.0.1.1
    tos 0x00, ttl 64, length 84, checksum 0x46a2 dscp CS0 ecn NON_ECN
    fragment id 0x2706, flags DONT_FRAGMENT
  ICMP echo_request checksum 0x3bd6 id 8399

20:15:00:496256: ip4-mpls-label-imposition-pipe
    mpls-header:[100:64:0:eos]

20:15:00:496258: mpls-output
  adj-idx 25 : mpls via 192.168.11.10 GigabitEthernet10/0/0: mtu:9000 next:2 flags:[] 5254001210015254001310008847 flow hash: 0x00000000

20:15:00:496260: GigabitEthernet10/0/0-output
  GigabitEthernet10/0/0 flags 0x0018000d
  MPLS: 52:54:00:13:10:00 -> 52:54:00:12:10:01
  label 100 exp 0, s 1, ttl 64

20:15:00:496262: GigabitEthernet10/0/0-tx
  GigabitEthernet10/0/0 tx queue 0
  buffer 0x4c44df: current data -4, length 102, buffer-pool 0, ref-count 1, trace handle 0x0
                   ext-hdr-valid
                   l2-hdr-offset 0 l3-hdr-offset 14
  PKT MBUF: port 2, nb_segs 1, pkt_len 102
    buf_len 2176, data_len 102, ol_flags 0x0, data_off 124, phys_addr 0x2ed13840
    packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
    rss 0x0 fdir.hi 0x0 fdir.lo 0x0
  MPLS: 52:54:00:13:10:00 -> 52:54:00:12:10:01
  label 100 exp 0, s 1, ttl 64
```

This packet has gone through a total of eight nodes, and the local timestamps are the uptime of VPP
when the packets were received. I'll try to explain them in turn:

1.   ***dpdk-input***: The packet is initially received by from Gi10/0/2 receive queue 0. It was an
ethernet packet from 52:54:00:20:10:03 (`host1-0.enp16s0f3`) to 52:54:00:13:10:02 (`vpp1-3.e2`). Some
more information is gleaned here, notably that it was an ethernet frame, an L3 IPv4 and L4 ICMP
packet.
1.   ***ethernet-input***: Since it was an ethernet frame, it was passed into this node. Here VPP
concludes that this is an IPv4 packet, because the ethertype is 0x0800.
1.   ***ip4-input***: We know it's an IPv4 packet, and the layer4 information shows this is an ICMP
echo packet from 10.0.1.0 to 10.0.1.1 (configured on `host1-1.lo`). VPP now needs to figure out where
to route this packet.
1.   ***ip4-lookup***: VPP takes a look at its FIB for 10.0.1.1 - note the information I specified
above (the `ip route add ...` on `vpp1-3`) - the next-hop here is 192.168.11.10 on Gi10/0/0 _but_ VPP
also sees that I'm intending to add an MPLS _out-label_ of 100.
1.   ***ip4-mpls-label-inposition-pipe***: An MPLS packet header is prepended in front of the IPv4
packet, which will have only one label (100) and since it's the only label, it will set the S-bit
(end-of-stack) to 1, and the MPLS TTL initializes at 64.
1.   ***mpls-output***: Now that the IPv4 packet is wrapped into an MPLS packet, VPP uses the rest
of the FIB entry (notably the next-hop 192.168.11.0 and the output interface Gi10/0/0) to find where
this thing is supposed to go.
1.   ***Gi10/0/0-output***: VPP now prepares the packet to be sent out on Gi10/0/0 as an MPLS
ethernet type. It uses the L2FIB adjacency table to figure out that we'll be sending it from our mac
address 52:54:00:13:10:00 (`vpp1-3.e0`) to the next hop on 52:54:00:12:10:01 (`vpp1-2.e1`).
1.   ***Gi10/0/0-tx***: VPP hands the fully formed packet with all necessary information back to
DPDK to marshall it on the wire.

Can you imagine this router can do such a thing at a rate of 18-20 Million packets per second,
linearly scaling up per added CPU thread? I learn something new every time I look at a packet trace,
I simply love this dataplane implementation!

### Step 2: P-routers

In Step 1 I've shown that `vpp1-3` did send the MPLS packet to `vpp1-2`, but I haven't configured
anything there yet, and because I didn't enable MPLS, each of these beautiful packets is brutally
sent to the bit-bucket (also called _dpo-drop_):

```
vpp1-2# show err
   Count                  Node                              Reason               Severity
       132             mpls-input              MPLS input packets decapsulated     info
       132             mpls-input                      MPLS not enabled            error
```

The purpose of a _P-router_ is to switch labels along the _Label-Switched-Path_. So let's manually
create the following to tell this `vpp1-2` router what to do when it receives an MPLS frame with
label 100:

```
vpp1-2# mpls table add 0
vpp1-2# set interface mpls GigabitEthernet10/0/0 enable
vpp1-2# set interface mpls GigabitEthernet10/0/1 enable
vpp1-2# mpls local-label add 100 eos via 192.168.11.8 GigabitEthernet10/0/0 out-labels 100
```

Remember, above I explained that the _P-Router_ has a simple job? It really does! All I'm doing here
is telling VPP, that if it receives an MPLS packet on any MPLS-enabled interface (notably Gi10/0/1
from which it is currently receiving MPLS packets from `vpp1-3`), that it should send the MPLS packet
out on Gi10/0/0 to neighbor 192.168.11.8 after imposing label 100.

If I've done a good job, I should be able to see this packet traversing the P-Router in a packet
trace:

```
20:42:51:151144: dpdk-input
  GigabitEthernet10/0/1 rx queue 0
  buffer 0x4c7d8b: current data 0, length 102, buffer-pool 0, ref-count 1, trace handle 0x0
                   ext-hdr-valid
  PKT MBUF: port 1, nb_segs 1, pkt_len 102
    buf_len 2176, data_len 102, ol_flags 0x0, data_off 128, phys_addr 0x1d1f6340
    packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
    rss 0x0 fdir.hi 0x0 fdir.lo 0x0
  MPLS: 52:54:00:13:10:00 -> 52:54:00:12:10:01
  label 100 exp 0, s 1, ttl 64

20:42:51:151161: ethernet-input
  frame: flags 0x1, hw-if-index 2, sw-if-index 2
  MPLS: 52:54:00:13:10:00 -> 52:54:00:12:10:01

20:42:51:151171: mpls-input
  MPLS: next mpls-lookup[1]  label 100 ttl 64 exp 0

20:42:51:151174: mpls-lookup
  MPLS: next [6], lookup fib index 0, LB index 74 hash 0 label 100 eos 1

20:42:51:151177: mpls-label-imposition-pipe
    mpls-header:[100:63:0:eos]

20:42:51:151179: mpls-output
  adj-idx 28 : mpls via 192.168.11.8 GigabitEthernet10/0/0: mtu:9000 next:2 flags:[] 5254001110015254001210008847 flow hash: 0x00000000

20:42:51:151181: GigabitEthernet10/0/0-output
  GigabitEthernet10/0/0 flags 0x0018000d
  MPLS: 52:54:00:12:10:00 -> 52:54:00:11:10:01
  label 100 exp 0, s 1, ttl 63

20:42:51:151184: GigabitEthernet10/0/0-tx
  GigabitEthernet10/0/0 tx queue 0
  buffer 0x4c7d8b: current data 0, length 102, buffer-pool 0, ref-count 1, trace handle 0x0
                   ext-hdr-valid
                   l2-hdr-offset 0 l3-hdr-offset 14
  PKT MBUF: port 1, nb_segs 1, pkt_len 102
    buf_len 2176, data_len 102, ol_flags 0x0, data_off 128, phys_addr 0x1d1f6340
    packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
    rss 0x0 fdir.hi 0x0 fdir.lo 0x0
  MPLS: 52:54:00:12:10:00 -> 52:54:00:11:10:01
  label 100 exp 0, s 1, ttl 63
```

In order, the following nodes are traversed:
1.    ***dpdk-input***: received the frame from the network interface Gi10/0/1
1.    ***ethernet-input***: the frame was an ethernet frame, and VPP determines based on the
ethertype (0x8847) that it is an MPLS frame
1.    ***mpls-input***: inspects the MPLS labelstack and sees the outermost label (the only one on
this frame) with a value of 100.
1.    ***mpls-lookup***: looks up the MPLS FIB what to do with packets which are End-Of-Stack or
_EOS_ (ie. with the S-bit set to 1), and are labeled 100. At this point VPP could  make a different
choice if there is 1 label (as in this case), or a stack of multiple labels (Not-End-of-Stack or
_NEOS_, ie. with the S-bit set to 0).
1.    ***mpls-label-imposition-pipe***: reads from the FIB that the outer label needs to be SWAPd
to a new _out-label_ (also with value 100). Because it's the same label, this is a no-op. However,
since this router is forwarding the MPLS packet, it will decrement the TTL to 63.
1.    ***mpls-output***: VPP then uses the rest of the FIB information to determine the L3 nexthop is
192.168.11.8 on Gi10/0/0.
1.    ***Gi10/0/0-output***: uses the L2FIB adjacency table to determine that the L2 nexthop is MAC
address 52:54:00:11:10:01 (`vpp1-1.e1`). If there is no L2 adjacency, this would be a good time for
VPP to send an ARP request to resolve the IP-to-MAC and store it in the L2FIB.
1.    ***Gi10/0/0-tx***: hands off the frame to DPDK for marshalling on the wire.

If you counted with me, you'll see that this flow in the _P-Router_ also has eight nodes. However,
while the IPv4 FIB can and will be north of one million entries in a longest-prefix match radix trie
(which is computationally expensive), the MPLS FIB contains far fewer entries which are organized as
a literal key lookup in a hash table; and as well compared to IPv4 routing, the packet that is being
transported does not have to get a decremented TTL which requires a recalculated IPv4 checksum. MPLS
switching is _much_ cheaper than IPv4 routing!

Now that our packets are switched from `vpp1-2` to `vpp1-1` (which is also a _P-Router_), I'll just
rinse and repeat there, using the L3 adjacency pointing at `vpp1-0.e1` (192.168.11.6 on Gi10/0/0):

```
vpp1-1# mpls table add 0
vpp1-1# set interface mpls GigabitEthernet10/0/0 enable
vpp1-1# set interface mpls GigabitEthernet10/0/1 enable
vpp1-1# mpls local-label add 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 100
```

Did I do this correctly? One way to check is by taking a look at which packets are seen on the Open
vSwitch mirror ports:


```
root@tap1-0:~# tcpdump -eni enp16s0f0
13:42:47.724107 52:54:00:20:10:03 > 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33
  p 0, ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 8399, seq 3238, length 64

13:42:47.724769 52:54:00:13:10:00 > 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64)
    10.0.1.0 > 10.0.1.1: ICMP echo request, id 8399, seq 3238, length 64

13:42:47.725038 52:54:00:12:10:00 > 52:54:00:11:10:01, ethertype 802.1Q (0x8100), length 106: vlan 21
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 63)
    10.0.1.0 > 10.0.1.1: ICMP echo request, id 8399, seq 3238, length 64

13:42:47.726155 52:54:00:11:10:00 > 52:54:00:10:10:01, ethertype 802.1Q (0x8100), length 106: vlan 20
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 62)
    10.0.1.0 > 10.0.1.1: ICMP echo request, id 8399, seq 3238, length 64
```

Nice!! I confirm that the ICMP packet first travels over VLAN 33 (from `host1-0` to `vpp1-3`), and then
the MPLS packets travel from `vpp1-3`, through `vpp-1-2`, through `vpp1-1` and towards `vpp1-0` over VLAN
22, 21 and 20 respectively.

### Step 3: PE Egress

Seeing as I haven't done anything with `vpp1-0` yet, now the MPLS packets all get dropped there. But not
for much longer, as I'm now ready to tell `vpp1-0` what to do with those packets:

```
vpp1-0# mpls table add 0
vpp1-0# set interface mpls GigabitEthernet10/0/0 enable
vpp1-0# set interface mpls GigabitEthernet10/0/1 enable
vpp1-0# mpls local-label add 100 eos via ip4-lookup-in-table 0
vpp1-0# ip route add 10.0.1.1/32 via 192.0.2.2
```

The difference between the _P-Routers_ in Step 2 and this _PE-Router_, is the operation provided in
the MPLS FIB. When an MPLS packet with _label_ value 100 is received, instead of forwarding it into
another interface (which is what the _P-Router_ would do), I tell VPP here to unwrap the MPLS label,
and expect to find an IPv4 packet which I'm asking it to route by looking up an IPv4 next hop in the
(IPv4) FIB table 0.

All that's left for me to do is add a regular static route for 10.0.1.1/32 via 192.0.2.2 (which is
the address on interface `host1-1.enp16s0f3`).  If my thinkingcap is still working, I should now see
packets emit from `vpp1-0` on Gi10/0/3:

```
vpp1-0# trace add dpdk-input 10
vpp1-0# show trace

21:34:39:370589: dpdk-input
  GigabitEthernet10/0/1 rx queue 0
  buffer 0x4c4a34: current data 0, length 102, buffer-pool 0, ref-count 1, trace handle 0x0
                   ext-hdr-valid
  PKT MBUF: port 1, nb_segs 1, pkt_len 102
    buf_len 2176, data_len 102, ol_flags 0x0, data_off 128, phys_addr 0x2ff28d80
    packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
    rss 0x0 fdir.hi 0x0 fdir.lo 0x0
  MPLS: 52:54:00:11:10:00 -> 52:54:00:10:10:01
  label 100 exp 0, s 1, ttl 62

21:34:39:370672: ethernet-input
  frame: flags 0x1, hw-if-index 2, sw-if-index 2
  MPLS: 52:54:00:11:10:00 -> 52:54:00:10:10:01

21:34:39:370702: mpls-input
  MPLS: next mpls-lookup[1]  label 100 ttl 62 exp 0

21:34:39:370704: mpls-lookup
  MPLS: next [6], lookup fib index 0, LB index 83 hash 0 label 100 eos 1

21:34:39:370706: ip4-mpls-label-disposition-pipe
  rpf-id:-1 ip4, pipe

21:34:39:370708: lookup-ip4-dst
     fib-index:0 addr:10.0.1.1 load-balance:82

21:34:39:370710: ip4-rewrite
  tx_sw_if_index 4 dpo-idx 32 : ipv4 via 192.0.2.2 GigabitEthernet10/0/3: mtu:9000 next:9 flags:[] 5254002110005254001010030800 flow hash: 0x00000000
  00000000: 5254002110005254001010030800450000543dec40003e01e8bc0a0001000a00
  00000020: 01010800173d231c01a0fce65864000000009ce80b00000000001011

21:34:39:370735: GigabitEthernet10/0/3-output
  GigabitEthernet10/0/3 flags 0x0418000d
  IP4: 52:54:00:10:10:03 -> 52:54:00:21:10:00
  ICMP: 10.0.1.0 -> 10.0.1.1
    tos 0x00, ttl 62, length 84, checksum 0xe8bc dscp CS0 ecn NON_ECN
    fragment id 0x3dec, flags DONT_FRAGMENT
  ICMP echo_request checksum 0x173d id 8988

21:34:39:370739: GigabitEthernet10/0/3-tx
  GigabitEthernet10/0/3 tx queue 0
  buffer 0x4c4a34: current data 4, length 98, buffer-pool 0, ref-count 1, trace handle 0x0
                   ext-hdr-valid
                   l2-hdr-offset 0 l3-hdr-offset 14 loop-counter 1
  PKT MBUF: port 1, nb_segs 1, pkt_len 98
    buf_len 2176, data_len 98, ol_flags 0x0, data_off 132, phys_addr 0x2ff28d80
    packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
    rss 0x0 fdir.hi 0x0 fdir.lo 0x0
  IP4: 52:54:00:10:10:03 -> 52:54:00:21:10:00
  ICMP: 10.0.1.0 -> 10.0.1.1
    tos 0x00, ttl 62, length 84, checksum 0xe8bc dscp CS0 ecn NON_ECN
    fragment id 0x3dec, flags DONT_FRAGMENT
  ICMP echo_request checksum 0x173d id 8988
```

Alright, another one of those huge blobs of information about a single packet traversing the VPP
dataplane, but it's the last one for this article, I promise! In order:

1.   ***dpdk-input***: DPDK reads the frame which is arriving from `vpp1-1` on Gi10/0/1, it determines
that this is an ethernet frame
1.   ***ethernet-input***: Based on the ethertype 0x8447, it knows that this ethernet frame is an
MPLS packet
1.   ***mpls-input***: The MPLS _labelstack_ has one label, value 100, with (obviously) the
EndOfStack _S-bit_ set to 1; I can also see the (MPLS) TTL is 62 here, because it has traversed three
routers (`vpp1-3` TTL=64, `vpp1-2` TTL=63, and `vpp1-1` TTL=62)
1.   ***mpls-lookup***: The lookup of local _label_ 100 informs VPP that it should switch to IPv4
processing and handle the packet as such
1.   ***ip4-mpls-label-disposition-pipe***: The MPLS label is removed, revealing an IPv4 packet as
the inner payload of the MPLS datagram
1.   ***lookup-ip4-dst***: VPP can now do a regular IPv4 forwarding table lookup for 10.0.1.1 which
informs it that it should forward the packet via 192.0.2.2 which is directly connected to Gi10/0/3.
1.   ***ip4-rewrite***: The IPv4 TTL is decremented and the IP header checksum recomputed
1.   ***Gi10/0/3-output***: VPP now can look up the L2FIB adjacency belonging to 192.0.2.2 on Gi10/0/3,
which informs it that 52:54:00:21:10:00 is the ethernet nexthop
1.   ***Gi10/0/3-tx***: The packet is now handed off to DPDK to marshall on the wire, destined to
`host1-1.enp16s0f3`

That means I should be able to see it on `host1-1`, right? If you, too, are dying to know, check this out:

```
root@host1-1:~# tcpdump -ni enp16s0f0 icmp
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on enp16s0f0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
14:25:53.776486 IP 10.0.1.0 > 10.0.1.1: ICMP echo request, id 8988, seq 1249, length 64
14:25:53.776522 IP 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 8988, seq 1249, length 64
14:25:54.799829 IP 10.0.1.0 > 10.0.1.1: ICMP echo request, id 8988, seq 1250, length 64
14:25:54.799866 IP 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 8988, seq 1250, length 64
```

"Jiggle jiggle, wiggle wiggle!", as I do a premature congratulatory dance on the chair in my lab! I
created a _label-switched-path_ using VPP as MPLS provider-edge and provider routers, to move
this ICMP echo packet all the way from `host1-0` to `host1-1`, but there's absolutely nothing to
suggest that the resulting ICMP echo-reply can go to back from `host1-1` to `host1-0`, because
_LSPs_ are unidirectional. The final step for me to do is create an _LSP_ back in the other
direction:

```
vpp1-0# ip route add 10.0.1.0/32 via 192.168.11.7 GigabitEthernet10/0/1 out-labels 103
vpp1-1# mpls local-label add 103 eos via 192.168.11.9 GigabitEthernet10/0/1 out-labels 103
vpp1-2# mpls local-label add 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 103
vpp1-3# mpls local-label add 103 eos via ip4-lookup-in-table 0
vpp1-3# ip route add 10.0.1.0/32 via 192.0.2.0
```

And with that, the ping I started at the beginning of this article, shoots to life:

```
root@host1-0:~# ping -I 10.0.1.0 10.0.1.1
PING 10.0.1.1 (10.0.1.1) from 10.0.1.0 : 56(84) bytes of data.
64 bytes from 10.0.1.1: icmp_seq=7644 ttl=62 time=6.28 ms
64 bytes from 10.0.1.1: icmp_seq=7645 ttl=62 time=7.45 ms
64 bytes from 10.0.1.1: icmp_seq=7646 ttl=62 time=7.01 ms
64 bytes from 10.0.1.1: icmp_seq=7647 ttl=62 time=5.76 ms
64 bytes from 10.0.1.1: icmp_seq=7648 ttl=62 time=5.88 ms
64 bytes from 10.0.1.1: icmp_seq=7649 ttl=62 time=9.23 ms
```

I will leave you with this packetdump from the Open vSwitch mirror, showing the entire flow of one
ICMP packet through the network:

```
root@tap1-0:~# tcpdump -c 10 -eni enp16s0f0
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on enp16s0f0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
14:41:07.526861 52:54:00:20:10:03 > 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33
  p 0, ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64
14:41:07.528103 52:54:00:13:10:00 > 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64)
    10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64
14:41:07.529342 52:54:00:12:10:00 > 52:54:00:11:10:01, ethertype 802.1Q (0x8100), length 106: vlan 21
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 63)
    10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64
14:41:07.530421 52:54:00:11:10:00 > 52:54:00:10:10:01, ethertype 802.1Q (0x8100), length 106: vlan 20
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 62)
    10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64
14:41:07.531160 52:54:00:10:10:03 > 52:54:00:21:10:00, ethertype 802.1Q (0x8100), length 102: vlan 40
  p 0, ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64

14:41:07.531455 52:54:00:21:10:00 > 52:54:00:10:10:03, ethertype 802.1Q (0x8100), length 102: vlan 40
  p 0, ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
14:41:07.532245 52:54:00:10:10:01 > 52:54:00:11:10:00, ethertype 802.1Q (0x8100), length 106: vlan 20
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 64)
    10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
14:41:07.532732 52:54:00:11:10:01 > 52:54:00:12:10:00, ethertype 802.1Q (0x8100), length 106: vlan 21
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 63)
    10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
14:41:07.533923 52:54:00:12:10:01 > 52:54:00:13:10:00, ethertype 802.1Q (0x8100), length 106: vlan 22
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 62)
    10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
14:41:07.535040 52:54:00:13:10:02 > 52:54:00:20:10:03, ethertype 802.1Q (0x8100), length 102: vlan 33
  p 0, ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
10 packets captured
10 packets received by filter
```

You can see all of the attributes I demonstrated in this article in one go: ingress ICMP packet on
VLAN 33, encapsulation with label 100, S=1 and ttl decrementing as the MPLS packet traverses
eastwards through the string of VPP routers on VLANs 22, 21 and 20, ultimately being sent out on
VLAN 40. There, the ICMP echo request packet is responded to, and we can trace the ICMP response as
it makes its way back westwards through the MPLS network using label 103, ultimately re-appearing on
VLAN 33.

There you have it. This is a fun story on _Multi Protocol Label Switching (MPLS)_ bringing a packet from
a _Label-Edge-Router (LER)_ through several _Label-Switch-Routers (LSRs)_ over a staticlly
configured _Label-Switched-Path (LSP)_. I feel like I can now more confidently use these terms
without sounding silly.

## What's next

The first mission is accomplished. I've taken a good look at IPv4 forwarding in the VPP dataplane as
MPLS packets, thereby en- and decapsulating the traffic using _PE-Routers_ and forwarding the
traffic using intermediary _P-Routers_. MPLS switching is cheaper than IPv4/IPv6 routing, but it can
also open a bunch of possibilities regarding advanced services offering, such as my coveted _Martini
Tunnels_ which transport ethernet frames point-to-point over an MPLS backbone. That will be the topic
of an upcoming article, as will I join forces with [@vifino](https://chaos.social/@vifino) who is adding
Linux Controlplane functionality to program the MPLS FIB using Netlink -- such that things like 'ip'
and 'FRR' can discover and share label information using a Label Distribution Protocol or _LDP_.