ipng.ch/content/articles/2023-05-17-vpp-mpls-2.md

---
date: "2023-05-17T10:01:14Z"
title: VPP MPLS - Part 2
aliases:
- /s/articles/2023/05/17/vpp-mpls-2.html
---

{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}

# About this series

Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
are shared between the two.

I've deployed an MPLS core for IPng Networks, which allows me to provide L2VPN services, and at the
same time keep an IPng Site Local network with IPv4 and IPv6 that is separate from the internet,
based on hardware/silicon based forwarding at line rate and high availability. You can read all
about my Centec MPLS shenanigans in [[this article]({{< ref "2023-03-11-mpls-core" >}})].

In the last article, I explored VPP's MPLS implementation a little bit.  All the while,
[@vifino](https://chaos.social/@vifino) has been tinkering with the Linux Control Plane and adding
MPLS support to it, and together we learned a lot about how VPP does MPLS forwarding and how it
sometimes differs to other implementations.  During the process, we talked a bit about
_implicit-null_ and _explicit-null_. When my buddy Fred read the [[previous article]({{< ref "2023-05-07-vpp-mpls-1" >}})], he also talked about a feature called _penultimate-hop-popping_ which
maybe deserves a bit more explanation. At the same time, I could not help but wonder what the
performance is of VPP as a _P-Router_ and _PE-Router_, compared to say IPv4 forwarding.

## Lab Setup: VMs

{{< image src="/assets/vpp-mpls/LAB v2 (1).svg" alt="Lab Setup" >}}

For this article, I'm going to boot up instance LAB1 with no changes (for posterity, using image
`vpp-proto-disk0@20230403-release`), and it will be in the same state it was at the end of my
previous [[MPLS article]({{< ref "2023-05-07-vpp-mpls-1" >}})]. To recap, there are four routers
daisychained in a string, and they are called `vpp1-0` through `vpp1-3`. I've then connected a
Debian virtual machine on both sides of the string. `host1-0.enp16s0f3` connects to `vpp1-3.e2`
and `host1-1.enp16s0f0` connects to `vpp1-0.e3`. Finally, recall that all of the links between these
routers and hosts can be inspected with the machine `tap1-0` which is connected to a mirror port on
the underlying Open vSwitch fabric. I bound some RFC1918 addresses on `host1-0` and `host1-1` and
can ping between the machines, using the VPP routers as MPLS transport.

### MPLS: Simple LSP

In this mode, I can plumb two _label switched paths (LSPs)_, the first one westbound from `vpp1-3`
to `vpp1-0`, and it wraps the packet destined to 10.0.1.1 into an MPLS packet with a single label
100:
```
vpp1-3# ip route add 10.0.1.1/32 via 192.168.11.10 GigabitEthernet10/0/0 out-labels 100
vpp1-2# mpls local-label add 100 eos via 192.168.11.8 GigabitEthernet10/0/0 out-labels 100
vpp1-1# mpls local-label add 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 100
vpp1-0# mpls local-label add 100 eos via ip4-lookup-in-table 0
vpp1-0# ip route add 10.0.1.1/32 via 192.0.2.2
```

The second is eastbound from `vpp1-0` to `vpp1-3`, and it is using MPLS label 103. Remember:
LSPs are unidirectional!

```
vpp1-0# ip route add 10.0.1.0/32 via 192.168.11.7 GigabitEthernet10/0/1 out-labels 103
vpp1-1# mpls local-label add 103 eos via 192.168.11.9 GigabitEthernet10/0/1 out-labels 103
vpp1-2# mpls local-label add 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 103
vpp1-3# mpls local-label add 103 eos via ip4-lookup-in-table 0
vpp1-3# ip route add 10.0.1.0/32 via 192.0.2.0
```

With these two _LSPs_ established, the ICMP echo request and subsequent ICMP echo reply can be seen
traveling through the network entirely as MPLS:

```
root@tap1-0:~# tcpdump -c 10 -eni enp16s0f0
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on enp16s0f0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
14:41:07.526861 52:54:00:20:10:03 > 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33
  p 0, ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64
14:41:07.528103 52:54:00:13:10:00 > 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64)
    10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64
14:41:07.529342 52:54:00:12:10:00 > 52:54:00:11:10:01, ethertype 802.1Q (0x8100), length 106: vlan 21
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 63)
    10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64
14:41:07.530421 52:54:00:11:10:00 > 52:54:00:10:10:01, ethertype 802.1Q (0x8100), length 106: vlan 20
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 62)
    10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64
14:41:07.531160 52:54:00:10:10:03 > 52:54:00:21:10:00, ethertype 802.1Q (0x8100), length 102: vlan 40
  p 0, ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64

14:41:07.531455 52:54:00:21:10:00 > 52:54:00:10:10:03, ethertype 802.1Q (0x8100), length 102: vlan 40
  p 0, ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
14:41:07.532245 52:54:00:10:10:01 > 52:54:00:11:10:00, ethertype 802.1Q (0x8100), length 106: vlan 20
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 64)
    10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
14:41:07.532732 52:54:00:11:10:01 > 52:54:00:12:10:00, ethertype 802.1Q (0x8100), length 106: vlan 21
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 63)
    10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
14:41:07.533923 52:54:00:12:10:01 > 52:54:00:13:10:00, ethertype 802.1Q (0x8100), length 106: vlan 22
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 62)
    10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
14:41:07.535040 52:54:00:13:10:02 > 52:54:00:20:10:03, ethertype 802.1Q (0x8100), length 102: vlan 33
  p 0, ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
10 packets captured
10 packets received by filter
```

When `vpp1-0` receives the MPLS frame with label 100,S=1, it looks up in the FIB to figure out what
_operation_ to perform with this packet is to POP the label, revealing the inner payload, which it
must look in the IPv4 FIB, and forward as per normal. This is a bit more expensive than it could be,
and the folks who established MPLS protocols found a few clever ways to cut down on cost!

### MPLS: Wellknown Label Values

I didn't know this until I started tinkering with MPLS on VPP, and as an operator it's easy to
overlook these things. As it so turns out, there are a few MPLS label values that have a very specific
meaning. Taking a read on [[RFC3032](https://www.rfc-editor.org/rfc/rfc3032.html)], label values 0-15
are reserved and they each serve a specific purpose:

*   ***Value 0***: IPv4 Explicit NULL Label
*   ***Value 1***: Router Alert Label
*   ***Value 2***: IPv6 Explicit NULL Label
*   ***Value 3***: Implicit NULL Label

There's a few other label values, 4-15, and if you're curious you could take a look at the [[Iana
List](https://www.iana.org/assignments/mpls-label-values/mpls-label-values.xhtml)] for them. For my
purposes, though, I'm only going to look at these weird little _NULL_ labels. What do they do?

### MPLS: Explicit Null

RFC3032 discusses the IPv4 explicit NULL label, value 0 (and the IPv6 variant with value 2):

> This label value is only legal at the bottom of the label
> stack.  It indicates that the label stack must be popped,
> and the forwarding of the packet must then be based on the
> IPv4 header.

What this means in practice is that we can allow MPLS _PE-Routers_ to take a little shortcut. If the
MPLS label in the last hop is just telling the router to POP the label and take a look in its IPv4
forwarding table, I can also set the label to 0 in the router just preceding it. This way, when the
last router sees label value 0, it knows already what to do, saving it one FIB lookup.

I can reconfigure both LSPs to make use of this feature, by changing the MPLS FIB entries on
`vpp1-1` that points the _LSP_ towards `vpp1-0`, removing what I configured before (`mpls
local-label del ...`) and replacing that with an out-label value of 0 (`mpls local-label add ...`):

```
vpp1-1# mpls local-label del 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 100
vpp1-1# mpls local-label add 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 0

vpp1-2# mpls local-label del 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 103
vpp1-2# mpls local-label add 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 0
```

Due to this, the last routers in the _LSP_ now already know what to do, so I can clean these up:

```
vpp1-0# mpls local-label del 100 eos via ip4-lookup-in-table 0
vpp1-3# mpls local-label del 103 eos via ip4-lookup-in-table 0
```

If I ping from `host1-0` to `host1-1` again, I can see a subtle but important difference in the
packets on the wire:

```
17:49:23.770119 52:54:00:20:10:03 > 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33, p 0,
  ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 524, length 64
17:49:23.770403 52:54:00:13:10:00 > 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22, p 0,
  ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64)
    10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 524, length 64
17:49:23.771184 52:54:00:12:10:00 > 52:54:00:11:10:01, ethertype 802.1Q (0x8100), length 106: vlan 21, p 0,
  ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 63)
    10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 524, length 64
17:49:23.772503 52:54:00:11:10:00 > 52:54:00:10:10:01, ethertype 802.1Q (0x8100), length 106: vlan 20, p 0,
  ethertype MPLS unicast (0x8847), MPLS (label 0, exp 0, [S], ttl 62)
    10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 524, length 64
17:49:23.773392 52:54:00:10:10:03 > 52:54:00:21:10:00, ethertype 802.1Q (0x8100), length 102: vlan 40, p 0,
  ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 524, length 64

17:49:23.773602 52:54:00:21:10:00 > 52:54:00:10:10:03, ethertype 802.1Q (0x8100), length 102: vlan 40, p 0,
  ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 524, length 64
17:49:23.774592 52:54:00:10:10:01 > 52:54:00:11:10:00, ethertype 802.1Q (0x8100), length 106: vlan 20, p 0,
  ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 64)
    10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 524, length 64
17:49:23.775804 52:54:00:11:10:01 > 52:54:00:12:10:00, ethertype 802.1Q (0x8100), length 106: vlan 21, p 0,
  ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 63)
    10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 524, length 64
17:49:23.776973 52:54:00:12:10:01 > 52:54:00:13:10:00, ethertype 802.1Q (0x8100), length 106: vlan 22, p 0,
  ethertype MPLS unicast (0x8847), MPLS (label 0, exp 0, [S], ttl 62)
    10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 524, length 64
17:49:23.778255 52:54:00:13:10:02 > 52:54:00:20:10:03, ethertype 802.1Q (0x8100), length 102: vlan 33, p 0,
  ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 524, length 64
```

Did you spot it? :) If your eyes are spinning, don't worry! I have configured the routers `vpp1-1`
towards `vpp1-0` in vlan 20 to use _IPv4 Explicit NULL_ (label 0). You can spot it on the fourth
packet in the tcpdump above. On the way back, `vpp1-2` towards `vpp1-3` in vlan 22 also sets _IPv4
Explicit NULL_ for the echo-reply. But, I do notice that end to end, the packet is still traversing
the network entirely as MPLS packets. The optimization here is that `vpp1-0` knows that label value
0 at the end of the label-stack just means 'what follows is an IPv4 packet, route it.'.

### MPLS: Implicit Null

Did that really help that much? I think I can answer the question by loadtesting, but first let me
take a closer look at what RFC3032 has to say about the _Implicit NULL Label_:

> A value of 3 represents the "Implicit NULL Label".  This
> is a label that an LSR may assign and distribute, but
> which never actually appears in the encapsulation.  When
> an LSR would otherwise replace the label at the top of the
> stack with a new label, but the new label is "Implicit
> NULL", the LSR will pop the stack instead of doing the
> replacement.  Although this value may never appear in the
> encapsulation, it needs to be specified in the Label
> Distribution Protocol, so a value is reserved.

{{< image width="200px" float="right" src="/assets/vpp-mpls/PHP-logo.svg" alt="PHP Logo" >}}

Oh, groovy! What this tells me is that I can take one further shortcut: if I set the label value 0
(_Explicit NULL IPv4_), or 2 (_Explicit NULL IPV6_), my last router in the chain will know to look
up the FIB entry automatically, saving one MPLS FIB lookup. But in this case, label value 3
(_Implicit NULL_) is telling the router to just unwrap the MPLS parts (it's looking at them anyway!)
and just forward the bare inner payload which is an IPv4 or IPv6 packet, directy onto the last
router. This is what all the real geeks call _Penultimate Hop Popping_ or PHP, none of that website
programming language rubbish!

Let me replace the FIB entries in the penultimate routers with this magic label value (3):

```
vpp1-1# mpls local-label del 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 0
vpp1-1# mpls local-label add 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 3

vpp1-2# mpls local-label del 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 0
vpp1-2# mpls local-label add 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 3
```

Now I would expect this _penultimate hop popping_ to yield an IPv4 packet between `vpp1-1` and
`vpp1-0` on the ICMP echo-request, and as well an IPv4 packet between `vpp1-2` and `vpp1-3` on the ICMP
echo-reply way back, and would you look at that:

```
17:45:35.783214 52:54:00:20:10:03 > 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33, p 0,
  ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 298, length 64
17:45:35.783879 52:54:00:13:10:00 > 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22, p 0,
  ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64)
    10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 298, length 64
17:45:35.784222 52:54:00:12:10:00 > 52:54:00:11:10:01, ethertype 802.1Q (0x8100), length 106: vlan 21, p 0,
  ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 63)
    10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 298, length 64
17:45:35.785123 52:54:00:11:10:00 > 52:54:00:10:10:01, ethertype 802.1Q (0x8100), length 102: vlan 20, p 0,
  ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 298, length 64
17:45:35.785311 52:54:00:10:10:03 > 52:54:00:21:10:00, ethertype 802.1Q (0x8100), length 102: vlan 40, p 0,
  ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 6172, seq 298, length 64

17:45:35.785533 52:54:00:21:10:00 > 52:54:00:10:10:03, ethertype 802.1Q (0x8100), length 102: vlan 40, p 0,
  ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 298, length 64
17:45:35.786465 52:54:00:10:10:01 > 52:54:00:11:10:00, ethertype 802.1Q (0x8100), length 106: vlan 20, p 0,
  ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 64)
    10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 298, length 64
17:45:35.787354 52:54:00:11:10:01 > 52:54:00:12:10:00, ethertype 802.1Q (0x8100), length 106: vlan 21, p 0,
  ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 63)
    10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 298, length 64
17:45:35.787575 52:54:00:12:10:01 > 52:54:00:13:10:00, ethertype 802.1Q (0x8100), length 102: vlan 22, p 0,
  ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 298, length 64
17:45:35.788320 52:54:00:13:10:02 > 52:54:00:20:10:03, ethertype 802.1Q (0x8100), length 102: vlan 33, p 0,
  ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 6172, seq 298, length 64
```

I can now see that the behavior has changed in a subtle way once again. Where before, there were
**three** MPLS packets all the way between `vpp1-3` through `vpp1-2` and `vpp1-1` onto `vpp1-0`, now
there are only **two** MPLS packets, and the last one (on the way out in VLAN 20, and on the way
back in VLAN 22), is just an IPv4 packet. PHP is slick!

## Loadtesting Setup: Bare Metal

{{< image width="350px" float="right" src="/assets/vpp-mpls/MPLS Lab.svg" alt="Bare Metal Setup" >}}

In 1997, an Internet Engineering Task Force (IETF) working group created standards to help fix the
issues of the time, mostly around internet traffic routing. MPLS was developed as an alternative to
multilayer switching and IP over asynchronous transfer mode (ATM). In the 90s, routers were
comparatively weak in terms of CPU, and things like _content addressable memory_ to facilitate
faster lookups, was incredibly expensive. Back then, every FIB lookup counted, so tricks like
_Penultimate Hop Popping_ really helped. But what about now? I'm reasonably confident that any
silicon based router would not mind to have one extra MPLS FIB operation, and equally would not mind
to unwrap the MPLS packet at the end. But, since these things exist, I thought it would be a fun
activity to see how much they would help in the VPP world, where just like in the old days, every
operation performed on a packet does cost valuable CPU cycles.

I can't really perform a loadtest on the virtual machines backed by Open vSwitch, while tightly
packing six machines on one hypervisor. That setup is made specifically to do functional testing and
development work. To do a proper loadtest, I will need bare metal. So, I grabbed three Supermicro
SYS-5018D-FN8T, which I'm running throughout [[AS8298]({{< ref "2021-02-27-network" >}})], as I
know their performance quite well. I'll take three of these, and daisychain them with TenGig ports.
This way, I can take a look at the cost of _P-Routers_ (which only SWAP MPLS labels and forward the
result), as well as _PE-Routers_ (which have to encapsulate, and sometimes decapsulate the IP or
Ethernet traffic).

These machines get a fresh Debian Bookworm install and VPP 23.06 without any plugins. It's weird for
me to run a VPP instance without Linux CP, but in this case I'm going completely vanilla, so I
disable all plugins and give each VPP machine one worker thread. The install follows my popular
[[VPP-7]({{< ref "2021-09-21-vpp-7" >}})]. By the way did you know that you can just type the search query [VPP-7] directly into Google to find this article. Am I an influencer now? Jokes aside, I decide to call the bare metal machines _France_,
_Belgium_ and _Netherlands_. And because if it ain't dutch, it ain't much, the Netherlands machine
sits on top :)


### IPv4 forwarding performance

The way Cisco T-Rex works in its simplest stateless loadtesting mode, is that it reads a Scapy file,
for example `bench.py`, and it then generates a stream of traffic from its first port, through the
_device under test (DUT)_, and expects to see that traffic returned on its second port. In a
bidirectional mode, traffic is sent from `16.0.0.0/8` to `48.0.0.0/8` in one direction, and back
from `48.0.0.0/8` to `16.0.0.0/8` in the other.

OK so first things first, let me configure a basic skeleton, taking _Netherlands_ as an example:

```
netherlands# set interface ip address TenGigabitEthernet6/0/1 192.168.13.7/31
netherlands# set interface ip address TenGigabitEthernet6/0/1 2001:678:d78:230::2:2/112
netherlands# set interface state TenGigabitEthernet6/0/1 up
netherlands# ip route add 100.64.0.0/30 via 192.168.13.6
netherlands# ip route add 192.168.13.4/31 via 192.168.13.6

netherlands# set interface ip address TenGigabitEthernet6/0/0 100.64.1.2/30
netherlands# set interface state TenGigabitEthernet6/0/0 up
netherlands# ip nei TenGigabitEthernet6/0/0 100.64.1.1 9c:69:b4:61:ff:40 static

netherlands# ip route add 16.0.0.0/8 via 100.64.1.1
netherlands# ip route add 48.0.0.0/8 via 192.168.13.6
```

The _Belgium_ router just has static routes back and forth, and the _France_ router looks similar
except it has its static routes all pointing in the other direction, and of course it has different
/31 transit networks towards T-Rex and _Belgium_. The one thing that is a bit curious is the use of
a static ARP entry that allows the VPP routers to resolve the nexthop for T-Rex -- in the case
above, T-Rex is sourcing from 100.64.1.1/30 (which has MAC address 9c:69:b4:61:ff:40) and sending to
our 100.64.1.2 on Te6/0/0.

{{< image src="/assets/vpp-mpls/trex.png" alt="T-Rex Baseline" >}}

After fiddling around a little bit with `imix`, I do notice the machine is still keeping up
with one CPU thread in both directions (~6.5Mpps). So I switch to 64b packets and ram up traffic
until that one VPP worker thread is saturated, which is a around the 9.2Mpps mark, so I lower it
slightly to a cool 9Mpps. Note: this CPU can have 3 worker threads in production, so it can do roughly
27Mpps per router, which is way cool!

The machines are at this point all doing exactly the same: receive ethernet from DPDK, do an IPv4
lookup, rewrite the header, and emit the frame on another interface. I can see that clearly in the
runtime statistics, taking a look at _Belgium_ for example:

```
belgium# show run
Thread 1 vpp_wk_0 (lcore 1)
Time 7912.6, 10 sec internal node vector rate 207.47 loops/sec 20604.47
  vector rates in 8.9997e6, out 9.0054e6, drop 0.0000e0, punt 0.0000e0
             Name                 State     Calls       Vectors      Suspends   Clocks   Vectors/Call
TenGigabitEthernet6/0/0-output   active     172120948   35740749991         0   6.47e0         207.65
TenGigabitEthernet6/0/0-tx       active     171687877   35650752635         0   8.49e1         207.65
TenGigabitEthernet6/0/1-output   active     172119849   35740963315         0   7.79e0         207.65
TenGigabitEthernet6/0/1-tx       active     171471125   35605967085         0   8.48e1         207.65
dpdk-input                       polling    171588827   71211720238         0   4.87e1         415.01
ethernet-input                   active     344675998   71571710136         0   2.16e1         207.65
ip4-input-no-checksum            active     343340278   71751697912         0   1.86e1         208.98
ip4-load-balance                 active     342929714   71661706997         0   1.44e1         208.97
ip4-lookup                       active     341632798   71391716172         0   2.28e1         208.97
ip4-rewrite                      active     342498637   71571712383         0   2.59e1         208.97
```

Looking at the time spent for one individual packet, it's about 245 CPU cycles, and considering the
cores on this Xeon D1518 run at 2.2GHz, that checks out very accurately: 2.2e9 / 245 = 9Mpps! Every
time that DPDK is asked for some work, it yields on average a vector of 208 packets -- and this is
why VPP is so super fast: the first packet may need to page in the instructions belonging to one of
the graph nodes, but the second through 208th packet will find almost 100% hitrate in the CPU's
instruction cache. Who needs RAM anyway?

### MPLS forwarding performance

Now that I have a baseline, I can take a look at the difference between the IPv4 path and the MPLS
path, and here's where the routers will start to behave differently. _France_ and _Netherlands_ will
be _PE-Routers_ and handle encapsulation/decapsulation, while _Belgium_ has a comparatively easy
job, as it will only handle MPLS forwarding. I'll choose country-codes for the labels, that which is
destined to _France_ will have MPLS label 33,S=1; while that which goes to _Netherlands_ will have
MPLS label 31,S=1.

```
netherlands# ip ro del 48.0.0.0/8 via 192.168.13.6
netherlands# ip ro add 48.0.0.0/8 via 192.168.13.6 TenGigabitEthernet6/0/1 out-labels 33
netherlands# mpls local-label add 31 eos via ip4-lookup-in-table 0

belgium# ip route del 48.0.0.0/8 via 192.168.13.4
belgium# ip route del 16.0.0.0/8 via 192.168.13.7
belgium# mpls local-label add 33 eos via 192.168.13.4 TenGigabitEthernet6/0/1 out-labels 33
belgium# mpls local-label add 31 eos via 192.168.13.7 TenGigabitEthernet6/0/0 out-labels 31

france# ip route del 16.0.0.0/8 via 192.168.13.5
france# ip route add 16.0.0.0/8 via 192.168.13.5 TenGigabitEthernet6/0/1 out-labels 31
france# mpls local-label add 33 eos via ip4-lookup-in-table 0
```

The types of _operation_ in MPLS is no longer symmetric. On the way in, the _PE-Router_ has to
encapsulate the IPv4 packet into an MPLS packet, and on the way out, the _PE-Router_ has to
decapsulate the MPLS packet to reveal the IPv4 packet. So, I change the loadtester to be
unidirectional, and ask it to send 10Mpps from _Netherlands_ to _France_. As soon as I reconfigure
the routers in this mode, I see quite a bit of _packetlo_, as only 7.3Mpps make it through.
Interesting! I wonder where this traffic is dropped, and what the bottleneck is, precisely.

#### MPLS: PE Ingress Performance

First, let's take a look at _Netherlands_, to try to understand why it is more expensive:

```
netherlands# show run
Time 255.5, 10 sec internal node vector rate 256.00 loops/sec 29399.92
  vector rates in 7.6937e6, out 7.6937e6, drop 0.0000e0, punt 0.0000e0
             Name                 State         Calls       Vectors  Suspends   Clocks   Vectors/Call
TenGigabitEthernet6/0/1-output   active       7978541    2042505472         0   7.28e0         255.99
TenGigabitEthernet6/0/1-tx       active       7678013    1965570304         0   8.25e1         255.99
dpdk-input                       polling      7684444    1965570304         0   4.55e1         255.79
ethernet-input                   active       7978549    2042507520         0   1.94e1         255.99
ip4-input-no-checksum            active       7978557    2042509568         0   1.75e1         255.99
ip4-lookup                       active       7678013    1965570304         0   2.17e1         255.99
ip4-mpls-label-imposition-pipe   active       7678013    1965570304         0   2.42e1         255.99
mpls-output                      active       7678013    1965570304         0   6.71e1         255.99
```

Each packet gets from `dpdk-input` into `ethernet-input`, the resulting IPv4 packet visits
`ip4-lookup` FIB where the MPLS out-label is found in the IPv4 FIB, the packet is then wrapped into
an MPLS packet in `ip4-mpls-label-imposition-pipe` and then sent through `mpls-output` to the NIC.
In total the input path (`ip4-*` plus `mpls-*`) takes **131 CPU cycles** for each packet. Including
all the nodes, from DPDK input to DPDK output sums up to 285 cycles, so 2.2GHz/285 = 7.69Mpps which
checks out.

#### MPLS: P Transit Performance

I would expect that _Belgium_ has it easier, as it's only doing label swapping and MPLS forwarding.

```
belgium# show run
Thread 1 vpp_wk_0 (lcore 1)
Time 595.6, 10 sec internal node vector rate 47.68 loops/sec 224464.40
  vector rates in 7.6930e6, out 7.6930e6, drop 0.0000e0, punt 0.0000e0
             Name                 State         Calls       Vectors  Suspends   Clocks   Vectors/Call
TenGigabitEthernet6/0/1-output   active      97711093    4659109793         0   8.83e0          47.68
TenGigabitEthernet6/0/1-tx       active      96096377    4582172229         0   8.14e1          47.68
dpdk-input                       polling    161102959    4582172278         0   5.72e1          28.44
ethernet-input                   active      97710991    4659111684         0   2.45e1          47.68
mpls-input                       active      97709468    4659096718         0   2.25e1          47.68
mpls-label-imposition-pipe       active      99324916    4736048227         0   2.52e1          47.68
mpls-lookup                      active      99324903    4736045943         0   3.25e1          47.68
mpls-output                      active      97710989    4659111742         0   3.04e1          47.68
```

Indeed, _Belgium_ can still breathe, it's spending **110 Cycles per packet** doing the MPLS
switching (`mpls-*`), which is 18% less than the _PE-Router_ ingress. Judging by the _vectors/Call_
(last column), it's also running a bit cooler than the ingress router.

It's nice to see that the claim that _P-Routers_ are cheaper on the CPU can be verified to be true
in practice!

#### MPLS: PE Egress Performance

On to the last router, _France_, which is in charge of decapsulating the MPLS packet and doing the
resulting IPv4 lookup:

```
france# show run
Thread 1 vpp_wk_0 (lcore 1)
Time 1067.2, 10 sec internal node vector rate 256.00 loops/sec 27986.96
  vector rates in 7.3234e6, out 7.3234e6, drop 0.0000e0, punt 0.0000e0
             Name                 State         Calls       Vectors  Suspends   Clocks   Vectors/Call
TenGigabitEthernet6/0/0-output   active      30528978    7815395072         0   6.59e0         255.99
TenGigabitEthernet6/0/0-tx       active      30528978    7815395072         0   8.20e1         255.99
dpdk-input                       polling     30534880    7815395072         0   4.68e1         255.95
ethernet-input                   active      30528978    7815395072         0   1.97e1         255.99
ip4-load-balance                 active      30528978    7815395072         0   1.35e1         255.99
ip4-mpls-label-disposition-pip   active      30528978    7815395072         0   2.82e1         255.99
ip4-rewrite                      active      30528978    7815395072         0   2.48e1         255.99
lookup-ip4-dst                   active      30815069    7888634368         0   3.09e1         255.99
mpls-input                       active      30528978    7815395072         0   1.86e1         255.99
mpls-lookup                      active      30528978    7815395072         0   2.85e1         255.99
```

This router is spending its time (in `*ip4*` and `mpls-*`) roughly at roughly **144.5 Cycles per
packet** and reveals itself as the bottleneck. _Netherlands_ sent _Belgium_ 7.69Mpps which it all
forwarded to _France_, where only 7.3Mpps make it through this _PE-Router_ egress, and into the
hands of T-Rex. In total, this router is spending 298 cycles/packet, which amounts to 7.37Mpps.

### MPLS Explicit Null performance

At the beginning of this article, I made a claim that we could take some shortcuts, and now is a
good time to see if those short cuts are worthwhile in the VPP setting. I'll reconfigure the
_Belgium_ router to set the _IPv4 Explicit NULL_ label (0), which can help my poor overloaded
_France_ router save some valuable CPU cycles.

```
belgium# mpls local-label del 33 eos via 192.168.13.4 TenGigabitEthernet6/0/1 out-labels 33
belgium# mpls local-label add 33 eos via 192.168.13.4 TenGigabitEthernet6/0/1 out-labels 0
```

The situation for _Belgium_ doesn't change at all, it's still doing the SWAP operation on the
incoming packet, but it's writing label 0,S=1 now (instead of label 33,S=1 before). But, haha!, take
a look at _France_ for an important difference:

```
france# show run
Thread 1 vpp_wk_0 (lcore 1)
Time 53.3, 10 sec internal node vector rate 85.35 loops/sec 77643.80
  vector rates in 7.6933e6, out 7.6933e6, drop 0.0000e0, punt 0.0000e0
             Name                 State         Calls       Vectors  Suspends   Clocks   Vectors/Call
TenGigabitEthernet6/0/0-output   active       4773870     409847372         0   6.96e0          85.85
TenGigabitEthernet6/0/0-tx       active       4773870     409847372         0   8.07e1          85.85
dpdk-input                       polling      4865704     409847372         0   5.01e1          84.23
ethernet-input                   active       4773870     409847372         0   2.15e1          85.85
ip4-load-balance                 active       4773869     409847235         0   1.51e1          85.85
ip4-rewrite                      active       4773870     409847372         0   2.60e1          85.85
lookup-ip4-dst-itf               active       4773870     409847372         0   3.41e1          85.85
mpls-input                       active       4773870     409847372         0   1.99e1          85.85
mpls-lookup                      active       4773870     409847372         0   3.01e1          85.85
```

First off, I notice the _input_ vector rates match the _output_ vector rates, both at 7.69Mpps, and
that the average _Vectors/Call_ is no longer pegged at 256. The router is now spending **125 Cycles
per packet** which is a lot better than it was before (15.4% better than 144.5 Cycles/packet).

Conclusion: **MPLS Explicit NULL is cheaper**!

### MPLS Implicit Null (PHP) performance

So there's one mode of operation left for me to play with. What if we asked _Belgium_ to unwrap the
MPLS packet and forward it as an IPv4 packet towards _France_, in other words apply _Penultimate Hop
Popping_? Of course, the ingress _Netherlands_ won't change at all, but I reconfigure the _Belgium_
router, like so:

```
belgium# mpls local-label del 33 eos via 192.168.13.4 TenGigabitEthernet6/0/1 out-labels 0
belgium# mpls local-label add 33 eos via 192.168.13.4 TenGigabitEthernet6/0/1 out-labels 3
```

The situation in _Belgium_ now looks subtly different:

```
belgium# show run
Thread 1 vpp_wk_0 (lcore 1)
Time 171.1, 10 sec internal node vector rate 50.64 loops/sec 188552.87
  vector rates in 7.6966e6, out 7.6966e6, drop 0.0000e0, punt 0.0000e0
             Name                 State         Calls       Vectors  Suspends   Clocks   Vectors/Call
TenGigabitEthernet6/0/1-output   active      26128425    1316828499         0   8.74e0          50.39
TenGigabitEthernet6/0/1-tx       active      26128424    1316828327         0   8.16e1          50.39
dpdk-input                       polling     39339977    1316828499         0   5.58e1          33.47
ethernet-input                   active      26128425    1316828499         0   2.39e1          50.39
ip4-mpls-label-disposition-pip   active      26128425    1316828499         0   3.07e1          50.39
ip4-rewrite                      active      27648864    1393790359         0   2.82e1          50.41
mpls-input                       active      26128425    1316828499         0   2.21e1          50.39
mpls-lookup                      active      26128422    1316828355         0   3.16e1          50.39
```

After doing the `mpls-lookup`, this router finds that it can just toss the label and forward the
packet as IPv4 down south. Cost for _Belgium_: **113 Cycles per packet**.

_France_ is now not participating in MPLS at all - it is simply receiving IPv4 packets which it has
to route back towards T-Rex. I take one final look at _France_ to see where it's spending its time:

```
france# show run
Thread 1 vpp_wk_0 (lcore 1)
Time 397.3, 10 sec internal node vector rate 42.17 loops/sec 259634.88
  vector rates in 7.7112e6, out 7.6964e6, drop 0.0000e0, punt 0.0000e0
             Name                 State         Calls       Vectors  Suspends   Clocks   Vectors/Call
TenGigabitEthernet6/0/0-output   active      74381543    3211443520         0   9.47e0          43.18
TenGigabitEthernet6/0/0-tx       active      70820630    3057504872         0   8.26e1          43.17
dpdk-input                       polling    131873061    3063377312         0   6.09e1          23.23
ethernet-input                   active      72645873    3134461107         0   2.66e1          43.15
ip4-input-no-checksum            active      70820629    3057504812         0   2.68e1          43.17
ip4-load-balance                 active      72646140    3134473660         0   1.74e1          43.15
ip4-lookup                       active      70820628    3057504796         0   2.79e1          43.17
ip4-rewrite                      active      70820631    3057504924         0   2.96e1          43.17
```

As an IPv4 router, _France_ spends in total **102 Cycles per packet**. This matches very closely
with the 104 cycles/packet I found when doing my baseline loadtest with only IPv4 routing. I love it
when numbers align!!

### Scaling

One thing that I was curious to know, is if MPLS packets would allow for multiple receive queues, to
enable horizontal scaling by adding more VPP worker threads. The answer is a resounding YES! If I
restart the VPP routers _Netherlands_, _Belgium_ and _France_ with three workers and set DPDK
`num-rx-queues` to 3 as well, I see perfect linear scaling, in other words these little routers
would be able to forward roughly 27Mpps of MPLS packets with varying inner payloads (be it IPv4 or
IPv6 or Ethernet traffic with differing src/dest MAC addresses). All things said, IPv4 is still a
little bit cheaper on the CPU, at least on these routers with only a very small routing table. But,
it's great to see that MPLS forwarding can leverage RSS.

### Conclusions

This is all fine and dandy, but I think it's a bit trickier to see if PHP is actually cheaper or not.
To answer this question, I think I should count the total amount of CPU cycles spent end to end: for a
packet traveling from T-Rex coming into _Netherlands_, through _Belgium_ and _France_, and back out
to T-Rex.

|                           | Netherlands | Belgium     | France      | Total Cost   |
| ------------------------- | ----------- | ----------- | ----------- | ------------ |
| Regular IPv4 path         | 104 cycles  | 104 cycles  | 104 cycles  | 312 cycles   |
| MPLS: Simple LSP          | 131 cycles  | 110 cycles  | 145 cycles  | 386 cycles   |
| MPLS: Explicit NULL LSP   | 131 cycles  | 110 cycles  | 125 cycles  | 366 cycles   |
| MPLS: Penultimate Hop Pop | 131 cycles  | 113 cycles  | 102 cycles  | ***346 cycles***   |

***Note***: The clock cycle numbers here are only `*mpls*` and `*ip4*` nodes, exclusing the `*-input`,
`*-output` and `*-tx` nodes, as they will add the same cost for all modes of operation.

I threw a lot of numbers into this article, and my head is spinning as I write this. But I still
think I can wrap it up in a way that allows me to have a few high level takeaways:

*   IPv4 forwarding is a fair bit cheaper than MPLS forwarding (with an empty FIB, anyway). I had
    not expected this!
*   End to end, the MPLS bottleneck is in the _PE-Ingress_ operation.
*   Explicit NULL helps without any drawbacks, as it cuts off one MPLS FIB lookup in the _PE-Egress_
    operation.
*   Implicit NULL (aka Penultimate Hop Popping) is also the fastest way to do MPLS with VPP, all
    things considered.

## What's next

I joined forces with [@vifino](https://chaos.social/@vifino) who has effectively added MPLS handling
to the Linux Control Plane, so VPP can start to function as an MPLS router using FRR's label
distribution protocol implementation. Gosh, I wish Bird3 would have LDP :)

Our work is mostly complete, there's two pending Gerrit's which should be ready to review and
certainly ready to play with:

1.   [[Gerrit 38826](https://gerrit.fd.io/r/c/vpp/+/38826)]: This adds the ability to listen to internal
     state changes of an interface, so that the Linux Control Plane plugin can enable MPLS on the
     _LIP_ interfaces and Linux sysctl for MPLS input.
1.   [[Gerrit 38702](https://gerrit.fd.io/r/c/vpp/+/38702)]: This adds the ability to listen to Netlink
     messages in the Linux Control Plane plugin, and sensibly apply these routes to the IPv4, IPv6
     and MPLS FIB in the VPP dataplane.

If you'd like to test this - reach out to the VPP Developer mailinglist
[[ref](mailto:vpp-dev@lists.fd.io)] any time!