Files
ipng.ch/content/articles/2022-01-12-vpp-l2.md

685 lines
36 KiB
Markdown

---
date: "2022-01-12T18:35:14Z"
title: Case Study - Virtual Leased Line (VLL) in VPP
aliases:
- /s/articles/2022/01/13/vpp-l2.html
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches
are shared between the two.
After completing the Linux CP plugin, interfaces and their attributes such as addresses and routes
can be shared between VPP and the Linux kernel in a clever way, so running software like
[FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving
&gt;100Mpps and &gt;100Gbps forwarding rates are easily in reach!
If you've read my previous articles (thank you!), you will have noticed that I have done a lot of
work on making VPP work well in an ISP (BGP/OSPF) environment with Linux CP. However, there's many other cool
things about VPP that make it a very competent advanced services router. One that that has always been
super interesting to me, is being able to offer L2 connectivity over wide-area-network. For example, a
virtual leased line from our Colo in Zurich to Amsterdam NIKHEF. This article explores this space.
***NOTE***: If you're only interested in results, scroll all the way down to the markdown table and
graph for performance stats.
## Introduction
ISPs can offer ethernet services, often called _Virtual Leased Lines_ (VLLs), _Layer2 VPN_ (L2VPN)
or _Ethernet Backhaul_. They mean the same thing: imagine a switchport in location A that appears
to be transparently and directly connected to a switchport in location B, with the ISP (layer3, so
IPv4 and IPv6) network in between. The "simple" old-school setup would be to have switches which
define VLANs and are all interconnected. But we collectively learned that it's a bad idea for several
reasons:
* Large broadcast domains tend to encouter L2 forwarding loops sooner rather than later
* Spanning-Tree and its kin are a stopgap, but they often disable an entire port from forwarding,
which can be expensive if that port is connected to a dark fiber into another datacenter far
away.
* Large VLAN setups that are intended to interconnect with other operators run into overlapping
VLAN tags, which means switches have to do tag rewriting and filtering and such.
* Traffic engineering is all but non-existent in L2-only networking domains, while L3 has all sorts
of smart TE extensions, ECMP, and so on.
The canonical solution is for ISPs to encapsulate the ethernet traffic of their customers in some
tunneling mechanism, for example in MPLS or in some IP tunneling protocol. Fundamentally, these are
the same, except for the chosen protocol and overhead/cost of forwarding. MPLS is a very thin layer
under the packet, but other IP based tunneling mechanisms exist, commonly used are GRE, VXLAN and GENEVE
but many others exist.
They all work roughly the same:
* An IP packet has a _maximum transmission unit_ (MTU) of 1500 bytes, while the ethernet header is
typically an additional 14 bytes: a 6 byte source MAC, 6 byte destination MAC, and 2 byte ethernet
type, which is 0x0800 for an IPv4 datagram, 0x0806 for ARP, and 0x86dd for IPv6, and many others
[[ref](https://en.wikipedia.org/wiki/EtherType)].
* If VLANs are used, an additional 4 bytes are needed [[ref](https://en.wikipedia.org/wiki/IEEE_802.1Q)]
making the ethernet frame at most 1518 bytes long, with an ethertype of 0x8100.
* If QinQ or QinAD are used, yet again 4 bytes are needed [[ref](https://en.wikipedia.org/wiki/IEEE_802.1ad)],
making the ethernet frame at most 1522 bytes long, with an ethertype of either 0x8100 or 0x9100,
depending on the implementation.
* We can take such an ethernet frame, and make it the _payload_ of another IP packet, encapsulating the original
ethernet frame in a new IPv4 or IPv6 packet. We can then route it over an IP network to a remote
site.
* Upon receipt of such a packet, by looking at the headers the remote router can determine that this
packet represents an encapsulated ethernet frame, unpack it all, and forward the original frame onto a
given interface.
### IP Tunneling Protocols
First let's get some theory out of the way -- I'll discuss three common IP tunneling protocols here, and then
move on to demonstrate how they are configured in VPP and perhaps more importantly, how they perform in VPP.
Each tunneling protocol has its own advantages and disadvantages, but I'll stick to the basics first:
#### GRE: Generic Routing Encapsulation
_Generic Routing Encapsulation_ (GRE, described in [RFC2784](https://datatracker.ietf.org/doc/html/rfc2784)) is a
very old and well known tunneling protocol. The packet is an IP datagram with protocol number 47, consisting
of a header with 4 bits of flags, 8 reserved bits, 3 bits for the version (normally set to all-zeros), and
16 bits for the inner protocol (ether)type, so 0x0800 for IPv4, 0x8100 for 802.1q and so on. It's a very
small header of only 4 bytes and an optional key (4 bytes) and sequence number (also 4 bytes) whieah means
that to be able to transport any ethernet frame (including the fancy QinQ and QinAD ones), the _underlay_
must have an end to end MTU of at least 1522 + 20(IPv4)+12(GRE) = ***1554 bytes for IPv4*** and ***1574
bytes for IPv6***.
#### VXLAN: Virtual Extensible LAN
_Virtual Extensible LAN_ (VXLAN, described in [RFC7348](https://datatracker.ietf.org/doc/html/rfc7348))
is a UDP datagram which has a header consisting of 8 bits worth
of flags, 24 bits reserved for future expansion, 24 bits of _Virtual Network Identifier_ (VNI) and
an additional 8 bits or reserved space at the end. It uses UDP port 4789 as assigned by IANA. VXLAN
encapsulation adds 20(IPv4)+8(UDP)+8(VXLAN) = 36 bytes, and considering IPv6 is 40 bytes, there it
adds 56 bytes. This means that to be able to transport any ethernet frame, the _underlay_
network must have an end to end MTU of at least 1522+36 = ***1558 bytes for IPv4*** and ***1578 bytes
for IPv6***.
#### GENEVE: Generic Network Virtualization Encapsulation
_GEneric NEtwork Virtualization Encapsulation_ (GENEVE, described in [RFC8926](https://datatracker.ietf.org/doc/html/rfc8926))
is somewhat similar to VXLAN although it was an attempt to stop the wild growth of tunneling protocols,
I'm sure there is an [XKCD](https://xkcd.com/927/) out there specifically for this approach. The packet is
also a UDP datagram with destination port 6081, followed by an 8 byte GENEVE specific header, containing
2 bits of version, 8 bits for flags, a 16 bit inner ethertype, a 24 bit _Virtual Network Identifier_ (VNI),
and 8 bits of reserved space. With GENEVE, several options are available and will be tacked onto the GENEVE
header, but they are typically not used. If they are though, the options can add an additional 16 bytes
which means that to be able to transport any ethernet frame, the _underlay_ network must have an end to
end MTU of at least 1522+52 = ***1574 bytes for IPv4*** and ***1594 bytes for IPv6***.
### Hardware setup
First let's take a look at the physical setup. I'm using three servers and a switch in the IPng Networks lab:
{{< image width="400px" float="right" src="/assets/vpp/l2-xconnect-lab.png" alt="Loadtest Setup" >}}
* `hvn0`: Dell R720xd, load generator
* Dual E5-2620, 24 CPUs, 2 threads per core, 2 numa nodes
* 64GB DDR3 at 1333MT/s
* Intel X710 4-port 10G, Speed 8.0GT/s Width x8 (64 Gbit/s)
* `Hippo` and `Rhino`: VPP routers
* ASRock B550 Taichi
* Ryzen 5950X 32 CPUs, 2 threads per core, 1 numa node
* 64GB DDR4 at 2133 MT/s
* Intel 810C 2-port 100G, Speed 16.0 GT/s Width x16 (256 Gbit/s)
* `fsw0`: FS.com switch S5860-48SC, 8x 100G, 48x 10G
* VLAN 4 (blue) connects Rhino's `Hu12/0/1` to Hippo's `Hu12/0/1`
* VLAN 5 (red) connects hvn0's `enp5s0f0` to Rhino's `Hu12/0/0`
* VLAN 6 (green) connects hvn0's `enp5s0f1` to Hippo's `Hu12/0/0`
* All switchports have jumbo frames enabled and are set to 9216 bytes.
Further, Hippo and Rhino are running VPP at head `vpp v22.02-rc0~490-gde3648db0`, and hvn0 is running
T-Rex v2.93 in L2 mode, with MAC address `00:00:00:01:01:00` on the first port, and MAC address
`00:00:00:02:01:00` on the second port. This machine can saturate 10G in both directions with small
packets even when using only one flow, as can be seen, if the ports are just looped back onto one
another, for example by physically crossconnecting them with an SFP+ or DAC; or in my case by putting
`fsw0` port `Te0/1` and `Te0/2` in the same VLAN together:
{{< image width="800px" src="/assets/vpp/l2-xconnect-trex.png" alt="TRex on hvn0" >}}
Now that I have shared all the context and hardware, I'm ready to actually dive in to what I wanted to
talk about: how does all this _virtual leased line_ business look like, for VPP. Ready? Here we go!
### Direct L2 CrossConnect
The simplest thing I can show in VPP, is to configure a layer2 cross-connect (_l2 xconnect_) between
two ports. In this case, VPP doesn't even need to have an IP address, all I do is bring up the ports,
set their MTU to be able to carry the 1522 bytes frames (ethernet at 1514, dot1q at 1518, and QinQ
at 1522 bytes). The configuration is identical on both Rhino and Hippo:
```
set interface state HundredGigabitEthernet12/0/0 up
set interface state HundredGigabitEthernet12/0/1 up
set interface mtu packet 1522 HundredGigabitEthernet12/0/0
set interface mtu packet 1522 HundredGigabitEthernet12/0/1
set interface l2 xconnect HundredGigabitEthernet12/0/0 HundredGigabitEthernet12/0/1
set interface l2 xconnect HundredGigabitEthernet12/0/1 HundredGigabitEthernet12/0/0
```
I'd say the only thing to keep in mind here, is that the cross-connect commands only
link in one direction (receive in A, forward to B), and that's why I have to type them twice (receive in B,
forward to A). Of course, this must be really cheap on VPP -- because all it has to do now is receive
from DPDK and immediately schedule for transmit on the other port. Looking at `show runtime` I can
see how much CPU time is spent in each of VPP's nodes:
```
Time 1241.5, 10 sec internal node vector rate 28.70 loops/sec 475009.85
vector rates in 1.4879e7, out 1.4879e7, drop 0.0000e0, punt 0.0000e0
Name Calls Vectors Clocks Vectors/Call
HundredGigabitEthernet12/0/1-o 650727833 18472218801 7.49e0 28.39
HundredGigabitEthernet12/0/1-t 650727833 18472218801 4.12e1 28.39
ethernet-input 650727833 18472218801 5.55e1 28.39
l2-input 650727833 18472218801 1.52e1 28.39
l2-output 650727833 18472218801 1.32e1 28.39
```
In this simple cross connect mode, the only thing VPP has to do is receive the ethernet, funnel it
into `l2-input`, and immediately send it straight through `l2-output` back out, which does not cost
much in terms of CPU cycles at all. In total, this CPU thread is forwarding 14.88Mpps (line rate 10G
at 64 bytes), at an average of 133 cycles per packet (not counting the time spent in DPDK). The CPU
has room to spare in this mode, in other words even _one CPU thread_ can handle this workload at
line rate, impressive!
Although cool, doing an L2 crossconnect like this isn't super useful. Usually, the customer leased line
has to be transported to another location, and for that we'll need some form of encapsulation ...
### Crossconnect over IPv6 VXLAN
Let's start with VXLAN. The concept is pretty straight forward in VPP. Based on the configuration
I put in Rhino and Hippo above, I first will have to bring `Hu12/0/1` out of L2 mode, give both interfaces an
IPv6 address, create a tunnel with a given _VNI_, and then crossconnect the customer side `Hu12/0/0`
into the `vxlan_tunnel0` and vice-versa. Piece of cake:
```
## On Rhino
set interface mtu packet 1600 HundredGigabitEthernet12/0/1
set interface l3 HundredGigabitEthernet12/0/1
set interface ip address HundredGigabitEthernet12/0/1 2001:db8::1/64
create vxlan tunnel instance 0 src 2001:db8::1 dst 2001:db8::2 vni 8298
set interface state vxlan_tunnel0 up
set interface mtu packet 1522 vxlan_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0
set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0
## On Hippo
set interface mtu packet 1600 HundredGigabitEthernet12/0/1
set interface l3 HundredGigabitEthernet12/0/1
set interface ip address HundredGigabitEthernet12/0/1 2001:db8::2/64
create vxlan tunnel instance 0 src 2001:db8::2 dst 2001:db8::1 vni 8298
set interface state vxlan_tunnel0 up
set interface mtu packet 1522 vxlan_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0
set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0
```
Of course, now we're actually beginning to make VPP do some work, and the exciting thing is, if there
would be an (opaque) ISP network between Rhino and Hippo, this would work just fine considering the
encapsulation is 'just' IPv6 UDP. Under the covers, for each received frame, VPP has to encapsulate it
into VXLAN, and route the resulting L3 packet by doing an IPv6 routing table lookup:
```
Time 10.0, 10 sec internal node vector rate 256.00 loops/sec 32132.74
vector rates in 8.5423e6, out 8.5423e6, drop 0.0000e0, punt 0.0000e0
Name Calls Vectors Clocks Vectors/Call
HundredGigabitEthernet12/0/0-o 333777 85445944 2.74e0 255.99
HundredGigabitEthernet12/0/0-t 333777 85445944 5.28e1 255.99
ethernet-input 333777 85445944 4.25e1 255.99
ip6-input 333777 85445944 1.25e1 255.99
ip6-lookup 333777 85445944 2.41e1 255.99
ip6-receive 333777 85445944 1.71e2 255.99
ip6-udp-lookup 333777 85445944 1.55e1 255.99
l2-input 333777 85445944 8.94e0 255.99
l2-output 333777 85445944 4.44e0 255.99
vxlan6-input 333777 85445944 2.12e1 255.99
```
I can definitely see a lot more action here. In this mode, VPP is handlnig 8.54Mpps on this CPU thread
before saturating. At full load, VPP is spending 356 CPU cycles per packet, of which almost half is in
node `ip6-receive`.
### Crossconnect over IPv4 VXLAN
Seeing `ip6-receive` being such a big part of the cost (almost half!), I wonder what it might look like if
I change the tunnel to use IPv4. So I'll give Rhino and Hippo an IPv4 address as well, delete the vxlan tunnel I made
before (the IPv6 one), and create a new one with IPv4:
```
set interface ip address HundredGigabitEthernet12/0/1 10.0.0.0/31
create vxlan tunnel instance 0 src 2001:db8::1 dst 2001:db8::2 vni 8298 del
create vxlan tunnel instance 0 src 10.0.0.0 dst 10.0.0.1 vni 8298
set interface state vxlan_tunnel0 up
set interface mtu packet 1522 vxlan_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0
set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0
set interface ip address HundredGigabitEthernet12/0/1 10.0.0.1/31
create vxlan tunnel instance 0 src 2001:db8::2 dst 2001:db8::1 vni 8298 del
create vxlan tunnel instance 0 src 10.0.0.1 dst 10.0.0.0 vni 8298
set interface state vxlan_tunnel0 up
set interface mtu packet 1522 vxlan_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0
set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0
```
And after letting this run for a few seconds, I can take a look and see how the `ip4-*` version of
the VPP code performs:
```
Time 10.0, 10 sec internal node vector rate 256.00 loops/sec 53309.71
vector rates in 1.4151e7, out 1.4151e7, drop 0.0000e0, punt 0.0000e0
Name Calls Vectors Clocks Vectors/Call
HundredGigabitEthernet12/0/0-o 552890 141539600 2.76e0 255.99
HundredGigabitEthernet12/0/0-t 552890 141539600 5.30e1 255.99
ethernet-input 552890 141539600 4.13e1 255.99
ip4-input-no-checksum 552890 141539600 1.18e1 255.99
ip4-lookup 552890 141539600 1.68e1 255.99
ip4-receive 552890 141539600 2.74e1 255.99
ip4-udp-lookup 552890 141539600 1.79e1 255.99
l2-input 552890 141539600 8.68e0 255.99
l2-output 552890 141539600 4.41e0 255.99
vxlan4-input 552890 141539600 1.76e1 255.99
```
Throughput is now quite a bit higher, clocking a cool 14.2Mpps (just short of line rate!) at 202 CPU
cycles per packet, considerably less time spent than in IPv6, but keep in mind that VPP has an ~empty
routing table in all of these tests.
### Crossconnect over IPv6 GENEVE
Another popular cross connect type, also based on IPv4 and IPv6 UDP packets, is GENEVE. The configuration
is almost identical, so I delete the IPv4 VXLAN and create an IPv6 GENEVE tunnel instead:
```
create vxlan tunnel instance 0 src 10.0.0.0 dst 10.0.0.1 vni 8298 del
create geneve tunnel local 2001:db8::1 remote 2001:db8::2 vni 8298
set interface state geneve_tunnel0 up
set interface mtu packet 1522 geneve_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0
set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0
create vxlan tunnel instance 0 src 10.0.0.1 dst 10.0.0.0 vni 8298 del
create geneve tunnel local 2001:db8::2 remote 2001:db8::1 vni 8298
set interface state geneve_tunnel0 up
set interface mtu packet 1522 geneve_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0
set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0
```
All the while, the TRex on the customer machine `hvn0`, is sending 14.88Mpps in both directions, and
after just a short (second or so) interruption, the GENEVE tunnel comes up, cross-connects into the
customer `Hu12/0/0` interfaces, and starts to carry traffic:
```
Thread 8 vpp_wk_7 (lcore 8)
Time 10.0, 10 sec internal node vector rate 256.00 loops/sec 29688.03
vector rates in 8.3179e6, out 8.3179e6, drop 0.0000e0, punt 0.0000e0
Name Calls Vectors Clocks Vectors/Call
HundredGigabitEthernet12/0/0-o 324981 83194664 2.74e0 255.99
HundredGigabitEthernet12/0/0-t 324981 83194664 5.18e1 255.99
ethernet-input 324981 83194664 4.26e1 255.99
geneve6-input 324981 83194664 3.87e1 255.99
ip6-input 324981 83194664 1.22e1 255.99
ip6-lookup 324981 83194664 2.39e1 255.99
ip6-receive 324981 83194664 1.67e2 255.99
ip6-udp-lookup 324981 83194664 1.54e1 255.99
l2-input 324981 83194664 9.28e0 255.99
l2-output 324981 83194664 4.47e0 255.99
```
Similar to VXLAN when using IPv6 the total for GENEVE-v6 is also comparatively slow (I say comparatively
because you should not expect anything like this performance when using Linux or BSD kernel routing!).
The lower throughput is again due to the `ip6-receive` node being costly. It is just slightly worse of a
performer at 8.32Mpps per core and 368 CPU cycles per packet.
### Crossconnect over IPv4 GENEVE
I am now suspecting that GENEVE over IPv4 would have similar gains to when I switched from VXLAN
IPv6 to IPv4 above. So I remove the IPv6 tunnel, create a new IPv4 tunnel instead, and hook it
back up to the customer port on both Rhino and Hippo, like so:
```
create geneve tunnel local 2001:db8::1 remote 2001:db8::2 vni 8298 del
create geneve tunnel local 10.0.0.0 remote 10.0.0.1 vni 8298
set interface state geneve_tunnel0 up
set interface mtu packet 1522 geneve_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0
set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0
create geneve tunnel local 2001:db8::2 remote 2001:db8::1 vni 8298 del
create geneve tunnel local 10.0.0.1 remote 10.0.0.0 vni 8298
set interface state geneve_tunnel0 up
set interface mtu packet 1522 geneve_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0
set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0
```
And the results, indeed a significant improvement:
```
Time 10.0, 10 sec internal node vector rate 256.00 loops/sec 48639.97
vector rates in 1.3737e7, out 1.3737e7, drop 0.0000e0, punt 0.0000e0
Name Calls Vectors Clocks Vectors/Call
HundredGigabitEthernet12/0/0-o 536763 137409904 2.76e0 255.99
HundredGigabitEthernet12/0/0-t 536763 137409904 5.19e1 255.99
ethernet-input 536763 137409904 4.19e1 255.99
geneve4-input 536763 137409904 2.39e1 255.99
ip4-input-no-checksum 536763 137409904 1.18e1 255.99
ip4-lookup 536763 137409904 1.69e1 255.99
ip4-receive 536763 137409904 2.71e1 255.99
ip4-udp-lookup 536763 137409904 1.79e1 255.99
l2-input 536763 137409904 8.81e0 255.99
l2-output 536763 137409904 4.47e0 255.99
```
So, close to line rate again! Performance of GENEVE-v4 clocks in at 13.7Mpps per core or 207
CPU cycles per packet.
## Crossconnect over IPv6 GRE
Now I can't help but wonder, that if those `ip4|6-udp-lookup` nodes burn valuable CPU cycles,
GRE will possibly do better, because it's an L3 protocol (proto number 47) and will never have to
inspect beyond the IP header, so I delete the GENEVE tunnel and give GRE a go too:
```
create geneve tunnel local 10.0.0.0 remote 10.0.0.1 vni 8298 del
create gre tunnel src 2001:db8::1 dst 2001:db8::2 teb
set interface state gre0 up
set interface mtu packet 1522 gre0
set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0
set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0
create geneve tunnel local 10.0.0.1 remote 10.0.0.0 vni 8298 del
create gre tunnel src 2001:db8::2 dst 2001:db8::1 teb
set interface state gre0 up
set interface mtu packet 1522 gre0
set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0
set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0
```
Results:
```
Time 10.0, 10 sec internal node vector rate 255.99 loops/sec 37129.87
vector rates in 9.9254e6, out 9.9254e6, drop 0.0000e0, punt 0.0000e0
Name Calls Vectors Clocks Vectors/Call
HundredGigabitEthernet12/0/0-o 387881 99297464 2.80e0 255.99
HundredGigabitEthernet12/0/0-t 387881 99297464 5.21e1 255.99
ethernet-input 775762 198594928 5.97e1 255.99
gre6-input 387881 99297464 2.81e1 255.99
ip6-input 387881 99297464 1.21e1 255.99
ip6-lookup 387881 99297464 2.39e1 255.99
ip6-receive 387881 99297464 5.09e1 255.99
l2-input 387881 99297464 9.35e0 255.99
l2-output 387881 99297464 4.40e0 255.99
```
The performance of GRE-v6 (in transparent ethernet bridge aka _TEB_ mode) is 9.9Mpps per core or
243 CPU cycles per packet, and I'll also note that while the `ip6-receive` node in all the
UDP based tunneling were in the 170 clocks/packet arena, now we're down to only 51 or so, so
indeed a huge improvement.
### Crossconnect over IPv4 GRE
To round off the set, I'll remove the IPv6 GRE tunnel and put an IPv4 GRE tunnel in place:
```
create gre tunnel src 2001:db8::1 dst 2001:db8::2 teb del
create gre tunnel src 10.0.0.0 dst 10.0.0.1 teb
set interface state gre0 up
set interface mtu packet 1522 gre0
set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0
set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0
create gre tunnel src 2001:db8::2 dst 2001:db8::1 teb del
create gre tunnel src 10.0.0.1 dst 10.0.0.0 teb
set interface state gre0 up
set interface mtu packet 1522 gre0
set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0
set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0
```
And without further ado:
```
Time 10.0, 10 sec internal node vector rate 255.87 loops/sec 52898.61
vector rates in 1.4039e7, out 1.4039e7, drop 0.0000e0, punt 0.0000e0
Name Calls Vectors Clocks Vectors/Call
HundredGigabitEthernet12/0/0-o 548684 140435080 2.80e0 255.95
HundredGigabitEthernet12/0/0-t 548684 140435080 5.22e1 255.95
ethernet-input 1097368 280870160 2.92e1 255.95
gre4-input 548684 140435080 2.51e1 255.95
ip4-input-no-checksum 548684 140435080 1.19e1 255.95
ip4-lookup 548684 140435080 1.68e1 255.95
ip4-receive 548684 140435080 2.03e1 255.95
l2-input 548684 140435080 8.72e0 255.95
l2-output 548684 140435080 4.43e0 255.95
```
The performance of GRE-v4 (in transparent ethernet bridge aka _TEB_ mode) is 14.0Mpps per
core or 171 CPU cycles per packet. This is really very low, the best of all the tunneling
protocols, but (for obvious reasons) will not outperform a direct L2 crossconnect, as that
cuts out the L3 (and L4) middleperson entirely. Whohoo!
## Conclusions
First, let me give a recap of the tests I did, from left to right the better to worse
performer.
Test | L2XC | GRE-v4 | VXLAN-v4 | GENEVE-v4 | GRE-v6 | VXLAN-v6 | GENEVE-v6
------------- | ----------- | --------- | --------- | ---------- | ---------- | --------- | ---------
pps/core | >14.88M | 14.34M | 14.15M | 13.74M | 9.93M | 8.54M | 8.32M
cycles/packet | 132.59 | 171.45 | 201.65 | 207.44 | 243.35 | 355.72 | 368.09
***(!)*** Achtung! Because in the L2XC mode the CPU was not fully consumed (VPP was consuming only
~28 frames per vector), it did not yet achieve its optimum CPU performance. Under full load, the
cycles/packet will be somewhat lower than what is shown here.
Taking a closer look at the VPP nodes in use, below I draw a graph of CPU cycles spent in each VPP
node, for each type of cross connect, where the lower the stack is, the faster cross connect will
be:
{{< image width="1000px" src="/assets/vpp/l2-xconnect-cycles.png" alt="Cycles by node" >}}
Although clearly GREv4 is the winner, I still would not use it for the following reason:
VPP does not support GRE keys, and considering it is an L3 protocol, I will have to use unique
IPv4 or IPv6 addresses for each tunnel src/dst pair, otherwise VPP will not know upon receipt of
a GRE packet, which tunnel it belongs to. For IPv6 this is not a huge deal (I can bind a whole
/64 to a loopback and just be done with it), but GREv6 does not perform as well as VXLAN-v4 or
GENEVE-v4.
VXLAN and GENEVE are equal performers, both in IPv4 and in IPv6. In both cases, IPv4 is
significantly faster than IPv6. But due to the use of _VNI_ fields in the header, contrary
to GRE, both VXLAN and GENEVE can have the same src/dst IP for any number of tunnels, which
is a huge benefit.
#### Multithreading
Usually, the customer facing side is an ethernet port (or sub-interface with tag popping) that will be
receiving IPv4 or IPv6 traffic (either tagged or untagged) and this allows the NIC to use _RSS_ to assign
this inbound traffic to multiple queues, and thus multiple CPU threads. That's great, it means linear
encapsulation performance.
Once the traffic is encapsulated, it risks becoming single flow with respect to the remote host, if
Rhino would be sending from 10.0.0.0:4789 to Hippo's 10.0.0.1:4789. However, the VPP VXLAN and GENEVE
implementation both inspect the _inner_ payload, and uses it to scramble the source port (thanks to
Neale for pointing this out, it's in `vxlan/encap.c:246`). Deterministically changing the source port
based on the inner-flow will allow Hippo to use _RSS_ on the receiving end, which allows these tunneling
protocols to scale linearly. I proved this for myself by attaching a port-mirror to the switch and
copying all traffic between Hippo and Rhino to a spare machine in the rack:
```
pim@hvn1:~$ sudo tcpdump -ni enp5s0f3 port 4789
11:19:54.887763 IP 10.0.0.1.4452 > 10.0.0.0.4789: VXLAN, flags [I] (0x08), vni 8298
11:19:54.888283 IP 10.0.0.1.42537 > 10.0.0.0.4789: VXLAN, flags [I] (0x08), vni 8298
11:19:54.888285 IP 10.0.0.0.17895 > 10.0.0.1.4789: VXLAN, flags [I] (0x08), vni 8298
11:19:54.899353 IP 10.0.0.1.40751 > 10.0.0.0.4789: VXLAN, flags [I] (0x08), vni 8298
11:19:54.899355 IP 10.0.0.0.35475 > 10.0.0.1.4789: VXLAN, flags [I] (0x08), vni 8298
11:19:54.904642 IP 10.0.0.0.60633 > 10.0.0.1.4789: VXLAN, flags [I] (0x08), vni 8298
pim@hvn1:~$ sudo tcpdump -ni enp5s0f3 port 6081
11:22:55.802406 IP 10.0.0.0.32299 > 10.0.0.1.6081: Geneve, Flags [none], vni 0x206a:
11:22:55.802409 IP 10.0.0.1.44011 > 10.0.0.0.6081: Geneve, Flags [none], vni 0x206a:
11:22:55.807711 IP 10.0.0.1.45503 > 10.0.0.0.6081: Geneve, Flags [none], vni 0x206a:
11:22:55.807712 IP 10.0.0.0.45532 > 10.0.0.1.6081: Geneve, Flags [none], vni 0x206a:
11:22:55.841495 IP 10.0.0.0.61694 > 10.0.0.1.6081: Geneve, Flags [none], vni 0x206a:
11:22:55.851719 IP 10.0.0.1.47581 > 10.0.0.0.6081: Geneve, Flags [none], vni 0x206a:
```
Considering I was sending the T-Rex profile `bench.py` with tunables `vm=var2,size=64`, the latter
of which chooses randomized source and destination (inner) IP addresses in the loadtester, I can
conclude that the outer source port is chosen based on a hash of the inner packet. Slick!!
#### Final conclusion
The most important practical conclusion to draw is that I can feel safe to offer L2VPN services at
IPng Networks using VPP and a VXLAN or GENEVE IPv4 underlay -- our backbone is 9000 bytes everywhere,
so it will be possible to provide up to 8942 bytes of customer payload taking into account the
VXLAN-v4 overhead. At least gigabit symmetric _VLLs_ filled with 64b packets will not be a
problem for the routers we have, as they forward approximately 10.2Mpps per core and 35Mpps
per chassis when fully loaded. Even considering the overhead and CPU consumption that VXLAN
encap/decap brings with it, due to the use of multiple transmit and receive threads,
the router would have plenty of room to spare.
## Appendix
The backing data for the graph in this article are captured in this [Google Sheet](https://docs.google.com/spreadsheets/d/1WZ4xvO1pAjCswpCDC9GfOGIkogS81ZES74_scHQryb0/edit?usp=sharing).
### VPP Configuration
For completeness, the `startup.conf` used on both Rhino and Hippo:
```
unix {
nodaemon
log /var/log/vpp/vpp.log
full-coredump
cli-listen /run/vpp/cli.sock
cli-prompt rhino#
gid vpp
}
api-trace { on }
api-segment { gid vpp }
socksvr { default }
memory {
main-heap-size 1536M
main-heap-page-size default-hugepage
}
cpu {
main-core 0
corelist-workers 1-15
}
buffers {
buffers-per-numa 300000
default data-size 2048
page-size default-hugepage
}
statseg {
size 1G
page-size default-hugepage
per-node-counters off
}
dpdk {
dev default {
num-rx-queues 7
}
decimal-interface-names
dev 0000:0c:00.0
dev 0000:0c:00.1
}
plugins {
plugin lcpng_nl_plugin.so { enable }
plugin lcpng_if_plugin.so { enable }
}
logging {
default-log-level info
default-syslog-log-level crit
class linux-cp/if { rate-limit 10000 level debug syslog-level debug }
class linux-cp/nl { rate-limit 10000 level debug syslog-level debug }
}
lcpng {
default netns dataplane
lcp-sync
lcp-auto-subint
}
```
### Other Details
For posterity, some other stats on the VPP deployment. First of all, a confirmation that PCIe 4.0 x16
slots were used, and that the _Comms_ DDP was loaded:
```
[ 0.433903] pci 0000:0c:00.0: [8086:1592] type 00 class 0x020000
[ 0.433924] pci 0000:0c:00.0: reg 0x10: [mem 0xea000000-0xebffffff 64bit pref]
[ 0.433946] pci 0000:0c:00.0: reg 0x1c: [mem 0xee010000-0xee01ffff 64bit pref]
[ 0.433964] pci 0000:0c:00.0: reg 0x30: [mem 0xfcf00000-0xfcffffff pref]
[ 0.434104] pci 0000:0c:00.0: reg 0x184: [mem 0xed000000-0xed01ffff 64bit pref]
[ 0.434106] pci 0000:0c:00.0: VF(n) BAR0 space: [mem 0xed000000-0xedffffff 64bit pref] (contains BAR0 for 128 VFs)
[ 0.434128] pci 0000:0c:00.0: reg 0x190: [mem 0xee220000-0xee223fff 64bit pref]
[ 0.434129] pci 0000:0c:00.0: VF(n) BAR3 space: [mem 0xee220000-0xee41ffff 64bit pref] (contains BAR3 for 128 VFs)
[ 11.216343] ice 0000:0c:00.0: The DDP package was successfully loaded: ICE COMMS Package version 1.3.30.0
[ 11.280567] ice 0000:0c:00.0: PTP init successful
[ 11.317826] ice 0000:0c:00.0: DCB is enabled in the hardware, max number of TCs supported on this port are 8
[ 11.317828] ice 0000:0c:00.0: FW LLDP is disabled, DCBx/LLDP in SW mode.
[ 11.317829] ice 0000:0c:00.0: Commit DCB Configuration to the hardware
[ 11.320608] ice 0000:0c:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
```
And how the NIC shows up in VPP, in particular the rx/tx burst modes and functions are interesting:
```
hippo# show hardware-interfaces
Name Idx Link Hardware
HundredGigabitEthernet12/0/0 1 up HundredGigabitEthernet12/0/0
Link speed: 100 Gbps
RX Queues:
queue thread mode
0 vpp_wk_0 (1) polling
1 vpp_wk_1 (2) polling
2 vpp_wk_2 (3) polling
3 vpp_wk_3 (4) polling
4 vpp_wk_4 (5) polling
5 vpp_wk_5 (6) polling
6 vpp_wk_6 (7) polling
Ethernet address b4:96:91:b3:b1:10
Intel E810 Family
carrier up full duplex mtu 9190 promisc
flags: admin-up promisc maybe-multiseg tx-offload intel-phdr-cksum rx-ip4-cksum int-supported
Devargs:
rx: queues 7 (max 64), desc 1024 (min 64 max 4096 align 32)
tx: queues 16 (max 64), desc 1024 (min 64 max 4096 align 32)
pci: device 8086:1592 subsystem 8086:0002 address 0000:0c:00.00 numa 0
max rx packet len: 9728
promiscuous: unicast on all-multicast on
vlan offload: strip off filter off qinq off
rx offload avail: vlan-strip ipv4-cksum udp-cksum tcp-cksum qinq-strip
outer-ipv4-cksum vlan-filter vlan-extend jumbo-frame
scatter keep-crc rss-hash
rx offload active: ipv4-cksum jumbo-frame scatter
tx offload avail: vlan-insert ipv4-cksum udp-cksum tcp-cksum sctp-cksum
tcp-tso outer-ipv4-cksum qinq-insert multi-segs mbuf-fast-free
outer-udp-cksum
tx offload active: ipv4-cksum udp-cksum tcp-cksum multi-segs
rss avail: ipv4-frag ipv4-tcp ipv4-udp ipv4-sctp ipv4-other ipv4
ipv6-frag ipv6-tcp ipv6-udp ipv6-sctp ipv6-other ipv6
l2-payload
rss active: ipv4-frag ipv4-tcp ipv4-udp ipv4 ipv6-frag ipv6-tcp
ipv6-udp ipv6
tx burst mode: Scalar
tx burst function: ice_recv_scattered_pkts_vec_avx2_offload
rx burst mode: Offload Vector AVX2 Scattered
rx burst function: ice_xmit_pkts
```
Finally, in case it's interesting, an output of [lscpu](/assets/vpp/l2-xconnect-lscpu.txt),
[lspci](/assets/vpp/l2-xconnect-lspci.txt) and [dmidecode](/assets/vpp/l2-xconnect-dmidecode.txt)
as run on Hippo (Rhino is an identical machine).