685 lines
36 KiB
Markdown
685 lines
36 KiB
Markdown
---
|
|
date: "2022-01-12T18:35:14Z"
|
|
title: Case Study - Virtual Leased Line (VLL) in VPP
|
|
aliases:
|
|
- /s/articles/2022/01/13/vpp-l2.html
|
|
---
|
|
|
|
|
|
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
|
|
|
|
# About this series
|
|
|
|
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
|
|
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
|
|
_ASR_ (aggregation services router), VPP will look and feel quite familiar as many of the approaches
|
|
are shared between the two.
|
|
|
|
After completing the Linux CP plugin, interfaces and their attributes such as addresses and routes
|
|
can be shared between VPP and the Linux kernel in a clever way, so running software like
|
|
[FRR](https://frrouting.org/) or [Bird](https://bird.network.cz/) on top of VPP and achieving
|
|
>100Mpps and >100Gbps forwarding rates are easily in reach!
|
|
|
|
If you've read my previous articles (thank you!), you will have noticed that I have done a lot of
|
|
work on making VPP work well in an ISP (BGP/OSPF) environment with Linux CP. However, there's many other cool
|
|
things about VPP that make it a very competent advanced services router. One that that has always been
|
|
super interesting to me, is being able to offer L2 connectivity over wide-area-network. For example, a
|
|
virtual leased line from our Colo in Zurich to Amsterdam NIKHEF. This article explores this space.
|
|
|
|
***NOTE***: If you're only interested in results, scroll all the way down to the markdown table and
|
|
graph for performance stats.
|
|
|
|
## Introduction
|
|
|
|
ISPs can offer ethernet services, often called _Virtual Leased Lines_ (VLLs), _Layer2 VPN_ (L2VPN)
|
|
or _Ethernet Backhaul_. They mean the same thing: imagine a switchport in location A that appears
|
|
to be transparently and directly connected to a switchport in location B, with the ISP (layer3, so
|
|
IPv4 and IPv6) network in between. The "simple" old-school setup would be to have switches which
|
|
define VLANs and are all interconnected. But we collectively learned that it's a bad idea for several
|
|
reasons:
|
|
|
|
* Large broadcast domains tend to encouter L2 forwarding loops sooner rather than later
|
|
* Spanning-Tree and its kin are a stopgap, but they often disable an entire port from forwarding,
|
|
which can be expensive if that port is connected to a dark fiber into another datacenter far
|
|
away.
|
|
* Large VLAN setups that are intended to interconnect with other operators run into overlapping
|
|
VLAN tags, which means switches have to do tag rewriting and filtering and such.
|
|
* Traffic engineering is all but non-existent in L2-only networking domains, while L3 has all sorts
|
|
of smart TE extensions, ECMP, and so on.
|
|
|
|
The canonical solution is for ISPs to encapsulate the ethernet traffic of their customers in some
|
|
tunneling mechanism, for example in MPLS or in some IP tunneling protocol. Fundamentally, these are
|
|
the same, except for the chosen protocol and overhead/cost of forwarding. MPLS is a very thin layer
|
|
under the packet, but other IP based tunneling mechanisms exist, commonly used are GRE, VXLAN and GENEVE
|
|
but many others exist.
|
|
|
|
They all work roughly the same:
|
|
* An IP packet has a _maximum transmission unit_ (MTU) of 1500 bytes, while the ethernet header is
|
|
typically an additional 14 bytes: a 6 byte source MAC, 6 byte destination MAC, and 2 byte ethernet
|
|
type, which is 0x0800 for an IPv4 datagram, 0x0806 for ARP, and 0x86dd for IPv6, and many others
|
|
[[ref](https://en.wikipedia.org/wiki/EtherType)].
|
|
* If VLANs are used, an additional 4 bytes are needed [[ref](https://en.wikipedia.org/wiki/IEEE_802.1Q)]
|
|
making the ethernet frame at most 1518 bytes long, with an ethertype of 0x8100.
|
|
* If QinQ or QinAD are used, yet again 4 bytes are needed [[ref](https://en.wikipedia.org/wiki/IEEE_802.1ad)],
|
|
making the ethernet frame at most 1522 bytes long, with an ethertype of either 0x8100 or 0x9100,
|
|
depending on the implementation.
|
|
* We can take such an ethernet frame, and make it the _payload_ of another IP packet, encapsulating the original
|
|
ethernet frame in a new IPv4 or IPv6 packet. We can then route it over an IP network to a remote
|
|
site.
|
|
* Upon receipt of such a packet, by looking at the headers the remote router can determine that this
|
|
packet represents an encapsulated ethernet frame, unpack it all, and forward the original frame onto a
|
|
given interface.
|
|
|
|
### IP Tunneling Protocols
|
|
|
|
First let's get some theory out of the way -- I'll discuss three common IP tunneling protocols here, and then
|
|
move on to demonstrate how they are configured in VPP and perhaps more importantly, how they perform in VPP.
|
|
Each tunneling protocol has its own advantages and disadvantages, but I'll stick to the basics first:
|
|
|
|
#### GRE: Generic Routing Encapsulation
|
|
|
|
_Generic Routing Encapsulation_ (GRE, described in [RFC2784](https://datatracker.ietf.org/doc/html/rfc2784)) is a
|
|
very old and well known tunneling protocol. The packet is an IP datagram with protocol number 47, consisting
|
|
of a header with 4 bits of flags, 8 reserved bits, 3 bits for the version (normally set to all-zeros), and
|
|
16 bits for the inner protocol (ether)type, so 0x0800 for IPv4, 0x8100 for 802.1q and so on. It's a very
|
|
small header of only 4 bytes and an optional key (4 bytes) and sequence number (also 4 bytes) whieah means
|
|
that to be able to transport any ethernet frame (including the fancy QinQ and QinAD ones), the _underlay_
|
|
must have an end to end MTU of at least 1522 + 20(IPv4)+12(GRE) = ***1554 bytes for IPv4*** and ***1574
|
|
bytes for IPv6***.
|
|
|
|
#### VXLAN: Virtual Extensible LAN
|
|
|
|
_Virtual Extensible LAN_ (VXLAN, described in [RFC7348](https://datatracker.ietf.org/doc/html/rfc7348))
|
|
is a UDP datagram which has a header consisting of 8 bits worth
|
|
of flags, 24 bits reserved for future expansion, 24 bits of _Virtual Network Identifier_ (VNI) and
|
|
an additional 8 bits or reserved space at the end. It uses UDP port 4789 as assigned by IANA. VXLAN
|
|
encapsulation adds 20(IPv4)+8(UDP)+8(VXLAN) = 36 bytes, and considering IPv6 is 40 bytes, there it
|
|
adds 56 bytes. This means that to be able to transport any ethernet frame, the _underlay_
|
|
network must have an end to end MTU of at least 1522+36 = ***1558 bytes for IPv4*** and ***1578 bytes
|
|
for IPv6***.
|
|
|
|
#### GENEVE: Generic Network Virtualization Encapsulation
|
|
|
|
_GEneric NEtwork Virtualization Encapsulation_ (GENEVE, described in [RFC8926](https://datatracker.ietf.org/doc/html/rfc8926))
|
|
is somewhat similar to VXLAN although it was an attempt to stop the wild growth of tunneling protocols,
|
|
I'm sure there is an [XKCD](https://xkcd.com/927/) out there specifically for this approach. The packet is
|
|
also a UDP datagram with destination port 6081, followed by an 8 byte GENEVE specific header, containing
|
|
2 bits of version, 8 bits for flags, a 16 bit inner ethertype, a 24 bit _Virtual Network Identifier_ (VNI),
|
|
and 8 bits of reserved space. With GENEVE, several options are available and will be tacked onto the GENEVE
|
|
header, but they are typically not used. If they are though, the options can add an additional 16 bytes
|
|
which means that to be able to transport any ethernet frame, the _underlay_ network must have an end to
|
|
end MTU of at least 1522+52 = ***1574 bytes for IPv4*** and ***1594 bytes for IPv6***.
|
|
|
|
### Hardware setup
|
|
|
|
First let's take a look at the physical setup. I'm using three servers and a switch in the IPng Networks lab:
|
|
|
|
{{< image width="400px" float="right" src="/assets/vpp/l2-xconnect-lab.png" alt="Loadtest Setup" >}}
|
|
|
|
* `hvn0`: Dell R720xd, load generator
|
|
* Dual E5-2620, 24 CPUs, 2 threads per core, 2 numa nodes
|
|
* 64GB DDR3 at 1333MT/s
|
|
* Intel X710 4-port 10G, Speed 8.0GT/s Width x8 (64 Gbit/s)
|
|
* `Hippo` and `Rhino`: VPP routers
|
|
* ASRock B550 Taichi
|
|
* Ryzen 5950X 32 CPUs, 2 threads per core, 1 numa node
|
|
* 64GB DDR4 at 2133 MT/s
|
|
* Intel 810C 2-port 100G, Speed 16.0 GT/s Width x16 (256 Gbit/s)
|
|
* `fsw0`: FS.com switch S5860-48SC, 8x 100G, 48x 10G
|
|
* VLAN 4 (blue) connects Rhino's `Hu12/0/1` to Hippo's `Hu12/0/1`
|
|
* VLAN 5 (red) connects hvn0's `enp5s0f0` to Rhino's `Hu12/0/0`
|
|
* VLAN 6 (green) connects hvn0's `enp5s0f1` to Hippo's `Hu12/0/0`
|
|
* All switchports have jumbo frames enabled and are set to 9216 bytes.
|
|
|
|
Further, Hippo and Rhino are running VPP at head `vpp v22.02-rc0~490-gde3648db0`, and hvn0 is running
|
|
T-Rex v2.93 in L2 mode, with MAC address `00:00:00:01:01:00` on the first port, and MAC address
|
|
`00:00:00:02:01:00` on the second port. This machine can saturate 10G in both directions with small
|
|
packets even when using only one flow, as can be seen, if the ports are just looped back onto one
|
|
another, for example by physically crossconnecting them with an SFP+ or DAC; or in my case by putting
|
|
`fsw0` port `Te0/1` and `Te0/2` in the same VLAN together:
|
|
|
|
{{< image width="800px" src="/assets/vpp/l2-xconnect-trex.png" alt="TRex on hvn0" >}}
|
|
|
|
Now that I have shared all the context and hardware, I'm ready to actually dive in to what I wanted to
|
|
talk about: how does all this _virtual leased line_ business look like, for VPP. Ready? Here we go!
|
|
|
|
### Direct L2 CrossConnect
|
|
|
|
The simplest thing I can show in VPP, is to configure a layer2 cross-connect (_l2 xconnect_) between
|
|
two ports. In this case, VPP doesn't even need to have an IP address, all I do is bring up the ports,
|
|
set their MTU to be able to carry the 1522 bytes frames (ethernet at 1514, dot1q at 1518, and QinQ
|
|
at 1522 bytes). The configuration is identical on both Rhino and Hippo:
|
|
```
|
|
set interface state HundredGigabitEthernet12/0/0 up
|
|
set interface state HundredGigabitEthernet12/0/1 up
|
|
set interface mtu packet 1522 HundredGigabitEthernet12/0/0
|
|
set interface mtu packet 1522 HundredGigabitEthernet12/0/1
|
|
set interface l2 xconnect HundredGigabitEthernet12/0/0 HundredGigabitEthernet12/0/1
|
|
set interface l2 xconnect HundredGigabitEthernet12/0/1 HundredGigabitEthernet12/0/0
|
|
```
|
|
|
|
I'd say the only thing to keep in mind here, is that the cross-connect commands only
|
|
link in one direction (receive in A, forward to B), and that's why I have to type them twice (receive in B,
|
|
forward to A). Of course, this must be really cheap on VPP -- because all it has to do now is receive
|
|
from DPDK and immediately schedule for transmit on the other port. Looking at `show runtime` I can
|
|
see how much CPU time is spent in each of VPP's nodes:
|
|
|
|
```
|
|
Time 1241.5, 10 sec internal node vector rate 28.70 loops/sec 475009.85
|
|
vector rates in 1.4879e7, out 1.4879e7, drop 0.0000e0, punt 0.0000e0
|
|
Name Calls Vectors Clocks Vectors/Call
|
|
HundredGigabitEthernet12/0/1-o 650727833 18472218801 7.49e0 28.39
|
|
HundredGigabitEthernet12/0/1-t 650727833 18472218801 4.12e1 28.39
|
|
ethernet-input 650727833 18472218801 5.55e1 28.39
|
|
l2-input 650727833 18472218801 1.52e1 28.39
|
|
l2-output 650727833 18472218801 1.32e1 28.39
|
|
```
|
|
|
|
In this simple cross connect mode, the only thing VPP has to do is receive the ethernet, funnel it
|
|
into `l2-input`, and immediately send it straight through `l2-output` back out, which does not cost
|
|
much in terms of CPU cycles at all. In total, this CPU thread is forwarding 14.88Mpps (line rate 10G
|
|
at 64 bytes), at an average of 133 cycles per packet (not counting the time spent in DPDK). The CPU
|
|
has room to spare in this mode, in other words even _one CPU thread_ can handle this workload at
|
|
line rate, impressive!
|
|
|
|
Although cool, doing an L2 crossconnect like this isn't super useful. Usually, the customer leased line
|
|
has to be transported to another location, and for that we'll need some form of encapsulation ...
|
|
|
|
### Crossconnect over IPv6 VXLAN
|
|
|
|
Let's start with VXLAN. The concept is pretty straight forward in VPP. Based on the configuration
|
|
I put in Rhino and Hippo above, I first will have to bring `Hu12/0/1` out of L2 mode, give both interfaces an
|
|
IPv6 address, create a tunnel with a given _VNI_, and then crossconnect the customer side `Hu12/0/0`
|
|
into the `vxlan_tunnel0` and vice-versa. Piece of cake:
|
|
|
|
```
|
|
## On Rhino
|
|
set interface mtu packet 1600 HundredGigabitEthernet12/0/1
|
|
set interface l3 HundredGigabitEthernet12/0/1
|
|
set interface ip address HundredGigabitEthernet12/0/1 2001:db8::1/64
|
|
create vxlan tunnel instance 0 src 2001:db8::1 dst 2001:db8::2 vni 8298
|
|
set interface state vxlan_tunnel0 up
|
|
set interface mtu packet 1522 vxlan_tunnel0
|
|
set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0
|
|
set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0
|
|
|
|
## On Hippo
|
|
set interface mtu packet 1600 HundredGigabitEthernet12/0/1
|
|
set interface l3 HundredGigabitEthernet12/0/1
|
|
set interface ip address HundredGigabitEthernet12/0/1 2001:db8::2/64
|
|
create vxlan tunnel instance 0 src 2001:db8::2 dst 2001:db8::1 vni 8298
|
|
set interface state vxlan_tunnel0 up
|
|
set interface mtu packet 1522 vxlan_tunnel0
|
|
set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0
|
|
set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0
|
|
```
|
|
|
|
Of course, now we're actually beginning to make VPP do some work, and the exciting thing is, if there
|
|
would be an (opaque) ISP network between Rhino and Hippo, this would work just fine considering the
|
|
encapsulation is 'just' IPv6 UDP. Under the covers, for each received frame, VPP has to encapsulate it
|
|
into VXLAN, and route the resulting L3 packet by doing an IPv6 routing table lookup:
|
|
|
|
```
|
|
Time 10.0, 10 sec internal node vector rate 256.00 loops/sec 32132.74
|
|
vector rates in 8.5423e6, out 8.5423e6, drop 0.0000e0, punt 0.0000e0
|
|
Name Calls Vectors Clocks Vectors/Call
|
|
HundredGigabitEthernet12/0/0-o 333777 85445944 2.74e0 255.99
|
|
HundredGigabitEthernet12/0/0-t 333777 85445944 5.28e1 255.99
|
|
ethernet-input 333777 85445944 4.25e1 255.99
|
|
ip6-input 333777 85445944 1.25e1 255.99
|
|
ip6-lookup 333777 85445944 2.41e1 255.99
|
|
ip6-receive 333777 85445944 1.71e2 255.99
|
|
ip6-udp-lookup 333777 85445944 1.55e1 255.99
|
|
l2-input 333777 85445944 8.94e0 255.99
|
|
l2-output 333777 85445944 4.44e0 255.99
|
|
vxlan6-input 333777 85445944 2.12e1 255.99
|
|
```
|
|
|
|
I can definitely see a lot more action here. In this mode, VPP is handlnig 8.54Mpps on this CPU thread
|
|
before saturating. At full load, VPP is spending 356 CPU cycles per packet, of which almost half is in
|
|
node `ip6-receive`.
|
|
|
|
### Crossconnect over IPv4 VXLAN
|
|
|
|
Seeing `ip6-receive` being such a big part of the cost (almost half!), I wonder what it might look like if
|
|
I change the tunnel to use IPv4. So I'll give Rhino and Hippo an IPv4 address as well, delete the vxlan tunnel I made
|
|
before (the IPv6 one), and create a new one with IPv4:
|
|
|
|
```
|
|
set interface ip address HundredGigabitEthernet12/0/1 10.0.0.0/31
|
|
create vxlan tunnel instance 0 src 2001:db8::1 dst 2001:db8::2 vni 8298 del
|
|
create vxlan tunnel instance 0 src 10.0.0.0 dst 10.0.0.1 vni 8298
|
|
set interface state vxlan_tunnel0 up
|
|
set interface mtu packet 1522 vxlan_tunnel0
|
|
set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0
|
|
set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0
|
|
|
|
set interface ip address HundredGigabitEthernet12/0/1 10.0.0.1/31
|
|
create vxlan tunnel instance 0 src 2001:db8::2 dst 2001:db8::1 vni 8298 del
|
|
create vxlan tunnel instance 0 src 10.0.0.1 dst 10.0.0.0 vni 8298
|
|
set interface state vxlan_tunnel0 up
|
|
set interface mtu packet 1522 vxlan_tunnel0
|
|
set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0
|
|
set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0
|
|
```
|
|
|
|
And after letting this run for a few seconds, I can take a look and see how the `ip4-*` version of
|
|
the VPP code performs:
|
|
|
|
```
|
|
Time 10.0, 10 sec internal node vector rate 256.00 loops/sec 53309.71
|
|
vector rates in 1.4151e7, out 1.4151e7, drop 0.0000e0, punt 0.0000e0
|
|
Name Calls Vectors Clocks Vectors/Call
|
|
HundredGigabitEthernet12/0/0-o 552890 141539600 2.76e0 255.99
|
|
HundredGigabitEthernet12/0/0-t 552890 141539600 5.30e1 255.99
|
|
ethernet-input 552890 141539600 4.13e1 255.99
|
|
ip4-input-no-checksum 552890 141539600 1.18e1 255.99
|
|
ip4-lookup 552890 141539600 1.68e1 255.99
|
|
ip4-receive 552890 141539600 2.74e1 255.99
|
|
ip4-udp-lookup 552890 141539600 1.79e1 255.99
|
|
l2-input 552890 141539600 8.68e0 255.99
|
|
l2-output 552890 141539600 4.41e0 255.99
|
|
vxlan4-input 552890 141539600 1.76e1 255.99
|
|
```
|
|
|
|
Throughput is now quite a bit higher, clocking a cool 14.2Mpps (just short of line rate!) at 202 CPU
|
|
cycles per packet, considerably less time spent than in IPv6, but keep in mind that VPP has an ~empty
|
|
routing table in all of these tests.
|
|
|
|
### Crossconnect over IPv6 GENEVE
|
|
|
|
Another popular cross connect type, also based on IPv4 and IPv6 UDP packets, is GENEVE. The configuration
|
|
is almost identical, so I delete the IPv4 VXLAN and create an IPv6 GENEVE tunnel instead:
|
|
|
|
```
|
|
create vxlan tunnel instance 0 src 10.0.0.0 dst 10.0.0.1 vni 8298 del
|
|
create geneve tunnel local 2001:db8::1 remote 2001:db8::2 vni 8298
|
|
set interface state geneve_tunnel0 up
|
|
set interface mtu packet 1522 geneve_tunnel0
|
|
set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0
|
|
set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0
|
|
|
|
create vxlan tunnel instance 0 src 10.0.0.1 dst 10.0.0.0 vni 8298 del
|
|
create geneve tunnel local 2001:db8::2 remote 2001:db8::1 vni 8298
|
|
set interface state geneve_tunnel0 up
|
|
set interface mtu packet 1522 geneve_tunnel0
|
|
set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0
|
|
set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0
|
|
```
|
|
|
|
All the while, the TRex on the customer machine `hvn0`, is sending 14.88Mpps in both directions, and
|
|
after just a short (second or so) interruption, the GENEVE tunnel comes up, cross-connects into the
|
|
customer `Hu12/0/0` interfaces, and starts to carry traffic:
|
|
|
|
```
|
|
Thread 8 vpp_wk_7 (lcore 8)
|
|
Time 10.0, 10 sec internal node vector rate 256.00 loops/sec 29688.03
|
|
vector rates in 8.3179e6, out 8.3179e6, drop 0.0000e0, punt 0.0000e0
|
|
Name Calls Vectors Clocks Vectors/Call
|
|
HundredGigabitEthernet12/0/0-o 324981 83194664 2.74e0 255.99
|
|
HundredGigabitEthernet12/0/0-t 324981 83194664 5.18e1 255.99
|
|
ethernet-input 324981 83194664 4.26e1 255.99
|
|
geneve6-input 324981 83194664 3.87e1 255.99
|
|
ip6-input 324981 83194664 1.22e1 255.99
|
|
ip6-lookup 324981 83194664 2.39e1 255.99
|
|
ip6-receive 324981 83194664 1.67e2 255.99
|
|
ip6-udp-lookup 324981 83194664 1.54e1 255.99
|
|
l2-input 324981 83194664 9.28e0 255.99
|
|
l2-output 324981 83194664 4.47e0 255.99
|
|
```
|
|
|
|
Similar to VXLAN when using IPv6 the total for GENEVE-v6 is also comparatively slow (I say comparatively
|
|
because you should not expect anything like this performance when using Linux or BSD kernel routing!).
|
|
The lower throughput is again due to the `ip6-receive` node being costly. It is just slightly worse of a
|
|
performer at 8.32Mpps per core and 368 CPU cycles per packet.
|
|
|
|
### Crossconnect over IPv4 GENEVE
|
|
|
|
I am now suspecting that GENEVE over IPv4 would have similar gains to when I switched from VXLAN
|
|
IPv6 to IPv4 above. So I remove the IPv6 tunnel, create a new IPv4 tunnel instead, and hook it
|
|
back up to the customer port on both Rhino and Hippo, like so:
|
|
|
|
```
|
|
create geneve tunnel local 2001:db8::1 remote 2001:db8::2 vni 8298 del
|
|
create geneve tunnel local 10.0.0.0 remote 10.0.0.1 vni 8298
|
|
set interface state geneve_tunnel0 up
|
|
set interface mtu packet 1522 geneve_tunnel0
|
|
set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0
|
|
set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0
|
|
|
|
create geneve tunnel local 2001:db8::2 remote 2001:db8::1 vni 8298 del
|
|
create geneve tunnel local 10.0.0.1 remote 10.0.0.0 vni 8298
|
|
set interface state geneve_tunnel0 up
|
|
set interface mtu packet 1522 geneve_tunnel0
|
|
set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0
|
|
set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0
|
|
```
|
|
|
|
And the results, indeed a significant improvement:
|
|
|
|
```
|
|
Time 10.0, 10 sec internal node vector rate 256.00 loops/sec 48639.97
|
|
vector rates in 1.3737e7, out 1.3737e7, drop 0.0000e0, punt 0.0000e0
|
|
Name Calls Vectors Clocks Vectors/Call
|
|
HundredGigabitEthernet12/0/0-o 536763 137409904 2.76e0 255.99
|
|
HundredGigabitEthernet12/0/0-t 536763 137409904 5.19e1 255.99
|
|
ethernet-input 536763 137409904 4.19e1 255.99
|
|
geneve4-input 536763 137409904 2.39e1 255.99
|
|
ip4-input-no-checksum 536763 137409904 1.18e1 255.99
|
|
ip4-lookup 536763 137409904 1.69e1 255.99
|
|
ip4-receive 536763 137409904 2.71e1 255.99
|
|
ip4-udp-lookup 536763 137409904 1.79e1 255.99
|
|
l2-input 536763 137409904 8.81e0 255.99
|
|
l2-output 536763 137409904 4.47e0 255.99
|
|
```
|
|
|
|
So, close to line rate again! Performance of GENEVE-v4 clocks in at 13.7Mpps per core or 207
|
|
CPU cycles per packet.
|
|
|
|
## Crossconnect over IPv6 GRE
|
|
|
|
Now I can't help but wonder, that if those `ip4|6-udp-lookup` nodes burn valuable CPU cycles,
|
|
GRE will possibly do better, because it's an L3 protocol (proto number 47) and will never have to
|
|
inspect beyond the IP header, so I delete the GENEVE tunnel and give GRE a go too:
|
|
|
|
```
|
|
create geneve tunnel local 10.0.0.0 remote 10.0.0.1 vni 8298 del
|
|
create gre tunnel src 2001:db8::1 dst 2001:db8::2 teb
|
|
set interface state gre0 up
|
|
set interface mtu packet 1522 gre0
|
|
set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0
|
|
set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0
|
|
|
|
create geneve tunnel local 10.0.0.1 remote 10.0.0.0 vni 8298 del
|
|
create gre tunnel src 2001:db8::2 dst 2001:db8::1 teb
|
|
set interface state gre0 up
|
|
set interface mtu packet 1522 gre0
|
|
set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0
|
|
set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0
|
|
```
|
|
|
|
Results:
|
|
```
|
|
Time 10.0, 10 sec internal node vector rate 255.99 loops/sec 37129.87
|
|
vector rates in 9.9254e6, out 9.9254e6, drop 0.0000e0, punt 0.0000e0
|
|
Name Calls Vectors Clocks Vectors/Call
|
|
HundredGigabitEthernet12/0/0-o 387881 99297464 2.80e0 255.99
|
|
HundredGigabitEthernet12/0/0-t 387881 99297464 5.21e1 255.99
|
|
ethernet-input 775762 198594928 5.97e1 255.99
|
|
gre6-input 387881 99297464 2.81e1 255.99
|
|
ip6-input 387881 99297464 1.21e1 255.99
|
|
ip6-lookup 387881 99297464 2.39e1 255.99
|
|
ip6-receive 387881 99297464 5.09e1 255.99
|
|
l2-input 387881 99297464 9.35e0 255.99
|
|
l2-output 387881 99297464 4.40e0 255.99
|
|
```
|
|
|
|
The performance of GRE-v6 (in transparent ethernet bridge aka _TEB_ mode) is 9.9Mpps per core or
|
|
243 CPU cycles per packet, and I'll also note that while the `ip6-receive` node in all the
|
|
UDP based tunneling were in the 170 clocks/packet arena, now we're down to only 51 or so, so
|
|
indeed a huge improvement.
|
|
|
|
### Crossconnect over IPv4 GRE
|
|
|
|
To round off the set, I'll remove the IPv6 GRE tunnel and put an IPv4 GRE tunnel in place:
|
|
|
|
```
|
|
create gre tunnel src 2001:db8::1 dst 2001:db8::2 teb del
|
|
create gre tunnel src 10.0.0.0 dst 10.0.0.1 teb
|
|
set interface state gre0 up
|
|
set interface mtu packet 1522 gre0
|
|
set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0
|
|
set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0
|
|
|
|
create gre tunnel src 2001:db8::2 dst 2001:db8::1 teb del
|
|
create gre tunnel src 10.0.0.1 dst 10.0.0.0 teb
|
|
set interface state gre0 up
|
|
set interface mtu packet 1522 gre0
|
|
set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0
|
|
set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0
|
|
```
|
|
|
|
And without further ado:
|
|
```
|
|
Time 10.0, 10 sec internal node vector rate 255.87 loops/sec 52898.61
|
|
vector rates in 1.4039e7, out 1.4039e7, drop 0.0000e0, punt 0.0000e0
|
|
Name Calls Vectors Clocks Vectors/Call
|
|
HundredGigabitEthernet12/0/0-o 548684 140435080 2.80e0 255.95
|
|
HundredGigabitEthernet12/0/0-t 548684 140435080 5.22e1 255.95
|
|
ethernet-input 1097368 280870160 2.92e1 255.95
|
|
gre4-input 548684 140435080 2.51e1 255.95
|
|
ip4-input-no-checksum 548684 140435080 1.19e1 255.95
|
|
ip4-lookup 548684 140435080 1.68e1 255.95
|
|
ip4-receive 548684 140435080 2.03e1 255.95
|
|
l2-input 548684 140435080 8.72e0 255.95
|
|
l2-output 548684 140435080 4.43e0 255.95
|
|
```
|
|
|
|
The performance of GRE-v4 (in transparent ethernet bridge aka _TEB_ mode) is 14.0Mpps per
|
|
core or 171 CPU cycles per packet. This is really very low, the best of all the tunneling
|
|
protocols, but (for obvious reasons) will not outperform a direct L2 crossconnect, as that
|
|
cuts out the L3 (and L4) middleperson entirely. Whohoo!
|
|
|
|
## Conclusions
|
|
|
|
First, let me give a recap of the tests I did, from left to right the better to worse
|
|
performer.
|
|
|
|
Test | L2XC | GRE-v4 | VXLAN-v4 | GENEVE-v4 | GRE-v6 | VXLAN-v6 | GENEVE-v6
|
|
------------- | ----------- | --------- | --------- | ---------- | ---------- | --------- | ---------
|
|
pps/core | >14.88M | 14.34M | 14.15M | 13.74M | 9.93M | 8.54M | 8.32M
|
|
cycles/packet | 132.59 | 171.45 | 201.65 | 207.44 | 243.35 | 355.72 | 368.09
|
|
|
|
***(!)*** Achtung! Because in the L2XC mode the CPU was not fully consumed (VPP was consuming only
|
|
~28 frames per vector), it did not yet achieve its optimum CPU performance. Under full load, the
|
|
cycles/packet will be somewhat lower than what is shown here.
|
|
|
|
Taking a closer look at the VPP nodes in use, below I draw a graph of CPU cycles spent in each VPP
|
|
node, for each type of cross connect, where the lower the stack is, the faster cross connect will
|
|
be:
|
|
|
|
{{< image width="1000px" src="/assets/vpp/l2-xconnect-cycles.png" alt="Cycles by node" >}}
|
|
|
|
Although clearly GREv4 is the winner, I still would not use it for the following reason:
|
|
VPP does not support GRE keys, and considering it is an L3 protocol, I will have to use unique
|
|
IPv4 or IPv6 addresses for each tunnel src/dst pair, otherwise VPP will not know upon receipt of
|
|
a GRE packet, which tunnel it belongs to. For IPv6 this is not a huge deal (I can bind a whole
|
|
/64 to a loopback and just be done with it), but GREv6 does not perform as well as VXLAN-v4 or
|
|
GENEVE-v4.
|
|
|
|
VXLAN and GENEVE are equal performers, both in IPv4 and in IPv6. In both cases, IPv4 is
|
|
significantly faster than IPv6. But due to the use of _VNI_ fields in the header, contrary
|
|
to GRE, both VXLAN and GENEVE can have the same src/dst IP for any number of tunnels, which
|
|
is a huge benefit.
|
|
|
|
#### Multithreading
|
|
|
|
Usually, the customer facing side is an ethernet port (or sub-interface with tag popping) that will be
|
|
receiving IPv4 or IPv6 traffic (either tagged or untagged) and this allows the NIC to use _RSS_ to assign
|
|
this inbound traffic to multiple queues, and thus multiple CPU threads. That's great, it means linear
|
|
encapsulation performance.
|
|
|
|
Once the traffic is encapsulated, it risks becoming single flow with respect to the remote host, if
|
|
Rhino would be sending from 10.0.0.0:4789 to Hippo's 10.0.0.1:4789. However, the VPP VXLAN and GENEVE
|
|
implementation both inspect the _inner_ payload, and uses it to scramble the source port (thanks to
|
|
Neale for pointing this out, it's in `vxlan/encap.c:246`). Deterministically changing the source port
|
|
based on the inner-flow will allow Hippo to use _RSS_ on the receiving end, which allows these tunneling
|
|
protocols to scale linearly. I proved this for myself by attaching a port-mirror to the switch and
|
|
copying all traffic between Hippo and Rhino to a spare machine in the rack:
|
|
|
|
```
|
|
pim@hvn1:~$ sudo tcpdump -ni enp5s0f3 port 4789
|
|
11:19:54.887763 IP 10.0.0.1.4452 > 10.0.0.0.4789: VXLAN, flags [I] (0x08), vni 8298
|
|
11:19:54.888283 IP 10.0.0.1.42537 > 10.0.0.0.4789: VXLAN, flags [I] (0x08), vni 8298
|
|
11:19:54.888285 IP 10.0.0.0.17895 > 10.0.0.1.4789: VXLAN, flags [I] (0x08), vni 8298
|
|
11:19:54.899353 IP 10.0.0.1.40751 > 10.0.0.0.4789: VXLAN, flags [I] (0x08), vni 8298
|
|
11:19:54.899355 IP 10.0.0.0.35475 > 10.0.0.1.4789: VXLAN, flags [I] (0x08), vni 8298
|
|
11:19:54.904642 IP 10.0.0.0.60633 > 10.0.0.1.4789: VXLAN, flags [I] (0x08), vni 8298
|
|
|
|
pim@hvn1:~$ sudo tcpdump -ni enp5s0f3 port 6081
|
|
11:22:55.802406 IP 10.0.0.0.32299 > 10.0.0.1.6081: Geneve, Flags [none], vni 0x206a:
|
|
11:22:55.802409 IP 10.0.0.1.44011 > 10.0.0.0.6081: Geneve, Flags [none], vni 0x206a:
|
|
11:22:55.807711 IP 10.0.0.1.45503 > 10.0.0.0.6081: Geneve, Flags [none], vni 0x206a:
|
|
11:22:55.807712 IP 10.0.0.0.45532 > 10.0.0.1.6081: Geneve, Flags [none], vni 0x206a:
|
|
11:22:55.841495 IP 10.0.0.0.61694 > 10.0.0.1.6081: Geneve, Flags [none], vni 0x206a:
|
|
11:22:55.851719 IP 10.0.0.1.47581 > 10.0.0.0.6081: Geneve, Flags [none], vni 0x206a:
|
|
```
|
|
|
|
Considering I was sending the T-Rex profile `bench.py` with tunables `vm=var2,size=64`, the latter
|
|
of which chooses randomized source and destination (inner) IP addresses in the loadtester, I can
|
|
conclude that the outer source port is chosen based on a hash of the inner packet. Slick!!
|
|
|
|
#### Final conclusion
|
|
|
|
The most important practical conclusion to draw is that I can feel safe to offer L2VPN services at
|
|
IPng Networks using VPP and a VXLAN or GENEVE IPv4 underlay -- our backbone is 9000 bytes everywhere,
|
|
so it will be possible to provide up to 8942 bytes of customer payload taking into account the
|
|
VXLAN-v4 overhead. At least gigabit symmetric _VLLs_ filled with 64b packets will not be a
|
|
problem for the routers we have, as they forward approximately 10.2Mpps per core and 35Mpps
|
|
per chassis when fully loaded. Even considering the overhead and CPU consumption that VXLAN
|
|
encap/decap brings with it, due to the use of multiple transmit and receive threads,
|
|
the router would have plenty of room to spare.
|
|
|
|
## Appendix
|
|
|
|
The backing data for the graph in this article are captured in this [Google Sheet](https://docs.google.com/spreadsheets/d/1WZ4xvO1pAjCswpCDC9GfOGIkogS81ZES74_scHQryb0/edit?usp=sharing).
|
|
|
|
### VPP Configuration
|
|
|
|
For completeness, the `startup.conf` used on both Rhino and Hippo:
|
|
|
|
```
|
|
unix {
|
|
nodaemon
|
|
log /var/log/vpp/vpp.log
|
|
full-coredump
|
|
cli-listen /run/vpp/cli.sock
|
|
cli-prompt rhino#
|
|
gid vpp
|
|
}
|
|
|
|
api-trace { on }
|
|
api-segment { gid vpp }
|
|
socksvr { default }
|
|
|
|
memory {
|
|
main-heap-size 1536M
|
|
main-heap-page-size default-hugepage
|
|
}
|
|
|
|
cpu {
|
|
main-core 0
|
|
corelist-workers 1-15
|
|
}
|
|
|
|
buffers {
|
|
buffers-per-numa 300000
|
|
default data-size 2048
|
|
page-size default-hugepage
|
|
}
|
|
|
|
statseg {
|
|
size 1G
|
|
page-size default-hugepage
|
|
per-node-counters off
|
|
}
|
|
|
|
dpdk {
|
|
dev default {
|
|
num-rx-queues 7
|
|
}
|
|
decimal-interface-names
|
|
dev 0000:0c:00.0
|
|
dev 0000:0c:00.1
|
|
}
|
|
|
|
plugins {
|
|
plugin lcpng_nl_plugin.so { enable }
|
|
plugin lcpng_if_plugin.so { enable }
|
|
}
|
|
|
|
logging {
|
|
default-log-level info
|
|
default-syslog-log-level crit
|
|
class linux-cp/if { rate-limit 10000 level debug syslog-level debug }
|
|
class linux-cp/nl { rate-limit 10000 level debug syslog-level debug }
|
|
}
|
|
|
|
lcpng {
|
|
default netns dataplane
|
|
lcp-sync
|
|
lcp-auto-subint
|
|
}
|
|
```
|
|
|
|
### Other Details
|
|
|
|
For posterity, some other stats on the VPP deployment. First of all, a confirmation that PCIe 4.0 x16
|
|
slots were used, and that the _Comms_ DDP was loaded:
|
|
|
|
```
|
|
[ 0.433903] pci 0000:0c:00.0: [8086:1592] type 00 class 0x020000
|
|
[ 0.433924] pci 0000:0c:00.0: reg 0x10: [mem 0xea000000-0xebffffff 64bit pref]
|
|
[ 0.433946] pci 0000:0c:00.0: reg 0x1c: [mem 0xee010000-0xee01ffff 64bit pref]
|
|
[ 0.433964] pci 0000:0c:00.0: reg 0x30: [mem 0xfcf00000-0xfcffffff pref]
|
|
[ 0.434104] pci 0000:0c:00.0: reg 0x184: [mem 0xed000000-0xed01ffff 64bit pref]
|
|
[ 0.434106] pci 0000:0c:00.0: VF(n) BAR0 space: [mem 0xed000000-0xedffffff 64bit pref] (contains BAR0 for 128 VFs)
|
|
[ 0.434128] pci 0000:0c:00.0: reg 0x190: [mem 0xee220000-0xee223fff 64bit pref]
|
|
[ 0.434129] pci 0000:0c:00.0: VF(n) BAR3 space: [mem 0xee220000-0xee41ffff 64bit pref] (contains BAR3 for 128 VFs)
|
|
[ 11.216343] ice 0000:0c:00.0: The DDP package was successfully loaded: ICE COMMS Package version 1.3.30.0
|
|
[ 11.280567] ice 0000:0c:00.0: PTP init successful
|
|
[ 11.317826] ice 0000:0c:00.0: DCB is enabled in the hardware, max number of TCs supported on this port are 8
|
|
[ 11.317828] ice 0000:0c:00.0: FW LLDP is disabled, DCBx/LLDP in SW mode.
|
|
[ 11.317829] ice 0000:0c:00.0: Commit DCB Configuration to the hardware
|
|
[ 11.320608] ice 0000:0c:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
|
|
```
|
|
|
|
And how the NIC shows up in VPP, in particular the rx/tx burst modes and functions are interesting:
|
|
```
|
|
hippo# show hardware-interfaces
|
|
Name Idx Link Hardware
|
|
HundredGigabitEthernet12/0/0 1 up HundredGigabitEthernet12/0/0
|
|
Link speed: 100 Gbps
|
|
RX Queues:
|
|
queue thread mode
|
|
0 vpp_wk_0 (1) polling
|
|
1 vpp_wk_1 (2) polling
|
|
2 vpp_wk_2 (3) polling
|
|
3 vpp_wk_3 (4) polling
|
|
4 vpp_wk_4 (5) polling
|
|
5 vpp_wk_5 (6) polling
|
|
6 vpp_wk_6 (7) polling
|
|
Ethernet address b4:96:91:b3:b1:10
|
|
Intel E810 Family
|
|
carrier up full duplex mtu 9190 promisc
|
|
flags: admin-up promisc maybe-multiseg tx-offload intel-phdr-cksum rx-ip4-cksum int-supported
|
|
Devargs:
|
|
rx: queues 7 (max 64), desc 1024 (min 64 max 4096 align 32)
|
|
tx: queues 16 (max 64), desc 1024 (min 64 max 4096 align 32)
|
|
pci: device 8086:1592 subsystem 8086:0002 address 0000:0c:00.00 numa 0
|
|
max rx packet len: 9728
|
|
promiscuous: unicast on all-multicast on
|
|
vlan offload: strip off filter off qinq off
|
|
rx offload avail: vlan-strip ipv4-cksum udp-cksum tcp-cksum qinq-strip
|
|
outer-ipv4-cksum vlan-filter vlan-extend jumbo-frame
|
|
scatter keep-crc rss-hash
|
|
rx offload active: ipv4-cksum jumbo-frame scatter
|
|
tx offload avail: vlan-insert ipv4-cksum udp-cksum tcp-cksum sctp-cksum
|
|
tcp-tso outer-ipv4-cksum qinq-insert multi-segs mbuf-fast-free
|
|
outer-udp-cksum
|
|
tx offload active: ipv4-cksum udp-cksum tcp-cksum multi-segs
|
|
rss avail: ipv4-frag ipv4-tcp ipv4-udp ipv4-sctp ipv4-other ipv4
|
|
ipv6-frag ipv6-tcp ipv6-udp ipv6-sctp ipv6-other ipv6
|
|
l2-payload
|
|
rss active: ipv4-frag ipv4-tcp ipv4-udp ipv4 ipv6-frag ipv6-tcp
|
|
ipv6-udp ipv6
|
|
tx burst mode: Scalar
|
|
tx burst function: ice_recv_scattered_pkts_vec_avx2_offload
|
|
rx burst mode: Offload Vector AVX2 Scattered
|
|
rx burst function: ice_xmit_pkts
|
|
```
|
|
|
|
Finally, in case it's interesting, an output of [lscpu](/assets/vpp/l2-xconnect-lscpu.txt),
|
|
[lspci](/assets/vpp/l2-xconnect-lspci.txt) and [dmidecode](/assets/vpp/l2-xconnect-dmidecode.txt)
|
|
as run on Hippo (Rhino is an identical machine).
|