All checks were successful
continuous-integration/drone/push Build is passing
311 lines
16 KiB
Markdown
311 lines
16 KiB
Markdown
---
|
|
date: "2023-10-21T11:35:14Z"
|
|
title: VPP IXP Gateway - Part 1
|
|
aliases:
|
|
- /s/articles/2023/10/21/vpp-ixp-gateway-1.html
|
|
---
|
|
|
|
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
|
|
|
|
# About this series
|
|
|
|
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
|
|
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
|
|
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
|
|
are shared between the two.
|
|
|
|
There's some really fantastic features in VPP, some of which are lesser well known, and not always
|
|
very well documented. In this article, I will describe a unique usecase in which I think VPP will
|
|
excel, notably acting as a gateway for Internet Exchange Points.
|
|
|
|
In this first article, I'll take a closer look at three things that would make such a gateway
|
|
possible: bridge domains, MAC address filtering and traffic shaping.
|
|
|
|
## Introduction
|
|
|
|
Internet Exchanges are typically L2 (ethernet) switch platforms that allow their connected members
|
|
to exchange traffic amongst themselves. Not all members share physical locations with the Internet
|
|
Exchange itself, for example the IXP may be at NTT Zurich, but the member may be present in
|
|
Interxion Zurich. For smaller clubs, like IPng Networks, it's not always financially feasible (or
|
|
desirable) to order a dark fiber between two adjacent datacenters, or even a cross connect in the
|
|
same datacenter (as many of them are charging exorbitant fees for what is essentially passive fiber
|
|
optics and patch panels), if the amount of traffic passed is modest.
|
|
|
|
One solution to such problems is to have one member transport multiple end-user downstream members
|
|
to the platform, for example by means of an Ethernet over MPLS or VxLAN transport from where the
|
|
enduser lives, to the physical port of the Internet Exchange. These transport members are often
|
|
called _IXP Resellers_ noting that usually, but not always, some form of payment is required.
|
|
|
|
From the point of view of the IXP, it's often the case that there is a one MAC address per member
|
|
limitation, and not all members will have the same bandwidth guarantees. Many IXPs will offer
|
|
physical connection speeds (like a Gigabit, TenGig or HundredGig port), but they _also_ have a
|
|
common practice to limit the passed traffic by means of traffic shaping, for example one might have
|
|
a TenGig port but only entitled to pass 3.0 Gbit/sec of traffic in- and out of the platform.
|
|
|
|
For a long time I thought this kind of sucked, after all, who wants to connect to an internet
|
|
exchange point but then see their traffic rate limited? But if you think about it, this is often to
|
|
protect both the member, and the reseller, and the exchange itself: if the total downstream
|
|
bandwidth to the reseller is potentially larger than the reseller's port to the exchange, and this
|
|
is almost certainly the case in the other direction: the total IXP bandwidth that might go to one
|
|
individual members, is significantly larger than the reseller's port to the exchange.
|
|
|
|
Due to these two issues, a reseller port may become a bottleneck and _packetlo_ may occur. To
|
|
protect the ecosystem, having the internet exchange try to enforce fairness and bandwidth limits
|
|
makes operational sense.
|
|
|
|
## VPP as an IXP Gateway
|
|
|
|
{{< image width="400px" float="right" src="/assets/vpp-ixp-gateway/VPP IXP Gateway.svg" alt="VPP IXP Gateway" >}}
|
|
|
|
Here's a few requirements that may be necessary to provide an end-to-end solution:
|
|
1. Downstream ports MAY be _untagged_, or _tagged_, in which case encapsulation (for example
|
|
.1q VLAN tags) SHOULD be provided, one per downstream member.
|
|
1. Each downstream member MUST ONLY be allowed to send traffic from one or more registered MAC
|
|
addresses, in other words, strict filtering MUST be applied by the gateway.
|
|
1. If a downstream member is assigned an up- and downstream bandwidth limit, this MUST be
|
|
enforced by the gateway.
|
|
|
|
Of course, all sorts of other things come to mind -- perhaps MPLS encapsulation, or VxLAN/GENEVE
|
|
tunneling endpoints, and certainly some monitoring with SNMP or Prometheus, and how about just
|
|
directly integrating this gateway with [[IXPManager](https://www.ixpmanager.org/)] while we're at
|
|
it. Yes, yes! But for this article, I'm going to stick to the bits and pieces regarding VPP itself,
|
|
and leave the other parts for another day!
|
|
|
|
First, I build a quick lab out of this, by taking one supermicro bare metal server with VPP (it will
|
|
be the VPP IXP Gateway), and a couple of Debian servers and switches to simulate clients (A-J):
|
|
* Client A-D (on port `e0`-`e3`) will use `192.0.2.1-4/24` and `2001:db8::1-5/64`
|
|
* Client E-G (on switch port `e0`-`e2` of switch0, behind port `xe0`) will use `192.0.2.5-7/24` and
|
|
`2001:db8::5-7/64`
|
|
* Client H-J (on switch port `e0`-`e2` of switch1, behind port `xe1`) will use `192.0.2.8-10/24` and
|
|
`2001:db8::8-a/64`
|
|
* There will be a server attached to port `xxv0` with address `198.0.2.254/24` and `2001:db8::ff/64`
|
|
* The server will run `iperf3`.
|
|
|
|
### VPP: Bridge Domains
|
|
|
|
The fundamental topology described in the picture above tries to bridge together a bunch of untagged
|
|
ports (`e0`..`e3` 1Gbit each)) with two tagged ports (`xe0` and `xe1`, 10Gbit) into an upstream IXP
|
|
port (`xxv0`, 25Gbit). One thing to note for the pedants (and I love me some good pedantry) is that
|
|
the total physical bandwidth to downstream members in this gateway (4x1+2x10 == 24Gbit) is lower
|
|
than the physical bandwidth to the IXP platform (25Gbit), which makes sense. It means that there
|
|
will not be contention per se.
|
|
|
|
Building this topology in VPP is rather straight forward by using a so called **Bridge Domain**,
|
|
which will be referred to by its bridge-id, for which I'll rather arbitrarily choose 8298:
|
|
```
|
|
vpp# create bridge-domain 8298
|
|
vpp# set interface l2 bridge xxv0 8298
|
|
vpp# set interface l2 bridge e0 8298
|
|
vpp# set interface l2 bridge e1 8298
|
|
vpp# set interface l2 bridge e2 8298
|
|
vpp# set interface l2 bridge e3 8298
|
|
vpp# set interface l2 bridge xe0 8298
|
|
vpp# set interface l2 bridge xe1 8298
|
|
```
|
|
|
|
### VPP: Bridge Domain Encapsulations
|
|
|
|
I cheated a little bit in the previous section: I added the two TenGig ports called `xe0` and `xe1`
|
|
directly to the bridge; however they are trunk ports to breakout switches which will each contain
|
|
three additional downstream customers. So to add these six new customers, I will do the following:
|
|
|
|
```
|
|
vpp# set interface l3 xe0
|
|
vpp# create sub-interfaces xe0 10
|
|
vpp# create sub-interfaces xe0 20
|
|
vpp# create sub-interfaces xe0 30
|
|
vpp# set interface l2 bridge xe0.10 8298
|
|
vpp# set interface l2 bridge xe0.20 8298
|
|
vpp# set interface l2 bridge xe0.30 8298
|
|
```
|
|
|
|
The first command here puts the interface `xe0` back into Layer3 mode, which will detach it from the
|
|
bridge-domain. The second set of commands creates sub-interfaces with dot1q tags 10, 20 and 30
|
|
respectively. The third set then adds these three sub-interfaces to the bridge. By the way, I'll do
|
|
this for both `xe0` shown above, but also for the second `xe1` port, so all-up that makes 6
|
|
downstream member ports.
|
|
|
|
Readers of my articles at this point may have a little bit of an uneasy feeling: "What about the
|
|
VLAN Gymnastics?" I hear you ask :) You see, VPP will generally just pick up these ethernet frames
|
|
from `xe0.10` which are tagged, and add them as-is to the bridge, which is weird, because all the
|
|
other bridge ports are expecting untagged frames. So what I must do is tell VPP, upon receipt of a
|
|
tagged ethernet frame on these ports, to strip the tag; and on the way out, before transmitting the
|
|
ethernet frame, to wrap it into its correct encapsulation. This is called **tag rewriting** in VPP,
|
|
and I've written a bit about it in [[this article]({{< ref "2022-02-14-vpp-vlan-gym" >}})] in case
|
|
you're curious. But to cut to the chase:
|
|
|
|
```
|
|
vpp# set interface l2 tag-rewrite xe0.10 pop 1
|
|
vpp# set interface l2 tag-rewrite xe0.20 pop 1
|
|
vpp# set interface l2 tag-rewrite xe0.30 pop 1
|
|
vpp# set interface l2 tag-rewrite xe1.10 pop 1
|
|
vpp# set interface l2 tag-rewrite xe1.20 pop 1
|
|
vpp# set interface l2 tag-rewrite xe1.30 pop 1
|
|
```
|
|
|
|
Allright, with the VLAN gymnastics properly applied, I now have a bridge with all ten downstream
|
|
members and one upstream port (`xxv0`):
|
|
|
|
```
|
|
vpp# show bridge-domain 8298 int
|
|
BD-ID Index BSN Age(min) Learning U-Forwrd UU-Flood Flooding ARP-Term arp-ufwd Learn-co Learn-li BVI-Intf
|
|
8298 1 0 off on on flood on off off 1 16777216 N/A
|
|
|
|
Interface If-idx ISN SHG BVI TxFlood VLAN-Tag-Rewrite
|
|
xxv0 3 1 0 - * none
|
|
e0 5 1 0 - * none
|
|
e1 6 1 0 - * none
|
|
e2 7 1 0 - * none
|
|
e3 8 1 0 - * none
|
|
xe0.10 19 1 0 - * pop-1
|
|
xe0.20 20 1 0 - * pop-1
|
|
xe0.30 21 1 0 - * pop-1
|
|
xe1.10 22 1 0 - * pop-1
|
|
xe1.20 23 1 0 - * pop-1
|
|
xe1.30 24 1 0 - * pop-1
|
|
```
|
|
|
|
One cool thing to re-iterate is that VPP is really a router, not a switch. It's
|
|
entirely possible and common to create two completely independent subinterfaces with .1q tag 10 (in
|
|
my case, `xe0.10` and `xe1.10` and use the bridge-domain to tie them together.
|
|
|
|
#### Validating Bridge Domains
|
|
|
|
Looking at my clients above, I can see that several of them are untagged (`e0`-`e3`) and a few of
|
|
them are tagged behind ports `xe0` and `xe1`. It should be straight forward to validate reachability
|
|
with the following simple ping command:
|
|
|
|
```
|
|
pim@clientA:~$ fping -a -g 192.0.2.0/24
|
|
192.0.2.1 is alive
|
|
192.0.2.2 is alive
|
|
192.0.2.3 is alive
|
|
192.0.2.4 is alive
|
|
192.0.2.5 is alive
|
|
192.0.2.6 is alive
|
|
192.0.2.7 is alive
|
|
192.0.2.8 is alive
|
|
192.0.2.9 is alive
|
|
192.0.2.10 is alive
|
|
192.0.2.254 is alive
|
|
```
|
|
|
|
At this point the table stakes configuration provides for a Layer2 bridge domain spanning all of
|
|
these ports, including performing the correct encapsulation on the TenGig ports that connect to
|
|
the switches. There is L2 reachability between all clients over this VPP IXP Gateway.
|
|
|
|
**✅ Requirement #1 is implemented!**
|
|
|
|
### VPP: MAC Address Filtering
|
|
|
|
Enter classifiers! Actually while doing the research for this article, I accidentally nerd-sniped
|
|
myself while going through the features provided by VPP's classifier system, and holy moly is that
|
|
thing powerful!
|
|
|
|
I'm only going to show the results of that little journey through the code base and documentation,
|
|
but in an upcoming article I intend to do a thorough deep-dive into VPP classifiers, and add them to
|
|
`vppcfg` because I think that would be the bee's knees!
|
|
|
|
Back to the topic of MAC address filtering, a classifier would look roughly like this:
|
|
|
|
```
|
|
vpp# classify table acl-miss-next deny mask l2 src table 5
|
|
vpp# classify session acl-hit-next permit table-index 5 match l2 src 00:01:02:03:ca:fe
|
|
vpp# classify session acl-hit-next permit table-index 5 match l2 src 00:01:02:03:d0:d0
|
|
vpp# set interface input acl intfc e0 l2-table 5
|
|
vpp# show inacl type l2
|
|
Intfc idx Classify table Interface name
|
|
5 5 e0
|
|
```
|
|
|
|
The first line create a classify table where we'll want to match on Layer2 source addresses, and if
|
|
there is no entry in the table that matches, the default will be to _deny_ (drop) the ethernet
|
|
frame. The next two lines add an entry for ethernet frames which have Layer2 source of the _cafe_
|
|
and _d0d0_ MAC addresses. When matching, the action is to _permit_ (accept) the ethernet frame.
|
|
Then, I apply this classifier as an l2 input ACL on interface `e0`.
|
|
|
|
Incidentally the input ACL can operate at five distinct points in the packet's journey through
|
|
the dataplane. At the Layer2 input stage, like I'm using here, in the IPv4 and IPv6 input path, and
|
|
when punting traffic for IPv4 and IPv6 respectively.
|
|
|
|
#### Validating MAC filtering
|
|
|
|
Remember when I created the classify table and added two bogus MAC addresses to it? Let me show you
|
|
what would happen on client A, which is directly connected to port `e0`.
|
|
|
|
```
|
|
pim@clientA:~$ ip -br link show eno3
|
|
eno3 UP 3c:ec:ef:6a:7b:74 <BROADCAST,MULTICAST,UP,LOWER_UP>
|
|
|
|
pim@clientA:~$ ping 192.0.2.254
|
|
PING 192.0.2.254 (192.0.2.254) 56(84) bytes of data.
|
|
...
|
|
```
|
|
|
|
This is expected because ClientA's MAC address has not yet been added to the classify table driving
|
|
the Layer2 input ACL, which is quicky remedied like so:
|
|
|
|
```
|
|
vpp# classify session acl-hit-next permit table-index 5 match l2 src 3c:ec:ef:6a:7b:74
|
|
...
|
|
64 bytes from 192.0.2.254: icmp_seq=34 ttl=64 time=2048 ms
|
|
64 bytes from 192.0.2.254: icmp_seq=35 ttl=64 time=1024 ms
|
|
64 bytes from 192.0.2.254: icmp_seq=36 ttl=64 time=0.450 ms
|
|
64 bytes from 192.0.2.254: icmp_seq=37 ttl=64 time=0.262 ms
|
|
```
|
|
|
|
**✅ Requirement #2 is implemented!**
|
|
|
|
### VPP: Traffic Policers
|
|
|
|
I realize that from the IXP's point of view, not all the available bandwidth behind `xxv0` should be
|
|
made available to all clients. Some may have negotiated a higher- or lower- bandwidth available to
|
|
them. Therefor, the VPP IXP Gateway should be able to rate limit the traffic through the it, for
|
|
which a VPP feature already exists: Policers.
|
|
|
|
Consider for a moment our client A (untagged on port `e0`), and client E (behind port `xe0` with a
|
|
dot1q tag of 10). Client A has a bandwidth of 1Gbit, but client E nominally has a bandwidth of
|
|
10Gbit. If I were to want to restrict both clients to, say, 150Mbit, I could do the following:
|
|
|
|
```
|
|
vpp# policer add name client-a rate kbps cir 150000 cb 15000000 conform-action transmit
|
|
vpp# policer input name client-a e0
|
|
vpp# policer output name client-a e0
|
|
|
|
vpp# policer add name client-e rate kbps cir 150000 cb 15000000 conform-action transmit
|
|
vpp# policer input name client-e xe0.10
|
|
vpp# policer output name client-e xe0.10
|
|
```
|
|
|
|
And here's where I bump into a stubborn VPP dataplane. I would've expected the input and output
|
|
packet shaping to occur on both the untagged interface `e0` as well as the tagged interface
|
|
`xe0.10`, but alas, the policer only works in one of these four cases. Ouch!
|
|
|
|
I read the code around `vnet/src/policer/` and understand the following:
|
|
* On input, the policer is applied on `device-input` which is the Phy, not the Sub-Interface. This
|
|
explains why the policer works on untagged, but not on tagged interfaces.
|
|
* On output, the policer is applied on `ip4-output` and `ip6-output`, which works only for L3
|
|
enabled interfaces, not for L2 ones like the ones in this bridge domain.
|
|
|
|
I also tried to work with classifiers, like in the MAC address filtering above -- but I concluded
|
|
here as well, that the policer works only on input, not on output. So the mission is now to figure
|
|
out how to enable an L2 policer on (1) untagged output, and (2) tagged in- and output.
|
|
|
|
**❌ Requirement #3 is not implemented!**
|
|
|
|
## What's Next
|
|
|
|
It's too bad that policers are a bit fickle. That's quite unfortunate, but I think fixable. I've
|
|
started a thread on `vpp-dev@` to discuss, and will reach out to Stanislav who added the
|
|
_policer output_ capability in commit `e5a3ae0179`.
|
|
|
|
Of course, this is just a proof of concept. I typed most of the configuration by hand on the VPP IXP
|
|
Gateway, just to show a few of the more advanced features of VPP. For me, this triggered a whole new
|
|
line of thinking: classifiers. This extract/match/act pattern can be used in policers, ACLs and
|
|
arbitrary traffic redirection through VPP's directed graph (eg. selecting a next node for
|
|
processing). I'm going to deep-dive into this classifier behavior in an upcoming article, and see
|
|
how I might add this to [[vppcfg](https://git.ipng.ch/ipng/vppcfg.git)], because I think it
|
|
would be super powerful to abstract away the rather complex underlying API into something a little
|
|
bit more ... user friendly. Stay tuned! :)
|
|
|