ipng.ch/content/articles/2023-10-21-vpp-ixp-gateway-1.md

---
date: "2023-10-21T11:35:14Z"
title: VPP IXP Gateway - Part 1
aliases:
- /s/articles/2023/10/21/vpp-ixp-gateway-1.html
---

{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}

# About this series

Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
are shared between the two.

There's some really fantastic features in VPP, some of which are lesser well known, and not always
very well documented. In this article, I will describe a unique usecase in which I think VPP will
excel, notably acting as a gateway for Internet Exchange Points.

In this first article, I'll take a closer look at three things that would make such a gateway
possible: bridge domains, MAC address filtering and traffic shaping.

## Introduction

Internet Exchanges are typically L2 (ethernet) switch platforms that allow their connected members
to exchange traffic amongst themselves. Not all members share physical locations with the Internet
Exchange itself, for example the IXP may be at NTT Zurich, but the member may be present in
Interxion Zurich. For smaller clubs, like IPng Networks, it's not always financially feasible (or
desirable) to order a dark fiber between two adjacent datacenters, or even a cross connect in the
same datacenter (as many of them are charging exorbitant fees for what is essentially passive fiber
optics and patch panels), if the amount of traffic passed is modest.

One solution to such problems is to have one member transport multiple end-user downstream members
to the platform, for example by means of an Ethernet over MPLS or VxLAN transport from where the
enduser lives, to the physical port of the Internet Exchange. These transport members are often
called _IXP Resellers_ noting that usually, but not always, some form of payment is required.

From the point of view of the IXP, it's often the case that there is a one MAC address per member
limitation, and not all members will have the same bandwidth guarantees. Many IXPs will offer
physical connection speeds (like a Gigabit, TenGig or HundredGig port), but they _also_ have a
common practice to limit the passed traffic by means of traffic shaping, for example one might have
a TenGig port but only entitled to pass 3.0 Gbit/sec of traffic in- and out of the platform.

For a long time I thought this kind of sucked, after all, who wants to connect to an internet
exchange point but then see their traffic rate limited? But if you think about it, this is often to
protect both the member, and the reseller, and the exchange itself: if the total downstream
bandwidth to the reseller is potentially larger than the reseller's port to the exchange, and this
is almost certainly the case in the other direction: the total IXP bandwidth that might go to one
individual members, is significantly larger than the reseller's port to the exchange.

Due to these two issues, a reseller port may become a bottleneck and _packetlo_ may occur. To
protect the ecosystem, having the internet exchange try to enforce fairness and bandwidth limits
makes operational sense.

## VPP as an IXP Gateway

{{< image width="400px" float="right" src="/assets/vpp-ixp-gateway/VPP IXP Gateway.svg" alt="VPP IXP Gateway" >}}

Here's a few requirements that may be necessary to provide an end-to-end solution:
1.   Downstream ports MAY be _untagged_, or _tagged_, in which case encapsulation (for example
     .1q VLAN tags) SHOULD be provided, one per downstream member.
1.   Each downstream member MUST ONLY be allowed to send traffic from one or more registered MAC
     addresses, in other words, strict filtering MUST be applied by the gateway.
1.   If a downstream member is assigned an up- and downstream bandwidth limit, this MUST be
     enforced by the gateway.

Of course, all sorts of other things come to mind -- perhaps MPLS encapsulation, or VxLAN/GENEVE
tunneling endpoints, and certainly some monitoring with SNMP or Prometheus, and how about just
directly integrating this gateway with [[IXPManager](https://www.ixpmanager.org/)] while we're at
it. Yes, yes! But for this article, I'm going to stick to the bits and pieces regarding VPP itself,
and leave the other parts for another day!

First, I build a quick lab out of this, by taking one supermicro bare metal server with VPP (it will
be the VPP IXP Gateway), and a couple of Debian servers and switches to simulate clients (A-J):
*   Client A-D (on port `e0`-`e3`) will use `192.0.2.1-4/24` and `2001:db8::1-5/64`
*   Client E-G (on switch port `e0`-`e2` of switch0, behind port `xe0`) will use `192.0.2.5-7/24` and
    `2001:db8::5-7/64`
*   Client H-J (on switch port `e0`-`e2` of switch1, behind port `xe1`) will use `192.0.2.8-10/24` and
    `2001:db8::8-a/64`
*   There will be a server attached to port `xxv0` with address `198.0.2.254/24` and `2001:db8::ff/64`
    *   The server will run `iperf3`.

### VPP: Bridge Domains

The fundamental topology described in the picture above tries to bridge together a bunch of untagged
ports (`e0`..`e3` 1Gbit each)) with two tagged ports (`xe0` and `xe1`, 10Gbit) into an upstream IXP
port (`xxv0`, 25Gbit). One thing to note for the pedants (and I love me some good pedantry) is that
the total physical bandwidth to downstream members in this gateway (4x1+2x10 == 24Gbit) is lower
than the physical bandwidth to the IXP platform (25Gbit), which makes sense. It means that there
will not be contention per se.

Building this topology in VPP is rather straight forward by using a so called **Bridge Domain**,
which will be referred to by its bridge-id, for which I'll rather arbitrarily choose 8298:
```
vpp# create bridge-domain 8298
vpp# set interface l2 bridge xxv0 8298
vpp# set interface l2 bridge e0 8298
vpp# set interface l2 bridge e1 8298
vpp# set interface l2 bridge e2 8298
vpp# set interface l2 bridge e3 8298
vpp# set interface l2 bridge xe0 8298
vpp# set interface l2 bridge xe1 8298
```

### VPP: Bridge Domain Encapsulations

I cheated a little bit in the previous section: I added the two TenGig ports called `xe0` and `xe1`
directly to the bridge; however they are trunk ports to breakout switches which will each contain
three additional downstream customers. So to add these six new customers, I will do the following:

```
vpp# set interface l3 xe0
vpp# create sub-interfaces xe0 10
vpp# create sub-interfaces xe0 20
vpp# create sub-interfaces xe0 30
vpp# set interface l2 bridge xe0.10 8298
vpp# set interface l2 bridge xe0.20 8298
vpp# set interface l2 bridge xe0.30 8298
```

The first command here puts the interface `xe0` back into Layer3 mode, which will detach it from the
bridge-domain. The second set of commands creates sub-interfaces with dot1q tags 10, 20 and 30
respectively. The third set then adds these three sub-interfaces to the bridge. By the way, I'll do
this for both `xe0` shown above, but also for the second `xe1` port, so all-up that makes 6
downstream member ports.

Readers of my articles at this point may have a little bit of an uneasy feeling: "What about the
VLAN Gymnastics?" I hear you ask :) You see, VPP will generally just pick up these ethernet frames
from `xe0.10` which are tagged, and add them as-is to the bridge, which is weird, because all the
other bridge ports are expecting untagged frames. So what I must do is tell VPP, upon receipt of a
tagged ethernet frame on these ports, to strip the tag; and on the way out, before transmitting the
ethernet frame, to wrap it into its correct encapsulation. This is called **tag rewriting** in VPP,
and I've written a bit about it in [[this article]({{< ref "2022-02-14-vpp-vlan-gym" >}})] in case
you're curious. But to cut to the chase:

```
vpp# set interface l2 tag-rewrite xe0.10 pop 1
vpp# set interface l2 tag-rewrite xe0.20 pop 1
vpp# set interface l2 tag-rewrite xe0.30 pop 1
vpp# set interface l2 tag-rewrite xe1.10 pop 1
vpp# set interface l2 tag-rewrite xe1.20 pop 1
vpp# set interface l2 tag-rewrite xe1.30 pop 1
```

Allright, with the VLAN gymnastics properly applied, I now have a bridge with all ten downstream
members and one upstream port (`xxv0`):

```
vpp# show bridge-domain 8298 int
  BD-ID   Index   BSN  Age(min)  Learning  U-Forwrd   UU-Flood   Flooding  ARP-Term  arp-ufwd Learn-co Learn-li   BVI-Intf
  8298      1      0     off        on        on       flood        on       off       off        1    16777216     N/A

           Interface           If-idx ISN  SHG  BVI  TxFlood        VLAN-Tag-Rewrite
             xxv0                3     1    0    -      *                 none
              e0                 5     1    0    -      *                 none
              e1                 6     1    0    -      *                 none
              e2                 7     1    0    -      *                 none
              e3                 8     1    0    -      *                 none
            xe0.10               19    1    0    -      *                 pop-1
            xe0.20               20    1    0    -      *                 pop-1
            xe0.30               21    1    0    -      *                 pop-1
            xe1.10               22    1    0    -      *                 pop-1
            xe1.20               23    1    0    -      *                 pop-1
            xe1.30               24    1    0    -      *                 pop-1
```

One cool thing to re-iterate is that VPP is really a router, not a switch. It's
entirely possible and common to create two completely independent subinterfaces with .1q tag 10 (in
my case, `xe0.10` and `xe1.10` and use the bridge-domain to tie them together.

#### Validating Bridge Domains

Looking at my clients above, I can see that several of them are untagged (`e0`-`e3`) and a few of
them are tagged behind ports `xe0` and `xe1`. It should be straight forward to validate reachability
with the following simple ping command:

```
pim@clientA:~$ fping -a -g 192.0.2.0/24
192.0.2.1 is alive
192.0.2.2 is alive
192.0.2.3 is alive
192.0.2.4 is alive
192.0.2.5 is alive
192.0.2.6 is alive
192.0.2.7 is alive
192.0.2.8 is alive
192.0.2.9 is alive
192.0.2.10 is alive
192.0.2.254 is alive
```

At this point the table stakes configuration provides for a Layer2 bridge domain spanning all of
these ports, including performing the correct encapsulation on the TenGig ports that connect to
the switches. There is L2 reachability between all clients over this VPP IXP Gateway.

**✅ Requirement #1 is implemented!**

### VPP: MAC Address Filtering

Enter classifiers! Actually while doing the research for this article, I accidentally nerd-sniped
myself while going through the features provided by VPP's classifier system, and holy moly is that
thing powerful!

I'm only going to show the results of that little journey through the code base and documentation,
but in an upcoming article I intend to do a thorough deep-dive into VPP classifiers, and add them to
`vppcfg` because I think that would be the bee's knees!

Back to the topic of MAC address filtering, a classifier would look roughly like this:

```
vpp# classify table acl-miss-next deny mask l2 src table 5
vpp# classify session acl-hit-next permit table-index 5 match l2 src 00:01:02:03:ca:fe
vpp# classify session acl-hit-next permit table-index 5 match l2 src 00:01:02:03:d0:d0
vpp# set interface input acl intfc e0 l2-table 5
vpp# show inacl type l2
 Intfc idx      Classify table          Interface name
         5                   5          e0
```

The first line create a classify table where we'll want to match on Layer2 source addresses, and if
there is no entry in the table that matches, the default will be to _deny_ (drop) the ethernet
frame. The next two lines add an entry for ethernet frames which have Layer2 source of the _cafe_
and _d0d0_ MAC addresses. When matching, the action is to _permit_ (accept) the ethernet frame.
Then, I apply this classifier as an l2 input ACL on interface `e0`.

Incidentally the input ACL can operate at five distinct points in the packet's journey through
the dataplane. At the Layer2 input stage, like I'm using here, in the IPv4 and IPv6 input path, and
when punting traffic for IPv4 and IPv6 respectively.

#### Validating MAC filtering

Remember when I created the classify table and added two bogus MAC addresses to it? Let me show you
what would happen on client A, which is directly connected to port `e0`.

```
pim@clientA:~$ ip -br link show eno3
eno3             UP             3c:ec:ef:6a:7b:74 <BROADCAST,MULTICAST,UP,LOWER_UP>

pim@clientA:~$ ping 192.0.2.254
PING 192.0.2.254 (192.0.2.254) 56(84) bytes of data.
...
```

This is expected because ClientA's MAC address has not yet been added to the classify table driving
the Layer2 input ACL, which is quicky remedied like so:

```
vpp# classify session acl-hit-next permit table-index 5 match l2 src 3c:ec:ef:6a:7b:74
...
64 bytes from 192.0.2.254: icmp_seq=34 ttl=64 time=2048 ms
64 bytes from 192.0.2.254: icmp_seq=35 ttl=64 time=1024 ms
64 bytes from 192.0.2.254: icmp_seq=36 ttl=64 time=0.450 ms
64 bytes from 192.0.2.254: icmp_seq=37 ttl=64 time=0.262 ms
```

**✅ Requirement #2 is implemented!**

### VPP: Traffic Policers

I realize that from the IXP's point of view, not all the available bandwidth behind `xxv0` should be
made available to all clients. Some may have negotiated a higher- or lower- bandwidth available to
them. Therefor, the VPP IXP Gateway should be able to rate limit the traffic through the it, for
which a VPP feature already exists: Policers.

Consider for a moment our client A (untagged on port `e0`), and client E (behind port `xe0` with a
dot1q tag of 10). Client A has a bandwidth of 1Gbit, but client E nominally has a bandwidth of
10Gbit. If I were to want to restrict both clients to, say, 150Mbit, I could do the following:

```
vpp# policer add name client-a rate kbps cir 150000 cb 15000000 conform-action transmit
vpp# policer input name client-a e0
vpp# policer output name client-a e0

vpp# policer add name client-e rate kbps cir 150000 cb 15000000 conform-action transmit
vpp# policer input name client-e xe0.10
vpp# policer output name client-e xe0.10
```

And here's where I bump into a stubborn VPP dataplane. I would've expected the input and output
packet shaping to occur on both the untagged interface `e0` as well as the tagged interface
`xe0.10`, but alas, the policer only works in one of these four cases. Ouch!

I read the code around `vnet/src/policer/` and understand the following:
*   On input, the policer is applied on `device-input` which is the Phy, not the Sub-Interface. This
    explains why the policer works on untagged, but not on tagged interfaces.
*   On output, the policer is applied on `ip4-output` and `ip6-output`, which works only for L3
    enabled interfaces, not for L2 ones like the ones in this bridge domain.

I also tried to work with classifiers, like in the MAC address filtering above -- but I concluded
here as well, that the policer works only on input, not on output. So the mission is now to figure
out how to enable an L2 policer on (1) untagged output, and (2) tagged in- and output.

**❌ Requirement #3 is not implemented!**

## What's Next

It's too bad that policers are a bit fickle. That's quite unfortunate, but I think fixable. I've
started a thread on `vpp-dev@` to discuss, and will reach out to Stanislav who added the
_policer output_ capability in commit `e5a3ae0179`.

Of course, this is just a proof of concept. I typed most of the configuration by hand on the VPP IXP
Gateway, just to show a few of the more advanced features of VPP. For me, this triggered a whole new
line of thinking: classifiers. This extract/match/act pattern can be used in policers, ACLs and
arbitrary traffic redirection through VPP's directed graph (eg. selecting a next node for
processing). I'm going to deep-dive into this classifier behavior in an upcoming article, and see
how I might add this to [[vppcfg](https://git.ipng.ch/ipng/vppcfg.git)], because I think it
would be super powerful to abstract away the rather complex underlying API into something a little
bit more ... user friendly. Stay tuned! :)