Files
ipng.ch/content/articles/2026-02-14-vpp-policers.md
Pim van Pelt 021fdfec4d
All checks were successful
continuous-integration/drone/push Build is passing
typo fix
2026-02-15 14:15:58 +01:00

433 lines
20 KiB
Markdown

---
date: "2026-02-14T11:35:14Z"
title: VPP Policers
---
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
are shared between the two.
There's some really fantastic features in VPP, some of which are lesser well known, and not always
very well documented. In this article, I will describe a unique usecase in which I think VPP will
excel, notably acting as a gateway for Internet Exchange Points.
A few years ago, I toyed with the idea to use VPP as an _IXP Reseller_ concentrator, allowing
several carriers to connect with say 10G or 25G ports, and carry sub-customers on tagged interfaces
with safety (like MAC address ACLs) and rate limiting (say any given customer limited to 1Gbps on a
10G or 100G trunk), all provided by VPP. You can take a look at my [[VPP IXP Gateway]({{< ref
2023-10-21-vpp-ixp-gateway-1 >}})] article for details. I never ended up deploying it.
In this article, I follow up and fix a few shortcomings in VPP's policer framework.
## Introduction
Consider the following policer in VPP:
```
vpp# policer add name client-a rate kbps cir 150000 cb 15000000 conform-action transmit
vpp# policer input name client-a GigabitEthernet10/0/1
vpp# policer output name client-a GigabitEthernet10/0/1
```
The idea is to give a _committed information rate_ of 150Mps with a _committed burst_ rate of 15MB.
The _CIR_ represents the average bandwidth allowed for the interface, while the _CB_ represents the
maximum amount of data (in bytes) that can be sent at line speed in a single burst before the _CIR_
kicks in to throttle the traffic.
Back in October of 2023, I reached the conclusion that the policer works in the following modes:
* ***On input***, the policer is applied on `device-input` which means it takes frames directly
from the Phy. It will not work on any sub-interfaces. It explains why the policer worked on
untagged (`Gi10/0/1`) but not on tagged (`Gi10/0/1.100`) sub-interfaces.
* ***On output***, the policer is applied on `ip4-output` and `ip6-output`, which works only for
L3 enabled interfaces, not for L2 ones like the ones one might use on bridge domain or L2 cross
connects.
## VPP Infra: L2 Feature Maps
The benefit in using the `device-input` input arc is it's efficient: every packet that comes from the
device (`Gi10/0/1`) regardless of tagging or not, will be handed off to the policer plugin. It means
any traffic (L2, L3, sub-interface, tagged, untagged) will all go through the same policer.
In `src/vnet/l2/` there are two nodes called `l2-input` and `l2-output`. I can configure VPP to call
these nodes before `ip[46]-unicast` and before `ip[46]-output` respectively. These L2 nodes have a
feature bitmap with 32 entries. The l2-input / l2-output nodes use a bitmap walk: they find the
highest set bit, and then dispatch the packet to a pre-configured graph node. Upon return, the
`feat-bitmap-next` checks the next bit, and if that one is set, dispatches the packet to the next
pre-configured graph node. This continues until all the bits are checked and packets have been
handed to their respective graph node if any given bit is set.
To show what I can do with these nodes, let me dive in to an example. When a packet arrives on an
interface configured in L2 mode, either because it's a bridge-domain or an L2XC, `ethernet-input`
will send it to `l2-input`. This node does three things:
1. It will classify the packet, by reading the interface configuration (`l2input_main.configs`) for
the sw_if_index, which contains the mode of the interface (`bridge-domain`, `l2xc`, or `bvi`). It
also contains the feature bitmap: a statically configured set of features for this interface.
1. It will store the effective feature bitmap for each individual packet in the packet buffer. For
bridge mode, depending on the packet being unicast or multicast, some features are disabled. For
example,flooding for unicast packets is not performed, so those bits are cleared. The result is
stored in a per-packet working copy that downstream nodes can be triggered on, in turn.
1. For each of the bits set in the packet buffer's `l2.feature_bitmap`, starting from highest bit
set, `l2-input` will set the next node, for example `l2-input-vtr` to do VLAN Tag Rewriting. Once
that node is finished, it'll clear its own bit, and search for the next one set, in order to set a
new node.
I note that processing order is HIGH to LOW bits. By reading `l2_input.h`, I can see that The full
`l2-input` chain looks like this:
```
l2-input
→ SPAN(17) → INPUT_CLASSIFY(16) → INPUT_FEAT_ARC(15) → POLICER_CLAS(14)
→ ACL(13) → VPATH(12) → L2_IP_QOS_RECORD(11) → VTR(10) → LEARN(9) → RW(8)
→ FWD(7) → UU_FWD(6) → UU_FLOOD(5) → ARP_TERM(4) → ARP_UFWD(3) → FLOOD(2)
→ XCONNECT(1) → DROP(0)
l2-output
→ XCRW(12) → OUTPUT_FEAT_ARC(11) → OUTPUT_CLASSIFY(10) → LINESTATUS_DOWN(9)
→ STP_BLOCKED(8) → IPIW(7) → EFP_FILTER(6) → L2PT(5) → ACL(4) → QOS(3)
→ CFM(2) → SPAN(1) → OUTPUT(0)
```
If none of the L2 processing nodes set the next node, ultimately `feature-bitmap-drop` gently takes
the packet behind the shed and drops it. On the way out, ultimately the last `OUTPUT` bit sends the
packet to `interface-output`, which hands off to the driver's TX node.
### Enabling L2 features
There's lots of places in VPP where L2 feature bitmaps are set/cleared. Here's a few examples:
```
# VTR: sets L2INPUT_FEAT_VTR + configures output VTR (VLAN Tag Rewriting)
vpp# set interface l2 tag-rewrite GigE0/0/0.100 pop 1
# ACL: sets L2INPUT_FEAT_ACL / L2OUTPUT_FEAT_ACL
vpp# set interface l2 input acl intfc GigE0/0/0 ip4-table 0
vpp# set interface l2 output acl intfc GigE0/0/0 ip4-table 0
# SPAN: sets L2INPUT_FEAT_SPAN / L2OUTPUT_FEAT_SPAN
vpp# set interface span GigE0/0/0 l2 destination GigE0/0/1
# Bridge domain level (affects bd_feature_bitmap, applied to all bridge members)
vpp# set bridge-domain learn 1 # enable/disable LEARN in BD
vpp# set bridge-domain forward 1 # enable/disable FWD in BD
vpp# set bridge-domain flood 1 # enable/disable FLOOD in BD
```
I'm starting to see how these L2 feature bitmaps are super powerful, yet flexible. I'm ready to add one!
### Creating L2 features
First, I need to insert my new `POLICER` bit in `l2_input.h` and `l2_output.h`. Then, I can call
`l2input_intf_bitmap_enable()` and its companion `l2output_intf_bitmap_enable()` to enable or
disable the L2 feature, and point it at a new graph node.
```cpp
/* Enable policer both on L2 feature bitmap, and L3 feature arcs */
if (dir == VLIB_RX) {
l2input_intf_bitmap_enable (sw_if_index, L2INPUT_FEAT_POLICER, apply);
vnet_feature_enable_disable ("ip4-unicast", "policer-input", sw_if_index, apply, 0, 0);
vnet_feature_enable_disable ("ip6-unicast", "policer-input", sw_if_index, apply, 0, 0);
} else {
l2output_intf_bitmap_enable (sw_if_index, L2OUTPUT_FEAT_POLICER, apply);
vnet_feature_enable_disable ("ip4-output", "policer-output", sw_if_index, apply, 0, 0);
vnet_feature_enable_disable ("ip6-output", "policer-output", sw_if_index, apply, 0, 0);
}
```
What this means is that if the interface happens to be in L2 mode, in other words when it is a
`bridge-domain` member or when it is in an `l2XC` mode, I will enable the L2 features. However, for
L3 packets, I will still proceed to enable the existing `policer-input` node by calling
`vnet_feature_enable_disable()` on the IPv4 and IPv6 input arc. I make a mental note that MPLS and
other non-IP traffic will not be policed in this way.
### Updating Policer graph node
The policer framework has an existing dataplane node called `vnet_policer_inline()` which I extend
to take a flag `is_l2`. Using this flag, I can either set the next graph node to be
`vnet_l2_feature_next()`, or, in the pre-existing L3 case, set `vnet_feature_next()` on the packets
that move through the node. The nodes now look like this:
```cpp
VLIB_NODE_FN (policer_l2_input_node)
(vlib_main_t *vm, vlib_node_runtime_t *node, vlib_frame_t *frame)
{
return vnet_policer_inline (vm, node, frame, VLIB_RX, 1 /* is_l2 */);
}
VLIB_REGISTER_NODE (policer_l2_input_node) = {
.name = "l2-policer-input",
.vector_size = sizeof (u32),
.format_trace = format_policer_trace,
.type = VLIB_NODE_TYPE_INTERNAL,
.n_errors = ARRAY_LEN(vnet_policer_error_strings),
.error_strings = vnet_policer_error_strings,
.n_next_nodes = VNET_POLICER_N_NEXT,
.next_nodes = {
[VNET_POLICER_NEXT_DROP] = "error-drop",
[VNET_POLICER_NEXT_HANDOFF] = "policer-input-handoff",
},
};
/* Register on IP unicast arcs for L3 routed sub-interfaces */
VNET_FEATURE_INIT (policer_ip4_unicast, static) = {
.arc_name = "ip4-unicast",
.node_name = "policer-input",
.runs_before = VNET_FEATURES ("ip4-lookup"),
};
VNET_FEATURE_INIT (policer_ip6_unicast, static) = {
.arc_name = "ip6-unicast",
.node_name = "policer-input",
.runs_before = VNET_FEATURES ("ip6-lookup"),
};
```
Here, I install the L3 feature before `ip[46]-lookup`, and hook up the L2 feature with a new node
that really just calls the existing node but with `is_l2` set to true. I do something very similar
for the output direction, except there I'll hook the L3 feature before `ip[46]-output`.
## Tests!
I think writing unit- and integration tests is a great idea. I add a new file
`test/test_policer_subif.py` which actually tests all four new cases:
1. **L3 Input**: on a routed sub-interface
1. **L3 Output**: on a routed sub-interface
1. **L2 Input**: on a bridge-domain sub-interface
1. **L2 Output**: on a bridge-domain sub-interface
The existing `test/test_policer.py` should also cover existing cases, and of course it's important
that my work does not impact these changes. Lucky me, the existing tests all still pass :)
### Test: L3 in/output
The tests use a VPP feature called `packet-generator`, which creates virtual devices upon which I
can emit packets using ScaPY, and use pcap to receive them. For the input, first I'll create the
interface and apply a new policer to it:
```python
sub_if0 = VppDot1QSubint(self, self.pg0, 10)
sub_if0.admin_up()
sub_if0.config_ip4()
sub_if0.resolve_arp()
# Create policer
action_tx = PolicerAction(VppEnum.vl_api_sse2_qos_action_type_t.SSE2_QOS_ACTION_API_TRANSMIT, 0)
policer = VppPolicer(self, "subif_l3_pol", 80, 0, 1000, 0,
conform_action=action_tx, exceed_action=action_tx, violate_action=action_tx,
)
policer.add_vpp_config()
# Apply policer to sub-interface input on pg0
policer.apply_vpp_config(sub_if0.sw_if_index, Dir.RX, True)
```
The policer with name `subif_l3_pol` has a _CIR_ of 80kbps, and _EIR_ of 0kB, a _CB_ of 1000 bytes, and
_EB_ of 0kB, and otherwise always accepts packets. I do this so that I can eventually detect if and
how many packets were seen, and how many bytes were passed in the conform and violate actions.
Next, I can generate a few packets and send them out from `pg0`, and wait to receive them on `pg1`:
```python
# Send packets with VLAN tag from sub_if0 to sub_if1
pkts = []
for i in range(NUM_PKTS): # NUM_PKTS = 67
pkt = (
Ether(src=self.pg0.remote_mac, dst=self.pg0.local_mac) / Dot1Q(vlan=10)
/ IP(src=sub_if0.remote_ip4, dst=sub_if1.remote_ip4) / UDP(sport=1234, dport=1234)
/ Raw(b"\xa5" * 100)
)
pkts.append(pkt)
# Send and verify packets are policed and forwarded
rx = self.send_and_expect(self.pg0, pkts, self.pg1)
stats = policer.get_stats()
# Verify policing happened
self.assertGreater(stats["conform_packets"], 0)
self.assertEqual(stats["exceed_packets"], 0)
self.assertGreater(stats["violate_packets"], 0)
self.logger.info(f"L3 sub-interface input policer stats: {stats}")
```
Similar to the L3 sub-interface input policer, I also write a test for L3 sub-interface output
policer. The only difference between the two is that in the output case, the policer is applied to
`pg1` in the `Dir.TX` direction, while in the input case, it's applied to `pg0` in the `Dir.Rx`
direction.
I can predict the outcome. Every packet is exactly 146 bytes:
* 14 bytes src/dst MAC in `Ether()`
* 4 bytes VLAN tag (10) in `Dot1Q()`
* 20 bytes IPv4 header in `IP()`
* 8 bytes UDP header in `UDP()`
* 100 bytes of additional payload.
When allowing a burst of 1000 bytes, that means 6 packets should make it through (876 bytes) in the
`conform` bucket while the other 61 should be in the `violate` bucket. I won't see any packets in
the `exceed` bucket, because the policer I created is a simple one-rate, two-color `1R2C` policer
with `EB` set to 0, so every non-conforming packet goes straight to violate as there is no extra
budget in the exceed bucket. However they are all sent, because the action was set to transmit in
all cases.
```
pim@summer:~/src/vpp$ make test-debug TEST=test_policer_subif V=2
15:21:46,868 L3 sub-interface input policer stats: {'conform_packets': 7, 'conform_bytes': 896,
'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 60, 'violate_bytes': 7680}
15:21:47,919 L3 sub-interface output policer stats: {'conform_packets': 6, 'conform_bytes': 876,
'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906}
```
{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
**Whoops!** So much for predicting the outcome! I see that 7 packets (896 bytes) make it through on input
while 6 packets (876 bytes) made it through on output. In the input case, the packet size is
`896/7 = 128` bytes, which is 18 bytes short. What's going on?
### Side Quest: Policer Accounting
On the vpp-dev mailinglist, Ben points out that the accounting will be changing when moving from
`device-input` to `ip[46]-input`, because after device-input, the packet buffer is advanced to the
L3 portion, and will start at the IPv4 or IPv6 header. Considering I was using dot1q tagged
sub-interfaces, that means I will be short exactly 18 bytes. The reason why this does not happen on
the way out, is that `ip[46]-rewrite` have both already wound back the buffer to be able to insert
the ethernet frame and encapsulation, so no adjustment is needed there.
Ben also points out that when applying the policer to the interface, I can detect at creation time if
it's a PHY, a single-tagged or a double-tagged interface, and store some information to help correct
the accounting. We discuss a little bit on the mailinglist, and agree that it's best for all four
cases (L2 input/output and L3 intput/output) to use the full L2 frame bytes in the accounting, which
as an added benefit also that is remains backwards compatible with the `device-input` accounting.
Chapeau, Ben you're so clever!
I add a little helper function:
```cpp
static u8 vnet_policer_compute_l2_overhead (vnet_main_t *vnm, u32 sw_if_index, vlib_dir_t dir)
{
if (dir == VLIB_TX) return 0;
vnet_hw_interface_t *hi = vnet_get_sup_hw_interface (vnm, sw_if_index);
if (PREDICT_FALSE (hi->hw_class_index != ethernet_hw_interface_class.index))
return 0; /* Not Ethernet */
vnet_sw_interface_t *si = vnet_get_sw_interface (vnm, sw_if_index);
if (si->type == VNET_SW_INTERFACE_TYPE_SUB) {
if (si->sub.eth.flags.one_tag) return 18; /* Ethernet + single VLAN */
if (si->sub.eth.flags.two_tags) return 22; /* Ethernet + QinQ */
}
return 14; /* Untagged Ethernet */
}
```
And in the policer struct, I also add a `l2_overhead_by_sw_if_index[dir][sw_if_index]` to store
these values. That way, I do not need to do this calculation for every packet in the dataplane, but
just blindly add the value I pre-computed at creation time. This is safe, because sub-interfaces
cannot change their encapsulation after being created.
In the `vnet_policer_police()` dataplane function, I add an `l2_overhead` argument, and then call it
like so:
```cpp
u16 l2_overhead0 = (is_l2) ? 0 : pm->l2_overhead_by_sw_if_index[dir][sw_if_index0];
act0 = vnet_policer_police (vm, b0, pi0, ..., l2_overhead0);
```
And with that, my two tests give the same results:
```
pim@summer:~/src/vpp$ make test-debug TEST=test_policer_subif V=2 | grep 'policer stats'
15:38:39,720 L3 sub-interface input policer stats: {'conform_packets': 6, 'conform_bytes': 876,
'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906}
15:38:40,715 L3 sub-interface output policer stats: {'conform_packets': 6, 'conform_bytes': 876,
'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906}
```
Yaay, great success!
### Test: L2 in/output
The tests for the L2 input and output case are not radically different. In the setup, rather than
giving the VLAN sub-interfaces an IPv4 address, I'll just add them to a bridge-domain:
```python
# Create VLAN sub-interfaces on pg0 and pg1
sub_if0 = VppDot1QSubint(self, self.pg0, 30)
sub_if0.admin_up()
sub_if1 = VppDot1QSubint(self, self.pg1, 30)
sub_if1.admin_up()
# Add both sub-interfaces to bridge domain 1
self.vapi.sw_interface_set_l2_bridge(sub_if0.sw_if_index, bd_id=1)
self.vapi.sw_interface_set_l2_bridge(sub_if1.sw_if_index, bd_id=1)
```
This puts the sub-interfaces in L2 mode, after which the `l2-input` and `l2-output` feature bitmaps
kick in. Without further ado:
```
pim@summer:~/src/vpp$ make test-debug TEST=test_policer_subif V=2 | grep 'L2.*policer stats'
15:50:15,217 L2 sub-interface input policer stats: {'conform_packets': 6, 'conform_bytes': 876,
'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906}
15:50:16,217 L2 sub-interface output policer stats: {'conform_packets': 6, 'conform_bytes': 876,
'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906}
```
## Results
The policer works in all sorts of cool scenario's now. Let me give a concrete example, where I
create an L2XC with VTR and then apply a policer. I've written about VTR, which stands for _VLAN Tag
Rewriting_ before, in an old article lovingly called [[VPP VLAN Gymnastics]({{< ref
"2022-02-14-vpp-vlan-gym" >}})]. It all looks like this:
```
vpp# create sub Gi10/0/0 100
vpp# create sub Gi10/0/1 200
vpp# set interface l2 xconnect Gi10/0/0.100 Gi10/0/1.200
vpp# set interface l2 xconnect Gi10/0/1.200 Gi10/0/0.100
vpp# set interface l2 tag-rewrite Gi10/0/0.100 pop 1
vpp# set interface l2 tag-rewrite Gi10/0/1.200 pop 1
vpp# policer add name pol-test rate kbps cir 150000 cb 15000000 conform-action transmit
vpp# policer input name pol-test Gi10/0/0.100
```
After applying this configuration, the input bitmap on Gi10/0/0.100 becomes `POLICER(14) | VTR(10) |
XCONNECT(1) | DROP(0)`. Packets now take the following path through the dataplane:
```
ethernet-input
→ l2-input (computes bitmap, dispatches to bit 14)
→ l2-policer-input (clears bit 14, polices, dispatches to bit 10)
→ l2-input-vtr (clears bit 10, pops 1 tag, dispatches to bit 1)
→ l2-output (XCONNECT: sw_if_index[TX]=Gi10/0/1.200)
→ inline output VTR (pushes 1 tag for .200)
→ interface-output
→ Gi10/0/1-tx
```
## What's Next
I've sent the change, which was only about ~300 LOC, off for review. You can follow along on the
gerrit on [[44654](https://gerrit.fd.io/r/c/vpp/+/44654)]. I don't think the policer got much slower
after adding the l2 path, and one might argue it doesn't matter because policing didn't work on
sub-interfaces and L2 output at all, before this change. However, for the L3 input/output case, and
for the PHY input case, there are a few CPU cycles added now to address the L2 and sub-int use
cases. Perhaps I should do a side by side comparision of packets/sec throughput on the bench some
time.
It would be great if VPP would support FQ-CoDel (Flow Queue-Controlled Delay), which is an algorithm
and packet scheduler designed to eliminate bufferbloat, which is high latency caused by excessive
buffering in network equipment, while ensuring fair bandwidth distribution among competing traffic
flows. I know that Dave Täht - may he rest in peace - always wanted that.
For me, I've set my sights on eVPN VxLAN, and I started toying with SRv6 L2 transport also. I hope
that in the spring I'll have a bit more time to contribute to VPP and write about it. Stay tuned!