From bf7de1181cef92ec47e4d378399144fddbfc4250 Mon Sep 17 00:00:00 2001 From: Pim van Pelt Date: Sat, 14 Feb 2026 16:01:54 +0100 Subject: [PATCH] Add L2/L3 sub-int policers --- content/articles/2026-02-14-vpp-policers.md | 401 ++++++++++++++++++++ 1 file changed, 401 insertions(+) create mode 100644 content/articles/2026-02-14-vpp-policers.md diff --git a/content/articles/2026-02-14-vpp-policers.md b/content/articles/2026-02-14-vpp-policers.md new file mode 100644 index 0000000..77a7cac --- /dev/null +++ b/content/articles/2026-02-14-vpp-policers.md @@ -0,0 +1,401 @@ +--- +date: "2026-02-14T11:35:14Z" +title: VPP Policers +--- + +{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} + +# About this series + +Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its +performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic +_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches +are shared between the two. + +There's some really fantastic features in VPP, some of which are lesser well known, and not always +very well documented. In this article, I will describe a unique usecase in which I think VPP will +excel, notably acting as a gateway for Internet Exchange Points. + +A few years ago, I toyed with the idea to use VPP as an _IXP Reseller_ concentrator, allowing +several carriers to connect with say 10G or 25G ports, and carry sub-customers on tagged interfaces +with safety (like MAC address ACLs) and rate limiting (say any given customer limited to 1Gbps on a +10G or 100G trunk), all provided by VPP. You can take a look at my [[VPP IXP Gateway]({{< ref +2023-10-21-vpp-ixp-gateway-1 >}})] article for details. I never ended up deploying it. + +In this article, I follow up and fix a few shortcomings in VPP's policer framework. + +## Introduction + +Consider the following policer in VPP: + +``` +vpp# policer add name client-a rate kbps cir 150000 cb 15000000 conform-action transmit +vpp# policer input name client-a GigabitEthernet10/0/1 +vpp# policer output name client-a GigabitEthernet10/0/1 +``` + +The idea is to give a _committed information rate_ of 150Mps with a _committed burst_ rate of 15MB. +The _CIR_ represents the average bandwidth allowed for the interface, while the _CB_ represents the +maximum amount of data (in bytes) that can be sent at line speed in a single burst before the _CIR_ +kicks in to throttle the traffic. + +Back in October of 2023, I reached the conclusion that the policer works in the following modes: +* ***On input***, the policer is applied on `device-input` which means it takes frames directly + from the Phy. It will not work on any sub-interfaces. It explains why the policer worked on + untagged (`Gi10/0/1`) but not on tagged (`Gi10/0/1.100`) sub-interfaces. +* ***On output***, the policer is applied on `ip4-output` and `ip6-output`, which works only for + L3 enabled interfaces, not for L2 ones like the ones one might use on bridge domain or L2 cross + connects. + +## VPP Infra: L2 Feature Maps + +The benefit in using the `device-input` input arc is it's efficient: every packet that comes from the +device (`Gi10/0/1`) regardless of tagging or not, will be handed off to the policer plugin. It means +any traffic (L2, L3, sub-interface, tagged, untagged) will all go through the same policer. + +In `src/vnet/l2/` there are two nodes called `l2-input` and `l2-output`. I can configure VPP to call +these nodes before `ip[46]-unicast` and before `ip[46]-output` respectively. These L2 nodes have a +feature bitmap with 32 entries. The l2-input / l2-output nodes use a bitmap walk: they find the +highest set bit, and then dispatch the packet to a pre-configured graph node. Upon return, the +`feat-bitmap-next` checks the next bit, and if that one is set, dispatches the packet to the next +pre-configured graph node. This continues until all the bits are checked and packets have been +handed to their respective graph node if any given bit is set. + +To show what I can do with these nodes, let me dive in to an example. When a packet arrives on an +interface configured in L2 mode, either because it's a bridge-domain or an L2XC, `ethernet-input` +will send it to `l2-input`. This node does three things: + +1. It will classify the packet, by reading the interface configuration (`l2input_main.configs`) for +the sw_if_index, which contains the mode of the interface (`bridge-domain`, `l2xc`, or `bvi`). It +also contains the feature bitmap: a statically configured set of features for this interface. + +1. It will store the effective feature bitmap for each individual packet in the packet buffer. For +bridge mode, depending on the packet being unicast or multicast, some features are disabled. For +example,flooding for unicast packets is not performed, so those bits are cleared. The result is +stored in a per-packet working copy that downstream nodes can be triggered on, in turn. + +1. For each of the bits set in the packet buffer's `l2.feature_bitmap`, starting from highest bit +set, `l2-input` will set the next node, for example `l2-input-vtr` to do VLAN Tag Rewriting. Once +that node is finished, it'll clear its own bit, and search for the next one set, in order to set a +new node. + +I note that processing order is HIGH to LOW bits. By reading `l2_input.h`, I can see that The full +`l2-input` chain looks like this: + +``` +l2-input + → SPAN(17) → INPUT_CLASSIFY(16) → INPUT_FEAT_ARC(15) → POLICER_CLAS(14) + → ACL(13) → VPATH(12) → L2_IP_QOS_RECORD(11) → VTR(10) → LEARN(9) → RW(8) + → FWD(7) → UU_FWD(6) → UU_FLOOD(5) → ARP_TERM(4) → ARP_UFWD(3) → FLOOD(2) + → XCONNECT(1) → DROP(0) + +l2-output + → XCRW(12) → OUTPUT_FEAT_ARC(11) → OUTPUT_CLASSIFY(10) → LINESTATUS_DOWN(9) + → STP_BLOCKED(8) → IPIW(7) → EFP_FILTER(6) → L2PT(5) → ACL(4) → QOS(3) + → CFM(2) → SPAN(1) → OUTPUT(0) +``` + +If none of the L2 processing nodes set the next node, ultimately `feature-bitmap-drop` gently takes +the packet behind the shed and drops it. On the way out, ultimately the last `OUTPUT` bit sends the +packet to `interface-output`, which hands off to the driver's TX node. + +### Enabling L2 features + +There's lots of places in VPP where L2 feature bitmaps are set/cleared. Here's a few examples: + +``` +# VTR: sets L2INPUT_FEAT_VTR + configures output VTR (VLAN Tag Rewriting) +vpp# set interface l2 tag-rewrite GigE0/0/0.100 pop 1 + +# ACL: sets L2INPUT_FEAT_ACL / L2OUTPUT_FEAT_ACL +vpp# set interface l2 input acl intfc GigE0/0/0 ip4-table 0 +vpp# set interface l2 output acl intfc GigE0/0/0 ip4-table 0 + +# SPAN: sets L2INPUT_FEAT_SPAN / L2OUTPUT_FEAT_SPAN +vpp# set interface span GigE0/0/0 l2 destination GigE0/0/1 + +# Bridge domain level (affects bd_feature_bitmap, applied to all bridge members) +vpp# set bridge-domain learn 1 # enable/disable LEARN in BD +vpp# set bridge-domain forward 1 # enable/disable FWD in BD +vpp# set bridge-domain flood 1 # enable/disable FLOOD in BD +``` + +I'm starting to see how these L2 feature bitmaps are super powerful, yet flexible. I'm ready to add one! + +### Creating L2 features + +First, I need to insert my new `POLICER` bit in `l2_input.h` and `l2_output.h`. Then, I can call +`l2input_intf_bitmap_enable()` and its companion `l2output_intf_bitmap_enable()` to enable or +disable the L2 feature, and point it at a new graph node. + +```cpp + /* Enable policer both on L2 feature bitmap, and L3 feature arcs */ + if (dir == VLIB_RX) { + l2input_intf_bitmap_enable (sw_if_index, L2INPUT_FEAT_POLICER, apply); + vnet_feature_enable_disable ("ip4-unicast", "policer-input", sw_if_index, apply, 0, 0); + vnet_feature_enable_disable ("ip6-unicast", "policer-input", sw_if_index, apply, 0, 0); + } else { + l2output_intf_bitmap_enable (sw_if_index, L2OUTPUT_FEAT_POLICER, apply); + vnet_feature_enable_disable ("ip4-output", "policer-output", sw_if_index, apply, 0, 0); + vnet_feature_enable_disable ("ip6-output", "policer-output", sw_if_index, apply, 0, 0); + } +``` + +What this means is that if the interface happens to be in L2 mode, in other words when it is a +`bridge-domain` member or when it is in an `l2XC` mode, I will enable the L2 features. However, for +L3 packets, I will still proceed to enable the existing `policer-input` node by calling +`vnet_feature_enable_disable()` on the IPv4 and IPv6 input arc. I make a mental note that MPLS and +other non-IP traffic will not be policed in this way. + +### Updating Policer graph node + +The policer framework has an existing dataplane node called `vnet_policer_inline()` which I extend +to take a flag `is_l2`. Using this flag, I can either set the next graph node to be +`vnet_l2_feature_next()`, or, in the pre-existing L3 case, set `vnet_feature_next()` on the packets +that move through the node. The nodes now look like this: + + +```cpp +VLIB_NODE_FN (policer_l2_input_node) +(vlib_main_t *vm, vlib_node_runtime_t *node, vlib_frame_t *frame) +{ + return vnet_policer_inline (vm, node, frame, VLIB_RX, 1 /* is_l2 */); +} + +VLIB_REGISTER_NODE (policer_l2_input_node) = { + .name = "l2-policer-input", + .vector_size = sizeof (u32), + .format_trace = format_policer_trace, + .type = VLIB_NODE_TYPE_INTERNAL, + .n_errors = ARRAY_LEN(vnet_policer_error_strings), + .error_strings = vnet_policer_error_strings, + .n_next_nodes = VNET_POLICER_N_NEXT, + .next_nodes = { + [VNET_POLICER_NEXT_DROP] = "error-drop", + [VNET_POLICER_NEXT_HANDOFF] = "policer-input-handoff", + }, +}; + +/* Register on IP unicast arcs for L3 routed sub-interfaces */ +VNET_FEATURE_INIT (policer_ip4_unicast, static) = { + .arc_name = "ip4-unicast", + .node_name = "policer-input", + .runs_before = VNET_FEATURES ("ip4-lookup"), +}; + +VNET_FEATURE_INIT (policer_ip6_unicast, static) = { + .arc_name = "ip6-unicast", + .node_name = "policer-input", + .runs_before = VNET_FEATURES ("ip6-lookup"), +}; +``` + +Here, I install the L3 feature before `ip[46]-lookup`, and hook up the L2 feature with a new node +that really just calls the existing node but with `is_l2` set to true. I do something very similar +for the output direction, except there I'll hook the L3 feature before `ip[46]-output`. + +## Tests! + +I think writing unit- and integration tests is a great idea. I add a new file +`test/test_policer_subif.py` which actually tests all four new cases: +1. **L3 Input**: on a routed sub-interface +1. **L3 Output**: on a routed sub-interface +1. **L2 Input**: on a bridge-domain sub-interface +1. **L2 Output**: on a bridge-domain sub-interface + +The existing `test/test_policer.py` should also cover existing cases, and of course it's important +that my work does not impact these changes. Lucky me, the existing tests all still pass :) + +### Test: L3 in/output + +The tests use a VPP feature called `packet-generator`, which creates virtual devices upon which I +can emit packets using ScaPY, and use pcap to receive them. For the input, first I'll create the +interface and apply a new policer to it: + +```python + sub_if0 = VppDot1QSubint(self, self.pg0, 10) + sub_if0.admin_up() + sub_if0.config_ip4() + sub_if0.resolve_arp() + + # Create policer + action_tx = PolicerAction(VppEnum.vl_api_sse2_qos_action_type_t.SSE2_QOS_ACTION_API_TRANSMIT, 0) + policer = VppPolicer(self, "subif_l3_pol", 80, 0, 1000, 0, + conform_action=action_tx, exceed_action=action_tx, violate_action=action_tx, + ) + policer.add_vpp_config() + + # Apply policer to sub-interface input on pg0 + policer.apply_vpp_config(sub_if0.sw_if_index, Dir.RX, True) +``` + +The policer with name `subif_l3_pol` has a _CIR_ of 80kbps, and _EIR_ of 0kB, a _CB_ of 1000 bytes, and +_EB_ of 0kB, and otherwise always accepts packets. I do this so that I can eventually detect if and +how many packets were seen, and how many bytes were passed in the conform and violate actions. + +Next, I can generate a few packets and send them out from `pg0`, and wait to receive them on `pg1`: + +```python + # Send packets with VLAN tag from sub_if0 to sub_if1 + pkts = [] + for i in range(NUM_PKTS): # NUM_PKTS = 67 + pkt = ( + Ether(src=self.pg0.remote_mac, dst=self.pg0.local_mac) / Dot1Q(vlan=10) + / IP(src=sub_if0.remote_ip4, dst=sub_if1.remote_ip4) / UDP(sport=1234, dport=1234) + / Raw(b"\xa5" * 100) + ) + pkts.append(pkt) + + # Send and verify packets are policed and forwarded + rx = self.send_and_expect(self.pg0, pkts, self.pg1) + + stats = policer.get_stats() + # Verify policing happened + self.assertGreater(stats["conform_packets"], 0) + self.assertEqual(stats["exceed_packets"], 0) + self.assertGreater(stats["violate_packets"], 0) + + self.logger.info(f"L3 sub-interface input policer stats: {stats}") +``` + +Similar to the L3 sub-interface input policer, I also write a test for L3 sub-interface output +policer. The only difference between the two is that in the output case, the policer is applied to +`pg1` in the `Dir.TX` direction, while in the input case, it's applied to `pg0` in the `Dir.Rx` +direction. + +I can predict the outcome. Every packet is exactly 146 bytes: +* 14 bytes src/dst MAC in `Ether()` +* 4 bytes VLAN tag (10) in `Dot1Q()` +* 20 bytes IPv4 header in `IP()` +* 8 bytes UDP header in `UDP()` +* 100 bytes of additional payload. + +When allowing a burst of 1000 bytes, that means 6 packets should make it through (876 bytes) in the +`conform` bucket while the other 61 should be in the `violate` bucket. I won't see any packets in +the `exceed` bucket, because the policer I created is a simple one-rate, two-color `1R2C` policer +with `EB` set to 0, so every non-conforming packet goes straight to violate as there is no extra +budget in the exceed bucket. However they are all sent, because the action was set to transmit in +all cases. + +``` +pim@summer:~/src/vpp$ make test-debug TEST=test_policer_subif V=2 +15:21:46,868 L3 sub-interface input policer stats: {'conform_packets': 7, 'conform_bytes': 896, + 'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 60, 'violate_bytes': 7680} +15:21:47,919 L3 sub-interface output policer stats: {'conform_packets': 6, 'conform_bytes': 876, + 'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906} +``` + +{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}} + +**Whoops!** So much for predicting the outcome! I see that 7 packets (896 bytes) make it through on input +while 6 packets (876 bytes) made it through on output. In the input case, the packet size is +`896/7 = 128` bytes, which is 18 bytes short. What's going on? + +### Side Quest: Policer Accounting + +On the vpp-dev mailinglist, Ben points out that the accounting will be changing when moving from +`device-input` to `ip[46]-input`, because after device-input, the packet buffer is advanced to the +L3 portion, and will start at the IPv4 or IPv6 header. Considering I was using dot1q tagged +sub-interfaces, that means I will be short exactly 18 bytes. The reason why this does not happen on +the way out, is that `ip[46]-rewrite` have both already wound back the buffer to be able to insert +the ethernet frame and encapsulation, so no adjustment is needed there. + +Ben also points out that when applying the policer to the interface, I can detect at creation time if +it's a PHY, a single-tagged or a double-tagged interface, and store some information to help correct +the accounting. We discuss a little bit on the mailinglist, and agree that it's best for all four +cases (L2 input/output and L3 intput/output) to use the full L2 frame bytes in the accounting, which +as an added benefit also that is remains backwards compatible with the `device-input` accounting. +Chapeau, Ben you're so clever! + +I add a little helper function: + +```cpp +static u8 vnet_policer_compute_l2_overhead (vnet_main_t *vnm, u32 sw_if_index, vlib_dir_t dir) +{ + if (dir == VLIB_TX) return 0; + + vnet_hw_interface_t *hi = vnet_get_sup_hw_interface (vnm, sw_if_index); + if (PREDICT_FALSE (hi->hw_class_index != ethernet_hw_interface_class.index)) + return 0; /* Not Ethernet */ + + vnet_sw_interface_t *si = vnet_get_sw_interface (vnm, sw_if_index); + if (si->type == VNET_SW_INTERFACE_TYPE_SUB) { + if (si->sub.eth.flags.one_tag) return 18; /* Ethernet + single VLAN */ + if (si->sub.eth.flags.two_tags) return 22; /* Ethernet + QinQ */ + } + + return 14; /* Untagged Ethernet */ +} +``` + +And in the policer struct, I also add a `l2_overhead_by_sw_if_index[dir][sw_if_index]` to store +these values. That way, I do not need to do this calculation for every packet in the dataplane, but +just blindly add the value I pre-computed at creation time. This is safe, because sub-interfaces +cannot change their encapsulation after being created. + +In the `vnet_policer_police()` dataplane function, I add an `l2_overhead` argument, and then call it +like so: + +```cpp + u16 l2_overhead0 = (is_l2) ? 0 : pm->l2_overhead_by_sw_if_index[dir][sw_if_index0]; + act0 = vnet_policer_police (vm, b0, pi0, ..., l2_overhead0); +``` + +And with that, my two tests give the same results: + +``` +pim@summer:~/src/vpp$ make test-debug TEST=test_policer_subif V=2 | grep 'policer stats' +15:38:39,720 L3 sub-interface input policer stats: {'conform_packets': 6, 'conform_bytes': 876, + 'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906} +15:38:40,715 L3 sub-interface output policer stats: {'conform_packets': 6, 'conform_bytes': 876, + 'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906} +``` + +Yaay, great success! + +### Test: L2 in/output + +The tests for the L2 input and output case are not radically different. In the setup, rather than +giving the VLAN sub-interfaces an IPv4 address, I'll just add them to a bridge-domain: + +```python + # Create VLAN sub-interfaces on pg0 and pg1 + sub_if0 = VppDot1QSubint(self, self.pg0, 30) + sub_if0.admin_up() + sub_if1 = VppDot1QSubint(self, self.pg1, 30) + sub_if1.admin_up() + + # Add both sub-interfaces to bridge domain 1 + self.vapi.sw_interface_set_l2_bridge(sub_if0.sw_if_index, bd_id=1) + self.vapi.sw_interface_set_l2_bridge(sub_if1.sw_if_index, bd_id=1) +``` + +This puts the sub-interfaces in L2 mode, after which the `l2-input` and `l2-output` feature bitmaps +kick in. Without further ado: + +``` +pim@summer:~/src/vpp$ make test-debug TEST=test_policer_subif V=2 | grep 'L2.*policer stats' +15:50:15,217 L2 sub-interface input policer stats: {'conform_packets': 6, 'conform_bytes': 876, + 'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906} +15:50:16,217 L2 sub-interface output policer stats: {'conform_packets': 6, 'conform_bytes': 876, + 'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906} +``` + +## What's Next + +I've sent the change, which was only about ~300 LOC, off for review. You can follow along on the +gerrit on [[44654](https://gerrit.fd.io/r/c/vpp/+/44654)]. I don't think the policer got much slower +after adding the l2 path, and one might argue it doesn't matter because policing didn't work on +sub-interfaces and L2 output at all, before this change. However, for the L3 input/output case, and +for the PHY input case, there are a few CPU cycles added now to address the L2 and sub-int use +cases. Perhaps I should do a side by side comparision of packets/sec throughput on the bench some +time. + +It would be great if VPP would support FQ-CoDel (Flow Queue-Controlled Delay), which is an +Active Queue Management (AQM) algorithm and packet scheduler designed to eliminate bufferbloat—high +latency caused by excessive buffering in network equipment, while ensuring fair bandwidth +distribution among competing traffic flows. I know that Dave Täht - may he rest in peace - always +wanted that. + +For me, I've set my sights on eVPN VxLAN, and started toing with SRv6 also. I hope that in the +spring I'll have a bit more time to contribute to VPP and write about it. Stay tuned!