From bf7de1181cef92ec47e4d378399144fddbfc4250 Mon Sep 17 00:00:00 2001
From: Pim van Pelt <pim@ipng.ch>
Date: Sat, 14 Feb 2026 16:01:54 +0100
Subject: [PATCH] Add L2/L3 sub-int policers

---
 content/articles/2026-02-14-vpp-policers.md | 401 ++++++++++++++++++++
 1 file changed, 401 insertions(+)
 create mode 100644 content/articles/2026-02-14-vpp-policers.md

diff --git a/content/articles/2026-02-14-vpp-policers.md b/content/articles/2026-02-14-vpp-policers.md
new file mode 100644
index 0000000..77a7cac
--- /dev/null
+++ b/content/articles/2026-02-14-vpp-policers.md
@@ -0,0 +1,401 @@
+---
+date: "2026-02-14T11:35:14Z"
+title: VPP Policers
+---
+
+{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
+
+# About this series
+
+Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
+performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
+_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
+are shared between the two.
+
+There's some really fantastic features in VPP, some of which are lesser well known, and not always
+very well documented. In this article, I will describe a unique usecase in which I think VPP will
+excel, notably acting as a gateway for Internet Exchange Points.
+
+A few years ago, I toyed with the idea to use VPP as an _IXP Reseller_ concentrator, allowing
+several carriers to connect with say 10G or 25G ports, and carry sub-customers on tagged interfaces
+with safety (like MAC address ACLs) and rate limiting (say any given customer limited to 1Gbps on a
+10G or 100G trunk), all provided by VPP. You can take a look at my [[VPP IXP Gateway]({{< ref
+2023-10-21-vpp-ixp-gateway-1 >}})] article for details. I never ended up deploying it.
+
+In this article, I follow up and fix a few shortcomings in VPP's policer framework.
+
+## Introduction
+
+Consider the following policer in VPP:
+
+```
+vpp# policer add name client-a rate kbps cir 150000 cb 15000000 conform-action transmit
+vpp# policer input name client-a GigabitEthernet10/0/1
+vpp# policer output name client-a GigabitEthernet10/0/1
+```
+
+The idea is to give a _committed information rate_ of 150Mps with a _committed burst_ rate of 15MB. 
+The _CIR_ represents the average bandwidth allowed for the interface, while the _CB_ represents the
+maximum amount of data (in bytes) that can be sent at line speed in a single burst before the _CIR_
+kicks in to throttle the traffic.
+
+Back in October of 2023, I reached the conclusion that the policer works in the following modes:
+*   ***On input***, the policer is applied on `device-input` which means it takes frames directly
+    from the Phy. It will not work on any sub-interfaces. It explains why the policer worked on
+    untagged (`Gi10/0/1`) but not on tagged (`Gi10/0/1.100`) sub-interfaces.
+*   ***On output***, the policer is applied on `ip4-output` and `ip6-output`, which works only for
+    L3 enabled interfaces, not for L2 ones like the ones one might use on bridge domain or L2 cross
+    connects.
+
+## VPP Infra: L2 Feature Maps
+
+The benefit in using the `device-input` input arc is it's efficient: every packet that comes from the
+device (`Gi10/0/1`) regardless of tagging or not, will be handed off to the policer plugin. It means
+any traffic (L2, L3, sub-interface, tagged, untagged) will all go through the same policer. 
+
+In `src/vnet/l2/` there are two nodes called `l2-input` and `l2-output`. I can configure VPP to call
+these nodes before `ip[46]-unicast` and before `ip[46]-output` respectively. These L2 nodes have a
+feature bitmap with 32 entries. The l2-input / l2-output nodes use a bitmap walk: they find the
+highest set bit, and then dispatch the packet to a pre-configured graph node. Upon return, the
+`feat-bitmap-next` checks the next bit, and if that one is set, dispatches the packet to the next
+pre-configured graph node. This continues until all the bits are checked and packets have been
+handed to their respective graph node if any given bit is set.
+
+To show what I can do with these nodes, let me dive in to an example. When a packet arrives on an
+interface configured in L2 mode, either because it's a bridge-domain or an L2XC, `ethernet-input`
+will send it to `l2-input`. This node does three things:                            
+
+1.  It will classify the packet, by reading the interface configuration (`l2input_main.configs`) for
+the sw_if_index, which contains the mode of the interface (`bridge-domain`, `l2xc`, or `bvi`). It
+also contains the feature bitmap: a statically configured set of features for this interface.
+
+1.  It will store the effective feature bitmap for each individual packet in the packet buffer. For
+bridge mode, depending on the packet being unicast or multicast, some features are disabled. For
+example,flooding for unicast packets is not performed, so those bits are cleared. The result is
+stored in a per-packet working copy that downstream nodes can be triggered on, in turn.
+
+1.  For each of the bits set in the packet buffer's `l2.feature_bitmap`, starting from highest bit
+set, `l2-input` will set the next node, for example `l2-input-vtr` to do VLAN Tag Rewriting. Once
+that node is finished, it'll clear its own bit, and search for the next one set, in order to set a
+new node.
+
+I note that processing order is HIGH to LOW bits. By reading `l2_input.h`, I can see that The full
+`l2-input` chain looks like this:
+
+```
+l2-input 
+  → SPAN(17) → INPUT_CLASSIFY(16) → INPUT_FEAT_ARC(15) → POLICER_CLAS(14)
+  → ACL(13) → VPATH(12) → L2_IP_QOS_RECORD(11) → VTR(10) → LEARN(9) → RW(8)
+  → FWD(7) → UU_FWD(6) → UU_FLOOD(5) → ARP_TERM(4) → ARP_UFWD(3) → FLOOD(2)
+  → XCONNECT(1) → DROP(0)
+
+l2-output
+  → XCRW(12) → OUTPUT_FEAT_ARC(11) → OUTPUT_CLASSIFY(10) → LINESTATUS_DOWN(9)
+  → STP_BLOCKED(8) → IPIW(7) → EFP_FILTER(6) → L2PT(5) → ACL(4) → QOS(3)
+  → CFM(2) → SPAN(1) → OUTPUT(0)
+```
+
+If none of the L2 processing nodes set the next node, ultimately `feature-bitmap-drop` gently takes
+the packet behind the shed and drops it. On the way out, ultimately the last `OUTPUT` bit sends the
+packet to `interface-output`, which hands off to the driver's TX node.
+
+### Enabling L2 features
+
+There's lots of places in VPP where L2 feature bitmaps are set/cleared. Here's a few examples:
+
+```
+# VTR: sets L2INPUT_FEAT_VTR + configures output VTR (VLAN Tag Rewriting)
+vpp# set interface l2 tag-rewrite GigE0/0/0.100 pop 1
+
+# ACL: sets L2INPUT_FEAT_ACL / L2OUTPUT_FEAT_ACL
+vpp# set interface l2 input acl intfc GigE0/0/0 ip4-table 0
+vpp# set interface l2 output acl intfc GigE0/0/0 ip4-table 0
+
+# SPAN: sets L2INPUT_FEAT_SPAN / L2OUTPUT_FEAT_SPAN
+vpp# set interface span GigE0/0/0 l2 destination GigE0/0/1
+
+# Bridge domain level (affects bd_feature_bitmap, applied to all bridge members)
+vpp# set bridge-domain learn 1     # enable/disable LEARN in BD
+vpp# set bridge-domain forward 1   # enable/disable FWD in BD
+vpp# set bridge-domain flood 1     # enable/disable FLOOD in BD
+```
+
+I'm starting to see how these L2 feature bitmaps are super powerful, yet flexible. I'm ready to add one!
+
+### Creating L2 features
+
+First, I need to insert my new `POLICER` bit in `l2_input.h` and `l2_output.h`. Then, I can call
+`l2input_intf_bitmap_enable()` and its companion `l2output_intf_bitmap_enable()` to enable or
+disable the L2 feature, and point it at a new graph node. 
+
+```cpp
+ /* Enable policer both on L2 feature bitmap, and L3 feature arcs */
+ if (dir == VLIB_RX) {
+   l2input_intf_bitmap_enable (sw_if_index, L2INPUT_FEAT_POLICER, apply);
+   vnet_feature_enable_disable ("ip4-unicast", "policer-input", sw_if_index, apply, 0, 0);
+   vnet_feature_enable_disable ("ip6-unicast", "policer-input", sw_if_index, apply, 0, 0);
+ } else {
+   l2output_intf_bitmap_enable (sw_if_index, L2OUTPUT_FEAT_POLICER, apply);
+   vnet_feature_enable_disable ("ip4-output", "policer-output", sw_if_index, apply, 0, 0);
+   vnet_feature_enable_disable ("ip6-output", "policer-output", sw_if_index, apply, 0, 0);
+ }
+```
+
+What this means is that if the interface happens to be in L2 mode, in other words when it is a
+`bridge-domain` member or when it is in an `l2XC` mode, I will enable the L2 features. However, for
+L3 packets, I will still proceed to enable the existing `policer-input` node by calling
+`vnet_feature_enable_disable()` on the IPv4 and IPv6 input arc. I make a mental note that MPLS and
+other non-IP traffic will not be policed in this way.
+
+### Updating Policer graph node
+
+The policer framework has an existing dataplane node called `vnet_policer_inline()` which I extend
+to take a flag `is_l2`. Using this flag, I can either set the next graph node to be
+`vnet_l2_feature_next()`, or, in the pre-existing L3 case, set `vnet_feature_next()` on the packets
+that move through the node. The nodes now look like this:
+
+
+```cpp
+VLIB_NODE_FN (policer_l2_input_node)
+(vlib_main_t *vm, vlib_node_runtime_t *node, vlib_frame_t *frame)
+{
+  return vnet_policer_inline (vm, node, frame, VLIB_RX, 1 /* is_l2 */);
+}
+
+VLIB_REGISTER_NODE (policer_l2_input_node) = {
+  .name = "l2-policer-input",
+  .vector_size = sizeof (u32),
+  .format_trace = format_policer_trace,
+  .type = VLIB_NODE_TYPE_INTERNAL,
+  .n_errors = ARRAY_LEN(vnet_policer_error_strings),
+  .error_strings = vnet_policer_error_strings,
+  .n_next_nodes = VNET_POLICER_N_NEXT,
+  .next_nodes = {
+                [VNET_POLICER_NEXT_DROP] = "error-drop",
+                [VNET_POLICER_NEXT_HANDOFF] = "policer-input-handoff",
+                },
+};
+
+/* Register on IP unicast arcs for L3 routed sub-interfaces */
+VNET_FEATURE_INIT (policer_ip4_unicast, static) = {
+  .arc_name = "ip4-unicast",
+  .node_name = "policer-input",
+  .runs_before = VNET_FEATURES ("ip4-lookup"),
+};
+
+VNET_FEATURE_INIT (policer_ip6_unicast, static) = {
+  .arc_name = "ip6-unicast",
+  .node_name = "policer-input",
+  .runs_before = VNET_FEATURES ("ip6-lookup"),
+};
+```
+
+Here, I install the L3 feature before `ip[46]-lookup`, and hook up the L2 feature with a new node
+that really just calls the existing node but with `is_l2` set to true. I do something very similar
+for the output direction, except there I'll hook the L3 feature before `ip[46]-output`.
+
+## Tests!
+
+I think writing unit- and integration tests is a great idea. I add a new file
+`test/test_policer_subif.py` which actually tests all four new cases:
+1.  **L3 Input**: on a routed sub-interface
+1.  **L3 Output**: on a routed sub-interface
+1.  **L2 Input**: on a bridge-domain sub-interface
+1.  **L2 Output**: on a bridge-domain sub-interface
+
+The existing `test/test_policer.py` should also cover existing cases, and of course it's important
+that my work does not impact these changes. Lucky me, the existing tests all still pass :)
+
+### Test: L3 in/output
+
+The tests use a VPP feature called `packet-generator`, which creates virtual devices upon which I
+can emit packets using ScaPY, and use pcap to receive them. For the input, first I'll create the
+interface and apply a new policer to it:
+
+```python
+  sub_if0 = VppDot1QSubint(self, self.pg0, 10)
+  sub_if0.admin_up()
+  sub_if0.config_ip4()
+  sub_if0.resolve_arp()
+
+  # Create policer
+  action_tx = PolicerAction(VppEnum.vl_api_sse2_qos_action_type_t.SSE2_QOS_ACTION_API_TRANSMIT, 0)
+  policer = VppPolicer(self, "subif_l3_pol", 80, 0, 1000, 0,
+      conform_action=action_tx, exceed_action=action_tx, violate_action=action_tx,
+  )
+  policer.add_vpp_config()
+
+  # Apply policer to sub-interface input on pg0
+  policer.apply_vpp_config(sub_if0.sw_if_index, Dir.RX, True)
+```
+
+The policer with name `subif_l3_pol` has a _CIR_ of 80kbps, and _EIR_ of 0kB, a _CB_ of 1000 bytes, and
+_EB_ of 0kB, and otherwise always accepts packets. I do this so that I can eventually detect if and
+how many packets were seen, and how many bytes were passed in the conform and violate actions.
+
+Next, I can generate a few packets and send them out from `pg0`, and wait to receive them on `pg1`:
+
+```python
+  # Send packets with VLAN tag from sub_if0 to sub_if1
+  pkts = []
+  for i in range(NUM_PKTS): # NUM_PKTS = 67
+      pkt = (
+          Ether(src=self.pg0.remote_mac, dst=self.pg0.local_mac) / Dot1Q(vlan=10)
+          / IP(src=sub_if0.remote_ip4, dst=sub_if1.remote_ip4) / UDP(sport=1234, dport=1234)
+          / Raw(b"\xa5" * 100)
+      )
+      pkts.append(pkt)
+
+  # Send and verify packets are policed and forwarded
+  rx = self.send_and_expect(self.pg0, pkts, self.pg1)
+
+  stats = policer.get_stats()
+  # Verify policing happened
+  self.assertGreater(stats["conform_packets"], 0)
+  self.assertEqual(stats["exceed_packets"], 0)
+  self.assertGreater(stats["violate_packets"], 0)
+
+  self.logger.info(f"L3 sub-interface input policer stats: {stats}")
+```
+
+Similar to the L3 sub-interface input policer, I also write a test for L3 sub-interface output
+policer. The only difference between the two is that in the output case, the policer is applied to
+`pg1` in the `Dir.TX` direction, while in the input case, it's applied to `pg0` in the `Dir.Rx`
+direction.
+ 
+I can predict the outcome. Every packet is exactly 146 bytes: 
+*   14 bytes src/dst MAC in `Ether()`
+*   4 bytes VLAN tag (10) in `Dot1Q()`
+*   20 bytes IPv4 header in `IP()`
+*   8 bytes UDP header in `UDP()`
+*   100 bytes of additional payload.
+
+When allowing a burst of 1000 bytes, that means 6 packets should make it through (876 bytes) in the
+`conform` bucket while the other 61 should be in the `violate` bucket. I won't see any packets in
+the `exceed` bucket, because the policer I created is a simple one-rate, two-color `1R2C` policer
+with `EB` set to 0, so every non-conforming packet goes straight to violate as there is no extra
+budget in the exceed bucket.  However they are all sent, because the action was set to transmit in
+all cases.
+
+```
+pim@summer:~/src/vpp$ make test-debug TEST=test_policer_subif V=2 
+15:21:46,868 L3 sub-interface input policer stats: {'conform_packets': 7, 'conform_bytes': 896,
+  'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 60, 'violate_bytes': 7680}
+15:21:47,919 L3 sub-interface output policer stats: {'conform_packets': 6, 'conform_bytes': 876,
+  'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906}
+```
+
+{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
+
+**Whoops!** So much for predicting the outcome! I see that 7 packets (896 bytes) make it through on input
+while 6 packets (876 bytes) made it through on output. In the input case, the packet size is
+`896/7 = 128` bytes, which is 18 bytes short. What's going on?
+
+### Side Quest: Policer Accounting
+
+On the vpp-dev mailinglist, Ben points out that the accounting will be changing when moving from
+`device-input` to `ip[46]-input`, because after device-input, the packet buffer is advanced to the
+L3 portion, and will start at the IPv4 or IPv6 header. Considering I was using dot1q tagged
+sub-interfaces, that means I will be short exactly 18 bytes. The reason why this does not happen on
+the way out, is that `ip[46]-rewrite` have both already wound back the buffer to be able to insert
+the ethernet frame and encapsulation, so no adjustment is needed there.
+
+Ben also points out that when applying the policer to the interface, I can detect at creation time if
+it's a PHY, a single-tagged or a double-tagged interface, and store some information to help correct
+the accounting. We discuss a little bit on the mailinglist, and agree that it's best for all four
+cases (L2 input/output and L3 intput/output) to use the full L2 frame bytes in the accounting, which
+as an added benefit also that is remains backwards compatible with the `device-input` accounting.
+Chapeau, Ben you're so clever!
+
+I add a little helper function:
+
+```cpp
+static u8 vnet_policer_compute_l2_overhead (vnet_main_t *vnm, u32 sw_if_index, vlib_dir_t dir)
+{
+  if (dir == VLIB_TX) return 0;
+
+  vnet_hw_interface_t *hi = vnet_get_sup_hw_interface (vnm, sw_if_index);
+  if (PREDICT_FALSE (hi->hw_class_index != ethernet_hw_interface_class.index))
+    return 0; /* Not Ethernet */
+
+  vnet_sw_interface_t *si = vnet_get_sw_interface (vnm, sw_if_index);
+  if (si->type == VNET_SW_INTERFACE_TYPE_SUB) {
+    if (si->sub.eth.flags.one_tag)  return 18; /* Ethernet + single VLAN */
+    if (si->sub.eth.flags.two_tags) return 22; /* Ethernet + QinQ */
+  }
+
+  return 14; /* Untagged Ethernet */
+}
+```
+
+And in the policer struct, I also add a `l2_overhead_by_sw_if_index[dir][sw_if_index]` to store
+these values. That way, I do not need to do this calculation for every packet in the dataplane, but
+just blindly add the value I pre-computed at creation time. This is safe, because sub-interfaces
+cannot change their encapsulation after being created.
+
+In the `vnet_policer_police()` dataplane function, I add an `l2_overhead` argument, and then call it
+like so:
+
+```cpp
+  u16 l2_overhead0 = (is_l2) ? 0 : pm->l2_overhead_by_sw_if_index[dir][sw_if_index0];
+  act0 = vnet_policer_police (vm, b0, pi0, ..., l2_overhead0);
+```
+
+And with that, my two tests give the same results:
+
+```
+pim@summer:~/src/vpp$ make test-debug TEST=test_policer_subif V=2 | grep 'policer stats'
+15:38:39,720 L3 sub-interface input policer stats: {'conform_packets': 6, 'conform_bytes': 876,
+  'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906}
+15:38:40,715 L3 sub-interface output policer stats: {'conform_packets': 6, 'conform_bytes': 876,
+  'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906}
+```
+
+Yaay, great success!
+
+### Test: L2 in/output
+
+The tests for the L2 input and output case are not radically different. In the setup, rather than
+giving the VLAN sub-interfaces an IPv4 address, I'll just add them to a bridge-domain:
+
+```python
+  # Create VLAN sub-interfaces on pg0 and pg1
+  sub_if0 = VppDot1QSubint(self, self.pg0, 30)
+  sub_if0.admin_up()
+  sub_if1 = VppDot1QSubint(self, self.pg1, 30)
+  sub_if1.admin_up()
+
+  # Add both sub-interfaces to bridge domain 1
+  self.vapi.sw_interface_set_l2_bridge(sub_if0.sw_if_index, bd_id=1)
+  self.vapi.sw_interface_set_l2_bridge(sub_if1.sw_if_index, bd_id=1)
+```
+
+This puts the sub-interfaces in L2 mode, after which the `l2-input` and `l2-output` feature bitmaps
+kick in. Without further ado:
+
+```
+pim@summer:~/src/vpp$ make test-debug TEST=test_policer_subif V=2 | grep 'L2.*policer stats'
+15:50:15,217 L2 sub-interface input policer stats: {'conform_packets': 6, 'conform_bytes': 876,
+  'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906}
+15:50:16,217 L2 sub-interface output policer stats: {'conform_packets': 6, 'conform_bytes': 876,
+  'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906}
+```
+
+## What's Next
+
+I've sent the change, which was only about ~300 LOC, off for review. You can follow along on the
+gerrit on [[44654](https://gerrit.fd.io/r/c/vpp/+/44654)]. I don't think the policer got much slower
+after adding the l2 path, and one might argue it doesn't matter because policing didn't work on
+sub-interfaces and L2 output at all, before this change. However, for the L3 input/output case, and
+for the PHY input case, there are a few CPU cycles added now to address the L2 and sub-int use
+cases. Perhaps I should do a side by side comparision of packets/sec throughput on the bench some
+time.
+
+It would be great if VPP would support FQ-CoDel (Flow Queue-Controlled Delay), which is an
+Active Queue Management (AQM) algorithm and packet scheduler designed to eliminate bufferbloat—high
+latency caused by excessive buffering in network equipment, while ensuring fair bandwidth
+distribution among competing traffic flows. I know that Dave Täht - may he rest in peace - always
+wanted that. 
+
+For me, I've set my sights on eVPN VxLAN, and started toing with SRv6 also. I hope that in the
+spring I'll have a bit more time to contribute to VPP and write about it. Stay tuned!