--- date: "2026-02-14T11:35:14Z" title: VPP Policers --- {{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}} # About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic _ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two. There's some really fantastic features in VPP, some of which are lesser well known, and not always very well documented. In this article, I will describe a unique usecase in which I think VPP will excel, notably acting as a gateway for Internet Exchange Points. A few years ago, I toyed with the idea to use VPP as an _IXP Reseller_ concentrator, allowing several carriers to connect with say 10G or 25G ports, and carry sub-customers on tagged interfaces with safety (like MAC address ACLs) and rate limiting (say any given customer limited to 1Gbps on a 10G or 100G trunk), all provided by VPP. You can take a look at my [[VPP IXP Gateway]({{< ref 2023-10-21-vpp-ixp-gateway-1 >}})] article for details. I never ended up deploying it. In this article, I follow up and fix a few shortcomings in VPP's policer framework. ## Introduction Consider the following policer in VPP: ``` vpp# policer add name client-a rate kbps cir 150000 cb 15000000 conform-action transmit vpp# policer input name client-a GigabitEthernet10/0/1 vpp# policer output name client-a GigabitEthernet10/0/1 ``` The idea is to give a _committed information rate_ of 150Mps with a _committed burst_ rate of 15MB. The _CIR_ represents the average bandwidth allowed for the interface, while the _CB_ represents the maximum amount of data (in bytes) that can be sent at line speed in a single burst before the _CIR_ kicks in to throttle the traffic. Back in October of 2023, I reached the conclusion that the policer works in the following modes: * ***On input***, the policer is applied on `device-input` which means it takes frames directly from the Phy. It will not work on any sub-interfaces. It explains why the policer worked on untagged (`Gi10/0/1`) but not on tagged (`Gi10/0/1.100`) sub-interfaces. * ***On output***, the policer is applied on `ip4-output` and `ip6-output`, which works only for L3 enabled interfaces, not for L2 ones like the ones one might use on bridge domain or L2 cross connects. ## VPP Infra: L2 Feature Maps The benefit in using the `device-input` input arc is it's efficient: every packet that comes from the device (`Gi10/0/1`) regardless of tagging or not, will be handed off to the policer plugin. It means any traffic (L2, L3, sub-interface, tagged, untagged) will all go through the same policer. In `src/vnet/l2/` there are two nodes called `l2-input` and `l2-output`. I can configure VPP to call these nodes before `ip[46]-unicast` and before `ip[46]-output` respectively. These L2 nodes have a feature bitmap with 32 entries. The l2-input / l2-output nodes use a bitmap walk: they find the highest set bit, and then dispatch the packet to a pre-configured graph node. Upon return, the `feat-bitmap-next` checks the next bit, and if that one is set, dispatches the packet to the next pre-configured graph node. This continues until all the bits are checked and packets have been handed to their respective graph node if any given bit is set. To show what I can do with these nodes, let me dive in to an example. When a packet arrives on an interface configured in L2 mode, either because it's a bridge-domain or an L2XC, `ethernet-input` will send it to `l2-input`. This node does three things: 1. It will classify the packet, by reading the interface configuration (`l2input_main.configs`) for the sw_if_index, which contains the mode of the interface (`bridge-domain`, `l2xc`, or `bvi`). It also contains the feature bitmap: a statically configured set of features for this interface. 1. It will store the effective feature bitmap for each individual packet in the packet buffer. For bridge mode, depending on the packet being unicast or multicast, some features are disabled. For example,flooding for unicast packets is not performed, so those bits are cleared. The result is stored in a per-packet working copy that downstream nodes can be triggered on, in turn. 1. For each of the bits set in the packet buffer's `l2.feature_bitmap`, starting from highest bit set, `l2-input` will set the next node, for example `l2-input-vtr` to do VLAN Tag Rewriting. Once that node is finished, it'll clear its own bit, and search for the next one set, in order to set a new node. I note that processing order is HIGH to LOW bits. By reading `l2_input.h`, I can see that The full `l2-input` chain looks like this: ``` l2-input → SPAN(17) → INPUT_CLASSIFY(16) → INPUT_FEAT_ARC(15) → POLICER_CLAS(14) → ACL(13) → VPATH(12) → L2_IP_QOS_RECORD(11) → VTR(10) → LEARN(9) → RW(8) → FWD(7) → UU_FWD(6) → UU_FLOOD(5) → ARP_TERM(4) → ARP_UFWD(3) → FLOOD(2) → XCONNECT(1) → DROP(0) l2-output → XCRW(12) → OUTPUT_FEAT_ARC(11) → OUTPUT_CLASSIFY(10) → LINESTATUS_DOWN(9) → STP_BLOCKED(8) → IPIW(7) → EFP_FILTER(6) → L2PT(5) → ACL(4) → QOS(3) → CFM(2) → SPAN(1) → OUTPUT(0) ``` If none of the L2 processing nodes set the next node, ultimately `feature-bitmap-drop` gently takes the packet behind the shed and drops it. On the way out, ultimately the last `OUTPUT` bit sends the packet to `interface-output`, which hands off to the driver's TX node. ### Enabling L2 features There's lots of places in VPP where L2 feature bitmaps are set/cleared. Here's a few examples: ``` # VTR: sets L2INPUT_FEAT_VTR + configures output VTR (VLAN Tag Rewriting) vpp# set interface l2 tag-rewrite GigE0/0/0.100 pop 1 # ACL: sets L2INPUT_FEAT_ACL / L2OUTPUT_FEAT_ACL vpp# set interface l2 input acl intfc GigE0/0/0 ip4-table 0 vpp# set interface l2 output acl intfc GigE0/0/0 ip4-table 0 # SPAN: sets L2INPUT_FEAT_SPAN / L2OUTPUT_FEAT_SPAN vpp# set interface span GigE0/0/0 l2 destination GigE0/0/1 # Bridge domain level (affects bd_feature_bitmap, applied to all bridge members) vpp# set bridge-domain learn 1 # enable/disable LEARN in BD vpp# set bridge-domain forward 1 # enable/disable FWD in BD vpp# set bridge-domain flood 1 # enable/disable FLOOD in BD ``` I'm starting to see how these L2 feature bitmaps are super powerful, yet flexible. I'm ready to add one! ### Creating L2 features First, I need to insert my new `POLICER` bit in `l2_input.h` and `l2_output.h`. Then, I can call `l2input_intf_bitmap_enable()` and its companion `l2output_intf_bitmap_enable()` to enable or disable the L2 feature, and point it at a new graph node. ```cpp /* Enable policer both on L2 feature bitmap, and L3 feature arcs */ if (dir == VLIB_RX) { l2input_intf_bitmap_enable (sw_if_index, L2INPUT_FEAT_POLICER, apply); vnet_feature_enable_disable ("ip4-unicast", "policer-input", sw_if_index, apply, 0, 0); vnet_feature_enable_disable ("ip6-unicast", "policer-input", sw_if_index, apply, 0, 0); } else { l2output_intf_bitmap_enable (sw_if_index, L2OUTPUT_FEAT_POLICER, apply); vnet_feature_enable_disable ("ip4-output", "policer-output", sw_if_index, apply, 0, 0); vnet_feature_enable_disable ("ip6-output", "policer-output", sw_if_index, apply, 0, 0); } ``` What this means is that if the interface happens to be in L2 mode, in other words when it is a `bridge-domain` member or when it is in an `l2XC` mode, I will enable the L2 features. However, for L3 packets, I will still proceed to enable the existing `policer-input` node by calling `vnet_feature_enable_disable()` on the IPv4 and IPv6 input arc. I make a mental note that MPLS and other non-IP traffic will not be policed in this way. ### Updating Policer graph node The policer framework has an existing dataplane node called `vnet_policer_inline()` which I extend to take a flag `is_l2`. Using this flag, I can either set the next graph node to be `vnet_l2_feature_next()`, or, in the pre-existing L3 case, set `vnet_feature_next()` on the packets that move through the node. The nodes now look like this: ```cpp VLIB_NODE_FN (policer_l2_input_node) (vlib_main_t *vm, vlib_node_runtime_t *node, vlib_frame_t *frame) { return vnet_policer_inline (vm, node, frame, VLIB_RX, 1 /* is_l2 */); } VLIB_REGISTER_NODE (policer_l2_input_node) = { .name = "l2-policer-input", .vector_size = sizeof (u32), .format_trace = format_policer_trace, .type = VLIB_NODE_TYPE_INTERNAL, .n_errors = ARRAY_LEN(vnet_policer_error_strings), .error_strings = vnet_policer_error_strings, .n_next_nodes = VNET_POLICER_N_NEXT, .next_nodes = { [VNET_POLICER_NEXT_DROP] = "error-drop", [VNET_POLICER_NEXT_HANDOFF] = "policer-input-handoff", }, }; /* Register on IP unicast arcs for L3 routed sub-interfaces */ VNET_FEATURE_INIT (policer_ip4_unicast, static) = { .arc_name = "ip4-unicast", .node_name = "policer-input", .runs_before = VNET_FEATURES ("ip4-lookup"), }; VNET_FEATURE_INIT (policer_ip6_unicast, static) = { .arc_name = "ip6-unicast", .node_name = "policer-input", .runs_before = VNET_FEATURES ("ip6-lookup"), }; ``` Here, I install the L3 feature before `ip[46]-lookup`, and hook up the L2 feature with a new node that really just calls the existing node but with `is_l2` set to true. I do something very similar for the output direction, except there I'll hook the L3 feature before `ip[46]-output`. ## Tests! I think writing unit- and integration tests is a great idea. I add a new file `test/test_policer_subif.py` which actually tests all four new cases: 1. **L3 Input**: on a routed sub-interface 1. **L3 Output**: on a routed sub-interface 1. **L2 Input**: on a bridge-domain sub-interface 1. **L2 Output**: on a bridge-domain sub-interface The existing `test/test_policer.py` should also cover existing cases, and of course it's important that my work does not impact these changes. Lucky me, the existing tests all still pass :) ### Test: L3 in/output The tests use a VPP feature called `packet-generator`, which creates virtual devices upon which I can emit packets using ScaPY, and use pcap to receive them. For the input, first I'll create the interface and apply a new policer to it: ```python sub_if0 = VppDot1QSubint(self, self.pg0, 10) sub_if0.admin_up() sub_if0.config_ip4() sub_if0.resolve_arp() # Create policer action_tx = PolicerAction(VppEnum.vl_api_sse2_qos_action_type_t.SSE2_QOS_ACTION_API_TRANSMIT, 0) policer = VppPolicer(self, "subif_l3_pol", 80, 0, 1000, 0, conform_action=action_tx, exceed_action=action_tx, violate_action=action_tx, ) policer.add_vpp_config() # Apply policer to sub-interface input on pg0 policer.apply_vpp_config(sub_if0.sw_if_index, Dir.RX, True) ``` The policer with name `subif_l3_pol` has a _CIR_ of 80kbps, and _EIR_ of 0kB, a _CB_ of 1000 bytes, and _EB_ of 0kB, and otherwise always accepts packets. I do this so that I can eventually detect if and how many packets were seen, and how many bytes were passed in the conform and violate actions. Next, I can generate a few packets and send them out from `pg0`, and wait to receive them on `pg1`: ```python # Send packets with VLAN tag from sub_if0 to sub_if1 pkts = [] for i in range(NUM_PKTS): # NUM_PKTS = 67 pkt = ( Ether(src=self.pg0.remote_mac, dst=self.pg0.local_mac) / Dot1Q(vlan=10) / IP(src=sub_if0.remote_ip4, dst=sub_if1.remote_ip4) / UDP(sport=1234, dport=1234) / Raw(b"\xa5" * 100) ) pkts.append(pkt) # Send and verify packets are policed and forwarded rx = self.send_and_expect(self.pg0, pkts, self.pg1) stats = policer.get_stats() # Verify policing happened self.assertGreater(stats["conform_packets"], 0) self.assertEqual(stats["exceed_packets"], 0) self.assertGreater(stats["violate_packets"], 0) self.logger.info(f"L3 sub-interface input policer stats: {stats}") ``` Similar to the L3 sub-interface input policer, I also write a test for L3 sub-interface output policer. The only difference between the two is that in the output case, the policer is applied to `pg1` in the `Dir.TX` direction, while in the input case, it's applied to `pg0` in the `Dir.Rx` direction. I can predict the outcome. Every packet is exactly 146 bytes: * 14 bytes src/dst MAC in `Ether()` * 4 bytes VLAN tag (10) in `Dot1Q()` * 20 bytes IPv4 header in `IP()` * 8 bytes UDP header in `UDP()` * 100 bytes of additional payload. When allowing a burst of 1000 bytes, that means 6 packets should make it through (876 bytes) in the `conform` bucket while the other 61 should be in the `violate` bucket. I won't see any packets in the `exceed` bucket, because the policer I created is a simple one-rate, two-color `1R2C` policer with `EB` set to 0, so every non-conforming packet goes straight to violate as there is no extra budget in the exceed bucket. However they are all sent, because the action was set to transmit in all cases. ``` pim@summer:~/src/vpp$ make test-debug TEST=test_policer_subif V=2 15:21:46,868 L3 sub-interface input policer stats: {'conform_packets': 7, 'conform_bytes': 896, 'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 60, 'violate_bytes': 7680} 15:21:47,919 L3 sub-interface output policer stats: {'conform_packets': 6, 'conform_bytes': 876, 'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906} ``` {{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}} **Whoops!** So much for predicting the outcome! I see that 7 packets (896 bytes) make it through on input while 6 packets (876 bytes) made it through on output. In the input case, the packet size is `896/7 = 128` bytes, which is 18 bytes short. What's going on? ### Side Quest: Policer Accounting On the vpp-dev mailinglist, Ben points out that the accounting will be changing when moving from `device-input` to `ip[46]-input`, because after device-input, the packet buffer is advanced to the L3 portion, and will start at the IPv4 or IPv6 header. Considering I was using dot1q tagged sub-interfaces, that means I will be short exactly 18 bytes. The reason why this does not happen on the way out, is that `ip[46]-rewrite` have both already wound back the buffer to be able to insert the ethernet frame and encapsulation, so no adjustment is needed there. Ben also points out that when applying the policer to the interface, I can detect at creation time if it's a PHY, a single-tagged or a double-tagged interface, and store some information to help correct the accounting. We discuss a little bit on the mailinglist, and agree that it's best for all four cases (L2 input/output and L3 intput/output) to use the full L2 frame bytes in the accounting, which as an added benefit also that is remains backwards compatible with the `device-input` accounting. Chapeau, Ben you're so clever! I add a little helper function: ```cpp static u8 vnet_policer_compute_l2_overhead (vnet_main_t *vnm, u32 sw_if_index, vlib_dir_t dir) { if (dir == VLIB_TX) return 0; vnet_hw_interface_t *hi = vnet_get_sup_hw_interface (vnm, sw_if_index); if (PREDICT_FALSE (hi->hw_class_index != ethernet_hw_interface_class.index)) return 0; /* Not Ethernet */ vnet_sw_interface_t *si = vnet_get_sw_interface (vnm, sw_if_index); if (si->type == VNET_SW_INTERFACE_TYPE_SUB) { if (si->sub.eth.flags.one_tag) return 18; /* Ethernet + single VLAN */ if (si->sub.eth.flags.two_tags) return 22; /* Ethernet + QinQ */ } return 14; /* Untagged Ethernet */ } ``` And in the policer struct, I also add a `l2_overhead_by_sw_if_index[dir][sw_if_index]` to store these values. That way, I do not need to do this calculation for every packet in the dataplane, but just blindly add the value I pre-computed at creation time. This is safe, because sub-interfaces cannot change their encapsulation after being created. In the `vnet_policer_police()` dataplane function, I add an `l2_overhead` argument, and then call it like so: ```cpp u16 l2_overhead0 = (is_l2) ? 0 : pm->l2_overhead_by_sw_if_index[dir][sw_if_index0]; act0 = vnet_policer_police (vm, b0, pi0, ..., l2_overhead0); ``` And with that, my two tests give the same results: ``` pim@summer:~/src/vpp$ make test-debug TEST=test_policer_subif V=2 | grep 'policer stats' 15:38:39,720 L3 sub-interface input policer stats: {'conform_packets': 6, 'conform_bytes': 876, 'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906} 15:38:40,715 L3 sub-interface output policer stats: {'conform_packets': 6, 'conform_bytes': 876, 'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906} ``` Yaay, great success! ### Test: L2 in/output The tests for the L2 input and output case are not radically different. In the setup, rather than giving the VLAN sub-interfaces an IPv4 address, I'll just add them to a bridge-domain: ```python # Create VLAN sub-interfaces on pg0 and pg1 sub_if0 = VppDot1QSubint(self, self.pg0, 30) sub_if0.admin_up() sub_if1 = VppDot1QSubint(self, self.pg1, 30) sub_if1.admin_up() # Add both sub-interfaces to bridge domain 1 self.vapi.sw_interface_set_l2_bridge(sub_if0.sw_if_index, bd_id=1) self.vapi.sw_interface_set_l2_bridge(sub_if1.sw_if_index, bd_id=1) ``` This puts the sub-interfaces in L2 mode, after which the `l2-input` and `l2-output` feature bitmaps kick in. Without further ado: ``` pim@summer:~/src/vpp$ make test-debug TEST=test_policer_subif V=2 | grep 'L2.*policer stats' 15:50:15,217 L2 sub-interface input policer stats: {'conform_packets': 6, 'conform_bytes': 876, 'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906} 15:50:16,217 L2 sub-interface output policer stats: {'conform_packets': 6, 'conform_bytes': 876, 'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906} ``` ## Results The policer works in all sorts of cool scenario's now. Let me give a concrete example, where I create an L2XC with VTR and then apply a policer. I've written about VTR, which stands for _VLAN Tag Rewriting_ before, in an old article lovingly called [[VPP VLAN Gymnastics]({{< ref "2022-02-14-vpp-vlan-gym" >}})]. It all looks like this: ``` vpp# create sub Gi10/0/0 100 vpp# create sub Gi10/0/1 200 vpp# set interface l2 xconnect Gi10/0/0.100 Gi10/0/1.200 vpp# set interface l2 xconnect Gi10/0/1.200 Gi10/0/0.100 vpp# set interface l2 tag-rewrite Gi10/0/0.100 pop 1 vpp# set interface l2 tag-rewrite Gi10/0/1.200 pop 1 vpp# policer add name pol-test rate kbps cir 150000 cb 15000000 conform-action transmit vpp# policer input name pol-test Gi10/0/0.100 ``` After applying this configuration, the input bitmap on Gi10/0/0.100 becomes `POLICER(14) | VTR(10) | XCONNECT(1) | DROP(0)`. Packets now take the following path through the dataplane: ``` ethernet-input → l2-input (computes bitmap, dispatches to bit 14) → l2-policer-input (clears bit 14, polices, dispatches to bit 10) → l2-input-vtr (clears bit 10, pops 1 tag, dispatches to bit 1) → l2-output (XCONNECT: sw_if_index[TX]=Gi10/0/1.200) → inline output VTR (pushes 1 tag for .200) → interface-output → Gi10/0/1-tx ``` ## What's Next I've sent the change, which was only about ~300 LOC, off for review. You can follow along on the gerrit on [[44654](https://gerrit.fd.io/r/c/vpp/+/44654)]. I don't think the policer got much slower after adding the l2 path, and one might argue it doesn't matter because policing didn't work on sub-interfaces and L2 output at all, before this change. However, for the L3 input/output case, and for the PHY input case, there are a few CPU cycles added now to address the L2 and sub-int use cases. Perhaps I should do a side by side comparision of packets/sec throughput on the bench some time. It would be great if VPP would support FQ-CoDel (Flow Queue-Controlled Delay), which is an algorithm and packet scheduler designed to eliminate bufferbloat, which is high latency caused by excessive buffering in network equipment, while ensuring fair bandwidth distribution among competing traffic flows. I know that Dave Täht - may he rest in peace - always wanted that. For me, I've set my sights on eVPN VxLAN, and I started toying with SRv6 L2 transport also. I hope that in the spring I'll have a bit more time to contribute to VPP and write about it. Stay tuned!