Add L2/L3 sub-int policers

2026-02-14 16:01:54 +01:00
parent 3244f49998
commit bf7de1181c
1 changed files with 401 additions and 0 deletions
--- a/content/articles/2026-02-14-vpp-policers.md
+++ b/content/articles/2026-02-14-vpp-policers.md
@@ -0,0 +1,401 @@
 ---
 date: "2026-02-14T11:35:14Z"
 title: VPP Policers
 ---
 {{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
 # About this series
 Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its
 performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic
 _ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
 are shared between the two.
 There's some really fantastic features in VPP, some of which are lesser well known, and not always
 very well documented. In this article, I will describe a unique usecase in which I think VPP will
 excel, notably acting as a gateway for Internet Exchange Points.
 A few years ago, I toyed with the idea to use VPP as an _IXP Reseller_ concentrator, allowing
 several carriers to connect with say 10G or 25G ports, and carry sub-customers on tagged interfaces
 with safety (like MAC address ACLs) and rate limiting (say any given customer limited to 1Gbps on a
 10G or 100G trunk), all provided by VPP. You can take a look at my [[VPP IXP Gateway]({{< ref
 2023-10-21-vpp-ixp-gateway-1 >}})] article for details. I never ended up deploying it.
 In this article, I follow up and fix a few shortcomings in VPP's policer framework.
 ## Introduction
 Consider the following policer in VPP:
 ```
 vpp# policer add name client-a rate kbps cir 150000 cb 15000000 conform-action transmit
 vpp# policer input name client-a GigabitEthernet10/0/1
 vpp# policer output name client-a GigabitEthernet10/0/1
 ```
 The idea is to give a _committed information rate_ of 150Mps with a _committed burst_ rate of 15MB. 
 The _CIR_ represents the average bandwidth allowed for the interface, while the _CB_ represents the
 maximum amount of data (in bytes) that can be sent at line speed in a single burst before the _CIR_
 kicks in to throttle the traffic.
 Back in October of 2023, I reached the conclusion that the policer works in the following modes:
 *   ***On input***, the policer is applied on `device-input` which means it takes frames directly
    from the Phy. It will not work on any sub-interfaces. It explains why the policer worked on
    untagged (`Gi10/0/1`) but not on tagged (`Gi10/0/1.100`) sub-interfaces.
 *   ***On output***, the policer is applied on `ip4-output` and `ip6-output`, which works only for
    L3 enabled interfaces, not for L2 ones like the ones one might use on bridge domain or L2 cross
    connects.
 ## VPP Infra: L2 Feature Maps
 The benefit in using the `device-input` input arc is it's efficient: every packet that comes from the
 device (`Gi10/0/1`) regardless of tagging or not, will be handed off to the policer plugin. It means
 any traffic (L2, L3, sub-interface, tagged, untagged) will all go through the same policer. 
 In `src/vnet/l2/` there are two nodes called `l2-input` and `l2-output`. I can configure VPP to call
 these nodes before `ip[46]-unicast` and before `ip[46]-output` respectively. These L2 nodes have a
 feature bitmap with 32 entries. The l2-input / l2-output nodes use a bitmap walk: they find the
 highest set bit, and then dispatch the packet to a pre-configured graph node. Upon return, the
 `feat-bitmap-next` checks the next bit, and if that one is set, dispatches the packet to the next
 pre-configured graph node. This continues until all the bits are checked and packets have been
 handed to their respective graph node if any given bit is set.
 To show what I can do with these nodes, let me dive in to an example. When a packet arrives on an
 interface configured in L2 mode, either because it's a bridge-domain or an L2XC, `ethernet-input`
 will send it to `l2-input`. This node does three things:                            
 1.  It will classify the packet, by reading the interface configuration (`l2input_main.configs`) for
 the sw_if_index, which contains the mode of the interface (`bridge-domain`, `l2xc`, or `bvi`). It
 also contains the feature bitmap: a statically configured set of features for this interface.
 1.  It will store the effective feature bitmap for each individual packet in the packet buffer. For
 bridge mode, depending on the packet being unicast or multicast, some features are disabled. For
 example,flooding for unicast packets is not performed, so those bits are cleared. The result is
 stored in a per-packet working copy that downstream nodes can be triggered on, in turn.
 1.  For each of the bits set in the packet buffer's `l2.feature_bitmap`, starting from highest bit
 set, `l2-input` will set the next node, for example `l2-input-vtr` to do VLAN Tag Rewriting. Once
 that node is finished, it'll clear its own bit, and search for the next one set, in order to set a
 new node.
 I note that processing order is HIGH to LOW bits. By reading `l2_input.h`, I can see that The full
 `l2-input` chain looks like this:
 ```
 l2-input 
  → SPAN(17) → INPUT_CLASSIFY(16) → INPUT_FEAT_ARC(15) → POLICER_CLAS(14)
  → ACL(13) → VPATH(12) → L2_IP_QOS_RECORD(11) → VTR(10) → LEARN(9) → RW(8)
  → FWD(7) → UU_FWD(6) → UU_FLOOD(5) → ARP_TERM(4) → ARP_UFWD(3) → FLOOD(2)
  → XCONNECT(1) → DROP(0)
 l2-output
  → XCRW(12) → OUTPUT_FEAT_ARC(11) → OUTPUT_CLASSIFY(10) → LINESTATUS_DOWN(9)
  → STP_BLOCKED(8) → IPIW(7) → EFP_FILTER(6) → L2PT(5) → ACL(4) → QOS(3)
  → CFM(2) → SPAN(1) → OUTPUT(0)
 ```
 If none of the L2 processing nodes set the next node, ultimately `feature-bitmap-drop` gently takes
 the packet behind the shed and drops it. On the way out, ultimately the last `OUTPUT` bit sends the
 packet to `interface-output`, which hands off to the driver's TX node.
 ### Enabling L2 features
 There's lots of places in VPP where L2 feature bitmaps are set/cleared. Here's a few examples:
 ```
 # VTR: sets L2INPUT_FEAT_VTR + configures output VTR (VLAN Tag Rewriting)
 vpp# set interface l2 tag-rewrite GigE0/0/0.100 pop 1
 # ACL: sets L2INPUT_FEAT_ACL / L2OUTPUT_FEAT_ACL
 vpp# set interface l2 input acl intfc GigE0/0/0 ip4-table 0
 vpp# set interface l2 output acl intfc GigE0/0/0 ip4-table 0
 # SPAN: sets L2INPUT_FEAT_SPAN / L2OUTPUT_FEAT_SPAN
 vpp# set interface span GigE0/0/0 l2 destination GigE0/0/1
 # Bridge domain level (affects bd_feature_bitmap, applied to all bridge members)
 vpp# set bridge-domain learn 1     # enable/disable LEARN in BD
 vpp# set bridge-domain forward 1   # enable/disable FWD in BD
 vpp# set bridge-domain flood 1     # enable/disable FLOOD in BD
 ```
 I'm starting to see how these L2 feature bitmaps are super powerful, yet flexible. I'm ready to add one!
 ### Creating L2 features
 First, I need to insert my new `POLICER` bit in `l2_input.h` and `l2_output.h`. Then, I can call
 `l2input_intf_bitmap_enable()` and its companion `l2output_intf_bitmap_enable()` to enable or
 disable the L2 feature, and point it at a new graph node. 
 ```cpp
 /* Enable policer both on L2 feature bitmap, and L3 feature arcs */
 if (dir == VLIB_RX) {
   l2input_intf_bitmap_enable (sw_if_index, L2INPUT_FEAT_POLICER, apply);
   vnet_feature_enable_disable ("ip4-unicast", "policer-input", sw_if_index, apply, 0, 0);
   vnet_feature_enable_disable ("ip6-unicast", "policer-input", sw_if_index, apply, 0, 0);
 } else {
   l2output_intf_bitmap_enable (sw_if_index, L2OUTPUT_FEAT_POLICER, apply);
   vnet_feature_enable_disable ("ip4-output", "policer-output", sw_if_index, apply, 0, 0);
   vnet_feature_enable_disable ("ip6-output", "policer-output", sw_if_index, apply, 0, 0);
 }
 ```
 What this means is that if the interface happens to be in L2 mode, in other words when it is a
 `bridge-domain` member or when it is in an `l2XC` mode, I will enable the L2 features. However, for
 L3 packets, I will still proceed to enable the existing `policer-input` node by calling
 `vnet_feature_enable_disable()` on the IPv4 and IPv6 input arc. I make a mental note that MPLS and
 other non-IP traffic will not be policed in this way.
 ### Updating Policer graph node
 The policer framework has an existing dataplane node called `vnet_policer_inline()` which I extend
 to take a flag `is_l2`. Using this flag, I can either set the next graph node to be
 `vnet_l2_feature_next()`, or, in the pre-existing L3 case, set `vnet_feature_next()` on the packets
 that move through the node. The nodes now look like this:
 ```cpp
 VLIB_NODE_FN (policer_l2_input_node)
 (vlib_main_t *vm, vlib_node_runtime_t *node, vlib_frame_t *frame)
 {
  return vnet_policer_inline (vm, node, frame, VLIB_RX, 1 /* is_l2 */);
 }
 VLIB_REGISTER_NODE (policer_l2_input_node) = {
  .name = "l2-policer-input",
  .vector_size = sizeof (u32),
  .format_trace = format_policer_trace,
  .type = VLIB_NODE_TYPE_INTERNAL,
  .n_errors = ARRAY_LEN(vnet_policer_error_strings),
  .error_strings = vnet_policer_error_strings,
  .n_next_nodes = VNET_POLICER_N_NEXT,
  .next_nodes = {
                [VNET_POLICER_NEXT_DROP] = "error-drop",
                [VNET_POLICER_NEXT_HANDOFF] = "policer-input-handoff",
                },
 };
 /* Register on IP unicast arcs for L3 routed sub-interfaces */
 VNET_FEATURE_INIT (policer_ip4_unicast, static) = {
  .arc_name = "ip4-unicast",
  .node_name = "policer-input",
  .runs_before = VNET_FEATURES ("ip4-lookup"),
 };
 VNET_FEATURE_INIT (policer_ip6_unicast, static) = {
  .arc_name = "ip6-unicast",
  .node_name = "policer-input",
  .runs_before = VNET_FEATURES ("ip6-lookup"),
 };
 ```
 Here, I install the L3 feature before `ip[46]-lookup`, and hook up the L2 feature with a new node
 that really just calls the existing node but with `is_l2` set to true. I do something very similar
 for the output direction, except there I'll hook the L3 feature before `ip[46]-output`.
 ## Tests!
 I think writing unit- and integration tests is a great idea. I add a new file
 `test/test_policer_subif.py` which actually tests all four new cases:
 1.  **L3 Input**: on a routed sub-interface
 1.  **L3 Output**: on a routed sub-interface
 1.  **L2 Input**: on a bridge-domain sub-interface
 1.  **L2 Output**: on a bridge-domain sub-interface
 The existing `test/test_policer.py` should also cover existing cases, and of course it's important
 that my work does not impact these changes. Lucky me, the existing tests all still pass :)
 ### Test: L3 in/output
 The tests use a VPP feature called `packet-generator`, which creates virtual devices upon which I
 can emit packets using ScaPY, and use pcap to receive them. For the input, first I'll create the
 interface and apply a new policer to it:
 ```python
  sub_if0 = VppDot1QSubint(self, self.pg0, 10)
  sub_if0.admin_up()
  sub_if0.config_ip4()
  sub_if0.resolve_arp()
  # Create policer
  action_tx = PolicerAction(VppEnum.vl_api_sse2_qos_action_type_t.SSE2_QOS_ACTION_API_TRANSMIT, 0)
  policer = VppPolicer(self, "subif_l3_pol", 80, 0, 1000, 0,
      conform_action=action_tx, exceed_action=action_tx, violate_action=action_tx,
  )
  policer.add_vpp_config()
  # Apply policer to sub-interface input on pg0
  policer.apply_vpp_config(sub_if0.sw_if_index, Dir.RX, True)
 ```
 The policer with name `subif_l3_pol` has a _CIR_ of 80kbps, and _EIR_ of 0kB, a _CB_ of 1000 bytes, and
 _EB_ of 0kB, and otherwise always accepts packets. I do this so that I can eventually detect if and
 how many packets were seen, and how many bytes were passed in the conform and violate actions.
 Next, I can generate a few packets and send them out from `pg0`, and wait to receive them on `pg1`:
 ```python
  # Send packets with VLAN tag from sub_if0 to sub_if1
  pkts = []
  for i in range(NUM_PKTS): # NUM_PKTS = 67
      pkt = (
          Ether(src=self.pg0.remote_mac, dst=self.pg0.local_mac) / Dot1Q(vlan=10)
          / IP(src=sub_if0.remote_ip4, dst=sub_if1.remote_ip4) / UDP(sport=1234, dport=1234)
          / Raw(b"\xa5" * 100)
      )
      pkts.append(pkt)
  # Send and verify packets are policed and forwarded
  rx = self.send_and_expect(self.pg0, pkts, self.pg1)
  stats = policer.get_stats()
  # Verify policing happened
  self.assertGreater(stats["conform_packets"], 0)
  self.assertEqual(stats["exceed_packets"], 0)
  self.assertGreater(stats["violate_packets"], 0)
  self.logger.info(f"L3 sub-interface input policer stats: {stats}")
 ```
 Similar to the L3 sub-interface input policer, I also write a test for L3 sub-interface output
 policer. The only difference between the two is that in the output case, the policer is applied to
 `pg1` in the `Dir.TX` direction, while in the input case, it's applied to `pg0` in the `Dir.Rx`
 direction.
 I can predict the outcome. Every packet is exactly 146 bytes: 
 *   14 bytes src/dst MAC in `Ether()`
 *   4 bytes VLAN tag (10) in `Dot1Q()`
 *   20 bytes IPv4 header in `IP()`
 *   8 bytes UDP header in `UDP()`
 *   100 bytes of additional payload.
 When allowing a burst of 1000 bytes, that means 6 packets should make it through (876 bytes) in the
 `conform` bucket while the other 61 should be in the `violate` bucket. I won't see any packets in
 the `exceed` bucket, because the policer I created is a simple one-rate, two-color `1R2C` policer
 with `EB` set to 0, so every non-conforming packet goes straight to violate as there is no extra
 budget in the exceed bucket.  However they are all sent, because the action was set to transmit in
 all cases.
 ```
 pim@summer:~/src/vpp$ make test-debug TEST=test_policer_subif V=2 
 15:21:46,868 L3 sub-interface input policer stats: {'conform_packets': 7, 'conform_bytes': 896,
  'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 60, 'violate_bytes': 7680}
 15:21:47,919 L3 sub-interface output policer stats: {'conform_packets': 6, 'conform_bytes': 876,
  'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906}
 ```
 {{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
 **Whoops!** So much for predicting the outcome! I see that 7 packets (896 bytes) make it through on input
 while 6 packets (876 bytes) made it through on output. In the input case, the packet size is
 `896/7 = 128` bytes, which is 18 bytes short. What's going on?
 ### Side Quest: Policer Accounting
 On the vpp-dev mailinglist, Ben points out that the accounting will be changing when moving from
 `device-input` to `ip[46]-input`, because after device-input, the packet buffer is advanced to the
 L3 portion, and will start at the IPv4 or IPv6 header. Considering I was using dot1q tagged
 sub-interfaces, that means I will be short exactly 18 bytes. The reason why this does not happen on
 the way out, is that `ip[46]-rewrite` have both already wound back the buffer to be able to insert
 the ethernet frame and encapsulation, so no adjustment is needed there.
 Ben also points out that when applying the policer to the interface, I can detect at creation time if
 it's a PHY, a single-tagged or a double-tagged interface, and store some information to help correct
 the accounting. We discuss a little bit on the mailinglist, and agree that it's best for all four
 cases (L2 input/output and L3 intput/output) to use the full L2 frame bytes in the accounting, which
 as an added benefit also that is remains backwards compatible with the `device-input` accounting.
 Chapeau, Ben you're so clever!
 I add a little helper function:
 ```cpp
 static u8 vnet_policer_compute_l2_overhead (vnet_main_t *vnm, u32 sw_if_index, vlib_dir_t dir)
 {
  if (dir == VLIB_TX) return 0;
  vnet_hw_interface_t *hi = vnet_get_sup_hw_interface (vnm, sw_if_index);
  if (PREDICT_FALSE (hi->hw_class_index != ethernet_hw_interface_class.index))
    return 0; /* Not Ethernet */
  vnet_sw_interface_t *si = vnet_get_sw_interface (vnm, sw_if_index);
  if (si->type == VNET_SW_INTERFACE_TYPE_SUB) {
    if (si->sub.eth.flags.one_tag)  return 18; /* Ethernet + single VLAN */
    if (si->sub.eth.flags.two_tags) return 22; /* Ethernet + QinQ */
  }
  return 14; /* Untagged Ethernet */
 }
 ```
 And in the policer struct, I also add a `l2_overhead_by_sw_if_index[dir][sw_if_index]` to store
 these values. That way, I do not need to do this calculation for every packet in the dataplane, but
 just blindly add the value I pre-computed at creation time. This is safe, because sub-interfaces
 cannot change their encapsulation after being created.
 In the `vnet_policer_police()` dataplane function, I add an `l2_overhead` argument, and then call it
 like so:
 ```cpp
  u16 l2_overhead0 = (is_l2) ? 0 : pm->l2_overhead_by_sw_if_index[dir][sw_if_index0];
  act0 = vnet_policer_police (vm, b0, pi0, ..., l2_overhead0);
 ```
 And with that, my two tests give the same results:
 ```
 pim@summer:~/src/vpp$ make test-debug TEST=test_policer_subif V=2 | grep 'policer stats'
 15:38:39,720 L3 sub-interface input policer stats: {'conform_packets': 6, 'conform_bytes': 876,
  'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906}
 15:38:40,715 L3 sub-interface output policer stats: {'conform_packets': 6, 'conform_bytes': 876,
  'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906}
 ```
 Yaay, great success!
 ### Test: L2 in/output
 The tests for the L2 input and output case are not radically different. In the setup, rather than
 giving the VLAN sub-interfaces an IPv4 address, I'll just add them to a bridge-domain:
 ```python
  # Create VLAN sub-interfaces on pg0 and pg1
  sub_if0 = VppDot1QSubint(self, self.pg0, 30)
  sub_if0.admin_up()
  sub_if1 = VppDot1QSubint(self, self.pg1, 30)
  sub_if1.admin_up()
  # Add both sub-interfaces to bridge domain 1
  self.vapi.sw_interface_set_l2_bridge(sub_if0.sw_if_index, bd_id=1)
  self.vapi.sw_interface_set_l2_bridge(sub_if1.sw_if_index, bd_id=1)
 ```
 This puts the sub-interfaces in L2 mode, after which the `l2-input` and `l2-output` feature bitmaps
 kick in. Without further ado:
 ```
 pim@summer:~/src/vpp$ make test-debug TEST=test_policer_subif V=2 | grep 'L2.*policer stats'
 15:50:15,217 L2 sub-interface input policer stats: {'conform_packets': 6, 'conform_bytes': 876,
  'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906}
 15:50:16,217 L2 sub-interface output policer stats: {'conform_packets': 6, 'conform_bytes': 876,
  'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906}
 ```
 ## What's Next
 I've sent the change, which was only about ~300 LOC, off for review. You can follow along on the
 gerrit on [[44654](https://gerrit.fd.io/r/c/vpp/+/44654)]. I don't think the policer got much slower
 after adding the l2 path, and one might argue it doesn't matter because policing didn't work on
 sub-interfaces and L2 output at all, before this change. However, for the L3 input/output case, and
 for the PHY input case, there are a few CPU cycles added now to address the L2 and sub-int use
 cases. Perhaps I should do a side by side comparision of packets/sec throughput on the bench some
 time.
 It would be great if VPP would support FQ-CoDel (Flow Queue-Controlled Delay), which is an
 Active Queue Management (AQM) algorithm and packet scheduler designed to eliminate bufferbloat—high
 latency caused by excessive buffering in network equipment, while ensuring fair bandwidth
 distribution among competing traffic flows. I know that Dave Täht - may he rest in peace - always
 wanted that. 
 For me, I've set my sights on eVPN VxLAN, and started toing with SRv6 also. I hope that in the
 spring I'll have a bit more time to contribute to VPP and write about it. Stay tuned!