--- date: "2022-02-21T09:35:14Z" title: 'Review: Cisco ASR9006/RSP440-SE' aliases: - /s/articles/2022/02/21/asr9006.html --- ## Introduction {{< image width="180px" float="right" src="/assets/asr9006/ipmax.png" alt="IP-Max" >}} If you've read up on my articles, you'll know that I have deployed a [European Ring]({{< ref "2021-02-27-network" >}}), which was reformatted late last year into [AS8298]({{< ref "2021-10-24-as8298" >}}) and upgraded to run [VPP Routers]({{< ref "2021-09-21-vpp-7" >}}) with 10G between each city. IPng Networks rents these 10G point to point virtual leased lines between each of our locations. It's a really great network, and it performs so well because it's built on an EoMPLS underlay provided by [IP-Max](https://ip-max.net/). They, in turn, run carrier grade hardware in the form of Cisco ASR9k. In part, we're such a good match together, because my choice of [VPP](https://fd.io/) on the IPng Networks routers fits very well with Fred's choice of [IOS/XR](https://en.wikipedia.org/wiki/Cisco_IOS_XR) on the IP-Max routers. And if you follow us on Twitter (I post as [@IPngNetworks](https://twitter.com/IPngNetworks/)), you may have seen a recent post where I upgraded an aging [ASR9006](https://www.cisco.com/c/en/us/support/routers/asr-9006-router/model.html) with a significantly larger [ASR9010](https://www.cisco.com/c/en/us/support/routers/asr-9010-router/model.html). The ASR9006 was initially deployed at Equinix Zurich ZH05 in Oberenstringen near Zurich, Switzerland in 2015, which is seven years ago. It has hauled countless packets from Zurich to Paris, Frankfurt and Lausanne. When it was deployed, it came with a A9K-RSP-4G route switch processor, which in 2019 was upgraded to the A9K-RSP-8G, and after so many hours^W years of runtime needed a replacement. Also, IP-Max was starting to run out of ports for the chassis, hence the upgrade. {{< image width="300px" float="left" src="/assets/asr9006/staging.png" alt="IP-Max" >}} If you're interested in the line-up, there's this epic reference guide from [Cisco Live!](https://www.cisco.com/c/en/us/td/docs/iosxr/asr9000/hardware-install/overview-reference/b-asr9k-overview-reference-guide/b-asr9k-overview-reference-guide_chapter_010.html#con_733653) that shows a deep dive of the ASR9k architecture. The chassis and power supplies can host several generations of silicon, and even mix-and-match generations. So IP-Max ordered a few new RSPs, and after deploying the ASR9010 at ZH05, we made plans to redeploy this ASR9006 at NTT Zurich in Rümlang next to the airport, to replace an even older Cisco 7600 at that location. Seeing as we have to order XFP optics (IP-Max has some DWDM/CWDM links in service at NTT), we have to park the chassis in and around Zurich. What better place to park it, than in my lab ? :-) The IPng Networks laboratory is where I do most of my work on [VPP](https://fd.io/). The rack you see to the left here holds my coveted Rhino and Hippo (two beefy AMD Ryzen 5950X machines with 100G network cards), and a few Dells that comprise my VPP lab. There was not enough room, so I gave this little fridge a place just adjacent to the rack, connected with 10x 10Gbps and serial and management ports. I immediately had a little giggle when booting up the machine. It comes with 4x 3kW power supply slots (3 are installed), and when booting the machine, I was happy that there was no debris laying on the side or back of the router, as its fans create a veritable vortex of airflow. Also, overnight the temperature in my basement lab + office room raised a few degrees. It's now nice and toasty in my office, no need for the heater in the winter. Yet the machine stays quite cool at 26C intake, consuming 2.2KW _idle_ with each of the two route processor (RSP440) drawing 240 Watts, each of the three 8x TenGigE blades drawing 575W each, and the 40x GigE blade drawing a respectable 320 Watts. ``` RP/0/RSP0/CPU0:fridge(admin)#show environment power-supply R/S/I Power Supply Voltage Current (W) (V) (A) 0/PS0/M1/* 741.1 54.9 13.5 0/PS0/M2/* 712.4 54.8 13.0 0/PS0/M3/* 765.8 55.1 13.9 -------------- Total: 2219.3 ``` For reference, Rhino and Hippo draw approximately 265W each, but they come with 4x1G, 4x10G, 2x100G and forward ~300Mpps when fully loaded. By the end of this article, I hope you'll see why this is a funny juxtaposition to me. ### Installing the ASR9006 The Cisco RSPs came to me new-in-refurbished-box. When booting, I had no idea what username/password was used for the preinstall, and none of the standard passwords worked. So the first order of business is to take ownership of the machine. I do this by putting both RSPs in _rommon_ (which is done by sending _Break_ after powercycling the machine -- my choice of _tio(1)_ has ***Ctrl-t b*** as the magic incantation). The first RSP (in slot 0) is then set to a different `confreg 0x142`, while the other is kept in rommon so it doesn't boot and take over the machine. After booting, I'm then presented with a root user setup dialog. I create a user `pim` with some temporary password, set back the configuration register, and reload. When the RSP is about to boot, I release the standby RSP to catch up, and voila: I'm _In like Flynn._ Wiring this up - I connect Te0/0/0/0 to IPng's office switch on port sfp-sfpplus9, and I assign the router an IPv4 and IPv6 address. Then, I connect four Tengig ports to the lab switch, so that I can play around with loadtests a little bit. After turning on LLDP, I can see the following physical view: ``` RP/0/RSP0/CPU0:fridge#show lldp neighbors Sun Feb 20 19:14:21.775 UTC Capability codes: (R) Router, (B) Bridge, (T) Telephone, (C) DOCSIS Cable Device (W) WLAN Access Point, (P) Repeater, (S) Station, (O) Other Device ID Local Intf Hold-time Capability Port ID xsw1-btl Te0/0/0/0 120 B,R bridge/sfp-sfpplus9 fsw0 Te0/1/0/0 41 P,B,R TenGigabitEthernet 0/9 fsw0 Te0/1/0/1 41 P,B,R TenGigabitEthernet 0/10 fsw0 Te0/2/0/0 41 P,B,R TenGigabitEthernet 0/7 fsw0 Te0/2/0/1 41 P,B,R TenGigabitEthernet 0/8 Total entries displayed: 5 ``` First, I decide to hook up basic connectivity behind port Te0/0/0/0. I establish OSPF, OSPFv3 and this gives me visibility to the route-reflectors at IPng's AS8298. Next, I also establish three IPv4 and IPv6 iBGP sessions, so the machine enters the Default Free Zone (also, _daaayum_, that table keeps on growing at 903K IPv4 prefixes and 143K IPv6 prefixes). ``` RP/0/RSP0/CPU0:fridge#show ip ospf neighbor Neighbor ID Pri State Dead Time Address Interface 194.1.163.3 1 2WAY/DROTHER 00:00:35 194.1.163.66 TenGigE0/0/0/0.101 Neighbor is up for 00:11:14 194.1.163.4 1 FULL/BDR 00:00:38 194.1.163.67 TenGigE0/0/0/0.101 Neighbor is up for 00:11:11 194.1.163.87 1 FULL/DR 00:00:37 194.1.163.87 TenGigE0/0/0/0.101 Neighbor is up for 00:11:12 RP/0/RSP0/CPU0:fridge#show ospfv3 neighbor Neighbor ID Pri State Dead Time Interface ID Interface 194.1.163.87 1 FULL/DR 00:00:35 2 TenGigE0/0/0/0.101 Neighbor is up for 00:12:14 194.1.163.3 1 2WAY/DROTHER 00:00:33 16 TenGigE0/0/0/0.101 Neighbor is up for 00:12:16 194.1.163.4 1 FULL/BDR 00:00:36 20 TenGigE0/0/0/0.101 Neighbor is up for 00:12:12 RP/0/RSP0/CPU0:fridge#show bgp ipv4 uni sum Process RcvTblVer bRIB/RIB LabelVer ImportVer SendTblVer StandbyVer Speaker 915517 915517 915517 915517 915517 915517 Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd 194.1.163.87 0 8298 172514 9 915517 0 0 00:04:47 903406 194.1.163.140 0 8298 171853 9 915517 0 0 00:04:56 903406 194.1.163.148 0 8298 176244 9 915517 0 0 00:04:49 903406 RP/0/RSP0/CPU0:fridge#show bgp ipv6 uni sum Process RcvTblVer bRIB/RIB LabelVer ImportVer SendTblVer StandbyVer Speaker 151597 151597 151597 151597 151597 151597 Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd 2001:678:d78:3::87 0 8298 54763 10 151597 0 0 00:05:19 142542 2001:678:d78:6::140 0 8298 51350 10 151597 0 0 00:05:23 142542 2001:678:d78:7::148 0 8298 54572 10 151597 0 0 00:05:25 142542 ``` One of the acceptance tests of new hardware at AS25091 IP-Max is to ensure that it takes a full table to help ensure memory is present, accounted for, and working. These route switch processor boards come with 12GB of ECC memory, and can scale the routing table for a small while to come. If/when they are at the end of their useful life, they will be replaced with A9K-RSP-880's, which will also give us access to 40G and 100G and 24x10G SFP+ line cards. At that point, the upgrade path is much easier as the chassis will already be installed. It's a matter of popping in new RSPs and replacing the line cards one by one. ## Loadtesting the ASR9006/RSP440-SE Now that this router has some basic connectivity, I'll do something that I always wanted to do: loadtest an ASR9k! I have mad amounts of respect for Cisco's ASR9k series, but as we'll soon see, their stability is their most redeeming quality, not their performance. Nowadays, many flashy 100G machines are around, which do indeed have the performance, but not the stability! I've seen routers with an uptime of 7 years, and BGP sessions and OSPF adjacencies with an uptime of 5 years+. It's just .. I've not seen that type of stability beyond Cisco and maybe Juniper. So if you want _Rock Solid Internet_, this is definitely the way to go. I have written a word or two on how VPP (an open source dataplane very similar to these industrial machines) works. A great example is my recent [VPP VLAN Gymnastics]({{< ref "2022-02-14-vpp-vlan-gym" >}}) article. There's a lot I can learn from comparing the performance between VPP and Cisco ASR9k, so I will focus on the following set of practical questions: 1. See if unidirectional versus bidirectional traffic impacts performance. 1. See if there is a performance penalty of using _Bundle-Ether_ (LACP controlled link aggregation). 1. Of course, replay my standard issue 1514b large packets, internet mix (_imix_) packets, small 64b packets from random source/destination addresses (ie. multiple flows); and finally the killer test of small 64b packets from a static source/destination address (ie. single flow). This is in total 2 (uni/bi) x2 (lag/plain) x4 (packet mix) or 16 loadtest runs, for three forwarding types ... 1. See performance of L2VPN (Point-to-Point), similar to what VPP would call "l2 xconnect". I'll create an L2 crossconnect between port Te0/1/0/0 and Te0/2/0/0; this is the simplest form computationally: it forwards any frame received on the first interface directly out on the second interface. 1. Take a look at performance of L2VPN (Bridge Domain), what VPP would call "bridge-domain". I'll create a Bridge Domain between port Te0/1/0/0 and Te0/2/0/0; this includes layer2 learning and FIB, and can tie together any number of interfaces into a layer2 broadcast domain. 1. And of course, tablestakes, see performance of IPv4 forwarding, with Te0/1/0/0 as 100.64.0.1/30 and Te0/2/0/0 as 100.64.1.1/30 and setting a static for 48.0.0.0/8 and 16.0.0.0/8 back to the loadtester. ... making a grand total of 48 loadtests. I have my work cut out for me! So I boot up Rhino, which has a Mellanox ConnectX5-Ex (PCIe v4.0 x16) network card sporting two 100G interfaces, and it can easily keep up with this 2x10G single interface, and 2x20G LAG, even with 64 byte packets. I am continually amazed that a full line rate loadtest of small 64 byte packets at a rate of 40Gbps boils down to 59.52Mpps! For each loadtest, I ramp up the traffic using a [T-Rex loadtester]({{< ref "2021-02-27-coloclue-loadtest" >}}) that I wrote. It starts with a low-pps warmup duration of 30s, then it ramps up from 0% to a certain line rate (in this case, alternating to 10GbpsL1 for the single TenGig tests, or 20GbpsL1 for the LACP tests), with a rampup duration of 120s and finally it holds for duration of 30s. The following sections describe the methodology and the configuration statements on the ASR9k, with a quick table of results per test, and a longer set of thoughts all the way at the bottom of this document. I so encourage you to not skip ahead. Instead, read on and learn a bit (as I did!) from the configuration itself. **The question to answer**: Can this beasty mini-fridge sustain line rate? Let's go take a look! ## Test 1 - 2x 10G In this test, I configure a very simple physical environment (this is a good time to take another look at the LLDP table above). The Cisco is connected with 4x 10G to the switch, Rhino and Hippo are connected with 2x 100G to the switch and I have a Dell connected as well with 2x 10G to the switch (this can be very useful to take a look at what's going on on the wire). The switch is an FS S5860-48SC (with 48x10G SFP+ ports, and 8x100G QSFP ports), which is a piece of kit that I highly recommend by the way. Its configuration: ``` interface TenGigabitEthernet 0/1 description Infra: Dell R720xd hvn0:enp5s0f0 no switchport mtu 9216 ! interface TenGigabitEthernet 0/2 description Infra: Dell R720xd hvn0:enp5s0f1 no switchport mtu 9216 ! interface TenGigabitEthernet 0/7 description Cust: Fridge Te0/2/0/0 mtu 9216 switchport access vlan 20 ! interface TenGigabitEthernet 0/9 description Cust: Fridge Te0/1/0/0 mtu 9216 switchport access vlan 10 ! interface HundredGigabitEthernet 0/53 description Cust: Rhino HundredGigabitEthernet15/0/1 mtu 9216 switchport access vlan 10 ! interface HundredGigabitEthernet 0/54 description Cust: Rhino HundredGigabitEthernet15/0/0 mtu 9216 switchport access vlan 20 ! monitor session 1 destination interface TenGigabitEthernet 0/1 monitor session 1 source vlan 10 rx monitor session 2 destination interface TenGigabitEthernet 0/2 monitor session 2 source vlan 20 rx ``` What this does is connect Rhino's Hu15/0/1 and Fridge's Te0/1/0/0 in VLAN 10, and sends a readonly copy of all traffic to the Dell's enp5s0f0 interface. Similarly, Rhino's Hu15/0/0 and Fridge's Te0/2/0/0 in VLAN 20 with a copy of traffic to the Dell's enp5s0f1 interface. I can now run `tcpdump` on the Dell to see what's going back and forth. In case you're curious: the monitor on Te0/1 and Te0/2 ports will saturate in case both machines are transmitting at a combined rate of over 10Gbps. If this is the case, the traffic that doesn't fit is simply dropped from the monitor port, but it's of course forwarded correctly between the original Hu0/53 and Te0/9 ports. In other words: the monitor session has no performance penalty. It's merely a convenience to be able to take a look on ports where `tcpdump` is not easily available (ie. both VPP as well as the ASR9k in this case!) ### Test 1.1: 10G L2 Cross Connect A simple matter of virtually patching one interface into the other, I choose the first port on blade 1 and 2, and tie them together in a `p2p` cross connect. In my [VLAN Gymnastics]({{< ref "2022-02-14-vpp-vlan-gym" >}}) post, I called this a `l2 xconnect`, and although the configuration statements are a bit different, the purpose and expected semantics are identical: ``` interface TenGigE0/1/0/0 l2transport ! ! interface TenGigE0/2/0/0 l2transport ! ! l2vpn xconnect group loadtest p2p xc01 interface TenGigE0/1/0/0 interface TenGigE0/2/0/0 ! ! ``` The results of this loadtest look promising - although I can already see that the port will not sustain line rate at 64 byte packets, which I find somewhat surprising. Both when using multiple flows (ie. random source and destination IP addresses), as well as when using a single flow (repeating the same src/dst packet), the machine tops out at around 20 Mpps which is 68% of line rate (29.76 Mpps). Fascinating! Loadtest | Unidirectional (pps) | L1 Unidirectional (bps) | Bidirectional (pps) | L1 Bidirectional (bps) ---------- | ---------- | ------------ | ----------- | ------------ 1514b | 810 kpps | 9.94 Gbps | 1.61 Mpps | 19.77 Gbps imix | 3.25 Mpps | 9.94 Gbps | 6.46 Mpps | 19.78 Gbps 64b Multi | 14.66 Mpps | 9.86 Gbps | 20.3 Mpps | 13.64 Gbps 64b Single | 14.28 Mpps | 9.60 Gbps | 20.3 Mpps | 13.62 Gbps ### Test 1.2: 10G L2 Bridge Domain I then keep the two physical interfaces in `l2transport` mode, but change the type of l2vpn into a `bridge-domain`, which I described in my [VLAN Gymnastics]({{< ref "2022-02-14-vpp-vlan-gym" >}}) post as well. VPP and Cisco IOS/XR semantics look very similar indeed, they differ really only in the way in which the configuration is expressed: ``` interface TenGigE0/1/0/0 l2transport ! ! interface TenGigE0/2/0/0 l2transport ! ! l2vpn xconnect group loadtest ! bridge group loadtest bridge-domain bd01 interface TenGigE0/1/0/0 ! interface TenGigE0/2/0/0 ! ! ! ! ``` Here, I find that performance in one direction is line rate, and with 64b packets ever so slightly better than the L2 crossconnect test above. In both directions though, the router struggles to obtain line rate in small packets, delivering 64% (or 19.0 Mpps) of the total offered 29.76 Mpps back to the loadtester. Loadtest | Unidirectional (pps) | L1 Unidirectional (bps) | Bidirectional (pps) | L1 Bidirectional (bps) ---------- | ---------- | ------------ | ----------- | ------------ 1514b | 807 kpps | 9.91 Gbps | 1.63 Mpps | 19.96 Gbps imix | 3.24 Mpps | 9.92 Gbps | 6.47 Mpps | 19.81 Gbps 64b Multi | 14.82 Mpps | 9.96 Gbps | 19.0 Mpps | 12.79 Gbps 64b Single | 14.86 Mpps | 9.98 Gbps | 19.0 Mpps | 12.81 Gbps I would say that in practice, the performance of a bridge-domain is comparable to that of an L2XC. ### Test 1.3: 10G L3 IPv4 Routing This is the most straight forward test: the T-Rex loadtester in this case is sourcing traffic from 100.64.0.2 on its first interface, and 100.64.1.2 on its second interface. It will send ARP for the nexthop (100.64.0.1 and 100.64.1.1, the Cisco), but the Cisco will not maintain an ARP table for the loadtester, so I have to add static ARP entries for it. Otherwise, this is a simple test, which stress tests the IPv4 forwarding path: ``` interface TenGigE0/1/0/0 ipv4 address 100.64.0.1 255.255.255.252 ! interface TenGigE0/2/0/0 ipv4 address 100.64.1.1 255.255.255.252 ! router static address-family ipv4 unicast 16.0.0.0/8 100.64.1.2 48.0.0.0/8 100.64.0.2 ! ! arp vrf default 100.64.0.2 043f.72c3.d048 ARPA arp vrf default 100.64.1.2 043f.72c3.d049 ARPA ! ``` Alright, so the cracks definitely show on this loadtest. The performance of small routed packets is quite poor, weighing in at 35% of line rate in the unidirectional test, and 43% in the bidirectional test. It seems that the ASR9k (at least in this hardware profile of `l3xl`) is not happy forwarding traffic at line rate, and the routing performance is indeed significantly lower than the L2VPN performance. That's good to know! Loadtest | Unidirectional (pps) | L1 Unidirectional (bps) | Bidirectional (pps) | L1 Bidirectional (bps) ---------- | ---------- | ------------ | ----------- | ------------ 1514b | 815 kpps | 10.0 Gbps | 1.63 Mpps | 19.98 Gbps imix | 3.27 Mpps | 9.99 Gbps | 6.52 Mpps | 19.96 Gbps 64b Multi | 5.14 Mpps | 3.45 Gbps | 12.3 Mpps | 8.28 Gbps 64b Single | 5.25 Mpps | 3.53 Gbps | 12.6 Mpps | 8.51 Gbps ## Test 2 - LACP 2x 20G Link aggregation ([ref](https://en.wikipedia.org/wiki/Link_aggregation)) means combining or aggregating multiple network connections in parallel by any of several methods, in order to increase throughput beyond what a single connection could sustain, to provide redundancy in case one of the links should fail, or both. A link aggregation group (LAG) is the combined collection of physical ports. Other umbrella terms used to describe the concept include _trunking_, _bundling_, _bonding_, _channeling_ or _teaming_. Bundling ports together on a Cisco IOS/XR platform like the ASR9k can be done by creating a _Bundle-Ether_ or _BE_. For reference, the same concept on VPP is called a _BondEthernet_ and in Linux it'll often be referred to as simply a _bond_. They all refer to the same concept. One thing that immediately comes to mind when thinking about LAGs is: how will the member port be selected on outgoing traffic? A sensible approach will be to either hash on the L2 source and/or destination (ie. the ethernet host on either side of the LAG), but in the case of a router and as is the case in our loadtest here, there is only one MAC address on either side of the LAG. So a different hashing algorithm has to be chosen, preferably of the source and/or destination _L3_ (IPv4 or IPv6) address. Luckily, both the FS switch as well as the Cisco ASR9006 support this. First I'll reconfigure the switch, and then reconfigure the router to use the newly created 2x 20G LAG ports. ``` interface TenGigabitEthernet 0/7 description Cust: Fridge Te0/2/0/0 port-group 2 mode active ! interface TenGigabitEthernet 0/8 description Cust: Fridge Te0/2/0/1 port-group 2 mode active ! interface TenGigabitEthernet 0/9 description Cust: Fridge Te0/1/0/0 port-group 1 mode active ! interface TenGigabitEthernet 0/10 description Cust: Fridge Te0/1/0/1 port-group 1 mode active ! interface AggregatePort 1 mtu 9216 aggregateport load-balance dst-ip switchport access vlan 10 ! interface AggregatePort 2 mtu 9216 aggregateport load-balance dst-ip switchport access vlan 20 ! ``` And after the Cisco is converted to use _Bundle-Ether_ as well, the link status looks like this: ``` fsw0#show int ag1 ... Aggregate Port Informations: Aggregate Number: 1 Name: "AggregatePort 1" Members: (count=2) Lower Limit: 1 TenGigabitEthernet 0/9 Link Status: Up Lacp Status: bndl TenGigabitEthernet 0/10 Link Status: Up Lacp Status: bndl Load Balance by: Destination IP fsw0#show int usage up Interface Bandwidth Average Usage Output Usage Input Usage -------------------------------- ----------- ---------------- ---------------- ---------------- TenGigabitEthernet 0/1 10000 Mbit 0.0000018300% 0.0000013100% 0.0000023500% TenGigabitEthernet 0/2 10000 Mbit 0.0000003450% 0.0000004700% 0.0000002200% TenGigabitEthernet 0/7 10000 Mbit 0.0000012350% 0.0000022900% 0.0000001800% TenGigabitEthernet 0/8 10000 Mbit 0.0000011450% 0.0000021800% 0.0000001100% TenGigabitEthernet 0/9 10000 Mbit 0.0000011350% 0.0000022300% 0.0000000400% TenGigabitEthernet 0/10 10000 Mbit 0.0000016700% 0.0000022500% 0.0000010900% HundredGigabitEthernet 0/53 100000 Mbit 0.00000011900% 0.00000023800% 0.00000000000% HundredGigabitEthernet 0/54 100000 Mbit 0.00000012500% 0.00000025000% 0.00000000000% AggregatePort 1 20000 Mbit 0.0000014600% 0.0000023400% 0.0000005799% AggregatePort 2 20000 Mbit 0.0000019575% 0.0000023950% 0.0000015200% ``` It's clear that both `AggregatePort` interfaces have 20Gbps of capacity and are using an L3 loadbalancing policy. Cool beans! If you recall my loadtest theory in for example my [Netgate 6100 review]({{< ref "2021-11-26-netgate-6100" >}}), it can sometimes be useful to operate a single-flow loadtest, in which the source and destination IP:Port stay the same. As I'll demonstrate, it's not only relevant for PC based routers like ones built on VPP, it can also be very relevant in silicon vendors and high-end routers! ### Test 2.1 - 2x 20G LAG L2 Cross Connect I scratched my head a little while (and with a little while I mean more like an hour or so!), because usually I come across _Bundle-Ether_ interfaces which have hashing turned on in the interface stanza, but in my first loadtest run I did not see any traffic on the second member port. I then found out that I need L2VPN setting `l2vpn load-balancing flow src-dst-ip` applied rather than the Interface setting: ``` interface Bundle-Ether1 description LAG1 l2transport ! ! interface TenGigE0/1/0/0 bundle id 1 mode active ! interface TenGigE0/1/0/1 bundle id 1 mode active ! interface Bundle-Ether2 description LAG2 l2transport ! ! interface TenGigE0/2/0/0 bundle id 2 mode active ! interface TenGigE0/2/0/1 bundle id 2 mode active ! l2vpn load-balancing flow src-dst-ip xconnect group loadtest p2p xc01 interface Bundle-Ether1 interface Bundle-Ether2 ! ! ! ``` Overall, the router performs as well as can be expected. In the single-flow 64 byte test, however, due to the hashing over the available members in the LAG being on L3 information, the router is forced to always choose the same member and effectively perform at 10G throughput, so it'll get a pass from me on the 64b single test. In the multi-flow test, I can see that it does indeed forward over both LAG members, however it reaches only 34.9Mpps which is 59% of line rate. Loadtest | Unidirectional (pps) | L1 Unidirectional (bps) | Bidirectional (pps) | L1 Bidirectional (bps) ---------- | ---------- | ------------ | ----------- | ------------ 1514b | 1.61 Mpps | 19.8 Gbps | 3.23 Mpps | 39.64 Gbps imix | 6.40 Mpps | 19.8 Gbps | 12.8 Mpps | 39.53 Gbps 64b Multi | 29.44 Mpps | 19.8 Gbps | 34.9 Mpps | 23.48 Gbps 64b Single | 14.86 Mpps | 9.99 Gbps | 29.8 Mpps | 20.0 Gbps ### Test 2.2 - 2x 20G LAG Bridge Domain Just like with Test 1.2 above, I can now transform this service from a Cross Connect into a fully formed L2 bridge, by simply putting the two _Bundle-Ether_ interfaces in a _bridge-domain_ together, again being careful to apply the L3 load-balancing policy on the `l2vpn` scope rather than the `interface` scope: ``` l2vpn load-balancing flow src-dst-ip no xconnect group loadtest bridge group loadtest bridge-domain bd01 interface Bundle-Ether1 ! interface Bundle-Ether2 ! ! ! ! ``` The results for this test show that indeed L2XC is computationally cheaper than _bridge-domain_ work. With imix and 1514b packets, the router is fine and forwards 20G and 40G respectively. When the bridge is slammed with 64 byte packets, its performance reaches only 65% with multiple flows in the unidirectional, and 47% in the bidirectional loadtest. I found the performance difference with the L2 crossconnect above remarkable. The single-flow loadtest cannot meaningfully stress both members of the LAG due to the src/dst being identical: the best I can expect here, is 10G performance regardless how many LAG members there are. Loadtest | Unidirectional (pps) | L1 Unidirectional (bps) | Bidirectional (pps) | L1 Bidirectional (bps) ---------- | ---------- | ------------ | ----------- | ------------ 1514b | 1.61 Mpps | 19.8 Gbps | 3.22 Mpps | 39.56 Gbps imix | 6.39 Mpps | 19.8 Gbps | 12.8 Mpps | 39.58 Gbps 64b Multi | 20.12 Mpps | 13.5 Gbps | 28.2 Mpps | 18.93 Gbps 64b Single | 9.49 Mpps | 6.38 Gbps | 19.0 Mpps | 12.78 Gbps ### Test 2.3 - 2x 20G LAG L3 IPv4 Routing And finally I turn my attention to the usual suspect: IPv4 routing. Here, I simply remove the `l2vpn` stanza alltogether, and remember to put the load-balancing policy on the _Bundle-Ether_ interfaces. This ensures that upon transmission, both members of the LAG are used. That is, if and only if the IP src/dst addresses differ, which is the case in most, but not all of my loadtests :-) ``` no l2vpn interface Bundle-Ether1 description LAG1 ipv4 address 100.64.1.1 255.255.255.252 bundle load-balancing hash src-ip ! interface TenGigE0/1/0/0 bundle id 1 mode active ! interface TenGigE0/1/0/1 bundle id 1 mode active ! interface Bundle-Ether2 description LAG2 ipv4 address 100.64.0.1 255.255.255.252 bundle load-balancing hash src-ip ! interface TenGigE0/2/0/0 bundle id 2 mode active ! interface TenGigE0/2/0/1 bundle id 2 mode active ! ``` The LAG is fine at forwarding IPv4 traffic in 1514b and imix - full line rate and 40Gbps of traffic is passed in the bidirectional test. With the 64b frames though, the forwarding performance is not line rate but rather 84% of line in one direction, and 76% of line rate in the bidirectional test. And once again, the single-flow loadtest cannot make use of more than one member port in the LAG, so it will be constrained to 10G throughput -- that said, it performs at 42.6% of line rate only. Loadtest | Unidirectional (pps) | L1 Unidirectional (bps) | Bidirectional (pps) | L1 Bidirectional (bps) ---------- | ---------- | ------------ | ----------- | ------------ 1514b | 1.63 Mpps | 20.0 Gbps | 3.25 Mpps | 39.92 Gbps imix | 6.51 Mpps | 19.9 Gbps | 13.04 Mpps | 39.91 Gbps 64b Multi | 12.52 Mpps | 8.41 Gbps | 22.49 Mpps | 15.11 Gbps 64b Single | 6.49 Mpps | 4.36 Gbps | 11.62 Mpps | 7.81 Gbps ## Bonus - ASR9k linear scaling {{< image width="300px" float="right" src="/assets/asr9006/loaded.png" alt="ASR9k Loaded" >}} As I've shown above, the loadtests often topped out at well under line rate for tests with small packet sizes, but I can also see that the LAG tests offered a higher performance, although not quite double that of single ports. I can't help but wonder: is this perhaps ***a per-port limit*** rather than a router-wide limit? To answer this question, I decide to pull out the stops and populate the ASR9k with as many XFPs as I have in my stash, which is 9 pieces. One (Te0/0/0/0) still goes to uplink, because the machine should be carrying IGP and full BGP tables at all times; which leaves me with 8x 10G XFPs, which I decide it might be nice to combine all three scenarios in one test: 1. Test 1.1 with Te0/1/0/2 cross connected to Te0/2/0/2, with a loadtest at 20Gbps. 1. Test 1.2 with Te0/1/0/3 in a bridge-domain with Te0/2/0/3, also with a loadtest at 20Gbps. 1. Test 2.3 with Te0/1/0/0+Te0/2/0/0 on one end, and Te0/1/0/1+Te0/2/0/1 on the other end, with an IPv4 loadtest at 40Gbps. ### 64 byte packets It would be unfair to use single-flow on the LAG, considering the hashing is on L3 source and/or destination IPv4 addresses, so really only one member port would be used. To avoid this pitfall, I run with `vm=var2`. On the other two tests, however, I do run the most stringent of traffic pattern with single-flow loadtests. So off I go, firing up ***three T-Rex*** instances. First, the 10G L2 Cross Connect test (approximately 17.7Mpps): ``` Tx bps L2 | 7.64 Gbps | 7.64 Gbps | 15.27 Gbps Tx bps L1 | 10.02 Gbps | 10.02 Gbps | 20.05 Gbps Tx pps | 14.92 Mpps | 14.92 Mpps | 29.83 Mpps Line Util. | 100.24 % | 100.24 % | --- | | | Rx bps | 4.52 Gbps | 4.52 Gbps | 9.05 Gbps Rx pps | 8.84 Mpps | 8.84 Mpps | 17.67 Mpps ``` Then, the 10G Bridge Domain test (approximately 17.0Mpps): ``` Tx bps L2 | 7.61 Gbps | 7.61 Gbps | 15.22 Gbps Tx bps L1 | 9.99 Gbps | 9.99 Gbps | 19.97 Gbps Tx pps | 14.86 Mpps | 14.86 Mpps | 29.72 Mpps Line Util. | 99.87 % | 99.87 % | --- | | | Rx bps | 4.36 Gbps | 4.36 Gbps | 8.72 Gbps Rx pps | 8.51 Mpps | 8.51 Mpps | 17.02 Mpps ``` Finally, the 20G LAG IPv4 forwarding test (approximately 24.4Mpps), noting that the _Line Util._ here is of the 100G loadtester ports, so 20% is expected: ``` Tx bps L2 | 15.22 Gbps | 15.23 Gbps | 30.45 Gbps Tx bps L1 | 19.97 Gbps | 19.99 Gbps | 39.96 Gbps Tx pps | 29.72 Mpps | 29.74 Mpps | 59.46 Mpps Line Util. | 19.97 % | 19.99 % | --- | | | Rx bps | 5.68 Gbps | 6.82 Gbps | 12.51 Gbps Rx pps | 11.1 Mpps | 13.33 Mpps | 24.43 Mpps ``` To summarize, in the above tests I am pumping 80Gbit (which is 8x 10Gbit full linerate at 64 byte packets, in other words 119Mpps) into the machine, and it's returning 30.28Gbps (or 59.2Mpps which is 38%) of that traffic back to the loadtesters. Features: yes; linerate: nope! ### 256 byte packets Seeing the lowest performance of the router coming in at 8.5Mpps (or 57% of linerate), it stands to reason that sending 256 byte packets will stay under the per-port observed packets/sec limits, so I decide to restart the loadtesters with 256b packets. The expected ethernet frame is now 256 + 20 byte overhead, or 2208 bits, of which ~4.53Mpps can fit into a 10G link. Immediately all ports go up entirely to full capacity. As seen from the Cisco's commandline: ``` RP/0/RSP0/CPU0:fridge#show interfaces | utility egrep 'output.*packets/sec' | exclude 0 packets Mon Feb 21 22:14:02.250 UTC 5 minute output rate 18390237000 bits/sec, 9075919 packets/sec 5 minute output rate 18391127000 bits/sec, 9056714 packets/sec 5 minute output rate 9278278000 bits/sec, 4547012 packets/sec 5 minute output rate 9242023000 bits/sec, 4528937 packets/sec 5 minute output rate 9287749000 bits/sec, 4563507 packets/sec 5 minute output rate 9273688000 bits/sec, 4537368 packets/sec 5 minute output rate 9237466000 bits/sec, 4519367 packets/sec 5 minute output rate 9289136000 bits/sec, 4562365 packets/sec 5 minute output rate 9290096000 bits/sec, 4554872 packets/sec ``` The first two ports there are _Bundle-Ether_ interface _BE1_ and _BE2_, and the other eight are the TenGigE ports. You can see that each one is forwarding the expected 4.53Mpps, and this lines up perfectly with T-Rex which is sending 10Gbps of L1, and 9.28Gbps of L2 (the difference here is the ethernet overhead of 20 bytes per frame, or 4.53 * 160 bits = 724Mbps), and it's receiving all of that traffic back on the other side, which is good. This clearly demonstrates the hypothesis that the machine is ***per-port pps-bound***. So the conclusion is that, the A9K-RSP440-SE typically will forward maybe only 8Mpps on a single TenGigE port, and 13Mpps on a two-member LAG. However, it will do this _for every port_, and with at least 8x 10G ports saturated, it remained fully responsive, OSPF and iBGP adjacencies stayed up, and ping times on the regular (Te0/0/0/0) uplink port were smooth. ## Results ### 1514b and imix: OK! {{< image width="1200px" src="/assets/asr9006/results-imix.png" alt="ASR9k Results - imix" >}} Let me start by showing a side-by-side comparison of the imix tests in all scenarios in the graph above. The graph for 1514b tests looks very similar, differing only in the left-Y axis: imix is a 3.2Mpps stream, while 1514b saturates the 10G port already at 810Kpps. But obviously, the router can do this just fine, even if used on 8 ports, it doesn't mind at all. As I later learned, any traffic mix larger than than 256b packets, or 4.5Mpps per port, forwards fine in any configuration. ### 64b: Not so much :) {{< image width="1200px" src="/assets/asr9006/results-64b.png" alt="ASR9k Results - 64b" >}} {{< image width="1200px" src="/assets/asr9006/results-lacp-64b.png" alt="ASR9k Results - LACP 64b" >}} These graphs show the throughput of the ASR9006 with a pair of A9K-RSP440-SE route switch processors. They are rated at 440Gbps per slot, but their packets/sec rates are significantly lower than line rate. The top graph shows the tests with 10G ports, and the bottom graph shows the same tests but with a 2x10G ports in _Bundle-Ether_ LAG. In an ideal situation, each test would follow the loadtester up to completion, and there would be no horizontal lines breaking out partway through. As I showed, some of the loadtests really performed poorly in terms of packets/sec forwarded. Understandably, the 20G LAG with single-flow can only utilize one member port (which is logical) but then managed to push through only 6Mpps or so. Other tests did better, but overall I must say, the results were lower than I had expected. ### That juxtaposition At the very top of this article I alluded to what I think is a cool juxtaposition. On the one hand, we have these beasty ASR9k routers, running idle at 2.2kW for 24x10G and 40x1G ports (as is the case for the IP-Max router that I took out for a spin here). They are large (10U of rackspace), heavy (40kg loaded), expensive (who cares about list price, the street price is easily $10'000,- apiece). On the other hand, we have these PC based machines with Vector Packet Processing, operating as low as 19W for 2x10G, 2x1G and 4x2.5G ports (like the [Netgate 6100]({{< ref "2021-11-26-netgate-6100" >}})) and offering roughly equal performance per port, except having to drop only $700,- apiece. The VPP machines come with ~infinite RAM, even a 16GB machine will run much larger routing tables, including full BGP and so on - there is no (need for) TCAM, and yet routing performance scales out with CPUs and larger CPU instruction/data-cache. Looking at my Ryzen 5950X based Hippo/Rhino VPP machines, they *can* sustain line rate 64b packets on their 10G ports, due to each CPU being able to process around 22.3Mpps, and the machine has 15 usable CPU cores. Intel or Mellanox 100G network cards are affordable, the whole machine with 2x100G, 4x10G and 4x1G will set me back about $3'000,- in 1U and run 265 Watts when fully loaded. See an extended rationale with backing data in my [FOSDEM'22 talk](/media/fosdem22/index.html). ## Conclusion I set out to answer three questions in this article, and I'm ready to opine now: 1. Unidirectional vs Bidirectional: there is an impact - bidirectional tests (stressing both ingress and egress of each individual router port) have lower performance, notably in packets smaller than 256b. 1. LACP performance penalty: there is an impact - 64b multiflow loadtest on LAG obtained 59%, 47% and 42% (for Test 2.1-3) while for single ports, they obtained 68%, 64% and 43% (for Test 1.1-3). So while aggregate throughput grows with the LACP _Bundle-Ether_ ports, individual port throughput is reduced. 1. The router performs line rate 1514b, imix, and anything beyond 256b packets really. However, it does _not_ sustain line rate at 64b packets. Some tests passed with a unidirectional loadtest, but all tests failed with bidirectional loadtests. After all of these tests, I have to say I am ***still a huge fan*** of the ASR9k. I had kind of expected that it would perform at line rate for any/all of my tests, but the theme became clear after a few - the ports will only forward between 8Mpps and 11Mpps (out of the needed 14.88Mpps), but _every_ port will do that, which means the machine will still scale up significantly in practice. But for business internet, colocation, and non-residential purposes, I would argue that routing _stability_ is most important, and with regards to performance, I would argue that _aggregate bandwidth_ is more important than pure _packets/sec_ performance. Finally, the ASR in Cisco ASR9k stands for _Advanced Services Router_, and being able to mix-and-match MPLS, L2VPN, Bridges, encapsulation, tunneling, and have an expectation of 8-10Mpps per 10G port is absolutely reasonable. The ASR9k is a very competent machine. ### Loadtest data I've dropped all loadtest data [here](/assets/asr9006/asr9006-loadtest.tar.gz) and if you'd like to play around with the data, take a look at the HTML files in [this directory](/assets/asr9006/), they were built with Michal's [trex-loadtest-viz](https://github.com/wejn/trex-loadtest-viz/) scripts. ## Acknowledgements I wanted to give a shout-out to Fred and the crew at IP-Max for allowing me to play with their router during these loadtests. I'll be configuring it to replace their router at NTT in March, so if you have a connection to SwissIX via IP-Max, you will be notified for maintenance ahead of time as we plan the maintenance window. We call these things Fridges in the IP-Max world, because they emit so much cool air when they start :) The ASR9001 is the microfridge, this ASR9006 is the minifridge, and the ASR9010 is the regular fridge.