diff --git a/content/articles/2025-05-03-containerlab-1.md b/content/articles/2025-05-03-containerlab-1.md new file mode 100644 index 0000000..7bab914 --- /dev/null +++ b/content/articles/2025-05-03-containerlab-1.md @@ -0,0 +1,464 @@ +--- +date: "2025-05-03T15:07:23Z" +title: 'VPP in Containerlab - Part 1' +--- + +{{< image float="right" src="/assets/containerlab/containerlab.svg" alt="Containerlab Logo" width="12em" >}} + +# Introduction + +From time to time the subject of containerized VPP instances comes up. At IPng, I run the routers in +AS8298 on bare metal (Supermicro and Dell hardware), as it allows me to maximize performance. +However, VPP is quite friendly in virtualization. Notably, it runs really well on virtual machines +like Qemu/KVM or VMWare. I can pass through PCI devices directly to the host, and use CPU pinning to +allow the guest virtual machine access to the underlying physical hardware. In such a mode, VPP +performance almost the same as on bare metal. But did you know that VPP can also run in Docker? + +The other day I joined the [[ZANOG'25](https://nog.net.za/event1/zanog25/)] in Durban, South Africa. +One of the presenters was Nardus le Roux of Nokia, and he showed off a project called +[[Containerlab](https://containerlab.dev/)], which provides a CLI for orchestrating and managing +container-based networking labs. It starts the containers, builds a virtual wiring between them to +create lab topologies of users choice and manages labs lifecycle. + +Quite regularly I am asked 'when will you add VPP to Containerlab?', but at ZANOG I made a promise +to actually add them. Here I go, on a journey to integrate VPP into Containerlab! + +## Containerized VPP + +The folks at [[Tigera](https://www.tigera.io/project-calico/)] maintain a project called _Calico_, +which accelerates Kubernetes CNI (Container Network Interface) by using [[FD.io](https://fd.io)] +VPP. Since the origins of Kubernetes are to run containers in a Docker environment, it stands to +reason that it should be possible to run a containerized VPP. I start by reading up on how they +create their Docker image, and I learn a lot. + +### Docker Build + +Considering IPng runs bare metal Debian (currently Bookworm) machines, my Docker image will be based +on `debian:bookworm` as well. The build starts off quite modest: + +``` +pim@summer:~$ mkdir -p src/vpp-containerlab +pim@summer:~/src/vpp-containerlab$ cat < EOF > Dockerfile.bookworm +FROM debian:bookworm +ARG DEBIAN_FRONTEND=noninteractive +ARG VPP_INSTALL_SKIP_SYSCTL=true +ARG REPO=release +RUN apt-get update && apt-get -y install curl procps && apt-get clean + +# Install VPP +RUN curl -s https://packagecloud.io/install/repositories/fdio/${REPO}/script.deb.sh | bash +RUN apt-get update && apt-get -y install vpp vpp-plugin-core && apt-get clean + +CMD ["/usr/bin/vpp","-c","/etc/vpp/startup.conf"] +EOF +pim@summer:~/src/vpp-containerlab$ docker build -f Dockerfile.bookworm . -t pimvanpelt/vpp-containerlab +``` + +One gotcha - when I install the upstream VPP debian packages, they generate a `sysctl` file which it +tries to execute. However, I can't set sysctl's in the container, so the build fails. I take a look +at the VPP source code and find `src/pkg/debian/vpp.postinst` which helpfully contains a means to +override setting the sysctl's, using an environment variable called `VPP_INSTALL_SKIP_SYSCTL`. + +### Running VPP in Docker + +With the Docker image built, I need to tweak the VPP startup configuration a little bit, to allow it +to run well in a Docker environment. There are a few things I make note of: +1. We may not have huge pages on the host machine, so I'll set all the page sizes to the + linux-default 4kB rather than 2MB or 1GB hugepages. This creates a performance regression, but + in the case of Containerlab, we're not here to build high performance stuff, but rather users + will be doing functional testing. +1. DPDK requires either UIO of VFIO kernel drivers, so that it can bind its so-called _poll mode + driver_ to the network cards. It also requires huge pages. Since my first version will be + using only virtual ethernet interfaces, I'll disable DPDK and VFIO alltogether. +1. VPP can run any number of CPU worker threads. In its simplest form, I can also run it with only + one thread. Of course, this will not be a high performance setup, but since I'm already not + using hugepages, I'll use only 1 thread. + +The VPP `startup.conf` configuration file I came up with: + +``` +pim@summer:~/src/vpp-containerlab$ cat < EOF > clab-startup.conf +unix { + interactive + log /var/log/vpp/vpp.log + full-coredump + cli-listen /run/vpp/cli.sock + cli-prompt vpp-clab# + cli-no-pager + poll-sleep-usec 100 +} + +api-trace { + on +} + +memory { + main-heap-size 512M + main-heap-page-size 4k +} +buffers { + buffers-per-numa 16000 + default data-size 2048 + page-size 4k +} + +statseg { + size 64M + page-size 4k + per-node-counters on +} + +plugins { + plugin default { enable } + plugin dpdk_plugin.so { disable } +} +EOF +``` + +Just a couple of notes for those who are running VPP in production. Each of the `*-page-size` config +settings take the normal Linux pagesize of 4kB, which effectively avoids VPP from using anhy +hugepages. Then, I'll specifically disable the DPDK plugin, although I didn't install it in the +Dockerfile build, as it lives in its own dedicated Debian package called `vpp-plugin-dpdk`. Finally, +I'll make VPP use less CPU by telling it to sleep for 100 microseconds between each poll iteration. +In production environments, VPP will use 100% of the CPUs it's assigned, but in this lab, it will +not be quite as hungry. By the way, even in this sleepy mode, it'll still easily handle a gigabit +of traffic! + +Now, VPP wants to run as root and it needs a few host features, notably tuntap devices and vhost, +and a few capabilites, notably NET_ADMIN and SYS_PTRACE. I take a look at the +[[manpage](https://man7.org/linux/man-pages/man7/capabilities.7.html)]: +* ***CAP_SYS_NICE***: allows to set real-time scheduling, CPU affinity, I/O scheduling class, and + to migrate and move memory pages. +* ***CAP_NET_ADMIN***: allows to perform various network-relates operations like interface + configs, routing tables, nested network namespaces, multicast, set promiscuous mode, and so on. +* ***CAP_SYS_PTRACE***: allows to trace arbitrary processes using `ptrace(2)`, and a few related + kernel system calls. + +Being a networking dataplane implementation, VPP wants to be able to tinker with network devices. +This is not typically allowed in Docker containers, although the Docker developers did make some +consessions for those containers that need just that little bit more access. They described it in +their +[[docs](https://docs.docker.com/engine/containers/run/#runtime-privilege-and-linux-capabilities)] as +follows: + +| The --privileged flag gives all capabilities to the container. When the operator executes docker +| run --privileged, Docker enables access to all devices on the host, and reconfigures AppArmor or +| SELinux to allow the container nearly all the same access to the host as processes running outside +| containers on the host. Use this flag with caution. For more information about the --privileged +| flag, see the docker run reference. + +{{< image width="4em" float="left" src="/assets/shared/warning.png" alt="Warning" >}} +In this moment, I feel I should point out that running a Docker container with `--privileged` flag +set does give it _a lot_ of privileges. A container with `--privileged` is not a securely sandboxed +process. Containers in this mode can get a root shell on the host and take control over the system. + +With that little fineprint warning out of the way, I am going to Yolo like a boss: + +``` +pim@summer:~/src/vpp-containerlab$ docker run --name clab-pim \ + --cap-add=NET_ADMIN --cap-add=SYS_NICE --cap-add=SYS_PTRACE \ + --device=/dev/net/tun:/dev/net/tun --device=/dev/vhost-net:/dev/vhost-net \ + --privileged -v $(pwd)/clab-startup.conf:/etc/vpp/startup.conf:ro \ + docker.io/pimvanpelt/vpp-containerlab +clab-pim +``` + +### Configuring VPP in Docker + +And with that, the Docker container is running! I post a screenshot on +[[Mastodon](https://ublog.tech/@IPngNetworks/114392852468494211)] and my buddy John responds with a +polite but firm insistence that I explain myself. Here you go, buddy :) + +In another terminal, I can play around with this VPP instance a little bit: +``` +pim@summer:~$ docker exec -it clab-pim bash +root@d57c3716eee9:/# ip -br l +lo UNKNOWN 00:00:00:00:00:00 +eth0@if530566 UP 02:42:ac:11:00:02 + +root@d57c3716eee9:/# ps auxw +USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND +root 1 2.2 0.2 17498852 160300 ? Rs 15:11 0:00 /usr/bin/vpp -c /etc/vpp/startup.conf +root 10 0.0 0.0 4192 3388 pts/0 Ss 15:11 0:00 bash +root 18 0.0 0.0 8104 4056 pts/0 R+ 15:12 0:00 ps auxw + +root@d57c3716eee9:/# vppctl + _______ _ _ _____ ___ + __/ __/ _ \ (_)__ | | / / _ \/ _ \ + _/ _// // / / / _ \ | |/ / ___/ ___/ + /_/ /____(_)_/\___/ |___/_/ /_/ + +vpp-clab# show version +vpp v25.02-release built by root on d5cd2c304b7f at 2025-02-26T13:58:32 +vpp-clab# show interfaces + Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count +local0 0 down 0/0/0/0 +``` + +Slick! I can see that the container has an `eth0` device, which Docker has connected to the main +bridged network. For now, there's only one process running, pid 1 proudly shows VPP (as in Docker, +the `CMD` field will simply replace `init`. Later on, I can imagine running a few more daemons like +SSH and so on, but for now, I'm happy. + +Looking at VPP itself, it has no network interfaces yet, except for the default `local0` interface. + +### Adding Interfaces in Docker + +But if I don't have DPDK, how will I add interfaces? Enter `veth(4)`. From the +[[manpage](https://man7.org/linux/man-pages/man4/veth.4.html)], I learn that veth devices are +virtual Ethernet devices. They can act as tunnels between network namespaces to create a bridge to +a physical network device in another namespace, but can also be used as standalone network devices. +veth devices are always created in interconnected pairs. + +Of course, Docker users will recognize this. It's like bread and butter for containers to +communicate with one another - and with the host they're running on. I can simply create a Docker +network and attach one half of it to a running container, like so: + +``` +pim@summer:~$ docker network create --driver=bridge clab-network \ + --subnet 192.0.2.0/24 --ipv6 --subnet 2001:db8::/64 +5711b95c6c32ac0ed185a54f39e5af4b499677171ff3d00f99497034e09320d2 +pim@summer:~$ docker network connect clab-network clab-pim --ip '' --ip6 '' +``` + +The first command here creates a new network called `clab-network` in Docker. As a result, a new +bridge called `br-5711b95c6c32` shows up on the host. The bridge name is chosen from the UUID of the +Docker object. Seeing as I added an IPv4 and IPv6 subnet to the bridge, it gets configured with the +first address in both: + +``` +pim@summer:~/src/vpp-containerlab$ brctl show br-5711b95c6c32 +bridge name bridge id STP enabled interfaces +br-5711b95c6c32 8000.0242099728c6 no veth021e363 + + +pim@summer:~/src/vpp-containerlab$ ip -br a show dev br-5711b95c6c32 +br-5711b95c6c32 UP 192.0.2.1/24 2001:db8::1/64 fe80::42:9ff:fe97:28c6/64 fe80::1/64 +``` + +The second command creates a `veth` pair, and puts one half of it in the bridge, and this interface +is called `veth021e363` above. The other half of it pops up as `eth1` in the Docker container: + +``` +pim@summer:~/src/vpp-containerlab$ docker exec -it clab-pim bash +root@d57c3716eee9:/# ip -br l +lo UNKNOWN 00:00:00:00:00:00 +eth0@if530566 UP 02:42:ac:11:00:02 +eth1@if530577 UP 02:42:c0:00:02:02 +``` + +One of the many awesome features of VPP is its ability to attach to these `veth` devices by means of +its `af-packet` driver, by reusing the same MAC address (in this case `02:42:c0:00:02:02`). I first +take a look at the linux [[manpage](https://man7.org/linux/man-pages/man7/packet.7.html)] for it, +and then read up on the VPP +[[documentation](https://fd.io/docs/vpp/v2101/gettingstarted/progressivevpp/interface)] on the +topic. + + +However, my attention is drawn to Docker assigning an IPv4 and IPv6 address to the container: +``` +root@d57c3716eee9:/# ip -br a +lo UNKNOWN 127.0.0.1/8 ::1/128 +eth0@if530566 UP 172.17.0.2/16 +eth1@if530577 UP 192.0.2.2/24 2001:db8::2/64 fe80::42:c0ff:fe00:202/64 +root@d57c3716eee9:/# ip addr del 192.0.2.2/24 dev eth1 +root@d57c3716eee9:/# ip addr del 2001:db8::2/64 dev eth1 +``` + +I decide to remove them from here, as in the end, `eth1` will be owned by VPP so _it_ should be +setting the IPv4 and IPv6 addresses. For the life of me, I don't see how I can avoid Docker from +assinging IPv4 and IPv6 addresses to this container ... and the +[[docs](https://docs.docker.com/engine/network/)] seem to be off as well, as they suggest I can pass +a flagg `--ipv4=False` but that flag doesn't exist, at least not on my Bookworm Docker variant. I +make a mental note to discuss this with the folks in the Containerlab community. + + +Anyway, armed with this knowledge I can bind the container-side veth pair called `eth1` to VPP, like +so: + +``` +root@d57c3716eee9:/# vppctl + _______ _ _ _____ ___ + __/ __/ _ \ (_)__ | | / / _ \/ _ \ + _/ _// // / / / _ \ | |/ / ___/ ___/ + /_/ /____(_)_/\___/ |___/_/ /_/ + +vpp-clab# create host-interface name eth1 hw-addr 02:42:c0:00:02:02 +vpp-clab# set interface name host-eth1 eth1 +vpp-clab# set interface mtu 1500 eth1 +vpp-clab# set interface ip address eth1 192.0.2.2/24 +vpp-clab# set interface ip address eth1 2001:db8::2/64 +vpp-clab# set interface state eth1 up +vpp-clab# show int addr +eth1 (up): + L3 192.0.2.2/24 + L3 2001:db8::2/64 +local0 (dn): +``` + +## Results + +After all this work, I've successfully created a Docker image based on Debian Bookworm and VPP 25.02 +(the current stable release version), started a container with it, added a network bridge in Docker, +which binds the host `summer` to the container. Proof, as they say, is in the ping-pudding: + +``` +pim@summer:~/src/vpp-containerlab$ ping -c5 2001:db8::2 +PING 2001:db8::2(2001:db8::2) 56 data bytes +64 bytes from 2001:db8::2: icmp_seq=1 ttl=64 time=0.113 ms +64 bytes from 2001:db8::2: icmp_seq=2 ttl=64 time=0.056 ms +64 bytes from 2001:db8::2: icmp_seq=3 ttl=64 time=0.202 ms +64 bytes from 2001:db8::2: icmp_seq=4 ttl=64 time=0.102 ms +64 bytes from 2001:db8::2: icmp_seq=5 ttl=64 time=0.100 ms + +--- 2001:db8::2 ping statistics --- +5 packets transmitted, 5 received, 0% packet loss, time 4098ms +rtt min/avg/max/mdev = 0.056/0.114/0.202/0.047 ms +pim@summer:~/src/vpp-containerlab$ ping -c5 192.0.2.2 +PING 192.0.2.2 (192.0.2.2) 56(84) bytes of data. +64 bytes from 192.0.2.2: icmp_seq=1 ttl=64 time=0.043 ms +64 bytes from 192.0.2.2: icmp_seq=2 ttl=64 time=0.032 ms +64 bytes from 192.0.2.2: icmp_seq=3 ttl=64 time=0.019 ms +64 bytes from 192.0.2.2: icmp_seq=4 ttl=64 time=0.041 ms +64 bytes from 192.0.2.2: icmp_seq=5 ttl=64 time=0.027 ms + +--- 192.0.2.2 ping statistics --- +5 packets transmitted, 5 received, 0% packet loss, time 4063ms +rtt min/avg/max/mdev = 0.019/0.032/0.043/0.008 ms +``` + +And in case that simple ping-test wasn't enough to get you excited, here's a packet trace from VPP +itself, while I'm performing this ping: + +``` +vpp-clab# trace add af-packet-input 100 +vpp-clab# wait 3 +vpp-clab# show trace +------------------- Start of thread 0 vpp_main ------------------- +Packet 1 + +00:07:03:979275: af-packet-input + af_packet: hw_if_index 1 rx-queue 0 next-index 4 + block 47: + address 0x7fbf23b7d000 version 2 seq_num 48 pkt_num 0 + tpacket3_hdr: + status 0x20000001 len 98 snaplen 98 mac 92 net 106 + sec 0x68164381 nsec 0x258e7659 vlan 0 vlan_tpid 0 + vnet-hdr: + flags 0x00 gso_type 0x00 hdr_len 0 + gso_size 0 csum_start 0 csum_offset 0 +00:07:03:979293: ethernet-input + IP4: 02:42:09:97:28:c6 -> 02:42:c0:00:02:02 +00:07:03:979306: ip4-input + ICMP: 192.0.2.1 -> 192.0.2.2 + tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN + fragment id 0x5813, flags DONT_FRAGMENT + ICMP echo_request checksum 0xc16 id 21197 +00:07:03:979315: ip4-lookup + fib 0 dpo-idx 9 flow hash: 0x00000000 + ICMP: 192.0.2.1 -> 192.0.2.2 + tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN + fragment id 0x5813, flags DONT_FRAGMENT + ICMP echo_request checksum 0xc16 id 21197 +00:07:03:979322: ip4-receive + fib:0 adj:9 flow:0x00000000 + ICMP: 192.0.2.1 -> 192.0.2.2 + tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN + fragment id 0x5813, flags DONT_FRAGMENT + ICMP echo_request checksum 0xc16 id 21197 +00:07:03:979323: ip4-icmp-input + ICMP: 192.0.2.1 -> 192.0.2.2 + tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN + fragment id 0x5813, flags DONT_FRAGMENT + ICMP echo_request checksum 0xc16 id 21197 +00:07:03:979323: ip4-icmp-echo-request + ICMP: 192.0.2.1 -> 192.0.2.2 + tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN + fragment id 0x5813, flags DONT_FRAGMENT + ICMP echo_request checksum 0xc16 id 21197 +00:07:03:979326: ip4-load-balance + fib 0 dpo-idx 5 flow hash: 0x00000000 + ICMP: 192.0.2.2 -> 192.0.2.1 + tos 0x00, ttl 64, length 84, checksum 0x88e1 dscp CS0 ecn NON_ECN + fragment id 0x2dc4, flags DONT_FRAGMENT + ICMP echo_reply checksum 0x1416 id 21197 +00:07:03:979325: ip4-rewrite + tx_sw_if_index 1 dpo-idx 5 : ipv4 via 192.0.2.1 eth1: mtu:1500 next:3 flags:[] 0242099728c60242c00002020800 flow hash: 0x00000000 + 00000000: 0242099728c60242c00002020800450000542dc44000400188e1c0000202c000 + 00000020: 02010000141652cd00018143166800000000399d0900000000001011 +00:07:03:979326: eth1-output + eth1 flags 0x02180005 + IP4: 02:42:c0:00:02:02 -> 02:42:09:97:28:c6 + ICMP: 192.0.2.2 -> 192.0.2.1 + tos 0x00, ttl 64, length 84, checksum 0x88e1 dscp CS0 ecn NON_ECN + fragment id 0x2dc4, flags DONT_FRAGMENT + ICMP echo_reply checksum 0x1416 id 21197 +00:07:03:979327: eth1-tx + af_packet: hw_if_index 1 tx-queue 0 + tpacket3_hdr: + status 0x1 len 108 snaplen 108 mac 0 net 0 + sec 0x0 nsec 0x0 vlan 0 vlan_tpid 0 + vnet-hdr: + flags 0x00 gso_type 0x00 hdr_len 0 + gso_size 0 csum_start 0 csum_offset 0 + buffer 0xf97c4: + current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x0 + local l2-hdr-offset 0 l3-hdr-offset 14 + IP4: 02:42:c0:00:02:02 -> 02:42:09:97:28:c6 + ICMP: 192.0.2.2 -> 192.0.2.1 + tos 0x00, ttl 64, length 84, checksum 0x88e1 dscp CS0 ecn NON_ECN + fragment id 0x2dc4, flags DONT_FRAGMENT + ICMP echo_reply checksum 0x1416 id 21197 +``` + +Well, that's a mouthfull, isn't it! Here, I get to show you VPP in action. After receiving the +packet on its `af-packet-input` node from 192.0.2.1 (Summer, who is pinging us) to 192.0.2.2 (the +VPP container), the packet traverses the dataplane graph. It goes through `ethernet-input`, then +`ip4-input`, which sees it's destined to an IPv4 address configured, so the packet is handed to +`ip4-receive`. That one sees that the IP protocol is ICMP, so it hands the packet to +`ip4-icmp-input` which notices that the packet is an ICMP echo request, so off to +`ip4-icmp-echo-request` our little packet goes. The ICMP plugin in VPP now answers by +`ip4-rewrite`'ing the packet, sending the return to 192.0.2.1 at MAC address `02:42:09:97:28:c6` +(this is Summer, the host doing the pinging!), after which the newly created ICMP echo-reply is +handed to `eth1-output` which marshalls it back into the kernel's AF_PACKET interface using +`eth1-tx`. + +Boom. I could not be more pleased. + +## What's Next + +This was a nice exercise for me! I'm going this direction becaue the +[[Containerlab](https://containerlab.dev)] framework will start containers with given NOS images, +not too dissimilar from the one I just made, and then attaches `veth` pairs between the containers. +I started dabbling with a [[pull-request](https://github.com/srl-labs/containerlab/pull/2569)], but +I got stuck with a part of the Containerlab code that pre-deploys config files into the containers. +You see, I will need to generate two files: + +1. A `startup.conf` file that is specific to the containerlab Docker container. I'd like them to + each set their own hostname so that the CLI has a unique prompt. I can do this by setting `unix + { cli-prompt {{ .ShortName }}# }` in the template renderer. +1. Containerlab will know all of the veth pairs that are planned to be created into each VPP + container. I'll need it to then write a little snippet of config that does the `create + host-interface` spiel, to attach these `veth` pairs to the VPP dataplane. + +I reached out to Roman from Nokia, who is one of the authors and current maintainer of Containerlab. +Roman was keen to help out, and seeing as he knows the COntainerlab stuff well, and I know the VPP +stuff well, this is a reasonable partnership! Soon, he and I plan to have a bare-bones setup that +will connect a few VPP containers together with an SR Linux node in a lab. Stand by! + +Once we have that, there's still quite some work for me to do. Notably: +* Configuration persistence. `clab` allows you to save the running config. For that, I'll need to + introduce [[vppcfg](https://github.com/pimvanpelt/vppcfg.git)] and a means to invoke it when + the lab operator wants to save their config, and then reconfigure VPP when the container + restarts. +* I'll need to have a few files from `clab` shared with the host, notably the `startup.conf` and + `vppcfg.yaml`, as well as some manual pre- and post-flight configuration for the more esoteric + stuff. Building the plumbing for this is a TODO for now. + +## Acknowledgements + +I wanted to give a shout-out to Nardus le Roux who inspired me to contribute this Containerlab VPP +node type, and to Roman Dodin for his help getting the Containerlab parts squared away when I got a +little bit stuck. + +First order of business: get it to ping at all ... it'll go faster from there on out :) diff --git a/static/assets/containerlab/containerlab.svg b/static/assets/containerlab/containerlab.svg new file mode 100644 index 0000000..c26dfc6 --- /dev/null +++ b/static/assets/containerlab/containerlab.svg @@ -0,0 +1 @@ + \ No newline at end of file