Update content/ctlog.md

typo fixes, s/shared/shard/
Retire halloumi2026h2
2025-09-29 15:19:08 +00:00 · 2025-09-08 22:10:24 +00:00 · 2025-08-26 09:27:06 +00:00 · 2025-08-26 09:19:41 +00:00 · 2025-08-26 08:25:12 +00:00 · 2025-08-25 13:01:55 +00:00
151 changed files with 13126 additions and 309 deletions
--- a/.drone.yml
+++ b/.drone.yml
@@ -0,0 +1,34 @@
+kind: pipeline
+name: default
+
+steps:
+  - name: git-lfs
+    image: alpine/git
+    commands:
+      - git lfs install
+      - git lfs pull
+  - name: build
+    image: git.ipng.ch/ipng/drone-hugo:release-0.148.2
+    settings:
+      hugo_version: 0.148.2
+      extended: true
+  - name: rsync
+    image: drillster/drone-rsync
+    settings:
+      user: drone
+      key:
+        from_secret: drone_sshkey
+      hosts:
+        - nginx0.chrma0.net.ipng.ch
+        - nginx0.chplo0.net.ipng.ch
+        - nginx0.nlams1.net.ipng.ch
+        - nginx0.nlams2.net.ipng.ch
+      port: 22
+      args: '-6u --delete-after'
+      source: public/
+      target: /nginx/sites/ipng.ch/
+      recursive: true
+      secrets: [ drone_sshkey ]
+
+image_pull_secrets:
+  - git_ipng_ch_docker
--- a/.gitignore
+++ b/.gitignore
@@ -1,3 +1,4 @@
 .hugo*
 public/
 resources/_gen/
+.DS_Store
--- a/content/articles/2021-02-26-history.md
+++ b/content/articles/2021-02-26-history.md
@@ -8,7 +8,7 @@ Historical context - todo, but notes for now

 1. started with stack.nl (when it was still stack.urc.tue.nl), 6bone and watching NASA multicast video in 1997.
 2. founded ipng.nl project, first IPv6 in NL that was usable outside of NREN.
-3. attacted attention of the first few IPv6 partitipants in Amsterdam, organized the AIAD - AMS-IX IPv6 Awareness Day
+3. attracted attention of the first few IPv6 participants in Amsterdam, organized the AIAD - AMS-IX IPv6 Awareness Day
 4. launched IPv6 at AMS-IX, first IXP prefix allocated 2001:768:1::/48
 >   My Brilliant Idea Of The Day -- encode AS number in leetspeak: `::AS01:2859:1`, because who would've thought we would ever run out of 16 bit AS numbers :)
 5. IPng rearchitected to SixXS, and became a very large scale deployment of IPv6 tunnelbroker; our main central provisioning system moved around a few times between ISPs (Intouch, Concepts ICT, BIT, IP Man)
--- a/content/articles/2021-03-27-coloclue-vpp.md
+++ b/content/articles/2021-03-27-coloclue-vpp.md
@@ -185,7 +185,7 @@ function is_coloclue_beacon()
 }
 ```

-Then, I ran the configuration again with one IPv4 beacon set on dcg-1, and still all the bird configs on both IPv4 and IPv6 for all routers parsed correctly, and the generated function on the dcg-1 IPv4 filters file was popupated:
+Then, I ran the configuration again with one IPv4 beacon set on dcg-1, and still all the bird configs on both IPv4 and IPv6 for all routers parsed correctly, and the generated function on the dcg-1 IPv4 filters file was populated:
 ```
 function is_coloclue_beacon()
 {
--- a/content/articles/2021-08-13-vpp-2.md
+++ b/content/articles/2021-08-13-vpp-2.md
@@ -89,7 +89,7 @@ lcp lcp-sync off
 ```

 The prep work for the rest of the interface syncer starts with this
-[[commit](https://github.com/pimvanpelt/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
+[[commit](https://git.ipng.ch/ipng/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
 for the rest of this blog post, the behavior will be in the 'on' position.

 ### Change interface: state
@@ -120,7 +120,7 @@ the state it was. I did notice that you can't bring up a sub-interface if its pa
 is down, which I found counterintuitive, but that's neither here nor there.

 All of this is to say that we have to be careful when copying state forward, because as
-this [[commit](https://github.com/pimvanpelt/lcpng/commit/7c15c84f6c4739860a85c599779c199cb9efef03)]
+this [[commit](https://git.ipng.ch/ipng/lcpng/commit/7c15c84f6c4739860a85c599779c199cb9efef03)]
 shows, issuing `set int state ... up` on an interface, won't touch its sub-interfaces in VPP, but
 the subsequent netlink message to bring the _LIP_ for that interface up, **will** update the
 children, thus desynchronising Linux and VPP: Linux will have interface **and all its
@@ -128,7 +128,7 @@ sub-interfaces** up unconditionally; VPP will have the interface up and its sub-
 whatever state they were before.

 To address this, a second
-[[commit](https://github.com/pimvanpelt/lcpng/commit/a3dc56c01461bdffcac8193ead654ae79225220f)] was
+[[commit](https://git.ipng.ch/ipng/lcpng/commit/a3dc56c01461bdffcac8193ead654ae79225220f)] was
 needed. I'm not too sure I want to keep this behavior, but for now, it results in an intuitive
 end-state, which is that all interfaces states are exactly the same between Linux and VPP.

@@ -157,7 +157,7 @@ DBGvpp# set int state TenGigabitEthernet3/0/0 up
 ### Change interface: MTU

 Finally, a straight forward
-[[commit](https://github.com/pimvanpelt/lcpng/commit/39bfa1615fd1cafe5df6d8fc9d34528e8d3906e2)], or
+[[commit](https://git.ipng.ch/ipng/lcpng/commit/39bfa1615fd1cafe5df6d8fc9d34528e8d3906e2)], or
 so I thought. When the MTU changes in VPP (with `set interface mtu packet N <int>`), there is
 callback that can be registered which copies this into the _LIP_. I did notice a specific corner
 case: In VPP, a sub-interface can have a larger MTU than its parent. In Linux, this cannot happen,
@@ -179,7 +179,7 @@ higher than that, perhaps logging an error explaining why. This means two things
 1.   Any change in VPP of a parent MTU should ensure all children are clamped to at most that.

 I addressed the issue in this
-[[commit](https://github.com/pimvanpelt/lcpng/commit/79a395b3c9f0dae9a23e6fbf10c5f284b1facb85)].
+[[commit](https://git.ipng.ch/ipng/lcpng/commit/79a395b3c9f0dae9a23e6fbf10c5f284b1facb85)].

 ### Change interface: IP Addresses

@@ -199,7 +199,7 @@ VPP into the companion Linux devices:
    _LIP_ with `lcp_itf_set_interface_addr()`.

 This means with this
-[[commit](https://github.com/pimvanpelt/lcpng/commit/f7e1bb951d648a63dfa27d04ded0b6261b9e39fe)], at
+[[commit](https://git.ipng.ch/ipng/lcpng/commit/f7e1bb951d648a63dfa27d04ded0b6261b9e39fe)], at
 any time a new _LIP_ is created, the IPv4 and IPv6 address on the VPP interface are fully copied
 over by the third change, while at runtime, new addresses can be set/removed as well by the first
 and second change.
--- a/content/articles/2021-08-15-vpp-3.md
+++ b/content/articles/2021-08-15-vpp-3.md
@@ -100,7 +100,7 @@ linux-cp {

 Based on this config, I set the startup default in `lcp_set_lcp_auto_subint()`, but I realize that
 an administrator may want to turn it on/off at runtime, too, so I add a CLI getter/setter that
-interacts with the flag in this [[commit](https://github.com/pimvanpelt/lcpng/commit/d23aab2d95aabcf24efb9f7aecaf15b513633ab7)]:
+interacts with the flag in this [[commit](https://git.ipng.ch/ipng/lcpng/commit/d23aab2d95aabcf24efb9f7aecaf15b513633ab7)]:

 ```
 DBGvpp# show lcp
@@ -116,11 +116,11 @@ lcp lcp-sync off
 ```

 The prep work for the rest of the interface syncer starts with this
-[[commit](https://github.com/pimvanpelt/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
+[[commit](https://git.ipng.ch/ipng/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
 for the rest of this blog post, the behavior will be in the 'on' position.

 The code for the configuration toggle is in this
-[[commit](https://github.com/pimvanpelt/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
+[[commit](https://git.ipng.ch/ipng/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].

 ### Auto create/delete sub-interfaces

@@ -145,7 +145,7 @@ I noticed that interface deletion had a bug (one that I fell victim to as well:
 remove the netlink device in the correct network namespace), which I fixed.

 The code for the auto create/delete and the bugfix is in this
-[[commit](https://github.com/pimvanpelt/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
+[[commit](https://git.ipng.ch/ipng/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].

 ### Further Work

--- a/content/articles/2021-08-25-vpp-4.md
+++ b/content/articles/2021-08-25-vpp-4.md
@@ -154,7 +154,7 @@ For now, `lcp_nl_dispatch()` just throws the message away after logging it with
 a function that will come in very useful as I start to explore all the different Netlink message types.

 The code that forms the basis of our Netlink Listener lives in [[this
-commit](https://github.com/pimvanpelt/lcpng/commit/c4e3043ea143d703915239b2390c55f7b6a9b0b1)] and
+commit](https://git.ipng.ch/ipng/lcpng/commit/c4e3043ea143d703915239b2390c55f7b6a9b0b1)] and
 specifically, here I want to call out I was not the primary author, I worked off of Matt and Neale's
 awesome work in this pending [Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122).

@@ -182,7 +182,7 @@ Linux interface VPP is not aware of. But, if I can find the _LIP_, I can convert
 add or remove the ip4/ip6 neighbor adjacency.

 The code for this first Netlink message handler lives in this
-[[commit](https://github.com/pimvanpelt/lcpng/commit/30bab1d3f9ab06670fbef2c7c6a658e7b77f7738)]. An
+[[commit](https://git.ipng.ch/ipng/lcpng/commit/30bab1d3f9ab06670fbef2c7c6a658e7b77f7738)]. An
 ironic insight is that after writing the code, I don't think any of it will be necessary, because
 the interface plugin will already copy ARP and IPv6 ND packets back and forth and itself update its
 neighbor adjacency tables; but I'm leaving the code in for now.
@@ -197,7 +197,7 @@ it or remove it, and if there are no link-local addresses left, disable IPv6 on
 There's also a few multicast routes to add (notably 224.0.0.0/24 and ff00::/8, all-local-subnet).

 The code for IP address handling is in this
-[[commit]](https://github.com/pimvanpelt/lcpng/commit/87742b4f541d389e745f0297d134e34f17b5b485), but
+[[commit]](https://git.ipng.ch/ipng/lcpng/commit/87742b4f541d389e745f0297d134e34f17b5b485), but
 when I took it out for a spin, I noticed something curious, looking at the log lines that are
 generated for the following sequence:

@@ -236,7 +236,7 @@ interface and directly connected route addition/deletion is slightly different i
 So, I decide to take a little shortcut -- if an addition returns "already there", or a deletion returns
 "no such entry", I'll just consider it a successful addition and deletion respectively, saving my eyes
 from being screamed at by this red error message. I changed that in this
-[[commit](https://github.com/pimvanpelt/lcpng/commit/d63fbd8a9a612d038aa385e79a57198785d409ca)],
+[[commit](https://git.ipng.ch/ipng/lcpng/commit/d63fbd8a9a612d038aa385e79a57198785d409ca)],
 turning this situation in a friendly green notice instead.

 ### Netlink: Link (existing)
@@ -267,7 +267,7 @@ To avoid this loop, I temporarily turn off `lcp-sync` just before handling a bat
 turn it back to its original state when I'm done with that.

 The code for all/del of existing links is in this
-[[commit](https://github.com/pimvanpelt/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)].
+[[commit](https://git.ipng.ch/ipng/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)].

 ### Netlink: Link (new)

@@ -276,7 +276,7 @@ doesn't have a _LIP_ for, but specifically describes a VLAN interface?  Well, th
 is trying to create a new sub-interface. And supporting that operation would be super cool, so let's go!

 Using the earlier placeholder hint in `lcp_nl_link_add()` (see the previous
-[[commit](https://github.com/pimvanpelt/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)]),
+[[commit](https://git.ipng.ch/ipng/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)]),
 I know that I've gotten a NEWLINK request but the Linux ifindex doesn't have a _LIP_. This could be
 because the interface is entirely foreign to VPP, for example somebody created a dummy interface or
 a VLAN sub-interface on one:
@@ -331,7 +331,7 @@ a boring `<phy>.<subid>` name.

 Alright, without further ado, the code for the main innovation here, the implementation of
 `lcp_nl_link_add_vlan()`, is in this
-[[commit](https://github.com/pimvanpelt/lcpng/commit/45f408865688eb7ea0cdbf23aa6f8a973be49d1a)].
+[[commit](https://git.ipng.ch/ipng/lcpng/commit/45f408865688eb7ea0cdbf23aa6f8a973be49d1a)].

 ## Results

--- a/content/articles/2021-09-02-vpp-5.md
+++ b/content/articles/2021-09-02-vpp-5.md
@@ -118,7 +118,7 @@ or Virtual Routing/Forwarding domains). So first, I need to add these:

 All of this code was heavily inspired by the pending [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122)]
 but a few finishing touches were added, and wrapped up in this
-[[commit](https://github.com/pimvanpelt/lcpng/commit/7a76498277edc43beaa680e91e3a0c1787319106)].
+[[commit](https://git.ipng.ch/ipng/lcpng/commit/7a76498277edc43beaa680e91e3a0c1787319106)].

 ### Deletion

@@ -459,7 +459,7 @@ it as 'unreachable' rather than deleting it. These are *additions* which have a
 but with an interface index of 1 (which, in Netlink, is 'lo'). This makes VPP intermittently crash, so I
 currently commented this out, while I gain better understanding. Result: blackhole/unreachable/prohibit
 specials can not be set using the plugin. Beware!
-(disabled in this [[commit](https://github.com/pimvanpelt/lcpng/commit/7c864ed099821f62c5be8cbe9ed3f4dd34000a42)]).
+(disabled in this [[commit](https://git.ipng.ch/ipng/lcpng/commit/7c864ed099821f62c5be8cbe9ed3f4dd34000a42)]).

 ## Credits

--- a/content/articles/2021-09-10-vpp-6.md
+++ b/content/articles/2021-09-10-vpp-6.md
@@ -88,7 +88,7 @@ stat['/if/rx-miss'][:, 1].sum() - returns the sum of packet counters for
 ```

 Alright, so let's grab that file and refactor it into a small library for me to use, I do
-this in [[this commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
+this in [[this commit](https://git.ipng.ch/ipng/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].

 ### VPP's API

@@ -159,7 +159,7 @@ idx=19 name=tap4 mac=02:fe:17:06:fc:af mtu=9000 flags=3

 So I added a little abstration with some error handling and one main function
 to return interfaces as a Python dictionary of those `sw_interface_details`
-tuples in [[this commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
+tuples in [[this commit](https://git.ipng.ch/ipng/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].

 ### AgentX

@@ -207,9 +207,9 @@ once asked with `GetPDU` or `GetNextPDU` requests, by issuing a corresponding `R
 to the SNMP server -- it takes care of all the rest!

 The resulting code is in [[this
-commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/8c9c1e2b4aa1d40a981f17581f92bba133dd2c29)]
+commit](https://git.ipng.ch/ipng/vpp-snmp-agent/commit/8c9c1e2b4aa1d40a981f17581f92bba133dd2c29)]
 but you can also check out the whole thing on
-[[Github](https://github.com/pimvanpelt/vpp-snmp-agent)].
+[[Github](https://git.ipng.ch/ipng/vpp-snmp-agent)].

 ### Building

--- a/content/articles/2021-09-21-vpp-7.md
+++ b/content/articles/2021-09-21-vpp-7.md
@@ -480,7 +480,7 @@ is to say, those packets which were destined to any IP address configured on the
 plane. Any traffic going _through_ VPP will never be seen by Linux! So, I'll have to be
 clever and count this traffic by polling VPP instead. This was the topic of my previous
 [VPP Part 6]({{< ref "2021-09-10-vpp-6" >}}) about the SNMP Agent. All of that code
-was released to [Github](https://github.com/pimvanpelt/vpp-snmp-agent), notably there's
+was released to [Github](https://git.ipng.ch/ipng/vpp-snmp-agent), notably there's
 a hint there for an `snmpd-dataplane.service` and a `vpp-snmp-agent.service`, including
 the compiled binary that reads from VPP and feeds this to SNMP.

--- a/content/articles/2021-12-23-vpp-playground.md
+++ b/content/articles/2021-12-23-vpp-playground.md
@@ -30,9 +30,9 @@ virtual machine running in Qemu/KVM into a working setup with both [Free Range R
 and [Bird](https://bird.network.cz/) installed side by side.

 **NOTE**: If you're just interested in the resulting image, here's the most pertinent information:
-> *   ***vpp-proto.qcow2.lrz [[Download](https://ipng.ch/media/vpp-proto/vpp-proto-bookworm-20231015.qcow2.lrz)]***
-> *   ***SHA256*** `bff03a80ccd1c0094d867d1eb1b669720a1838330c0a5a526439ecb1a2457309`
-> *   ***Debian Bookworm (12.4)*** and ***VPP 24.02-rc0~46-ga16463610e***
+> *   ***vpp-proto.qcow2.lrz*** [[Download](https://ipng.ch/media/vpp-proto/vpp-proto-bookworm-20250607.qcow2.lrz)]
+> *   ***SHA256*** `a5fdf157c03f2d202dcccdf6ed97db49c8aa5fdb6b9ca83a1da958a8a24780ab
+> *   ***Debian Bookworm (12.11)*** and ***VPP 25.10-rc0~49-g90d92196***
 > *   ***CPU*** Make sure the (virtualized) CPU supports AVX
 > *   ***RAM*** The image needs at least 4GB of RAM, and the hypervisor should support hugepages and AVX
 > *   ***Username***: `ipng` with ***password***: `ipng loves vpp` and is sudo-enabled
@@ -62,7 +62,7 @@ plugins:
    or route, or the system receiving ARP or IPv6 neighbor request/reply from neighbors), and applying
    these events to the VPP dataplane.

-I've published the code on [Github](https://github.com/pimvanpelt/lcpng/) and I am targeting a release
+I've published the code on [Github](https://git.ipng.ch/ipng/lcpng/) and I am targeting a release
 in upstream VPP, hoping to make the upcoming 22.02 release in February 2022. I have a lot of ground to
 cover, but I will note that the plugin has been running in production in [AS8298]({{< ref "2021-02-27-network" >}})
 since Sep'21 and no crashes related to LinuxCP have been observed.
@@ -195,7 +195,7 @@ So grab a cup of tea, while we let Rhino stretch its legs, ehh, CPUs ...
 pim@rhino:~$ mkdir -p ~/src
 pim@rhino:~$ cd ~/src
 pim@rhino:~/src$ sudo apt install libmnl-dev
-pim@rhino:~/src$ git clone https://github.com/pimvanpelt/lcpng.git
+pim@rhino:~/src$ git clone https://git.ipng.ch/ipng/lcpng.git
 pim@rhino:~/src$ git clone https://gerrit.fd.io/r/vpp
 pim@rhino:~/src$ ln -s ~/src/lcpng ~/src/vpp/src/plugins/lcpng
 pim@rhino:~/src$ cd ~/src/vpp
--- a/content/articles/2022-03-27-vppcfg-1.md
+++ b/content/articles/2022-03-27-vppcfg-1.md
@@ -33,7 +33,7 @@ In this first post, let's take a look at tablestakes: writing a YAML specificati
 configuration elements of VPP, and then ensures that the YAML file is both syntactically as well as
 semantically correct.

-**Note**: Code is on [my Github](https://github.com/pimvanpelt/vppcfg), but it's not quite ready for
+**Note**: Code is on [my Github](https://git.ipng.ch/ipng/vppcfg), but it's not quite ready for
 prime-time yet. Take a look, and engage with us on GitHub (pull requests preferred over issues themselves)
 or reach out by [contacting us](/s/contact/).

@@ -348,7 +348,7 @@ to mess up my (or your!) VPP router by feeding it garbage, so the lions' share o
 has been to assert the YAML file is both syntactically and semantically valid.


-In the mean time, you can take a look at my code on [GitHub](https://github.com/pimvanpelt/vppcfg), but to
+In the mean time, you can take a look at my code on [GitHub](https://git.ipng.ch/ipng/vppcfg), but to
 whet your appetite, here's a hefty configuration that demonstrates all implemented types:

 ```
--- a/content/articles/2022-04-02-vppcfg-2.md
+++ b/content/articles/2022-04-02-vppcfg-2.md
@@ -32,7 +32,7 @@ the configuration to the dataplane. Welcome to `vppcfg`!
 In this second post of the series, I want to talk a little bit about how planning a path from a running
 configuration to a desired new configuration might look like.

-**Note**: Code is on [my Github](https://github.com/pimvanpelt/vppcfg), but it's not quite ready for
+**Note**: Code is on [my Github](https://git.ipng.ch/ipng/vppcfg), but it's not quite ready for
 prime-time yet. Take a look, and engage with us on GitHub (pull requests preferred over issues themselves)
 or reach out by [contacting us](/s/contact/).

--- a/content/articles/2022-10-14-lab-1.md
+++ b/content/articles/2022-10-14-lab-1.md
@@ -275,7 +275,6 @@ that will point at an `unbound` running on `lab.ipng.ch` itself.
 I can now create any file I'd like which may use variable substition and other jinja2 style templating. Take
 for example these two files:

-{% raw %}
 ```
 pim@lab:~/src/lab$ cat overlays/bird/common/etc/netplan/01-netcfg.yaml.j2
 network:
@@ -292,13 +291,12 @@ network:

 pim@lab:~/src/lab$ cat overlays/bird/common/etc/netns/dataplane/resolv.conf.j2
 domain lab.ipng.ch
-search{% for domain in lab.nameserver.search %} {{domain}}{%endfor %}
+search{% for domain in lab.nameserver.search %} {{ domain }}{% endfor %}

 {% for resolver in lab.nameserver.addresses %}
-nameserver {{resolver}}
-{%endfor%}
+nameserver {{ resolver }}
+{% endfor %}
 ```
-{% endraw %}

 The first file is a [[NetPlan.io](https://netplan.io/)] configuration that substitutes the correct management
 IPv4 and IPv6 addresses and gateways. The second one enumerates a set of search domains and nameservers, so that
--- a/content/articles/2022-12-05-oem-switch-1.md
+++ b/content/articles/2022-12-05-oem-switch-1.md
@@ -578,7 +578,7 @@ the inner payload carries the `vlan 30` tag, neat! The `VNI` there is `0xca986`
 VLAN10 traffic (showing that multiple VLANs can be transported across the same tunnel, distinguished
 by VNI).

-{{< image width="90px" float="left" src="/assets/oem-switch/warning.png" alt="Warning" >}}
+{{< image width="90px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}

 At this point I make an important observation. VxLAN and GENEVE both have this really cool feature
 that they can hash their _inner_ payload (ie. the IPv4/IPv6 address and ports if available) and use
--- a/content/articles/2023-02-12-fitlet2.md
+++ b/content/articles/2023-02-12-fitlet2.md
@@ -171,12 +171,12 @@ GigabitEthernet1/0/0               1     up   GigabitEthernet1/0/0

 After this exploratory exercise, I have learned enough about the hardware to be able to take the
 Fitlet2 out for a spin.  To configure the VPP instance, I turn to
-[[vppcfg](https://github.com/pimvanpelt/vppcfg)], which can take a YAML configuration file
+[[vppcfg](https://git.ipng.ch/ipng/vppcfg)], which can take a YAML configuration file
 describing the desired VPP configuration, and apply it safely to the running dataplane using the VPP
 API. I've written a few more posts on how it does that, notably on its [[syntax]({{< ref "2022-03-27-vppcfg-1" >}})]
 and its [[planner]({{< ref "2022-04-02-vppcfg-2" >}})].  A complete
 configuration guide on vppcfg can be found
-[[here](https://github.com/pimvanpelt/vppcfg/blob/main/docs/config-guide.md)].
+[[here](https://git.ipng.ch/ipng/vppcfg/blob/main/docs/config-guide.md)].

 ```
 pim@fitlet:~$ sudo dpkg -i {lib,}vpp*23.06*deb
--- a/content/articles/2023-02-24-coloclue-vpp-2.md
+++ b/content/articles/2023-02-24-coloclue-vpp-2.md
@@ -185,7 +185,7 @@ forgetful chipmunk-sized brain!), so here, I'll only recap what's already writte

 **1. BUILD:** For the first step, the build is straight forward, and yields a VPP instance based on
 `vpp-ext-deps_23.06-1` at version `23.06-rc0~71-g182d2b466`, which contains my
-[[LCPng](https://github.com/pimvanpelt/lcpng.git)] plugin. I then copy the packages to the router.
+[[LCPng](https://git.ipng.ch/ipng/lcpng.git)] plugin. I then copy the packages to the router.
 The router has an E-2286G CPU @ 4.00GHz with 6 cores and 6 hyperthreads. There's a really handy tool
 called `likwid-topology` that can show how the L1, L2 and L3 cache lines up with respect to CPU
 cores. Here I learn that CPU (0+6) and (1+7) share L1 and L2 cache -- so I can conclude that 0-5 are
@@ -351,7 +351,7 @@ in `vppcfg`:
 *   When I create the initial `--novpp` config, there's a bug in `vppcfg` where I incorrectly
    reference a dataplane object which I haven't initialized (because with `--novpp` the tool
    will not contact the dataplane at all. That one was easy to fix, which I did in [[this
-    commit](https://github.com/pimvanpelt/vppcfg/commit/0a0413927a0be6ed3a292a8c336deab8b86f5eee)]).
+    commit](https://git.ipng.ch/ipng/vppcfg/commit/0a0413927a0be6ed3a292a8c336deab8b86f5eee)]).

 After that small detour, I can now proceed to configure the dataplane by offering the resulting
 VPP commands, like so:
@@ -573,7 +573,7 @@ see is that which is destined to the controlplane (eg, to one of the IPv4 or IPv
 multicast/broadcast groups that they are participating in), so things like tcpdump or SNMP won't
 really work.

-However, due to my [[vpp-snmp-agent](https://github.com/pimvanpelt/vpp-snmp-agent.git)], which is
+However, due to my [[vpp-snmp-agent](https://git.ipng.ch/ipng/vpp-snmp-agent.git)], which is
 feeding as an AgentX behind an snmpd that in turn is running in the `dataplane` namespace, SNMP scrapes
 work as they did before, albeit with a few different interface names.

--- a/content/articles/2023-04-09-vpp-stats.md
+++ b/content/articles/2023-04-09-vpp-stats.md
@@ -14,7 +14,7 @@ performance and versatility. For those of us who have used Cisco IOS/XR devices,
 _ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
 are shared between the two.

-I've been working on the Linux Control Plane [[ref](https://github.com/pimvanpelt/lcpng)], which you
+I've been working on the Linux Control Plane [[ref](https://git.ipng.ch/ipng/lcpng)], which you
 can read all about in my series on VPP back in 2021:

 [![DENOG14](/assets/vpp-stats/denog14-thumbnail.png){: style="width:300px; float: right; margin-left: 1em;"}](https://video.ipng.ch/w/erc9sAofrSZ22qjPwmv6H4)
@@ -70,7 +70,7 @@ answered by a Response PDU.

 Using parts of a Python Agentx library written by GitHub user hosthvo
 [[ref](https://github.com/hosthvo/pyagentx)], I tried my hands at writing one of these AgentX's.
-The resulting source code is on [[GitHub](https://github.com/pimvanpelt/vpp-snmp-agent)]. That's the
+The resulting source code is on [[GitHub](https://git.ipng.ch/ipng/vpp-snmp-agent)]. That's the
 one that's running in production ever since I started running VPP routers at IPng Networks AS8298.
 After the _AgentX_ exposes the dataplane interfaces and their statistics into _SNMP_, an open source
 monitoring tool such as LibreNMS [[ref](https://librenms.org/)] can discover the routers and draw
@@ -126,7 +126,7 @@ for any interface created in the dataplane.

 I wish I were good at Go, but I never really took to the language. I'm pretty good at Python, but
 sorting through the stats segment isn't super quick as I've already noticed in the Python3 based
-[[VPP SNMP Agent](https://github.com/pimvanpelt/vpp-snmp-agent)]. I'm probably the world's least
+[[VPP SNMP Agent](https://git.ipng.ch/ipng/vpp-snmp-agent)]. I'm probably the world's least
 terrible C programmer, so maybe I can take a look at the VPP Stats Client and make sense of it. Luckily,
 there's an example already in `src/vpp/app/vpp_get_stats.c` and it reveals the following pattern:

--- a/content/articles/2023-05-07-vpp-mpls-1.md
+++ b/content/articles/2023-05-07-vpp-mpls-1.md
@@ -19,7 +19,7 @@ same time keep an IPng Site Local network with IPv4 and IPv6 that is separate fr
 based on hardware/silicon based forwarding at line rate and high availability. You can read all
 about my Centec MPLS shenanigans in [[this article]({{< ref "2023-03-11-mpls-core" >}})].

-Ever since the release of the Linux Control Plane [[ref](https://github.com/pimvanpelt/lcpng)]
+Ever since the release of the Linux Control Plane [[ref](https://git.ipng.ch/ipng/lcpng)]
 plugin in VPP, folks have asked "What about MPLS?" -- I have never really felt the need to go this
 rabbit hole, because I figured that in this day and age, higher level IP protocols that do tunneling
 are just as performant, and a little bit less of an 'art' to get right. For example, the Centec
--- a/content/articles/2023-05-21-vpp-mpls-3.md
+++ b/content/articles/2023-05-21-vpp-mpls-3.md
@@ -459,6 +459,6 @@ and VPP, and the overall implementation before attempting to use in production.
 we got at least some of this right, but testing and runtime experience will tell.

 I will be silently porting the change into my own copy of the Linux Controlplane called lcpng on
-[[GitHub](https://github.com/pimvanpelt/lcpng.git)]. If you'd like to test this - reach out to the VPP
+[[GitHub](https://git.ipng.ch/ipng/lcpng.git)]. If you'd like to test this - reach out to the VPP
 Developer [[mailinglist](mailto:vpp-dev@lists.fd.io)] any time!

--- a/content/articles/2023-05-28-vpp-mpls-4.md
+++ b/content/articles/2023-05-28-vpp-mpls-4.md
@@ -187,7 +187,7 @@ MPLS-VRF:0, fib_index:0 locks:[interface:4, CLI:1, lcp-rt:1, ]
        [@1]: mpls via fe80::5054:ff:fe02:1001 GigabitEthernet10/0/0: mtu:9000 next:2 flags:[] 5254000210015254000310008847
 ```

-{{< image width="80px" float="left" src="/assets/vpp-mpls/lightbulb.svg" alt="Lightbulb" >}}
+{{< image width="80px" float="left" src="/assets/shared/lightbulb.svg" alt="Lightbulb" >}}

 Haha, I love it when the brain-ligutbulb goes to the _on_ position. What's happening is that when we
 turned on the MPLS feature on the VPP `tap` that is connected to `e0`, and VPP saw an MPLS packet,
@@ -385,5 +385,5 @@ and VPP, and the overall implementation before attempting to use in production.
 we got at least some of this right, but testing and runtime experience will tell.

 I will be silently porting the change into my own copy of the Linux Controlplane called lcpng on
-[[GitHub](https://github.com/pimvanpelt/lcpng.git)]. If you'd like to test this - reach out to the VPP
+[[GitHub](https://git.ipng.ch/ipng/lcpng.git)]. If you'd like to test this - reach out to the VPP
 Developer [[mailinglist](mailto:vpp-dev@lists.fd.io)] any time!
--- a/content/articles/2023-10-21-vpp-ixp-gateway-1.md
+++ b/content/articles/2023-10-21-vpp-ixp-gateway-1.md
@@ -304,7 +304,7 @@ Gateway, just to show a few of the more advanced features of VPP. For me, this t
 line of thinking: classifiers. This extract/match/act pattern can be used in policers, ACLs and
 arbitrary traffic redirection through VPP's directed graph (eg. selecting a next node for
 processing). I'm going to deep-dive into this classifier behavior in an upcoming article, and see
-how I might add this to [[vppcfg](https://github.com/pimvanpelt/vppcfg.git)], because I think it
+how I might add this to [[vppcfg](https://git.ipng.ch/ipng/vppcfg.git)], because I think it
 would be super powerful to abstract away the rather complex underlying API into something a little
 bit more ... user friendly. Stay tuned! :)

--- a/content/articles/2023-11-11-mellanox-sn2700.md
+++ b/content/articles/2023-11-11-mellanox-sn2700.md
@@ -543,7 +543,7 @@ Whoa, what just happened here? The switch took the port defined by `pci/0000:03:
 it is _splittable_ and has four lanes, and split it into four NEW ports called `swp1s0`-`swp1s3`,
 and the resulting ports are 25G, 10G or 1G.

-{{< image width="100px" float="left" src="/assets/oem-switch/warning.png" alt="Warning" >}}
+{{< image width="100px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}

 However, I make an important observation. When splitting `swp1` in 4, the switch also removed port
 `swp2`, and remember at the beginning of this article I mentioned that the MAC addresses seemed to
--- a/content/articles/2023-12-17-defra0-debian.md
+++ b/content/articles/2023-12-17-defra0-debian.md
@@ -243,7 +243,7 @@ any prefixes, for example this session in D&uuml;sseldorf:
   };
 ```

-{{< image width="80px" float="left" src="/assets/debian-vpp/warning.png" alt="Warning" >}}
+{{< image width="80px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}

 This is where it's a good idea to grab some tea. Quite a few internet providers have
 incredibly slow convergence, so just by stopping the announcment of `AS8298:AS-IPNG` prefixes at
--- a/content/articles/2024-01-27-vpp-papi.md
+++ b/content/articles/2024-01-27-vpp-papi.md
@@ -548,7 +548,7 @@ for table in api_reply:
    print(str)
 ```

-{{< image width="50px" float="left" src="/assets/vpp-papi/warning.png" alt="Warning" >}}
+{{< image width="50px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}

 Funny detail - it took me almost two years to discover `VppEnum`, which contains all of these
 symbols. If you end up reading this after a Bing, Yahoo or DuckDuckGo search, feel free to buy
--- a/content/articles/2024-02-10-vpp-freebsd-1.md
+++ b/content/articles/2024-02-10-vpp-freebsd-1.md
@@ -47,7 +47,7 @@ we'll use for performance testing, notably to compare the FreeBSD kernel routing
 like `netmap`, and of course VPP itself. I do intend to do some side-by-side comparisons between
 Debian and FreeBSD when they run VPP.

-{{< image width="100px" float="left" src="/assets/freebsd-vpp/brain.png" alt="brain" >}}
+{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="brain" >}}

 If you know me a little bit, you'll know that I typically forget how I did a thing, so I'm using
 this article for others as well as myself in case I want to reproduce this whole thing 5 years down
--- a/content/articles/2024-02-17-vpp-freebsd-2.md
+++ b/content/articles/2024-02-17-vpp-freebsd-2.md
@@ -163,7 +163,7 @@ interfaces a bit. They need to be:
 075.810547 main [301] Ready to go, ixl0 0x0/4 <-> ixl1 0x0/4.
 ```

-{{< image width="80px" float="left" src="/assets/freebsd-vpp/warning.png" alt="Warning" >}}
+{{< image width="80px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}

 I start my first loadtest, which pretty immediately fails. It's an interesting behavior pattern which
 I've not seen before. After staring at the problem, and reading the code of `bridge.c`, which is a
--- a/content/articles/2024-03-06-vpp-babel-1.md
+++ b/content/articles/2024-03-06-vpp-babel-1.md
@@ -63,7 +63,7 @@ Let me discuss these two purposes in more detail:

 ### 1. IPv4 ARP, n&eacute;e IPv6 NDP

-{{< image width="100px" float="left" src="/assets/vpp-babel/brain.png" alt="Brain" >}}
+{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Brain" >}}

 One really neat trick is simply replacing ARP resolution by something that can resolve the
 link-layer MAC address in a different way. As it turns out, IPv6 has an equivalent that's
@@ -359,7 +359,7 @@ does not have an IPv4 address. Except -- I'm bending the rules a little bit by d
 There's an internal function `ip4_sw_interface_enable_disable()` which is called to enable IPv4
 processing on an interface once the first IPv4 address is added. So my first fix is to force this to
 be enabled for any interface that is exposed via Linux Control Plane, notably in `lcp_itf_pair_create()`
-[[here](https://github.com/pimvanpelt/lcpng/blob/main/lcpng_interface.c#L777)].
+[[here](https://git.ipng.ch/ipng/lcpng/blob/main/lcpng_interface.c#L777)].

 This approach is partially effective:

@@ -500,7 +500,7 @@ which is unnumbered. Because I don't know for sure if everybody would find this
 I make sure to guard the behavior behind a backwards compatible configuration option.

 If you're curious, please take a look at the change in my [[GitHub
-repo](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
+repo](https://git.ipng.ch/ipng/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
 which I:
 1.   add a new configuration option, `lcp-sync-unnumbered`, which defaults to `on`. That would be
     what the plugin would do in the normal case: copy forward these borrowed IP addresses to Linux.
--- a/content/articles/2024-04-06-vpp-ospf.md
+++ b/content/articles/2024-04-06-vpp-ospf.md
@@ -147,7 +147,7 @@ With all of that, I am ready to demonstrate two working solutions now. I first c
 Ondrej's [[commit](https://gitlab.nic.cz/labs/bird/-/commit/280daed57d061eb1ebc89013637c683fe23465e8)].
 Then, I compile VPP with my pending [[gerrit](https://gerrit.fd.io/r/c/vpp/+/40482)]. Finally,
 to demonstrate how `update_loopback_addr()` might work, I compile `lcpng` with my previous
-[[commit](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)],
+[[commit](https://git.ipng.ch/ipng/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)],
 which allows me to inhibit copying forward addresses from VPP to Linux, when using _unnumbered_
 interfaces.

@@ -242,7 +242,7 @@ even if the interface link stays up.  It's described in detail in
 [[RFC5880](https://www.rfc-editor.org/rfc/rfc5880.txt)], and I use it at IPng Networks all over the
 place.

-{{< image width="100px" float="left" src="/assets/vpp-babel/brain.png" alt="Brain" >}}
+{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Brain" >}}

 Then I'll configure two OSPF protocols, one for IPv4 called `ospf4` and another for IPv6 called
 `ospf6`. It's easy to overlook, but while usually the IPv4 protocol is OSPFv2 and the IPv6 protocol
--- a/content/articles/2024-04-27-freeix-1.md
+++ b/content/articles/2024-04-27-freeix-1.md
@@ -1,8 +1,9 @@
 ---
 date: "2024-04-27T10:52:11Z"
-title: FreeIX - Remote
+title: "FreeIX Remote - Part 1"
 aliases:
 - /s/articles/2024/04/27/freeix-1.html
+- /s/articles/2024/04/27/freeix-remote/
 ---

 # Introduction
@@ -91,7 +92,7 @@ their traffic to these remote internet exchanges.
 There are two types of BGP neighbor adjacency:

 1.   ***Members***: these are {ip-address,AS}-tuples which FreeIX has explicitly configured. Learned prefixes are added
-     to as-set AS50869:AS-MEMBERS. Members receive _all_ prefixes from FreeIX, each annotated with BGP **informational**
+     to as-set AS50869:AS-MEMBERS. Members receive _some or all_ prefixes from FreeIX, each annotated with BGP **informational**
     communities, and members can drive certain behavior with BGP **action** communities.

 1.   ***Peers***: these are all other entities with whom FreeIX has an adjacency at public internet exchanges or private
@@ -195,12 +196,12 @@ network interconnects:
 *   `(50869,3020,1)`: Inhibit Action (30XX), Country (3020), Switzerland (1)
 *   `(50869,3030,1308)`: Inhibit Action (30XX), IXP (3030), PeeringDB IXP for LS-IX (1308)

-Further actions can be placed on a per-remote-neighbor basis:
+Four actions can be placed on a per-remote-asn basis:

 *   `(50869,3040,13030)`: Inhibit Action (30XX), AS (3040), Init7 (AS13030)
-*   `(50869,3041,6939)`: Prepend Action (30XX), Prepend Once (3041), Hurricane Electric (AS6939)
-*   `(50869,3042,12859)`: Prepend Action (30XX), Prepend Twice (3042), BIT BV (AS12859)
-*   `(50869,3043,8283)`: Prepend Action (30XX), Prepend Three Times (3043), Coloclue (AS8283)
+*   `(50869,3100,6939)`: Prepend Once Action (3100), Hurricane Electric (AS6939)
+*   `(50869,3200,12859)`: Prepend Twice Action (3200), BIT BV (AS12859)
+*   `(50869,3300,8283)`: Prepend Thice Action (3300), Coloclue (AS8283)

 Peers cannot set these actions, as all action communities will be stripped on ingress.  Members can set these action
 communities on their sessions with FreeIX routers, however in some cases they may also be set by FreeIX operators when
--- a/content/articles/2024-05-17-smtp.md
+++ b/content/articles/2024-05-17-smtp.md
@@ -58,7 +58,8 @@ argument of resistance? Nerd-snipe accepted!

 Let me first introduce the mail^W main characters of my story:

-| ![Postfix](/assets/smtp/postfix_logo.png){: style="width:100px; margin: 1em;"} | ![Dovecot](/assets/smtp/dovecot_logo.png){: style="width:100px; margin: 1em;"} | ![NGINX](/assets/smtp/nginx_logo.png){: style="width:100px; margin: 1em;"} | ![rspamd](/assets/smtp/rspamd_logo.png){: style="width:100px; margin: 1em;"} | ![Unbound](/assets/smtp/unbound_logo.png){: style="width:100px; margin: 1em;"} | ![Postfix](/assets/smtp/roundcube_logo.png){: style="width:100px; margin: 1em;"} |
+| {{< image src="/assets/smtp/postfix_logo.png" width="8em" >}} | {{< image src="/assets/smtp/dovecot_logo.png" width="8em" >}} | {{< image src="/assets/smtp/nginx_logo.png" width="8em" >}} | {{< image src="/assets/smtp/rspamd_logo.png" width="8em" >}} | {{< image src="/assets/smtp/unbound_logo.png" width="8em" >}} | {{< image src="/assets/smtp/roundcube_logo.png" width="8em" >}} |
+| ---- | ---- | ---- | ---- | ---- | ---- |

 *    ***Postfix***: is Wietse Venema's mail server that started life at IBM research as an
     alternative to the widely-used Sendmail program. After eight years at Google, Wietse continues
@@ -444,7 +445,7 @@ pim@squanchy:~$ sudo cat /etc/mail/secrets
 ipng bastion:<haha-made-you-look>
 ```

-{{< image width="120px" float="left" src="/assets/smtp/lightbulb.svg" alt="Lightbulb" >}}
+{{< image width="120px" float="left" src="/assets/shared/lightbulb.svg" alt="Lightbulb" >}}

 What happens here is, every time this server `squanchy` wants to send an e-mail, it will use an SMTP
 session with TLS, on port 587, of the machine called `smtp-out.ipng.ch`, and it'll authenticate
--- a/content/articles/2024-05-25-nat64-1.md
+++ b/content/articles/2024-05-25-nat64-1.md
@@ -101,6 +101,7 @@ IPv6 network and access the internet via a shared IPv6 address.
 I will assign a pool of four public IPv4 addresses and eight IPv6 addresses to each border gateway:

 | **Machine** | **IPv4 pool** | **IPv6 pool** |
+| ----------- | ------------- | ------------- |
 | border0.chbtl0.net.ipng.ch | <span style='color:green;'>194.126.235.0/30</span> | <span style='color:blue;'>2001:678:d78::3:0:0/125</span> |
 | border0.chrma0.net.ipng.ch | <span style='color:green;'>194.126.235.4/30</span> | <span style='color:blue;'>2001:678:d78::3:1:0/125</span> |
 | border0.chplo0.net.ipng.ch | <span style='color:green;'>194.126.235.8/30</span> | <span style='color:blue;'>2001:678:d78::3:2:0/125</span> |
@@ -305,7 +306,7 @@ switches, I will announce:
    towards DNS64-rewritten destinations, for example 2001:678:d78:564::8c52:7903 as DNS64 representation
    of github.com, which is reachable only at legacy address 140.82.121.3.

-{{< image width="100px" float="left" src="/assets/nat64/brain.png" alt="Brain" >}}
+{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Brain" >}}

 I have to be careful with the announcements into OSPF. The cost of E1 routes is the cost of the
 external metric **in addition to** the internal cost within OSPF to reach that network. The cost
--- a/content/articles/2024-06-22-vpp-ospf-2.md
+++ b/content/articles/2024-06-22-vpp-ospf-2.md
@@ -250,10 +250,10 @@ remove the IPv4 and IPv6 addresses from the <span style='color:red;font-weight:b
 routers in Br&uuml;ttisellen. They are directly connected, and if anything goes wrong, I can walk
 over and rescue them. Sounds like a safe way to start!

-I quickly add the ability for [[vppcfg](https://github.com/pimvanpelt/vppcfg)] to configure
+I quickly add the ability for [[vppcfg](https://git.ipng.ch/ipng/vppcfg)] to configure
 _unnumbered_ interfaces. In VPP, these are interfaces that don't have an IPv4 or IPv6 address of
 their own, but they borrow one from another interface. If you're curious, you can take a look at the
-[[User Guide](https://github.com/pimvanpelt/vppcfg/blob/main/docs/config-guide.md#interfaces)] on
+[[User Guide](https://git.ipng.ch/ipng/vppcfg/blob/main/docs/config-guide.md#interfaces)] on
 GitHub.

 Looking at their `vppcfg` files, the change is actually very easy, taking as an example the
@@ -280,7 +280,7 @@ By commenting out the `addresses` field, and replacing it with `unnumbered: loop
 vppcfg to make Te6/0/0, which in Linux is called `xe1-0`, borrow its addresses from the loopback
 interface `loop0`.

-{{< image width="100px" float="left" src="/assets/freebsd-vpp/brain.png" alt="brain" >}}
+{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="brain" >}}

 Planning and applying this is straight forward, but there's one detail I should
 mention. In my [[previous article]({{< ref "2024-04-06-vpp-ospf" >}})] I asked myself a question:
@@ -291,7 +291,7 @@ interface.

 In the article, you'll see that discussed as _Solution 2_, and it includes a bit of rationale why I
 find this better. I implemented it in this
-[[commit](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
+[[commit](https://git.ipng.ch/ipng/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
 case you're curious, and the commandline keyword is `lcp lcp-sync-unnumbered off` (the default is
 _on_).

--- a/content/articles/2024-07-05-r86s.md
+++ b/content/articles/2024-07-05-r86s.md
@@ -292,7 +292,7 @@ transmitting, or performing both receiving *and* transmitting.

 ### Intel X520 (10GbE)

-{{< image width="100px" float="left" src="/assets/oem-switch/warning.png" alt="Warning" >}}
+{{< image width="100px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}

 This network card is based on the classic Intel _Niantic_ chipset, also known as the 82599ES chip,
 first released in 2009. It's super reliable, but there is one downside. It's a PCIe v2.0 device
@@ -462,7 +462,7 @@ ip4-rewrite                      active    14845221    35913927          0   8.9
 unix-epoll-input                 polling      22551           0          0   1.37e3          0.00
 ```

-{{< image width="100px" float="left" src="/assets/vpp-babel/brain.png" alt="Brain" >}}
+{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Brain" >}}

 I kind of wonder why that is. Is the Mellanox Connect-X3 such a poor performer? Or does it not like
 small packets? I've read online that Mellanox cards do some form of message compression on the PCI
--- a/content/articles/2024-08-03-gowin.md
+++ b/content/articles/2024-08-03-gowin.md
@@ -407,7 +407,7 @@ loadtest:

 {{< image src="/assets/gowin-n305/cx5-cpu-rdma1q.png" alt="Cx5 CPU with 1Q" >}}

-{{< image width="100px" float="left" src="/assets/vpp-babel/brain.png" alt="Brain" >}}
+{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Brain" >}}

 Here I can clearly see that the one CPU thread (in yellow for unidirectional) and the two CPU
 therads (one for each of the bidirectional flows) jump up to 100% and stay there. This means that
--- a/content/articles/2024-08-12-jekyll-hugo.md
+++ b/content/articles/2024-08-12-jekyll-hugo.md
@@ -0,0 +1,452 @@
+---
+date: "2024-08-12T09:01:23Z"
+title: 'Case Study: From Jekyll to Hugo'
+---
+
+# Introduction
+
+{{< image width="16em" float="right" src="/assets/jekyll-hugo/before.png" alt="ipng.nl before" >}}
+
+In the _before-days_, I had a very modest personal website running on [[ipng.nl](https://ipng.nl)]
+and [[ipng.ch](https://ipng.ch/)]. Over the years I've had quite a few different designs, and
+although one of them was hosted (on Google Sites) for a brief moment, they were mostly very much web
+1.0, "The 90s called, they wanted their website back!" style.
+
+The site didn't have much other than a little blurb on a few open source projects of mine, and a
+gallery hosted on PicasaWeb [which Google subsequently turned down], and a mostly empty Blogger
+page. Would you imagine that I hand-typed the XHTML and CSS for this website, where the menu at the
+top (thinks like `Home` - `Resume` - `History` - `Articles`) would just have a HTML page which
+meticulously linked to the other HTML pages. It was the way of the world, in the 1990s.
+
+## Jekyll
+
+{{< image width="9em" float="right" src="/assets/jekyll-hugo/jekyll-logo.png" alt="Jekyll" >}}
+
+My buddy Michal suggested in May of 2021 that, if I was going to write all of the HTML skeleton by
+hand, I may as well switch to a static website generator. He's fluent in Ruby, and suggested I take
+a look at [[Jekyll](https://jekyllrb.com/)], a static site generator. It takes text written in
+your favorite markup language and uses layouts to create a static website. You can tweak the site’s
+look and feel, URLs, the data displayed on the page, and more.
+
+I immediately fell in love! As an experiment, I moved [[IPng.ch](https://ipng.ch)] to a new
+webserver, and kept my personal website on [[IPng.nl](https://ipng.nl)]. I had always wanted to
+write a little bit more about technology, and since I was working on an interesting project [[Linux
+Control Plane]({{< ref  2021-08-12-vpp-1 >}})] in VPP, I thought it'd be nice to write a little bit
+about it, but certainly not while hand-crafting all of the HTML exoskeleton. I just wanted to write
+Markdown, and this is precisely the _raison d'&ecirc;tre_ of Jekyll!
+
+Since April 2021, I wrote in total 67 articles with Jekyll. Some of them proved to become quite
+popular, and (_humblebrag_) my website is widely considered one of the best resources for Vector
+Packet Processing, with my [[VPP]({{< ref 2021-09-21-vpp-7 >}})] series, [[MPLS]({{< ref
+2023-05-07-vpp-mpls-1 >}})] series and a few others like the [[Mastodon]({{< ref
+2022-11-20-mastodon-1 >}})] series being amongst some of the top visited articles, with ~7.5-8K
+monthly unique visitors.
+
+## The catalyst
+
+There were two distinct events that lead up to this. Firstly, I started a side project called [[Free
+IX](https://free-ix.ch/)], which I also created in Jekyll. When I did that, I branched the
+[[IPng.ch](https://ipng.ch)] site, but the build faild with Ruby errors. My buddy Antonios fixed
+those, and we were underway. Secondly, later on I attempted to upgrade the IPng website to the same
+fixes that Antonios had provided for Free-IX, and all hell broke loose (luckily, only in staging
+environment). I spent several hours pulling my hear out re-assembling the dependencies, downgrading
+Jekyll, pulling new `gems`, downgrading `ruby`. Finally, I got it to work again, only to see after
+my first production build, that the build immediately failed because the Docker container that does
+the build no longer liked what I had put in the `Gemfile` and `_config.yml`. It was something to do
+with `sass-embedded` gem, and I spent waaaay too long fixing this incredibly frustrating breakage.
+
+## Hugo
+
+{{< image width="9em" float="right" src="/assets/jekyll-hugo/hugo-logo-wide.svg" alt="Hugo" >}}
+
+When I made my roadtrip from Zurich to the North Cape with my buddy Paul, we took extensive notes on
+our daily travels, and put them on a [[2022roadtripnose](https://2022roadtripnose.weirdnet.nl/)]
+website. At the time, I was looking for a photo caroussel for Jekyll, and while I found a few, none
+of them really worked in the way I wanted them to. I stumbled across [[Hugo](https://gohugo.io)], 
+which says on its website that it is one of the most popular open-source static site generators.
+With its amazing speed and flexibility, Hugo makes building websites fun again. So I dabbled a bit
+and liked what I saw. I used the [[notrack](https://github.com/gevhaz/hugo-theme-notrack)] theme from
+GitHub user `@gevhaz`, as they had made a really nice gallery widget (called a `shortcode` in Hugo).
+
+The main reason for me to move to Hugo is that it is a **standalone Go** program, with no runtime or
+build time dependencies. The Hugo [[GitHub](https://github.com/gohugoio/hugo)] delivers ready to go
+build artifacts, tests amd releases regularly, and has a vibrant user community.
+
+### Migrating
+
+I have only a few strong requirements if I am to move my website:
+
+1.   The site's URL namespace MUST be *identical* (not just similar) to Jekyll. I do not want to
+     lose my precious ranking on popular search engines.
+1.   MUST be built in a CI/CD tool like Drone or Jenkins, and autodeploy
+1.   Code MUST be _hermetic_, not pulling in external dependencies, neither in the build system (eg.
+     Hugo itself) nor the website (eg. dependencies, themes, etc).
+1.   Theme MUST support images, videos and SHOULD support asciinema.
+1.   Theme SHOULD try to look very similar to the current Jekyll `minima` theme.
+
+
+#### Attempt 1: Auto import ❌
+
+With that in mind, I notice that Hugo has a site _importer_, that can import a site from Jekyll! I
+run it, but it produces completely broken code, and Hugo doesn't even want to compile the site. This
+turns out to be a _theme_ issue, so I take Hugo's advice and install the recommended theme. The site
+comes up, but is pretty screwed up. I now realize that the `hugo import jekyll` imports the markdown
+as-is, and only rewrites the _frontmatter_ (the little blurb of YAML metadata at the top of each
+file). Two notable problems:
+
+**1. images** - I make liberal use of Markdown images, which in Jekyll can be decorated with CSS
+styling, like so: 
+```
+![Alt](/path/to/image){: style="width:200px; float: right; margin: 1em;"}
+```
+
+**2. post_url** - Another widely used feature is cross-linking my own articles, using Jekyll
+template expansion, like so:
+```
+.. Remember in my [[VPP Babel]({% post_url 2024-03-06-vpp-babel-1 %})] ..
+```
+
+I do some grepping, and have 246 such Jekyll template expansions, and 272 images  OK, that's a dud.
+
+#### Attempt 2: Skeleton ✅
+
+I decide to do this one step at a time. First, I create a completely new website `hugo new site
+ipng.ch`, download the `notrack` theme, and add only the front page `index.md` from the
+original IPng site. OK, that renders.
+
+Now comes a fun part: going over the `notrack` theme's SCSS to adjust it to look and feel similar to
+the Jekyll `minima` theme. I change a bunch of stuff in the skeleton of the website:
+
+First, I take a look at the site media breakpoints, to feel correct for desktop screen, tablet
+screen and iPhone/Android screens.  Then, I inspect the font family, size and H1/H2/H3...
+magnifications, also scaling them with media size.  Finally I notice the footer, which in `notrack`
+spans the whole width of the browser. I change it to be as wide as the header and main page.
+
+I go one by one on the site's main pages and, just as on the Jekyll site, I make them into menu
+items at the top of the page. The [[Services]({{< ref services >}})] page serves as my proof of
+concept, as it has both the `image` and the `post_url` pattern in Jekyll. It references six articles
+and has two images which float on the right side of the canvas. If I can figure out how to rewrite
+these to fit the Hugo variants of the same pattern, I should be home free.
+
+### Hugo: image
+
+The idiomatic way in `notrack` is an `image` shortcode. I hope you know where to find the curly
+braces on your keyboard - because geez, Hugo templating sure does like them!
+
+```
+<figure class="image-shortcode{{ with .Get "class" }} {{ . }}{{ end }}
+         {{- with .Get "wide" }}{{- if eq . "true" }} wide{{ end -}}{{ end -}}
+         {{- with .Get "frame" }}{{- if eq . "true" }} frame{{ end -}}{{ end -}}
+         {{- with .Get "float" }} {{ . }}{{ end -}}" 
+         style="
+         {{- with .Get "width" }}width: {{ . }};{{ end -}}
+         {{- with .Get "height" }}height: {{ . }};{{ end -}}">
+    {{- if .Get "link" -}}
+        <a href="{{ .Get "link" }}"{{ with .Get "target" }} target="{{ . }}"{{ end -}}
+                 {{- with .Get "rel" }} rel="{{ . }}"{{ end }}>
+    {{- end }}
+    <img src="{{ .Get "src" | relURL }}"
+         {{- if or (.Get "alt") (.Get "caption") }}
+         alt="{{ with .Get "alt" }}{{ replace . "'" "&#39;" }}{{ else -}}
+              {{- .Get "caption" | markdownify| plainify }}{{ end }}"
+         {{- end -}}
+    /> <!-- Closing img tag -->
+    {{- if .Get "link" }}</a>{{ end -}}
+    {{- if or (or (.Get "title") (.Get "caption")) (.Get "attr") -}}
+        <figcaption>
+            {{ with (.Get "title") -}}
+                <h4>{{ . }}</h4>
+            {{- end -}}
+            {{- if or (.Get "caption") (.Get "attr") -}}<p>
+                {{- .Get "caption" | markdownify -}}
+                {{- with .Get "attrlink" }}
+                    <a href="{{ . }}">
+                {{- end -}}
+                {{- .Get "attr" | markdownify -}}
+                {{- if .Get "attrlink" }}</a>{{ end }}</p>
+            {{- end }}
+        </figcaption>
+    {{- end }}
+</figure>
+```
+
+From the top - Hugo creates a figure with a certain set of classes, the default `image-shortcode`
+but also classes for `frame`, `wide` and `float` to further decorate the image. Then it applies
+direct styling for `width` and `height`, optionally inserts a link (something I had missed out on in
+Jekyll), then inlines the `<img>` tag with an `alt` or (markdown based!) `caption`. It then reuses
+the `caption` or `title` or `attr` variables to assemble a `<figcaption>` block. I absolutely love it!
+
+I've rather consistently placed my images by themselves, on a single line, and they all have at
+least one style (be it `width`, or `float`), so it's really straight forward to rewrite this with a
+little bit of Python:
+
+```
+def convert_image(line):
+  p = re.compile(r'^!\[(.+)\]\((.+)\){:\s*(.*)}')
+  m = p.match(line)
+  if not m:
+    return False
+
+  alt=m.group(1)
+  src=m.group(2)
+  style=m.group(3)
+
+  image_line = "{{</* image "
+  if sm := re.search(r'width:\s*(\d+px)', style):
+    image_line += f'width="{sm.group(1)}" '
+  if sm := re.search(r'float:\s*(\w+)', style):
+    image_line += f'float="{sm.group(1)}" '
+  image_line += f'src="{src}" alt="{alt}" */>}}}}'
+
+  print(image_line)
+  return True
+
+with open(sys.argv[1], "r", encoding="utf-8") as file_handle:
+    for line in file_handle.readlines():
+        if not convert_image(line):
+            print(line.rstrip())
+```
+
+### Hugo: ref
+
+In Hugo, the idiomatic way to reference another document in the corpus is with the builtin `ref`
+shortcode, requiring a single argument: the path to a content document, with or without a file
+extension, with or without an anchor. Paths without a leading / are first resolved relative to the
+current page, then to the remainder of the site. This is super cool, because I can essentially
+reference any file by just its name!
+
+```
+for fn in $(find content/ -name \*.md); do
+  sed -i -r 's/{%[ ]?post_url (.*)[ ]?%}/{{</* ref \1 */>}}/' $fn
+done
+```
+
+And with that, the converted markdown from Jekyll renders perfectly in Hugo. Of course, other sites
+may use other templating commands, but for [[IPng.ch](https://ipng.ch)], these were the only two
+special cases.
+
+### Hugo: URL redirects
+
+It is a hard requirement for me to keep the same URLs that I had from Jekyll. Luckily, this is a
+trivial matter for Hugo, as it supports URL aliases in the _frontmatter_. Jekyll will add a file
+extension to the article _slugs_, while Hugo uses only the directly and serves an `index.html` from
+it. Also, the default for Hugo is to put content in a different directory.
+
+The first change I make is to the main `hugo.toml` config file:
+
+```
+[permalinks]
+  articles = "/s/articles/:year/:month/:day/:slug"
+```
+
+That solves the main directory problem, as back then, I chose `s/articles/` in Jekyll. Then, adding
+the URL redirect is a simple matter of looking up which filename Jekyll ultimately used, and adding
+a little frontmatter at the top of each article, for example my [[VPP #1]({{< ref
+2024-08-12-jekyll-hugo >}})] article would get this addition:
+
+```
+---
+date: "2021-08-12T11:17:54Z"
+title: VPP Linux CP - Part1
+aliases:
+- /s/articles/2021/08/12/vpp-1.html
+---
+```
+
+Hugo by default renders it in `/s/articles/2021/08/12/vpp-linux-cp-part1/index.html` but the
+addition of the `alias` makes it also generate a drop-in placeholder HTML page that offers a
+permanent redirect (cleverly setting `noindex` for web crawlers and offering the `canonical` link
+for the new place, aka a permanent redirect:
+
+```
+$ curl https://ipng.ch/s/articles/2021/08/12/vpp-1.html 
+<!DOCTYPE html>
+<html lang="en-us">
+  <head>
+    <title>https://ipng.ch/s/articles/2021/08/12/vpp-linux-cp-part1/</title>
+    <link rel="canonical" href="https://ipng.ch/s/articles/2021/08/12/vpp-linux-cp-part1/">
+    <meta name="robots" content="noindex">
+    <meta charset="utf-8">
+    <meta http-equiv="refresh" content="0; url=https://ipng.ch/s/articles/2021/08/12/vpp-linux-cp-part1/">
+  </head>
+</html>
+```
+
+### Hugo: Asciinema
+
+One thing that I always wanted to add is the ability to inline [[Asciinema](https://asciinema.org)]
+screen recordings. First, I take a look at what is needed to serve Asciinema: One Javascript file,
+and one CSS file, followed by a named `<div>` which invokes the Javascript. Armed with that
+knowledge, I dive into the `shortcode` language a little bit:
+
+```
+$ cat themes/hugo-theme-ipng/layouts/shortcodes/asciinema.html 
+<div id='{{ .Get "src" | replaceRE "[[:^alnum:]]" "" }}'></div>
+<script>
+  AsciinemaPlayer.create("{{ .Get "src" }}",
+                         document.getElementById('{{ .Get "src" | replaceRE "[[:^alnum:]]" "" }}'));
+</script>
+```
+
+This file creates the `id` of the `<div>` by means of stripping all non-alphanumeric characters from
+the `src` argument of the _shortcode_. So if I were to create an `{{</* asciinema
+src='/casts/my.cast' */>}}`, the resulting DIV will be uniquely called `castsmycast`. This way, I
+can add multiple screencasts in the same document, which is dope.
+
+But, as I now know, I need to load some CSS and JS so that the `AsciinemaPlayer` class becomes
+available. For this, I use a realtively new feature in Hugo, which allows for `params` to be set in
+the frontmatter, for example in the [[VPP OSPF #2]({{< ref 2024-06-22-vpp-ospf-2 >}})] article:
+
+```
+---
+date: "2024-06-22T09:17:54Z"
+title: VPP with loopback-only OSPFv3 - Part 2
+aliases:
+- /s/articles/2024/06/22/vpp-ospf-2.html
+params:
+  asciinema: true
+---
+```
+
+The presence of that `params.asciinema` can be used in any page, including the HTML skeleton of the
+theme, like so:
+
+```
+$ cat themes/hugo-theme-ipng/layouts/partials/head.html 
+<head>
+...
+    {{ if eq .Params.asciinema true -}}
+    <link rel="stylesheet" type="text/css" href="{{ "css/asciinema-player.css" | relURL }}" />
+    <script src="{{ "js/asciinema-player.min.js" | relURL }}"></script>
+    {{- end }}
+</head>
+```
+
+Now all that's left for me to do is drop the two Asciinema player files in their respective theme
+directories, and for each article that wants to use an Asciinema, set the `param` and it'll ship the
+CSS and Javascript to the browser. I think I'm going to have a good relationship with Hugo :)
+
+### Gitea: Large File Support
+
+One mistake I made with the old Jekyll based website, is that I checked in all of the images and
+binary files directly into Git. This bloats the repository and is otherwise completely unnecessary.
+For this new repository, I enable [[Git LFS](https://git-lfs.com/)], which is available for OpenBSD
+(packages), Debian (apt) and MacOS (homebrew). Turning this on is very simple:
+
+```
+$ brew install git-lfs
+$ cd ipng.ch
+$ git lfs install
+$ for i in gz png gif jpg jpeg tgz zip; do \\
+   git track "*.$i" \\
+   git lfs import --everything --include "*.$i" \\
+  done
+$ git push --force --all
+```
+
+The `force` push rewrites the history of the repo to reference the binary blobs in LFS instead of
+directly in the repo. As a result, the size of the repository greatly shrinks, and handling it
+becomes easier once it grows. A really nice feature!
+
+### Gitea: CI/CD with Drone
+
+At IPng, I run a [[Gitea](https://gitea.io)] server, which is one of the coolest pieces of open
+source that I use on a daily basis. There's a very clean integration of a continuous integration
+tool called [[Drone](https://drone.io/)] and these two tools are literally made for each other.
+Drone can be enabled for any Git repo in Gitea, and given the presence of a `.drone.yml` file,
+execute a set of steps upon repository events, called _triggers_. It can then run a sequence of
+steps, hermetically in a Docker container called a _drone-runner_, which first checks out the
+repository at the latest commit, and then does whatever I'd like with it. I'd like to build and
+distribute a Hugo website, please!
+
+As it turns out, there is a [[Drone Hugo](https://plugins.drone.io/plugins/hugo)] plugin available,
+but it seems to be very outdated. Luckily, this being open source and all, I can download the source
+on [[GitHub](https://github.com/drone-plugins/drone-hugo)], and in the `Dockerfile`, bump the Alpine
+version, the Go version and build the latest Hugo release, which is 0.130.1 at the moment. I really
+do need this version, because the `params` feature was introduced in 0.123 and the upstream package
+is still for 0.77 -- which is about four years old. Ouch!
+
+I build a docker image and upload it to my private repo at IPng which is hosted as well on Gitea, by
+the way. As I said, it really is a great piece of kit! In case anybody else would like to give it a
+whirl, ping me on Mastodon or e-mail and I'll upload one to public Docker Hub as well.
+
+### Putting it all together
+
+With Drone activated for this repo, and the Drone Hugo plugin built with a new version, I can submit
+the following file to the root directory of the `ipng.ch` repository:
+
+
+```
+$ cat .drone.yml
+kind: pipeline
+name: default
+
+steps:
+  - name: git-lfs
+    image: alpine/git
+    commands:
+      - git lfs install
+      - git lfs pull
+  - name: build
+    image: git.ipng.ch/ipng/drone-hugo:release-0.130.0
+    settings:
+      hugo_version: 0.130.0
+      extended: true
+  - name: rsync
+    image: drillster/drone-rsync
+    settings:
+      user: drone
+      key:
+        from_secret: drone_sshkey
+      hosts:
+        - nginx0.chrma0.net.ipng.ch
+        - nginx0.chplo0.net.ipng.ch
+        - nginx0.nlams1.net.ipng.ch
+        - nginx0.nlams2.net.ipng.ch
+      port: 22
+      args: '-6u --delete-after'
+      source: public/
+      target: /var/www/ipng.ch/
+      recursive: true
+      secrets: [ drone_sshkey ]
+
+image_pull_secrets:
+  - git_ipng_ch_docker
+```
+
+The file is relatively self-explanatory. Before my first step runs, Drone already checks out the
+repo in the current working directory of the docker container. I then install package `alpine/git`
+and run the `git lfs install` and `git lfs pull` commands to resolve the LFS symlinks into actual
+files by pulling those objects that are referenced (and, notably, not all historical versions of any
+binary file ever added to the repo).
+
+Then, I run a step called `build` which invokes the Hugo Drone package that I created before.
+
+Finally, I run a step called `rsync` which uses package `drillster/drone-rsync` to rsync-over-ssh
+the files to the four NGINX servers running at IPng: two in Amsterdam, one in Geneva and one in
+Zurich.
+
+One really cool feature is the use of so called _Drone Secrets_ which are references to locked
+secrets such as the SSH key, and, notably, the Docker Repository credentials, because Gitea at IPng
+does not run a public docker repo. Using secrets is nifty, because it allows to safely check in the
+`.drone.yml` configuration file without leaking any specifics.
+
+### NGINX and SSL
+
+Now that the website is automatically built and rsync'd to the webservers upon every `git merge`,
+all that's left for me to do is arm the webservers with SSL certificates. I actually wrote a whole
+story about specifically that, as for `*.ipng.ch` and `*.ipng.nl` and a bunch of others,
+periodically there is a background task that retrieves multiple wildcard certificates with Let's
+Encrypt, and distributes them to any server that needs them (like the NGINX cluster, or the Postfix
+cluster). I wrote about the [[Frontends]({{< ref 2023-03-17-ipng-frontends >}})], the spiffy
+[[DNS-01]({{< ref 2023-03-24-lego-dns01.md >}})] certificate subsystem, and the internal network
+called [[IPng Site Local]({{< ref 2023-03-11-mpls-core >}})] each in their own articles, so I won't
+repeat that information here.
+
+## The Results
+
+The results are really cool, as I'll demonstrate in this video. I can just submit and merge this
+change, and it'll automatically kick off a build and push. Take a look at this video which was
+performed in real time as I pushed this very article live:
+
+{{< video src="https://ipng.ch/media/vdo/hugo-drone.mp4" >}}
--- a/content/articles/2024-09-03-asr9001.md
+++ b/content/articles/2024-09-03-asr9001.md
@@ -0,0 +1,238 @@
+---
+date: "2024-09-03T13:07:54Z"
+title: Loadtest notes, ASR9001
+draft: true
+---
+
+### L2 point-to-point (L2XC) config
+
+```
+interface TenGigE0/0/0/0
+ mtu 9216
+ load-interval 30
+ l2transport
+ !
+!
+interface TenGigE0/0/0/1
+ mtu 9216
+ load-interval 30
+ l2transport
+ !
+!
+interface TenGigE0/0/0/2
+ mtu 9216
+ load-interval 30
+ l2transport
+ !
+!
+interface TenGigE0/0/0/3
+ mtu 9216
+ load-interval 30
+ l2transport
+ !
+!
+
+
+...
+l2vpn
+ load-balancing flow src-dst-ip
+ logging
+  bridge-domain
+  pseudowire
+ !
+ xconnect group LoadTest
+  p2p pair0
+   interface TenGigE0/0/2/0
+   interface TenGigE0/0/2/1
+  !
+  p2p pair1
+   interface TenGigE0/0/2/2
+   interface TenGigE0/0/2/3
+  !
+...
+```
+
+
+### L2 Bridge-Domain
+
+```
+l2vpn
+ bridge group LoadTestp
+  bridge-domain bd0
+   interface TenGigE0/0/0/0
+   !
+   interface TenGigE0/0/0/1
+   !
+  !
+  bridge-domain bd1
+   interface TenGigE0/0/0/2
+   !
+   interface TenGigE0/0/0/3
+   !
+  !
+...
+```
+RP/0/RSP0/CPU0:micro-fridge#show l2vpn forwarding bridge-domain mac-address location 0/0/CPU0 
+Sat Aug 31 12:09:08.957 UTC
+Mac Address    Type    Learned from/Filtered on    LC learned Resync Age         Mapped to     
+--------------------------------------------------------------------------------
+9c69.b461.fcf2 dynamic Te0/0/0/0                   0/0/CPU0   0d 0h 0m 14s       N/A           
+9c69.b461.fcf3 dynamic Te0/0/0/1                   0/0/CPU0   0d 0h 0m 2s        N/A           
+001b.2155.1f11 dynamic Te0/0/0/2                   0/0/CPU0   0d 0h 0m 0s        N/A           
+001b.2155.1f10 dynamic Te0/0/0/3                   0/0/CPU0   0d 0h 0m 15s       N/A           
+001b.21bc.47a4 dynamic Te0/0/1/0                   0/0/CPU0   0d 0h 0m 6s        N/A           
+001b.21bc.47a5 dynamic Te0/0/1/1                   0/0/CPU0   0d 0h 0m 21s       N/A           
+9c69.b461.ff41 dynamic Te0/0/1/2                   0/0/CPU0   0d 0h 0m 16s       N/A           
+9c69.b461.ff40 dynamic Te0/0/1/3                   0/0/CPU0   0d 0h 0m 10s       N/A           
+001b.2155.1d1d dynamic Te0/0/2/0                   0/0/CPU0   0d 0h 0m 9s        N/A           
+001b.2155.1d1c dynamic Te0/0/2/1                   0/0/CPU0   0d 0h 0m 16s       N/A           
+001b.2155.1e08 dynamic Te0/0/2/2                   0/0/CPU0   0d 0h 0m 4s        N/A           
+001b.2155.1e09 dynamic Te0/0/2/3                   0/0/CPU0   0d 0h 0m 11s       N/A           
+```
+
+Interesting finding, after a bridge-domain overload occurs, forwarding pretty much stops
+```
+Te0/0/0/0:
+  30 second input rate 6931755000 bits/sec, 14441158 packets/sec
+  30 second output rate 0 bits/sec, 0 packets/sec
+Te0/0/0/1:
+  30 second input rate 0 bits/sec, 0 packets/sec
+  30 second output rate 19492000 bits/sec, 40609 packets/sec
+
+Te0/0/0/2:
+  30 second input rate 0 bits/sec, 0 packets/sec
+  30 second output rate 19720000 bits/sec, 41084 packets/sec
+Te0/0/0/3:
+  30 second input rate 6931728000 bits/sec, 14441100 packets/sec
+  30 second output rate 0 bits/sec, 0 packets/sec
+
+... and so on
+
+  30 second input rate 6931558000 bits/sec, 14440748 packets/sec
+  30 second output rate 0 bits/sec, 0 packets/sec
+  30 second input rate 0 bits/sec, 0 packets/sec
+  30 second output rate 12627000 bits/sec, 26307 packets/sec
+  30 second input rate 0 bits/sec, 0 packets/sec
+  30 second output rate 12710000 bits/sec, 26479 packets/sec
+  30 second input rate 6931542000 bits/sec, 14440712 packets/sec
+  30 second output rate 0 bits/sec, 0 packets/sec
+  30 second input rate 0 bits/sec, 0 packets/sec
+  30 second output rate 19196000 bits/sec, 39992 packets/sec
+  30 second input rate 6931651000 bits/sec, 14440938 packets/sec
+  30 second output rate 0 bits/sec, 0 packets/sec
+  30 second input rate 6931658000 bits/sec, 14440958 packets/sec
+  30 second output rate 0 bits/sec, 0 packets/sec
+  30 second input rate 0 bits/sec, 0 packets/sec
+  30 second output rate 13167000 bits/sec, 27431 packets/sec
+```
+
+MPLS enabled test:
+```
+arp vrf default 100.64.0.2 001b.2155.1e08 ARPA
+arp vrf default 100.64.1.2 001b.2155.1e09 ARPA
+arp vrf default 100.64.2.2 001b.2155.1d1c ARPA
+arp vrf default 100.64.3.2 001b.2155.1d1d ARPA
+arp vrf default 100.64.4.2 001b.21bc.47a4 ARPA
+arp vrf default 100.64.5.2 001b.21bc.47a5 ARPA
+arp vrf default 100.64.6.2 9c69.b461.fcf2 ARPA
+arp vrf default 100.64.7.2 9c69.b461.fcf3 ARPA
+arp vrf default 100.64.8.2 001b.2155.1f10 ARPA
+arp vrf default 100.64.9.2 001b.2155.1f11 ARPA
+arp vrf default 100.64.10.2 9c69.b461.ff40 ARPA
+arp vrf default 100.64.11.2 9c69.b461.ff41 ARPA
+
+router static
+ address-family ipv4 unicast
+  0.0.0.0/0 198.19.5.1
+  16.0.0.0/24 100.64.0.2
+  16.0.1.0/24 100.64.2.2
+  16.0.2.0/24 100.64.4.2
+  16.0.3.0/24 100.64.6.2
+  16.0.4.0/24 100.64.8.2
+  16.0.5.0/24 100.64.10.2
+  48.0.0.0/24 100.64.1.2
+  48.0.1.0/24 100.64.3.2
+  48.0.2.0/24 100.64.5.2
+  48.0.3.0/24 100.64.7.2
+  48.0.4.0/24 100.64.9.2
+  48.0.5.0/24 100.64.11.2
+ !
+!
+
+mpls static
+ interface TenGigE0/0/0/0
+ interface TenGigE0/0/0/1
+ interface TenGigE0/0/0/2
+ interface TenGigE0/0/0/3
+ interface TenGigE0/0/1/0
+ interface TenGigE0/0/1/1
+ interface TenGigE0/0/1/2
+ interface TenGigE0/0/1/3
+ interface TenGigE0/0/2/0
+ interface TenGigE0/0/2/1
+ interface TenGigE0/0/2/2
+ interface TenGigE0/0/2/3
+ address-family ipv4 unicast
+  local-label 16 allocate
+   forward
+    path 1 nexthop TenGigE0/0/2/3 100.64.1.2 out-label 17
+   !
+  !
+  local-label 17 allocate
+   forward
+    path 1 nexthop TenGigE0/0/2/2 100.64.0.2 out-label 16
+   !
+  !
+  local-label 18 allocate
+   forward
+    path 1 nexthop TenGigE0/0/2/0 100.64.3.2 out-label 19
+   !
+  !
+  local-label 19 allocate
+   forward
+    path 1 nexthop TenGigE0/0/2/1 100.64.2.2 out-label 18
+   !
+  !
+  local-label 20 allocate
+   forward
+    path 1 nexthop TenGigE0/0/1/1 100.64.5.2 out-label 21
+   !
+  !
+  local-label 21 allocate
+   forward
+    path 1 nexthop TenGigE0/0/1/0 100.64.4.2 out-label 20
+   !
+  !
+  local-label 22 allocate
+   forward
+    path 1 nexthop TenGigE0/0/0/1 100.64.7.2 out-label 23
+   !
+  !
+  local-label 23 allocate
+   forward
+    path 1 nexthop TenGigE0/0/0/0 100.64.6.2 out-label 22
+   !
+  !
+  local-label 24 allocate
+   forward
+    path 1 nexthop TenGigE0/0/0/2 100.64.9.2 out-label 25
+   !
+  !
+  local-label 25 allocate
+   forward
+    path 1 nexthop TenGigE0/0/0/3 100.64.8.2 out-label 24
+   !
+  !
+  local-label 26 allocate
+   forward
+    path 1 nexthop TenGigE0/0/1/2 100.64.11.2 out-label 27
+   !
+  !
+  local-label 27 allocate
+   forward
+    path 1 nexthop TenGigE0/0/1/3 100.64.10.2 out-label 26
+   !
+  !
+ !
+!
+```
--- a/content/articles/2024-09-08-sflow-1.md
+++ b/content/articles/2024-09-08-sflow-1.md
@@ -0,0 +1,725 @@
+---
+date: "2024-09-08T12:51:23Z"
+title: 'VPP with sFlow - Part 1'
+---
+
+# Introduction
+
+{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width='12em' >}}
+
+In January of 2023, an uncomfortably long time ago at this point, an acquaintance of mine called
+Ciprian reached out to me after seeing my [[DENOG
+#14](https://video.ipng.ch/w/erc9sAofrSZ22qjPwmv6H4)] presentation. He was interested to learn about
+IPFIX and was asking if sFlow would be an option.  At the time, there was a plugin in VPP called
+[[flowprobe](https://s3-docs.fd.io/vpp/24.10/cli-reference/clis/clicmd_src_plugins_flowprobe.html)]
+which is able to emit IPFIX records.  Unfotunately I never really got it to work well in my tests,
+as either the records were corrupted, sub-interfaces didn't work, or the plugin would just crash the
+dataplane entirely. In the meantime, the folks at [[Netgate](https://netgate.com/)] submitted quite
+a few fixes to flowprobe, but it remains an expensive operation computationally. Wouldn't copying
+one in a thousand or ten thousand packet headers with flow _sampling_ not be just as good?
+
+In the months that followed, I discussed the feature with the incredible folks at
+[[inMon](https://inmon.com/)], the original designers and maintainers of the sFlow protocol and
+toolkit. Neil from inMon wrote a prototype and put it on [[GitHub](https://github.com/sflow/vpp)]
+but for lack of time I didn't manage to get it to work, which was largely my fault by the way.
+
+However, I have a bit of time on my hands in September and October, and just a few weeks ago,
+my buddy Pavel from [[FastNetMon](https://fastnetmon.com/)] pinged that very dormant thread about
+sFlow being a potentially useful tool for anti DDoS protection using VPP. And I very much agree!
+
+## sFlow: Protocol
+
+Maintenance of the protocol is performed by the [[sFlow.org](https://sflow.org/)] consortium, the
+authoritative source of the sFlow protocol specifications. The current version of sFlow is v5.
+
+sFlow, short for _sampled Flow_, works at the ethernet layer of the stack, where it inspects one in
+N datagrams (typically 1:1000 or 1:10000) going through the physical network interfaces of a device.
+On the device, an **sFlow Agent** does the sampling. For each sample the Agent takes, the first M
+bytes (typically 128) are copied into an sFlow Datagram. Sampling metadata is added, such as
+the ingress (or egress) interface and sampling process parameters. The Agent can then optionally add
+forwarding information (such as router source- and destination prefix, MPLS LSP information, BGP
+communties, and what-not). Finally the Agent will periodically read the octet and packet counters of
+physical network interface(s). Ultimately, the Agent will send the samples and additional
+information over the network as a UDP datagram, to an **sFlow Collector** for further processing.
+
+sFlow has been specifically designed to take advantages of the statistical properties of packet
+sampling and can be modeled using statistical sampling theory. This means that the sFlow traffic
+monitoring system will always produce statistically quantifiable measurements. You can read more
+about it in Peter Phaal and Sonia Panchen's
+[[paper](https://sflow.org/packetSamplingBasics/index.htm)], I certainly did and my head spun a
+little bit at the math :)
+
+### sFlow: Netlink PSAMPLE
+
+sFlow is meant to be a very _lightweight_ operation for the sampling equipment. It can typically be
+done in hardware, but there also exist several software implementations. One very clever thing, I
+think, is decoupling the sampler from the rest of the Agent. The Linux kernel has a packet sampling
+API called [[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)], which
+allows _producers_ to send samples to a certain _group_, and then allows _consumers_ to subscribe to
+samples of a certrain _group_. The PSAMPLE API uses
+[[NetLink](https://docs.kernel.org/userspace-api/netlink/intro.html)] under the covers. The cool
+thing, for me anyway, is that I have a little bit of experience with Netlink due to my work on VPP's
+[[Linux Control Plane]({{< ref 2021-08-25-vpp-4 >}})] plugin.
+
+The idea here is that some **sFlow Agent**, notably a VPP plugin, will be taking periodic samples
+from the physical network interfaces, and producing Netlink messages. Then, some other program,
+notably outside of VPP, can consume these messages and further handle them, creating UDP packets
+with sFlow samples and counters and other information, and sending them to an **sFlow Collector**
+somewhere else on the network.
+
+{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Warning" >}}
+
+There's a handy utility called [[psampletest](https://github.com/sflow/psampletest)] which can
+subscribe to these PSAMPLE netlink groups and retrieve the samples. The first time I used all of
+this stuff, I wasn't aware of this utility and I kept on getting errors. It turns out, there's a
+kernel module that needs to be loaded: `modprobe psample` and `psampletest` helpfully does that for
+you [[ref](https://github.com/sflow/psampletest/blob/main/psampletest.c#L799)], so just make sure
+the module is loaded and added to `/etc/modules` before you spend as many hours as I did pulling out
+hair.
+
+## VPP: sFlow Plugin
+
+For the purposes of my initial testing, I'll simply take a look at Neil's prototype on
+[[GitHub](https://github.com/sflow/vpp)] and see what I learn in terms of functionality and
+performance.
+
+### sFlow Plugin: Anatomy
+
+The design is purposefully minimal, to do all of the heavy lifting outside of the VPP dataplane. The
+plugin will create a new VPP _graph node_ called `sflow`, which the operator can insert after
+`device-input`, in other words, if enabled, the plugin will get a copy of all packets that are read
+from an input provider, such as `dpdk-input` or `rdma-input`. The plugin's job is to process the
+packet, and if it's not selected for sampling, just move it onwards to the next node, typically
+`ethernet-input`. Almost all of the interesting action is in `node.c`
+
+The kicker is, that one in N packets will be selected to sample, after which:
+1.   the ethernet header (`*en`) is extracted from the packet
+1.   the input interface (`hw_if_index`) is extracted from the VPP buffer. Remember, sFlow works
+with physical network interfaces!
+1.   if there are too many samples from this worker thread being worked on, it is discarded and an
+     error counter is incremented. This protects the main thread from being slammed with samples if
+     there are simply too many being fished out of the dataplane.
+1.   Otherwise:
+     *   a new `sflow_sample_t` is created, with all the sampling process metadata filled in
+     *   the first 128 bytes of the packet are copied into the sample
+     *   an RPC is dispatched to the main thread, which will send the sample to the PSAMPLE channel
+
+Both a debug CLI command and API call are added:
+
+```
+sflow enable-disable <interface-name> [<sampling_N>]|[disable]
+```
+
+Some observations:
+
+First off, the sampling_N in Neil's demo is a global rather than per-interface setting. It would
+make sense to make this be per-interface, as routers typically have a mixture of 1G/10G and faster
+100G network cards available. It was a surprise when I set one interface to 1:1000 and the other to
+1:10000 and then saw the first interface change its sampling rate also. It's a small thing, and 
+will not be an issue to change.
+
+{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
+
+Secondly, sending the RPC to main uses `vl_api_rpc_call_main_thread()`, which
+requires a _spinlock_ in `src/vlibmemory/memclnt_api.c:649`. I'm somewhat worried that when many
+samples are sent from many threads, there will be lock contention and performance will suffer.
+
+### sFlow Plugin: Functional
+
+I boot up the [[IPng Lab]({{< ref 2022-10-14-lab-1 >}})] and install a bunch of sFlow tools on it,
+make sure the `psample` kernel module is loaded.  In this first test I'll take a look at
+tablestakes. I compile VPP with the sFlow plugin, and enable that plugin in `startup.conf` on each
+of the four VPP routers.  For reference, the Lab looks like this:
+
+{{< image src="/assets/vpp-mpls/LAB v2.svg" alt="Lab Setup" >}}
+
+What I'll do is start an `iperf3` server on `vpp0-3` and then hit it from `vpp0-0`, to generate
+a few TCP traffic streams back and forth, which will be traversing `vpp0-2` and `vpp0-1`, like so:
+
+```
+pim@vpp0-3:~ $ iperf3 -s -D
+pim@vpp0-0:~ $ iperf3 -c vpp0-3.lab.ipng.ch -t 86400 -P 10 -b 10M
+```
+
+### Configuring VPP for sFlow
+
+While this `iperf3` is running, I'll log on to `vpp0-2` to take a closer look. The first thing I do,
+is turn on packet sampling on `vpp0-2`'s interface that points at `vpp0-3`, which is `Gi10/0/1`, and
+the interface that points at `vpp0-0`, which is `Gi10/0/0`. That's easy enough, and I will use a
+sampling rate of 1:1000 as these interfaces are GigabitEthernet:
+
+```
+root@vpp0-2:~# vppctl sflow enable-disable GigabitEthernet10/0/0 1000
+root@vpp0-2:~# vppctl sflow enable-disable GigabitEthernet10/0/1 1000
+root@vpp0-2:~# vppctl show run | egrep '(Name|sflow)'
+       Name         State       Calls       Vectors  Suspends      Clocks     Vectors/Call  
+sflow              active        5656         24168         0      9.01e2     4.27
+```
+
+Nice! VPP inserted the `sflow` node between `dpdk-input` and `ethernet-input` where it can do its
+business. But is it sending data? To answer this question, I can first take a look at the
+`psampletest` tool:
+
+```
+root@vpp0-2:~# psampletest 
+pstest: modprobe psample returned 0
+pstest: netlink socket number = 1637
+pstest: getFamily
+pstest: generic netlink CMD = 1
+pstest: generic family name: psample
+pstest: generic family id: 32
+pstest: psample attr type: 4 (nested=0) len: 8
+pstest: psample attr type: 5 (nested=0) len: 8
+pstest: psample attr type: 6 (nested=0) len: 24
+pstest: psample multicast group id: 9
+pstest: psample multicast group: config
+pstest: psample multicast group id: 10
+pstest: psample multicast group: packets
+pstest: psample found group packets=10
+pstest: joinGroup 10
+pstest: received Netlink ACK
+pstest: joinGroup 10
+pstest: set headers...
+pstest: serialize...
+pstest: print before sending...
+pstest: psample netlink (type=32) CMD = 0
+pstest: grp=1 in=7 out=9 n=1000 seq=1 pktlen=1514 hdrlen=31 pkt=0x558c08ba4958 q=3 depth=33333333 delay=123456
+pstest: send...
+pstest: send_psample getuid=0 geteuid=0
+pstest: sendmsg returned 140
+pstest: free...
+pstest: start read loop...
+pstest: psample netlink (type=32) CMD = 0
+pstest: grp=1 in=1 out=0 n=1000 seq=600320 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
+pstest: psample netlink (type=32) CMD = 0
+pstest: grp=1 in=1 out=0 n=1000 seq=600321 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
+pstest: psample netlink (type=32) CMD = 0
+pstest: grp=1 in=1 out=0 n=1000 seq=600322 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
+pstest: psample netlink (type=32) CMD = 0
+pstest: grp=1 in=2 out=0 n=1000 seq=600423 pktlen=66 hdrlen=70 pkt=0x7ffdb0d5a1e8 q=0 depth=0 delay=0
+pstest: psample netlink (type=32) CMD = 0
+pstest: grp=1 in=1 out=0 n=1000 seq=600324 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
+```
+
+I am amazed! The `psampletest` output shows a few packets, considering I'm asking `iperf3` to push
+100Mbit using 9000 byte jumboframes (which would be something like 1400 packets/second), I can
+expect two or three samples per second. I immediately notice a few things:
+
+***1. Network Namespace***: The Netlink sampling channel belongs to a network _namespace_. The VPP
+process is running in the _default_ netns, so its PSAMPLE netlink messages will be in that namespace. 
+Thus, the `psampletest` and other tools must also run in that namespace. I mention this because in
+Linux CP, often times the controlplane interfaces are created in a dedicated `dataplane` network
+namespace.
+
+***2. pktlen and hdrlen***: The pktlen is wrong, and this is a bug. In VPP, packets are put into
+buffers of size 2048, and if there is a jumboframe, that means multiple buffers are concatenated for
+the same packet. The packet length here ought to be 9000 in one direction. Looking at the `in=2`
+packet with length 66, that looks like a legitimate ACK packet on the way back. But why is the
+hdrlen set to 70 there? I'm going to want to ask Neil about that.
+
+***3. ingress and egress***: The `in=1` and one packet with `in=2` represent the input `hw_if_index`
+which is the ifIndex that VPP assigns to its devices. And looking at `show interfaces`, indeed
+number 1 corresponds with `GigabitEthernet10/0/0` and 2 is `GigabitEthernet10/0/1`, which checks
+out:
+```
+root@vpp0-2:~# vppctl show int
+              Name          Idx    State  MTU (L3/IP4/IP6/MPLS)     Counter          Count     
+GigabitEthernet10/0/0         1      up          9000/0/0/0     rx packets             469552764
+                                                                rx bytes           4218754400233
+                                                                tx packets             133717230
+                                                                tx bytes              8887341013
+                                                                drops                       6050
+                                                                ip4                    469321635
+                                                                ip6                       225164
+GigabitEthernet10/0/1         2      up          9000/0/0/0     rx packets             133527636
+                                                                rx bytes              8816920909
+                                                                tx packets             469353481
+                                                                tx bytes           4218736200819
+                                                                drops                       6060
+                                                                ip4                    133489925
+                                                                ip6                        29139
+
+```
+
+***4. ifIndexes are orthogonal***: These `in=1` or `in=2` ifIndex numbers are constructs of the VPP
+dataplane.  Notably, VPP's numbering of interface index is strictly _orthogonal_ to Linux, and it's
+not guaranteed that there even _exists_ an interface in Linux for the PHY upon which the sampling is
+happening. Said differently, `in=1` here is meant to reference VPP's `GigabitEthernet10/0/0`
+interface, but in Linux, `ifIndex=1` is a completely different interface (`lo`) in the default
+network namespace. Similarly `in=2` for VPP's `Gi10/0/1` interface corresponds to interface `enp1s0`
+in Linux:
+
+```
+root@vpp0-2:~# ip link
+1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
+    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
+2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
+    link/ether 52:54:00:f0:01:20 brd ff:ff:ff:ff:ff:ff
+```
+
+***5. Counters***: sFlow periodically polls the interface counters for all interfaces. It will
+normally use `/proc/net/` entries for that, but there are two problems with this:
+
+1.   There may not exist a Linux representation of the interface, for example if it's only doing L2
+bridging or cross connects in the VPP dataplane, and it does not have a Linux Control Plane
+interface, or `linux-cp` is not used at all.
+
+1.   Even if it does exist and it's the "correct" ifIndex in Linux, for example if the _Linux
+Interface Pair_'s tuntap `host_vif_index` index is used, even then the statistics counters in the
+Linux representation will only count packets and octets of _punted_ packets, that is to say, the
+stuff that LinuxCP has decided need to go to the Linux kernel through the TUN/TAP device.  Important
+to note that east-west traffic that goes _through_ the dataplane, is never punted to Linux, and as
+such, the counters will be undershooting: only counting traffic _to_ the router, not _through_ the
+router.
+
+### VPP sFlow: Performance
+
+Now that I've shown that Neil's proof of concept works, I will take a better look at the performance
+of the plugin. I've made a mental note that the plugin sends RPCs from worker threads to the main
+thread to marshall the PSAMPLE messages out. I'd like to see how expensive that is, in general. So,
+I pull boot two Dell R730 machines in IPng's Lab and put them to work. The first machine will run
+Cisco's T-Rex loadtester with 8x 10Gbps ports (4x dual Intel 58299), while the second (identical)
+machine will run VPP also ith 8x 10Gbps ports (2x Intel i710-DA4).
+
+I will test a bunch of things in parallel. First off, I'll test L2 (xconnect) and L3 (IPv4 routing),
+and secondly I'll test that with and without sFlow turned on. This gives me 8 ports to configure,
+and I'll start with the L2 configuration, as follows:
+
+```
+vpp# set int state TenGigabitEthernet3/0/2 up
+vpp# set int state TenGigabitEthernet3/0/3 up
+vpp# set int state TenGigabitEthernet130/0/2 up
+vpp# set int state TenGigabitEthernet130/0/3 up
+vpp# set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3
+vpp# set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2
+vpp# set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3
+vpp# set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2
+```
+
+Then, the L3 configuration looks like this:
+
+```
+vpp# lcp create TenGigabitEthernet3/0/0 host-if xe0-0
+vpp# lcp create TenGigabitEthernet3/0/1 host-if xe0-1
+vpp# lcp create TenGigabitEthernet130/0/0 host-if xe1-0
+vpp# lcp create TenGigabitEthernet130/0/1 host-if xe1-1
+vpp# set int state TenGigabitEthernet3/0/0 up
+vpp# set int state TenGigabitEthernet3/0/1 up
+vpp# set int state TenGigabitEthernet130/0/0 up
+vpp# set int state TenGigabitEthernet130/0/1 up
+vpp# set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31
+vpp# set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31
+vpp# set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31
+vpp# set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31
+vpp# ip route add 16.0.0.0/24 via 100.64.0.0
+vpp# ip route add 48.0.0.0/24 via 100.64.1.0
+vpp# ip route add 16.0.2.0/24 via 100.64.4.0
+vpp# ip route add 48.0.2.0/24 via 100.64.5.0
+vpp# ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static
+vpp# ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static
+vpp# ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static
+vpp# ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static
+```
+
+And finally, the Cisco T-Rex configuration looks like this:
+
+```
+- version: 2
+  interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ]
+  port_info:
+    - src_mac:  00:1b:21:06:00:00
+      dest_mac: 9c:69:b4:61:a1:dc
+    - src_mac:  00:1b:21:06:00:01
+      dest_mac: 9c:69:b4:61:a1:dd
+
+    - src_mac:  00:1b:21:83:00:00 
+      dest_mac: 00:1b:21:83:00:01
+    - src_mac:  00:1b:21:83:00:01
+      dest_mac: 00:1b:21:83:00:00
+
+    - src_mac:  00:1b:21:87:00:00
+      dest_mac: 9c:69:b4:61:75:d0
+    - src_mac:  00:1b:21:87:00:01
+      dest_mac: 9c:69:b4:61:75:d1
+
+    - src_mac:  9c:69:b4:85:00:00
+      dest_mac: 9c:69:b4:85:00:01
+    - src_mac:  9c:69:b4:85:00:01
+      dest_mac: 9c:69:b4:85:00:00
+```
+
+A little note on the use of `ip neighbor` in VPP and specific `dest_mac` in T-Rex. In L2 mode,
+because the VPP interfaces will be in promiscuous mode and simply pass through any ethernet frame
+received on interface `Te3/0/2` and copy it out on `Te3/0/3` and vice-versa, there is no need to
+tinker with MAC addresses. But in L3 mode, the NIC will only accept ethernet frames addressed to its
+MAC address, so you can see that for the first port in T-Rex, I am setting `dest_mac:
+9c:69:b4:61:a1:dc` which is the MAC address of `Te3/0/0` on VPP. And then on the way out, if VPP
+wants to send traffic back to T-Rex, I'll give it a static ARP entry with `ip neighbor .. static`.
+
+With that said, I can start a baseline loadtest like so:
+{{< image width="100%" src="/assets/sflow/trex-baseline.png" alt="Cisco T-Rex: baseline" >}}
+
+T-Rex is sending 10Gbps out on all eight interfaces (four of which are L3 routing, and four of which
+are L2 xconnecting), using packet size of 1514 bytes. This amounts of roughlu 813Kpps per port, or a
+cool 6.51Mpps in total. And I can see, in this base line configuration, the VPP router is happy to
+do the work.
+
+I then enable sFlow on the second set of four ports, using a 1:1000 sampling rate:
+
+```
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 1000
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/1 1000
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 1000
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/3 1000
+```
+
+This should yield about 3'250 or so samples per second, and things look pretty great:
+
+```
+pim@hvn6-lab:~$ vppctl show err
+   Count                  Node                              Reason               Severity 
+   5034508               sflow                     sflow packets processed         error  
+      4908               sflow                      sflow packets sampled          error  
+   5034508               sflow                     sflow packets processed         error  
+      5111               sflow                      sflow packets sampled          error  
+   5034516             l2-output                      L2 output packets            error  
+   5034516              l2-input                       L2 input packets            error  
+   5034404               sflow                     sflow packets processed         error  
+      4948               sflow                      sflow packets sampled          error  
+   5034404             l2-output                      L2 output packets            error  
+   5034404              l2-input                       L2 input packets            error  
+   5034404               sflow                     sflow packets processed         error  
+      4928               sflow                      sflow packets sampled          error  
+   5034404             l2-output                      L2 output packets            error  
+   5034404              l2-input                       L2 input packets            error  
+   5034516             l2-output                      L2 output packets            error  
+   5034516              l2-input                       L2 input packets            error  
+```
+
+I can see that the `sflow packets sampled` is roughly 0.1% of the `sflow packets processed` which
+checks out. I can also see in `psampletest` a flurry of activity, so I'm happy:
+
+```
+pim@hvn6-lab:~$ sudo psampletest 
+...
+pstest: grp=1 in=9 out=0 n=1000 seq=63388 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
+pstest: psample netlink (type=32) CMD = 0
+pstest: grp=1 in=8 out=0 n=1000 seq=63389 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
+pstest: psample netlink (type=32) CMD = 0
+pstest: grp=1 in=11 out=0 n=1000 seq=63390 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
+pstest: psample netlink (type=32) CMD = 0
+pstest: grp=1 in=10 out=0 n=1000 seq=63391 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
+pstest: psample netlink (type=32) CMD = 0
+pstest: grp=1 in=11 out=0 n=1000 seq=63392 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
+```
+
+I confirm that all four `in` interfaced (8, 9, 10 and 11) are sending samples, and those indexes
+correctly correspond to the VPP dataplane's `sw_if_index` for `TenGig130/0/0 - 3`. Sweet! On this
+machine, each TenGig network interface has its own dedicated VPP worker thread.  Considering I
+turned on sFlow sampling on four interfaces, I should see the cost I'm paying for the feature:
+
+```
+pim@hvn6-lab:~$ vppctl show run | grep -e '(Name|sflow)'
+Name                 State         Calls       Vectors Suspends      Clocks     Vectors/Call  
+sflow               active       3908218      14350684        0      9.05e1     3.67
+sflow               active       3913266      14350680        0      1.11e2     3.67
+sflow               active       3910828      14350687        0      1.08e2     3.67
+sflow               active       3909274      14350692        0      5.66e1     3.67
+```
+
+Alright, so for the 999 packets that went through and the one packet that got sampled, on average
+VPP is spending between 90 and 111 CPU cycles per packet, and the loadtest looks squeaky clean on
+T-Rex.
+
+### VPP sFlow: Cost of passthru
+
+I decide to take a look at two edge cases. What if there are no samples being taken at all, and the
+`sflow` node is merely passing through all packets to `ethernet-input`? To simulate this, I will set
+up a bizarrely high sampling rate, say one in ten million. I'll also make the T-Rex loadtester use
+only four ports, in other words, a unidirectional loadtest, and I'll make it go much faster by
+sending smaller packets, say 128 bytes:
+
+```
+tui>start -f stl/ipng.py -p 0 2 4 6 -m 99% -t size=128
+
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 1000 disable
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/1 1000 disable
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 1000 disable
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/3 1000 disable
+
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 10000000
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 10000000
+```
+
+The loadtester is now sending 33.5Mpps or thereabouts (4x 8.37Mpps), and I can confirm that the
+`sFlow` plugin is not sampling many packets:
+
+```
+pim@hvn6-lab:~$ vppctl show err
+   Count                  Node                              Reason               Severity 
+  59777084               sflow                     sflow packets processed         error  
+         6               sflow                      sflow packets sampled          error  
+  59777152             l2-output                      L2 output packets            error  
+  59777152              l2-input                       L2 input packets            error  
+  59777104               sflow                     sflow packets processed         error  
+         6               sflow                      sflow packets sampled          error  
+  59777104             l2-output                      L2 output packets            error  
+  59777104              l2-input                       L2 input packets            error  
+
+pim@hvn6-lab:~$ vppctl show run | grep -e '(Name|sflow)'
+Name                 State         Calls       Vectors Suspends      Clocks     Vectors/Call  
+sflow                active      8186642     369674664        0      1.35e1     45.16
+sflow                active     25173660     369674696        0      1.97e1     14.68
+```
+Two observations:
+
+1.   One of these is busier than the other. Without looking further, I can already predict that the
+top one (doing 45.16 vectors/call) is the L3 thread. Reasoning: the L3 code path through the
+dataplane is a lot more expensive than 'merely' L2 XConnect. As such, the packets will spend more
+time, and therefore the iterations of the `dpdk-input` loop will be further apart in time.  And
+because of that, it'll end up consuming more packets on each subsequent iteration, in order to catch
+up.  The L2 path on the other hand, is quicker and therefore will have less packets waiting on
+subsequent iterations of `dpdk-input`.
+
+2.   The `sflow` plugin spends between 13.5 and 19.7 CPU cycles shoveling the packets into
+`ethernet-input` without doing anything to them. That's pretty low! And the L3 path is a little bit
+more efficient per packet, which is very likely because it gets to amort its L1/L2 CPU instruction
+cache over 45 packets each time it runs, while the L2 path can only amort its instruction cache over
+15 or so packets each time it runs.
+
+I let the loadtest run overnight,and the proof is in the pudding: sFlow enabled but not sampling
+works just fine:
+
+{{< image width="100%" src="/assets/sflow/trex-passthru.png" alt="Cisco T-Rex: passthru" >}}
+
+### VPP sFlow: Cost of sampling
+
+The other interesting case is to figure out how much CPU it takes to execute the code path
+with the actual sampling. This one turns out a bit trickier to measure. While leaving the previous
+loadtest running at 33.5Mpps, I disable sFlow and then re-enable it at an abnormally _high_ ratio of
+1:10 packets:
+
+```
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 disable
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 disable
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 10
+pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 10
+```
+
+The T-Rex view immediately reveals that VPP is not doing very well, as the throughput went from
+33.5Mpps all the way down to 7.5Mpps. Ouch! Looking at the dataplane:
+
+```
+pim@hvn6-lab:~$ vppctl show err | grep sflow
+ 340502528               sflow                     sflow packets processed         error  
+  12254462               sflow                      sflow packets dropped          error  
+  22611461               sflow                      sflow packets sampled          error  
+ 422527140               sflow                     sflow packets processed         error  
+   8533855               sflow                      sflow packets dropped          error  
+  34235952               sflow                      sflow packets sampled          error  
+```
+
+Ha, this new safeguard popped up: remember all the way at the beginning, I explained how there's a
+safety net in the `sflow` plugin that will pre-emptively drop the sample if the RPC channel towards
+the main thread is seeing too many outstanding RPCs? That's happening right now, under the moniker
+`sflow packets dropped`, and it's roughly *half* of the samples.
+
+My first attempt is to back off the loadtester to roughly 1.5Mpps per port (so 6Mpps in total, under the
+current limit of 7.5Mpps), but I'm disappointed: the VPP instance is now returning 665Kpps per port
+only, which is horrible, and it's still dropping samples.
+
+My second attempt is to turn off all ports but last pair (the L2XC port), which returns 930Kpps from
+the offered 1.5Mpps. VPP is clearly not having a good time here.
+
+Finally, as a validation, I turn off all ports but the first pair (the L3 port, without sFlow), and
+ramp up the traffic to 8Mpps. Success (unsurprising to me). I also ramp up the second pair (the L2XC
+port, without sFlow), VPP forwards all 16Mpps and is happy again.
+
+Once I turn on the third pair (the L3 port, _with_ sFlow), even at 1Mpps, the whole situation
+regresses again: First two ports go down from 8Mpps to 5.2Mpps each; the third (offending) port
+delivers 740Kpps out of 1Mpps. Clearly, there's some work to do under high load situations!
+
+#### Reasoning about the bottle neck
+
+But how expensive is sending samples, really? To try to get at least some pseudo-scientific answer I
+turn off all ports again, and ramp up the one port pair with (L3 + sFlow at 1:10 ratio) to full line
+rate: that is 64 byte packets at 14.88Mpps:
+
+```
+tui>stop
+tui>start -f stl/ipng.py -m 100% -p 4 -t size=64
+```
+
+VPP is now on the struggle bus and is returning 3.16Mpps or 21% of that. But, I think it'll give me
+some reasonable data to try to feel out where the bottleneck is.
+
+```
+Thread 2 vpp_wk_1 (lcore 3)
+Time 6.3, 10 sec internal node vector rate 256.00 loops/sec 27310.73
+  vector rates in 3.1607e6, out 3.1607e6, drop 0.0000e0, punt 0.0000e0
+             Name                 State     Calls    Vectors   Suspends   Clocks   Vectors/Call  
+TenGigabitEthernet130/0/1-outp   active     77906   19943936          0   5.79e0         256.00
+TenGigabitEthernet130/0/1-tx     active     77906   19943936          0   6.88e1         256.00
+dpdk-input                       polling    77906   19943936          0   4.41e1         256.00
+ethernet-input                   active     77906   19943936          0   2.21e1         256.00
+ip4-input                        active     77906   19943936          0   2.05e1         256.00
+ip4-load-balance                 active     77906   19943936          0   1.07e1         256.00
+ip4-lookup                       active     77906   19943936          0   1.98e1         256.00
+ip4-rewrite                      active     77906   19943936          0   1.97e1         256.00
+sflow                            active     77906   19943936          0   6.14e1         256.00
+
+pim@hvn6-lab:pim# vppctl show err | grep sflow
+ 551357440               sflow                     sflow packets processed         error  
+  19829380               sflow                      sflow packets dropped          error  
+  36613544               sflow                      sflow packets sampled          error  
+```
+
+OK, the `sflow` plugin saw 551M packets, selected 36.6M of them for sampling, but ultimately only
+sent RPCs to the main thread for 16.8M samples after having dropped 19.8M of them. There are three
+code paths, each one extending the other:
+
+1.   Super cheap: pass through. I already learned that it takes about X=13.5 CPU cycles to pass
+through a packet.
+1.   Very cheap: select sample and construct the RPC, but toss it, costing Y CPU cycles.
+1.   Expensive: select sample, and send the RPC. Z CPU cycles in worker, and another amount in main.
+
+Now I don't know what Y is, but seeing as the selection only copies some data from the VPP buffer
+into a new `sflow_sample_t`, and it uses `clip_memcpy_fast()` for the sample header, I'm going to
+assume it's not _drastically_ more expensive than the super cheap case, so for simplicity I'll
+guesstimate that it takes Y=20 CPU cyces.
+
+With that guess out of the way, I can see what the `sflow` plugin is consuming for the third case:
+
+```
+AvgClocks = (Total * X + Sampled * Y + RPCSent * Z) / Total
+
+61.4 = ( 551357440 * 13.5 + 36613544 * 20 + (36613544-19829380) * Z ) / 551357440
+61.4 = ( 7443325440 + 732270880 + 16784164 * Z ) / 551357440
+33853346816 = 7443325440 + 732270880 + 16784164 * Z
+25677750496 = 16784164 * Z 
+Z = 1529
+```
+
+Good to know! I find spending O(1500) cycles to send the sample pretty reasonable. However, for a
+dataplane that is trying to do 10Mpps per core, and a core running 2.2GHz, there are really only 220
+CPU cycles to spend end-to-end. Spending an order of magnitude more than that once every ten packets
+feels dangerous to me.
+
+Here's where I start my conjecture. If I count the CPU cycles spent in the table above, I will see
+273 CPU cycles spent on average per packet. The CPU in the VPP router is an `E5-2696 v4 @ 2.20GHz`,
+which means it should be able to do `2.2e10/273 = 8.06Mpps` per thread, more than double that what I
+observe (3.16Mpps)! But, for all the `vector rates in` (3.1607e6), it also managed to emit the
+packets back out (same number: 3.1607e6).
+
+So why isn't VPP getting more packets from DPDK?  I poke around a bit and find an important clue:
+
+```
+pim@hvn6-lab:~$ vppctl show hard TenGigabitEthernet130/0/0 | grep rx\ missed; \
+                sleep 10; vppctl show hard TenGigabitEthernet130/0/0 | grep rx\ missed
+    rx missed                                     4065539464
+    rx missed                                     4182788310
+```
+
+In those ten seconds, VPP missed (4182788310-4065539464)/10 = 11.72Mpps. I already measured that it
+forwarded 3.16Mpps and you know what? 11.7 + 3.16 is precisely 14.88Mpps. All packets are accounted
+for! It's just, DPDK never managed to read them from the hardware: `sad-trombone.wav`
+
+
+As a validation, I turned off sFlow while keeping that one port at 14.88Mpps. Now, 10.8Mpps were
+delivered:
+
+```
+Thread 2 vpp_wk_1 (lcore 3)
+Time 14.7, 10 sec internal node vector rate 256.00 loops/sec 40622.64
+  vector rates in 1.0794e7, out 1.0794e7, drop 0.0000e0, punt 0.0000e0
+             Name                 State     Calls    Vectors   Suspends   Clocks   Vectors/Call  
+TenGigabitEthernet130/0/1-outp   active    620012  158723072          0   5.66e0         256.00
+TenGigabitEthernet130/0/1-tx     active    620012  158723072          0   7.01e1         256.00
+dpdk-input                       polling   620012  158723072          0   4.39e1         256.00
+ethernet-input                   active    620012  158723072          0   1.56e1         256.00
+ip4-input-no-checksum            active    620012  158723072          0   1.43e1         256.00
+ip4-load-balance                 active    620012  158723072          0   1.11e1         256.00
+ip4-lookup                       active    620012  158723072          0   2.00e1         256.00
+ip4-rewrite                      active    620012  158723072          0   2.02e1         256.00
+```
+
+Total Clocks: 201 per packet; 2.2GHz/201 = 10.9Mpps, and I am observing 10.8Mpps. As [[North of the
+Border](https://www.youtube.com/c/NorthoftheBorder)] would say: "That's not just good, it's good
+_enough_!"
+
+For completeness, I turned on all eight ports again, at line rate (8x14.88 = 119Mpps 🥰), and saw that
+about 29Mpps of that made it through. Interestingly, what was 3.16Mpps in the single-port line rate
+loadtest, went up slighty to 3.44Mpps now. What puzzles me even more, is that the non-sFlow worker
+threads are also impacted. I spent some time thinking about this and poking around, but I did not
+find a good explanation why port pair 0 (L3, no sFlow) and 1 (L2XC, no sFlow) would be impacted.
+Here's a screenshot of VPP on the struggle bus:
+
+{{< image width="100%" src="/assets/sflow/trex-overload.png" alt="Cisco T-Rex: overload at line rate" >}}
+
+**Hypothesis**: Due to the _spinlock_ in `vl_api_rpc_call_main_thread()`, the worker CPU is pegged
+for a longer time, during which the `dpdk-input` PMD can't run, so it misses out on these sweet
+sweet packets that the network card had dutifully received for it, resulting in the `rx-miss`
+situation.  While VPP's performance measurement shows 273 CPU cycles per packet and 3.16Mpps, this
+accounts only for 862M cycles, while the thread has 2200M cycles, leaving a whopping 60% of CPU
+cycles unused in the dataplane. I still don't understand why _other_ worker threads are impacted,
+though.
+
+## What's Next
+
+I'll continue to work with the folks in the sFlow and VPP communities and iterate on the plugin and
+other **sFlow Agent** machinery. In an upcoming article, I hope to share more details on how to tie
+the VPP plugin in to the `hsflowd` host sflow daemon in a way that the interface indexes, counters
+and packet lengths are all correct. Of course, the main improvement that we can make is to allow for
+the system to work better under load, which will take some thinking.
+
+I should do a few more tests with a debug binary and profiling turned on. I quickly ran a `perf`
+over the VPP (release / optimized)  binary running on the bench, but it merely said 80% of time was
+spent in `libvlib` rather than `libvnet` in the baseline (sFlow turned off).
+
+```
+root@hvn6-lab:/home/pim# perf record -p 1752441 sleep 10
+root@hvn6-lab:/home/pim# perf report --stdio --sort=dso 
+# Overhead  Shared Object (sFlow)   Overhead  Shared Object (baseline)
+# ........  ......................  ........  ........................
+#
+    79.02%  libvlib.so.24.10          54.27%  libvlib.so.24.10      
+    12.82%  libvnet.so.24.10          33.91%  libvnet.so.24.10      
+     3.77%  dpdk_plugin.so            10.87%  dpdk_plugin.so        
+     3.21%  [kernel.kallsyms]          0.81%  [kernel.kallsyms]     
+     0.29%  sflow_plugin.so            0.09%  ld-linux-x86-64.so.2  
+     0.28%  libvppinfra.so.24.10       0.03%  libc.so.6 
+     0.21%  libc.so.6                  0.01%  libvppinfra.so.24.10  
+     0.17%  libvlibapi.so.24.10        0.00%  libvlibmemory.so.24.10
+     0.15%  libvlibmemory.so.24.10     
+     0.07%  ld-linux-x86-64.so.2  
+     0.00%  vpp                   
+     0.00%  [vdso]                
+     0.00%  libsvm.so.24.10       
+```
+
+Unfortunately, I'm not much of a profiler expert, being merely a network engineer :) so I may have
+to ask for help. Of course, if you're reading this, you may also _offer_ help! There's lots of
+interesting work to do on this `sflow` plugin, with matching ifIndex for consumers like `hsflowd`,
+reading interface counters from the dataplane (or from the Prometheus Exporter), and most
+importantly, ensuring it works well, or fails gracefully, under stringent load.
+
+From the _cray-cray_ ideas department, what if we:
+1.   In worker thread, produced the sample but instead of sending an RPC to main and taking the
+lock, append it to a producer sample queue and move on. This way, no locks are needed, and each
+worker thread will have its own producer queue.
+
+1.   Create a separate worker (or even pool of workers), running on possibly a different CPU (or in
+main), that runs a loop iterating on all sflow sample queues consuming the samples and sending them
+in batches to the PSAMPLE Netlink group, possibly dropping samples if there are too many coming in.
+
+I'm reminded that this pattern exists already -- async crypto workers create a `crypto-dispatch`
+node that acts as poller for inbound crypto, and it hands off the result back into the worker
+thread: lockless at the expense of some complexity!
+
+## Acknowledgements
+
+The plugin I am testing here is a prototype written by Neil McKee of inMon. I also wanted to say
+thanks to Pavel Odintsov of FastNetMon and Ciprian Balaceanu for showing an interest in this plugin,
+and Peter Phaal for facilitating a get-together last year.
+
+Who's up for making this thing a reality?!
--- a/content/articles/2024-10-06-sflow-2.md
+++ b/content/articles/2024-10-06-sflow-2.md
@@ -0,0 +1,547 @@
+---
+date: "2024-10-06T07:51:23Z"
+title: 'VPP with sFlow - Part 2'
+---
+
+# Introduction
+
+{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width='12em' >}}
+
+Last month, I picked up a project together with Neil McKee of [[inMon](https://inmon.com/)], the
+care takers of [[sFlow](https://sflow.org)]: an industry standard technology for monitoring high speed switched
+networks. `sFlow` gives complete visibility into the use of networks enabling performance optimization,
+accounting/billing for usage, and defense against security threats.
+
+The open source software dataplane [[VPP](https://fd.io)] is a perfect match for sampling, as it
+forwards packets at very high rates using underlying libraries like [[DPDK](https://dpdk.org/)] and
+[[RDMA](https://en.wikipedia.org/wiki/Remote_direct_memory_access)]. A clever design choice in the so
+called _Host sFlow Daemon_ [[host-sflow](https://github.com/sflow/host-sflow)], which allows for a small
+portion of code to _grab_ the samples, for example in a merchant silicon ASIC or FPGA, but also in the
+VPP software dataplane, and then _transmit_ these samples using a Linux kernel feature called
+[[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)]. This greatly
+reduces the complexity of code to be implemented in the forwarding path, while at the same time
+bringing consistency to the `sFlow` delivery pipeline by (re)using the `hsflowd` business logic for
+the more complex state keeping, packet marshalling and transmission from the _Agent_ to a central
+_Collector_.
+
+Last month, Neil and I discussed the proof of concept [[ref](https://github.com/sflow/vpp-sflow/)]
+and I described this in a [[first article]({{< ref 2024-09-08-sflow-1.md >}})]. Then, we iterated on
+the VPP plugin, playing with a few different approaches to strike a balance between performance, code
+complexity, and agent features. This article describes our journey.
+
+## VPP: an sFlow plugin
+
+There are three things Neil and I specifically take a look at:
+
+1.  If `sFlow` is not enabled on a given interface, there should not be a regression on other
+interfaces.
+1.  If `sFlow` _is_ enabled, but a packet is not sampled, the overhead should be as small as
+possible, targetting single digit CPU cycles per packet in overhead.
+1.  If `sFlow` actually selects a packet for sampling, it should be moved out of the dataplane as
+quickly as possible, targetting double digit CPU cycles per sample.
+
+For all these validation and loadtests, I use a bare metal VPP machine which is receiving load from
+a T-Rex loadtester on eight TenGig ports. I have configured VPP and T-Rex as follows.
+
+**1. RX Queue Placement**
+
+It's important that the network card that is receiving the traffic, gets serviced by a worker thread
+on the same NUMA domain. Since my machine has two processors (and thus, two NUMA nodes), I will
+align the NIC with the correct processor, like so:
+
+```
+set interface rx-placement TenGigabitEthernet3/0/0 queue 0 worker 0
+set interface rx-placement TenGigabitEthernet3/0/1 queue 0 worker 2
+set interface rx-placement TenGigabitEthernet3/0/2 queue 0 worker 4
+set interface rx-placement TenGigabitEthernet3/0/3 queue 0 worker 6
+
+set interface rx-placement TenGigabitEthernet130/0/0 queue 0 worker 1
+set interface rx-placement TenGigabitEthernet130/0/1 queue 0 worker 3
+set interface rx-placement TenGigabitEthernet130/0/2 queue 0 worker 5
+set interface rx-placement TenGigabitEthernet130/0/3 queue 0 worker 7
+```
+
+**2. L3 IPv4/MPLS interfaces**
+
+I will take two pairs of interfaces, one on NUMA0, and the other on NUMA1, so that I can make a
+comparison with L3 IPv4 or MPLS running _without_ `sFlow` (these are TenGig3/0/*, which I will call
+the _baseline_ pairs) and two which are running _with_ `sFlow` (these are TenGig130/0/*, which I'll
+call the _experiment_ pairs).
+
+```
+comment { L3: IPv4 interfaces }
+set int state TenGigabitEthernet3/0/0 up
+set int state TenGigabitEthernet3/0/1 up
+set int state TenGigabitEthernet130/0/0 up
+set int state TenGigabitEthernet130/0/1 up
+set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31
+set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31
+set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31
+set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31
+ip route add 16.0.0.0/24 via 100.64.0.0
+ip route add 48.0.0.0/24 via 100.64.1.0
+ip route add 16.0.2.0/24 via 100.64.4.0
+ip route add 48.0.2.0/24 via 100.64.5.0
+ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static
+ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static
+ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static
+ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static
+```
+
+Here, the only specific trick worth mentioning is the use of `ip neighbor` to pre-populate the L2
+adjacency for the T-Rex loadtester. This way, VPP knows which MAC address to send traffic to, in
+case a packet has to be forwarded to 100.64.0.0 or 100.64.5.0. It avoids VPP from having to use ARP
+resolution.
+
+The configuration for an MPLS label switching router _LSR_ or also called _P-Router_ is added:
+
+```
+comment { MPLS interfaces }
+mpls table add 0
+set interface mpls TenGigabitEthernet3/0/0 enable
+set interface mpls TenGigabitEthernet3/0/1 enable
+set interface mpls TenGigabitEthernet130/0/0 enable
+set interface mpls TenGigabitEthernet130/0/1 enable
+mpls local-label add 16 eos via 100.64.1.0 TenGigabitEthernet3/0/1 out-labels 17
+mpls local-label add 17 eos via 100.64.0.0 TenGigabitEthernet3/0/0 out-labels 16
+mpls local-label add 20 eos via 100.64.5.0 TenGigabitEthernet130/0/1 out-labels 21
+mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels 20
+```
+
+**3. L2 CrossConnect interfaces**
+
+Here, I will also use NUMA0 as my baseline (`sFlow` disabled) pair, and an equivalent pair of TenGig
+interfaces on NUMA1 as my experiment (`sFlow` enabled) pair. This way, I can both make a comparison
+on the performance impact of enabling `sFlow`, but I can also assert if any regression occurs in the
+_baseline_ pair if I enable a feature in the _experiment_ pair, which should really never happen.
+
+```
+comment { L2 xconnected interfaces }
+set int state TenGigabitEthernet3/0/2 up
+set int state TenGigabitEthernet3/0/3 up
+set int state TenGigabitEthernet130/0/2 up
+set int state TenGigabitEthernet130/0/3 up
+set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3
+set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2
+set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3
+set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2
+```
+
+**4. T-Rex Configuration**
+
+The Cisco T-Rex loadtester is running on another machine in the same rack. Physically, it has eight
+ports which are connected to a LAB switch, a cool Mellanox SN2700 running Debian [[ref]({{< ref
+2023-11-11-mellanox-sn2700.md >}})]. From there, eight ports go to my VPP machine. The LAB switch
+just has VLANs with two ports in each: VLAN 100 takes T-Rex port0 and connects it to TenGig3/0/0,
+VLAN 101 takes port1 and connects it to TenGig3/0/1, and so on. In total, sixteen ports and eight
+VLANs are used.
+
+The configuration for T-Rex then becomes:
+
+```
+- version: 2
+  interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ]
+  port_info:
+    - src_mac:  00:1b:21:06:00:00
+      dest_mac: 9c:69:b4:61:a1:dc
+    - src_mac:  00:1b:21:06:00:01
+      dest_mac: 9c:69:b4:61:a1:dd
+
+    - src_mac:  00:1b:21:83:00:00
+      dest_mac: 00:1b:21:83:00:01
+    - src_mac:  00:1b:21:83:00:01
+      dest_mac: 00:1b:21:83:00:00
+
+    - src_mac:  00:1b:21:87:00:00
+      dest_mac: 9c:69:b4:61:75:d0
+    - src_mac:  00:1b:21:87:00:01
+      dest_mac: 9c:69:b4:61:75:d1
+
+    - src_mac:  9c:69:b4:85:00:00
+      dest_mac: 9c:69:b4:85:00:01
+    - src_mac:  9c:69:b4:85:00:01
+      dest_mac: 9c:69:b4:85:00:00
+```
+
+Do you see how the first pair sends from `src_mac` 00:1b:21:06:00:00? That's the T-Rex side, and it
+encodes the PCI device `06:00.0` in the MAC address. It sends traffic to `dest_mac`
+9c:69:b4:61:a1:dc, which is the MAC address of VPP's TenGig3/0/0 interface. Looking back at the `ip
+neighbor` VPP config above, it becomes much easier to see who is sending traffic to whom.
+
+For L2XC, the MAC addresses don't matter. VPP will set the NIC in _promiscuous_ mode which means
+it'll accept any ethernet frame, not only those sent to the NIC's own MAC address. Therefore, in
+L2XC modes (the second and fourth pair), I just use the MAC addresses from T-Rex. I find debugging
+connections and looking up FDB entries on the Mellanox switch much, much easier this way.
+
+With all config in place, but with `sFlow` disabled, I run a quick bidirectional loadtest using 256b
+packets at line rate, which shows 79.83Gbps and 36.15Mpps. All ports are forwarding, with MPLS,
+IPv4, and L2XC. Neat!
+
+{{< image src="/assets/sflow/trex-acceptance.png" alt="T-Rex Acceptance Loadtest" >}}
+
+The name of the game is now to do a loadtest that shows the packet throughput and CPU cycles spent
+for each of the plugin iterations, comparing their performance on ports with and without `sFlow`
+enabled. For each iteration, I will use exactly the same VPP configuration, I will generate
+unidirectional 4x14.88Mpps of traffic with T-Rex, and I will report on VPP's performance in
+_baseline_ and a somewhat unfavorable 1:100 sampling rate.
+
+Ready? Here I go!
+
+### v1: Workers send RPC to main
+
+***TL/DR***: _13 cycles/packet on passthrough, 4.68Mpps L2, 3.26Mpps L3, with severe regression in
+baseline_
+
+The first iteration goes all the way back to a proof of concept from last year. It's described in
+detail in my [[first post]({{< ref 2024-09-08-sflow-1.md >}})]. The performance results are not
+stellar:
+*   ☢ When slamming a single sFlow enabled interface, _all interfaces_ regress. When sending 8Mpps
+of IPv4 traffic through an _baseline_ interface, that is an interface _without_ sFlow enabled, only
+5.2Mpps get through. This is considered a mortal sin in VPP-land.
+*   ✅ Passing through packets without sampling them, costs about 13 CPU cycles, not bad.
+*   ❌ Sampling a packet, specifically at higher rates (say, 1:100 or worse, 1:10) completely
+destroys throughput. When sending 4x14.88MMpps of traffic, only one third makes it through.
+
+Here's the bloodbath as seen from T-Rex:
+
+{{< image src="/assets/sflow/trex-v1.png" alt="T-Rex Loadtest for v1" >}}
+
+**Debrief**: When we talked through these issues, we sort of drew the conclusion that it would be much
+faster if, when a worker thread produces a sample, instead of sending an RPC to main and taking the
+spinlock, that the worker appends the sample to a producer queue and moves on. This way, no locks
+are needed, and each worker thread will have its own producer queue.
+
+Then, we can create a separate thread (or even pool of threads), scheduling on possibly a different
+CPU (or in main), that runs a loop iterating on all sflow sample queues, consuming the samples and
+sending them in batches to the PSAMPLE Netlink group, possibly dropping samples if there are too
+many coming in.
+
+### v2: Workers send PSAMPLE directly
+
+**TL/DR**: _7.21Mpps IPv4 L3, 9.45Mpps L2XC, 87 cycles/packet, no impact on disabled interfaces_
+
+But before we do that, we have one curiosity itch to scratch - what if we sent the sample directly
+from the worker? With such a model, if it works, we will need no RPCs or sample queue at all. Of
+course, in this model any sample will have to be rewritten into a PSAMPLE packet and written via the
+netlink socket. It would be less complex, but not as efficient as it could be. One thing is prety
+certain, though: it should be much faster than sending an RPC to the main thread.
+
+After short refactor, Neil commits [[d278273](https://github.com/sflow/vpp-sflow/commit/d278273)],
+which adds compiler macros `SFLOW_SEND_FROM_WORKER` (v2) and `SFLOW_SEND_VIA_MAIN` (v1). When
+workers send directly, they will invoke `sflow_send_sample_from_worker()` instead of sending an RPC
+with `vl_api_rpc_call_main_thread()` in the previous version.
+
+The code currently uses `clib_warning()` to print stats from the dataplane, which is pretty
+expensive. We should be using the VPP logging framework, but for the time being, I add a few CPU
+counters so we can more accurately count the cummulative time spent for each part of the calls, see
+[[6ca61d2](https://github.com/sflow/vpp-sflow/commit/6ca61d2)]. I can now see these with `vppctl show
+err` instead.
+
+When loadtesting this, the deadly sin of impacting performance of interfaces that did not have
+`sFlow` enabled is gone. The throughput is not great, though. Instead of showing screenshots of
+T-Rex, I can also take a look at the throughput as measured by VPP itself. In its `show runtime`
+statistics, each worker thread shows both CPU cycles spent, as well as how many packets/sec it
+received and how many it transmitted:
+
+```
+pim@hvn6-lab:~$ export C="v2-100"; vppctl clear run; vppctl clear err; sleep 30; \
+                vppctl show run > $C-runtime.txt; vppctl show err > $C-err.txt
+pim@hvn6-lab:~$ grep 'vector rates' v2-100-runtime.txt  | grep -v 'in 0'
+  vector rates in 1.0909e7, out 1.0909e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 7.2078e6, out 7.2078e6, drop 0.0000e0, punt 0.0000e0
+  vector rates in 1.4866e7, out 1.4866e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 9.4476e6, out 9.4476e6, drop 0.0000e0, punt 0.0000e0
+pim@hvn6-lab:~$ grep 'sflow' v2-100-runtime.txt 
+Name            State        Calls      Vectors  Suspends  Clocks    Vectors/Call  
+sflow           active      844916    216298496         0  8.69e1          256.00
+sflow           active     1107466    283511296         0  8.26e1          256.00
+pim@hvn6-lab:~$ grep -i sflow v2-100-err.txt 
+ 217929472               sflow                     sflow packets processed         error  
+   1614519               sflow                      sflow packets sampled          error  
+2606893106               sflow                    CPU cycles in sent samples       error  
+ 280697344               sflow                     sflow packets processed         error  
+   2078203               sflow                      sflow packets sampled          error  
+1844674406               sflow                    CPU cycles in sent samples       error  
+```
+
+At a glance, I can see in the first `grep`, the in and out vector (==packet) rates for each worker
+thread that is doing meaningful work (ie. has more than 0pps of input). Remember that I pinned the
+RX queues to worker threads, and this now pays dividend: worker thread 0 is servicing TenGig3/0/0
+(as _even_ worker thread numbers are on NUMA domain 0), worker thread 1 is servicing TenGig130/0/0.
+What's cool about this, is it gives me an easy way to compare baseline L3 (10.9Mpps) with experiment
+L3 (7.21Mpps). Equally, L2XC comes in at 14.88Mpps in baseline and 9.45Mpps in experiment.
+
+Looking at the output of `vppctl show error`, I can learn another interesting detail. See how there
+are 1614519 sampled packets out of 217929472 processed packets (ie. a roughly 1:100 rate)? I added a
+CPU clock cycle counter that counts cummulative clocks spent once samples are taken. I can see that
+VPP spent 2606893106 CPU cycles sending these samples. That's **1615 CPU cycles** per sent sample,
+which is pretty terrible.
+
+**Debrief**: We both understand that assembling and `send()`ing the netlink messages from within the
+dataplane is a pretty bad idea. But it's great to see that removing the use of RPCs immediately
+improves performance on non-enabled interfaces, and we learned what the cost is of sending those
+samples. An easy step forward from here is to create a producer/consumer queue, where the workers
+can just copy the packet into a queue or ring buffer, and have an external `pthread` consume from
+the queue/ring in another thread that won't block the dataplane.
+
+### v3: SVM FIFO from workers, dedicated PSAMPLE pthread
+
+**TL/DR:** _9.34Mpps L3, 13.51Mpps L2XC, 16.3 cycles/packet, but with corruption on the FIFO queue messages_
+
+Neil checks in after committing [[7a78e05](https://github.com/sflow/vpp-sflow/commit/7a78e05)]
+that he has introduced a macro `SFLOW_SEND_FIFO` which tries this new approach. There's a pretty
+elaborate FIFO queue implementation in `svm/fifo_segment.h`. Neil uses this to create a segment
+called `fifo-sflow-worker`, to which the worker can write its samples in the dataplane node. A new
+thread called `spt_process_samples` can then call `svm_fifo_dequeue()` from all workers' queues and
+pump those into Netlink.
+
+The overhead of copying the samples onto a VPP native `svm_fifo` seems to be two orders of magnitude
+lower than writing directly to Netlink, even though the `svm_fifo` library code has many bells and
+whistles that we don't need. But, perhaps due to these bells and whistles, we may be holding it
+wrong, as invariably after a short while the Netlink writes return _Message too long_ errors.
+
+```
+pim@hvn6-lab:~$ grep 'vector rates' v3fifo-sflow-100-runtime.txt  | grep -v 'in 0'
+  vector rates in 1.0783e7, out 1.0783e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 9.3499e6, out 9.3499e6, drop 0.0000e0, punt 0.0000e0
+  vector rates in 1.4728e7, out 1.4728e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 1.3516e7, out 1.3516e7, drop 0.0000e0, punt 0.0000e0
+pim@hvn6-lab:~$ grep -i sflow v3fifo-sflow-100-runtime.txt 
+Name            State        Calls      Vectors  Suspends  Clocks    Vectors/Call  
+sflow           active     1096132    280609792         0  1.63e1          256.00
+sflow           active     1584577    405651712         0  1.46e1          256.00
+pim@hvn6-lab:~$ grep -i sflow v3fifo-sflow-100-err.txt 
+ 280635904               sflow                     sflow packets processed         error  
+   2079194               sflow                      sflow packets sampled          error  
+ 733447310               sflow                    CPU cycles in sent samples       error  
+ 405689856               sflow                     sflow packets processed         error  
+   3004118               sflow                      sflow packets sampled          error  
+1844674407               sflow                    CPU cycles in sent samples       error  
+```
+
+Two things of note here. Firstly, the average clocks spent in the `sFlow` node have gone down from
+86 CPU cycles/packet to 16.3 CPU cycles. But even more importantly, the amount of time spent after
+the sample is taken is hugely reduced, from 1600+ cycles in v2 to a much more favorable 352 cycles
+in this version. Also, any risk of Netlink writes failing has been eliminated, because that's now
+offloaded to a different thread entirely.
+
+**Debrief**: It's not great that we created a new linux `pthread` for the consumer of the samples.
+VPP has an elaborate thread management system, and collaborative multitasking in its threading
+model, which adds introspection like clock counters, names, `show runtime`, `show threads` and so
+on. I can't help but wonder: wouldn't we just be able to move the `spt_process_samples()` thread
+into a VPP process node instead?
+
+### v3bis: SVM FIFO, PSAMPLE process in Main
+
+**TL/DR:** _9.68Mpps L3, 14.10Mpps L2XC, 14.2 cycles/packet, still with corrupted FIFO queue messages_
+
+Neil agrees that there's no good reason to keep this out of main, and conjures up
+[[df2dab8d](https://github.com/vpp/sflow-vpp/df2dab8d)] which rewrites the thread to an
+`sflow_process_samples()` function, using `VLIB_REGISTER_NODE` to add it to VPP in an idiomatic way.
+As a really nice benefit, we can now count how many CPU cycles are spent, in _main_, each time this
+_process_ wakes up and does some work. It's a widely used pattern in VPP.
+
+Because of the FIFO queue message corruption, Netlink message are failing to send at an alarming
+rate, which is causing lots of `clib_warning()` messages to be spewed on console. I replace those
+with a counter of Failed Netlink messages instead, and commit refactor
+[[6ba4715](https://github.com/sflow/vpp-sflow/6ba4715d050f76cfc582055958d50bf4cc8a0ad1)].
+
+```
+pim@hvn6-lab:~$ grep 'vector rates' v3bis-100-runtime.txt  | grep -v 'in 0'
+  vector rates in 1.0976e7, out 1.0976e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 9.6743e6, out 9.6743e6, drop 0.0000e0, punt 0.0000e0
+  vector rates in 1.4866e7, out 1.4866e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 1.4052e7, out 1.4052e7, drop 0.0000e0, punt 0.0000e0
+pim@hvn6-lab:~$ grep sflow v3bis-100-runtime.txt 
+Name                   State       Calls    Vectors  Suspends  Clocks    Vectors/Call  
+sflow-process-samples  any wait        0          0     28052  4.66e4            0.00
+sflow                  active    1134102  290330112         0  1.42e1          256.00
+sflow                  active    1647240  421693440         0  1.32e1          256.00
+pim@hvn6-lab:~$ grep sflow v3bis-100-err.txt 
+     77945               sflow                        sflow PSAMPLE sent           error  
+       863               sflow                    sflow PSAMPLE send failed        error  
+ 290376960               sflow                     sflow packets processed         error  
+   2151184               sflow                      sflow packets sampled          error  
+ 421761024               sflow                     sflow packets processed         error  
+   3119625               sflow                      sflow packets sampled          error  
+```
+
+With this iteration, I make a few observations. Firstly, the `sflow-process-samples` node shows up
+and informs me that, when handling the samples from the worker FIFO queues, the _process_ is using
+4660 CPU cycles. Secondly, the replacement of `clib_warnign()` with the `sflow PSAMPLE send failed`
+counter reduced time from 16.3 to 14.2 cycles on average in the dataplane. Nice.
+
+**Debrief**: A sad conclusion: of the 5.2M samples taken, only 77k make it through to Netlink. All
+these send failures and corrupt packets are really messing things up. So while the provided FIFO
+implementation in `svm/fifo_segment.h` is idiomatic, it is also much more complex than we thought,
+and we're fearing that it may not be safe to read from another thread.
+
+### v4: Custom lockless FIFO, PSAMPLE process in Main
+
+**TL/DR:** _9.56Mpps L3, 13.69Mpps L2XC, 15.6 cycles/packet, corruption fixed!_
+
+After reading around a bit in DPDK's
+[[kni_fifo](https://doc.dpdk.org/api-18.11/rte__kni__fifo_8h_source.html)], Neil produces a gem of a
+commit in
+[[42bbb64](https://github.com/sflow/vpp-sflow/commit/42bbb643b1f11e8498428d3f7d20cde4de8ee201)],
+where he introduces a tiny multiple-writer, single-consumer FIFO with two simple functions:
+`sflow_fifo_enqueue()` to be called in the workers, and `sflow_fifo_dequeue()` to be called in the
+main thread's `sflow-process-samples` process. He then makes this thread-safe by doing what I
+consider black magic, in commit
+[[dd8af17](https://github.com/sflow/vpp-sflow/commit/dd8af1722d579adc9d08656cd7ec8cf8b9ac11d6)],
+which makes use of `clib_atomic_load_acq_n()` and `clib_atomic_store_rel_n()` macros from VPP's
+`vppinfra/atomics.h`.
+
+What I really like about this change is that it introduces a FIFO implementation in about twenty
+lines of code, which means the sampling code path in the dataplane becomes really easy to follow,
+and will be even faster than it was before. I take it out for a loadtest:
+
+```
+pim@hvn6-lab:~$ grep 'vector rates' v4-100-runtime.txt  | grep -v 'in 0'
+  vector rates in 1.0958e7, out 1.0958e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 9.5633e6, out 9.5633e6, drop 0.0000e0, punt 0.0000e0
+  vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 1.3697e7, out 1.3697e7, drop 0.0000e0, punt 0.0000e0
+pim@hvn6-lab:~$ grep sflow v4-100-runtime.txt 
+Name                   State       Calls    Vectors  Suspends  Clocks    Vectors/Call  
+sflow-process-samples  any wait        0          0     17767  1.52e6            0.00
+sflow                  active    1121156  287015936         0  1.56e1          256.00
+sflow                  active    1605772  411077632         0  1.53e1          256.00
+pim@hvn6-lab:~$ grep sflow v4-100-err.txt 
+   3553600               sflow                        sflow PSAMPLE sent           error  
+ 287101184               sflow                     sflow packets processed         error  
+   2127024               sflow                      sflow packets sampled          error  
+    350224               sflow                      sflow packets dropped          error  
+ 411199744               sflow                     sflow packets processed         error  
+   3043693               sflow                      sflow packets sampled          error  
+   1266893               sflow                      sflow packets dropped          error  
+```
+
+
+This is starting to be a very nice implementation! With this iteration of the plugin, all the
+corruption is gone, there is a slight regression (because we're now actually _sending_ the
+messages). With the v3bis variant, only a tiny fraction of the samples made it through to netlink.
+With this v4 variant, I can see 2127024 + 3043693 packets sampled, but due to a carefully chosen
+FIFO depth of 4, the workers will drop samples so as not to overload the main process that is trying
+to write them out. At this unnatural rate of 1:100, I can see that of the 2127024 samples taken,
+350224 are prematurely dropped (because the FIFO queue is full). This is a perfect defense in depth!
+
+Doing the math, both workers can enqueue 1776800 samples in 30 seconds, which is 59k/s per
+interface. I can also see that the second interface, which is doing L2XC and hits a much larger
+packets/sec throughput, is dropping more samples because it receives an equal amount of time from main
+reading samples from its queue. In other words: in an overload scenario, one interface cannot crowd
+out another. Slick.
+
+Finally, completing my math, each worker has enqueued 1776800 samples to their FIFOs, and I see that
+main has dequeued exactly 2x1776800 = 3553600 samples, all successfully written to Netlink, so
+the `sflow PSAMPLE send failed` counter remains zero.
+
+{{< image float="right" width="20em" src="/assets/sflow/hsflowd-demo.png" >}}
+
+**Debrief**: In the mean time, Neil has been working on the `host-sflow` daemon changes to pick up
+these netlink messages. There's also a bit of work to do with retrieving the packet and byte
+counters of the VPP interfaces, so he is creating a module in `host-sflow` that can consume some
+messages from VPP. He will call this `mod_vpp`, and he mails a screenshot of his work in progress.
+I'll discuss the end-to-end changes with `hsflowd` in a followup article, and focus my efforts here
+on documenting the VPP parts only. But, as a teaser,  here's a screenshot of a validated
+`sflow-tool` output of a VPP instance using our `sFlow` plugin and his pending `host-sflow` changes
+to integrate the rest of the business logic outside of the VPP dataplane, where it's arguably
+expensive to make mistakes.
+
+Neil admits to an itch that he has been meaning to scratch all this time. In VPP's
+`plugins/sflow/node.c`, we insert the node between `device-input` and `ethernet-input`. Here, really
+most of the time the plugin is just shoveling the ethernet packets through to `ethernet-input`. To
+make use of some CPU instruction cache affinity, the loop that does this shovelling can do it one
+packet at a time, two packets at a time, or even four packets at a time. Although the code is super
+repetitive and somewhat ugly, it does actually speed up processing in terms of CPU cycles spent per
+packet, if you shovel four of them at a time.
+
+### v5: Quad Bucket Brigade in worker
+
+**TL/DR:** _9.68Mpps L3, 14.0Mpps L2XC, 11 CPU cycles/packet, 1.28e5 CPU cycles in main_
+
+Neil calls this the _Quad Bucket Brigade_, and one last finishing touch is to move from his default
+2-packet to a 4-packet shoveling. In commit
+[[285d8a0](https://github.com/sflow/vpp-sflow/commit/285d8a097b74bb38eeb14a922a1e8c1115da2ef2)], he
+extends a common pattern in VPP dataplane nodes, each time the node iterates, it'll pre-fetch now up
+to eight packets (`p0-p7`) if the vector is long enough, and handle them four at a time (`b0-b3`).
+He also adds a few compiler hints with branch prediction: almost no packets will have a trace
+enabled, so he can use `PREDICT_FALSE()` macros to allow the compiler to further optimize the code.
+
+I find reading the dataplane code, that it is incredibly ugly. But it's the price to pay for ultra
+fast throughput. But how do we see the effect? My low-tech proposal is to enable sampling at a very
+high rate, say 1:10'000'000, so that the code path that grabs and enqueues the sample into the FIFO
+is almost never called. Then, what's left for the `sFlow` dataplane node, really is to shovel the
+packets from `device-input` into `ethernet-input`.
+
+To measure the relative improvement, I do one test with, and one without commit
+[[285d8a09](https://github.com/sflow/vpp-sflow/commit/285d8a09)].
+
+```
+pim@hvn6-lab:~$ grep 'vector rates' v5-10M-runtime.txt  | grep -v 'in 0'
+  vector rates in 1.0981e7, out 1.0981e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 9.8806e6, out 9.8806e6, drop 0.0000e0, punt 0.0000e0
+  vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 1.4328e7, out 1.4328e7, drop 0.0000e0, punt 0.0000e0
+pim@hvn6-lab:~$ grep sflow v5-10M-runtime.txt 
+Name                   State       Calls    Vectors  Suspends  Clocks    Vectors/Call  
+sflow-process-samples  any wait        0          0     28467  9.36e3            0.00
+sflow                  active    1158325  296531200         0  1.09e1          256.00
+sflow                  active    1679742  430013952         0  1.11e1          256.00
+
+pim@hvn6-lab:~$ grep 'vector rates' v5-noquadbrigade-10M-runtime.txt  | grep -v in\ 0
+  vector rates in 1.0959e7, out 1.0959e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 9.7046e6, out 9.7046e6, drop 0.0000e0, punt 0.0000e0
+  vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
+  vector rates in 1.4008e7, out 1.4008e7, drop 0.0000e0, punt 0.0000e0
+pim@hvn6-lab:~$ grep sflow v5-noquadbrigade-10M-runtime.txt 
+Name                   State       Calls    Vectors  Suspends  Clocks    Vectors/Call  
+sflow-process-samples  any wait        0          0     28462  9.57e3            0.00
+sflow                  active    1137571  291218176         0  1.26e1          256.00
+sflow                  active    1641991  420349696         0  1.20e1          256.00
+```
+
+Would you look at that, this optimization actually works as advertised! There is a meaningful
+_progression_ from v5-noquadbrigade (9.70Mpps L3, 14.00Mpps L2XC) to v5 (9.88Mpps L3, 14.32Mpps
+L2XC).  So at the expense of adding 63 lines of code, there is a 2.8% increase in throughput.
+**Quad-Bucket-Brigade, yaay!**
+
+I'll leave you with a beautiful screenshot of the current code at HEAD, as it is sampling 1:100
+packets (!) on four interfaces, while forwarding 8x10G of 256 byte packets at line rate. You'll
+recall at the beginning of this article I did an acceptance loadtest with sFlow disabled, but this
+is the exact same result **with sFlow** enabled:
+
+{{< image src="/assets/sflow/trex-sflow-acceptance.png" alt="T-Rex sFlow Acceptance Loadtest" >}}
+
+This picture says it all: 79.98 Gbps in, 79.98 Gbps out; 36.22Mpps in, 36.22Mpps out. Also: 176k
+samples/sec taken from the dataplane, with correct rate limiting due to a per-worker FIFO depth
+limit, yielding 25k samples/sec sent to Netlink.
+
+## What's Next
+
+Checking in on the three main things we wanted to ensure with the plugin:
+
+1.  ✅ If `sFlow` _is not_ enabled on a given interface, there is no regression on other interfaces.
+1.  ✅ If `sFlow` _is_ enabled, copying packets costs 11 CPU cycles on average
+1.  ✅ If `sFlow` takes a sample, it takes only marginally more CPU time to enqueue.
+    *   No sampling gets 9.88Mpps of IPv4 and 14.3Mpps of L2XC throughput,
+    *   1:1000 sampling reduces to 9.77Mpps of L3 and 14.05Mpps of L2XC throughput,
+    *   and an overly harsh 1:100 reduces to 9.69Mpps and 13.97Mpps only.
+
+The hard part is finished, but we're not entirely done yet. What's left is to implement a set of
+packet and byte counters, and send this information along with possible Linux CP data (such as the
+TAP interface ID in the Linux side), and to add the module for VPP in `hsflowd`. I'll write about
+that part in a followup article.
+
+Neil has introduced vpp-dev@ to this plugin, and so far there were no objections. But he has pointed
+folks to a github out of tree repo, and I may add a Gerrit instead so it becomes part of the
+ecosystem. Our work so far is captured in Gerrit [[41680](https://gerrit.fd.io/r/c/vpp/+/41680)],
+which ends up being just over 2600 lines all-up. I do think we need to refactor a bit, add some
+VPP-specific tidbits like `FEATURE.yaml` and `*.rst` documentation, but this should be in reasonable
+shape.
+
+### Acknowledgements
+
+I'd like to thank Neil McKee from inMon for his dedication to getting things right, including the
+finer details such as logging, error handling, API specifications, and documentation. He has been a
+true pleasure to work with and learn from.
--- a/content/articles/2024-10-21-freeix-2.md
+++ b/content/articles/2024-10-21-freeix-2.md
@@ -0,0 +1,778 @@
+---
+date: "2024-10-21T10:52:11Z"
+title: "FreeIX Remote - Part 2"
+---
+
+{{< image width="18em" float="right" src="/assets/freeix/freeix-artist-rendering.png" alt="FreeIX, Artists Rendering" >}}
+
+# Introduction
+
+A few months ago, I wrote about [[an idea]({{< ref 2024-04-27-freeix-1.md >}})] to help boost the
+value of small Internet Exchange Points (_IXPs_). When such an exchange  doesn't have many members,
+then the operational costs of connecting to it (cross connects, router ports, finding peers, etc)
+are not very favorable. 
+
+Clearly, the benefit of using an Internet Exchange is to reduce the portion of an ISP’s (and CDN’s)
+traffic that must be delivered via their upstream transit providers, thereby reducing the average
+per-bit delivery cost and as well reducing the end to end latency as seen by their users or
+customers. Furthermore, the increased number of paths available through the IXP improves routing
+efficiency and fault-tolerance, and at the same time it avoids traffic going the scenic route to a
+large hub like Frankfurt, London, Amsterdam, Paris or Rome, if it could very well remain local.
+
+## Refresher: FreeIX Remote
+
+{{< image width="20em" float="right" src="/assets/freeix/Free IX Remote.svg" alt="FreeIX Remote" >}}
+
+Let's take for example the [[Free IX in Greece](https://free-ix.gr/)] that was announced at GRNOG16
+in Athens on April 19th, 2024.  This exchange initially targets Athens and Thessaloniki, with 2x100G
+between the two cities.  Members can connect to either site for the cost of only a cross connect.
+The 1G/10G/25G ports will be _Gratis_, so please make sure to apply if you're in this region! I
+myself have connected one very special router to Free IX Greece, which will be offering an outreach
+infrastructure by connecting to _other_ Internet Exchange Points in Amsterdam, and allowing all FreeIX
+Greece members to benefit from that in the following way:
+
+1.   FreeIX Remote uses AS50869 to peer with any network operator (or routeserver) available at public
+Internet Exchange Points or using private interconnects. For these peers, it looks like a completely
+normal service provider in this regard. It will connect to internet exchange points, and learn a bunch of
+routes and announce other routes.
+
+1.   FreeIX Remote _members_ can join the program, after which they are granted certain propagation
+permissions by FreeIX Remote at the point where they have a BGP session with AS50869. The prefixes
+learned on these _member_ sessions are marked as such, and will be allowed to propagate.  Members
+will receive some or all learned prefixes from AS50869.
+
+1.   FreeIX _members_ can set fine grained BGP communities to determine which of their prefixes are
+propagated to and from which locations, by router, country or Internet Exchange Point.
+
+Members at smaller internet exchange points greatly benefit from this type of outreach, by receiving large
+portions of the public internet directly at their preferred peering location. The _Free IX Remote_
+routers will carry member traffic to and from these remote Internet Exchange Points.  My [[previous
+article]({{< ref 2024-04-27-freeix-1.md >}})] went into a good amount of detail on the principles of
+operation, but back then I made a promise to come back to the actual _implementation_ of such a
+complex routing topology. As a starting point, I work with the structure I shared in [[IPng's
+Routing Policy]({{< ref 2021-11-14-routing-policy.md >}})]. If you haven't read that yet, I think
+it may make sense to take a look as many of the structural elements and concepts will be similar.
+
+## Implementation
+
+The routing policy calls for three classes of (large) BGP communities: informational, permission and
+inhibit. It also defines a few classic BGP communties, but I'll skip over those as they are not
+very interesting. Firstly, I will use the _informational_ communities to tag which prefixes were
+learned by which _router_, in which _country_ and at which internet exchange point, which I will call a
+_group_. 
+
+Then, I will use the same structure to grant members _permissions_, that is to say, when AS50869
+learns their prefixes, they will get tagged with specific action communities that enable propagation
+to other places. I will call this 'Member-to-IXP'. Sometimes, I'd like to be able to _inhibit_
+propagation of 'Member-to-IXP', so there will be a third set of communities that perform this
+function. Finally, matching on the informational communities in a clever way will enable a symmetric
+'IXP-to-Member' propagation.
+
+To help structure this implementation, it helps if I think about it in
+the following way:
+
+Let's say, AS50869 is connected to IXP1, IXP2, IXP3 and IXP4. AS50869 has a _member_ called M1 at
+IXP1, and that member is 'permitted' to reach IXP2 and IXP3, but it is 'inhibited' from reaching
+IXP4. My _FreeIX Remote_ implementation now has to satisfy three main requirements:
+
+1.  **Ingress**: learn prefixes (from peers and members alike) at internet exchange points or
+private network interconnects, and 'tag' them with the correct informational communities.
+1.  **Egress: Member-to-IXP**: Announce M1's prefixes to IXP2 and IXP3, but not to IXP4.
+1.  **Egress: IXP-to-Member**: Announce IXP2's and IXP3's prefixes to M1, but not IXP4's.
+
+### Defining Countries and Routers
+
+I'll start by giving each country which has at least one router a unique _country_id_ in a YAML
+file, leaving the value 0 to mean 'all' countries:
+
+```
+$ cat config/common/countries.yaml
+country:
+  all: 0
+  CH: 1
+  NL: 2
+  GR: 3
+  IT: 4
+```
+
+Each router has its own configuration file, and at the top, I'll define some metadata which
+includes things like the country in which it operates, and its own unique _router_id_, like so:
+
+```
+$ cat config/chrma0.net.free-ix.net.yaml
+device:
+  id: 1
+  hostname: chrma0.free-ix.net
+  shortname: chrma0
+  country: CH
+  loopbacks:
+    ipv4: 194.126.235.16
+    ipv6: "2a0b:dd80:3101::"
+  location: "Hofwiesenstrasse, Ruemlang, Zurich, Switzerland"
+...
+```
+
+### Defining communities
+
+Next, I define the BGP communities in `class` and `subclass` types, in the following YAML structure:
+
+```
+ebgp:
+  community:
+    legacy:
+      noannounce: 0
+      blackhole: 666
+      inhibit: 3000
+      prepend1: 3100
+      prepend2: 3200
+      prepend3: 3300
+    large:
+      class:
+        informational: 1000
+        permission: 2000
+        inhibit: 3000
+        prepend1: 3100
+        prepend2: 3200
+        prepend3: 3300
+      subclass:
+        all: 0
+        router: 10
+        country: 20
+        group: 30
+        asn: 40
+```
+
+### Defining Members
+
+In order to keep this system manageable, I have to rely on automation. I intend to leverage the
+BGP community _subclasses_ in a simple ACL system consisting of the following YAML, taking my buddy
+Antonios' network as an example:
+
+```
+$ cat config/common/members.yaml
+member:
+  210312:
+    description: DaKnObNET
+    prefix_filter: AS-SET-DNET
+    permission: [ router:chrma0 ]
+    inhibit: [ group:chix ]
+  ...
+```
+
+The syntax of the `permission` and `inhibit` fields are identical. They are lists of key:value pairs
+where they key  must be one of the _subclasses_ (eg. 'router', 'country', 'group', 'asn'), and the
+value appropriate for that type. In this example, AS50869 is being asked to grant permissions for
+Antonios' prefixes to any peer connected to `router:chrma0`, but inhibit propagation to/from the
+exchange point called `group:chix`. I could extend this list, for example by adding a permission to
+`country:NL` or an inhibit to `router:grskg0` and so on.
+
+I decide that sensible defaults are to give permissions to all, and keep inhibit empty. In other
+words: be very liberal in propagation, to maximize the value that FreeIX Remote can provide its
+members.
+
+### Ingress: Learning Prefixes
+
+With what I've defined so far, I can start to set informational BGP communtiies:
+*   The prefixes learned on subclass **router** for `chrma0` will have value of device.id=1:
+`(50869,1010,1)`
+*   The prefixes learned on subclass **country** for `chrma0` will learn from device.country=CH and
+be able to look up in `countries['CH']` that this means value 1: `(50869,1020,1)`
+*   When learning prefixes from a given internet exchange, Kees already knows its PeeringDB
+_ixp_id_, which is a unique value for each exchange point. Thus, subclass **group** for `chrma0` at
+[[CommunityIX](https://www.peeringdb.com/ix/2013)] is ixp_id=2013: `(50869,1030,2013)`
+
+#### Ingress: Learning from members
+
+I need to make sure that members send only the prefixes that I expect from them. To do this, I'll
+make use of a common tool called [[bgpq4](https://github.com/bgp/bgpq4)] which cobbles together the
+prefixes belonging to an AS-SET by referencing one or more IRR databases.
+
+In Python, I'll prepare the Jinja context by generating the prefix filter lists like so:
+
+```
+if session["type"] == "member":
+  session = {**session, **data["member"][asn]}
+
+pf = ebgp_merge_value(data["ebgp"], group, session, "prefix_filter", None)
+if pf: 
+    ctx["prefix_filter"] = {}
+    pfn = pf
+    pfn = pfn.replace("-", "_")
+    pfn = pfn.replace(":", "_")
+
+    for af in [4, 6]:
+        filter_name = "%s_%s_IPV%d" % (groupname.upper(), pfn, af)
+        filter_contents = fetch_bgpq(filter_name, pf, af, allow_morespecifics=True) 
+        if "[" in filter_contents:
+            ctx["prefix_filter"][filter_name] = { "str": filter_contents, "af": af }
+            ctx["prefix_filter_ipv%d" % af] = True
+        else:
+            log.warning(f"Filter {filter_name} is empty!")
+            ctx["prefix_filter_ipv%d" % af] = False
+```
+
+First, if a given BGP session is of type _member_, I'll merge the `member[asn]` dictionary
+into the `ebgp.group.session[asn]`. I've left out error handling for brevity, but in case the member
+YAML file doesn't have an entry for the given ASN, it'll just revert back to being of type _peer_.
+
+I'll use a helper function `ebgp_merge_value()` to walk the YAML hiearchy from the member-data
+enriched _session_ to the _group_ and finally to the _ebgp_ scope, looking for the existence of a
+key called _prefix_filter_ and defaulting to None in case none was found.  With the value of
+_prefix_filter_ in hand (in this case `AS-SET-DNET`), I shell out to `bgpq4` for IPv4 and IPv6
+respectively. Sometimes, there are no IPv6 prefixes (why must you be like this?!) and sometimes
+there are no IPv4 prefixes (welcome to the Internet, kid!)
+
+All of this context, including the session and group information, are then fed as context to a
+Jinja renderer, where I can use them in an _import_ filter like so:
+
+```
+{% for plname, pl in (prefix_filter | default({})).items() %}
+{{pl.str}}
+{% endfor %}
+
+filter ebgp_{{group_name}}_{{their_asn}}_import {
+{% if not prefix_filter_ipv4 | default(True) %}
+  # WARNING: No IPv4 prefix filter found
+  if (net.type = NET_IP4) then reject;
+{% endif %}
+{% if not prefix_filter_ipv6 | default(True) %}
+  # WARNING: No IPv6 prefix filter found
+  if (net.type = NET_IP6) then reject;
+{% endif %}
+{% for plname, pl in (prefix_filter | default({})).items() %}
+{% if pl.af == 4 %}
+  if (net.type = NET_IP4 && ! (net ~ {{plname}})) then reject;
+{% elif pl.af == 6 %}
+  if (net.type = NET_IP6 && ! (net ~ {{plname}})) then reject;
+{% endif %}
+{% endfor %}
+{% if session_type is defined %}
+  if ! ebgp_import_{{session_type}}({{their_asn}}) then reject;
+{% endif %}
+
+  # Add FreeIX Remote: Informational
+  bgp_large_community.add(({{my_asn}},{{community.large.class.informational+community.large.subclass.router}},{{device.id}})); ## informational.router = {{ device.hostname }}
+  bgp_large_community.add(({{my_asn}},{{community.large.class.informational+community.large.subclass.country}},{{country[device.country]}})); ## informational.country = {{ device.country }}
+{% if group.peeringdb_ix.id %}
+  bgp_large_community.add(({{my_asn}},{{community.large.class.informational+community.large.subclass.group}},{{group.peeringdb_ix.id}})); ## informational.group = {{ group_name }}
+{% endif %}
+
+  ## NOTE(pim): More comes here, see Member-to-IXP below
+
+  accept;
+}
+```
+
+Let me explain what's going on here, as Jinja templating language that my generator uses is a bit
+... chatty. The first block will print the dictionary of zero or more `prefix_filter` entries. If
+the `prefix_filter` context variable doesn't exist, assume it's the empty dictionary and thus,
+print no prefix lists.
+
+Then, I create a Bird2 filter and these must each have a globally unique name. I satisfy this
+requirement by giving it a name with the tuple of {group, their_asn}. The first thing this filter
+does, is inspect `prefix_filter_ipv4` and `prefix_filter_ipv6`, and if they are explicitly set to
+False (for example, if a member doesn't have any IRR prefixes associated with their AS-SET), then
+I'll reject any prefixes from them. Then, I'll match the prefixes with the `prefix_filter`, if
+provided, and reject any prefixes that aren't in the list I'm expecting on this session. Assuming
+we're still good to go, I'll hand this prefix off to a function called `ebgp_import_peer()` for
+peers and `ebgp_import_member()` for members, both of which ensure BGP communities are scrubbed.
+
+```
+function ebgp_import_peer(int remote_as) -> bool
+{
+  # Scrub BGP Communities (RFC 7454 Section 11)
+  bgp_community.delete([(50869, *)]);
+  bgp_large_community.delete([(50869, *, *)]);
+
+  # Scrub BLACKHOLE community
+  bgp_community.delete((65535, 666));
+
+  return ebgp_import(remote_as);
+}
+
+function ebgp_import_member(int remote_as) -> bool
+{
+  # We scrub only our own (informational, permissions) BGP Communities for members
+  bgp_large_community.delete([(50869,1000..2999,*)]);
+
+  return ebgp_import(remote_as);
+}
+```
+
+After scrubbing the communities (peers are not allowed to set _any_ communities, and members are not
+allowed to set their own informational or permissions communities, but they are allowed to inhibit
+themselves or prepend, if they wish), one last check is performed by calling the underlying
+`ebgp_import()`:
+
+```
+function ebgp_import(int remote_as) -> bool
+{
+  if aspath_bogon() then return false;
+  if (net.type = NET_IP4 && ipv4_bogon()) then return false;
+  if (net.type = NET_IP6 && ipv6_bogon()) then return false;
+
+  if (net.type = NET_IP4 && ipv4_rpki_invalid()) then return false;
+  if (net.type = NET_IP6 && ipv6_rpki_invalid()) then return false;
+
+  # Graceful Shutdown (https://www.rfc-editor.org/rfc/rfc8326.html)
+  if (65535, 0) ~ bgp_community then bgp_local_pref = 0;
+
+  return true;
+}
+```
+
+Here, belt-and-suspenders checks are performed, notably bogon AS Paths, IPv4/IPv6 prefixes and RPKI
+invalids are filtered out. If the prefix has well-known community for [[BGP Graceful
+Shutdown](https://www.rfc-editor.org/rfc/rfc8326.html)], honor it and set the local preference to
+zero (making sure to prefer any other available path).
+
+OK, after all these checks are done, I am finally ready to accept the prefix from this peer or
+member. It's time to add the informational communities based on the _router_id_, the router's
+_country_id_ and (if this is a session at a public internet exchange point documented in PeeringDB),
+the group's _ixp_id_.
+
+#### Ingress Example: member
+
+Here's what the rendered template looks like for Antonios' member session at CHIX:
+
+```
+# bgpq4 -Ab4 -R 32 -l 'define CHIX_AS_SET_DNET_IPV4' AS-SET-DNET
+define CHIX_AS_SET_DNET_IPV4 = [
+ 44.31.27.0/24{24,32}, 44.154.130.0/24{24,32}, 44.154.132.0/24{24,32},
+ 147.189.216.0/21{21,32}, 193.5.16.0/22{22,32}, 212.46.55.0/24{24,32}
+];
+
+# bgpq4 -Ab6 -R 128 -l 'define CHIX_AS_SET_DNET_IPV6' AS-SET-DNET
+define CHIX_AS_SET_DNET_IPV6 = [
+ 2001:678:f5c::/48{48,128}, 2a05:dfc1:9174::/48{48,128}, 2a06:9f81:2500::/40{40,128},
+ 2a06:9f81:2600::/40{40,128}, 2a0a:6044:7100::/40{40,128}, 2a0c:2f04:100::/40{40,128},
+ 2a0d:3dc0::/29{29,128}, 2a12:bc0::/29{29,128}
+];
+
+filter ebgp_chix_210312_import {
+  if (net.type = NET_IP4 && ! (net ~ CHIX_AS_SET_DNET_IPV4)) then reject;
+  if (net.type = NET_IP6 && ! (net ~ CHIX_AS_SET_DNET_IPV6)) then reject;
+  if ! ebgp_import_member(210312) then reject;
+
+  # Add FreeIX Remote: Informational
+  bgp_large_community.add((50869,1010,1)); ## informational.router = chrma0.free-ix.net
+  bgp_large_community.add((50869,1020,1)); ## informational.country = CH
+  bgp_large_community.add((50869,1030,2365)); ## informational.group = chix
+  
+  ## NOTE(pim): More comes here, see Member-to-IXP below
+
+  accept;
+}
+```
+
+#### Ingress Example: peer
+
+For completeness, here's a regular peer Cloudflare at CHIX, and I hope you agree that the Jinja
+template renders down to something waaaay more readable now:
+
+```
+filter ebgp_chix_13335_import {
+  if ! ebgp_import_peer(13335) then reject;
+
+  # Add FreeIX Remote: Informational
+  bgp_large_community.add((50869,1010,1)); ## informational.router = chrma0.free-ix.net
+  bgp_large_community.add((50869,1020,1)); ## informational.country = CH
+  bgp_large_community.add((50869,1030,2365)); ## informational.group = chix
+
+  accept;
+}
+```
+
+Most sessions will actually look like this one: just learning prefixes, scrubbing inbound
+communities that are nobody's business to be setting but mine, tossing weird prefixes like bogons
+and then setting typically the three informational communities. I now know exactly which prefixes
+are picked up at group CHIX, which ones in country Switzerland, and which ones on router `chrma0`.
+
+### Egress: Propagating Prefixes
+
+And with that, I've completed the 'learning' part. Let me move to the 'propagating' part. A design
+goal of FreeIX Remote is to have _symmetric_ propagation. In my example above, member M1 should have
+its prefixes announced at IXP2 and IXP3, and all prefixes learned at IXP2 and IXP3 should be
+announced to member M1.
+
+First, let me create a helper function in the generator. It's job is to take the symbolic
+`member.*.permissions` and `member.*.inhibit` lists and resolve them into a structure of numeric
+values suitable for BGP community list adding and matching. It's a bit of a beast, but I've
+simplified it a bit. Notably, I've removed all the error and exception handling for brevity:
+
+```
+def parse_member_communities(data, asn, type):
+  myasn = data["ebgp"]["asn"]
+  cls = data["ebgp"]["community"]["large"]["class"]
+  sub = data["ebgp"]["community"]["large"]["subclass"]
+
+  bgp_cl = []
+  member = data["member"][asn]
+
+  for perm in perms:
+    if perm == "all":
+      el = { "class": int(cls[type]), "subclass": int(sub["all"]),
+             "value": 0, "description": f"{type}.all" }
+      return [el]
+    k, v = perm.split(":")
+    if k == "country":
+      country_id = data["country"][v]
+      el = { "class": int(cls[type]), "subclass": int(sub["country"]),
+             "value": int(country_id), "description": f"{type}.{k} = {v}" }
+      bgp_cl.append(el)
+    elif k == "asn":
+      el = { "class": int(cls[type]), "subclass": int(sub["asn"]),
+             "value": int(v), "description": f"{type}.{k} = {v}" }
+      bgp_cl.append(el)
+    elif k == "router":
+      device_id = data["_devices"][v]["id"]
+      el = { "class": int(cls[type]), "subclass": int(sub["router"]),
+             "value": int(device_id), "description": f"{type}.{k} = {v}" }
+      bgp_cl.append(el)
+    elif k == "group":
+      group = data["ebgp"]["groups"][v]
+      if isinstance(group["peeringdb_ix"], dict):
+        ix_id = group["peeringdb_ix"]["id"]
+      else:
+        ix_id = group["peeringdb_ix"]
+      el = { "class": int(cls[type]), "subclass": int(sub["group"]),
+             "value": int(ix_id), "description": f"{type}.{k} = {v}" }
+      bgp_cl.append(el)
+    else:
+      log.warning (f"No implementation for {type} subclass '{k}' for member AS{asn}, skipping")
+
+    return bgp_cl
+
+```
+
+The essence of this function is to take a human readable list of symbols, like 'router:chrma0' and
+look up what subclass is called 'router' and what router_id is 'chrma0'. It does this for keywords
+'router', 'country', 'group' and 'asn' and for a special keyword called 'all' as well.
+
+Running this a function on Antonios' member data above would reveal the following:
+```
+Member 210312 has permissions:
+ [{'class': 2000, 'subclass': 10, 'value': 1, 'description': 'permission.router = chrma0'}]
+Member 210312 has inhibits:
+ [{'class': 3000, 'subclass': 30, 'value': 2365, 'description': 'inhibit.group = chix'}]
+```
+
+The neat thing about this is, that this data will come in handy for _both_ types of propagation, and
+the `parse_member_communities()` helper function returns pretty readable data, which will help in
+debugging and further understanding the ultimately generated configuration.
+
+#### Egress: Member-to-IXP
+
+OK, when I learned Antonios' prefixes, I have instructed the system to propagate them to all
+sessions on router `chrma0`, except sessions on group `chix`. This means that in the direction of
+_from AS50869 to others_, I can do the following:
+
+**1. Tag permissions and inhibits on ingress**
+
+I add a tiny bit of logic using this data structure I just created above. In the import filter,
+remember I added `NOTE(pim): More comes here`? After setting the informational communities, I also
+add these:
+
+```
+{% if session_type == "member" %}
+{% if permissions %}
+
+  # Add FreeIX Remote: Permission
+{% for el in permissions %}
+  bgp_large_community.add(({{my_asn}},{{el.class+el.subclass}},{{el.value}})); ## {{ el.description
+}}
+{% endfor %}
+{% endif %}
+{% if inhibits %}
+
+  # Add FreeIX Remote: Inhibit
+{% for el in inhibits %}
+  bgp_large_community.add(({{my_asn}},{{el.class+el.subclass}},{{el.value}})); ## {{ el.description
+}}
+{% endfor %}
+{% endif %}
+{% endif %}
+```
+
+Seeing as this block only gets rendered if the session type is _member_, let me show you how
+Antonios' import filter looks like in its full glory:
+
+```
+filter ebgp_chix_210312_import {
+  if (net.type = NET_IP4 && ! (net ~ CHIX_AS_SET_DNET_IPV4)) then reject;
+  if (net.type = NET_IP6 && ! (net ~ CHIX_AS_SET_DNET_IPV6)) then reject;
+  if ! ebgp_import_member(210312) then reject;
+
+  # Add FreeIX Remote: Informational
+  bgp_large_community.add((50869,1010,1)); ## informational.router = chrma0.free-ix.net
+  bgp_large_community.add((50869,1020,1)); ## informational.country = CH
+  bgp_large_community.add((50869,1030,2365)); ## informational.group = chix
+
+  # Add FreeIX Remote: Permission
+  bgp_large_community.add((50869,2010,1)); ## permission.router = chrma0
+
+  # Add FreeIX Remote: Inhibit
+  bgp_large_community.add((50869,3030,2365)); ## inhibit.group = chix
+
+  accept;
+}
+```
+
+Remember, the `ebgp_import_member()` helper will strip any informational (the 1000s) and permissions
+(the 2000s), but it would allow Antonios to set inhibits and prepends (the 3000s) so these BGP
+communities will still be allowed in. In other words, Antonios can't give himself propagation rights
+(sorry, buddy!) but if he would like to make AS50869 stop sending his prefixes to, say, CommunityIX,
+he could simply add the BGP community `(50869,3030,2013)` on his announcements, and that will get
+honored. If he'd like AS50869 to prepend itself twice before announcing to peer AS8298, he could set
+`(50869,3200,8298)` and that will also get picked up.
+
+**2. Match permissions and inhibits on egress**
+
+Now that all of Antonios' prefixes are tagged with permissions and inhibits, I can reveal how I
+implemented the export filters for AS50869:
+
+```
+function member_prefix(int group) -> bool
+{ 
+  bool permitted = false;
+
+  if (({{ebgp.asn}}, {{ebgp.community.large.class.permission+ebgp.community.large.subclass.all}}, 0) ~ bgp_large_community ||
+      ({{ebgp.asn}}, {{ebgp.community.large.class.permission+ebgp.community.large.subclass.router}}, {{ device.id }}) ~ bgp_large_community ||
+      ({{ebgp.asn}}, {{ebgp.community.large.class.permission+ebgp.community.large.subclass.country}}, {{ country[device.country] }}) ~ bgp_large_community ||
+      ({{ebgp.asn}}, {{ebgp.community.large.class.permission+ebgp.community.large.subclass.group}}, group) ~ bgp_large_community) then {
+    permitted = true;
+  }
+  if (({{ebgp.asn}}, {{ebgp.community.large.class.inhibit+ebgp.community.large.subclass.all}}, 0) ~ bgp_large_community ||
+      ({{ebgp.asn}}, {{ebgp.community.large.class.inhibit+ebgp.community.large.subclass.router}}, {{ device.id }}) ~ bgp_large_community ||
+      ({{ebgp.asn}}, {{ebgp.community.large.class.inhibit+ebgp.community.large.subclass.country}}, {{ country[device.country] }}) ~ bgp_large_community ||
+      ({{ebgp.asn}}, {{ebgp.community.large.class.inhibit+ebgp.community.large.subclass.group}}, group) ~ bgp_large_community) then {
+    permitted = false;
+  }
+  return (permitted);
+}
+
+function valid_prefix(int group) -> bool
+{    
+  return (source_prefix() || member_prefix(group));
+}    
+
+function ebgp_export_peer(int remote_as; int group) -> bool
+{
+  if (source != RTS_BGP && source != RTS_STATIC) then return false;
+  if !valid_prefix(group) then return false;
+
+  bgp_community.delete([(50869, *)]);
+  bgp_large_community.delete([(50869, *, *)]);
+
+  return ebgp_export(remote_as);
+}
+```
+
+From the bottom, the function `ebgp_export_peer()` is invoked on each peering session, and it gets
+the argument of the remote AS (for example 13335 for CloudFlare), and the group (for example 2365
+for CHIX). The function ensures that it's either a _static_ route or a _BGP_ route. Then it makes
+sure it's a `valid_prefix()` for the group.
+
+The `valid_prefix()` function first checks if it's one of our own (as in: AS50869's own) prefixes,
+which it does by calling `source_prefix()`, which i've ommitted here as it would be a distraction.
+All it does is check if the prefix is in a static prefix list generated with `bgpq4` for AS50869
+itself. The more interesting observation is that to be eligible, the prefix needs to be either
+`source_prefix()` **or** `member_prefix(group)`.
+
+The propagation decision for 'Member-to-IXP' actually happens in that `member_prefix()` function. It
+starts off by assuming the prefix is not permitted. Then it scans all relevant _permissions_
+communities which may be present in the RIB for this prefix:
+- is the `all` permissions community `(50869,2000,0)` set?
+- what about the `router` permission `(50869,2010,R)` for my _router_id_?
+- perhaps the `country` permission `(50869,2020,C)` for my _country_id_?
+- or maybe the `group` permission `(50869,2030,G)` for the _ixp_id_ that this session lives on?
+
+If any of these conditions are true, then this prefix _might_ pe permitted, so I set the variable to
+True. Next, I check and see if any of the _inhibit_ communities are set, either by me (in
+`members.yaml`) or by the member on the live BGP session. If any one of them  matches, then I flip
+the variable to False again. Once the verdict is known, I can return True or False here, which
+makes its way all the way up the call stack and ultimately announces the member prefix on the BGP
+session, or not. Slick!
+
+#### Egress: IXP-to-Member
+
+At this point, members' prefixes get announced at the correct internet exchange points, but I need to
+satisfy one more requirement: the prefixes picked up at those IXPs, should _also_ be announced to
+members. For this, the helper dictionary with permissions and inhibits can be used in a clever way.
+What if I held them against the informational communities? For example, I have _permitted_
+Antonios to be annouced at any IXP connected to router `chrma0`, then all prefixes I learned at
+`chrma0` are fair game, right? But, I configured an _inhibit_ for Antonios' prefixes at CHIX. No
+problem, I have an informational community for all prefixes I learned from the CHIX group!
+
+I come to the realization that IXP-to-Member simply adds to the Member-to-IXP logic. Everything that
+I would announce to a peer, I will also announce to a member. Off I go, adding one last helper
+function to the BGP session Jinja template:
+
+```
+{% if session_type == "member" %}
+function ebgp_export_{{group_name}}_{{their_asn}}(int remote_as; int group) -> bool
+{
+  bool permitted = false;
+
+  if (source != RTS_BGP && source != RTS_STATIC) then return false;
+  if valid_prefix(group) then return ebgp_export(remote_as);
+
+{% for el in permissions | default([]) %}
+  if (bgp_large_community ~ [({{ my_asn }},{{ 1000+el.subclass}},{% if el.value == 0%}*{% else %}{{el.value}}{% endif %})]) then permitted=true; ## {{el.description}}
+{% endfor %}
+{% for el in inhibits | default([]) %}
+  if (bgp_large_community ~ [({{ my_asn }},{{ 1000+el.subclass}},{% if el.value == 0%}*{% else %}{{el.value}}{% endif %})]) then permitted=false; ## {{el.description}}
+{% endfor %}
+
+  if (permitted) then return ebgp_export(remote_as);
+  return false;
+}
+{% endif %}
+```
+
+Note that in essence, this new function still calls `valid_prefix()`, which in turn calls
+`source_prefix()` **or** `member_prefix(group)`, so it announces the same prefixes that are also
+announced to sessions of type 'peer'. But then, I'll also inspect the _informational_ communities,
+where the value of `0` is replaced with a wildcard, because 'permit or inhibit all' would mean
+'match any of these BGP communities'. This template renders as follows for Antonios at CHIX:
+
+```
+function ebgp_export_chix_210312(int remote_as; int group) -> bool
+{
+  bool export = false;
+
+  if (source != RTS_BGP && source != RTS_STATIC) then return false;
+  if valid_prefix(group) then return ebgp_export(remote_as);
+
+  if (bgp_large_community ~ [(50869,1010,1)]) then export=true; ## permission.router = chrma0
+  if (bgp_large_community ~ [(50869,1030,2365)]) then export=false; ## inhibit.group = chix
+
+  if (export) then return ebgp_export(remote_as);
+  return false;
+}
+```
+
+## Results
+
+With this, the propagation logic is complete. Announcements are _symmetric_, that is to say the function
+`ebgp_export_chix_210312()` sees to it that Antonios gets the prefixes learned at router `chrma0`
+but not those learned at group `CHIX`. Similarly, the `ebgp_export_peer()` ensures that Antonios'
+prefixes are propagated to any session at router `chrma0` except those sessions at group `CHIX`.
+
+{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
+
+I have installed VPP with [[OSPFv3]({{< ref 2024-06-22-vpp-ospf-2.md >}})] unnumbered interfaces,
+so each router has exactly one IPv4 and IPv6 loopback address. The router in R&uuml;mlang has been
+operational for a while, the one in Amsterdam (nlams0.free-ix.net) and Thessaloniki
+(grskg0.free-ix.net) have been deployed and are connecting to IXPs now, and the one in Milan
+(itmil0.free-ix.net) has been installed but is pending physical deployment at Caldara.
+
+I deployed a test setup with a few permissions and inhibits on the R&uuml;mlang router, with many thanks
+to Jurrian, Sam and Antonios for allowing me to guinnaepig-ize their member sessions. With the
+following test configuration:
+
+```
+member:
+  35202:
+    description: OnTheGo (Sam Aschwanden)
+    prefix_filter: AS-OTG
+    permission: [ router:chrma0 ]
+    inhibit: [ group:comix ]
+  210312:
+    description: DaKnObNET
+    prefix_filter: AS-SET-DNET
+    permission: [ router:chrma0 ]
+    inhibit: [ group:chix ]
+  212635:
+    description: Jurrian van Iersel
+    prefix_filter: AS212635:AS-212635
+    permission: [ router:chrma0 ]
+    inhibit: [ group:chix, group:fogixp ]
+```
+
+I can see the following prefix learn/announce counts towards _members_:
+
+```
+pim@chrma0:~$ for i in $(birdc show protocol | grep member | cut -f1 -d' '); do echo -n $i\ ; birdc
+show protocol all $i | grep Routes; done
+chix_member_35202_ipv4_1        2 imported, 0 filtered, 159984 exported, 0 preferred
+chix_member_35202_ipv6_1        2 imported, 0 filtered, 61730 exported, 0 preferred
+chix_member_210312_ipv4_1       3 imported, 0 filtered, 3518 exported, 3 preferred
+chix_member_210312_ipv6_1       2 imported, 0 filtered, 1251 exported, 2 preferred
+comix_member_35202_ipv4_1       2 imported, 0 filtered, 159981 exported, 2 preferred
+comix_member_35202_ipv4_2       2 imported, 0 filtered, 159981 exported, 1 preferred
+comix_member_35202_ipv6_1       2 imported, 0 filtered, 61727 exported, 2 preferred
+comix_member_35202_ipv6_2       2 imported, 0 filtered, 61727 exported, 1 preferred
+fogixp_member_212635_ipv4_1     1 imported, 0 filtered, 442 exported, 1 preferred
+fogixp_member_212635_ipv6_1     14 imported, 0 filtered, 181 exported, 14 preferred
+freeix_ch_member_210312_ipv4_1  3 imported, 0 filtered, 3521 exported, 0 preferred
+freeix_ch_member_210312_ipv6_1  2 imported, 0 filtered, 1253 exported, 0 preferred
+```
+
+Let me make a few observations:
+*   Hurricane Electric AS6939 is present at CHIX, and they tend to announce a very large number of
+prefixes. So every member who is permitted (and not inhibited) at CHIX will see all of those: Sam's
+AS35202 is inhibited on CommunityIX but not on CHIX, and he's permitted on both. That explains why
+he is seeing the routes on both sessions.
+*   I've inhibited Jurrian's AS212635 to/from both CHIX and FogIXP, which means he will be seeing
+CommunityIX (~245 IPv4, 85 IPv6 prefixes), and FreeIX CH (~173 IPv4 and ~60 IPv6). We also send him
+the member prefixes, which is about 35 or so additional prefixes. This explains why Jurrian is
+receiving from us ~440 IPv4 and ~180 IPv6.
+*   Antonios' AS210312, the exemplar in this article, is receiving all-but-CHIX. FogIXP yields 3077
+or so IPv4 and 1056 IPv6 prefixes, while I've already added up FreeIX, CommunityIX, and our members
+(this is what we're sending Jurrian!), at 330 resp 180, so Antonios should be getting about 3500 IPv4
+prefixes and 1250 IPv6 prefixes.
+
+In the other direction, I would expect to be announcing to _peers_ only prefixes belonging to either
+AS50869 itself, or those of our members:
+
+```
+pim@chrma0:~$ for i in $(birdc show protocol | grep peer.*_1 | cut -f1 -d' '); do echo -n $i\ ; birdc
+show protocol all $i | grep Routes || echo; done
+chix_peer_212100_ipv4_1      57618 imported, 0 filtered, 24 exported, 778 preferred
+chix_peer_212100_ipv6_1      21979 imported, 1 filtered, 37 exported, 7186 preferred
+chix_peer_13335_ipv4_1       4767 imported, 9 filtered, 24 exported, 4765 preferred
+chix_peer_13335_ipv6_1       371 imported, 1 filtered, 37 exported, 369 preferred
+chix_peer_6939_ipv4_1        151787 imported, 27 filtered, 24 exported, 133943 preferred
+chix_peer_6939_ipv6_1        61191 imported, 6 filtered, 37 exported, 16223 preferred
+comix_peer_44596_ipv4_1      594 imported, 0 filtered, 25 exported, 10 preferred
+comix_peer_44596_ipv6_1      1147 imported, 0 filtered, 50 exported, 0 preferred
+comix_peer_8298_ipv4_1       23 imported, 0 filtered, 25 exported, 0 preferred
+comix_peer_8298_ipv6_1       34 imported, 0 filtered, 50 exported, 0 preferred
+fogixp_peer_47498_ipv4_1     3286 imported, 1 filtered, 27 exported, 3077 preferred
+fogixp_peer_47498_ipv6_1     1838 imported, 0 filtered, 39 exported, 1056 preferred
+freeix_ch_peer_51530_ipv4_1  355 imported, 0 filtered, 28 exported, 0 preferred
+freeix_ch_peer_51530_ipv6_1  143 imported, 0 filtered, 53 exported, 0 preferred
+```
+
+Some observations:
+
+*   Nobody is inhibited at FreeIX Switzerland. It stands to reason therefore, that it has the most
+exported prefixes: 28 for IPv4 and 53 for IPv6.
+*   Two members are inhibited at CHIX, which makes it have the lowest amount of exported prefixes:
+24 for IPv4 and 27 for IPv6.
+*   All members at each exchange (group) will have the same amount of prefixes. I can confirm that
+at CHIX, all thre peers have the same amount of announced prefixes. Similarly, at CommunityIX, all
+peers have the same amount.
+*   If Antonios, Sam or Jurrian would add an outgoing announcement to AS50869 with an additional inhibit
+BGP community (eg `(50869,3020,1)` to inhibit country Switzerland), they could tweak these numbers.
+
+## What's next
+
+This all adds up. I'd like to test the waters with my friendly neighborhood canaries a little bit,
+to make sure that announcements are expected, and traffic flows where appropriate. In the mean time,
+I'll chase the deployment of LSIX, FrysIX, SpeedIX and possibly a few others in Amsterdam. And of
+course FreeIX Greece in Thessaloniki. I'll try to get the Milano VPP router deployed (it's already
+installed and configured, but currently powered off) and connected to PCIX, MIX and a few others.
+
+## How can you help?
+
+If you're willing to participate with a VPP router and connect it to either multiple local internet
+exchanges (like I've demonstrated in Zurich), or better yet, to one or more of the other existing
+routers, I would welcome your contribution. [[Contact]({{< ref contact.md >}})] me for details.
+
+A bit further down the pike, a connection from Amsterdam to Zurich, from Zurich to Milan and from
+Milan to Thessaloniki is on the horizon. If you are willing and able to donate some bandwidth (point
+to point VPWS, VLL, L2VPN) and your transport network is capable of at least 2026 bytes of _inner_
+payload, please also [[reach out]({{< ref contact.md >}})] as I'm sure many small network operators
+would be thrilled.
--- a/content/articles/2025-02-08-sflow-3.md
+++ b/content/articles/2025-02-08-sflow-3.md
@@ -0,0 +1,857 @@
+---
+date: "2025-02-08T07:51:23Z"
+title: 'VPP with sFlow - Part 3'
+---
+
+# Introduction
+
+{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width="12em" >}}
+
+In the second half of last year, I picked up a project together with Neil McKee of
+[[inMon](https://inmon.com/)], the care takers of [[sFlow](https://sflow.org)]: an industry standard
+technology for monitoring high speed networks. `sFlow` gives complete visibility into the
+use of networks enabling performance optimization, accounting/billing for usage, and defense against
+security threats.
+
+The open source software dataplane [[VPP](https://fd.io)] is a perfect match for sampling, as it
+forwards packets at very high rates using underlying libraries like [[DPDK](https://dpdk.org/)] and
+[[RDMA](https://en.wikipedia.org/wiki/Remote_direct_memory_access)]. A clever design choice in the
+so called _Host sFlow Daemon_ [[host-sflow](https://github.com/sflow/host-sflow)], which allows for
+a small portion of code to _grab_ the samples, for example in a merchant silicon ASIC or FPGA, but
+also in the VPP software dataplane. The agent then _transmits_ these samples using a Linux kernel
+feature called [[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)].
+This greatly reduces the complexity of code to be implemented in the forwarding path, while at the
+same time bringing consistency to the `sFlow` delivery pipeline by (re)using the `hsflowd` business
+logic for the more complex state keeping, packet marshalling and transmission from the _Agent_ to a
+central _Collector_.
+
+In this third article, I wanted to spend some time discussing how samples make their way out of the
+VPP dataplane, and into higher level tools.
+
+## Recap: sFlow
+
+{{< image float="left" src="/assets/sflow/sflow-overview.png" alt="sFlow Overview" width="14em" >}}
+
+sFlow describes a method for Monitoring Traffic in Switched/Routed Networks, originally described in
+[[RFC3176](https://datatracker.ietf.org/doc/html/rfc3176)]. The current specification is version 5
+and is homed on the sFlow.org website [[ref](https://sflow.org/sflow_version_5.txt)]. Typically, a
+Switching ASIC in the dataplane (seen at the bottom of the diagram to the left) is asked to copy
+1-in-N packets to local sFlow Agent.
+
+**Sampling**: The agent will copy the first N bytes (typically 128) of the packet into a sample. As
+the ASIC knows which interface the packet was received on, the `inIfIndex` will be added. After a
+routing decision is made, the nexthop and its L2 address and interface become known. The ASIC might
+annotate the sample with this `outIfIndex` and `DstMAC` metadata as well.
+
+**Drop Monitoring**: There's one rather clever insight that sFlow gives: what if the packet _was
+not_ routed or switched, but rather discarded?  For this, sFlow is able to describe the reason for
+the drop. For example, the ASIC receive queue could have been overfull, or it did not find a
+destination to forward the packet to (no FIB entry), perhaps it was instructed by an ACL to drop the
+packet or maybe even tried to transmit the packet but the physical datalink layer had to abandon the
+transmission for whatever reason (link down, TX queue full, link saturation, and so on). It's hard
+to overstate how important it is to have this so-called _drop monitoring_, as operators often spend
+hours and hours figuring out _why_ packets are lost their network or datacenter switching fabric.
+
+**Metadata**: The agent may have other metadata as well, such as which prefix was the source and
+destination of the packet, what additional RIB information is available (AS path, BGP communities,
+and so on).  This may be added to the sample record as well.
+
+**Counters**: Since sFlow is sampling 1:N packets, the system can estimate total traffic in a
+reasonably accurate way. Peter and Sonia wrote a succint
+[[paper](https://sflow.org/packetSamplingBasics/)] about the math, so I won't get into that here.
+Mostly because I am but a software engineer, not a statistician... :) However, I will say this: if a
+fraction of the traffic is sampled but the _Agent_ knows how many bytes and packets were forwarded
+in total, it can provide an overview with a quantifiable accuracy. This is why the _Agent_ will
+periodically get the interface counters from the ASIC.
+
+**Collector**: One or more samples can be concatenated into UDP messages that go from the _sFlow
+Agent_ to a central _sFlow Collector_. The heavy lifting in analysis is done upstream from the
+switch or router, which is great for performance.  Many thousands or even tens of thousands of
+agents can forward their samples and interface counters to a single central collector, which in turn
+can be used to draw up a near real time picture of the state of traffic through even the largest of
+ISP networks or datacenter switch fabrics.
+
+In sFlow parlance [[VPP](https://fd.io/)] and its companion
+[[hsflowd](https://github.com/sflow/host-sflow)] together form an _Agent_ (it sends the UDP packets
+over the network), and for example the commandline tool `sflowtool` could be a _Collector_ (it
+receives the UDP packets).
+
+## Recap: sFlow in VPP
+
+First, I have some pretty good news to report - our work on this plugin was
+[[merged](https://gerrit.fd.io/r/c/vpp/+/41680)] and will be included in the VPP 25.02 release in a
+few weeks! Last weekend, I gave a lightning talk at
+[[FOSDEM](https://fosdem.org/2025/schedule/event/fosdem-2025-4196-vpp-monitoring-100gbps-with-sflow/)]
+in Brussels, Belgium, and caught up with a lot of community members and network- and software
+engineers. I had a great time.
+
+In trying to keep the amount of code as small as possible, and therefore the probability of bugs that
+might impact VPP's dataplane stability low, the architecture of the end to end solution consists of
+three distinct parts, each with their own risk and performance profile:
+
+{{< image float="left" src="/assets/sflow/sflow-vpp-overview.png" alt="sFlow VPP Overview" width="18em" >}}
+
+**1. sFlow worker node**: Its job is to do what the ASIC does in the hardware case. As VPP moves
+packets from `device-input` to the `ethernet-input` nodes in its forwarding graph, the sFlow plugin
+will inspect 1-in-N, taking a sample for further processing. Here, we don't try to be clever, simply
+copy the `inIfIndex` and the first bytes of the ethernet frame, and append them to a
+[[FIFO](https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics))] queue. If too many samples
+arrive, samples are dropped at the tail, and a counter incremented.  This way, I can tell when the
+dataplane is congested. Bounded FIFOs also provide fairness: it allows for each VPP worker thread to
+get their fair share of samples into the Agent's hands.
+
+**2. sFlow main process**: There's a function running on the _main thread_, which shifts further
+processing time _away_ from the dataplane. This _sflow-process_ does two things. Firstly, it
+consumes samples from the per-worker FIFO queues (both forwarded packets in green, and dropped ones
+in red). Secondly, it keeps track of time and every few seconds (20 by default, but this is
+configurable), it'll grab all interface counters from those interfaces for which I have sFlow
+turned on. VPP produces _Netlink_ messages and sends them to the kernel.
+
+**3. Host sFlow daemon**: The third component is external to VPP: `hsflowd` subscribes to the _Netlink_
+messages. It goes without saying that `hsflowd` is a battle-hardened implementation running on
+hundreds of different silicon and software defined networking stacks. The PSAMPLE stuff is easy,
+this module already exists. But Neil implemented a _mod_vpp_ which can grab interface names and their
+`ifIndex`, and counter statistics. VPP emits this data as _Netlink_ `USERSOCK` messages alongside
+the PSAMPLEs.
+
+
+By the way, I've written about _Netlink_ before when discussing the [[Linux Control Plane]({{< ref
+2021-08-25-vpp-4 >}})] plugin.  It's a mechanism for programs running in userspace to share
+information with the kernel. In the Linux kernel, packets can be sampled as well, and sent from
+kernel to userspace using a _PSAMPLE_ Netlink channel. However, the pattern is that of a message
+producer/subscriber relationship and nothing precludes one userspace process (`vpp`) to be the
+producer while another userspace process (`hsflowd`) acts as the consumer!
+
+Assuming the sFlow plugin in VPP produces samples and counters properly, `hsflowd` will do the rest,
+giving correctness and upstream interoperability pretty much for free. That's slick!
+
+### VPP: sFlow Configuration
+
+The solution that I offer is based on two moving parts. First, the VPP plugin configuration, which
+turns on sampling at a given rate on physical devices, also known as _hardware-interfaces_. Second,
+the open source component [[host-sflow](https://github.com/sflow/host-sflow/releases)] can be
+configured as of release v2.11-5 [[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)].
+
+I will show how to configure VPP in three ways:
+
+***1. VPP Configuration via CLI***
+
+```
+pim@vpp0-0:~$ vppctl
+vpp0-0# sflow sampling-rate 100
+vpp0-0# sflow polling-interval 10
+vpp0-0# sflow header-bytes 128
+vpp0-0# sflow enable GigabitEthernet10/0/0
+vpp0-0# sflow enable GigabitEthernet10/0/0 disable
+vpp0-0# sflow enable GigabitEthernet10/0/2
+vpp0-0# sflow enable GigabitEthernet10/0/3
+```
+
+The first three commands set the global defaults - in my case I'm going to be sampling at 1:100
+which is an unusually high rate. A production setup may take 1-in-_linkspeed-in-megabits_ so for a
+1Gbps device 1:1'000 is appropriate. For 100GE, something between 1:10'000 and 1:100'000 is more
+appropriate, depending on link load. The second command sets the interface stats polling interval.
+The default is to gather these statistics every 20 seconds, but I set it to 10s here.
+
+Next, I tell the plugin how many bytes of the sampled ethernet frame should be taken. Common
+values are 64 and 128 but it doesn't have to be a power of two. I want enough data to see the
+headers, like MPLS label(s), Dot1Q tag(s), IP header and TCP/UDP/ICMP header, but the contents of
+the payload are rarely interesting for
+statistics purposes.
+
+Finally, I can turn on the sFlow plugin on an interface with the `sflow enable-disable` CLI. In VPP,
+an idiomatic way to turn on and off things is to have an enabler/disabler. It feels a bit clunky
+maybe to write `sflow enable $iface disable` but it makes more logical sends if you parse that as
+"enable-disable" with the default being the "enable" operation, and the alternate being the
+"disable" operation.
+
+***2. VPP Configuration via API***
+
+I implemented a few API methods for the most common operations. Here's a snippet that obtains the
+same config as what I typed on the CLI above, but using these Python API calls:
+
+```python
+from vpp_papi import VPPApiClient, VPPApiJSONFiles
+import sys
+
+vpp_api_dir = VPPApiJSONFiles.find_api_dir([])
+vpp_api_files = VPPApiJSONFiles.find_api_files(api_dir=vpp_api_dir)
+vpp = VPPApiClient(apifiles=vpp_api_files, server_address="/run/vpp/api.sock")
+vpp.connect("sflow-api-client")
+print(vpp.api.show_version().version)
+# Output: 25.06-rc0~14-g9b1c16039
+
+vpp.api.sflow_sampling_rate_set(sampling_N=100)
+print(vpp.api.sflow_sampling_rate_get())
+# Output: sflow_sampling_rate_get_reply(_0=655, context=3, sampling_N=100)
+
+vpp.api.sflow_polling_interval_set(polling_S=10)
+print(vpp.api.sflow_polling_interval_get())
+# Output: sflow_polling_interval_get_reply(_0=661, context=5, polling_S=10)
+
+vpp.api.sflow_header_bytes_set(header_B=128)
+print(vpp.api.sflow_header_bytes_get())
+# Output: sflow_header_bytes_get_reply(_0=665, context=7, header_B=128)
+
+vpp.api.sflow_enable_disable(hw_if_index=1, enable_disable=True)
+vpp.api.sflow_enable_disable(hw_if_index=2, enable_disable=True)
+print(vpp.api.sflow_interface_dump())
+# Output: [ sflow_interface_details(_0=667, context=8, hw_if_index=1),
+#           sflow_interface_details(_0=667, context=8, hw_if_index=2) ]
+
+print(vpp.api.sflow_interface_dump(hw_if_index=2))
+# Output: [ sflow_interface_details(_0=667, context=9, hw_if_index=2) ]
+
+print(vpp.api.sflow_interface_dump(hw_if_index=1234)) ## Invalid hw_if_index
+# Output: []
+
+vpp.api.sflow_enable_disable(hw_if_index=1, enable_disable=False)
+print(vpp.api.sflow_interface_dump())
+# Output: [ sflow_interface_details(_0=667, context=10, hw_if_index=2) ]
+```
+
+This short program toys around a bit with the sFlow API. I first set the sampling to 1:100 and get
+the current value. Then I set the polling interval to 10s and retrieve the current value again.
+Finally, I set the header bytes to 128, and retrieve the value again. 
+
+Enabling and disabling sFlow on interfaces shows the idiom I mentioned before - the API being an
+`*_enable_disable()` call of sorts, and typically taking a boolean argument if the operator wants to
+enable (the default), or disable sFlow on the interface. Getting the list of enabled interfaces can
+be done with the `sflow_interface_dump()` call, which returns a list of `sflow_interface_details`
+messages.
+
+I demonstrated VPP's Python API and how it works in a fair amount of detail in a [[previous
+article]({{< ref 2024-01-27-vpp-papi >}})], in case this type of stuff interests you.
+
+***3. VPPCfg YAML Configuration***
+
+Writing on the CLI and calling the API is good and all, but many users of VPP have noticed that it
+does not have any form of configuration persistence and that's deliberate. VPP's goal is to be a
+programmable dataplane, and explicitly has left the programming and configuration as an exercise for
+integrators. I have written a Python project that takes a YAML file as input and uses it to
+configure (and reconfigure, on the fly) the dataplane automatically, called
+[[VPPcfg](https://git.ipng.ch/ipng/vppcfg.git)]. Previously, I wrote some implementation thoughts
+on its [[datamodel]({{< ref 2022-03-27-vppcfg-1 >}})] and its [[operations]({{< ref 2022-04-02-vppcfg-2
+>}})] so I won't repeat that here. Instead, I will just show the configuration:
+
+```
+pim@vpp0-0:~$ cat << EOF > vppcfg.yaml
+interfaces:
+  GigabitEthernet10/0/0:
+    sflow: true
+  GigabitEthernet10/0/1:
+    sflow: true
+  GigabitEthernet10/0/2:
+    sflow: true
+  GigabitEthernet10/0/3:
+    sflow: true
+
+sflow:
+  sampling-rate: 100
+  polling-interval: 10
+  header-bytes: 128
+EOF
+pim@vpp0-0:~$ vppcfg plan -c vppcfg.yaml -o /etc/vpp/config/vppcfg.vpp
+[INFO    ] root.main: Loading configfile vppcfg.yaml
+[INFO    ] vppcfg.config.valid_config: Configuration validated successfully
+[INFO    ] root.main: Configuration is valid
+[INFO    ] vppcfg.reconciler.write: Wrote 13 lines to /etc/vpp/config/vppcfg.vpp
+[INFO    ] root.main: Planning succeeded
+pim@vpp0-0:~$ vppctl exec /etc/vpp/config/vppcfg.vpp
+```
+
+The nifty thing about `vppcfg` is that if I were to change, say, the sampling-rate (setting it to
+1000) and disable sFlow from an interface, say Gi10/0/0, I can re-run the `vppcfg plan` and `vppcfg
+apply` stages and the VPP dataplane will be reprogrammed to reflect the newly declared configuration.
+
+### hsflowd: Configuration
+
+When sFlow is enabled, VPP will start to emit _Netlink_ messages of type PSAMPLE with packet samples
+and of type USERSOCK with the custom messages containing interface names and counters. These latter
+custom messages have to be decoded, which is done by the _mod_vpp_ module in `hsflowd`, starting
+from release v2.11-5 [[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)].
+
+Here's a minimalist configuration:
+
+```
+pim@vpp0-0:~$ cat /etc/hsflowd.conf
+sflow {
+  collector { ip=127.0.0.1 udpport=16343 }
+  collector { ip=192.0.2.1 namespace=dataplane }
+  psample { group=1 }
+  vpp { osIndex=off }
+}
+```
+
+{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
+
+There are two important details that can be confusing at first: \
+**1.** kernel network namespaces \
+**2.** interface index namespaces
+
+#### hsflowd: Network namespace
+
+Network namespaces virtualize Linux's network stack. Upon creation, a network namespace contains only
+a loopback interface, and subsequently interfaces can be moved between namespaces.  Each network
+namespace will have its own set of IP addresses, its own routing table, socket listing, connection
+tracking table, firewall, and other network-related resources.  When started by systemd, `hsflowd`
+and VPP will normally both run in the _default_ network namespace.
+
+Given this, I can conclude that when the sFlow plugin opens a Netlink channel, it will
+naturally do this in the network namespace that its VPP process is running in (the _default_
+namespace, normally). It is therefore important that the recipient of these Netlink messages,
+notably `hsflowd` runs in ths ***same*** namespace as VPP. It's totally fine to run them together in
+a different namespace (eg. a container in Kubernetes or Docker), as long as they can see each other.
+
+It might pose a problem if the network connectivity lives in a different namespace than the default
+one. One common example (that I heavily rely on at IPng!) is to create Linux Control Plane interface
+pairs, _LIPs_, in a dataplane namespace. The main reason for doing this is to allow something like
+FRR or Bird to completely govern the routing table in the kernel and keep it in-sync with the FIB in
+VPP. In such a _dataplane_ network namespace, typically every interface is owned by VPP.
+
+Luckily, `hsflowd` can attach to one (default) namespace to get the PSAMPLEs, but create a socket in
+a _different_ (dataplane) namespace to send packets to a collector. This explains the second
+_collector_ entry in the config-file above. Here, `hsflowd` will send UDP packets to 192.0.2.1:6343
+from within the (VPP) dataplane namespace, and to 127.0.0.1:16343 in the default namespace.
+
+#### hsflowd: osIndex
+
+I hope the previous section made some sense, because this one will be a tad more esoteric. When
+creating a network namespace, each interface will get its own uint32 interface index that identifies
+it, and such an ID is typically called an `ifIndex`. It's important to note that the same number can
+(and will!) occur multiple times, once for each namespace. Let me give you an example:
+
+```
+pim@summer:~$ ip link
+1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN ...
+    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
+2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ipng-sl state UP ...
+    link/ether 00:22:19:6a:46:2e brd ff:ff:ff:ff:ff:ff
+    altname enp1s0f0
+3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 900 qdisc mq master ipng-sl state DOWN ...
+    link/ether 00:22:19:6a:46:30 brd ff:ff:ff:ff:ff:ff
+    altname enp1s0f1
+
+pim@summer:~$ ip netns exec dataplane ip link
+1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN ...
+    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
+2: loop0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP ...
+    link/ether de:ad:00:00:00:00 brd ff:ff:ff:ff:ff:ff
+3: xe1-0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP ...
+    link/ether 00:1b:21:bd:c7:18 brd ff:ff:ff:ff:ff:ff
+```
+
+I want to draw your attention to the number at the beginning of the line. In the _default_
+namespace, `ifIndex=3` corresponds to `ifName=eno2` (which has no link, it's marked `DOWN`). But in
+the _dataplane_ namespace, that index corresponds to a completely different interface called
+`ifName=xe1-0` (which is link `UP`).
+
+Now, let me show you the interfaces in VPP:
+
+```
+pim@summer:~$ vppctl show int | grep Gigabit | egrep 'Name|loop0|tap0|Gigabit'
+              Name               Idx    State  MTU (L3/IP4/IP6/MPLS)
+GigabitEthernet4/0/0              1      up          9000/0/0/0
+GigabitEthernet4/0/1              2     down         9000/0/0/0
+GigabitEthernet4/0/2              3     down         9000/0/0/0
+GigabitEthernet4/0/3              4     down         9000/0/0/0
+TenGigabitEthernet5/0/0           5      up          9216/0/0/0
+TenGigabitEthernet5/0/1           6      up          9216/0/0/0
+loop0                             7      up          9216/0/0/0
+tap0                              19     up          9216/0/0/0
+```
+
+Here, I want you to look at the second column `Idx`, which shows what VPP calls the _sw_if_index_
+(the software interface index, as opposed to hardware index). Here, `ifIndex=3` corresponds to
+`ifName=GigabitEthernet4/0/2`, which is neither `eno2` nor `xe1-0`. Oh my, yet _another_ namespace!
+
+It turns out that there are three (relevant) types of namespaces at play here:
+1.  ***Linux network*** namespace; here using `dataplane` and `default` each with their own unique
+    (and overlapping) numbering.
+1.  ***VPP hardware*** interface namespace, also called PHYs (for physical interfaces). When VPP
+    first attaches to or creates network interfaces like the ones from DPDK or RDMA, these will
+    create an _hw_if_index_ in a list.
+1.  ***VPP software*** interface namespace. All interfaces (including hardware ones!) will
+    receive a _sw_if_index_ in VPP. A good example is sub-interfaces: if I create a sub-int on
+    GigabitEthernet4/0/2, it will NOT get a hardware index, but it _will_ get the next available
+    software index (in this example, `sw_if_index=7`).
+
+In Linux CP, I can see a mapping from one to the other, just look at this:
+
+```
+pim@summer:~$ vppctl show lcp
+lcp default netns dataplane
+lcp lcp-auto-subint off
+lcp lcp-sync on
+lcp lcp-sync-unnumbered on
+itf-pair: [0] loop0 tap0 loop0 2 type tap netns dataplane
+itf-pair: [1] TenGigabitEthernet5/0/0 tap1 xe1-0 3 type tap netns dataplane
+itf-pair: [2] TenGigabitEthernet5/0/1 tap2 xe1-1 4 type tap netns dataplane
+itf-pair: [3] TenGigabitEthernet5/0/0.20 tap1.20 xe1-0.20 5 type tap netns dataplane
+```
+
+Those `itf-pair` describe our _LIPs_, and they have the coordinates to three things. 1) The VPP
+software interface (VPP `ifName=loop0` with `sw_if_index=7`), which 2) Linux CP will mirror into the
+Linux kernel using a TAP device (VPP `ifName=tap0` with `sw_if_index=19`). That TAP has one leg in
+VPP (`tap0`), and another in 3) Linux (with `ifName=loop` and `ifIndex=2` in namespace `dataplane`).
+
+> So the tuple that fully describes a _LIP_ is `{7, 19,'dataplane', 2}`
+
+Climbing back out of that rabbit hole, I am now finally ready to explain the feature. When sFlow in
+VPP takes its sample, it will be doing this on a PHY, that is a given interface with a specific
+_hw_if_index_. When it polls the counters, it'll do it for that specific _hw_if_index_. It now has a
+choice: should it share with the world the representation of *its* namespace, or should it try to be
+smarter? If LinuxCP is enabled, this interface will likely have a representation in Linux. So the
+plugin will first resolve the _sw_if_index_ belonging to that PHY, and using that, try to look up a
+_LIP_ with it. If it finds one, it'll know both the namespace in which it lives as well as the
+osIndex in that namespace. If it doesn't find a _LIP_, it will at least have the _sw_if_index_ at
+hand, so it'll annotate the USERSOCK counter messages with this information instead.
+
+Now, `hsflowd` has a choice to make: does it share the Linux representation and hide VPP as an
+implementation detail? Or does it share the VPP dataplane _sw_if_index_? There are use cases
+relevant to both, so the decision was to let the operator decide, by setting `osIndex` either `on`
+(use Linux ifIndex) or `off` (use VPP _sw_if_index_).
+
+### hsflowd: Host Counters
+
+Now that I understand the configuration parts of VPP and `hsflowd`, I decide to configure everything
+but without enabling sFlow on on any interfaces yet in VPP. Once I start the daemon, I can see that
+it sends an UDP packet every 30 seconds to the configured _collector_:
+
+```
+pim@vpp0-0:~$ sudo tcpdump -s 9000 -i lo -n
+tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
+listening on lo, link-type EN10MB (Ethernet), snapshot length 9000 bytes
+15:34:19.695042 IP 127.0.0.1.48753 > 127.0.0.1.6343: sFlowv5,
+   IPv4 agent 198.19.5.16, agent-id 100000, length 716
+```
+
+The `tcpdump` I have on my Debian bookworm machines doesn't know how to decode the contents of these
+sFlow packets. Actually, neither does Wireshark. I've attached a file of these mysterious packets
+[[sflow-host.pcap](/assets/sflow/sflow-host.pcap)] in case you want to take a look.
+Neil however gives me a tip. A full message decoder and otherwise handy Swiss army knife lives in
+[[sflowtool](https://github.com/sflow/sflowtool)].
+
+I can offer this pcap file to `sflowtool`, or let it just listen on the UDP port directly, and
+it'll tell me what it finds:
+
+```
+pim@vpp0-0:~$ sflowtool -p 6343      
+startDatagram =================================
+datagramSourceIP 127.0.0.1
+datagramSize 716       
+unixSecondsUTC 1739112018
+localtime 2025-02-09T15:40:18+0100
+datagramVersion 5            
+agentSubId 100000   
+agent 198.19.5.16        
+packetSequenceNo 57 
+sysUpTime 987398   
+samplesInPacket 1            
+startSample ----------------------
+sampleType_tag 0:4                                                         
+sampleType COUNTERSSAMPLE
+sampleSequenceNo 33
+sourceId 2:1                
+counterBlock_tag 0:2001           
+adaptor_0_ifIndex 2                                                        
+adaptor_0_MACs 1   
+adaptor_0_MAC_0 525400f00100
+counterBlock_tag 0:2010
+udpInDatagrams 123904
+udpNoPorts 23132459
+udpInErrors 0
+udpOutDatagrams 46480629
+udpRcvbufErrors 0
+udpSndbufErrors 0
+udpInCsumErrors 0
+counterBlock_tag 0:2009
+tcpRtoAlgorithm 1
+tcpRtoMin 200
+tcpRtoMax 120000
+tcpMaxConn 4294967295
+tcpActiveOpens 0
+tcpPassiveOpens 30
+tcpAttemptFails 0
+tcpEstabResets 0
+tcpCurrEstab 1
+tcpInSegs 89120
+tcpOutSegs 86961
+tcpRetransSegs 59
+tcpInErrs 0
+tcpOutRsts 4
+tcpInCsumErrors 0
+counterBlock_tag 0:2008
+icmpInMsgs 23129314
+icmpInErrors 32
+icmpInDestUnreachs 0
+icmpInTimeExcds 23129282
+icmpInParamProbs 0
+icmpInSrcQuenchs 0
+icmpInRedirects 0
+icmpInEchos 0
+icmpInEchoReps 32
+icmpInTimestamps 0
+icmpInAddrMasks 0
+icmpInAddrMaskReps 0
+icmpOutMsgs 0
+icmpOutErrors 0
+icmpOutDestUnreachs 23132467
+icmpOutTimeExcds 0
+icmpOutParamProbs 23132467
+icmpOutSrcQuenchs 0
+icmpOutRedirects 0
+icmpOutEchos 0
+icmpOutEchoReps 0
+icmpOutTimestamps 0
+icmpOutTimestampReps 0
+icmpOutAddrMasks 0
+icmpOutAddrMaskReps 0
+counterBlock_tag 0:2007
+ipForwarding 2
+ipDefaultTTL 64
+ipInReceives 46590552
+ipInHdrErrors 0
+ipInAddrErrors 0
+ipForwDatagrams 0
+ipInUnknownProtos 0
+ipInDiscards 0
+ipInDelivers 46402357
+ipOutRequests 69613096
+ipOutDiscards 0
+ipOutNoRoutes 80
+ipReasmTimeout 0
+ipReasmReqds 0
+ipReasmOKs 0
+ipReasmFails 0
+ipFragOKs 0
+ipFragFails 0
+ipFragCreates 0
+counterBlock_tag 0:2005
+disk_total 6253608960
+disk_free 2719039488
+disk_partition_max_used 56.52
+disk_reads 11512
+disk_bytes_read 626214912
+disk_read_time 48469
+disk_writes 1058955
+disk_bytes_written 8924332032
+disk_write_time 7954804
+counterBlock_tag 0:2004
+mem_total 8326963200
+mem_free 5063872512
+mem_shared 0
+mem_buffers 86425600
+mem_cached 827752448
+swap_total 0
+swap_free 0
+page_in 306365
+page_out 4357584
+swap_in 0
+swap_out 0
+counterBlock_tag 0:2003
+cpu_load_one 0.030
+cpu_load_five 0.050
+cpu_load_fifteen 0.040
+cpu_proc_run 1
+cpu_proc_total 138
+cpu_num 2
+cpu_speed 1699
+cpu_uptime 1699306
+cpu_user 64269210
+cpu_nice 1810
+cpu_system 34690140
+cpu_idle 3234293560
+cpu_wio 3568580
+cpuintr 0
+cpu_sintr 5687680
+cpuinterrupts 1596621688
+cpu_contexts 3246142972
+cpu_steal 329520
+cpu_guest 0
+cpu_guest_nice 0
+counterBlock_tag 0:2006
+nio_bytes_in 250283
+nio_pkts_in 2931
+nio_errs_in 0
+nio_drops_in 0
+nio_bytes_out 370244
+nio_pkts_out 1640
+nio_errs_out 0
+nio_drops_out 0
+counterBlock_tag 0:2000
+hostname vpp0-0
+UUID ec933791-d6af-7a93-3b8d-aab1a46d6faa
+machine_type 3
+os_name 2
+os_release 6.1.0-26-amd64
+endSample   ----------------------
+endDatagram   =================================
+```
+
+If you thought: "What an obnoxiously long paste!", then my slightly RSI-induced mouse-hand might
+agree with you. But it is really cool to see that every 30 seconds, the _collector_ will receive
+this form of heartbeat from the _agent_. There's a lot of vitalsigns in this packet, including some
+non-obvious but interesting stats like CPU load, memory, disk use and disk IO, and kernel version
+information. It's super dope!
+
+### hsflowd: Interface Counters
+
+Next, I'll enable sFlow in VPP on all four interfaces (Gi10/0/0-Gi10/0/3), set the sampling rate to
+something very high (1 in 100M), and the interface polling-interval to every 10 seconds. And indeed,
+every ten seconds or so I get a few packets, which I captured in
+[[sflow-interface.pcap](/assets/sflow/sflow-interface.pcap)]. Most of the packets contain only one
+counter record, while some contain more than one (in the PCAP, packet #9 has two). If I update the
+polling-interval to every second, I can see that most of the packets have all four counters.
+
+Those interface counters, as decoded by `sflowtool`, look like this:
+
+```
+pim@vpp0-0:~$ sflowtool -r sflow-interface.pcap | \
+              awk '/startSample/ { on=1 } { if (on) { print $0 } } /endSample/ { on=0 }'
+startSample   ----------------------
+sampleType_tag 0:4
+sampleType COUNTERSSAMPLE
+sampleSequenceNo 745
+sourceId 0:3
+counterBlock_tag 0:1005
+ifName GigabitEthernet10/0/2
+counterBlock_tag 0:1
+ifIndex 3
+networkType 6
+ifSpeed 0
+ifDirection 1
+ifStatus 3
+ifInOctets 858282015
+ifInUcastPkts 780540
+ifInMulticastPkts 0
+ifInBroadcastPkts 0
+ifInDiscards 0
+ifInErrors 0
+ifInUnknownProtos 0
+ifOutOctets 1246716016
+ifOutUcastPkts 975772
+ifOutMulticastPkts 0
+ifOutBroadcastPkts 0
+ifOutDiscards 127
+ifOutErrors 28
+ifPromiscuousMode 0
+endSample   ----------------------
+```
+
+What I find particularly cool about it, is that sFlow provides an automatic mapping between the
+`ifName=GigabitEthernet10/0/2` (tag 0:1005), together with an object (tag 0:1), which contains the
+`ifIndex=3`, and lots of packet and octet counters both in the ingress and egress direction. This is
+super useful for upstream _collectors_, as they can now find the hostname, agent name and address,
+and the correlation between interface names and their indexes. Noice!
+
+#### hsflowd: Packet Samples
+
+Now it's time to ratchet up the packet sampling, so I move it from 1:100M to 1:1000, while keeping
+the interface polling-interval at 10 seconds and I ask VPP to sample 64 bytes of each packet that it
+inspects. On either side of my pet VPP instance, I start an `iperf3` run to generate some traffic. I
+now see a healthy stream of sflow packets coming in on port 6343. They still contain every 30
+seconds or so a host counter, and every 10 seconds a set of interface counters come by, but mostly
+these UDP packets are showing me samples. I've captured a few minutes of these in
+[[sflow-all.pcap](/assets/sflow/sflow-all.pcap)].
+Although Wireshark doesn't know how to interpret the sFlow counter messages, it _does_ know how to
+interpret the sFlow sample messagess, and it reveals one of them like this:
+
+{{< image width="100%" src="/assets/sflow/sflow-wireshark.png" alt="sFlow Wireshark" >}}
+
+Let me take a look at the picture from top to bottom. First, the outer header (from 127.0.0.1:48753
+to 127.0.0.1:6343) is the sFlow agent sending to the collector. The agent identifies itself as
+having IPv4 address 198.19.5.16 with ID 100000 and an uptime of 1h52m. Then, it says it's going to
+send 9 samples, the first of which says it's from ifIndex=2 and at a sampling rate of 1:1000. It
+then shows that sample, saying that the frame length is 1518 bytes, and the first 64 bytes of those
+are sampled.  Finally, the first sampled packet starts at the blue line. It shows the SrcMAC and
+DstMAC, and that it was a TCP packet from 192.168.10.17:51028 to 192.168.10.33:5201 - my running
+`iperf3`, booyah!
+
+### VPP: sFlow Performance
+
+{{< image float="right" src="/assets/sflow/sflow-lab.png" alt="sFlow Lab" width="20em" >}}
+
+One question I get a lot about this plugin is: what is the performance impact when using
+sFlow? I spent a considerable amount of time tinkering with this, and together with Neil bringing
+the plugin to what we both agree is the most efficient use of CPU. We could have gone a bit further,
+but that would require somewhat intrusive changes to VPP's internals and as _North of the Border_
+(and the Simpsons!) would say: what we have isn't just good, it's good enough!
+
+I've built a small testbed based on two Dell R730 machines. On the left, I have a Debian machine
+running Cisco T-Rex using four quad-tengig network cards, the classic Intel i710-DA4. On the right,
+I have my VPP machine called _Hippo_ (because it's always hungry for packets), with the same
+hardware.  I'll build two halves. On the top NIC (Te3/0/0-3 in VPP), I will install IPv4 and MPLS
+forwarding on the purple circuit, and a simple Layer2 cross connect on the cyan circuit. On all four
+interfaces, I will enable sFlow. Then, I will mirror this configuration on the bottom NIC
+(Te130/0/0-3) in the red and green circuits, for which I will leave sFlow turned off.
+
+To help you reproduce my results, and under the assumption that this is your jam, here's the
+configuration for all of the kit:
+
+***0. Cisco T-Rex***
+```
+pim@trex:~ $ cat /srv/trex/8x10.yaml
+- version: 2
+  interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ]
+  port_info:
+    - src_mac:  00:1b:21:06:00:00
+      dest_mac: 9c:69:b4:61:a1:dc    # Connected to Hippo Te3/0/0, purple
+    - src_mac:  00:1b:21:06:00:01
+      dest_mac: 9c:69:b4:61:a1:dd    # Connected to Hippo Te3/0/1, purple
+    - src_mac:  00:1b:21:83:00:00
+      dest_mac: 00:1b:21:83:00:01    # L2XC via Hippo Te3/0/2, cyan
+    - src_mac:  00:1b:21:83:00:01
+      dest_mac: 00:1b:21:83:00:00    # L2XC via Hippo Te3/0/3, cyan
+
+    - src_mac:  00:1b:21:87:00:00
+      dest_mac: 9c:69:b4:61:75:d0    # Connected to Hippo Te130/0/0, red
+    - src_mac:  00:1b:21:87:00:01
+      dest_mac: 9c:69:b4:61:75:d1    # Connected to Hippo Te130/0/1, red
+    - src_mac:  9c:69:b4:85:00:00
+      dest_mac: 9c:69:b4:85:00:01    # L2XC via Hippo Te130/0/2, green
+    - src_mac:  9c:69:b4:85:00:01
+      dest_mac: 9c:69:b4:85:00:00    # L2XC via Hippo Te130/0/3, green
+pim@trex:~ $ sudo t-rex-64 -i -c 4 --cfg /srv/trex/8x10.yaml
+```
+
+When constructing the T-Rex configuration, I specifically set the destination MAC address for L3
+circuits (the purple and red ones) using Hippo's interface MAC address, which I can find with
+`vppctl show hardware-interfaces`. This way, T-Rex does not have to ARP for the VPP endpoint. On
+L2XC circuits (the cyan and green ones), VPP does not concern itself with the MAC addressing at
+all. It puts its interface in _promiscuous_ mode, and simply writes out any ethernet frame received,
+directly to the egress interface.
+
+***1. IPv4***
+```
+hippo# set int state TenGigabitEthernet3/0/0 up
+hippo# set int state TenGigabitEthernet3/0/1 up
+hippo# set int state TenGigabitEthernet130/0/0 up
+hippo# set int state TenGigabitEthernet130/0/1 up
+hippo# set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31
+hippo# set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31
+hippo# set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31
+hippo# set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31
+hippo# ip route add 16.0.0.0/24 via 100.64.0.0
+hippo# ip route add 48.0.0.0/24 via 100.64.1.0
+hippo# ip route add 16.0.2.0/24 via 100.64.4.0
+hippo# ip route add 48.0.2.0/24 via 100.64.5.0
+hippo# ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static
+hippo# ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static
+hippo# ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static
+hippo# ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static
+```
+
+By the way, one note to this last piece, I'm setting static IPv4 neighbors so that Cisco T-Rex 
+as well as VPP do not have to use ARP to resolve each other. You'll see above that the T-Rex
+configuration also uses MAC addresses exclusively. Setting the `ip neighbor` like this allows VPP
+to know where to send return traffic.
+
+***2. MPLS***
+```
+hippo# mpls table add 0
+hippo# set interface mpls TenGigabitEthernet3/0/0 enable
+hippo# set interface mpls TenGigabitEthernet3/0/1 enable
+hippo# set interface mpls TenGigabitEthernet130/0/0 enable
+hippo# set interface mpls TenGigabitEthernet130/0/1 enable
+hippo# mpls local-label add 16 eos via 100.64.1.0 TenGigabitEthernet3/0/1 out-labels 17
+hippo# mpls local-label add 17 eos via 100.64.0.0 TenGigabitEthernet3/0/0 out-labels 16
+hippo# mpls local-label add 20 eos via 100.64.5.0 TenGigabitEthernet130/0/1 out-labels 21
+hippo# mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels 20
+```
+
+Here, the MPLS configuration implements a simple P-router, where incoming MPLS packets with label 16
+will be sent back to T-Rex on Te3/0/1 to the specified IPv4 nexthop (for which I already know the
+MAC address), and with label 16 removed and new label 17 imposed, in other words a SWAP operation.
+
+***3. L2XC***
+```
+hippo# set int state TenGigabitEthernet3/0/2 up
+hippo# set int state TenGigabitEthernet3/0/3 up
+hippo# set int state TenGigabitEthernet130/0/2 up
+hippo# set int state TenGigabitEthernet130/0/3 up
+hippo# set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3
+hippo# set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2
+hippo# set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3
+hippo# set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2
+```
+
+I've added a layer2 cross connect as well because it's computationally very cheap for VPP to receive
+an L2 (ethernet) datagram, and immediately transmit it on another interface. There's no FIB lookup
+and not even an L2 nexthop lookup involved, VPP is just shoveling ethernet packets in-and-out as
+fast as it can!
+
+Here's how a loadtest looks like when sending 80Gbps at 192b packets on all eight interfaces:
+
+{{< image src="/assets/sflow/sflow-lab-trex.png" alt="sFlow T-Rex" width="100%" >}}
+
+The leftmost ports p0 <-> p1 are sending IPv4+MPLS, while ports p0 <-> p2 are sending ethernet back
+and forth. All four of them have sFlow enabled, at a sampling rate of 1:10'000, the default. These
+four ports are my experiment, to show the CPU use of sFlow. Then, ports p3 <-> p4 and p5 <-> p6
+respectively have sFlow turned off but with the same configuration.  They are my control, showing
+the CPU use without sFlow.
+
+**First conclusion**: This stuff works a treat. There is absolutely no impact of throughput at
+80Gbps with 47.6Mpps either _with_, or _without_ sFlow turned on. That's wonderful news, as it shows
+that the dataplane has more CPU available than is needed for any combination of functionality.
+
+But what _is_ the limit? For this, I'll take a deeper look at the runtime statistics by varying the
+CPU time spent and maximum throughput achievable on a single VPP worker, thus using a single CPU
+thread on this Hippo machine that has 44 cores and 44 hyperthreads. I switch the loadtester to emit
+64 byte ethernet packets, the smallest I'm allowed to send.
+
+| Loadtest | no sFlow | 1:1'000'000 | 1:10'000 | 1:1'000 | 1:100 |
+|-------------|-----------|-----------|-----------|-----------|-----------|
+| L2XC        | 14.88Mpps | 14.32Mpps | 14.31Mpps | 14.27Mpps | 14.15Mpps |
+| IPv4        | 10.89Mpps | 9.88Mpps  | 9.88Mpps  | 9.84Mpps  | 9.73Mpps  |
+| MPLS        | 10.11Mpps | 9.52Mpps  | 9.52Mpps  | 9.51Mpps  | 9.45Mpps  |
+| ***sFlow Packets*** / 10sec | N/A       | 337.42M total | 337.39M total | 336.48M total | 333.64M total |
+| .. Sampled    | &nbsp;    | 328 | 33.8k | 336k | 3.34M | 
+| .. Sent       | &nbsp;    | 328 | 33.8k | 336k | 1.53M |
+| .. Dropped    | &nbsp;    | 0   | 0     | 0    | 1.81M |
+
+Here I can make a few important observations.
+
+**Baseline**: One worker (thus, one CPU thread) can sustain 14.88Mpps of L2XC when sFlow is turned off, which
+implies that it has a little bit of CPU left over to do other work, if needed.  With IPv4, I can see
+that the throughput is actually CPU limited: 10.89Mpps can be handled by one worker (thus, one CPU thread). I
+know that MPLS is a little bit more expensive computationally than IPv4, and that checks out. The
+total capacity is 10.11Mpps for one worker, when sFlow is turned off.
+
+**Overhead**: When I turn on sFlow on the interface, VPP will insert the _sflow-node_ into the
+forwarding graph between `device-input` and `ethernet-input`. It means that the sFlow node will see
+_every single_ packet, and it will have to move all of these into the next node, which costs about
+9.5 CPU cycles per packet.  The regression on L2XC is 3.8% but I have to note that VPP was not CPU
+bound on the L2XC so it used some CPU cycles which were still available, before regressing
+throughput. There is an immediate regression of 9.3% on IPv4 and 5.9% on MPLS, only to shuffle the
+packets through the graph.
+
+**Sampling Cost**: But when then doing higher rates of sampling, the further regression is not _that_
+terrible.  Between 1:1'000'000 and 1:10'000, there's barely a noticeable difference. Even in the
+worst case of 1:100, the regression is from 14.32Mpps to 14.15Mpps for L2XC, only 1.2%.  The
+regression for L2XC, IPv4 and MPLS are all very modest, at 1.2% (L2XC), 1.6% (IPv4) and 0.8% (MPLS).
+Of course, by using multiple hardware receive queues and multiple RX workers per interface, the cost
+can be kept well in hand.
+
+**Overload Protection**: At 1:1'000 and an effective rate of 33.65Mpps across all ports, I correctly
+observe 336k samples taken, and sent to PSAMPLE.  At 1:100 however, there are 3.34M samples, but
+they are not fitting through the FIFO, so the plugin is dropping samples to protect downstream
+`sflow-main` thread and `hsflowd`. I can see that here, 1.81M samples have been dropped, while 1.53M
+samples made it through.  By the way, this means VPP is happily sending a whopping 153K samples/sec
+to the collector!
+
+## What's Next
+
+Now that I've seen the UDP packets from our agent to a collector on the wire, and also how
+incredibly efficient the sFlow sampling implementation turned out, I'm super motivated to
+continue the journey with higher level collector receivers like ntopng, sflow-rt or Akvorado. In an
+upcoming article, I'll describe how I rolled out Akvorado at IPng, and what types of changes would
+make the user experience even better (or simpler to understand, at least).
+
+### Acknowledgements
+
+I'd like to thank Neil McKee from inMon for his dedication to getting things right, including the
+finer details such as logging, error handling, API specifications, and documentation. He has been a
+true pleasure to work with and learn from. Also, thank you to the VPP developer community, notably
+Benoit, Florin, Damjan, Dave and Matt, for helping with the review and getting this thing merged in
+time for the 25.02 release.
--- a/content/articles/2025-04-09-frysix-evpn.md
+++ b/content/articles/2025-04-09-frysix-evpn.md
@@ -0,0 +1,793 @@
+---
+date: "2025-04-09T07:51:23Z"
+title: 'FrysIX eVPN: think different'
+---
+
+{{< image float="right" src="/assets/frys-ix/frysix-logo-small.png" alt="FrysIX Logo" width="12em" >}}
+
+# Introduction
+
+Somewhere in the far north of the Netherlands, the country where I was born, a town called Jubbega
+is the home of the Frysian Internet Exchange called [[Frys-IX](https://frys-ix.net/)]. Back in 2021,
+a buddy of mine, Arend, said that he was planning on renting a rack at the NIKHEF facility, one of
+the most densely populated facilities in western Europe. He was looking for a few launching
+customers and I was definitely in the market for a presence in Amsterdam. I even wrote about it on
+my [[bucketlist]({{< ref 2021-07-26-bucketlist.md >}})]. Arend and his IT company
+[[ERITAP](https://www.eritap.com/)], took delivery of that rack in May of 2021, and this is when the
+internet exchange with _Frysian roots_ was born. 
+
+In the years from 2021 until now, Arend and I have been operating the exchange with reasonable
+success. It grew from a handful of folks in that first rack, to now some 250 participating ISPs
+with about ten switches in six datacenters across the Amsterdam metro area. It's shifting a cool
+800Gbit of traffic or so. It's dope, and very rewarding to be able to contribute to this community!
+
+## Frys-IX is growing
+
+We have several members with a 2x100G LAG and even though all inter-datacenter links are either dark
+fiber or WDM, we're starting to feel the growing pains as we set our sights to the next step growth.
+You see, when FrysIX did 13.37Gbit of traffic, Arend organized a barbecue. When it did 133.7Gbit of
+traffic, Arend organized an even bigger barbecue. Obviously, the next step is 1337Gbit and joining
+the infamous [[One TeraBit Club](https://github.com/tking/OneTeraBitClub)]. Thomas: we're on our
+way!
+
+It became clear that we will not be able to keep a dependable peering platform if FrysIX remains a
+single L2 broadcast domain, and it also became clear that concatenating multiple 100G ports would be
+operationally expensive (think of all the dark fiber or WDM waves!), and brittle (think of LACP and
+balancing traffic over those ports). We need to modernize in order to stay ahead of the growth
+curve.
+
+## Hello Nokia
+
+{{< image float="right" src="/assets/frys-ix/nokia-7220-d4.png" alt="Nokia 7220-D4" width="20em" >}}
+
+The Nokia 7220 Interconnect Router (7220 IXR) for data center fabric provides fixed-configuration,
+high-capacity platforms that let you bring unmatched scale, flexibility and operational simplicity
+to your data center networks and peering network environments. These devices are built around the
+Broadcom _Trident_ chipset, in the case of the "D4" platform, this is a Trident4 with 28x100G and
+8x400G ports. Whoot!
+
+{{< image float="right" src="/assets/frys-ix/IXR-7220-D3.jpg" alt="Nokia 7220-D3" width="20em" >}}
+
+What I find particularly awesome of the Trident series is their speed (total bandwidth of
+12.8Tbps _per router_), low power use (without optics, the IXR-7220-D4 consumes about 150W) and
+a plethora of advanced capabilities like L2/L3 filtering, IPv4, IPv6 and MPLS routing, and modern
+approaches to scale-out networking such as VXLAN based EVPN. At the FrysIX barbecue in September of
+2024, FrysIX was gifted a rather powerful IXR-7220-D3 router, shown in the picture to the right.
+That's a 32x100G router.
+
+ERITAP has bought two (new in box) IXR-7220-D4 (8x400G,28x100G) routers, and has also acquired two
+IXR-7220-D2 (48x25G,8x100G) routers. So in total, FrysIX is now the proud owner of five of these
+beautiful Nokia devices. If you haven't yet, you should definitely read about these versatile
+routers on the [[Nokia](https://onestore.nokia.com/asset/207599)] website, and some details of the
+_merchant silicon_ switch chips in use on the
+[[Broadcom](https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs/bcm56880-series)]
+website. 
+
+### eVPN: A small rant
+
+{{< image float="right" src="/assets/frys-ix/FrysIX_ Topology (concept).svg" alt="Topology Concept" width="50%" >}}
+
+First, I need to get something off my chest. Consider a topology for an internet exchange platform,
+taking into account the available equipment, rackspace, power, and cross connects. Somehow, almost
+every design or reference architecture I can find on the Internet, assumes folks want to build a
+[[Clos network](https://en.wikipedia.org/wiki/Clos_network)], which has a topology existing of leaf
+and spine switches. The _spine_ switches have a different set of features than the _leaf_ ones,
+notably they don't have to do provider edge functionality like VXLAN encap and decapsulation.
+Almost all of these designs are showing how one might build a leaf-spine network for hyperscale.
+
+**Critique 1**: my 'spine' (IXR-7220-D4 routers) must also be provider edge.  Practically speaking,
+in the picture above I have these beautiful Nokia IXR-7220-D4 routers, using two 400G ports to
+connect between the facilities, and six 100G ports to connect the smaller breakout switches. That
+would leave a _massive_ amount of capacity unused: 22x 100G and 6x400G ports, to be exact.
+
+**Critique 2**: all 'leaf' (either IXR-7220-D2 routers or Arista switches) can't realistically
+connect to both 'spines'. Our devices are spread out over two (and in practice, more like six)
+datacenters, and it's prohibitively expensive to get 100G waves or dark fiber to create a full mesh.
+It's much more economical to create a star-topology that minimizes cross-datacenter fiber spans.
+
+**Critique 3**: Most of these 'spine-leaf' reference architectures assume that the interior gateway
+protocol is eBGP in what they call the _underlay_, and on top of that, some secondary eBGP that's
+called the _overlay_. Frankly, such a design makes my head spin a little bit. These designs assume
+hundreds of switches, in which case making use of one AS number per switch could make sense, as iBGP
+needs either a 'full mesh', or external route reflectors. 
+
+**Critique 4**: These reference designs also make an assumption that all fiber is local and while
+optics and links can fail, it will be relatively rare to _drain_ a link. However, in
+cross-datacenter networks, draining links for maintenance is very common, for example if the dark
+fiber provider needs to perform repairs on a span that was damaged. With these eBGP-over-eBGP
+connections, traffic engineering is more difficult than simply raising the OSPF (or IS-IS) cost of a
+link, to reroute traffic.
+
+Setting aside eVPN for a second, if I were to build an IP transport network, like I did when I built
+[[IPng Site Local]({{< ref 2023-03-11-mpls-core.md >}})], I would use a much more intuitive
+and simple (I would even dare say elegant) design:
+
+1. Take a classic IGP like [[OSPF](https://en.wikipedia.org/wiki/Open_Shortest_Path_First)], or
+perhaps [[IS-IS](https://en.wikipedia.org/wiki/IS-IS)]. There is no benefit, to me at least, to use
+BGP as an IGP.
+1. I would give each of the links between the switches an IPv4 /31 and enable link-local, and give
+each switch a loopback address with a /32 IPv4 and a /128 IPv6.
+1. If I had multiple links between two given switches, I would probably just use ECMP if my devices
+supported it, and fall back to a LACP signaled bundle-ethernet otherwise.
+1. If I were to need to use BGP (and for eVPN, this need exists), taking the ISP mindset (as opposed
+to the datacenter fabric mindset), I would simply install iBGP against two or three route
+reflectors, and exchange routing information within the same single AS number.
+
+### eVPN: A demo topology
+
+{{< image float="right" src="/assets/frys-ix/Nokia Arista VXLAN.svg" alt="Demo topology" width="50%" >}}
+
+So, that's exactly how I'm going to approach the FrysIX eVPN design: OSPF for the underlay and iBGP
+for the overlay! I have a feeling that some folks will despise me for being contrarian, but you can
+leave your comments below, and don't forget to like-and-subscribe :-)
+
+Arend builds this topology for me in Jubbega - also known as FrysIX HQ. He takes the two
+400G-capable routers and connects them. Then he takes an Arista DCS-7060CX switch, which is eVPN
+capable, with 32x100G ports, based on the Broadcom Tomahawk chipset, and a smaller Nokia
+IXR-7220-D2 with 48x25G and 8x100G ports, based on the Trident3 chipset. He wires all of this up
+to look like the picture on the right.
+
+#### Underlay: Nokia's SR Linux
+
+We boot up the equipment, verify that all the optics and links are up, and connect the management
+ports to an OOB network that I can remotely log in to. This is the first time that either of us work
+on Nokia, but I find it reasonably intuitive once I get a few tips and tricks from Niek. 
+
+```
+[pim@nikhef ~]$ sr_cli
+--{ running }--[  ]--
+A:pim@nikhef# enter candidate
+--{ candidate shared default }--[  ]--
+A:pim@nikhef# set / interface lo0 admin-state enable
+A:pim@nikhef# set / interface lo0 subinterface 0 admin-state enable
+A:pim@nikhef# set / interface lo0 subinterface 0 ipv4 admin-state enable
+A:pim@nikhef# set / interface lo0 subinterface 0 ipv4 address 198.19.16.1/32
+A:pim@nikhef# commit stay
+```
+
+There, my first config snippet! This creates a _loopback_ interface, and similar to JunOS, a
+_subinterface_ (which Juniper calls a _unit_) which enables IPv4 and gives it an /32 address. In SR
+Linux, any interface has to be associated with a _network-instance_, think of those as routing
+domains or VRFs. There's a conveniently named _default_ network-instance, which I'll add this and
+the point-to-point interface between the two 400G routers to:
+
+```
+A:pim@nikhef# info flat interface ethernet-1/29
+set / interface ethernet-1/29 admin-state enable
+set / interface ethernet-1/29 subinterface 0 admin-state enable
+set / interface ethernet-1/29 subinterface 0 ip-mtu 9190
+set / interface ethernet-1/29 subinterface 0 ipv4 admin-state enable
+set / interface ethernet-1/29 subinterface 0 ipv4 address 198.19.17.1/31
+set / interface ethernet-1/29 subinterface 0 ipv6 admin-state enable
+
+A:pim@nikhef# set / network-instance default type default
+A:pim@nikhef# set / network-instance default admin-state enable
+A:pim@nikhef# set / network-instance default interface ethernet-1/29.0
+A:pim@nikhef# set / network-instance default interface lo0.0
+A:pim@nikhef# commit stay
+```
+
+Cool. Assuming I now also do this on the other IXR-7220-D4 router, called _equinix_ (which gets the
+loopback address 198.19.16.0/32 and the point-to-point on the 400G interface of 198.19.17.0/31), I
+should be able to do my first jumboframe ping:
+
+```
+A:pim@equinix# ping network-instance default 198.19.17.1 -s 9162 -M do
+Using network instance default
+PING 198.19.17.1 (198.19.17.1) 9162(9190) bytes of data.
+9170 bytes from 198.19.17.1: icmp_seq=1 ttl=64 time=0.466 ms
+9170 bytes from 198.19.17.1: icmp_seq=2 ttl=64 time=0.477 ms
+9170 bytes from 198.19.17.1: icmp_seq=3 ttl=64 time=0.547 ms
+```
+
+#### Underlay: SR Linux OSPF
+
+OK, let's get these two Nokia routers to speak OSPF, so that they can reach each other's loopback.
+It's really easy:
+
+```
+A:pim@nikhef# / network-instance default protocols ospf instance default
+--{ candidate shared default }--[ network-instance default protocols ospf instance default ]--
+A:pim@nikhef# set admin-state enable
+A:pim@nikhef# set version ospf-v2
+A:pim@nikhef# set router-id 198.19.16.1
+A:pim@nikhef# set area 0.0.0.0 interface ethernet-1/29.0 interface-type point-to-point
+A:pim@nikhef# set area 0.0.0.0 interface lo0.0 passive true
+A:pim@nikhef# commit stay
+```
+
+Similar to in JunOS, I can descend into a configuration scope: the first line goes into the
+_network-instance_ called `default` and then the _protocols_ called `ospf`, and then the _instance_
+called `default`. Subsequent `set` commands operate at this scope. Once I commit this configuration
+(on the _nikhef_ router and also the _equinix_ router, with its own unique router-id), OSPF quickly
+shoots in action:
+
+```
+A:pim@nikhef# show network-instance default protocols ospf neighbor 
+=========================================================================================
+Net-Inst default OSPFv2 Instance default Neighbors
+=========================================================================================
+---------------------------------------------------------------------------------------+
+| Interface-Name         Rtr Id            State        Pri   RetxQ    Time Before Dead |
+=======================================================================================+
+| ethernet-1/29.0        198.19.16.0       full         1     0        36               |
+---------------------------------------------------------------------------------------+
+-----------------------------------------------------------------------------------------
+No. of Neighbors: 1
+=========================================================================================
+
+A:pim@nikhef# show network-instance default route-table all | more
+IPv4 unicast route table of network instance default
+------------------+-----+------------+--------------+--------+----------+--------+------+-------------+-----------------+
+|        Prefix    | ID  | Route Type |  Route Owner | Active |  Origin  | Metric | Pref | Next-hop    |    Next-hop     |
+|                  |     |            |              |        | Network  |        |      | (Type)      |   Interface     |
+|                  |     |            |              |        | Instance |        |      |             |                 |
+==================+=====+============+==============+========+==========+========+======+=============+=================+
+| 198.19.16.0/32   | 0   | ospfv2     | ospf_mgr     | True   | default  | 1      | 10   | 198.19.17.0 | ethernet-1/29.0 |
+|                  |     |            |              |        |          |        |      | (direct)    |                 |
+| 198.19.16.1/32   | 7   | host       | net_inst_mgr | True   | default  | 0      | 0    | None        | None            |
+| 198.19.17.0/31   | 6   | local      | net_inst_mgr | True   | default  | 0      | 0    | 198.19.17.1 | ethernet-1/29.0 |
+|                  |     |            |              |        |          |        |      | (direct)    |                 |
+| 198.19.17.1/32   | 6   | host       | net_inst_mgr | True   | default  | 0      | 0    | None        | None            |
+==================+=====+============+==============+========+==========+========+======+=============+=================+
+
+A:pim@nikhef# ping network-instance default 198.19.16.0
+Using network instance default
+PING 198.19.16.0 (198.19.16.0) 56(84) bytes of data.
+64 bytes from 198.19.16.0: icmp_seq=1 ttl=64 time=0.484 ms
+64 bytes from 198.19.16.0: icmp_seq=2 ttl=64 time=0.663 ms
+```
+
+Delicious! OSPF has learned the loopback, and it is now reachable. As with most things, going from 0
+to 1 (in this case: understanding how SR Linux works at all) is the most difficult part. Then going
+from 1 to 2 is critical (in this case: making two routers interact with OSPF), but from there on,
+going from 2 to N is easy (in my case: enabling several other point-to-point /31 transit networks on
+the _nikhef_ router, using `ethernet-1/1.0` through `ethernet-1/4.0` with the correct MTU and
+turning on OSPF for these), makes the whole network shoot to life. Slick!
+
+#### Underlay: Arista
+
+I'll point out that one of the devices in this topology is an Arista. We have several of these ready
+for deployment at FrysIX. They are a lot more affordable and easy to find on the second hand /
+refurbished market. These switches come with 32x100G ports, and are really good at packet slinging
+because they're based on the Broadcom _Tomahawk_ chipset. They pack a few less features than the
+_Trident_ chipset that powers the Nokia, but they happen to have all the features we need to run our
+internet exchange .  So I turn my attention to the Arista in the topology. I am much more
+comfortable configuring the whole thing here, as it's not my first time touching these devices:
+
+```
+arista-leaf#show run int loop0
+interface Loopback0
+   ip address 198.19.16.2/32
+   ip ospf area 0.0.0.0
+arista-leaf#show run int Ethernet32/1
+interface Ethernet32/1
+   description Core: Connected to nikhef:ethernet-1/2
+   load-interval 1
+   mtu 9190
+   no switchport
+   ip address 198.19.17.5/31
+   ip ospf cost 1000
+   ip ospf network point-to-point
+   ip ospf area 0.0.0.0
+arista-leaf#show run section router ospf
+router ospf 65500
+   router-id 198.19.16.2
+   redistribute connected
+   network 198.19.0.0/16 area 0.0.0.0
+   max-lsa 12000
+```
+
+I complete the configuration for the other two interfaces on this Arista, port Eth31/1 connects also
+to the _nikhef_ IXR-7220-D4 and I give it a high cost of 1000, while Eth30/1 connects only 1x100G to
+the _nokia-leaf_ IXR-7220-D2 with a cost of 10.
+It's nice to see that OSPF in action - there are two equal path (but high cost) OSPF paths via
+router-id 198.19.16.1 (nikhef), and there's one lower cost path via router-id 198.19.16.3
+(_nokia-leaf_). The traceroute nicely shows the scenic route (arista-leaf -> nokia-leaf -> nokia ->
+equinix). Dope!
+
+```
+arista-leaf#show ip ospf nei
+Neighbor ID     Instance VRF      Pri State                  Dead Time   Address         Interface
+198.19.16.1     65500    default  1   FULL                   00:00:36    198.19.17.4     Ethernet32/1
+198.19.16.3     65500    default  1   FULL                   00:00:31    198.19.17.11    Ethernet30/1
+198.19.16.1     65500    default  1   FULL                   00:00:35    198.19.17.2     Ethernet31/1
+
+arista-leaf#traceroute 198.19.16.0
+traceroute to 198.19.16.0 (198.19.16.0), 30 hops max, 60 byte packets
+ 1  198.19.17.11 (198.19.17.11)  0.220 ms  0.150 ms  0.206 ms
+ 2  198.19.17.6 (198.19.17.6)  0.169 ms  0.107 ms  0.099 ms
+ 3  198.19.16.0 (198.19.16.0)  0.434 ms  0.346 ms  0.303 ms
+```
+
+So far, so good! The _underlay_ is up, every router can reach every other router on its loopback,
+and all OSPF adjacencies are formed. I'll leave the 2x100G between _nikhef_ and _arista-leaf_ at
+high cost for now.
+
+#### Overlay EVPN: SR Linux
+
+The big-picture idea here is to use iBGP with the same private AS number, and because there are two
+main facilities (NIKHEF and Equinix), make each of those bigger IXR-7220-D4 routers act as
+route-reflectors for others. It means that they will have an iBGP session amongst themselves
+(198.191.16.0 <-> 198.19.16.1) and otherwise accept iBGP sessions from any IP address in the
+198.19.16.0/24 subnet.  This way, I don't have to configure any more than strictly necessary on the
+core routers. Any new router can just plug in, form an OSPF adjacency, and connect to both core
+routers. I proceed to configure BGP on the Nokia's like this:
+
+```
+A:pim@nikhef# / network-instance default protocols bgp 
+A:pim@nikhef# set admin-state enable
+A:pim@nikhef# set autonomous-system 65500
+A:pim@nikhef# set router-id 198.19.16.1
+A:pim@nikhef# set dynamic-neighbors accept match 198.19.16.0/24 peer-group overlay
+A:pim@nikhef# set afi-safi evpn admin-state enable
+A:pim@nikhef# set preference ibgp 170
+A:pim@nikhef# set route-advertisement rapid-withdrawal true
+A:pim@nikhef# set route-advertisement wait-for-fib-install false
+A:pim@nikhef# set group overlay peer-as 65500
+A:pim@nikhef# set group overlay afi-safi evpn admin-state enable
+A:pim@nikhef# set group overlay afi-safi ipv4-unicast admin-state disable
+A:pim@nikhef# set group overlay afi-safi ipv6-unicast admin-state disable
+A:pim@nikhef# set group overlay local-as as-number 65500
+A:pim@nikhef# set group overlay route-reflector client true
+A:pim@nikhef# set group overlay transport local-address 198.19.16.1
+A:pim@nikhef# set neighbor 198.19.16.0 admin-state enable
+A:pim@nikhef# set neighbor 198.19.16.0 peer-group overlay
+A:pim@nikhef# commit stay
+```
+
+I can see that iBGP sessions establish between all the devices:
+
+```
+A:pim@nikhef# show network-instance default protocols bgp neighbor
+---------------------------------------------------------------------------------------------------------------------------
+BGP neighbor summary for network-instance "default"              
+Flags: S static, D dynamic, L discovered by LLDP, B BFD enabled, - disabled, * slow
+---------------------------------------------------------------------------------------------------------------------------
+---------------------------------------------------------------------------------------------------------------------------
+-------------+-------------+----------+-------+----------+-------------+---------------+------------+--------------------+
+|   Net-Inst  |    Peer     |  Group   | Flags |  Peer-AS |   State     |      Uptime   |  AFI/SAFI  |   [Rx/Active/Tx]   |
+=============+=============+==========+=======+==========+=============+===============+============+====================+
+| default     | 198.19.16.0 | overlay  | S     | 65500    | established | 0d:0h:2m:32s  | evpn       | [0/0/0]            |
+| default     | 198.19.16.2 | overlay  | D     | 65500    | established | 0d:0h:2m:27s  | evpn       | [0/0/0]            |
+| default     | 198.19.16.3 | overlay  | D     | 65500    | established | 0d:0h:2m:41s  | evpn       | [0/0/0]            |
+-------------+-------------+----------+-------+----------+-------------+---------------+------------+--------------------+
+---------------------------------------------------------------------------------------------------------------------------
+Summary:
+1 configured neighbors, 1 configured sessions are established, 0 disabled peers
+2 dynamic peers
+```
+
+A few things to note here - there one _configured_ neighbor (this is the other IXR-7220-D4 router),
+and two _dynamic_ peers, these are the Arista and the smaller IXR-7220-D2 router. The only address
+family that they are exchanging information for is the _evpn_ family, and no prefixes have been
+learned or sent yet, shown by the `[0/0/0]` designation in the last column.
+
+#### Overlay EVPN: Arista
+
+The Arista is also remarkably straight forward to configure. Here, I'll simply enable the iBGP
+session as follows:
+
+```
+arista-leaf#show run section bgp
+router bgp 65500
+   neighbor evpn peer group
+   neighbor evpn remote-as 65500
+   neighbor evpn update-source Loopback0
+   neighbor evpn ebgp-multihop 3
+   neighbor evpn send-community extended
+   neighbor evpn maximum-routes 12000 warning-only
+   neighbor 198.19.16.0 peer group evpn
+   neighbor 198.19.16.1 peer group evpn
+   !
+   address-family evpn
+      neighbor evpn activate
+
+arista-leaf#show bgp summary 
+BGP summary information for VRF default
+Router identifier 198.19.16.2, local AS number 65500
+Neighbor             AS Session State AFI/SAFI                AFI/SAFI State   NLRI Rcd   NLRI Acc
+----------- ----------- ------------- ----------------------- -------------- ---------- ----------
+198.19.16.0       65500 Established   IPv4 Unicast            Advertised              0          0
+198.19.16.0       65500 Established   L2VPN EVPN              Negotiated              0          0
+198.19.16.1       65500 Established   IPv4 Unicast            Advertised              0          0
+198.19.16.1       65500 Established   L2VPN EVPN              Negotiated              0          0
+```
+
+On this leaf node, I'll have a redundant iBGP session with the two core nodes. Since those core
+nodes are peering amongst themselves, and are configured as route-reflectors, this is all I need. No
+matter how many additional Arista (or Nokia) devices I add to the network, all they'll have to do is
+enable OSPF (so they can reach 198.19.16.0 and .1) and turn on iBGP sessions with both core routers.
+Voila!
+
+#### VXLAN EVPN: SR Linux
+
+Nokia documentation informs me that SR Linux uses a special interface called _system0_ to source its
+VXLAN traffic from, and to add this interface to the _default_ network-instance. So it's a matter of
+defining that interface and associate a VXLAN interface with it, like so:
+
+```
+A:pim@nikhef# set / interface system0 admin-state enable
+A:pim@nikhef# set / interface system0 subinterface 0 admin-state enable
+A:pim@nikhef# set / interface system0 subinterface 0 ipv4 admin-state enable
+A:pim@nikhef# set / interface system0 subinterface 0 ipv4 address 198.19.18.1/32
+A:pim@nikhef# set / network-instance default interface system0.0
+A:pim@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged
+A:pim@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604
+A:pim@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address
+A:pim@nikhef# commit stay
+```
+
+This creates the plumbing for a VXLAN sub-interface called `vxlan1.2604` which will accept/send
+traffic using VNI 2604 (this happens to be the VLAN id we use at FrysIX for our production Peering
+LAN), and it'll use the `system0.0` address to source that traffic from.
+
+The second part is to create what SR Linux calls a MAC-VRF and put some interface(s) in it:
+
+```
+A:pim@nikhef# set / interface ethernet-1/9 admin-state enable
+A:pim@nikhef# set / interface ethernet-1/9 breakout-mode num-breakout-ports 4
+A:pim@nikhef# set / interface ethernet-1/9 breakout-mode breakout-port-speed 10G
+A:pim@nikhef# set / interface ethernet-1/9/3 admin-state enable
+A:pim@nikhef# set / interface ethernet-1/9/3 vlan-tagging true
+A:pim@nikhef# set / interface ethernet-1/9/3 subinterface 0 type bridged
+A:pim@nikhef# set / interface ethernet-1/9/3 subinterface 0 admin-state enable
+A:pim@nikhef# set / interface ethernet-1/9/3 subinterface 0 vlan encap untagged
+
+A:pim@nikhef# / network-instance peeringlan 
+A:pim@nikhef# set type mac-vrf
+A:pim@nikhef# set admin-state enable
+A:pim@nikhef# set interface ethernet-1/9/3.0
+A:pim@nikhef# set vxlan-interface vxlan1.2604
+A:pim@nikhef# set protocols bgp-evpn bgp-instance 1 admin-state enable
+A:pim@nikhef# set protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604
+A:pim@nikhef# set protocols bgp-evpn bgp-instance 1 evi 2604
+A:pim@nikhef# set protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604
+A:pim@nikhef# set protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604
+A:pim@nikhef# set protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604
+A:pim@nikhef# commit stay
+```
+
+In the first block here, Arend took what is a 100G port called `ethernet-1/9` and split it into 4x25G
+ports. Arend forced the port speed to 10G because he has taken a 40G-4x10G DAC, and it happens that
+the third lane is plugged into the Debian machine. So on `ethernet-1/9/3` I'll create a
+sub-interface, make it type _bridged_ (which I've also done on `vxlan1.2604`!) and allow any
+untagged traffic to enter it. 
+
+{{< image width="5em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
+
+If you, like me, are used to either VPP or IOS/XR, this type of sub-interface stuff should feel very
+natural to you. I've written about the sub-interfaces logic on Cisco's IOS/XR and VPP approach in a
+previous [[article]({{< ref 2022-02-14-vpp-vlan-gym.md >}})] which my buddy Fred lovingly calls
+_VLAN Gymnastics_ because the ports are just so damn flexible. Worth a read!
+
+The second block creates a new _network-instance_ which I'll name `peeringlan`, and it associates
+the newly created untagged sub-interface `ethernet-1/9/3.0` with the VXLAN interface, and starts a
+protocol for eVPN instructing traffic in and out of this network-instance to use EVI 2604 on the
+VXLAN sub-interface, and signalling of all MAC addresses learned to use the specified 
+route-distinguisher and import/export route-targets. For simplicity I've just used the same for
+each: 65500:2604.
+
+I continue to add an interface to the `peeringlan` _network-instance_ on the other two Nokia
+routers: `ethernet-1/9/3.0` on the _equinix_ router and `ethernet-1/9.0` on the _nokia-leaf_ router.
+Each of these goes to a 10Gbps port on a Debian machine.
+
+#### VXLAN EVPN: Arista
+
+At this point I'm feeling pretty bullish about the whole project. Arista does not make it very
+difficult on me to configure it for L2 EVPN (which is called MAC-VRF here also):
+
+```
+arista-leaf#conf t
+vlan 2604
+   name v-peeringlan
+interface Ethernet9/3
+   speed forced 10000full
+   switchport access vlan 2604
+
+interface Loopback1
+   ip address 198.19.18.2/32
+interface Vxlan1
+   vxlan source-interface Loopback1
+   vxlan udp-port 4789
+   vxlan vlan 2604 vni 2604
+```
+
+After creating VLAN 2604 and making port Eth9/3 an access port in that VLAN, I'll add a VTEP endpoint
+called `Loopback1`, and a VXLAN interface that uses that to source its traffic. Here, I'll associate
+local VLAN 2604 with the `Vxlan1` and its VNI 2604, to match up with how I configured the Nokias
+previously.
+
+Finally, it's a matter of tying these together by announcing the MAC addresses into the EVPN iBGP
+sessions:
+```
+arista-leaf#conf t
+router bgp 65500
+   vlan 2604
+      rd 65500:2604
+      route-target both 65500:2604
+      redistribute learned
+   !
+```
+
+### Results
+
+To validate the configurations, I learn a cool trick from my buddy Andy on the SR Linux discord
+server. In EOS, I can ask it to check for any obvious mistakes in two places:
+
+```
+arista-leaf#show vxlan config-sanity detail
+Category                            Result  Detail                                            
+---------------------------------- -------- --------------------------------------------------
+Local VTEP Configuration Check        OK                                                      
+  Loopback IP Address                 OK                                                      
+  VLAN-VNI Map                        OK                                                      
+  Flood List                          OK                                                      
+  Routing                             OK                                                      
+  VNI VRF ACL                         OK                                                      
+  Decap VRF-VNI Map                   OK                                                      
+  VRF-VNI Dynamic VLAN                OK                                                      
+Remote VTEP Configuration Check       OK                                                      
+  Remote VTEP                         OK                                                      
+Platform Dependent Check              OK                                                      
+  VXLAN Bridging                      OK                                                      
+  VXLAN Routing                       OK    VXLAN Routing not enabled                         
+CVX Configuration Check               OK                                                      
+  CVX Server                          OK    Not in controller client mode                     
+MLAG Configuration Check              OK    Run 'show mlag config-sanity' to verify MLAG config
+  Peer VTEP IP                        OK    MLAG peer is not connected                        
+  MLAG VTEP IP                        OK                                                      
+  Peer VLAN-VNI                       OK                                                      
+  Virtual VTEP IP                     OK                                                      
+  MLAG Inactive State                 OK                                                      
+
+arista-leaf#show bgp evpn sanity detail 
+Category Check                Status Detail
+-------- -------------------- ------ ------
+General  Send community       OK           
+General  Multi-agent mode     OK           
+General  Neighbor established OK           
+L2       MAC-VRF route-target OK           
+         import and export                 
+L2       MAC-VRF              OK           
+         route-distinguisher               
+L2       MAC-VRF redistribute OK           
+L2       MAC-VRF overlapping  OK           
+         VLAN                              
+L2       Suppressed MAC       OK           
+VXLAN    VLAN to VNI map for  OK           
+         MAC-VRF                           
+VXLAN    VRF to VNI map for   OK           
+         IP-VRF                            
+```
+
+#### Results: Arista view
+
+Inspecting the MAC addresses learned from all four of the client ports on the Debian machine is
+easy:
+
+```
+arista-leaf#show bgp evpn summary 
+BGP summary information for VRF default
+Router identifier 198.19.16.2, local AS number 65500
+Neighbor Status Codes: m - Under maintenance
+  Neighbor    V AS           MsgRcvd   MsgSent  InQ OutQ  Up/Down State   PfxRcd PfxAcc
+  198.19.16.0 4 65500           3311      3867    0    0 18:06:28 Estab   7      7
+  198.19.16.1 4 65500           3308      3873    0    0 18:06:28 Estab   7      7
+
+arista-leaf#show bgp evpn vni 2604 next-hop 198.19.18.3
+BGP routing table information for VRF default
+Router identifier 198.19.16.2, local AS number 65500
+Route status codes: * - valid, > - active, S - Stale, E - ECMP head, e - ECMP
+                    c - Contributing to ECMP, % - Pending BGP convergence
+Origin codes: i - IGP, e - EGP, ? - incomplete
+AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Link Local Nexthop
+
+          Network                Next Hop              Metric  LocPref Weight  Path
+ * >Ec    RD: 65500:2604 mac-ip e43a.6e5f.0c59
+                                 198.19.18.3           -       100     0       i Or-ID: 198.19.16.3 C-LST: 198.19.16.1 
+ *  ec    RD: 65500:2604 mac-ip e43a.6e5f.0c59
+                                 198.19.18.3           -       100     0       i Or-ID: 198.19.16.3 C-LST: 198.19.16.0 
+ * >Ec    RD: 65500:2604 imet 198.19.18.3
+                                 198.19.18.3           -       100     0       i Or-ID: 198.19.16.3 C-LST: 198.19.16.1 
+ *  ec    RD: 65500:2604 imet 198.19.18.3
+                                 198.19.18.3           -       100     0       i Or-ID: 198.19.16.3 C-LST: 198.19.16.0 
+```
+There's a lot to unpack here! The Arista is seeing that from the _route-distinguisher_ I configured
+on all the sessions, it is learning one MAC address on neighbor 198.19.18.3 (this is the VTEP for
+the _nokia-leaf_ router) from both iBGP sessions. The MAC address is learned from originator
+198.19.16.3 (the loopback of the _nokia-leaf_ router), from two cluster members, the active one on
+iBGP speaker 198.19.16.1 (_nikhef_) and a backup member on 198.19.16.0 (_equinix_).
+
+I can also see that there's a bunch of `imet` route entries, and Andy explained these to me. They are
+a signal from a VTEP participant that they are interested in seeing multicast traffic (like neighbor
+discovery or ARP requests) flooded to them. Every router participating in this L2VPN will raise such
+an `imet` route, which I'll see in duplicates as well (one from each iBGP session). This checks out.
+
+#### Results: SR Linux view
+
+The Nokia IXR-7220-D4 router called _equinix_ has also learned a bunch of EVPN routing entries,
+which I can inspect as follows:
+
+```
+A:pim@equinix# show network-instance default protocols bgp routes evpn route-type summary 
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------
+Show report for the BGP route table of network-instance "default"
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------
+Status codes: u=used, *=valid, >=best, x=stale, b=backup
+Origin codes: i=IGP, e=EGP, ?=incomplete
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------
+BGP Router ID: 198.19.16.0      AS: 65500      Local AS: 65500
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------
+Type 2 MAC-IP Advertisement Routes
+--------+---------------+--------+-------------------+------------+-------------+------+-------------+--------+--------------------------------+------------------+
+| Status |    Route-     | Tag-ID |    MAC-address    | IP-address | neighbor    | Path-|  Next-Hop   |  Label |              ESI               |   MAC Mobility   |
+|        | distinguisher |        |                   |            |             |   id |             |        |                                |                  |
+========+===============+========+===================+============+=============+======+============-+========+================================+==================+
+| u*>    | 65500:2604    | 0      | E4:3A:6E:5F:0C:57 | 0.0.0.0    | 198.19.16.1 | 0    | 198.19.18.1 | 2604   | 00:00:00:00:00:00:00:00:00:00  | -                |
+| *      | 65500:2604    | 0      | E4:3A:6E:5F:0C:58 | 0.0.0.0    | 198.19.16.1 | 0    | 198.19.18.2 | 2604   | 00:00:00:00:00:00:00:00:00:00  | -                |
+| u*>    | 65500:2604    | 0      | E4:3A:6E:5F:0C:58 | 0.0.0.0    | 198.19.16.2 | 0    | 198.19.18.2 | 2604   | 00:00:00:00:00:00:00:00:00:00  | -                |
+| *      | 65500:2604    | 0      | E4:3A:6E:5F:0C:59 | 0.0.0.0    | 198.19.16.1 | 0    | 198.19.18.3 | 2604   | 00:00:00:00:00:00:00:00:00:00  | -                |
+| u*>    | 65500:2604    | 0      | E4:3A:6E:5F:0C:59 | 0.0.0.0    | 198.19.16.3 | 0    | 198.19.18.3 | 2604   | 00:00:00:00:00:00:00:00:00:00  | -                |
+--------+---------------+--------+-------------------+------------+-------------+------+-------------+--------+--------------------------------+------------------+
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------
+Type 3 Inclusive Multicast Ethernet Tag Routes
+--------+-----------------------------+--------+---------------------+-----------------+--------+-----------------------+
+| Status |     Route-distinguisher     | Tag-ID |    Originator-IP    |    neighbor     | Path-  |       Next-Hop        |
+|        |                             |        |                     |                 |   id   |                       |
+========+=============================+========+=====================+=================+========+=======================+
+| u*>    | 65500:2604                  | 0      | 198.19.18.1         | 198.19.16.1     | 0      | 198.19.18.1           |
+| *      | 65500:2604                  | 0      | 198.19.18.2         | 198.19.16.1     | 0      | 198.19.18.2           |
+| u*>    | 65500:2604                  | 0      | 198.19.18.2         | 198.19.16.2     | 0      | 198.19.18.2           |
+| *      | 65500:2604                  | 0      | 198.19.18.3         | 198.19.16.1     | 0      | 198.19.18.3           |
+| u*>    | 65500:2604                  | 0      | 198.19.18.3         | 198.19.16.3     | 0      | 198.19.18.3           |
+--------+-----------------------------+--------+---------------------+-----------------+--------+-----------------------+
+--------------------------------------------------------------------------------------------------------------------------
+0 Ethernet Auto-Discovery routes 0 used, 0 valid
+5 MAC-IP Advertisement routes 3 used, 5 valid
+5 Inclusive Multicast Ethernet Tag routes 3 used, 5 valid
+0 Ethernet Segment routes 0 used, 0 valid
+0 IP Prefix routes 0 used, 0 valid
+0 Selective Multicast Ethernet Tag routes 0 used, 0 valid
+0 Selective Multicast Membership Report Sync routes 0 used, 0 valid
+0 Selective Multicast Leave Sync routes 0 used, 0 valid
+--------------------------------------------------------------------------------------------------------------------------
+```
+
+I have to say, SR Linux output is incredibly verbose! But, I can see all the relevant bits and bobs
+here.  Each MAC-IP entry is accounted for, I can see several nexthops pointing at the nikhef switch,
+one pointing at the nokia-leaf router and one pointing at the Arista switch. I also see the `imet`
+entries. One thing to note -- the SR Linux implementation leaves the type-2 routes empty with a
+0.0.0.0 IPv4 address, while the Arista (in my opinion, more correctly) leaves them as NULL
+(unspecified). But, everything looks great!
+
+#### Results: Debian view
+
+There's one more thing to show, and that's kind of the 'proof is in the pudding' moment. As I said,
+Arend hooked up a Debian machine with an Intel X710-DA4 network card, which sports 4x10G SFP+
+connections.  This network card is a regular in my AS8298 network, as it has excellent DPDK support
+and can easily pump 40Mpps with VPP. IPng 🥰 Intel X710!
+
+```
+root@debian:~ # ip netns add nikhef
+root@debian:~ # ip link set enp1s0f0 netns nikhef
+root@debian:~ # ip netns exec nikhef ip link set enp1s0f0 up mtu 9000
+root@debian:~ # ip netns exec nikhef ip addr add 192.0.2.10/24 dev enp1s0f0
+root@debian:~ # ip netns exec nikhef ip addr add 2001:db8::10/64 dev enp1s0f0
+
+root@debian:~ # ip netns add arista-leaf
+root@debian:~ # ip link set enp1s0f1 netns arista-leaf
+root@debian:~ # ip netns exec arista-leaf ip link set enp1s0f1 up mtu 9000
+root@debian:~ # ip netns exec arista-leaf ip addr add 192.0.2.11/24 dev enp1s0f1
+root@debian:~ # ip netns exec arista-leaf ip addr add 2001:db8::11/64 dev enp1s0f1
+
+root@debian:~ # ip netns add nokia-leaf
+root@debian:~ # ip link set enp1s0f2 netns nokia-leaf
+root@debian:~ # ip netns exec nokia-leaf ip link set enp1s0f2 up mtu 9000
+root@debian:~ # ip netns exec nokia-leaf ip addr add 192.0.2.12/24 dev enp1s0f2
+root@debian:~ # ip netns exec nokia-leaf ip addr add 2001:db8::12/64 dev enp1s0f2
+
+root@debian:~ # ip netns add equinix
+root@debian:~ # ip link set enp1s0f3 netns equinix
+root@debian:~ # ip netns exec equinix ip link set enp1s0f3 up mtu 9000
+root@debian:~ # ip netns exec equinix ip addr add 192.0.2.13/24 dev enp1s0f3 
+root@debian:~ # ip netns exec equinix ip addr add 2001:db8::13/64 dev enp1s0f3 
+
+root@debian:~# ip netns exec nikhef fping -g 192.0.2.8/29
+192.0.2.10 is alive
+192.0.2.11 is alive
+192.0.2.12 is alive
+192.0.2.13 is alive
+
+root@debian:~# ip netns exec arista-leaf fping 2001:db8::10 2001:db8::11 2001:db8::12 2001:db8::13
+2001:db8::10 is alive
+2001:db8::11 is alive
+2001:db8::12 is alive
+2001:db8::13 is alive
+
+root@debian:~# ip netns exec equinix ip nei
+192.0.2.10 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE 
+192.0.2.11 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE 
+192.0.2.12 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE 
+fe80::e63a:6eff:fe5f:c57 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE 
+fe80::e63a:6eff:fe5f:c58 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE 
+fe80::e63a:6eff:fe5f:c59 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE 
+2001:db8::10 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE 
+2001:db8::11 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE 
+2001:db8::12 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE 
+```
+
+The Debian machine puts each network card into its own network namespace, and gives them both an IPv4
+and an IPv6 address. I can then enter the `nikhef` network namespace, which has its NIC connected to
+the IXR-7220-D4 router called _nikhef_, and ping all four endpoints. Similarly, I can enter the
+`arista-leaf` namespace and ping6 all four endpoints. Finally, I take a look at the IPv6 and IPv4
+neighbor table on the network card that is connected to the _equinix_ router. All three MAC addresses are
+seen. This proves end to end connectivity across the EVPN VXLAN, and full interoperability. Booyah!
+
+Performance? We got that! I'm not worried as these Nokia routers are rated for 12.8Tbps of VXLAN....
+```
+root@debian:~# ip netns exec equinix iperf3 -c 192.0.2.12 
+Connecting to host 192.0.2.12, port 5201
+[  5] local 192.0.2.10 port 34598 connected to 192.0.2.12 port 5201
+[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
+[  5]   0.00-1.00   sec  1.15 GBytes  9.91 Gbits/sec   19   1.52 MBytes       
+[  5]   1.00-2.00   sec  1.15 GBytes  9.90 Gbits/sec    3   1.54 MBytes       
+[  5]   2.00-3.00   sec  1.15 GBytes  9.90 Gbits/sec    1   1.54 MBytes       
+[  5]   3.00-4.00   sec  1.15 GBytes  9.90 Gbits/sec    1   1.54 MBytes       
+[  5]   4.00-5.00   sec  1.15 GBytes  9.90 Gbits/sec    0   1.54 MBytes       
+[  5]   5.00-6.00   sec  1.15 GBytes  9.90 Gbits/sec    0   1.54 MBytes       
+[  5]   6.00-7.00   sec  1.15 GBytes  9.90 Gbits/sec    0   1.54 MBytes       
+[  5]   7.00-8.00   sec  1.15 GBytes  9.90 Gbits/sec    0   1.54 MBytes       
+[  5]   8.00-9.00   sec  1.15 GBytes  9.90 Gbits/sec    0   1.54 MBytes       
+[  5]   9.00-10.00  sec  1.15 GBytes  9.90 Gbits/sec    0   1.54 MBytes       
+- - - - - - - - - - - - - - - - - - - - - - - - -
+[ ID] Interval           Transfer     Bitrate         Retr
+[  5]   0.00-10.00  sec  11.5 GBytes  9.90 Gbits/sec   24             sender
+[  5]   0.00-10.00  sec  11.5 GBytes  9.90 Gbits/sec                  receiver
+
+iperf Done.
+```
+
+## What's Next
+
+There's a few improvements I can make before deploying this architecture to the internet exchange.
+Notably:
+*   the functional equivalent of _port security_, that is to say only allowing one or two MAC
+    addresses per member port. FrysIX has a strict one-port-one-member-one-MAC rule, and having port
+    security will greatly improve our resilience.
+*   SR Linux has the ability to suppress ARP, _even on L2 MAC-VRF_! It's relatively well known for
+    IRB based setups, but adding this to transparent bridge-domains is possible in Nokia
+[[ref](https://documentation.nokia.com/srlinux/22-6/SR_Linux_Book_Files/EVPN-VXLAN_Guide/services-evpn-vxlan-l2.html#configuring_evpn_learning_for_proxy_arp)],
+    using the syntax of `protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise
+true`. This will glean the IP addresses based on intercepted ARP requests, and reduce the need for
+    BUM flooding.
+*   Andy informs me that Arista also has this feature. By setting `router l2-vpn` and `arp learning bridged`,
+    the suppression of ARP requests/replies also works in the same way. This greatly reduces cross-router
+    BUM flooding. If DE-CIX can do it, so can FrysIX :)
+*   some automation - although configuring the MAC-VRF across Arista and SR Linux is  definitely not
+    as difficult as I thought, having some automation in place will avoid errors and mistakes. It
+    would suck if the IXP collapsed because I botched a link drain or PNI configuration!
+
+### Acknowledgements
+
+I am relatively new to EVPN configurations, and wanted to give a shoutout to Andy Whitaker who
+jumped in very quickly when I asked a question on the SR Linux Discord. He was gracious with his
+time and spent a few hours on a video call with me, explaining EVPN in great detail both for Arista
+as well as SR Linux, and in particular wanted to give a big "Thank you!" for helping me understand
+symmetric and asymmetric IRB in the context of multivendor EVPN. Andy is about to start a new job at
+Nokia, and I wish him all the best. To my friends at Nokia: you caught a good one, Andy is pure
+gold!
+
+I also want to thank Niek for helping me take my first baby steps onto this platform and patiently
+answering my nerdly questions about the platform, the switch chip, and the configuration philosophy.
+Learning a new NOS is always a fun task, and it was made super fun because Niek spent an hour with
+Arend and I on a video call, giving a bunch of operational tips and tricks along the way.
+
+Finally, Arend and ERITAP are an absolute joy to work with. We took turns hacking on the lab, which
+Arend made available for me while I am traveling to Mississippi this week. Thanks for the kWh and
+OOB access, and for brainstorming the config with me!
+
+### Reference configurations
+
+Here's the configs for all machines in this demonstration:
+[[nikhef](/assets/frys-ix/nikhef.conf)] | [[equinix](/assets/frys-ix/equinix.conf)] | [[nokia-leaf](/assets/frys-ix/nokia-leaf.conf)] | [[arista-leaf](/assets/frys-ix/arista-leaf.conf)] 
--- a/content/articles/2025-05-03-containerlab-1.md
+++ b/content/articles/2025-05-03-containerlab-1.md
@@ -0,0 +1,464 @@
+---
+date: "2025-05-03T15:07:23Z"
+title: 'VPP in Containerlab - Part 1'
+---
+
+{{< image float="right" src="/assets/containerlab/containerlab.svg" alt="Containerlab Logo" width="12em" >}}
+
+# Introduction
+
+From time to time the subject of containerized VPP instances comes up. At IPng, I run the routers in
+AS8298 on bare metal (Supermicro and Dell hardware), as it allows me to maximize performance.
+However, VPP is quite friendly in virtualization. Notably, it runs really well on virtual machines
+like Qemu/KVM or VMWare. I can pass through PCI devices directly to the host, and use CPU pinning to
+allow the guest virtual machine access to the underlying physical hardware. In such a mode, VPP
+performance almost the same as on bare metal. But did you know that VPP can also run in Docker?
+
+The other day I joined the [[ZANOG'25](https://nog.net.za/event1/zanog25/)] in Durban, South Africa.
+One of the presenters was Nardus le Roux of Nokia, and he showed off a project called
+[[Containerlab](https://containerlab.dev/)], which provides a CLI for orchestrating and managing
+container-based networking labs. It starts the containers, builds a virtual wiring between them to
+create lab topologies of users choice and manages labs lifecycle.
+
+Quite regularly I am asked 'when will you add VPP to Containerlab?', but at ZANOG I made a promise
+to actually add them. Here I go, on a journey to integrate VPP into Containerlab!
+
+## Containerized VPP
+
+The folks at [[Tigera](https://www.tigera.io/project-calico/)] maintain a project called _Calico_,
+which accelerates Kubernetes CNI (Container Network Interface) by using [[FD.io](https://fd.io)]
+VPP. Since the origins of Kubernetes are to run containers in a Docker environment, it stands to
+reason that it should be possible to run a containerized VPP. I start by reading up on how they
+create their Docker image, and I learn a lot.
+
+### Docker Build
+
+Considering IPng runs bare metal Debian (currently Bookworm) machines, my Docker image will be based
+on `debian:bookworm` as well. The build starts off quite modest:
+
+```
+pim@summer:~$ mkdir -p src/vpp-containerlab
+pim@summer:~/src/vpp-containerlab$ cat < EOF > Dockerfile.bookworm
+FROM debian:bookworm
+ARG DEBIAN_FRONTEND=noninteractive
+ARG VPP_INSTALL_SKIP_SYSCTL=true
+ARG REPO=release
+RUN apt-get update && apt-get -y install curl procps && apt-get clean
+
+# Install VPP
+RUN curl -s https://packagecloud.io/install/repositories/fdio/${REPO}/script.deb.sh | bash
+RUN apt-get update && apt-get -y install vpp vpp-plugin-core && apt-get clean
+
+CMD ["/usr/bin/vpp","-c","/etc/vpp/startup.conf"]
+EOF
+pim@summer:~/src/vpp-containerlab$ docker build -f Dockerfile.bookworm . -t pimvanpelt/vpp-containerlab
+```
+
+One gotcha - when I install the upstream VPP debian packages, they generate a `sysctl` file which it
+tries to execute. However, I can't set sysctl's in the container, so the build fails. I take a look
+at the VPP source code and find `src/pkg/debian/vpp.postinst` which helpfully contains a means to
+override setting the sysctl's, using an environment variable called `VPP_INSTALL_SKIP_SYSCTL`.
+
+### Running VPP in Docker
+
+With the Docker image built, I need to tweak the VPP startup configuration a little bit, to allow it
+to run well in a Docker environment. There are a few things I make note of:
+1.   We may not have huge pages on the host machine, so I'll set all the page sizes to the
+     linux-default 4kB rather than 2MB or 1GB hugepages. This creates a performance regression, but
+     in the case of Containerlab, we're not here to build high performance stuff, but rather users
+     will be doing functional testing.
+1.   DPDK requires either UIO of VFIO kernel drivers, so that it can bind its so-called _poll mode
+     driver_  to the network cards. It also requires huge pages. Since my first version will be
+     using only virtual ethernet interfaces, I'll disable DPDK and VFIO alltogether.
+1.   VPP can run any number of CPU worker threads. In its simplest form, I can also run it with only
+     one thread. Of course, this will not be a high performance setup, but since I'm already not
+     using hugepages, I'll use only 1 thread.
+
+The VPP `startup.conf` configuration file I came up with:
+
+```
+pim@summer:~/src/vpp-containerlab$ cat < EOF > clab-startup.conf
+unix {
+  interactive
+  log /var/log/vpp/vpp.log
+  full-coredump
+  cli-listen /run/vpp/cli.sock
+  cli-prompt vpp-clab#
+  cli-no-pager
+  poll-sleep-usec 100
+}
+
+api-trace {
+  on
+}
+
+memory {
+  main-heap-size 512M
+  main-heap-page-size 4k
+}
+buffers {
+  buffers-per-numa 16000
+  default data-size 2048
+  page-size 4k
+}
+
+statseg {
+  size 64M
+  page-size 4k
+  per-node-counters on
+}
+
+plugins {
+  plugin default { enable }
+  plugin dpdk_plugin.so { disable }
+}
+EOF
+```
+
+Just a couple of notes for those who are running VPP in production. Each of the `*-page-size` config
+settings take the normal Linux pagesize of 4kB, which effectively avoids VPP from using anhy
+hugepages. Then, I'll specifically disable the DPDK plugin, although I didn't install it in the
+Dockerfile build, as it lives in its own dedicated Debian package called `vpp-plugin-dpdk`. Finally,
+I'll make VPP use less CPU by telling it to sleep for 100 microseconds between each poll iteration.
+In production environments, VPP will use 100% of the CPUs it's assigned, but in this lab, it will
+not be quite as hungry. By the way, even in this sleepy mode, it'll still easily handle a gigabit
+of traffic!
+
+Now, VPP wants to run as root and it needs a few host features, notably tuntap devices and vhost,
+and a few capabilites, notably NET_ADMIN and SYS_PTRACE. I take a look at the
+[[manpage](https://man7.org/linux/man-pages/man7/capabilities.7.html)]:
+*   ***CAP_SYS_NICE***: allows to set real-time scheduling, CPU affinity, I/O scheduling class, and
+    to migrate and move memory pages.
+*   ***CAP_NET_ADMIN***: allows to perform various network-relates operations like interface
+    configs, routing tables, nested network namespaces, multicast, set promiscuous mode, and so on.
+*   ***CAP_SYS_PTRACE***: allows to trace arbitrary processes using `ptrace(2)`, and a few related
+    kernel system calls.
+
+Being a networking dataplane implementation, VPP wants to be able to tinker with network devices.
+This is not typically allowed in Docker containers, although the Docker developers did make some
+consessions for those containers that need just that little bit more access. They described it in
+their
+[[docs](https://docs.docker.com/engine/containers/run/#runtime-privilege-and-linux-capabilities)] as
+follows:
+
+| The --privileged flag gives all capabilities to the container. When the operator executes docker
+| run --privileged, Docker enables access to all devices on the host, and reconfigures AppArmor or
+| SELinux to allow the container nearly all the same access to the host as processes running outside
+| containers on the host. Use this flag with caution. For more information about the --privileged
+| flag, see the docker run reference.
+
+{{< image width="4em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
+In this moment, I feel I should point out that running a Docker container with `--privileged` flag
+set does give it _a lot_ of privileges. A container with `--privileged` is not a securely sandboxed
+process. Containers in this mode can get a root shell on the host and take control over the system.
+
+With that little fineprint warning out of the way, I am going to Yolo like a boss:
+
+```
+pim@summer:~/src/vpp-containerlab$ docker run --name clab-pim \
+                --cap-add=NET_ADMIN --cap-add=SYS_NICE --cap-add=SYS_PTRACE \
+                --device=/dev/net/tun:/dev/net/tun --device=/dev/vhost-net:/dev/vhost-net \
+                --privileged -v $(pwd)/clab-startup.conf:/etc/vpp/startup.conf:ro \
+                docker.io/pimvanpelt/vpp-containerlab
+clab-pim
+```
+
+### Configuring VPP in Docker
+
+And with that, the Docker container is running! I post a screenshot on
+[[Mastodon](https://ublog.tech/@IPngNetworks/114392852468494211)] and my buddy John responds with a
+polite but firm insistence that I explain myself. Here you go, buddy :)
+
+In another terminal, I can play around with this VPP instance a little bit:
+```
+pim@summer:~$ docker exec -it clab-pim bash
+root@d57c3716eee9:/# ip -br l
+lo               UNKNOWN        00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP> 
+eth0@if530566    UP             02:42:ac:11:00:02 <BROADCAST,MULTICAST,UP,LOWER_UP> 
+
+root@d57c3716eee9:/# ps auxw   
+USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
+root           1  2.2  0.2 17498852 160300 ?     Rs   15:11   0:00 /usr/bin/vpp -c /etc/vpp/startup.conf
+root          10  0.0  0.0   4192  3388 pts/0    Ss   15:11   0:00 bash
+root          18  0.0  0.0   8104  4056 pts/0    R+   15:12   0:00 ps auxw
+
+root@d57c3716eee9:/# vppctl
+    _______    _        _   _____  ___ 
+ __/ __/ _ \  (_)__    | | / / _ \/ _ \
+ _/ _// // / / / _ \   | |/ / ___/ ___/
+ /_/ /____(_)_/\___/   |___/_/  /_/    
+
+vpp-clab# show version
+vpp v25.02-release built by root on d5cd2c304b7f at 2025-02-26T13:58:32
+vpp-clab# show interfaces
+              Name               Idx    State  MTU (L3/IP4/IP6/MPLS)     Counter          Count     
+local0                            0     down          0/0/0/0       
+```
+
+Slick! I can see that the container has an `eth0` device, which Docker has connected to the main
+bridged network. For now, there's only one process running, pid 1 proudly shows VPP (as in Docker,
+the `CMD` field will simply replace `init`. Later on, I can imagine running a few more daemons like
+SSH and so on, but for now, I'm happy.
+
+Looking at VPP itself, it has no network interfaces yet, except for the default `local0` interface.
+
+### Adding Interfaces in Docker
+
+But if I don't have DPDK, how will I add interfaces? Enter `veth(4)`. From the
+[[manpage](https://man7.org/linux/man-pages/man4/veth.4.html)], I learn that veth devices are
+virtual Ethernet devices.  They can act as tunnels between network namespaces to create a bridge to
+a physical network device in another namespace, but can also be used as standalone network devices.
+veth devices are always created in interconnected pairs.
+
+Of course, Docker users will recognize this. It's like bread and butter for containers to
+communicate with one another - and with the host they're running on. I can simply create a Docker
+network and attach one half of it to a running container, like so:
+
+```
+pim@summer:~$ docker network create --driver=bridge clab-network \
+                     --subnet 192.0.2.0/24 --ipv6 --subnet 2001:db8::/64
+5711b95c6c32ac0ed185a54f39e5af4b499677171ff3d00f99497034e09320d2
+pim@summer:~$ docker network connect clab-network clab-pim --ip '' --ip6 ''
+```
+
+The first command here creates a new network called `clab-network` in Docker. As a result, a new
+bridge called `br-5711b95c6c32` shows up on the host. The bridge name is chosen from the UUID of the
+Docker object. Seeing as I added an IPv4 and IPv6 subnet to the bridge, it gets configured with the
+first address in both:
+
+```
+pim@summer:~/src/vpp-containerlab$ brctl show br-5711b95c6c32
+bridge name       bridge id               STP enabled     interfaces
+br-5711b95c6c32   8000.0242099728c6       no              veth021e363
+
+
+pim@summer:~/src/vpp-containerlab$ ip -br a show dev br-5711b95c6c32
+br-5711b95c6c32  UP     192.0.2.1/24 2001:db8::1/64 fe80::42:9ff:fe97:28c6/64 fe80::1/64 
+```
+
+The second command creates a `veth` pair, and puts one half of it in the bridge, and this interface
+is called `veth021e363` above. The other half of it pops up as `eth1` in the Docker container:
+
+```
+pim@summer:~/src/vpp-containerlab$ docker exec -it clab-pim bash
+root@d57c3716eee9:/# ip -br l
+lo               UNKNOWN        00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP> 
+eth0@if530566    UP             02:42:ac:11:00:02 <BROADCAST,MULTICAST,UP,LOWER_UP> 
+eth1@if530577    UP             02:42:c0:00:02:02 <BROADCAST,MULTICAST,UP,LOWER_UP> 
+```
+
+One of the many awesome features of VPP is its ability to attach to these `veth` devices by means of
+its `af-packet` driver, by reusing the same MAC address (in this case `02:42:c0:00:02:02`). I first
+take a look at the linux [[manpage](https://man7.org/linux/man-pages/man7/packet.7.html)] for it,
+and then read up on the VPP
+[[documentation](https://fd.io/docs/vpp/v2101/gettingstarted/progressivevpp/interface)] on the
+topic.
+
+
+However, my attention is drawn to Docker assigning an IPv4 and IPv6 address to the container:
+```
+root@d57c3716eee9:/# ip -br a
+lo               UNKNOWN        127.0.0.1/8 ::1/128
+eth0@if530566    UP             172.17.0.2/16
+eth1@if530577    UP             192.0.2.2/24 2001:db8::2/64 fe80::42:c0ff:fe00:202/64
+root@d57c3716eee9:/# ip addr del 192.0.2.2/24  dev eth1
+root@d57c3716eee9:/# ip addr del 2001:db8::2/64 dev eth1
+```
+
+I decide to remove them from here, as in the end, `eth1` will be owned by VPP so _it_ should be
+setting the IPv4 and IPv6 addresses. For the life of me, I don't see how I can avoid Docker from
+assinging IPv4 and IPv6 addresses to this container ... and the
+[[docs](https://docs.docker.com/engine/network/)] seem to be off as well, as they suggest I can pass
+a flagg `--ipv4=False` but that flag doesn't exist, at least not on my Bookworm Docker variant. I
+make a mental note to discuss this with the folks in the Containerlab community.
+
+
+Anyway, armed with this knowledge I can bind the container-side veth pair called `eth1` to VPP, like
+so:
+
+```
+root@d57c3716eee9:/# vppctl
+    _______    _        _   _____  ___ 
+ __/ __/ _ \  (_)__    | | / / _ \/ _ \
+ _/ _// // / / / _ \   | |/ / ___/ ___/
+ /_/ /____(_)_/\___/   |___/_/  /_/    
+
+vpp-clab# create host-interface name eth1 hw-addr 02:42:c0:00:02:02
+vpp-clab# set interface name host-eth1 eth1
+vpp-clab# set interface mtu 1500 eth1
+vpp-clab# set interface ip address eth1 192.0.2.2/24
+vpp-clab# set interface ip address eth1 2001:db8::2/64
+vpp-clab# set interface state eth1 up
+vpp-clab# show int addr
+eth1 (up):
+  L3 192.0.2.2/24
+  L3 2001:db8::2/64
+local0 (dn):
+```
+
+## Results
+
+After all this work, I've successfully created a Docker image based on Debian Bookworm and VPP 25.02
+(the current stable release version), started a container with it, added a network bridge in Docker,
+which binds the host `summer` to the container. Proof, as they say, is in the ping-pudding:
+
+```
+pim@summer:~/src/vpp-containerlab$ ping -c5 2001:db8::2
+PING 2001:db8::2(2001:db8::2) 56 data bytes
+64 bytes from 2001:db8::2: icmp_seq=1 ttl=64 time=0.113 ms
+64 bytes from 2001:db8::2: icmp_seq=2 ttl=64 time=0.056 ms
+64 bytes from 2001:db8::2: icmp_seq=3 ttl=64 time=0.202 ms
+64 bytes from 2001:db8::2: icmp_seq=4 ttl=64 time=0.102 ms
+64 bytes from 2001:db8::2: icmp_seq=5 ttl=64 time=0.100 ms
+
+--- 2001:db8::2 ping statistics ---
+5 packets transmitted, 5 received, 0% packet loss, time 4098ms
+rtt min/avg/max/mdev = 0.056/0.114/0.202/0.047 ms
+pim@summer:~/src/vpp-containerlab$ ping -c5 192.0.2.2
+PING 192.0.2.2 (192.0.2.2) 56(84) bytes of data.
+64 bytes from 192.0.2.2: icmp_seq=1 ttl=64 time=0.043 ms
+64 bytes from 192.0.2.2: icmp_seq=2 ttl=64 time=0.032 ms
+64 bytes from 192.0.2.2: icmp_seq=3 ttl=64 time=0.019 ms
+64 bytes from 192.0.2.2: icmp_seq=4 ttl=64 time=0.041 ms
+64 bytes from 192.0.2.2: icmp_seq=5 ttl=64 time=0.027 ms
+
+--- 192.0.2.2 ping statistics ---
+5 packets transmitted, 5 received, 0% packet loss, time 4063ms
+rtt min/avg/max/mdev = 0.019/0.032/0.043/0.008 ms
+```
+
+And in case that simple ping-test wasn't enough to get you excited, here's a packet trace from VPP
+itself, while I'm performing this ping:
+
+```
+vpp-clab# trace add af-packet-input 100
+vpp-clab# wait 3
+vpp-clab# show trace
+------------------- Start of thread 0 vpp_main -------------------
+Packet 1
+
+00:07:03:979275: af-packet-input
+  af_packet: hw_if_index 1 rx-queue 0 next-index 4
+    block 47:
+      address 0x7fbf23b7d000 version 2 seq_num 48 pkt_num 0
+    tpacket3_hdr:
+      status 0x20000001 len 98 snaplen 98 mac 92 net 106
+      sec 0x68164381 nsec 0x258e7659 vlan 0 vlan_tpid 0
+    vnet-hdr:
+      flags 0x00 gso_type 0x00 hdr_len 0
+      gso_size 0 csum_start 0 csum_offset 0
+00:07:03:979293: ethernet-input
+  IP4: 02:42:09:97:28:c6 -> 02:42:c0:00:02:02
+00:07:03:979306: ip4-input
+  ICMP: 192.0.2.1 -> 192.0.2.2
+    tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
+    fragment id 0x5813, flags DONT_FRAGMENT
+  ICMP echo_request checksum 0xc16 id 21197
+00:07:03:979315: ip4-lookup
+  fib 0 dpo-idx 9 flow hash: 0x00000000
+  ICMP: 192.0.2.1 -> 192.0.2.2
+    tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
+    fragment id 0x5813, flags DONT_FRAGMENT
+  ICMP echo_request checksum 0xc16 id 21197
+00:07:03:979322: ip4-receive
+    fib:0 adj:9 flow:0x00000000
+  ICMP: 192.0.2.1 -> 192.0.2.2
+    tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
+    fragment id 0x5813, flags DONT_FRAGMENT
+  ICMP echo_request checksum 0xc16 id 21197
+00:07:03:979323: ip4-icmp-input
+  ICMP: 192.0.2.1 -> 192.0.2.2
+    tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
+    fragment id 0x5813, flags DONT_FRAGMENT
+  ICMP echo_request checksum 0xc16 id 21197
+00:07:03:979323: ip4-icmp-echo-request
+  ICMP: 192.0.2.1 -> 192.0.2.2
+    tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
+    fragment id 0x5813, flags DONT_FRAGMENT
+  ICMP echo_request checksum 0xc16 id 21197
+00:07:03:979326: ip4-load-balance
+  fib 0 dpo-idx 5 flow hash: 0x00000000
+  ICMP: 192.0.2.2 -> 192.0.2.1
+    tos 0x00, ttl 64, length 84, checksum 0x88e1 dscp CS0 ecn NON_ECN
+    fragment id 0x2dc4, flags DONT_FRAGMENT
+  ICMP echo_reply checksum 0x1416 id 21197
+00:07:03:979325: ip4-rewrite
+  tx_sw_if_index 1 dpo-idx 5 : ipv4 via 192.0.2.1 eth1: mtu:1500 next:3 flags:[] 0242099728c60242c00002020800 flow hash: 0x00000000
+  00000000: 0242099728c60242c00002020800450000542dc44000400188e1c0000202c000
+  00000020: 02010000141652cd00018143166800000000399d0900000000001011
+00:07:03:979326: eth1-output
+  eth1 flags 0x02180005
+  IP4: 02:42:c0:00:02:02 -> 02:42:09:97:28:c6
+  ICMP: 192.0.2.2 -> 192.0.2.1
+    tos 0x00, ttl 64, length 84, checksum 0x88e1 dscp CS0 ecn NON_ECN
+    fragment id 0x2dc4, flags DONT_FRAGMENT
+  ICMP echo_reply checksum 0x1416 id 21197
+00:07:03:979327: eth1-tx
+  af_packet: hw_if_index 1 tx-queue 0
+    tpacket3_hdr:
+      status 0x1 len 108 snaplen 108 mac 0 net 0
+      sec 0x0 nsec 0x0 vlan 0 vlan_tpid 0
+    vnet-hdr:
+      flags 0x00 gso_type 0x00 hdr_len 0
+      gso_size 0 csum_start 0 csum_offset 0
+    buffer 0xf97c4:
+      current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x0
+      local l2-hdr-offset 0 l3-hdr-offset 14 
+    IP4: 02:42:c0:00:02:02 -> 02:42:09:97:28:c6
+    ICMP: 192.0.2.2 -> 192.0.2.1
+      tos 0x00, ttl 64, length 84, checksum 0x88e1 dscp CS0 ecn NON_ECN
+      fragment id 0x2dc4, flags DONT_FRAGMENT
+    ICMP echo_reply checksum 0x1416 id 21197
+```
+
+Well, that's a mouthfull, isn't it! Here, I get to show you VPP in action. After receiving the
+packet on its `af-packet-input` node from 192.0.2.1 (Summer, who is pinging us) to 192.0.2.2 (the
+VPP container), the packet traverses the dataplane graph. It goes through `ethernet-input`, then
+`ip4-input`, which sees it's destined to an IPv4 address configured, so the packet is handed to
+`ip4-receive`. That one sees that the IP protocol is ICMP, so it hands the packet to
+`ip4-icmp-input` which notices that the packet is an ICMP echo request, so off to
+`ip4-icmp-echo-request` our little packet goes. The ICMP plugin in VPP now answers by
+`ip4-rewrite`'ing the packet, sending the return to 192.0.2.1 at MAC address `02:42:09:97:28:c6`
+(this is Summer, the host doing the pinging!), after which the newly created ICMP echo-reply is
+handed to `eth1-output` which marshalls it back into the kernel's AF_PACKET interface using
+`eth1-tx`.
+
+Boom. I could not be more pleased.
+
+## What's Next
+
+This was a nice exercise for me! I'm going this direction becaue the
+[[Containerlab](https://containerlab.dev)] framework will start containers with given NOS images, 
+not too dissimilar from the one I just made, and then attaches `veth` pairs between the containers.
+I started dabbling with a [[pull-request](https://github.com/srl-labs/containerlab/pull/2571)], but
+I got stuck with a part of the Containerlab code that pre-deploys config files into the containers.
+You see, I will need to generate two files:
+
+1.   A `startup.conf` file that is specific to the containerlab Docker container. I'd like them to
+     each set their own hostname so that the CLI has a unique prompt. I can do this by setting `unix
+     { cli-prompt {{ .ShortName }}# }` in the template renderer.
+1.   Containerlab will know all of the veth pairs that are planned to be created into each VPP
+     container. I'll need it to then write a little snippet of config that does the `create
+     host-interface` spiel, to attach these `veth` pairs to the VPP dataplane.
+
+I reached out to Roman from Nokia, who is one of the authors and current maintainer of Containerlab.
+Roman was keen to help out, and seeing as he knows the COntainerlab stuff well, and I know the VPP
+stuff well, this is a reasonable partnership! Soon, he and I plan to have a bare-bones setup that
+will connect a few VPP containers together with an SR Linux node in a lab. Stand by!
+
+Once we have that, there's still quite some work for me to do. Notably:
+*    Configuration persistence. `clab` allows you to save the running config. For that, I'll need to
+     introduce [[vppcfg](https://git.ipng.ch/ipng/vppcfg)] and a means to invoke it when
+     the lab operator wants to save their config, and then reconfigure VPP when the container
+     restarts.
+*    I'll need to have a few files from `clab` shared with the host, notably the `startup.conf` and
+     `vppcfg.yaml`, as well as some manual pre- and post-flight configuration for the more esoteric
+     stuff. Building the plumbing for this is a TODO for now.
+
+## Acknowledgements
+
+I wanted to give a shout-out to Nardus le Roux who inspired me to contribute this Containerlab VPP
+node type, and to Roman Dodin for his help getting the Containerlab parts squared away when I got a
+little bit stuck.
+
+First order of business: get it to ping at all ... it'll go faster from there on out :)
--- a/content/articles/2025-05-04-containerlab-2.md
+++ b/content/articles/2025-05-04-containerlab-2.md
@@ -0,0 +1,373 @@
+---
+date: "2025-05-04T15:07:23Z"
+title: 'VPP in Containerlab - Part 2'
+params:
+  asciinema: true
+---
+
+{{< image float="right" src="/assets/containerlab/containerlab.svg" alt="Containerlab Logo" width="12em" >}}
+
+# Introduction
+
+From time to time the subject of containerized VPP instances comes up. At IPng, I run the routers in
+AS8298 on bare metal (Supermicro and Dell hardware), as it allows me to maximize performance.
+However, VPP is quite friendly in virtualization. Notably, it runs really well on virtual machines
+like Qemu/KVM or VMWare. I can pass through PCI devices directly to the host, and use CPU pinning to
+allow the guest virtual machine access to the underlying physical hardware. In such a mode, VPP
+performance almost the same as on bare metal. But did you know that VPP can also run in Docker?
+
+The other day I joined the [[ZANOG'25](https://nog.net.za/event1/zanog25/)] in Durban, South Africa.
+One of the presenters was Nardus le Roux of Nokia, and he showed off a project called
+[[Containerlab](https://containerlab.dev/)], which provides a CLI for orchestrating and managing
+container-based networking labs. It starts the containers, builds virtual wiring between them to
+create lab topologies of users' choice and manages the lab lifecycle.
+
+Quite regularly I am asked 'when will you add VPP to Containerlab?', but at ZANOG I made a promise
+to actually add it. In my previous [[article]({{< ref 2025-05-03-containerlab-1.md >}})], I took
+a good look at VPP as a dockerized container. In this article, I'll explore how to make such a
+container run in Containerlab!
+
+## Completing the Docker container
+
+Just having VPP running by itself in a container is not super useful (although it _is_ cool!). I
+decide first to add a few bits and bobs that will come in handy in the `Dockerfile`:
+
+```
+FROM debian:bookworm
+ARG DEBIAN_FRONTEND=noninteractive
+ARG VPP_INSTALL_SKIP_SYSCTL=true
+ARG REPO=release
+EXPOSE 22/tcp
+RUN apt-get update && apt-get -y install curl procps tcpdump iproute2 iptables \
+  iputils-ping net-tools git python3 python3-pip vim-tiny openssh-server bird2 \
+  mtr-tiny traceroute && apt-get clean
+
+# Install VPP
+RUN mkdir -p /var/log/vpp /root/.ssh/
+RUN curl -s https://packagecloud.io/install/repositories/fdio/${REPO}/script.deb.sh |  bash
+RUN apt-get update && apt-get -y install vpp vpp-plugin-core && apt-get clean
+
+# Build vppcfg
+RUN pip install --break-system-packages build netaddr yamale argparse pyyaml ipaddress
+RUN git clone https://git.ipng.ch/ipng/vppcfg.git && cd vppcfg && python3 -m build && \
+    pip install --break-system-packages dist/vppcfg-*-py3-none-any.whl
+
+# Config files
+COPY files/etc/vpp/* /etc/vpp/
+COPY files/etc/bird/* /etc/bird/
+COPY files/init-container.sh /sbin/
+RUN chmod 755 /sbin/init-container.sh
+CMD ["/sbin/init-container.sh"]
+```
+
+A few notable additions:
+*   ***vppcfg*** is a handy utility I wrote and discussed in a previous [[article]({{< ref
+    2022-04-02-vppcfg-2 >}})]. Its purpose is to take YAML file that describes the configuration of
+    the dataplane (like which interfaces, sub-interfaces, MTU, IP addresses and so on), and then
+    apply this safely to a running dataplane. You can check it out in my
+    [[vppcfg](https://git.ipng.ch/ipng/vppcfg)] git repository.
+*   ***openssh-server*** will come in handy to log in to the container, in addition to the already
+    available `docker exec`.
+*   ***bird2*** which will be my controlplane of choice. At a future date, I might also add FRR,
+    which may be a good alterantive for some. VPP works well with both. You can check out Bird on
+    the nic.cz [[website](https://bird.network.cz/?get_doc&f=bird.html&v=20)].
+
+I'll add a couple of default config files for Bird and VPP, and replace the CMD with a generic
+`/sbin/init-container.sh` in which I can do any late binding stuff before launching VPP.
+
+### Initializing the Container
+
+#### VPP Containerlab: NetNS
+
+VPP's Linux Control Plane plugin wants to run in its own network namespace. So the first order of
+business of `/sbin/init-container.sh` is to create it:
+
+```
+NETNS=${NETNS:="dataplane"}
+
+echo "Creating dataplane namespace"
+/usr/bin/mkdir -p /etc/netns/$NETNS
+/usr/bin/touch /etc/netns/$NETNS/resolv.conf
+/usr/sbin/ip netns add $NETNS
+```
+
+#### VPP Containerlab: SSH
+
+Then, I'll set the root password (which is `vpp` by the way), and start aan SSH daemon which allows
+for password-less logins:
+
+```
+echo "Starting SSH, with credentials root:vpp"
+sed -i -e 's,^#PermitRootLogin prohibit-password,PermitRootLogin yes,' /etc/ssh/sshd_config
+sed -i -e 's,^root:.*,root:$y$j9T$kG8pyZEVmwLXEtXekQCRK.$9iJxq/bEx5buni1hrC8VmvkDHRy7ZMsw9wYvwrzexID:20211::::::,' /etc/shadow
+/etc/init.d/ssh start
+```
+
+#### VPP Containerlab: Bird2
+
+I can already predict that Bird2 won't be the only option for a controlplane, even though I'm a huge
+fan of it. Therefore, I'll make it configurable to leave the door open for other controlplane
+implementations in the future:
+
+```
+BIRD_ENABLED=${BIRD_ENABLED:="true"}
+
+if [ "$BIRD_ENABLED" == "true" ]; then
+  echo "Starting Bird in $NETNS"
+  mkdir -p /run/bird /var/log/bird
+  chown bird:bird /var/log/bird
+  ROUTERID=$(ip -br a show eth0 | awk '{ print $3 }' | cut -f1 -d/)
+  sed -i -e "s,.*router id .*,router id $ROUTERID; # Set by container-init.sh," /etc/bird/bird.conf
+  /usr/bin/nsenter --net=/var/run/netns/$NETNS /usr/sbin/bird -u bird -g bird
+fi
+```
+
+I am reminded that Bird won't start if it cannot determine its _router id_. When I start it in the
+`dataplane` namespace, it will immediately exit, because there will be no IP addresses configured
+yet. But luckily, it logs its complaint and it's easily addressed. I decide to take the management
+IPv4 address from `eth0` and write that into the `bird.conf` file, which otherwise does some basic
+initialization that I described in a previous [[article]({{< ref 2021-09-02-vpp-5 >}})], so I'll
+skip that here. However, I do include an empty file called `/etc/bird/bird-local.conf` for users to
+further configure Bird2.
+
+#### VPP Containerlab: Binding veth pairs
+
+When Containerlab starts the VPP container, it'll offer it a set of `veth` ports that connect this
+container to other nodes in the lab. This is done by the `links` list in the topology file
+[[ref](https://containerlab.dev/manual/network/)]. It's my goal to take all of the interfaces
+that are of type `veth`, and generate a little snippet to grab them and bind them into VPP while
+setting their MTU to 9216 to allow for jumbo frames:
+
+```
+CLAB_VPP_FILE=${CLAB_VPP_FILE:=/etc/vpp/clab.vpp}
+
+echo "Generating $CLAB_VPP_FILE"
+: > $CLAB_VPP_FILE
+MTU=9216
+for IFNAME in $(ip -br link show type veth | cut -f1 -d@ | grep -v '^eth0$' | sort); do
+  MAC=$(ip -br link show dev $IFNAME | awk '{ print $3 }')
+  echo " * $IFNAME hw-addr $MAC mtu $MTU"
+  ip link set $IFNAME up mtu $MTU
+  cat << EOF >> $CLAB_VPP_FILE
+create host-interface name $IFNAME hw-addr $MAC
+set interface name host-$IFNAME $IFNAME
+set interface mtu $MTU $IFNAME
+set interface state $IFNAME up
+
+EOF
+done
+```
+
+{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
+
+One thing I realized is that VPP will assign a random MAC address on its copy of the `veth` port,
+which is not great. I'll explicitly configure it with the same MAC address as the `veth` interface
+itself, otherwise I'd have to put the interface into promiscuous mode.
+
+#### VPP Containerlab: VPPcfg
+
+I'm almost ready, but I have one more detail. The user will be able to offer a
+[[vppcfg](https://git.ipng.ch/ipng/vppcfg)] YAML file to configure the interfaces and so on. If such
+a file exists, I'll apply it to the dataplane upon startup:
+
+```
+VPPCFG_VPP_FILE=${VPPCFG_VPP_FILE:=/etc/vpp/vppcfg.vpp}
+
+echo "Generating $VPPCFG_VPP_FILE"
+: > $VPPCFG_VPP_FILE
+if [ -r /etc/vpp/vppcfg.yaml ]; then
+  vppcfg plan --novpp -c /etc/vpp/vppcfg.yaml -o $VPPCFG_VPP_FILE
+fi
+```
+
+Once the VPP process starts, it'll execute `/etc/vpp/bootstrap.vpp`, which in turn executes these
+newly generated `/etc/vpp/clab.vpp` to grab the `veth` interfaces, and then `/etc/vpp/vppcfg.vpp` to
+further configure the dataplane. Easy peasy!
+
+### Adding VPP to Containerlab
+
+Roman points out a previous integration for the 6WIND VSR in
+[[PR#2540](https://github.com/srl-labs/containerlab/pull/2540)]. This serves as a useful guide to
+get me started. I fork the repo, create a branch so that Roman can also add a few commits, and
+together we start hacking in [[PR#2571](https://github.com/srl-labs/containerlab/pull/2571)].
+
+First, I add the documentation skeleton in `docs/manual/kinds/fdio_vpp.md`, which links in from a
+few other places, and will be where the end-user facing documentation will live. That's about half
+the contributed LOC, right there!
+
+Next, I'll create a Go module in `nodes/fdio_vpp/fdio_vpp.go` which doesn't do much other than
+creating the `struct`, and its required `Register` and `Init` functions. The `Init` function ensures
+the right capabilities are set in Docker, and the right devices are bound for the container. 
+
+I notice that Containerlab rewrites the Dockerfile `CMD` string and prepends an `if-wait.sh` script
+to it. This is because when Containerlab starts the container, it'll still be busy adding these
+`link` interfaces to it, and if a container starts too quickly, it may not see all the interfaces.
+So, containerlab informs the container using an environment variable called `CLAB_INTFS`, so this
+script simply sleeps for a while until that exact amount of interfaces are present. Ok, cool beans.
+
+Roman helps me a bit with Go templating. You see, I think it'll be slick to have the CLI prompt for
+the VPP containers to reflect their hostname, because normally, VPP will assign `vpp# `. I add the
+template in `nodes/fdio_vpp/vpp_startup_config.go.tpl` and it only has one variable expansion: `unix
+{ cli-prompt {{ .ShortName }}# }`. But I totally think it's worth it, because when running many VPP
+containers in the lab, it could otherwise get confusing. 
+
+Roman also shows me a trick in the function `PostDeploy()`, which will write the user's SSH pubkeys
+to `/root/.ssh/authorized_keys`. This allows users to log in without having to use password
+authentication. 
+
+Collectively, we decide to punt on the `SaveConfig` function until we're a bit further along. I have
+an idea how this would work, basically along the lines of calling `vppcfg dump` and bind-mounting
+that file into the lab directory somewhere. This way, upon restarting, the YAML file can be re-read
+and the dataplane initialized. But it'll be for another day.
+
+After the main module is finished, all I have to do is add it to `clab/register.go` and that's just
+about it. In about 170 lines of code, 50 lines of Go template, and 170 lines of Markdown, this
+contribution is about ready to ship!
+
+### Containerlab: Demo
+
+After I finish writing the documentation, I decide to include a demo with a quickstart to help folks
+along. A simple lab showing two VPP instances and two Alpine Linux clients can be found on
+[[git.ipng.ch/ipng/vpp-containerlab](https://git.ipng.ch/ipng/vpp-containerlab)]. Simply check out the
+repo and start the lab, like so:
+
+```
+$ git clone https://git.ipng.ch/ipng/vpp-containerlab.git
+$ cd vpp-containerlab
+$ containerlab deploy --topo vpp.clab.yml
+```
+
+#### Containerlab: configs
+
+The file `vpp.clab.yml` contains an example topology existing of two VPP instances connected each to
+one Alpine linux container, in the following topology:
+
+{{< image src="/assets/containerlab/learn-vpp.png" alt="Containerlab Topo" width="100%" >}}
+
+Two relevant files for each VPP router are included in this
+[[repository](https://git.ipng.ch/ipng/vpp-containerlab)]:
+1.   `config/vpp*/vppcfg.yaml` configures the dataplane interfaces, including a loopback address.
+1.   `config/vpp*/bird-local.conf` configures the controlplane to enable BFD and OSPF.
+
+To illustrate these files, let me take a closer look at node `vpp1`. It's VPP dataplane
+configuration looks like this:
+```
+pim@summer:~/src/vpp-containerlab$ cat config/vpp1/vppcfg.yaml 
+interfaces:
+  eth1:
+    description: 'To client1'
+    mtu: 1500
+    lcp: eth1
+    addresses: [ 10.82.98.65/28, 2001:db8:8298:101::1/64 ]
+  eth2:
+    description: 'To vpp2'
+    mtu: 9216
+    lcp: eth2
+    addresses: [ 10.82.98.16/31, 2001:db8:8298:1::1/64 ]
+loopbacks:
+  loop0:
+    description: 'vpp1'
+    lcp: loop0
+    addresses: [ 10.82.98.0/32, 2001:db8:8298::/128 ]
+```
+
+Then, I enable BFD, OSPF and OSPFv3 on `eth2` and `loop0` on both of the VPP routers:
+```
+pim@summer:~/src/vpp-containerlab$ cat config/vpp1/bird-local.conf 
+protocol bfd bfd1 {
+  interface "eth2" { interval 100 ms; multiplier 30; };
+}
+
+protocol ospf v2 ospf4 {
+  ipv4 { import all; export all; };
+  area 0 {
+    interface "loop0" { stub yes; };
+    interface "eth2" { type pointopoint; cost 10; bfd on; };
+  };
+}
+
+protocol ospf v3 ospf6 {
+  ipv6 { import all; export all; };
+  area 0 {
+    interface "loop0" { stub yes; };
+    interface "eth2" { type pointopoint; cost 10; bfd on; };
+  };
+}
+```
+
+#### Containerlab: playtime!
+
+Once the lab comes up, I can SSH to the VPP containers (`vpp1` and `vpp2`) which have my SSH pubkeys
+installed thanks to Roman's work. Barring that, I could still log in as user `root` using
+password `vpp`. VPP runs its own network namespace called `dataplane`, which is very similar to SR
+Linux default `network-instance`. I can join that namespace to take a closer look:
+
+```
+pim@summer:~/src/vpp-containerlab$ ssh root@vpp1
+root@vpp1:~# nsenter --net=/var/run/netns/dataplane
+root@vpp1:~# ip -br a
+lo               DOWN           
+loop0            UP             10.82.98.0/32 2001:db8:8298::/128 fe80::dcad:ff:fe00:0/64 
+eth1             UNKNOWN        10.82.98.65/28 2001:db8:8298:101::1/64 fe80::a8c1:abff:fe77:acb9/64 
+eth2             UNKNOWN        10.82.98.16/31 2001:db8:8298:1::1/64 fe80::a8c1:abff:fef0:7125/64 
+
+root@vpp1:~# ping 10.82.98.1
+PING 10.82.98.1 (10.82.98.1) 56(84) bytes of data.
+64 bytes from 10.82.98.1: icmp_seq=1 ttl=64 time=9.53 ms
+64 bytes from 10.82.98.1: icmp_seq=2 ttl=64 time=15.9 ms
+^C
+--- 10.82.98.1 ping statistics ---
+2 packets transmitted, 2 received, 0% packet loss, time 1002ms
+rtt min/avg/max/mdev = 9.530/12.735/15.941/3.205 ms
+```
+
+From `vpp1`, I can tell that Bird2's OSPF adjacency has formed, because I can ping the `loop0`
+address of `vpp2` router on 10.82.98.1. Nice! The two client nodes are running a minimalistic Alpine
+Linux container, which doesn't ship with SSH by default. But of course I can still enter the
+containers using `docker exec`, like so:
+
+```
+pim@summer:~/src/vpp-containerlab$ docker exec -it client1 sh
+/ # ip addr show dev eth1
+531235: eth1@if531234: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 9500 qdisc noqueue state UP 
+    link/ether 00:c1:ab:00:00:01 brd ff:ff:ff:ff:ff:ff
+    inet 10.82.98.66/28 scope global eth1
+       valid_lft forever preferred_lft forever
+    inet6 2001:db8:8298:101::2/64 scope global 
+       valid_lft forever preferred_lft forever
+    inet6 fe80::2c1:abff:fe00:1/64 scope link 
+       valid_lft forever preferred_lft forever
+/ # traceroute 10.82.98.82
+traceroute to 10.82.98.82 (10.82.98.82), 30 hops max, 46 byte packets
+ 1  10.82.98.65 (10.82.98.65)  5.906 ms  7.086 ms  7.868 ms
+ 2  10.82.98.17 (10.82.98.17)  24.007 ms  23.349 ms  15.933 ms
+ 3  10.82.98.82 (10.82.98.82)  39.978 ms  31.127 ms  31.854 ms
+
+/ # traceroute 2001:db8:8298:102::2
+traceroute to 2001:db8:8298:102::2 (2001:db8:8298:102::2), 30 hops max, 72 byte packets
+ 1  2001:db8:8298:101::1 (2001:db8:8298:101::1)  0.701 ms  7.144 ms  7.900 ms
+ 2  2001:db8:8298:1::2 (2001:db8:8298:1::2)  23.909 ms  22.943 ms  23.893 ms
+ 3  2001:db8:8298:102::2 (2001:db8:8298:102::2)  31.964 ms  30.814 ms  32.000 ms
+```
+
+From the vantage point of `client1`, the first hop represents the `vpp1` node, which forwards to
+`vpp2`, which finally forwards to `client2`, which shows that both VPP routers are passing traffic.
+Dope!
+
+## Results
+
+After all of this deep-diving, all that's left is for me to demonstrate the Containerlab by means of
+this little screencast [[asciinema](/assets/containerlab/vpp-containerlab.cast)].  I hope you enjoy
+it as much as I enjoyed creating it:
+
+{{< asciinema src="/assets/containerlab/vpp-containerlab.cast" >}}
+
+## Acknowledgements
+
+I wanted to give a shout-out Roman Dodin for his help getting the Containerlab parts squared away
+when I got a little bit stuck. He took the time to explain the internals and idiom of Containerlab
+project, which really saved me a tonne of time. He also pair-programmed the
+[[PR#2471](https://github.com/srl-labs/containerlab/pull/2571)] with me over the span of two
+evenings.
+
+Collaborative open source rocks!
--- a/content/articles/2025-05-28-minio-1.md
+++ b/content/articles/2025-05-28-minio-1.md
@@ -0,0 +1,713 @@
+---
+date: "2025-05-28T22:07:23Z"
+title: 'Case Study: Minio S3 - Part 1'
+---
+
+{{< image float="right" src="/assets/minio/minio-logo.png" alt="MinIO Logo" width="6em" >}}
+
+# Introduction
+
+Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading
+scalability, data availability, security, and performance. Millions of customers of all sizes and
+industries store, manage, analyze, and protect any amount of data for virtually any use case, such
+as data lakes, cloud-native applications, and mobile apps. With cost-effective storage classes and
+easy-to-use management features, you can optimize costs, organize and analyze data, and configure
+fine-tuned access controls to meet specific business and compliance requirements. 
+
+Amazon's S3 became the _de facto_ standard object storage system, and there exist several fully open
+source implementations of the protocol. One of them is MinIO: designed to allow enterprises to
+consolidate all of their data on a single, private cloud namespace. Architected using the same
+principles as the hyperscalers, AIStor delivers performance at scale at a fraction of the cost
+compared to the public cloud.
+
+IPng Networks is an Internet Service Provider, but I also dabble in self-hosting things, for
+example [[PeerTube](https://video.ipng.ch/)], [[Mastodon](https://ublog.tech/)],
+[[Immich](https://photos.ipng.ch/)], [[Pixelfed](https://pix.ublog.tech/)] and of course
+[[Hugo](https://ipng.ch/)]. These services all have one thing in common: they tend to use lots of
+storage when they grow. At IPng Networks, all hypervisors ship with enterprise SAS flash drives,
+mostly 1.92TB and 3.84TB. Scaling up each of these services, and backing them up safely, can be
+quite the headache.
+
+This article is for the storage-buffs. I'll set up a set of distributed MinIO nodes from scatch.
+
+## Physical
+
+{{< image float="right" src="/assets/minio/disks.png" alt="MinIO Disks" width="16em" >}}
+
+I'll start with the basics. I still have a few Dell R720 servers laying around, they are getting a
+bit older but still have 24 cores and 64GB of memory. First I need to get me some disks. I order
+36pcs of 16TB SATA enterprise disk, a mixture of Seagate EXOS and Toshiba MG series disks. I've once
+learned (the hard way), that buying a big stack of disks from one production run is a risk - so I'll
+mix and match the drives.
+
+Three trays of caddies and a melted credit card later, I have 576TB of SATA disks safely in hand.
+Each machine will carry 192TB of raw storage. The nice thing about this chassis is that Dell can
+ship them with 12x 3.5" SAS slots in the front, and 2x 2.5" SAS slots in the rear of the chassis.
+
+So I'll install Debian Bookworm on one small 480G SSD in software RAID1.
+
+### Cloning an install
+
+I have three identical machines so in total I'll want six of these SSDs. I temporarily screw the
+other five in 3.5" drive caddies and plug them into the first installed Dell, which I've called
+`minio-proto`:
+
+
+```
+pim@minio-proto:~$ for i in b c d e f; do
+  sudo dd if=/dev/sda of=/dev/sd${i} bs=512 count=1;
+  sudo mdadm --manage /dev/md0 --add /dev/md${i}1
+done
+pim@minio-proto:~$ sudo mdadm --manage /dev/md0 --grow 6
+pim@minio-proto:~$ watch cat /proc/mdstat
+pim@minio-proto:~$ for i in a b c d e f; do
+  sudo grub-install /dev/sd$i
+done
+```
+
+{{< image float="right" src="/assets/minio/rack.png" alt="MinIO Rack" width="16em" >}}
+
+The first command takes my installed disk, `/dev/sda`, and copies the first sector over to the other
+five. This will give them the same partition table. Next, I'll add the first partition of each disk
+to the raidset. Then, I'll expand the raidset to have six members, after which the kernel starts a
+recovery process that syncs the newly added paritions to `/dev/md0` (by copying from `/dev/sda` to
+all other disks at once). Finally, I'll watch this exciting movie and grab a cup of tea.
+
+
+Once the disks are fully copied, I'll shut down the machine and distribute the disks to their
+respective Dell R720, two each. Once they boot they will all be identical. I'll need to make sure
+their hostnames, and machine/host-id are unique, otherwise things like bridges will have overlapping
+MAC addresses - ask me how I know:
+
+```
+pim@minio-proto:~$ sudo mdadm --manage /dev/md0 --grow -n 2
+pim@minio-proto:~$ sudo rm /etc/ssh/ssh_host*
+pim@minio-proto:~$ sudo hostname minio0-chbtl0
+pim@minio-proto:~$ sudo dpkg-reconfigure openssh-server
+pim@minio-proto:~$ sudo dd if=/dev/random of=/etc/hostid bs=4 count=1
+pim@minio-proto:~$ sudo /usr/bin/dbus-uuidgen > /etc/machine-id
+pim@minio-proto:~$ sudo reboot
+```
+
+After which I have three beautiful and unique machines:
+*   `minio0.chbtl0.net.ipng.ch`: which will go into my server rack at the IPng office.
+*   `minio0.ddln0.net.ipng.ch`: which will go to [[Daedalean]({{< ref
+    2022-02-24-colo >}})], doing AI since before it was all about vibe coding.
+*    `minio0.chrma0.net.ipng.ch`: which will go to [[IP-Max](https://ip-max.net/)], one of the best
+    ISPs on the planet. 🥰
+
+
+## Deploying Minio
+
+The user guide that MinIO provides
+[[ref](https://min.io/docs/minio/linux/operations/installation.html)] is super good, arguably one of
+the best documented open source projects I've ever seen. it shows me that I can do three types of
+install. A 'Standalone' with one disk, a 'Standalone Multi-Drive', and a 'Distributed' deployment.
+I decide to make three independent standalone multi-drive installs. This way, I have less shared
+fate, and will be immune to network partitions (as these are going to be in three different
+physical locations). I've also read about per-bucket _replication_, which will be an excellent way
+to get geographical distribution and active/active instances to work together.
+
+I feel good about the single-machine multi-drive decision. I follow the install guide
+[[ref](https://min.io/docs/minio/linux/operations/install-deploy-manage/deploy-minio-single-node-multi-drive.html#minio-snmd)]
+for this deployment type.
+
+### IPng Frontends
+
+At IPng I use a private IPv4/IPv6/MPLS network that is not connected to the internet. I call this
+network [[IPng Site Local]({{< ref 2023-03-11-mpls-core.md >}})]. But how will users reach my Minio
+install? I have four redundantly and geographically deployed frontends, two in the Netherlands and
+two in Switzerland. I've described the frontend setup in a [[previous article]({{< ref
+2023-03-17-ipng-frontends >}})] and the certificate management in [[this article]({{< ref
+2023-03-24-lego-dns01 >}})].
+
+I've decided to run the service on these three regionalized endpoints:
+1.   `s3.chbtl0.ipng.ch` which will back into `minio0.chbtl0.net.ipng.ch`
+1.   `s3.ddln0.ipng.ch` which will back into `minio0.ddln0.net.ipng.ch`
+1.   `s3.chrma0.ipng.ch` which will back into `minio0.chrma0.net.ipng.ch`
+
+The first thing I take note of is that S3 buckets can be either addressed _by path_, in other words
+something like `s3.chbtl0.ipng.ch/my-bucket/README.md`, but they can also be addressed by virtual
+host, like so: `my-bucket.s3.chbtl0.ipng.ch/README.md`. A subtle difference, but from the docs I
+understand that Minio needs to have control of the whole space under its main domain.
+
+There's a small implication to this requirement -- the Web Console that ships with MinIO (eh, well,
+maybe that's going to change, more on that later), will want to have its own domain-name, so I
+choose something simple: `cons0-s3.chbtl0.ipng.ch` and so on. This way, somebody might still be able
+to have a bucket name called `cons0` :)
+
+#### Let's Encrypt Certificates
+
+Alright, so I will be neading nine domains into this new certificate which I'll simply call
+`s3.ipng.ch`. I configure it in Ansible:
+
+```
+certbot:
+  certs:
+...
+    s3.ipng.ch:
+      groups: [ 'nginx', 'minio' ]
+      altnames:
+        - 's3.chbtl0.ipng.ch'
+        - 'cons0-s3.chbtl0.ipng.ch'
+        - '*.s3.chbtl0.ipng.ch'
+        - 's3.ddln0.ipng.ch'
+        - 'cons0-s3.ddln0.ipng.ch'
+        - '*.s3.ddln0.ipng.ch'
+        - 's3.chrma0.ipng.ch'
+        - 'cons0-s3.chrma0.ipng.ch'
+        - '*.s3.chrma0.ipng.ch'
+```
+
+I run the `certbot` playbook and it does two things:
+1.   On the machines from group `nginx` and `minio`, it will ensure there exists a user `lego` with
+     an SSH key and write permissions to `/etc/lego/`; this is where the automation will write (and
+     update) the certificate keys.
+1.   On the `lego` machine, it'll create two files. One is the certificate requestor, and the other
+     is a certificate distribution script that will copy the cert to the right machine(s) when it
+     renews.
+
+On the `lego` machine, I'll run the cert request for the first time:
+
+```
+lego@lego:~$ bin/certbot:s3.ipng.ch
+lego@lego:~$ RENEWED_LINEAGE=/home/lego/acme-dns/live/s3.ipng.ch bin/certbot-distribute
+```
+
+The first script asks me to add the _acme-challenge DNS entries, which I'll do, for example on the
+`s3.chbtl0.ipng.ch` instance (and similar for the `ddln0` and `chrma0` ones:
+
+```
+$ORIGIN chbtl0.ipng.ch.
+_acme-challenge.s3        CNAME 51f16fd0-8eb6-455c-b5cd-96fad12ef8fd.auth.ipng.ch.
+_acme-challenge.cons0-s3  CNAME 450477b8-74c9-4b9e-bbeb-de49c3f95379.auth.ipng.ch.
+s3                        CNAME nginx0.ipng.ch.
+*.s3                      CNAME nginx0.ipng.ch.
+cons0-s3                  CNAME nginx0.ipng.ch.
+```
+
+I push and reload the `ipng.ch` zonefile with these changes after which the certificate gets
+requested and a cronjob added to check for renewals. The second script will copy the newly created
+cert to all three `minio` machines, and all four `nginx` machines. From now on, every 90 days, a new
+cert will be automatically generated and distributed. Slick!
+
+#### NGINX Configs
+
+With the LE wildcard certs in hand, I can create an NGINX frontend for these minio deployments.
+
+First, a simple redirector service that punts people on port 80 to port 443:
+
+```
+server {
+  listen [::]:80;
+  listen 0.0.0.0:80;
+
+  server_name cons0-s3.chbtl0.ipng.ch s3.chbtl0.ipng.ch *.s3.chbtl0.ipng.ch;
+  access_log /var/log/nginx/s3.chbtl0.ipng.ch-access.log;
+  include /etc/nginx/conf.d/ipng-headers.inc;
+
+  location / {
+    return 301 https://$server_name$request_uri;
+  }
+}
+```
+
+Next, the Minio API service itself which runs on port 9000, with a configuration snippet inspired by
+the MinIO [[docs](https://min.io/docs/minio/linux/integrations/setup-nginx-proxy-with-minio.html)]:
+
+```
+server {
+  listen [::]:443 ssl http2;
+  listen 0.0.0.0:443 ssl http2;
+  ssl_certificate /etc/certs/s3.ipng.ch/fullchain.pem;
+  ssl_certificate_key /etc/certs/s3.ipng.ch/privkey.pem;
+  include /etc/nginx/conf.d/options-ssl-nginx.inc;
+  ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc;
+
+  server_name s3.chbtl0.ipng.ch *.s3.chbtl0.ipng.ch;
+  access_log /var/log/nginx/s3.chbtl0.ipng.ch-access.log upstream;
+  include /etc/nginx/conf.d/ipng-headers.inc;
+
+  add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
+
+  ignore_invalid_headers off;
+  client_max_body_size 0;
+  # Disable buffering
+  proxy_buffering off;
+  proxy_request_buffering off;
+
+  location / {
+    proxy_set_header Host $http_host;
+    proxy_set_header X-Real-IP $remote_addr;
+    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+    proxy_set_header X-Forwarded-Proto $scheme;
+
+    proxy_connect_timeout 300;
+    proxy_http_version 1.1;
+    proxy_set_header Connection "";
+    chunked_transfer_encoding off;
+
+    proxy_pass http://minio0.chbtl0.net.ipng.ch:9000;
+  }
+}
+```
+
+Finally, the Minio Console service which runs on port 9090:
+
+```
+include /etc/nginx/conf.d/geo-ipng-trusted.inc;
+
+server {
+  listen [::]:443 ssl http2;
+  listen 0.0.0.0:443 ssl http2;
+  ssl_certificate /etc/certs/s3.ipng.ch/fullchain.pem;
+  ssl_certificate_key /etc/certs/s3.ipng.ch/privkey.pem;
+  include /etc/nginx/conf.d/options-ssl-nginx.inc;
+  ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc;
+
+  server_name cons0-s3.chbtl0.ipng.ch;
+  access_log /var/log/nginx/cons0-s3.chbtl0.ipng.ch-access.log upstream;
+  include /etc/nginx/conf.d/ipng-headers.inc;
+
+  add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
+
+  ignore_invalid_headers off;
+  client_max_body_size 0;
+  # Disable buffering
+  proxy_buffering off;
+  proxy_request_buffering off;
+
+  location / {
+    if ($geo_ipng_trusted = 0) { rewrite ^ https://ipng.ch/ break; }
+    proxy_set_header Host $http_host;
+    proxy_set_header X-Real-IP $remote_addr;
+    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+    proxy_set_header X-Forwarded-Proto $scheme;
+    proxy_set_header X-NginX-Proxy true;
+
+    real_ip_header X-Real-IP;
+    proxy_connect_timeout 300;
+    chunked_transfer_encoding off;
+
+    proxy_http_version 1.1;
+    proxy_set_header Upgrade $http_upgrade;
+    proxy_set_header Connection "upgrade";
+
+    proxy_pass http://minio0.chbtl0.net.ipng.ch:9090;
+  }
+}
+```
+
+This last one has an NGINX trick. It will only allow users in if they are in the map called
+`geo_ipng_trusted`, which contains a set of IPv4 and IPv6 prefixes. Visitors who are not in this map
+will receive an HTTP redirect back to the [[IPng.ch](https://ipng.ch/)] homepage instead.
+
+I run the Ansible Playbook which contains the NGINX changes to all frontends, but of course nothing
+runs yet, because I haven't yet started MinIO backends.
+
+### MinIO Backends
+
+The first thing I need to do is get those disks mounted. MinIO likes using XFS, so I'll install that
+and prepare the disks as follows:
+
+```
+pim@minio0-chbtl0:~$ sudo apt install xfsprogs
+pim@minio0-chbtl0:~$ sudo modprobe xfs
+pim@minio0-chbtl0:~$ echo xfs | sudo tee -a /etc/modules
+pim@minio0-chbtl0:~$ sudo update-initramfs -k all -u
+pim@minio0-chbtl0:~$ for i in a b c d  e f g h  i j k l; do sudo mkfs.xfs /dev/sd$i; done
+pim@minio0-chbtl0:~$ blkid | awk 'BEGIN {i=1} /TYPE="xfs"/ {
+       printf "%s /minio/disk%d   xfs defaults 0 2\n",$2,i; i++;
+    }' | sudo tee -a /etc/fstab
+pim@minio0-chbtl0:~$ for i in `seq 1 12`; do sudo mkdir -p /minio/disk$i; done
+pim@minio0-chbtl0:~$ sudo mount -t xfs -a
+pim@minio0-chbtl0:~$ sudo chown -R minio-user: /minio/
+```
+
+From the top: I'll install `xfsprogs` which contains the things I need to manipulate XFS filesystems
+in Debian. Then I'll install the `xfs` kernel module, and make sure it gets inserted upon subsequent
+startup by adding it to `/etc/modules` and regenerating the initrd for the installed kernels.
+
+Next, I'll format all twelve 16TB disks (which are `/dev/sda` - `/dev/sdl` on these machines), and
+add their resulting blockdevice id's to `/etc/fstab` so they get persistently mounted on reboot.
+
+Finally, I'll create their mountpoints, mount all XFS filesystems, and chown them to the user that
+MinIO is running as. End result:
+
+```
+pim@minio0-chbtl0:~$ df -T
+Filesystem     Type       1K-blocks      Used   Available Use% Mounted on
+udev           devtmpfs    32950856         0    32950856   0% /dev
+tmpfs          tmpfs        6595340      1508     6593832   1% /run
+/dev/md0       ext4       114695308   5423976   103398948   5% /
+tmpfs          tmpfs       32976680         0    32976680   0% /dev/shm
+tmpfs          tmpfs           5120         4        5116   1% /run/lock
+/dev/sda       xfs      15623792640 121505936 15502286704   1% /minio/disk1
+/dev/sde       xfs      15623792640 121505968 15502286672   1% /minio/disk12
+/dev/sdi       xfs      15623792640 121505968 15502286672   1% /minio/disk11
+/dev/sdl       xfs      15623792640 121505904 15502286736   1% /minio/disk10
+/dev/sdd       xfs      15623792640 121505936 15502286704   1% /minio/disk4
+/dev/sdb       xfs      15623792640 121505968 15502286672   1% /minio/disk3
+/dev/sdk       xfs      15623792640 121505936 15502286704   1% /minio/disk5
+/dev/sdc       xfs      15623792640 121505936 15502286704   1% /minio/disk9
+/dev/sdf       xfs      15623792640 121506000 15502286640   1% /minio/disk2
+/dev/sdj       xfs      15623792640 121505968 15502286672   1% /minio/disk7
+/dev/sdg       xfs      15623792640 121506000 15502286640   1% /minio/disk8
+/dev/sdh       xfs      15623792640 121505968 15502286672   1% /minio/disk6
+tmpfs          tmpfs        6595336         0     6595336   0% /run/user/0
+```
+
+MinIO likes to be configured using environment variables - and this is likely because it's a popular
+thing to run in a containerized environment like Kubernetes. The maintainers ship it also as a
+Debian package, which will read its environment from `/etc/default/minio`, and I'll prepare that
+file as follows:
+
+```
+pim@minio0-chbtl0:~$ cat << EOF | sudo tee /etc/default/minio
+MINIO_DOMAIN="s3.chbtl0.ipng.ch,minio0.chbtl0.net.ipng.ch"
+MINIO_ROOT_USER="XXX"
+MINIO_ROOT_PASSWORD="YYY"
+MINIO_VOLUMES="/minio/disk{1...12}"
+MINIO_OPTS="--console-address :9001"
+EOF
+pim@minio0-chbtl0:~$ sudo systemctl enable --now minio
+pim@minio0-chbtl0:~$ sudo journalctl -u minio
+May 31 10:44:11 minio0-chbtl0 minio[690420]: MinIO Object Storage Server
+May 31 10:44:11 minio0-chbtl0 minio[690420]: Copyright: 2015-2025 MinIO, Inc.
+May 31 10:44:11 minio0-chbtl0 minio[690420]: License: GNU AGPLv3 - https://www.gnu.org/licenses/agpl-3.0.html
+May 31 10:44:11 minio0-chbtl0 minio[690420]: Version: RELEASE.2025-05-24T17-08-30Z (go1.24.3 linux/amd64)
+May 31 10:44:11 minio0-chbtl0 minio[690420]: API: http://198.19.4.11:9000  http://127.0.0.1:9000
+May 31 10:44:11 minio0-chbtl0 minio[690420]: WebUI: https://cons0-s3.chbtl0.ipng.ch/
+May 31 10:44:11 minio0-chbtl0 minio[690420]: Docs: https://docs.min.io
+
+pim@minio0-chbtl0:~$ sudo ipmitool sensor | grep Watts
+Pwr Consumption  | 154.000    | Watts
+```
+
+Incidentally - I am pretty pleased with this 192TB disk tank, sporting 24 cores, 64GB memory and
+2x10G network, casually hanging out at 154 Watts of power all up. Slick!
+
+{{< image float="right" src="/assets/minio/minio-ec.svg" alt="MinIO Erasure Coding" width="22em" >}}
+
+MinIO implements _erasure coding_ as a core component in providing availability and resiliency
+during drive or node-level failure events. MinIO partitions each object into data and parity shards
+and distributes those shards across a single so-called _erasure set_. Under the hood, it uses
+[[Reed-Solomon](https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction)] erasure coding
+implementation and partitions the object for distribution. From the MinIO website, I'll borrow a
+diagram to show how it looks like on a single node like mine to the right.
+
+Anyway, MinIO detects 12 disks and installs an erasure set with 8 data disks and 4 parity disks,
+which it calls `EC:4` encoding, also known in the industry as `RS8.4`.
+Just like that, the thing shoots to life. Awesome!
+
+### MinIO Client
+
+On Summer, I'll install the MinIO Client called `mc`. This is easy because the maintainers ship a
+Linux binary which I can just download. On OpenBSD, they don't do that. Not a problem though, on
+Squanchy, Pencilvester and Glootie, I will just `go install` the client. Using the `mc` commandline,
+I can all any of the S3 APIs on my new MinIO instance:
+
+```
+pim@summer:~$ set +o history
+pim@summer:~$ mc alias set chbtl0 https://s3.chbtl0.ipng.ch/ <rootuser> <rootpass>
+pim@summer:~$ set -o history
+pim@summer:~$ mc admin info chbtl0/
+●  s3.chbtl0.ipng.ch
+   Uptime: 22 hours 
+   Version: 2025-05-24T17:08:30Z
+   Network: 1/1 OK 
+   Drives: 12/12 OK 
+   Pool: 1
+
+┌──────┬───────────────────────┬─────────────────────┬──────────────┐
+│ Pool │ Drives Usage          │ Erasure stripe size │ Erasure sets │
+│ 1st  │ 0.8% (total: 116 TiB) │ 12                  │ 1            │
+└──────┴───────────────────────┴─────────────────────┴──────────────┘
+
+95 GiB Used, 5 Buckets, 5,859 Objects, 318 Versions, 1 Delete Marker
+12 drives online, 0 drives offline, EC:4
+
+```
+
+Cool beans. I think I should get rid of this root account though, I've installed those credentials
+into the `/etc/default/minio` environment file, but I don't want to keep them out in the open. So
+I'll make an account for myself and assign me reasonable privileges, called `consoleAdmin` in the
+default install:
+
+```
+pim@summer:~$ set +o history
+pim@summer:~$ mc admin user add chbtl0/ <someuser> <somepass>
+pim@summer:~$ mc admin policy info chbtl0 consoleAdmin
+pim@summer:~$ mc admin policy attach chbtl0 consoleAdmin --user=<someuser>
+pim@summer:~$ mc alias set chbtl0 https://s3.chbtl0.ipng.ch/ <someuser> <somepass>
+pim@summer:~$ set -o history
+```
+
+OK, I feel less gross now that I'm not operating as root on the MinIO deployment. Using my new
+user-powers, let me set some metadata on my new minio server:
+
+```
+pim@summer:~$ mc admin config set chbtl0/ site name=chbtl0 region=switzerland
+Successfully applied new settings.
+Please restart your server 'mc admin service restart chbtl0/'.
+pim@summer:~$ mc admin service restart chbtl0/
+Service status: ▰▰▱ [DONE]
+Summary:
+    ┌───────────────┬─────────────────────────────┐
+    │ Servers:      │ 1 online, 0 offline, 0 hung │
+    │ Restart Time: │ 61.322886ms                 │
+    └───────────────┴─────────────────────────────┘
+pim@summer:~$ mc admin config get chbtl0/ site 
+site name=chbtl0 region=switzerland 
+```
+
+By the way, what's really cool about these open standards is that both the Amazon `aws` client works
+with MinIO, but `mc` also works with AWS!
+### MinIO Console
+
+Although I'm pretty good with APIs and command line tools, there's some benefit also in using a
+Graphical User Interface. MinIO ships with one, but there was a bit of a kerfuffle in the MinIO
+community. Unfortunately, these are pretty common -- Redis (an open source key/value storage system)
+changed their offering abruptly. Terraform (an open source infrastructure-as-code tool) changed
+their licensing at some point. Ansible (an open source machine management tool) changed their
+offering also. MinIO developers decided to strip their console of ~all features recently. The gnarly
+bits are discussed on
+[[reddit](https://www.reddit.com/r/selfhosted/comments/1kva3pw/avoid_minio_developers_introduce_trojan_horse/)].
+but suffice to say: the same thing that happened in literally 100% of the other cases, also happened
+here. Somebody decided to simply fork the code from before it was changed.
+
+Enter OpenMaxIO. A cringe worthy name, but it gets the job done. Reading up on the
+[[GitHub](https://github.com/OpenMaxIO/openmaxio-object-browser/issues/5)], reviving the fully
+working console is pretty straight forward -- that is, once somebody spent a few days figuring it
+out. Thank you `icesvz` for this excellent pointer. With this, I can create a systemd service for
+the console and start it:
+
+```
+pim@minio0-chbtl0:~$ cat << EOF | sudo tee -a /etc/default/minio
+## NOTE(pim): For openmaxio console service
+CONSOLE_MINIO_SERVER="http://localhost:9000"
+MINIO_BROWSER_REDIRECT_URL="https://cons0-s3.chbtl0.ipng.ch/"
+EOF
+pim@minio0-chbtl0:~$ cat << EOF | sudo tee /lib/systemd/system/minio-console.service
+[Unit]
+Description=OpenMaxIO Console Service
+Wants=network-online.target
+After=network-online.target
+AssertFileIsExecutable=/usr/local/bin/minio-console
+
+[Service]
+Type=simple
+
+WorkingDirectory=/usr/local
+
+User=minio-user
+Group=minio-user
+ProtectProc=invisible
+
+EnvironmentFile=-/etc/default/minio
+ExecStart=/usr/local/bin/minio-console server
+Restart=always
+LimitNOFILE=1048576
+MemoryAccounting=no
+TasksMax=infinity
+TimeoutSec=infinity
+OOMScoreAdjust=-1000
+SendSIGKILL=no
+
+[Install]
+WantedBy=multi-user.target
+EOF
+pim@minio0-chbtl0:~$ sudo systemctl enable --now minio-console
+pim@minio0-chbtl0:~$ sudo systemctl restart minio
+```
+
+The first snippet is an update to the MinIO configuration that instructs it to redirect users who
+are not trying to use the API to the console endpoint on `cons0-s3.chbtl0.ipng.ch`, and then the
+console-server needs to know where to find the API, which from its vantage point is running on
+`localhost:9000`. Hello, beautiful fully featured console:
+
+{{< image src="/assets/minio/console-1.png" alt="MinIO Console" >}}
+
+### MinIO Prometheus
+
+MinIO ships with a prometheus metrics endpoint, and I notice on its console that it has a nice
+metrics tab, which is fully greyed out. This is most likely because, well, I don't have a Prometheus
+install here yet. I decide to keep the storage nodes self-contained and start a Prometheus server on
+the local machine. I can always plumb that to IPng's Grafana instance later.
+
+For now, I'll install Prometheus as follows:
+
+```
+pim@minio0-chbtl0:~$ cat << EOF | sudo tee -a /etc/default/minio
+## NOTE(pim): Metrics for minio-console
+MINIO_PROMETHEUS_AUTH_TYPE="public"
+CONSOLE_PROMETHEUS_URL="http://localhost:19090/"
+CONSOLE_PROMETHEUS_JOB_ID="minio-job"
+EOF
+
+pim@minio0-chbtl0:~$ sudo apt install prometheus
+pim@minio0-chbtl0:~$ cat << EOF | sudo tee /etc/default/prometheus
+ARGS="--web.listen-address='[::]:19090' --storage.tsdb.retention.size=16GB"
+EOF
+pim@minio0-chbtl0:~$ cat << EOF | sudo tee /etc/prometheus/prometheus.yml
+global:
+  scrape_interval: 60s
+
+scrape_configs:
+  - job_name: minio-job
+    metrics_path: /minio/v2/metrics/cluster
+    static_configs:
+      - targets: ['localhost:9000']
+        labels: 
+          cluster: minio0-chbtl0
+
+  - job_name: minio-job-node
+    metrics_path: /minio/v2/metrics/node
+    static_configs:
+      - targets: ['localhost:9000']
+        labels: 
+          cluster: minio0-chbtl0
+
+  - job_name: minio-job-bucket
+    metrics_path: /minio/v2/metrics/bucket
+    static_configs:
+      - targets: ['localhost:9000']
+        labels: 
+          cluster: minio0-chbtl0
+
+  - job_name: minio-job-resource
+    metrics_path: /minio/v2/metrics/resource
+    static_configs:
+      - targets: ['localhost:9000']
+        labels: 
+          cluster: minio0-chbtl0
+  
+  - job_name: node
+    static_configs:
+      - targets: ['localhost:9100']
+        labels: 
+          cluster: minio0-chbtl0
+pim@minio0-chbtl0:~$ sudo systemctl restart minio prometheus
+```
+
+In the first snippet, I'll tell MinIO where it should find its Prometheus instance. Since the MinIO
+console service is running on port 9090, and this is also the default port for Prometheus, I will
+run Promtheus on port 19090 instead. From reading the MinIO docs, I can see that normally MinIO will
+want prometheus to authenticate to it before it'll allow the endpoints to be scraped. I'll turn that
+off by making these public. On the IPng Frontends, I can always remove access to /minio/v2 and
+simply use the IPng Site Local access for local Prometheus scrapers instead. 
+
+After telling Prometheus its runtime arguments (in `/etc/default/prometheus`) and its scraping
+endpoints (in `/etc/prometheus/prometheus.yml`), I can restart minio and prometheus. A few minutes
+later, I can see the _Metrics_ tab in the console come to life. 
+
+But now that I have this prometheus running on the MinIO node, I can also add it to IPng's Grafana
+configuration, by adding a new data source on `minio0.chbtl0.net.ipng.ch:19090` and pointing the
+default Grafana [[Dashboard](https://grafana.com/grafana/dashboards/13502-minio-dashboard/)] at it:
+
+{{< image src="/assets/minio/console-2.png" alt="Grafana Dashboard" >}}
+
+A two-for-one: I will both be able to see metrics directly in the console, but also I will be able
+to hook up these per-node prometheus instances into IPng's alertmanager also, and I've read some
+[[docs](https://min.io/docs/minio/linux/operations/monitoring/collect-minio-metrics-using-prometheus.html)]
+on the concepts. I'm really liking the experience so far!
+
+### MinIO Nagios
+
+Prometheus is fancy and all, but at IPng Networks, I've been doing monitoring for a while now. As a
+dinosaur, I still have an active [[Nagios](https://www.nagios.org/)] install, which autogenerates
+all of its configuration using the Ansible repository I have. So for the new Ansible group called
+`minio`, I will autogenerate the following snippet:
+
+```
+define command {
+  command_name   ipng_check_minio
+  command_line   $USER1$/check_http -E -H $HOSTALIAS$ -I $ARG1$ -p $ARG2$ -u $ARG3$ -r '$ARG4$'
+}
+
+define service {
+  hostgroup_name        ipng:minio:ipv6
+  service_description   minio6:api
+  check_command         ipng_check_minio!$_HOSTADDRESS6$!9000!/minio/health/cluster!
+  use                   ipng-service-fast
+  notification_interval 0 ; set > 0 if you want to be renotified
+}
+
+define service {
+  hostgroup_name        ipng:minio:ipv6
+  service_description   minio6:prom
+  check_command         ipng_check_minio!$_HOSTADDRESS6$!19090!/classic/targets!minio-job
+  use                   ipng-service-fast
+  notification_interval 0 ; set > 0 if you want to be renotified
+}
+
+define service {
+  hostgroup_name        ipng:minio:ipv6
+  service_description   minio6:console
+  check_command         ipng_check_minio!$_HOSTADDRESS6$!9090!/!MinIO Console
+  use                   ipng-service-fast
+  notification_interval 0 ; set > 0 if you want to be renotified
+}
+```
+
+I've shown the snippet for IPv6 but I also have three services defined for legacy IP in the
+hostgroup `ipng:minio:ipv4`. The check command here uses `-I` which has the IPv4 or IPv6 address to
+talk to, `-p` for the port to consule, `-u` for the URI to hit and an option `-r` for a regular
+expression to expect in the output. For the Nagios afficianados out there: my Ansible `groups`
+correspond one to one with autogenerated Nagios `hostgroups`. This allows me to add arbitrary checks
+by group-type, like above in the `ipng:minio` group for IPv4 and IPv6.
+
+In the MinIO [[docs](https://min.io/docs/minio/linux/operations/monitoring/healthcheck-probe.html)]
+I read up on the Healthcheck API. I choose to monitor the _Cluster Write Quorum_ on my minio
+deployments. For Prometheus, I decide to hit the `targets` endpoint and expect the `minio-job` to be
+among them. Finally, for the MinIO Console, I expect to see a login screen with the words `MinIO
+Console` in the returned page. I guessed right, because Nagios is all green:
+
+{{< image src="/assets/minio/nagios.png" alt="Nagios Dashboard" >}}
+
+## My First Bucket
+
+The IPng website is a statically generated Hugo site, and when-ever I submit a change to my Git
+repo, a CI/CD runner (called [[Drone](https://www.drone.io/)]), picks up the change. It re-builds
+the static website, and copies it to four redundant NGINX servers.
+
+But IPng's website has amassed quite a bit of extra files (like VM images and VPP packages that I
+publish), which are copied separately using a simple push script I have in my home directory. This
+avoids all those big media files from cluttering the Git repository. I decide to move this stuff
+into S3:
+
+```
+pim@summer:~/src/ipng-web-assets$ echo 'Gruezi World.' > ipng.ch/media/README.md
+pim@summer:~/src/ipng-web-assets$ mc mb chbtl0/ipng-web-assets
+pim@summer:~/src/ipng-web-assets$ mc mirror . chbtl0/ipng-web-assets/
+...ch/media/README.md: 6.50 GiB / 6.50 GiB ┃▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓┃ 236.38 MiB/s 28s
+pim@summer:~/src/ipng-web-assets$ mc anonymous set download chbtl0/ipng-web-assets/
+```
+
+OK, two things that immediately jump out at me. This stuff is **fast**: Summer is connected with a
+2.5GbE network card, and she's running hard, copying the 6.5GB of data that are in these web assets
+essentially at line rate. It doesn't really surprise me because Summer is running off of Gen4 NVME,
+while MinIO has 12 spinning disks which each can write about 160MB/s or so sustained
+[[ref](https://www.seagate.com/www-content/datasheets/pdfs/exos-x16-DS2011-1-1904US-en_US.pdf)],
+with 24 CPUs to tend to the NIC (2x10G) and disks (2x SSD, 12x LFF). Should be plenty!
+
+The second is that MinIO allows for buckets to be publicly shared in three ways: 1) read-only by
+setting `download`; 2) write-only by setting `upload`, and 3) read-write by setting `public`.
+I set `download` here, which means I should be able to fetch an asset now publicly:
+
+```
+pim@summer:~$ curl https://s3.chbtl0.ipng.ch/ipng-web-assets/ipng.ch/media/README.md
+Gruezi World.
+pim@summer:~$ curl https://ipng-web-assets.s3.chbtl0.ipng.ch/ipng.ch/media/README.md
+Gruezi World.
+```
+
+The first `curl` here shows the path-based access, while the second one shows an equivalent
+virtual-host based access. Both retrieve the file I just pushed via the public Internet. Whoot!
+
+# What's Next
+
+I'm going to be moving [[Restic](https://restic.net/)] backups from IPng's ZFS storage pool to this
+S3 service over the next few days. I'll also migrate PeerTube and possibly Mastodon from NVME based
+storage to replicated S3 buckets as well. Finally, the IPng website media that I mentioned above,
+should make for a nice followup article. Stay tuned!
--- a/content/articles/2025-06-01-minio-2.md
+++ b/content/articles/2025-06-01-minio-2.md
@@ -0,0 +1,475 @@
+---
+date: "2025-06-01T10:07:23Z"
+title: 'Case Study: Minio S3 - Part 2'
+---
+
+{{< image float="right" src="/assets/minio/minio-logo.png" alt="MinIO Logo" width="6em" >}}
+
+# Introduction
+
+Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading
+scalability, data availability, security, and performance. Millions of customers of all sizes and
+industries store, manage, analyze, and protect any amount of data for virtually any use case, such
+as data lakes, cloud-native applications, and mobile apps. With cost-effective storage classes and
+easy-to-use management features, you can optimize costs, organize and analyze data, and configure
+fine-tuned access controls to meet specific business and compliance requirements. 
+
+Amazon's S3 became the _de facto_ standard object storage system, and there exist several fully open
+source implementations of the protocol. One of them is MinIO: designed to allow enterprises to
+consolidate all of their data on a single, private cloud namespace. Architected using the same
+principles as the hyperscalers, AIStor delivers performance at scale at a fraction of the cost
+compared to the public cloud.
+
+IPng Networks is an Internet Service Provider, but I also dabble in self-hosting things, for
+example [[PeerTube](https://video.ipng.ch/)], [[Mastodon](https://ublog.tech/)],
+[[Immich](https://photos.ipng.ch/)], [[Pixelfed](https://pix.ublog.tech/)] and of course
+[[Hugo](https://ipng.ch/)]. These services all have one thing in common: they tend to use lots of
+storage when they grow. At IPng Networks, all hypervisors ship with enterprise SAS flash drives,
+mostly 1.92TB and 3.84TB. Scaling up each of these services, and backing them up safely, can be
+quite the headache.
+
+In a [[previous article]({{< ref 2025-05-28-minio-1 >}})], I talked through the install of a
+redundant set of three Minio machines. In this article, I'll start putting them to good use.
+
+## Use Case: Restic
+
+{{< image float="right" src="/assets/minio/restic-logo.png" alt="Restic Logo" width="12em" >}}
+
+[[Restic](https://restic.org/)] is a modern backup program that can back up your files from multiple
+host OS, to many different storage types, easily, effectively, securely, verifiably and freely. With
+a sales pitch like that, what's not to love? Actually, I am a long-time
+[[BorgBackup](https://www.borgbackup.org/)] user, and I think I'll keep that running. However, for
+resilience, and because I've heard only good things about Restic, I'll make a second backup of the
+routers, hypervisors, and virtual machines using Restic.
+
+Restic can use S3 buckets out of the box (incidentally, so can BorgBackup). To configure it, I use
+a mixture of environment variables and flags. But first, let me create a bucket for the backups.
+
+```
+pim@glootie:~$ mc mb chbtl0/ipng-restic
+pim@glootie:~$ mc admin user add chbtl0/ <key> <secret>
+pim@glootie:~$ cat << EOF | tee ipng-restic-access.json
+{
+ "PolicyName": "ipng-restic-access",
+ "Policy": {
+  "Version": "2012-10-17",
+  "Statement": [
+   {
+    "Effect": "Allow",
+    "Action": [ "s3:DeleteObject", "s3:GetObject", "s3:ListBucket", "s3:PutObject" ],
+    "Resource": [ "arn:aws:s3:::ipng-restic", "arn:aws:s3:::ipng-restic/*" ]
+   }
+  ]
+ },
+}
+EOF
+pim@glootie:~$ mc admin policy create chbtl0/ ipng-restic-access.json
+pim@glootie:~$ mc admin policy attach chbtl0/ ipng-restic-access --user <key>
+```
+
+First, I'll create a bucket called `ipng-restic`. Then, I'll create a _user_ with a given secret
+_key_. To protect the innocent, and my backups, I'll not disclose them. Next, I'll create an
+IAM policy that allows for Get/List/Put/Delete to be performed on the bucket and its contents, and
+finally I'll attach this policy to the user I just created.
+
+To run a Restic backup, I'll first have to create a so-called _repository_. The repository has a
+location and a password, which Restic uses to encrypt the data. Because I'm using S3, I'll also need
+to specify the key and secret:
+
+```
+root@glootie:~# RESTIC_PASSWORD="changeme"
+root@glootie:~# RESTIC_REPOSITORY="s3:https://s3.chbtl0.ipng.ch/ipng-restic/$(hostname)/"
+root@glootie:~# AWS_ACCESS_KEY_ID="<key>"
+root@glootie:~# AWS_SECRET_ACCESS_KEY:="<secret>"
+root@glootie:~# export RESTIC_PASSWORD RESTIC_REPOSITORY AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY
+root@glootie:~# restic init
+created restic repository 807cf25e85 at s3:https://s3.chbtl0.ipng.ch/ipng-restic/glootie.ipng.ch/
+```
+
+Restic prints out some repository finterprint of the latest 'snapshot' it just created. Taking a
+look on the MinIO install:
+
+```
+pim@glootie:~$ mc stat chbtl0/ipng-restic/glootie.ipng.ch/
+Name      : config
+Date      : 2025-06-01 12:01:43 UTC 
+Size      : 155 B  
+ETag      : 661a43f72c43080649712e45da14da3a 
+Type      : file 
+Metadata  :
+  Content-Type: application/octet-stream 
+
+Name      : keys/
+Date      : 2025-06-01 12:03:33 UTC 
+Type      : folder 
+```
+
+Cool. Now I'm ready to make my first full backup:
+
+```
+root@glootie:~# ARGS="--exclude /proc --exclude /sys --exclude /dev --exclude /run"
+root@glootie:~# ARGS="$ARGS --exclude-if-present .nobackup"
+root@glootie:~# restic backup $ARGS /
+...
+processed 1141426 files, 131.111 GiB in 15:12
+snapshot 34476c74 saved
+```
+
+Once the backup completes, the Restic authors advise me to also do a check of the repository, and to
+prune it so that it keeps a finite amount of daily, weekly and monthly backups. My further journey
+for Restic looks a bit like this:
+
+```
+root@glootie:~# restic check
+using temporary cache in /tmp/restic-check-cache-2712250731
+create exclusive lock for repository
+load indexes
+check all packs
+check snapshots, trees and blobs
+[0:04] 100.00%  1 / 1 snapshots
+
+no errors were found
+
+root@glootie:~# restic forget --prune --keep-daily 8 --keep-weekly 5 --keep-monthly 6
+repository 34476c74 opened (version 2, compression level auto)
+Applying Policy: keep 8 daily, 5 weekly, 6 monthly snapshots
+keep 1 snapshots:
+ID        Time                 Host           Tags        Reasons           Paths
+---------------------------------------------------------------------------------
+34476c74  2025-06-01 12:18:54  glootie.ipng.ch            daily snapshot    /
+                                                          weekly snapshot
+                                                          monthly snapshot
+----------------------------------------------------------------------------------
+1 snapshots
+```
+
+Right on! I proceed to update the Ansible configs at IPng to roll this out against the entire fleet
+of 152 hosts at IPng Networks. I do this in a little tool called `bitcron`, which I wrote for a
+previous company I worked at: [[BIT](https://bit.nl)] in the Netherlands. Bitcron allows me to
+create relatively elegant cronjobs that can raise warnings, errors and fatal issues. If no issues
+are found, an e-mail can be sent to a bitbucket address, but if warnings or errors are found, a
+different _monitored_ address will be used. Bitcron is kind of cool, and I wrote it in 2001.  Maybe
+I'll write about it, for old time's sake. I wonder if the folks at BIT still use it?
+
+## Use Case: NGINX
+
+{{< image float="right" src="/assets/minio/nginx-logo.png" alt="NGINX Logo" width="11em" >}}
+
+OK, with the first use case out of the way, I turn my attention to a second - in my opinion more
+interesting - use case. In the [[previous article]({{< ref 2025-05-28-minio-1 >}})], I created a
+public bucket called `ipng-web-assets` in which I stored 6.50GB of website data belonging to the
+IPng website, and some material I posted when I was on my
+[[Sabbatical](https://sabbatical.ipng.nl/)] last year.
+
+### MinIO: Bucket Replication
+
+First things first: redundancy. These web assets are currently pushed to all four nginx machines,
+and statically served. If I were to replace them with a single S3 bucket, I would create a single
+point of failure, and that's _no bueno_!
+
+Off I go, creating a replicated bucket using two MinIO instances (`chbtl0` and `ddln0`):
+
+```
+pim@glootie:~$ mc mb ddln0/ipng-web-assets
+pim@glootie:~$ mc anonymous set download ddln0/ipng-web-assets
+pim@glootie:~$ mc admin user add ddln0/ <replkey> <replsecret>
+pim@glootie:~$ cat << EOF | tee ipng-web-assets-access.json
+{
+ "PolicyName": "ipng-web-assets-access",
+ "Policy": {
+  "Version": "2012-10-17",
+  "Statement": [
+   {
+    "Effect": "Allow",
+    "Action": [ "s3:DeleteObject", "s3:GetObject", "s3:ListBucket", "s3:PutObject" ],
+    "Resource": [ "arn:aws:s3:::ipng-web-assets", "arn:aws:s3:::ipng-web-assets/*" ]
+   }
+  ]
+ },
+}
+EOF
+pim@glootie:~$ mc admin policy create ddln0/ ipng-web-assets-access.json
+pim@glootie:~$ mc admin policy attach ddln0/ ipng-web-assets-access --user <replkey>
+pim@glootie:~$ mc replicate add chbtl0/ipng-web-assets \
+                  --remote-bucket https://<key>:<secret>@s3.ddln0.ipng.ch/ipng-web-assets
+```
+
+What happens next is pure magic. I've told `chbtl0` that I want it to replicate all existing and
+future changes to that bucket to its neighbor `ddln0`. Only minutes later, I check the replication
+status, just to see that it's _already done_:
+
+```
+pim@glootie:~$ mc replicate status chbtl0/ipng-web-assets
+  Replication status since 1 hour                                                                     
+  s3.ddln0.ipng.ch
+  Replicated:                   142 objects (6.5 GiB)                                             
+  Queued:                       ● 0 objects, 0 B (avg: 4 objects, 915 MiB ; max: 0 objects, 0 B)  
+  Workers:                      0 (avg: 0; max: 0)                                                
+  Transfer Rate:                15 kB/s (avg: 88 MB/s; max: 719 MB/s                              
+  Latency:                      3ms (avg: 3ms; max: 7ms)                                          
+  Link:                         ● online (total downtime: 0 milliseconds)                         
+  Errors:                       0 in last 1 minute; 0 in last 1hr; 0 since uptime                 
+  Configured Max Bandwidth (Bps): 644 GB/s   Current Bandwidth (Bps): 975 B/s                     
+pim@summer:~/src/ipng-web-assets$ mc ls ddln0/ipng-web-assets/
+[2025-06-01 12:42:22 CEST]     0B ipng.ch/
+[2025-06-01 12:42:22 CEST]     0B sabbatical.ipng.nl/
+```
+
+MinIO has pumped the data from bucket `ipng-web-assets` to the other machine at an average of 88MB/s
+with a peak throughput of 719MB/s (probably for the larger VM images). And indeed, looking at the
+remote machine, it is fully caught up after the push, within only a minute or so with a completely
+fresh copy. Nice!
+
+### MinIO: Missing directory index
+
+I take a look at what I just built, on the following URL:
+*   [https://ipng-web-assets.s3.ddln0.ipng.ch/sabbatical.ipng.nl/media/vdo/IMG_0406_0.mp4](https://ipng-web-assets.s3.ddln0.ipng.ch/sabbatical.ipng.nl/media/vdo/IMG_0406_0.mp4)
+
+That checks out, and I can see the mess that was my room when I first went on sabbatical. By the
+way, I totally cleaned it up, see
+[[here](https://sabbatical.ipng.nl/blog/2024/08/01/thursday-basement-done/)] for proof. I can't,
+however, see the directory listing:
+
+```
+pim@glootie:~$ curl https://ipng-web-assets.s3.ddln0.ipng.ch/sabbatical.ipng.nl/media/vdo/
+<?xml version="1.0" encoding="UTF-8"?>
+<Error>
+  <Code>NoSuchKey</Code>
+  <Message>The specified key does not exist.</Message>
+  <Key>sabbatical.ipng.nl/media/vdo/</Key>
+  <BucketName>ipng-web-assets</BucketName>
+  <Resource>/sabbatical.ipng.nl/media/vdo/</Resource>
+  <RequestId>1844EC0CFEBF3C5F</RequestId>
+  <HostId>dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8</HostId>
+</Error>
+```
+
+That's unfortunate, because some of the IPng articles link to a directory full of files, which I'd
+like to be shown so that my readers can navigate through the directories. Surely I'm not the first
+to encounter this? And sure enough, I'm not
+[[ref](https://github.com/glowinthedark/index-html-generator)] by user `glowinthedark` who wrote a
+little python script that generates `index.html` files for their Caddy file server. I'll take me
+some of that Python, thank you!
+
+With the following little script, my setup is complete:
+
+```
+pim@glootie:~/src/ipng-web-assets$ cat push.sh 
+#!/usr/bin/env bash
+
+echo "Generating index.html files ..."
+for D in */media; do
+  echo "* Directory $D"
+  ./genindex.py -r $D
+done
+echo "Done (genindex)"
+echo ""
+
+echo "Mirroring directoro to S3 Bucket"
+mc mirror --remove --overwrite . chbtl0/ipng-web-assets/
+echo "Done (mc mirror)"
+echo ""
+pim@glootie:~/src/ipng-web-assets$ ./push.sh 
+```
+
+Only a few seconds after I run `./push.sh`, the replication is complete and I have two identical
+copies of my media:
+
+1.   [https://ipng-web-assets.s3.chbtl0.ipng.ch/ipng.ch/media/](https://ipng-web-assets.s3.chbtl0.ipng.ch/ipng.ch/media/index.html)
+1.   [https://ipng-web-assets.s3.ddln0.ipng.ch/ipng.ch/media/](https://ipng-web-assets.s3.ddln0.ipng.ch/ipng.ch/media/index.html)
+
+
+### NGINX: Proxy to Minio
+
+Before moving to S3 storage, my NGINX frontends all kept a copy of the IPng media on local NVME
+disk. That's great for reliability, as each NGINX instance is completely hermetic and standalone.
+However, it's not great for scaling: the current NGINX instances only have 16GB of local storage,
+and I'd rather not have my static web asset data outgrow that filesystem. From before, I already had
+an NGINX config that served the Hugo static data from `/var/www/ipng.ch/ and the `/media'
+subdirectory from a different directory in `/var/www/ipng-web-assets/ipng.ch/media`.
+
+Moving to redundant S3 storage backenda is straight forward:
+
+```
+upstream minio_ipng {
+  least_conn;
+  server minio0.chbtl0.net.ipng.ch:9000;
+  server minio0.ddln0.net.ipng.ch:9000;
+}
+
+server {
+  ...
+  location / {
+    root /var/www/ipng.ch/;
+  }
+
+  location /media {
+    proxy_set_header Host $http_host;
+    proxy_set_header X-Real-IP $remote_addr;
+    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+    proxy_set_header X-Forwarded-Proto $scheme;
+
+    proxy_connect_timeout 300;
+    proxy_http_version 1.1;
+    proxy_set_header Connection "";
+    chunked_transfer_encoding off;
+
+    rewrite (.*)/$ $1/index.html;
+
+    proxy_pass http://minio_ipng/ipng-web-assets/ipng.ch/media;
+  }
+}
+```
+
+I want to make note of a few things:
+1.   The `upstream` definition here uses IPng Site Local entrypoints, considering the NGINX servers
+     all have direct MTU=9000 access to the MinIO instances. I'll put both in there, in a
+     round-robin configuration favoring the replica with _least connections_.
+1.   Deeplinking to directory names without the trailing `/index.html` would serve a 404 from the
+     backend, so I'll intercept these and rewrite directory to always include the `/index.html'.
+1.   The used upstream endpoint is _path-based_, that is to say has the bucketname and website name
+     included. This whole location used to be simply `root /var/www/ipng-web-assets/ipng.ch/media/`
+     so the mental change is quite small.
+
+### NGINX: Caching
+
+
+After deploying the S3 upstream on all IPng websites, I can delete the old
+`/var/www/ipng-web-assets/` directory and reclaim about 7GB of diskspace. This gives me an idea ...
+
+{{< image width="8em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
+
+On the one hand it's great that I will pull these assets from Minio and all, but at the same time,
+it's a tad inefficient to retrieve them from, say, Zurich to Amsterdam just to serve them onto the
+internet again. If at any time something on the IPng website goes viral, it'd be nice to be able to
+serve them directly from the edge, right?
+
+A webcache. What could _possibly_ go wrong :)
+
+NGINX is really really good at caching content. It has a powerful engine to store, scan, revalidate
+and match any content and upstream headers. It's also very well documented, so I take a look at the
+proxy module's documentation [[here](https://nginx.org/en/docs/http/ngx_http_proxy_module.html)] and
+in particular a useful [[blog](https://blog.nginx.org/blog/nginx-caching-guide)] on their website.
+
+The first thing I need to do is create what is called a _key zone_, which is a region of memory in
+which URL keys are stored with some metadata. Having a copy of the keys in memory enables NGINX to
+quickly determine if a request is a HIT or a MISS without having to go to disk, greatly speeding up
+the check.
+
+In `/etc/nginx/conf.d/ipng-cache.conf` I add the following NGINX cache:
+
+```
+proxy_cache_path /var/www/nginx-cache levels=1:2 keys_zone=ipng_cache:10m max_size=8g
+                 inactive=24h use_temp_path=off;
+```
+
+With this statement, I'll create a 2-level subdirectory, and allocate 10MB of space, which should
+hold on the order of 100K entries. The maximum size I'll allow the cache to grow to is 8GB, and I'll
+mark any object inactive if it's not been referenced for 24 hours. I learn that inactive is
+different to expired content. If a cache element has expired, but NGINX can't reach the upstream
+for a new copy, it can be configured to serve a inactive (stale) copy from the cache. That's dope,
+as it serves as an extra layer of defence in case the network or all available S3 replicas take the
+day off. I'll ask NGINX to avoid writing objects first to a tmp directory and them moving them into
+the `/var/www/nginx-cache` directory.  These are recommendations I grab from the manual.
+
+Within the `location` block I configured above, I'm now ready to enable this cache. I'll do that by
+adding two include files, which I'll reference in all sites that I want to have make use of this
+cache:
+
+First, to enable the cache, I write the following snippet:
+```
+pim@nginx0-nlams1:~$ cat /etc/nginx/conf.d/ipng-cache.inc
+proxy_cache ipng_cache;
+proxy_ignore_headers Cache-Control;
+proxy_cache_valid any 1h;
+proxy_cache_revalidate on;
+proxy_cache_use_stale error timeout updating http_500 http_502 http_503 http_504;
+proxy_cache_background_update on;
+```
+
+Then, I find it useful to emit a few debugging HTTP headers, and at the same time I see that Minio
+emits a bunch of HTTP headers that may not be safe for me to propagate, so I pen two more snippets:
+
+```
+pim@nginx0-nlams1:~$ cat /etc/nginx/conf.d/ipng-strip-minio-headers.inc 
+proxy_hide_header x-minio-deployment-id;
+proxy_hide_header x-amz-request-id;
+proxy_hide_header x-amz-id-2;
+proxy_hide_header x-amz-replication-status;
+proxy_hide_header x-amz-version-id;
+
+pim@nginx0-nlams1:~$ cat /etc/nginx/conf.d/ipng-add-upstream-headers.inc 
+add_header X-IPng-Frontend $hostname always;
+add_header X-IPng-Upstream $upstream_addr always;
+add_header X-IPng-Upstream-Status $upstream_status always;
+add_header X-IPng-Cache-Status $upstream_cache_status;
+```
+
+With that, I am ready to enable caching of the IPng `/media` location:
+
+```
+  location /media {
+    ...
+    include /etc/nginx/conf.d/ipng-strip-minio-headers.inc;
+    include /etc/nginx/conf.d/ipng-add-upstream-headers.inc;
+    include /etc/nginx/conf.d/ipng-cache.inc;
+    ...
+}
+```
+
+## Results
+
+I run the Ansible playbook for the NGINX cluster and take a look at the replica at Coloclue in
+Amsterdam, called `nginx0.nlams1.ipng.ch`. Notably, it'll have to retrieve the file from a MinIO
+replica in Zurich (12ms away), so it's expected to take a little while.
+
+The first attempt:
+
+```
+pim@nginx0-nlams1:~$ curl -v -o /dev/null --connect-to ipng.ch:443:localhost:443 \
+                     https://ipng.ch/media/vpp-proto/vpp-proto-bookworm.qcow2.lrz
+...
+< last-modified: Sun, 01 Jun 2025 12:37:52 GMT
+< x-ipng-frontend: nginx0-nlams1
+< x-ipng-cache-status: MISS
+< x-ipng-upstream: [2001:678:d78:503::b]:9000
+< x-ipng-upstream-status: 200
+
+100  711M  100  711M    0     0  26.2M      0  0:00:27  0:00:27 --:--:-- 26.6M
+```
+
+
+OK, that's respectable, I've read the file at 26MB/s. Of course I just turned on the cache, so the
+NGINX fetches the file from Zurich while handing it over to my `curl` here. It notifies me by means
+of a HTTP header that the cache was a `MISS`, and then which upstream server it contacted to
+retrieve the object. 
+
+But look at what happens the _second_ time I run the same command:
+
+```
+pim@nginx0-nlams1:~$ curl -v -o /dev/null --connect-to ipng.ch:443:localhost:443 \
+                     https://ipng.ch/media/vpp-proto/vpp-proto-bookworm.qcow2.lrz
+< last-modified: Sun, 01 Jun 2025 12:37:52 GMT
+< x-ipng-frontend: nginx0-nlams1
+< x-ipng-cache-status: HIT
+ 
+100  711M  100  711M    0     0   436M      0  0:00:01  0:00:01 --:--:--  437M
+```
+
+
+Holy moly! First I see the object has the same _Last-Modified_ header, but I now also see that the
+_Cache-Status_ was a `HIT`, and there is no mention of any upstream server. I do however see the
+file come in at a whopping 437MB/s which is 16x faster than over the network!! Nice work, NGINX!
+
+{{< image float="right" src="/assets/minio/rack-2.png" alt="Rack-o-Minio" width="12em" >}}
+
+# What's Next
+
+I'm going to deploy the third MinIO replica in R&uuml;mlang once the disks arrive. I'll release the
+~4TB of disk used currently in Restic backups for the fleet, and put that ZFS capacity to other use.
+Now, creating services like PeerTube, Mastodon, Pixelfed, Loops, NextCloud and what-have-you, will
+become much easier for me. And with the per-bucket replication between MinIO deployments, I also
+think this is a great way to auto-backup important data. First off, it'll be RS8.4 on the MinIO node
+itself, and secondly, user data will be copied automatically to a neighboring facility.
+
+I've convinced myself that S3 storage is a great service to operate, and that MinIO is awesome.
--- a/content/articles/2025-07-12-vpp-evpn-1.md
+++ b/content/articles/2025-07-12-vpp-evpn-1.md
@@ -0,0 +1,375 @@
+---
+date: "2025-07-12T08:07:23Z"
+title: 'VPP and eVPN/VxLAN - Part 1'
+---
+
+{{< image width="6em" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
+
+# Introduction
+
+You know what would be really cool? If VPP could be an eVPN/VxLAN speaker! Sometimes I feel like I'm
+the very last on the planet to learn about something cool. My latest "A-Ha!"-moment was when I was
+configuring the eVPN fabric for [[Frys-IX](https://frys-ix.net/)], and I wrote up an article about
+it [[here]({{< ref 2025-04-09-frysix-evpn >}})] back in April.
+
+I can build the equivalent of Virtual Private Wires (VPWS), also called L2VPN or Virtual Leased
+Lines, and these are straight forward because they typically only have two endpoints. A "regular"
+VxLAN tunnel which is L2 cross connected with another interface already does that just fine. Take a
+look at an article on [[L2 Gymnastics]({{< ref 2022-01-12-vpp-l2 >}})] for that. But the real kicker
+is that I can also create multi-site L2 domains like Virtual Private LAN Services (VPLS) or also
+called Virtual Private Ethernet, L2VPN or Ethernet LAN Service (E-LAN). And *that* is a whole other
+level of awesome.
+
+## Recap: VPP today
+
+### VPP: VxLAN
+
+The current VPP VxLAN tunnel plugin does point to point tunnels, that is they are configured with a
+source address, destination address, destination port and VNI. As I mentioned, a point to point
+ethernet transport is configured very easily:
+
+```
+vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.254 vni 8298 instance 0
+vpp0# set int l2 xconnect vxlan_tunnel0 HundredGigabitEthernet10/0/0
+vpp0# set int l2 xconnect HundredGigabitEthernet10/0/0 vxlan_tunnel0
+vpp0# set int state vxlan_tunnel0 up
+vpp0# set int state HundredGigabitEthernet10/0/0 up
+
+vpp1# create vxlan tunnel src 192.0.2.254 dst 192.0.2.1 vni 8298 instance 0
+vpp1# set int l2 xconnect vxlan_tunnel0 HundredGigabitEthernet10/0/1
+vpp1# set int l2 xconnect HundredGigabitEthernet10/0/1 vxlan_tunnel0
+vpp1# set int state vxlan_tunnel0 up
+vpp1# set int state HundredGigabitEthernet10/0/1 up
+```
+
+And with that, `vpp0:Hu10/0/0` is cross connected with `vpp1:Hu10/0/1` and ethernet flows between
+the two.
+
+### VPP: Bridge Domains
+
+Now consider a VPLS with five different routers. While it's possible to create a bridge-domain and add
+some local ports and four other VxLAN tunnels:
+
+```
+vpp0# create bridge-domain 8298
+vpp0# set int l2 bridge HundredGigabitEthernet10/0/1 8298
+vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.2 vni 8298 instance 0
+vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.3 vni 8298 instance 1
+vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.4 vni 8298 instance 2
+vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.5 vni 8298 instance 3
+vpp0# set int l2 bridge vxlan_tunnel0 8298
+vpp0# set int l2 bridge vxlan_tunnel1 8298
+vpp0# set int l2 bridge vxlan_tunnel2 8298
+vpp0# set int l2 bridge vxlan_tunnel3 8298
+```
+
+To make this work, I will have to replicate this configuration to all other `vpp1`-`vpp4` routers.
+While it does work, it's really not very practical. When other VPP instances get added to a VPLS,
+every other router will have to have a new VxLAN tunnel created and added to its local bridge
+domain. Consider 1000s of VPLS instances on 100s of routers, it would yield ~100'000 VxLAN tunnels
+on every router, yikes!
+
+Such a configuration reminds me in a way of iBGP in a large network: the naive approach is to have a
+full mesh of all routers speaking to all other routers, but that quickly becomes a maintenance
+headache. The canonical solution for this is to create iBGP _Route Reflectors_ to which every router
+connects, and their job is to redistribute routing information between the fleet of routers. This
+turns the iBGP problem from an O(N^2) to an O(N) problem: all 1'000 routers connect to, say, three
+regional route reflectors for a total of 3'000 BGP connections, which is much better than ~1'000'000
+BGP connections in the naive approach.
+
+## Recap: eVPN Moving parts
+
+The reason why I got so enthusiastic when I was playing with Arista and Nokia's eVPN stuff, is
+because it requires very little dataplane configuration, and a relatively intuitive controlplane
+configuration:
+
+1.   **Dataplane**: For each L2 broadcast domain (be it a L2XC or a Bridge Domain), really all I
+     need is a single VxLAN interface with a given VNI, which should be able to send encapsulated
+     ethernet frames to one more more other speakers in the same domain.
+1.   **Controlplane**: I will need to learn MAC addresses locally, and inform some BGP eVPN
+     implementation of who-lives-where. Other VxLAN speakers learn of the MAC addresses I own, and
+     will send me encapsulated ethernet for those addresses
+1.   **Dataplane**: For unknown layer2 destinations, like _Broadcast_, _Unknown Unicast_, and
+     _Multicast_ (BUM) traffic, I will want to keep track of which other VxLAN speakers these
+     packets should be flooded. I make note that this is not that different to flooding the packets
+     to local interfaces, except here it'd be flooding them to remote VxLAN endpoints.
+1.   **ControlPlane**: Flooding L2 traffic across wide area networks is typically considered icky,
+     so a few tricks might be optionally deployed. Since the controlplane already knows which MAC
+     lives where, it may as well also make note of any local IPv6 ARP and IPv6 neighbor discovery
+     replies and teach its peers which IPv4/IPv6 addresses live where: a distributed neighbor table.
+
+{{< image width="6em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
+
+For the controlplane parts, [[FRRouting](https://frrouting.org/)] has a working implementation for
+L2 (MAC-VRF) and L3 (IP-VRF). My favorite, [[Bird](https://birg.nic.cz/)], is slowly catching up, and
+has a few of these controlplane parts already working (mostly MAC-VRF). Commercial vendors like Arista,
+Nokia, Juniper, Cisco are ready to go. If we want VPP to inter-operate, we may need to make a few
+changes.
+
+## VPP: Changes needed
+
+### Dynamic VxLAN
+
+I propose two changes to the VxLAN plugin, or perhaps, a new plugin that changes the behavior so that
+we don't have to break any performance or functional promises to existing users. This new VxLAN
+interface behavior changes in the following ways:
+
+1.    Each VxLAN interface has a local L2FIB attached to it, the keys are MAC address and the
+values are remote VTEPs.  In its simplest form, the values would be just IPv4 or IPv6 addresses,
+because I can re-use the VNI and port information from the tunnel definition itself.
+
+1.    Each VxLAN interface has a local flood-list attached to it. This list contains remote VTEPs
+that I am supposed to send 'flood' packets to. Similar to the Bridge Domain, when packets are marked
+for flooding, I will need to prepare and replicate them, sending them to each VTEP.
+
+
+A set of APIs will be needed to manipulate these:
+*    ***Interface***: I will need to have an interface create,  delete and list call, which will
+    be able to maintain the interfaces, their metadata like source address, source/destination port,
+    VNI and such.
+*    ***L2FIB***: I will need to add, replace, delete, and list which MAC addresses go where,
+    With such a table, each time a packet is handled for a given Dynamic VxLAN interface, the
+    dst_addr can be written into the packet.
+*   ***Flooding***: For those packets that are not unicast (BUM), I will need to be able to add,
+    remove and list which VTEPs should receive this packet.
+
+It would be pretty dope if the configuration looked something like this:
+```
+vpp# create evpn-vxlan src <v46address> dst-port <port> vni <vni> instance <id>
+vpp# evpn-vxlan l2fib <iface> mac <mac> dst <v46address> [del]
+vpp# evpn-vxlan flood <iface> dst <v46address> [del]
+```
+
+The VxLAN underlay transport can be either IPv4 or IPv6. Of course manipulating L2FIB or Flood
+destinations must match the address family of an interface of type evpn-vxlan. A practical example
+might be:
+
+```
+vpp# create evpn-vxlan src 2001:db8::1 dst-port 4789 vni 8298 instance 6
+vpp# evpn-vxlan l2fib evpn-vxlan0 mac 00:01:02:82:98:02 dst 2001:db8::2
+vpp# evpn-vxlan l2fib evpn-vxlan0 mac 00:01:02:82:98:03 dst 2001:db8::3
+vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::2
+vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::3
+vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::4
+```
+
+By the way, while this _could_ be a new plugin, it could also just be added to the existing VxLAN
+plugin. One way in which I might do this when creating a normal vxlan tunnel is to allow for its
+destination address to be either 0.0.0.0 for IPv4 or :: for IPv6. That would signal 'dynamic'
+tunneling, upon which the L2FIB and Flood lists are used. It would slow down each VxLAN packet by
+the time it takes to call `ip46_address_is_zero()` which is only a handfull of clocks.
+
+### Bridge Domain
+
+{{< image width="6em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
+
+It's important to understand that L2 learning is **required** for eVPN to function. Each router
+needs to be able to tell the iBGP eVPN session which MAC addresses should be forwarded to it. This
+rules out the simple case of L2XC because there, no learning is performed. The corollary is that a
+bridge-domain is required for any form of eVPN.
+
+The L2 code in VPP already does most of what I'd need. It maintains an L2FIB in `vnet/l2/l2_fib.c`,
+which is keyed by bridge-id and MAC address, and its values are a 64 bit structure that points
+essentially to a `sw_if_index` output interface. The L2FIB of the eVPN needs a bit more information
+though, notably a `ip46address` struct to know which VTEP to send to. It's tempting to add this
+extra data to the bridge domain code. I would recommend against it, because other implementations,
+for example MPLS, GENEVE or Carrier Pigeon IP may need more than just the destination address. Even
+the VxLAN implementation I'm thinking about might want to be able to override other things like the
+destination port for a given VTEP, or even the VNI. Putting all of this stuff in the bridge-domain
+code will just clutter it, for all users, not just those users who might want eVPN.
+
+Similarly, one might argue it is tempting to re-use/extend the behavior in `vnet/l2/l2_flood.c`,
+because if it's already replicating BUM traffic, why not replicate it many times over the flood list
+for any member interface that happens to be a dynamic VxLAN interface?  This would be a bad idea
+because of a few reasons. Firstly, it is not guaranteed that the VxLAN plugin is loaded, and in
+doing this, I would leak internal details of VxLAN into the bridge-domain code. Secondly, the
+`l2_flood.c` code would potentially get messy if other types were added (like the MPLS and GENEVE
+above).
+
+A reasonable request is to mark such BUM frames once in the existing L2 code and when handing the
+replicated packet into the VxLAN node, to see the `is_bum` marker and once again replicate -- in the
+vxlan plugin -- these packets to the VTEPs in our local flood-list.  Although a bit more work, this
+approach only requires a tiny amount of work in the `l2_flood.c` code (the marking), and will keep
+all the logic tucked away where it is relevant, derisking the VPP vnet codebase.
+
+Fundamentally, I think the cleanest design is to keep the dynamic VxLAN interface fully
+self-contained and it would therefor maintain its own L2FIB and Flooding logic. The only thing I
+would add to the L2 codebase is some form of BUM marker to allow for efficient flooding.
+
+### Control Plane
+
+There's a few things the control plane has to do. Some external agent, like FRR or Bird, will be
+receiving a few types of eVPN messages. The ones I'm interested in are:
+
+*   ***Type 2***: MAC/IP Advertisement Route
+    -   On the way in, these should be fed to the VxLAN L2FIB belonging to the bridge-domain.
+    -   On the way out, learned addresses should be advertised to peers.
+    -   Regarding IPv4/IPv6 addresses, that is the ARP / ND tables: we can talk about those later.
+*   ***Type 3***: Inclusive Multicast Ethernet Tag Route
+    -   On the way in, these will populate the VxLAN Flood list belonging to the bridge-domain
+    -   On the way out, each bridge-domain should advertise itself as IMET to peers.
+*   ***Type 5***: IP Prefix Route
+    -   Similar to IP information in Type 2, we can talk about those later once L3VPN/eVPN is
+        needed.
+
+The 'on the way in' stuff can be easily done with my proposed APIs in the Dynamic VxLAN (or a new
+eVPN VxLAN) plugin.  Adding, removing, listing L2FIB and Flood lists is easy as far as VPP is
+concerned. It's just that the controlplane implementation needs to somehow _feed_ the API, so an
+external program may be needed, or alterntively the Linux Control Plane netlink plugin might be used
+to consume this information.
+
+The 'on the way out' stuff is a bit trickier. I will need to listen to creation of new broadcast
+domains and associate them with the right IMET announcements, and for each MAC address learned, pick
+them up and advertise them into eVPN. Later, if ever ARP and ND proxying becomes important, I'll
+have to revisit the bridge-domain feature to do IPv4 ARP and IPv6 Neighbor Discovery, and replace it
+with some code that populates the IPv4/IPv6 parts of the Type2 messages on the way out, and
+similarly on the way in, populates an L3 neighbor cache for the bridge domain, so ARP and ND replies
+can be synthesized based on what we've learned in eVPN.
+
+# Demonstration
+
+### VPP: Current VxLAN
+
+I'll build a small demo environment on Summer to show how the interaction of VxLAN and Bridge
+Domain works today:
+
+```
+vpp# create tap host-if-name dummy0 host-mtu-size 9216 host-ip4-addr 192.0.2.1/24
+vpp# set int state tap0 up
+vpp# set int ip address tap0 192.0.2.1/24
+vpp# set ip neighbor tap0 192.0.2.254 01:02:03:82:98:fe static
+vpp# set ip neighbor tap0 192.0.2.2 01:02:03:82:98:02 static
+vpp# set ip neighbor tap0 192.0.2.3 01:02:03:82:98:03 static
+
+vpp# create vxlan tunnel src 192.0.2.1 dst 192.0.2.254 vni 8298
+vpp# set int state vxlan_tunnel0 up
+
+vpp# create tap host-if-name vpptap0 host-mtu-size 9216 hw-addr 02:fe:64:dc:1b:82
+vpp# set int state tap1 up
+
+vpp# create bridge-domain 8298
+vpp# set int l2 bridge tap1 8298
+vpp# set int l2 bridge vxlan_tunnel0 8298
+```
+
+I've created a tap device called `dummy0` and gave it an IPv4 address. Normally, I would use some
+DPDK or RDMA interface like `TenGigabutEthernet10/0/0`. Then I'll populate some static ARP entries.
+Again, normally this would just be 'use normal routing'. However, for the purposes of this
+demonstration, it helps to use a TAP device, as any packets I make VPP send to those 192.0.2.254 and
+so on, can be captured with `tcpdump` in Linux in addition to `trace add` in VPP.
+
+Then, I create a VxLAN tunnel with a default destination of 192.0.2.254 and the given VNI.
+Next, I create a TAP interface called `vpptap0` with the given MAC address.
+Finally, I bind these two interfaces together in a bridge-domain.
+
+I proceed to write a small ScaPY program:
+
+```python
+#!/usr/bin/env python3
+
+from scapy.all import Ether, IP, UDP, Raw, sendp
+
+pkt = Ether(dst="01:02:03:04:05:02", src="02:fe:64:dc:1b:82", type=0x0800)
+      / IP(src="192.168.1.1", dst="192.168.1.2")
+      / UDP(sport=8298, dport=7) / Raw(load=b"ping")
+print(pkt)
+sendp(pkt, iface="vpptap0")
+
+pkt = Ether(dst="01:02:03:04:05:03", src="02:fe:64:dc:1b:82", type=0x0800)
+      / IP(src="192.168.1.1", dst="192.168.1.3")
+      / UDP(sport=8298, dport=7) / Raw(load=b"ping")
+print(pkt)
+sendp(pkt, iface="vpptap0")
+```
+
+What will happen is, the ScaPY program will emit these frames into device `vpptap0` which is in
+bridge-domain 8298. The bridge will learn our src MAC `02:fe:64:dc:1b:82`, and look up the dst MAC
+`01:02:03:04:05:02`, and because there hasn't been traffic yet, it'll flood to all member ports, one
+of which is the VxLAN tunnel. VxLAN will then encapsulate the packets to the other side of the
+tunnel.
+
+```
+pim@summer:~$ sudo ./vxlan-test.py 
+Ether / IP / UDP 192.168.1.1:8298 > 192.168.1.2:echo / Raw
+Ether / IP / UDP 192.168.1.1:8298 > 192.168.1.3:echo / Raw
+
+pim@summer:~$ sudo tcpdump -evni dummy0
+10:50:35.310620 02:fe:72:52:38:53 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
+    (tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82)
+    192.0.2.1.6345 > 192.0.2.254.4789: VXLAN, flags [I] (0x08), vni 8298
+      02:fe:64:dc:1b:82 > 01:02:03:04:05:02, ethertype IPv4 (0x0800), length 46:
+      (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
+        192.168.1.1.8298 > 192.168.1.2.7: UDP, length 4
+10:50:35.362552 02:fe:72:52:38:53 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
+    (tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82)
+    192.0.2.1.23916 > 192.0.2.254.4789: VXLAN, flags [I] (0x08), vni 8298
+      02:fe:64:dc:1b:82 > 01:02:03:04:05:03, ethertype IPv4 (0x0800), length 46:
+      (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
+        192.168.1.1.8298 > 192.168.1.3.7: UDP, length 4
+```
+
+I want to point out that nothing, so far, is special. All of this works with upstream VPP just fine.
+I can see two VxLAN encapsulated packets, both destined to `192.0.2.254:4789`. Cool.
+
+### Dynamic VPP VxLAN
+
+I wrote a prototype for a Dynamic VxLAN tunnel in [[43433](https://gerrit.fd.io/r/c/vpp/+/43433)].
+The good news is, this works. The bad news is, I think I'll want to discuss my proposal (this
+article) with the community before going further down a potential rabbit hole.
+
+With my gerrit patched in, I can do the following:
+
+```
+vpp# vxlan l2fib vxlan_tunnel0 mac 01:02:03:04:05:02 dst 192.0.2.2       
+Added VXLAN dynamic destination for 01:02:03:04:05:02 on vxlan_tunnel0 dst 192.0.2.2
+vpp# vxlan l2fib vxlan_tunnel0 mac 01:02:03:04:05:03 dst 192.0.2.3
+Added VXLAN dynamic destination for 01:02:03:04:05:03 on vxlan_tunnel0 dst 192.0.2.3
+
+vpp# show vxlan l2fib 
+VXLAN Dynamic L2FIB entries:
+        MAC            Interface      Destination     Port      VNI  
+ 01:02:03:04:05:02   vxlan_tunnel0     192.0.2.2      4789     8298  
+ 01:02:03:04:05:03   vxlan_tunnel0     192.0.2.3      4789     8298  
+Dynamic L2FIB entries: 2
+```
+
+I've instructed the VxLAN tunnel to change the tunnel destination based on the destination MAC.
+
+
+I run the script and tcpdump again:
+
+```
+pim@summer:~$ sudo tcpdump -evni dummy0
+11:16:53.834619 02:fe:fe:ae:0d:a3 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
+    (tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82, bad cksum 3945 (->3997)!)
+    192.0.2.1.6345 > 192.0.2.2.4789: VXLAN, flags [I] (0x08), vni 8298
+      02:fe:64:dc:1b:82 > 01:02:03:04:05:02, ethertype IPv4 (0x0800), length 46:
+      (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
+        192.168.1.1.8298 > 192.168.1.2.7: UDP, length 4
+11:16:53.882554 02:fe:fe:ae:0d:a3 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
+    (tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82, bad cksum 3944 (->3996)!)
+    192.0.2.1.23916 > 192.0.2.3.4789: VXLAN, flags [I] (0x08), vni 8298
+      02:fe:64:dc:1b:82 > 01:02:03:04:05:03, ethertype IPv4 (0x0800), length 46:
+      (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
+        192.168.1.1.8298 > 192.168.1.3.7: UDP, length 4
+```
+
+Two important notes: Firstly, this works! For the MAC address ending in `:02`, send the packet to
+`192.0.2.2` instead of the default of `192.0.2.254`. Same for the `:03` MAC which now goes to
+`192.0.2.3`. Nice! But secondly, the IPv4 header of the VxLAN packets was changed, so there needs to
+be a call to `ip4_header_checksum()` inserted somewhere. That's an easy fix.
+
+# What's next
+
+I want to discuss a few things, perhaps at an upcoming VPP Community meeting. Notably:
+1.   Is the VPP Developer community supportive of adding eVPN support? Does anybody want to help
+     write it with me?
+1.   Is changing the existing VxLAN plugin appropriate, or should I make a new plugin which adds
+     dynamic endpoints, L2FIB and Flood lists for BUM traffic?
+1.   Is it acceptable for me to add a BUM marker in `l2_flood.c` so that I can reuse all the logic
+     from bridge-domain flooding as I extend to also do VTEP flooding?
+1.   (perhaps later) VxLAN is the canonical underlay, but is there an appetite to extend also to,
+     say, GENEVE or MPLS?
+1.   (perhaps later) What's a good way to tie in a controlplane like FRRouting or Bird2 into the
+     dataplane (perhaps using a sidecar controller, or perhaps using Linux CP Netlink messages)?
+
--- a/content/articles/2025-07-26-ctlog-1.md
+++ b/content/articles/2025-07-26-ctlog-1.md
@@ -0,0 +1,701 @@
+---
+date: "2025-07-26T22:07:23Z"
+title: 'Certificate Transparency - Part 1 - TesseraCT'
+aliases:
+- /s/articles/2025/07/26/certificate-transparency-part-1/
+---
+
+{{< image width="10em" float="right" src="/assets/ctlog/ctlog-logo-ipng.png" alt="ctlog logo" >}}
+
+# Introduction
+
+There once was a Dutch company called [[DigiNotar](https://en.wikipedia.org/wiki/DigiNotar)], as the
+name suggests it was a form of _digital notary_, and they were in the business of issuing security
+certificates. Unfortunately, in June of 2011, their IT infrastructure was compromised and
+subsequently it issued hundreds of fraudulent SSL certificates, some of which were used for
+man-in-the-middle attacks on Iranian Gmail users. Not cool.
+
+Google launched a project called **Certificate Transparency**, because it was becoming more common
+that the root of trust given to _Certification Authorities_ could no longer be unilaterally trusted.
+These attacks showed that the lack of transparency in the way CAs operated was a significant risk to
+the Web Public Key Infrastructure. It led to the creation of this ambitious
+[[project](https://certificate.transparency.dev/)] to improve security online by bringing
+accountability to the system that protects our online services with _SSL_ (Secure Socket Layer)
+and _TLS_ (Transport Layer Security). 
+
+In 2013, [[RFC 6962](https://datatracker.ietf.org/doc/html/rfc6962)] was published by the IETF. It
+describes an experimental protocol for publicly logging the existence of Transport Layer Security
+(TLS) certificates as they are issued or observed, in a manner that allows anyone to audit
+certificate authority (CA) activity and notice the issuance of suspect certificates as well as to
+audit the certificate logs themselves.  The intent is that eventually clients would refuse to honor
+certificates that do not appear in a log, effectively forcing CAs to add all issued certificates to
+the logs.
+
+This series explores and documents how IPng Networks will be running two Static CT _Logs_ with two
+different implementations. One will be [[Sunlight](https://sunlight.dev/)], and the other will be
+[[TesseraCT](https://github.com/transparency-dev/tesseract)].
+
+## Static Certificate Transparency
+
+In this context, _Logs_ are network services that implement the protocol operations for submissions
+and queries that are defined in a specification that builds on the previous RFC.  A few years ago,
+my buddy Antonis asked me if I would be willing to run a log, but operationally they were very
+complex and expensive to run. However, over the years, the concept of _Static Logs_ put running one
+in reach. This [[Static CT API](https://github.com/C2SP/C2SP/blob/main/static-ct-api.md)] defines a
+read-path HTTP static asset hierarchy (for monitoring) to be implemented alongside the write-path
+RFC 6962 endpoints (for submission).
+
+Aside from the different read endpoints, a log that implements the Static API is a regular CT log
+that can work alongside RFC 6962 logs and that fulfills the same purpose. In particular, it requires
+no modification to submitters and TLS clients.
+
+If you only read one document about Static CT, read Filippo Valsorda's excellent
+[[paper](https://filippo.io/a-different-CT-log)]. It describes a radically cheaper and easier to
+operate [[Certificate Transparency](https://certificate.transparency.dev/)] log that is backed by a
+consistent object storage, and can scale to 30x the current issuance rate for 2-10% of the costs
+with no merge delay.
+
+## Scalable, Cheap, Reliable: choose two
+
+{{< image width="18em" float="right" src="/assets/ctlog/MPLS Backbone - CTLog.svg" alt="ctlog at ipng" >}}
+
+In the diagram, I've drawn an overview of IPng's network. In {{< boldcolor color="red" >}}red{{<
+/boldcolor >}} a european backbone network is provided by a [[BGP Free Core
+network]({{< ref 2022-12-09-oem-switch-2 >}})]. It operates a private IPv4, IPv6, and MPLS network, called
+_IPng Site Local_, which is not connected to the internet. On top of that, IPng offers L2 and L3
+services, for example using [[VPP]({{< ref 2021-02-27-network >}})]. 
+
+In {{< boldcolor color="lightgreen" >}}green{{< /boldcolor >}} I built a cluster of replicated
+NGINX frontends. They connect into _IPng Site Local_ and can reach all hypervisors, VMs, and storage
+systems. They also connect to the Internet with a single IPv4 and IPv6 address. One might say that
+SSL is _added and removed here :-)_ [[ref](/assets/ctlog/nsa_slide.jpg)].
+
+Then in {{< boldcolor color="orange" >}}orange{{< /boldcolor >}} I built a set of [[MinIO]({{< ref
+2025-05-28-minio-1 >}})] S3 storage pools. Amongst others, I serve the static content from the IPng
+website from these pools, providing fancy redundancy and caching. I wrote about its design in [[this
+article]({{< ref 2025-06-01-minio-2 >}})].
+
+Finally, I turn my attention to the {{< boldcolor color="blue" >}}blue{{< /boldcolor >}} which is
+two hypervisors, one run by [[IPng](https://ipng.ch/)] and the other by [[Massar](https://massars.net/)]. Each
+of them will be running one of the _Log_ implementations. IPng provides two large ZFS storage tanks
+for offsite backup, in case a hypervisor decides to check out, and daily backups to an S3 bucket
+using Restic. 
+
+Having explained all of this, I am well aware that end to end reliability will be coming from the
+fact that there are many independent _Log_ operators, and folks wanting to validate certificates can
+simply monitor many. If there is a gap in coverage, say due to any given _Log_'s downtime, this will
+not necessarily be problematic. It does mean that I may have to suppress the SRE in me...
+
+## MinIO
+
+My first instinct is to leverage the distributed storage IPng has, but as I'll show in the rest of
+this article, maybe a simpler, more elegant design could be superior, precisely because individual
+log reliability is not _as important_ as having many available log _instances_ to choose from.
+
+From operators in the field I understand that the world-wide generation of certificates is roughly
+17M/day, which amounts of some 200-250qps of writes. Antonis explains that certs with a validity
+if 180 days or less will need two CT log entries, while certs with a validity more than 180d will
+need three CT log entries. So the write rate is roughly 2.2x that, as an upper bound.
+
+My first thought is to see how fast my open source S3 machines can go, really. I'm curious also as
+to the difference between SSD and spinning disks.
+
+I boot two Dell R630s in the Lab. These machines have two Xeon E5-2640 v4 CPUs for a total of 20
+cores and 40 threads, and 512GB of DDR4 memory. They also sport a SAS controller. In one machine I
+place 6pcs 1.2TB SAS3 disks (HPE part number EG1200JEHMC), and in the second machine I place 6pcs
+of 1.92TB enterprise storage (Samsung part number P1633N19).
+
+I spin up a 6-device MinIO cluster on both and take them out for a spin using [[S3
+Benchmark](https://github.com/wasabi-tech/s3-benchmark.git)] from Wasabi Tech.
+
+```
+pim@ctlog-test:~/src/s3-benchmark$ for dev in disk ssd; do \
+  for t in 1 8 32; do \
+    for z in 4M 1M 8k 4k; do \
+      ./s3-benchmark -a $KEY -s $SECRET -u http://minio-$dev:9000 -t $t -z $z \
+        | tee -a minio-results.txt; \
+    done; \
+  done; \
+done
+```
+
+The loadtest above does a bunch of runs with varying parameters. First it tries to read and write
+object sizes of 4MB, 1MB, 8kB and 4kB respectively. Then it tries to do this with either 1 thread, 8
+threads or 32 threads. Finally it tests both the disk-based variant as well as the SSD based one.
+The loadtest runs from a third machine, so that the Dell R630 disk tanks can stay completely
+dedicated to their task of running MinIO.
+
+{{< image width="100%" src="/assets/ctlog/minio_8kb_performance.png" alt="MinIO 8kb disk vs SSD" >}}
+
+The left-hand side graph feels pretty natural to me. With one thread, uploading 8kB objects will
+quickly hit the IOPS rate of the disks, each of which have to participate in the write due to EC:3
+encoding when using six disks, and it tops out at ~56 PUT/s. The single thread hitting SSDs will not
+hit that limit, and has ~371 PUT/s which I found a bit underwhelming. But, when performing the
+loadtest with either 8 or 32 write threads, the hard disks become only marginally faster (topping
+out at 240 PUT/s), while the SSDs really start to shine, with 3850 PUT/s. Pretty good performance.
+
+On the read-side, I am pleasantly surprised that there's not really that much of a difference
+between disks and SSDs. This is likely because the host filesystem cache is playing a large role, so
+the 1-thread performance is equivalent (765 GET/s for disks, 677 GET/s for SSDs), and the 32-thread
+performance is also equivalent (at 7624 GET/s for disks with 7261 GET/s for SSDs). I do wonder why
+the hard disks consistently outperform the SSDs with all the other variables (OS, MinIO version,
+hardware) the same.
+
+## Sidequest: SeaweedFS
+
+Something that has long caught my attention is the way in which
+[[SeaweedFS](https://github.com/seaweedfs/seaweedfs)] approaches blob storage. Many operators have
+great success with many small file writes in SeaweedFS compared to MinIO and even AWS S3 storage.
+This is because writes with WeedFS are not broken into erasure-sets, which would require every disk
+to write a small part or checksum of the data, but rather files are replicated within the cluster in
+their entirety on different disks, racks or datacenters. I won't bore you with the details of
+SeaweedFS but I'll tack on a docker [[compose file](/assets/ctlog/seaweedfs.docker-compose.yml)]
+that I used at the end of this article, if you're curious.
+
+{{< image width="100%" src="/assets/ctlog/size_comparison_8t.png" alt="MinIO vs SeaWeedFS" >}}
+
+In the write-path, SeaweedFS dominates in all cases, due to its different way of achieving durable
+storage (per-file replication in SeaweedFS versus all-disk erasure-sets in MinIO): 
+*   4k: 3,384 ops/sec vs MinIO's 111 ops/sec (30x faster!)
+*   8k: 3,332 ops/sec vs MinIO's 111 ops/sec (30x faster!)
+*   1M: 383 ops/sec vs MinIO's 44 ops/sec (9x faster)
+*   4M: 104 ops/sec vs MinIO's 32 ops/sec (4x faster)
+
+For the read-path, in GET operations MinIO is better at small objects, and really dominates the
+large objects:
+*   4k: 7,411 ops/sec vs SeaweedFS 5,014 ops/sec
+*   8k: 7,666 ops/sec vs SeaweedFS 5,165 ops/sec
+*   1M: 5,466 ops/sec vs SeaweedFS 2,212 ops/sec
+*   4M: 3,084 ops/sec vs SeaweedFS 646 ops/sec
+
+This makes me draw an interesting conclusion: seeing as CT Logs are read/write heavy (every couple
+of seconds, the Merkle tree is recomputed which is reasonably disk-intensive), SeaweedFS might be a
+slight better choice. IPng Networks has three MinIO deployments, but no SeaweedFS deployments. Yet.
+
+# Tessera
+
+[[Tessera](https://github.com/transparency-dev/tessera.git)] is a Go library for building tile-based
+transparency logs (tlogs) [[ref](https://github.com/C2SP/C2SP/blob/main/tlog-tiles.md)]. It is the
+logical successor to the approach that Google took when building and operating _Logs_ using its
+predecessor called [[Trillian](https://github.com/google/trillian)]. The implementation and its APIs
+bake-in current best-practices based on the lessons learned over the past decade of building and
+operating transparency logs in production environments and at scale.
+
+Tessera was introduced at the Transparency.Dev summit in October 2024. I first watch Al and Martin
+[[introduce](https://www.youtube.com/watch?v=9j_8FbQ9qSc)] it at last year's summit. At a high
+level, it wraps what used to be a whole kubernetes cluster full of components, into a single library
+that can be used with Cloud based services, either like AWS S3 and RDS database, or like GCP's GCS
+storage and Spanner database. However, Google also made is easy to use a regular POSIX filesystem
+implementation.
+
+## TesseraCT
+
+{{< image width="10em" float="right" src="/assets/ctlog/tesseract-logo.png" alt="tesseract logo" >}}
+
+While Tessera is a library, a CT log implementation comes from its sibling GitHub repository called
+[[TesseraCT](https://github.com/transparency-dev/tesseract)]. Because it leverages Tessera under the
+hood, TesseraCT can run on GCP, AWS, POSIX-compliant, or on S3-compatible systems alongside a MySQL
+database.  In order to provide ecosystem agility and to control the growth of CT Log sizes, new CT
+Logs must be temporally sharded, defining a certificate expiry range denoted in the form of two
+dates: `[rangeBegin, rangeEnd)`. The certificate expiry range allows a Log to reject otherwise valid
+logging submissions for certificates that expire before or after this defined range, thus
+partitioning the set of publicly-trusted certificates that each Log will accept. I will be expected
+to keep logs for an extended period of time, say 3-5 years.
+
+It's time for me to figure out what this TesseraCT thing can do .. are you ready? Let's go!
+
+### TesseraCT: S3 and SQL
+
+TesseraCT comes with a few so-called _personalities_. Those are an implementation of the underlying
+storage infrastructure in an opinionated way. The first personality I look at is the `aws` one in
+`cmd/tesseract/aws`. I notice that this personality does make hard assumptions about the use of AWS
+which is unfortunate as the documentation says '.. or self-hosted S3 and MySQL database'. However,
+the `aws` personality assumes the AWS SecretManager in order to fetch its signing key. Before I
+can be successful, I need to detangle that.
+
+#### TesseraCT: AWS and Local Signer
+
+First, I change `cmd/tesseract/aws/main.go` to add two new flags:
+
+*   ***-signer_public_key_file***: a path to the public key for checkpoints and SCT signer
+*   ***-signer_private_key_file***: a path to the private key for checkpoints and SCT signer
+
+I then change the program to assume if these flags are both set, the user will want a
+_NewLocalSigner_ instead of a _NewSecretsManagerSigner_. Now all I have to do is implement the
+signer interface in a package `local_signer.go`. There, function _NewLocalSigner()_ will read the
+public and private PEM from file, decode them, and create an _ECDSAWithSHA256Signer_ with them, a
+simple example to show what I mean:
+
+```
+// NewLocalSigner creates a new signer that uses the ECDSA P-256 key pair from
+// local disk files for signing digests.
+func NewLocalSigner(publicKeyFile, privateKeyFile string) (*ECDSAWithSHA256Signer, error) {
+  // Read public key
+  publicKeyPEM, err := os.ReadFile(publicKeyFile)
+  publicPemBlock, rest := pem.Decode(publicKeyPEM)
+
+  var publicKey crypto.PublicKey
+  publicKey, err = x509.ParsePKIXPublicKey(publicPemBlock.Bytes)
+  ecdsaPublicKey, ok := publicKey.(*ecdsa.PublicKey)
+
+  // Read private key
+  privateKeyPEM, err := os.ReadFile(privateKeyFile)
+  privatePemBlock, rest := pem.Decode(privateKeyPEM)
+
+  var ecdsaPrivateKey *ecdsa.PrivateKey
+  ecdsaPrivateKey, err = x509.ParseECPrivateKey(privatePemBlock.Bytes)
+
+  // Verify the correctness of the signer key pair
+  if !ecdsaPrivateKey.PublicKey.Equal(ecdsaPublicKey) {
+   return nil, errors.New("signer key pair doesn't match")
+  }
+
+  return &ECDSAWithSHA256Signer{
+   publicKey:  ecdsaPublicKey,
+   privateKey: ecdsaPrivateKey,
+  }, nil
+}
+```
+
+In the snippet above I omitted all of the error handling, but the local signer logic itself is
+hopefully clear. And with that, I am liberated from Amazon's Cloud offering and can run this thing
+all by myself!
+
+#### TesseraCT: Running with S3, MySQL, and Local Signer
+
+First, I need to create a suitable ECDSA key:
+```
+pim@ctlog-test:~$ openssl ecparam -name prime256v1 -genkey -noout -out /tmp/private_key.pem
+pim@ctlog-test:~$ openssl ec -in /tmp/private_key.pem -pubout -out /tmp/public_key.pem
+```
+
+Then, I'll install the MySQL server and create the databases:
+
+```
+pim@ctlog-test:~$ sudo apt install default-mysql-server
+pim@ctlog-test:~$ sudo mysql -u root
+
+CREATE USER 'tesseract'@'localhost' IDENTIFIED BY '<db_passwd>';
+CREATE DATABASE tesseract;
+CREATE DATABASE tesseract_antispam;
+GRANT ALL PRIVILEGES ON tesseract.* TO 'tesseract'@'localhost';
+GRANT ALL PRIVILEGES ON tesseract_antispam.* TO 'tesseract'@'localhost';
+```
+
+Finally, I use the SSD MinIO lab-machine that I just loadtested to create an S3 bucket.
+
+```
+pim@ctlog-test:~$ mc mb minio-ssd/tesseract-test
+pim@ctlog-test:~$ cat << EOF > /tmp/minio-access.json
+{ "Version": "2012-10-17", "Statement": [ {
+    "Effect": "Allow",
+    "Action": [ "s3:ListBucket", "s3:PutObject", "s3:GetObject", "s3:DeleteObject" ],
+    "Resource": [ "arn:aws:s3:::tesseract-test/*", "arn:aws:s3:::tesseract-test" ]
+  } ]
+}
+EOF
+pim@ctlog-test:~$ mc admin user add minio-ssd <user> <secret>
+pim@ctlog-test:~$ mc admin policy create minio-ssd tesseract-test-access /tmp/minio-access.json
+pim@ctlog-test:~$ mc admin policy attach minio-ssd tesseract-test-access --user <user>
+pim@ctlog-test:~$ mc anonymous set public minio-ssd/tesseract-test
+```
+
+{{< image width="6em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
+
+After some fiddling, I understand that the AWS software development kit makes some assumptions that
+you'll be using .. _quelle surprise_ .. AWS services. But you can also use local S3 services by
+setting a few key environment variables. I had heard of the S3 access and secret key environment
+variables before, but I now need to also use a different S3 endpoint. That little detour into the
+codebase only took me .. several hours.
+
+Armed with that knowledge, I can build and finally start my TesseraCT instance:
+```
+pim@ctlog-test:~/src/tesseract/cmd/tesseract/aws$ go build -o ~/aws .
+pim@ctlog-test:~$ export AWS_DEFAULT_REGION="us-east-1"
+pim@ctlog-test:~$ export AWS_ACCESS_KEY_ID="<user>"
+pim@ctlog-test:~$ export AWS_SECRET_ACCESS_KEY="<secret>"
+pim@ctlog-test:~$ export AWS_ENDPOINT_URL_S3="http://minio-ssd.lab.ipng.ch:9000/"
+pim@ctlog-test:~$ ./aws --http_endpoint='[::]:6962' \
+  --origin=ctlog-test.lab.ipng.ch/test-ecdsa \
+  --bucket=tesseract-test \
+  --db_host=ctlog-test.lab.ipng.ch \
+  --db_user=tesseract \
+  --db_password=<db_passwd> \
+  --db_name=tesseract \
+  --antispam_db_name=tesseract_antispam \
+  --signer_public_key_file=/tmp/public_key.pem \
+  --signer_private_key_file=/tmp/private_key.pem \
+  --roots_pem_file=internal/hammer/testdata/test_root_ca_cert.pem
+
+I0727 15:13:04.666056  337461 main.go:128] **** CT HTTP Server Starting ****
+```
+
+Hah! I think most of the command line flags and environment variables should make sense, but I was
+struggling for a while with the `--roots_pem_file` and the `--origin` flags, so I phoned a friend
+(Al Cutter, Googler extraordinaire and an expert in Tessera/CT). He explained to me that the Log is
+actually an open endpoint to which anybody might POST data. However, to avoid folks abusing the log
+infrastructure, each POST is expected to come from one of the certificate authorities listed in the
+`--roots_pem_file`. OK, that makes sense.
+
+Then, the `--origin` flag designates how my log calls itself. In the resulting `checkpoint` file it
+will enumerate a hash of the latest merged and published Merkle tree. In case a server serves
+multiple logs, it uses the `--origin` flag to make the destinction which checksum belongs to which.
+
+```
+pim@ctlog-test:~/src/tesseract$ curl http://tesseract-test.minio-ssd.lab.ipng.ch:9000/checkpoint
+ctlog-test.lab.ipng.ch/test-ecdsa
+0
+JGPitKWWI0aGuCfC2k1n/p9xdWAYPm5RZPNDXkCEVUU=
+
+— ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMCONUBAMARjBEAiA/nc9dig6U//vPg7SoTHjt9bxP5K+x3w4MYKpIRn4ULQIgUY5zijRK8qyuJGvZaItDEmP1gohCt+wI+sESBnhkuqo=
+```
+
+When creating the bucket above, I used `mc anonymous set public`, which made the S3 bucket
+world-readable. I can now execute the whole read-path simply by hitting the S3 service. Check.
+
+#### TesseraCT: Loadtesting S3/MySQL
+
+{{< image width="12em" float="right" src="/assets/ctlog/stop-hammer-time.jpg" alt="Stop, hammer time" >}}
+
+The write path is a server on `[::]:6962`. I should be able to write a log to it, but how? Here's
+where I am grateful to find a tool in the TesseraCT GitHub repository called `hammer`. This hammer
+sets up read and write traffic to a Static CT API log to test correctness and performance under
+load.  The traffic is sent according to the [[Static CT API](https://c2sp.org/static-ct-api)] spec.
+Slick!
+
+The tool start a text-based UI (my favorite! also when using Cisco T-Rex loadtester) in the terminal
+that shows the current status, logs, and supports increasing/decreasing read and write traffic. This
+TUI allows for a level of interactivity when probing a new configuration of a log in order to find
+any cliffs where performance degrades. For real load-testing applications, especially headless runs
+as part of a CI pipeline, it is recommended to run the tool with `-show_ui=false` in order to disable
+the UI.
+
+I'm a bit lost in the somewhat terse
+[[README.md](https://github.com/transparency-dev/tesseract/tree/main/internal/hammer)], but my buddy
+Al comes to my rescue and explains the flags to me.  First of all, the loadtester wants to hit the
+same `--origin` that I configured the write-path to accept. In my case this is
+`ctlog-test.lab.ipng.ch/test-ecdsa`. Then, it needs the public key for that _Log_, which I can find
+in `/tmp/public_key.pem`. The text there is the _DER_ (Distinguished Encoding Rules), stored as a
+base64 encoded string. What follows next was the most difficult for me to understand, as I was
+thinking the hammer would read some log from the internet somewhere and replay it locally. Al
+explains that actually, the `hammer` tool synthetically creates all of these entries itself, and it
+regularly reads the `checkpoint` from the `--log_url` place, while it writes its certificates to
+`--write_log_url`. The last few flags just inform the `hammer` how many read and write ops/sec it
+should generate, and with that explanation my brain plays _tadaa.wav_ and I am ready to go.
+
+```
+pim@ctlog-test:~/src/tesseract$ go run ./internal/hammer \
+  --origin=ctlog-test.lab.ipng.ch/test-ecdsa \
+  --log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEucHtDWe9GYNicPnuGWbEX8rJg/VnDcXs8z40KdoNidBKy6/ZXw2u+NW1XAUnGpXcZozxufsgOMhijsWb25r7jw== \
+  --log_url=http://tesseract-test.minio-ssd.lab.ipng.ch:9000/ \
+  --write_log_url=http://localhost:6962/ctlog-test.lab.ipng.ch/test-ecdsa/ \
+  --max_read_ops=0 \
+  --num_writers=5000 \
+  --max_write_ops=100
+```
+
+{{< image width="30em" float="right" src="/assets/ctlog/ctlog-loadtest1.png" alt="S3/MySQL Loadtest 100qps" >}}
+
+Cool! It seems that the loadtest is happily chugging along at 100qps. The log is consuming them in
+the HTTP write-path by accepting POST requests to
+`/ctlog-test.lab.ipng.ch/test-ecdsa/ct/v1/add-chain`, where hammer is offering them at a rate of
+100qps, with a configured probability of duplicates set at 10%. What that means is that every now
+and again, it'll repeat a previous request. The purpose of this is to stress test the so-called
+`antispam` implementation. When `hammer` sends its requests, it signs them with a certificate that
+was issued by the CA described in `internal/hammer/testdata/test_root_ca_cert.pem`, which is why
+TesseraCT accepts them.
+
+I raise the write load by using the '>' key a few times. I notice things are great at 500qps, which
+is nice because that's double what we are to expect. But I start seeing a bit more noise at 600qps.
+When I raise the write-rate to 1000qps, all hell breaks loose on the logs of the server (and similar
+logs in the `hammer` loadtester:
+
+```
+W0727 15:54:33.419881  348475 handlers.go:168] ctlog-test.lab.ipng.ch/test-ecdsa: AddChain handler error: couldn't store the leaf: failed to fetch entry bundle at index 0: failed to fetch resource: getObject: failed to create reader for object "tile/data/000" in bucket "tesseract-test": operation error S3: GetObject, context deadline exceeded
+W0727 15:55:02.727962  348475 aws.go:345] GarbageCollect failed: failed to delete one or more objects: failed to delete objects: operation error S3: DeleteObjects, https response error StatusCode: 400, RequestID: 1856202CA3C4B83F, HostID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8, api error MalformedXML: The XML you provided was not well-formed or did not validate against our published schema.
+E0727 15:55:10.448973  348475 append_lifecycle.go:293] followerStats: follower "AWS antispam" EntriesProcessed(): failed to read follow coordination info: Error 1040: Too many connections
+```
+
+I see on the MinIO instance that it's doing about 150/s of GETs and 15/s of PUTs, which is totally
+reasonable:
+
+```
+pim@ctlog-test:~/src/tesseract$ mc admin trace --stats ssd
+Duration: 6m9s ▰▱▱
+RX Rate:↑ 34 MiB/m
+TX Rate:↓ 2.3 GiB/m
+RPM    :  10588.1
+-------------
+Call                      Count          RPM     Avg Time  Min Time  Max Time  Avg TTFB  Max TTFB  Avg Size     Rate /min  
+s3.GetObject              60558 (92.9%)  9837.2  4.3ms     708µs     48.1ms    3.9ms     47.8ms    ↑144B ↓246K  ↑1.4M ↓2.3G
+s3.PutObject              2199 (3.4%)    357.2   5.3ms     2.4ms     32.7ms    5.3ms     32.7ms    ↑92K         ↑32M       
+s3.DeleteMultipleObjects  1212 (1.9%)    196.9   877µs     290µs     41.1ms    850µs     41.1ms    ↑230B ↓369B  ↑44K ↓71K  
+s3.ListObjectsV2          1212 (1.9%)    196.9   18.4ms    999µs     52.8ms    18.3ms    52.7ms    ↑131B ↓261B  ↑25K ↓50K  
+```
+
+Another nice way to see what makes it through is this oneliner, which reads the `checkpoint` every
+second, and once it changes, shows the delta in seconds and how many certs were written:
+
+```
+pim@ctlog-test:~/src/tesseract$ T=0; O=0; while :; do \
+  N=$(curl -sS http://tesseract-test.minio-ssd.lab.ipng.ch:9000/checkpoint | grep -E '^[0-9]+$'); \
+  if [ "$N" -eq "$O" ]; then \
+    echo -n .; \
+  else \
+    echo " $T seconds $((N-O)) certs"; O=$N; T=0; echo -n $N\ ;
+  fi; \
+  T=$((T+1)); sleep 1; done
+1012905 .... 5 seconds 2081 certs
+1014986 .... 5 seconds 2126 certs
+1017112 .... 5 seconds 1913 certs
+1019025 .... 5 seconds 2588 certs
+1021613 .... 5 seconds 2591 certs
+1024204 .... 5 seconds 2197 certs
+```
+
+So I can see that the checkpoint is refreshed every 5 seconds and between 1913 and 2591 certs are
+written each time. And indeed, at 400/s there are no errors or warnings at all. At this write rate,
+TesseraCT is using about 2.9 CPUs/s, with MariaDB using 0.3 CPUs/s, but the hammer is using 6.0
+CPUs/s. Overall, the machine is perfectly happily serving for a few hours under this load test.
+
+***Conclusion: a write-rate of 400/s should be safe with S3+MySQL***
+
+### TesseraCT: POSIX
+
+I have been playing with this idea of having a reliable read-path by having the S3 cluster be
+redundant, or by replicating the S3 bucket. But Al asks: why not use our experimental POSIX?
+We discuss two very important benefits, but also two drawbacks:
+
+*   On the plus side:
+    1.   There is no need for S3 storage, read/writing to a local ZFS raidz2 pool instead.
+    1.   There is no need for MySQL, as the POSIX implementation can use a local badger instance
+         also on the local filesystem.
+*   On the drawbacks:
+    1.   There is a SPOF in the read-path, as the single VM must handle both. The write-path always
+         has a SPOF on the TesseraCT VM.
+    1.   Local storage is more expensive than S3 storage, and can be used only for the purposes of
+         one application (and at best, shared with other VMs on the same hypervisor).
+
+Come to think of it, this is maybe not such a bad tradeoff. I do kind of like having a single-VM
+with a single-binary and no other moving parts. It greatly simplifies the architecture, and for the
+read-path I can (and will) still use multiple upstream NGINX machines in IPng's network.
+
+I consider myself nerd-sniped, and take a look at the POSIX variant. I have a few SAS3
+solid state storage (NetAPP part number X447_S1633800AMD), which I plug into the `ctlog-test`
+machine.
+
+```
+pim@ctlog-test:~$ sudo zpool create -o ashift=12 -o autotrim=on -o ssd-vol0 mirror \
+  /dev/disk/by-id/wwn-0x5002538a0???????
+pim@ctlog-test:~$ sudo zfs create ssd-vol0/tesseract-test
+pim@ctlog-test:~$ sudo chown pim:pim /ssd-vol0/tesseract-test
+pim@ctlog-test:~/src/tesseract$ go run ./cmd/experimental/posix --http_endpoint='[::]:6962' \
+  --origin=ctlog-test.lab.ipng.ch/test-ecdsa \
+  --private_key=/tmp/private_key.pem \
+  --storage_dir=/ssd-vol0/tesseract-test \
+  --roots_pem_file=internal/hammer/testdata/test_root_ca_cert.pem 
+badger 2025/07/27 16:29:15 INFO: All 0 tables opened in 0s
+badger 2025/07/27 16:29:15 INFO: Discard stats nextEmptySlot: 0
+badger 2025/07/27 16:29:15 INFO: Set nextTxnTs to 0
+I0727 16:29:15.032845  363156 files.go:502] Initializing directory for POSIX log at "/ssd-vol0/tesseract-test" (this should only happen ONCE per log!)
+I0727 16:29:15.034101  363156 main.go:97] **** CT HTTP Server Starting ****
+
+pim@ctlog-test:~/src/tesseract$ cat /ssd-vol0/tesseract-test/checkpoint 
+ctlog-test.lab.ipng.ch/test-ecdsa
+0
+47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=
+
+— ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMSgC8BAMARzBFAiBjT5zdkniKlryqlUlx/gLHOtVK26zuWwrc4BlyTVzCWgIhAJ0GIrlrP7YGzRaHjzdB5tnS5rpP3LeOsPbpLateaiFc
+```
+
+Alright, I can see the log started and created an empty checkpoint file. Nice!
+
+Before I can loadtest it, I will need to get the read-path to become visible. The `hammer` can read
+a checkpoint from local `file:///` prefixes, but I'll have to serve them over the network eventually
+anyway, so I create the following NGINX config for it:
+
+```
+server {
+  listen 80 default_server backlog=4096;
+  listen [::]:80 default_server backlog=4096;
+  root /ssd-vol0/tesseract-test/;
+  index index.html index.htm index.nginx-debian.html;
+
+  server_name _;
+
+  access_log /var/log/nginx/access.log combined buffer=512k flush=5s;
+
+  location / {
+    try_files $uri $uri/ =404;
+    tcp_nopush  on;
+    sendfile    on;
+    tcp_nodelay on;
+    keepalive_timeout 65;
+    keepalive_requests 1000;
+  }
+}
+```
+
+Just a couple of small thoughts on this configuration. I'm using buffered access logs, to avoid
+excessive disk writes in the read-path. Then, I'm using kernel `sendfile()` which will instruct the
+kernel to serve the static objects directly, so that NGINX can move on. Further, I'll allow for a
+long keepalive in HTTP 1.1, so that future requests can use the same TCP connection, and I'll set
+the flag `tcp_nodelay` and `tcp_nopush` to just blast the data out without waiting.
+
+Without much ado:
+
+```
+pim@ctlog-test:~/src/tesseract$ curl -sS ctlog-test.lab.ipng.ch/checkpoint
+ctlog-test.lab.ipng.ch/test-ecdsa
+0
+47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=
+
+— ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMTfksBAMASDBGAiEAqADLH0P/SRVloF6G1ezlWG3Exf+sTzPIY5u6VjAKLqACIQCkJO2N0dZQuDHvkbnzL8Hd91oyU41bVqfD3vs5EwUouA==
+```
+
+#### TesseraCT: Loadtesting POSIX
+
+The loadtesting is roughly the same. I start the `hammer` with the same 500qps of write rate, which
+was roughly where the S3+MySQL variant topped.  My checkpoint tracker shows the following:
+
+```
+pim@ctlog-test:~/src/tesseract$ T=0; O=0; while :; do \
+  N=$(curl -sS http://localhost/checkpoint | grep -E '^[0-9]+$'); \
+  if [ "$N" -eq "$O" ]; then \
+    echo -n .; \
+  else \
+    echo " $T seconds $((N-O)) certs"; O=$N; T=0; echo -n $N\ ;
+  fi; \
+  T=$((T+1)); sleep 1; done
+59250 ......... 10 seconds 5244 certs
+64494 ......... 10 seconds 5000 certs
+69494 ......... 10 seconds 5000 certs
+74494 ......... 10 seconds 5000 certs
+79494 ......... 10 seconds 5256 certs
+79494 ......... 10 seconds 5256 certs
+84750 ......... 10 seconds 5244 certs
+89994 ......... 10 seconds 5256 certs
+95250 ......... 10 seconds 5000 certs
+100250 ......... 10 seconds 5000 certs
+105250 ......... 10 seconds 5000 certs
+```
+
+I learn two things. First, the checkpoint interval in this `posix` variant is 10 seconds, compared
+to the 5 seconds of the `aws` variant I tested before. I dive into the code, because there doesn't
+seem to be a `--checkpoint_interval` flag. In the `tessera` library, I find
+`DefaultCheckpointInterval` which is set to 10 seconds. I change it to be 2 seconds instead, and
+restart the `posix` binary:
+
+```
+238250 . 2 seconds 1000 certs
+239250 . 2 seconds 1000 certs
+240250 . 2 seconds 1000 certs
+241250 . 2 seconds 1000 certs
+242250 . 2 seconds 1000 certs
+243250 . 2 seconds 1000 certs
+244250 . 2 seconds 1000 certs
+```
+
+{{< image width="30em" float="right" src="/assets/ctlog/ctlog-loadtest2.png" alt="Posix Loadtest 5000qps" >}}
+
+Very nice! Maybe I can write a few more certs? I restart the `hammer` with 5000/s, which somewhat to my
+surprise, ends up serving! 
+
+```
+642608 . 2 seconds 6155 certs
+648763 . 2 seconds 10256 certs
+659019 . 2 seconds 9237 certs
+668256 . 2 seconds 8800 certs
+677056 . 2 seconds 8729 certs
+685785 . 2 seconds 8237 certs
+694022 . 2 seconds 7487 certs
+701509 . 2 seconds 8572 certs
+710081 . 2 seconds 7413 certs
+```
+
+The throughput is highly variable though, seemingly between 3700/sec and 5100/sec, and I quickly
+find out that the `hammer` is completely saturating the CPU on the machine, leaving very little room
+for the `posix` TesseraCT to serve. I'm going to need more machines!
+
+So I start a `hammer` loadtester on the two now-idle MinIO servers, and run them at about 6000qps
+**each**, for a total of 12000 certs/sec. And my little `posix` binary is keeping up like a champ:
+
+```
+2987169 . 2 seconds 23040 certs
+3010209 . 2 seconds 23040 certs
+3033249 . 2 seconds 21760 certs
+3055009 . 2 seconds 21504 certs
+3076513 . 2 seconds 23808 certs
+3100321 . 2 seconds 22528 certs
+```
+
+One thing is reasonably clear, the `posix` TesseraCT is CPU bound, not disk bound. The CPU is now
+running at about 18.5 CPUs/s (with 20 cores), which is pretty much all this Dell has to offer. The
+NetAPP enterprise solid state drives are not impressed:
+
+```
+pim@ctlog-test:~/src/tesseract$ zpool iostat -v ssd-vol0 10 100
+                              capacity     operations     bandwidth 
+pool                        alloc   free   read  write   read  write
+--------------------------  -----  -----  -----  -----  -----  -----
+ssd-vol0                    11.4G   733G      0  3.13K      0   117M
+  mirror-0                  11.4G   733G      0  3.13K      0   117M
+    wwn-0x5002538a05302930      -      -      0  1.04K      0  39.1M
+    wwn-0x5002538a053069f0      -      -      0  1.06K      0  39.1M
+    wwn-0x5002538a06313ed0      -      -      0  1.02K      0  39.1M
+--------------------------  -----  -----  -----  -----  -----  -----
+
+pim@ctlog-test:~/src/tesseract$ zpool iostat -l  ssd-vol0 10
+              capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim
+pool        alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait
+----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
+ssd-vol0    14.0G   730G      0  1.48K      0  35.4M      -    2ms      -  535us      -    1us      -    3ms      -   50ms
+ssd-vol0    14.0G   730G      0  1.12K      0  23.0M      -    1ms      -  733us      -    2us      -    1ms      -   44ms
+ssd-vol0    14.1G   730G      0  1.42K      0  45.3M      -  508us      -  122us      -  914ns      -    2ms      -   41ms
+ssd-vol0    14.2G   730G      0    678      0  21.0M      -  863us      -  144us      -    2us      -    2ms      -      -
+```
+
+## Results
+
+OK, that kind of seals the deal for me. The write path needs about 250 certs/sec and I'm hammering
+now with 12'000 certs/sec, with room to spare. But what about the read path? The cool thing about
+the static log is that reads are all entirely done by NGINX. The only file that isn't cacheable is
+the `checkpoint` file which gets updated every two seconds (or ten seconds in the default `tessera`
+settings).
+
+So I start yet another `hammer` whose job it is to read back from the static filesystem:
+
+```
+pim@ctlog-test:~/src/tesseract$ curl localhost/nginx_status; sleep 60; curl localhost/nginx_status
+Active connections: 10556 
+server accepts handled requests
+ 25302 25302 1492918 
+Reading: 0 Writing: 1 Waiting: 10555 
+Active connections: 7791 
+server accepts handled requests
+ 25764 25764 1727631 
+Reading: 0 Writing: 1 Waiting: 7790 
+```
+
+And I can see that it's keeping up quite nicely. In one minute, it handled (1727631-1492918) or
+234713 requests, which is a cool 3911 requests/sec. All these read/write hammers are kind of
+saturating the `ctlog-test` machine though:
+
+{{< image width="100%" src="/assets/ctlog/ctlog-loadtest3.png" alt="Posix Loadtest 8000qps write, 4000qps read" >}}
+
+But after a little bit of fiddling, I can assert my conclusion:
+
+***Conclusion: a write-rate of 8'000/s alongside a read-rate of 4'000/s should be safe with POSIX***
+
+## What's Next
+
+I am going to offer such a machine in production together with Antonis Chariton, and Jeroen Massar.
+I plan to do a few additional things:
+
+*   Test Sunlight as well on the same hardware. It would be nice to see a comparison between write
+    rates of the two implementations.
+*   Work with Al Cutter and the Transparency Dev team to close a few small gaps (like the
+    `local_signer.go` and some Prometheus monitoring of the `posix` binary.
+*   Install and launch both under `*.ct.ipng.ch`, which in itself deserves its own report, showing
+    how I intend to do log cycling and care/feeding, as well as report on the real production
+    experience running these CT Logs.
--- a/content/articles/2025-08-10-ctlog-2.md
+++ b/content/articles/2025-08-10-ctlog-2.md
@@ -0,0 +1,666 @@
+---
+date: "2025-08-10T12:07:23Z"
+title: 'Certificate Transparency - Part 2 - Sunlight'
+---
+
+{{< image width="10em" float="right" src="/assets/ctlog/ctlog-logo-ipng.png" alt="ctlog logo" >}}
+
+# Introduction
+
+There once was a Dutch company called [[DigiNotar](https://en.wikipedia.org/wiki/DigiNotar)], as the
+name suggests it was a form of _digital notary_, and they were in the business of issuing security
+certificates. Unfortunately, in June of 2011, their IT infrastructure was compromised and
+subsequently it issued hundreds of fraudulent SSL certificates, some of which were used for
+man-in-the-middle attacks on Iranian Gmail users. Not cool.
+
+Google launched a project called **Certificate Transparency**, because it was becoming more common
+that the root of trust given to _Certification Authorities_ could no longer be unilaterally trusted.
+These attacks showed that the lack of transparency in the way CAs operated was a significant risk to
+the Web Public Key Infrastructure. It led to the creation of this ambitious
+[[project](https://certificate.transparency.dev/)] to improve security online by bringing
+accountability to the system that protects our online services with _SSL_ (Secure Socket Layer)
+and _TLS_ (Transport Layer Security). 
+
+In 2013, [[RFC 6962](https://datatracker.ietf.org/doc/html/rfc6962)] was published by the IETF. It
+describes an experimental protocol for publicly logging the existence of Transport Layer Security
+(TLS) certificates as they are issued or observed, in a manner that allows anyone to audit
+certificate authority (CA) activity and notice the issuance of suspect certificates as well as to
+audit the certificate logs themselves.  The intent is that eventually clients would refuse to honor
+certificates that do not appear in a log, effectively forcing CAs to add all issued certificates to
+the logs.
+
+In a [[previous article]({{< ref 2025-07-26-ctlog-1 >}})], I took a deep dive into an upcoming
+open source implementation of Static CT Logs made by Google. There is however a very competent
+alternative called [[Sunlight](https://sunlight.dev/)], which deserves some attention to get to know
+its look and feel, as well as its performance characteristics.
+
+## Sunlight
+
+I start by reading up on the project website, and learn:
+
+> _Sunlight is a [[Certificate Transparency](https://certificate.transparency.dev/)] log implementation
+> and monitoring API designed for scalability, ease of operation, and reduced cost. What started as
+> the Sunlight API is now the [[Static CT API](https://c2sp.org/static-ct-api)] and is allowed by the
+> CT log policies of the major browsers._
+>
+> _Sunlight was designed by Filippo Valsorda for the needs of the WebPKI community, through the
+> feedback of many of its members, and in particular of the Sigsum, Google TrustFabric, and ISRG
+> teams. It is partially based on the Go Checksum Database. Sunlight's development was sponsored by
+> Let's Encrypt._
+
+I have a chat with Filippo and think I'm addressing an Elephant by asking him which of the two
+implementations, TesseraCT or Sunlight, he thinks would be a good fit. One thing he says really sticks
+with me: "The community needs _any_ static log operator, so if Google thinks TesseraCT is ready, by
+all means use that. The diversity will do us good!". 
+
+To find out if one or the other is 'ready' is partly on the software, but importantly also on the
+operator. So I carefully take Sunlight out of its cardboard box, and put it onto the same Dell R630
+that I used in my previous tests: two Xeon E5-2640 v4 CPUs for a total of 20 cores and 40 threads,
+and 512GB of DDR4 memory. They also sport a SAS controller. In one machine I place 6 pcs 1.2TB SAS3
+drives (HPE part number EG1200JEHMC), and in the second machine I place 6pcs of 1.92TB enterprise
+storage (Samsung part number P1633N19).
+
+### Sunlight: setup
+
+I download the source from GitHub, which, one of these days, will have an IPv6 address. Building the
+tools is easy enough, there are three main tools:
+1.   ***sunlight***: Which serves the write-path. Certification authorities add their certs here.
+1.   ***sunlight-keygen***: A helper tool to create the so-called `seed` file (key material) for a
+     log.
+1.   ***skylight***: Which serves the read-path. `/checkpoint` and things like `/tile` and `/issuer`
+     are served here in a spec-compliant way.
+
+The YAML configuration file is straightforward, and can define and handle multiple logs in one
+instance, which sets it apart from TesseraCT which can only handle one log per instance. There's a
+`submissionprefix` which `sunlight` will use to accept writes, and a `monitoringprefix` which
+`skylight` will use for reads.
+
+I stumble across a small issue - I haven't created multiple DNS hostnames for the test machine. So I
+decide to use a different port for one versus the other. The write path will use TLS on port 1443
+while Sunlight will point to a normal HTTP port 1080. And considering I don't have a certificate for
+`*.lab.ipng.ch`, I will use a self-signed one instead:
+
+```
+pim@ctlog-test:/etc/sunlight$ openssl genrsa -out ca.key 2048
+pim@ctlog-test:/etc/sunlight$ openssl req -new -x509 -days 365 -key ca.key \
+  -subj "/C=CH/ST=ZH/L=Bruttisellen/O=IPng Networks GmbH/CN=IPng Root CA" -out ca.crt
+pim@ctlog-test:/etc/sunlight$ openssl req -newkey rsa:2048 -nodes -keyout sunlight-key.pem \
+  -subj "/C=CH/ST=ZH/L=Bruttisellen/O=IPng Networks GmbH/CN=*.lab.ipng.ch" -out sunlight.csr
+pim@ctlog-test:/etc/sunlight# openssl x509 -req -extfile \
+  <(printf "subjectAltName=DNS:ctlog-test.lab.ipng.ch,DNS:ctlog-test.lab.ipng.ch") -days 365 \
+  -in sunlight.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out sunlight.pem
+ln -s sunlight.pem skylight.pem
+ln -s sunlight-key.pem skylight-key.pem
+```
+
+This little snippet yields `sunlight.pem` (the certificate) and `sunlight-key.pem` (the private
+key), and symlinks them to `skylight.pem` and `skylight-key.pem` for simplicity. With these in hand,
+I can start the rest of the show.  First I will prepare the NVME storage with a few datasets in
+which Sunlight will store its data:
+
+```
+pim@ctlog-test:~$ sudo zfs create ssd-vol0/sunlight-test
+pim@ctlog-test:~$ sudo zfs create ssd-vol0/sunlight-test/shared
+pim@ctlog-test:~$ sudo zfs create ssd-vol0/sunlight-test/logs
+pim@ctlog-test:~$ sudo zfs create ssd-vol0/sunlight-test/logs/sunlight-test
+pim@ctlog-test:~$ sudo chown -R pim:pim /ssd-vol0/sunlight-test
+```
+
+Then I'll create the Sunlight configuration:
+
+```
+pim@ctlog-test:/etc/sunlight$ sunlight-keygen -f sunlight-test.seed.bin 
+Log ID: IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E=
+ECDSA public key:
+-----BEGIN PUBLIC KEY-----
+MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHR
+wRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ==
+-----END PUBLIC KEY-----
+Ed25519 public key:
+-----BEGIN PUBLIC KEY-----
+0pHg7KptAxmb4o67m9xNM1Ku3YH4bjjXbyIgXn2R2bk=
+-----END PUBLIC KEY-----
+```
+
+The first block creates key material for the log, and I get a fun surprise: the Log ID starts
+precisely with the string IPng... what are the odds that that would happen!? I should tell Antonis
+about this, it's dope!
+
+As a safety precaution, Sunlight requires the operator to make the `checkpoints.db` by hand, which
+I'll also do:
+```
+pim@ctlog-test:/etc/sunlight$ sqlite3 /ssd-vol0/sunlight-test/shared/checkpoints.db \
+  "CREATE TABLE checkpoints (logID BLOB PRIMARY KEY, body TEXT)"
+```
+
+And with that, I'm ready to create my first log!
+
+### Sunlight: Setting up S3
+
+When learning about [[Tessera]({{< ref 2025-07-26-ctlog-1 >}})], I already kind of drew the
+conclusion that, for our case at IPng at least, running the fully cloud-native version with S3
+storage and MySQL database, gave both poorer performance, but also more operational complexity. But
+I find it interesting to compare behavior and performance, so I'll start by creating a Sunlight log
+using backing MinIO SSD storage.
+
+I'll first create the bucket and a user account to access it:
+
+```
+pim@ctlog-test:~$ export AWS_ACCESS_KEY_ID="<some user>"
+pim@ctlog-test:~$ export AWS_SECRET_ACCESS_KEY="<some password>"
+pim@ctlog-test:~$ export S3_BUCKET=sunlight-test
+
+pim@ctlog-test:~$ mc mb ssd/${S3_BUCKET}
+pim@ctlog-test:~$ cat << EOF > /tmp/minio-access.json
+{ "Version": "2012-10-17", "Statement": [ {
+    "Effect": "Allow",
+    "Action": [ "s3:ListBucket", "s3:PutObject", "s3:GetObject", "s3:DeleteObject" ],
+    "Resource": [ "arn:aws:s3:::${S3_BUCKET}/*", "arn:aws:s3:::${S3_BUCKET}" ]
+  } ]
+}
+EOF
+pim@ctlog-test:~$ mc admin user add ssd ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY}
+pim@ctlog-test:~$ mc admin policy create ssd ${S3_BUCKET}-access /tmp/minio-access.json
+pim@ctlog-test:~$ mc admin policy attach ssd ${S3_BUCKET}-access --user ${AWS_ACCESS_KEY_ID}
+pim@ctlog-test:~$ mc anonymous set public ssd/${S3_BUCKET}
+```
+
+After setting up the S3 environment, all I must do is wire it up to the Sunlight configuration
+file:
+
+```
+pim@ctlog-test:/etc/sunlight$ cat << EOF > sunlight-s3.yaml
+listen:
+  - "[::]:1443"
+checkpoints: /ssd-vol0/sunlight-test/shared/checkpoints.db
+logs:
+  - shortname: sunlight-test
+    inception: 2025-08-10
+    submissionprefix: https://ctlog-test.lab.ipng.ch:1443/
+    monitoringprefix: http://sunlight-test.minio-ssd.lab.ipng.ch:9000/
+    secret: /etc/sunlight/sunlight-test.seed.bin
+    cache: /ssd-vol0/sunlight-test/logs/sunlight-test/cache.db
+    s3region: eu-schweiz-1
+    s3bucket: sunlight-test
+    s3endpoint: http://minio-ssd.lab.ipng.ch:9000/
+    roots: /etc/sunlight/roots.pem
+    period: 200
+    poolsize: 15000
+    notafterstart: 2024-01-01T00:00:00Z
+    notafterlimit: 2025-01-01T00:00:00Z
+EOF
+```
+
+The one thing of note here is the use of `roots:` file which contains the Root CA for the TesseraCT
+loadtester which I'll be using. In production, Sunlight can grab the approved roots from the
+so-called _Common CA Database_ or CCADB. But you can also specify either all roots using the `roots`
+field, or additional roots on top of the `ccadbroots` field, using the `extraroots` field. That's a
+handy trick! You can find more info on the [[CCADB](https://www.ccadb.org/)] homepage.
+
+I can then start Sunlight just like this:
+
+```
+pim@ctlog-test:/etc/sunlight$ sunlight -testcert -c /etc/sunlight/sunlight-s3.yaml                                                                    {"time":"2025-08-10T13:49:36.091384532+02:00","level":"INFO","source":{"function":"main.main.func1","file":"/home/pim/src/sunlight/cmd/sunlight/sunlig
+ht.go","line":341},"msg":"debug server listening","addr":{"IP":"127.0.0.1","Port":37477,"Zone":""}}                                                   
+time=2025-08-10T13:49:36.091+02:00 level=INFO msg="debug server listening" addr=127.0.0.1:37477                                                       {"time":"2025-08-10T13:49:36.100471647+02:00","level":"INFO","source":{"function":"main.main","file":"/home/pim/src/sunlight/cmd/sunlight/sunlight.go"
+,"line":542},"msg":"today is the Inception date, creating log","log":"sunlight-test"}                                                                 time=2025-08-10T13:49:36.100+02:00 level=INFO msg="today is the Inception date, creating log" log=sunlight-test                                       
+{"time":"2025-08-10T13:49:36.119529208+02:00","level":"INFO","source":{"function":"filippo.io/sunlight/internal/ctlog.CreateLog","file":"/home/pim/src
+/sunlight/internal/ctlog/ctlog.go","line":159},"msg":"created log","log":"sunlight-test","timestamp":1754826576111,"logID":"IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E="}                                                                                                                                  
+time=2025-08-10T13:49:36.119+02:00 level=INFO msg="created log" log=sunlight-test timestamp=1754826576111 logID="IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E="                                                                                                                                              
+{"time":"2025-08-10T13:49:36.127702166+02:00","level":"WARN","source":{"function":"filippo.io/sunlight/internal/ctlog.LoadLog","file":"/home/pim/src/s
+unlight/internal/ctlog/ctlog.go","line":296},"msg":"failed to parse previously trusted roots","log":"sunlight-test","roots":""}                       time=2025-08-10T13:49:36.127+02:00 level=WARN msg="failed to parse previously trusted roots" log=sunlight-test roots=""                               
+{"time":"2025-08-10T13:49:36.127766452+02:00","level":"INFO","source":{"function":"filippo.io/sunlight/internal/ctlog.LoadLog","file":"/home/pim/src/sunlight/internal/ctlog/ctlog.go","line":301},"msg":"loaded log","log":"sunlight-test","logID":"IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E=","size":0,
+"timestamp":1754826576111}                                                                                                                            
+time=2025-08-10T13:49:36.127+02:00 level=INFO msg="loaded log" log=sunlight-test logID="IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E=" size=0 timestamp=1754826576111                                                             
+{"time":"2025-08-10T13:49:36.540297532+02:00","level":"INFO","source":{"function":"filippo.io/sunlight/internal/ctlog.(*Log).sequencePool","file":"/home/pim/src/sunlight/internal/ctlog/ctlog.go","line":972},"msg":"sequenced pool","log":"sunlight-test","old_tree_size":0,"entries":0,"start":"2025-08-1
+0T13:49:36.534500633+02:00","tree_size":0,"tiles":0,"timestamp":1754826576534,"elapsed":5788099}                                                      
+time=2025-08-10T13:49:36.540+02:00 level=INFO msg="sequenced pool" log=sunlight-test old_tree_size=0 entries=0 start=2025-08-10T13:49:36.534+02:00 tree_size=0 tiles=0 timestamp=1754826576534 elapsed=5.788099ms                                                                                           
+...
+```
+
+Although that looks pretty good, I see that something is not quite right. When Sunlight comes up, it shares
+with me a few links, in the `get-roots` and `json` fields on the homepage, but neither of them work:
+
+```
+pim@ctlog-test:~$ curl -k https://ctlog-test.lab.ipng.ch:1443/ct/v1/get-roots
+404 page not found
+pim@ctlog-test:~$ curl -k https://ctlog-test.lab.ipng.ch:1443/log.v3.json
+404 page not found
+```
+
+I'm starting to think that using a non-standard listen port won't work, or more precisely, adding
+a port in the `monitoringprefix` won't work. I notice that the logname is called
+`ctlog-test.lab.ipng.ch:1443` which I don't think is supposed to have a portname in it. So instead,
+I make Sunlight `listen` on port 443 and omit the port in the `submissionprefix`, and give it and
+its companion Skylight the needed privileges to bind the privileged port like so:
+
+```
+pim@ctlog-test:~$ sudo setcap 'cap_net_bind_service=+ep' /usr/local/bin/sunlight
+pim@ctlog-test:~$ sudo setcap 'cap_net_bind_service=+ep' /usr/local/bin/skylight
+pim@ctlog-test:~$ sunlight -testcert -c /etc/sunlight/sunlight-s3.yaml
+```
+
+{{< image width="60%" src="/assets/ctlog/sunlight-test-s3.png" alt="Sunlight testlog / S3" >}}
+
+And with that, Sunlight reports for duty and the links work. Hoi!
+
+#### Sunlight: Loadtesting S3
+
+I have some good experience loadtesting from the [[TesseraCT article]({{< ref 2025-07-26-ctlog-1
+>}})]. One important difference is that Sunlight wants to use SSL for the submission and monitoring
+paths, and I've created a snakeoil self-signed cert. CT Hammer does not accept that out of the box,
+so I need to make a tiny change to the Hammer:
+
+```
+pim@ctlog-test:~/src/tesseract$ git diff
+diff --git a/internal/hammer/hammer.go b/internal/hammer/hammer.go
+index 3828fbd..1dfd895 100644
+--- a/internal/hammer/hammer.go
+++ b/internal/hammer/hammer.go
+@@ -104,6 +104,9 @@ func main() {
+                        MaxIdleConns:        *numWriters + *numReadersFull + *numReadersRandom,
+                        MaxIdleConnsPerHost: *numWriters + *numReadersFull + *numReadersRandom,
+                        DisableKeepAlives:   false,
+                        TLSClientConfig: &tls.Config{
+                                InsecureSkipVerify: true,
+                        },
+                },
+                Timeout: *httpTimeout,
+        }
+```
+
+With that small bit of insecurity out of the way, Sunlight makes it otherwise pretty easy for me to
+construct the CT Hammer commandline:
+
+```
+pim@ctlog-test:~/src/tesseract$ go run ./internal/hammer --origin=ctlog-test.lab.ipng.ch \
+  --log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHRwRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ== \
+  --log_url=http://sunlight-test.minio-ssd.lab.ipng.ch:9000/ --write_log_url=https://ctlog-test.lab.ipng.ch/ \
+  --max_read_ops=0 --num_writers=5000 --max_write_ops=100
+
+pim@ctlog-test:/etc/sunlight$ T=0; O=0; while :; do \
+  N=$(curl -sS http://sunlight-test.minio-ssd.lab.ipng.ch:9000/checkpoint | grep -E '^[0-9]+$'); \
+  if [ "$N" -eq "$O" ]; then \
+    echo -n .; \
+  else \
+    echo " $T seconds $((N-O)) certs"; O=$N; T=0; echo -n $N\ ;
+  fi; \
+  T=$((T+1)); sleep 1; done
+24915  1 seconds 96 certs
+25011  1 seconds 92 certs
+25103  1 seconds 93 certs
+25196  1 seconds 87 certs
+```
+
+On the first commandline I'll start the loadtest at 100 writes/sec with the standard duplication
+probability of 10%, which allows me to test Sunlights ability to avoid writing duplicates. This
+means I should see on average a growth of the tree at about 90/s. Check. I raise the write-load to
+500/s:
+
+```
+39421  1 seconds 443 certs
+39864  1 seconds 442 certs
+40306  1 seconds 441 certs
+40747  1 seconds 447 certs
+41194  1 seconds 448 certs
+```
+
+.. and to 1'000/s:
+```
+57941  1 seconds 945 certs
+58886  1 seconds 970 certs
+59856  1 seconds 948 certs
+60804  1 seconds 965 certs
+61769  1 seconds 955 certs
+```
+
+After a few minutes I see a few errors from CT Hammer:
+```
+W0810 14:55:29.660710 1398779 analysis.go:134] (1 x) failed to create request: failed to write leaf: Post "https://ctlog-test.lab.ipng.ch/ct/v1/add-chain": EOF
+W0810 14:55:30.496603 1398779 analysis.go:124] (1 x) failed to create request: write leaf was not OK. Status code: 500. Body: "failed to read body: read tcp 127.0.1.1:443->127.0.0.1:44908: i/o timeout\n"
+```
+
+I raise the Hammer load to 5'000/sec (which means 4'500/s unique certs and 500 duplicates), and find
+the max committed writes/sec to max out at around 4'200/s:
+```
+879637  1 seconds 4213 certs
+883850  1 seconds 4207 certs
+888057  1 seconds 4211 certs
+892268  1 seconds 4249 certs
+896517  1 seconds 4216 certs
+```
+
+The error rate is a steady stream of errors like the one before:
+```
+W0810 14:59:48.499274 1398779 analysis.go:124] (1 x) failed to create request: failed to write leaf: Post "https://ctlog-test.lab.ipng.ch/ct/v1/add-chain": EOF
+W0810 14:59:49.034194 1398779 analysis.go:124] (1 x) failed to create request: failed to write leaf: Post "https://ctlog-test.lab.ipng.ch/ct/v1/add-chain": EOF
+W0810 15:00:05.496459 1398779 analysis.go:124] (1 x) failed to create request: failed to write leaf: Post "https://ctlog-test.lab.ipng.ch/ct/v1/add-chain": EOF
+W0810 15:00:07.187181 1398779 analysis.go:124] (1 x) failed to create request: failed to write leaf: Post "https://ctlog-test.lab.ipng.ch/ct/v1/add-chain": EOF
+```
+
+At this load of 4'200/s, MinIO is not very impressed. Remember in the [[other article]({{< ref
+2025-07-26-ctlog-1 >}})] I loadtested it to about 7'500 ops/sec and the statistics below are about
+50 ops/sec (2'800/min). I conclude that MinIO is, in fact, bored of this whole activity:
+
+```
+pim@ctlog-test:/etc/sunlight$ mc admin trace --stats ssd
+Duration: 18m58s ▱▱▱
+RX Rate:↑ 115 MiB/m
+TX Rate:↓ 2.4 MiB/m
+RPM    :  2821.3
+-------------
+Call          Count          RPM     Avg Time  Min Time  Max Time  Avg TTFB  Max TTFB  Avg Size    Rate /min    Errors  
+s3.PutObject  37602 (70.3%)  1982.2  6.2ms     785µs     86.7ms    6.1ms     86.6ms    ↑59K ↓0B    ↑115M ↓1.4K  0       
+s3.GetObject  15918 (29.7%)  839.1   996µs     670µs     51.3ms    912µs     51.2ms    ↑46B ↓3.0K  ↑38K ↓2.4M   0       
+```
+
+Sunlight still keeps its certificate cache on local disk. At a rate of 4'200/s, the ZFS pool has a
+write rate of about 105MB/s with about 877 ZFS writes per second.
+
+```
+pim@ctlog-test:/etc/sunlight$ zpool iostat -v ssd-vol0 10
+                              capacity     operations     bandwidth 
+pool                        alloc   free   read  write   read  write
+--------------------------  -----  -----  -----  -----  -----  -----
+ssd-vol0                    59.1G   685G      0  2.55K      0   312M
+  mirror-0                  59.1G   685G      0  2.55K      0   312M
+    wwn-0x5002538a05302930      -      -      0    877      0   104M
+    wwn-0x5002538a053069f0      -      -      0    871      0   104M
+    wwn-0x5002538a06313ed0      -      -      0    866      0   104M
+--------------------------  -----  -----  -----  -----  -----  -----
+
+pim@ctlog-test:/etc/sunlight$ zpool iostat -l  ssd-vol0 10
+              capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim
+pool        alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait
+----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
+ssd-vol0    59.0G   685G      0  3.19K      0   388M      -    8ms      -  628us      -  990us      -   10ms      -   88ms
+ssd-vol0    59.2G   685G      0  2.49K      0   296M      -    5ms      -  557us      -  163us      -    8ms      -      -
+ssd-vol0    59.6G   684G      0  2.04K      0   253M      -    2ms      -  704us      -  296us      -    4ms      -      -
+ssd-vol0    58.8G   685G      0  2.72K      0   328M      -    6ms      -  783us      -  701us      -    9ms      -   68ms
+
+```
+
+A few interesting observations:
+*   Sunlight still uses a local sqlite3 database for the certificate tracking, which is more
+    efficient than MariaDB/MySQL, let alone AWS RDS, so it has one less runtime dependency.
+*   The write rate to ZFS is significantly higher with Sunlight than TesseraCT (about 8:1). This is
+    likely explained because the sqlite3 database lives on ZFS here, while TesseraCT uses MariaDB
+    running on a different filesystem.
+*   The MinIO usage is a lot lighter. As I reduce the load to 1'000/s, as was the case in the TesseraCT
+    test, I can see the ratio of Get:Put was 93:4 in TesseraCT, while it's 70:30 here. TesseraCT as
+    also consuming more IOPS, running at about 10.5k requests/minute, while Sunlight is
+    significantly calmer at 2.8k requests/minute (almost 4x less!)
+*   The burst capacity of Sunlight is a fair bit higher than TesseraCT, likely due to its more
+    efficient use of S3 backends.
+
+***Conclusion***: Sunlight S3+MinIO can handle 1'000/s reliably, and can spike to 4'200/s with only
+few errors.
+
+#### Sunlight: Loadtesting POSIX
+
+When I took a closer look at TesseraCT a few weeks ago, it struck me that while making a
+cloud-native setup, with S3 storage would allow for a cool way to enable storage scaling and
+read-path redundancy, by creating synchronously replicated buckets, it does come at a significant
+operational overhead and complexity. My main concern is the amount of different moving parts, and
+Sunlight really has one very appealing property: it can run entirely on one machine without the need
+for any other moving parts - even the SQL database is linked in. That's pretty slick.
+
+```
+pim@ctlog-test:/etc/sunlight$ cat << EOF > sunlight.yaml
+listen:
+  - "[::]:443"
+checkpoints: /ssd-vol0/sunlight-test/shared/checkpoints.db
+logs:
+  - shortname: sunlight-test
+    inception: 2025-08-10
+    submissionprefix: https://ctlog-test.lab.ipng.ch/
+    monitoringprefix: https://ctlog-test.lab.ipng.ch:1443/
+    secret: /etc/sunlight/sunlight-test.seed.bin
+    cache: /ssd-vol0/sunlight-test/logs/sunlight-test/cache.db
+    localdirectory: /ssd-vol0/sunlight-test/logs/sunlight-test/data
+    roots: /etc/sunlight/roots.pem
+    period: 200
+    poolsize: 15000
+    notafterstart: 2024-01-01T00:00:00Z
+    notafterlimit: 2025-01-01T00:00:00Z
+EOF
+pim@ctlog-test:/etc/sunlight$ sunlight -testcert -c sunlight.yaml
+pim@ctlog-test:/etc/sunlight$ skylight -testcert -c skylight.yaml
+```
+
+First I'll start a hello-world loadtest at 100/s and take a look at the number of leaves in the
+checkpoint after a few minutes, I would expect about three minutes worth at 100/s with a duplicate
+probability of 10% to yield about 16'200 unique certificates in total.
+
+```
+pim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E '^[0-9]+$'; sleep 60; done
+10086
+15518
+20920
+26339
+```
+
+And would you look at that? `(26339-10086)` is right on the dot! One thing that I find particularly
+cool about Sunlight is its baked in Prometheus metrics. This allows me some pretty solid insight on
+its performance. Take a look for example at the write path latency tail (99th ptile):
+
+
+```
+pim@ctlog-test:/etc/sunlight$ curl -ksS https://ctlog-test.lab.ipng.ch/metrics | egrep 'seconds.*quantile=\"0.99\"'
+sunlight_addchain_wait_seconds{log="sunlight-test",quantile="0.99"} 0.207285993
+sunlight_cache_get_duration_seconds{log="sunlight-test",quantile="0.99"} 0.001409719
+sunlight_cache_put_duration_seconds{log="sunlight-test",quantile="0.99"} 0.002227985
+sunlight_fs_op_duration_seconds{log="sunlight-test",method="discard",quantile="0.99"} 0.000224969
+sunlight_fs_op_duration_seconds{log="sunlight-test",method="fetch",quantile="0.99"} 8.3003e-05
+sunlight_fs_op_duration_seconds{log="sunlight-test",method="upload",quantile="0.99"} 0.042118751
+sunlight_http_request_duration_seconds{endpoint="add-chain",log="sunlight-test",quantile="0.99"} 0.2259605
+sunlight_sequencing_duration_seconds{log="sunlight-test",quantile="0.99"} 0.108987393
+sunlight_sqlite_update_duration_seconds{quantile="0.99"} 0.014922489
+```
+
+I'm seeing here that at a load of 100/s (with 90/s of unique certificates), the 99th percentile
+add-chain latency is 207ms, which makes sense because the `period` configuration field is set to
+200ms. The filesystem operations (discard, fetch, upload) are _de minimis_ and the sequencing
+duration is at 109ms.  Excellent!
+
+But can this thing go really fast? I do remember that the CT Hammer uses more CPU than TesseraCT,
+and I've seen it above also when running my 5'000/s loadtest that's about all the hammer can take on
+a single Dell R630. So, as I did with the TesseraCT test, I'll use the MinIO SSD and MinIO Disk
+machines to generate the load. 
+
+I boot them, so that I can hammer, or shall I say jackhammer away:
+
+```
+pim@ctlog-test:~/src/tesseract$ go run ./internal/hammer --origin=ctlog-test.lab.ipng.ch \
+  --log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHRwRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ== \
+  --log_url=https://ctlog-test.lab.ipng.ch:1443/ --write_log_url=https://ctlog-test.lab.ipng.ch/ \
+  --max_read_ops=0 --num_writers=5000 --max_write_ops=5000
+
+pim@minio-ssd:~/src/tesseract$ go run ./internal/hammer --origin=ctlog-test.lab.ipng.ch \
+  --log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHRwRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ== \
+  --log_url=https://ctlog-test.lab.ipng.ch:1443/ --write_log_url=https://ctlog-test.lab.ipng.ch/ \
+  --max_read_ops=0 --num_writers=5000 --max_write_ops=5000 --serial_offset=1000000
+
+pim@minio-disk:~/src/tesseract$ go run ./internal/hammer --origin=ctlog-test.lab.ipng.ch \
+  --log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHRwRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ== \
+  --log_url=https://ctlog-test.lab.ipng.ch:1443/ --write_log_url=https://ctlog-test.lab.ipng.ch/ \
+  --max_read_ops=0 --num_writers=5000 --max_write_ops=5000 --serial_offset=2000000
+```
+
+This will generate 15'000/s of load, which I note does bring Sunlight to its knees, although it does
+remain stable (yaay!) with a somewhat more bursty checkpoint interval:
+
+```
+5504780  1 seconds 4039 certs
+5508819  1 seconds 10000 certs
+5518819 . 2 seconds 7976 certs
+5526795  1 seconds 2022 certs
+5528817  1 seconds 9782 certs
+5538599  1 seconds 217 certs
+5538816  1 seconds 3114 certs
+5541930  1 seconds 6818 certs
+```
+
+So what I do instead is a somewhat simpler measurement of certificates per minute:
+```
+pim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E '^[0-9]+$'; sleep 60; done
+6008831
+6296255
+6576712
+```
+
+This rate boils down to `(6576712-6008831)/120` or 4'700/s of written certs, which at a duplication
+ratio of 10% means approximately 5'200/s of total accepted certs. This rate, Sunlight is consuming
+about 10.3 CPUs/s, while Skylight is at 0.1 CPUs/s and the CT Hammer is at 11.1 CPUs/s; Given the 40
+threads on this machine, I am not saturating the CPU, but I'm curious as this rate is significantly
+lower than TesseraCT. I briefly turn off the hammer on `ctlog-test` to allow Sunlight to monopolize
+the entire machine. The CPU use does reduce to about 9.3 CPUs/s suggesting that indeed, the bottleneck
+is not strictly CPU: 
+
+{{< image width="90%" src="/assets/ctlog/btop-sunlight.png" alt="Sunlight btop" >}}
+
+When using only two CT Hammers (on `minio-ssd.lab.ipng.ch` and `minio-disk.lab.ipng.ch`), the CPU
+use on the `ctlog-test.lab.ipng.ch` machine definitely goes down (CT Hammer is kind of a CPU hog....),
+but the resulting throughput doesn't change that much:
+
+```
+pim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E '^[0-9]+$'; sleep 60; done
+7985648
+8302421
+8528122
+8772758
+```
+
+What I find particularly interesting is that the total rate stays approximately 4'400/s
+(`(8772758-7985648)/180`), while the checkpoint latency varies considerably. One really cool thing I
+learned earlier is that Sunlight comes with baked in Prometheus metrics, which I can take a look at
+while keeping it under this load of ~10'000/sec:
+
+```
+pim@ctlog-test:/etc/sunlight$ curl -ksS https://ctlog-test.lab.ipng.ch/metrics | egrep 'seconds.*quantile=\"0.99\"'
+sunlight_addchain_wait_seconds{log="sunlight-test",quantile="0.99"} 1.889983538
+sunlight_cache_get_duration_seconds{log="sunlight-test",quantile="0.99"} 0.000148819
+sunlight_cache_put_duration_seconds{log="sunlight-test",quantile="0.99"} 0.837981208
+sunlight_fs_op_duration_seconds{log="sunlight-test",method="discard",quantile="0.99"} 0.000433179
+sunlight_fs_op_duration_seconds{log="sunlight-test",method="fetch",quantile="0.99"} NaN
+sunlight_fs_op_duration_seconds{log="sunlight-test",method="upload",quantile="0.99"} 0.067494558
+sunlight_http_request_duration_seconds{endpoint="add-chain",log="sunlight-test",quantile="0.99"} 1.86894666
+sunlight_sequencing_duration_seconds{log="sunlight-test",quantile="0.99"} 1.111400223
+sunlight_sqlite_update_duration_seconds{quantile="0.99"} 0.016859223
+```
+
+Comparing the throughput at 4'400/s with that first test of 100/s, I expect and can confirm a
+significant increase in all of these metrics. The 99th percentile addchain is now 1889ms (up from
+207ms) and the sequencing duration is now 1111ms (up from 109ms). 
+
+#### Sunlight: Effect of period
+
+I fiddle a little bit with Sunlight's configuration file, notably the `period` and `poolsize`.
+First I set `period:2000` and `poolsize:15000`, which yields pretty much the same throughput:
+
+```
+pim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E '^[0-9]+$'; sleep 60; done
+701850
+1001424
+1295508
+1575789
+```
+
+With a generated load of 10'000/sec with a 10% duplication rate, I am offering roughly 9'000/sec of
+unique certificates, and I'm seeing `(1575789 - 701850)/180` or about 4'855/sec come through. Just
+for reference, at this rate and with `period:2000`, the latency tail looks like this:
+
+```
+pim@ctlog-test:/etc/sunlight$ curl -ksS https://ctlog-test.lab.ipng.ch/metrics | egrep 'seconds.*quantile=\"0.99\"'
+sunlight_addchain_wait_seconds{log="sunlight-test",quantile="0.99"} 3.203510079
+sunlight_cache_get_duration_seconds{log="sunlight-test",quantile="0.99"} 0.000108613
+sunlight_cache_put_duration_seconds{log="sunlight-test",quantile="0.99"} 0.950453973
+sunlight_fs_op_duration_seconds{log="sunlight-test",method="discard",quantile="0.99"} 0.00046192
+sunlight_fs_op_duration_seconds{log="sunlight-test",method="fetch",quantile="0.99"} NaN
+sunlight_fs_op_duration_seconds{log="sunlight-test",method="upload",quantile="0.99"} 0.049007693
+sunlight_http_request_duration_seconds{endpoint="add-chain",log="sunlight-test",quantile="0.99"} 3.570709413
+sunlight_sequencing_duration_seconds{log="sunlight-test",quantile="0.99"} 1.5968609040000001
+sunlight_sqlite_update_duration_seconds{quantile="0.99"} 0.010847308
+```
+
+Then I also set a `period:100` and `poolsize:15000`, which does improve a bit:
+
+```
+pim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E '^[0-9]+$'; sleep 60; done
+560654
+950524
+1324645
+1720362
+```
+
+With the same generated load of 10'000/sec with a 10% duplication rate, I am still offering roughly
+9'000/sec of unique certificates, and I'm seeing `(1720362 - 560654)/180` or about 6'440/sec come
+through, which is a fair bit better, at the expense of more disk activity. At this rate and with
+`period:100`, the latency tail looks like this:
+
+```
+pim@ctlog-test:/etc/sunlight$ curl -ksS https://ctlog-test.lab.ipng.ch/metrics | egrep 'seconds.*quantile=\"0.99\"'
+sunlight_addchain_wait_seconds{log="sunlight-test",quantile="0.99"} 1.616046445
+sunlight_cache_get_duration_seconds{log="sunlight-test",quantile="0.99"} 7.5123e-05
+sunlight_cache_put_duration_seconds{log="sunlight-test",quantile="0.99"} 0.534935803
+sunlight_fs_op_duration_seconds{log="sunlight-test",method="discard",quantile="0.99"} 0.000377273
+sunlight_fs_op_duration_seconds{log="sunlight-test",method="fetch",quantile="0.99"} 4.8893e-05
+sunlight_fs_op_duration_seconds{log="sunlight-test",method="upload",quantile="0.99"} 0.054685991
+sunlight_http_request_duration_seconds{endpoint="add-chain",log="sunlight-test",quantile="0.99"} 1.946445877
+sunlight_sequencing_duration_seconds{log="sunlight-test",quantile="0.99"} 0.980602185
+sunlight_sqlite_update_duration_seconds{quantile="0.99"} 0.018385831
+```
+
+***Conclusion***: Sunlight on POSIX can reliably handle 4'400/s (with a duplicate rate of 10%) on
+this setup.
+
+## Wrapup - Observations
+
+From an operators point of view, TesseraCT and Sunlight handle quite differently. Both are easily up
+to the task of serving the current write-load (which is about 250/s).
+
+*    ***S3***: When using the S3 backend, TesseraCT became quite unhappy above 800/s while Sunlight
+     went all the way up to 4'200/s and sent significantly less requests to MinIO (about 4x less),
+     while showing good telemetry on the use of S3 backends. In this mode, TesseraCT uses MySQL (in
+     my case, MariaDB) which was not on the ZFS pool, but on the boot-disk.
+
+*    ***POSIX***: When using normal filesystem, Sunlight seems to peak at 4'800/s while TesseraCT
+     went all the way to 12'000/s. When doing so, Disk IO was quite similar between the two
+     solutions, taking into account that TesseraCT runs BadgerDB, while Sunlight uses sqlite3,
+     both are using their respective ZFS pool.
+
+***Notable***: Sunlight POSIX and S3 performance is roughly identical (both handle about
+5'000/sec), while TesseraCT POSIX performance (12'000/s) is significantly better than its S3
+(800/s). Some other observations:
+
+*    Sunlight has a very opinionated configuration, and can run multiple logs with one configuration
+     file and one binary. Its configuration was a bit constraining though, as I could not manage to
+     use `monitoringprefix` or `submissionprefix` with `http://` prefix - a likely security
+     precaution - but also using ports in those prefixes (other than the standard 443) rendered
+     Sunlight and Skylight unusable for me.
+
+*    Skylight only serves from local directory, it does not have support for S3. For operators using S3,
+     an alternative could be to use NGINX in the serving path, similar to TesseraCT. Skylight does have
+     a few things to teach me though, notably on proper compression, content type and other headers.
+
+*    TesseraCT does not have a configuration file, and will run exactly one log per binary
+     instance. It uses flags to construct the environment, and is much more forgiving for creative
+     `origin` (log name), and submission- and monitoring URLs. It's happy to use regular 'http://'
+     for both, which comes in handy in those architectures where the system is serving behind a
+     reversed proxy.
+
+*    The TesseraCT Hammer tool then again does not like using self-signed certificates, and needs
+     to be told to skip certificate validation in the case of Sunlight loadtests while it is
+     running with the `-testcert` commandline.
+
+I consider all of these small and mostly cosmetic issues, because in production there will be proper
+TLS certificates issued and normal https:// serving ports with unique monitoring and submission
+hostnames.
+
+## What's Next
+
+Together with Antonis Chariton and Jeroen Massar, IPng Networks will be offering both TesseraCT and
+Sunlight logs on the public internet. One final step is to productionize both logs, and file the
+paperwork for them in the community. Although at this point our Sunlight log is already running,
+I'll wait a few weeks to gather any additional intel, before wrapping up in a final article.
+
--- a/content/articles/2025-08-24-ctlog-3.md
+++ b/content/articles/2025-08-24-ctlog-3.md
@@ -0,0 +1,515 @@
+---
+date: "2025-08-24T12:07:23Z"
+title: 'Certificate Transparency - Part 3 - Operations'
+---
+
+{{< image width="10em" float="right" src="/assets/ctlog/ctlog-logo-ipng.png" alt="ctlog logo" >}}
+
+# Introduction
+
+There once was a Dutch company called [[DigiNotar](https://en.wikipedia.org/wiki/DigiNotar)], as the
+name suggests it was a form of _digital notary_, and they were in the business of issuing security
+certificates. Unfortunately, in June of 2011, their IT infrastructure was compromised and
+subsequently it issued hundreds of fraudulent SSL certificates, some of which were used for
+man-in-the-middle attacks on Iranian Gmail users. Not cool.
+
+Google launched a project called **Certificate Transparency**, because it was becoming more common
+that the root of trust given to _Certification Authorities_ could no longer be unilaterally trusted.
+These attacks showed that the lack of transparency in the way CAs operated was a significant risk to
+the Web Public Key Infrastructure. It led to the creation of this ambitious
+[[project](https://certificate.transparency.dev/)] to improve security online by bringing
+accountability to the system that protects our online services with _SSL_ (Secure Socket Layer)
+and _TLS_ (Transport Layer Security). 
+
+In 2013, [[RFC 6962](https://datatracker.ietf.org/doc/html/rfc6962)] was published by the IETF. It
+describes an experimental protocol for publicly logging the existence of Transport Layer Security
+(TLS) certificates as they are issued or observed, in a manner that allows anyone to audit
+certificate authority (CA) activity and notice the issuance of suspect certificates as well as to
+audit the certificate logs themselves.  The intent is that eventually clients would refuse to honor
+certificates that do not appear in a log, effectively forcing CAs to add all issued certificates to
+the logs.
+
+In the first two articles of this series, I explored [[Sunlight]({{< ref 2025-07-26-ctlog-1 >}})]
+and [[TesseraCT]({{< ref 2025-08-10-ctlog-2 >}})], two open source implementations of the Static CT
+protocol. In this final article, I'll share the details on how I created the environment and
+production instances for four logs that IPng will be providing: Rennet and Lipase are two
+ingredients to make cheese and will serve as our staging/testing logs. Gouda and Halloumi are two
+delicious cheeses that pay homage to our heritage, Jeroen and I being Dutch and Antonis being
+Greek.
+
+## Hardware
+
+At IPng Networks, all hypervisors are from the same brand: Dell's Poweredge line. In this project,
+Jeroen is also contributing a server, and it so happens that he also has a Dell Poweredge. We're
+both running Debian on our hypervisor, so we install a fresh VM with Debian 13.0, codenamed
+_Trixie_, and give the machine 16GB of memory, 8 vCPU and a 16GB boot disk. Boot disks are placed on
+the hypervisor's ZFS pool, and a blockdevice snapshot is taken every 6hrs. This allows the boot disk
+to be rolled back to a last known good point in case an upgrade goes south. If you haven't seen it
+yet, take a look at [[zrepl](https://zrepl.github.io/)], a one-stop, integrated solution for ZFS
+replication. This tool is incredibly powerful, and can do snapshot management, sourcing / sinking
+to remote hosts, of course using incremental snapshots as they are native to ZFS.
+
+Once the machine is up, we pass four enterprise-class storage drives, in our case 3.84TB Kioxia
+NVMe, model _KXD51RUE3T84_ which are PCIe 3.1 x4 lanes, and NVMe 1.2.1 specification with a good
+durability and reasonable (albeit not stellar) read throughput of ~2700MB/s, write throughput of
+~800MB/s with 240 kIOPS random read and 21 kIOPS random write. My attention is also drawn to a
+specific specification point: these drives allow for 1.0 DWPD, which stands for _Drive Writes Per
+Day_, in other words they are not going to run themselves off a cliff after a few petabytes of
+writes, and I am reminded that a CT Log wants to write to disk a lot during normal operation.
+
+The point of these logs is to **keep them safe**, and the most important aspects of the compute
+environment are the use of ECC memory to detect single bit errors, and dependable storage. Toshiba
+makes a great product.
+
+```
+ctlog1:~$ sudo zpool create -f -o ashift=12 -o autotrim=on -O atime=off -O xattr=sa \
+               ssd-vol0 raidz2 /dev/disk/by-id/nvme-KXD51RUE3T84_TOSHIBA_*M
+ctlog1:~$ sudo zfs create -o encryption=on -o keyformat=passphrase ssd-vol0/enc
+ctlog1:~$ sudo zfs create ssd-vol0/logs
+ctlog1:~$ for log in lipase; do \
+    for shard in 2025h2 2026h1 2026h2 2027h1 2027h2; do \
+      sudo zfs create ssd-vol0/logs/${log}${shard} \
+    done \
+  done
+```
+
+The hypervisor will use PCI passthrough for the NVMe drives, and we'll handle ZFS directly on the
+VM. The first command creates a ZFS raidz2 pool using 4kB blocks, turns of _atime_ (which avoids one
+metadata write for each read!), and turns on SSD trimming in ZFS, a very useful feature.
+
+Then I'll create an encrypted volume for the configuration and key material. This way, if the
+machine is ever physically transported, the keys will be safe in transit. Finally, I'll create the
+temporal log shards starting at 2025h2, all the way through to 2027h2 for our testing log called
+_Lipase_ and our production log called _Halloumi_ on Jeroen's machine. On my own machine, it'll be
+_Rennet_ for the testing log and _Gouda_ for the production log.
+
+## Sunlight
+
+{{< image width="10em" float="right" src="/assets/ctlog/sunlight-logo.png" alt="Sunlight logo" >}}
+
+I set up Sunlight first. as its authors have extensive operational notes both in terms of the
+[[config](https://config.sunlight.geomys.org/)] of Geomys' _Tuscolo_ log, as well as on the
+[[Sunlight](https://sunlight.dev)] homepage. I really appreciate that Filippo added some
+[[Gists](https://gist.github.com/FiloSottile/989338e6ba8e03f2c699590ce83f537b)] and
+[[Doc](https://docs.google.com/document/d/1ID8dX5VuvvrgJrM0Re-jt6Wjhx1eZp-trbpSIYtOhRE/edit?tab=t.0#heading=h.y3yghdo4mdij)]
+with pretty much all I need to know to run one too. Our Rennet and Gouda logs use very similar
+approach for their configuration, with one notable exception: the VMs do not have a public IP
+address, and are tucked away in a private network called IPng Site Local. I'll get back to that
+later.
+
+```
+ctlog@ctlog0:/ssd-vol0/enc/sunlight$ cat << EOF | tee sunlight-staging.yaml
+listen:
+  - "[::]:16420"
+checkpoints: /ssd-vol0/shared/checkpoints.db
+logs:
+  - shortname: rennet2025h2
+    inception: 2025-07-28
+    period: 200
+    poolsize: 750
+    submissionprefix: https://rennet2025h2.log.ct.ipng.ch
+    monitoringprefix: https://rennet2025h2.mon.ct.ipng.ch
+    ccadbroots: testing
+    extraroots: /ssd-vol0/enc/sunlight/extra-roots-staging.pem
+    secret: /ssd-vol0/enc/sunlight/keys/rennet2025h2.seed.bin
+    cache: /ssd-vol0/logs/rennet2025h2/cache.db
+    localdirectory: /ssd-vol0/logs/rennet2025h2/data
+    notafterstart: 2025-07-01T00:00:00Z
+    notafterlimit: 2026-01-01T00:00:00Z
+...
+EOF
+ctlog@ctlog0:/ssd-vol0/enc/sunlight$ cat << EOF | tee skylight-staging.yaml
+listen:
+  - "[::]:16421"
+homeredirect: https://ipng.ch/s/ct/
+logs:
+  - shortname: rennet2025h2
+    monitoringprefix: https://rennet2025h2.mon.ct.ipng.ch
+    localdirectory: /ssd-vol0/logs/rennet2025h2/data
+    staging: true
+...
+```
+
+In the first configuration file, I'll tell _Sunlight_ (the write path component) to listen on port
+`:16420` and I'll tell _Skylight_ (the read path component) to listen on port `:16421`. I've disabled
+the automatic certificate renewals, and will handle SSL upstream. A few notes on this:
+
+1.   Most importantly, I will be using a common frontend pool with a wildcard certificate for
+`*.ct.ipng.ch`. I wrote about [[DNS-01]({{< ref 2023-03-24-lego-dns01 >}})] before, it's a very
+convenient way for IPng to do certificate pool management. I will be sharing certificate for all log
+types under this certificate.
+1.   ACME/HTTP-01 could be made to work with a bit of effort; plumbing through the `/.well-known/`
+URIs on the frontend and pointing them to these instances. But then the cert would have to be copied
+from Sunlight back to the frontends.
+
+I've noticed that when the log doesn't exist yet, I can start Sunlight and it'll create the bits and
+pieces on the local filesystem and start writing checkpoints. But if the log already exists, I am
+required to have the _monitoringprefix_ active, otherwise Sunlight won't start up. It's a small
+thing, as I will have the read path operational in a few simple steps. Anyway, all five logshards
+for Rennet, and a few days later, for Gouda, are operational this way.
+
+Skylight provides all the things I need to serve the data back, which is a huge help. The [[Static
+Log Spec](https://github.com/C2SP/C2SP/blob/main/static-ct-api.md)] is very clear on things like
+compression, content-type, cache-control and other headers. Skylight makes this a breeze, as it reads
+a configuration file very similar to the Sunlight write-path one, and takes care of it all for me.
+
+## TesseraCT
+
+{{< image width="10em" float="right" src="/assets/ctlog/tesseract-logo.png" alt="TesseraCT logo" >}}
+
+Good news came to our community on August 14th, when Google's TrustFabric team announced their Alpha
+milestone of [[TesseraCT](https://blog.transparency.dev/introducing-tesseract)]. This release
+also moved the POSIX variant from experimental alongside the already further along GCP and AWS
+personalities. After playing around with it with Al and the team, I think I've learned enough to get
+us going in a public `tesseract-posix` instance.
+
+One thing I liked about Sunlight is its compact YAML file that described the pertinent bits of the
+system, and that I can serve any number of logs with the same process. On the other hand, TesseraCT
+can serve only one log per process. Both have pro's and con's, notably if any poisonous submission
+would be offered, Sunlight might take down all logs, while TesseraCT would only take down the log
+receiving the offensive submission. On the other hand, maintaining separate processes is cumbersome,
+and all log instances need to be meticulously configured.
+
+
+### TesseraCT genconf
+
+I decide to automate this by vibing a little tool called `tesseract-genconf`, which I've published on
+[[Gitea](https://git.ipng.ch/certificate-transparency/cheese)]. What it does is take a YAML file
+describing the logs, and outputs the bits and pieces needed to operate multiple separate processes
+that together form the sharded static log. I've attempted to stay mostly compatible with the
+Sunlight YAML configuration, and came up with a variant like this one:
+
+```
+ctlog@ctlog1:/ssd-vol0/enc/tesseract$ cat << EOF | tee tesseract-staging.yaml
+listen:
+ - "[::]:8080"
+roots: /ssd-vol0/enc/tesseract/roots.pem
+logs:
+  - shortname: lipase2025h2
+    listen: "[::]:16900"
+    submissionprefix: https://lipase2025h2.log.ct.ipng.ch
+    monitoringprefix: https://lipase2025h2.mon.ct.ipng.ch
+    extraroots: /ssd-vol0/enc/tesseract/extra-roots-staging.pem
+    secret: /ssd-vol0/enc/tesseract/keys/lipase2025h2.pem
+    localdirectory: /ssd-vol0/logs/lipase2025h2/data
+    notafterstart: 2025-07-01T00:00:00Z
+    notafterlimit: 2026-01-01T00:00:00Z
+...
+EOF
+```
+
+With this snippet, I have all the information I need. Here's the steps I take to construct the log
+itself:
+
+***1. Generate keys***
+
+The keys are `prime256v1` and the format that TesseraCT accepts did change since I wrote up my first
+[[deep dive]({{< ref 2025-07-26-ctlog-1 >}})] a few weeks ago. Now, the tool accepts a `PEM` format
+private key, from which the _Log ID_ and _Public Key_ can be derived. So off I go:
+
+```
+ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf -c tesseract-staging.yaml gen-key
+Creating /ssd-vol0/enc/tesseract/keys/lipase2025h2.pem
+Creating /ssd-vol0/enc/tesseract/keys/lipase2026h1.pem
+Creating /ssd-vol0/enc/tesseract/keys/lipase2026h2.pem
+Creating /ssd-vol0/enc/tesseract/keys/lipase2027h1.pem
+Creating /ssd-vol0/enc/tesseract/keys/lipase2027h2.pem
+```
+
+Of course, if a file already exists at that location, it'll just print a warning like:
+```
+Key already exists: /ssd-vol0/enc/tesseract/keys/lipase2025h2.pem (skipped)
+```
+
+***2. Generate JSON/HTML***
+
+I will be operating the read-path with NGINX. Log operators have started speaking about their log
+metadata in terms of a small JSON file called `log.v3.json`, and Skylight does a good job of
+exposing that one, alongside all the other pertinent metadata. So I'll generate these files for each
+of the logs:
+
+```
+ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf -c tesseract-staging.yaml gen-html
+Creating /ssd-vol0/logs/lipase2025h2/data/index.html
+Creating /ssd-vol0/logs/lipase2025h2/data/log.v3.json
+Creating /ssd-vol0/logs/lipase2026h1/data/index.html
+Creating /ssd-vol0/logs/lipase2026h1/data/log.v3.json
+Creating /ssd-vol0/logs/lipase2026h2/data/index.html
+Creating /ssd-vol0/logs/lipase2026h2/data/log.v3.json
+Creating /ssd-vol0/logs/lipase2027h1/data/index.html
+Creating /ssd-vol0/logs/lipase2027h1/data/log.v3.json
+Creating /ssd-vol0/logs/lipase2027h2/data/index.html
+Creating /ssd-vol0/logs/lipase2027h2/data/log.v3.json
+```
+
+{{< image width="60%" src="/assets/ctlog/lipase.png" alt="TesseraCT Lipase Log" >}}
+
+It's nice to see a familiar look-and-feel for these logs appear in those `index.html` (which all
+cross-link to each other within the logs specificied in `tesseract-staging.yaml`, which is dope.
+
+***3. Generate Roots***
+
+Antonis had seen this before (thanks for the explanation!) but TesseraCT does not natively implement
+fetching of the [[CCADB](https://www.ccadb.org/)] roots. But, he points out, you can just get them
+from any other running log instance, so I'll implement a `gen-roots` command:
+
+```
+ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf gen-roots \
+  --source https://tuscolo2027h1.sunlight.geomys.org --output production-roots.pem
+Fetching roots from: https://tuscolo2027h1.sunlight.geomys.org/ct/v1/get-roots
+2025/08/25 08:24:58 Warning: Failed to parse certificate,carefully skipping: x509: negative serial number
+Creating production-roots.pem
+Successfully wrote 248 certificates to tusc.pem (out of 249 total)
+
+ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf gen-roots \
+  --source https://navigli2027h1.sunlight.geomys.org --output testing-roots.pem
+Fetching roots from: https://navigli2027h1.sunlight.geomys.org/ct/v1/get-roots
+Creating testing-roots.pem
+Successfully wrote 82 certificates to tusc.pem (out of 82 total)
+```
+
+I can do this regularly, say daily, in a cronjob and if the files were to change, restart the
+TesseraCT processes. It's not ideal (because the restart might be briefly disruptive), but it's a
+reasonable option for the time being.
+
+***4. Generate TesseraCT cmdline***
+
+I will be running TesseraCT as a _templated unit_ in systemd. These are system unit files that have
+an argument, they will have an @ in their name, like so:
+
+```
+ctlog@ctlog1:/ssd-vol0/enc/tesseract$ cat << EOF | sudo tee /lib/systemd/system/tesseract@.service 
+[Unit]
+Description=Tesseract CT Log service for %i
+ConditionFileExists=/ssd-vol0/logs/%i/data/.env
+After=network.target
+
+[Service]
+# The %i here refers to the instance name, e.g., "lipase2025h2"
+# This path should point to where your instance-specific .env files are located
+EnvironmentFile=/ssd-vol0/logs/%i/data/.env
+ExecStart=/home/ctlog/bin/tesseract-posix $TESSERACT_ARGS
+User=ctlog
+Group=ctlog
+Restart=on-failure
+RestartSec=5
+
+[Install]
+WantedBy=multi-user.target
+EOF
+```
+
+I can now implement a `gen-env` command for my tool:
+
+```
+ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf -c tesseract-staging.yaml gen-env
+Creating /ssd-vol0/logs/lipase2025h2/data/roots.pem
+Creating /ssd-vol0/logs/lipase2025h2/data/.env
+Creating /ssd-vol0/logs/lipase2026h1/data/roots.pem
+Creating /ssd-vol0/logs/lipase2026h1/data/.env
+Creating /ssd-vol0/logs/lipase2026h2/data/roots.pem
+Creating /ssd-vol0/logs/lipase2026h2/data/.env
+Creating /ssd-vol0/logs/lipase2027h1/data/roots.pem
+Creating /ssd-vol0/logs/lipase2027h1/data/.env
+Creating /ssd-vol0/logs/lipase2027h2/data/roots.pem
+Creating /ssd-vol0/logs/lipase2027h2/data/.env
+```
+
+Looking at one of those .env files, I can show the exact commandline I'll be feeding to the
+`tesseract-posix` binary:
+
+```
+ctlog@ctlog1:/ssd-vol0/enc/tesseract$ cat /ssd-vol0/logs/lipase2025h2/data/.env
+TESSERACT_ARGS="--private_key=/ssd-vol0/enc/tesseract/keys/lipase2025h2.pem 
+  --origin=lipase2025h2.log.ct.ipng.ch --storage_dir=/ssd-vol0/logs/lipase2025h2/data
+  --roots_pem_file=/ssd-vol0/logs/lipase2025h2/data/roots.pem --http_endpoint=[::]:16900
+  --not_after_start=2025-07-01T00:00:00Z --not_after_limit=2026-01-01T00:00:00Z"
+OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
+```
+
+{{< image width="7em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
+A quick operational note on OpenTelemetry (also often referred to as Otel): Al and the TrustFabric
+team added open telemetry to the TesseraCT personalities, as it was mostly already implemented in
+the underlying Tessera library. By default, it'll try to send its telemetry to localhost using
+`https`, which makes sense in those cases where the collector is on a different machine. In my case,
+I'll keep `otelcol` (the collector) on the same machine. Its job is to consume the Otel telemetry
+stream, and turn those back into Prometheus `/metrics` endpoint on port `:9464`. 
+
+The `gen-env` command also assembles the per-instance `roots.pem` file. For staging logs, it'll take
+the file pointed to by the `roots:` key, and append any per-log `extraroots:` files. For me, these
+extraroots are empty and the main roots file points at either the testing roots that came from
+_Rennet_ (our Sunlight staging log), or the production roots that came from _Gouda_. A job well done!
+
+***5. Generate NGINX***
+
+When I first ran my tests, I noticed that the log check tool called `ct-fsck` threw errors on my
+read path. Filippo explained that the HTTP headers matter in the Static CT specification. Tiles,
+Issuers, and Checkpoint must all have specific caching and content type headers set. This is what
+makes Skylight such a gem - I get to read it (and the spec!) to see what I'm supposed to be serving.
+
+And thus, `gen-nginx` command is born, and listens on port `:8080` for requests:
+
+```
+ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf -c tesseract-staging.yaml gen-nginx
+Creating nginx config: /ssd-vol0/logs/lipase2025h2/data/lipase2025h2.mon.ct.ipng.ch.conf
+Creating nginx config: /ssd-vol0/logs/lipase2026h1/data/lipase2026h1.mon.ct.ipng.ch.conf
+Creating nginx config: /ssd-vol0/logs/lipase2026h2/data/lipase2026h2.mon.ct.ipng.ch.conf
+Creating nginx config: /ssd-vol0/logs/lipase2027h1/data/lipase2027h1.mon.ct.ipng.ch.conf
+Creating nginx config: /ssd-vol0/logs/lipase2027h2/data/lipase2027h2.mon.ct.ipng.ch.conf
+```
+
+All that's left for me to do is symlink these from `/etc/nginx/sites-enabled/` and the read-path is
+off to the races. With these commands in the `tesseract-genconf` tool, I am hoping that future
+travelers have an easy time setting up their static log. Please let me know if you'd like to use, or
+contribute, to the tool. You can find me in the Transparency Dev Slack, in #ct and also #cheese.
+
+
+## IPng Frontends
+
+{{< image width="18em" float="right" src="/assets/ctlog/MPLS Backbone - CTLog.svg" alt="ctlog at ipng" >}}
+
+IPng Networks has a private internal network called [[IPng Site Local]({{< ref 2023-03-11-mpls-core
+>}})], which is not routed on the internet. Our [[Frontends]({{< ref 2023-03-17-ipng-frontends >}})]
+are the only things that have public IPv4 and IPv6 addresses. It allows for things like anycasted
+webservers and loadbalancing with
+[[Maglev](https://research.google/pubs/maglev-a-fast-and-reliable-software-network-load-balancer/)].
+
+The IPng Site Local network kind of looks like the picture to the right. The hypervisors running the
+Sunlight and TesseraCT logs are at NTT Zurich1 in R&uuml;mlang, Switzerland. The IPng frontends are
+in green, and the sweet thing is, some of them run in IPng's own ISP network (AS8298), while others
+run in partner networks (like IP-Max AS25091, and Coloclue AS8283). This means that I will benefit
+from some pretty solid connectivity redundancy.
+
+The frontends are provisioned with Ansible. There are two aspects to them - firstly, a _certbot_
+instance maintains the Let's Encrypt wildcard certificates for `*.ct.ipng.ch`. There's a machine
+tucked away somewhere called `lego.net.ipng.ch` -- again, not exposed on the internet -- and its job
+is to renew certificates and copy them to the machines that need them. Next, a cluster of NGINX
+servers uses these certificates to expose IPng and customer services to the Internet.
+
+I can tie it all together with a snippet like so, for which I apologize in advance - it's quite a
+wall of text:
+
+```
+map $http_user_agent $no_cache_ctlog_lipase {
+  "~*TesseraCT fsck" 1;
+  default 0;
+}
+
+server {
+  listen [::]:443 ssl http2;
+  listen 0.0.0.0:443 ssl http2;
+  ssl_certificate /etc/certs/ct.ipng.ch/fullchain.pem;
+  ssl_certificate_key /etc/certs/ct.ipng.ch/privkey.pem;
+  include /etc/nginx/conf.d/options-ssl-nginx.inc;
+  ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc;
+
+  server_name lipase2025h2.log.ct.ipng.ch;
+  access_log /nginx/logs/lipase2025h2.log.ct.ipng.ch-access.log upstream buffer=512k flush=5s;
+  include /etc/nginx/conf.d/ipng-headers.inc;
+
+  location = / {
+    proxy_http_version 1.1;
+    proxy_set_header Host lipase2025h2.mon.ct.ipng.ch;
+    proxy_set_header Upgrade $http_upgrade;
+    proxy_set_header Connection "upgrade";
+    proxy_set_header X-Real-IP $remote_addr;
+    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+    proxy_set_header X-Forwarded-Proto $scheme;
+    proxy_pass http://ctlog1.net.ipng.ch:8080/index.html;
+  }
+
+  location = /metrics {
+    proxy_http_version 1.1;
+    proxy_set_header Host $host;
+    proxy_set_header Upgrade $http_upgrade;
+    proxy_set_header Connection "upgrade";
+    proxy_set_header X-Real-IP $remote_addr;
+    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+    proxy_set_header X-Forwarded-Proto $scheme;
+    proxy_pass http://ctlog1.net.ipng.ch:9464;
+  }
+
+  location / {
+    proxy_http_version 1.1;
+    proxy_set_header Host $host;
+    proxy_set_header Upgrade $http_upgrade;
+    proxy_set_header Connection "upgrade";
+    proxy_set_header X-Real-IP $remote_addr;
+    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+    proxy_set_header X-Forwarded-Proto $scheme;
+    proxy_pass http://ctlog1.net.ipng.ch:16900;
+  }
+}
+
+server {
+  listen [::]:443 ssl http2;
+  listen 0.0.0.0:443 ssl http2;
+  ssl_certificate /etc/certs/ct.ipng.ch/fullchain.pem;
+  ssl_certificate_key /etc/certs/ct.ipng.ch/privkey.pem;
+  include /etc/nginx/conf.d/options-ssl-nginx.inc;
+  ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc;
+
+  server_name lipase2025h2.mon.ct.ipng.ch;
+  access_log /nginx/logs/lipase2025h2.mon.ct.ipng.ch-access.log upstream buffer=512k flush=5s;
+  include /etc/nginx/conf.d/ipng-headers.inc;
+
+  location = /checkpoint {
+    proxy_http_version 1.1;
+    proxy_set_header Host $host;
+    proxy_set_header Upgrade $http_upgrade;
+    proxy_set_header Connection "upgrade";
+    proxy_set_header X-Real-IP $remote_addr;
+    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+    proxy_set_header X-Forwarded-Proto $scheme;
+
+    proxy_pass http://ctlog1.net.ipng.ch:8080;
+  }
+
+  location / {
+    proxy_http_version 1.1;
+    proxy_set_header Host $host;
+    proxy_set_header Upgrade $http_upgrade;
+    proxy_set_header Connection "upgrade";
+    proxy_set_header X-Real-IP $remote_addr;
+    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+    proxy_set_header X-Forwarded-Proto $scheme;
+
+    include /etc/nginx/conf.d/ipng-upstream-headers.inc;
+    proxy_cache ipng_cache;
+    proxy_cache_key "$scheme://$host$request_uri";
+    proxy_cache_valid 200 24h;
+    proxy_cache_revalidate off;
+    proxy_cache_bypass $no_cache_ctlog_lipase;
+    proxy_no_cache $no_cache_ctlog_lipase;
+
+    proxy_pass http://ctlog1.net.ipng.ch:8080;
+  }
+}
+```
+
+Taking _Lipase_ shard 2025h2 as an example, The submission path (on `*.log.ct.ipng.ch`) will show
+the same `index.html` as the monitoring path (on `*.mon.ct.ipng.ch`), to provide some consistency
+with Sunlight logs. Otherwise, the `/metrics` endpoint is forwarded to the `otelcol` running on port
+`:9464`, and the rest (the `/ct/v1/` and so on) are sent to the first port `:16900` of the
+TesseraCT.
+
+Then the read-path makes a special-case of the `/checkpoint` endpoint, which it does not cache. That
+request (as all others) are forwarded to port `:8080` which is where NGINX is running. Other
+requests (notably `/tile` and `/issuer`) are cacheable, so I'll cache these on the upstream NGINX
+servers, both for resilience as well as for performance. Having four of these NGINX upstream will
+allow the Static CT logs (regardless of being Sunlight or TesseraCT) to serve very high read-rates.
+
+## What's Next
+
+I need to spend a little bit of time thinking about rate limits, specifically write-ratelimits. I
+think I'll use a request limiter in upstream NGINX, to allow for each IP or /24 or /48 subnet to
+only send a fixed number of requests/sec. I'll probably keep that part private though, as it's a
+good rule of thumb to never offer information to attackers.
+
+Together with Antonis Chariton and Jeroen Massar, IPng Networks will be offering both TesseraCT and
+Sunlight logs on the public internet. One final step is to productionize both logs, and file the
+paperwork for them in the community. At this point our Sunlight log has been running for a month or
+so, and we've filed the paperwork for it to be included at Apple and Google.
+
+I'm going to have folks poke at _Lipase_ as well, after which I'll try to run a few `ct-fsck` to
+make sure the logs are sane, before offering them into the inclusion program as well. Wish us luck!
--- a/content/ctlog.md
+++ b/content/ctlog.md
@@ -0,0 +1,73 @@
+---
+title: 'Certificate Transparency'
+date: 2025-07-30
+url: /s/ct
+---
+
+{{< image width="10em" float="right" src="/assets/ctlog/ctlog-logo-ipng.png" alt="ctlog logo" >}}
+
+Certificate Transparency logs are "append-only" and publicly-auditable ledgers of certificates being
+created, updated, and expired. This is the homepage for IPng Networks' Certificate Transparency
+project.
+
+Certificate Transparency [[CT](https://certificate.transparency.dev)] is a system for logging and
+monitoring certificate issuance. It greatly enhances everyone’s ability to monitor and study
+certificate issuance, and these capabilities have led to numerous improvements to the CA ecosystem
+and Web security. As a result, it is rapidly becoming critical Internet infrastructure. Originally
+developed by Google, the concept is now being adopted by many _Certification Authories_ who log
+their certificates, and professional _Monitoring_ companies who observe the certificates and
+report anomalies.
+
+IPng Networks runs our logs under the domain `ct.ipng.ch`, split into a `*.log.ct.ipng.ch` for the
+write-path, and `*.mon.ct.ipng.ch` for the read-path.
+
+We are submitting our log for inclusion in the approved log lists for Google Chrome and Apple
+Safari. Following 90 days of successful monitoring, we anticipate our log will be added to these
+trusted lists and that change will propagate to people’s browsers with subsequent browser version
+releases.
+
+We operate two popular implementations of Static Certificate Transparency software.
+
+## Sunlight
+
+{{< image width="10em" float="right" src="/assets/ctlog/sunlight-logo.png" alt="sunlight logo" >}}
+
+[[Sunlight](https://sunlight.dev)] was designed by Filippo Valsorda for the needs of the WebPKI
+community, through the feedback of many of its members, and in particular of the Sigsum, Google
+TrustFabric, and ISRG teams. It is partially based on the Go Checksum Database. Sunlight's
+development was sponsored by Let's Encrypt.
+
+Our Sunlight logs:
+*    A staging log called [[Rennet](https://rennet2025h2.log.ct.ipng.ch/)], incepted 2025-07-28,
+     starting from temporal shard `rennet2025h2`.
+*    A production log called [[Gouda](https://gouda2025h2.log.ct.ipng.ch/)], incepted 2025-07-30,
+     starting from temporal shard `gouda2025h2`.
+
+## TesseraCT
+
+{{< image width="10em" float="right" src="/assets/ctlog/tesseract-logo.png" alt="tesseract logo" >}}
+
+[[TesseraCT](https://github.com/transparency-dev/tesseract)] is a Certificate Transparency (CT) log
+implementation by the TrustFabric team at Google. It was built to allow log operators to run
+production static-ct-api CT logs starting with temporal shards covering 2026 onwards, as the
+successor to Trillian's CTFE.
+
+Our TesseraCT logs:
+*    A staging log called [[Lipase](https://lipase2025h2.log.ct.ipng.ch/)], incepted 2025-08-22,
+     starting from temporal shard `lipase2025h2`.
+*    A production log called [[Halloumi](https://halloumi2025h2.log.ct.ipng.ch/)], incepted 2025-08-24,
+     starting from temporal shard `halloumi2025h2`.
+     *   Shard `halloumi2026h2` incorporated incorrect data into its Merkle Tree at entry 4357956 and
+         4552365, due to a [[TesseraCT bug](https://github.com/transparency-dev/tesseract/issues/553)]
+         and was retired on 2025-09-08, to be replaced by temporal shard `halloumi2026h2a`.
+
+## Operational Details
+
+You can read more details about our infrastructure on:
+*   **[[TesseraCT]({{< ref 2025-07-26-ctlog-1 >}})]** -  published on 2025-07-26.
+*   **[[Sunlight]({{< ref 2025-08-10-ctlog-2 >}})]** -  published on 2025-08-10.
+*   **[[Operations]({{< ref 2025-08-24-ctlog-3 >}})]** -  published on 2025-08-24.
+
+The operators of this infrastructure are **Antonis Chariton**, **Jeroen Massar** and **Pim van Pelt**. \
+You can reach us via e-mail at [[<ct-ops@ipng.ch>](mailto:ct-ops@ipng.ch)].
+
--- a/hugo.toml
+++ b/hugo.toml
@@ -1,36 +0,0 @@
-baseURL = 'https://ipng.ch/'
-languageCode = 'en-us'
-title = "IPng Networks"
-theme = 'hugo-theme-ipng'
-
-mainSections = ["articles"]
-# disqusShortname = "example"
-paginate = 4
-
-[params]
-  author = "IPng Networks GmbH"
-  siteHeading = "IPng Networks"
-  favicon = "favicon.ico" # Adds a small icon next to the page title in a tab
-  showBlogLatest = false
-  mainSections = ["articles"]
-  showTaxonomyLinks = false
-  nBlogLatest = 14 # number of blog post om the home page
-  Paginate = 30
-  blogLatestHeading = "Latest Dabblings"
-  footer = "Copyright 2021- IPng Networks GmbH, all rights reserved"
-
-  [params.social]
-    email = "info+www@ipng.ch"
-    mastodon = "IPngNetworks"
-    twitter = "IPngNetworks"
-    linkedin = "pimvanpelt"
-    instagram = "IPngNetworks"
-
-[taxonomies]
-  year = "year"
-  month = "month"
-  tags = "tags"
-  categories = "categories"
-
-[permalinks]
-  articles = "/s/articles/:year/:month/:day/:slug"
--- a/hugo.yaml
+++ b/hugo.yaml
@@ -0,0 +1,38 @@
+baseURL: 'https://ipng.ch/'
+languageCode: 'en-us'
+title: "IPng Networks"
+theme: 'hugo-theme-ipng'
+
+mainSections: ["articles"]
+
+params:
+  author: "IPng Networks GmbH"
+  siteHeading: "IPng Networks"
+  favicon: "favicon.ico"
+  showBlogLatest: false
+  mainSections: ["articles"]
+  showTaxonomyLinks: false
+  nBlogLatest: 14 # number of blog post om the home page
+  Paginate: 30
+  blogLatestHeading: "Latest Dabblings"
+  footer: "Copyright 2021- IPng Networks GmbH, all rights reserved"
+
+  social:
+    email: "info+www@ipng.ch"
+    mastodon: "@IPngNetworks"
+    twitter: "IPngNetworks"
+    linkedin: "pimvanpelt"
+    github: "pimvanpelt"
+    instagram: "IPngNetworks"
+    rss: true
+
+taxonomies:
+  year: "year"
+  month: "month"
+  tags: "tags"
+  categories: "categories"
+
+permalinks:
+  articles: "/s/articles/:year/:month/:day/:slug"
+
+ignoreLogs: [ "warning-goldmark-raw-html" ]
--- a/static/.well-known/security.txt
+++ b/static/.well-known/security.txt
@@ -0,0 +1,5 @@
+Canonical: https://ipng.ch/.well-known/security.txt
+Expires: 2026-01-01T00:00:00.000Z
+Contact: mailto:info@ipng.ch
+Contact: https://ipng.ch/s/contact/
+Preferred-Languages: en, nl, de
--- a/static/app/go/index.html
+++ b/static/app/go/index.html
@@ -0,0 +1,55 @@
+<!DOCTYPE html>
+<html lang="en-us">
+ <head>
+  <title>Javascript Redirector for RFID / NFC / nTAG</title>
+  <meta name="robots" content="noindex,nofollow">
+  <meta charset="utf-8">
+  <script type="text/JavaScript">
+
+const ntag_list = [
+  "/s/articles/2021/09/21/vpp-linux-cp-part7/",
+  "/s/articles/2021/12/23/vpp-linux-cp-virtual-machine-playground/",
+  "/s/articles/2022/01/12/case-study-virtual-leased-line-vll-in-vpp/",
+  "/s/articles/2022/02/14/case-study-vlan-gymnastics-with-vpp/",
+  "/s/articles/2022/03/27/vpp-configuration-part1/",
+  "/s/articles/2022/10/14/vpp-lab-setup/",
+  "/s/articles/2023/03/11/case-study-centec-mpls-core/",
+  "/s/articles/2023/04/09/vpp-monitoring/",
+  "/s/articles/2023/05/28/vpp-mpls-part-4/",
+  "/s/articles/2023/11/11/debian-on-mellanox-sn2700-32x100g/",
+  "/s/articles/2023/12/17/debian-on-ipngs-vpp-routers/",
+  "/s/articles/2024/01/27/vpp-python-api/",
+  "/s/articles/2024/02/10/vpp-on-freebsd-part-1/",
+  "/s/articles/2024/03/06/vpp-with-babel-part-1/",
+  "/s/articles/2024/04/06/vpp-with-loopback-only-ospfv3-part-1/",
+  "/s/articles/2024/04/27/freeix-remote/"
+];
+
+var redir_url = "https://ipng.ch/";
+var key = window.location.hash.slice(1);
+if (key.startsWith("ntag")) {
+  let week = Math.round(new Date().getTime() / 1000 / (7*24*3400));
+  let num = parseInt(key.slice(-2));
+  let idx = (num + week) % ntag_list.length;
+  console.log("(ntag " + num + " + week number " + week + ") % " + ntag_list.length + " = " + idx);
+  redir_url = ntag_list[idx];
+}
+
+console.log("Redirecting to " + redir_url + " - off you go!");
+window.location = redir_url;
+  </script>
+ </head>
+ <body>
+
+<pre>
+Usage: https://ipng.ch/app/go/#&lt;key&gt;
+Example: <a href="/app/go/#ntag00">#ntag00</a>
+
+Also, this page requires javascript.
+
+Love,
+  IPng Networks.
+</pre>
+
+ </body>
+</html>
--- a/static/assets/containerlab/containerlab.svg
+++ b/static/assets/containerlab/containerlab.svg
--- a/static/assets/containerlab/learn-vpp.png
+++ b/static/assets/containerlab/learn-vpp.png
--- a/static/assets/containerlab/vpp-containerlab.cast
+++ b/static/assets/containerlab/vpp-containerlab.cast
--- a/static/assets/ctlog/MPLS
+++ b/static/assets/ctlog/MPLS
--- a/static/assets/ctlog/btop-sunlight.png
+++ b/static/assets/ctlog/btop-sunlight.png
--- a/static/assets/ctlog/ctlog-loadtest1.png
+++ b/static/assets/ctlog/ctlog-loadtest1.png
--- a/static/assets/ctlog/ctlog-loadtest2.png
+++ b/static/assets/ctlog/ctlog-loadtest2.png
--- a/static/assets/ctlog/ctlog-loadtest3.png
+++ b/static/assets/ctlog/ctlog-loadtest3.png
--- a/static/assets/ctlog/ctlog-logo-ipng.png
+++ b/static/assets/ctlog/ctlog-logo-ipng.png
--- a/static/assets/ctlog/lipase.png
+++ b/static/assets/ctlog/lipase.png
--- a/static/assets/ctlog/minio-results.txt
+++ b/static/assets/ctlog/minio-results.txt
@@ -0,0 +1,164 @@
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=4M
+Loop 1: PUT time 60.0 secs, objects = 813, speed = 54.2MB/sec, 13.5 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 23168, speed = 1.5GB/sec, 386.1 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 2.2 secs, 371.2 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=1M
+2025/07/20 16:07:25 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
+	status code: 409, request id: 1853FACEBAC4D052, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
+Loop 1: PUT time 60.0 secs, objects = 1221, speed = 20.3MB/sec, 20.3 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 31000, speed = 516.7MB/sec, 516.7 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 3.2 secs, 376.5 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=8k
+2025/07/20 16:09:29 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
+	status code: 409, request id: 1853FAEB70060604, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
+Loop 1: PUT time 60.0 secs, objects = 3353, speed = 447KB/sec, 55.9 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 45913, speed = 6MB/sec, 765.2 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 9.3 secs, 361.6 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=4k
+2025/07/20 16:11:38 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
+	status code: 409, request id: 1853FB098B162788, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
+Loop 1: PUT time 60.0 secs, objects = 3404, speed = 226.9KB/sec, 56.7 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 45230, speed = 2.9MB/sec, 753.8 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 9.4 secs, 362.6 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4M
+2025/07/20 16:13:47 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
+	status code: 409, request id: 1853FB27AE890E75, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
+Loop 1: PUT time 60.1 secs, objects = 1898, speed = 126.4MB/sec, 31.6 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 185034, speed = 12GB/sec, 3083.9 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 0.4 secs, 4267.8 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=1M
+2025/07/20 16:15:48 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
+	status code: 409, request id: 1853FB43C0386015, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
+Loop 1: PUT time 60.2 secs, objects = 2627, speed = 43.7MB/sec, 43.7 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 327959, speed = 5.3GB/sec, 5465.9 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 0.6 secs, 4045.6 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=8k
+2025/07/20 16:17:49 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
+	status code: 409, request id: 1853FB5FE2012590, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
+Loop 1: PUT time 60.0 secs, objects = 6663, speed = 887.7KB/sec, 111.0 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 459962, speed = 59.9MB/sec, 7666.0 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 1.7 secs, 3890.9 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4k
+2025/07/20 16:19:50 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
+	status code: 409, request id: 1853FB7C3CF0FFCA, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
+Loop 1: PUT time 60.1 secs, objects = 6673, speed = 444.4KB/sec, 111.1 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 444637, speed = 28.9MB/sec, 7410.5 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 1.5 secs, 4411.8 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4M
+2025/07/20 16:21:52 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
+	status code: 409, request id: 1853FB988DB60881, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
+Loop 1: PUT time 60.2 secs, objects = 3093, speed = 205.5MB/sec, 51.4 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 168750, speed = 11GB/sec, 2811.4 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 0.3 secs, 9112.2 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=1M
+2025/07/20 16:23:53 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
+	status code: 409, request id: 1853FBB4A1E534DE, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
+Loop 1: PUT time 60.2 secs, objects = 4652, speed = 77.2MB/sec, 77.2 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 351187, speed = 5.7GB/sec, 5852.8 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 0.6 secs, 8141.6 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=8k
+2025/07/20 16:25:54 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
+	status code: 409, request id: 1853FBD0C4764C64, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
+Loop 1: PUT time 60.1 secs, objects = 14497, speed = 1.9MB/sec, 241.4 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 457437, speed = 59.6MB/sec, 7623.7 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 1.7 secs, 8353.6 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4k
+2025/07/20 16:27:55 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
+	status code: 409, request id: 1853FBED210B0792, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
+Loop 1: PUT time 60.1 secs, objects = 14459, speed = 962.6KB/sec, 240.7 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 466680, speed = 30.4MB/sec, 7777.7 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 1.7 secs, 8605.3 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=4M
+Loop 1: PUT time 60.0 secs, objects = 1866, speed = 124.4MB/sec, 31.1 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 16400, speed = 1.1GB/sec, 273.3 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 5.1 secs, 369.3 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=1M
+2025/07/20 16:32:02 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
+	status code: 409, request id: 1853FC25AE815718, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
+Loop 1: PUT time 60.0 secs, objects = 5459, speed = 91MB/sec, 91.0 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 25090, speed = 418.2MB/sec, 418.2 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 14.8 secs, 369.8 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=8k
+2025/07/20 16:34:17 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
+	status code: 409, request id: 1853FC4514A78873, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
+Loop 1: PUT time 60.0 secs, objects = 22278, speed = 2.9MB/sec, 371.3 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 40626, speed = 5.3MB/sec, 677.1 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 61.6 secs, 361.8 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=4k
+2025/07/20 16:37:19 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
+	status code: 409, request id: 1853FC6F629ACFAC, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
+Loop 1: PUT time 60.0 secs, objects = 23394, speed = 1.5MB/sec, 389.9 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 39249, speed = 2.6MB/sec, 654.1 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 64.5 secs, 363.0 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4M
+2025/07/20 16:40:23 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
+	status code: 409, request id: 1853FC9A5D101971, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
+Loop 1: PUT time 60.0 secs, objects = 10564, speed = 704.1MB/sec, 176.0 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 20682, speed = 1.3GB/sec, 344.6 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 2.5 secs, 4178.8 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=1M
+2025/07/20 16:42:26 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
+	status code: 409, request id: 1853FCB6EB0A45D9, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
+Loop 1: PUT time 60.0 secs, objects = 26550, speed = 442.4MB/sec, 442.4 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 124810, speed = 2GB/sec, 2080.1 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 6.6 secs, 4049.2 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=8k
+2025/07/20 16:44:32 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
+	status code: 409, request id: 1853FCD4684A110E, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
+Loop 1: PUT time 60.0 secs, objects = 129363, speed = 16.8MB/sec, 2155.9 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 423956, speed = 55.2MB/sec, 7065.8 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 32.4 secs, 3992.0 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4k
+2025/07/20 16:47:05 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
+	status code: 409, request id: 1853FCF7EA4857CF, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
+Loop 1: PUT time 60.0 secs, objects = 123067, speed = 8MB/sec, 2051.0 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 357694, speed = 23.3MB/sec, 5961.4 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 30.9 secs, 3986.0 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4M
+2025/07/20 16:49:36 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
+	status code: 409, request id: 1853FD1B12EFDEBC, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
+Loop 1: PUT time 60.1 secs, objects = 13131, speed = 873.3MB/sec, 218.3 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.1 secs, objects = 18630, speed = 1.2GB/sec, 310.2 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 1.7 secs, 7787.5 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=1M
+2025/07/20 16:51:38 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
+	status code: 409, request id: 1853FD3779E97644, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
+Loop 1: PUT time 60.1 secs, objects = 40226, speed = 669.8MB/sec, 669.8 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 85692, speed = 1.4GB/sec, 1427.8 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 4.7 secs, 8610.2 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=8k
+2025/07/20 16:53:42 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
+	status code: 409, request id: 1853FD5489FB2F1F, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
+Loop 1: PUT time 60.0 secs, objects = 230985, speed = 30.1MB/sec, 3849.3 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 435703, speed = 56.7MB/sec, 7261.1 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 25.8 secs, 8945.8 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4k
+2025/07/20 16:56:08 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
+	status code: 409, request id: 1853FD7683B9BB96, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
+Loop 1: PUT time 60.0 secs, objects = 228647, speed = 14.9MB/sec, 3810.4 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 452412, speed = 29.5MB/sec, 7539.9 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 27.2 secs, 8418.0 deletes/sec. Slowdowns = 0
--- a/static/assets/ctlog/minio_8kb_performance.png
+++ b/static/assets/ctlog/minio_8kb_performance.png
--- a/static/assets/ctlog/nsa_slide.jpg
+++ b/static/assets/ctlog/nsa_slide.jpg
--- a/static/assets/ctlog/seaweedfs-results.txt
+++ b/static/assets/ctlog/seaweedfs-results.txt
@@ -0,0 +1,80 @@
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=1M
+Loop 1: PUT time 60.0 secs, objects = 1994, speed = 33.2MB/sec, 33.2 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 29243, speed = 487.4MB/sec, 487.4 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 2.8 secs, 701.4 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=8k
+Loop 1: PUT time 60.0 secs, objects = 13634, speed = 1.8MB/sec, 227.2 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 32284, speed = 4.2MB/sec, 538.1 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 18.7 secs, 727.8 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=1M
+Loop 1: PUT time 62.0 secs, objects = 23733, speed = 382.8MB/sec, 382.8 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 132708, speed = 2.2GB/sec, 2211.7 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 3.7 secs, 6490.1 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=8k
+Loop 1: PUT time 60.0 secs, objects = 199925, speed = 26MB/sec, 3331.9 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 309937, speed = 40.4MB/sec, 5165.3 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 31.2 secs, 6406.0 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=1M
+Loop 1: PUT time 60.0 secs, objects = 1975, speed = 32.9MB/sec, 32.9 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 29898, speed = 498.3MB/sec, 498.3 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 2.7 secs, 726.6 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=8k
+Loop 1: PUT time 60.0 secs, objects = 13662, speed = 1.8MB/sec, 227.7 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 31865, speed = 4.1MB/sec, 531.1 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 18.8 secs, 726.9 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=1M
+Loop 1: PUT time 60.0 secs, objects = 26622, speed = 443.6MB/sec, 443.6 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 117688, speed = 1.9GB/sec, 1961.3 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 4.1 secs, 6499.5 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=8k
+Loop 1: PUT time 60.0 secs, objects = 198238, speed = 25.8MB/sec, 3303.9 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 312868, speed = 40.7MB/sec, 5214.3 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 30.8 secs, 6432.7 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4M
+Loop 1: PUT time 60.1 secs, objects = 6220, speed = 414.2MB/sec, 103.6 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 38773, speed = 2.5GB/sec, 646.1 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 0.9 secs, 6693.3 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4k
+Loop 1: PUT time 60.0 secs, objects = 203033, speed = 13.2MB/sec, 3383.8 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 300824, speed = 19.6MB/sec, 5013.6 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 31.1 secs, 6528.6 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4M
+Loop 1: PUT time 60.3 secs, objects = 13181, speed = 874.2MB/sec, 218.6 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.1 secs, objects = 18575, speed = 1.2GB/sec, 309.3 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 0.8 secs, 17547.2 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4k
+Loop 1: PUT time 60.0 secs, objects = 495006, speed = 32.2MB/sec, 8249.5 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 465947, speed = 30.3MB/sec, 7765.4 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 41.4 secs, 11961.3 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4M
+Loop 1: PUT time 60.1 secs, objects = 7073, speed = 471MB/sec, 117.8 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 31248, speed = 2GB/sec, 520.7 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 1.1 secs, 6576.1 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4k
+Loop 1: PUT time 60.0 secs, objects = 214387, speed = 14MB/sec, 3573.0 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 297586, speed = 19.4MB/sec, 4959.7 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 32.9 secs, 6519.8 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4M
+Loop 1: PUT time 60.1 secs, objects = 14365, speed = 956MB/sec, 239.0 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.1 secs, objects = 18113, speed = 1.2GB/sec, 301.6 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 0.8 secs, 18655.8 deletes/sec. Slowdowns = 0
+Wasabi benchmark program v2.0
+Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4k
+Loop 1: PUT time 60.0 secs, objects = 489736, speed = 31.9MB/sec, 8161.8 operations/sec. Slowdowns = 0
+Loop 1: GET time 60.0 secs, objects = 460296, speed = 30MB/sec, 7671.2 operations/sec. Slowdowns = 0
+Loop 1: DELETE time 41.0 secs, 11957.6 deletes/sec. Slowdowns = 0
--- a/static/assets/ctlog/seaweedfs.docker-compose.yml
+++ b/static/assets/ctlog/seaweedfs.docker-compose.yml
@@ -0,0 +1,116 @@
+# Test Setup for SeaweedFS with 6 disks, a Filer an an S3 API
+#
+# Use with the following .env file
+# root@minio-ssd:~# cat /opt/seaweedfs/.env 
+# AWS_ACCESS_KEY_ID="hottentotten"
+# AWS_SECRET_ACCESS_KEY="tentententoonstelling"
+
+services:
+# Master
+  master0:
+    image: chrislusf/seaweedfs
+    ports:
+      - 9333:9333
+      - 19333:19333
+    command: "-v=1 master -volumeSizeLimitMB 100 -resumeState=false -ip=master0 -ip.bind=0.0.0.0 -port=9333 -mdir=/var/lib/seaweedfs/master"
+    volumes:
+      - ./data/master0:/var/lib/seaweedfs/master
+    restart: unless-stopped
+
+# Volume Server 1
+  volume1:
+    image: chrislusf/seaweedfs
+    command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8081 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume1'
+    volumes:
+      - /data/disk1:/var/lib/seaweedfs/volume1
+    depends_on:
+      - master0
+    restart: unless-stopped
+
+# Volume Server 2
+  volume2:
+    image: chrislusf/seaweedfs
+    command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8082 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume2'
+    volumes:
+      - /data/disk2:/var/lib/seaweedfs/volume2
+    depends_on:
+      - master0
+    restart: unless-stopped
+
+# Volume Server 3
+  volume3:
+    image: chrislusf/seaweedfs
+    command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8083 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume3'
+    volumes:
+      - /data/disk3:/var/lib/seaweedfs/volume3
+    depends_on:
+      - master0
+    restart: unless-stopped
+
+# Volume Server 4
+  volume4:
+    image: chrislusf/seaweedfs
+    command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8084 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume4'
+    volumes:
+      - /data/disk4:/var/lib/seaweedfs/volume4
+    depends_on:
+      - master0
+    restart: unless-stopped
+
+# Volume Server 5
+  volume5:
+    image: chrislusf/seaweedfs
+    command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8085 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume5'
+    volumes:
+      - /data/disk5:/var/lib/seaweedfs/volume5
+    depends_on:
+      - master0
+    restart: unless-stopped
+
+# Volume Server 6
+  volume6:
+    image: chrislusf/seaweedfs
+    command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8086 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume6'
+    volumes:
+      - /data/disk6:/var/lib/seaweedfs/volume6
+    depends_on:
+      - master0
+    restart: unless-stopped
+
+# Filer
+  filer:
+    image: chrislusf/seaweedfs
+    ports:
+      - 8888:8888
+      - 18888:18888
+    command: 'filer -defaultReplicaPlacement=002 -iam -master="master0:9333"'
+    volumes:
+      - ./data/filer:/data
+    depends_on:
+      - master0
+      - volume1
+      - volume2
+      - volume3
+      - volume4
+      - volume5
+      - volume6
+    restart: unless-stopped
+
+# S3 API
+  s3:
+    image: chrislusf/seaweedfs
+    ports:
+      - 8333:8333
+    command: 's3 -filer="filer:8888" -ip.bind=0.0.0.0'
+    env_file:
+      - .env
+    depends_on:
+      - master0
+      - volume1
+      - volume2
+      - volume3
+      - volume4
+      - volume5
+      - volume6
+      - filer
+    restart: unless-stopped
--- a/static/assets/ctlog/size_comparison_8t.png
+++ b/static/assets/ctlog/size_comparison_8t.png
--- a/static/assets/ctlog/stop-hammer-time.jpg
+++ b/static/assets/ctlog/stop-hammer-time.jpg
--- a/static/assets/ctlog/sunlight-logo.png
+++ b/static/assets/ctlog/sunlight-logo.png
--- a/static/assets/ctlog/sunlight-test-s3.png
+++ b/static/assets/ctlog/sunlight-test-s3.png
--- a/static/assets/ctlog/tesseract-logo.png
+++ b/static/assets/ctlog/tesseract-logo.png
--- a/static/assets/debian-vpp/warning.png
+++ b/static/assets/debian-vpp/warning.png
--- a/static/assets/freebsd-vpp/brain.png
+++ b/static/assets/freebsd-vpp/brain.png
--- a/static/assets/freebsd-vpp/warning.png
+++ b/static/assets/freebsd-vpp/warning.png
--- a/static/assets/freeix/freeix-artist-rendering.png
+++ b/static/assets/freeix/freeix-artist-rendering.png
--- a/static/assets/frys-ix/FrysIX_
+++ b/static/assets/frys-ix/FrysIX_
--- a/static/assets/frys-ix/IXR-7220-D3.jpg
+++ b/static/assets/frys-ix/IXR-7220-D3.jpg
--- a/static/assets/frys-ix/Nokia
+++ b/static/assets/frys-ix/Nokia
--- a/static/assets/frys-ix/arista-leaf.conf
+++ b/static/assets/frys-ix/arista-leaf.conf
@@ -0,0 +1,169 @@
+no aaa root
+!
+hardware counter feature vtep decap
+hardware counter feature vtep encap
+!
+service routing protocols model multi-agent
+!
+hostname arista-leaf
+!
+router l2-vpn
+   arp learning bridged
+!
+spanning-tree mode mstp
+!
+system l1
+   unsupported speed action error
+   unsupported error-correction action error
+!
+vlan 2604
+   name v-peeringlan
+!
+interface Ethernet1/1
+!
+interface Ethernet2/1
+!
+interface Ethernet3/1
+!
+interface Ethernet4/1
+!
+interface Ethernet5/1
+!
+interface Ethernet6/1
+!
+interface Ethernet7/1
+!
+interface Ethernet8/1
+!
+interface Ethernet9/1
+   shutdown
+   speed forced 10000full
+!
+interface Ethernet9/2
+   shutdown
+!
+interface Ethernet9/3
+   speed forced 10000full
+   switchport access vlan 2604
+!
+interface Ethernet9/4
+   shutdown
+!
+interface Ethernet10/1
+!
+interface Ethernet10/2
+   shutdown
+!
+interface Ethernet10/4
+   shutdown
+!
+interface Ethernet11/1
+!
+interface Ethernet12/1
+!
+interface Ethernet13/1
+!
+interface Ethernet14/1
+!
+interface Ethernet15/1
+!
+interface Ethernet16/1
+!
+interface Ethernet17/1
+!
+interface Ethernet18/1
+!
+interface Ethernet19/1
+!
+interface Ethernet20/1
+!
+interface Ethernet21/1
+!
+interface Ethernet22/1
+!
+interface Ethernet23/1
+!
+interface Ethernet24/1
+!
+interface Ethernet25/1
+!
+interface Ethernet26/1
+!
+interface Ethernet27/1
+!
+interface Ethernet28/1
+!
+interface Ethernet29/1
+   no switchport
+!
+interface Ethernet30/1
+   load-interval 1
+   mtu 9190
+   no switchport
+   ip address 198.19.17.10/31
+   ip ospf cost 10
+   ip ospf network point-to-point
+   ip ospf area 0.0.0.0
+!
+interface Ethernet31/1
+   load-interval 1
+   mtu 9190
+   no switchport
+   ip address 198.19.17.3/31
+   ip ospf cost 1000
+   ip ospf network point-to-point
+   ip ospf area 0.0.0.0
+!
+interface Ethernet32/1
+   load-interval 1
+   mtu 9190
+   no switchport
+   ip address 198.19.17.5/31
+   ip ospf cost 1000
+   ip ospf network point-to-point
+   ip ospf area 0.0.0.0
+!
+interface Loopback0
+   ip address 198.19.16.2/32
+   ip ospf area 0.0.0.0
+!
+interface Loopback1
+   ip address 198.19.18.2/32
+!
+interface Management1
+   ip address dhcp
+!
+interface Vxlan1
+   vxlan source-interface Loopback1
+   vxlan udp-port 4789
+   vxlan vlan 2604 vni 2604
+!
+ip routing
+!
+ip route 0.0.0.0/0 Management1 10.75.8.1
+!
+router bgp 65500
+   neighbor evpn peer group
+   neighbor evpn remote-as 65500
+   neighbor evpn update-source Loopback0
+   neighbor evpn ebgp-multihop 3
+   neighbor evpn send-community extended
+   neighbor evpn maximum-routes 12000 warning-only
+   neighbor 198.19.16.0 peer group evpn
+   neighbor 198.19.16.1 peer group evpn
+   !
+   vlan 2604
+      rd 65500:2604
+      route-target both 65500:2604
+      redistribute learned
+   !
+   address-family evpn
+      neighbor evpn activate
+!
+router ospf 65500
+   router-id 198.19.16.2
+   redistribute connected
+   network 198.19.0.0/16 area 0.0.0.0
+   max-lsa 12000
+!
+end
--- a/static/assets/frys-ix/equinix.conf
+++ b/static/assets/frys-ix/equinix.conf
@@ -0,0 +1,90 @@
+set / interface ethernet-1/1 admin-state disable
+set / interface ethernet-1/9 admin-state enable
+set / interface ethernet-1/9 breakout-mode num-breakout-ports 4
+set / interface ethernet-1/9 breakout-mode breakout-port-speed 10G
+set / interface ethernet-1/9/3 admin-state enable
+set / interface ethernet-1/9/3 vlan-tagging true
+set / interface ethernet-1/9/3 subinterface 0 type bridged
+set / interface ethernet-1/9/3 subinterface 0 admin-state enable
+set / interface ethernet-1/9/3 subinterface 0 vlan encap untagged
+set / interface ethernet-1/29 admin-state enable
+set / interface ethernet-1/29 subinterface 0 type routed
+set / interface ethernet-1/29 subinterface 0 admin-state enable
+set / interface ethernet-1/29 subinterface 0 ip-mtu 9190
+set / interface ethernet-1/29 subinterface 0 ipv4 admin-state enable
+set / interface ethernet-1/29 subinterface 0 ipv4 address 198.19.17.0/31
+set / interface ethernet-1/29 subinterface 0 ipv6 admin-state enable
+set / interface lo0 admin-state enable
+set / interface lo0 subinterface 0 admin-state enable
+set / interface lo0 subinterface 0 ipv4 admin-state enable
+set / interface lo0 subinterface 0 ipv4 address 198.19.16.0/32
+set / interface mgmt0 admin-state enable
+set / interface mgmt0 subinterface 0 admin-state enable
+set / interface mgmt0 subinterface 0 ipv4 admin-state enable
+set / interface mgmt0 subinterface 0 ipv4 dhcp-client
+set / interface mgmt0 subinterface 0 ipv6 admin-state enable
+set / interface mgmt0 subinterface 0 ipv6 dhcp-client
+set / interface system0 admin-state enable
+set / interface system0 subinterface 0 admin-state enable
+set / interface system0 subinterface 0 ipv4 admin-state enable
+set / interface system0 subinterface 0 ipv4 address 198.19.18.0/32
+set / network-instance default type default
+set / network-instance default admin-state enable
+set / network-instance default description "fabric: dc2 role: spine"
+set / network-instance default router-id 198.19.16.0
+set / network-instance default ip-forwarding receive-ipv4-check false
+set / network-instance default interface ethernet-1/29.0
+set / network-instance default interface lo0.0
+set / network-instance default interface system0.0
+set / network-instance default protocols bgp admin-state enable
+set / network-instance default protocols bgp autonomous-system 65500
+set / network-instance default protocols bgp router-id 198.19.16.0
+set / network-instance default protocols bgp dynamic-neighbors accept match 198.19.16.0/24 peer-group overlay
+set / network-instance default protocols bgp afi-safi evpn admin-state enable
+set / network-instance default protocols bgp preference ibgp 170
+set / network-instance default protocols bgp route-advertisement rapid-withdrawal true
+set / network-instance default protocols bgp route-advertisement wait-for-fib-install false
+set / network-instance default protocols bgp group overlay peer-as 65500
+set / network-instance default protocols bgp group overlay afi-safi evpn admin-state enable
+set / network-instance default protocols bgp group overlay afi-safi ipv4-unicast admin-state disable
+set / network-instance default protocols bgp group overlay local-as as-number 65500
+set / network-instance default protocols bgp group overlay route-reflector client true
+set / network-instance default protocols bgp group overlay transport local-address 198.19.16.0
+set / network-instance default protocols bgp neighbor 198.19.16.1 admin-state enable
+set / network-instance default protocols bgp neighbor 198.19.16.1 peer-group overlay
+set / network-instance default protocols ospf instance default admin-state enable
+set / network-instance default protocols ospf instance default version ospf-v2
+set / network-instance default protocols ospf instance default router-id 198.19.16.0
+set / network-instance default protocols ospf instance default export-policy ospf
+set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/29.0 interface-type point-to-point
+set / network-instance default protocols ospf instance default area 0.0.0.0 interface lo0.0
+set / network-instance default protocols ospf instance default area 0.0.0.0 interface system0.0
+set / network-instance mgmt type ip-vrf
+set / network-instance mgmt admin-state enable
+set / network-instance mgmt description "Management network instance"
+set / network-instance mgmt interface mgmt0.0
+set / network-instance mgmt protocols linux import-routes true
+set / network-instance mgmt protocols linux export-routes true
+set / network-instance mgmt protocols linux export-neighbors true
+set / network-instance peeringlan type mac-vrf
+set / network-instance peeringlan admin-state enable
+set / network-instance peeringlan interface ethernet-1/9/3.0
+set / network-instance peeringlan vxlan-interface vxlan1.2604
+set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 admin-state enable
+set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604
+set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 evi 2604
+set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise true
+set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604
+set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604
+set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604
+set / network-instance peeringlan bridge-table proxy-arp admin-state enable
+set / network-instance peeringlan bridge-table proxy-arp dynamic-learning admin-state enable
+set / network-instance peeringlan bridge-table proxy-arp dynamic-learning age-time 600
+set / network-instance peeringlan bridge-table proxy-arp dynamic-learning send-refresh 180
+set / routing-policy policy ospf statement 100 match protocol host
+set / routing-policy policy ospf statement 100 action policy-result accept
+set / routing-policy policy ospf statement 200 match protocol ospfv2
+set / routing-policy policy ospf statement 200 action policy-result accept
+set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged
+set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604
+set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address
--- a/static/assets/frys-ix/frysix-logo-small.png
+++ b/static/assets/frys-ix/frysix-logo-small.png
--- a/static/assets/frys-ix/nikhef.conf
+++ b/static/assets/frys-ix/nikhef.conf
@@ -0,0 +1,132 @@
+set / interface ethernet-1/1 admin-state enable
+set / interface ethernet-1/1 ethernet forward-error-correction fec-option rs-528
+set / interface ethernet-1/1 subinterface 0 type routed
+set / interface ethernet-1/1 subinterface 0 admin-state enable
+set / interface ethernet-1/1 subinterface 0 ip-mtu 9190
+set / interface ethernet-1/1 subinterface 0 ipv4 admin-state enable
+set / interface ethernet-1/1 subinterface 0 ipv4 address 198.19.17.2/31
+set / interface ethernet-1/1 subinterface 0 ipv6 admin-state enable
+set / interface ethernet-1/2 admin-state enable
+set / interface ethernet-1/2 ethernet forward-error-correction fec-option rs-528
+set / interface ethernet-1/2 subinterface 0 type routed
+set / interface ethernet-1/2 subinterface 0 admin-state enable
+set / interface ethernet-1/2 subinterface 0 ip-mtu 9190
+set / interface ethernet-1/2 subinterface 0 ipv4 admin-state enable
+set / interface ethernet-1/2 subinterface 0 ipv4 address 198.19.17.4/31
+set / interface ethernet-1/2 subinterface 0 ipv6 admin-state enable
+set / interface ethernet-1/3 admin-state enable
+set / interface ethernet-1/3 ethernet forward-error-correction fec-option rs-528
+set / interface ethernet-1/3 subinterface 0 type routed
+set / interface ethernet-1/3 subinterface 0 admin-state enable
+set / interface ethernet-1/3 subinterface 0 ip-mtu 9190
+set / interface ethernet-1/3 subinterface 0 ipv4 admin-state enable
+set / interface ethernet-1/3 subinterface 0 ipv4 address 198.19.17.6/31
+set / interface ethernet-1/3 subinterface 0 ipv6 admin-state enable
+set / interface ethernet-1/4 admin-state enable
+set / interface ethernet-1/4 ethernet forward-error-correction fec-option rs-528
+set / interface ethernet-1/4 subinterface 0 type routed
+set / interface ethernet-1/4 subinterface 0 admin-state enable
+set / interface ethernet-1/4 subinterface 0 ip-mtu 9190
+set / interface ethernet-1/4 subinterface 0 ipv4 admin-state enable
+set / interface ethernet-1/4 subinterface 0 ipv4 address 198.19.17.8/31
+set / interface ethernet-1/4 subinterface 0 ipv6 admin-state enable
+set / interface ethernet-1/9 admin-state enable
+set / interface ethernet-1/9 breakout-mode num-breakout-ports 4
+set / interface ethernet-1/9 breakout-mode breakout-port-speed 10G
+set / interface ethernet-1/9/1 admin-state disable
+set / interface ethernet-1/9/2 admin-state disable
+set / interface ethernet-1/9/3 admin-state enable
+set / interface ethernet-1/9/3 vlan-tagging true
+set / interface ethernet-1/9/3 subinterface 0 type bridged
+set / interface ethernet-1/9/3 subinterface 0 admin-state enable
+set / interface ethernet-1/9/3 subinterface 0 vlan encap untagged
+set / interface ethernet-1/9/4 admin-state disable
+set / interface ethernet-1/29 admin-state enable
+set / interface ethernet-1/29 subinterface 0 type routed
+set / interface ethernet-1/29 subinterface 0 admin-state enable
+set / interface ethernet-1/29 subinterface 0 ip-mtu 9190
+set / interface ethernet-1/29 subinterface 0 ipv4 admin-state enable
+set / interface ethernet-1/29 subinterface 0 ipv4 address 198.19.17.1/31
+set / interface ethernet-1/29 subinterface 0 ipv6 admin-state enable
+set / interface lo0 admin-state enable
+set / interface lo0 subinterface 0 admin-state enable
+set / interface lo0 subinterface 0 ipv4 admin-state enable
+set / interface lo0 subinterface 0 ipv4 address 198.19.16.1/32
+set / interface mgmt0 admin-state enable
+set / interface mgmt0 subinterface 0 admin-state enable
+set / interface mgmt0 subinterface 0 ipv4 admin-state enable
+set / interface mgmt0 subinterface 0 ipv4 dhcp-client
+set / interface mgmt0 subinterface 0 ipv6 admin-state enable
+set / interface mgmt0 subinterface 0 ipv6 dhcp-client
+set / interface system0 admin-state enable
+set / interface system0 subinterface 0 admin-state enable
+set / interface system0 subinterface 0 ipv4 admin-state enable
+set / interface system0 subinterface 0 ipv4 address 198.19.18.1/32
+set / network-instance default type default
+set / network-instance default admin-state enable
+set / network-instance default description "fabric: dc1 role: spine"
+set / network-instance default router-id 198.19.16.1
+set / network-instance default ip-forwarding receive-ipv4-check false
+set / network-instance default interface ethernet-1/1.0
+set / network-instance default interface ethernet-1/2.0
+set / network-instance default interface ethernet-1/29.0
+set / network-instance default interface ethernet-1/3.0
+set / network-instance default interface ethernet-1/4.0
+set / network-instance default interface lo0.0
+set / network-instance default interface system0.0
+set / network-instance default protocols bgp admin-state enable
+set / network-instance default protocols bgp autonomous-system 65500
+set / network-instance default protocols bgp router-id 198.19.16.1
+set / network-instance default protocols bgp dynamic-neighbors accept match 198.19.16.0/24 peer-group overlay
+set / network-instance default protocols bgp afi-safi evpn admin-state enable
+set / network-instance default protocols bgp preference ibgp 170
+set / network-instance default protocols bgp route-advertisement rapid-withdrawal true
+set / network-instance default protocols bgp route-advertisement wait-for-fib-install false
+set / network-instance default protocols bgp group overlay peer-as 65500
+set / network-instance default protocols bgp group overlay afi-safi evpn admin-state enable
+set / network-instance default protocols bgp group overlay afi-safi ipv4-unicast admin-state disable
+set / network-instance default protocols bgp group overlay local-as as-number 65500
+set / network-instance default protocols bgp group overlay route-reflector client true
+set / network-instance default protocols bgp group overlay transport local-address 198.19.16.1
+set / network-instance default protocols bgp neighbor 198.19.16.0 admin-state enable
+set / network-instance default protocols bgp neighbor 198.19.16.0 peer-group overlay
+set / network-instance default protocols ospf instance default admin-state enable
+set / network-instance default protocols ospf instance default version ospf-v2
+set / network-instance default protocols ospf instance default router-id 198.19.16.1
+set / network-instance default protocols ospf instance default export-policy ospf
+set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/1.0 interface-type point-to-point
+set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/2.0 interface-type point-to-point
+set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/3.0 interface-type point-to-point
+set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/4.0 interface-type point-to-point
+set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/29.0 interface-type point-to-point
+set / network-instance default protocols ospf instance default area 0.0.0.0 interface lo0.0 passive true
+set / network-instance default protocols ospf instance default area 0.0.0.0 interface system0.0
+set / network-instance mgmt type ip-vrf
+set / network-instance mgmt admin-state enable
+set / network-instance mgmt description "Management network instance"
+set / network-instance mgmt interface mgmt0.0
+set / network-instance mgmt protocols linux import-routes true
+set / network-instance mgmt protocols linux export-routes true
+set / network-instance mgmt protocols linux export-neighbors true
+set / network-instance peeringlan type mac-vrf
+set / network-instance peeringlan admin-state enable
+set / network-instance peeringlan interface ethernet-1/9/3.0
+set / network-instance peeringlan vxlan-interface vxlan1.2604
+set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 admin-state enable
+set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604
+set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 evi 2604
+set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise true
+set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604
+set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604
+set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604
+set / network-instance peeringlan bridge-table proxy-arp admin-state enable
+set / network-instance peeringlan bridge-table proxy-arp dynamic-learning admin-state enable
+set / network-instance peeringlan bridge-table proxy-arp dynamic-learning age-time 600
+set / network-instance peeringlan bridge-table proxy-arp dynamic-learning send-refresh 180
+set / routing-policy policy ospf statement 100 match protocol host
+set / routing-policy policy ospf statement 100 action policy-result accept
+set / routing-policy policy ospf statement 200 match protocol ospfv2
+set / routing-policy policy ospf statement 200 action policy-result accept
+set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged
+set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604
+set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address
--- a/static/assets/frys-ix/nokia-7220-d2.png
+++ b/static/assets/frys-ix/nokia-7220-d2.png
--- a/static/assets/frys-ix/nokia-7220-d4.png
+++ b/static/assets/frys-ix/nokia-7220-d4.png
--- a/static/assets/frys-ix/nokia-leaf.conf
+++ b/static/assets/frys-ix/nokia-leaf.conf
@@ -0,0 +1,105 @@
+set / interface ethernet-1/9 admin-state enable
+set / interface ethernet-1/9 vlan-tagging true
+set / interface ethernet-1/9 ethernet port-speed 10G
+set / interface ethernet-1/9 subinterface 0 type bridged
+set / interface ethernet-1/9 subinterface 0 admin-state enable
+set / interface ethernet-1/9 subinterface 0 vlan encap untagged
+set / interface ethernet-1/53 admin-state enable
+set / interface ethernet-1/53 ethernet forward-error-correction fec-option rs-528
+set / interface ethernet-1/53 subinterface 0 admin-state enable
+set / interface ethernet-1/53 subinterface 0 ip-mtu 9190
+set / interface ethernet-1/53 subinterface 0 ipv4 admin-state enable
+set / interface ethernet-1/53 subinterface 0 ipv4 address 198.19.17.11/31
+set / interface ethernet-1/53 subinterface 0 ipv6 admin-state enable
+set / interface ethernet-1/55 admin-state enable
+set / interface ethernet-1/55 ethernet forward-error-correction fec-option rs-528
+set / interface ethernet-1/55 subinterface 0 admin-state enable
+set / interface ethernet-1/55 subinterface 0 ip-mtu 9190
+set / interface ethernet-1/55 subinterface 0 ipv4 admin-state enable
+set / interface ethernet-1/55 subinterface 0 ipv4 address 198.19.17.7/31
+set / interface ethernet-1/55 subinterface 0 ipv6 admin-state enable
+set / interface ethernet-1/56 admin-state enable
+set / interface ethernet-1/56 ethernet forward-error-correction fec-option rs-528
+set / interface ethernet-1/56 subinterface 0 admin-state enable
+set / interface ethernet-1/56 subinterface 0 ip-mtu 9190
+set / interface ethernet-1/56 subinterface 0 ipv4 admin-state enable
+set / interface ethernet-1/56 subinterface 0 ipv4 address 198.19.17.9/31
+set / interface ethernet-1/56 subinterface 0 ipv6 admin-state enable
+set / interface lo0 admin-state enable
+set / interface lo0 subinterface 0 admin-state enable
+set / interface lo0 subinterface 0 ipv4 admin-state enable
+set / interface lo0 subinterface 0 ipv4 address 198.19.16.3/32
+set / interface mgmt0 admin-state enable
+set / interface mgmt0 subinterface 0 admin-state enable
+set / interface mgmt0 subinterface 0 ipv4 admin-state enable
+set / interface mgmt0 subinterface 0 ipv4 dhcp-client
+set / interface mgmt0 subinterface 0 ipv6 admin-state enable
+set / interface mgmt0 subinterface 0 ipv6 dhcp-client
+set / interface system0 admin-state enable
+set / interface system0 subinterface 0 admin-state enable
+set / interface system0 subinterface 0 ipv4 admin-state enable
+set / interface system0 subinterface 0 ipv4 address 198.19.18.3/32
+set / network-instance default type default
+set / network-instance default admin-state enable
+set / network-instance default description "fabric: dc1 role: leaf"
+set / network-instance default router-id 198.19.16.3
+set / network-instance default ip-forwarding receive-ipv4-check false
+set / network-instance default interface ethernet-1/53.0
+set / network-instance default interface ethernet-1/55.0
+set / network-instance default interface ethernet-1/56.0
+set / network-instance default interface lo0.0
+set / network-instance default interface system0.0
+set / network-instance default protocols bgp admin-state enable
+set / network-instance default protocols bgp autonomous-system 65500
+set / network-instance default protocols bgp router-id 198.19.16.3
+set / network-instance default protocols bgp afi-safi evpn admin-state enable
+set / network-instance default protocols bgp preference ibgp 170
+set / network-instance default protocols bgp route-advertisement rapid-withdrawal true
+set / network-instance default protocols bgp route-advertisement wait-for-fib-install false
+set / network-instance default protocols bgp group overlay peer-as 65500
+set / network-instance default protocols bgp group overlay afi-safi evpn admin-state enable
+set / network-instance default protocols bgp group overlay afi-safi ipv4-unicast admin-state disable
+set / network-instance default protocols bgp group overlay local-as as-number 65500
+set / network-instance default protocols bgp group overlay transport local-address 198.19.16.3
+set / network-instance default protocols bgp neighbor 198.19.16.0 admin-state enable
+set / network-instance default protocols bgp neighbor 198.19.16.0 peer-group overlay
+set / network-instance default protocols bgp neighbor 198.19.16.1 admin-state enable
+set / network-instance default protocols bgp neighbor 198.19.16.1 peer-group overlay
+set / network-instance default protocols ospf instance default admin-state enable
+set / network-instance default protocols ospf instance default version ospf-v2
+set / network-instance default protocols ospf instance default router-id 198.19.16.3
+set / network-instance default protocols ospf instance default export-policy ospf
+set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/53.0 interface-type point-to-point
+set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/55.0 interface-type point-to-point
+set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/56.0 interface-type point-to-point
+set / network-instance default protocols ospf instance default area 0.0.0.0 interface lo0.0 passive true
+set / network-instance default protocols ospf instance default area 0.0.0.0 interface system0.0
+set / network-instance mgmt type ip-vrf
+set / network-instance mgmt admin-state enable
+set / network-instance mgmt description "Management network instance"
+set / network-instance mgmt interface mgmt0.0
+set / network-instance mgmt protocols linux import-routes true
+set / network-instance mgmt protocols linux export-routes true
+set / network-instance mgmt protocols linux export-neighbors true
+set / network-instance peeringlan type mac-vrf
+set / network-instance peeringlan admin-state enable
+set / network-instance peeringlan interface ethernet-1/9.0
+set / network-instance peeringlan vxlan-interface vxlan1.2604
+set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 admin-state enable
+set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604
+set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 evi 2604
+set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise true
+set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604
+set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604
+set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604
+set / network-instance peeringlan bridge-table proxy-arp admin-state enable
+set / network-instance peeringlan bridge-table proxy-arp dynamic-learning admin-state enable
+set / network-instance peeringlan bridge-table proxy-arp dynamic-learning age-time 600
+set / network-instance peeringlan bridge-table proxy-arp dynamic-learning send-refresh 180
+set / routing-policy policy ospf statement 100 match protocol host
+set / routing-policy policy ospf statement 100 action policy-result accept
+set / routing-policy policy ospf statement 200 match protocol ospfv2
+set / routing-policy policy ospf statement 200 action policy-result accept
+set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged
+set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604
+set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address
--- a/static/assets/jekyll-hugo/before.png
+++ b/static/assets/jekyll-hugo/before.png
--- a/static/assets/jekyll-hugo/hugo-logo-wide.svg
+++ b/static/assets/jekyll-hugo/hugo-logo-wide.svg
@@ -0,0 +1,7 @@
+<svg xmlns="http://www.w3.org/2000/svg" fill-rule="evenodd" stroke-width="27" aria-label="Logo" viewBox="0 0 1493 391">
+  <path fill="#ebb951" stroke="#fcd804" d="M1345.211 24.704l112.262 64.305a43 43 0 0 1 21.627 37.312v142.237a40 40 0 0 1-20.702 35.037l-120.886 66.584a42 42 0 0 1-41.216-.389l-106.242-61.155a57 57 0 0 1-28.564-49.4V138.71a64 64 0 0 1 31.172-54.939l98.01-58.564a54 54 0 0 1 54.54-.503z"/>
+  <path fill="#33ba91" stroke="#00a88a" d="M958.07 22.82l117.31 66.78a41 41 0 0 1 20.72 35.64v139.5a45 45 0 0 1-23.1 39.32L955.68 369.4a44 44 0 0 1-43.54-.41l-105.82-61.6a56 56 0 0 1-27.83-48.4V140.07a68 68 0 0 1 33.23-58.44l98.06-58.35a48 48 0 0 1 48.3-.46z"/>
+  <path fill="#0594cb" stroke="#0083c0" d="M575.26 20.97l117.23 68.9a40 40 0 0 1 19.73 34.27l.73 138.67a48 48 0 0 1-24.64 42.2l-115.13 64.11a45 45 0 0 1-44.53-.42l-105.83-61.6a55 55 0 0 1-27.33-47.53V136.52a63 63 0 0 1 29.87-53.59l99.3-61.4a49 49 0 0 1 50.6-.56z"/>
+  <path fill="#ff4088" stroke="#c9177e" d="M195.81 24.13l114.41 66.54a44 44 0 0 1 21.88 38.04v136.43a48 48 0 0 1-24.45 41.82L194.1 370.9a49 49 0 0 1-48.48-.23L41.05 310.48a53 53 0 0 1-26.56-45.93V135.08a55 55 0 0 1 26.1-46.8l102.8-63.46a51 51 0 0 1 52.42-.69z"/>
+  <path fill="#fff" d="M1320.72 89.15c58.79 0 106.52 47.73 106.52 106.51 0 58.8-47.73 106.52-106.52 106.52-58.78 0-106.52-47.73-106.52-106.52 0-58.78 47.74-106.51 106.52-106.51zm0 39.57c36.95 0 66.94 30 66.94 66.94a66.97 66.97 0 0 1-66.94 66.94c-36.95 0-66.94-29.99-66.94-66.94a66.97 66.97 0 0 1 66.93-66.94h.01zm-283.8 65.31c0 47.18-8.94 60.93-26.81 80.58-17.87 19.65-41.57 27.57-71.1 27.57-27 0-48.75-9.58-67.61-26.23-20.88-18.45-36.08-47.04-36.08-78.95 0-31.37 11.72-58.48 32.49-78.67 18.22-17.67 45.34-29.18 73.3-29.18 33.77 0 68.83 15.98 90.44 47.53l-31.73 26.82c-13.45-25.03-32.94-33.46-60.82-34.26-30.83-.88-64.77 28.53-62.25 67.75 1.4 21.94 11.65 59.65 60.96 66.57 25.9 3.63 55.36-24.02 55.36-39.04H944.4v-37.5h92.5V194l.02.03zm-562.6-94.65h42.29v112.17c0 17.8.49 29.33 1.47 34.61 1.69 8.48 4.81 14.37 11.17 19.5 6.37 5.13 13.8 6.59 24.84 6.59 11.2 0 14.96-1.74 20.66-6.6 5.69-4.85 9.12-9.46 10.28-16.53 1.15-7.07 3.07-18.8 3.07-35.18V99.38h42.28v108.78c0 24.86-1.07 42.43-3.21 52.69-2.14 10.27-6.08 18.93-11.82 26-5.74 7.06-13.42 12.69-23.03 16.88-9.62 4.19-22.16 6.28-37.65 6.28-18.7 0-32.87-2.28-42.52-6.85-9.66-4.57-17.3-10.5-22.9-17.8-5.61-7.3-9.3-14.95-11.08-22.96-2.58-11.86-3.88-29.38-3.88-52.55V99.38h.03zM93.91 299.92V92.7h43.35v75.48h71.92V92.7h43.48v207.22h-43.48v-90.61h-71.92v90.61z"/>
+</svg>
--- a/static/assets/jekyll-hugo/jekyll-logo.png
+++ b/static/assets/jekyll-hugo/jekyll-logo.png
--- a/static/assets/logo/logo-red.svg
+++ b/static/assets/logo/logo-red.svg
--- a/static/assets/logo/logo-white-1000px.png
+++ b/static/assets/logo/logo-white-1000px.png
--- a/static/assets/logo/logo-white-100px.png
+++ b/static/assets/logo/logo-white-100px.png
--- a/static/assets/logo/logo-white-2000px.png
+++ b/static/assets/logo/logo-white-2000px.png
--- a/static/assets/logo/logo-white-200px.png
+++ b/static/assets/logo/logo-white-200px.png
--- a/static/assets/logo/logo-white-400px.png
+++ b/static/assets/logo/logo-white-400px.png
--- a/static/assets/minio/console-1.png
+++ b/static/assets/minio/console-1.png
--- a/static/assets/minio/console-2.png
+++ b/static/assets/minio/console-2.png
--- a/Show More
+++ b/Show More