Compare commits

...

44 Commits

Author SHA1 Message Date
Pim van Pelt
512cfd75dc Retire halloumi2026h2
All checks were successful
continuous-integration/drone/push Build is passing
2025-09-08 22:10:24 +00:00
Pim van Pelt
8683d570a1 Add alias for renamed article
All checks were successful
continuous-integration/drone/push Build is passing
2025-08-26 09:27:06 +00:00
Pim van Pelt
a1a98ad3c6 Erratum: Tesseract/POSIX uses BadgerDB, not MariaDB, h/t alcutter@
All checks were successful
continuous-integration/drone/push Build is passing
2025-08-26 09:19:41 +00:00
Pim van Pelt
26ae98d977 Add start/limit flags to .env, h/t philippe
All checks were successful
continuous-integration/drone/push Build is passing
2025-08-26 08:25:12 +00:00
Pim van Pelt
619a1dfdf2 A few typo fixes, h/t claude
All checks were successful
continuous-integration/drone/push Build is passing
2025-08-25 13:01:55 +00:00
Pim van Pelt
a9e978effb A few typo changes
All checks were successful
continuous-integration/drone/push Build is passing
2025-08-25 10:29:59 +00:00
Pim van Pelt
825335cef9 Typo fixes
All checks were successful
continuous-integration/drone/push Build is passing
2025-08-25 10:25:43 +00:00
Pim van Pelt
a97115593c Typo and readability fixes
All checks were successful
continuous-integration/drone/push Build is passing
2025-08-25 09:55:40 +00:00
Pim van Pelt
3dd0d8a656 ctlog-3
All checks were successful
continuous-integration/drone/push Build is passing
2025-08-25 10:59:07 +02:00
Pim van Pelt
f137326339 Newline between logo and text
All checks were successful
continuous-integration/drone/push Build is passing
2025-08-24 15:55:00 +02:00
Pim van Pelt
51098ed43c Update ctlog landing page. Add logos, reized to 300x300
All checks were successful
continuous-integration/drone/push Build is passing
2025-08-24 15:41:14 +02:00
Pim van Pelt
6b337e1167 Fix ctlog article link
All checks were successful
continuous-integration/drone/push Build is passing
2025-08-24 15:30:22 +02:00
Pim van Pelt
bbf36f5a4e Merge branch 'main' of git.ipng.ch:ipng/ipng.ch
All checks were successful
continuous-integration/drone/push Build is passing
2025-08-24 15:23:47 +02:00
Pim van Pelt
b324d71b3f Publish Lipase and Halloumi 2025-08-24 15:21:26 +02:00
Pim van Pelt
2681861e4b typo fix, h/t jeroen@
All checks were successful
continuous-integration/drone/push Build is passing
2025-08-10 22:06:14 +02:00
Pim van Pelt
4f0188abeb A few readability edits
All checks were successful
continuous-integration/drone/push Build is passing
2025-08-10 18:50:04 +02:00
Pim van Pelt
f4ed332b18 Add note on Skylight and S3
All checks were successful
continuous-integration/drone/push Build is passing
2025-08-10 17:50:05 +02:00
Pim van Pelt
d9066aa241 Publish
All checks were successful
continuous-integration/drone/push Build is passing
2025-08-10 17:37:38 +02:00
Pim van Pelt
c68799703b Add period:100 reporting
All checks were successful
continuous-integration/drone/push Build is passing
2025-08-10 17:35:57 +02:00
Pim van Pelt
c32d1779f8 Add Sunlight article
All checks were successful
continuous-integration/drone/push Build is passing
2025-08-10 17:21:02 +02:00
Pim van Pelt
eda80e7e66 Add name
All checks were successful
continuous-integration/drone/push Build is passing
2025-08-10 11:57:36 +02:00
Pim van Pelt
d13da5608d Mark unfinished ASR9001 loadtest notes as draft, h/t stapelberg
All checks were successful
continuous-integration/drone/push Build is passing
2025-08-09 18:49:58 +02:00
Pim van Pelt
d47261a3b7 Bump hugo to 0.148.2
All checks were successful
continuous-integration/drone/push Build is passing
2025-08-09 16:56:29 +02:00
Pim van Pelt
383a598fc7 Move drone to new target location
All checks were successful
continuous-integration/drone/push Build is passing
2025-08-07 10:57:02 +00:00
Pim van Pelt
8afa2ff944 Add logo
All checks were successful
continuous-integration/drone/push Build is passing
2025-07-30 22:23:14 +02:00
Pim van Pelt
fe1207ee78 Add ctlog landing page
All checks were successful
continuous-integration/drone/push Build is passing
2025-07-30 22:14:08 +02:00
Pim van Pelt
6a59b7d7e6 Typo fixes h/t alcutter@ jeroen@
All checks were successful
continuous-integration/drone/push Build is passing
2025-07-27 17:30:24 +00:00
Pim van Pelt
bc2a9bb352 CTLog part 1
All checks were successful
continuous-integration/drone/push Build is passing
2025-07-27 17:27:50 +02:00
Pim van Pelt
5d02b6466c Bump timestamp for publication
All checks were successful
continuous-integration/drone/push Build is passing
2025-07-12 11:48:56 +02:00
Pim van Pelt
b6b419471d Typo and formatting fixes
All checks were successful
continuous-integration/drone/push Build is passing
2025-07-12 11:38:35 +02:00
Pim van Pelt
85b41ba4e0 Add a proposal article for eVPN/VxLAN in VPP
All checks were successful
continuous-integration/drone/push Build is passing
2025-07-12 11:27:33 +02:00
Pim van Pelt
ebbb0f8e24 typo fix, h/t tim427
All checks were successful
continuous-integration/drone/push Build is passing
2025-06-23 16:18:43 +00:00
Pim van Pelt
218ee84d5f Typo fix, h/t tim
All checks were successful
continuous-integration/drone/push Build is passing
2025-06-23 16:11:31 +00:00
Pim van Pelt
c476fa56fb Bump playground to 25.10-rc0~49-g90d92196
All checks were successful
continuous-integration/drone/push Build is passing
2025-06-07 19:05:24 +00:00
Pim van Pelt
a76abc331f A few typo fixes, h/t jeroen
All checks were successful
continuous-integration/drone/push Build is passing
2025-06-05 20:06:34 +00:00
Pim van Pelt
44deb34685 Typo fixes, h/t Jeroen
All checks were successful
continuous-integration/drone/push Build is passing
2025-06-05 20:04:11 +00:00
Pim van Pelt
ca46bcf6d5 Add Minio #2
All checks were successful
continuous-integration/drone/push Build is passing
2025-06-01 16:39:48 +02:00
Pim van Pelt
5042f822ef Minio Article #1
All checks were successful
continuous-integration/drone/push Build is passing
2025-06-01 12:53:16 +02:00
Pim van Pelt
fdb77838b8 Rewrite github.com to git.ipng.ch for popular repos
All checks were successful
continuous-integration/drone/push Build is passing
2025-05-04 21:54:16 +02:00
Pim van Pelt
6d3f4ac206 Some readability changes 2025-05-04 21:50:07 +02:00
Pim van Pelt
baa3e78045 Update MTU to 9216
All checks were successful
continuous-integration/drone/push Build is passing
2025-05-04 20:15:24 +02:00
Pim van Pelt
0972cf4aa1 A few readability fixes
All checks were successful
continuous-integration/drone/push Build is passing
2025-05-04 17:30:04 +02:00
Pim van Pelt
4f81d377a0 Article #2, Containerlab is up and running
All checks were successful
continuous-integration/drone/push Build is passing
2025-05-04 17:11:58 +02:00
Pim van Pelt
153048eda4 Update git repo 2025-05-04 17:11:58 +02:00
63 changed files with 7521 additions and 58 deletions

View File

@@ -8,9 +8,9 @@ steps:
- git lfs install
- git lfs pull
- name: build
image: git.ipng.ch/ipng/drone-hugo:release-0.145.1
image: git.ipng.ch/ipng/drone-hugo:release-0.148.2
settings:
hugo_version: 0.145.0
hugo_version: 0.148.2
extended: true
- name: rsync
image: drillster/drone-rsync
@@ -26,7 +26,7 @@ steps:
port: 22
args: '-6u --delete-after'
source: public/
target: /var/www/ipng.ch/
target: /nginx/sites/ipng.ch/
recursive: true
secrets: [ drone_sshkey ]

View File

@@ -8,7 +8,7 @@ Historical context - todo, but notes for now
1. started with stack.nl (when it was still stack.urc.tue.nl), 6bone and watching NASA multicast video in 1997.
2. founded ipng.nl project, first IPv6 in NL that was usable outside of NREN.
3. attacted attention of the first few IPv6 partitipants in Amsterdam, organized the AIAD - AMS-IX IPv6 Awareness Day
3. attracted attention of the first few IPv6 participants in Amsterdam, organized the AIAD - AMS-IX IPv6 Awareness Day
4. launched IPv6 at AMS-IX, first IXP prefix allocated 2001:768:1::/48
> My Brilliant Idea Of The Day -- encode AS number in leetspeak: `::AS01:2859:1`, because who would've thought we would ever run out of 16 bit AS numbers :)
5. IPng rearchitected to SixXS, and became a very large scale deployment of IPv6 tunnelbroker; our main central provisioning system moved around a few times between ISPs (Intouch, Concepts ICT, BIT, IP Man)

View File

@@ -185,7 +185,7 @@ function is_coloclue_beacon()
}
```
Then, I ran the configuration again with one IPv4 beacon set on dcg-1, and still all the bird configs on both IPv4 and IPv6 for all routers parsed correctly, and the generated function on the dcg-1 IPv4 filters file was popupated:
Then, I ran the configuration again with one IPv4 beacon set on dcg-1, and still all the bird configs on both IPv4 and IPv6 for all routers parsed correctly, and the generated function on the dcg-1 IPv4 filters file was populated:
```
function is_coloclue_beacon()
{

View File

@@ -89,7 +89,7 @@ lcp lcp-sync off
```
The prep work for the rest of the interface syncer starts with this
[[commit](https://github.com/pimvanpelt/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
[[commit](https://git.ipng.ch/ipng/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
for the rest of this blog post, the behavior will be in the 'on' position.
### Change interface: state
@@ -120,7 +120,7 @@ the state it was. I did notice that you can't bring up a sub-interface if its pa
is down, which I found counterintuitive, but that's neither here nor there.
All of this is to say that we have to be careful when copying state forward, because as
this [[commit](https://github.com/pimvanpelt/lcpng/commit/7c15c84f6c4739860a85c599779c199cb9efef03)]
this [[commit](https://git.ipng.ch/ipng/lcpng/commit/7c15c84f6c4739860a85c599779c199cb9efef03)]
shows, issuing `set int state ... up` on an interface, won't touch its sub-interfaces in VPP, but
the subsequent netlink message to bring the _LIP_ for that interface up, **will** update the
children, thus desynchronising Linux and VPP: Linux will have interface **and all its
@@ -128,7 +128,7 @@ sub-interfaces** up unconditionally; VPP will have the interface up and its sub-
whatever state they were before.
To address this, a second
[[commit](https://github.com/pimvanpelt/lcpng/commit/a3dc56c01461bdffcac8193ead654ae79225220f)] was
[[commit](https://git.ipng.ch/ipng/lcpng/commit/a3dc56c01461bdffcac8193ead654ae79225220f)] was
needed. I'm not too sure I want to keep this behavior, but for now, it results in an intuitive
end-state, which is that all interfaces states are exactly the same between Linux and VPP.
@@ -157,7 +157,7 @@ DBGvpp# set int state TenGigabitEthernet3/0/0 up
### Change interface: MTU
Finally, a straight forward
[[commit](https://github.com/pimvanpelt/lcpng/commit/39bfa1615fd1cafe5df6d8fc9d34528e8d3906e2)], or
[[commit](https://git.ipng.ch/ipng/lcpng/commit/39bfa1615fd1cafe5df6d8fc9d34528e8d3906e2)], or
so I thought. When the MTU changes in VPP (with `set interface mtu packet N <int>`), there is
callback that can be registered which copies this into the _LIP_. I did notice a specific corner
case: In VPP, a sub-interface can have a larger MTU than its parent. In Linux, this cannot happen,
@@ -179,7 +179,7 @@ higher than that, perhaps logging an error explaining why. This means two things
1. Any change in VPP of a parent MTU should ensure all children are clamped to at most that.
I addressed the issue in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/79a395b3c9f0dae9a23e6fbf10c5f284b1facb85)].
[[commit](https://git.ipng.ch/ipng/lcpng/commit/79a395b3c9f0dae9a23e6fbf10c5f284b1facb85)].
### Change interface: IP Addresses
@@ -199,7 +199,7 @@ VPP into the companion Linux devices:
_LIP_ with `lcp_itf_set_interface_addr()`.
This means with this
[[commit](https://github.com/pimvanpelt/lcpng/commit/f7e1bb951d648a63dfa27d04ded0b6261b9e39fe)], at
[[commit](https://git.ipng.ch/ipng/lcpng/commit/f7e1bb951d648a63dfa27d04ded0b6261b9e39fe)], at
any time a new _LIP_ is created, the IPv4 and IPv6 address on the VPP interface are fully copied
over by the third change, while at runtime, new addresses can be set/removed as well by the first
and second change.

View File

@@ -100,7 +100,7 @@ linux-cp {
Based on this config, I set the startup default in `lcp_set_lcp_auto_subint()`, but I realize that
an administrator may want to turn it on/off at runtime, too, so I add a CLI getter/setter that
interacts with the flag in this [[commit](https://github.com/pimvanpelt/lcpng/commit/d23aab2d95aabcf24efb9f7aecaf15b513633ab7)]:
interacts with the flag in this [[commit](https://git.ipng.ch/ipng/lcpng/commit/d23aab2d95aabcf24efb9f7aecaf15b513633ab7)]:
```
DBGvpp# show lcp
@@ -116,11 +116,11 @@ lcp lcp-sync off
```
The prep work for the rest of the interface syncer starts with this
[[commit](https://github.com/pimvanpelt/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
[[commit](https://git.ipng.ch/ipng/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
for the rest of this blog post, the behavior will be in the 'on' position.
The code for the configuration toggle is in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
[[commit](https://git.ipng.ch/ipng/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
### Auto create/delete sub-interfaces
@@ -145,7 +145,7 @@ I noticed that interface deletion had a bug (one that I fell victim to as well:
remove the netlink device in the correct network namespace), which I fixed.
The code for the auto create/delete and the bugfix is in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
[[commit](https://git.ipng.ch/ipng/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
### Further Work

View File

@@ -154,7 +154,7 @@ For now, `lcp_nl_dispatch()` just throws the message away after logging it with
a function that will come in very useful as I start to explore all the different Netlink message types.
The code that forms the basis of our Netlink Listener lives in [[this
commit](https://github.com/pimvanpelt/lcpng/commit/c4e3043ea143d703915239b2390c55f7b6a9b0b1)] and
commit](https://git.ipng.ch/ipng/lcpng/commit/c4e3043ea143d703915239b2390c55f7b6a9b0b1)] and
specifically, here I want to call out I was not the primary author, I worked off of Matt and Neale's
awesome work in this pending [Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122).
@@ -182,7 +182,7 @@ Linux interface VPP is not aware of. But, if I can find the _LIP_, I can convert
add or remove the ip4/ip6 neighbor adjacency.
The code for this first Netlink message handler lives in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/30bab1d3f9ab06670fbef2c7c6a658e7b77f7738)]. An
[[commit](https://git.ipng.ch/ipng/lcpng/commit/30bab1d3f9ab06670fbef2c7c6a658e7b77f7738)]. An
ironic insight is that after writing the code, I don't think any of it will be necessary, because
the interface plugin will already copy ARP and IPv6 ND packets back and forth and itself update its
neighbor adjacency tables; but I'm leaving the code in for now.
@@ -197,7 +197,7 @@ it or remove it, and if there are no link-local addresses left, disable IPv6 on
There's also a few multicast routes to add (notably 224.0.0.0/24 and ff00::/8, all-local-subnet).
The code for IP address handling is in this
[[commit]](https://github.com/pimvanpelt/lcpng/commit/87742b4f541d389e745f0297d134e34f17b5b485), but
[[commit]](https://git.ipng.ch/ipng/lcpng/commit/87742b4f541d389e745f0297d134e34f17b5b485), but
when I took it out for a spin, I noticed something curious, looking at the log lines that are
generated for the following sequence:
@@ -236,7 +236,7 @@ interface and directly connected route addition/deletion is slightly different i
So, I decide to take a little shortcut -- if an addition returns "already there", or a deletion returns
"no such entry", I'll just consider it a successful addition and deletion respectively, saving my eyes
from being screamed at by this red error message. I changed that in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/d63fbd8a9a612d038aa385e79a57198785d409ca)],
[[commit](https://git.ipng.ch/ipng/lcpng/commit/d63fbd8a9a612d038aa385e79a57198785d409ca)],
turning this situation in a friendly green notice instead.
### Netlink: Link (existing)
@@ -267,7 +267,7 @@ To avoid this loop, I temporarily turn off `lcp-sync` just before handling a bat
turn it back to its original state when I'm done with that.
The code for all/del of existing links is in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)].
[[commit](https://git.ipng.ch/ipng/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)].
### Netlink: Link (new)
@@ -276,7 +276,7 @@ doesn't have a _LIP_ for, but specifically describes a VLAN interface? Well, th
is trying to create a new sub-interface. And supporting that operation would be super cool, so let's go!
Using the earlier placeholder hint in `lcp_nl_link_add()` (see the previous
[[commit](https://github.com/pimvanpelt/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)]),
[[commit](https://git.ipng.ch/ipng/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)]),
I know that I've gotten a NEWLINK request but the Linux ifindex doesn't have a _LIP_. This could be
because the interface is entirely foreign to VPP, for example somebody created a dummy interface or
a VLAN sub-interface on one:
@@ -331,7 +331,7 @@ a boring `<phy>.<subid>` name.
Alright, without further ado, the code for the main innovation here, the implementation of
`lcp_nl_link_add_vlan()`, is in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/45f408865688eb7ea0cdbf23aa6f8a973be49d1a)].
[[commit](https://git.ipng.ch/ipng/lcpng/commit/45f408865688eb7ea0cdbf23aa6f8a973be49d1a)].
## Results

View File

@@ -118,7 +118,7 @@ or Virtual Routing/Forwarding domains). So first, I need to add these:
All of this code was heavily inspired by the pending [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122)]
but a few finishing touches were added, and wrapped up in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/7a76498277edc43beaa680e91e3a0c1787319106)].
[[commit](https://git.ipng.ch/ipng/lcpng/commit/7a76498277edc43beaa680e91e3a0c1787319106)].
### Deletion
@@ -459,7 +459,7 @@ it as 'unreachable' rather than deleting it. These are *additions* which have a
but with an interface index of 1 (which, in Netlink, is 'lo'). This makes VPP intermittently crash, so I
currently commented this out, while I gain better understanding. Result: blackhole/unreachable/prohibit
specials can not be set using the plugin. Beware!
(disabled in this [[commit](https://github.com/pimvanpelt/lcpng/commit/7c864ed099821f62c5be8cbe9ed3f4dd34000a42)]).
(disabled in this [[commit](https://git.ipng.ch/ipng/lcpng/commit/7c864ed099821f62c5be8cbe9ed3f4dd34000a42)]).
## Credits

View File

@@ -88,7 +88,7 @@ stat['/if/rx-miss'][:, 1].sum() - returns the sum of packet counters for
```
Alright, so let's grab that file and refactor it into a small library for me to use, I do
this in [[this commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
this in [[this commit](https://git.ipng.ch/ipng/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
### VPP's API
@@ -159,7 +159,7 @@ idx=19 name=tap4 mac=02:fe:17:06:fc:af mtu=9000 flags=3
So I added a little abstration with some error handling and one main function
to return interfaces as a Python dictionary of those `sw_interface_details`
tuples in [[this commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
tuples in [[this commit](https://git.ipng.ch/ipng/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
### AgentX
@@ -207,9 +207,9 @@ once asked with `GetPDU` or `GetNextPDU` requests, by issuing a corresponding `R
to the SNMP server -- it takes care of all the rest!
The resulting code is in [[this
commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/8c9c1e2b4aa1d40a981f17581f92bba133dd2c29)]
commit](https://git.ipng.ch/ipng/vpp-snmp-agent/commit/8c9c1e2b4aa1d40a981f17581f92bba133dd2c29)]
but you can also check out the whole thing on
[[Github](https://github.com/pimvanpelt/vpp-snmp-agent)].
[[Github](https://git.ipng.ch/ipng/vpp-snmp-agent)].
### Building

View File

@@ -480,7 +480,7 @@ is to say, those packets which were destined to any IP address configured on the
plane. Any traffic going _through_ VPP will never be seen by Linux! So, I'll have to be
clever and count this traffic by polling VPP instead. This was the topic of my previous
[VPP Part 6]({{< ref "2021-09-10-vpp-6" >}}) about the SNMP Agent. All of that code
was released to [Github](https://github.com/pimvanpelt/vpp-snmp-agent), notably there's
was released to [Github](https://git.ipng.ch/ipng/vpp-snmp-agent), notably there's
a hint there for an `snmpd-dataplane.service` and a `vpp-snmp-agent.service`, including
the compiled binary that reads from VPP and feeds this to SNMP.

View File

@@ -30,9 +30,9 @@ virtual machine running in Qemu/KVM into a working setup with both [Free Range R
and [Bird](https://bird.network.cz/) installed side by side.
**NOTE**: If you're just interested in the resulting image, here's the most pertinent information:
> * ***vpp-proto.qcow2.lrz [[Download](https://ipng.ch/media/vpp-proto/vpp-proto-bookworm-20231015.qcow2.lrz)]***
> * ***SHA256*** `bff03a80ccd1c0094d867d1eb1b669720a1838330c0a5a526439ecb1a2457309`
> * ***Debian Bookworm (12.4)*** and ***VPP 24.02-rc0~46-ga16463610e***
> * ***vpp-proto.qcow2.lrz*** [[Download](https://ipng.ch/media/vpp-proto/vpp-proto-bookworm-20250607.qcow2.lrz)]
> * ***SHA256*** `a5fdf157c03f2d202dcccdf6ed97db49c8aa5fdb6b9ca83a1da958a8a24780ab
> * ***Debian Bookworm (12.11)*** and ***VPP 25.10-rc0~49-g90d92196***
> * ***CPU*** Make sure the (virtualized) CPU supports AVX
> * ***RAM*** The image needs at least 4GB of RAM, and the hypervisor should support hugepages and AVX
> * ***Username***: `ipng` with ***password***: `ipng loves vpp` and is sudo-enabled
@@ -62,7 +62,7 @@ plugins:
or route, or the system receiving ARP or IPv6 neighbor request/reply from neighbors), and applying
these events to the VPP dataplane.
I've published the code on [Github](https://github.com/pimvanpelt/lcpng/) and I am targeting a release
I've published the code on [Github](https://git.ipng.ch/ipng/lcpng/) and I am targeting a release
in upstream VPP, hoping to make the upcoming 22.02 release in February 2022. I have a lot of ground to
cover, but I will note that the plugin has been running in production in [AS8298]({{< ref "2021-02-27-network" >}})
since Sep'21 and no crashes related to LinuxCP have been observed.
@@ -195,7 +195,7 @@ So grab a cup of tea, while we let Rhino stretch its legs, ehh, CPUs ...
pim@rhino:~$ mkdir -p ~/src
pim@rhino:~$ cd ~/src
pim@rhino:~/src$ sudo apt install libmnl-dev
pim@rhino:~/src$ git clone https://github.com/pimvanpelt/lcpng.git
pim@rhino:~/src$ git clone https://git.ipng.ch/ipng/lcpng.git
pim@rhino:~/src$ git clone https://gerrit.fd.io/r/vpp
pim@rhino:~/src$ ln -s ~/src/lcpng ~/src/vpp/src/plugins/lcpng
pim@rhino:~/src$ cd ~/src/vpp

View File

@@ -33,7 +33,7 @@ In this first post, let's take a look at tablestakes: writing a YAML specificati
configuration elements of VPP, and then ensures that the YAML file is both syntactically as well as
semantically correct.
**Note**: Code is on [my Github](https://github.com/pimvanpelt/vppcfg), but it's not quite ready for
**Note**: Code is on [my Github](https://git.ipng.ch/ipng/vppcfg), but it's not quite ready for
prime-time yet. Take a look, and engage with us on GitHub (pull requests preferred over issues themselves)
or reach out by [contacting us](/s/contact/).
@@ -348,7 +348,7 @@ to mess up my (or your!) VPP router by feeding it garbage, so the lions' share o
has been to assert the YAML file is both syntactically and semantically valid.
In the mean time, you can take a look at my code on [GitHub](https://github.com/pimvanpelt/vppcfg), but to
In the mean time, you can take a look at my code on [GitHub](https://git.ipng.ch/ipng/vppcfg), but to
whet your appetite, here's a hefty configuration that demonstrates all implemented types:
```

View File

@@ -32,7 +32,7 @@ the configuration to the dataplane. Welcome to `vppcfg`!
In this second post of the series, I want to talk a little bit about how planning a path from a running
configuration to a desired new configuration might look like.
**Note**: Code is on [my Github](https://github.com/pimvanpelt/vppcfg), but it's not quite ready for
**Note**: Code is on [my Github](https://git.ipng.ch/ipng/vppcfg), but it's not quite ready for
prime-time yet. Take a look, and engage with us on GitHub (pull requests preferred over issues themselves)
or reach out by [contacting us](/s/contact/).

View File

@@ -171,12 +171,12 @@ GigabitEthernet1/0/0 1 up GigabitEthernet1/0/0
After this exploratory exercise, I have learned enough about the hardware to be able to take the
Fitlet2 out for a spin. To configure the VPP instance, I turn to
[[vppcfg](https://github.com/pimvanpelt/vppcfg)], which can take a YAML configuration file
[[vppcfg](https://git.ipng.ch/ipng/vppcfg)], which can take a YAML configuration file
describing the desired VPP configuration, and apply it safely to the running dataplane using the VPP
API. I've written a few more posts on how it does that, notably on its [[syntax]({{< ref "2022-03-27-vppcfg-1" >}})]
and its [[planner]({{< ref "2022-04-02-vppcfg-2" >}})]. A complete
configuration guide on vppcfg can be found
[[here](https://github.com/pimvanpelt/vppcfg/blob/main/docs/config-guide.md)].
[[here](https://git.ipng.ch/ipng/vppcfg/blob/main/docs/config-guide.md)].
```
pim@fitlet:~$ sudo dpkg -i {lib,}vpp*23.06*deb

View File

@@ -185,7 +185,7 @@ forgetful chipmunk-sized brain!), so here, I'll only recap what's already writte
**1. BUILD:** For the first step, the build is straight forward, and yields a VPP instance based on
`vpp-ext-deps_23.06-1` at version `23.06-rc0~71-g182d2b466`, which contains my
[[LCPng](https://github.com/pimvanpelt/lcpng.git)] plugin. I then copy the packages to the router.
[[LCPng](https://git.ipng.ch/ipng/lcpng.git)] plugin. I then copy the packages to the router.
The router has an E-2286G CPU @ 4.00GHz with 6 cores and 6 hyperthreads. There's a really handy tool
called `likwid-topology` that can show how the L1, L2 and L3 cache lines up with respect to CPU
cores. Here I learn that CPU (0+6) and (1+7) share L1 and L2 cache -- so I can conclude that 0-5 are
@@ -351,7 +351,7 @@ in `vppcfg`:
* When I create the initial `--novpp` config, there's a bug in `vppcfg` where I incorrectly
reference a dataplane object which I haven't initialized (because with `--novpp` the tool
will not contact the dataplane at all. That one was easy to fix, which I did in [[this
commit](https://github.com/pimvanpelt/vppcfg/commit/0a0413927a0be6ed3a292a8c336deab8b86f5eee)]).
commit](https://git.ipng.ch/ipng/vppcfg/commit/0a0413927a0be6ed3a292a8c336deab8b86f5eee)]).
After that small detour, I can now proceed to configure the dataplane by offering the resulting
VPP commands, like so:
@@ -573,7 +573,7 @@ see is that which is destined to the controlplane (eg, to one of the IPv4 or IPv
multicast/broadcast groups that they are participating in), so things like tcpdump or SNMP won't
really work.
However, due to my [[vpp-snmp-agent](https://github.com/pimvanpelt/vpp-snmp-agent.git)], which is
However, due to my [[vpp-snmp-agent](https://git.ipng.ch/ipng/vpp-snmp-agent.git)], which is
feeding as an AgentX behind an snmpd that in turn is running in the `dataplane` namespace, SNMP scrapes
work as they did before, albeit with a few different interface names.

View File

@@ -14,7 +14,7 @@ performance and versatility. For those of us who have used Cisco IOS/XR devices,
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
are shared between the two.
I've been working on the Linux Control Plane [[ref](https://github.com/pimvanpelt/lcpng)], which you
I've been working on the Linux Control Plane [[ref](https://git.ipng.ch/ipng/lcpng)], which you
can read all about in my series on VPP back in 2021:
[![DENOG14](/assets/vpp-stats/denog14-thumbnail.png){: style="width:300px; float: right; margin-left: 1em;"}](https://video.ipng.ch/w/erc9sAofrSZ22qjPwmv6H4)
@@ -70,7 +70,7 @@ answered by a Response PDU.
Using parts of a Python Agentx library written by GitHub user hosthvo
[[ref](https://github.com/hosthvo/pyagentx)], I tried my hands at writing one of these AgentX's.
The resulting source code is on [[GitHub](https://github.com/pimvanpelt/vpp-snmp-agent)]. That's the
The resulting source code is on [[GitHub](https://git.ipng.ch/ipng/vpp-snmp-agent)]. That's the
one that's running in production ever since I started running VPP routers at IPng Networks AS8298.
After the _AgentX_ exposes the dataplane interfaces and their statistics into _SNMP_, an open source
monitoring tool such as LibreNMS [[ref](https://librenms.org/)] can discover the routers and draw
@@ -126,7 +126,7 @@ for any interface created in the dataplane.
I wish I were good at Go, but I never really took to the language. I'm pretty good at Python, but
sorting through the stats segment isn't super quick as I've already noticed in the Python3 based
[[VPP SNMP Agent](https://github.com/pimvanpelt/vpp-snmp-agent)]. I'm probably the world's least
[[VPP SNMP Agent](https://git.ipng.ch/ipng/vpp-snmp-agent)]. I'm probably the world's least
terrible C programmer, so maybe I can take a look at the VPP Stats Client and make sense of it. Luckily,
there's an example already in `src/vpp/app/vpp_get_stats.c` and it reveals the following pattern:

View File

@@ -19,7 +19,7 @@ same time keep an IPng Site Local network with IPv4 and IPv6 that is separate fr
based on hardware/silicon based forwarding at line rate and high availability. You can read all
about my Centec MPLS shenanigans in [[this article]({{< ref "2023-03-11-mpls-core" >}})].
Ever since the release of the Linux Control Plane [[ref](https://github.com/pimvanpelt/lcpng)]
Ever since the release of the Linux Control Plane [[ref](https://git.ipng.ch/ipng/lcpng)]
plugin in VPP, folks have asked "What about MPLS?" -- I have never really felt the need to go this
rabbit hole, because I figured that in this day and age, higher level IP protocols that do tunneling
are just as performant, and a little bit less of an 'art' to get right. For example, the Centec

View File

@@ -459,6 +459,6 @@ and VPP, and the overall implementation before attempting to use in production.
we got at least some of this right, but testing and runtime experience will tell.
I will be silently porting the change into my own copy of the Linux Controlplane called lcpng on
[[GitHub](https://github.com/pimvanpelt/lcpng.git)]. If you'd like to test this - reach out to the VPP
[[GitHub](https://git.ipng.ch/ipng/lcpng.git)]. If you'd like to test this - reach out to the VPP
Developer [[mailinglist](mailto:vpp-dev@lists.fd.io)] any time!

View File

@@ -385,5 +385,5 @@ and VPP, and the overall implementation before attempting to use in production.
we got at least some of this right, but testing and runtime experience will tell.
I will be silently porting the change into my own copy of the Linux Controlplane called lcpng on
[[GitHub](https://github.com/pimvanpelt/lcpng.git)]. If you'd like to test this - reach out to the VPP
[[GitHub](https://git.ipng.ch/ipng/lcpng.git)]. If you'd like to test this - reach out to the VPP
Developer [[mailinglist](mailto:vpp-dev@lists.fd.io)] any time!

View File

@@ -304,7 +304,7 @@ Gateway, just to show a few of the more advanced features of VPP. For me, this t
line of thinking: classifiers. This extract/match/act pattern can be used in policers, ACLs and
arbitrary traffic redirection through VPP's directed graph (eg. selecting a next node for
processing). I'm going to deep-dive into this classifier behavior in an upcoming article, and see
how I might add this to [[vppcfg](https://github.com/pimvanpelt/vppcfg.git)], because I think it
how I might add this to [[vppcfg](https://git.ipng.ch/ipng/vppcfg.git)], because I think it
would be super powerful to abstract away the rather complex underlying API into something a little
bit more ... user friendly. Stay tuned! :)

View File

@@ -359,7 +359,7 @@ does not have an IPv4 address. Except -- I'm bending the rules a little bit by d
There's an internal function `ip4_sw_interface_enable_disable()` which is called to enable IPv4
processing on an interface once the first IPv4 address is added. So my first fix is to force this to
be enabled for any interface that is exposed via Linux Control Plane, notably in `lcp_itf_pair_create()`
[[here](https://github.com/pimvanpelt/lcpng/blob/main/lcpng_interface.c#L777)].
[[here](https://git.ipng.ch/ipng/lcpng/blob/main/lcpng_interface.c#L777)].
This approach is partially effective:
@@ -500,7 +500,7 @@ which is unnumbered. Because I don't know for sure if everybody would find this
I make sure to guard the behavior behind a backwards compatible configuration option.
If you're curious, please take a look at the change in my [[GitHub
repo](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
repo](https://git.ipng.ch/ipng/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
which I:
1. add a new configuration option, `lcp-sync-unnumbered`, which defaults to `on`. That would be
what the plugin would do in the normal case: copy forward these borrowed IP addresses to Linux.

View File

@@ -147,7 +147,7 @@ With all of that, I am ready to demonstrate two working solutions now. I first c
Ondrej's [[commit](https://gitlab.nic.cz/labs/bird/-/commit/280daed57d061eb1ebc89013637c683fe23465e8)].
Then, I compile VPP with my pending [[gerrit](https://gerrit.fd.io/r/c/vpp/+/40482)]. Finally,
to demonstrate how `update_loopback_addr()` might work, I compile `lcpng` with my previous
[[commit](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)],
[[commit](https://git.ipng.ch/ipng/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)],
which allows me to inhibit copying forward addresses from VPP to Linux, when using _unnumbered_
interfaces.

View File

@@ -250,10 +250,10 @@ remove the IPv4 and IPv6 addresses from the <span style='color:red;font-weight:b
routers in Br&uuml;ttisellen. They are directly connected, and if anything goes wrong, I can walk
over and rescue them. Sounds like a safe way to start!
I quickly add the ability for [[vppcfg](https://github.com/pimvanpelt/vppcfg)] to configure
I quickly add the ability for [[vppcfg](https://git.ipng.ch/ipng/vppcfg)] to configure
_unnumbered_ interfaces. In VPP, these are interfaces that don't have an IPv4 or IPv6 address of
their own, but they borrow one from another interface. If you're curious, you can take a look at the
[[User Guide](https://github.com/pimvanpelt/vppcfg/blob/main/docs/config-guide.md#interfaces)] on
[[User Guide](https://git.ipng.ch/ipng/vppcfg/blob/main/docs/config-guide.md#interfaces)] on
GitHub.
Looking at their `vppcfg` files, the change is actually very easy, taking as an example the
@@ -291,7 +291,7 @@ interface.
In the article, you'll see that discussed as _Solution 2_, and it includes a bit of rationale why I
find this better. I implemented it in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
[[commit](https://git.ipng.ch/ipng/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
case you're curious, and the commandline keyword is `lcp lcp-sync-unnumbered off` (the default is
_on_).

View File

@@ -0,0 +1,238 @@
---
date: "2024-09-03T13:07:54Z"
title: Loadtest notes, ASR9001
draft: true
---
### L2 point-to-point (L2XC) config
```
interface TenGigE0/0/0/0
mtu 9216
load-interval 30
l2transport
!
!
interface TenGigE0/0/0/1
mtu 9216
load-interval 30
l2transport
!
!
interface TenGigE0/0/0/2
mtu 9216
load-interval 30
l2transport
!
!
interface TenGigE0/0/0/3
mtu 9216
load-interval 30
l2transport
!
!
...
l2vpn
load-balancing flow src-dst-ip
logging
bridge-domain
pseudowire
!
xconnect group LoadTest
p2p pair0
interface TenGigE0/0/2/0
interface TenGigE0/0/2/1
!
p2p pair1
interface TenGigE0/0/2/2
interface TenGigE0/0/2/3
!
...
```
### L2 Bridge-Domain
```
l2vpn
bridge group LoadTestp
bridge-domain bd0
interface TenGigE0/0/0/0
!
interface TenGigE0/0/0/1
!
!
bridge-domain bd1
interface TenGigE0/0/0/2
!
interface TenGigE0/0/0/3
!
!
...
```
RP/0/RSP0/CPU0:micro-fridge#show l2vpn forwarding bridge-domain mac-address location 0/0/CPU0
Sat Aug 31 12:09:08.957 UTC
Mac Address Type Learned from/Filtered on LC learned Resync Age Mapped to
--------------------------------------------------------------------------------
9c69.b461.fcf2 dynamic Te0/0/0/0 0/0/CPU0 0d 0h 0m 14s N/A
9c69.b461.fcf3 dynamic Te0/0/0/1 0/0/CPU0 0d 0h 0m 2s N/A
001b.2155.1f11 dynamic Te0/0/0/2 0/0/CPU0 0d 0h 0m 0s N/A
001b.2155.1f10 dynamic Te0/0/0/3 0/0/CPU0 0d 0h 0m 15s N/A
001b.21bc.47a4 dynamic Te0/0/1/0 0/0/CPU0 0d 0h 0m 6s N/A
001b.21bc.47a5 dynamic Te0/0/1/1 0/0/CPU0 0d 0h 0m 21s N/A
9c69.b461.ff41 dynamic Te0/0/1/2 0/0/CPU0 0d 0h 0m 16s N/A
9c69.b461.ff40 dynamic Te0/0/1/3 0/0/CPU0 0d 0h 0m 10s N/A
001b.2155.1d1d dynamic Te0/0/2/0 0/0/CPU0 0d 0h 0m 9s N/A
001b.2155.1d1c dynamic Te0/0/2/1 0/0/CPU0 0d 0h 0m 16s N/A
001b.2155.1e08 dynamic Te0/0/2/2 0/0/CPU0 0d 0h 0m 4s N/A
001b.2155.1e09 dynamic Te0/0/2/3 0/0/CPU0 0d 0h 0m 11s N/A
```
Interesting finding, after a bridge-domain overload occurs, forwarding pretty much stops
```
Te0/0/0/0:
30 second input rate 6931755000 bits/sec, 14441158 packets/sec
30 second output rate 0 bits/sec, 0 packets/sec
Te0/0/0/1:
30 second input rate 0 bits/sec, 0 packets/sec
30 second output rate 19492000 bits/sec, 40609 packets/sec
Te0/0/0/2:
30 second input rate 0 bits/sec, 0 packets/sec
30 second output rate 19720000 bits/sec, 41084 packets/sec
Te0/0/0/3:
30 second input rate 6931728000 bits/sec, 14441100 packets/sec
30 second output rate 0 bits/sec, 0 packets/sec
... and so on
30 second input rate 6931558000 bits/sec, 14440748 packets/sec
30 second output rate 0 bits/sec, 0 packets/sec
30 second input rate 0 bits/sec, 0 packets/sec
30 second output rate 12627000 bits/sec, 26307 packets/sec
30 second input rate 0 bits/sec, 0 packets/sec
30 second output rate 12710000 bits/sec, 26479 packets/sec
30 second input rate 6931542000 bits/sec, 14440712 packets/sec
30 second output rate 0 bits/sec, 0 packets/sec
30 second input rate 0 bits/sec, 0 packets/sec
30 second output rate 19196000 bits/sec, 39992 packets/sec
30 second input rate 6931651000 bits/sec, 14440938 packets/sec
30 second output rate 0 bits/sec, 0 packets/sec
30 second input rate 6931658000 bits/sec, 14440958 packets/sec
30 second output rate 0 bits/sec, 0 packets/sec
30 second input rate 0 bits/sec, 0 packets/sec
30 second output rate 13167000 bits/sec, 27431 packets/sec
```
MPLS enabled test:
```
arp vrf default 100.64.0.2 001b.2155.1e08 ARPA
arp vrf default 100.64.1.2 001b.2155.1e09 ARPA
arp vrf default 100.64.2.2 001b.2155.1d1c ARPA
arp vrf default 100.64.3.2 001b.2155.1d1d ARPA
arp vrf default 100.64.4.2 001b.21bc.47a4 ARPA
arp vrf default 100.64.5.2 001b.21bc.47a5 ARPA
arp vrf default 100.64.6.2 9c69.b461.fcf2 ARPA
arp vrf default 100.64.7.2 9c69.b461.fcf3 ARPA
arp vrf default 100.64.8.2 001b.2155.1f10 ARPA
arp vrf default 100.64.9.2 001b.2155.1f11 ARPA
arp vrf default 100.64.10.2 9c69.b461.ff40 ARPA
arp vrf default 100.64.11.2 9c69.b461.ff41 ARPA
router static
address-family ipv4 unicast
0.0.0.0/0 198.19.5.1
16.0.0.0/24 100.64.0.2
16.0.1.0/24 100.64.2.2
16.0.2.0/24 100.64.4.2
16.0.3.0/24 100.64.6.2
16.0.4.0/24 100.64.8.2
16.0.5.0/24 100.64.10.2
48.0.0.0/24 100.64.1.2
48.0.1.0/24 100.64.3.2
48.0.2.0/24 100.64.5.2
48.0.3.0/24 100.64.7.2
48.0.4.0/24 100.64.9.2
48.0.5.0/24 100.64.11.2
!
!
mpls static
interface TenGigE0/0/0/0
interface TenGigE0/0/0/1
interface TenGigE0/0/0/2
interface TenGigE0/0/0/3
interface TenGigE0/0/1/0
interface TenGigE0/0/1/1
interface TenGigE0/0/1/2
interface TenGigE0/0/1/3
interface TenGigE0/0/2/0
interface TenGigE0/0/2/1
interface TenGigE0/0/2/2
interface TenGigE0/0/2/3
address-family ipv4 unicast
local-label 16 allocate
forward
path 1 nexthop TenGigE0/0/2/3 100.64.1.2 out-label 17
!
!
local-label 17 allocate
forward
path 1 nexthop TenGigE0/0/2/2 100.64.0.2 out-label 16
!
!
local-label 18 allocate
forward
path 1 nexthop TenGigE0/0/2/0 100.64.3.2 out-label 19
!
!
local-label 19 allocate
forward
path 1 nexthop TenGigE0/0/2/1 100.64.2.2 out-label 18
!
!
local-label 20 allocate
forward
path 1 nexthop TenGigE0/0/1/1 100.64.5.2 out-label 21
!
!
local-label 21 allocate
forward
path 1 nexthop TenGigE0/0/1/0 100.64.4.2 out-label 20
!
!
local-label 22 allocate
forward
path 1 nexthop TenGigE0/0/0/1 100.64.7.2 out-label 23
!
!
local-label 23 allocate
forward
path 1 nexthop TenGigE0/0/0/0 100.64.6.2 out-label 22
!
!
local-label 24 allocate
forward
path 1 nexthop TenGigE0/0/0/2 100.64.9.2 out-label 25
!
!
local-label 25 allocate
forward
path 1 nexthop TenGigE0/0/0/3 100.64.8.2 out-label 24
!
!
local-label 26 allocate
forward
path 1 nexthop TenGigE0/0/1/2 100.64.11.2 out-label 27
!
!
local-label 27 allocate
forward
path 1 nexthop TenGigE0/0/1/3 100.64.10.2 out-label 26
!
!
!
!
```

View File

@@ -230,7 +230,7 @@ does not have any form of configuration persistence and that's deliberate. VPP's
programmable dataplane, and explicitly has left the programming and configuration as an exercise for
integrators. I have written a Python project that takes a YAML file as input and uses it to
configure (and reconfigure, on the fly) the dataplane automatically, called
[[VPPcfg](https://github.com/pimvanpelt/vppcfg.git)]. Previously, I wrote some implementation thoughts
[[VPPcfg](https://git.ipng.ch/ipng/vppcfg.git)]. Previously, I wrote some implementation thoughts
on its [[datamodel]({{< ref 2022-03-27-vppcfg-1 >}})] and its [[operations]({{< ref 2022-04-02-vppcfg-2
>}})] so I won't repeat that here. Instead, I will just show the configuration:

View File

@@ -430,7 +430,7 @@ Boom. I could not be more pleased.
This was a nice exercise for me! I'm going this direction becaue the
[[Containerlab](https://containerlab.dev)] framework will start containers with given NOS images,
not too dissimilar from the one I just made, and then attaches `veth` pairs between the containers.
I started dabbling with a [[pull-request](https://github.com/srl-labs/containerlab/pull/2569)], but
I started dabbling with a [[pull-request](https://github.com/srl-labs/containerlab/pull/2571)], but
I got stuck with a part of the Containerlab code that pre-deploys config files into the containers.
You see, I will need to generate two files:
@@ -448,7 +448,7 @@ will connect a few VPP containers together with an SR Linux node in a lab. Stand
Once we have that, there's still quite some work for me to do. Notably:
* Configuration persistence. `clab` allows you to save the running config. For that, I'll need to
introduce [[vppcfg](https://github.com/pimvanpelt/vppcfg.git)] and a means to invoke it when
introduce [[vppcfg](https://git.ipng.ch/ipng/vppcfg)] and a means to invoke it when
the lab operator wants to save their config, and then reconfigure VPP when the container
restarts.
* I'll need to have a few files from `clab` shared with the host, notably the `startup.conf` and

View File

@@ -0,0 +1,373 @@
---
date: "2025-05-04T15:07:23Z"
title: 'VPP in Containerlab - Part 2'
params:
asciinema: true
---
{{< image float="right" src="/assets/containerlab/containerlab.svg" alt="Containerlab Logo" width="12em" >}}
# Introduction
From time to time the subject of containerized VPP instances comes up. At IPng, I run the routers in
AS8298 on bare metal (Supermicro and Dell hardware), as it allows me to maximize performance.
However, VPP is quite friendly in virtualization. Notably, it runs really well on virtual machines
like Qemu/KVM or VMWare. I can pass through PCI devices directly to the host, and use CPU pinning to
allow the guest virtual machine access to the underlying physical hardware. In such a mode, VPP
performance almost the same as on bare metal. But did you know that VPP can also run in Docker?
The other day I joined the [[ZANOG'25](https://nog.net.za/event1/zanog25/)] in Durban, South Africa.
One of the presenters was Nardus le Roux of Nokia, and he showed off a project called
[[Containerlab](https://containerlab.dev/)], which provides a CLI for orchestrating and managing
container-based networking labs. It starts the containers, builds virtual wiring between them to
create lab topologies of users' choice and manages the lab lifecycle.
Quite regularly I am asked 'when will you add VPP to Containerlab?', but at ZANOG I made a promise
to actually add it. In my previous [[article]({{< ref 2025-05-03-containerlab-1.md >}})], I took
a good look at VPP as a dockerized container. In this article, I'll explore how to make such a
container run in Containerlab!
## Completing the Docker container
Just having VPP running by itself in a container is not super useful (although it _is_ cool!). I
decide first to add a few bits and bobs that will come in handy in the `Dockerfile`:
```
FROM debian:bookworm
ARG DEBIAN_FRONTEND=noninteractive
ARG VPP_INSTALL_SKIP_SYSCTL=true
ARG REPO=release
EXPOSE 22/tcp
RUN apt-get update && apt-get -y install curl procps tcpdump iproute2 iptables \
iputils-ping net-tools git python3 python3-pip vim-tiny openssh-server bird2 \
mtr-tiny traceroute && apt-get clean
# Install VPP
RUN mkdir -p /var/log/vpp /root/.ssh/
RUN curl -s https://packagecloud.io/install/repositories/fdio/${REPO}/script.deb.sh | bash
RUN apt-get update && apt-get -y install vpp vpp-plugin-core && apt-get clean
# Build vppcfg
RUN pip install --break-system-packages build netaddr yamale argparse pyyaml ipaddress
RUN git clone https://git.ipng.ch/ipng/vppcfg.git && cd vppcfg && python3 -m build && \
pip install --break-system-packages dist/vppcfg-*-py3-none-any.whl
# Config files
COPY files/etc/vpp/* /etc/vpp/
COPY files/etc/bird/* /etc/bird/
COPY files/init-container.sh /sbin/
RUN chmod 755 /sbin/init-container.sh
CMD ["/sbin/init-container.sh"]
```
A few notable additions:
* ***vppcfg*** is a handy utility I wrote and discussed in a previous [[article]({{< ref
2022-04-02-vppcfg-2 >}})]. Its purpose is to take YAML file that describes the configuration of
the dataplane (like which interfaces, sub-interfaces, MTU, IP addresses and so on), and then
apply this safely to a running dataplane. You can check it out in my
[[vppcfg](https://git.ipng.ch/ipng/vppcfg)] git repository.
* ***openssh-server*** will come in handy to log in to the container, in addition to the already
available `docker exec`.
* ***bird2*** which will be my controlplane of choice. At a future date, I might also add FRR,
which may be a good alterantive for some. VPP works well with both. You can check out Bird on
the nic.cz [[website](https://bird.network.cz/?get_doc&f=bird.html&v=20)].
I'll add a couple of default config files for Bird and VPP, and replace the CMD with a generic
`/sbin/init-container.sh` in which I can do any late binding stuff before launching VPP.
### Initializing the Container
#### VPP Containerlab: NetNS
VPP's Linux Control Plane plugin wants to run in its own network namespace. So the first order of
business of `/sbin/init-container.sh` is to create it:
```
NETNS=${NETNS:="dataplane"}
echo "Creating dataplane namespace"
/usr/bin/mkdir -p /etc/netns/$NETNS
/usr/bin/touch /etc/netns/$NETNS/resolv.conf
/usr/sbin/ip netns add $NETNS
```
#### VPP Containerlab: SSH
Then, I'll set the root password (which is `vpp` by the way), and start aan SSH daemon which allows
for password-less logins:
```
echo "Starting SSH, with credentials root:vpp"
sed -i -e 's,^#PermitRootLogin prohibit-password,PermitRootLogin yes,' /etc/ssh/sshd_config
sed -i -e 's,^root:.*,root:$y$j9T$kG8pyZEVmwLXEtXekQCRK.$9iJxq/bEx5buni1hrC8VmvkDHRy7ZMsw9wYvwrzexID:20211::::::,' /etc/shadow
/etc/init.d/ssh start
```
#### VPP Containerlab: Bird2
I can already predict that Bird2 won't be the only option for a controlplane, even though I'm a huge
fan of it. Therefore, I'll make it configurable to leave the door open for other controlplane
implementations in the future:
```
BIRD_ENABLED=${BIRD_ENABLED:="true"}
if [ "$BIRD_ENABLED" == "true" ]; then
echo "Starting Bird in $NETNS"
mkdir -p /run/bird /var/log/bird
chown bird:bird /var/log/bird
ROUTERID=$(ip -br a show eth0 | awk '{ print $3 }' | cut -f1 -d/)
sed -i -e "s,.*router id .*,router id $ROUTERID; # Set by container-init.sh," /etc/bird/bird.conf
/usr/bin/nsenter --net=/var/run/netns/$NETNS /usr/sbin/bird -u bird -g bird
fi
```
I am reminded that Bird won't start if it cannot determine its _router id_. When I start it in the
`dataplane` namespace, it will immediately exit, because there will be no IP addresses configured
yet. But luckily, it logs its complaint and it's easily addressed. I decide to take the management
IPv4 address from `eth0` and write that into the `bird.conf` file, which otherwise does some basic
initialization that I described in a previous [[article]({{< ref 2021-09-02-vpp-5 >}})], so I'll
skip that here. However, I do include an empty file called `/etc/bird/bird-local.conf` for users to
further configure Bird2.
#### VPP Containerlab: Binding veth pairs
When Containerlab starts the VPP container, it'll offer it a set of `veth` ports that connect this
container to other nodes in the lab. This is done by the `links` list in the topology file
[[ref](https://containerlab.dev/manual/network/)]. It's my goal to take all of the interfaces
that are of type `veth`, and generate a little snippet to grab them and bind them into VPP while
setting their MTU to 9216 to allow for jumbo frames:
```
CLAB_VPP_FILE=${CLAB_VPP_FILE:=/etc/vpp/clab.vpp}
echo "Generating $CLAB_VPP_FILE"
: > $CLAB_VPP_FILE
MTU=9216
for IFNAME in $(ip -br link show type veth | cut -f1 -d@ | grep -v '^eth0$' | sort); do
MAC=$(ip -br link show dev $IFNAME | awk '{ print $3 }')
echo " * $IFNAME hw-addr $MAC mtu $MTU"
ip link set $IFNAME up mtu $MTU
cat << EOF >> $CLAB_VPP_FILE
create host-interface name $IFNAME hw-addr $MAC
set interface name host-$IFNAME $IFNAME
set interface mtu $MTU $IFNAME
set interface state $IFNAME up
EOF
done
```
{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
One thing I realized is that VPP will assign a random MAC address on its copy of the `veth` port,
which is not great. I'll explicitly configure it with the same MAC address as the `veth` interface
itself, otherwise I'd have to put the interface into promiscuous mode.
#### VPP Containerlab: VPPcfg
I'm almost ready, but I have one more detail. The user will be able to offer a
[[vppcfg](https://git.ipng.ch/ipng/vppcfg)] YAML file to configure the interfaces and so on. If such
a file exists, I'll apply it to the dataplane upon startup:
```
VPPCFG_VPP_FILE=${VPPCFG_VPP_FILE:=/etc/vpp/vppcfg.vpp}
echo "Generating $VPPCFG_VPP_FILE"
: > $VPPCFG_VPP_FILE
if [ -r /etc/vpp/vppcfg.yaml ]; then
vppcfg plan --novpp -c /etc/vpp/vppcfg.yaml -o $VPPCFG_VPP_FILE
fi
```
Once the VPP process starts, it'll execute `/etc/vpp/bootstrap.vpp`, which in turn executes these
newly generated `/etc/vpp/clab.vpp` to grab the `veth` interfaces, and then `/etc/vpp/vppcfg.vpp` to
further configure the dataplane. Easy peasy!
### Adding VPP to Containerlab
Roman points out a previous integration for the 6WIND VSR in
[[PR#2540](https://github.com/srl-labs/containerlab/pull/2540)]. This serves as a useful guide to
get me started. I fork the repo, create a branch so that Roman can also add a few commits, and
together we start hacking in [[PR#2571](https://github.com/srl-labs/containerlab/pull/2571)].
First, I add the documentation skeleton in `docs/manual/kinds/fdio_vpp.md`, which links in from a
few other places, and will be where the end-user facing documentation will live. That's about half
the contributed LOC, right there!
Next, I'll create a Go module in `nodes/fdio_vpp/fdio_vpp.go` which doesn't do much other than
creating the `struct`, and its required `Register` and `Init` functions. The `Init` function ensures
the right capabilities are set in Docker, and the right devices are bound for the container.
I notice that Containerlab rewrites the Dockerfile `CMD` string and prepends an `if-wait.sh` script
to it. This is because when Containerlab starts the container, it'll still be busy adding these
`link` interfaces to it, and if a container starts too quickly, it may not see all the interfaces.
So, containerlab informs the container using an environment variable called `CLAB_INTFS`, so this
script simply sleeps for a while until that exact amount of interfaces are present. Ok, cool beans.
Roman helps me a bit with Go templating. You see, I think it'll be slick to have the CLI prompt for
the VPP containers to reflect their hostname, because normally, VPP will assign `vpp# `. I add the
template in `nodes/fdio_vpp/vpp_startup_config.go.tpl` and it only has one variable expansion: `unix
{ cli-prompt {{ .ShortName }}# }`. But I totally think it's worth it, because when running many VPP
containers in the lab, it could otherwise get confusing.
Roman also shows me a trick in the function `PostDeploy()`, which will write the user's SSH pubkeys
to `/root/.ssh/authorized_keys`. This allows users to log in without having to use password
authentication.
Collectively, we decide to punt on the `SaveConfig` function until we're a bit further along. I have
an idea how this would work, basically along the lines of calling `vppcfg dump` and bind-mounting
that file into the lab directory somewhere. This way, upon restarting, the YAML file can be re-read
and the dataplane initialized. But it'll be for another day.
After the main module is finished, all I have to do is add it to `clab/register.go` and that's just
about it. In about 170 lines of code, 50 lines of Go template, and 170 lines of Markdown, this
contribution is about ready to ship!
### Containerlab: Demo
After I finish writing the documentation, I decide to include a demo with a quickstart to help folks
along. A simple lab showing two VPP instances and two Alpine Linux clients can be found on
[[git.ipng.ch/ipng/vpp-containerlab](https://git.ipng.ch/ipng/vpp-containerlab)]. Simply check out the
repo and start the lab, like so:
```
$ git clone https://git.ipng.ch/ipng/vpp-containerlab.git
$ cd vpp-containerlab
$ containerlab deploy --topo vpp.clab.yml
```
#### Containerlab: configs
The file `vpp.clab.yml` contains an example topology existing of two VPP instances connected each to
one Alpine linux container, in the following topology:
{{< image src="/assets/containerlab/learn-vpp.png" alt="Containerlab Topo" width="100%" >}}
Two relevant files for each VPP router are included in this
[[repository](https://git.ipng.ch/ipng/vpp-containerlab)]:
1. `config/vpp*/vppcfg.yaml` configures the dataplane interfaces, including a loopback address.
1. `config/vpp*/bird-local.conf` configures the controlplane to enable BFD and OSPF.
To illustrate these files, let me take a closer look at node `vpp1`. It's VPP dataplane
configuration looks like this:
```
pim@summer:~/src/vpp-containerlab$ cat config/vpp1/vppcfg.yaml
interfaces:
eth1:
description: 'To client1'
mtu: 1500
lcp: eth1
addresses: [ 10.82.98.65/28, 2001:db8:8298:101::1/64 ]
eth2:
description: 'To vpp2'
mtu: 9216
lcp: eth2
addresses: [ 10.82.98.16/31, 2001:db8:8298:1::1/64 ]
loopbacks:
loop0:
description: 'vpp1'
lcp: loop0
addresses: [ 10.82.98.0/32, 2001:db8:8298::/128 ]
```
Then, I enable BFD, OSPF and OSPFv3 on `eth2` and `loop0` on both of the VPP routers:
```
pim@summer:~/src/vpp-containerlab$ cat config/vpp1/bird-local.conf
protocol bfd bfd1 {
interface "eth2" { interval 100 ms; multiplier 30; };
}
protocol ospf v2 ospf4 {
ipv4 { import all; export all; };
area 0 {
interface "loop0" { stub yes; };
interface "eth2" { type pointopoint; cost 10; bfd on; };
};
}
protocol ospf v3 ospf6 {
ipv6 { import all; export all; };
area 0 {
interface "loop0" { stub yes; };
interface "eth2" { type pointopoint; cost 10; bfd on; };
};
}
```
#### Containerlab: playtime!
Once the lab comes up, I can SSH to the VPP containers (`vpp1` and `vpp2`) which have my SSH pubkeys
installed thanks to Roman's work. Barring that, I could still log in as user `root` using
password `vpp`. VPP runs its own network namespace called `dataplane`, which is very similar to SR
Linux default `network-instance`. I can join that namespace to take a closer look:
```
pim@summer:~/src/vpp-containerlab$ ssh root@vpp1
root@vpp1:~# nsenter --net=/var/run/netns/dataplane
root@vpp1:~# ip -br a
lo DOWN
loop0 UP 10.82.98.0/32 2001:db8:8298::/128 fe80::dcad:ff:fe00:0/64
eth1 UNKNOWN 10.82.98.65/28 2001:db8:8298:101::1/64 fe80::a8c1:abff:fe77:acb9/64
eth2 UNKNOWN 10.82.98.16/31 2001:db8:8298:1::1/64 fe80::a8c1:abff:fef0:7125/64
root@vpp1:~# ping 10.82.98.1
PING 10.82.98.1 (10.82.98.1) 56(84) bytes of data.
64 bytes from 10.82.98.1: icmp_seq=1 ttl=64 time=9.53 ms
64 bytes from 10.82.98.1: icmp_seq=2 ttl=64 time=15.9 ms
^C
--- 10.82.98.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 9.530/12.735/15.941/3.205 ms
```
From `vpp1`, I can tell that Bird2's OSPF adjacency has formed, because I can ping the `loop0`
address of `vpp2` router on 10.82.98.1. Nice! The two client nodes are running a minimalistic Alpine
Linux container, which doesn't ship with SSH by default. But of course I can still enter the
containers using `docker exec`, like so:
```
pim@summer:~/src/vpp-containerlab$ docker exec -it client1 sh
/ # ip addr show dev eth1
531235: eth1@if531234: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 9500 qdisc noqueue state UP
link/ether 00:c1:ab:00:00:01 brd ff:ff:ff:ff:ff:ff
inet 10.82.98.66/28 scope global eth1
valid_lft forever preferred_lft forever
inet6 2001:db8:8298:101::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::2c1:abff:fe00:1/64 scope link
valid_lft forever preferred_lft forever
/ # traceroute 10.82.98.82
traceroute to 10.82.98.82 (10.82.98.82), 30 hops max, 46 byte packets
1 10.82.98.65 (10.82.98.65) 5.906 ms 7.086 ms 7.868 ms
2 10.82.98.17 (10.82.98.17) 24.007 ms 23.349 ms 15.933 ms
3 10.82.98.82 (10.82.98.82) 39.978 ms 31.127 ms 31.854 ms
/ # traceroute 2001:db8:8298:102::2
traceroute to 2001:db8:8298:102::2 (2001:db8:8298:102::2), 30 hops max, 72 byte packets
1 2001:db8:8298:101::1 (2001:db8:8298:101::1) 0.701 ms 7.144 ms 7.900 ms
2 2001:db8:8298:1::2 (2001:db8:8298:1::2) 23.909 ms 22.943 ms 23.893 ms
3 2001:db8:8298:102::2 (2001:db8:8298:102::2) 31.964 ms 30.814 ms 32.000 ms
```
From the vantage point of `client1`, the first hop represents the `vpp1` node, which forwards to
`vpp2`, which finally forwards to `client2`, which shows that both VPP routers are passing traffic.
Dope!
## Results
After all of this deep-diving, all that's left is for me to demonstrate the Containerlab by means of
this little screencast [[asciinema](/assets/containerlab/vpp-containerlab.cast)]. I hope you enjoy
it as much as I enjoyed creating it:
{{< asciinema src="/assets/containerlab/vpp-containerlab.cast" >}}
## Acknowledgements
I wanted to give a shout-out Roman Dodin for his help getting the Containerlab parts squared away
when I got a little bit stuck. He took the time to explain the internals and idiom of Containerlab
project, which really saved me a tonne of time. He also pair-programmed the
[[PR#2471](https://github.com/srl-labs/containerlab/pull/2571)] with me over the span of two
evenings.
Collaborative open source rocks!

View File

@@ -0,0 +1,713 @@
---
date: "2025-05-28T22:07:23Z"
title: 'Case Study: Minio S3 - Part 1'
---
{{< image float="right" src="/assets/minio/minio-logo.png" alt="MinIO Logo" width="6em" >}}
# Introduction
Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading
scalability, data availability, security, and performance. Millions of customers of all sizes and
industries store, manage, analyze, and protect any amount of data for virtually any use case, such
as data lakes, cloud-native applications, and mobile apps. With cost-effective storage classes and
easy-to-use management features, you can optimize costs, organize and analyze data, and configure
fine-tuned access controls to meet specific business and compliance requirements.
Amazon's S3 became the _de facto_ standard object storage system, and there exist several fully open
source implementations of the protocol. One of them is MinIO: designed to allow enterprises to
consolidate all of their data on a single, private cloud namespace. Architected using the same
principles as the hyperscalers, AIStor delivers performance at scale at a fraction of the cost
compared to the public cloud.
IPng Networks is an Internet Service Provider, but I also dabble in self-hosting things, for
example [[PeerTube](https://video.ipng.ch/)], [[Mastodon](https://ublog.tech/)],
[[Immich](https://photos.ipng.ch/)], [[Pixelfed](https://pix.ublog.tech/)] and of course
[[Hugo](https://ipng.ch/)]. These services all have one thing in common: they tend to use lots of
storage when they grow. At IPng Networks, all hypervisors ship with enterprise SAS flash drives,
mostly 1.92TB and 3.84TB. Scaling up each of these services, and backing them up safely, can be
quite the headache.
This article is for the storage-buffs. I'll set up a set of distributed MinIO nodes from scatch.
## Physical
{{< image float="right" src="/assets/minio/disks.png" alt="MinIO Disks" width="16em" >}}
I'll start with the basics. I still have a few Dell R720 servers laying around, they are getting a
bit older but still have 24 cores and 64GB of memory. First I need to get me some disks. I order
36pcs of 16TB SATA enterprise disk, a mixture of Seagate EXOS and Toshiba MG series disks. I've once
learned (the hard way), that buying a big stack of disks from one production run is a risk - so I'll
mix and match the drives.
Three trays of caddies and a melted credit card later, I have 576TB of SATA disks safely in hand.
Each machine will carry 192TB of raw storage. The nice thing about this chassis is that Dell can
ship them with 12x 3.5" SAS slots in the front, and 2x 2.5" SAS slots in the rear of the chassis.
So I'll install Debian Bookworm on one small 480G SSD in software RAID1.
### Cloning an install
I have three identical machines so in total I'll want six of these SSDs. I temporarily screw the
other five in 3.5" drive caddies and plug them into the first installed Dell, which I've called
`minio-proto`:
```
pim@minio-proto:~$ for i in b c d e f; do
sudo dd if=/dev/sda of=/dev/sd${i} bs=512 count=1;
sudo mdadm --manage /dev/md0 --add /dev/md${i}1
done
pim@minio-proto:~$ sudo mdadm --manage /dev/md0 --grow 6
pim@minio-proto:~$ watch cat /proc/mdstat
pim@minio-proto:~$ for i in a b c d e f; do
sudo grub-install /dev/sd$i
done
```
{{< image float="right" src="/assets/minio/rack.png" alt="MinIO Rack" width="16em" >}}
The first command takes my installed disk, `/dev/sda`, and copies the first sector over to the other
five. This will give them the same partition table. Next, I'll add the first partition of each disk
to the raidset. Then, I'll expand the raidset to have six members, after which the kernel starts a
recovery process that syncs the newly added paritions to `/dev/md0` (by copying from `/dev/sda` to
all other disks at once). Finally, I'll watch this exciting movie and grab a cup of tea.
Once the disks are fully copied, I'll shut down the machine and distribute the disks to their
respective Dell R720, two each. Once they boot they will all be identical. I'll need to make sure
their hostnames, and machine/host-id are unique, otherwise things like bridges will have overlapping
MAC addresses - ask me how I know:
```
pim@minio-proto:~$ sudo mdadm --manage /dev/md0 --grow -n 2
pim@minio-proto:~$ sudo rm /etc/ssh/ssh_host*
pim@minio-proto:~$ sudo hostname minio0-chbtl0
pim@minio-proto:~$ sudo dpkg-reconfigure openssh-server
pim@minio-proto:~$ sudo dd if=/dev/random of=/etc/hostid bs=4 count=1
pim@minio-proto:~$ sudo /usr/bin/dbus-uuidgen > /etc/machine-id
pim@minio-proto:~$ sudo reboot
```
After which I have three beautiful and unique machines:
* `minio0.chbtl0.net.ipng.ch`: which will go into my server rack at the IPng office.
* `minio0.ddln0.net.ipng.ch`: which will go to [[Daedalean]({{< ref
2022-02-24-colo >}})], doing AI since before it was all about vibe coding.
* `minio0.chrma0.net.ipng.ch`: which will go to [[IP-Max](https://ip-max.net/)], one of the best
ISPs on the planet. 🥰
## Deploying Minio
The user guide that MinIO provides
[[ref](https://min.io/docs/minio/linux/operations/installation.html)] is super good, arguably one of
the best documented open source projects I've ever seen. it shows me that I can do three types of
install. A 'Standalone' with one disk, a 'Standalone Multi-Drive', and a 'Distributed' deployment.
I decide to make three independent standalone multi-drive installs. This way, I have less shared
fate, and will be immune to network partitions (as these are going to be in three different
physical locations). I've also read about per-bucket _replication_, which will be an excellent way
to get geographical distribution and active/active instances to work together.
I feel good about the single-machine multi-drive decision. I follow the install guide
[[ref](https://min.io/docs/minio/linux/operations/install-deploy-manage/deploy-minio-single-node-multi-drive.html#minio-snmd)]
for this deployment type.
### IPng Frontends
At IPng I use a private IPv4/IPv6/MPLS network that is not connected to the internet. I call this
network [[IPng Site Local]({{< ref 2023-03-11-mpls-core.md >}})]. But how will users reach my Minio
install? I have four redundantly and geographically deployed frontends, two in the Netherlands and
two in Switzerland. I've described the frontend setup in a [[previous article]({{< ref
2023-03-17-ipng-frontends >}})] and the certificate management in [[this article]({{< ref
2023-03-24-lego-dns01 >}})].
I've decided to run the service on these three regionalized endpoints:
1. `s3.chbtl0.ipng.ch` which will back into `minio0.chbtl0.net.ipng.ch`
1. `s3.ddln0.ipng.ch` which will back into `minio0.ddln0.net.ipng.ch`
1. `s3.chrma0.ipng.ch` which will back into `minio0.chrma0.net.ipng.ch`
The first thing I take note of is that S3 buckets can be either addressed _by path_, in other words
something like `s3.chbtl0.ipng.ch/my-bucket/README.md`, but they can also be addressed by virtual
host, like so: `my-bucket.s3.chbtl0.ipng.ch/README.md`. A subtle difference, but from the docs I
understand that Minio needs to have control of the whole space under its main domain.
There's a small implication to this requirement -- the Web Console that ships with MinIO (eh, well,
maybe that's going to change, more on that later), will want to have its own domain-name, so I
choose something simple: `cons0-s3.chbtl0.ipng.ch` and so on. This way, somebody might still be able
to have a bucket name called `cons0` :)
#### Let's Encrypt Certificates
Alright, so I will be neading nine domains into this new certificate which I'll simply call
`s3.ipng.ch`. I configure it in Ansible:
```
certbot:
certs:
...
s3.ipng.ch:
groups: [ 'nginx', 'minio' ]
altnames:
- 's3.chbtl0.ipng.ch'
- 'cons0-s3.chbtl0.ipng.ch'
- '*.s3.chbtl0.ipng.ch'
- 's3.ddln0.ipng.ch'
- 'cons0-s3.ddln0.ipng.ch'
- '*.s3.ddln0.ipng.ch'
- 's3.chrma0.ipng.ch'
- 'cons0-s3.chrma0.ipng.ch'
- '*.s3.chrma0.ipng.ch'
```
I run the `certbot` playbook and it does two things:
1. On the machines from group `nginx` and `minio`, it will ensure there exists a user `lego` with
an SSH key and write permissions to `/etc/lego/`; this is where the automation will write (and
update) the certificate keys.
1. On the `lego` machine, it'll create two files. One is the certificate requestor, and the other
is a certificate distribution script that will copy the cert to the right machine(s) when it
renews.
On the `lego` machine, I'll run the cert request for the first time:
```
lego@lego:~$ bin/certbot:s3.ipng.ch
lego@lego:~$ RENEWED_LINEAGE=/home/lego/acme-dns/live/s3.ipng.ch bin/certbot-distribute
```
The first script asks me to add the _acme-challenge DNS entries, which I'll do, for example on the
`s3.chbtl0.ipng.ch` instance (and similar for the `ddln0` and `chrma0` ones:
```
$ORIGIN chbtl0.ipng.ch.
_acme-challenge.s3 CNAME 51f16fd0-8eb6-455c-b5cd-96fad12ef8fd.auth.ipng.ch.
_acme-challenge.cons0-s3 CNAME 450477b8-74c9-4b9e-bbeb-de49c3f95379.auth.ipng.ch.
s3 CNAME nginx0.ipng.ch.
*.s3 CNAME nginx0.ipng.ch.
cons0-s3 CNAME nginx0.ipng.ch.
```
I push and reload the `ipng.ch` zonefile with these changes after which the certificate gets
requested and a cronjob added to check for renewals. The second script will copy the newly created
cert to all three `minio` machines, and all four `nginx` machines. From now on, every 90 days, a new
cert will be automatically generated and distributed. Slick!
#### NGINX Configs
With the LE wildcard certs in hand, I can create an NGINX frontend for these minio deployments.
First, a simple redirector service that punts people on port 80 to port 443:
```
server {
listen [::]:80;
listen 0.0.0.0:80;
server_name cons0-s3.chbtl0.ipng.ch s3.chbtl0.ipng.ch *.s3.chbtl0.ipng.ch;
access_log /var/log/nginx/s3.chbtl0.ipng.ch-access.log;
include /etc/nginx/conf.d/ipng-headers.inc;
location / {
return 301 https://$server_name$request_uri;
}
}
```
Next, the Minio API service itself which runs on port 9000, with a configuration snippet inspired by
the MinIO [[docs](https://min.io/docs/minio/linux/integrations/setup-nginx-proxy-with-minio.html)]:
```
server {
listen [::]:443 ssl http2;
listen 0.0.0.0:443 ssl http2;
ssl_certificate /etc/certs/s3.ipng.ch/fullchain.pem;
ssl_certificate_key /etc/certs/s3.ipng.ch/privkey.pem;
include /etc/nginx/conf.d/options-ssl-nginx.inc;
ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc;
server_name s3.chbtl0.ipng.ch *.s3.chbtl0.ipng.ch;
access_log /var/log/nginx/s3.chbtl0.ipng.ch-access.log upstream;
include /etc/nginx/conf.d/ipng-headers.inc;
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
ignore_invalid_headers off;
client_max_body_size 0;
# Disable buffering
proxy_buffering off;
proxy_request_buffering off;
location / {
proxy_set_header Host $http_host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_connect_timeout 300;
proxy_http_version 1.1;
proxy_set_header Connection "";
chunked_transfer_encoding off;
proxy_pass http://minio0.chbtl0.net.ipng.ch:9000;
}
}
```
Finally, the Minio Console service which runs on port 9090:
```
include /etc/nginx/conf.d/geo-ipng-trusted.inc;
server {
listen [::]:443 ssl http2;
listen 0.0.0.0:443 ssl http2;
ssl_certificate /etc/certs/s3.ipng.ch/fullchain.pem;
ssl_certificate_key /etc/certs/s3.ipng.ch/privkey.pem;
include /etc/nginx/conf.d/options-ssl-nginx.inc;
ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc;
server_name cons0-s3.chbtl0.ipng.ch;
access_log /var/log/nginx/cons0-s3.chbtl0.ipng.ch-access.log upstream;
include /etc/nginx/conf.d/ipng-headers.inc;
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
ignore_invalid_headers off;
client_max_body_size 0;
# Disable buffering
proxy_buffering off;
proxy_request_buffering off;
location / {
if ($geo_ipng_trusted = 0) { rewrite ^ https://ipng.ch/ break; }
proxy_set_header Host $http_host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-NginX-Proxy true;
real_ip_header X-Real-IP;
proxy_connect_timeout 300;
chunked_transfer_encoding off;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_pass http://minio0.chbtl0.net.ipng.ch:9090;
}
}
```
This last one has an NGINX trick. It will only allow users in if they are in the map called
`geo_ipng_trusted`, which contains a set of IPv4 and IPv6 prefixes. Visitors who are not in this map
will receive an HTTP redirect back to the [[IPng.ch](https://ipng.ch/)] homepage instead.
I run the Ansible Playbook which contains the NGINX changes to all frontends, but of course nothing
runs yet, because I haven't yet started MinIO backends.
### MinIO Backends
The first thing I need to do is get those disks mounted. MinIO likes using XFS, so I'll install that
and prepare the disks as follows:
```
pim@minio0-chbtl0:~$ sudo apt install xfsprogs
pim@minio0-chbtl0:~$ sudo modprobe xfs
pim@minio0-chbtl0:~$ echo xfs | sudo tee -a /etc/modules
pim@minio0-chbtl0:~$ sudo update-initramfs -k all -u
pim@minio0-chbtl0:~$ for i in a b c d e f g h i j k l; do sudo mkfs.xfs /dev/sd$i; done
pim@minio0-chbtl0:~$ blkid | awk 'BEGIN {i=1} /TYPE="xfs"/ {
printf "%s /minio/disk%d xfs defaults 0 2\n",$2,i; i++;
}' | sudo tee -a /etc/fstab
pim@minio0-chbtl0:~$ for i in `seq 1 12`; do sudo mkdir -p /minio/disk$i; done
pim@minio0-chbtl0:~$ sudo mount -t xfs -a
pim@minio0-chbtl0:~$ sudo chown -R minio-user: /minio/
```
From the top: I'll install `xfsprogs` which contains the things I need to manipulate XFS filesystems
in Debian. Then I'll install the `xfs` kernel module, and make sure it gets inserted upon subsequent
startup by adding it to `/etc/modules` and regenerating the initrd for the installed kernels.
Next, I'll format all twelve 16TB disks (which are `/dev/sda` - `/dev/sdl` on these machines), and
add their resulting blockdevice id's to `/etc/fstab` so they get persistently mounted on reboot.
Finally, I'll create their mountpoints, mount all XFS filesystems, and chown them to the user that
MinIO is running as. End result:
```
pim@minio0-chbtl0:~$ df -T
Filesystem Type 1K-blocks Used Available Use% Mounted on
udev devtmpfs 32950856 0 32950856 0% /dev
tmpfs tmpfs 6595340 1508 6593832 1% /run
/dev/md0 ext4 114695308 5423976 103398948 5% /
tmpfs tmpfs 32976680 0 32976680 0% /dev/shm
tmpfs tmpfs 5120 4 5116 1% /run/lock
/dev/sda xfs 15623792640 121505936 15502286704 1% /minio/disk1
/dev/sde xfs 15623792640 121505968 15502286672 1% /minio/disk12
/dev/sdi xfs 15623792640 121505968 15502286672 1% /minio/disk11
/dev/sdl xfs 15623792640 121505904 15502286736 1% /minio/disk10
/dev/sdd xfs 15623792640 121505936 15502286704 1% /minio/disk4
/dev/sdb xfs 15623792640 121505968 15502286672 1% /minio/disk3
/dev/sdk xfs 15623792640 121505936 15502286704 1% /minio/disk5
/dev/sdc xfs 15623792640 121505936 15502286704 1% /minio/disk9
/dev/sdf xfs 15623792640 121506000 15502286640 1% /minio/disk2
/dev/sdj xfs 15623792640 121505968 15502286672 1% /minio/disk7
/dev/sdg xfs 15623792640 121506000 15502286640 1% /minio/disk8
/dev/sdh xfs 15623792640 121505968 15502286672 1% /minio/disk6
tmpfs tmpfs 6595336 0 6595336 0% /run/user/0
```
MinIO likes to be configured using environment variables - and this is likely because it's a popular
thing to run in a containerized environment like Kubernetes. The maintainers ship it also as a
Debian package, which will read its environment from `/etc/default/minio`, and I'll prepare that
file as follows:
```
pim@minio0-chbtl0:~$ cat << EOF | sudo tee /etc/default/minio
MINIO_DOMAIN="s3.chbtl0.ipng.ch,minio0.chbtl0.net.ipng.ch"
MINIO_ROOT_USER="XXX"
MINIO_ROOT_PASSWORD="YYY"
MINIO_VOLUMES="/minio/disk{1...12}"
MINIO_OPTS="--console-address :9001"
EOF
pim@minio0-chbtl0:~$ sudo systemctl enable --now minio
pim@minio0-chbtl0:~$ sudo journalctl -u minio
May 31 10:44:11 minio0-chbtl0 minio[690420]: MinIO Object Storage Server
May 31 10:44:11 minio0-chbtl0 minio[690420]: Copyright: 2015-2025 MinIO, Inc.
May 31 10:44:11 minio0-chbtl0 minio[690420]: License: GNU AGPLv3 - https://www.gnu.org/licenses/agpl-3.0.html
May 31 10:44:11 minio0-chbtl0 minio[690420]: Version: RELEASE.2025-05-24T17-08-30Z (go1.24.3 linux/amd64)
May 31 10:44:11 minio0-chbtl0 minio[690420]: API: http://198.19.4.11:9000 http://127.0.0.1:9000
May 31 10:44:11 minio0-chbtl0 minio[690420]: WebUI: https://cons0-s3.chbtl0.ipng.ch/
May 31 10:44:11 minio0-chbtl0 minio[690420]: Docs: https://docs.min.io
pim@minio0-chbtl0:~$ sudo ipmitool sensor | grep Watts
Pwr Consumption | 154.000 | Watts
```
Incidentally - I am pretty pleased with this 192TB disk tank, sporting 24 cores, 64GB memory and
2x10G network, casually hanging out at 154 Watts of power all up. Slick!
{{< image float="right" src="/assets/minio/minio-ec.svg" alt="MinIO Erasure Coding" width="22em" >}}
MinIO implements _erasure coding_ as a core component in providing availability and resiliency
during drive or node-level failure events. MinIO partitions each object into data and parity shards
and distributes those shards across a single so-called _erasure set_. Under the hood, it uses
[[Reed-Solomon](https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction)] erasure coding
implementation and partitions the object for distribution. From the MinIO website, I'll borrow a
diagram to show how it looks like on a single node like mine to the right.
Anyway, MinIO detects 12 disks and installs an erasure set with 8 data disks and 4 parity disks,
which it calls `EC:4` encoding, also known in the industry as `RS8.4`.
Just like that, the thing shoots to life. Awesome!
### MinIO Client
On Summer, I'll install the MinIO Client called `mc`. This is easy because the maintainers ship a
Linux binary which I can just download. On OpenBSD, they don't do that. Not a problem though, on
Squanchy, Pencilvester and Glootie, I will just `go install` the client. Using the `mc` commandline,
I can all any of the S3 APIs on my new MinIO instance:
```
pim@summer:~$ set +o history
pim@summer:~$ mc alias set chbtl0 https://s3.chbtl0.ipng.ch/ <rootuser> <rootpass>
pim@summer:~$ set -o history
pim@summer:~$ mc admin info chbtl0/
● s3.chbtl0.ipng.ch
Uptime: 22 hours
Version: 2025-05-24T17:08:30Z
Network: 1/1 OK
Drives: 12/12 OK
Pool: 1
┌──────┬───────────────────────┬─────────────────────┬──────────────┐
│ Pool │ Drives Usage │ Erasure stripe size │ Erasure sets │
│ 1st │ 0.8% (total: 116 TiB) │ 12 │ 1 │
└──────┴───────────────────────┴─────────────────────┴──────────────┘
95 GiB Used, 5 Buckets, 5,859 Objects, 318 Versions, 1 Delete Marker
12 drives online, 0 drives offline, EC:4
```
Cool beans. I think I should get rid of this root account though, I've installed those credentials
into the `/etc/default/minio` environment file, but I don't want to keep them out in the open. So
I'll make an account for myself and assign me reasonable privileges, called `consoleAdmin` in the
default install:
```
pim@summer:~$ set +o history
pim@summer:~$ mc admin user add chbtl0/ <someuser> <somepass>
pim@summer:~$ mc admin policy info chbtl0 consoleAdmin
pim@summer:~$ mc admin policy attach chbtl0 consoleAdmin --user=<someuser>
pim@summer:~$ mc alias set chbtl0 https://s3.chbtl0.ipng.ch/ <someuser> <somepass>
pim@summer:~$ set -o history
```
OK, I feel less gross now that I'm not operating as root on the MinIO deployment. Using my new
user-powers, let me set some metadata on my new minio server:
```
pim@summer:~$ mc admin config set chbtl0/ site name=chbtl0 region=switzerland
Successfully applied new settings.
Please restart your server 'mc admin service restart chbtl0/'.
pim@summer:~$ mc admin service restart chbtl0/
Service status: ▰▰▱ [DONE]
Summary:
┌───────────────┬─────────────────────────────┐
│ Servers: │ 1 online, 0 offline, 0 hung │
│ Restart Time: │ 61.322886ms │
└───────────────┴─────────────────────────────┘
pim@summer:~$ mc admin config get chbtl0/ site
site name=chbtl0 region=switzerland
```
By the way, what's really cool about these open standards is that both the Amazon `aws` client works
with MinIO, but `mc` also works with AWS!
### MinIO Console
Although I'm pretty good with APIs and command line tools, there's some benefit also in using a
Graphical User Interface. MinIO ships with one, but there was a bit of a kerfuffle in the MinIO
community. Unfortunately, these are pretty common -- Redis (an open source key/value storage system)
changed their offering abruptly. Terraform (an open source infrastructure-as-code tool) changed
their licensing at some point. Ansible (an open source machine management tool) changed their
offering also. MinIO developers decided to strip their console of ~all features recently. The gnarly
bits are discussed on
[[reddit](https://www.reddit.com/r/selfhosted/comments/1kva3pw/avoid_minio_developers_introduce_trojan_horse/)].
but suffice to say: the same thing that happened in literally 100% of the other cases, also happened
here. Somebody decided to simply fork the code from before it was changed.
Enter OpenMaxIO. A cringe worthy name, but it gets the job done. Reading up on the
[[GitHub](https://github.com/OpenMaxIO/openmaxio-object-browser/issues/5)], reviving the fully
working console is pretty straight forward -- that is, once somebody spent a few days figuring it
out. Thank you `icesvz` for this excellent pointer. With this, I can create a systemd service for
the console and start it:
```
pim@minio0-chbtl0:~$ cat << EOF | sudo tee -a /etc/default/minio
## NOTE(pim): For openmaxio console service
CONSOLE_MINIO_SERVER="http://localhost:9000"
MINIO_BROWSER_REDIRECT_URL="https://cons0-s3.chbtl0.ipng.ch/"
EOF
pim@minio0-chbtl0:~$ cat << EOF | sudo tee /lib/systemd/system/minio-console.service
[Unit]
Description=OpenMaxIO Console Service
Wants=network-online.target
After=network-online.target
AssertFileIsExecutable=/usr/local/bin/minio-console
[Service]
Type=simple
WorkingDirectory=/usr/local
User=minio-user
Group=minio-user
ProtectProc=invisible
EnvironmentFile=-/etc/default/minio
ExecStart=/usr/local/bin/minio-console server
Restart=always
LimitNOFILE=1048576
MemoryAccounting=no
TasksMax=infinity
TimeoutSec=infinity
OOMScoreAdjust=-1000
SendSIGKILL=no
[Install]
WantedBy=multi-user.target
EOF
pim@minio0-chbtl0:~$ sudo systemctl enable --now minio-console
pim@minio0-chbtl0:~$ sudo systemctl restart minio
```
The first snippet is an update to the MinIO configuration that instructs it to redirect users who
are not trying to use the API to the console endpoint on `cons0-s3.chbtl0.ipng.ch`, and then the
console-server needs to know where to find the API, which from its vantage point is running on
`localhost:9000`. Hello, beautiful fully featured console:
{{< image src="/assets/minio/console-1.png" alt="MinIO Console" >}}
### MinIO Prometheus
MinIO ships with a prometheus metrics endpoint, and I notice on its console that it has a nice
metrics tab, which is fully greyed out. This is most likely because, well, I don't have a Prometheus
install here yet. I decide to keep the storage nodes self-contained and start a Prometheus server on
the local machine. I can always plumb that to IPng's Grafana instance later.
For now, I'll install Prometheus as follows:
```
pim@minio0-chbtl0:~$ cat << EOF | sudo tee -a /etc/default/minio
## NOTE(pim): Metrics for minio-console
MINIO_PROMETHEUS_AUTH_TYPE="public"
CONSOLE_PROMETHEUS_URL="http://localhost:19090/"
CONSOLE_PROMETHEUS_JOB_ID="minio-job"
EOF
pim@minio0-chbtl0:~$ sudo apt install prometheus
pim@minio0-chbtl0:~$ cat << EOF | sudo tee /etc/default/prometheus
ARGS="--web.listen-address='[::]:19090' --storage.tsdb.retention.size=16GB"
EOF
pim@minio0-chbtl0:~$ cat << EOF | sudo tee /etc/prometheus/prometheus.yml
global:
scrape_interval: 60s
scrape_configs:
- job_name: minio-job
metrics_path: /minio/v2/metrics/cluster
static_configs:
- targets: ['localhost:9000']
labels:
cluster: minio0-chbtl0
- job_name: minio-job-node
metrics_path: /minio/v2/metrics/node
static_configs:
- targets: ['localhost:9000']
labels:
cluster: minio0-chbtl0
- job_name: minio-job-bucket
metrics_path: /minio/v2/metrics/bucket
static_configs:
- targets: ['localhost:9000']
labels:
cluster: minio0-chbtl0
- job_name: minio-job-resource
metrics_path: /minio/v2/metrics/resource
static_configs:
- targets: ['localhost:9000']
labels:
cluster: minio0-chbtl0
- job_name: node
static_configs:
- targets: ['localhost:9100']
labels:
cluster: minio0-chbtl0
pim@minio0-chbtl0:~$ sudo systemctl restart minio prometheus
```
In the first snippet, I'll tell MinIO where it should find its Prometheus instance. Since the MinIO
console service is running on port 9090, and this is also the default port for Prometheus, I will
run Promtheus on port 19090 instead. From reading the MinIO docs, I can see that normally MinIO will
want prometheus to authenticate to it before it'll allow the endpoints to be scraped. I'll turn that
off by making these public. On the IPng Frontends, I can always remove access to /minio/v2 and
simply use the IPng Site Local access for local Prometheus scrapers instead.
After telling Prometheus its runtime arguments (in `/etc/default/prometheus`) and its scraping
endpoints (in `/etc/prometheus/prometheus.yml`), I can restart minio and prometheus. A few minutes
later, I can see the _Metrics_ tab in the console come to life.
But now that I have this prometheus running on the MinIO node, I can also add it to IPng's Grafana
configuration, by adding a new data source on `minio0.chbtl0.net.ipng.ch:19090` and pointing the
default Grafana [[Dashboard](https://grafana.com/grafana/dashboards/13502-minio-dashboard/)] at it:
{{< image src="/assets/minio/console-2.png" alt="Grafana Dashboard" >}}
A two-for-one: I will both be able to see metrics directly in the console, but also I will be able
to hook up these per-node prometheus instances into IPng's alertmanager also, and I've read some
[[docs](https://min.io/docs/minio/linux/operations/monitoring/collect-minio-metrics-using-prometheus.html)]
on the concepts. I'm really liking the experience so far!
### MinIO Nagios
Prometheus is fancy and all, but at IPng Networks, I've been doing monitoring for a while now. As a
dinosaur, I still have an active [[Nagios](https://www.nagios.org/)] install, which autogenerates
all of its configuration using the Ansible repository I have. So for the new Ansible group called
`minio`, I will autogenerate the following snippet:
```
define command {
command_name ipng_check_minio
command_line $USER1$/check_http -E -H $HOSTALIAS$ -I $ARG1$ -p $ARG2$ -u $ARG3$ -r '$ARG4$'
}
define service {
hostgroup_name ipng:minio:ipv6
service_description minio6:api
check_command ipng_check_minio!$_HOSTADDRESS6$!9000!/minio/health/cluster!
use ipng-service-fast
notification_interval 0 ; set > 0 if you want to be renotified
}
define service {
hostgroup_name ipng:minio:ipv6
service_description minio6:prom
check_command ipng_check_minio!$_HOSTADDRESS6$!19090!/classic/targets!minio-job
use ipng-service-fast
notification_interval 0 ; set > 0 if you want to be renotified
}
define service {
hostgroup_name ipng:minio:ipv6
service_description minio6:console
check_command ipng_check_minio!$_HOSTADDRESS6$!9090!/!MinIO Console
use ipng-service-fast
notification_interval 0 ; set > 0 if you want to be renotified
}
```
I've shown the snippet for IPv6 but I also have three services defined for legacy IP in the
hostgroup `ipng:minio:ipv4`. The check command here uses `-I` which has the IPv4 or IPv6 address to
talk to, `-p` for the port to consule, `-u` for the URI to hit and an option `-r` for a regular
expression to expect in the output. For the Nagios afficianados out there: my Ansible `groups`
correspond one to one with autogenerated Nagios `hostgroups`. This allows me to add arbitrary checks
by group-type, like above in the `ipng:minio` group for IPv4 and IPv6.
In the MinIO [[docs](https://min.io/docs/minio/linux/operations/monitoring/healthcheck-probe.html)]
I read up on the Healthcheck API. I choose to monitor the _Cluster Write Quorum_ on my minio
deployments. For Prometheus, I decide to hit the `targets` endpoint and expect the `minio-job` to be
among them. Finally, for the MinIO Console, I expect to see a login screen with the words `MinIO
Console` in the returned page. I guessed right, because Nagios is all green:
{{< image src="/assets/minio/nagios.png" alt="Nagios Dashboard" >}}
## My First Bucket
The IPng website is a statically generated Hugo site, and when-ever I submit a change to my Git
repo, a CI/CD runner (called [[Drone](https://www.drone.io/)]), picks up the change. It re-builds
the static website, and copies it to four redundant NGINX servers.
But IPng's website has amassed quite a bit of extra files (like VM images and VPP packages that I
publish), which are copied separately using a simple push script I have in my home directory. This
avoids all those big media files from cluttering the Git repository. I decide to move this stuff
into S3:
```
pim@summer:~/src/ipng-web-assets$ echo 'Gruezi World.' > ipng.ch/media/README.md
pim@summer:~/src/ipng-web-assets$ mc mb chbtl0/ipng-web-assets
pim@summer:~/src/ipng-web-assets$ mc mirror . chbtl0/ipng-web-assets/
...ch/media/README.md: 6.50 GiB / 6.50 GiB ┃▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓┃ 236.38 MiB/s 28s
pim@summer:~/src/ipng-web-assets$ mc anonymous set download chbtl0/ipng-web-assets/
```
OK, two things that immediately jump out at me. This stuff is **fast**: Summer is connected with a
2.5GbE network card, and she's running hard, copying the 6.5GB of data that are in these web assets
essentially at line rate. It doesn't really surprise me because Summer is running off of Gen4 NVME,
while MinIO has 12 spinning disks which each can write about 160MB/s or so sustained
[[ref](https://www.seagate.com/www-content/datasheets/pdfs/exos-x16-DS2011-1-1904US-en_US.pdf)],
with 24 CPUs to tend to the NIC (2x10G) and disks (2x SSD, 12x LFF). Should be plenty!
The second is that MinIO allows for buckets to be publicly shared in three ways: 1) read-only by
setting `download`; 2) write-only by setting `upload`, and 3) read-write by setting `public`.
I set `download` here, which means I should be able to fetch an asset now publicly:
```
pim@summer:~$ curl https://s3.chbtl0.ipng.ch/ipng-web-assets/ipng.ch/media/README.md
Gruezi World.
pim@summer:~$ curl https://ipng-web-assets.s3.chbtl0.ipng.ch/ipng.ch/media/README.md
Gruezi World.
```
The first `curl` here shows the path-based access, while the second one shows an equivalent
virtual-host based access. Both retrieve the file I just pushed via the public Internet. Whoot!
# What's Next
I'm going to be moving [[Restic](https://restic.net/)] backups from IPng's ZFS storage pool to this
S3 service over the next few days. I'll also migrate PeerTube and possibly Mastodon from NVME based
storage to replicated S3 buckets as well. Finally, the IPng website media that I mentioned above,
should make for a nice followup article. Stay tuned!

View File

@@ -0,0 +1,475 @@
---
date: "2025-06-01T10:07:23Z"
title: 'Case Study: Minio S3 - Part 2'
---
{{< image float="right" src="/assets/minio/minio-logo.png" alt="MinIO Logo" width="6em" >}}
# Introduction
Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading
scalability, data availability, security, and performance. Millions of customers of all sizes and
industries store, manage, analyze, and protect any amount of data for virtually any use case, such
as data lakes, cloud-native applications, and mobile apps. With cost-effective storage classes and
easy-to-use management features, you can optimize costs, organize and analyze data, and configure
fine-tuned access controls to meet specific business and compliance requirements.
Amazon's S3 became the _de facto_ standard object storage system, and there exist several fully open
source implementations of the protocol. One of them is MinIO: designed to allow enterprises to
consolidate all of their data on a single, private cloud namespace. Architected using the same
principles as the hyperscalers, AIStor delivers performance at scale at a fraction of the cost
compared to the public cloud.
IPng Networks is an Internet Service Provider, but I also dabble in self-hosting things, for
example [[PeerTube](https://video.ipng.ch/)], [[Mastodon](https://ublog.tech/)],
[[Immich](https://photos.ipng.ch/)], [[Pixelfed](https://pix.ublog.tech/)] and of course
[[Hugo](https://ipng.ch/)]. These services all have one thing in common: they tend to use lots of
storage when they grow. At IPng Networks, all hypervisors ship with enterprise SAS flash drives,
mostly 1.92TB and 3.84TB. Scaling up each of these services, and backing them up safely, can be
quite the headache.
In a [[previous article]({{< ref 2025-05-28-minio-1 >}})], I talked through the install of a
redundant set of three Minio machines. In this article, I'll start putting them to good use.
## Use Case: Restic
{{< image float="right" src="/assets/minio/restic-logo.png" alt="Restic Logo" width="12em" >}}
[[Restic](https://restic.org/)] is a modern backup program that can back up your files from multiple
host OS, to many different storage types, easily, effectively, securely, verifiably and freely. With
a sales pitch like that, what's not to love? Actually, I am a long-time
[[BorgBackup](https://www.borgbackup.org/)] user, and I think I'll keep that running. However, for
resilience, and because I've heard only good things about Restic, I'll make a second backup of the
routers, hypervisors, and virtual machines using Restic.
Restic can use S3 buckets out of the box (incidentally, so can BorgBackup). To configure it, I use
a mixture of environment variables and flags. But first, let me create a bucket for the backups.
```
pim@glootie:~$ mc mb chbtl0/ipng-restic
pim@glootie:~$ mc admin user add chbtl0/ <key> <secret>
pim@glootie:~$ cat << EOF | tee ipng-restic-access.json
{
"PolicyName": "ipng-restic-access",
"Policy": {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [ "s3:DeleteObject", "s3:GetObject", "s3:ListBucket", "s3:PutObject" ],
"Resource": [ "arn:aws:s3:::ipng-restic", "arn:aws:s3:::ipng-restic/*" ]
}
]
},
}
EOF
pim@glootie:~$ mc admin policy create chbtl0/ ipng-restic-access.json
pim@glootie:~$ mc admin policy attach chbtl0/ ipng-restic-access --user <key>
```
First, I'll create a bucket called `ipng-restic`. Then, I'll create a _user_ with a given secret
_key_. To protect the innocent, and my backups, I'll not disclose them. Next, I'll create an
IAM policy that allows for Get/List/Put/Delete to be performed on the bucket and its contents, and
finally I'll attach this policy to the user I just created.
To run a Restic backup, I'll first have to create a so-called _repository_. The repository has a
location and a password, which Restic uses to encrypt the data. Because I'm using S3, I'll also need
to specify the key and secret:
```
root@glootie:~# RESTIC_PASSWORD="changeme"
root@glootie:~# RESTIC_REPOSITORY="s3:https://s3.chbtl0.ipng.ch/ipng-restic/$(hostname)/"
root@glootie:~# AWS_ACCESS_KEY_ID="<key>"
root@glootie:~# AWS_SECRET_ACCESS_KEY:="<secret>"
root@glootie:~# export RESTIC_PASSWORD RESTIC_REPOSITORY AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY
root@glootie:~# restic init
created restic repository 807cf25e85 at s3:https://s3.chbtl0.ipng.ch/ipng-restic/glootie.ipng.ch/
```
Restic prints out some repository finterprint of the latest 'snapshot' it just created. Taking a
look on the MinIO install:
```
pim@glootie:~$ mc stat chbtl0/ipng-restic/glootie.ipng.ch/
Name : config
Date : 2025-06-01 12:01:43 UTC
Size : 155 B
ETag : 661a43f72c43080649712e45da14da3a
Type : file
Metadata :
Content-Type: application/octet-stream
Name : keys/
Date : 2025-06-01 12:03:33 UTC
Type : folder
```
Cool. Now I'm ready to make my first full backup:
```
root@glootie:~# ARGS="--exclude /proc --exclude /sys --exclude /dev --exclude /run"
root@glootie:~# ARGS="$ARGS --exclude-if-present .nobackup"
root@glootie:~# restic backup $ARGS /
...
processed 1141426 files, 131.111 GiB in 15:12
snapshot 34476c74 saved
```
Once the backup completes, the Restic authors advise me to also do a check of the repository, and to
prune it so that it keeps a finite amount of daily, weekly and monthly backups. My further journey
for Restic looks a bit like this:
```
root@glootie:~# restic check
using temporary cache in /tmp/restic-check-cache-2712250731
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
[0:04] 100.00% 1 / 1 snapshots
no errors were found
root@glootie:~# restic forget --prune --keep-daily 8 --keep-weekly 5 --keep-monthly 6
repository 34476c74 opened (version 2, compression level auto)
Applying Policy: keep 8 daily, 5 weekly, 6 monthly snapshots
keep 1 snapshots:
ID Time Host Tags Reasons Paths
---------------------------------------------------------------------------------
34476c74 2025-06-01 12:18:54 glootie.ipng.ch daily snapshot /
weekly snapshot
monthly snapshot
----------------------------------------------------------------------------------
1 snapshots
```
Right on! I proceed to update the Ansible configs at IPng to roll this out against the entire fleet
of 152 hosts at IPng Networks. I do this in a little tool called `bitcron`, which I wrote for a
previous company I worked at: [[BIT](https://bit.nl)] in the Netherlands. Bitcron allows me to
create relatively elegant cronjobs that can raise warnings, errors and fatal issues. If no issues
are found, an e-mail can be sent to a bitbucket address, but if warnings or errors are found, a
different _monitored_ address will be used. Bitcron is kind of cool, and I wrote it in 2001. Maybe
I'll write about it, for old time's sake. I wonder if the folks at BIT still use it?
## Use Case: NGINX
{{< image float="right" src="/assets/minio/nginx-logo.png" alt="NGINX Logo" width="11em" >}}
OK, with the first use case out of the way, I turn my attention to a second - in my opinion more
interesting - use case. In the [[previous article]({{< ref 2025-05-28-minio-1 >}})], I created a
public bucket called `ipng-web-assets` in which I stored 6.50GB of website data belonging to the
IPng website, and some material I posted when I was on my
[[Sabbatical](https://sabbatical.ipng.nl/)] last year.
### MinIO: Bucket Replication
First things first: redundancy. These web assets are currently pushed to all four nginx machines,
and statically served. If I were to replace them with a single S3 bucket, I would create a single
point of failure, and that's _no bueno_!
Off I go, creating a replicated bucket using two MinIO instances (`chbtl0` and `ddln0`):
```
pim@glootie:~$ mc mb ddln0/ipng-web-assets
pim@glootie:~$ mc anonymous set download ddln0/ipng-web-assets
pim@glootie:~$ mc admin user add ddln0/ <replkey> <replsecret>
pim@glootie:~$ cat << EOF | tee ipng-web-assets-access.json
{
"PolicyName": "ipng-web-assets-access",
"Policy": {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [ "s3:DeleteObject", "s3:GetObject", "s3:ListBucket", "s3:PutObject" ],
"Resource": [ "arn:aws:s3:::ipng-web-assets", "arn:aws:s3:::ipng-web-assets/*" ]
}
]
},
}
EOF
pim@glootie:~$ mc admin policy create ddln0/ ipng-web-assets-access.json
pim@glootie:~$ mc admin policy attach ddln0/ ipng-web-assets-access --user <replkey>
pim@glootie:~$ mc replicate add chbtl0/ipng-web-assets \
--remote-bucket https://<key>:<secret>@s3.ddln0.ipng.ch/ipng-web-assets
```
What happens next is pure magic. I've told `chbtl0` that I want it to replicate all existing and
future changes to that bucket to its neighbor `ddln0`. Only minutes later, I check the replication
status, just to see that it's _already done_:
```
pim@glootie:~$ mc replicate status chbtl0/ipng-web-assets
Replication status since 1 hour
s3.ddln0.ipng.ch
Replicated: 142 objects (6.5 GiB)
Queued: ● 0 objects, 0 B (avg: 4 objects, 915 MiB ; max: 0 objects, 0 B)
Workers: 0 (avg: 0; max: 0)
Transfer Rate: 15 kB/s (avg: 88 MB/s; max: 719 MB/s
Latency: 3ms (avg: 3ms; max: 7ms)
Link: ● online (total downtime: 0 milliseconds)
Errors: 0 in last 1 minute; 0 in last 1hr; 0 since uptime
Configured Max Bandwidth (Bps): 644 GB/s Current Bandwidth (Bps): 975 B/s
pim@summer:~/src/ipng-web-assets$ mc ls ddln0/ipng-web-assets/
[2025-06-01 12:42:22 CEST] 0B ipng.ch/
[2025-06-01 12:42:22 CEST] 0B sabbatical.ipng.nl/
```
MinIO has pumped the data from bucket `ipng-web-assets` to the other machine at an average of 88MB/s
with a peak throughput of 719MB/s (probably for the larger VM images). And indeed, looking at the
remote machine, it is fully caught up after the push, within only a minute or so with a completely
fresh copy. Nice!
### MinIO: Missing directory index
I take a look at what I just built, on the following URL:
* [https://ipng-web-assets.s3.ddln0.ipng.ch/sabbatical.ipng.nl/media/vdo/IMG_0406_0.mp4](https://ipng-web-assets.s3.ddln0.ipng.ch/sabbatical.ipng.nl/media/vdo/IMG_0406_0.mp4)
That checks out, and I can see the mess that was my room when I first went on sabbatical. By the
way, I totally cleaned it up, see
[[here](https://sabbatical.ipng.nl/blog/2024/08/01/thursday-basement-done/)] for proof. I can't,
however, see the directory listing:
```
pim@glootie:~$ curl https://ipng-web-assets.s3.ddln0.ipng.ch/sabbatical.ipng.nl/media/vdo/
<?xml version="1.0" encoding="UTF-8"?>
<Error>
<Code>NoSuchKey</Code>
<Message>The specified key does not exist.</Message>
<Key>sabbatical.ipng.nl/media/vdo/</Key>
<BucketName>ipng-web-assets</BucketName>
<Resource>/sabbatical.ipng.nl/media/vdo/</Resource>
<RequestId>1844EC0CFEBF3C5F</RequestId>
<HostId>dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8</HostId>
</Error>
```
That's unfortunate, because some of the IPng articles link to a directory full of files, which I'd
like to be shown so that my readers can navigate through the directories. Surely I'm not the first
to encounter this? And sure enough, I'm not
[[ref](https://github.com/glowinthedark/index-html-generator)] by user `glowinthedark` who wrote a
little python script that generates `index.html` files for their Caddy file server. I'll take me
some of that Python, thank you!
With the following little script, my setup is complete:
```
pim@glootie:~/src/ipng-web-assets$ cat push.sh
#!/usr/bin/env bash
echo "Generating index.html files ..."
for D in */media; do
echo "* Directory $D"
./genindex.py -r $D
done
echo "Done (genindex)"
echo ""
echo "Mirroring directoro to S3 Bucket"
mc mirror --remove --overwrite . chbtl0/ipng-web-assets/
echo "Done (mc mirror)"
echo ""
pim@glootie:~/src/ipng-web-assets$ ./push.sh
```
Only a few seconds after I run `./push.sh`, the replication is complete and I have two identical
copies of my media:
1. [https://ipng-web-assets.s3.chbtl0.ipng.ch/ipng.ch/media/](https://ipng-web-assets.s3.chbtl0.ipng.ch/ipng.ch/media/index.html)
1. [https://ipng-web-assets.s3.ddln0.ipng.ch/ipng.ch/media/](https://ipng-web-assets.s3.ddln0.ipng.ch/ipng.ch/media/index.html)
### NGINX: Proxy to Minio
Before moving to S3 storage, my NGINX frontends all kept a copy of the IPng media on local NVME
disk. That's great for reliability, as each NGINX instance is completely hermetic and standalone.
However, it's not great for scaling: the current NGINX instances only have 16GB of local storage,
and I'd rather not have my static web asset data outgrow that filesystem. From before, I already had
an NGINX config that served the Hugo static data from `/var/www/ipng.ch/ and the `/media'
subdirectory from a different directory in `/var/www/ipng-web-assets/ipng.ch/media`.
Moving to redundant S3 storage backenda is straight forward:
```
upstream minio_ipng {
least_conn;
server minio0.chbtl0.net.ipng.ch:9000;
server minio0.ddln0.net.ipng.ch:9000;
}
server {
...
location / {
root /var/www/ipng.ch/;
}
location /media {
proxy_set_header Host $http_host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_connect_timeout 300;
proxy_http_version 1.1;
proxy_set_header Connection "";
chunked_transfer_encoding off;
rewrite (.*)/$ $1/index.html;
proxy_pass http://minio_ipng/ipng-web-assets/ipng.ch/media;
}
}
```
I want to make note of a few things:
1. The `upstream` definition here uses IPng Site Local entrypoints, considering the NGINX servers
all have direct MTU=9000 access to the MinIO instances. I'll put both in there, in a
round-robin configuration favoring the replica with _least connections_.
1. Deeplinking to directory names without the trailing `/index.html` would serve a 404 from the
backend, so I'll intercept these and rewrite directory to always include the `/index.html'.
1. The used upstream endpoint is _path-based_, that is to say has the bucketname and website name
included. This whole location used to be simply `root /var/www/ipng-web-assets/ipng.ch/media/`
so the mental change is quite small.
### NGINX: Caching
After deploying the S3 upstream on all IPng websites, I can delete the old
`/var/www/ipng-web-assets/` directory and reclaim about 7GB of diskspace. This gives me an idea ...
{{< image width="8em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
On the one hand it's great that I will pull these assets from Minio and all, but at the same time,
it's a tad inefficient to retrieve them from, say, Zurich to Amsterdam just to serve them onto the
internet again. If at any time something on the IPng website goes viral, it'd be nice to be able to
serve them directly from the edge, right?
A webcache. What could _possibly_ go wrong :)
NGINX is really really good at caching content. It has a powerful engine to store, scan, revalidate
and match any content and upstream headers. It's also very well documented, so I take a look at the
proxy module's documentation [[here](https://nginx.org/en/docs/http/ngx_http_proxy_module.html)] and
in particular a useful [[blog](https://blog.nginx.org/blog/nginx-caching-guide)] on their website.
The first thing I need to do is create what is called a _key zone_, which is a region of memory in
which URL keys are stored with some metadata. Having a copy of the keys in memory enables NGINX to
quickly determine if a request is a HIT or a MISS without having to go to disk, greatly speeding up
the check.
In `/etc/nginx/conf.d/ipng-cache.conf` I add the following NGINX cache:
```
proxy_cache_path /var/www/nginx-cache levels=1:2 keys_zone=ipng_cache:10m max_size=8g
inactive=24h use_temp_path=off;
```
With this statement, I'll create a 2-level subdirectory, and allocate 10MB of space, which should
hold on the order of 100K entries. The maximum size I'll allow the cache to grow to is 8GB, and I'll
mark any object inactive if it's not been referenced for 24 hours. I learn that inactive is
different to expired content. If a cache element has expired, but NGINX can't reach the upstream
for a new copy, it can be configured to serve a inactive (stale) copy from the cache. That's dope,
as it serves as an extra layer of defence in case the network or all available S3 replicas take the
day off. I'll ask NGINX to avoid writing objects first to a tmp directory and them moving them into
the `/var/www/nginx-cache` directory. These are recommendations I grab from the manual.
Within the `location` block I configured above, I'm now ready to enable this cache. I'll do that by
adding two include files, which I'll reference in all sites that I want to have make use of this
cache:
First, to enable the cache, I write the following snippet:
```
pim@nginx0-nlams1:~$ cat /etc/nginx/conf.d/ipng-cache.inc
proxy_cache ipng_cache;
proxy_ignore_headers Cache-Control;
proxy_cache_valid any 1h;
proxy_cache_revalidate on;
proxy_cache_use_stale error timeout updating http_500 http_502 http_503 http_504;
proxy_cache_background_update on;
```
Then, I find it useful to emit a few debugging HTTP headers, and at the same time I see that Minio
emits a bunch of HTTP headers that may not be safe for me to propagate, so I pen two more snippets:
```
pim@nginx0-nlams1:~$ cat /etc/nginx/conf.d/ipng-strip-minio-headers.inc
proxy_hide_header x-minio-deployment-id;
proxy_hide_header x-amz-request-id;
proxy_hide_header x-amz-id-2;
proxy_hide_header x-amz-replication-status;
proxy_hide_header x-amz-version-id;
pim@nginx0-nlams1:~$ cat /etc/nginx/conf.d/ipng-add-upstream-headers.inc
add_header X-IPng-Frontend $hostname always;
add_header X-IPng-Upstream $upstream_addr always;
add_header X-IPng-Upstream-Status $upstream_status always;
add_header X-IPng-Cache-Status $upstream_cache_status;
```
With that, I am ready to enable caching of the IPng `/media` location:
```
location /media {
...
include /etc/nginx/conf.d/ipng-strip-minio-headers.inc;
include /etc/nginx/conf.d/ipng-add-upstream-headers.inc;
include /etc/nginx/conf.d/ipng-cache.inc;
...
}
```
## Results
I run the Ansible playbook for the NGINX cluster and take a look at the replica at Coloclue in
Amsterdam, called `nginx0.nlams1.ipng.ch`. Notably, it'll have to retrieve the file from a MinIO
replica in Zurich (12ms away), so it's expected to take a little while.
The first attempt:
```
pim@nginx0-nlams1:~$ curl -v -o /dev/null --connect-to ipng.ch:443:localhost:443 \
https://ipng.ch/media/vpp-proto/vpp-proto-bookworm.qcow2.lrz
...
< last-modified: Sun, 01 Jun 2025 12:37:52 GMT
< x-ipng-frontend: nginx0-nlams1
< x-ipng-cache-status: MISS
< x-ipng-upstream: [2001:678:d78:503::b]:9000
< x-ipng-upstream-status: 200
100 711M 100 711M 0 0 26.2M 0 0:00:27 0:00:27 --:--:-- 26.6M
```
OK, that's respectable, I've read the file at 26MB/s. Of course I just turned on the cache, so the
NGINX fetches the file from Zurich while handing it over to my `curl` here. It notifies me by means
of a HTTP header that the cache was a `MISS`, and then which upstream server it contacted to
retrieve the object.
But look at what happens the _second_ time I run the same command:
```
pim@nginx0-nlams1:~$ curl -v -o /dev/null --connect-to ipng.ch:443:localhost:443 \
https://ipng.ch/media/vpp-proto/vpp-proto-bookworm.qcow2.lrz
< last-modified: Sun, 01 Jun 2025 12:37:52 GMT
< x-ipng-frontend: nginx0-nlams1
< x-ipng-cache-status: HIT
100 711M 100 711M 0 0 436M 0 0:00:01 0:00:01 --:--:-- 437M
```
Holy moly! First I see the object has the same _Last-Modified_ header, but I now also see that the
_Cache-Status_ was a `HIT`, and there is no mention of any upstream server. I do however see the
file come in at a whopping 437MB/s which is 16x faster than over the network!! Nice work, NGINX!
{{< image float="right" src="/assets/minio/rack-2.png" alt="Rack-o-Minio" width="12em" >}}
# What's Next
I'm going to deploy the third MinIO replica in R&uuml;mlang once the disks arrive. I'll release the
~4TB of disk used currently in Restic backups for the fleet, and put that ZFS capacity to other use.
Now, creating services like PeerTube, Mastodon, Pixelfed, Loops, NextCloud and what-have-you, will
become much easier for me. And with the per-bucket replication between MinIO deployments, I also
think this is a great way to auto-backup important data. First off, it'll be RS8.4 on the MinIO node
itself, and secondly, user data will be copied automatically to a neighboring facility.
I've convinced myself that S3 storage is a great service to operate, and that MinIO is awesome.

View File

@@ -0,0 +1,375 @@
---
date: "2025-07-12T08:07:23Z"
title: 'VPP and eVPN/VxLAN - Part 1'
---
{{< image width="6em" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
# Introduction
You know what would be really cool? If VPP could be an eVPN/VxLAN speaker! Sometimes I feel like I'm
the very last on the planet to learn about something cool. My latest "A-Ha!"-moment was when I was
configuring the eVPN fabric for [[Frys-IX](https://frys-ix.net/)], and I wrote up an article about
it [[here]({{< ref 2025-04-09-frysix-evpn >}})] back in April.
I can build the equivalent of Virtual Private Wires (VPWS), also called L2VPN or Virtual Leased
Lines, and these are straight forward because they typically only have two endpoints. A "regular"
VxLAN tunnel which is L2 cross connected with another interface already does that just fine. Take a
look at an article on [[L2 Gymnastics]({{< ref 2022-01-12-vpp-l2 >}})] for that. But the real kicker
is that I can also create multi-site L2 domains like Virtual Private LAN Services (VPLS) or also
called Virtual Private Ethernet, L2VPN or Ethernet LAN Service (E-LAN). And *that* is a whole other
level of awesome.
## Recap: VPP today
### VPP: VxLAN
The current VPP VxLAN tunnel plugin does point to point tunnels, that is they are configured with a
source address, destination address, destination port and VNI. As I mentioned, a point to point
ethernet transport is configured very easily:
```
vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.254 vni 8298 instance 0
vpp0# set int l2 xconnect vxlan_tunnel0 HundredGigabitEthernet10/0/0
vpp0# set int l2 xconnect HundredGigabitEthernet10/0/0 vxlan_tunnel0
vpp0# set int state vxlan_tunnel0 up
vpp0# set int state HundredGigabitEthernet10/0/0 up
vpp1# create vxlan tunnel src 192.0.2.254 dst 192.0.2.1 vni 8298 instance 0
vpp1# set int l2 xconnect vxlan_tunnel0 HundredGigabitEthernet10/0/1
vpp1# set int l2 xconnect HundredGigabitEthernet10/0/1 vxlan_tunnel0
vpp1# set int state vxlan_tunnel0 up
vpp1# set int state HundredGigabitEthernet10/0/1 up
```
And with that, `vpp0:Hu10/0/0` is cross connected with `vpp1:Hu10/0/1` and ethernet flows between
the two.
### VPP: Bridge Domains
Now consider a VPLS with five different routers. While it's possible to create a bridge-domain and add
some local ports and four other VxLAN tunnels:
```
vpp0# create bridge-domain 8298
vpp0# set int l2 bridge HundredGigabitEthernet10/0/1 8298
vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.2 vni 8298 instance 0
vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.3 vni 8298 instance 1
vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.4 vni 8298 instance 2
vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.5 vni 8298 instance 3
vpp0# set int l2 bridge vxlan_tunnel0 8298
vpp0# set int l2 bridge vxlan_tunnel1 8298
vpp0# set int l2 bridge vxlan_tunnel2 8298
vpp0# set int l2 bridge vxlan_tunnel3 8298
```
To make this work, I will have to replicate this configuration to all other `vpp1`-`vpp4` routers.
While it does work, it's really not very practical. When other VPP instances get added to a VPLS,
every other router will have to have a new VxLAN tunnel created and added to its local bridge
domain. Consider 1000s of VPLS instances on 100s of routers, it would yield ~100'000 VxLAN tunnels
on every router, yikes!
Such a configuration reminds me in a way of iBGP in a large network: the naive approach is to have a
full mesh of all routers speaking to all other routers, but that quickly becomes a maintenance
headache. The canonical solution for this is to create iBGP _Route Reflectors_ to which every router
connects, and their job is to redistribute routing information between the fleet of routers. This
turns the iBGP problem from an O(N^2) to an O(N) problem: all 1'000 routers connect to, say, three
regional route reflectors for a total of 3'000 BGP connections, which is much better than ~1'000'000
BGP connections in the naive approach.
## Recap: eVPN Moving parts
The reason why I got so enthusiastic when I was playing with Arista and Nokia's eVPN stuff, is
because it requires very little dataplane configuration, and a relatively intuitive controlplane
configuration:
1. **Dataplane**: For each L2 broadcast domain (be it a L2XC or a Bridge Domain), really all I
need is a single VxLAN interface with a given VNI, which should be able to send encapsulated
ethernet frames to one more more other speakers in the same domain.
1. **Controlplane**: I will need to learn MAC addresses locally, and inform some BGP eVPN
implementation of who-lives-where. Other VxLAN speakers learn of the MAC addresses I own, and
will send me encapsulated ethernet for those addresses
1. **Dataplane**: For unknown layer2 destinations, like _Broadcast_, _Unknown Unicast_, and
_Multicast_ (BUM) traffic, I will want to keep track of which other VxLAN speakers these
packets should be flooded. I make note that this is not that different to flooding the packets
to local interfaces, except here it'd be flooding them to remote VxLAN endpoints.
1. **ControlPlane**: Flooding L2 traffic across wide area networks is typically considered icky,
so a few tricks might be optionally deployed. Since the controlplane already knows which MAC
lives where, it may as well also make note of any local IPv6 ARP and IPv6 neighbor discovery
replies and teach its peers which IPv4/IPv6 addresses live where: a distributed neighbor table.
{{< image width="6em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
For the controlplane parts, [[FRRouting](https://frrouting.org/)] has a working implementation for
L2 (MAC-VRF) and L3 (IP-VRF). My favorite, [[Bird](https://birg.nic.cz/)], is slowly catching up, and
has a few of these controlplane parts already working (mostly MAC-VRF). Commercial vendors like Arista,
Nokia, Juniper, Cisco are ready to go. If we want VPP to inter-operate, we may need to make a few
changes.
## VPP: Changes needed
### Dynamic VxLAN
I propose two changes to the VxLAN plugin, or perhaps, a new plugin that changes the behavior so that
we don't have to break any performance or functional promises to existing users. This new VxLAN
interface behavior changes in the following ways:
1. Each VxLAN interface has a local L2FIB attached to it, the keys are MAC address and the
values are remote VTEPs. In its simplest form, the values would be just IPv4 or IPv6 addresses,
because I can re-use the VNI and port information from the tunnel definition itself.
1. Each VxLAN interface has a local flood-list attached to it. This list contains remote VTEPs
that I am supposed to send 'flood' packets to. Similar to the Bridge Domain, when packets are marked
for flooding, I will need to prepare and replicate them, sending them to each VTEP.
A set of APIs will be needed to manipulate these:
* ***Interface***: I will need to have an interface create, delete and list call, which will
be able to maintain the interfaces, their metadata like source address, source/destination port,
VNI and such.
* ***L2FIB***: I will need to add, replace, delete, and list which MAC addresses go where,
With such a table, each time a packet is handled for a given Dynamic VxLAN interface, the
dst_addr can be written into the packet.
* ***Flooding***: For those packets that are not unicast (BUM), I will need to be able to add,
remove and list which VTEPs should receive this packet.
It would be pretty dope if the configuration looked something like this:
```
vpp# create evpn-vxlan src <v46address> dst-port <port> vni <vni> instance <id>
vpp# evpn-vxlan l2fib <iface> mac <mac> dst <v46address> [del]
vpp# evpn-vxlan flood <iface> dst <v46address> [del]
```
The VxLAN underlay transport can be either IPv4 or IPv6. Of course manipulating L2FIB or Flood
destinations must match the address family of an interface of type evpn-vxlan. A practical example
might be:
```
vpp# create evpn-vxlan src 2001:db8::1 dst-port 4789 vni 8298 instance 6
vpp# evpn-vxlan l2fib evpn-vxlan0 mac 00:01:02:82:98:02 dst 2001:db8::2
vpp# evpn-vxlan l2fib evpn-vxlan0 mac 00:01:02:82:98:03 dst 2001:db8::3
vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::2
vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::3
vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::4
```
By the way, while this _could_ be a new plugin, it could also just be added to the existing VxLAN
plugin. One way in which I might do this when creating a normal vxlan tunnel is to allow for its
destination address to be either 0.0.0.0 for IPv4 or :: for IPv6. That would signal 'dynamic'
tunneling, upon which the L2FIB and Flood lists are used. It would slow down each VxLAN packet by
the time it takes to call `ip46_address_is_zero()` which is only a handfull of clocks.
### Bridge Domain
{{< image width="6em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
It's important to understand that L2 learning is **required** for eVPN to function. Each router
needs to be able to tell the iBGP eVPN session which MAC addresses should be forwarded to it. This
rules out the simple case of L2XC because there, no learning is performed. The corollary is that a
bridge-domain is required for any form of eVPN.
The L2 code in VPP already does most of what I'd need. It maintains an L2FIB in `vnet/l2/l2_fib.c`,
which is keyed by bridge-id and MAC address, and its values are a 64 bit structure that points
essentially to a `sw_if_index` output interface. The L2FIB of the eVPN needs a bit more information
though, notably a `ip46address` struct to know which VTEP to send to. It's tempting to add this
extra data to the bridge domain code. I would recommend against it, because other implementations,
for example MPLS, GENEVE or Carrier Pigeon IP may need more than just the destination address. Even
the VxLAN implementation I'm thinking about might want to be able to override other things like the
destination port for a given VTEP, or even the VNI. Putting all of this stuff in the bridge-domain
code will just clutter it, for all users, not just those users who might want eVPN.
Similarly, one might argue it is tempting to re-use/extend the behavior in `vnet/l2/l2_flood.c`,
because if it's already replicating BUM traffic, why not replicate it many times over the flood list
for any member interface that happens to be a dynamic VxLAN interface? This would be a bad idea
because of a few reasons. Firstly, it is not guaranteed that the VxLAN plugin is loaded, and in
doing this, I would leak internal details of VxLAN into the bridge-domain code. Secondly, the
`l2_flood.c` code would potentially get messy if other types were added (like the MPLS and GENEVE
above).
A reasonable request is to mark such BUM frames once in the existing L2 code and when handing the
replicated packet into the VxLAN node, to see the `is_bum` marker and once again replicate -- in the
vxlan plugin -- these packets to the VTEPs in our local flood-list. Although a bit more work, this
approach only requires a tiny amount of work in the `l2_flood.c` code (the marking), and will keep
all the logic tucked away where it is relevant, derisking the VPP vnet codebase.
Fundamentally, I think the cleanest design is to keep the dynamic VxLAN interface fully
self-contained and it would therefor maintain its own L2FIB and Flooding logic. The only thing I
would add to the L2 codebase is some form of BUM marker to allow for efficient flooding.
### Control Plane
There's a few things the control plane has to do. Some external agent, like FRR or Bird, will be
receiving a few types of eVPN messages. The ones I'm interested in are:
* ***Type 2***: MAC/IP Advertisement Route
- On the way in, these should be fed to the VxLAN L2FIB belonging to the bridge-domain.
- On the way out, learned addresses should be advertised to peers.
- Regarding IPv4/IPv6 addresses, that is the ARP / ND tables: we can talk about those later.
* ***Type 3***: Inclusive Multicast Ethernet Tag Route
- On the way in, these will populate the VxLAN Flood list belonging to the bridge-domain
- On the way out, each bridge-domain should advertise itself as IMET to peers.
* ***Type 5***: IP Prefix Route
- Similar to IP information in Type 2, we can talk about those later once L3VPN/eVPN is
needed.
The 'on the way in' stuff can be easily done with my proposed APIs in the Dynamic VxLAN (or a new
eVPN VxLAN) plugin. Adding, removing, listing L2FIB and Flood lists is easy as far as VPP is
concerned. It's just that the controlplane implementation needs to somehow _feed_ the API, so an
external program may be needed, or alterntively the Linux Control Plane netlink plugin might be used
to consume this information.
The 'on the way out' stuff is a bit trickier. I will need to listen to creation of new broadcast
domains and associate them with the right IMET announcements, and for each MAC address learned, pick
them up and advertise them into eVPN. Later, if ever ARP and ND proxying becomes important, I'll
have to revisit the bridge-domain feature to do IPv4 ARP and IPv6 Neighbor Discovery, and replace it
with some code that populates the IPv4/IPv6 parts of the Type2 messages on the way out, and
similarly on the way in, populates an L3 neighbor cache for the bridge domain, so ARP and ND replies
can be synthesized based on what we've learned in eVPN.
# Demonstration
### VPP: Current VxLAN
I'll build a small demo environment on Summer to show how the interaction of VxLAN and Bridge
Domain works today:
```
vpp# create tap host-if-name dummy0 host-mtu-size 9216 host-ip4-addr 192.0.2.1/24
vpp# set int state tap0 up
vpp# set int ip address tap0 192.0.2.1/24
vpp# set ip neighbor tap0 192.0.2.254 01:02:03:82:98:fe static
vpp# set ip neighbor tap0 192.0.2.2 01:02:03:82:98:02 static
vpp# set ip neighbor tap0 192.0.2.3 01:02:03:82:98:03 static
vpp# create vxlan tunnel src 192.0.2.1 dst 192.0.2.254 vni 8298
vpp# set int state vxlan_tunnel0 up
vpp# create tap host-if-name vpptap0 host-mtu-size 9216 hw-addr 02:fe:64:dc:1b:82
vpp# set int state tap1 up
vpp# create bridge-domain 8298
vpp# set int l2 bridge tap1 8298
vpp# set int l2 bridge vxlan_tunnel0 8298
```
I've created a tap device called `dummy0` and gave it an IPv4 address. Normally, I would use some
DPDK or RDMA interface like `TenGigabutEthernet10/0/0`. Then I'll populate some static ARP entries.
Again, normally this would just be 'use normal routing'. However, for the purposes of this
demonstration, it helps to use a TAP device, as any packets I make VPP send to those 192.0.2.254 and
so on, can be captured with `tcpdump` in Linux in addition to `trace add` in VPP.
Then, I create a VxLAN tunnel with a default destination of 192.0.2.254 and the given VNI.
Next, I create a TAP interface called `vpptap0` with the given MAC address.
Finally, I bind these two interfaces together in a bridge-domain.
I proceed to write a small ScaPY program:
```python
#!/usr/bin/env python3
from scapy.all import Ether, IP, UDP, Raw, sendp
pkt = Ether(dst="01:02:03:04:05:02", src="02:fe:64:dc:1b:82", type=0x0800)
/ IP(src="192.168.1.1", dst="192.168.1.2")
/ UDP(sport=8298, dport=7) / Raw(load=b"ping")
print(pkt)
sendp(pkt, iface="vpptap0")
pkt = Ether(dst="01:02:03:04:05:03", src="02:fe:64:dc:1b:82", type=0x0800)
/ IP(src="192.168.1.1", dst="192.168.1.3")
/ UDP(sport=8298, dport=7) / Raw(load=b"ping")
print(pkt)
sendp(pkt, iface="vpptap0")
```
What will happen is, the ScaPY program will emit these frames into device `vpptap0` which is in
bridge-domain 8298. The bridge will learn our src MAC `02:fe:64:dc:1b:82`, and look up the dst MAC
`01:02:03:04:05:02`, and because there hasn't been traffic yet, it'll flood to all member ports, one
of which is the VxLAN tunnel. VxLAN will then encapsulate the packets to the other side of the
tunnel.
```
pim@summer:~$ sudo ./vxlan-test.py
Ether / IP / UDP 192.168.1.1:8298 > 192.168.1.2:echo / Raw
Ether / IP / UDP 192.168.1.1:8298 > 192.168.1.3:echo / Raw
pim@summer:~$ sudo tcpdump -evni dummy0
10:50:35.310620 02:fe:72:52:38:53 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
(tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82)
192.0.2.1.6345 > 192.0.2.254.4789: VXLAN, flags [I] (0x08), vni 8298
02:fe:64:dc:1b:82 > 01:02:03:04:05:02, ethertype IPv4 (0x0800), length 46:
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
192.168.1.1.8298 > 192.168.1.2.7: UDP, length 4
10:50:35.362552 02:fe:72:52:38:53 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
(tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82)
192.0.2.1.23916 > 192.0.2.254.4789: VXLAN, flags [I] (0x08), vni 8298
02:fe:64:dc:1b:82 > 01:02:03:04:05:03, ethertype IPv4 (0x0800), length 46:
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
192.168.1.1.8298 > 192.168.1.3.7: UDP, length 4
```
I want to point out that nothing, so far, is special. All of this works with upstream VPP just fine.
I can see two VxLAN encapsulated packets, both destined to `192.0.2.254:4789`. Cool.
### Dynamic VPP VxLAN
I wrote a prototype for a Dynamic VxLAN tunnel in [[43433](https://gerrit.fd.io/r/c/vpp/+/43433)].
The good news is, this works. The bad news is, I think I'll want to discuss my proposal (this
article) with the community before going further down a potential rabbit hole.
With my gerrit patched in, I can do the following:
```
vpp# vxlan l2fib vxlan_tunnel0 mac 01:02:03:04:05:02 dst 192.0.2.2
Added VXLAN dynamic destination for 01:02:03:04:05:02 on vxlan_tunnel0 dst 192.0.2.2
vpp# vxlan l2fib vxlan_tunnel0 mac 01:02:03:04:05:03 dst 192.0.2.3
Added VXLAN dynamic destination for 01:02:03:04:05:03 on vxlan_tunnel0 dst 192.0.2.3
vpp# show vxlan l2fib
VXLAN Dynamic L2FIB entries:
MAC Interface Destination Port VNI
01:02:03:04:05:02 vxlan_tunnel0 192.0.2.2 4789 8298
01:02:03:04:05:03 vxlan_tunnel0 192.0.2.3 4789 8298
Dynamic L2FIB entries: 2
```
I've instructed the VxLAN tunnel to change the tunnel destination based on the destination MAC.
I run the script and tcpdump again:
```
pim@summer:~$ sudo tcpdump -evni dummy0
11:16:53.834619 02:fe:fe:ae:0d:a3 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
(tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82, bad cksum 3945 (->3997)!)
192.0.2.1.6345 > 192.0.2.2.4789: VXLAN, flags [I] (0x08), vni 8298
02:fe:64:dc:1b:82 > 01:02:03:04:05:02, ethertype IPv4 (0x0800), length 46:
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
192.168.1.1.8298 > 192.168.1.2.7: UDP, length 4
11:16:53.882554 02:fe:fe:ae:0d:a3 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
(tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82, bad cksum 3944 (->3996)!)
192.0.2.1.23916 > 192.0.2.3.4789: VXLAN, flags [I] (0x08), vni 8298
02:fe:64:dc:1b:82 > 01:02:03:04:05:03, ethertype IPv4 (0x0800), length 46:
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
192.168.1.1.8298 > 192.168.1.3.7: UDP, length 4
```
Two important notes: Firstly, this works! For the MAC address ending in `:02`, send the packet to
`192.0.2.2` instead of the default of `192.0.2.254`. Same for the `:03` MAC which now goes to
`192.0.2.3`. Nice! But secondly, the IPv4 header of the VxLAN packets was changed, so there needs to
be a call to `ip4_header_checksum()` inserted somewhere. That's an easy fix.
# What's next
I want to discuss a few things, perhaps at an upcoming VPP Community meeting. Notably:
1. Is the VPP Developer community supportive of adding eVPN support? Does anybody want to help
write it with me?
1. Is changing the existing VxLAN plugin appropriate, or should I make a new plugin which adds
dynamic endpoints, L2FIB and Flood lists for BUM traffic?
1. Is it acceptable for me to add a BUM marker in `l2_flood.c` so that I can reuse all the logic
from bridge-domain flooding as I extend to also do VTEP flooding?
1. (perhaps later) VxLAN is the canonical underlay, but is there an appetite to extend also to,
say, GENEVE or MPLS?
1. (perhaps later) What's a good way to tie in a controlplane like FRRouting or Bird2 into the
dataplane (perhaps using a sidecar controller, or perhaps using Linux CP Netlink messages)?

View File

@@ -0,0 +1,701 @@
---
date: "2025-07-26T22:07:23Z"
title: 'Certificate Transparency - Part 1 - TesseraCT'
aliases:
- /s/articles/2025/07/26/certificate-transparency-part-1/
---
{{< image width="10em" float="right" src="/assets/ctlog/ctlog-logo-ipng.png" alt="ctlog logo" >}}
# Introduction
There once was a Dutch company called [[DigiNotar](https://en.wikipedia.org/wiki/DigiNotar)], as the
name suggests it was a form of _digital notary_, and they were in the business of issuing security
certificates. Unfortunately, in June of 2011, their IT infrastructure was compromised and
subsequently it issued hundreds of fraudulent SSL certificates, some of which were used for
man-in-the-middle attacks on Iranian Gmail users. Not cool.
Google launched a project called **Certificate Transparency**, because it was becoming more common
that the root of trust given to _Certification Authorities_ could no longer be unilaterally trusted.
These attacks showed that the lack of transparency in the way CAs operated was a significant risk to
the Web Public Key Infrastructure. It led to the creation of this ambitious
[[project](https://certificate.transparency.dev/)] to improve security online by bringing
accountability to the system that protects our online services with _SSL_ (Secure Socket Layer)
and _TLS_ (Transport Layer Security).
In 2013, [[RFC 6962](https://datatracker.ietf.org/doc/html/rfc6962)] was published by the IETF. It
describes an experimental protocol for publicly logging the existence of Transport Layer Security
(TLS) certificates as they are issued or observed, in a manner that allows anyone to audit
certificate authority (CA) activity and notice the issuance of suspect certificates as well as to
audit the certificate logs themselves. The intent is that eventually clients would refuse to honor
certificates that do not appear in a log, effectively forcing CAs to add all issued certificates to
the logs.
This series explores and documents how IPng Networks will be running two Static CT _Logs_ with two
different implementations. One will be [[Sunlight](https://sunlight.dev/)], and the other will be
[[TesseraCT](https://github.com/transparency-dev/tesseract)].
## Static Certificate Transparency
In this context, _Logs_ are network services that implement the protocol operations for submissions
and queries that are defined in a specification that builds on the previous RFC. A few years ago,
my buddy Antonis asked me if I would be willing to run a log, but operationally they were very
complex and expensive to run. However, over the years, the concept of _Static Logs_ put running one
in reach. This [[Static CT API](https://github.com/C2SP/C2SP/blob/main/static-ct-api.md)] defines a
read-path HTTP static asset hierarchy (for monitoring) to be implemented alongside the write-path
RFC 6962 endpoints (for submission).
Aside from the different read endpoints, a log that implements the Static API is a regular CT log
that can work alongside RFC 6962 logs and that fulfills the same purpose. In particular, it requires
no modification to submitters and TLS clients.
If you only read one document about Static CT, read Filippo Valsorda's excellent
[[paper](https://filippo.io/a-different-CT-log)]. It describes a radically cheaper and easier to
operate [[Certificate Transparency](https://certificate.transparency.dev/)] log that is backed by a
consistent object storage, and can scale to 30x the current issuance rate for 2-10% of the costs
with no merge delay.
## Scalable, Cheap, Reliable: choose two
{{< image width="18em" float="right" src="/assets/ctlog/MPLS Backbone - CTLog.svg" alt="ctlog at ipng" >}}
In the diagram, I've drawn an overview of IPng's network. In {{< boldcolor color="red" >}}red{{<
/boldcolor >}} a european backbone network is provided by a [[BGP Free Core
network]({{< ref 2022-12-09-oem-switch-2 >}})]. It operates a private IPv4, IPv6, and MPLS network, called
_IPng Site Local_, which is not connected to the internet. On top of that, IPng offers L2 and L3
services, for example using [[VPP]({{< ref 2021-02-27-network >}})].
In {{< boldcolor color="lightgreen" >}}green{{< /boldcolor >}} I built a cluster of replicated
NGINX frontends. They connect into _IPng Site Local_ and can reach all hypervisors, VMs, and storage
systems. They also connect to the Internet with a single IPv4 and IPv6 address. One might say that
SSL is _added and removed here :-)_ [[ref](/assets/ctlog/nsa_slide.jpg)].
Then in {{< boldcolor color="orange" >}}orange{{< /boldcolor >}} I built a set of [[MinIO]({{< ref
2025-05-28-minio-1 >}})] S3 storage pools. Amongst others, I serve the static content from the IPng
website from these pools, providing fancy redundancy and caching. I wrote about its design in [[this
article]({{< ref 2025-06-01-minio-2 >}})].
Finally, I turn my attention to the {{< boldcolor color="blue" >}}blue{{< /boldcolor >}} which is
two hypervisors, one run by [[IPng](https://ipng.ch/)] and the other by [[Massar](https://massars.net/)]. Each
of them will be running one of the _Log_ implementations. IPng provides two large ZFS storage tanks
for offsite backup, in case a hypervisor decides to check out, and daily backups to an S3 bucket
using Restic.
Having explained all of this, I am well aware that end to end reliability will be coming from the
fact that there are many independent _Log_ operators, and folks wanting to validate certificates can
simply monitor many. If there is a gap in coverage, say due to any given _Log_'s downtime, this will
not necessarily be problematic. It does mean that I may have to suppress the SRE in me...
## MinIO
My first instinct is to leverage the distributed storage IPng has, but as I'll show in the rest of
this article, maybe a simpler, more elegant design could be superior, precisely because individual
log reliability is not _as important_ as having many available log _instances_ to choose from.
From operators in the field I understand that the world-wide generation of certificates is roughly
17M/day, which amounts of some 200-250qps of writes. Antonis explains that certs with a validity
if 180 days or less will need two CT log entries, while certs with a validity more than 180d will
need three CT log entries. So the write rate is roughly 2.2x that, as an upper bound.
My first thought is to see how fast my open source S3 machines can go, really. I'm curious also as
to the difference between SSD and spinning disks.
I boot two Dell R630s in the Lab. These machines have two Xeon E5-2640 v4 CPUs for a total of 20
cores and 40 threads, and 512GB of DDR4 memory. They also sport a SAS controller. In one machine I
place 6pcs 1.2TB SAS3 disks (HPE part number EG1200JEHMC), and in the second machine I place 6pcs
of 1.92TB enterprise storage (Samsung part number P1633N19).
I spin up a 6-device MinIO cluster on both and take them out for a spin using [[S3
Benchmark](https://github.com/wasabi-tech/s3-benchmark.git)] from Wasabi Tech.
```
pim@ctlog-test:~/src/s3-benchmark$ for dev in disk ssd; do \
for t in 1 8 32; do \
for z in 4M 1M 8k 4k; do \
./s3-benchmark -a $KEY -s $SECRET -u http://minio-$dev:9000 -t $t -z $z \
| tee -a minio-results.txt; \
done; \
done; \
done
```
The loadtest above does a bunch of runs with varying parameters. First it tries to read and write
object sizes of 4MB, 1MB, 8kB and 4kB respectively. Then it tries to do this with either 1 thread, 8
threads or 32 threads. Finally it tests both the disk-based variant as well as the SSD based one.
The loadtest runs from a third machine, so that the Dell R630 disk tanks can stay completely
dedicated to their task of running MinIO.
{{< image width="100%" src="/assets/ctlog/minio_8kb_performance.png" alt="MinIO 8kb disk vs SSD" >}}
The left-hand side graph feels pretty natural to me. With one thread, uploading 8kB objects will
quickly hit the IOPS rate of the disks, each of which have to participate in the write due to EC:3
encoding when using six disks, and it tops out at ~56 PUT/s. The single thread hitting SSDs will not
hit that limit, and has ~371 PUT/s which I found a bit underwhelming. But, when performing the
loadtest with either 8 or 32 write threads, the hard disks become only marginally faster (topping
out at 240 PUT/s), while the SSDs really start to shine, with 3850 PUT/s. Pretty good performance.
On the read-side, I am pleasantly surprised that there's not really that much of a difference
between disks and SSDs. This is likely because the host filesystem cache is playing a large role, so
the 1-thread performance is equivalent (765 GET/s for disks, 677 GET/s for SSDs), and the 32-thread
performance is also equivalent (at 7624 GET/s for disks with 7261 GET/s for SSDs). I do wonder why
the hard disks consistently outperform the SSDs with all the other variables (OS, MinIO version,
hardware) the same.
## Sidequest: SeaweedFS
Something that has long caught my attention is the way in which
[[SeaweedFS](https://github.com/seaweedfs/seaweedfs)] approaches blob storage. Many operators have
great success with many small file writes in SeaweedFS compared to MinIO and even AWS S3 storage.
This is because writes with WeedFS are not broken into erasure-sets, which would require every disk
to write a small part or checksum of the data, but rather files are replicated within the cluster in
their entirety on different disks, racks or datacenters. I won't bore you with the details of
SeaweedFS but I'll tack on a docker [[compose file](/assets/ctlog/seaweedfs.docker-compose.yml)]
that I used at the end of this article, if you're curious.
{{< image width="100%" src="/assets/ctlog/size_comparison_8t.png" alt="MinIO vs SeaWeedFS" >}}
In the write-path, SeaweedFS dominates in all cases, due to its different way of achieving durable
storage (per-file replication in SeaweedFS versus all-disk erasure-sets in MinIO):
* 4k: 3,384 ops/sec vs MinIO's 111 ops/sec (30x faster!)
* 8k: 3,332 ops/sec vs MinIO's 111 ops/sec (30x faster!)
* 1M: 383 ops/sec vs MinIO's 44 ops/sec (9x faster)
* 4M: 104 ops/sec vs MinIO's 32 ops/sec (4x faster)
For the read-path, in GET operations MinIO is better at small objects, and really dominates the
large objects:
* 4k: 7,411 ops/sec vs SeaweedFS 5,014 ops/sec
* 8k: 7,666 ops/sec vs SeaweedFS 5,165 ops/sec
* 1M: 5,466 ops/sec vs SeaweedFS 2,212 ops/sec
* 4M: 3,084 ops/sec vs SeaweedFS 646 ops/sec
This makes me draw an interesting conclusion: seeing as CT Logs are read/write heavy (every couple
of seconds, the Merkle tree is recomputed which is reasonably disk-intensive), SeaweedFS might be a
slight better choice. IPng Networks has three MinIO deployments, but no SeaweedFS deployments. Yet.
# Tessera
[[Tessera](https://github.com/transparency-dev/tessera.git)] is a Go library for building tile-based
transparency logs (tlogs) [[ref](https://github.com/C2SP/C2SP/blob/main/tlog-tiles.md)]. It is the
logical successor to the approach that Google took when building and operating _Logs_ using its
predecessor called [[Trillian](https://github.com/google/trillian)]. The implementation and its APIs
bake-in current best-practices based on the lessons learned over the past decade of building and
operating transparency logs in production environments and at scale.
Tessera was introduced at the Transparency.Dev summit in October 2024. I first watch Al and Martin
[[introduce](https://www.youtube.com/watch?v=9j_8FbQ9qSc)] it at last year's summit. At a high
level, it wraps what used to be a whole kubernetes cluster full of components, into a single library
that can be used with Cloud based services, either like AWS S3 and RDS database, or like GCP's GCS
storage and Spanner database. However, Google also made is easy to use a regular POSIX filesystem
implementation.
## TesseraCT
{{< image width="10em" float="right" src="/assets/ctlog/tesseract-logo.png" alt="tesseract logo" >}}
While Tessera is a library, a CT log implementation comes from its sibling GitHub repository called
[[TesseraCT](https://github.com/transparency-dev/tesseract)]. Because it leverages Tessera under the
hood, TesseraCT can run on GCP, AWS, POSIX-compliant, or on S3-compatible systems alongside a MySQL
database. In order to provide ecosystem agility and to control the growth of CT Log sizes, new CT
Logs must be temporally sharded, defining a certificate expiry range denoted in the form of two
dates: `[rangeBegin, rangeEnd)`. The certificate expiry range allows a Log to reject otherwise valid
logging submissions for certificates that expire before or after this defined range, thus
partitioning the set of publicly-trusted certificates that each Log will accept. I will be expected
to keep logs for an extended period of time, say 3-5 years.
It's time for me to figure out what this TesseraCT thing can do .. are you ready? Let's go!
### TesseraCT: S3 and SQL
TesseraCT comes with a few so-called _personalities_. Those are an implementation of the underlying
storage infrastructure in an opinionated way. The first personality I look at is the `aws` one in
`cmd/tesseract/aws`. I notice that this personality does make hard assumptions about the use of AWS
which is unfortunate as the documentation says '.. or self-hosted S3 and MySQL database'. However,
the `aws` personality assumes the AWS SecretManager in order to fetch its signing key. Before I
can be successful, I need to detangle that.
#### TesseraCT: AWS and Local Signer
First, I change `cmd/tesseract/aws/main.go` to add two new flags:
* ***-signer_public_key_file***: a path to the public key for checkpoints and SCT signer
* ***-signer_private_key_file***: a path to the private key for checkpoints and SCT signer
I then change the program to assume if these flags are both set, the user will want a
_NewLocalSigner_ instead of a _NewSecretsManagerSigner_. Now all I have to do is implement the
signer interface in a package `local_signer.go`. There, function _NewLocalSigner()_ will read the
public and private PEM from file, decode them, and create an _ECDSAWithSHA256Signer_ with them, a
simple example to show what I mean:
```
// NewLocalSigner creates a new signer that uses the ECDSA P-256 key pair from
// local disk files for signing digests.
func NewLocalSigner(publicKeyFile, privateKeyFile string) (*ECDSAWithSHA256Signer, error) {
// Read public key
publicKeyPEM, err := os.ReadFile(publicKeyFile)
publicPemBlock, rest := pem.Decode(publicKeyPEM)
var publicKey crypto.PublicKey
publicKey, err = x509.ParsePKIXPublicKey(publicPemBlock.Bytes)
ecdsaPublicKey, ok := publicKey.(*ecdsa.PublicKey)
// Read private key
privateKeyPEM, err := os.ReadFile(privateKeyFile)
privatePemBlock, rest := pem.Decode(privateKeyPEM)
var ecdsaPrivateKey *ecdsa.PrivateKey
ecdsaPrivateKey, err = x509.ParseECPrivateKey(privatePemBlock.Bytes)
// Verify the correctness of the signer key pair
if !ecdsaPrivateKey.PublicKey.Equal(ecdsaPublicKey) {
return nil, errors.New("signer key pair doesn't match")
}
return &ECDSAWithSHA256Signer{
publicKey: ecdsaPublicKey,
privateKey: ecdsaPrivateKey,
}, nil
}
```
In the snippet above I omitted all of the error handling, but the local signer logic itself is
hopefully clear. And with that, I am liberated from Amazon's Cloud offering and can run this thing
all by myself!
#### TesseraCT: Running with S3, MySQL, and Local Signer
First, I need to create a suitable ECDSA key:
```
pim@ctlog-test:~$ openssl ecparam -name prime256v1 -genkey -noout -out /tmp/private_key.pem
pim@ctlog-test:~$ openssl ec -in /tmp/private_key.pem -pubout -out /tmp/public_key.pem
```
Then, I'll install the MySQL server and create the databases:
```
pim@ctlog-test:~$ sudo apt install default-mysql-server
pim@ctlog-test:~$ sudo mysql -u root
CREATE USER 'tesseract'@'localhost' IDENTIFIED BY '<db_passwd>';
CREATE DATABASE tesseract;
CREATE DATABASE tesseract_antispam;
GRANT ALL PRIVILEGES ON tesseract.* TO 'tesseract'@'localhost';
GRANT ALL PRIVILEGES ON tesseract_antispam.* TO 'tesseract'@'localhost';
```
Finally, I use the SSD MinIO lab-machine that I just loadtested to create an S3 bucket.
```
pim@ctlog-test:~$ mc mb minio-ssd/tesseract-test
pim@ctlog-test:~$ cat << EOF > /tmp/minio-access.json
{ "Version": "2012-10-17", "Statement": [ {
"Effect": "Allow",
"Action": [ "s3:ListBucket", "s3:PutObject", "s3:GetObject", "s3:DeleteObject" ],
"Resource": [ "arn:aws:s3:::tesseract-test/*", "arn:aws:s3:::tesseract-test" ]
} ]
}
EOF
pim@ctlog-test:~$ mc admin user add minio-ssd <user> <secret>
pim@ctlog-test:~$ mc admin policy create minio-ssd tesseract-test-access /tmp/minio-access.json
pim@ctlog-test:~$ mc admin policy attach minio-ssd tesseract-test-access --user <user>
pim@ctlog-test:~$ mc anonymous set public minio-ssd/tesseract-test
```
{{< image width="6em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
After some fiddling, I understand that the AWS software development kit makes some assumptions that
you'll be using .. _quelle surprise_ .. AWS services. But you can also use local S3 services by
setting a few key environment variables. I had heard of the S3 access and secret key environment
variables before, but I now need to also use a different S3 endpoint. That little detour into the
codebase only took me .. several hours.
Armed with that knowledge, I can build and finally start my TesseraCT instance:
```
pim@ctlog-test:~/src/tesseract/cmd/tesseract/aws$ go build -o ~/aws .
pim@ctlog-test:~$ export AWS_DEFAULT_REGION="us-east-1"
pim@ctlog-test:~$ export AWS_ACCESS_KEY_ID="<user>"
pim@ctlog-test:~$ export AWS_SECRET_ACCESS_KEY="<secret>"
pim@ctlog-test:~$ export AWS_ENDPOINT_URL_S3="http://minio-ssd.lab.ipng.ch:9000/"
pim@ctlog-test:~$ ./aws --http_endpoint='[::]:6962' \
--origin=ctlog-test.lab.ipng.ch/test-ecdsa \
--bucket=tesseract-test \
--db_host=ctlog-test.lab.ipng.ch \
--db_user=tesseract \
--db_password=<db_passwd> \
--db_name=tesseract \
--antispam_db_name=tesseract_antispam \
--signer_public_key_file=/tmp/public_key.pem \
--signer_private_key_file=/tmp/private_key.pem \
--roots_pem_file=internal/hammer/testdata/test_root_ca_cert.pem
I0727 15:13:04.666056 337461 main.go:128] **** CT HTTP Server Starting ****
```
Hah! I think most of the command line flags and environment variables should make sense, but I was
struggling for a while with the `--roots_pem_file` and the `--origin` flags, so I phoned a friend
(Al Cutter, Googler extraordinaire and an expert in Tessera/CT). He explained to me that the Log is
actually an open endpoint to which anybody might POST data. However, to avoid folks abusing the log
infrastructure, each POST is expected to come from one of the certificate authorities listed in the
`--roots_pem_file`. OK, that makes sense.
Then, the `--origin` flag designates how my log calls itself. In the resulting `checkpoint` file it
will enumerate a hash of the latest merged and published Merkle tree. In case a server serves
multiple logs, it uses the `--origin` flag to make the destinction which checksum belongs to which.
```
pim@ctlog-test:~/src/tesseract$ curl http://tesseract-test.minio-ssd.lab.ipng.ch:9000/checkpoint
ctlog-test.lab.ipng.ch/test-ecdsa
0
JGPitKWWI0aGuCfC2k1n/p9xdWAYPm5RZPNDXkCEVUU=
— ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMCONUBAMARjBEAiA/nc9dig6U//vPg7SoTHjt9bxP5K+x3w4MYKpIRn4ULQIgUY5zijRK8qyuJGvZaItDEmP1gohCt+wI+sESBnhkuqo=
```
When creating the bucket above, I used `mc anonymous set public`, which made the S3 bucket
world-readable. I can now execute the whole read-path simply by hitting the S3 service. Check.
#### TesseraCT: Loadtesting S3/MySQL
{{< image width="12em" float="right" src="/assets/ctlog/stop-hammer-time.jpg" alt="Stop, hammer time" >}}
The write path is a server on `[::]:6962`. I should be able to write a log to it, but how? Here's
where I am grateful to find a tool in the TesseraCT GitHub repository called `hammer`. This hammer
sets up read and write traffic to a Static CT API log to test correctness and performance under
load. The traffic is sent according to the [[Static CT API](https://c2sp.org/static-ct-api)] spec.
Slick!
The tool start a text-based UI (my favorite! also when using Cisco T-Rex loadtester) in the terminal
that shows the current status, logs, and supports increasing/decreasing read and write traffic. This
TUI allows for a level of interactivity when probing a new configuration of a log in order to find
any cliffs where performance degrades. For real load-testing applications, especially headless runs
as part of a CI pipeline, it is recommended to run the tool with `-show_ui=false` in order to disable
the UI.
I'm a bit lost in the somewhat terse
[[README.md](https://github.com/transparency-dev/tesseract/tree/main/internal/hammer)], but my buddy
Al comes to my rescue and explains the flags to me. First of all, the loadtester wants to hit the
same `--origin` that I configured the write-path to accept. In my case this is
`ctlog-test.lab.ipng.ch/test-ecdsa`. Then, it needs the public key for that _Log_, which I can find
in `/tmp/public_key.pem`. The text there is the _DER_ (Distinguished Encoding Rules), stored as a
base64 encoded string. What follows next was the most difficult for me to understand, as I was
thinking the hammer would read some log from the internet somewhere and replay it locally. Al
explains that actually, the `hammer` tool synthetically creates all of these entries itself, and it
regularly reads the `checkpoint` from the `--log_url` place, while it writes its certificates to
`--write_log_url`. The last few flags just inform the `hammer` how many read and write ops/sec it
should generate, and with that explanation my brain plays _tadaa.wav_ and I am ready to go.
```
pim@ctlog-test:~/src/tesseract$ go run ./internal/hammer \
--origin=ctlog-test.lab.ipng.ch/test-ecdsa \
--log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEucHtDWe9GYNicPnuGWbEX8rJg/VnDcXs8z40KdoNidBKy6/ZXw2u+NW1XAUnGpXcZozxufsgOMhijsWb25r7jw== \
--log_url=http://tesseract-test.minio-ssd.lab.ipng.ch:9000/ \
--write_log_url=http://localhost:6962/ctlog-test.lab.ipng.ch/test-ecdsa/ \
--max_read_ops=0 \
--num_writers=5000 \
--max_write_ops=100
```
{{< image width="30em" float="right" src="/assets/ctlog/ctlog-loadtest1.png" alt="S3/MySQL Loadtest 100qps" >}}
Cool! It seems that the loadtest is happily chugging along at 100qps. The log is consuming them in
the HTTP write-path by accepting POST requests to
`/ctlog-test.lab.ipng.ch/test-ecdsa/ct/v1/add-chain`, where hammer is offering them at a rate of
100qps, with a configured probability of duplicates set at 10%. What that means is that every now
and again, it'll repeat a previous request. The purpose of this is to stress test the so-called
`antispam` implementation. When `hammer` sends its requests, it signs them with a certificate that
was issued by the CA described in `internal/hammer/testdata/test_root_ca_cert.pem`, which is why
TesseraCT accepts them.
I raise the write load by using the '>' key a few times. I notice things are great at 500qps, which
is nice because that's double what we are to expect. But I start seeing a bit more noise at 600qps.
When I raise the write-rate to 1000qps, all hell breaks loose on the logs of the server (and similar
logs in the `hammer` loadtester:
```
W0727 15:54:33.419881 348475 handlers.go:168] ctlog-test.lab.ipng.ch/test-ecdsa: AddChain handler error: couldn't store the leaf: failed to fetch entry bundle at index 0: failed to fetch resource: getObject: failed to create reader for object "tile/data/000" in bucket "tesseract-test": operation error S3: GetObject, context deadline exceeded
W0727 15:55:02.727962 348475 aws.go:345] GarbageCollect failed: failed to delete one or more objects: failed to delete objects: operation error S3: DeleteObjects, https response error StatusCode: 400, RequestID: 1856202CA3C4B83F, HostID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8, api error MalformedXML: The XML you provided was not well-formed or did not validate against our published schema.
E0727 15:55:10.448973 348475 append_lifecycle.go:293] followerStats: follower "AWS antispam" EntriesProcessed(): failed to read follow coordination info: Error 1040: Too many connections
```
I see on the MinIO instance that it's doing about 150/s of GETs and 15/s of PUTs, which is totally
reasonable:
```
pim@ctlog-test:~/src/tesseract$ mc admin trace --stats ssd
Duration: 6m9s ▰▱▱
RX Rate:↑ 34 MiB/m
TX Rate:↓ 2.3 GiB/m
RPM : 10588.1
-------------
Call Count RPM Avg Time Min Time Max Time Avg TTFB Max TTFB Avg Size Rate /min
s3.GetObject 60558 (92.9%) 9837.2 4.3ms 708µs 48.1ms 3.9ms 47.8ms ↑144B ↓246K ↑1.4M ↓2.3G
s3.PutObject 2199 (3.4%) 357.2 5.3ms 2.4ms 32.7ms 5.3ms 32.7ms ↑92K ↑32M
s3.DeleteMultipleObjects 1212 (1.9%) 196.9 877µs 290µs 41.1ms 850µs 41.1ms ↑230B ↓369B ↑44K ↓71K
s3.ListObjectsV2 1212 (1.9%) 196.9 18.4ms 999µs 52.8ms 18.3ms 52.7ms ↑131B ↓261B ↑25K ↓50K
```
Another nice way to see what makes it through is this oneliner, which reads the `checkpoint` every
second, and once it changes, shows the delta in seconds and how many certs were written:
```
pim@ctlog-test:~/src/tesseract$ T=0; O=0; while :; do \
N=$(curl -sS http://tesseract-test.minio-ssd.lab.ipng.ch:9000/checkpoint | grep -E '^[0-9]+$'); \
if [ "$N" -eq "$O" ]; then \
echo -n .; \
else \
echo " $T seconds $((N-O)) certs"; O=$N; T=0; echo -n $N\ ;
fi; \
T=$((T+1)); sleep 1; done
1012905 .... 5 seconds 2081 certs
1014986 .... 5 seconds 2126 certs
1017112 .... 5 seconds 1913 certs
1019025 .... 5 seconds 2588 certs
1021613 .... 5 seconds 2591 certs
1024204 .... 5 seconds 2197 certs
```
So I can see that the checkpoint is refreshed every 5 seconds and between 1913 and 2591 certs are
written each time. And indeed, at 400/s there are no errors or warnings at all. At this write rate,
TesseraCT is using about 2.9 CPUs/s, with MariaDB using 0.3 CPUs/s, but the hammer is using 6.0
CPUs/s. Overall, the machine is perfectly happily serving for a few hours under this load test.
***Conclusion: a write-rate of 400/s should be safe with S3+MySQL***
### TesseraCT: POSIX
I have been playing with this idea of having a reliable read-path by having the S3 cluster be
redundant, or by replicating the S3 bucket. But Al asks: why not use our experimental POSIX?
We discuss two very important benefits, but also two drawbacks:
* On the plus side:
1. There is no need for S3 storage, read/writing to a local ZFS raidz2 pool instead.
1. There is no need for MySQL, as the POSIX implementation can use a local badger instance
also on the local filesystem.
* On the drawbacks:
1. There is a SPOF in the read-path, as the single VM must handle both. The write-path always
has a SPOF on the TesseraCT VM.
1. Local storage is more expensive than S3 storage, and can be used only for the purposes of
one application (and at best, shared with other VMs on the same hypervisor).
Come to think of it, this is maybe not such a bad tradeoff. I do kind of like having a single-VM
with a single-binary and no other moving parts. It greatly simplifies the architecture, and for the
read-path I can (and will) still use multiple upstream NGINX machines in IPng's network.
I consider myself nerd-sniped, and take a look at the POSIX variant. I have a few SAS3
solid state storage (NetAPP part number X447_S1633800AMD), which I plug into the `ctlog-test`
machine.
```
pim@ctlog-test:~$ sudo zpool create -o ashift=12 -o autotrim=on -o ssd-vol0 mirror \
/dev/disk/by-id/wwn-0x5002538a0???????
pim@ctlog-test:~$ sudo zfs create ssd-vol0/tesseract-test
pim@ctlog-test:~$ sudo chown pim:pim /ssd-vol0/tesseract-test
pim@ctlog-test:~/src/tesseract$ go run ./cmd/experimental/posix --http_endpoint='[::]:6962' \
--origin=ctlog-test.lab.ipng.ch/test-ecdsa \
--private_key=/tmp/private_key.pem \
--storage_dir=/ssd-vol0/tesseract-test \
--roots_pem_file=internal/hammer/testdata/test_root_ca_cert.pem
badger 2025/07/27 16:29:15 INFO: All 0 tables opened in 0s
badger 2025/07/27 16:29:15 INFO: Discard stats nextEmptySlot: 0
badger 2025/07/27 16:29:15 INFO: Set nextTxnTs to 0
I0727 16:29:15.032845 363156 files.go:502] Initializing directory for POSIX log at "/ssd-vol0/tesseract-test" (this should only happen ONCE per log!)
I0727 16:29:15.034101 363156 main.go:97] **** CT HTTP Server Starting ****
pim@ctlog-test:~/src/tesseract$ cat /ssd-vol0/tesseract-test/checkpoint
ctlog-test.lab.ipng.ch/test-ecdsa
0
47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=
— ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMSgC8BAMARzBFAiBjT5zdkniKlryqlUlx/gLHOtVK26zuWwrc4BlyTVzCWgIhAJ0GIrlrP7YGzRaHjzdB5tnS5rpP3LeOsPbpLateaiFc
```
Alright, I can see the log started and created an empty checkpoint file. Nice!
Before I can loadtest it, I will need to get the read-path to become visible. The `hammer` can read
a checkpoint from local `file:///` prefixes, but I'll have to serve them over the network eventually
anyway, so I create the following NGINX config for it:
```
server {
listen 80 default_server backlog=4096;
listen [::]:80 default_server backlog=4096;
root /ssd-vol0/tesseract-test/;
index index.html index.htm index.nginx-debian.html;
server_name _;
access_log /var/log/nginx/access.log combined buffer=512k flush=5s;
location / {
try_files $uri $uri/ =404;
tcp_nopush on;
sendfile on;
tcp_nodelay on;
keepalive_timeout 65;
keepalive_requests 1000;
}
}
```
Just a couple of small thoughts on this configuration. I'm using buffered access logs, to avoid
excessive disk writes in the read-path. Then, I'm using kernel `sendfile()` which will instruct the
kernel to serve the static objects directly, so that NGINX can move on. Further, I'll allow for a
long keepalive in HTTP 1.1, so that future requests can use the same TCP connection, and I'll set
the flag `tcp_nodelay` and `tcp_nopush` to just blast the data out without waiting.
Without much ado:
```
pim@ctlog-test:~/src/tesseract$ curl -sS ctlog-test.lab.ipng.ch/checkpoint
ctlog-test.lab.ipng.ch/test-ecdsa
0
47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=
— ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMTfksBAMASDBGAiEAqADLH0P/SRVloF6G1ezlWG3Exf+sTzPIY5u6VjAKLqACIQCkJO2N0dZQuDHvkbnzL8Hd91oyU41bVqfD3vs5EwUouA==
```
#### TesseraCT: Loadtesting POSIX
The loadtesting is roughly the same. I start the `hammer` with the same 500qps of write rate, which
was roughly where the S3+MySQL variant topped. My checkpoint tracker shows the following:
```
pim@ctlog-test:~/src/tesseract$ T=0; O=0; while :; do \
N=$(curl -sS http://localhost/checkpoint | grep -E '^[0-9]+$'); \
if [ "$N" -eq "$O" ]; then \
echo -n .; \
else \
echo " $T seconds $((N-O)) certs"; O=$N; T=0; echo -n $N\ ;
fi; \
T=$((T+1)); sleep 1; done
59250 ......... 10 seconds 5244 certs
64494 ......... 10 seconds 5000 certs
69494 ......... 10 seconds 5000 certs
74494 ......... 10 seconds 5000 certs
79494 ......... 10 seconds 5256 certs
79494 ......... 10 seconds 5256 certs
84750 ......... 10 seconds 5244 certs
89994 ......... 10 seconds 5256 certs
95250 ......... 10 seconds 5000 certs
100250 ......... 10 seconds 5000 certs
105250 ......... 10 seconds 5000 certs
```
I learn two things. First, the checkpoint interval in this `posix` variant is 10 seconds, compared
to the 5 seconds of the `aws` variant I tested before. I dive into the code, because there doesn't
seem to be a `--checkpoint_interval` flag. In the `tessera` library, I find
`DefaultCheckpointInterval` which is set to 10 seconds. I change it to be 2 seconds instead, and
restart the `posix` binary:
```
238250 . 2 seconds 1000 certs
239250 . 2 seconds 1000 certs
240250 . 2 seconds 1000 certs
241250 . 2 seconds 1000 certs
242250 . 2 seconds 1000 certs
243250 . 2 seconds 1000 certs
244250 . 2 seconds 1000 certs
```
{{< image width="30em" float="right" src="/assets/ctlog/ctlog-loadtest2.png" alt="Posix Loadtest 5000qps" >}}
Very nice! Maybe I can write a few more certs? I restart the `hammer` with 5000/s, which somewhat to my
surprise, ends up serving!
```
642608 . 2 seconds 6155 certs
648763 . 2 seconds 10256 certs
659019 . 2 seconds 9237 certs
668256 . 2 seconds 8800 certs
677056 . 2 seconds 8729 certs
685785 . 2 seconds 8237 certs
694022 . 2 seconds 7487 certs
701509 . 2 seconds 8572 certs
710081 . 2 seconds 7413 certs
```
The throughput is highly variable though, seemingly between 3700/sec and 5100/sec, and I quickly
find out that the `hammer` is completely saturating the CPU on the machine, leaving very little room
for the `posix` TesseraCT to serve. I'm going to need more machines!
So I start a `hammer` loadtester on the two now-idle MinIO servers, and run them at about 6000qps
**each**, for a total of 12000 certs/sec. And my little `posix` binary is keeping up like a champ:
```
2987169 . 2 seconds 23040 certs
3010209 . 2 seconds 23040 certs
3033249 . 2 seconds 21760 certs
3055009 . 2 seconds 21504 certs
3076513 . 2 seconds 23808 certs
3100321 . 2 seconds 22528 certs
```
One thing is reasonably clear, the `posix` TesseraCT is CPU bound, not disk bound. The CPU is now
running at about 18.5 CPUs/s (with 20 cores), which is pretty much all this Dell has to offer. The
NetAPP enterprise solid state drives are not impressed:
```
pim@ctlog-test:~/src/tesseract$ zpool iostat -v ssd-vol0 10 100
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
ssd-vol0 11.4G 733G 0 3.13K 0 117M
mirror-0 11.4G 733G 0 3.13K 0 117M
wwn-0x5002538a05302930 - - 0 1.04K 0 39.1M
wwn-0x5002538a053069f0 - - 0 1.06K 0 39.1M
wwn-0x5002538a06313ed0 - - 0 1.02K 0 39.1M
-------------------------- ----- ----- ----- ----- ----- -----
pim@ctlog-test:~/src/tesseract$ zpool iostat -l ssd-vol0 10
capacity operations bandwidth total_wait disk_wait syncq_wait asyncq_wait scrub trim
pool alloc free read write read write read write read write read write read write wait wait
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
ssd-vol0 14.0G 730G 0 1.48K 0 35.4M - 2ms - 535us - 1us - 3ms - 50ms
ssd-vol0 14.0G 730G 0 1.12K 0 23.0M - 1ms - 733us - 2us - 1ms - 44ms
ssd-vol0 14.1G 730G 0 1.42K 0 45.3M - 508us - 122us - 914ns - 2ms - 41ms
ssd-vol0 14.2G 730G 0 678 0 21.0M - 863us - 144us - 2us - 2ms - -
```
## Results
OK, that kind of seals the deal for me. The write path needs about 250 certs/sec and I'm hammering
now with 12'000 certs/sec, with room to spare. But what about the read path? The cool thing about
the static log is that reads are all entirely done by NGINX. The only file that isn't cacheable is
the `checkpoint` file which gets updated every two seconds (or ten seconds in the default `tessera`
settings).
So I start yet another `hammer` whose job it is to read back from the static filesystem:
```
pim@ctlog-test:~/src/tesseract$ curl localhost/nginx_status; sleep 60; curl localhost/nginx_status
Active connections: 10556
server accepts handled requests
25302 25302 1492918
Reading: 0 Writing: 1 Waiting: 10555
Active connections: 7791
server accepts handled requests
25764 25764 1727631
Reading: 0 Writing: 1 Waiting: 7790
```
And I can see that it's keeping up quite nicely. In one minute, it handled (1727631-1492918) or
234713 requests, which is a cool 3911 requests/sec. All these read/write hammers are kind of
saturating the `ctlog-test` machine though:
{{< image width="100%" src="/assets/ctlog/ctlog-loadtest3.png" alt="Posix Loadtest 8000qps write, 4000qps read" >}}
But after a little bit of fiddling, I can assert my conclusion:
***Conclusion: a write-rate of 8'000/s alongside a read-rate of 4'000/s should be safe with POSIX***
## What's Next
I am going to offer such a machine in production together with Antonis Chariton, and Jeroen Massar.
I plan to do a few additional things:
* Test Sunlight as well on the same hardware. It would be nice to see a comparison between write
rates of the two implementations.
* Work with Al Cutter and the Transparency Dev team to close a few small gaps (like the
`local_signer.go` and some Prometheus monitoring of the `posix` binary.
* Install and launch both under `*.ct.ipng.ch`, which in itself deserves its own report, showing
how I intend to do log cycling and care/feeding, as well as report on the real production
experience running these CT Logs.

View File

@@ -0,0 +1,666 @@
---
date: "2025-08-10T12:07:23Z"
title: 'Certificate Transparency - Part 2 - Sunlight'
---
{{< image width="10em" float="right" src="/assets/ctlog/ctlog-logo-ipng.png" alt="ctlog logo" >}}
# Introduction
There once was a Dutch company called [[DigiNotar](https://en.wikipedia.org/wiki/DigiNotar)], as the
name suggests it was a form of _digital notary_, and they were in the business of issuing security
certificates. Unfortunately, in June of 2011, their IT infrastructure was compromised and
subsequently it issued hundreds of fraudulent SSL certificates, some of which were used for
man-in-the-middle attacks on Iranian Gmail users. Not cool.
Google launched a project called **Certificate Transparency**, because it was becoming more common
that the root of trust given to _Certification Authorities_ could no longer be unilaterally trusted.
These attacks showed that the lack of transparency in the way CAs operated was a significant risk to
the Web Public Key Infrastructure. It led to the creation of this ambitious
[[project](https://certificate.transparency.dev/)] to improve security online by bringing
accountability to the system that protects our online services with _SSL_ (Secure Socket Layer)
and _TLS_ (Transport Layer Security).
In 2013, [[RFC 6962](https://datatracker.ietf.org/doc/html/rfc6962)] was published by the IETF. It
describes an experimental protocol for publicly logging the existence of Transport Layer Security
(TLS) certificates as they are issued or observed, in a manner that allows anyone to audit
certificate authority (CA) activity and notice the issuance of suspect certificates as well as to
audit the certificate logs themselves. The intent is that eventually clients would refuse to honor
certificates that do not appear in a log, effectively forcing CAs to add all issued certificates to
the logs.
In a [[previous article]({{< ref 2025-07-26-ctlog-1 >}})], I took a deep dive into an upcoming
open source implementation of Static CT Logs made by Google. There is however a very competent
alternative called [[Sunlight](https://sunlight.dev/)], which deserves some attention to get to know
its look and feel, as well as its performance characteristics.
## Sunlight
I start by reading up on the project website, and learn:
> _Sunlight is a [[Certificate Transparency](https://certificate.transparency.dev/)] log implementation
> and monitoring API designed for scalability, ease of operation, and reduced cost. What started as
> the Sunlight API is now the [[Static CT API](https://c2sp.org/static-ct-api)] and is allowed by the
> CT log policies of the major browsers._
>
> _Sunlight was designed by Filippo Valsorda for the needs of the WebPKI community, through the
> feedback of many of its members, and in particular of the Sigsum, Google TrustFabric, and ISRG
> teams. It is partially based on the Go Checksum Database. Sunlight's development was sponsored by
> Let's Encrypt._
I have a chat with Filippo and think I'm addressing an Elephant by asking him which of the two
implementations, TesseraCT or Sunlight, he thinks would be a good fit. One thing he says really sticks
with me: "The community needs _any_ static log operator, so if Google thinks TesseraCT is ready, by
all means use that. The diversity will do us good!".
To find out if one or the other is 'ready' is partly on the software, but importantly also on the
operator. So I carefully take Sunlight out of its cardboard box, and put it onto the same Dell R630
that I used in my previous tests: two Xeon E5-2640 v4 CPUs for a total of 20 cores and 40 threads,
and 512GB of DDR4 memory. They also sport a SAS controller. In one machine I place 6 pcs 1.2TB SAS3
drives (HPE part number EG1200JEHMC), and in the second machine I place 6pcs of 1.92TB enterprise
storage (Samsung part number P1633N19).
### Sunlight: setup
I download the source from GitHub, which, one of these days, will have an IPv6 address. Building the
tools is easy enough, there are three main tools:
1. ***sunlight***: Which serves the write-path. Certification authorities add their certs here.
1. ***sunlight-keygen***: A helper tool to create the so-called `seed` file (key material) for a
log.
1. ***skylight***: Which serves the read-path. `/checkpoint` and things like `/tile` and `/issuer`
are served here in a spec-compliant way.
The YAML configuration file is straightforward, and can define and handle multiple logs in one
instance, which sets it apart from TesseraCT which can only handle one log per instance. There's a
`submissionprefix` which `sunlight` will use to accept writes, and a `monitoringprefix` which
`skylight` will use for reads.
I stumble across a small issue - I haven't created multiple DNS hostnames for the test machine. So I
decide to use a different port for one versus the other. The write path will use TLS on port 1443
while Sunlight will point to a normal HTTP port 1080. And considering I don't have a certificate for
`*.lab.ipng.ch`, I will use a self-signed one instead:
```
pim@ctlog-test:/etc/sunlight$ openssl genrsa -out ca.key 2048
pim@ctlog-test:/etc/sunlight$ openssl req -new -x509 -days 365 -key ca.key \
-subj "/C=CH/ST=ZH/L=Bruttisellen/O=IPng Networks GmbH/CN=IPng Root CA" -out ca.crt
pim@ctlog-test:/etc/sunlight$ openssl req -newkey rsa:2048 -nodes -keyout sunlight-key.pem \
-subj "/C=CH/ST=ZH/L=Bruttisellen/O=IPng Networks GmbH/CN=*.lab.ipng.ch" -out sunlight.csr
pim@ctlog-test:/etc/sunlight# openssl x509 -req -extfile \
<(printf "subjectAltName=DNS:ctlog-test.lab.ipng.ch,DNS:ctlog-test.lab.ipng.ch") -days 365 \
-in sunlight.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out sunlight.pem
ln -s sunlight.pem skylight.pem
ln -s sunlight-key.pem skylight-key.pem
```
This little snippet yields `sunlight.pem` (the certificate) and `sunlight-key.pem` (the private
key), and symlinks them to `skylight.pem` and `skylight-key.pem` for simplicity. With these in hand,
I can start the rest of the show. First I will prepare the NVME storage with a few datasets in
which Sunlight will store its data:
```
pim@ctlog-test:~$ sudo zfs create ssd-vol0/sunlight-test
pim@ctlog-test:~$ sudo zfs create ssd-vol0/sunlight-test/shared
pim@ctlog-test:~$ sudo zfs create ssd-vol0/sunlight-test/logs
pim@ctlog-test:~$ sudo zfs create ssd-vol0/sunlight-test/logs/sunlight-test
pim@ctlog-test:~$ sudo chown -R pim:pim /ssd-vol0/sunlight-test
```
Then I'll create the Sunlight configuration:
```
pim@ctlog-test:/etc/sunlight$ sunlight-keygen -f sunlight-test.seed.bin
Log ID: IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E=
ECDSA public key:
-----BEGIN PUBLIC KEY-----
MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHR
wRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ==
-----END PUBLIC KEY-----
Ed25519 public key:
-----BEGIN PUBLIC KEY-----
0pHg7KptAxmb4o67m9xNM1Ku3YH4bjjXbyIgXn2R2bk=
-----END PUBLIC KEY-----
```
The first block creates key material for the log, and I get a fun surprise: the Log ID starts
precisely with the string IPng... what are the odds that that would happen!? I should tell Antonis
about this, it's dope!
As a safety precaution, Sunlight requires the operator to make the `checkpoints.db` by hand, which
I'll also do:
```
pim@ctlog-test:/etc/sunlight$ sqlite3 /ssd-vol0/sunlight-test/shared/checkpoints.db \
"CREATE TABLE checkpoints (logID BLOB PRIMARY KEY, body TEXT)"
```
And with that, I'm ready to create my first log!
### Sunlight: Setting up S3
When learning about [[Tessera]({{< ref 2025-07-26-ctlog-1 >}})], I already kind of drew the
conclusion that, for our case at IPng at least, running the fully cloud-native version with S3
storage and MySQL database, gave both poorer performance, but also more operational complexity. But
I find it interesting to compare behavior and performance, so I'll start by creating a Sunlight log
using backing MinIO SSD storage.
I'll first create the bucket and a user account to access it:
```
pim@ctlog-test:~$ export AWS_ACCESS_KEY_ID="<some user>"
pim@ctlog-test:~$ export AWS_SECRET_ACCESS_KEY="<some password>"
pim@ctlog-test:~$ export S3_BUCKET=sunlight-test
pim@ctlog-test:~$ mc mb ssd/${S3_BUCKET}
pim@ctlog-test:~$ cat << EOF > /tmp/minio-access.json
{ "Version": "2012-10-17", "Statement": [ {
"Effect": "Allow",
"Action": [ "s3:ListBucket", "s3:PutObject", "s3:GetObject", "s3:DeleteObject" ],
"Resource": [ "arn:aws:s3:::${S3_BUCKET}/*", "arn:aws:s3:::${S3_BUCKET}" ]
} ]
}
EOF
pim@ctlog-test:~$ mc admin user add ssd ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY}
pim@ctlog-test:~$ mc admin policy create ssd ${S3_BUCKET}-access /tmp/minio-access.json
pim@ctlog-test:~$ mc admin policy attach ssd ${S3_BUCKET}-access --user ${AWS_ACCESS_KEY_ID}
pim@ctlog-test:~$ mc anonymous set public ssd/${S3_BUCKET}
```
After setting up the S3 environment, all I must do is wire it up to the Sunlight configuration
file:
```
pim@ctlog-test:/etc/sunlight$ cat << EOF > sunlight-s3.yaml
listen:
- "[::]:1443"
checkpoints: /ssd-vol0/sunlight-test/shared/checkpoints.db
logs:
- shortname: sunlight-test
inception: 2025-08-10
submissionprefix: https://ctlog-test.lab.ipng.ch:1443/
monitoringprefix: http://sunlight-test.minio-ssd.lab.ipng.ch:9000/
secret: /etc/sunlight/sunlight-test.seed.bin
cache: /ssd-vol0/sunlight-test/logs/sunlight-test/cache.db
s3region: eu-schweiz-1
s3bucket: sunlight-test
s3endpoint: http://minio-ssd.lab.ipng.ch:9000/
roots: /etc/sunlight/roots.pem
period: 200
poolsize: 15000
notafterstart: 2024-01-01T00:00:00Z
notafterlimit: 2025-01-01T00:00:00Z
EOF
```
The one thing of note here is the use of `roots:` file which contains the Root CA for the TesseraCT
loadtester which I'll be using. In production, Sunlight can grab the approved roots from the
so-called _Common CA Database_ or CCADB. But you can also specify either all roots using the `roots`
field, or additional roots on top of the `ccadbroots` field, using the `extraroots` field. That's a
handy trick! You can find more info on the [[CCADB](https://www.ccadb.org/)] homepage.
I can then start Sunlight just like this:
```
pim@ctlog-test:/etc/sunlight$ sunlight -testcert -c /etc/sunlight/sunlight-s3.yaml {"time":"2025-08-10T13:49:36.091384532+02:00","level":"INFO","source":{"function":"main.main.func1","file":"/home/pim/src/sunlight/cmd/sunlight/sunlig
ht.go","line":341},"msg":"debug server listening","addr":{"IP":"127.0.0.1","Port":37477,"Zone":""}}
time=2025-08-10T13:49:36.091+02:00 level=INFO msg="debug server listening" addr=127.0.0.1:37477 {"time":"2025-08-10T13:49:36.100471647+02:00","level":"INFO","source":{"function":"main.main","file":"/home/pim/src/sunlight/cmd/sunlight/sunlight.go"
,"line":542},"msg":"today is the Inception date, creating log","log":"sunlight-test"} time=2025-08-10T13:49:36.100+02:00 level=INFO msg="today is the Inception date, creating log" log=sunlight-test
{"time":"2025-08-10T13:49:36.119529208+02:00","level":"INFO","source":{"function":"filippo.io/sunlight/internal/ctlog.CreateLog","file":"/home/pim/src
/sunlight/internal/ctlog/ctlog.go","line":159},"msg":"created log","log":"sunlight-test","timestamp":1754826576111,"logID":"IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E="}
time=2025-08-10T13:49:36.119+02:00 level=INFO msg="created log" log=sunlight-test timestamp=1754826576111 logID="IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E="
{"time":"2025-08-10T13:49:36.127702166+02:00","level":"WARN","source":{"function":"filippo.io/sunlight/internal/ctlog.LoadLog","file":"/home/pim/src/s
unlight/internal/ctlog/ctlog.go","line":296},"msg":"failed to parse previously trusted roots","log":"sunlight-test","roots":""} time=2025-08-10T13:49:36.127+02:00 level=WARN msg="failed to parse previously trusted roots" log=sunlight-test roots=""
{"time":"2025-08-10T13:49:36.127766452+02:00","level":"INFO","source":{"function":"filippo.io/sunlight/internal/ctlog.LoadLog","file":"/home/pim/src/sunlight/internal/ctlog/ctlog.go","line":301},"msg":"loaded log","log":"sunlight-test","logID":"IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E=","size":0,
"timestamp":1754826576111}
time=2025-08-10T13:49:36.127+02:00 level=INFO msg="loaded log" log=sunlight-test logID="IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E=" size=0 timestamp=1754826576111
{"time":"2025-08-10T13:49:36.540297532+02:00","level":"INFO","source":{"function":"filippo.io/sunlight/internal/ctlog.(*Log).sequencePool","file":"/home/pim/src/sunlight/internal/ctlog/ctlog.go","line":972},"msg":"sequenced pool","log":"sunlight-test","old_tree_size":0,"entries":0,"start":"2025-08-1
0T13:49:36.534500633+02:00","tree_size":0,"tiles":0,"timestamp":1754826576534,"elapsed":5788099}
time=2025-08-10T13:49:36.540+02:00 level=INFO msg="sequenced pool" log=sunlight-test old_tree_size=0 entries=0 start=2025-08-10T13:49:36.534+02:00 tree_size=0 tiles=0 timestamp=1754826576534 elapsed=5.788099ms
...
```
Although that looks pretty good, I see that something is not quite right. When Sunlight comes up, it shares
with me a few links, in the `get-roots` and `json` fields on the homepage, but neither of them work:
```
pim@ctlog-test:~$ curl -k https://ctlog-test.lab.ipng.ch:1443/ct/v1/get-roots
404 page not found
pim@ctlog-test:~$ curl -k https://ctlog-test.lab.ipng.ch:1443/log.v3.json
404 page not found
```
I'm starting to think that using a non-standard listen port won't work, or more precisely, adding
a port in the `monitoringprefix` won't work. I notice that the logname is called
`ctlog-test.lab.ipng.ch:1443` which I don't think is supposed to have a portname in it. So instead,
I make Sunlight `listen` on port 443 and omit the port in the `submissionprefix`, and give it and
its companion Skylight the needed privileges to bind the privileged port like so:
```
pim@ctlog-test:~$ sudo setcap 'cap_net_bind_service=+ep' /usr/local/bin/sunlight
pim@ctlog-test:~$ sudo setcap 'cap_net_bind_service=+ep' /usr/local/bin/skylight
pim@ctlog-test:~$ sunlight -testcert -c /etc/sunlight/sunlight-s3.yaml
```
{{< image width="60%" src="/assets/ctlog/sunlight-test-s3.png" alt="Sunlight testlog / S3" >}}
And with that, Sunlight reports for duty and the links work. Hoi!
#### Sunlight: Loadtesting S3
I have some good experience loadtesting from the [[TesseraCT article]({{< ref 2025-07-26-ctlog-1
>}})]. One important difference is that Sunlight wants to use SSL for the submission and monitoring
paths, and I've created a snakeoil self-signed cert. CT Hammer does not accept that out of the box,
so I need to make a tiny change to the Hammer:
```
pim@ctlog-test:~/src/tesseract$ git diff
diff --git a/internal/hammer/hammer.go b/internal/hammer/hammer.go
index 3828fbd..1dfd895 100644
--- a/internal/hammer/hammer.go
+++ b/internal/hammer/hammer.go
@@ -104,6 +104,9 @@ func main() {
MaxIdleConns: *numWriters + *numReadersFull + *numReadersRandom,
MaxIdleConnsPerHost: *numWriters + *numReadersFull + *numReadersRandom,
DisableKeepAlives: false,
+ TLSClientConfig: &tls.Config{
+ InsecureSkipVerify: true,
+ },
},
Timeout: *httpTimeout,
}
```
With that small bit of insecurity out of the way, Sunlight makes it otherwise pretty easy for me to
construct the CT Hammer commandline:
```
pim@ctlog-test:~/src/tesseract$ go run ./internal/hammer --origin=ctlog-test.lab.ipng.ch \
--log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHRwRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ== \
--log_url=http://sunlight-test.minio-ssd.lab.ipng.ch:9000/ --write_log_url=https://ctlog-test.lab.ipng.ch/ \
--max_read_ops=0 --num_writers=5000 --max_write_ops=100
pim@ctlog-test:/etc/sunlight$ T=0; O=0; while :; do \
N=$(curl -sS http://sunlight-test.minio-ssd.lab.ipng.ch:9000/checkpoint | grep -E '^[0-9]+$'); \
if [ "$N" -eq "$O" ]; then \
echo -n .; \
else \
echo " $T seconds $((N-O)) certs"; O=$N; T=0; echo -n $N\ ;
fi; \
T=$((T+1)); sleep 1; done
24915 1 seconds 96 certs
25011 1 seconds 92 certs
25103 1 seconds 93 certs
25196 1 seconds 87 certs
```
On the first commandline I'll start the loadtest at 100 writes/sec with the standard duplication
probability of 10%, which allows me to test Sunlights ability to avoid writing duplicates. This
means I should see on average a growth of the tree at about 90/s. Check. I raise the write-load to
500/s:
```
39421 1 seconds 443 certs
39864 1 seconds 442 certs
40306 1 seconds 441 certs
40747 1 seconds 447 certs
41194 1 seconds 448 certs
```
.. and to 1'000/s:
```
57941 1 seconds 945 certs
58886 1 seconds 970 certs
59856 1 seconds 948 certs
60804 1 seconds 965 certs
61769 1 seconds 955 certs
```
After a few minutes I see a few errors from CT Hammer:
```
W0810 14:55:29.660710 1398779 analysis.go:134] (1 x) failed to create request: failed to write leaf: Post "https://ctlog-test.lab.ipng.ch/ct/v1/add-chain": EOF
W0810 14:55:30.496603 1398779 analysis.go:124] (1 x) failed to create request: write leaf was not OK. Status code: 500. Body: "failed to read body: read tcp 127.0.1.1:443->127.0.0.1:44908: i/o timeout\n"
```
I raise the Hammer load to 5'000/sec (which means 4'500/s unique certs and 500 duplicates), and find
the max committed writes/sec to max out at around 4'200/s:
```
879637 1 seconds 4213 certs
883850 1 seconds 4207 certs
888057 1 seconds 4211 certs
892268 1 seconds 4249 certs
896517 1 seconds 4216 certs
```
The error rate is a steady stream of errors like the one before:
```
W0810 14:59:48.499274 1398779 analysis.go:124] (1 x) failed to create request: failed to write leaf: Post "https://ctlog-test.lab.ipng.ch/ct/v1/add-chain": EOF
W0810 14:59:49.034194 1398779 analysis.go:124] (1 x) failed to create request: failed to write leaf: Post "https://ctlog-test.lab.ipng.ch/ct/v1/add-chain": EOF
W0810 15:00:05.496459 1398779 analysis.go:124] (1 x) failed to create request: failed to write leaf: Post "https://ctlog-test.lab.ipng.ch/ct/v1/add-chain": EOF
W0810 15:00:07.187181 1398779 analysis.go:124] (1 x) failed to create request: failed to write leaf: Post "https://ctlog-test.lab.ipng.ch/ct/v1/add-chain": EOF
```
At this load of 4'200/s, MinIO is not very impressed. Remember in the [[other article]({{< ref
2025-07-26-ctlog-1 >}})] I loadtested it to about 7'500 ops/sec and the statistics below are about
50 ops/sec (2'800/min). I conclude that MinIO is, in fact, bored of this whole activity:
```
pim@ctlog-test:/etc/sunlight$ mc admin trace --stats ssd
Duration: 18m58s ▱▱▱
RX Rate:↑ 115 MiB/m
TX Rate:↓ 2.4 MiB/m
RPM : 2821.3
-------------
Call Count RPM Avg Time Min Time Max Time Avg TTFB Max TTFB Avg Size Rate /min Errors
s3.PutObject 37602 (70.3%) 1982.2 6.2ms 785µs 86.7ms 6.1ms 86.6ms ↑59K ↓0B ↑115M ↓1.4K 0
s3.GetObject 15918 (29.7%) 839.1 996µs 670µs 51.3ms 912µs 51.2ms ↑46B ↓3.0K ↑38K ↓2.4M 0
```
Sunlight still keeps its certificate cache on local disk. At a rate of 4'200/s, the ZFS pool has a
write rate of about 105MB/s with about 877 ZFS writes per second.
```
pim@ctlog-test:/etc/sunlight$ zpool iostat -v ssd-vol0 10
capacity operations bandwidth
pool alloc free read write read write
-------------------------- ----- ----- ----- ----- ----- -----
ssd-vol0 59.1G 685G 0 2.55K 0 312M
mirror-0 59.1G 685G 0 2.55K 0 312M
wwn-0x5002538a05302930 - - 0 877 0 104M
wwn-0x5002538a053069f0 - - 0 871 0 104M
wwn-0x5002538a06313ed0 - - 0 866 0 104M
-------------------------- ----- ----- ----- ----- ----- -----
pim@ctlog-test:/etc/sunlight$ zpool iostat -l ssd-vol0 10
capacity operations bandwidth total_wait disk_wait syncq_wait asyncq_wait scrub trim
pool alloc free read write read write read write read write read write read write wait wait
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
ssd-vol0 59.0G 685G 0 3.19K 0 388M - 8ms - 628us - 990us - 10ms - 88ms
ssd-vol0 59.2G 685G 0 2.49K 0 296M - 5ms - 557us - 163us - 8ms - -
ssd-vol0 59.6G 684G 0 2.04K 0 253M - 2ms - 704us - 296us - 4ms - -
ssd-vol0 58.8G 685G 0 2.72K 0 328M - 6ms - 783us - 701us - 9ms - 68ms
```
A few interesting observations:
* Sunlight still uses a local sqlite3 database for the certificate tracking, which is more
efficient than MariaDB/MySQL, let alone AWS RDS, so it has one less runtime dependency.
* The write rate to ZFS is significantly higher with Sunlight than TesseraCT (about 8:1). This is
likely explained because the sqlite3 database lives on ZFS here, while TesseraCT uses MariaDB
running on a different filesystem.
* The MinIO usage is a lot lighter. As I reduce the load to 1'000/s, as was the case in the TesseraCT
test, I can see the ratio of Get:Put was 93:4 in TesseraCT, while it's 70:30 here. TesseraCT as
also consuming more IOPS, running at about 10.5k requests/minute, while Sunlight is
significantly calmer at 2.8k requests/minute (almost 4x less!)
* The burst capacity of Sunlight is a fair bit higher than TesseraCT, likely due to its more
efficient use of S3 backends.
***Conclusion***: Sunlight S3+MinIO can handle 1'000/s reliably, and can spike to 4'200/s with only
few errors.
#### Sunlight: Loadtesting POSIX
When I took a closer look at TesseraCT a few weeks ago, it struck me that while making a
cloud-native setup, with S3 storage would allow for a cool way to enable storage scaling and
read-path redundancy, by creating synchronously replicated buckets, it does come at a significant
operational overhead and complexity. My main concern is the amount of different moving parts, and
Sunlight really has one very appealing property: it can run entirely on one machine without the need
for any other moving parts - even the SQL database is linked in. That's pretty slick.
```
pim@ctlog-test:/etc/sunlight$ cat << EOF > sunlight.yaml
listen:
- "[::]:443"
checkpoints: /ssd-vol0/sunlight-test/shared/checkpoints.db
logs:
- shortname: sunlight-test
inception: 2025-08-10
submissionprefix: https://ctlog-test.lab.ipng.ch/
monitoringprefix: https://ctlog-test.lab.ipng.ch:1443/
secret: /etc/sunlight/sunlight-test.seed.bin
cache: /ssd-vol0/sunlight-test/logs/sunlight-test/cache.db
localdirectory: /ssd-vol0/sunlight-test/logs/sunlight-test/data
roots: /etc/sunlight/roots.pem
period: 200
poolsize: 15000
notafterstart: 2024-01-01T00:00:00Z
notafterlimit: 2025-01-01T00:00:00Z
EOF
pim@ctlog-test:/etc/sunlight$ sunlight -testcert -c sunlight.yaml
pim@ctlog-test:/etc/sunlight$ skylight -testcert -c skylight.yaml
```
First I'll start a hello-world loadtest at 100/s and take a look at the number of leaves in the
checkpoint after a few minutes, I would expect about three minutes worth at 100/s with a duplicate
probability of 10% to yield about 16'200 unique certificates in total.
```
pim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E '^[0-9]+$'; sleep 60; done
10086
15518
20920
26339
```
And would you look at that? `(26339-10086)` is right on the dot! One thing that I find particularly
cool about Sunlight is its baked in Prometheus metrics. This allows me some pretty solid insight on
its performance. Take a look for example at the write path latency tail (99th ptile):
```
pim@ctlog-test:/etc/sunlight$ curl -ksS https://ctlog-test.lab.ipng.ch/metrics | egrep 'seconds.*quantile=\"0.99\"'
sunlight_addchain_wait_seconds{log="sunlight-test",quantile="0.99"} 0.207285993
sunlight_cache_get_duration_seconds{log="sunlight-test",quantile="0.99"} 0.001409719
sunlight_cache_put_duration_seconds{log="sunlight-test",quantile="0.99"} 0.002227985
sunlight_fs_op_duration_seconds{log="sunlight-test",method="discard",quantile="0.99"} 0.000224969
sunlight_fs_op_duration_seconds{log="sunlight-test",method="fetch",quantile="0.99"} 8.3003e-05
sunlight_fs_op_duration_seconds{log="sunlight-test",method="upload",quantile="0.99"} 0.042118751
sunlight_http_request_duration_seconds{endpoint="add-chain",log="sunlight-test",quantile="0.99"} 0.2259605
sunlight_sequencing_duration_seconds{log="sunlight-test",quantile="0.99"} 0.108987393
sunlight_sqlite_update_duration_seconds{quantile="0.99"} 0.014922489
```
I'm seeing here that at a load of 100/s (with 90/s of unique certificates), the 99th percentile
add-chain latency is 207ms, which makes sense because the `period` configuration field is set to
200ms. The filesystem operations (discard, fetch, upload) are _de minimis_ and the sequencing
duration is at 109ms. Excellent!
But can this thing go really fast? I do remember that the CT Hammer uses more CPU than TesseraCT,
and I've seen it above also when running my 5'000/s loadtest that's about all the hammer can take on
a single Dell R630. So, as I did with the TesseraCT test, I'll use the MinIO SSD and MinIO Disk
machines to generate the load.
I boot them, so that I can hammer, or shall I say jackhammer away:
```
pim@ctlog-test:~/src/tesseract$ go run ./internal/hammer --origin=ctlog-test.lab.ipng.ch \
--log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHRwRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ== \
--log_url=https://ctlog-test.lab.ipng.ch:1443/ --write_log_url=https://ctlog-test.lab.ipng.ch/ \
--max_read_ops=0 --num_writers=5000 --max_write_ops=5000
pim@minio-ssd:~/src/tesseract$ go run ./internal/hammer --origin=ctlog-test.lab.ipng.ch \
--log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHRwRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ== \
--log_url=https://ctlog-test.lab.ipng.ch:1443/ --write_log_url=https://ctlog-test.lab.ipng.ch/ \
--max_read_ops=0 --num_writers=5000 --max_write_ops=5000 --serial_offset=1000000
pim@minio-disk:~/src/tesseract$ go run ./internal/hammer --origin=ctlog-test.lab.ipng.ch \
--log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHRwRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ== \
--log_url=https://ctlog-test.lab.ipng.ch:1443/ --write_log_url=https://ctlog-test.lab.ipng.ch/ \
--max_read_ops=0 --num_writers=5000 --max_write_ops=5000 --serial_offset=2000000
```
This will generate 15'000/s of load, which I note does bring Sunlight to its knees, although it does
remain stable (yaay!) with a somewhat more bursty checkpoint interval:
```
5504780 1 seconds 4039 certs
5508819 1 seconds 10000 certs
5518819 . 2 seconds 7976 certs
5526795 1 seconds 2022 certs
5528817 1 seconds 9782 certs
5538599 1 seconds 217 certs
5538816 1 seconds 3114 certs
5541930 1 seconds 6818 certs
```
So what I do instead is a somewhat simpler measurement of certificates per minute:
```
pim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E '^[0-9]+$'; sleep 60; done
6008831
6296255
6576712
```
This rate boils down to `(6576712-6008831)/120` or 4'700/s of written certs, which at a duplication
ratio of 10% means approximately 5'200/s of total accepted certs. This rate, Sunlight is consuming
about 10.3 CPUs/s, while Skylight is at 0.1 CPUs/s and the CT Hammer is at 11.1 CPUs/s; Given the 40
threads on this machine, I am not saturating the CPU, but I'm curious as this rate is significantly
lower than TesseraCT. I briefly turn off the hammer on `ctlog-test` to allow Sunlight to monopolize
the entire machine. The CPU use does reduce to about 9.3 CPUs/s suggesting that indeed, the bottleneck
is not strictly CPU:
{{< image width="90%" src="/assets/ctlog/btop-sunlight.png" alt="Sunlight btop" >}}
When using only two CT Hammers (on `minio-ssd.lab.ipng.ch` and `minio-disk.lab.ipng.ch`), the CPU
use on the `ctlog-test.lab.ipng.ch` machine definitely goes down (CT Hammer is kind of a CPU hog....),
but the resulting throughput doesn't change that much:
```
pim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E '^[0-9]+$'; sleep 60; done
7985648
8302421
8528122
8772758
```
What I find particularly interesting is that the total rate stays approximately 4'400/s
(`(8772758-7985648)/180`), while the checkpoint latency varies considerably. One really cool thing I
learned earlier is that Sunlight comes with baked in Prometheus metrics, which I can take a look at
while keeping it under this load of ~10'000/sec:
```
pim@ctlog-test:/etc/sunlight$ curl -ksS https://ctlog-test.lab.ipng.ch/metrics | egrep 'seconds.*quantile=\"0.99\"'
sunlight_addchain_wait_seconds{log="sunlight-test",quantile="0.99"} 1.889983538
sunlight_cache_get_duration_seconds{log="sunlight-test",quantile="0.99"} 0.000148819
sunlight_cache_put_duration_seconds{log="sunlight-test",quantile="0.99"} 0.837981208
sunlight_fs_op_duration_seconds{log="sunlight-test",method="discard",quantile="0.99"} 0.000433179
sunlight_fs_op_duration_seconds{log="sunlight-test",method="fetch",quantile="0.99"} NaN
sunlight_fs_op_duration_seconds{log="sunlight-test",method="upload",quantile="0.99"} 0.067494558
sunlight_http_request_duration_seconds{endpoint="add-chain",log="sunlight-test",quantile="0.99"} 1.86894666
sunlight_sequencing_duration_seconds{log="sunlight-test",quantile="0.99"} 1.111400223
sunlight_sqlite_update_duration_seconds{quantile="0.99"} 0.016859223
```
Comparing the throughput at 4'400/s with that first test of 100/s, I expect and can confirm a
significant increase in all of these metrics. The 99th percentile addchain is now 1889ms (up from
207ms) and the sequencing duration is now 1111ms (up from 109ms).
#### Sunlight: Effect of period
I fiddle a little bit with Sunlight's configuration file, notably the `period` and `poolsize`.
First I set `period:2000` and `poolsize:15000`, which yields pretty much the same throughput:
```
pim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E '^[0-9]+$'; sleep 60; done
701850
1001424
1295508
1575789
```
With a generated load of 10'000/sec with a 10% duplication rate, I am offering roughly 9'000/sec of
unique certificates, and I'm seeing `(1575789 - 701850)/180` or about 4'855/sec come through. Just
for reference, at this rate and with `period:2000`, the latency tail looks like this:
```
pim@ctlog-test:/etc/sunlight$ curl -ksS https://ctlog-test.lab.ipng.ch/metrics | egrep 'seconds.*quantile=\"0.99\"'
sunlight_addchain_wait_seconds{log="sunlight-test",quantile="0.99"} 3.203510079
sunlight_cache_get_duration_seconds{log="sunlight-test",quantile="0.99"} 0.000108613
sunlight_cache_put_duration_seconds{log="sunlight-test",quantile="0.99"} 0.950453973
sunlight_fs_op_duration_seconds{log="sunlight-test",method="discard",quantile="0.99"} 0.00046192
sunlight_fs_op_duration_seconds{log="sunlight-test",method="fetch",quantile="0.99"} NaN
sunlight_fs_op_duration_seconds{log="sunlight-test",method="upload",quantile="0.99"} 0.049007693
sunlight_http_request_duration_seconds{endpoint="add-chain",log="sunlight-test",quantile="0.99"} 3.570709413
sunlight_sequencing_duration_seconds{log="sunlight-test",quantile="0.99"} 1.5968609040000001
sunlight_sqlite_update_duration_seconds{quantile="0.99"} 0.010847308
```
Then I also set a `period:100` and `poolsize:15000`, which does improve a bit:
```
pim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E '^[0-9]+$'; sleep 60; done
560654
950524
1324645
1720362
```
With the same generated load of 10'000/sec with a 10% duplication rate, I am still offering roughly
9'000/sec of unique certificates, and I'm seeing `(1720362 - 560654)/180` or about 6'440/sec come
through, which is a fair bit better, at the expense of more disk activity. At this rate and with
`period:100`, the latency tail looks like this:
```
pim@ctlog-test:/etc/sunlight$ curl -ksS https://ctlog-test.lab.ipng.ch/metrics | egrep 'seconds.*quantile=\"0.99\"'
sunlight_addchain_wait_seconds{log="sunlight-test",quantile="0.99"} 1.616046445
sunlight_cache_get_duration_seconds{log="sunlight-test",quantile="0.99"} 7.5123e-05
sunlight_cache_put_duration_seconds{log="sunlight-test",quantile="0.99"} 0.534935803
sunlight_fs_op_duration_seconds{log="sunlight-test",method="discard",quantile="0.99"} 0.000377273
sunlight_fs_op_duration_seconds{log="sunlight-test",method="fetch",quantile="0.99"} 4.8893e-05
sunlight_fs_op_duration_seconds{log="sunlight-test",method="upload",quantile="0.99"} 0.054685991
sunlight_http_request_duration_seconds{endpoint="add-chain",log="sunlight-test",quantile="0.99"} 1.946445877
sunlight_sequencing_duration_seconds{log="sunlight-test",quantile="0.99"} 0.980602185
sunlight_sqlite_update_duration_seconds{quantile="0.99"} 0.018385831
```
***Conclusion***: Sunlight on POSIX can reliably handle 4'400/s (with a duplicate rate of 10%) on
this setup.
## Wrapup - Observations
From an operators point of view, TesseraCT and Sunlight handle quite differently. Both are easily up
to the task of serving the current write-load (which is about 250/s).
* ***S3***: When using the S3 backend, TesseraCT became quite unhappy above 800/s while Sunlight
went all the way up to 4'200/s and sent significantly less requests to MinIO (about 4x less),
while showing good telemetry on the use of S3 backends. In this mode, TesseraCT uses MySQL (in
my case, MariaDB) which was not on the ZFS pool, but on the boot-disk.
* ***POSIX***: When using normal filesystem, Sunlight seems to peak at 4'800/s while TesseraCT
went all the way to 12'000/s. When doing so, Disk IO was quite similar between the two
solutions, taking into account that TesseraCT runs BadgerDB, while Sunlight uses sqlite3,
both are using their respective ZFS pool.
***Notable***: Sunlight POSIX and S3 performance is roughly identical (both handle about
5'000/sec), while TesseraCT POSIX performance (12'000/s) is significantly better than its S3
(800/s). Some other observations:
* Sunlight has a very opinionated configuration, and can run multiple logs with one configuration
file and one binary. Its configuration was a bit constraining though, as I could not manage to
use `monitoringprefix` or `submissionprefix` with `http://` prefix - a likely security
precaution - but also using ports in those prefixes (other than the standard 443) rendered
Sunlight and Skylight unusable for me.
* Skylight only serves from local directory, it does not have support for S3. For operators using S3,
an alternative could be to use NGINX in the serving path, similar to TesseraCT. Skylight does have
a few things to teach me though, notably on proper compression, content type and other headers.
* TesseraCT does not have a configuration file, and will run exactly one log per binary
instance. It uses flags to construct the environment, and is much more forgiving for creative
`origin` (log name), and submission- and monitoring URLs. It's happy to use regular 'http://'
for both, which comes in handy in those architectures where the system is serving behind a
reversed proxy.
* The TesseraCT Hammer tool then again does not like using self-signed certificates, and needs
to be told to skip certificate validation in the case of Sunlight loadtests while it is
running with the `-testcert` commandline.
I consider all of these small and mostly cosmetic issues, because in production there will be proper
TLS certificates issued and normal https:// serving ports with unique monitoring and submission
hostnames.
## What's Next
Together with Antonis Chariton and Jeroen Massar, IPng Networks will be offering both TesseraCT and
Sunlight logs on the public internet. One final step is to productionize both logs, and file the
paperwork for them in the community. Although at this point our Sunlight log is already running,
I'll wait a few weeks to gather any additional intel, before wrapping up in a final article.

View File

@@ -0,0 +1,515 @@
---
date: "2025-08-24T12:07:23Z"
title: 'Certificate Transparency - Part 3 - Operations'
---
{{< image width="10em" float="right" src="/assets/ctlog/ctlog-logo-ipng.png" alt="ctlog logo" >}}
# Introduction
There once was a Dutch company called [[DigiNotar](https://en.wikipedia.org/wiki/DigiNotar)], as the
name suggests it was a form of _digital notary_, and they were in the business of issuing security
certificates. Unfortunately, in June of 2011, their IT infrastructure was compromised and
subsequently it issued hundreds of fraudulent SSL certificates, some of which were used for
man-in-the-middle attacks on Iranian Gmail users. Not cool.
Google launched a project called **Certificate Transparency**, because it was becoming more common
that the root of trust given to _Certification Authorities_ could no longer be unilaterally trusted.
These attacks showed that the lack of transparency in the way CAs operated was a significant risk to
the Web Public Key Infrastructure. It led to the creation of this ambitious
[[project](https://certificate.transparency.dev/)] to improve security online by bringing
accountability to the system that protects our online services with _SSL_ (Secure Socket Layer)
and _TLS_ (Transport Layer Security).
In 2013, [[RFC 6962](https://datatracker.ietf.org/doc/html/rfc6962)] was published by the IETF. It
describes an experimental protocol for publicly logging the existence of Transport Layer Security
(TLS) certificates as they are issued or observed, in a manner that allows anyone to audit
certificate authority (CA) activity and notice the issuance of suspect certificates as well as to
audit the certificate logs themselves. The intent is that eventually clients would refuse to honor
certificates that do not appear in a log, effectively forcing CAs to add all issued certificates to
the logs.
In the first two articles of this series, I explored [[Sunlight]({{< ref 2025-07-26-ctlog-1 >}})]
and [[TesseraCT]({{< ref 2025-08-10-ctlog-2 >}})], two open source implementations of the Static CT
protocol. In this final article, I'll share the details on how I created the environment and
production instances for four logs that IPng will be providing: Rennet and Lipase are two
ingredients to make cheese and will serve as our staging/testing logs. Gouda and Halloumi are two
delicious cheeses that pay homage to our heritage, Jeroen and I being Dutch and Antonis being
Greek.
## Hardware
At IPng Networks, all hypervisors are from the same brand: Dell's Poweredge line. In this project,
Jeroen is also contributing a server, and it so happens that he also has a Dell Poweredge. We're
both running Debian on our hypervisor, so we install a fresh VM with Debian 13.0, codenamed
_Trixie_, and give the machine 16GB of memory, 8 vCPU and a 16GB boot disk. Boot disks are placed on
the hypervisor's ZFS pool, and a blockdevice snapshot is taken every 6hrs. This allows the boot disk
to be rolled back to a last known good point in case an upgrade goes south. If you haven't seen it
yet, take a look at [[zrepl](https://zrepl.github.io/)], a one-stop, integrated solution for ZFS
replication. This tool is incredibly powerful, and can do snapshot management, sourcing / sinking
to remote hosts, of course using incremental snapshots as they are native to ZFS.
Once the machine is up, we pass four enterprise-class storage drives, in our case 3.84TB Kioxia
NVMe, model _KXD51RUE3T84_ which are PCIe 3.1 x4 lanes, and NVMe 1.2.1 specification with a good
durability and reasonable (albeit not stellar) read throughput of ~2700MB/s, write throughput of
~800MB/s with 240 kIOPS random read and 21 kIOPS random write. My attention is also drawn to a
specific specification point: these drives allow for 1.0 DWPD, which stands for _Drive Writes Per
Day_, in other words they are not going to run themselves off a cliff after a few petabytes of
writes, and I am reminded that a CT Log wants to write to disk a lot during normal operation.
The point of these logs is to **keep them safe**, and the most important aspects of the compute
environment are the use of ECC memory to detect single bit errors, and dependable storage. Toshiba
makes a great product.
```
ctlog1:~$ sudo zpool create -f -o ashift=12 -o autotrim=on -O atime=off -O xattr=sa \
ssd-vol0 raidz2 /dev/disk/by-id/nvme-KXD51RUE3T84_TOSHIBA_*M
ctlog1:~$ sudo zfs create -o encryption=on -o keyformat=passphrase ssd-vol0/enc
ctlog1:~$ sudo zfs create ssd-vol0/logs
ctlog1:~$ for log in lipase; do \
for shard in 2025h2 2026h1 2026h2 2027h1 2027h2; do \
sudo zfs create ssd-vol0/logs/${log}${shard} \
done \
done
```
The hypervisor will use PCI passthrough for the NVMe drives, and we'll handle ZFS directly on the
VM. The first command creates a ZFS raidz2 pool using 4kB blocks, turns of _atime_ (which avoids one
metadata write for each read!), and turns on SSD trimming in ZFS, a very useful feature.
Then I'll create an encrypted volume for the configuration and key material. This way, if the
machine is ever physically transported, the keys will be safe in transit. Finally, I'll create the
temporal log shards starting at 2025h2, all the way through to 2027h2 for our testing log called
_Lipase_ and our production log called _Halloumi_ on Jeroen's machine. On my own machine, it'll be
_Rennet_ for the testing log and _Gouda_ for the production log.
## Sunlight
{{< image width="10em" float="right" src="/assets/ctlog/sunlight-logo.png" alt="Sunlight logo" >}}
I set up Sunlight first. as its authors have extensive operational notes both in terms of the
[[config](https://config.sunlight.geomys.org/)] of Geomys' _Tuscolo_ log, as well as on the
[[Sunlight](https://sunlight.dev)] homepage. I really appreciate that Filippo added some
[[Gists](https://gist.github.com/FiloSottile/989338e6ba8e03f2c699590ce83f537b)] and
[[Doc](https://docs.google.com/document/d/1ID8dX5VuvvrgJrM0Re-jt6Wjhx1eZp-trbpSIYtOhRE/edit?tab=t.0#heading=h.y3yghdo4mdij)]
with pretty much all I need to know to run one too. Our Rennet and Gouda logs use very similar
approach for their configuration, with one notable exception: the VMs do not have a public IP
address, and are tucked away in a private network called IPng Site Local. I'll get back to that
later.
```
ctlog@ctlog0:/ssd-vol0/enc/sunlight$ cat << EOF | tee sunlight-staging.yaml
listen:
- "[::]:16420"
checkpoints: /ssd-vol0/shared/checkpoints.db
logs:
- shortname: rennet2025h2
inception: 2025-07-28
period: 200
poolsize: 750
submissionprefix: https://rennet2025h2.log.ct.ipng.ch
monitoringprefix: https://rennet2025h2.mon.ct.ipng.ch
ccadbroots: testing
extraroots: /ssd-vol0/enc/sunlight/extra-roots-staging.pem
secret: /ssd-vol0/enc/sunlight/keys/rennet2025h2.seed.bin
cache: /ssd-vol0/logs/rennet2025h2/cache.db
localdirectory: /ssd-vol0/logs/rennet2025h2/data
notafterstart: 2025-07-01T00:00:00Z
notafterlimit: 2026-01-01T00:00:00Z
...
EOF
ctlog@ctlog0:/ssd-vol0/enc/sunlight$ cat << EOF | tee skylight-staging.yaml
listen:
- "[::]:16421"
homeredirect: https://ipng.ch/s/ct/
logs:
- shortname: rennet2025h2
monitoringprefix: https://rennet2025h2.mon.ct.ipng.ch
localdirectory: /ssd-vol0/logs/rennet2025h2/data
staging: true
...
```
In the first configuration file, I'll tell _Sunlight_ (the write path component) to listen on port
`:16420` and I'll tell _Skylight_ (the read path component) to listen on port `:16421`. I've disabled
the automatic certificate renewals, and will handle SSL upstream. A few notes on this:
1. Most importantly, I will be using a common frontend pool with a wildcard certificate for
`*.ct.ipng.ch`. I wrote about [[DNS-01]({{< ref 2023-03-24-lego-dns01 >}})] before, it's a very
convenient way for IPng to do certificate pool management. I will be sharing certificate for all log
types under this certificate.
1. ACME/HTTP-01 could be made to work with a bit of effort; plumbing through the `/.well-known/`
URIs on the frontend and pointing them to these instances. But then the cert would have to be copied
from Sunlight back to the frontends.
I've noticed that when the log doesn't exist yet, I can start Sunlight and it'll create the bits and
pieces on the local filesystem and start writing checkpoints. But if the log already exists, I am
required to have the _monitoringprefix_ active, otherwise Sunlight won't start up. It's a small
thing, as I will have the read path operational in a few simple steps. Anyway, all five logshards
for Rennet, and a few days later, for Gouda, are operational this way.
Skylight provides all the things I need to serve the data back, which is a huge help. The [[Static
Log Spec](https://github.com/C2SP/C2SP/blob/main/static-ct-api.md)] is very clear on things like
compression, content-type, cache-control and other headers. Skylight makes this a breeze, as it reads
a configuration file very similar to the Sunlight write-path one, and takes care of it all for me.
## TesseraCT
{{< image width="10em" float="right" src="/assets/ctlog/tesseract-logo.png" alt="TesseraCT logo" >}}
Good news came to our community on August 14th, when Google's TrustFabric team announced their Alpha
milestone of [[TesseraCT](https://blog.transparency.dev/introducing-tesseract)]. This release
also moved the POSIX variant from experimental alongside the already further along GCP and AWS
personalities. After playing around with it with Al and the team, I think I've learned enough to get
us going in a public `tesseract-posix` instance.
One thing I liked about Sunlight is its compact YAML file that described the pertinent bits of the
system, and that I can serve any number of logs with the same process. On the other hand, TesseraCT
can serve only one log per process. Both have pro's and con's, notably if any poisonous submission
would be offered, Sunlight might take down all logs, while TesseraCT would only take down the log
receiving the offensive submission. On the other hand, maintaining separate processes is cumbersome,
and all log instances need to be meticulously configured.
### TesseraCT genconf
I decide to automate this by vibing a little tool called `tesseract-genconf`, which I've published on
[[Gitea](https://git.ipng.ch/certificate-transparency/cheese)]. What it does is take a YAML file
describing the logs, and outputs the bits and pieces needed to operate multiple separate processes
that together form the sharded static log. I've attempted to stay mostly compatible with the
Sunlight YAML configuration, and came up with a variant like this one:
```
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ cat << EOF | tee tesseract-staging.yaml
listen:
- "[::]:8080"
roots: /ssd-vol0/enc/tesseract/roots.pem
logs:
- shortname: lipase2025h2
listen: "[::]:16900"
submissionprefix: https://lipase2025h2.log.ct.ipng.ch
monitoringprefix: https://lipase2025h2.mon.ct.ipng.ch
extraroots: /ssd-vol0/enc/tesseract/extra-roots-staging.pem
secret: /ssd-vol0/enc/tesseract/keys/lipase2025h2.pem
localdirectory: /ssd-vol0/logs/lipase2025h2/data
notafterstart: 2025-07-01T00:00:00Z
notafterlimit: 2026-01-01T00:00:00Z
...
EOF
```
With this snippet, I have all the information I need. Here's the steps I take to construct the log
itself:
***1. Generate keys***
The keys are `prime256v1` and the format that TesseraCT accepts did change since I wrote up my first
[[deep dive]({{< ref 2025-07-26-ctlog-1 >}})] a few weeks ago. Now, the tool accepts a `PEM` format
private key, from which the _Log ID_ and _Public Key_ can be derived. So off I go:
```
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf -c tesseract-staging.yaml gen-key
Creating /ssd-vol0/enc/tesseract/keys/lipase2025h2.pem
Creating /ssd-vol0/enc/tesseract/keys/lipase2026h1.pem
Creating /ssd-vol0/enc/tesseract/keys/lipase2026h2.pem
Creating /ssd-vol0/enc/tesseract/keys/lipase2027h1.pem
Creating /ssd-vol0/enc/tesseract/keys/lipase2027h2.pem
```
Of course, if a file already exists at that location, it'll just print a warning like:
```
Key already exists: /ssd-vol0/enc/tesseract/keys/lipase2025h2.pem (skipped)
```
***2. Generate JSON/HTML***
I will be operating the read-path with NGINX. Log operators have started speaking about their log
metadata in terms of a small JSON file called `log.v3.json`, and Skylight does a good job of
exposing that one, alongside all the other pertinent metadata. So I'll generate these files for each
of the logs:
```
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf -c tesseract-staging.yaml gen-html
Creating /ssd-vol0/logs/lipase2025h2/data/index.html
Creating /ssd-vol0/logs/lipase2025h2/data/log.v3.json
Creating /ssd-vol0/logs/lipase2026h1/data/index.html
Creating /ssd-vol0/logs/lipase2026h1/data/log.v3.json
Creating /ssd-vol0/logs/lipase2026h2/data/index.html
Creating /ssd-vol0/logs/lipase2026h2/data/log.v3.json
Creating /ssd-vol0/logs/lipase2027h1/data/index.html
Creating /ssd-vol0/logs/lipase2027h1/data/log.v3.json
Creating /ssd-vol0/logs/lipase2027h2/data/index.html
Creating /ssd-vol0/logs/lipase2027h2/data/log.v3.json
```
{{< image width="60%" src="/assets/ctlog/lipase.png" alt="TesseraCT Lipase Log" >}}
It's nice to see a familiar look-and-feel for these logs appear in those `index.html` (which all
cross-link to each other within the logs specificied in `tesseract-staging.yaml`, which is dope.
***3. Generate Roots***
Antonis had seen this before (thanks for the explanation!) but TesseraCT does not natively implement
fetching of the [[CCADB](https://www.ccadb.org/)] roots. But, he points out, you can just get them
from any other running log instance, so I'll implement a `gen-roots` command:
```
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf gen-roots \
--source https://tuscolo2027h1.sunlight.geomys.org --output production-roots.pem
Fetching roots from: https://tuscolo2027h1.sunlight.geomys.org/ct/v1/get-roots
2025/08/25 08:24:58 Warning: Failed to parse certificate,carefully skipping: x509: negative serial number
Creating production-roots.pem
Successfully wrote 248 certificates to tusc.pem (out of 249 total)
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf gen-roots \
--source https://navigli2027h1.sunlight.geomys.org --output testing-roots.pem
Fetching roots from: https://navigli2027h1.sunlight.geomys.org/ct/v1/get-roots
Creating testing-roots.pem
Successfully wrote 82 certificates to tusc.pem (out of 82 total)
```
I can do this regularly, say daily, in a cronjob and if the files were to change, restart the
TesseraCT processes. It's not ideal (because the restart might be briefly disruptive), but it's a
reasonable option for the time being.
***4. Generate TesseraCT cmdline***
I will be running TesseraCT as a _templated unit_ in systemd. These are system unit files that have
an argument, they will have an @ in their name, like so:
```
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ cat << EOF | sudo tee /lib/systemd/system/tesseract@.service
[Unit]
Description=Tesseract CT Log service for %i
ConditionFileExists=/ssd-vol0/logs/%i/data/.env
After=network.target
[Service]
# The %i here refers to the instance name, e.g., "lipase2025h2"
# This path should point to where your instance-specific .env files are located
EnvironmentFile=/ssd-vol0/logs/%i/data/.env
ExecStart=/home/ctlog/bin/tesseract-posix $TESSERACT_ARGS
User=ctlog
Group=ctlog
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
```
I can now implement a `gen-env` command for my tool:
```
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf -c tesseract-staging.yaml gen-env
Creating /ssd-vol0/logs/lipase2025h2/data/roots.pem
Creating /ssd-vol0/logs/lipase2025h2/data/.env
Creating /ssd-vol0/logs/lipase2026h1/data/roots.pem
Creating /ssd-vol0/logs/lipase2026h1/data/.env
Creating /ssd-vol0/logs/lipase2026h2/data/roots.pem
Creating /ssd-vol0/logs/lipase2026h2/data/.env
Creating /ssd-vol0/logs/lipase2027h1/data/roots.pem
Creating /ssd-vol0/logs/lipase2027h1/data/.env
Creating /ssd-vol0/logs/lipase2027h2/data/roots.pem
Creating /ssd-vol0/logs/lipase2027h2/data/.env
```
Looking at one of those .env files, I can show the exact commandline I'll be feeding to the
`tesseract-posix` binary:
```
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ cat /ssd-vol0/logs/lipase2025h2/data/.env
TESSERACT_ARGS="--private_key=/ssd-vol0/enc/tesseract/keys/lipase2025h2.pem
--origin=lipase2025h2.log.ct.ipng.ch --storage_dir=/ssd-vol0/logs/lipase2025h2/data
--roots_pem_file=/ssd-vol0/logs/lipase2025h2/data/roots.pem --http_endpoint=[::]:16900
--not_after_start=2025-07-01T00:00:00Z --not_after_limit=2026-01-01T00:00:00Z"
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
```
{{< image width="7em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
A quick operational note on OpenTelemetry (also often referred to as Otel): Al and the TrustFabric
team added open telemetry to the TesseraCT personalities, as it was mostly already implemented in
the underlying Tessera library. By default, it'll try to send its telemetry to localhost using
`https`, which makes sense in those cases where the collector is on a different machine. In my case,
I'll keep `otelcol` (the collector) on the same machine. Its job is to consume the Otel telemetry
stream, and turn those back into Prometheus `/metrics` endpoint on port `:9464`.
The `gen-env` command also assembles the per-instance `roots.pem` file. For staging logs, it'll take
the file pointed to by the `roots:` key, and append any per-log `extraroots:` files. For me, these
extraroots are empty and the main roots file points at either the testing roots that came from
_Rennet_ (our Sunlight staging log), or the production roots that came from _Gouda_. A job well done!
***5. Generate NGINX***
When I first ran my tests, I noticed that the log check tool called `ct-fsck` threw errors on my
read path. Filippo explained that the HTTP headers matter in the Static CT specification. Tiles,
Issuers, and Checkpoint must all have specific caching and content type headers set. This is what
makes Skylight such a gem - I get to read it (and the spec!) to see what I'm supposed to be serving.
And thus, `gen-nginx` command is born, and listens on port `:8080` for requests:
```
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf -c tesseract-staging.yaml gen-nginx
Creating nginx config: /ssd-vol0/logs/lipase2025h2/data/lipase2025h2.mon.ct.ipng.ch.conf
Creating nginx config: /ssd-vol0/logs/lipase2026h1/data/lipase2026h1.mon.ct.ipng.ch.conf
Creating nginx config: /ssd-vol0/logs/lipase2026h2/data/lipase2026h2.mon.ct.ipng.ch.conf
Creating nginx config: /ssd-vol0/logs/lipase2027h1/data/lipase2027h1.mon.ct.ipng.ch.conf
Creating nginx config: /ssd-vol0/logs/lipase2027h2/data/lipase2027h2.mon.ct.ipng.ch.conf
```
All that's left for me to do is symlink these from `/etc/nginx/sites-enabled/` and the read-path is
off to the races. With these commands in the `tesseract-genconf` tool, I am hoping that future
travelers have an easy time setting up their static log. Please let me know if you'd like to use, or
contribute, to the tool. You can find me in the Transparency Dev Slack, in #ct and also #cheese.
## IPng Frontends
{{< image width="18em" float="right" src="/assets/ctlog/MPLS Backbone - CTLog.svg" alt="ctlog at ipng" >}}
IPng Networks has a private internal network called [[IPng Site Local]({{< ref 2023-03-11-mpls-core
>}})], which is not routed on the internet. Our [[Frontends]({{< ref 2023-03-17-ipng-frontends >}})]
are the only things that have public IPv4 and IPv6 addresses. It allows for things like anycasted
webservers and loadbalancing with
[[Maglev](https://research.google/pubs/maglev-a-fast-and-reliable-software-network-load-balancer/)].
The IPng Site Local network kind of looks like the picture to the right. The hypervisors running the
Sunlight and TesseraCT logs are at NTT Zurich1 in R&uuml;mlang, Switzerland. The IPng frontends are
in green, and the sweet thing is, some of them run in IPng's own ISP network (AS8298), while others
run in partner networks (like IP-Max AS25091, and Coloclue AS8283). This means that I will benefit
from some pretty solid connectivity redundancy.
The frontends are provisioned with Ansible. There are two aspects to them - firstly, a _certbot_
instance maintains the Let's Encrypt wildcard certificates for `*.ct.ipng.ch`. There's a machine
tucked away somewhere called `lego.net.ipng.ch` -- again, not exposed on the internet -- and its job
is to renew certificates and copy them to the machines that need them. Next, a cluster of NGINX
servers uses these certificates to expose IPng and customer services to the Internet.
I can tie it all together with a snippet like so, for which I apologize in advance - it's quite a
wall of text:
```
map $http_user_agent $no_cache_ctlog_lipase {
"~*TesseraCT fsck" 1;
default 0;
}
server {
listen [::]:443 ssl http2;
listen 0.0.0.0:443 ssl http2;
ssl_certificate /etc/certs/ct.ipng.ch/fullchain.pem;
ssl_certificate_key /etc/certs/ct.ipng.ch/privkey.pem;
include /etc/nginx/conf.d/options-ssl-nginx.inc;
ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc;
server_name lipase2025h2.log.ct.ipng.ch;
access_log /nginx/logs/lipase2025h2.log.ct.ipng.ch-access.log upstream buffer=512k flush=5s;
include /etc/nginx/conf.d/ipng-headers.inc;
location = / {
proxy_http_version 1.1;
proxy_set_header Host lipase2025h2.mon.ct.ipng.ch;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_pass http://ctlog1.net.ipng.ch:8080/index.html;
}
location = /metrics {
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_pass http://ctlog1.net.ipng.ch:9464;
}
location / {
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_pass http://ctlog1.net.ipng.ch:16900;
}
}
server {
listen [::]:443 ssl http2;
listen 0.0.0.0:443 ssl http2;
ssl_certificate /etc/certs/ct.ipng.ch/fullchain.pem;
ssl_certificate_key /etc/certs/ct.ipng.ch/privkey.pem;
include /etc/nginx/conf.d/options-ssl-nginx.inc;
ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc;
server_name lipase2025h2.mon.ct.ipng.ch;
access_log /nginx/logs/lipase2025h2.mon.ct.ipng.ch-access.log upstream buffer=512k flush=5s;
include /etc/nginx/conf.d/ipng-headers.inc;
location = /checkpoint {
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_pass http://ctlog1.net.ipng.ch:8080;
}
location / {
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
include /etc/nginx/conf.d/ipng-upstream-headers.inc;
proxy_cache ipng_cache;
proxy_cache_key "$scheme://$host$request_uri";
proxy_cache_valid 200 24h;
proxy_cache_revalidate off;
proxy_cache_bypass $no_cache_ctlog_lipase;
proxy_no_cache $no_cache_ctlog_lipase;
proxy_pass http://ctlog1.net.ipng.ch:8080;
}
}
```
Taking _Lipase_ shard 2025h2 as an example, The submission path (on `*.log.ct.ipng.ch`) will show
the same `index.html` as the monitoring path (on `*.mon.ct.ipng.ch`), to provide some consistency
with Sunlight logs. Otherwise, the `/metrics` endpoint is forwarded to the `otelcol` running on port
`:9464`, and the rest (the `/ct/v1/` and so on) are sent to the first port `:16900` of the
TesseraCT.
Then the read-path makes a special-case of the `/checkpoint` endpoint, which it does not cache. That
request (as all others) are forwarded to port `:8080` which is where NGINX is running. Other
requests (notably `/tile` and `/issuer`) are cacheable, so I'll cache these on the upstream NGINX
servers, both for resilience as well as for performance. Having four of these NGINX upstream will
allow the Static CT logs (regardless of being Sunlight or TesseraCT) to serve very high read-rates.
## What's Next
I need to spend a little bit of time thinking about rate limits, specifically write-ratelimits. I
think I'll use a request limiter in upstream NGINX, to allow for each IP or /24 or /48 subnet to
only send a fixed number of requests/sec. I'll probably keep that part private though, as it's a
good rule of thumb to never offer information to attackers.
Together with Antonis Chariton and Jeroen Massar, IPng Networks will be offering both TesseraCT and
Sunlight logs on the public internet. One final step is to productionize both logs, and file the
paperwork for them in the community. At this point our Sunlight log has been running for a month or
so, and we've filed the paperwork for it to be included at Apple and Google.
I'm going to have folks poke at _Lipase_ as well, after which I'll try to run a few `ct-fsck` to
make sure the logs are sane, before offering them into the inclusion program as well. Wish us luck!

73
content/ctlog.md Normal file
View File

@@ -0,0 +1,73 @@
---
title: 'Certificate Transparency'
date: 2025-07-30
url: /s/ct
---
{{< image width="10em" float="right" src="/assets/ctlog/ctlog-logo-ipng.png" alt="ctlog logo" >}}
Certificate Transparency logs are "append-only" and publicly-auditable ledgers of certificates being
created, updated, and expired. This is the homepage for IPng Networks' Certificate Transparency
project.
Certificate Transparency [[CT](https://certificate.transparency.dev)] is a system for logging and
monitoring certificate issuance. It greatly enhances everyones ability to monitor and study
certificate issuance, and these capabilities have led to numerous improvements to the CA ecosystem
and Web security. As a result, it is rapidly becoming critical Internet infrastructure. Originally
developed by Google, the concept is now being adopted by many _Certification Authories_ who log
their certificates, and professional _Monitoring_ companies who observe the certificates and
report anomalies.
IPng Networks runs our logs under the domain `ct.ipng.ch`, split into a `*.log.ct.ipng.ch` for the
write-path, and `*.mon.ct.ipng.ch` for the read-path.
We are submitting our log for inclusion in the approved log lists for Google Chrome and Apple
Safari. Following 90 days of successful monitoring, we anticipate our log will be added to these
trusted lists and that change will propagate to peoples browsers with subsequent browser version
releases.
We operate two popular implementations of Static Certificate Transparency software.
## Sunlight
{{< image width="10em" float="right" src="/assets/ctlog/sunlight-logo.png" alt="sunlight logo" >}}
[[Sunlight](https://sunlight.dev)] was designed by Filippo Valsorda for the needs of the WebPKI
community, through the feedback of many of its members, and in particular of the Sigsum, Google
TrustFabric, and ISRG teams. It is partially based on the Go Checksum Database. Sunlight's
development was sponsored by Let's Encrypt.
Our Sunlight logs:
* A staging log called [[Rennet](https://rennet2025h2.log.ct.ipng.ch/)], incepted 2025-07-28,
starting from temporal shard `rennet2025h2`.
* A production log called [[Gouda](https://gouda2025h2.log.ct.ipng.ch/)], incepted 2025-07-30,
starting from temporal shard `gouda2025h2`.
## TesseraCT
{{< image width="10em" float="right" src="/assets/ctlog/tesseract-logo.png" alt="tesseract logo" >}}
[[TesseraCT](https://github.com/transparency-dev/tesseract)] is a Certificate Transparency (CT) log
implementation by the TrustFabric team at Google. It was built to allow log operators to run
production static-ct-api CT logs starting with temporal shards covering 2026 onwards, as the
successor to Trillian's CTFE.
Our TesseraCT logs:
* A staging log called [[Lipase](https://lipase2025h2.log.ct.ipng.ch/)], incepted 2025-08-22,
starting from temporal shared `lipase2025h2`.
* A production log called [[Halloumi](https://halloumi2025h2.log.ct.ipng.ch/)], incepted 2025-08-24,
starting from temporal shared `halloumi2025h2`.
* Log `halloumi2026h2` incorporated incorrect data into its Merkle Tree at entry 4357956 and
4552365, due to a [[TesseraCT bug](https://github.com/transparency-dev/tesseract/issues/553)]
and was retired on 2025-09-08, to be replaced by temporal shard `halloumi2026h2a`.
## Operational Details
You can read more details about our infrastructure on:
* **[[TesseraCT]({{< ref 2025-07-26-ctlog-1 >}})]** - published on 2025-07-26.
* **[[Sunlight]({{< ref 2025-08-10-ctlog-2 >}})]** - published on 2025-08-10.
* **[[Operations]({{< ref 2025-08-24-ctlog-3 >}})]** - published on 2025-08-24.
The operators of this infrastructure are **Antonis Chariton**, **Jeroen Massar** and **Pim van Pelt**. \
You can reach us via e-mail at [[<ct-ops@ipng.ch>](mailto:ct-ops@ipng.ch)].

BIN
static/assets/containerlab/learn-vpp.png (Stored with Git LFS) Normal file

Binary file not shown.

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 147 KiB

BIN
static/assets/ctlog/btop-sunlight.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/ctlog/ctlog-loadtest1.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/ctlog/ctlog-loadtest2.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/ctlog/ctlog-loadtest3.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/ctlog/ctlog-logo-ipng.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/ctlog/lipase.png (Stored with Git LFS) Normal file

Binary file not shown.

View File

@@ -0,0 +1,164 @@
Wasabi benchmark program v2.0
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=4M
Loop 1: PUT time 60.0 secs, objects = 813, speed = 54.2MB/sec, 13.5 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 23168, speed = 1.5GB/sec, 386.1 operations/sec. Slowdowns = 0
Loop 1: DELETE time 2.2 secs, 371.2 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=1M
2025/07/20 16:07:25 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
status code: 409, request id: 1853FACEBAC4D052, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
Loop 1: PUT time 60.0 secs, objects = 1221, speed = 20.3MB/sec, 20.3 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 31000, speed = 516.7MB/sec, 516.7 operations/sec. Slowdowns = 0
Loop 1: DELETE time 3.2 secs, 376.5 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=8k
2025/07/20 16:09:29 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
status code: 409, request id: 1853FAEB70060604, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
Loop 1: PUT time 60.0 secs, objects = 3353, speed = 447KB/sec, 55.9 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 45913, speed = 6MB/sec, 765.2 operations/sec. Slowdowns = 0
Loop 1: DELETE time 9.3 secs, 361.6 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=4k
2025/07/20 16:11:38 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
status code: 409, request id: 1853FB098B162788, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
Loop 1: PUT time 60.0 secs, objects = 3404, speed = 226.9KB/sec, 56.7 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 45230, speed = 2.9MB/sec, 753.8 operations/sec. Slowdowns = 0
Loop 1: DELETE time 9.4 secs, 362.6 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4M
2025/07/20 16:13:47 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
status code: 409, request id: 1853FB27AE890E75, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
Loop 1: PUT time 60.1 secs, objects = 1898, speed = 126.4MB/sec, 31.6 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 185034, speed = 12GB/sec, 3083.9 operations/sec. Slowdowns = 0
Loop 1: DELETE time 0.4 secs, 4267.8 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=1M
2025/07/20 16:15:48 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
status code: 409, request id: 1853FB43C0386015, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
Loop 1: PUT time 60.2 secs, objects = 2627, speed = 43.7MB/sec, 43.7 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 327959, speed = 5.3GB/sec, 5465.9 operations/sec. Slowdowns = 0
Loop 1: DELETE time 0.6 secs, 4045.6 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=8k
2025/07/20 16:17:49 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
status code: 409, request id: 1853FB5FE2012590, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
Loop 1: PUT time 60.0 secs, objects = 6663, speed = 887.7KB/sec, 111.0 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 459962, speed = 59.9MB/sec, 7666.0 operations/sec. Slowdowns = 0
Loop 1: DELETE time 1.7 secs, 3890.9 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4k
2025/07/20 16:19:50 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
status code: 409, request id: 1853FB7C3CF0FFCA, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
Loop 1: PUT time 60.1 secs, objects = 6673, speed = 444.4KB/sec, 111.1 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 444637, speed = 28.9MB/sec, 7410.5 operations/sec. Slowdowns = 0
Loop 1: DELETE time 1.5 secs, 4411.8 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4M
2025/07/20 16:21:52 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
status code: 409, request id: 1853FB988DB60881, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
Loop 1: PUT time 60.2 secs, objects = 3093, speed = 205.5MB/sec, 51.4 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 168750, speed = 11GB/sec, 2811.4 operations/sec. Slowdowns = 0
Loop 1: DELETE time 0.3 secs, 9112.2 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=1M
2025/07/20 16:23:53 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
status code: 409, request id: 1853FBB4A1E534DE, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
Loop 1: PUT time 60.2 secs, objects = 4652, speed = 77.2MB/sec, 77.2 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 351187, speed = 5.7GB/sec, 5852.8 operations/sec. Slowdowns = 0
Loop 1: DELETE time 0.6 secs, 8141.6 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=8k
2025/07/20 16:25:54 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
status code: 409, request id: 1853FBD0C4764C64, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
Loop 1: PUT time 60.1 secs, objects = 14497, speed = 1.9MB/sec, 241.4 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 457437, speed = 59.6MB/sec, 7623.7 operations/sec. Slowdowns = 0
Loop 1: DELETE time 1.7 secs, 8353.6 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4k
2025/07/20 16:27:55 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
status code: 409, request id: 1853FBED210B0792, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
Loop 1: PUT time 60.1 secs, objects = 14459, speed = 962.6KB/sec, 240.7 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 466680, speed = 30.4MB/sec, 7777.7 operations/sec. Slowdowns = 0
Loop 1: DELETE time 1.7 secs, 8605.3 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=4M
Loop 1: PUT time 60.0 secs, objects = 1866, speed = 124.4MB/sec, 31.1 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 16400, speed = 1.1GB/sec, 273.3 operations/sec. Slowdowns = 0
Loop 1: DELETE time 5.1 secs, 369.3 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=1M
2025/07/20 16:32:02 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
status code: 409, request id: 1853FC25AE815718, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
Loop 1: PUT time 60.0 secs, objects = 5459, speed = 91MB/sec, 91.0 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 25090, speed = 418.2MB/sec, 418.2 operations/sec. Slowdowns = 0
Loop 1: DELETE time 14.8 secs, 369.8 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=8k
2025/07/20 16:34:17 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
status code: 409, request id: 1853FC4514A78873, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
Loop 1: PUT time 60.0 secs, objects = 22278, speed = 2.9MB/sec, 371.3 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 40626, speed = 5.3MB/sec, 677.1 operations/sec. Slowdowns = 0
Loop 1: DELETE time 61.6 secs, 361.8 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=4k
2025/07/20 16:37:19 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
status code: 409, request id: 1853FC6F629ACFAC, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
Loop 1: PUT time 60.0 secs, objects = 23394, speed = 1.5MB/sec, 389.9 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 39249, speed = 2.6MB/sec, 654.1 operations/sec. Slowdowns = 0
Loop 1: DELETE time 64.5 secs, 363.0 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4M
2025/07/20 16:40:23 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
status code: 409, request id: 1853FC9A5D101971, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
Loop 1: PUT time 60.0 secs, objects = 10564, speed = 704.1MB/sec, 176.0 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 20682, speed = 1.3GB/sec, 344.6 operations/sec. Slowdowns = 0
Loop 1: DELETE time 2.5 secs, 4178.8 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=1M
2025/07/20 16:42:26 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
status code: 409, request id: 1853FCB6EB0A45D9, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
Loop 1: PUT time 60.0 secs, objects = 26550, speed = 442.4MB/sec, 442.4 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 124810, speed = 2GB/sec, 2080.1 operations/sec. Slowdowns = 0
Loop 1: DELETE time 6.6 secs, 4049.2 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=8k
2025/07/20 16:44:32 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
status code: 409, request id: 1853FCD4684A110E, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
Loop 1: PUT time 60.0 secs, objects = 129363, speed = 16.8MB/sec, 2155.9 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 423956, speed = 55.2MB/sec, 7065.8 operations/sec. Slowdowns = 0
Loop 1: DELETE time 32.4 secs, 3992.0 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4k
2025/07/20 16:47:05 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
status code: 409, request id: 1853FCF7EA4857CF, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
Loop 1: PUT time 60.0 secs, objects = 123067, speed = 8MB/sec, 2051.0 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 357694, speed = 23.3MB/sec, 5961.4 operations/sec. Slowdowns = 0
Loop 1: DELETE time 30.9 secs, 3986.0 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4M
2025/07/20 16:49:36 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
status code: 409, request id: 1853FD1B12EFDEBC, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
Loop 1: PUT time 60.1 secs, objects = 13131, speed = 873.3MB/sec, 218.3 operations/sec. Slowdowns = 0
Loop 1: GET time 60.1 secs, objects = 18630, speed = 1.2GB/sec, 310.2 operations/sec. Slowdowns = 0
Loop 1: DELETE time 1.7 secs, 7787.5 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=1M
2025/07/20 16:51:38 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
status code: 409, request id: 1853FD3779E97644, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
Loop 1: PUT time 60.1 secs, objects = 40226, speed = 669.8MB/sec, 669.8 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 85692, speed = 1.4GB/sec, 1427.8 operations/sec. Slowdowns = 0
Loop 1: DELETE time 4.7 secs, 8610.2 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=8k
2025/07/20 16:53:42 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
status code: 409, request id: 1853FD5489FB2F1F, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
Loop 1: PUT time 60.0 secs, objects = 230985, speed = 30.1MB/sec, 3849.3 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 435703, speed = 56.7MB/sec, 7261.1 operations/sec. Slowdowns = 0
Loop 1: DELETE time 25.8 secs, 8945.8 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4k
2025/07/20 16:56:08 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
status code: 409, request id: 1853FD7683B9BB96, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
Loop 1: PUT time 60.0 secs, objects = 228647, speed = 14.9MB/sec, 3810.4 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 452412, speed = 29.5MB/sec, 7539.9 operations/sec. Slowdowns = 0
Loop 1: DELETE time 27.2 secs, 8418.0 deletes/sec. Slowdowns = 0

BIN
static/assets/ctlog/minio_8kb_performance.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/ctlog/nsa_slide.jpg (Stored with Git LFS) Normal file

Binary file not shown.

View File

@@ -0,0 +1,80 @@
Wasabi benchmark program v2.0
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=1M
Loop 1: PUT time 60.0 secs, objects = 1994, speed = 33.2MB/sec, 33.2 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 29243, speed = 487.4MB/sec, 487.4 operations/sec. Slowdowns = 0
Loop 1: DELETE time 2.8 secs, 701.4 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=8k
Loop 1: PUT time 60.0 secs, objects = 13634, speed = 1.8MB/sec, 227.2 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 32284, speed = 4.2MB/sec, 538.1 operations/sec. Slowdowns = 0
Loop 1: DELETE time 18.7 secs, 727.8 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=1M
Loop 1: PUT time 62.0 secs, objects = 23733, speed = 382.8MB/sec, 382.8 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 132708, speed = 2.2GB/sec, 2211.7 operations/sec. Slowdowns = 0
Loop 1: DELETE time 3.7 secs, 6490.1 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=8k
Loop 1: PUT time 60.0 secs, objects = 199925, speed = 26MB/sec, 3331.9 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 309937, speed = 40.4MB/sec, 5165.3 operations/sec. Slowdowns = 0
Loop 1: DELETE time 31.2 secs, 6406.0 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=1M
Loop 1: PUT time 60.0 secs, objects = 1975, speed = 32.9MB/sec, 32.9 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 29898, speed = 498.3MB/sec, 498.3 operations/sec. Slowdowns = 0
Loop 1: DELETE time 2.7 secs, 726.6 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=8k
Loop 1: PUT time 60.0 secs, objects = 13662, speed = 1.8MB/sec, 227.7 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 31865, speed = 4.1MB/sec, 531.1 operations/sec. Slowdowns = 0
Loop 1: DELETE time 18.8 secs, 726.9 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=1M
Loop 1: PUT time 60.0 secs, objects = 26622, speed = 443.6MB/sec, 443.6 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 117688, speed = 1.9GB/sec, 1961.3 operations/sec. Slowdowns = 0
Loop 1: DELETE time 4.1 secs, 6499.5 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=8k
Loop 1: PUT time 60.0 secs, objects = 198238, speed = 25.8MB/sec, 3303.9 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 312868, speed = 40.7MB/sec, 5214.3 operations/sec. Slowdowns = 0
Loop 1: DELETE time 30.8 secs, 6432.7 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4M
Loop 1: PUT time 60.1 secs, objects = 6220, speed = 414.2MB/sec, 103.6 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 38773, speed = 2.5GB/sec, 646.1 operations/sec. Slowdowns = 0
Loop 1: DELETE time 0.9 secs, 6693.3 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4k
Loop 1: PUT time 60.0 secs, objects = 203033, speed = 13.2MB/sec, 3383.8 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 300824, speed = 19.6MB/sec, 5013.6 operations/sec. Slowdowns = 0
Loop 1: DELETE time 31.1 secs, 6528.6 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4M
Loop 1: PUT time 60.3 secs, objects = 13181, speed = 874.2MB/sec, 218.6 operations/sec. Slowdowns = 0
Loop 1: GET time 60.1 secs, objects = 18575, speed = 1.2GB/sec, 309.3 operations/sec. Slowdowns = 0
Loop 1: DELETE time 0.8 secs, 17547.2 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4k
Loop 1: PUT time 60.0 secs, objects = 495006, speed = 32.2MB/sec, 8249.5 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 465947, speed = 30.3MB/sec, 7765.4 operations/sec. Slowdowns = 0
Loop 1: DELETE time 41.4 secs, 11961.3 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4M
Loop 1: PUT time 60.1 secs, objects = 7073, speed = 471MB/sec, 117.8 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 31248, speed = 2GB/sec, 520.7 operations/sec. Slowdowns = 0
Loop 1: DELETE time 1.1 secs, 6576.1 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4k
Loop 1: PUT time 60.0 secs, objects = 214387, speed = 14MB/sec, 3573.0 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 297586, speed = 19.4MB/sec, 4959.7 operations/sec. Slowdowns = 0
Loop 1: DELETE time 32.9 secs, 6519.8 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4M
Loop 1: PUT time 60.1 secs, objects = 14365, speed = 956MB/sec, 239.0 operations/sec. Slowdowns = 0
Loop 1: GET time 60.1 secs, objects = 18113, speed = 1.2GB/sec, 301.6 operations/sec. Slowdowns = 0
Loop 1: DELETE time 0.8 secs, 18655.8 deletes/sec. Slowdowns = 0
Wasabi benchmark program v2.0
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4k
Loop 1: PUT time 60.0 secs, objects = 489736, speed = 31.9MB/sec, 8161.8 operations/sec. Slowdowns = 0
Loop 1: GET time 60.0 secs, objects = 460296, speed = 30MB/sec, 7671.2 operations/sec. Slowdowns = 0
Loop 1: DELETE time 41.0 secs, 11957.6 deletes/sec. Slowdowns = 0

View File

@@ -0,0 +1,116 @@
# Test Setup for SeaweedFS with 6 disks, a Filer an an S3 API
#
# Use with the following .env file
# root@minio-ssd:~# cat /opt/seaweedfs/.env
# AWS_ACCESS_KEY_ID="hottentotten"
# AWS_SECRET_ACCESS_KEY="tentententoonstelling"
services:
# Master
master0:
image: chrislusf/seaweedfs
ports:
- 9333:9333
- 19333:19333
command: "-v=1 master -volumeSizeLimitMB 100 -resumeState=false -ip=master0 -ip.bind=0.0.0.0 -port=9333 -mdir=/var/lib/seaweedfs/master"
volumes:
- ./data/master0:/var/lib/seaweedfs/master
restart: unless-stopped
# Volume Server 1
volume1:
image: chrislusf/seaweedfs
command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8081 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume1'
volumes:
- /data/disk1:/var/lib/seaweedfs/volume1
depends_on:
- master0
restart: unless-stopped
# Volume Server 2
volume2:
image: chrislusf/seaweedfs
command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8082 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume2'
volumes:
- /data/disk2:/var/lib/seaweedfs/volume2
depends_on:
- master0
restart: unless-stopped
# Volume Server 3
volume3:
image: chrislusf/seaweedfs
command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8083 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume3'
volumes:
- /data/disk3:/var/lib/seaweedfs/volume3
depends_on:
- master0
restart: unless-stopped
# Volume Server 4
volume4:
image: chrislusf/seaweedfs
command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8084 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume4'
volumes:
- /data/disk4:/var/lib/seaweedfs/volume4
depends_on:
- master0
restart: unless-stopped
# Volume Server 5
volume5:
image: chrislusf/seaweedfs
command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8085 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume5'
volumes:
- /data/disk5:/var/lib/seaweedfs/volume5
depends_on:
- master0
restart: unless-stopped
# Volume Server 6
volume6:
image: chrislusf/seaweedfs
command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8086 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume6'
volumes:
- /data/disk6:/var/lib/seaweedfs/volume6
depends_on:
- master0
restart: unless-stopped
# Filer
filer:
image: chrislusf/seaweedfs
ports:
- 8888:8888
- 18888:18888
command: 'filer -defaultReplicaPlacement=002 -iam -master="master0:9333"'
volumes:
- ./data/filer:/data
depends_on:
- master0
- volume1
- volume2
- volume3
- volume4
- volume5
- volume6
restart: unless-stopped
# S3 API
s3:
image: chrislusf/seaweedfs
ports:
- 8333:8333
command: 's3 -filer="filer:8888" -ip.bind=0.0.0.0'
env_file:
- .env
depends_on:
- master0
- volume1
- volume2
- volume3
- volume4
- volume5
- volume6
- filer
restart: unless-stopped

BIN
static/assets/ctlog/size_comparison_8t.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/ctlog/stop-hammer-time.jpg (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/ctlog/sunlight-logo.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/ctlog/sunlight-test-s3.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/ctlog/tesseract-logo.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/minio/console-1.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/minio/console-2.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/minio/disks.png (Stored with Git LFS) Normal file

Binary file not shown.

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 90 KiB

BIN
static/assets/minio/minio-logo.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/minio/nagios.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/minio/nginx-logo.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/minio/rack-2.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/minio/rack.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/minio/restic-logo.png (Stored with Git LFS) Normal file

Binary file not shown.

View File

@@ -0,0 +1 @@
<span style="color: {{ .Get "color" }}; font-weight: bold;">{{ .Inner }}</span>