--- date: "2025-07-26T22:07:23Z" title: 'Certificate Transparency - Part 1' --- {{< image width="10em" float="right" src="/assets/ctlog/ctlog-logo-ipng.png" alt="ctlog logo" >}} # Introduction There once was a Dutch company called [[DigiNotar](https://en.wikipedia.org/wiki/DigiNotar)], as the name suggests it was a form of _digital notary_, and they were in the business of issuing security certificates. Unfortunately, in June of 2011, their IT infrastructure was compromised and subsequently it issued hundreds of fraudulent SSL certificates, some of which were used for man-in-the-middle attacks on Iranian Gmail users. Not cool. Google launched a project called **Certificate Transparency**, because it was becoming more common that the root of trust given to _Certificate Authorities_ could no longer be unilateraly trusted. These attacks showed that the lack of transparency in the way CAs operated was a significant risk to the Web Public Key Infrastructure. It led to the creation of this ambitious [[project](https://certificate.transparency.dev/)] to improve security online by bringing accountability to the system that protects our online services with _SSL_ (Secure Socket Layer) and _TLS_ (Transport Layer Security). In 2013, [[RFC 6962](https://datatracker.ietf.org/doc/html/rfc6962)] was published by the IETF. It describes an experimental protocol for publicly logging the existence of Transport Layer Security (TLS) certificates as they are issued or observed, in a manner that allows anyone to audit certificate authority (CA) activity and notice the issuance of suspect certificates as well as to audit the certificate logs themselves. The intent is that eventually clients would refuse to honor certificates that do not appear in a log, effectively forcing CAs to add all issued certificates to the logs. This series explores and documents how IPng Networks will be running two Static CT _Logs_ with two different implementations. One will be [[Sunlight](https://sunlight.dev/)], and the other will be [[TesseraCT](https://github.com/transparency-dev/tesseract)]. ## Static Certificate Transparency In this context, _Logs_ are network services that implement the protocol operations for submissions and queries that are defined in this RFC. A few years ago, my buddy Antonis asked me if I would be willing to run a log, but operationally they were very complex and expensive to run. However, over the years, the concept of _Static Logs_ put running on in reach. The [[Static CT API](https://github.com/C2SP/C2SP/blob/main/static-ct-api.md)] defines a read-path HTTP static asset hierarchy (for monitoring) to be implemented alongside the write-path RFC 6962 endpoints (for submission). Aside from the different read endpoints, a log that implements the Static API is a regular CT log that can work alongside RFC 6962 logs and that fulfills the same purpose. In particular, it requires no modification to submitters and TLS clients. If you only read one document about Static CT, read Filippo Valsorda's excellent [[paper](https://filippo.io/a-different-CT-log)]. It describes a radically cheaper and easier to operate [[Certificate Transparency](https://certificate.transparency.dev/)] log that is backed by a consistent object storage, and can scale to 30x the current issuance rate for 2-10% of the costs with no merge delay. ## Scalable, Cheap, Reliable: choose two {{< image width="18em" float="right" src="/assets/ctlog/MPLS Backbone - CTLog.svg" alt="ctlog at ipng" >}} In the diagram, I've drawn an overview of IPng's network. In {{< boldcolor color="red" >}}red{{< /boldcolor >}} a european backbone network is provided by an [[BGP Free Core network](2022-12-09-oem-switch-2.md)]. It operates a private IPv4, IPv6 and MPLS network, called _IPng Site Local_, which is not connected to the internet. On top of that, IPng offers L2 and L3 services, for example using [[VPP]({{< ref 2021-02-27-network >}})]. In {{< boldcolor color="lightgreen" >}}green{{< /boldcolor >}} I built a cluster of replicated NGINX frontends. They connect into _IPng Site Local_ and can reach all hypervisors, VMs, and storage systems. They also connect to the Internet with a single IPv4 and IPv6 address. One might say that SSL is _added and removed here :-)_ [[ref](/assets/ctlog/nsa_slide.jpg)]. Then in {{< boldcolor color="orange" >}}orange{{< /boldcolor >}} I built a set of [[Minio]({{< ref 2025-05-28-minio-1 >}})] S3 storage pools. Amongst others, I serve the static content from the IPng website from these pools, providing fancy redundancy and caching. I wrote about its design in [[this article]({{< ref 2025-06-01-minio-2 >}})]. Finally, I turn my attention to the {{< boldcolor color="blue" >}}blue{{< /boldcolor >}} which is two hypervisors, one run by [[IPng](https://ipng.ch/)] and the other by [[Massar](https://massars.net/)]. Each of them will be running one of the _Log_ implementations. IPng provides two large ZFS storage tanks for offsite backup, in case a hypervisor decides to check out, and daily backups to an S3 bucket using Restic. Having explained all of this, I am well aware that end to end reliability will be coming from the fact that there are many independent _Log_ operators, and folks wanting to validate certificates can simply monitor many. If there is a gap in coverage, say due to any given _Log_'s downtime, this will not necessarily be problematic. It does mean that I may have to suppress the SRE in me... ## Minio My first instinct is to leverage the distributed storage IPng has, but as I'll show in the rest of this article, maybe a simpler, more elegant design could be superior, precisely because individual log reliability is not _as important_ as having many available log _instances_ to choose from. From operators in the field I understand that the world-wide generation of certificates is roughly 17M/day, which amounts of some 200-250qps of writes. My first thought is to see how fast my open source S3 machines can go, really. I'm curious also as to the difference between SSD and spinning disks. I boot two Dell R630s in the Lab. These machines have two Xeon E5-2640 v4 CPUs for a total of 20 cores and 40 threads, and 512GB of DDR4 memory. They also sport a SAS controller. In one machine I place 6pcs 1.2TB SAS3 disks (HPE part number EG1200JEHMC), and in the second machine I place 6pcs of 1.92TB enterprise storage (Samsung part number P1633N19). I spin up a 6-device Minio cluster on both and take them out for a spin using [[S3 Benchmark](https://github.com/wasabi-tech/s3-benchmark.git)] from Wasabi Tech. ``` pim@ctlog-test:~/src/s3-benchmark$ for dev in disk ssd; do \ for t in 1 8 32; do \ for z in 4M 1M 8k 4k; do \ ./s3-benchmark -a $KEY -s $SECRET -u http://minio-$dev:9000 -t $t -z $z \ | tee -a minio-results.txt; \ done; \ done; \ done ``` The loadtest above does a bunch of runs with varying parameters. First it tries to read and write object sizes of 4MB, 1MB, 8kB and 4kB respectively. Then it tries to do this with either 1 thread, 8 threads or 32 threads. Finally it tests both the disk-based variant as well as the SSD based one. The loadtest runs from a third machine, so that the Dell R630 disk tanks can stay completely dedicated to their task of running Minio. {{< image width="100%" src="/assets/ctlog/minio_8kb_performance.png" alt="MinIO 8kb disk vs SSD" >}} The left-hand side graph feels pretty natural to me. With one thread, uploading 8kB objects will quickly hit the IOPS rate of the disks, each of which have to participate in the write due to EC:3 encoding when using six disks, and it tops out at ~56 PUT/s. The single thread hitting SSDs will not hit that limit, and has ~371 PUT/s which I found a bit underwhelming. But, when performing the loadtest with either 8 or 32 write threads, the hard disks become only marginally faster (topping out at 240 PUT/s), while the SSDs really start to shine, with 3850 PUT/s. Pretty good performance. On the read-side, I am pleasantly surprised that there's not really that much of a difference between disks and SSDs. This is likely because the host filesystem cache is playing a large role, so the 1-thread performance is equivalent (765 GET/s for disks, 677 GET/s for SSDs), and the 32-thread performance is also equivalent (at 7624 GET/s for disks with 7261 GET/s for SSDs). I do wonder why the hard disks consistently outperform the SSDs with all the other variables (OS, MinIO version, hardware) the same. ## Sidequest: SeaweedFS Something that has long caught my attention is the way in which [[SeaweedFS](https://github.com/seaweedfs/seaweedfs)] approaches blob storage. Many operators have great success with many small file writes in SeaweedFS compared to MinIO and even AWS S3 storage. This is because writes with WeedFS are not broken into erasure-sets, which would require every disk to write a small part or checksum of the data, but rather files are replicated within the cluster in their entirety on different disks, racks or datacenters. I won't bore you with the details of SeaweedFS but I'll tack on a docker [[compose file](/assets/ctlog/seaweedfs.docker-compose.yml)] that I used at the end of this article, if you're curious. {{< image width="100%" src="/assets/ctlog/size_comparison_8t.png" alt="MinIO vs SeaWeedFS" >}} In the write-path, SeaweedFS dominates in all cases, due to its different way of achieving durable storage (per-file replication in SeaweedFS versus all-disk erasure-sets in MinIO): * 4k: 3,384 ops/sec vs MinIO's 111 ops/sec (30x faster!) * 8k: 3,332 ops/sec vs MinIO's 111 ops/sec (30x faster!) * 1M: 383 ops/sec vs MinIO's 44 ops/sec (9x faster) * 4M: 104 ops/sec vs MinIO's 32 ops/sec (4x faster) For the read-path, in GET operations MinIO is better at small objects, and really dominates the large objects: * 4k: 7,411 ops/sec vs SeaweedFS 5,014 ops/sec * 8k: 7,666 ops/sec vs SeaweedFS 5,165 ops/sec * 1M: 5,466 ops/sec vs SeaweedFS 2,212 ops/sec * 4M: 3,084 ops/sec vs SeaweedFS 646 ops/sec This makes me draw an interesting conclusion: seeing as CT Logs are read/write heavy (every couple of seconds, the Merkle tree is recomputed which is reasonably disk-intensive), SeaweedFS might be a slight better choice. IPng Networks has three Minio deployments, but no SeaweedFS deployments. Yet. # Tessera Tessera is a Go library for building tile-based transparency logs (tlogs) [[ref](https://github.com/C2SP/C2SP/blob/main/tlog-tiles.md)]. It is the logical successor to the approach that Google took when building and operating _Logs_ using its predecessor called [[Trillian](https://github.com/google/trillian)]. The implementation and its APIs bake-in current best-practices based on the lessons learned over the past decade of building and operating transparency logs in production environments and at scale. Tessera was introduced at the Transparency.Dev summit in October 2024. I first watch Al and Martin [[introduce](https://www.youtube.com/watch?v=9j_8FbQ9qSc)] it at last year's summit. At a high level, it wraps what used to be a whole kubernetes cluster full of components, into a single library that can be used with Cloud based services, either like AWS S3 and RDS database, or like GCP's GCS storage and Spanner database. However, Google also made is easy to use a regular POSIX filesystem implementation. ## TesseraCT {{< image width="10em" float="right" src="/assets/ctlog/tesseract-logo.png" alt="tesseract logo" >}} While Tessera is a library, a CT log implementation comes from its sibling GitHub repository called [[TesseraCT](https://github.com/transparency-dev/tesseract)]. Because it leverages Tessera under the hood, TesseraCT can run on GCP, AWS, POSIX-compliant, or on S3-compatible systems alongside a MySQL database. In order to provide ecosystem agility and to control the growth of CT Log sizes, new CT Logs must be temporally sharded, defining a certificate expiry range denoted in the form of two dates: `[rangeBegin, rangeEnd)`. The certificate expiry range allows a Log to reject otherwise valid logging submissions for certificates that expire before or after this defined range, thus partitioning the set of publicly-trusted certificates that each Log will accept. I will be expected to keep logs for an extended period of time, say 3-5 years. It's time for me to figure out what this TesseraCT thing can do .. are you ready? Let's go! ### TesseraCT: S3 and SQL TesseraCT comes with a few so-called _personalities_. Those are an implementation of the underlying storage infrastructure in an opinionated way. The first personality I look at is the `aws` one in `cmd/tesseract/aws`. I notice that this personality does make hard assumptions about the use of AWS which is unfortunate as the documentation says '.. or self-hosted S3 and MySQL database'. However, the `aws` personality assumes the AWS SecretManager in order to fetch its signing key. Before I can be successful, I need to detangle that. #### TesseraCT: AWS and Local Signer First, I change `cmd/tesseract/aws/main.go` to add two new flags: * ***-signer_public_key_file***: a path to the public key for checkpoints and SCT signer * ***-signer_private_key_file***: a path to the private key for checkpoints and SCT signer I then change the program to assume if these flags are both set, the user will want a _NewLocalSigner_ instead of a _NewSecretsManagerSigner_. Now all I have to do is implement the signer interface in a package `local_signer.go`. There, function _NewLocalSigner()_ will read the public and private PEM from file, decode them, and create an _ECDSAWithSHA256Signer_ with them, a simple example to show what I mean: ``` // NewLocalSigner creates a new signer that uses the ECDSA P-256 key pair from // local disk files for signing digests. func NewLocalSigner(publicKeyFile, privateKeyFile string) (*ECDSAWithSHA256Signer, error) { // Read public key publicKeyPEM, err := os.ReadFile(publicKeyFile) publicPemBlock, rest := pem.Decode(publicKeyPEM) var publicKey crypto.PublicKey publicKey, err = x509.ParsePKIXPublicKey(publicPemBlock.Bytes) ecdsaPublicKey, ok := publicKey.(*ecdsa.PublicKey) // Read private key privateKeyPEM, err := os.ReadFile(privateKeyFile) privatePemBlock, rest := pem.Decode(privateKeyPEM) var ecdsaPrivateKey *ecdsa.PrivateKey ecdsaPrivateKey, err = x509.ParseECPrivateKey(privatePemBlock.Bytes) // Verify the correctness of the signer key pair if !ecdsaPrivateKey.PublicKey.Equal(ecdsaPublicKey) { return nil, errors.New("signer key pair doesn't match") } return &ECDSAWithSHA256Signer{ publicKey: ecdsaPublicKey, privateKey: ecdsaPrivateKey, }, nil } ``` In the snippet above I omitted all of the error handling, but the local signer logic itself is hopefully clear. And with that, I am liberated from Amazon's Cloud offering and can run this thing all by myself! #### TesseraCT: Running with S3, MySQL, and Local Signer First, I need to create a suitable ECDSA key: ``` pim@ctlog-test:~$ openssl ecparam -name prime256v1 -genkey -noout -out /tmp/private_key.pem pim@ctlog-test:~$ openssl ec -in /tmp/private_key.pem -pubout -out /tmp/public_key.pem ``` Then, I'll install the MySQL server and create the databases: ``` pim@ctlog-test:~$ sudo apt install default-mysql-server pim@ctlog-test:~$ sudo mysql -u root CREATE USER 'tesseract'@'localhost' IDENTIFIED BY ''; CREATE DATABASE tesseract; CREATE DATABASE tesseract_antispam; GRANT ALL PRIVILEGES ON tesseract.* TO 'tesseract'@'localhost'; GRANT ALL PRIVILEGES ON tesseract_antispam.* TO 'tesseract'@'localhost'; ``` Finally, I use the SSD Minio lab-machine that I just loadtested to create an S3 bucket. ``` pim@ctlog-test:~$ mc mb minio-ssd/tesseract-test pim@ctlog-test:~$ cat << EOF > /tmp/minio-access.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:ListBucket", "s3:PutObject", "s3:GetObject", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::tesseract-test/*", "arn:aws:s3:::tesseract-test" ] } ] } EOF pim@ctlog-test:~$ mc admin user add minio-ssd pim@ctlog-test:~$ mc admin policy create minio-ssd tesseract-test-access /tmp/minio-access.json pim@ctlog-test:~$ mc admin policy attach minio-ssd tesseract-test-access --user pim@ctlog-test:~$ mc anonymous set public minio-ssd/tesseract-test ``` {{< image width="6em" float="left" src="/assets/shared/brain.png" alt="brain" >}} After some fiddling, I understand that the AWS software development kit makes some assumptions that you'll be using .. _quelle surprise_ .. AWS services. But you can also use local S3 services by setting a few key environment variables. I had heard of the S3 access and secret key environment variables before, but I now need to also use a different S3 endpoint. That little detour into the codebase only took me .. several hours. Armed with that knowledge, I can build and finally start my TesseraCT instance: ``` pim@ctlog-test:~/src/tesseract/cmd/tesseract/aws$ go build -o ~/aws . pim@ctlog-test:~$ export AWS_DEFAULT_REGION="us-east-1" pim@ctlog-test:~$ export AWS_ACCESS_KEY_ID="" pim@ctlog-test:~$ export AWS_SECRET_ACCESS_KEY="" pim@ctlog-test:~$ export AWS_ENDPOINT_URL_S3="http://minio-ssd.lab.ipng.ch:9000/" pim@ctlog-test:~$ ./aws --http_endpoint='[::]:6962' \ --origin=ctlog-test.lab.ipng.ch/test-ecdsa \ --bucket=tesseract-test \ --db_host=ctlog-test.lab.ipng.ch \ --db_user=tesseract \ --db_password= \ --db_name=tesseract \ --antispam_db_name=tesseract_antispam \ --signer_public_key_file=/tmp/public_key.pem \ --signer_private_key_file=/tmp/private_key.pem \ --roots_pem_file=internal/hammer/testdata/test_root_ca_cert.pem I0727 15:13:04.666056 337461 main.go:128] **** CT HTTP Server Starting **** ``` Hah! I think most of the command line flags and environment variables should make sense, but I was struggling for a while with the `--roots_pem_file` and the `--origin` flags, so I phoned a friend (Al Cutter, Googler extraordinaire and an expert in Tessera/CT). He explained to me that the Log is actually an open endpoint to which anybody might POST data. However, to avoid folks abusing the log infrastructure, each POST is expected to come from one of the certificate authorities listed in the `--roots_pem_file`. OK, that makes sense. Then, the `--origin` flag designates how my log calls itself. In the resulting `checkpoint` file it will enumerate a hash of the latest merged and published Merkle tree. In case a server serves multiple logs, it uses the `--origin` flag to make the destinction which checksum belongs to which. ``` pim@ctlog-test:~/src/tesseract$ curl http://tesseract-test.minio-ssd.lab.ipng.ch:9000/checkpoint ctlog-test.lab.ipng.ch/test-ecdsa 0 JGPitKWWI0aGuCfC2k1n/p9xdWAYPm5RZPNDXkCEVUU= — ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMCONUBAMARjBEAiA/nc9dig6U//vPg7SoTHjt9bxP5K+x3w4MYKpIRn4ULQIgUY5zijRK8qyuJGvZaItDEmP1gohCt+wI+sESBnhkuqo= ``` When creating the bucket above, I used `mc anonymous set public`, which made the S3 bucket world-readable. I can now execute the whole read-path simply by hitting the S3 service. Check. #### TesseraCT: Loadtesting S3/MySQL {{< image width="12em" float="right" src="/assets/ctlog/stop-hammer-time.jpg" alt="Stop, hammer time" >}} The write path is a server on `[::]:6962`. I should be able to write a log to it, but how? Here's where I am grateful to find a tool in the TesseraCT GitHub repository called `hammer`. This hammer sets up read and write traffic to a Static CT API log to test correctness and performance under load. The traffic is sent according to the [[Static CT API](https://c2sp.org/static-ct-api)] spec. Slick! The tool start a text-based UI (my favorite! also when using Cisco T-Rex loadtester) in the terminal that shows the current status, logs, and supports increasing/decreasing read and write traffic. This TUI allows for a level of interactivity when probing a new configuration of a log in order to find any cliffs where performance degrades. For real load-testing applications, especially headless runs as part of a CI pipeline, it is recommended to run the tool with `-show_ui=false` in order to disable the UI. I'm a bit lost in the somewhat terse [[README.md](https://github.com/transparency-dev/tesseract/tree/main/internal/hammer)], but my buddy Al comes to my rescue and explains the flags to me. First of all, the loadtester wants to hit the same `--origin` that I configured the write-path to accept. In my case this is `ctlog-test.lab.ipng.ch/test-ecdsa`. Then, it needs the public key for that _Log_, which I can find in `/tmp/public_key.pem`. The text there is the _DER_ (Distinguished Encoding Rules), stored as a base64 encoded string. What follows next was the most difficult for me to understand, as I was thinking the hammer would read some log from the internet somewhere and replay it locally. Al explains that actually, the `hammer` tool synthetically creates all of these logs itself, and it regularly reads the `checkpoint` from the `--log_url` place, while it writes its certificates to `--write_log_url`. The last few flags just inform the `hammer` how mny read and write ops/sec it should generate, and with that explanation my brain plays _tadaa.wav_ and I am ready to go. ``` pim@ctlog-test:~/src/tesseract$ go run ./internal/hammer \ --origin=ctlog-test.lab.ipng.ch/test-ecdsa \ --log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEucHtDWe9GYNicPnuGWbEX8rJg/VnDcXs8z40KdoNidBKy6/ZXw2u+NW1XAUnGpXcZozxufsgOMhijsWb25r7jw== \ --log_url=http://tesseract-test.minio-ssd.lab.ipng.ch:9000/ \ --write_log_url=http://localhost:6962/ctlog-test.lab.ipng.ch/test-ecdsa/ \ --max_read_ops=0 \ --num_writers=5000 \ --max_write_ops=100 ``` {{< image width="30em" float="right" src="/assets/ctlog/ctlog-loadtest1.png" alt="S3/MySQL Loadtest 100qps" >}} Cool! It seems that the loadtest is happily chugging along at 100qps. The log is consuming them in the HTTP write-path by accepting POST requests to `/ctlog-test.lab.ipng.ch/test-ecdsa/ct/v1/add-chain`, where hammer is offering them at a rate of 100qps, with a configured probability of duplicates set at 10%. What that means is that every now and again, it'll repeat a previous request. The purpose of this is to stress test the so-called `antispam` implementation. When `hammer` sends its requests, it signs them with a certificate that was issued by the CA described in `internal/hammer/testdata/test_root_ca_cert.pem`, which is why TesseraCT accepts them. I raise the write load by using the '>' key a few times. I notice things are great at 500qps, which is nice because that's double what we are to expect. But I start seeing a bit more noise at 600qps. When I raise the write-rate to 1000qps, all hell breaks loose on the logs of the server (and similar logs in the `hammer` loadtester: ``` W0727 15:54:33.419881 348475 handlers.go:168] ctlog-test.lab.ipng.ch/test-ecdsa: AddChain handler error: couldn't store the leaf: failed to fetch entry bundle at index 0: failed to fetch resource: getObject: failed to create reader for object "tile/data/000" in bucket "tesseract-test": operation error S3: GetObject, context deadline exceeded W0727 15:55:02.727962 348475 aws.go:345] GarbageCollect failed: failed to delete one or more objects: failed to delete objects: operation error S3: DeleteObjects, https response error StatusCode: 400, RequestID: 1856202CA3C4B83F, HostID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8, api error MalformedXML: The XML you provided was not well-formed or did not validate against our published schema. E0727 15:55:10.448973 348475 append_lifecycle.go:293] followerStats: follower "AWS antispam" EntriesProcessed(): failed to read follow coordination info: Error 1040: Too many connections ``` I see on the Minio instance that it's doing about 150/s of GETs and 15/s of PUTs, which is totally reasonable: ``` pim@ctlog-test:~/src/tesseract$ mc admin trace --stats ssd Duration: 6m9s ▰▱▱ RX Rate:↑ 34 MiB/m TX Rate:↓ 2.3 GiB/m RPM : 10588.1 ------------- Call Count RPM Avg Time Min Time Max Time Avg TTFB Max TTFB Avg Size Rate /min s3.GetObject 60558 (92.9%) 9837.2 4.3ms 708µs 48.1ms 3.9ms 47.8ms ↑144B ↓246K ↑1.4M ↓2.3G s3.PutObject 2199 (3.4%) 357.2 5.3ms 2.4ms 32.7ms 5.3ms 32.7ms ↑92K ↑32M s3.DeleteMultipleObjects 1212 (1.9%) 196.9 877µs 290µs 41.1ms 850µs 41.1ms ↑230B ↓369B ↑44K ↓71K s3.ListObjectsV2 1212 (1.9%) 196.9 18.4ms 999µs 52.8ms 18.3ms 52.7ms ↑131B ↓261B ↑25K ↓50K ``` Another nice way to see what makes it through is this oneliner, which reads the `checkpoint` every second, and once it changes, shows the delta in seconds and how many certs were written: ``` pim@ctlog-test:~/src/tesseract$ T=0; O=0; while :; do \ N=$(curl -sS http://tesseract-test.minio-ssd.lab.ipng.ch:9000/checkpoint | grep -E '^[0-9]+$'); \ if [ "$N" -eq "$O" ]; then \ echo -n .; \ else \ echo " $T seconds $((N-O)) certs"; O=$N; T=0; echo -n $N\ ; fi; \ T=$((T+1)); sleep 1; done 1012905 .... 5 seconds 2081 certs 1014986 .... 5 seconds 2126 certs 1017112 .... 5 seconds 1913 certs 1019025 .... 5 seconds 2588 certs 1021613 .... 5 seconds 2591 certs 1024204 .... 5 seconds 2197 certs ``` So I can see that the checkpoint is refreshed every 5 seconds and between 1913 and 2591 certs are written each time. And indeed, at 400/s there are no errors or warnings at all. At this write rate, TesseraCT is using about 2.9 CPUs/s, with MariaDB using 0.3 CPUs/s, but the hammer is using 6.0 CPUs/s. Overall, the machine is perfectly happily serving for a few hours under this load test. ***Conclusion: a write-rate of 400/s should be safe with S3+MySQL*** ### TesseraCT: POSIX I have been playing with this idea of having a reliable read-path by having the S3 cluster be redundant, or by replicating the S3 bucket. But Al asks: why not use our experimental POSIX? We discuss two very important benefits, but also two drawbacks: * On the plus side: 1. There is no need for S3 storage, read/writing to a local ZFS raidz2 pool instead. 1. There is no need for MySQL, as the POSIX implementation can use a local badger instance also on the local filesystem. * On the drawbacks: 1. There is a SPOF in the read-path, as the single VM must handle both. The write-path always has a SPOF on the TesseraCT VM. 1. Local storage is more expensive than S3 storage, and can be used only for the purposes of one application (and at best, shared with other VMs on the same hypervisor). Come to think of it, this is maybe not such a bad tradeoff. I do kind of like having a single-VM with a single-binary and no other moving parts. It greatly simplifies the architecture, and for the read-path I can (and will) still use multiple upstream NGINX machines in IPng's network. I consider myself nerd-sniped, and take a look at the POSIX variant. I have a few SAS3 solid state storage (NetAPP part number X447_S1633800AMD), which I plug into the `ctlog-test` machine. ``` pim@ctlog-test:~$ sudo zpool create -o ashift=12 -o autotrim=on -o ssd-vol0 mirror \ /dev/disk/by-id/wwn-0x5002538a0??????? pim@ctlog-test:~$ sudo zfs create ssd-vol0/tesseract-test pim@ctlog-test:~$ sudo chown pim:pim /ssd-vol0/tesseract-test pim@ctlog-test:~/src/tesseract$ go run ./cmd/experimental/posix --http_endpoint='[::]:6962' \ --origin=ctlog-test.lab.ipng.ch/test-ecdsa \ --private_key=/tmp/private_key.pem \ --storage_dir=/ssd-vol0/tesseract-test \ --roots_pem_file=internal/hammer/testdata/test_root_ca_cert.pem badger 2025/07/27 16:29:15 INFO: All 0 tables opened in 0s badger 2025/07/27 16:29:15 INFO: Discard stats nextEmptySlot: 0 badger 2025/07/27 16:29:15 INFO: Set nextTxnTs to 0 I0727 16:29:15.032845 363156 files.go:502] Initializing directory for POSIX log at "/ssd-vol0/tesseract-test" (this should only happen ONCE per log!) I0727 16:29:15.034101 363156 main.go:97] **** CT HTTP Server Starting **** pim@ctlog-test:~/src/tesseract$ cat /ssd-vol0/tesseract-test/checkpoint ctlog-test.lab.ipng.ch/test-ecdsa 0 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU= — ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMSgC8BAMARzBFAiBjT5zdkniKlryqlUlx/gLHOtVK26zuWwrc4BlyTVzCWgIhAJ0GIrlrP7YGzRaHjzdB5tnS5rpP3LeOsPbpLateaiFc ``` Alright, I can see the log started and created an empty checkpoint file. Nice! Before I can loadtest it, I will need to get the read-path to become visible. The `hammer` can read a checkpoint from local `file:///` prefixes, but I'll have to serve them over the network eventually anyway, so I create the following NGINX config for it: ``` server { listen 80 default_server backlog=4096; listen [::]:80 default_server backlog=4096; root /ssd-vol0/tesseract-test/; index index.html index.htm index.nginx-debian.html; server_name _; access_log /var/log/nginx/access.log combined buffer=512k flush=5s; location / { try_files $uri $uri/ =404; tcp_nopush on; sendfile on; tcp_nodelay on; keepalive_timeout 65; keepalive_requests 1000; } } ``` Just a couple of small thoughts on this configuration. I'm using buffered access logs, to avoid excessive disk writes in the read-path. Then, I'm using kernel `sendfile()` which will instruct the kernel to serve the static objects directly, so that NGINX can move on. Further, I'll allow for a long keepalive in HTTP 1.1, so that future requests can use the same TCP connection, and I'll set the flag `tcp_nodelay` and `tcp_nopush` to just blast the data out without waiting. Without much ado: ``` pim@ctlog-test:~/src/tesseract$ curl -sS ctlog-test.lab.ipng.ch/checkpoint ctlog-test.lab.ipng.ch/test-ecdsa 0 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU= — ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMTfksBAMASDBGAiEAqADLH0P/SRVloF6G1ezlWG3Exf+sTzPIY5u6VjAKLqACIQCkJO2N0dZQuDHvkbnzL8Hd91oyU41bVqfD3vs5EwUouA== ``` #### TesseraCT: Loadtesting POSIX The loadtesting is roughly the same. I start the `hammer` with the same 500qps of write rate, which was roughly where the S3+MySQL variant topped. My checkpoint tracker shows the following: ``` pim@ctlog-test:~/src/tesseract$ T=0; O=0; while :; do \ N=$(curl -sS http://localhost/checkpoint | grep -E '^[0-9]+$'); \ if [ "$N" -eq "$O" ]; then \ echo -n .; \ else \ echo " $T seconds $((N-O)) certs"; O=$N; T=0; echo -n $N\ ; fi; \ T=$((T+1)); sleep 1; done 59250 ......... 10 seconds 5244 certs 64494 ......... 10 seconds 5000 certs 69494 ......... 10 seconds 5000 certs 74494 ......... 10 seconds 5000 certs 79494 ......... 10 seconds 5256 certs 79494 ......... 10 seconds 5256 certs 84750 ......... 10 seconds 5244 certs 89994 ......... 10 seconds 5256 certs 95250 ......... 10 seconds 5000 certs 100250 ......... 10 seconds 5000 certs 105250 ......... 10 seconds 5000 certs ``` I learn two things. First, the checkpoint interval in this `posix` variant is 10 seconds, compared to the 5 seconds of the `aws` variant I tested before. I dive into the code, because there doesn't seem to be a `--checkpoint_interval` flag. In the `tessera` library, I find `DefaultCheckpointInterval` which is set to 10 seconds. I change it to be 2 seconds instead, and restart the `posix` binary: ``` 238250 . 2 seconds 1000 certs 239250 . 2 seconds 1000 certs 240250 . 2 seconds 1000 certs 241250 . 2 seconds 1000 certs 242250 . 2 seconds 1000 certs 243250 . 2 seconds 1000 certs 244250 . 2 seconds 1000 certs ``` {{< image width="30em" float="right" src="/assets/ctlog/ctlog-loadtest2.png" alt="Posix Loadtest 5000qps" >}} Very nice! Maybe I can write a few more certs? I restart the `hammer` with 5000/s, which somewhat to my surprise, ends up serving! ``` 642608 . 2 seconds 6155 certs 648763 . 2 seconds 10256 certs 659019 . 2 seconds 9237 certs 668256 . 2 seconds 8800 certs 677056 . 2 seconds 8729 certs 685785 . 2 seconds 8237 certs 694022 . 2 seconds 7487 certs 701509 . 2 seconds 8572 certs 710081 . 2 seconds 7413 certs ``` The throughput is highly variable though, seemingly between 3700/sec and 5100/sec, and I quickly find out that the `hammer` is completely saturating the CPU on the machine, leaving very little room for the `posix` TesseraCT to serve. I'm going to need more machines! So I start a `hammer` loadtester on the two now-idle Minio servers, and run them at about 6000qps **each**, for a total of 12000 certs/sec. And my little `posix` binary is keeping up like a champ: ``` 2987169 . 2 seconds 23040 certs 3010209 . 2 seconds 23040 certs 3033249 . 2 seconds 21760 certs 3055009 . 2 seconds 21504 certs 3076513 . 2 seconds 23808 certs 3100321 . 2 seconds 22528 certs ``` One thing is reasonably clear, the `posix` TesseraCT is CPU bound, not disk bound. The CPU is now running at about 18.5 CPUs/s (with 20 cores), which is pretty much all this Dell has to offer. The NetAPP enterprise solid state drives are not impressed: ``` pim@ctlog-test:~/src/tesseract$ zpool iostat -v ssd-vol0 10 100 capacity operations bandwidth pool alloc free read write read write -------------------------- ----- ----- ----- ----- ----- ----- ssd-vol0 11.4G 733G 0 3.13K 0 117M mirror-0 11.4G 733G 0 3.13K 0 117M wwn-0x5002538a05302930 - - 0 1.04K 0 39.1M wwn-0x5002538a053069f0 - - 0 1.06K 0 39.1M wwn-0x5002538a06313ed0 - - 0 1.02K 0 39.1M -------------------------- ----- ----- ----- ----- ----- ----- pim@ctlog-test:~/src/tesseract$ zpool iostat -l ssd-vol0 10 capacity operations bandwidth total_wait disk_wait syncq_wait asyncq_wait scrub trim pool alloc free read write read write read write read write read write read write wait wait ---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ssd-vol0 14.0G 730G 0 1.48K 0 35.4M - 2ms - 535us - 1us - 3ms - 50ms ssd-vol0 14.0G 730G 0 1.12K 0 23.0M - 1ms - 733us - 2us - 1ms - 44ms ssd-vol0 14.1G 730G 0 1.42K 0 45.3M - 508us - 122us - 914ns - 2ms - 41ms ssd-vol0 14.2G 730G 0 678 0 21.0M - 863us - 144us - 2us - 2ms - - ``` ## Results OK, that kind of seals the deal for me. The write path needs about 250 certs/sec and I'm hammering now with 12'000 certs/sec, with room to spare. But what about the read path? The cool thing about the static log is that reads are all entirely done by NGINX. The only file that isn't cacheable is the `checkpoint` file which gets updated every two seconds (or ten seconds in the default `tessera` settings). So I start yet another `hammer` whose job it is to read back from the static filesystem: ``` pim@ctlog-test:~/src/tesseract$ curl localhost/nginx_status; sleep 60; curl localhost/nginx_status Active connections: 10556 server accepts handled requests 25302 25302 1492918 Reading: 0 Writing: 1 Waiting: 10555 Active connections: 7791 server accepts handled requests 25764 25764 1727631 Reading: 0 Writing: 1 Waiting: 7790 ``` And I can see that it's keeping up quite nicely. In one minute, it handled (1727631-1492918) or 234713 requests, which is a cool 3911 requests/sec. All these read/write hammers are kind of saturating the `ctlog-test` machine though: {{< image width="100%" src="/assets/ctlog/ctlog-loadtest3.png" alt="Posix Loadtest 8000qps write, 4000qps read" >}} But after a little bit of fiddling, I can assert my conclusion: ***Conclusion: a write-rate of 8'000/s alongside a read-rate of 4'000/s should be safe with POSIX*** ## What's Next I am going to offer such a machine in production together with Antonis Chariton, and Jeroen Massar. I plan to do a few additional things: * Test Sunlight as well on the same hardware. It would be nice to see a comparison between write rates of the two implementations. * Work with Al Cutter and the Transparency Dev team to close a few small gaps (like the `local_signer.go` and some Prometheus monitoring of the `posix` binary. * Install and launch both under `*.ct.ipng.ch`, which in itself deserves its own report, showing how I intend to do log cycling and care/feeding, as well as report on the real production experience running these CT Logs.