---
date: "2025-07-26T22:07:23Z"
title: 'Certificate Transparency - Part 1'
---

{{< image width="10em" float="right" src="/assets/ctlog/ctlog-logo-ipng.png" alt="ctlog logo" >}}

# Introduction

There once was a Dutch company called [[DigiNotar](https://en.wikipedia.org/wiki/DigiNotar)], as the
name suggests it was a form of _digital notary_, and they were in the business of issuing security
certificates. Unfortunately, in June of 2011, their IT infrastructure was compromised and
subsequently it issued hundreds of fraudulent SSL certificates, some of which were used for
man-in-the-middle attacks on Iranian Gmail users. Not cool.

Google launched a project called **Certificate Transparency**, because it was becoming more common
that the root of trust given to _Certificate Authorities_ could no longer be unilateraly trusted.
These attacks showed that the lack of transparency in the way CAs operated was a significant risk to
the Web Public Key Infrastructure. It led to the creation of this ambitious
[[project](https://certificate.transparency.dev/)] to improve security online by bringing
accountability to the system that protects our online services with _SSL_ (Secure Socket Layer)
and _TLS_ (Transport Layer Security). 

In 2013, [[RFC 6962](https://datatracker.ietf.org/doc/html/rfc6962)] was published by the IETF. It
describes an experimental protocol for publicly logging the existence of Transport Layer Security
(TLS) certificates as they are issued or observed, in a manner that allows anyone to audit
certificate authority (CA) activity and notice the issuance of suspect certificates as well as to
audit the certificate logs themselves.  The intent is that eventually clients would refuse to honor
certificates that do not appear in a log, effectively forcing CAs to add all issued certificates to
the logs.

This series explores and documents how IPng Networks will be running two Static CT _Logs_ with two
different implementations. One will be [[Sunlight](https://sunlight.dev/)], and the other will be
[[TesseraCT](https://github.com/transparency-dev/tesseract)].

## Static Certificate Transparency

In this context, _Logs_ are network services that implement the protocol operations for submissions
and queries that are defined in this RFC. A few years ago, my buddy Antonis asked me if I would be
willing to run a log, but operationally they were very complex and expensive to run. However, over
the years, the concept of _Static Logs_ put running on in reach. The [[Static CT
API](https://github.com/C2SP/C2SP/blob/main/static-ct-api.md)] defines a read-path HTTP static asset
hierarchy (for monitoring) to be implemented alongside the write-path RFC 6962 endpoints (for
submission).

Aside from the different read endpoints, a log that implements the Static API is a regular CT log
that can work alongside RFC 6962 logs and that fulfills the same purpose. In particular, it requires
no modification to submitters and TLS clients.

If you only read one document about Static CT, read Filippo Valsorda's excellent
[[paper](https://filippo.io/a-different-CT-log)]. It describes a radically cheaper and easier to
operate [[Certificate Transparency](https://certificate.transparency.dev/)] log that is backed by a
consistent object storage, and can scale to 30x the current issuance rate for 2-10% of the costs
with no merge delay.

## Scalable, Cheap, Reliable: choose two

{{< image width="18em" float="right" src="/assets/ctlog/MPLS Backbone - CTLog.svg" alt="ctlog at ipng" >}}

In the diagram, I've drawn an overview of IPng's network. In {{< boldcolor color="red" >}}red{{<
/boldcolor >}} a european backbone network is provided by an [[BGP Free Core
network](2022-12-09-oem-switch-2.md)]. It operates a private IPv4, IPv6 and MPLS network, called
_IPng Site Local_, which is not connected to the internet. On top of that, IPng offers L2 and L3
services, for example using [[VPP]({{< ref 2021-02-27-network >}})]. 

In {{< boldcolor color="lightgreen" >}}green{{< /boldcolor >}} I built a cluster of replicated
NGINX frontends. They connect into _IPng Site Local_ and can reach all hypervisors, VMs, and storage
systems. They also connect to the Internet with a single IPv4 and IPv6 address. One might say that
SSL is _added and removed here :-)_ [[ref](/assets/ctlog/nsa_slide.jpg)].

Then in {{< boldcolor color="orange" >}}orange{{< /boldcolor >}} I built a set of [[Minio]({{< ref
2025-05-28-minio-1 >}})] S3 storage pools. Amongst others, I serve the static content from the IPng
website from these pools, providing fancy redundancy and caching. I wrote about its design in [[this
article]({{< ref 2025-06-01-minio-2 >}})].

Finally, I turn my attention to the {{< boldcolor color="blue" >}}blue{{< /boldcolor >}} which is
two hypervisors, one run by [[IPng](https://ipng.ch/)] and the other by [[Massar](https://massars.net/)]. Each
of them will be running one of the _Log_ implementations. IPng provides two large ZFS storage tanks
for offsite backup, in case a hypervisor decides to check out, and daily backups to an S3 bucket
using Restic. 

Having explained all of this, I am well aware that end to end reliability will be coming from the
fact that there are many independent _Log_ operators, and folks wanting to validate certificates can
simply monitor many. If there is a gap in coverage, say due to any given _Log_'s downtime, this will
not necessarily be problematic. It does mean that I may have to suppress the SRE in me...

## Minio

My first instinct is to leverage the distributed storage IPng has, but as I'll show in the rest of
this article, maybe a simpler, more elegant design could be superior, precisely because individual
log reliability is not _as important_ as having many available log _instances_ to choose from.

From operators in the field I understand that the world-wide generation of certificates is roughly
17M/day, which amounts of some 200-250qps of writes. My first thought is to see how fast my open
source S3 machines can go, really. I'm curious also as to the difference between SSD and spinning
disks.

I boot two Dell R630s in the Lab. These machines have two Xeon E5-2640 v4 CPUs for a total of 20
cores and 40 threads, and 512GB of DDR4 memory. They also sport a SAS controller. In one machine I
place 6pcs 1.2TB SAS3 disks (HPE part number EG1200JEHMC), and in the second machine I place 6pcs
of 1.92TB enterprise storage (Samsung part number P1633N19).

I spin up a 6-device Minio cluster on both and take them out for a spin using [[S3
Benchmark](https://github.com/wasabi-tech/s3-benchmark.git)] from Wasabi Tech.

```
pim@ctlog-test:~/src/s3-benchmark$ for dev in disk ssd; do \
  for t in 1 8 32; do \
    for z in 4M 1M 8k 4k; do \
      ./s3-benchmark -a $KEY -s $SECRET -u http://minio-$dev:9000 -t $t -z $z \
        | tee -a minio-results.txt; \
    done; \
  done; \
done
```

The loadtest above does a bunch of runs with varying parameters. First it tries to read and write
object sizes of 4MB, 1MB, 8kB and 4kB respectively. Then it tries to do this with either 1 thread, 8
threads or 32 threads. Finally it tests both the disk-based variant as well as the SSD based one.
The loadtest runs from a third machine, so that the Dell R630 disk tanks can stay completely
dedicated to their task of running Minio.

{{< image width="100%" src="/assets/ctlog/minio_8kb_performance.png" alt="MinIO 8kb disk vs SSD" >}}

The left-hand side graph feels pretty natural to me. With one thread, uploading 8kB objects will
quickly hit the IOPS rate of the disks, each of which have to participate in the write due to EC:3
encoding when using six disks, and it tops out at ~56 PUT/s. The single thread hitting SSDs will not
hit that limit, and has ~371 PUT/s which I found a bit underwhelming. But, when performing the
loadtest with either 8 or 32 write threads, the hard disks become only marginally faster (topping
out at 240 PUT/s), while the SSDs really start to shine, with 3850 PUT/s. Pretty good performance.

On the read-side, I am pleasantly surprised that there's not really that much of a difference
between disks and SSDs. This is likely because the host filesystem cache is playing a large role, so
the 1-thread performance is equivalent (765 GET/s for disks, 677 GET/s for SSDs), and the 32-thread
performance is also equivalent (at 7624 GET/s for disks with 7261 GET/s for SSDs). I do wonder why
the hard disks consistently outperform the SSDs with all the other variables (OS, MinIO version,
hardware) the same.

## Sidequest: SeaweedFS

Something that has long caught my attention is the way in which
[[SeaweedFS](https://github.com/seaweedfs/seaweedfs)] approaches blob storage. Many operators have
great success with many small file writes in SeaweedFS compared to MinIO and even AWS S3 storage.
This is because writes with WeedFS are not broken into erasure-sets, which would require every disk
to write a small part or checksum of the data, but rather files are replicated within the cluster in
their entirety on different disks, racks or datacenters. I won't bore you with the details of
SeaweedFS but I'll tack on a docker [[compose file](/assets/ctlog/seaweedfs.docker-compose.yml)]
that I used at the end of this article, if you're curious.

{{< image width="100%" src="/assets/ctlog/size_comparison_8t.png" alt="MinIO vs SeaWeedFS" >}}

In the write-path, SeaweedFS dominates in all cases, due to its different way of achieving durable
storage (per-file replication in SeaweedFS versus all-disk erasure-sets in MinIO): 
*   4k: 3,384 ops/sec vs MinIO's 111 ops/sec (30x faster!)
*   8k: 3,332 ops/sec vs MinIO's 111 ops/sec (30x faster!)
*   1M: 383 ops/sec vs MinIO's 44 ops/sec (9x faster)
*   4M: 104 ops/sec vs MinIO's 32 ops/sec (4x faster)

For the read-path, in GET operations MinIO is better at small objects, and really dominates the
large objects:
*   4k: 7,411 ops/sec vs SeaweedFS 5,014 ops/sec
*   8k: 7,666 ops/sec vs SeaweedFS 5,165 ops/sec
*   1M: 5,466 ops/sec vs SeaweedFS 2,212 ops/sec
*   4M: 3,084 ops/sec vs SeaweedFS 646 ops/sec

This makes me draw an interesting conclusion: seeing as CT Logs are read/write heavy (every couple
of seconds, the Merkle tree is recomputed which is reasonably disk-intensive), SeaweedFS might be a
slight better choice. IPng Networks has three Minio deployments, but no SeaweedFS deployments. Yet.

# Tessera

Tessera is a Go library for building tile-based transparency logs (tlogs)
[[ref](https://github.com/C2SP/C2SP/blob/main/tlog-tiles.md)]. It is the logical successor to the
approach that Google took when building and operating _Logs_ using its predecessor called
[[Trillian](https://github.com/google/trillian)]. The implementation and its APIs bake-in current
best-practices based on the lessons learned over the past decade of building and operating
transparency logs in production environments and at scale.

Tessera was introduced at the Transparency.Dev summit in October 2024. I first watch Al and Martin
[[introduce](https://www.youtube.com/watch?v=9j_8FbQ9qSc)] it at last year's summit. At a high
level, it wraps what used to be a whole kubernetes cluster full of components, into a single library
that can be used with Cloud based services, either like AWS S3 and RDS database, or like GCP's GCS
storage and Spanner database. However, Google also made is easy to use a regular POSIX filesystem
implementation.

## TesseraCT

{{< image width="10em" float="right" src="/assets/ctlog/tesseract-logo.png" alt="tesseract logo" >}}

While Tessera is a library, a CT log implementation comes from its sibling GitHub repository called
[[TesseraCT](https://github.com/transparency-dev/tesseract)]. Because it leverages Tessera under the
hood, TesseraCT can run on GCP, AWS, POSIX-compliant, or on S3-compatible systems alongside a MySQL
database.  In order to provide ecosystem agility and to control the growth of CT Log sizes, new CT
Logs must be temporally sharded, defining a certificate expiry range denoted in the form of two
dates: `[rangeBegin, rangeEnd)`. The certificate expiry range allows a Log to reject otherwise valid
logging submissions for certificates that expire before or after this defined range, thus
partitioning the set of publicly-trusted certificates that each Log will accept. I will be expected
to keep logs for an extended period of time, say 3-5 years.

It's time for me to figure out what this TesseraCT thing can do .. are you ready? Let's go!

### TesseraCT: S3 and SQL

TesseraCT comes with a few so-called _personalities_. Those are an implementation of the underlying
storage infrastructure in an opinionated way. The first personality I look at is the `aws` one in
`cmd/tesseract/aws`. I notice that this personality does make hard assumptions about the use of AWS
which is unfortunate as the documentation says '.. or self-hosted S3 and MySQL database'. However,
the `aws` personality assumes the AWS SecretManager in order to fetch its signing key. Before I
can be successful, I need to detangle that.

#### TesseraCT: AWS and Local Signer

First, I change `cmd/tesseract/aws/main.go` to add two new flags:

*   ***-signer_public_key_file***: a path to the public key for checkpoints and SCT signer
*   ***-signer_private_key_file***: a path to the private key for checkpoints and SCT signer

I then change the program to assume if these flags are both set, the user will want a
_NewLocalSigner_ instead of a _NewSecretsManagerSigner_. Now all I have to do is implement the
signer interface in a package `local_signer.go`. There, function _NewLocalSigner()_ will read the
public and private PEM from file, decode them, and create an _ECDSAWithSHA256Signer_ with them, a
simple example to show what I mean:

```
// NewLocalSigner creates a new signer that uses the ECDSA P-256 key pair from
// local disk files for signing digests.
func NewLocalSigner(publicKeyFile, privateKeyFile string) (*ECDSAWithSHA256Signer, error) {
  // Read public key
  publicKeyPEM, err := os.ReadFile(publicKeyFile)
  publicPemBlock, rest := pem.Decode(publicKeyPEM)

  var publicKey crypto.PublicKey
  publicKey, err = x509.ParsePKIXPublicKey(publicPemBlock.Bytes)
  ecdsaPublicKey, ok := publicKey.(*ecdsa.PublicKey)

  // Read private key
  privateKeyPEM, err := os.ReadFile(privateKeyFile)
  privatePemBlock, rest := pem.Decode(privateKeyPEM)

  var ecdsaPrivateKey *ecdsa.PrivateKey
  ecdsaPrivateKey, err = x509.ParseECPrivateKey(privatePemBlock.Bytes)

  // Verify the correctness of the signer key pair
  if !ecdsaPrivateKey.PublicKey.Equal(ecdsaPublicKey) {
   return nil, errors.New("signer key pair doesn't match")
  }

  return &ECDSAWithSHA256Signer{
   publicKey:  ecdsaPublicKey,
   privateKey: ecdsaPrivateKey,
  }, nil
}
```

In the snippet above I omitted all of the error handling, but the local signer logic itself is
hopefully clear. And with that, I am liberated from Amazon's Cloud offering and can run this thing
all by myself!

#### TesseraCT: Running with S3, MySQL, and Local Signer

First, I need to create a suitable ECDSA key:
```
pim@ctlog-test:~$ openssl ecparam -name prime256v1 -genkey -noout -out /tmp/private_key.pem
pim@ctlog-test:~$ openssl ec -in /tmp/private_key.pem -pubout -out /tmp/public_key.pem
```

Then, I'll install the MySQL server and create the databases:

```
pim@ctlog-test:~$ sudo apt install default-mysql-server
pim@ctlog-test:~$ sudo mysql -u root

CREATE USER 'tesseract'@'localhost' IDENTIFIED BY '<db_passwd>';
CREATE DATABASE tesseract;
CREATE DATABASE tesseract_antispam;
GRANT ALL PRIVILEGES ON tesseract.* TO 'tesseract'@'localhost';
GRANT ALL PRIVILEGES ON tesseract_antispam.* TO 'tesseract'@'localhost';
```

Finally, I use the SSD Minio lab-machine that I just loadtested to create an S3 bucket.

```
pim@ctlog-test:~$ mc mb minio-ssd/tesseract-test
pim@ctlog-test:~$ cat << EOF > /tmp/minio-access.json
{ "Version": "2012-10-17", "Statement": [ {
    "Effect": "Allow",
    "Action": [ "s3:ListBucket", "s3:PutObject", "s3:GetObject", "s3:DeleteObject" ],
    "Resource": [ "arn:aws:s3:::tesseract-test/*", "arn:aws:s3:::tesseract-test" ]
  } ]
}
EOF
pim@ctlog-test:~$ mc admin user add minio-ssd <user> <secret>
pim@ctlog-test:~$ mc admin policy create minio-ssd tesseract-test-access /tmp/minio-access.json
pim@ctlog-test:~$ mc admin policy attach minio-ssd tesseract-test-access --user <user>
pim@ctlog-test:~$ mc anonymous set public minio-ssd/tesseract-test
```

{{< image width="6em" float="left" src="/assets/shared/brain.png" alt="brain" >}}

After some fiddling, I understand that the AWS software development kit makes some assumptions that
you'll be using .. _quelle surprise_ .. AWS services. But you can also use local S3 services by
setting a few key environment variables. I had heard of the S3 access and secret key environment
variables before, but I now need to also use a different S3 endpoint. That little detour into the
codebase only took me .. several hours.

Armed with that knowledge, I can build and finally start my TesseraCT instance:
```
pim@ctlog-test:~/src/tesseract/cmd/tesseract/aws$ go build -o ~/aws .
pim@ctlog-test:~$ export AWS_DEFAULT_REGION="us-east-1"
pim@ctlog-test:~$ export AWS_ACCESS_KEY_ID="<user>"
pim@ctlog-test:~$ export AWS_SECRET_ACCESS_KEY="<secret>"
pim@ctlog-test:~$ export AWS_ENDPOINT_URL_S3="http://minio-ssd.lab.ipng.ch:9000/"
pim@ctlog-test:~$ ./aws --http_endpoint='[::]:6962' \
  --origin=ctlog-test.lab.ipng.ch/test-ecdsa \
  --bucket=tesseract-test \
  --db_host=ctlog-test.lab.ipng.ch \
  --db_user=tesseract \
  --db_password=<db_passwd> \
  --db_name=tesseract \
  --antispam_db_name=tesseract_antispam \
  --signer_public_key_file=/tmp/public_key.pem \
  --signer_private_key_file=/tmp/private_key.pem \
  --roots_pem_file=internal/hammer/testdata/test_root_ca_cert.pem

I0727 15:13:04.666056  337461 main.go:128] **** CT HTTP Server Starting ****
```

Hah! I think most of the command line flags and environment variables should make sense, but I was
struggling for a while with the `--roots_pem_file` and the `--origin` flags, so I phoned a friend
(Al Cutter, Googler extraordinaire and an expert in Tessera/CT). He explained to me that the Log is
actually an open endpoint to which anybody might POST data. However, to avoid folks abusing the log
infrastructure, each POST is expected to come from one of the certificate authorities listed in the
`--roots_pem_file`. OK, that makes sense.

Then, the `--origin` flag designates how my log calls itself. In the resulting `checkpoint` file it
will enumerate a hash of the latest merged and published Merkle tree. In case a server serves
multiple logs, it uses the `--origin` flag to make the destinction which checksum belongs to which.

```
pim@ctlog-test:~/src/tesseract$ curl http://tesseract-test.minio-ssd.lab.ipng.ch:9000/checkpoint
ctlog-test.lab.ipng.ch/test-ecdsa
0
JGPitKWWI0aGuCfC2k1n/p9xdWAYPm5RZPNDXkCEVUU=

— ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMCONUBAMARjBEAiA/nc9dig6U//vPg7SoTHjt9bxP5K+x3w4MYKpIRn4ULQIgUY5zijRK8qyuJGvZaItDEmP1gohCt+wI+sESBnhkuqo=
```

When creating the bucket above, I used `mc anonymous set public`, which made the S3 bucket
world-readable. I can now execute the whole read-path simply by hitting the S3 service. Check.

#### TesseraCT: Loadtesting S3/MySQL

{{< image width="12em" float="right" src="/assets/ctlog/stop-hammer-time.jpg" alt="Stop, hammer time" >}}

The write path is a server on `[::]:6962`. I should be able to write a log to it, but how? Here's
where I am grateful to find a tool in the TesseraCT GitHub repository called `hammer`. This hammer
sets up read and write traffic to a Static CT API log to test correctness and performance under
load.  The traffic is sent according to the [[Static CT API](https://c2sp.org/static-ct-api)] spec.
Slick!

The tool start a text-based UI (my favorite! also when using Cisco T-Rex loadtester) in the terminal
that shows the current status, logs, and supports increasing/decreasing read and write traffic. This
TUI allows for a level of interactivity when probing a new configuration of a log in order to find
any cliffs where performance degrades. For real load-testing applications, especially headless runs
as part of a CI pipeline, it is recommended to run the tool with `-show_ui=false` in order to disable
the UI.

I'm a bit lost in the somewhat terse
[[README.md](https://github.com/transparency-dev/tesseract/tree/main/internal/hammer)], but my buddy
Al comes to my rescue and explains the flags to me.  First of all, the loadtester wants to hit the
same `--origin` that I configured the write-path to accept. In my case this is
`ctlog-test.lab.ipng.ch/test-ecdsa`. Then, it needs the public key for that _Log_, which I can find
in `/tmp/public_key.pem`. The text there is the _DER_ (Distinguished Encoding Rules), stored as a
base64 encoded string. What follows next was the most difficult for me to understand, as I was
thinking the hammer would read some log from the internet somewhere and replay it locally. Al
explains that actually, the `hammer` tool synthetically creates all of these logs itself, and it
regularly reads the `checkpoint` from the `--log_url` place, while it writes its certificates to
`--write_log_url`. The last few flags just inform the `hammer` how mny read and write ops/sec it
should generate, and with that explanation my brain plays _tadaa.wav_ and I am ready to go.

```
pim@ctlog-test:~/src/tesseract$ go run ./internal/hammer \
  --origin=ctlog-test.lab.ipng.ch/test-ecdsa \
  --log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEucHtDWe9GYNicPnuGWbEX8rJg/VnDcXs8z40KdoNidBKy6/ZXw2u+NW1XAUnGpXcZozxufsgOMhijsWb25r7jw== \
  --log_url=http://tesseract-test.minio-ssd.lab.ipng.ch:9000/ \
  --write_log_url=http://localhost:6962/ctlog-test.lab.ipng.ch/test-ecdsa/ \
  --max_read_ops=0 \
  --num_writers=5000 \
  --max_write_ops=100
```

{{< image width="30em" float="right" src="/assets/ctlog/ctlog-loadtest1.png" alt="S3/MySQL Loadtest 100qps" >}}

Cool! It seems that the loadtest is happily chugging along at 100qps. The log is consuming them in
the HTTP write-path by accepting POST requests to
`/ctlog-test.lab.ipng.ch/test-ecdsa/ct/v1/add-chain`, where hammer is offering them at a rate of
100qps, with a configured probability of duplicates set at 10%. What that means is that every now
and again, it'll repeat a previous request. The purpose of this is to stress test the so-called
`antispam` implementation. When `hammer` sends its requests, it signs them with a certificate that
was issued by the CA described in `internal/hammer/testdata/test_root_ca_cert.pem`, which is why
TesseraCT accepts them.

I raise the write load by using the '>' key a few times. I notice things are great at 500qps, which
is nice because that's double what we are to expect. But I start seeing a bit more noise at 600qps.
When I raise the write-rate to 1000qps, all hell breaks loose on the logs of the server (and similar
logs in the `hammer` loadtester:

```
W0727 15:54:33.419881  348475 handlers.go:168] ctlog-test.lab.ipng.ch/test-ecdsa: AddChain handler error: couldn't store the leaf: failed to fetch entry bundle at index 0: failed to fetch resource: getObject: failed to create reader for object "tile/data/000" in bucket "tesseract-test": operation error S3: GetObject, context deadline exceeded
W0727 15:55:02.727962  348475 aws.go:345] GarbageCollect failed: failed to delete one or more objects: failed to delete objects: operation error S3: DeleteObjects, https response error StatusCode: 400, RequestID: 1856202CA3C4B83F, HostID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8, api error MalformedXML: The XML you provided was not well-formed or did not validate against our published schema.
E0727 15:55:10.448973  348475 append_lifecycle.go:293] followerStats: follower "AWS antispam" EntriesProcessed(): failed to read follow coordination info: Error 1040: Too many connections
```

I see on the Minio instance that it's doing about 150/s of GETs and 15/s of PUTs, which is totally
reasonable:

```
pim@ctlog-test:~/src/tesseract$ mc admin trace --stats ssd
Duration: 6m9s ▰▱▱
RX Rate:↑ 34 MiB/m
TX Rate:↓ 2.3 GiB/m
RPM    :  10588.1
-------------
Call                      Count          RPM     Avg Time  Min Time  Max Time  Avg TTFB  Max TTFB  Avg Size     Rate /min  
s3.GetObject              60558 (92.9%)  9837.2  4.3ms     708µs     48.1ms    3.9ms     47.8ms    ↑144B ↓246K  ↑1.4M ↓2.3G
s3.PutObject              2199 (3.4%)    357.2   5.3ms     2.4ms     32.7ms    5.3ms     32.7ms    ↑92K         ↑32M       
s3.DeleteMultipleObjects  1212 (1.9%)    196.9   877µs     290µs     41.1ms    850µs     41.1ms    ↑230B ↓369B  ↑44K ↓71K  
s3.ListObjectsV2          1212 (1.9%)    196.9   18.4ms    999µs     52.8ms    18.3ms    52.7ms    ↑131B ↓261B  ↑25K ↓50K  
```

Another nice way to see what makes it through is this oneliner, which reads the `checkpoint` every
second, and once it changes, shows the delta in seconds and how many certs were written:

```
pim@ctlog-test:~/src/tesseract$ T=0; O=0; while :; do \
  N=$(curl -sS http://tesseract-test.minio-ssd.lab.ipng.ch:9000/checkpoint | grep -E '^[0-9]+$'); \
  if [ "$N" -eq "$O" ]; then \
    echo -n .; \
  else \
    echo " $T seconds $((N-O)) certs"; O=$N; T=0; echo -n $N\ ;
  fi; \
  T=$((T+1)); sleep 1; done
1012905 .... 5 seconds 2081 certs
1014986 .... 5 seconds 2126 certs
1017112 .... 5 seconds 1913 certs
1019025 .... 5 seconds 2588 certs
1021613 .... 5 seconds 2591 certs
1024204 .... 5 seconds 2197 certs
```

So I can see that the checkpoint is refreshed every 5 seconds and between 1913 and 2591 certs are
written each time. And indeed, at 400/s there are no errors or warnings at all. At this write rate,
TesseraCT is using about 2.9 CPUs/s, with MariaDB using 0.3 CPUs/s, but the hammer is using 6.0
CPUs/s. Overall, the machine is perfectly happily serving for a few hours under this load test.

***Conclusion: a write-rate of 400/s should be safe with S3+MySQL***

### TesseraCT: POSIX

I have been playing with this idea of having a reliable read-path by having the S3 cluster be
redundant, or by replicating the S3 bucket. But Al asks: why not use our experimental POSIX?
We discuss two very important benefits, but also two drawbacks:

*   On the plus side:
    1.   There is no need for S3 storage, read/writing to a local ZFS raidz2 pool instead.
    1.   There is no need for MySQL, as the POSIX implementation can use a local badger instance
         also on the local filesystem.
*   On the drawbacks:
    1.   There is a SPOF in the read-path, as the single VM must handle both. The write-path always
         has a SPOF on the TesseraCT VM.
    1.   Local storage is more expensive than S3 storage, and can be used only for the purposes of
         one application (and at best, shared with other VMs on the same hypervisor).

Come to think of it, this is maybe not such a bad tradeoff. I do kind of like having a single-VM
with a single-binary and no other moving parts. It greatly simplifies the architecture, and for the
read-path I can (and will) still use multiple upstream NGINX machines in IPng's network.

I consider myself nerd-sniped, and take a look at the POSIX variant. I have a few SAS3
solid state storage (NetAPP part number X447_S1633800AMD), which I plug into the `ctlog-test`
machine.

```
pim@ctlog-test:~$ sudo zpool create -o ashift=12 -o autotrim=on -o ssd-vol0 mirror \
  /dev/disk/by-id/wwn-0x5002538a0???????
pim@ctlog-test:~$ sudo zfs create ssd-vol0/tesseract-test
pim@ctlog-test:~$ sudo chown pim:pim /ssd-vol0/tesseract-test
pim@ctlog-test:~/src/tesseract$ go run ./cmd/experimental/posix --http_endpoint='[::]:6962' \
  --origin=ctlog-test.lab.ipng.ch/test-ecdsa \
  --private_key=/tmp/private_key.pem \
  --storage_dir=/ssd-vol0/tesseract-test \
  --roots_pem_file=internal/hammer/testdata/test_root_ca_cert.pem 
badger 2025/07/27 16:29:15 INFO: All 0 tables opened in 0s
badger 2025/07/27 16:29:15 INFO: Discard stats nextEmptySlot: 0
badger 2025/07/27 16:29:15 INFO: Set nextTxnTs to 0
I0727 16:29:15.032845  363156 files.go:502] Initializing directory for POSIX log at "/ssd-vol0/tesseract-test" (this should only happen ONCE per log!)
I0727 16:29:15.034101  363156 main.go:97] **** CT HTTP Server Starting ****

pim@ctlog-test:~/src/tesseract$ cat /ssd-vol0/tesseract-test/checkpoint 
ctlog-test.lab.ipng.ch/test-ecdsa
0
47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=

— ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMSgC8BAMARzBFAiBjT5zdkniKlryqlUlx/gLHOtVK26zuWwrc4BlyTVzCWgIhAJ0GIrlrP7YGzRaHjzdB5tnS5rpP3LeOsPbpLateaiFc
```

Alright, I can see the log started and created an empty checkpoint file. Nice!

Before I can loadtest it, I will need to get the read-path to become visible. The `hammer` can read
a checkpoint from local `file:///` prefixes, but I'll have to serve them over the network eventually
anyway, so I create the following NGINX config for it:

```
server {
  listen 80 default_server backlog=4096;
  listen [::]:80 default_server backlog=4096;
  root /ssd-vol0/tesseract-test/;
  index index.html index.htm index.nginx-debian.html;

  server_name _;

  access_log /var/log/nginx/access.log combined buffer=512k flush=5s;

  location / {
    try_files $uri $uri/ =404;
    tcp_nopush  on;
    sendfile    on;
    tcp_nodelay on;
    keepalive_timeout 65;
    keepalive_requests 1000;
  }
}
```

Just a couple of small thoughts on this configuration. I'm using buffered access logs, to avoid
excessive disk writes in the read-path. Then, I'm using kernel `sendfile()` which will instruct the
kernel to serve the static objects directly, so that NGINX can move on. Further, I'll allow for a
long keepalive in HTTP 1.1, so that future requests can use the same TCP connection, and I'll set
the flag `tcp_nodelay` and `tcp_nopush` to just blast the data out without waiting.

Without much ado:

```
pim@ctlog-test:~/src/tesseract$ curl -sS ctlog-test.lab.ipng.ch/checkpoint
ctlog-test.lab.ipng.ch/test-ecdsa
0
47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=

— ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMTfksBAMASDBGAiEAqADLH0P/SRVloF6G1ezlWG3Exf+sTzPIY5u6VjAKLqACIQCkJO2N0dZQuDHvkbnzL8Hd91oyU41bVqfD3vs5EwUouA==
```

#### TesseraCT: Loadtesting POSIX

The loadtesting is roughly the same. I start the `hammer` with the same 500qps of write rate, which
was roughly where the S3+MySQL variant topped.  My checkpoint tracker shows the following:

```
pim@ctlog-test:~/src/tesseract$ T=0; O=0; while :; do \
  N=$(curl -sS http://localhost/checkpoint | grep -E '^[0-9]+$'); \
  if [ "$N" -eq "$O" ]; then \
    echo -n .; \
  else \
    echo " $T seconds $((N-O)) certs"; O=$N; T=0; echo -n $N\ ;
  fi; \
  T=$((T+1)); sleep 1; done
59250 ......... 10 seconds 5244 certs
64494 ......... 10 seconds 5000 certs
69494 ......... 10 seconds 5000 certs
74494 ......... 10 seconds 5000 certs
79494 ......... 10 seconds 5256 certs
79494 ......... 10 seconds 5256 certs
84750 ......... 10 seconds 5244 certs
89994 ......... 10 seconds 5256 certs
95250 ......... 10 seconds 5000 certs
100250 ......... 10 seconds 5000 certs
105250 ......... 10 seconds 5000 certs
```

I learn two things. First, the checkpoint interval in this `posix` variant is 10 seconds, compared
to the 5 seconds of the `aws` variant I tested before. I dive into the code, because there doesn't
seem to be a `--checkpoint_interval` flag. In the `tessera` library, I find
`DefaultCheckpointInterval` which is set to 10 seconds. I change it to be 2 seconds instead, and
restart the `posix` binary:

```
238250 . 2 seconds 1000 certs
239250 . 2 seconds 1000 certs
240250 . 2 seconds 1000 certs
241250 . 2 seconds 1000 certs
242250 . 2 seconds 1000 certs
243250 . 2 seconds 1000 certs
244250 . 2 seconds 1000 certs
```

{{< image width="30em" float="right" src="/assets/ctlog/ctlog-loadtest2.png" alt="Posix Loadtest 5000qps" >}}

Very nice! Maybe I can write a few more certs? I restart the `hammer` with 5000/s, which somewhat to my
surprise, ends up serving! 

```
642608 . 2 seconds 6155 certs
648763 . 2 seconds 10256 certs
659019 . 2 seconds 9237 certs
668256 . 2 seconds 8800 certs
677056 . 2 seconds 8729 certs
685785 . 2 seconds 8237 certs
694022 . 2 seconds 7487 certs
701509 . 2 seconds 8572 certs
710081 . 2 seconds 7413 certs
```

The throughput is highly variable though, seemingly between 3700/sec and 5100/sec, and I quickly
find out that the `hammer` is completely saturating the CPU on the machine, leaving very little room
for the `posix` TesseraCT to serve. I'm going to need more machines!

So I start a `hammer` loadtester on the two now-idle Minio servers, and run them at about 6000qps
**each**, for a total of 12000 certs/sec. And my little `posix` binary is keeping up like a champ:

```
2987169 . 2 seconds 23040 certs
3010209 . 2 seconds 23040 certs
3033249 . 2 seconds 21760 certs
3055009 . 2 seconds 21504 certs
3076513 . 2 seconds 23808 certs
3100321 . 2 seconds 22528 certs
```

One thing is reasonably clear, the `posix` TesseraCT is CPU bound, not disk bound. The CPU is now
running at about 18.5 CPUs/s (with 20 cores), which is pretty much all this Dell has to offer. The
NetAPP enterprise solid state drives are not impressed:

```
pim@ctlog-test:~/src/tesseract$ zpool iostat -v ssd-vol0 10 100
                              capacity     operations     bandwidth 
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
ssd-vol0                    11.4G   733G      0  3.13K      0   117M
  mirror-0                  11.4G   733G      0  3.13K      0   117M
    wwn-0x5002538a05302930      -      -      0  1.04K      0  39.1M
    wwn-0x5002538a053069f0      -      -      0  1.06K      0  39.1M
    wwn-0x5002538a06313ed0      -      -      0  1.02K      0  39.1M
--------------------------  -----  -----  -----  -----  -----  -----

pim@ctlog-test:~/src/tesseract$ zpool iostat -l  ssd-vol0 10
              capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim
pool        alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
ssd-vol0    14.0G   730G      0  1.48K      0  35.4M      -    2ms      -  535us      -    1us      -    3ms      -   50ms
ssd-vol0    14.0G   730G      0  1.12K      0  23.0M      -    1ms      -  733us      -    2us      -    1ms      -   44ms
ssd-vol0    14.1G   730G      0  1.42K      0  45.3M      -  508us      -  122us      -  914ns      -    2ms      -   41ms
ssd-vol0    14.2G   730G      0    678      0  21.0M      -  863us      -  144us      -    2us      -    2ms      -      -
```

## Results

OK, that kind of seals the deal for me. The write path needs about 250 certs/sec and I'm hammering
now with 12'000 certs/sec, with room to spare. But what about the read path? The cool thing about
the static log is that reads are all entirely done by NGINX. The only file that isn't cacheable is
the `checkpoint` file which gets updated every two seconds (or ten seconds in the default `tessera`
settings).

So I start yet another `hammer` whose job it is to read back from the static filesystem:

```
pim@ctlog-test:~/src/tesseract$ curl localhost/nginx_status; sleep 60; curl localhost/nginx_status
Active connections: 10556 
server accepts handled requests
 25302 25302 1492918 
Reading: 0 Writing: 1 Waiting: 10555 
Active connections: 7791 
server accepts handled requests
 25764 25764 1727631 
Reading: 0 Writing: 1 Waiting: 7790 
```

And I can see that it's keeping up quite nicely. In one minute, it handled (1727631-1492918) or
234713 requests, which is a cool 3911 requests/sec. All these read/write hammers are kind of
saturating the `ctlog-test` machine though:

{{< image width="100%" src="/assets/ctlog/ctlog-loadtest3.png" alt="Posix Loadtest 8000qps write, 4000qps read" >}}

But after a little bit of fiddling, I can assert my conclusion:

***Conclusion: a write-rate of 8'000/s alongside a read-rate of 4'000/s should be safe with POSIX***

## What's Next

I am going to offer such a machine in production together with Antonis Chariton, and Jeroen Massar.
I plan to do a few additional things:

*   Test Sunlight as well on the same hardware. It would be nice to see a comparison between write
    rates of the two implementations.
*   Work with Al Cutter and the Transparency Dev team to close a few small gaps (like the
    `local_signer.go` and some Prometheus monitoring of the `posix` binary.
*   Install and launch both under `*.ct.ipng.ch`, which in itself deserves its own report, showing
    how I intend to do log cycling and care/feeding, as well as report on the real production
    experience running these CT Logs.