Compare commits
66 Commits
26397d69c6
...
main
Author | SHA1 | Date | |
---|---|---|---|
|
512cfd75dc | ||
|
8683d570a1 | ||
|
a1a98ad3c6 | ||
|
26ae98d977 | ||
|
619a1dfdf2 | ||
|
a9e978effb | ||
|
825335cef9 | ||
|
a97115593c | ||
|
3dd0d8a656 | ||
|
f137326339 | ||
|
51098ed43c | ||
|
6b337e1167 | ||
|
bbf36f5a4e | ||
|
b324d71b3f | ||
|
2681861e4b | ||
|
4f0188abeb | ||
|
f4ed332b18 | ||
|
d9066aa241 | ||
|
c68799703b | ||
|
c32d1779f8 | ||
|
eda80e7e66 | ||
|
d13da5608d | ||
|
d47261a3b7 | ||
|
383a598fc7 | ||
|
8afa2ff944 | ||
|
fe1207ee78 | ||
|
6a59b7d7e6 | ||
|
bc2a9bb352 | ||
|
5d02b6466c | ||
|
b6b419471d | ||
|
85b41ba4e0 | ||
|
ebbb0f8e24 | ||
|
218ee84d5f | ||
|
c476fa56fb | ||
|
a76abc331f | ||
|
44deb34685 | ||
|
ca46bcf6d5 | ||
|
5042f822ef | ||
|
fdb77838b8 | ||
|
6d3f4ac206 | ||
|
baa3e78045 | ||
|
0972cf4aa1 | ||
|
4f81d377a0 | ||
|
153048eda4 | ||
|
4aa5745d06 | ||
|
7d3f617966 | ||
|
8918821413 | ||
|
9783c7d39c | ||
|
af68c1ec3b | ||
|
0baadb5089 | ||
|
3b7e576d20 | ||
|
d0a7cdbe38 | ||
|
ed087f3fc6 | ||
|
51e6c0e1c2 | ||
|
8a991bee47 | ||
|
d9e2f407e7 | ||
|
01820776af | ||
|
d5d4f7ff55 | ||
|
2a61bdc028 | ||
|
c2b8eef4f4 | ||
|
533cca0108 | ||
|
4ac8c47127 | ||
|
bcbb119b20 | ||
|
ce6e6cde22 | ||
|
610835925b | ||
|
16ac42bad9 |
@@ -8,9 +8,9 @@ steps:
|
||||
- git lfs install
|
||||
- git lfs pull
|
||||
- name: build
|
||||
image: git.ipng.ch/ipng/drone-hugo:release-0.134.3
|
||||
image: git.ipng.ch/ipng/drone-hugo:release-0.148.2
|
||||
settings:
|
||||
hugo_version: 0.134.3
|
||||
hugo_version: 0.148.2
|
||||
extended: true
|
||||
- name: rsync
|
||||
image: drillster/drone-rsync
|
||||
@@ -26,7 +26,7 @@ steps:
|
||||
port: 22
|
||||
args: '-6u --delete-after'
|
||||
source: public/
|
||||
target: /var/www/ipng.ch/
|
||||
target: /nginx/sites/ipng.ch/
|
||||
recursive: true
|
||||
secrets: [ drone_sshkey ]
|
||||
|
||||
|
@@ -8,7 +8,7 @@ Historical context - todo, but notes for now
|
||||
|
||||
1. started with stack.nl (when it was still stack.urc.tue.nl), 6bone and watching NASA multicast video in 1997.
|
||||
2. founded ipng.nl project, first IPv6 in NL that was usable outside of NREN.
|
||||
3. attacted attention of the first few IPv6 partitipants in Amsterdam, organized the AIAD - AMS-IX IPv6 Awareness Day
|
||||
3. attracted attention of the first few IPv6 participants in Amsterdam, organized the AIAD - AMS-IX IPv6 Awareness Day
|
||||
4. launched IPv6 at AMS-IX, first IXP prefix allocated 2001:768:1::/48
|
||||
> My Brilliant Idea Of The Day -- encode AS number in leetspeak: `::AS01:2859:1`, because who would've thought we would ever run out of 16 bit AS numbers :)
|
||||
5. IPng rearchitected to SixXS, and became a very large scale deployment of IPv6 tunnelbroker; our main central provisioning system moved around a few times between ISPs (Intouch, Concepts ICT, BIT, IP Man)
|
||||
|
@@ -185,7 +185,7 @@ function is_coloclue_beacon()
|
||||
}
|
||||
```
|
||||
|
||||
Then, I ran the configuration again with one IPv4 beacon set on dcg-1, and still all the bird configs on both IPv4 and IPv6 for all routers parsed correctly, and the generated function on the dcg-1 IPv4 filters file was popupated:
|
||||
Then, I ran the configuration again with one IPv4 beacon set on dcg-1, and still all the bird configs on both IPv4 and IPv6 for all routers parsed correctly, and the generated function on the dcg-1 IPv4 filters file was populated:
|
||||
```
|
||||
function is_coloclue_beacon()
|
||||
{
|
||||
|
@@ -89,7 +89,7 @@ lcp lcp-sync off
|
||||
```
|
||||
|
||||
The prep work for the rest of the interface syncer starts with this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
|
||||
for the rest of this blog post, the behavior will be in the 'on' position.
|
||||
|
||||
### Change interface: state
|
||||
@@ -120,7 +120,7 @@ the state it was. I did notice that you can't bring up a sub-interface if its pa
|
||||
is down, which I found counterintuitive, but that's neither here nor there.
|
||||
|
||||
All of this is to say that we have to be careful when copying state forward, because as
|
||||
this [[commit](https://github.com/pimvanpelt/lcpng/commit/7c15c84f6c4739860a85c599779c199cb9efef03)]
|
||||
this [[commit](https://git.ipng.ch/ipng/lcpng/commit/7c15c84f6c4739860a85c599779c199cb9efef03)]
|
||||
shows, issuing `set int state ... up` on an interface, won't touch its sub-interfaces in VPP, but
|
||||
the subsequent netlink message to bring the _LIP_ for that interface up, **will** update the
|
||||
children, thus desynchronising Linux and VPP: Linux will have interface **and all its
|
||||
@@ -128,7 +128,7 @@ sub-interfaces** up unconditionally; VPP will have the interface up and its sub-
|
||||
whatever state they were before.
|
||||
|
||||
To address this, a second
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/a3dc56c01461bdffcac8193ead654ae79225220f)] was
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/a3dc56c01461bdffcac8193ead654ae79225220f)] was
|
||||
needed. I'm not too sure I want to keep this behavior, but for now, it results in an intuitive
|
||||
end-state, which is that all interfaces states are exactly the same between Linux and VPP.
|
||||
|
||||
@@ -157,7 +157,7 @@ DBGvpp# set int state TenGigabitEthernet3/0/0 up
|
||||
### Change interface: MTU
|
||||
|
||||
Finally, a straight forward
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/39bfa1615fd1cafe5df6d8fc9d34528e8d3906e2)], or
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/39bfa1615fd1cafe5df6d8fc9d34528e8d3906e2)], or
|
||||
so I thought. When the MTU changes in VPP (with `set interface mtu packet N <int>`), there is
|
||||
callback that can be registered which copies this into the _LIP_. I did notice a specific corner
|
||||
case: In VPP, a sub-interface can have a larger MTU than its parent. In Linux, this cannot happen,
|
||||
@@ -179,7 +179,7 @@ higher than that, perhaps logging an error explaining why. This means two things
|
||||
1. Any change in VPP of a parent MTU should ensure all children are clamped to at most that.
|
||||
|
||||
I addressed the issue in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/79a395b3c9f0dae9a23e6fbf10c5f284b1facb85)].
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/79a395b3c9f0dae9a23e6fbf10c5f284b1facb85)].
|
||||
|
||||
### Change interface: IP Addresses
|
||||
|
||||
@@ -199,7 +199,7 @@ VPP into the companion Linux devices:
|
||||
_LIP_ with `lcp_itf_set_interface_addr()`.
|
||||
|
||||
This means with this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/f7e1bb951d648a63dfa27d04ded0b6261b9e39fe)], at
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/f7e1bb951d648a63dfa27d04ded0b6261b9e39fe)], at
|
||||
any time a new _LIP_ is created, the IPv4 and IPv6 address on the VPP interface are fully copied
|
||||
over by the third change, while at runtime, new addresses can be set/removed as well by the first
|
||||
and second change.
|
||||
|
@@ -100,7 +100,7 @@ linux-cp {
|
||||
|
||||
Based on this config, I set the startup default in `lcp_set_lcp_auto_subint()`, but I realize that
|
||||
an administrator may want to turn it on/off at runtime, too, so I add a CLI getter/setter that
|
||||
interacts with the flag in this [[commit](https://github.com/pimvanpelt/lcpng/commit/d23aab2d95aabcf24efb9f7aecaf15b513633ab7)]:
|
||||
interacts with the flag in this [[commit](https://git.ipng.ch/ipng/lcpng/commit/d23aab2d95aabcf24efb9f7aecaf15b513633ab7)]:
|
||||
|
||||
```
|
||||
DBGvpp# show lcp
|
||||
@@ -116,11 +116,11 @@ lcp lcp-sync off
|
||||
```
|
||||
|
||||
The prep work for the rest of the interface syncer starts with this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
|
||||
for the rest of this blog post, the behavior will be in the 'on' position.
|
||||
|
||||
The code for the configuration toggle is in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
|
||||
|
||||
### Auto create/delete sub-interfaces
|
||||
|
||||
@@ -145,7 +145,7 @@ I noticed that interface deletion had a bug (one that I fell victim to as well:
|
||||
remove the netlink device in the correct network namespace), which I fixed.
|
||||
|
||||
The code for the auto create/delete and the bugfix is in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
|
||||
|
||||
### Further Work
|
||||
|
||||
|
@@ -154,7 +154,7 @@ For now, `lcp_nl_dispatch()` just throws the message away after logging it with
|
||||
a function that will come in very useful as I start to explore all the different Netlink message types.
|
||||
|
||||
The code that forms the basis of our Netlink Listener lives in [[this
|
||||
commit](https://github.com/pimvanpelt/lcpng/commit/c4e3043ea143d703915239b2390c55f7b6a9b0b1)] and
|
||||
commit](https://git.ipng.ch/ipng/lcpng/commit/c4e3043ea143d703915239b2390c55f7b6a9b0b1)] and
|
||||
specifically, here I want to call out I was not the primary author, I worked off of Matt and Neale's
|
||||
awesome work in this pending [Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122).
|
||||
|
||||
@@ -182,7 +182,7 @@ Linux interface VPP is not aware of. But, if I can find the _LIP_, I can convert
|
||||
add or remove the ip4/ip6 neighbor adjacency.
|
||||
|
||||
The code for this first Netlink message handler lives in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/30bab1d3f9ab06670fbef2c7c6a658e7b77f7738)]. An
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/30bab1d3f9ab06670fbef2c7c6a658e7b77f7738)]. An
|
||||
ironic insight is that after writing the code, I don't think any of it will be necessary, because
|
||||
the interface plugin will already copy ARP and IPv6 ND packets back and forth and itself update its
|
||||
neighbor adjacency tables; but I'm leaving the code in for now.
|
||||
@@ -197,7 +197,7 @@ it or remove it, and if there are no link-local addresses left, disable IPv6 on
|
||||
There's also a few multicast routes to add (notably 224.0.0.0/24 and ff00::/8, all-local-subnet).
|
||||
|
||||
The code for IP address handling is in this
|
||||
[[commit]](https://github.com/pimvanpelt/lcpng/commit/87742b4f541d389e745f0297d134e34f17b5b485), but
|
||||
[[commit]](https://git.ipng.ch/ipng/lcpng/commit/87742b4f541d389e745f0297d134e34f17b5b485), but
|
||||
when I took it out for a spin, I noticed something curious, looking at the log lines that are
|
||||
generated for the following sequence:
|
||||
|
||||
@@ -236,7 +236,7 @@ interface and directly connected route addition/deletion is slightly different i
|
||||
So, I decide to take a little shortcut -- if an addition returns "already there", or a deletion returns
|
||||
"no such entry", I'll just consider it a successful addition and deletion respectively, saving my eyes
|
||||
from being screamed at by this red error message. I changed that in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/d63fbd8a9a612d038aa385e79a57198785d409ca)],
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/d63fbd8a9a612d038aa385e79a57198785d409ca)],
|
||||
turning this situation in a friendly green notice instead.
|
||||
|
||||
### Netlink: Link (existing)
|
||||
@@ -267,7 +267,7 @@ To avoid this loop, I temporarily turn off `lcp-sync` just before handling a bat
|
||||
turn it back to its original state when I'm done with that.
|
||||
|
||||
The code for all/del of existing links is in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)].
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)].
|
||||
|
||||
### Netlink: Link (new)
|
||||
|
||||
@@ -276,7 +276,7 @@ doesn't have a _LIP_ for, but specifically describes a VLAN interface? Well, th
|
||||
is trying to create a new sub-interface. And supporting that operation would be super cool, so let's go!
|
||||
|
||||
Using the earlier placeholder hint in `lcp_nl_link_add()` (see the previous
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)]),
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)]),
|
||||
I know that I've gotten a NEWLINK request but the Linux ifindex doesn't have a _LIP_. This could be
|
||||
because the interface is entirely foreign to VPP, for example somebody created a dummy interface or
|
||||
a VLAN sub-interface on one:
|
||||
@@ -331,7 +331,7 @@ a boring `<phy>.<subid>` name.
|
||||
|
||||
Alright, without further ado, the code for the main innovation here, the implementation of
|
||||
`lcp_nl_link_add_vlan()`, is in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/45f408865688eb7ea0cdbf23aa6f8a973be49d1a)].
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/45f408865688eb7ea0cdbf23aa6f8a973be49d1a)].
|
||||
|
||||
## Results
|
||||
|
||||
|
@@ -118,7 +118,7 @@ or Virtual Routing/Forwarding domains). So first, I need to add these:
|
||||
|
||||
All of this code was heavily inspired by the pending [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122)]
|
||||
but a few finishing touches were added, and wrapped up in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/7a76498277edc43beaa680e91e3a0c1787319106)].
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/7a76498277edc43beaa680e91e3a0c1787319106)].
|
||||
|
||||
### Deletion
|
||||
|
||||
@@ -459,7 +459,7 @@ it as 'unreachable' rather than deleting it. These are *additions* which have a
|
||||
but with an interface index of 1 (which, in Netlink, is 'lo'). This makes VPP intermittently crash, so I
|
||||
currently commented this out, while I gain better understanding. Result: blackhole/unreachable/prohibit
|
||||
specials can not be set using the plugin. Beware!
|
||||
(disabled in this [[commit](https://github.com/pimvanpelt/lcpng/commit/7c864ed099821f62c5be8cbe9ed3f4dd34000a42)]).
|
||||
(disabled in this [[commit](https://git.ipng.ch/ipng/lcpng/commit/7c864ed099821f62c5be8cbe9ed3f4dd34000a42)]).
|
||||
|
||||
## Credits
|
||||
|
||||
|
@@ -88,7 +88,7 @@ stat['/if/rx-miss'][:, 1].sum() - returns the sum of packet counters for
|
||||
```
|
||||
|
||||
Alright, so let's grab that file and refactor it into a small library for me to use, I do
|
||||
this in [[this commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
|
||||
this in [[this commit](https://git.ipng.ch/ipng/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
|
||||
|
||||
### VPP's API
|
||||
|
||||
@@ -159,7 +159,7 @@ idx=19 name=tap4 mac=02:fe:17:06:fc:af mtu=9000 flags=3
|
||||
|
||||
So I added a little abstration with some error handling and one main function
|
||||
to return interfaces as a Python dictionary of those `sw_interface_details`
|
||||
tuples in [[this commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
|
||||
tuples in [[this commit](https://git.ipng.ch/ipng/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
|
||||
|
||||
### AgentX
|
||||
|
||||
@@ -207,9 +207,9 @@ once asked with `GetPDU` or `GetNextPDU` requests, by issuing a corresponding `R
|
||||
to the SNMP server -- it takes care of all the rest!
|
||||
|
||||
The resulting code is in [[this
|
||||
commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/8c9c1e2b4aa1d40a981f17581f92bba133dd2c29)]
|
||||
commit](https://git.ipng.ch/ipng/vpp-snmp-agent/commit/8c9c1e2b4aa1d40a981f17581f92bba133dd2c29)]
|
||||
but you can also check out the whole thing on
|
||||
[[Github](https://github.com/pimvanpelt/vpp-snmp-agent)].
|
||||
[[Github](https://git.ipng.ch/ipng/vpp-snmp-agent)].
|
||||
|
||||
### Building
|
||||
|
||||
|
@@ -480,7 +480,7 @@ is to say, those packets which were destined to any IP address configured on the
|
||||
plane. Any traffic going _through_ VPP will never be seen by Linux! So, I'll have to be
|
||||
clever and count this traffic by polling VPP instead. This was the topic of my previous
|
||||
[VPP Part 6]({{< ref "2021-09-10-vpp-6" >}}) about the SNMP Agent. All of that code
|
||||
was released to [Github](https://github.com/pimvanpelt/vpp-snmp-agent), notably there's
|
||||
was released to [Github](https://git.ipng.ch/ipng/vpp-snmp-agent), notably there's
|
||||
a hint there for an `snmpd-dataplane.service` and a `vpp-snmp-agent.service`, including
|
||||
the compiled binary that reads from VPP and feeds this to SNMP.
|
||||
|
||||
|
@@ -30,9 +30,9 @@ virtual machine running in Qemu/KVM into a working setup with both [Free Range R
|
||||
and [Bird](https://bird.network.cz/) installed side by side.
|
||||
|
||||
**NOTE**: If you're just interested in the resulting image, here's the most pertinent information:
|
||||
> * ***vpp-proto.qcow2.lrz [[Download](https://ipng.ch/media/vpp-proto/vpp-proto-bookworm-20231015.qcow2.lrz)]***
|
||||
> * ***SHA256*** `bff03a80ccd1c0094d867d1eb1b669720a1838330c0a5a526439ecb1a2457309`
|
||||
> * ***Debian Bookworm (12.4)*** and ***VPP 24.02-rc0~46-ga16463610e***
|
||||
> * ***vpp-proto.qcow2.lrz*** [[Download](https://ipng.ch/media/vpp-proto/vpp-proto-bookworm-20250607.qcow2.lrz)]
|
||||
> * ***SHA256*** `a5fdf157c03f2d202dcccdf6ed97db49c8aa5fdb6b9ca83a1da958a8a24780ab
|
||||
> * ***Debian Bookworm (12.11)*** and ***VPP 25.10-rc0~49-g90d92196***
|
||||
> * ***CPU*** Make sure the (virtualized) CPU supports AVX
|
||||
> * ***RAM*** The image needs at least 4GB of RAM, and the hypervisor should support hugepages and AVX
|
||||
> * ***Username***: `ipng` with ***password***: `ipng loves vpp` and is sudo-enabled
|
||||
@@ -62,7 +62,7 @@ plugins:
|
||||
or route, or the system receiving ARP or IPv6 neighbor request/reply from neighbors), and applying
|
||||
these events to the VPP dataplane.
|
||||
|
||||
I've published the code on [Github](https://github.com/pimvanpelt/lcpng/) and I am targeting a release
|
||||
I've published the code on [Github](https://git.ipng.ch/ipng/lcpng/) and I am targeting a release
|
||||
in upstream VPP, hoping to make the upcoming 22.02 release in February 2022. I have a lot of ground to
|
||||
cover, but I will note that the plugin has been running in production in [AS8298]({{< ref "2021-02-27-network" >}})
|
||||
since Sep'21 and no crashes related to LinuxCP have been observed.
|
||||
@@ -195,7 +195,7 @@ So grab a cup of tea, while we let Rhino stretch its legs, ehh, CPUs ...
|
||||
pim@rhino:~$ mkdir -p ~/src
|
||||
pim@rhino:~$ cd ~/src
|
||||
pim@rhino:~/src$ sudo apt install libmnl-dev
|
||||
pim@rhino:~/src$ git clone https://github.com/pimvanpelt/lcpng.git
|
||||
pim@rhino:~/src$ git clone https://git.ipng.ch/ipng/lcpng.git
|
||||
pim@rhino:~/src$ git clone https://gerrit.fd.io/r/vpp
|
||||
pim@rhino:~/src$ ln -s ~/src/lcpng ~/src/vpp/src/plugins/lcpng
|
||||
pim@rhino:~/src$ cd ~/src/vpp
|
||||
|
@@ -33,7 +33,7 @@ In this first post, let's take a look at tablestakes: writing a YAML specificati
|
||||
configuration elements of VPP, and then ensures that the YAML file is both syntactically as well as
|
||||
semantically correct.
|
||||
|
||||
**Note**: Code is on [my Github](https://github.com/pimvanpelt/vppcfg), but it's not quite ready for
|
||||
**Note**: Code is on [my Github](https://git.ipng.ch/ipng/vppcfg), but it's not quite ready for
|
||||
prime-time yet. Take a look, and engage with us on GitHub (pull requests preferred over issues themselves)
|
||||
or reach out by [contacting us](/s/contact/).
|
||||
|
||||
@@ -348,7 +348,7 @@ to mess up my (or your!) VPP router by feeding it garbage, so the lions' share o
|
||||
has been to assert the YAML file is both syntactically and semantically valid.
|
||||
|
||||
|
||||
In the mean time, you can take a look at my code on [GitHub](https://github.com/pimvanpelt/vppcfg), but to
|
||||
In the mean time, you can take a look at my code on [GitHub](https://git.ipng.ch/ipng/vppcfg), but to
|
||||
whet your appetite, here's a hefty configuration that demonstrates all implemented types:
|
||||
|
||||
```
|
||||
|
@@ -32,7 +32,7 @@ the configuration to the dataplane. Welcome to `vppcfg`!
|
||||
In this second post of the series, I want to talk a little bit about how planning a path from a running
|
||||
configuration to a desired new configuration might look like.
|
||||
|
||||
**Note**: Code is on [my Github](https://github.com/pimvanpelt/vppcfg), but it's not quite ready for
|
||||
**Note**: Code is on [my Github](https://git.ipng.ch/ipng/vppcfg), but it's not quite ready for
|
||||
prime-time yet. Take a look, and engage with us on GitHub (pull requests preferred over issues themselves)
|
||||
or reach out by [contacting us](/s/contact/).
|
||||
|
||||
|
@@ -171,12 +171,12 @@ GigabitEthernet1/0/0 1 up GigabitEthernet1/0/0
|
||||
|
||||
After this exploratory exercise, I have learned enough about the hardware to be able to take the
|
||||
Fitlet2 out for a spin. To configure the VPP instance, I turn to
|
||||
[[vppcfg](https://github.com/pimvanpelt/vppcfg)], which can take a YAML configuration file
|
||||
[[vppcfg](https://git.ipng.ch/ipng/vppcfg)], which can take a YAML configuration file
|
||||
describing the desired VPP configuration, and apply it safely to the running dataplane using the VPP
|
||||
API. I've written a few more posts on how it does that, notably on its [[syntax]({{< ref "2022-03-27-vppcfg-1" >}})]
|
||||
and its [[planner]({{< ref "2022-04-02-vppcfg-2" >}})]. A complete
|
||||
configuration guide on vppcfg can be found
|
||||
[[here](https://github.com/pimvanpelt/vppcfg/blob/main/docs/config-guide.md)].
|
||||
[[here](https://git.ipng.ch/ipng/vppcfg/blob/main/docs/config-guide.md)].
|
||||
|
||||
```
|
||||
pim@fitlet:~$ sudo dpkg -i {lib,}vpp*23.06*deb
|
||||
|
@@ -185,7 +185,7 @@ forgetful chipmunk-sized brain!), so here, I'll only recap what's already writte
|
||||
|
||||
**1. BUILD:** For the first step, the build is straight forward, and yields a VPP instance based on
|
||||
`vpp-ext-deps_23.06-1` at version `23.06-rc0~71-g182d2b466`, which contains my
|
||||
[[LCPng](https://github.com/pimvanpelt/lcpng.git)] plugin. I then copy the packages to the router.
|
||||
[[LCPng](https://git.ipng.ch/ipng/lcpng.git)] plugin. I then copy the packages to the router.
|
||||
The router has an E-2286G CPU @ 4.00GHz with 6 cores and 6 hyperthreads. There's a really handy tool
|
||||
called `likwid-topology` that can show how the L1, L2 and L3 cache lines up with respect to CPU
|
||||
cores. Here I learn that CPU (0+6) and (1+7) share L1 and L2 cache -- so I can conclude that 0-5 are
|
||||
@@ -351,7 +351,7 @@ in `vppcfg`:
|
||||
* When I create the initial `--novpp` config, there's a bug in `vppcfg` where I incorrectly
|
||||
reference a dataplane object which I haven't initialized (because with `--novpp` the tool
|
||||
will not contact the dataplane at all. That one was easy to fix, which I did in [[this
|
||||
commit](https://github.com/pimvanpelt/vppcfg/commit/0a0413927a0be6ed3a292a8c336deab8b86f5eee)]).
|
||||
commit](https://git.ipng.ch/ipng/vppcfg/commit/0a0413927a0be6ed3a292a8c336deab8b86f5eee)]).
|
||||
|
||||
After that small detour, I can now proceed to configure the dataplane by offering the resulting
|
||||
VPP commands, like so:
|
||||
@@ -573,7 +573,7 @@ see is that which is destined to the controlplane (eg, to one of the IPv4 or IPv
|
||||
multicast/broadcast groups that they are participating in), so things like tcpdump or SNMP won't
|
||||
really work.
|
||||
|
||||
However, due to my [[vpp-snmp-agent](https://github.com/pimvanpelt/vpp-snmp-agent.git)], which is
|
||||
However, due to my [[vpp-snmp-agent](https://git.ipng.ch/ipng/vpp-snmp-agent.git)], which is
|
||||
feeding as an AgentX behind an snmpd that in turn is running in the `dataplane` namespace, SNMP scrapes
|
||||
work as they did before, albeit with a few different interface names.
|
||||
|
||||
|
@@ -14,7 +14,7 @@ performance and versatility. For those of us who have used Cisco IOS/XR devices,
|
||||
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
|
||||
are shared between the two.
|
||||
|
||||
I've been working on the Linux Control Plane [[ref](https://github.com/pimvanpelt/lcpng)], which you
|
||||
I've been working on the Linux Control Plane [[ref](https://git.ipng.ch/ipng/lcpng)], which you
|
||||
can read all about in my series on VPP back in 2021:
|
||||
|
||||
[{: style="width:300px; float: right; margin-left: 1em;"}](https://video.ipng.ch/w/erc9sAofrSZ22qjPwmv6H4)
|
||||
@@ -70,7 +70,7 @@ answered by a Response PDU.
|
||||
|
||||
Using parts of a Python Agentx library written by GitHub user hosthvo
|
||||
[[ref](https://github.com/hosthvo/pyagentx)], I tried my hands at writing one of these AgentX's.
|
||||
The resulting source code is on [[GitHub](https://github.com/pimvanpelt/vpp-snmp-agent)]. That's the
|
||||
The resulting source code is on [[GitHub](https://git.ipng.ch/ipng/vpp-snmp-agent)]. That's the
|
||||
one that's running in production ever since I started running VPP routers at IPng Networks AS8298.
|
||||
After the _AgentX_ exposes the dataplane interfaces and their statistics into _SNMP_, an open source
|
||||
monitoring tool such as LibreNMS [[ref](https://librenms.org/)] can discover the routers and draw
|
||||
@@ -126,7 +126,7 @@ for any interface created in the dataplane.
|
||||
|
||||
I wish I were good at Go, but I never really took to the language. I'm pretty good at Python, but
|
||||
sorting through the stats segment isn't super quick as I've already noticed in the Python3 based
|
||||
[[VPP SNMP Agent](https://github.com/pimvanpelt/vpp-snmp-agent)]. I'm probably the world's least
|
||||
[[VPP SNMP Agent](https://git.ipng.ch/ipng/vpp-snmp-agent)]. I'm probably the world's least
|
||||
terrible C programmer, so maybe I can take a look at the VPP Stats Client and make sense of it. Luckily,
|
||||
there's an example already in `src/vpp/app/vpp_get_stats.c` and it reveals the following pattern:
|
||||
|
||||
|
@@ -19,7 +19,7 @@ same time keep an IPng Site Local network with IPv4 and IPv6 that is separate fr
|
||||
based on hardware/silicon based forwarding at line rate and high availability. You can read all
|
||||
about my Centec MPLS shenanigans in [[this article]({{< ref "2023-03-11-mpls-core" >}})].
|
||||
|
||||
Ever since the release of the Linux Control Plane [[ref](https://github.com/pimvanpelt/lcpng)]
|
||||
Ever since the release of the Linux Control Plane [[ref](https://git.ipng.ch/ipng/lcpng)]
|
||||
plugin in VPP, folks have asked "What about MPLS?" -- I have never really felt the need to go this
|
||||
rabbit hole, because I figured that in this day and age, higher level IP protocols that do tunneling
|
||||
are just as performant, and a little bit less of an 'art' to get right. For example, the Centec
|
||||
|
@@ -459,6 +459,6 @@ and VPP, and the overall implementation before attempting to use in production.
|
||||
we got at least some of this right, but testing and runtime experience will tell.
|
||||
|
||||
I will be silently porting the change into my own copy of the Linux Controlplane called lcpng on
|
||||
[[GitHub](https://github.com/pimvanpelt/lcpng.git)]. If you'd like to test this - reach out to the VPP
|
||||
[[GitHub](https://git.ipng.ch/ipng/lcpng.git)]. If you'd like to test this - reach out to the VPP
|
||||
Developer [[mailinglist](mailto:vpp-dev@lists.fd.io)] any time!
|
||||
|
||||
|
@@ -385,5 +385,5 @@ and VPP, and the overall implementation before attempting to use in production.
|
||||
we got at least some of this right, but testing and runtime experience will tell.
|
||||
|
||||
I will be silently porting the change into my own copy of the Linux Controlplane called lcpng on
|
||||
[[GitHub](https://github.com/pimvanpelt/lcpng.git)]. If you'd like to test this - reach out to the VPP
|
||||
[[GitHub](https://git.ipng.ch/ipng/lcpng.git)]. If you'd like to test this - reach out to the VPP
|
||||
Developer [[mailinglist](mailto:vpp-dev@lists.fd.io)] any time!
|
||||
|
@@ -304,7 +304,7 @@ Gateway, just to show a few of the more advanced features of VPP. For me, this t
|
||||
line of thinking: classifiers. This extract/match/act pattern can be used in policers, ACLs and
|
||||
arbitrary traffic redirection through VPP's directed graph (eg. selecting a next node for
|
||||
processing). I'm going to deep-dive into this classifier behavior in an upcoming article, and see
|
||||
how I might add this to [[vppcfg](https://github.com/pimvanpelt/vppcfg.git)], because I think it
|
||||
how I might add this to [[vppcfg](https://git.ipng.ch/ipng/vppcfg.git)], because I think it
|
||||
would be super powerful to abstract away the rather complex underlying API into something a little
|
||||
bit more ... user friendly. Stay tuned! :)
|
||||
|
||||
|
@@ -359,7 +359,7 @@ does not have an IPv4 address. Except -- I'm bending the rules a little bit by d
|
||||
There's an internal function `ip4_sw_interface_enable_disable()` which is called to enable IPv4
|
||||
processing on an interface once the first IPv4 address is added. So my first fix is to force this to
|
||||
be enabled for any interface that is exposed via Linux Control Plane, notably in `lcp_itf_pair_create()`
|
||||
[[here](https://github.com/pimvanpelt/lcpng/blob/main/lcpng_interface.c#L777)].
|
||||
[[here](https://git.ipng.ch/ipng/lcpng/blob/main/lcpng_interface.c#L777)].
|
||||
|
||||
This approach is partially effective:
|
||||
|
||||
@@ -500,7 +500,7 @@ which is unnumbered. Because I don't know for sure if everybody would find this
|
||||
I make sure to guard the behavior behind a backwards compatible configuration option.
|
||||
|
||||
If you're curious, please take a look at the change in my [[GitHub
|
||||
repo](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
|
||||
repo](https://git.ipng.ch/ipng/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
|
||||
which I:
|
||||
1. add a new configuration option, `lcp-sync-unnumbered`, which defaults to `on`. That would be
|
||||
what the plugin would do in the normal case: copy forward these borrowed IP addresses to Linux.
|
||||
|
@@ -147,7 +147,7 @@ With all of that, I am ready to demonstrate two working solutions now. I first c
|
||||
Ondrej's [[commit](https://gitlab.nic.cz/labs/bird/-/commit/280daed57d061eb1ebc89013637c683fe23465e8)].
|
||||
Then, I compile VPP with my pending [[gerrit](https://gerrit.fd.io/r/c/vpp/+/40482)]. Finally,
|
||||
to demonstrate how `update_loopback_addr()` might work, I compile `lcpng` with my previous
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)],
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)],
|
||||
which allows me to inhibit copying forward addresses from VPP to Linux, when using _unnumbered_
|
||||
interfaces.
|
||||
|
||||
|
@@ -1,8 +1,9 @@
|
||||
---
|
||||
date: "2024-04-27T10:52:11Z"
|
||||
title: FreeIX - Remote
|
||||
title: "FreeIX Remote - Part 1"
|
||||
aliases:
|
||||
- /s/articles/2024/04/27/freeix-1.html
|
||||
- /s/articles/2024/04/27/freeix-remote/
|
||||
---
|
||||
|
||||
# Introduction
|
||||
|
@@ -250,10 +250,10 @@ remove the IPv4 and IPv6 addresses from the <span style='color:red;font-weight:b
|
||||
routers in Brüttisellen. They are directly connected, and if anything goes wrong, I can walk
|
||||
over and rescue them. Sounds like a safe way to start!
|
||||
|
||||
I quickly add the ability for [[vppcfg](https://github.com/pimvanpelt/vppcfg)] to configure
|
||||
I quickly add the ability for [[vppcfg](https://git.ipng.ch/ipng/vppcfg)] to configure
|
||||
_unnumbered_ interfaces. In VPP, these are interfaces that don't have an IPv4 or IPv6 address of
|
||||
their own, but they borrow one from another interface. If you're curious, you can take a look at the
|
||||
[[User Guide](https://github.com/pimvanpelt/vppcfg/blob/main/docs/config-guide.md#interfaces)] on
|
||||
[[User Guide](https://git.ipng.ch/ipng/vppcfg/blob/main/docs/config-guide.md#interfaces)] on
|
||||
GitHub.
|
||||
|
||||
Looking at their `vppcfg` files, the change is actually very easy, taking as an example the
|
||||
@@ -291,7 +291,7 @@ interface.
|
||||
|
||||
In the article, you'll see that discussed as _Solution 2_, and it includes a bit of rationale why I
|
||||
find this better. I implemented it in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
|
||||
case you're curious, and the commandline keyword is `lcp lcp-sync-unnumbered off` (the default is
|
||||
_on_).
|
||||
|
||||
|
238
content/articles/2024-09-03-asr9001.md
Normal file
238
content/articles/2024-09-03-asr9001.md
Normal file
@@ -0,0 +1,238 @@
|
||||
---
|
||||
date: "2024-09-03T13:07:54Z"
|
||||
title: Loadtest notes, ASR9001
|
||||
draft: true
|
||||
---
|
||||
|
||||
### L2 point-to-point (L2XC) config
|
||||
|
||||
```
|
||||
interface TenGigE0/0/0/0
|
||||
mtu 9216
|
||||
load-interval 30
|
||||
l2transport
|
||||
!
|
||||
!
|
||||
interface TenGigE0/0/0/1
|
||||
mtu 9216
|
||||
load-interval 30
|
||||
l2transport
|
||||
!
|
||||
!
|
||||
interface TenGigE0/0/0/2
|
||||
mtu 9216
|
||||
load-interval 30
|
||||
l2transport
|
||||
!
|
||||
!
|
||||
interface TenGigE0/0/0/3
|
||||
mtu 9216
|
||||
load-interval 30
|
||||
l2transport
|
||||
!
|
||||
!
|
||||
|
||||
|
||||
...
|
||||
l2vpn
|
||||
load-balancing flow src-dst-ip
|
||||
logging
|
||||
bridge-domain
|
||||
pseudowire
|
||||
!
|
||||
xconnect group LoadTest
|
||||
p2p pair0
|
||||
interface TenGigE0/0/2/0
|
||||
interface TenGigE0/0/2/1
|
||||
!
|
||||
p2p pair1
|
||||
interface TenGigE0/0/2/2
|
||||
interface TenGigE0/0/2/3
|
||||
!
|
||||
...
|
||||
```
|
||||
|
||||
|
||||
### L2 Bridge-Domain
|
||||
|
||||
```
|
||||
l2vpn
|
||||
bridge group LoadTestp
|
||||
bridge-domain bd0
|
||||
interface TenGigE0/0/0/0
|
||||
!
|
||||
interface TenGigE0/0/0/1
|
||||
!
|
||||
!
|
||||
bridge-domain bd1
|
||||
interface TenGigE0/0/0/2
|
||||
!
|
||||
interface TenGigE0/0/0/3
|
||||
!
|
||||
!
|
||||
...
|
||||
```
|
||||
RP/0/RSP0/CPU0:micro-fridge#show l2vpn forwarding bridge-domain mac-address location 0/0/CPU0
|
||||
Sat Aug 31 12:09:08.957 UTC
|
||||
Mac Address Type Learned from/Filtered on LC learned Resync Age Mapped to
|
||||
--------------------------------------------------------------------------------
|
||||
9c69.b461.fcf2 dynamic Te0/0/0/0 0/0/CPU0 0d 0h 0m 14s N/A
|
||||
9c69.b461.fcf3 dynamic Te0/0/0/1 0/0/CPU0 0d 0h 0m 2s N/A
|
||||
001b.2155.1f11 dynamic Te0/0/0/2 0/0/CPU0 0d 0h 0m 0s N/A
|
||||
001b.2155.1f10 dynamic Te0/0/0/3 0/0/CPU0 0d 0h 0m 15s N/A
|
||||
001b.21bc.47a4 dynamic Te0/0/1/0 0/0/CPU0 0d 0h 0m 6s N/A
|
||||
001b.21bc.47a5 dynamic Te0/0/1/1 0/0/CPU0 0d 0h 0m 21s N/A
|
||||
9c69.b461.ff41 dynamic Te0/0/1/2 0/0/CPU0 0d 0h 0m 16s N/A
|
||||
9c69.b461.ff40 dynamic Te0/0/1/3 0/0/CPU0 0d 0h 0m 10s N/A
|
||||
001b.2155.1d1d dynamic Te0/0/2/0 0/0/CPU0 0d 0h 0m 9s N/A
|
||||
001b.2155.1d1c dynamic Te0/0/2/1 0/0/CPU0 0d 0h 0m 16s N/A
|
||||
001b.2155.1e08 dynamic Te0/0/2/2 0/0/CPU0 0d 0h 0m 4s N/A
|
||||
001b.2155.1e09 dynamic Te0/0/2/3 0/0/CPU0 0d 0h 0m 11s N/A
|
||||
```
|
||||
|
||||
Interesting finding, after a bridge-domain overload occurs, forwarding pretty much stops
|
||||
```
|
||||
Te0/0/0/0:
|
||||
30 second input rate 6931755000 bits/sec, 14441158 packets/sec
|
||||
30 second output rate 0 bits/sec, 0 packets/sec
|
||||
Te0/0/0/1:
|
||||
30 second input rate 0 bits/sec, 0 packets/sec
|
||||
30 second output rate 19492000 bits/sec, 40609 packets/sec
|
||||
|
||||
Te0/0/0/2:
|
||||
30 second input rate 0 bits/sec, 0 packets/sec
|
||||
30 second output rate 19720000 bits/sec, 41084 packets/sec
|
||||
Te0/0/0/3:
|
||||
30 second input rate 6931728000 bits/sec, 14441100 packets/sec
|
||||
30 second output rate 0 bits/sec, 0 packets/sec
|
||||
|
||||
... and so on
|
||||
|
||||
30 second input rate 6931558000 bits/sec, 14440748 packets/sec
|
||||
30 second output rate 0 bits/sec, 0 packets/sec
|
||||
30 second input rate 0 bits/sec, 0 packets/sec
|
||||
30 second output rate 12627000 bits/sec, 26307 packets/sec
|
||||
30 second input rate 0 bits/sec, 0 packets/sec
|
||||
30 second output rate 12710000 bits/sec, 26479 packets/sec
|
||||
30 second input rate 6931542000 bits/sec, 14440712 packets/sec
|
||||
30 second output rate 0 bits/sec, 0 packets/sec
|
||||
30 second input rate 0 bits/sec, 0 packets/sec
|
||||
30 second output rate 19196000 bits/sec, 39992 packets/sec
|
||||
30 second input rate 6931651000 bits/sec, 14440938 packets/sec
|
||||
30 second output rate 0 bits/sec, 0 packets/sec
|
||||
30 second input rate 6931658000 bits/sec, 14440958 packets/sec
|
||||
30 second output rate 0 bits/sec, 0 packets/sec
|
||||
30 second input rate 0 bits/sec, 0 packets/sec
|
||||
30 second output rate 13167000 bits/sec, 27431 packets/sec
|
||||
```
|
||||
|
||||
MPLS enabled test:
|
||||
```
|
||||
arp vrf default 100.64.0.2 001b.2155.1e08 ARPA
|
||||
arp vrf default 100.64.1.2 001b.2155.1e09 ARPA
|
||||
arp vrf default 100.64.2.2 001b.2155.1d1c ARPA
|
||||
arp vrf default 100.64.3.2 001b.2155.1d1d ARPA
|
||||
arp vrf default 100.64.4.2 001b.21bc.47a4 ARPA
|
||||
arp vrf default 100.64.5.2 001b.21bc.47a5 ARPA
|
||||
arp vrf default 100.64.6.2 9c69.b461.fcf2 ARPA
|
||||
arp vrf default 100.64.7.2 9c69.b461.fcf3 ARPA
|
||||
arp vrf default 100.64.8.2 001b.2155.1f10 ARPA
|
||||
arp vrf default 100.64.9.2 001b.2155.1f11 ARPA
|
||||
arp vrf default 100.64.10.2 9c69.b461.ff40 ARPA
|
||||
arp vrf default 100.64.11.2 9c69.b461.ff41 ARPA
|
||||
|
||||
router static
|
||||
address-family ipv4 unicast
|
||||
0.0.0.0/0 198.19.5.1
|
||||
16.0.0.0/24 100.64.0.2
|
||||
16.0.1.0/24 100.64.2.2
|
||||
16.0.2.0/24 100.64.4.2
|
||||
16.0.3.0/24 100.64.6.2
|
||||
16.0.4.0/24 100.64.8.2
|
||||
16.0.5.0/24 100.64.10.2
|
||||
48.0.0.0/24 100.64.1.2
|
||||
48.0.1.0/24 100.64.3.2
|
||||
48.0.2.0/24 100.64.5.2
|
||||
48.0.3.0/24 100.64.7.2
|
||||
48.0.4.0/24 100.64.9.2
|
||||
48.0.5.0/24 100.64.11.2
|
||||
!
|
||||
!
|
||||
|
||||
mpls static
|
||||
interface TenGigE0/0/0/0
|
||||
interface TenGigE0/0/0/1
|
||||
interface TenGigE0/0/0/2
|
||||
interface TenGigE0/0/0/3
|
||||
interface TenGigE0/0/1/0
|
||||
interface TenGigE0/0/1/1
|
||||
interface TenGigE0/0/1/2
|
||||
interface TenGigE0/0/1/3
|
||||
interface TenGigE0/0/2/0
|
||||
interface TenGigE0/0/2/1
|
||||
interface TenGigE0/0/2/2
|
||||
interface TenGigE0/0/2/3
|
||||
address-family ipv4 unicast
|
||||
local-label 16 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/2/3 100.64.1.2 out-label 17
|
||||
!
|
||||
!
|
||||
local-label 17 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/2/2 100.64.0.2 out-label 16
|
||||
!
|
||||
!
|
||||
local-label 18 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/2/0 100.64.3.2 out-label 19
|
||||
!
|
||||
!
|
||||
local-label 19 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/2/1 100.64.2.2 out-label 18
|
||||
!
|
||||
!
|
||||
local-label 20 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/1/1 100.64.5.2 out-label 21
|
||||
!
|
||||
!
|
||||
local-label 21 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/1/0 100.64.4.2 out-label 20
|
||||
!
|
||||
!
|
||||
local-label 22 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/0/1 100.64.7.2 out-label 23
|
||||
!
|
||||
!
|
||||
local-label 23 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/0/0 100.64.6.2 out-label 22
|
||||
!
|
||||
!
|
||||
local-label 24 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/0/2 100.64.9.2 out-label 25
|
||||
!
|
||||
!
|
||||
local-label 25 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/0/3 100.64.8.2 out-label 24
|
||||
!
|
||||
!
|
||||
local-label 26 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/1/2 100.64.11.2 out-label 27
|
||||
!
|
||||
!
|
||||
local-label 27 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/1/3 100.64.10.2 out-label 26
|
||||
!
|
||||
!
|
||||
!
|
||||
!
|
||||
```
|
@@ -1,6 +1,6 @@
|
||||
---
|
||||
date: "2024-10-21T10:52:11Z"
|
||||
title: "FreeIX - Remote, part 2"
|
||||
title: "FreeIX Remote - Part 2"
|
||||
---
|
||||
|
||||
{{< image width="18em" float="right" src="/assets/freeix/freeix-artist-rendering.png" alt="FreeIX, Artists Rendering" >}}
|
||||
@@ -8,7 +8,7 @@ title: "FreeIX - Remote, part 2"
|
||||
# Introduction
|
||||
|
||||
A few months ago, I wrote about [[an idea]({{< ref 2024-04-27-freeix-1.md >}})] to help boost the
|
||||
value of small Internet Exchange Points (_IXPs). When such an exchange doesn't have many members,
|
||||
value of small Internet Exchange Points (_IXPs_). When such an exchange doesn't have many members,
|
||||
then the operational costs of connecting to it (cross connects, router ports, finding peers, etc)
|
||||
are not very favorable.
|
||||
|
||||
|
857
content/articles/2025-02-08-sflow-3.md
Normal file
857
content/articles/2025-02-08-sflow-3.md
Normal file
@@ -0,0 +1,857 @@
|
||||
---
|
||||
date: "2025-02-08T07:51:23Z"
|
||||
title: 'VPP with sFlow - Part 3'
|
||||
---
|
||||
|
||||
# Introduction
|
||||
|
||||
{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width="12em" >}}
|
||||
|
||||
In the second half of last year, I picked up a project together with Neil McKee of
|
||||
[[inMon](https://inmon.com/)], the care takers of [[sFlow](https://sflow.org)]: an industry standard
|
||||
technology for monitoring high speed networks. `sFlow` gives complete visibility into the
|
||||
use of networks enabling performance optimization, accounting/billing for usage, and defense against
|
||||
security threats.
|
||||
|
||||
The open source software dataplane [[VPP](https://fd.io)] is a perfect match for sampling, as it
|
||||
forwards packets at very high rates using underlying libraries like [[DPDK](https://dpdk.org/)] and
|
||||
[[RDMA](https://en.wikipedia.org/wiki/Remote_direct_memory_access)]. A clever design choice in the
|
||||
so called _Host sFlow Daemon_ [[host-sflow](https://github.com/sflow/host-sflow)], which allows for
|
||||
a small portion of code to _grab_ the samples, for example in a merchant silicon ASIC or FPGA, but
|
||||
also in the VPP software dataplane. The agent then _transmits_ these samples using a Linux kernel
|
||||
feature called [[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)].
|
||||
This greatly reduces the complexity of code to be implemented in the forwarding path, while at the
|
||||
same time bringing consistency to the `sFlow` delivery pipeline by (re)using the `hsflowd` business
|
||||
logic for the more complex state keeping, packet marshalling and transmission from the _Agent_ to a
|
||||
central _Collector_.
|
||||
|
||||
In this third article, I wanted to spend some time discussing how samples make their way out of the
|
||||
VPP dataplane, and into higher level tools.
|
||||
|
||||
## Recap: sFlow
|
||||
|
||||
{{< image float="left" src="/assets/sflow/sflow-overview.png" alt="sFlow Overview" width="14em" >}}
|
||||
|
||||
sFlow describes a method for Monitoring Traffic in Switched/Routed Networks, originally described in
|
||||
[[RFC3176](https://datatracker.ietf.org/doc/html/rfc3176)]. The current specification is version 5
|
||||
and is homed on the sFlow.org website [[ref](https://sflow.org/sflow_version_5.txt)]. Typically, a
|
||||
Switching ASIC in the dataplane (seen at the bottom of the diagram to the left) is asked to copy
|
||||
1-in-N packets to local sFlow Agent.
|
||||
|
||||
**Sampling**: The agent will copy the first N bytes (typically 128) of the packet into a sample. As
|
||||
the ASIC knows which interface the packet was received on, the `inIfIndex` will be added. After a
|
||||
routing decision is made, the nexthop and its L2 address and interface become known. The ASIC might
|
||||
annotate the sample with this `outIfIndex` and `DstMAC` metadata as well.
|
||||
|
||||
**Drop Monitoring**: There's one rather clever insight that sFlow gives: what if the packet _was
|
||||
not_ routed or switched, but rather discarded? For this, sFlow is able to describe the reason for
|
||||
the drop. For example, the ASIC receive queue could have been overfull, or it did not find a
|
||||
destination to forward the packet to (no FIB entry), perhaps it was instructed by an ACL to drop the
|
||||
packet or maybe even tried to transmit the packet but the physical datalink layer had to abandon the
|
||||
transmission for whatever reason (link down, TX queue full, link saturation, and so on). It's hard
|
||||
to overstate how important it is to have this so-called _drop monitoring_, as operators often spend
|
||||
hours and hours figuring out _why_ packets are lost their network or datacenter switching fabric.
|
||||
|
||||
**Metadata**: The agent may have other metadata as well, such as which prefix was the source and
|
||||
destination of the packet, what additional RIB information is available (AS path, BGP communities,
|
||||
and so on). This may be added to the sample record as well.
|
||||
|
||||
**Counters**: Since sFlow is sampling 1:N packets, the system can estimate total traffic in a
|
||||
reasonably accurate way. Peter and Sonia wrote a succint
|
||||
[[paper](https://sflow.org/packetSamplingBasics/)] about the math, so I won't get into that here.
|
||||
Mostly because I am but a software engineer, not a statistician... :) However, I will say this: if a
|
||||
fraction of the traffic is sampled but the _Agent_ knows how many bytes and packets were forwarded
|
||||
in total, it can provide an overview with a quantifiable accuracy. This is why the _Agent_ will
|
||||
periodically get the interface counters from the ASIC.
|
||||
|
||||
**Collector**: One or more samples can be concatenated into UDP messages that go from the _sFlow
|
||||
Agent_ to a central _sFlow Collector_. The heavy lifting in analysis is done upstream from the
|
||||
switch or router, which is great for performance. Many thousands or even tens of thousands of
|
||||
agents can forward their samples and interface counters to a single central collector, which in turn
|
||||
can be used to draw up a near real time picture of the state of traffic through even the largest of
|
||||
ISP networks or datacenter switch fabrics.
|
||||
|
||||
In sFlow parlance [[VPP](https://fd.io/)] and its companion
|
||||
[[hsflowd](https://github.com/sflow/host-sflow)] together form an _Agent_ (it sends the UDP packets
|
||||
over the network), and for example the commandline tool `sflowtool` could be a _Collector_ (it
|
||||
receives the UDP packets).
|
||||
|
||||
## Recap: sFlow in VPP
|
||||
|
||||
First, I have some pretty good news to report - our work on this plugin was
|
||||
[[merged](https://gerrit.fd.io/r/c/vpp/+/41680)] and will be included in the VPP 25.02 release in a
|
||||
few weeks! Last weekend, I gave a lightning talk at
|
||||
[[FOSDEM](https://fosdem.org/2025/schedule/event/fosdem-2025-4196-vpp-monitoring-100gbps-with-sflow/)]
|
||||
in Brussels, Belgium, and caught up with a lot of community members and network- and software
|
||||
engineers. I had a great time.
|
||||
|
||||
In trying to keep the amount of code as small as possible, and therefore the probability of bugs that
|
||||
might impact VPP's dataplane stability low, the architecture of the end to end solution consists of
|
||||
three distinct parts, each with their own risk and performance profile:
|
||||
|
||||
{{< image float="left" src="/assets/sflow/sflow-vpp-overview.png" alt="sFlow VPP Overview" width="18em" >}}
|
||||
|
||||
**1. sFlow worker node**: Its job is to do what the ASIC does in the hardware case. As VPP moves
|
||||
packets from `device-input` to the `ethernet-input` nodes in its forwarding graph, the sFlow plugin
|
||||
will inspect 1-in-N, taking a sample for further processing. Here, we don't try to be clever, simply
|
||||
copy the `inIfIndex` and the first bytes of the ethernet frame, and append them to a
|
||||
[[FIFO](https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics))] queue. If too many samples
|
||||
arrive, samples are dropped at the tail, and a counter incremented. This way, I can tell when the
|
||||
dataplane is congested. Bounded FIFOs also provide fairness: it allows for each VPP worker thread to
|
||||
get their fair share of samples into the Agent's hands.
|
||||
|
||||
**2. sFlow main process**: There's a function running on the _main thread_, which shifts further
|
||||
processing time _away_ from the dataplane. This _sflow-process_ does two things. Firstly, it
|
||||
consumes samples from the per-worker FIFO queues (both forwarded packets in green, and dropped ones
|
||||
in red). Secondly, it keeps track of time and every few seconds (20 by default, but this is
|
||||
configurable), it'll grab all interface counters from those interfaces for which I have sFlow
|
||||
turned on. VPP produces _Netlink_ messages and sends them to the kernel.
|
||||
|
||||
**3. Host sFlow daemon**: The third component is external to VPP: `hsflowd` subscribes to the _Netlink_
|
||||
messages. It goes without saying that `hsflowd` is a battle-hardened implementation running on
|
||||
hundreds of different silicon and software defined networking stacks. The PSAMPLE stuff is easy,
|
||||
this module already exists. But Neil implemented a _mod_vpp_ which can grab interface names and their
|
||||
`ifIndex`, and counter statistics. VPP emits this data as _Netlink_ `USERSOCK` messages alongside
|
||||
the PSAMPLEs.
|
||||
|
||||
|
||||
By the way, I've written about _Netlink_ before when discussing the [[Linux Control Plane]({{< ref
|
||||
2021-08-25-vpp-4 >}})] plugin. It's a mechanism for programs running in userspace to share
|
||||
information with the kernel. In the Linux kernel, packets can be sampled as well, and sent from
|
||||
kernel to userspace using a _PSAMPLE_ Netlink channel. However, the pattern is that of a message
|
||||
producer/subscriber relationship and nothing precludes one userspace process (`vpp`) to be the
|
||||
producer while another userspace process (`hsflowd`) acts as the consumer!
|
||||
|
||||
Assuming the sFlow plugin in VPP produces samples and counters properly, `hsflowd` will do the rest,
|
||||
giving correctness and upstream interoperability pretty much for free. That's slick!
|
||||
|
||||
### VPP: sFlow Configuration
|
||||
|
||||
The solution that I offer is based on two moving parts. First, the VPP plugin configuration, which
|
||||
turns on sampling at a given rate on physical devices, also known as _hardware-interfaces_. Second,
|
||||
the open source component [[host-sflow](https://github.com/sflow/host-sflow/releases)] can be
|
||||
configured as of release v2.11-5 [[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)].
|
||||
|
||||
I will show how to configure VPP in three ways:
|
||||
|
||||
***1. VPP Configuration via CLI***
|
||||
|
||||
```
|
||||
pim@vpp0-0:~$ vppctl
|
||||
vpp0-0# sflow sampling-rate 100
|
||||
vpp0-0# sflow polling-interval 10
|
||||
vpp0-0# sflow header-bytes 128
|
||||
vpp0-0# sflow enable GigabitEthernet10/0/0
|
||||
vpp0-0# sflow enable GigabitEthernet10/0/0 disable
|
||||
vpp0-0# sflow enable GigabitEthernet10/0/2
|
||||
vpp0-0# sflow enable GigabitEthernet10/0/3
|
||||
```
|
||||
|
||||
The first three commands set the global defaults - in my case I'm going to be sampling at 1:100
|
||||
which is an unusually high rate. A production setup may take 1-in-_linkspeed-in-megabits_ so for a
|
||||
1Gbps device 1:1'000 is appropriate. For 100GE, something between 1:10'000 and 1:100'000 is more
|
||||
appropriate, depending on link load. The second command sets the interface stats polling interval.
|
||||
The default is to gather these statistics every 20 seconds, but I set it to 10s here.
|
||||
|
||||
Next, I tell the plugin how many bytes of the sampled ethernet frame should be taken. Common
|
||||
values are 64 and 128 but it doesn't have to be a power of two. I want enough data to see the
|
||||
headers, like MPLS label(s), Dot1Q tag(s), IP header and TCP/UDP/ICMP header, but the contents of
|
||||
the payload are rarely interesting for
|
||||
statistics purposes.
|
||||
|
||||
Finally, I can turn on the sFlow plugin on an interface with the `sflow enable-disable` CLI. In VPP,
|
||||
an idiomatic way to turn on and off things is to have an enabler/disabler. It feels a bit clunky
|
||||
maybe to write `sflow enable $iface disable` but it makes more logical sends if you parse that as
|
||||
"enable-disable" with the default being the "enable" operation, and the alternate being the
|
||||
"disable" operation.
|
||||
|
||||
***2. VPP Configuration via API***
|
||||
|
||||
I implemented a few API methods for the most common operations. Here's a snippet that obtains the
|
||||
same config as what I typed on the CLI above, but using these Python API calls:
|
||||
|
||||
```python
|
||||
from vpp_papi import VPPApiClient, VPPApiJSONFiles
|
||||
import sys
|
||||
|
||||
vpp_api_dir = VPPApiJSONFiles.find_api_dir([])
|
||||
vpp_api_files = VPPApiJSONFiles.find_api_files(api_dir=vpp_api_dir)
|
||||
vpp = VPPApiClient(apifiles=vpp_api_files, server_address="/run/vpp/api.sock")
|
||||
vpp.connect("sflow-api-client")
|
||||
print(vpp.api.show_version().version)
|
||||
# Output: 25.06-rc0~14-g9b1c16039
|
||||
|
||||
vpp.api.sflow_sampling_rate_set(sampling_N=100)
|
||||
print(vpp.api.sflow_sampling_rate_get())
|
||||
# Output: sflow_sampling_rate_get_reply(_0=655, context=3, sampling_N=100)
|
||||
|
||||
vpp.api.sflow_polling_interval_set(polling_S=10)
|
||||
print(vpp.api.sflow_polling_interval_get())
|
||||
# Output: sflow_polling_interval_get_reply(_0=661, context=5, polling_S=10)
|
||||
|
||||
vpp.api.sflow_header_bytes_set(header_B=128)
|
||||
print(vpp.api.sflow_header_bytes_get())
|
||||
# Output: sflow_header_bytes_get_reply(_0=665, context=7, header_B=128)
|
||||
|
||||
vpp.api.sflow_enable_disable(hw_if_index=1, enable_disable=True)
|
||||
vpp.api.sflow_enable_disable(hw_if_index=2, enable_disable=True)
|
||||
print(vpp.api.sflow_interface_dump())
|
||||
# Output: [ sflow_interface_details(_0=667, context=8, hw_if_index=1),
|
||||
# sflow_interface_details(_0=667, context=8, hw_if_index=2) ]
|
||||
|
||||
print(vpp.api.sflow_interface_dump(hw_if_index=2))
|
||||
# Output: [ sflow_interface_details(_0=667, context=9, hw_if_index=2) ]
|
||||
|
||||
print(vpp.api.sflow_interface_dump(hw_if_index=1234)) ## Invalid hw_if_index
|
||||
# Output: []
|
||||
|
||||
vpp.api.sflow_enable_disable(hw_if_index=1, enable_disable=False)
|
||||
print(vpp.api.sflow_interface_dump())
|
||||
# Output: [ sflow_interface_details(_0=667, context=10, hw_if_index=2) ]
|
||||
```
|
||||
|
||||
This short program toys around a bit with the sFlow API. I first set the sampling to 1:100 and get
|
||||
the current value. Then I set the polling interval to 10s and retrieve the current value again.
|
||||
Finally, I set the header bytes to 128, and retrieve the value again.
|
||||
|
||||
Enabling and disabling sFlow on interfaces shows the idiom I mentioned before - the API being an
|
||||
`*_enable_disable()` call of sorts, and typically taking a boolean argument if the operator wants to
|
||||
enable (the default), or disable sFlow on the interface. Getting the list of enabled interfaces can
|
||||
be done with the `sflow_interface_dump()` call, which returns a list of `sflow_interface_details`
|
||||
messages.
|
||||
|
||||
I demonstrated VPP's Python API and how it works in a fair amount of detail in a [[previous
|
||||
article]({{< ref 2024-01-27-vpp-papi >}})], in case this type of stuff interests you.
|
||||
|
||||
***3. VPPCfg YAML Configuration***
|
||||
|
||||
Writing on the CLI and calling the API is good and all, but many users of VPP have noticed that it
|
||||
does not have any form of configuration persistence and that's deliberate. VPP's goal is to be a
|
||||
programmable dataplane, and explicitly has left the programming and configuration as an exercise for
|
||||
integrators. I have written a Python project that takes a YAML file as input and uses it to
|
||||
configure (and reconfigure, on the fly) the dataplane automatically, called
|
||||
[[VPPcfg](https://git.ipng.ch/ipng/vppcfg.git)]. Previously, I wrote some implementation thoughts
|
||||
on its [[datamodel]({{< ref 2022-03-27-vppcfg-1 >}})] and its [[operations]({{< ref 2022-04-02-vppcfg-2
|
||||
>}})] so I won't repeat that here. Instead, I will just show the configuration:
|
||||
|
||||
```
|
||||
pim@vpp0-0:~$ cat << EOF > vppcfg.yaml
|
||||
interfaces:
|
||||
GigabitEthernet10/0/0:
|
||||
sflow: true
|
||||
GigabitEthernet10/0/1:
|
||||
sflow: true
|
||||
GigabitEthernet10/0/2:
|
||||
sflow: true
|
||||
GigabitEthernet10/0/3:
|
||||
sflow: true
|
||||
|
||||
sflow:
|
||||
sampling-rate: 100
|
||||
polling-interval: 10
|
||||
header-bytes: 128
|
||||
EOF
|
||||
pim@vpp0-0:~$ vppcfg plan -c vppcfg.yaml -o /etc/vpp/config/vppcfg.vpp
|
||||
[INFO ] root.main: Loading configfile vppcfg.yaml
|
||||
[INFO ] vppcfg.config.valid_config: Configuration validated successfully
|
||||
[INFO ] root.main: Configuration is valid
|
||||
[INFO ] vppcfg.reconciler.write: Wrote 13 lines to /etc/vpp/config/vppcfg.vpp
|
||||
[INFO ] root.main: Planning succeeded
|
||||
pim@vpp0-0:~$ vppctl exec /etc/vpp/config/vppcfg.vpp
|
||||
```
|
||||
|
||||
The nifty thing about `vppcfg` is that if I were to change, say, the sampling-rate (setting it to
|
||||
1000) and disable sFlow from an interface, say Gi10/0/0, I can re-run the `vppcfg plan` and `vppcfg
|
||||
apply` stages and the VPP dataplane will be reprogrammed to reflect the newly declared configuration.
|
||||
|
||||
### hsflowd: Configuration
|
||||
|
||||
When sFlow is enabled, VPP will start to emit _Netlink_ messages of type PSAMPLE with packet samples
|
||||
and of type USERSOCK with the custom messages containing interface names and counters. These latter
|
||||
custom messages have to be decoded, which is done by the _mod_vpp_ module in `hsflowd`, starting
|
||||
from release v2.11-5 [[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)].
|
||||
|
||||
Here's a minimalist configuration:
|
||||
|
||||
```
|
||||
pim@vpp0-0:~$ cat /etc/hsflowd.conf
|
||||
sflow {
|
||||
collector { ip=127.0.0.1 udpport=16343 }
|
||||
collector { ip=192.0.2.1 namespace=dataplane }
|
||||
psample { group=1 }
|
||||
vpp { osIndex=off }
|
||||
}
|
||||
```
|
||||
|
||||
{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
There are two important details that can be confusing at first: \
|
||||
**1.** kernel network namespaces \
|
||||
**2.** interface index namespaces
|
||||
|
||||
#### hsflowd: Network namespace
|
||||
|
||||
Network namespaces virtualize Linux's network stack. Upon creation, a network namespace contains only
|
||||
a loopback interface, and subsequently interfaces can be moved between namespaces. Each network
|
||||
namespace will have its own set of IP addresses, its own routing table, socket listing, connection
|
||||
tracking table, firewall, and other network-related resources. When started by systemd, `hsflowd`
|
||||
and VPP will normally both run in the _default_ network namespace.
|
||||
|
||||
Given this, I can conclude that when the sFlow plugin opens a Netlink channel, it will
|
||||
naturally do this in the network namespace that its VPP process is running in (the _default_
|
||||
namespace, normally). It is therefore important that the recipient of these Netlink messages,
|
||||
notably `hsflowd` runs in ths ***same*** namespace as VPP. It's totally fine to run them together in
|
||||
a different namespace (eg. a container in Kubernetes or Docker), as long as they can see each other.
|
||||
|
||||
It might pose a problem if the network connectivity lives in a different namespace than the default
|
||||
one. One common example (that I heavily rely on at IPng!) is to create Linux Control Plane interface
|
||||
pairs, _LIPs_, in a dataplane namespace. The main reason for doing this is to allow something like
|
||||
FRR or Bird to completely govern the routing table in the kernel and keep it in-sync with the FIB in
|
||||
VPP. In such a _dataplane_ network namespace, typically every interface is owned by VPP.
|
||||
|
||||
Luckily, `hsflowd` can attach to one (default) namespace to get the PSAMPLEs, but create a socket in
|
||||
a _different_ (dataplane) namespace to send packets to a collector. This explains the second
|
||||
_collector_ entry in the config-file above. Here, `hsflowd` will send UDP packets to 192.0.2.1:6343
|
||||
from within the (VPP) dataplane namespace, and to 127.0.0.1:16343 in the default namespace.
|
||||
|
||||
#### hsflowd: osIndex
|
||||
|
||||
I hope the previous section made some sense, because this one will be a tad more esoteric. When
|
||||
creating a network namespace, each interface will get its own uint32 interface index that identifies
|
||||
it, and such an ID is typically called an `ifIndex`. It's important to note that the same number can
|
||||
(and will!) occur multiple times, once for each namespace. Let me give you an example:
|
||||
|
||||
```
|
||||
pim@summer:~$ ip link
|
||||
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN ...
|
||||
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
|
||||
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ipng-sl state UP ...
|
||||
link/ether 00:22:19:6a:46:2e brd ff:ff:ff:ff:ff:ff
|
||||
altname enp1s0f0
|
||||
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 900 qdisc mq master ipng-sl state DOWN ...
|
||||
link/ether 00:22:19:6a:46:30 brd ff:ff:ff:ff:ff:ff
|
||||
altname enp1s0f1
|
||||
|
||||
pim@summer:~$ ip netns exec dataplane ip link
|
||||
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN ...
|
||||
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
|
||||
2: loop0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP ...
|
||||
link/ether de:ad:00:00:00:00 brd ff:ff:ff:ff:ff:ff
|
||||
3: xe1-0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP ...
|
||||
link/ether 00:1b:21:bd:c7:18 brd ff:ff:ff:ff:ff:ff
|
||||
```
|
||||
|
||||
I want to draw your attention to the number at the beginning of the line. In the _default_
|
||||
namespace, `ifIndex=3` corresponds to `ifName=eno2` (which has no link, it's marked `DOWN`). But in
|
||||
the _dataplane_ namespace, that index corresponds to a completely different interface called
|
||||
`ifName=xe1-0` (which is link `UP`).
|
||||
|
||||
Now, let me show you the interfaces in VPP:
|
||||
|
||||
```
|
||||
pim@summer:~$ vppctl show int | grep Gigabit | egrep 'Name|loop0|tap0|Gigabit'
|
||||
Name Idx State MTU (L3/IP4/IP6/MPLS)
|
||||
GigabitEthernet4/0/0 1 up 9000/0/0/0
|
||||
GigabitEthernet4/0/1 2 down 9000/0/0/0
|
||||
GigabitEthernet4/0/2 3 down 9000/0/0/0
|
||||
GigabitEthernet4/0/3 4 down 9000/0/0/0
|
||||
TenGigabitEthernet5/0/0 5 up 9216/0/0/0
|
||||
TenGigabitEthernet5/0/1 6 up 9216/0/0/0
|
||||
loop0 7 up 9216/0/0/0
|
||||
tap0 19 up 9216/0/0/0
|
||||
```
|
||||
|
||||
Here, I want you to look at the second column `Idx`, which shows what VPP calls the _sw_if_index_
|
||||
(the software interface index, as opposed to hardware index). Here, `ifIndex=3` corresponds to
|
||||
`ifName=GigabitEthernet4/0/2`, which is neither `eno2` nor `xe1-0`. Oh my, yet _another_ namespace!
|
||||
|
||||
It turns out that there are three (relevant) types of namespaces at play here:
|
||||
1. ***Linux network*** namespace; here using `dataplane` and `default` each with their own unique
|
||||
(and overlapping) numbering.
|
||||
1. ***VPP hardware*** interface namespace, also called PHYs (for physical interfaces). When VPP
|
||||
first attaches to or creates network interfaces like the ones from DPDK or RDMA, these will
|
||||
create an _hw_if_index_ in a list.
|
||||
1. ***VPP software*** interface namespace. All interfaces (including hardware ones!) will
|
||||
receive a _sw_if_index_ in VPP. A good example is sub-interfaces: if I create a sub-int on
|
||||
GigabitEthernet4/0/2, it will NOT get a hardware index, but it _will_ get the next available
|
||||
software index (in this example, `sw_if_index=7`).
|
||||
|
||||
In Linux CP, I can see a mapping from one to the other, just look at this:
|
||||
|
||||
```
|
||||
pim@summer:~$ vppctl show lcp
|
||||
lcp default netns dataplane
|
||||
lcp lcp-auto-subint off
|
||||
lcp lcp-sync on
|
||||
lcp lcp-sync-unnumbered on
|
||||
itf-pair: [0] loop0 tap0 loop0 2 type tap netns dataplane
|
||||
itf-pair: [1] TenGigabitEthernet5/0/0 tap1 xe1-0 3 type tap netns dataplane
|
||||
itf-pair: [2] TenGigabitEthernet5/0/1 tap2 xe1-1 4 type tap netns dataplane
|
||||
itf-pair: [3] TenGigabitEthernet5/0/0.20 tap1.20 xe1-0.20 5 type tap netns dataplane
|
||||
```
|
||||
|
||||
Those `itf-pair` describe our _LIPs_, and they have the coordinates to three things. 1) The VPP
|
||||
software interface (VPP `ifName=loop0` with `sw_if_index=7`), which 2) Linux CP will mirror into the
|
||||
Linux kernel using a TAP device (VPP `ifName=tap0` with `sw_if_index=19`). That TAP has one leg in
|
||||
VPP (`tap0`), and another in 3) Linux (with `ifName=loop` and `ifIndex=2` in namespace `dataplane`).
|
||||
|
||||
> So the tuple that fully describes a _LIP_ is `{7, 19,'dataplane', 2}`
|
||||
|
||||
Climbing back out of that rabbit hole, I am now finally ready to explain the feature. When sFlow in
|
||||
VPP takes its sample, it will be doing this on a PHY, that is a given interface with a specific
|
||||
_hw_if_index_. When it polls the counters, it'll do it for that specific _hw_if_index_. It now has a
|
||||
choice: should it share with the world the representation of *its* namespace, or should it try to be
|
||||
smarter? If LinuxCP is enabled, this interface will likely have a representation in Linux. So the
|
||||
plugin will first resolve the _sw_if_index_ belonging to that PHY, and using that, try to look up a
|
||||
_LIP_ with it. If it finds one, it'll know both the namespace in which it lives as well as the
|
||||
osIndex in that namespace. If it doesn't find a _LIP_, it will at least have the _sw_if_index_ at
|
||||
hand, so it'll annotate the USERSOCK counter messages with this information instead.
|
||||
|
||||
Now, `hsflowd` has a choice to make: does it share the Linux representation and hide VPP as an
|
||||
implementation detail? Or does it share the VPP dataplane _sw_if_index_? There are use cases
|
||||
relevant to both, so the decision was to let the operator decide, by setting `osIndex` either `on`
|
||||
(use Linux ifIndex) or `off` (use VPP _sw_if_index_).
|
||||
|
||||
### hsflowd: Host Counters
|
||||
|
||||
Now that I understand the configuration parts of VPP and `hsflowd`, I decide to configure everything
|
||||
but without enabling sFlow on on any interfaces yet in VPP. Once I start the daemon, I can see that
|
||||
it sends an UDP packet every 30 seconds to the configured _collector_:
|
||||
|
||||
```
|
||||
pim@vpp0-0:~$ sudo tcpdump -s 9000 -i lo -n
|
||||
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
|
||||
listening on lo, link-type EN10MB (Ethernet), snapshot length 9000 bytes
|
||||
15:34:19.695042 IP 127.0.0.1.48753 > 127.0.0.1.6343: sFlowv5,
|
||||
IPv4 agent 198.19.5.16, agent-id 100000, length 716
|
||||
```
|
||||
|
||||
The `tcpdump` I have on my Debian bookworm machines doesn't know how to decode the contents of these
|
||||
sFlow packets. Actually, neither does Wireshark. I've attached a file of these mysterious packets
|
||||
[[sflow-host.pcap](/assets/sflow/sflow-host.pcap)] in case you want to take a look.
|
||||
Neil however gives me a tip. A full message decoder and otherwise handy Swiss army knife lives in
|
||||
[[sflowtool](https://github.com/sflow/sflowtool)].
|
||||
|
||||
I can offer this pcap file to `sflowtool`, or let it just listen on the UDP port directly, and
|
||||
it'll tell me what it finds:
|
||||
|
||||
```
|
||||
pim@vpp0-0:~$ sflowtool -p 6343
|
||||
startDatagram =================================
|
||||
datagramSourceIP 127.0.0.1
|
||||
datagramSize 716
|
||||
unixSecondsUTC 1739112018
|
||||
localtime 2025-02-09T15:40:18+0100
|
||||
datagramVersion 5
|
||||
agentSubId 100000
|
||||
agent 198.19.5.16
|
||||
packetSequenceNo 57
|
||||
sysUpTime 987398
|
||||
samplesInPacket 1
|
||||
startSample ----------------------
|
||||
sampleType_tag 0:4
|
||||
sampleType COUNTERSSAMPLE
|
||||
sampleSequenceNo 33
|
||||
sourceId 2:1
|
||||
counterBlock_tag 0:2001
|
||||
adaptor_0_ifIndex 2
|
||||
adaptor_0_MACs 1
|
||||
adaptor_0_MAC_0 525400f00100
|
||||
counterBlock_tag 0:2010
|
||||
udpInDatagrams 123904
|
||||
udpNoPorts 23132459
|
||||
udpInErrors 0
|
||||
udpOutDatagrams 46480629
|
||||
udpRcvbufErrors 0
|
||||
udpSndbufErrors 0
|
||||
udpInCsumErrors 0
|
||||
counterBlock_tag 0:2009
|
||||
tcpRtoAlgorithm 1
|
||||
tcpRtoMin 200
|
||||
tcpRtoMax 120000
|
||||
tcpMaxConn 4294967295
|
||||
tcpActiveOpens 0
|
||||
tcpPassiveOpens 30
|
||||
tcpAttemptFails 0
|
||||
tcpEstabResets 0
|
||||
tcpCurrEstab 1
|
||||
tcpInSegs 89120
|
||||
tcpOutSegs 86961
|
||||
tcpRetransSegs 59
|
||||
tcpInErrs 0
|
||||
tcpOutRsts 4
|
||||
tcpInCsumErrors 0
|
||||
counterBlock_tag 0:2008
|
||||
icmpInMsgs 23129314
|
||||
icmpInErrors 32
|
||||
icmpInDestUnreachs 0
|
||||
icmpInTimeExcds 23129282
|
||||
icmpInParamProbs 0
|
||||
icmpInSrcQuenchs 0
|
||||
icmpInRedirects 0
|
||||
icmpInEchos 0
|
||||
icmpInEchoReps 32
|
||||
icmpInTimestamps 0
|
||||
icmpInAddrMasks 0
|
||||
icmpInAddrMaskReps 0
|
||||
icmpOutMsgs 0
|
||||
icmpOutErrors 0
|
||||
icmpOutDestUnreachs 23132467
|
||||
icmpOutTimeExcds 0
|
||||
icmpOutParamProbs 23132467
|
||||
icmpOutSrcQuenchs 0
|
||||
icmpOutRedirects 0
|
||||
icmpOutEchos 0
|
||||
icmpOutEchoReps 0
|
||||
icmpOutTimestamps 0
|
||||
icmpOutTimestampReps 0
|
||||
icmpOutAddrMasks 0
|
||||
icmpOutAddrMaskReps 0
|
||||
counterBlock_tag 0:2007
|
||||
ipForwarding 2
|
||||
ipDefaultTTL 64
|
||||
ipInReceives 46590552
|
||||
ipInHdrErrors 0
|
||||
ipInAddrErrors 0
|
||||
ipForwDatagrams 0
|
||||
ipInUnknownProtos 0
|
||||
ipInDiscards 0
|
||||
ipInDelivers 46402357
|
||||
ipOutRequests 69613096
|
||||
ipOutDiscards 0
|
||||
ipOutNoRoutes 80
|
||||
ipReasmTimeout 0
|
||||
ipReasmReqds 0
|
||||
ipReasmOKs 0
|
||||
ipReasmFails 0
|
||||
ipFragOKs 0
|
||||
ipFragFails 0
|
||||
ipFragCreates 0
|
||||
counterBlock_tag 0:2005
|
||||
disk_total 6253608960
|
||||
disk_free 2719039488
|
||||
disk_partition_max_used 56.52
|
||||
disk_reads 11512
|
||||
disk_bytes_read 626214912
|
||||
disk_read_time 48469
|
||||
disk_writes 1058955
|
||||
disk_bytes_written 8924332032
|
||||
disk_write_time 7954804
|
||||
counterBlock_tag 0:2004
|
||||
mem_total 8326963200
|
||||
mem_free 5063872512
|
||||
mem_shared 0
|
||||
mem_buffers 86425600
|
||||
mem_cached 827752448
|
||||
swap_total 0
|
||||
swap_free 0
|
||||
page_in 306365
|
||||
page_out 4357584
|
||||
swap_in 0
|
||||
swap_out 0
|
||||
counterBlock_tag 0:2003
|
||||
cpu_load_one 0.030
|
||||
cpu_load_five 0.050
|
||||
cpu_load_fifteen 0.040
|
||||
cpu_proc_run 1
|
||||
cpu_proc_total 138
|
||||
cpu_num 2
|
||||
cpu_speed 1699
|
||||
cpu_uptime 1699306
|
||||
cpu_user 64269210
|
||||
cpu_nice 1810
|
||||
cpu_system 34690140
|
||||
cpu_idle 3234293560
|
||||
cpu_wio 3568580
|
||||
cpuintr 0
|
||||
cpu_sintr 5687680
|
||||
cpuinterrupts 1596621688
|
||||
cpu_contexts 3246142972
|
||||
cpu_steal 329520
|
||||
cpu_guest 0
|
||||
cpu_guest_nice 0
|
||||
counterBlock_tag 0:2006
|
||||
nio_bytes_in 250283
|
||||
nio_pkts_in 2931
|
||||
nio_errs_in 0
|
||||
nio_drops_in 0
|
||||
nio_bytes_out 370244
|
||||
nio_pkts_out 1640
|
||||
nio_errs_out 0
|
||||
nio_drops_out 0
|
||||
counterBlock_tag 0:2000
|
||||
hostname vpp0-0
|
||||
UUID ec933791-d6af-7a93-3b8d-aab1a46d6faa
|
||||
machine_type 3
|
||||
os_name 2
|
||||
os_release 6.1.0-26-amd64
|
||||
endSample ----------------------
|
||||
endDatagram =================================
|
||||
```
|
||||
|
||||
If you thought: "What an obnoxiously long paste!", then my slightly RSI-induced mouse-hand might
|
||||
agree with you. But it is really cool to see that every 30 seconds, the _collector_ will receive
|
||||
this form of heartbeat from the _agent_. There's a lot of vitalsigns in this packet, including some
|
||||
non-obvious but interesting stats like CPU load, memory, disk use and disk IO, and kernel version
|
||||
information. It's super dope!
|
||||
|
||||
### hsflowd: Interface Counters
|
||||
|
||||
Next, I'll enable sFlow in VPP on all four interfaces (Gi10/0/0-Gi10/0/3), set the sampling rate to
|
||||
something very high (1 in 100M), and the interface polling-interval to every 10 seconds. And indeed,
|
||||
every ten seconds or so I get a few packets, which I captured in
|
||||
[[sflow-interface.pcap](/assets/sflow/sflow-interface.pcap)]. Most of the packets contain only one
|
||||
counter record, while some contain more than one (in the PCAP, packet #9 has two). If I update the
|
||||
polling-interval to every second, I can see that most of the packets have all four counters.
|
||||
|
||||
Those interface counters, as decoded by `sflowtool`, look like this:
|
||||
|
||||
```
|
||||
pim@vpp0-0:~$ sflowtool -r sflow-interface.pcap | \
|
||||
awk '/startSample/ { on=1 } { if (on) { print $0 } } /endSample/ { on=0 }'
|
||||
startSample ----------------------
|
||||
sampleType_tag 0:4
|
||||
sampleType COUNTERSSAMPLE
|
||||
sampleSequenceNo 745
|
||||
sourceId 0:3
|
||||
counterBlock_tag 0:1005
|
||||
ifName GigabitEthernet10/0/2
|
||||
counterBlock_tag 0:1
|
||||
ifIndex 3
|
||||
networkType 6
|
||||
ifSpeed 0
|
||||
ifDirection 1
|
||||
ifStatus 3
|
||||
ifInOctets 858282015
|
||||
ifInUcastPkts 780540
|
||||
ifInMulticastPkts 0
|
||||
ifInBroadcastPkts 0
|
||||
ifInDiscards 0
|
||||
ifInErrors 0
|
||||
ifInUnknownProtos 0
|
||||
ifOutOctets 1246716016
|
||||
ifOutUcastPkts 975772
|
||||
ifOutMulticastPkts 0
|
||||
ifOutBroadcastPkts 0
|
||||
ifOutDiscards 127
|
||||
ifOutErrors 28
|
||||
ifPromiscuousMode 0
|
||||
endSample ----------------------
|
||||
```
|
||||
|
||||
What I find particularly cool about it, is that sFlow provides an automatic mapping between the
|
||||
`ifName=GigabitEthernet10/0/2` (tag 0:1005), together with an object (tag 0:1), which contains the
|
||||
`ifIndex=3`, and lots of packet and octet counters both in the ingress and egress direction. This is
|
||||
super useful for upstream _collectors_, as they can now find the hostname, agent name and address,
|
||||
and the correlation between interface names and their indexes. Noice!
|
||||
|
||||
#### hsflowd: Packet Samples
|
||||
|
||||
Now it's time to ratchet up the packet sampling, so I move it from 1:100M to 1:1000, while keeping
|
||||
the interface polling-interval at 10 seconds and I ask VPP to sample 64 bytes of each packet that it
|
||||
inspects. On either side of my pet VPP instance, I start an `iperf3` run to generate some traffic. I
|
||||
now see a healthy stream of sflow packets coming in on port 6343. They still contain every 30
|
||||
seconds or so a host counter, and every 10 seconds a set of interface counters come by, but mostly
|
||||
these UDP packets are showing me samples. I've captured a few minutes of these in
|
||||
[[sflow-all.pcap](/assets/sflow/sflow-all.pcap)].
|
||||
Although Wireshark doesn't know how to interpret the sFlow counter messages, it _does_ know how to
|
||||
interpret the sFlow sample messagess, and it reveals one of them like this:
|
||||
|
||||
{{< image width="100%" src="/assets/sflow/sflow-wireshark.png" alt="sFlow Wireshark" >}}
|
||||
|
||||
Let me take a look at the picture from top to bottom. First, the outer header (from 127.0.0.1:48753
|
||||
to 127.0.0.1:6343) is the sFlow agent sending to the collector. The agent identifies itself as
|
||||
having IPv4 address 198.19.5.16 with ID 100000 and an uptime of 1h52m. Then, it says it's going to
|
||||
send 9 samples, the first of which says it's from ifIndex=2 and at a sampling rate of 1:1000. It
|
||||
then shows that sample, saying that the frame length is 1518 bytes, and the first 64 bytes of those
|
||||
are sampled. Finally, the first sampled packet starts at the blue line. It shows the SrcMAC and
|
||||
DstMAC, and that it was a TCP packet from 192.168.10.17:51028 to 192.168.10.33:5201 - my running
|
||||
`iperf3`, booyah!
|
||||
|
||||
### VPP: sFlow Performance
|
||||
|
||||
{{< image float="right" src="/assets/sflow/sflow-lab.png" alt="sFlow Lab" width="20em" >}}
|
||||
|
||||
One question I get a lot about this plugin is: what is the performance impact when using
|
||||
sFlow? I spent a considerable amount of time tinkering with this, and together with Neil bringing
|
||||
the plugin to what we both agree is the most efficient use of CPU. We could have gone a bit further,
|
||||
but that would require somewhat intrusive changes to VPP's internals and as _North of the Border_
|
||||
(and the Simpsons!) would say: what we have isn't just good, it's good enough!
|
||||
|
||||
I've built a small testbed based on two Dell R730 machines. On the left, I have a Debian machine
|
||||
running Cisco T-Rex using four quad-tengig network cards, the classic Intel i710-DA4. On the right,
|
||||
I have my VPP machine called _Hippo_ (because it's always hungry for packets), with the same
|
||||
hardware. I'll build two halves. On the top NIC (Te3/0/0-3 in VPP), I will install IPv4 and MPLS
|
||||
forwarding on the purple circuit, and a simple Layer2 cross connect on the cyan circuit. On all four
|
||||
interfaces, I will enable sFlow. Then, I will mirror this configuration on the bottom NIC
|
||||
(Te130/0/0-3) in the red and green circuits, for which I will leave sFlow turned off.
|
||||
|
||||
To help you reproduce my results, and under the assumption that this is your jam, here's the
|
||||
configuration for all of the kit:
|
||||
|
||||
***0. Cisco T-Rex***
|
||||
```
|
||||
pim@trex:~ $ cat /srv/trex/8x10.yaml
|
||||
- version: 2
|
||||
interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ]
|
||||
port_info:
|
||||
- src_mac: 00:1b:21:06:00:00
|
||||
dest_mac: 9c:69:b4:61:a1:dc # Connected to Hippo Te3/0/0, purple
|
||||
- src_mac: 00:1b:21:06:00:01
|
||||
dest_mac: 9c:69:b4:61:a1:dd # Connected to Hippo Te3/0/1, purple
|
||||
- src_mac: 00:1b:21:83:00:00
|
||||
dest_mac: 00:1b:21:83:00:01 # L2XC via Hippo Te3/0/2, cyan
|
||||
- src_mac: 00:1b:21:83:00:01
|
||||
dest_mac: 00:1b:21:83:00:00 # L2XC via Hippo Te3/0/3, cyan
|
||||
|
||||
- src_mac: 00:1b:21:87:00:00
|
||||
dest_mac: 9c:69:b4:61:75:d0 # Connected to Hippo Te130/0/0, red
|
||||
- src_mac: 00:1b:21:87:00:01
|
||||
dest_mac: 9c:69:b4:61:75:d1 # Connected to Hippo Te130/0/1, red
|
||||
- src_mac: 9c:69:b4:85:00:00
|
||||
dest_mac: 9c:69:b4:85:00:01 # L2XC via Hippo Te130/0/2, green
|
||||
- src_mac: 9c:69:b4:85:00:01
|
||||
dest_mac: 9c:69:b4:85:00:00 # L2XC via Hippo Te130/0/3, green
|
||||
pim@trex:~ $ sudo t-rex-64 -i -c 4 --cfg /srv/trex/8x10.yaml
|
||||
```
|
||||
|
||||
When constructing the T-Rex configuration, I specifically set the destination MAC address for L3
|
||||
circuits (the purple and red ones) using Hippo's interface MAC address, which I can find with
|
||||
`vppctl show hardware-interfaces`. This way, T-Rex does not have to ARP for the VPP endpoint. On
|
||||
L2XC circuits (the cyan and green ones), VPP does not concern itself with the MAC addressing at
|
||||
all. It puts its interface in _promiscuous_ mode, and simply writes out any ethernet frame received,
|
||||
directly to the egress interface.
|
||||
|
||||
***1. IPv4***
|
||||
```
|
||||
hippo# set int state TenGigabitEthernet3/0/0 up
|
||||
hippo# set int state TenGigabitEthernet3/0/1 up
|
||||
hippo# set int state TenGigabitEthernet130/0/0 up
|
||||
hippo# set int state TenGigabitEthernet130/0/1 up
|
||||
hippo# set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31
|
||||
hippo# set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31
|
||||
hippo# set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31
|
||||
hippo# set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31
|
||||
hippo# ip route add 16.0.0.0/24 via 100.64.0.0
|
||||
hippo# ip route add 48.0.0.0/24 via 100.64.1.0
|
||||
hippo# ip route add 16.0.2.0/24 via 100.64.4.0
|
||||
hippo# ip route add 48.0.2.0/24 via 100.64.5.0
|
||||
hippo# ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static
|
||||
hippo# ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static
|
||||
hippo# ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static
|
||||
hippo# ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static
|
||||
```
|
||||
|
||||
By the way, one note to this last piece, I'm setting static IPv4 neighbors so that Cisco T-Rex
|
||||
as well as VPP do not have to use ARP to resolve each other. You'll see above that the T-Rex
|
||||
configuration also uses MAC addresses exclusively. Setting the `ip neighbor` like this allows VPP
|
||||
to know where to send return traffic.
|
||||
|
||||
***2. MPLS***
|
||||
```
|
||||
hippo# mpls table add 0
|
||||
hippo# set interface mpls TenGigabitEthernet3/0/0 enable
|
||||
hippo# set interface mpls TenGigabitEthernet3/0/1 enable
|
||||
hippo# set interface mpls TenGigabitEthernet130/0/0 enable
|
||||
hippo# set interface mpls TenGigabitEthernet130/0/1 enable
|
||||
hippo# mpls local-label add 16 eos via 100.64.1.0 TenGigabitEthernet3/0/1 out-labels 17
|
||||
hippo# mpls local-label add 17 eos via 100.64.0.0 TenGigabitEthernet3/0/0 out-labels 16
|
||||
hippo# mpls local-label add 20 eos via 100.64.5.0 TenGigabitEthernet130/0/1 out-labels 21
|
||||
hippo# mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels 20
|
||||
```
|
||||
|
||||
Here, the MPLS configuration implements a simple P-router, where incoming MPLS packets with label 16
|
||||
will be sent back to T-Rex on Te3/0/1 to the specified IPv4 nexthop (for which I already know the
|
||||
MAC address), and with label 16 removed and new label 17 imposed, in other words a SWAP operation.
|
||||
|
||||
***3. L2XC***
|
||||
```
|
||||
hippo# set int state TenGigabitEthernet3/0/2 up
|
||||
hippo# set int state TenGigabitEthernet3/0/3 up
|
||||
hippo# set int state TenGigabitEthernet130/0/2 up
|
||||
hippo# set int state TenGigabitEthernet130/0/3 up
|
||||
hippo# set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3
|
||||
hippo# set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2
|
||||
hippo# set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3
|
||||
hippo# set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2
|
||||
```
|
||||
|
||||
I've added a layer2 cross connect as well because it's computationally very cheap for VPP to receive
|
||||
an L2 (ethernet) datagram, and immediately transmit it on another interface. There's no FIB lookup
|
||||
and not even an L2 nexthop lookup involved, VPP is just shoveling ethernet packets in-and-out as
|
||||
fast as it can!
|
||||
|
||||
Here's how a loadtest looks like when sending 80Gbps at 192b packets on all eight interfaces:
|
||||
|
||||
{{< image src="/assets/sflow/sflow-lab-trex.png" alt="sFlow T-Rex" width="100%" >}}
|
||||
|
||||
The leftmost ports p0 <-> p1 are sending IPv4+MPLS, while ports p0 <-> p2 are sending ethernet back
|
||||
and forth. All four of them have sFlow enabled, at a sampling rate of 1:10'000, the default. These
|
||||
four ports are my experiment, to show the CPU use of sFlow. Then, ports p3 <-> p4 and p5 <-> p6
|
||||
respectively have sFlow turned off but with the same configuration. They are my control, showing
|
||||
the CPU use without sFlow.
|
||||
|
||||
**First conclusion**: This stuff works a treat. There is absolutely no impact of throughput at
|
||||
80Gbps with 47.6Mpps either _with_, or _without_ sFlow turned on. That's wonderful news, as it shows
|
||||
that the dataplane has more CPU available than is needed for any combination of functionality.
|
||||
|
||||
But what _is_ the limit? For this, I'll take a deeper look at the runtime statistics by varying the
|
||||
CPU time spent and maximum throughput achievable on a single VPP worker, thus using a single CPU
|
||||
thread on this Hippo machine that has 44 cores and 44 hyperthreads. I switch the loadtester to emit
|
||||
64 byte ethernet packets, the smallest I'm allowed to send.
|
||||
|
||||
| Loadtest | no sFlow | 1:1'000'000 | 1:10'000 | 1:1'000 | 1:100 |
|
||||
|-------------|-----------|-----------|-----------|-----------|-----------|
|
||||
| L2XC | 14.88Mpps | 14.32Mpps | 14.31Mpps | 14.27Mpps | 14.15Mpps |
|
||||
| IPv4 | 10.89Mpps | 9.88Mpps | 9.88Mpps | 9.84Mpps | 9.73Mpps |
|
||||
| MPLS | 10.11Mpps | 9.52Mpps | 9.52Mpps | 9.51Mpps | 9.45Mpps |
|
||||
| ***sFlow Packets*** / 10sec | N/A | 337.42M total | 337.39M total | 336.48M total | 333.64M total |
|
||||
| .. Sampled | | 328 | 33.8k | 336k | 3.34M |
|
||||
| .. Sent | | 328 | 33.8k | 336k | 1.53M |
|
||||
| .. Dropped | | 0 | 0 | 0 | 1.81M |
|
||||
|
||||
Here I can make a few important observations.
|
||||
|
||||
**Baseline**: One worker (thus, one CPU thread) can sustain 14.88Mpps of L2XC when sFlow is turned off, which
|
||||
implies that it has a little bit of CPU left over to do other work, if needed. With IPv4, I can see
|
||||
that the throughput is actually CPU limited: 10.89Mpps can be handled by one worker (thus, one CPU thread). I
|
||||
know that MPLS is a little bit more expensive computationally than IPv4, and that checks out. The
|
||||
total capacity is 10.11Mpps for one worker, when sFlow is turned off.
|
||||
|
||||
**Overhead**: When I turn on sFlow on the interface, VPP will insert the _sflow-node_ into the
|
||||
forwarding graph between `device-input` and `ethernet-input`. It means that the sFlow node will see
|
||||
_every single_ packet, and it will have to move all of these into the next node, which costs about
|
||||
9.5 CPU cycles per packet. The regression on L2XC is 3.8% but I have to note that VPP was not CPU
|
||||
bound on the L2XC so it used some CPU cycles which were still available, before regressing
|
||||
throughput. There is an immediate regression of 9.3% on IPv4 and 5.9% on MPLS, only to shuffle the
|
||||
packets through the graph.
|
||||
|
||||
**Sampling Cost**: But when then doing higher rates of sampling, the further regression is not _that_
|
||||
terrible. Between 1:1'000'000 and 1:10'000, there's barely a noticeable difference. Even in the
|
||||
worst case of 1:100, the regression is from 14.32Mpps to 14.15Mpps for L2XC, only 1.2%. The
|
||||
regression for L2XC, IPv4 and MPLS are all very modest, at 1.2% (L2XC), 1.6% (IPv4) and 0.8% (MPLS).
|
||||
Of course, by using multiple hardware receive queues and multiple RX workers per interface, the cost
|
||||
can be kept well in hand.
|
||||
|
||||
**Overload Protection**: At 1:1'000 and an effective rate of 33.65Mpps across all ports, I correctly
|
||||
observe 336k samples taken, and sent to PSAMPLE. At 1:100 however, there are 3.34M samples, but
|
||||
they are not fitting through the FIFO, so the plugin is dropping samples to protect downstream
|
||||
`sflow-main` thread and `hsflowd`. I can see that here, 1.81M samples have been dropped, while 1.53M
|
||||
samples made it through. By the way, this means VPP is happily sending a whopping 153K samples/sec
|
||||
to the collector!
|
||||
|
||||
## What's Next
|
||||
|
||||
Now that I've seen the UDP packets from our agent to a collector on the wire, and also how
|
||||
incredibly efficient the sFlow sampling implementation turned out, I'm super motivated to
|
||||
continue the journey with higher level collector receivers like ntopng, sflow-rt or Akvorado. In an
|
||||
upcoming article, I'll describe how I rolled out Akvorado at IPng, and what types of changes would
|
||||
make the user experience even better (or simpler to understand, at least).
|
||||
|
||||
### Acknowledgements
|
||||
|
||||
I'd like to thank Neil McKee from inMon for his dedication to getting things right, including the
|
||||
finer details such as logging, error handling, API specifications, and documentation. He has been a
|
||||
true pleasure to work with and learn from. Also, thank you to the VPP developer community, notably
|
||||
Benoit, Florin, Damjan, Dave and Matt, for helping with the review and getting this thing merged in
|
||||
time for the 25.02 release.
|
793
content/articles/2025-04-09-frysix-evpn.md
Normal file
793
content/articles/2025-04-09-frysix-evpn.md
Normal file
@@ -0,0 +1,793 @@
|
||||
---
|
||||
date: "2025-04-09T07:51:23Z"
|
||||
title: 'FrysIX eVPN: think different'
|
||||
---
|
||||
|
||||
{{< image float="right" src="/assets/frys-ix/frysix-logo-small.png" alt="FrysIX Logo" width="12em" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
Somewhere in the far north of the Netherlands, the country where I was born, a town called Jubbega
|
||||
is the home of the Frysian Internet Exchange called [[Frys-IX](https://frys-ix.net/)]. Back in 2021,
|
||||
a buddy of mine, Arend, said that he was planning on renting a rack at the NIKHEF facility, one of
|
||||
the most densely populated facilities in western Europe. He was looking for a few launching
|
||||
customers and I was definitely in the market for a presence in Amsterdam. I even wrote about it on
|
||||
my [[bucketlist]({{< ref 2021-07-26-bucketlist.md >}})]. Arend and his IT company
|
||||
[[ERITAP](https://www.eritap.com/)], took delivery of that rack in May of 2021, and this is when the
|
||||
internet exchange with _Frysian roots_ was born.
|
||||
|
||||
In the years from 2021 until now, Arend and I have been operating the exchange with reasonable
|
||||
success. It grew from a handful of folks in that first rack, to now some 250 participating ISPs
|
||||
with about ten switches in six datacenters across the Amsterdam metro area. It's shifting a cool
|
||||
800Gbit of traffic or so. It's dope, and very rewarding to be able to contribute to this community!
|
||||
|
||||
## Frys-IX is growing
|
||||
|
||||
We have several members with a 2x100G LAG and even though all inter-datacenter links are either dark
|
||||
fiber or WDM, we're starting to feel the growing pains as we set our sights to the next step growth.
|
||||
You see, when FrysIX did 13.37Gbit of traffic, Arend organized a barbecue. When it did 133.7Gbit of
|
||||
traffic, Arend organized an even bigger barbecue. Obviously, the next step is 1337Gbit and joining
|
||||
the infamous [[One TeraBit Club](https://github.com/tking/OneTeraBitClub)]. Thomas: we're on our
|
||||
way!
|
||||
|
||||
It became clear that we will not be able to keep a dependable peering platform if FrysIX remains a
|
||||
single L2 broadcast domain, and it also became clear that concatenating multiple 100G ports would be
|
||||
operationally expensive (think of all the dark fiber or WDM waves!), and brittle (think of LACP and
|
||||
balancing traffic over those ports). We need to modernize in order to stay ahead of the growth
|
||||
curve.
|
||||
|
||||
## Hello Nokia
|
||||
|
||||
{{< image float="right" src="/assets/frys-ix/nokia-7220-d4.png" alt="Nokia 7220-D4" width="20em" >}}
|
||||
|
||||
The Nokia 7220 Interconnect Router (7220 IXR) for data center fabric provides fixed-configuration,
|
||||
high-capacity platforms that let you bring unmatched scale, flexibility and operational simplicity
|
||||
to your data center networks and peering network environments. These devices are built around the
|
||||
Broadcom _Trident_ chipset, in the case of the "D4" platform, this is a Trident4 with 28x100G and
|
||||
8x400G ports. Whoot!
|
||||
|
||||
{{< image float="right" src="/assets/frys-ix/IXR-7220-D3.jpg" alt="Nokia 7220-D3" width="20em" >}}
|
||||
|
||||
What I find particularly awesome of the Trident series is their speed (total bandwidth of
|
||||
12.8Tbps _per router_), low power use (without optics, the IXR-7220-D4 consumes about 150W) and
|
||||
a plethora of advanced capabilities like L2/L3 filtering, IPv4, IPv6 and MPLS routing, and modern
|
||||
approaches to scale-out networking such as VXLAN based EVPN. At the FrysIX barbecue in September of
|
||||
2024, FrysIX was gifted a rather powerful IXR-7220-D3 router, shown in the picture to the right.
|
||||
That's a 32x100G router.
|
||||
|
||||
ERITAP has bought two (new in box) IXR-7220-D4 (8x400G,28x100G) routers, and has also acquired two
|
||||
IXR-7220-D2 (48x25G,8x100G) routers. So in total, FrysIX is now the proud owner of five of these
|
||||
beautiful Nokia devices. If you haven't yet, you should definitely read about these versatile
|
||||
routers on the [[Nokia](https://onestore.nokia.com/asset/207599)] website, and some details of the
|
||||
_merchant silicon_ switch chips in use on the
|
||||
[[Broadcom](https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs/bcm56880-series)]
|
||||
website.
|
||||
|
||||
### eVPN: A small rant
|
||||
|
||||
{{< image float="right" src="/assets/frys-ix/FrysIX_ Topology (concept).svg" alt="Topology Concept" width="50%" >}}
|
||||
|
||||
First, I need to get something off my chest. Consider a topology for an internet exchange platform,
|
||||
taking into account the available equipment, rackspace, power, and cross connects. Somehow, almost
|
||||
every design or reference architecture I can find on the Internet, assumes folks want to build a
|
||||
[[Clos network](https://en.wikipedia.org/wiki/Clos_network)], which has a topology existing of leaf
|
||||
and spine switches. The _spine_ switches have a different set of features than the _leaf_ ones,
|
||||
notably they don't have to do provider edge functionality like VXLAN encap and decapsulation.
|
||||
Almost all of these designs are showing how one might build a leaf-spine network for hyperscale.
|
||||
|
||||
**Critique 1**: my 'spine' (IXR-7220-D4 routers) must also be provider edge. Practically speaking,
|
||||
in the picture above I have these beautiful Nokia IXR-7220-D4 routers, using two 400G ports to
|
||||
connect between the facilities, and six 100G ports to connect the smaller breakout switches. That
|
||||
would leave a _massive_ amount of capacity unused: 22x 100G and 6x400G ports, to be exact.
|
||||
|
||||
**Critique 2**: all 'leaf' (either IXR-7220-D2 routers or Arista switches) can't realistically
|
||||
connect to both 'spines'. Our devices are spread out over two (and in practice, more like six)
|
||||
datacenters, and it's prohibitively expensive to get 100G waves or dark fiber to create a full mesh.
|
||||
It's much more economical to create a star-topology that minimizes cross-datacenter fiber spans.
|
||||
|
||||
**Critique 3**: Most of these 'spine-leaf' reference architectures assume that the interior gateway
|
||||
protocol is eBGP in what they call the _underlay_, and on top of that, some secondary eBGP that's
|
||||
called the _overlay_. Frankly, such a design makes my head spin a little bit. These designs assume
|
||||
hundreds of switches, in which case making use of one AS number per switch could make sense, as iBGP
|
||||
needs either a 'full mesh', or external route reflectors.
|
||||
|
||||
**Critique 4**: These reference designs also make an assumption that all fiber is local and while
|
||||
optics and links can fail, it will be relatively rare to _drain_ a link. However, in
|
||||
cross-datacenter networks, draining links for maintenance is very common, for example if the dark
|
||||
fiber provider needs to perform repairs on a span that was damaged. With these eBGP-over-eBGP
|
||||
connections, traffic engineering is more difficult than simply raising the OSPF (or IS-IS) cost of a
|
||||
link, to reroute traffic.
|
||||
|
||||
Setting aside eVPN for a second, if I were to build an IP transport network, like I did when I built
|
||||
[[IPng Site Local]({{< ref 2023-03-11-mpls-core.md >}})], I would use a much more intuitive
|
||||
and simple (I would even dare say elegant) design:
|
||||
|
||||
1. Take a classic IGP like [[OSPF](https://en.wikipedia.org/wiki/Open_Shortest_Path_First)], or
|
||||
perhaps [[IS-IS](https://en.wikipedia.org/wiki/IS-IS)]. There is no benefit, to me at least, to use
|
||||
BGP as an IGP.
|
||||
1. I would give each of the links between the switches an IPv4 /31 and enable link-local, and give
|
||||
each switch a loopback address with a /32 IPv4 and a /128 IPv6.
|
||||
1. If I had multiple links between two given switches, I would probably just use ECMP if my devices
|
||||
supported it, and fall back to a LACP signaled bundle-ethernet otherwise.
|
||||
1. If I were to need to use BGP (and for eVPN, this need exists), taking the ISP mindset (as opposed
|
||||
to the datacenter fabric mindset), I would simply install iBGP against two or three route
|
||||
reflectors, and exchange routing information within the same single AS number.
|
||||
|
||||
### eVPN: A demo topology
|
||||
|
||||
{{< image float="right" src="/assets/frys-ix/Nokia Arista VXLAN.svg" alt="Demo topology" width="50%" >}}
|
||||
|
||||
So, that's exactly how I'm going to approach the FrysIX eVPN design: OSPF for the underlay and iBGP
|
||||
for the overlay! I have a feeling that some folks will despise me for being contrarian, but you can
|
||||
leave your comments below, and don't forget to like-and-subscribe :-)
|
||||
|
||||
Arend builds this topology for me in Jubbega - also known as FrysIX HQ. He takes the two
|
||||
400G-capable routers and connects them. Then he takes an Arista DCS-7060CX switch, which is eVPN
|
||||
capable, with 32x100G ports, based on the Broadcom Tomahawk chipset, and a smaller Nokia
|
||||
IXR-7220-D2 with 48x25G and 8x100G ports, based on the Trident3 chipset. He wires all of this up
|
||||
to look like the picture on the right.
|
||||
|
||||
#### Underlay: Nokia's SR Linux
|
||||
|
||||
We boot up the equipment, verify that all the optics and links are up, and connect the management
|
||||
ports to an OOB network that I can remotely log in to. This is the first time that either of us work
|
||||
on Nokia, but I find it reasonably intuitive once I get a few tips and tricks from Niek.
|
||||
|
||||
```
|
||||
[pim@nikhef ~]$ sr_cli
|
||||
--{ running }--[ ]--
|
||||
A:pim@nikhef# enter candidate
|
||||
--{ candidate shared default }--[ ]--
|
||||
A:pim@nikhef# set / interface lo0 admin-state enable
|
||||
A:pim@nikhef# set / interface lo0 subinterface 0 admin-state enable
|
||||
A:pim@nikhef# set / interface lo0 subinterface 0 ipv4 admin-state enable
|
||||
A:pim@nikhef# set / interface lo0 subinterface 0 ipv4 address 198.19.16.1/32
|
||||
A:pim@nikhef# commit stay
|
||||
```
|
||||
|
||||
There, my first config snippet! This creates a _loopback_ interface, and similar to JunOS, a
|
||||
_subinterface_ (which Juniper calls a _unit_) which enables IPv4 and gives it an /32 address. In SR
|
||||
Linux, any interface has to be associated with a _network-instance_, think of those as routing
|
||||
domains or VRFs. There's a conveniently named _default_ network-instance, which I'll add this and
|
||||
the point-to-point interface between the two 400G routers to:
|
||||
|
||||
```
|
||||
A:pim@nikhef# info flat interface ethernet-1/29
|
||||
set / interface ethernet-1/29 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/29 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 ipv4 address 198.19.17.1/31
|
||||
set / interface ethernet-1/29 subinterface 0 ipv6 admin-state enable
|
||||
|
||||
A:pim@nikhef# set / network-instance default type default
|
||||
A:pim@nikhef# set / network-instance default admin-state enable
|
||||
A:pim@nikhef# set / network-instance default interface ethernet-1/29.0
|
||||
A:pim@nikhef# set / network-instance default interface lo0.0
|
||||
A:pim@nikhef# commit stay
|
||||
```
|
||||
|
||||
Cool. Assuming I now also do this on the other IXR-7220-D4 router, called _equinix_ (which gets the
|
||||
loopback address 198.19.16.0/32 and the point-to-point on the 400G interface of 198.19.17.0/31), I
|
||||
should be able to do my first jumboframe ping:
|
||||
|
||||
```
|
||||
A:pim@equinix# ping network-instance default 198.19.17.1 -s 9162 -M do
|
||||
Using network instance default
|
||||
PING 198.19.17.1 (198.19.17.1) 9162(9190) bytes of data.
|
||||
9170 bytes from 198.19.17.1: icmp_seq=1 ttl=64 time=0.466 ms
|
||||
9170 bytes from 198.19.17.1: icmp_seq=2 ttl=64 time=0.477 ms
|
||||
9170 bytes from 198.19.17.1: icmp_seq=3 ttl=64 time=0.547 ms
|
||||
```
|
||||
|
||||
#### Underlay: SR Linux OSPF
|
||||
|
||||
OK, let's get these two Nokia routers to speak OSPF, so that they can reach each other's loopback.
|
||||
It's really easy:
|
||||
|
||||
```
|
||||
A:pim@nikhef# / network-instance default protocols ospf instance default
|
||||
--{ candidate shared default }--[ network-instance default protocols ospf instance default ]--
|
||||
A:pim@nikhef# set admin-state enable
|
||||
A:pim@nikhef# set version ospf-v2
|
||||
A:pim@nikhef# set router-id 198.19.16.1
|
||||
A:pim@nikhef# set area 0.0.0.0 interface ethernet-1/29.0 interface-type point-to-point
|
||||
A:pim@nikhef# set area 0.0.0.0 interface lo0.0 passive true
|
||||
A:pim@nikhef# commit stay
|
||||
```
|
||||
|
||||
Similar to in JunOS, I can descend into a configuration scope: the first line goes into the
|
||||
_network-instance_ called `default` and then the _protocols_ called `ospf`, and then the _instance_
|
||||
called `default`. Subsequent `set` commands operate at this scope. Once I commit this configuration
|
||||
(on the _nikhef_ router and also the _equinix_ router, with its own unique router-id), OSPF quickly
|
||||
shoots in action:
|
||||
|
||||
```
|
||||
A:pim@nikhef# show network-instance default protocols ospf neighbor
|
||||
=========================================================================================
|
||||
Net-Inst default OSPFv2 Instance default Neighbors
|
||||
=========================================================================================
|
||||
+---------------------------------------------------------------------------------------+
|
||||
| Interface-Name Rtr Id State Pri RetxQ Time Before Dead |
|
||||
+=======================================================================================+
|
||||
| ethernet-1/29.0 198.19.16.0 full 1 0 36 |
|
||||
+---------------------------------------------------------------------------------------+
|
||||
-----------------------------------------------------------------------------------------
|
||||
No. of Neighbors: 1
|
||||
=========================================================================================
|
||||
|
||||
A:pim@nikhef# show network-instance default route-table all | more
|
||||
IPv4 unicast route table of network instance default
|
||||
+------------------+-----+------------+--------------+--------+----------+--------+------+-------------+-----------------+
|
||||
| Prefix | ID | Route Type | Route Owner | Active | Origin | Metric | Pref | Next-hop | Next-hop |
|
||||
| | | | | | Network | | | (Type) | Interface |
|
||||
| | | | | | Instance | | | | |
|
||||
+==================+=====+============+==============+========+==========+========+======+=============+=================+
|
||||
| 198.19.16.0/32 | 0 | ospfv2 | ospf_mgr | True | default | 1 | 10 | 198.19.17.0 | ethernet-1/29.0 |
|
||||
| | | | | | | | | (direct) | |
|
||||
| 198.19.16.1/32 | 7 | host | net_inst_mgr | True | default | 0 | 0 | None | None |
|
||||
| 198.19.17.0/31 | 6 | local | net_inst_mgr | True | default | 0 | 0 | 198.19.17.1 | ethernet-1/29.0 |
|
||||
| | | | | | | | | (direct) | |
|
||||
| 198.19.17.1/32 | 6 | host | net_inst_mgr | True | default | 0 | 0 | None | None |
|
||||
+==================+=====+============+==============+========+==========+========+======+=============+=================+
|
||||
|
||||
A:pim@nikhef# ping network-instance default 198.19.16.0
|
||||
Using network instance default
|
||||
PING 198.19.16.0 (198.19.16.0) 56(84) bytes of data.
|
||||
64 bytes from 198.19.16.0: icmp_seq=1 ttl=64 time=0.484 ms
|
||||
64 bytes from 198.19.16.0: icmp_seq=2 ttl=64 time=0.663 ms
|
||||
```
|
||||
|
||||
Delicious! OSPF has learned the loopback, and it is now reachable. As with most things, going from 0
|
||||
to 1 (in this case: understanding how SR Linux works at all) is the most difficult part. Then going
|
||||
from 1 to 2 is critical (in this case: making two routers interact with OSPF), but from there on,
|
||||
going from 2 to N is easy (in my case: enabling several other point-to-point /31 transit networks on
|
||||
the _nikhef_ router, using `ethernet-1/1.0` through `ethernet-1/4.0` with the correct MTU and
|
||||
turning on OSPF for these), makes the whole network shoot to life. Slick!
|
||||
|
||||
#### Underlay: Arista
|
||||
|
||||
I'll point out that one of the devices in this topology is an Arista. We have several of these ready
|
||||
for deployment at FrysIX. They are a lot more affordable and easy to find on the second hand /
|
||||
refurbished market. These switches come with 32x100G ports, and are really good at packet slinging
|
||||
because they're based on the Broadcom _Tomahawk_ chipset. They pack a few less features than the
|
||||
_Trident_ chipset that powers the Nokia, but they happen to have all the features we need to run our
|
||||
internet exchange . So I turn my attention to the Arista in the topology. I am much more
|
||||
comfortable configuring the whole thing here, as it's not my first time touching these devices:
|
||||
|
||||
```
|
||||
arista-leaf#show run int loop0
|
||||
interface Loopback0
|
||||
ip address 198.19.16.2/32
|
||||
ip ospf area 0.0.0.0
|
||||
arista-leaf#show run int Ethernet32/1
|
||||
interface Ethernet32/1
|
||||
description Core: Connected to nikhef:ethernet-1/2
|
||||
load-interval 1
|
||||
mtu 9190
|
||||
no switchport
|
||||
ip address 198.19.17.5/31
|
||||
ip ospf cost 1000
|
||||
ip ospf network point-to-point
|
||||
ip ospf area 0.0.0.0
|
||||
arista-leaf#show run section router ospf
|
||||
router ospf 65500
|
||||
router-id 198.19.16.2
|
||||
redistribute connected
|
||||
network 198.19.0.0/16 area 0.0.0.0
|
||||
max-lsa 12000
|
||||
```
|
||||
|
||||
I complete the configuration for the other two interfaces on this Arista, port Eth31/1 connects also
|
||||
to the _nikhef_ IXR-7220-D4 and I give it a high cost of 1000, while Eth30/1 connects only 1x100G to
|
||||
the _nokia-leaf_ IXR-7220-D2 with a cost of 10.
|
||||
It's nice to see that OSPF in action - there are two equal path (but high cost) OSPF paths via
|
||||
router-id 198.19.16.1 (nikhef), and there's one lower cost path via router-id 198.19.16.3
|
||||
(_nokia-leaf_). The traceroute nicely shows the scenic route (arista-leaf -> nokia-leaf -> nokia ->
|
||||
equinix). Dope!
|
||||
|
||||
```
|
||||
arista-leaf#show ip ospf nei
|
||||
Neighbor ID Instance VRF Pri State Dead Time Address Interface
|
||||
198.19.16.1 65500 default 1 FULL 00:00:36 198.19.17.4 Ethernet32/1
|
||||
198.19.16.3 65500 default 1 FULL 00:00:31 198.19.17.11 Ethernet30/1
|
||||
198.19.16.1 65500 default 1 FULL 00:00:35 198.19.17.2 Ethernet31/1
|
||||
|
||||
arista-leaf#traceroute 198.19.16.0
|
||||
traceroute to 198.19.16.0 (198.19.16.0), 30 hops max, 60 byte packets
|
||||
1 198.19.17.11 (198.19.17.11) 0.220 ms 0.150 ms 0.206 ms
|
||||
2 198.19.17.6 (198.19.17.6) 0.169 ms 0.107 ms 0.099 ms
|
||||
3 198.19.16.0 (198.19.16.0) 0.434 ms 0.346 ms 0.303 ms
|
||||
```
|
||||
|
||||
So far, so good! The _underlay_ is up, every router can reach every other router on its loopback,
|
||||
and all OSPF adjacencies are formed. I'll leave the 2x100G between _nikhef_ and _arista-leaf_ at
|
||||
high cost for now.
|
||||
|
||||
#### Overlay EVPN: SR Linux
|
||||
|
||||
The big-picture idea here is to use iBGP with the same private AS number, and because there are two
|
||||
main facilities (NIKHEF and Equinix), make each of those bigger IXR-7220-D4 routers act as
|
||||
route-reflectors for others. It means that they will have an iBGP session amongst themselves
|
||||
(198.191.16.0 <-> 198.19.16.1) and otherwise accept iBGP sessions from any IP address in the
|
||||
198.19.16.0/24 subnet. This way, I don't have to configure any more than strictly necessary on the
|
||||
core routers. Any new router can just plug in, form an OSPF adjacency, and connect to both core
|
||||
routers. I proceed to configure BGP on the Nokia's like this:
|
||||
|
||||
```
|
||||
A:pim@nikhef# / network-instance default protocols bgp
|
||||
A:pim@nikhef# set admin-state enable
|
||||
A:pim@nikhef# set autonomous-system 65500
|
||||
A:pim@nikhef# set router-id 198.19.16.1
|
||||
A:pim@nikhef# set dynamic-neighbors accept match 198.19.16.0/24 peer-group overlay
|
||||
A:pim@nikhef# set afi-safi evpn admin-state enable
|
||||
A:pim@nikhef# set preference ibgp 170
|
||||
A:pim@nikhef# set route-advertisement rapid-withdrawal true
|
||||
A:pim@nikhef# set route-advertisement wait-for-fib-install false
|
||||
A:pim@nikhef# set group overlay peer-as 65500
|
||||
A:pim@nikhef# set group overlay afi-safi evpn admin-state enable
|
||||
A:pim@nikhef# set group overlay afi-safi ipv4-unicast admin-state disable
|
||||
A:pim@nikhef# set group overlay afi-safi ipv6-unicast admin-state disable
|
||||
A:pim@nikhef# set group overlay local-as as-number 65500
|
||||
A:pim@nikhef# set group overlay route-reflector client true
|
||||
A:pim@nikhef# set group overlay transport local-address 198.19.16.1
|
||||
A:pim@nikhef# set neighbor 198.19.16.0 admin-state enable
|
||||
A:pim@nikhef# set neighbor 198.19.16.0 peer-group overlay
|
||||
A:pim@nikhef# commit stay
|
||||
```
|
||||
|
||||
I can see that iBGP sessions establish between all the devices:
|
||||
|
||||
```
|
||||
A:pim@nikhef# show network-instance default protocols bgp neighbor
|
||||
---------------------------------------------------------------------------------------------------------------------------
|
||||
BGP neighbor summary for network-instance "default"
|
||||
Flags: S static, D dynamic, L discovered by LLDP, B BFD enabled, - disabled, * slow
|
||||
---------------------------------------------------------------------------------------------------------------------------
|
||||
---------------------------------------------------------------------------------------------------------------------------
|
||||
+-------------+-------------+----------+-------+----------+-------------+---------------+------------+--------------------+
|
||||
| Net-Inst | Peer | Group | Flags | Peer-AS | State | Uptime | AFI/SAFI | [Rx/Active/Tx] |
|
||||
+=============+=============+==========+=======+==========+=============+===============+============+====================+
|
||||
| default | 198.19.16.0 | overlay | S | 65500 | established | 0d:0h:2m:32s | evpn | [0/0/0] |
|
||||
| default | 198.19.16.2 | overlay | D | 65500 | established | 0d:0h:2m:27s | evpn | [0/0/0] |
|
||||
| default | 198.19.16.3 | overlay | D | 65500 | established | 0d:0h:2m:41s | evpn | [0/0/0] |
|
||||
+-------------+-------------+----------+-------+----------+-------------+---------------+------------+--------------------+
|
||||
---------------------------------------------------------------------------------------------------------------------------
|
||||
Summary:
|
||||
1 configured neighbors, 1 configured sessions are established, 0 disabled peers
|
||||
2 dynamic peers
|
||||
```
|
||||
|
||||
A few things to note here - there one _configured_ neighbor (this is the other IXR-7220-D4 router),
|
||||
and two _dynamic_ peers, these are the Arista and the smaller IXR-7220-D2 router. The only address
|
||||
family that they are exchanging information for is the _evpn_ family, and no prefixes have been
|
||||
learned or sent yet, shown by the `[0/0/0]` designation in the last column.
|
||||
|
||||
#### Overlay EVPN: Arista
|
||||
|
||||
The Arista is also remarkably straight forward to configure. Here, I'll simply enable the iBGP
|
||||
session as follows:
|
||||
|
||||
```
|
||||
arista-leaf#show run section bgp
|
||||
router bgp 65500
|
||||
neighbor evpn peer group
|
||||
neighbor evpn remote-as 65500
|
||||
neighbor evpn update-source Loopback0
|
||||
neighbor evpn ebgp-multihop 3
|
||||
neighbor evpn send-community extended
|
||||
neighbor evpn maximum-routes 12000 warning-only
|
||||
neighbor 198.19.16.0 peer group evpn
|
||||
neighbor 198.19.16.1 peer group evpn
|
||||
!
|
||||
address-family evpn
|
||||
neighbor evpn activate
|
||||
|
||||
arista-leaf#show bgp summary
|
||||
BGP summary information for VRF default
|
||||
Router identifier 198.19.16.2, local AS number 65500
|
||||
Neighbor AS Session State AFI/SAFI AFI/SAFI State NLRI Rcd NLRI Acc
|
||||
----------- ----------- ------------- ----------------------- -------------- ---------- ----------
|
||||
198.19.16.0 65500 Established IPv4 Unicast Advertised 0 0
|
||||
198.19.16.0 65500 Established L2VPN EVPN Negotiated 0 0
|
||||
198.19.16.1 65500 Established IPv4 Unicast Advertised 0 0
|
||||
198.19.16.1 65500 Established L2VPN EVPN Negotiated 0 0
|
||||
```
|
||||
|
||||
On this leaf node, I'll have a redundant iBGP session with the two core nodes. Since those core
|
||||
nodes are peering amongst themselves, and are configured as route-reflectors, this is all I need. No
|
||||
matter how many additional Arista (or Nokia) devices I add to the network, all they'll have to do is
|
||||
enable OSPF (so they can reach 198.19.16.0 and .1) and turn on iBGP sessions with both core routers.
|
||||
Voila!
|
||||
|
||||
#### VXLAN EVPN: SR Linux
|
||||
|
||||
Nokia documentation informs me that SR Linux uses a special interface called _system0_ to source its
|
||||
VXLAN traffic from, and to add this interface to the _default_ network-instance. So it's a matter of
|
||||
defining that interface and associate a VXLAN interface with it, like so:
|
||||
|
||||
```
|
||||
A:pim@nikhef# set / interface system0 admin-state enable
|
||||
A:pim@nikhef# set / interface system0 subinterface 0 admin-state enable
|
||||
A:pim@nikhef# set / interface system0 subinterface 0 ipv4 admin-state enable
|
||||
A:pim@nikhef# set / interface system0 subinterface 0 ipv4 address 198.19.18.1/32
|
||||
A:pim@nikhef# set / network-instance default interface system0.0
|
||||
A:pim@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged
|
||||
A:pim@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604
|
||||
A:pim@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address
|
||||
A:pim@nikhef# commit stay
|
||||
```
|
||||
|
||||
This creates the plumbing for a VXLAN sub-interface called `vxlan1.2604` which will accept/send
|
||||
traffic using VNI 2604 (this happens to be the VLAN id we use at FrysIX for our production Peering
|
||||
LAN), and it'll use the `system0.0` address to source that traffic from.
|
||||
|
||||
The second part is to create what SR Linux calls a MAC-VRF and put some interface(s) in it:
|
||||
|
||||
```
|
||||
A:pim@nikhef# set / interface ethernet-1/9 admin-state enable
|
||||
A:pim@nikhef# set / interface ethernet-1/9 breakout-mode num-breakout-ports 4
|
||||
A:pim@nikhef# set / interface ethernet-1/9 breakout-mode breakout-port-speed 10G
|
||||
A:pim@nikhef# set / interface ethernet-1/9/3 admin-state enable
|
||||
A:pim@nikhef# set / interface ethernet-1/9/3 vlan-tagging true
|
||||
A:pim@nikhef# set / interface ethernet-1/9/3 subinterface 0 type bridged
|
||||
A:pim@nikhef# set / interface ethernet-1/9/3 subinterface 0 admin-state enable
|
||||
A:pim@nikhef# set / interface ethernet-1/9/3 subinterface 0 vlan encap untagged
|
||||
|
||||
A:pim@nikhef# / network-instance peeringlan
|
||||
A:pim@nikhef# set type mac-vrf
|
||||
A:pim@nikhef# set admin-state enable
|
||||
A:pim@nikhef# set interface ethernet-1/9/3.0
|
||||
A:pim@nikhef# set vxlan-interface vxlan1.2604
|
||||
A:pim@nikhef# set protocols bgp-evpn bgp-instance 1 admin-state enable
|
||||
A:pim@nikhef# set protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604
|
||||
A:pim@nikhef# set protocols bgp-evpn bgp-instance 1 evi 2604
|
||||
A:pim@nikhef# set protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604
|
||||
A:pim@nikhef# set protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604
|
||||
A:pim@nikhef# set protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604
|
||||
A:pim@nikhef# commit stay
|
||||
```
|
||||
|
||||
In the first block here, Arend took what is a 100G port called `ethernet-1/9` and split it into 4x25G
|
||||
ports. Arend forced the port speed to 10G because he has taken a 40G-4x10G DAC, and it happens that
|
||||
the third lane is plugged into the Debian machine. So on `ethernet-1/9/3` I'll create a
|
||||
sub-interface, make it type _bridged_ (which I've also done on `vxlan1.2604`!) and allow any
|
||||
untagged traffic to enter it.
|
||||
|
||||
{{< image width="5em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
|
||||
|
||||
If you, like me, are used to either VPP or IOS/XR, this type of sub-interface stuff should feel very
|
||||
natural to you. I've written about the sub-interfaces logic on Cisco's IOS/XR and VPP approach in a
|
||||
previous [[article]({{< ref 2022-02-14-vpp-vlan-gym.md >}})] which my buddy Fred lovingly calls
|
||||
_VLAN Gymnastics_ because the ports are just so damn flexible. Worth a read!
|
||||
|
||||
The second block creates a new _network-instance_ which I'll name `peeringlan`, and it associates
|
||||
the newly created untagged sub-interface `ethernet-1/9/3.0` with the VXLAN interface, and starts a
|
||||
protocol for eVPN instructing traffic in and out of this network-instance to use EVI 2604 on the
|
||||
VXLAN sub-interface, and signalling of all MAC addresses learned to use the specified
|
||||
route-distinguisher and import/export route-targets. For simplicity I've just used the same for
|
||||
each: 65500:2604.
|
||||
|
||||
I continue to add an interface to the `peeringlan` _network-instance_ on the other two Nokia
|
||||
routers: `ethernet-1/9/3.0` on the _equinix_ router and `ethernet-1/9.0` on the _nokia-leaf_ router.
|
||||
Each of these goes to a 10Gbps port on a Debian machine.
|
||||
|
||||
#### VXLAN EVPN: Arista
|
||||
|
||||
At this point I'm feeling pretty bullish about the whole project. Arista does not make it very
|
||||
difficult on me to configure it for L2 EVPN (which is called MAC-VRF here also):
|
||||
|
||||
```
|
||||
arista-leaf#conf t
|
||||
vlan 2604
|
||||
name v-peeringlan
|
||||
interface Ethernet9/3
|
||||
speed forced 10000full
|
||||
switchport access vlan 2604
|
||||
|
||||
interface Loopback1
|
||||
ip address 198.19.18.2/32
|
||||
interface Vxlan1
|
||||
vxlan source-interface Loopback1
|
||||
vxlan udp-port 4789
|
||||
vxlan vlan 2604 vni 2604
|
||||
```
|
||||
|
||||
After creating VLAN 2604 and making port Eth9/3 an access port in that VLAN, I'll add a VTEP endpoint
|
||||
called `Loopback1`, and a VXLAN interface that uses that to source its traffic. Here, I'll associate
|
||||
local VLAN 2604 with the `Vxlan1` and its VNI 2604, to match up with how I configured the Nokias
|
||||
previously.
|
||||
|
||||
Finally, it's a matter of tying these together by announcing the MAC addresses into the EVPN iBGP
|
||||
sessions:
|
||||
```
|
||||
arista-leaf#conf t
|
||||
router bgp 65500
|
||||
vlan 2604
|
||||
rd 65500:2604
|
||||
route-target both 65500:2604
|
||||
redistribute learned
|
||||
!
|
||||
```
|
||||
|
||||
### Results
|
||||
|
||||
To validate the configurations, I learn a cool trick from my buddy Andy on the SR Linux discord
|
||||
server. In EOS, I can ask it to check for any obvious mistakes in two places:
|
||||
|
||||
```
|
||||
arista-leaf#show vxlan config-sanity detail
|
||||
Category Result Detail
|
||||
---------------------------------- -------- --------------------------------------------------
|
||||
Local VTEP Configuration Check OK
|
||||
Loopback IP Address OK
|
||||
VLAN-VNI Map OK
|
||||
Flood List OK
|
||||
Routing OK
|
||||
VNI VRF ACL OK
|
||||
Decap VRF-VNI Map OK
|
||||
VRF-VNI Dynamic VLAN OK
|
||||
Remote VTEP Configuration Check OK
|
||||
Remote VTEP OK
|
||||
Platform Dependent Check OK
|
||||
VXLAN Bridging OK
|
||||
VXLAN Routing OK VXLAN Routing not enabled
|
||||
CVX Configuration Check OK
|
||||
CVX Server OK Not in controller client mode
|
||||
MLAG Configuration Check OK Run 'show mlag config-sanity' to verify MLAG config
|
||||
Peer VTEP IP OK MLAG peer is not connected
|
||||
MLAG VTEP IP OK
|
||||
Peer VLAN-VNI OK
|
||||
Virtual VTEP IP OK
|
||||
MLAG Inactive State OK
|
||||
|
||||
arista-leaf#show bgp evpn sanity detail
|
||||
Category Check Status Detail
|
||||
-------- -------------------- ------ ------
|
||||
General Send community OK
|
||||
General Multi-agent mode OK
|
||||
General Neighbor established OK
|
||||
L2 MAC-VRF route-target OK
|
||||
import and export
|
||||
L2 MAC-VRF OK
|
||||
route-distinguisher
|
||||
L2 MAC-VRF redistribute OK
|
||||
L2 MAC-VRF overlapping OK
|
||||
VLAN
|
||||
L2 Suppressed MAC OK
|
||||
VXLAN VLAN to VNI map for OK
|
||||
MAC-VRF
|
||||
VXLAN VRF to VNI map for OK
|
||||
IP-VRF
|
||||
```
|
||||
|
||||
#### Results: Arista view
|
||||
|
||||
Inspecting the MAC addresses learned from all four of the client ports on the Debian machine is
|
||||
easy:
|
||||
|
||||
```
|
||||
arista-leaf#show bgp evpn summary
|
||||
BGP summary information for VRF default
|
||||
Router identifier 198.19.16.2, local AS number 65500
|
||||
Neighbor Status Codes: m - Under maintenance
|
||||
Neighbor V AS MsgRcvd MsgSent InQ OutQ Up/Down State PfxRcd PfxAcc
|
||||
198.19.16.0 4 65500 3311 3867 0 0 18:06:28 Estab 7 7
|
||||
198.19.16.1 4 65500 3308 3873 0 0 18:06:28 Estab 7 7
|
||||
|
||||
arista-leaf#show bgp evpn vni 2604 next-hop 198.19.18.3
|
||||
BGP routing table information for VRF default
|
||||
Router identifier 198.19.16.2, local AS number 65500
|
||||
Route status codes: * - valid, > - active, S - Stale, E - ECMP head, e - ECMP
|
||||
c - Contributing to ECMP, % - Pending BGP convergence
|
||||
Origin codes: i - IGP, e - EGP, ? - incomplete
|
||||
AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Link Local Nexthop
|
||||
|
||||
Network Next Hop Metric LocPref Weight Path
|
||||
* >Ec RD: 65500:2604 mac-ip e43a.6e5f.0c59
|
||||
198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.1
|
||||
* ec RD: 65500:2604 mac-ip e43a.6e5f.0c59
|
||||
198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.0
|
||||
* >Ec RD: 65500:2604 imet 198.19.18.3
|
||||
198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.1
|
||||
* ec RD: 65500:2604 imet 198.19.18.3
|
||||
198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.0
|
||||
```
|
||||
There's a lot to unpack here! The Arista is seeing that from the _route-distinguisher_ I configured
|
||||
on all the sessions, it is learning one MAC address on neighbor 198.19.18.3 (this is the VTEP for
|
||||
the _nokia-leaf_ router) from both iBGP sessions. The MAC address is learned from originator
|
||||
198.19.16.3 (the loopback of the _nokia-leaf_ router), from two cluster members, the active one on
|
||||
iBGP speaker 198.19.16.1 (_nikhef_) and a backup member on 198.19.16.0 (_equinix_).
|
||||
|
||||
I can also see that there's a bunch of `imet` route entries, and Andy explained these to me. They are
|
||||
a signal from a VTEP participant that they are interested in seeing multicast traffic (like neighbor
|
||||
discovery or ARP requests) flooded to them. Every router participating in this L2VPN will raise such
|
||||
an `imet` route, which I'll see in duplicates as well (one from each iBGP session). This checks out.
|
||||
|
||||
#### Results: SR Linux view
|
||||
|
||||
The Nokia IXR-7220-D4 router called _equinix_ has also learned a bunch of EVPN routing entries,
|
||||
which I can inspect as follows:
|
||||
|
||||
```
|
||||
A:pim@equinix# show network-instance default protocols bgp routes evpn route-type summary
|
||||
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
Show report for the BGP route table of network-instance "default"
|
||||
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
Status codes: u=used, *=valid, >=best, x=stale, b=backup
|
||||
Origin codes: i=IGP, e=EGP, ?=incomplete
|
||||
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
BGP Router ID: 198.19.16.0 AS: 65500 Local AS: 65500
|
||||
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
Type 2 MAC-IP Advertisement Routes
|
||||
+--------+---------------+--------+-------------------+------------+-------------+------+-------------+--------+--------------------------------+------------------+
|
||||
| Status | Route- | Tag-ID | MAC-address | IP-address | neighbor | Path-| Next-Hop | Label | ESI | MAC Mobility |
|
||||
| | distinguisher | | | | | id | | | | |
|
||||
+========+===============+========+===================+============+=============+======+============-+========+================================+==================+
|
||||
| u*> | 65500:2604 | 0 | E4:3A:6E:5F:0C:57 | 0.0.0.0 | 198.19.16.1 | 0 | 198.19.18.1 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
|
||||
| * | 65500:2604 | 0 | E4:3A:6E:5F:0C:58 | 0.0.0.0 | 198.19.16.1 | 0 | 198.19.18.2 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
|
||||
| u*> | 65500:2604 | 0 | E4:3A:6E:5F:0C:58 | 0.0.0.0 | 198.19.16.2 | 0 | 198.19.18.2 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
|
||||
| * | 65500:2604 | 0 | E4:3A:6E:5F:0C:59 | 0.0.0.0 | 198.19.16.1 | 0 | 198.19.18.3 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
|
||||
| u*> | 65500:2604 | 0 | E4:3A:6E:5F:0C:59 | 0.0.0.0 | 198.19.16.3 | 0 | 198.19.18.3 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
|
||||
+--------+---------------+--------+-------------------+------------+-------------+------+-------------+--------+--------------------------------+------------------+
|
||||
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
Type 3 Inclusive Multicast Ethernet Tag Routes
|
||||
+--------+-----------------------------+--------+---------------------+-----------------+--------+-----------------------+
|
||||
| Status | Route-distinguisher | Tag-ID | Originator-IP | neighbor | Path- | Next-Hop |
|
||||
| | | | | | id | |
|
||||
+========+=============================+========+=====================+=================+========+=======================+
|
||||
| u*> | 65500:2604 | 0 | 198.19.18.1 | 198.19.16.1 | 0 | 198.19.18.1 |
|
||||
| * | 65500:2604 | 0 | 198.19.18.2 | 198.19.16.1 | 0 | 198.19.18.2 |
|
||||
| u*> | 65500:2604 | 0 | 198.19.18.2 | 198.19.16.2 | 0 | 198.19.18.2 |
|
||||
| * | 65500:2604 | 0 | 198.19.18.3 | 198.19.16.1 | 0 | 198.19.18.3 |
|
||||
| u*> | 65500:2604 | 0 | 198.19.18.3 | 198.19.16.3 | 0 | 198.19.18.3 |
|
||||
+--------+-----------------------------+--------+---------------------+-----------------+--------+-----------------------+
|
||||
--------------------------------------------------------------------------------------------------------------------------
|
||||
0 Ethernet Auto-Discovery routes 0 used, 0 valid
|
||||
5 MAC-IP Advertisement routes 3 used, 5 valid
|
||||
5 Inclusive Multicast Ethernet Tag routes 3 used, 5 valid
|
||||
0 Ethernet Segment routes 0 used, 0 valid
|
||||
0 IP Prefix routes 0 used, 0 valid
|
||||
0 Selective Multicast Ethernet Tag routes 0 used, 0 valid
|
||||
0 Selective Multicast Membership Report Sync routes 0 used, 0 valid
|
||||
0 Selective Multicast Leave Sync routes 0 used, 0 valid
|
||||
--------------------------------------------------------------------------------------------------------------------------
|
||||
```
|
||||
|
||||
I have to say, SR Linux output is incredibly verbose! But, I can see all the relevant bits and bobs
|
||||
here. Each MAC-IP entry is accounted for, I can see several nexthops pointing at the nikhef switch,
|
||||
one pointing at the nokia-leaf router and one pointing at the Arista switch. I also see the `imet`
|
||||
entries. One thing to note -- the SR Linux implementation leaves the type-2 routes empty with a
|
||||
0.0.0.0 IPv4 address, while the Arista (in my opinion, more correctly) leaves them as NULL
|
||||
(unspecified). But, everything looks great!
|
||||
|
||||
#### Results: Debian view
|
||||
|
||||
There's one more thing to show, and that's kind of the 'proof is in the pudding' moment. As I said,
|
||||
Arend hooked up a Debian machine with an Intel X710-DA4 network card, which sports 4x10G SFP+
|
||||
connections. This network card is a regular in my AS8298 network, as it has excellent DPDK support
|
||||
and can easily pump 40Mpps with VPP. IPng 🥰 Intel X710!
|
||||
|
||||
```
|
||||
root@debian:~ # ip netns add nikhef
|
||||
root@debian:~ # ip link set enp1s0f0 netns nikhef
|
||||
root@debian:~ # ip netns exec nikhef ip link set enp1s0f0 up mtu 9000
|
||||
root@debian:~ # ip netns exec nikhef ip addr add 192.0.2.10/24 dev enp1s0f0
|
||||
root@debian:~ # ip netns exec nikhef ip addr add 2001:db8::10/64 dev enp1s0f0
|
||||
|
||||
root@debian:~ # ip netns add arista-leaf
|
||||
root@debian:~ # ip link set enp1s0f1 netns arista-leaf
|
||||
root@debian:~ # ip netns exec arista-leaf ip link set enp1s0f1 up mtu 9000
|
||||
root@debian:~ # ip netns exec arista-leaf ip addr add 192.0.2.11/24 dev enp1s0f1
|
||||
root@debian:~ # ip netns exec arista-leaf ip addr add 2001:db8::11/64 dev enp1s0f1
|
||||
|
||||
root@debian:~ # ip netns add nokia-leaf
|
||||
root@debian:~ # ip link set enp1s0f2 netns nokia-leaf
|
||||
root@debian:~ # ip netns exec nokia-leaf ip link set enp1s0f2 up mtu 9000
|
||||
root@debian:~ # ip netns exec nokia-leaf ip addr add 192.0.2.12/24 dev enp1s0f2
|
||||
root@debian:~ # ip netns exec nokia-leaf ip addr add 2001:db8::12/64 dev enp1s0f2
|
||||
|
||||
root@debian:~ # ip netns add equinix
|
||||
root@debian:~ # ip link set enp1s0f3 netns equinix
|
||||
root@debian:~ # ip netns exec equinix ip link set enp1s0f3 up mtu 9000
|
||||
root@debian:~ # ip netns exec equinix ip addr add 192.0.2.13/24 dev enp1s0f3
|
||||
root@debian:~ # ip netns exec equinix ip addr add 2001:db8::13/64 dev enp1s0f3
|
||||
|
||||
root@debian:~# ip netns exec nikhef fping -g 192.0.2.8/29
|
||||
192.0.2.10 is alive
|
||||
192.0.2.11 is alive
|
||||
192.0.2.12 is alive
|
||||
192.0.2.13 is alive
|
||||
|
||||
root@debian:~# ip netns exec arista-leaf fping 2001:db8::10 2001:db8::11 2001:db8::12 2001:db8::13
|
||||
2001:db8::10 is alive
|
||||
2001:db8::11 is alive
|
||||
2001:db8::12 is alive
|
||||
2001:db8::13 is alive
|
||||
|
||||
root@debian:~# ip netns exec equinix ip nei
|
||||
192.0.2.10 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE
|
||||
192.0.2.11 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE
|
||||
192.0.2.12 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE
|
||||
fe80::e63a:6eff:fe5f:c57 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE
|
||||
fe80::e63a:6eff:fe5f:c58 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE
|
||||
fe80::e63a:6eff:fe5f:c59 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE
|
||||
2001:db8::10 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE
|
||||
2001:db8::11 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE
|
||||
2001:db8::12 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE
|
||||
```
|
||||
|
||||
The Debian machine puts each network card into its own network namespace, and gives them both an IPv4
|
||||
and an IPv6 address. I can then enter the `nikhef` network namespace, which has its NIC connected to
|
||||
the IXR-7220-D4 router called _nikhef_, and ping all four endpoints. Similarly, I can enter the
|
||||
`arista-leaf` namespace and ping6 all four endpoints. Finally, I take a look at the IPv6 and IPv4
|
||||
neighbor table on the network card that is connected to the _equinix_ router. All three MAC addresses are
|
||||
seen. This proves end to end connectivity across the EVPN VXLAN, and full interoperability. Booyah!
|
||||
|
||||
Performance? We got that! I'm not worried as these Nokia routers are rated for 12.8Tbps of VXLAN....
|
||||
```
|
||||
root@debian:~# ip netns exec equinix iperf3 -c 192.0.2.12
|
||||
Connecting to host 192.0.2.12, port 5201
|
||||
[ 5] local 192.0.2.10 port 34598 connected to 192.0.2.12 port 5201
|
||||
[ ID] Interval Transfer Bitrate Retr Cwnd
|
||||
[ 5] 0.00-1.00 sec 1.15 GBytes 9.91 Gbits/sec 19 1.52 MBytes
|
||||
[ 5] 1.00-2.00 sec 1.15 GBytes 9.90 Gbits/sec 3 1.54 MBytes
|
||||
[ 5] 2.00-3.00 sec 1.15 GBytes 9.90 Gbits/sec 1 1.54 MBytes
|
||||
[ 5] 3.00-4.00 sec 1.15 GBytes 9.90 Gbits/sec 1 1.54 MBytes
|
||||
[ 5] 4.00-5.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
|
||||
[ 5] 5.00-6.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
|
||||
[ 5] 6.00-7.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
|
||||
[ 5] 7.00-8.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
|
||||
[ 5] 8.00-9.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
|
||||
[ 5] 9.00-10.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
|
||||
- - - - - - - - - - - - - - - - - - - - - - - - -
|
||||
[ ID] Interval Transfer Bitrate Retr
|
||||
[ 5] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec 24 sender
|
||||
[ 5] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec receiver
|
||||
|
||||
iperf Done.
|
||||
```
|
||||
|
||||
## What's Next
|
||||
|
||||
There's a few improvements I can make before deploying this architecture to the internet exchange.
|
||||
Notably:
|
||||
* the functional equivalent of _port security_, that is to say only allowing one or two MAC
|
||||
addresses per member port. FrysIX has a strict one-port-one-member-one-MAC rule, and having port
|
||||
security will greatly improve our resilience.
|
||||
* SR Linux has the ability to suppress ARP, _even on L2 MAC-VRF_! It's relatively well known for
|
||||
IRB based setups, but adding this to transparent bridge-domains is possible in Nokia
|
||||
[[ref](https://documentation.nokia.com/srlinux/22-6/SR_Linux_Book_Files/EVPN-VXLAN_Guide/services-evpn-vxlan-l2.html#configuring_evpn_learning_for_proxy_arp)],
|
||||
using the syntax of `protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise
|
||||
true`. This will glean the IP addresses based on intercepted ARP requests, and reduce the need for
|
||||
BUM flooding.
|
||||
* Andy informs me that Arista also has this feature. By setting `router l2-vpn` and `arp learning bridged`,
|
||||
the suppression of ARP requests/replies also works in the same way. This greatly reduces cross-router
|
||||
BUM flooding. If DE-CIX can do it, so can FrysIX :)
|
||||
* some automation - although configuring the MAC-VRF across Arista and SR Linux is definitely not
|
||||
as difficult as I thought, having some automation in place will avoid errors and mistakes. It
|
||||
would suck if the IXP collapsed because I botched a link drain or PNI configuration!
|
||||
|
||||
### Acknowledgements
|
||||
|
||||
I am relatively new to EVPN configurations, and wanted to give a shoutout to Andy Whitaker who
|
||||
jumped in very quickly when I asked a question on the SR Linux Discord. He was gracious with his
|
||||
time and spent a few hours on a video call with me, explaining EVPN in great detail both for Arista
|
||||
as well as SR Linux, and in particular wanted to give a big "Thank you!" for helping me understand
|
||||
symmetric and asymmetric IRB in the context of multivendor EVPN. Andy is about to start a new job at
|
||||
Nokia, and I wish him all the best. To my friends at Nokia: you caught a good one, Andy is pure
|
||||
gold!
|
||||
|
||||
I also want to thank Niek for helping me take my first baby steps onto this platform and patiently
|
||||
answering my nerdly questions about the platform, the switch chip, and the configuration philosophy.
|
||||
Learning a new NOS is always a fun task, and it was made super fun because Niek spent an hour with
|
||||
Arend and I on a video call, giving a bunch of operational tips and tricks along the way.
|
||||
|
||||
Finally, Arend and ERITAP are an absolute joy to work with. We took turns hacking on the lab, which
|
||||
Arend made available for me while I am traveling to Mississippi this week. Thanks for the kWh and
|
||||
OOB access, and for brainstorming the config with me!
|
||||
|
||||
### Reference configurations
|
||||
|
||||
Here's the configs for all machines in this demonstration:
|
||||
[[nikhef](/assets/frys-ix/nikhef.conf)] | [[equinix](/assets/frys-ix/equinix.conf)] | [[nokia-leaf](/assets/frys-ix/nokia-leaf.conf)] | [[arista-leaf](/assets/frys-ix/arista-leaf.conf)]
|
464
content/articles/2025-05-03-containerlab-1.md
Normal file
464
content/articles/2025-05-03-containerlab-1.md
Normal file
@@ -0,0 +1,464 @@
|
||||
---
|
||||
date: "2025-05-03T15:07:23Z"
|
||||
title: 'VPP in Containerlab - Part 1'
|
||||
---
|
||||
|
||||
{{< image float="right" src="/assets/containerlab/containerlab.svg" alt="Containerlab Logo" width="12em" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
From time to time the subject of containerized VPP instances comes up. At IPng, I run the routers in
|
||||
AS8298 on bare metal (Supermicro and Dell hardware), as it allows me to maximize performance.
|
||||
However, VPP is quite friendly in virtualization. Notably, it runs really well on virtual machines
|
||||
like Qemu/KVM or VMWare. I can pass through PCI devices directly to the host, and use CPU pinning to
|
||||
allow the guest virtual machine access to the underlying physical hardware. In such a mode, VPP
|
||||
performance almost the same as on bare metal. But did you know that VPP can also run in Docker?
|
||||
|
||||
The other day I joined the [[ZANOG'25](https://nog.net.za/event1/zanog25/)] in Durban, South Africa.
|
||||
One of the presenters was Nardus le Roux of Nokia, and he showed off a project called
|
||||
[[Containerlab](https://containerlab.dev/)], which provides a CLI for orchestrating and managing
|
||||
container-based networking labs. It starts the containers, builds a virtual wiring between them to
|
||||
create lab topologies of users choice and manages labs lifecycle.
|
||||
|
||||
Quite regularly I am asked 'when will you add VPP to Containerlab?', but at ZANOG I made a promise
|
||||
to actually add them. Here I go, on a journey to integrate VPP into Containerlab!
|
||||
|
||||
## Containerized VPP
|
||||
|
||||
The folks at [[Tigera](https://www.tigera.io/project-calico/)] maintain a project called _Calico_,
|
||||
which accelerates Kubernetes CNI (Container Network Interface) by using [[FD.io](https://fd.io)]
|
||||
VPP. Since the origins of Kubernetes are to run containers in a Docker environment, it stands to
|
||||
reason that it should be possible to run a containerized VPP. I start by reading up on how they
|
||||
create their Docker image, and I learn a lot.
|
||||
|
||||
### Docker Build
|
||||
|
||||
Considering IPng runs bare metal Debian (currently Bookworm) machines, my Docker image will be based
|
||||
on `debian:bookworm` as well. The build starts off quite modest:
|
||||
|
||||
```
|
||||
pim@summer:~$ mkdir -p src/vpp-containerlab
|
||||
pim@summer:~/src/vpp-containerlab$ cat < EOF > Dockerfile.bookworm
|
||||
FROM debian:bookworm
|
||||
ARG DEBIAN_FRONTEND=noninteractive
|
||||
ARG VPP_INSTALL_SKIP_SYSCTL=true
|
||||
ARG REPO=release
|
||||
RUN apt-get update && apt-get -y install curl procps && apt-get clean
|
||||
|
||||
# Install VPP
|
||||
RUN curl -s https://packagecloud.io/install/repositories/fdio/${REPO}/script.deb.sh | bash
|
||||
RUN apt-get update && apt-get -y install vpp vpp-plugin-core && apt-get clean
|
||||
|
||||
CMD ["/usr/bin/vpp","-c","/etc/vpp/startup.conf"]
|
||||
EOF
|
||||
pim@summer:~/src/vpp-containerlab$ docker build -f Dockerfile.bookworm . -t pimvanpelt/vpp-containerlab
|
||||
```
|
||||
|
||||
One gotcha - when I install the upstream VPP debian packages, they generate a `sysctl` file which it
|
||||
tries to execute. However, I can't set sysctl's in the container, so the build fails. I take a look
|
||||
at the VPP source code and find `src/pkg/debian/vpp.postinst` which helpfully contains a means to
|
||||
override setting the sysctl's, using an environment variable called `VPP_INSTALL_SKIP_SYSCTL`.
|
||||
|
||||
### Running VPP in Docker
|
||||
|
||||
With the Docker image built, I need to tweak the VPP startup configuration a little bit, to allow it
|
||||
to run well in a Docker environment. There are a few things I make note of:
|
||||
1. We may not have huge pages on the host machine, so I'll set all the page sizes to the
|
||||
linux-default 4kB rather than 2MB or 1GB hugepages. This creates a performance regression, but
|
||||
in the case of Containerlab, we're not here to build high performance stuff, but rather users
|
||||
will be doing functional testing.
|
||||
1. DPDK requires either UIO of VFIO kernel drivers, so that it can bind its so-called _poll mode
|
||||
driver_ to the network cards. It also requires huge pages. Since my first version will be
|
||||
using only virtual ethernet interfaces, I'll disable DPDK and VFIO alltogether.
|
||||
1. VPP can run any number of CPU worker threads. In its simplest form, I can also run it with only
|
||||
one thread. Of course, this will not be a high performance setup, but since I'm already not
|
||||
using hugepages, I'll use only 1 thread.
|
||||
|
||||
The VPP `startup.conf` configuration file I came up with:
|
||||
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ cat < EOF > clab-startup.conf
|
||||
unix {
|
||||
interactive
|
||||
log /var/log/vpp/vpp.log
|
||||
full-coredump
|
||||
cli-listen /run/vpp/cli.sock
|
||||
cli-prompt vpp-clab#
|
||||
cli-no-pager
|
||||
poll-sleep-usec 100
|
||||
}
|
||||
|
||||
api-trace {
|
||||
on
|
||||
}
|
||||
|
||||
memory {
|
||||
main-heap-size 512M
|
||||
main-heap-page-size 4k
|
||||
}
|
||||
buffers {
|
||||
buffers-per-numa 16000
|
||||
default data-size 2048
|
||||
page-size 4k
|
||||
}
|
||||
|
||||
statseg {
|
||||
size 64M
|
||||
page-size 4k
|
||||
per-node-counters on
|
||||
}
|
||||
|
||||
plugins {
|
||||
plugin default { enable }
|
||||
plugin dpdk_plugin.so { disable }
|
||||
}
|
||||
EOF
|
||||
```
|
||||
|
||||
Just a couple of notes for those who are running VPP in production. Each of the `*-page-size` config
|
||||
settings take the normal Linux pagesize of 4kB, which effectively avoids VPP from using anhy
|
||||
hugepages. Then, I'll specifically disable the DPDK plugin, although I didn't install it in the
|
||||
Dockerfile build, as it lives in its own dedicated Debian package called `vpp-plugin-dpdk`. Finally,
|
||||
I'll make VPP use less CPU by telling it to sleep for 100 microseconds between each poll iteration.
|
||||
In production environments, VPP will use 100% of the CPUs it's assigned, but in this lab, it will
|
||||
not be quite as hungry. By the way, even in this sleepy mode, it'll still easily handle a gigabit
|
||||
of traffic!
|
||||
|
||||
Now, VPP wants to run as root and it needs a few host features, notably tuntap devices and vhost,
|
||||
and a few capabilites, notably NET_ADMIN and SYS_PTRACE. I take a look at the
|
||||
[[manpage](https://man7.org/linux/man-pages/man7/capabilities.7.html)]:
|
||||
* ***CAP_SYS_NICE***: allows to set real-time scheduling, CPU affinity, I/O scheduling class, and
|
||||
to migrate and move memory pages.
|
||||
* ***CAP_NET_ADMIN***: allows to perform various network-relates operations like interface
|
||||
configs, routing tables, nested network namespaces, multicast, set promiscuous mode, and so on.
|
||||
* ***CAP_SYS_PTRACE***: allows to trace arbitrary processes using `ptrace(2)`, and a few related
|
||||
kernel system calls.
|
||||
|
||||
Being a networking dataplane implementation, VPP wants to be able to tinker with network devices.
|
||||
This is not typically allowed in Docker containers, although the Docker developers did make some
|
||||
consessions for those containers that need just that little bit more access. They described it in
|
||||
their
|
||||
[[docs](https://docs.docker.com/engine/containers/run/#runtime-privilege-and-linux-capabilities)] as
|
||||
follows:
|
||||
|
||||
| The --privileged flag gives all capabilities to the container. When the operator executes docker
|
||||
| run --privileged, Docker enables access to all devices on the host, and reconfigures AppArmor or
|
||||
| SELinux to allow the container nearly all the same access to the host as processes running outside
|
||||
| containers on the host. Use this flag with caution. For more information about the --privileged
|
||||
| flag, see the docker run reference.
|
||||
|
||||
{{< image width="4em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
In this moment, I feel I should point out that running a Docker container with `--privileged` flag
|
||||
set does give it _a lot_ of privileges. A container with `--privileged` is not a securely sandboxed
|
||||
process. Containers in this mode can get a root shell on the host and take control over the system.
|
||||
|
||||
With that little fineprint warning out of the way, I am going to Yolo like a boss:
|
||||
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ docker run --name clab-pim \
|
||||
--cap-add=NET_ADMIN --cap-add=SYS_NICE --cap-add=SYS_PTRACE \
|
||||
--device=/dev/net/tun:/dev/net/tun --device=/dev/vhost-net:/dev/vhost-net \
|
||||
--privileged -v $(pwd)/clab-startup.conf:/etc/vpp/startup.conf:ro \
|
||||
docker.io/pimvanpelt/vpp-containerlab
|
||||
clab-pim
|
||||
```
|
||||
|
||||
### Configuring VPP in Docker
|
||||
|
||||
And with that, the Docker container is running! I post a screenshot on
|
||||
[[Mastodon](https://ublog.tech/@IPngNetworks/114392852468494211)] and my buddy John responds with a
|
||||
polite but firm insistence that I explain myself. Here you go, buddy :)
|
||||
|
||||
In another terminal, I can play around with this VPP instance a little bit:
|
||||
```
|
||||
pim@summer:~$ docker exec -it clab-pim bash
|
||||
root@d57c3716eee9:/# ip -br l
|
||||
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
|
||||
eth0@if530566 UP 02:42:ac:11:00:02 <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
|
||||
root@d57c3716eee9:/# ps auxw
|
||||
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
|
||||
root 1 2.2 0.2 17498852 160300 ? Rs 15:11 0:00 /usr/bin/vpp -c /etc/vpp/startup.conf
|
||||
root 10 0.0 0.0 4192 3388 pts/0 Ss 15:11 0:00 bash
|
||||
root 18 0.0 0.0 8104 4056 pts/0 R+ 15:12 0:00 ps auxw
|
||||
|
||||
root@d57c3716eee9:/# vppctl
|
||||
_______ _ _ _____ ___
|
||||
__/ __/ _ \ (_)__ | | / / _ \/ _ \
|
||||
_/ _// // / / / _ \ | |/ / ___/ ___/
|
||||
/_/ /____(_)_/\___/ |___/_/ /_/
|
||||
|
||||
vpp-clab# show version
|
||||
vpp v25.02-release built by root on d5cd2c304b7f at 2025-02-26T13:58:32
|
||||
vpp-clab# show interfaces
|
||||
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
|
||||
local0 0 down 0/0/0/0
|
||||
```
|
||||
|
||||
Slick! I can see that the container has an `eth0` device, which Docker has connected to the main
|
||||
bridged network. For now, there's only one process running, pid 1 proudly shows VPP (as in Docker,
|
||||
the `CMD` field will simply replace `init`. Later on, I can imagine running a few more daemons like
|
||||
SSH and so on, but for now, I'm happy.
|
||||
|
||||
Looking at VPP itself, it has no network interfaces yet, except for the default `local0` interface.
|
||||
|
||||
### Adding Interfaces in Docker
|
||||
|
||||
But if I don't have DPDK, how will I add interfaces? Enter `veth(4)`. From the
|
||||
[[manpage](https://man7.org/linux/man-pages/man4/veth.4.html)], I learn that veth devices are
|
||||
virtual Ethernet devices. They can act as tunnels between network namespaces to create a bridge to
|
||||
a physical network device in another namespace, but can also be used as standalone network devices.
|
||||
veth devices are always created in interconnected pairs.
|
||||
|
||||
Of course, Docker users will recognize this. It's like bread and butter for containers to
|
||||
communicate with one another - and with the host they're running on. I can simply create a Docker
|
||||
network and attach one half of it to a running container, like so:
|
||||
|
||||
```
|
||||
pim@summer:~$ docker network create --driver=bridge clab-network \
|
||||
--subnet 192.0.2.0/24 --ipv6 --subnet 2001:db8::/64
|
||||
5711b95c6c32ac0ed185a54f39e5af4b499677171ff3d00f99497034e09320d2
|
||||
pim@summer:~$ docker network connect clab-network clab-pim --ip '' --ip6 ''
|
||||
```
|
||||
|
||||
The first command here creates a new network called `clab-network` in Docker. As a result, a new
|
||||
bridge called `br-5711b95c6c32` shows up on the host. The bridge name is chosen from the UUID of the
|
||||
Docker object. Seeing as I added an IPv4 and IPv6 subnet to the bridge, it gets configured with the
|
||||
first address in both:
|
||||
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ brctl show br-5711b95c6c32
|
||||
bridge name bridge id STP enabled interfaces
|
||||
br-5711b95c6c32 8000.0242099728c6 no veth021e363
|
||||
|
||||
|
||||
pim@summer:~/src/vpp-containerlab$ ip -br a show dev br-5711b95c6c32
|
||||
br-5711b95c6c32 UP 192.0.2.1/24 2001:db8::1/64 fe80::42:9ff:fe97:28c6/64 fe80::1/64
|
||||
```
|
||||
|
||||
The second command creates a `veth` pair, and puts one half of it in the bridge, and this interface
|
||||
is called `veth021e363` above. The other half of it pops up as `eth1` in the Docker container:
|
||||
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ docker exec -it clab-pim bash
|
||||
root@d57c3716eee9:/# ip -br l
|
||||
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
|
||||
eth0@if530566 UP 02:42:ac:11:00:02 <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
eth1@if530577 UP 02:42:c0:00:02:02 <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
```
|
||||
|
||||
One of the many awesome features of VPP is its ability to attach to these `veth` devices by means of
|
||||
its `af-packet` driver, by reusing the same MAC address (in this case `02:42:c0:00:02:02`). I first
|
||||
take a look at the linux [[manpage](https://man7.org/linux/man-pages/man7/packet.7.html)] for it,
|
||||
and then read up on the VPP
|
||||
[[documentation](https://fd.io/docs/vpp/v2101/gettingstarted/progressivevpp/interface)] on the
|
||||
topic.
|
||||
|
||||
|
||||
However, my attention is drawn to Docker assigning an IPv4 and IPv6 address to the container:
|
||||
```
|
||||
root@d57c3716eee9:/# ip -br a
|
||||
lo UNKNOWN 127.0.0.1/8 ::1/128
|
||||
eth0@if530566 UP 172.17.0.2/16
|
||||
eth1@if530577 UP 192.0.2.2/24 2001:db8::2/64 fe80::42:c0ff:fe00:202/64
|
||||
root@d57c3716eee9:/# ip addr del 192.0.2.2/24 dev eth1
|
||||
root@d57c3716eee9:/# ip addr del 2001:db8::2/64 dev eth1
|
||||
```
|
||||
|
||||
I decide to remove them from here, as in the end, `eth1` will be owned by VPP so _it_ should be
|
||||
setting the IPv4 and IPv6 addresses. For the life of me, I don't see how I can avoid Docker from
|
||||
assinging IPv4 and IPv6 addresses to this container ... and the
|
||||
[[docs](https://docs.docker.com/engine/network/)] seem to be off as well, as they suggest I can pass
|
||||
a flagg `--ipv4=False` but that flag doesn't exist, at least not on my Bookworm Docker variant. I
|
||||
make a mental note to discuss this with the folks in the Containerlab community.
|
||||
|
||||
|
||||
Anyway, armed with this knowledge I can bind the container-side veth pair called `eth1` to VPP, like
|
||||
so:
|
||||
|
||||
```
|
||||
root@d57c3716eee9:/# vppctl
|
||||
_______ _ _ _____ ___
|
||||
__/ __/ _ \ (_)__ | | / / _ \/ _ \
|
||||
_/ _// // / / / _ \ | |/ / ___/ ___/
|
||||
/_/ /____(_)_/\___/ |___/_/ /_/
|
||||
|
||||
vpp-clab# create host-interface name eth1 hw-addr 02:42:c0:00:02:02
|
||||
vpp-clab# set interface name host-eth1 eth1
|
||||
vpp-clab# set interface mtu 1500 eth1
|
||||
vpp-clab# set interface ip address eth1 192.0.2.2/24
|
||||
vpp-clab# set interface ip address eth1 2001:db8::2/64
|
||||
vpp-clab# set interface state eth1 up
|
||||
vpp-clab# show int addr
|
||||
eth1 (up):
|
||||
L3 192.0.2.2/24
|
||||
L3 2001:db8::2/64
|
||||
local0 (dn):
|
||||
```
|
||||
|
||||
## Results
|
||||
|
||||
After all this work, I've successfully created a Docker image based on Debian Bookworm and VPP 25.02
|
||||
(the current stable release version), started a container with it, added a network bridge in Docker,
|
||||
which binds the host `summer` to the container. Proof, as they say, is in the ping-pudding:
|
||||
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ ping -c5 2001:db8::2
|
||||
PING 2001:db8::2(2001:db8::2) 56 data bytes
|
||||
64 bytes from 2001:db8::2: icmp_seq=1 ttl=64 time=0.113 ms
|
||||
64 bytes from 2001:db8::2: icmp_seq=2 ttl=64 time=0.056 ms
|
||||
64 bytes from 2001:db8::2: icmp_seq=3 ttl=64 time=0.202 ms
|
||||
64 bytes from 2001:db8::2: icmp_seq=4 ttl=64 time=0.102 ms
|
||||
64 bytes from 2001:db8::2: icmp_seq=5 ttl=64 time=0.100 ms
|
||||
|
||||
--- 2001:db8::2 ping statistics ---
|
||||
5 packets transmitted, 5 received, 0% packet loss, time 4098ms
|
||||
rtt min/avg/max/mdev = 0.056/0.114/0.202/0.047 ms
|
||||
pim@summer:~/src/vpp-containerlab$ ping -c5 192.0.2.2
|
||||
PING 192.0.2.2 (192.0.2.2) 56(84) bytes of data.
|
||||
64 bytes from 192.0.2.2: icmp_seq=1 ttl=64 time=0.043 ms
|
||||
64 bytes from 192.0.2.2: icmp_seq=2 ttl=64 time=0.032 ms
|
||||
64 bytes from 192.0.2.2: icmp_seq=3 ttl=64 time=0.019 ms
|
||||
64 bytes from 192.0.2.2: icmp_seq=4 ttl=64 time=0.041 ms
|
||||
64 bytes from 192.0.2.2: icmp_seq=5 ttl=64 time=0.027 ms
|
||||
|
||||
--- 192.0.2.2 ping statistics ---
|
||||
5 packets transmitted, 5 received, 0% packet loss, time 4063ms
|
||||
rtt min/avg/max/mdev = 0.019/0.032/0.043/0.008 ms
|
||||
```
|
||||
|
||||
And in case that simple ping-test wasn't enough to get you excited, here's a packet trace from VPP
|
||||
itself, while I'm performing this ping:
|
||||
|
||||
```
|
||||
vpp-clab# trace add af-packet-input 100
|
||||
vpp-clab# wait 3
|
||||
vpp-clab# show trace
|
||||
------------------- Start of thread 0 vpp_main -------------------
|
||||
Packet 1
|
||||
|
||||
00:07:03:979275: af-packet-input
|
||||
af_packet: hw_if_index 1 rx-queue 0 next-index 4
|
||||
block 47:
|
||||
address 0x7fbf23b7d000 version 2 seq_num 48 pkt_num 0
|
||||
tpacket3_hdr:
|
||||
status 0x20000001 len 98 snaplen 98 mac 92 net 106
|
||||
sec 0x68164381 nsec 0x258e7659 vlan 0 vlan_tpid 0
|
||||
vnet-hdr:
|
||||
flags 0x00 gso_type 0x00 hdr_len 0
|
||||
gso_size 0 csum_start 0 csum_offset 0
|
||||
00:07:03:979293: ethernet-input
|
||||
IP4: 02:42:09:97:28:c6 -> 02:42:c0:00:02:02
|
||||
00:07:03:979306: ip4-input
|
||||
ICMP: 192.0.2.1 -> 192.0.2.2
|
||||
tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x5813, flags DONT_FRAGMENT
|
||||
ICMP echo_request checksum 0xc16 id 21197
|
||||
00:07:03:979315: ip4-lookup
|
||||
fib 0 dpo-idx 9 flow hash: 0x00000000
|
||||
ICMP: 192.0.2.1 -> 192.0.2.2
|
||||
tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x5813, flags DONT_FRAGMENT
|
||||
ICMP echo_request checksum 0xc16 id 21197
|
||||
00:07:03:979322: ip4-receive
|
||||
fib:0 adj:9 flow:0x00000000
|
||||
ICMP: 192.0.2.1 -> 192.0.2.2
|
||||
tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x5813, flags DONT_FRAGMENT
|
||||
ICMP echo_request checksum 0xc16 id 21197
|
||||
00:07:03:979323: ip4-icmp-input
|
||||
ICMP: 192.0.2.1 -> 192.0.2.2
|
||||
tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x5813, flags DONT_FRAGMENT
|
||||
ICMP echo_request checksum 0xc16 id 21197
|
||||
00:07:03:979323: ip4-icmp-echo-request
|
||||
ICMP: 192.0.2.1 -> 192.0.2.2
|
||||
tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x5813, flags DONT_FRAGMENT
|
||||
ICMP echo_request checksum 0xc16 id 21197
|
||||
00:07:03:979326: ip4-load-balance
|
||||
fib 0 dpo-idx 5 flow hash: 0x00000000
|
||||
ICMP: 192.0.2.2 -> 192.0.2.1
|
||||
tos 0x00, ttl 64, length 84, checksum 0x88e1 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x2dc4, flags DONT_FRAGMENT
|
||||
ICMP echo_reply checksum 0x1416 id 21197
|
||||
00:07:03:979325: ip4-rewrite
|
||||
tx_sw_if_index 1 dpo-idx 5 : ipv4 via 192.0.2.1 eth1: mtu:1500 next:3 flags:[] 0242099728c60242c00002020800 flow hash: 0x00000000
|
||||
00000000: 0242099728c60242c00002020800450000542dc44000400188e1c0000202c000
|
||||
00000020: 02010000141652cd00018143166800000000399d0900000000001011
|
||||
00:07:03:979326: eth1-output
|
||||
eth1 flags 0x02180005
|
||||
IP4: 02:42:c0:00:02:02 -> 02:42:09:97:28:c6
|
||||
ICMP: 192.0.2.2 -> 192.0.2.1
|
||||
tos 0x00, ttl 64, length 84, checksum 0x88e1 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x2dc4, flags DONT_FRAGMENT
|
||||
ICMP echo_reply checksum 0x1416 id 21197
|
||||
00:07:03:979327: eth1-tx
|
||||
af_packet: hw_if_index 1 tx-queue 0
|
||||
tpacket3_hdr:
|
||||
status 0x1 len 108 snaplen 108 mac 0 net 0
|
||||
sec 0x0 nsec 0x0 vlan 0 vlan_tpid 0
|
||||
vnet-hdr:
|
||||
flags 0x00 gso_type 0x00 hdr_len 0
|
||||
gso_size 0 csum_start 0 csum_offset 0
|
||||
buffer 0xf97c4:
|
||||
current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x0
|
||||
local l2-hdr-offset 0 l3-hdr-offset 14
|
||||
IP4: 02:42:c0:00:02:02 -> 02:42:09:97:28:c6
|
||||
ICMP: 192.0.2.2 -> 192.0.2.1
|
||||
tos 0x00, ttl 64, length 84, checksum 0x88e1 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x2dc4, flags DONT_FRAGMENT
|
||||
ICMP echo_reply checksum 0x1416 id 21197
|
||||
```
|
||||
|
||||
Well, that's a mouthfull, isn't it! Here, I get to show you VPP in action. After receiving the
|
||||
packet on its `af-packet-input` node from 192.0.2.1 (Summer, who is pinging us) to 192.0.2.2 (the
|
||||
VPP container), the packet traverses the dataplane graph. It goes through `ethernet-input`, then
|
||||
`ip4-input`, which sees it's destined to an IPv4 address configured, so the packet is handed to
|
||||
`ip4-receive`. That one sees that the IP protocol is ICMP, so it hands the packet to
|
||||
`ip4-icmp-input` which notices that the packet is an ICMP echo request, so off to
|
||||
`ip4-icmp-echo-request` our little packet goes. The ICMP plugin in VPP now answers by
|
||||
`ip4-rewrite`'ing the packet, sending the return to 192.0.2.1 at MAC address `02:42:09:97:28:c6`
|
||||
(this is Summer, the host doing the pinging!), after which the newly created ICMP echo-reply is
|
||||
handed to `eth1-output` which marshalls it back into the kernel's AF_PACKET interface using
|
||||
`eth1-tx`.
|
||||
|
||||
Boom. I could not be more pleased.
|
||||
|
||||
## What's Next
|
||||
|
||||
This was a nice exercise for me! I'm going this direction becaue the
|
||||
[[Containerlab](https://containerlab.dev)] framework will start containers with given NOS images,
|
||||
not too dissimilar from the one I just made, and then attaches `veth` pairs between the containers.
|
||||
I started dabbling with a [[pull-request](https://github.com/srl-labs/containerlab/pull/2571)], but
|
||||
I got stuck with a part of the Containerlab code that pre-deploys config files into the containers.
|
||||
You see, I will need to generate two files:
|
||||
|
||||
1. A `startup.conf` file that is specific to the containerlab Docker container. I'd like them to
|
||||
each set their own hostname so that the CLI has a unique prompt. I can do this by setting `unix
|
||||
{ cli-prompt {{ .ShortName }}# }` in the template renderer.
|
||||
1. Containerlab will know all of the veth pairs that are planned to be created into each VPP
|
||||
container. I'll need it to then write a little snippet of config that does the `create
|
||||
host-interface` spiel, to attach these `veth` pairs to the VPP dataplane.
|
||||
|
||||
I reached out to Roman from Nokia, who is one of the authors and current maintainer of Containerlab.
|
||||
Roman was keen to help out, and seeing as he knows the COntainerlab stuff well, and I know the VPP
|
||||
stuff well, this is a reasonable partnership! Soon, he and I plan to have a bare-bones setup that
|
||||
will connect a few VPP containers together with an SR Linux node in a lab. Stand by!
|
||||
|
||||
Once we have that, there's still quite some work for me to do. Notably:
|
||||
* Configuration persistence. `clab` allows you to save the running config. For that, I'll need to
|
||||
introduce [[vppcfg](https://git.ipng.ch/ipng/vppcfg)] and a means to invoke it when
|
||||
the lab operator wants to save their config, and then reconfigure VPP when the container
|
||||
restarts.
|
||||
* I'll need to have a few files from `clab` shared with the host, notably the `startup.conf` and
|
||||
`vppcfg.yaml`, as well as some manual pre- and post-flight configuration for the more esoteric
|
||||
stuff. Building the plumbing for this is a TODO for now.
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
I wanted to give a shout-out to Nardus le Roux who inspired me to contribute this Containerlab VPP
|
||||
node type, and to Roman Dodin for his help getting the Containerlab parts squared away when I got a
|
||||
little bit stuck.
|
||||
|
||||
First order of business: get it to ping at all ... it'll go faster from there on out :)
|
373
content/articles/2025-05-04-containerlab-2.md
Normal file
373
content/articles/2025-05-04-containerlab-2.md
Normal file
@@ -0,0 +1,373 @@
|
||||
---
|
||||
date: "2025-05-04T15:07:23Z"
|
||||
title: 'VPP in Containerlab - Part 2'
|
||||
params:
|
||||
asciinema: true
|
||||
---
|
||||
|
||||
{{< image float="right" src="/assets/containerlab/containerlab.svg" alt="Containerlab Logo" width="12em" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
From time to time the subject of containerized VPP instances comes up. At IPng, I run the routers in
|
||||
AS8298 on bare metal (Supermicro and Dell hardware), as it allows me to maximize performance.
|
||||
However, VPP is quite friendly in virtualization. Notably, it runs really well on virtual machines
|
||||
like Qemu/KVM or VMWare. I can pass through PCI devices directly to the host, and use CPU pinning to
|
||||
allow the guest virtual machine access to the underlying physical hardware. In such a mode, VPP
|
||||
performance almost the same as on bare metal. But did you know that VPP can also run in Docker?
|
||||
|
||||
The other day I joined the [[ZANOG'25](https://nog.net.za/event1/zanog25/)] in Durban, South Africa.
|
||||
One of the presenters was Nardus le Roux of Nokia, and he showed off a project called
|
||||
[[Containerlab](https://containerlab.dev/)], which provides a CLI for orchestrating and managing
|
||||
container-based networking labs. It starts the containers, builds virtual wiring between them to
|
||||
create lab topologies of users' choice and manages the lab lifecycle.
|
||||
|
||||
Quite regularly I am asked 'when will you add VPP to Containerlab?', but at ZANOG I made a promise
|
||||
to actually add it. In my previous [[article]({{< ref 2025-05-03-containerlab-1.md >}})], I took
|
||||
a good look at VPP as a dockerized container. In this article, I'll explore how to make such a
|
||||
container run in Containerlab!
|
||||
|
||||
## Completing the Docker container
|
||||
|
||||
Just having VPP running by itself in a container is not super useful (although it _is_ cool!). I
|
||||
decide first to add a few bits and bobs that will come in handy in the `Dockerfile`:
|
||||
|
||||
```
|
||||
FROM debian:bookworm
|
||||
ARG DEBIAN_FRONTEND=noninteractive
|
||||
ARG VPP_INSTALL_SKIP_SYSCTL=true
|
||||
ARG REPO=release
|
||||
EXPOSE 22/tcp
|
||||
RUN apt-get update && apt-get -y install curl procps tcpdump iproute2 iptables \
|
||||
iputils-ping net-tools git python3 python3-pip vim-tiny openssh-server bird2 \
|
||||
mtr-tiny traceroute && apt-get clean
|
||||
|
||||
# Install VPP
|
||||
RUN mkdir -p /var/log/vpp /root/.ssh/
|
||||
RUN curl -s https://packagecloud.io/install/repositories/fdio/${REPO}/script.deb.sh | bash
|
||||
RUN apt-get update && apt-get -y install vpp vpp-plugin-core && apt-get clean
|
||||
|
||||
# Build vppcfg
|
||||
RUN pip install --break-system-packages build netaddr yamale argparse pyyaml ipaddress
|
||||
RUN git clone https://git.ipng.ch/ipng/vppcfg.git && cd vppcfg && python3 -m build && \
|
||||
pip install --break-system-packages dist/vppcfg-*-py3-none-any.whl
|
||||
|
||||
# Config files
|
||||
COPY files/etc/vpp/* /etc/vpp/
|
||||
COPY files/etc/bird/* /etc/bird/
|
||||
COPY files/init-container.sh /sbin/
|
||||
RUN chmod 755 /sbin/init-container.sh
|
||||
CMD ["/sbin/init-container.sh"]
|
||||
```
|
||||
|
||||
A few notable additions:
|
||||
* ***vppcfg*** is a handy utility I wrote and discussed in a previous [[article]({{< ref
|
||||
2022-04-02-vppcfg-2 >}})]. Its purpose is to take YAML file that describes the configuration of
|
||||
the dataplane (like which interfaces, sub-interfaces, MTU, IP addresses and so on), and then
|
||||
apply this safely to a running dataplane. You can check it out in my
|
||||
[[vppcfg](https://git.ipng.ch/ipng/vppcfg)] git repository.
|
||||
* ***openssh-server*** will come in handy to log in to the container, in addition to the already
|
||||
available `docker exec`.
|
||||
* ***bird2*** which will be my controlplane of choice. At a future date, I might also add FRR,
|
||||
which may be a good alterantive for some. VPP works well with both. You can check out Bird on
|
||||
the nic.cz [[website](https://bird.network.cz/?get_doc&f=bird.html&v=20)].
|
||||
|
||||
I'll add a couple of default config files for Bird and VPP, and replace the CMD with a generic
|
||||
`/sbin/init-container.sh` in which I can do any late binding stuff before launching VPP.
|
||||
|
||||
### Initializing the Container
|
||||
|
||||
#### VPP Containerlab: NetNS
|
||||
|
||||
VPP's Linux Control Plane plugin wants to run in its own network namespace. So the first order of
|
||||
business of `/sbin/init-container.sh` is to create it:
|
||||
|
||||
```
|
||||
NETNS=${NETNS:="dataplane"}
|
||||
|
||||
echo "Creating dataplane namespace"
|
||||
/usr/bin/mkdir -p /etc/netns/$NETNS
|
||||
/usr/bin/touch /etc/netns/$NETNS/resolv.conf
|
||||
/usr/sbin/ip netns add $NETNS
|
||||
```
|
||||
|
||||
#### VPP Containerlab: SSH
|
||||
|
||||
Then, I'll set the root password (which is `vpp` by the way), and start aan SSH daemon which allows
|
||||
for password-less logins:
|
||||
|
||||
```
|
||||
echo "Starting SSH, with credentials root:vpp"
|
||||
sed -i -e 's,^#PermitRootLogin prohibit-password,PermitRootLogin yes,' /etc/ssh/sshd_config
|
||||
sed -i -e 's,^root:.*,root:$y$j9T$kG8pyZEVmwLXEtXekQCRK.$9iJxq/bEx5buni1hrC8VmvkDHRy7ZMsw9wYvwrzexID:20211::::::,' /etc/shadow
|
||||
/etc/init.d/ssh start
|
||||
```
|
||||
|
||||
#### VPP Containerlab: Bird2
|
||||
|
||||
I can already predict that Bird2 won't be the only option for a controlplane, even though I'm a huge
|
||||
fan of it. Therefore, I'll make it configurable to leave the door open for other controlplane
|
||||
implementations in the future:
|
||||
|
||||
```
|
||||
BIRD_ENABLED=${BIRD_ENABLED:="true"}
|
||||
|
||||
if [ "$BIRD_ENABLED" == "true" ]; then
|
||||
echo "Starting Bird in $NETNS"
|
||||
mkdir -p /run/bird /var/log/bird
|
||||
chown bird:bird /var/log/bird
|
||||
ROUTERID=$(ip -br a show eth0 | awk '{ print $3 }' | cut -f1 -d/)
|
||||
sed -i -e "s,.*router id .*,router id $ROUTERID; # Set by container-init.sh," /etc/bird/bird.conf
|
||||
/usr/bin/nsenter --net=/var/run/netns/$NETNS /usr/sbin/bird -u bird -g bird
|
||||
fi
|
||||
```
|
||||
|
||||
I am reminded that Bird won't start if it cannot determine its _router id_. When I start it in the
|
||||
`dataplane` namespace, it will immediately exit, because there will be no IP addresses configured
|
||||
yet. But luckily, it logs its complaint and it's easily addressed. I decide to take the management
|
||||
IPv4 address from `eth0` and write that into the `bird.conf` file, which otherwise does some basic
|
||||
initialization that I described in a previous [[article]({{< ref 2021-09-02-vpp-5 >}})], so I'll
|
||||
skip that here. However, I do include an empty file called `/etc/bird/bird-local.conf` for users to
|
||||
further configure Bird2.
|
||||
|
||||
#### VPP Containerlab: Binding veth pairs
|
||||
|
||||
When Containerlab starts the VPP container, it'll offer it a set of `veth` ports that connect this
|
||||
container to other nodes in the lab. This is done by the `links` list in the topology file
|
||||
[[ref](https://containerlab.dev/manual/network/)]. It's my goal to take all of the interfaces
|
||||
that are of type `veth`, and generate a little snippet to grab them and bind them into VPP while
|
||||
setting their MTU to 9216 to allow for jumbo frames:
|
||||
|
||||
```
|
||||
CLAB_VPP_FILE=${CLAB_VPP_FILE:=/etc/vpp/clab.vpp}
|
||||
|
||||
echo "Generating $CLAB_VPP_FILE"
|
||||
: > $CLAB_VPP_FILE
|
||||
MTU=9216
|
||||
for IFNAME in $(ip -br link show type veth | cut -f1 -d@ | grep -v '^eth0$' | sort); do
|
||||
MAC=$(ip -br link show dev $IFNAME | awk '{ print $3 }')
|
||||
echo " * $IFNAME hw-addr $MAC mtu $MTU"
|
||||
ip link set $IFNAME up mtu $MTU
|
||||
cat << EOF >> $CLAB_VPP_FILE
|
||||
create host-interface name $IFNAME hw-addr $MAC
|
||||
set interface name host-$IFNAME $IFNAME
|
||||
set interface mtu $MTU $IFNAME
|
||||
set interface state $IFNAME up
|
||||
|
||||
EOF
|
||||
done
|
||||
```
|
||||
|
||||
{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
One thing I realized is that VPP will assign a random MAC address on its copy of the `veth` port,
|
||||
which is not great. I'll explicitly configure it with the same MAC address as the `veth` interface
|
||||
itself, otherwise I'd have to put the interface into promiscuous mode.
|
||||
|
||||
#### VPP Containerlab: VPPcfg
|
||||
|
||||
I'm almost ready, but I have one more detail. The user will be able to offer a
|
||||
[[vppcfg](https://git.ipng.ch/ipng/vppcfg)] YAML file to configure the interfaces and so on. If such
|
||||
a file exists, I'll apply it to the dataplane upon startup:
|
||||
|
||||
```
|
||||
VPPCFG_VPP_FILE=${VPPCFG_VPP_FILE:=/etc/vpp/vppcfg.vpp}
|
||||
|
||||
echo "Generating $VPPCFG_VPP_FILE"
|
||||
: > $VPPCFG_VPP_FILE
|
||||
if [ -r /etc/vpp/vppcfg.yaml ]; then
|
||||
vppcfg plan --novpp -c /etc/vpp/vppcfg.yaml -o $VPPCFG_VPP_FILE
|
||||
fi
|
||||
```
|
||||
|
||||
Once the VPP process starts, it'll execute `/etc/vpp/bootstrap.vpp`, which in turn executes these
|
||||
newly generated `/etc/vpp/clab.vpp` to grab the `veth` interfaces, and then `/etc/vpp/vppcfg.vpp` to
|
||||
further configure the dataplane. Easy peasy!
|
||||
|
||||
### Adding VPP to Containerlab
|
||||
|
||||
Roman points out a previous integration for the 6WIND VSR in
|
||||
[[PR#2540](https://github.com/srl-labs/containerlab/pull/2540)]. This serves as a useful guide to
|
||||
get me started. I fork the repo, create a branch so that Roman can also add a few commits, and
|
||||
together we start hacking in [[PR#2571](https://github.com/srl-labs/containerlab/pull/2571)].
|
||||
|
||||
First, I add the documentation skeleton in `docs/manual/kinds/fdio_vpp.md`, which links in from a
|
||||
few other places, and will be where the end-user facing documentation will live. That's about half
|
||||
the contributed LOC, right there!
|
||||
|
||||
Next, I'll create a Go module in `nodes/fdio_vpp/fdio_vpp.go` which doesn't do much other than
|
||||
creating the `struct`, and its required `Register` and `Init` functions. The `Init` function ensures
|
||||
the right capabilities are set in Docker, and the right devices are bound for the container.
|
||||
|
||||
I notice that Containerlab rewrites the Dockerfile `CMD` string and prepends an `if-wait.sh` script
|
||||
to it. This is because when Containerlab starts the container, it'll still be busy adding these
|
||||
`link` interfaces to it, and if a container starts too quickly, it may not see all the interfaces.
|
||||
So, containerlab informs the container using an environment variable called `CLAB_INTFS`, so this
|
||||
script simply sleeps for a while until that exact amount of interfaces are present. Ok, cool beans.
|
||||
|
||||
Roman helps me a bit with Go templating. You see, I think it'll be slick to have the CLI prompt for
|
||||
the VPP containers to reflect their hostname, because normally, VPP will assign `vpp# `. I add the
|
||||
template in `nodes/fdio_vpp/vpp_startup_config.go.tpl` and it only has one variable expansion: `unix
|
||||
{ cli-prompt {{ .ShortName }}# }`. But I totally think it's worth it, because when running many VPP
|
||||
containers in the lab, it could otherwise get confusing.
|
||||
|
||||
Roman also shows me a trick in the function `PostDeploy()`, which will write the user's SSH pubkeys
|
||||
to `/root/.ssh/authorized_keys`. This allows users to log in without having to use password
|
||||
authentication.
|
||||
|
||||
Collectively, we decide to punt on the `SaveConfig` function until we're a bit further along. I have
|
||||
an idea how this would work, basically along the lines of calling `vppcfg dump` and bind-mounting
|
||||
that file into the lab directory somewhere. This way, upon restarting, the YAML file can be re-read
|
||||
and the dataplane initialized. But it'll be for another day.
|
||||
|
||||
After the main module is finished, all I have to do is add it to `clab/register.go` and that's just
|
||||
about it. In about 170 lines of code, 50 lines of Go template, and 170 lines of Markdown, this
|
||||
contribution is about ready to ship!
|
||||
|
||||
### Containerlab: Demo
|
||||
|
||||
After I finish writing the documentation, I decide to include a demo with a quickstart to help folks
|
||||
along. A simple lab showing two VPP instances and two Alpine Linux clients can be found on
|
||||
[[git.ipng.ch/ipng/vpp-containerlab](https://git.ipng.ch/ipng/vpp-containerlab)]. Simply check out the
|
||||
repo and start the lab, like so:
|
||||
|
||||
```
|
||||
$ git clone https://git.ipng.ch/ipng/vpp-containerlab.git
|
||||
$ cd vpp-containerlab
|
||||
$ containerlab deploy --topo vpp.clab.yml
|
||||
```
|
||||
|
||||
#### Containerlab: configs
|
||||
|
||||
The file `vpp.clab.yml` contains an example topology existing of two VPP instances connected each to
|
||||
one Alpine linux container, in the following topology:
|
||||
|
||||
{{< image src="/assets/containerlab/learn-vpp.png" alt="Containerlab Topo" width="100%" >}}
|
||||
|
||||
Two relevant files for each VPP router are included in this
|
||||
[[repository](https://git.ipng.ch/ipng/vpp-containerlab)]:
|
||||
1. `config/vpp*/vppcfg.yaml` configures the dataplane interfaces, including a loopback address.
|
||||
1. `config/vpp*/bird-local.conf` configures the controlplane to enable BFD and OSPF.
|
||||
|
||||
To illustrate these files, let me take a closer look at node `vpp1`. It's VPP dataplane
|
||||
configuration looks like this:
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ cat config/vpp1/vppcfg.yaml
|
||||
interfaces:
|
||||
eth1:
|
||||
description: 'To client1'
|
||||
mtu: 1500
|
||||
lcp: eth1
|
||||
addresses: [ 10.82.98.65/28, 2001:db8:8298:101::1/64 ]
|
||||
eth2:
|
||||
description: 'To vpp2'
|
||||
mtu: 9216
|
||||
lcp: eth2
|
||||
addresses: [ 10.82.98.16/31, 2001:db8:8298:1::1/64 ]
|
||||
loopbacks:
|
||||
loop0:
|
||||
description: 'vpp1'
|
||||
lcp: loop0
|
||||
addresses: [ 10.82.98.0/32, 2001:db8:8298::/128 ]
|
||||
```
|
||||
|
||||
Then, I enable BFD, OSPF and OSPFv3 on `eth2` and `loop0` on both of the VPP routers:
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ cat config/vpp1/bird-local.conf
|
||||
protocol bfd bfd1 {
|
||||
interface "eth2" { interval 100 ms; multiplier 30; };
|
||||
}
|
||||
|
||||
protocol ospf v2 ospf4 {
|
||||
ipv4 { import all; export all; };
|
||||
area 0 {
|
||||
interface "loop0" { stub yes; };
|
||||
interface "eth2" { type pointopoint; cost 10; bfd on; };
|
||||
};
|
||||
}
|
||||
|
||||
protocol ospf v3 ospf6 {
|
||||
ipv6 { import all; export all; };
|
||||
area 0 {
|
||||
interface "loop0" { stub yes; };
|
||||
interface "eth2" { type pointopoint; cost 10; bfd on; };
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
#### Containerlab: playtime!
|
||||
|
||||
Once the lab comes up, I can SSH to the VPP containers (`vpp1` and `vpp2`) which have my SSH pubkeys
|
||||
installed thanks to Roman's work. Barring that, I could still log in as user `root` using
|
||||
password `vpp`. VPP runs its own network namespace called `dataplane`, which is very similar to SR
|
||||
Linux default `network-instance`. I can join that namespace to take a closer look:
|
||||
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ ssh root@vpp1
|
||||
root@vpp1:~# nsenter --net=/var/run/netns/dataplane
|
||||
root@vpp1:~# ip -br a
|
||||
lo DOWN
|
||||
loop0 UP 10.82.98.0/32 2001:db8:8298::/128 fe80::dcad:ff:fe00:0/64
|
||||
eth1 UNKNOWN 10.82.98.65/28 2001:db8:8298:101::1/64 fe80::a8c1:abff:fe77:acb9/64
|
||||
eth2 UNKNOWN 10.82.98.16/31 2001:db8:8298:1::1/64 fe80::a8c1:abff:fef0:7125/64
|
||||
|
||||
root@vpp1:~# ping 10.82.98.1
|
||||
PING 10.82.98.1 (10.82.98.1) 56(84) bytes of data.
|
||||
64 bytes from 10.82.98.1: icmp_seq=1 ttl=64 time=9.53 ms
|
||||
64 bytes from 10.82.98.1: icmp_seq=2 ttl=64 time=15.9 ms
|
||||
^C
|
||||
--- 10.82.98.1 ping statistics ---
|
||||
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
|
||||
rtt min/avg/max/mdev = 9.530/12.735/15.941/3.205 ms
|
||||
```
|
||||
|
||||
From `vpp1`, I can tell that Bird2's OSPF adjacency has formed, because I can ping the `loop0`
|
||||
address of `vpp2` router on 10.82.98.1. Nice! The two client nodes are running a minimalistic Alpine
|
||||
Linux container, which doesn't ship with SSH by default. But of course I can still enter the
|
||||
containers using `docker exec`, like so:
|
||||
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ docker exec -it client1 sh
|
||||
/ # ip addr show dev eth1
|
||||
531235: eth1@if531234: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 9500 qdisc noqueue state UP
|
||||
link/ether 00:c1:ab:00:00:01 brd ff:ff:ff:ff:ff:ff
|
||||
inet 10.82.98.66/28 scope global eth1
|
||||
valid_lft forever preferred_lft forever
|
||||
inet6 2001:db8:8298:101::2/64 scope global
|
||||
valid_lft forever preferred_lft forever
|
||||
inet6 fe80::2c1:abff:fe00:1/64 scope link
|
||||
valid_lft forever preferred_lft forever
|
||||
/ # traceroute 10.82.98.82
|
||||
traceroute to 10.82.98.82 (10.82.98.82), 30 hops max, 46 byte packets
|
||||
1 10.82.98.65 (10.82.98.65) 5.906 ms 7.086 ms 7.868 ms
|
||||
2 10.82.98.17 (10.82.98.17) 24.007 ms 23.349 ms 15.933 ms
|
||||
3 10.82.98.82 (10.82.98.82) 39.978 ms 31.127 ms 31.854 ms
|
||||
|
||||
/ # traceroute 2001:db8:8298:102::2
|
||||
traceroute to 2001:db8:8298:102::2 (2001:db8:8298:102::2), 30 hops max, 72 byte packets
|
||||
1 2001:db8:8298:101::1 (2001:db8:8298:101::1) 0.701 ms 7.144 ms 7.900 ms
|
||||
2 2001:db8:8298:1::2 (2001:db8:8298:1::2) 23.909 ms 22.943 ms 23.893 ms
|
||||
3 2001:db8:8298:102::2 (2001:db8:8298:102::2) 31.964 ms 30.814 ms 32.000 ms
|
||||
```
|
||||
|
||||
From the vantage point of `client1`, the first hop represents the `vpp1` node, which forwards to
|
||||
`vpp2`, which finally forwards to `client2`, which shows that both VPP routers are passing traffic.
|
||||
Dope!
|
||||
|
||||
## Results
|
||||
|
||||
After all of this deep-diving, all that's left is for me to demonstrate the Containerlab by means of
|
||||
this little screencast [[asciinema](/assets/containerlab/vpp-containerlab.cast)]. I hope you enjoy
|
||||
it as much as I enjoyed creating it:
|
||||
|
||||
{{< asciinema src="/assets/containerlab/vpp-containerlab.cast" >}}
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
I wanted to give a shout-out Roman Dodin for his help getting the Containerlab parts squared away
|
||||
when I got a little bit stuck. He took the time to explain the internals and idiom of Containerlab
|
||||
project, which really saved me a tonne of time. He also pair-programmed the
|
||||
[[PR#2471](https://github.com/srl-labs/containerlab/pull/2571)] with me over the span of two
|
||||
evenings.
|
||||
|
||||
Collaborative open source rocks!
|
713
content/articles/2025-05-28-minio-1.md
Normal file
713
content/articles/2025-05-28-minio-1.md
Normal file
@@ -0,0 +1,713 @@
|
||||
---
|
||||
date: "2025-05-28T22:07:23Z"
|
||||
title: 'Case Study: Minio S3 - Part 1'
|
||||
---
|
||||
|
||||
{{< image float="right" src="/assets/minio/minio-logo.png" alt="MinIO Logo" width="6em" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading
|
||||
scalability, data availability, security, and performance. Millions of customers of all sizes and
|
||||
industries store, manage, analyze, and protect any amount of data for virtually any use case, such
|
||||
as data lakes, cloud-native applications, and mobile apps. With cost-effective storage classes and
|
||||
easy-to-use management features, you can optimize costs, organize and analyze data, and configure
|
||||
fine-tuned access controls to meet specific business and compliance requirements.
|
||||
|
||||
Amazon's S3 became the _de facto_ standard object storage system, and there exist several fully open
|
||||
source implementations of the protocol. One of them is MinIO: designed to allow enterprises to
|
||||
consolidate all of their data on a single, private cloud namespace. Architected using the same
|
||||
principles as the hyperscalers, AIStor delivers performance at scale at a fraction of the cost
|
||||
compared to the public cloud.
|
||||
|
||||
IPng Networks is an Internet Service Provider, but I also dabble in self-hosting things, for
|
||||
example [[PeerTube](https://video.ipng.ch/)], [[Mastodon](https://ublog.tech/)],
|
||||
[[Immich](https://photos.ipng.ch/)], [[Pixelfed](https://pix.ublog.tech/)] and of course
|
||||
[[Hugo](https://ipng.ch/)]. These services all have one thing in common: they tend to use lots of
|
||||
storage when they grow. At IPng Networks, all hypervisors ship with enterprise SAS flash drives,
|
||||
mostly 1.92TB and 3.84TB. Scaling up each of these services, and backing them up safely, can be
|
||||
quite the headache.
|
||||
|
||||
This article is for the storage-buffs. I'll set up a set of distributed MinIO nodes from scatch.
|
||||
|
||||
## Physical
|
||||
|
||||
{{< image float="right" src="/assets/minio/disks.png" alt="MinIO Disks" width="16em" >}}
|
||||
|
||||
I'll start with the basics. I still have a few Dell R720 servers laying around, they are getting a
|
||||
bit older but still have 24 cores and 64GB of memory. First I need to get me some disks. I order
|
||||
36pcs of 16TB SATA enterprise disk, a mixture of Seagate EXOS and Toshiba MG series disks. I've once
|
||||
learned (the hard way), that buying a big stack of disks from one production run is a risk - so I'll
|
||||
mix and match the drives.
|
||||
|
||||
Three trays of caddies and a melted credit card later, I have 576TB of SATA disks safely in hand.
|
||||
Each machine will carry 192TB of raw storage. The nice thing about this chassis is that Dell can
|
||||
ship them with 12x 3.5" SAS slots in the front, and 2x 2.5" SAS slots in the rear of the chassis.
|
||||
|
||||
So I'll install Debian Bookworm on one small 480G SSD in software RAID1.
|
||||
|
||||
### Cloning an install
|
||||
|
||||
I have three identical machines so in total I'll want six of these SSDs. I temporarily screw the
|
||||
other five in 3.5" drive caddies and plug them into the first installed Dell, which I've called
|
||||
`minio-proto`:
|
||||
|
||||
|
||||
```
|
||||
pim@minio-proto:~$ for i in b c d e f; do
|
||||
sudo dd if=/dev/sda of=/dev/sd${i} bs=512 count=1;
|
||||
sudo mdadm --manage /dev/md0 --add /dev/md${i}1
|
||||
done
|
||||
pim@minio-proto:~$ sudo mdadm --manage /dev/md0 --grow 6
|
||||
pim@minio-proto:~$ watch cat /proc/mdstat
|
||||
pim@minio-proto:~$ for i in a b c d e f; do
|
||||
sudo grub-install /dev/sd$i
|
||||
done
|
||||
```
|
||||
|
||||
{{< image float="right" src="/assets/minio/rack.png" alt="MinIO Rack" width="16em" >}}
|
||||
|
||||
The first command takes my installed disk, `/dev/sda`, and copies the first sector over to the other
|
||||
five. This will give them the same partition table. Next, I'll add the first partition of each disk
|
||||
to the raidset. Then, I'll expand the raidset to have six members, after which the kernel starts a
|
||||
recovery process that syncs the newly added paritions to `/dev/md0` (by copying from `/dev/sda` to
|
||||
all other disks at once). Finally, I'll watch this exciting movie and grab a cup of tea.
|
||||
|
||||
|
||||
Once the disks are fully copied, I'll shut down the machine and distribute the disks to their
|
||||
respective Dell R720, two each. Once they boot they will all be identical. I'll need to make sure
|
||||
their hostnames, and machine/host-id are unique, otherwise things like bridges will have overlapping
|
||||
MAC addresses - ask me how I know:
|
||||
|
||||
```
|
||||
pim@minio-proto:~$ sudo mdadm --manage /dev/md0 --grow -n 2
|
||||
pim@minio-proto:~$ sudo rm /etc/ssh/ssh_host*
|
||||
pim@minio-proto:~$ sudo hostname minio0-chbtl0
|
||||
pim@minio-proto:~$ sudo dpkg-reconfigure openssh-server
|
||||
pim@minio-proto:~$ sudo dd if=/dev/random of=/etc/hostid bs=4 count=1
|
||||
pim@minio-proto:~$ sudo /usr/bin/dbus-uuidgen > /etc/machine-id
|
||||
pim@minio-proto:~$ sudo reboot
|
||||
```
|
||||
|
||||
After which I have three beautiful and unique machines:
|
||||
* `minio0.chbtl0.net.ipng.ch`: which will go into my server rack at the IPng office.
|
||||
* `minio0.ddln0.net.ipng.ch`: which will go to [[Daedalean]({{< ref
|
||||
2022-02-24-colo >}})], doing AI since before it was all about vibe coding.
|
||||
* `minio0.chrma0.net.ipng.ch`: which will go to [[IP-Max](https://ip-max.net/)], one of the best
|
||||
ISPs on the planet. 🥰
|
||||
|
||||
|
||||
## Deploying Minio
|
||||
|
||||
The user guide that MinIO provides
|
||||
[[ref](https://min.io/docs/minio/linux/operations/installation.html)] is super good, arguably one of
|
||||
the best documented open source projects I've ever seen. it shows me that I can do three types of
|
||||
install. A 'Standalone' with one disk, a 'Standalone Multi-Drive', and a 'Distributed' deployment.
|
||||
I decide to make three independent standalone multi-drive installs. This way, I have less shared
|
||||
fate, and will be immune to network partitions (as these are going to be in three different
|
||||
physical locations). I've also read about per-bucket _replication_, which will be an excellent way
|
||||
to get geographical distribution and active/active instances to work together.
|
||||
|
||||
I feel good about the single-machine multi-drive decision. I follow the install guide
|
||||
[[ref](https://min.io/docs/minio/linux/operations/install-deploy-manage/deploy-minio-single-node-multi-drive.html#minio-snmd)]
|
||||
for this deployment type.
|
||||
|
||||
### IPng Frontends
|
||||
|
||||
At IPng I use a private IPv4/IPv6/MPLS network that is not connected to the internet. I call this
|
||||
network [[IPng Site Local]({{< ref 2023-03-11-mpls-core.md >}})]. But how will users reach my Minio
|
||||
install? I have four redundantly and geographically deployed frontends, two in the Netherlands and
|
||||
two in Switzerland. I've described the frontend setup in a [[previous article]({{< ref
|
||||
2023-03-17-ipng-frontends >}})] and the certificate management in [[this article]({{< ref
|
||||
2023-03-24-lego-dns01 >}})].
|
||||
|
||||
I've decided to run the service on these three regionalized endpoints:
|
||||
1. `s3.chbtl0.ipng.ch` which will back into `minio0.chbtl0.net.ipng.ch`
|
||||
1. `s3.ddln0.ipng.ch` which will back into `minio0.ddln0.net.ipng.ch`
|
||||
1. `s3.chrma0.ipng.ch` which will back into `minio0.chrma0.net.ipng.ch`
|
||||
|
||||
The first thing I take note of is that S3 buckets can be either addressed _by path_, in other words
|
||||
something like `s3.chbtl0.ipng.ch/my-bucket/README.md`, but they can also be addressed by virtual
|
||||
host, like so: `my-bucket.s3.chbtl0.ipng.ch/README.md`. A subtle difference, but from the docs I
|
||||
understand that Minio needs to have control of the whole space under its main domain.
|
||||
|
||||
There's a small implication to this requirement -- the Web Console that ships with MinIO (eh, well,
|
||||
maybe that's going to change, more on that later), will want to have its own domain-name, so I
|
||||
choose something simple: `cons0-s3.chbtl0.ipng.ch` and so on. This way, somebody might still be able
|
||||
to have a bucket name called `cons0` :)
|
||||
|
||||
#### Let's Encrypt Certificates
|
||||
|
||||
Alright, so I will be neading nine domains into this new certificate which I'll simply call
|
||||
`s3.ipng.ch`. I configure it in Ansible:
|
||||
|
||||
```
|
||||
certbot:
|
||||
certs:
|
||||
...
|
||||
s3.ipng.ch:
|
||||
groups: [ 'nginx', 'minio' ]
|
||||
altnames:
|
||||
- 's3.chbtl0.ipng.ch'
|
||||
- 'cons0-s3.chbtl0.ipng.ch'
|
||||
- '*.s3.chbtl0.ipng.ch'
|
||||
- 's3.ddln0.ipng.ch'
|
||||
- 'cons0-s3.ddln0.ipng.ch'
|
||||
- '*.s3.ddln0.ipng.ch'
|
||||
- 's3.chrma0.ipng.ch'
|
||||
- 'cons0-s3.chrma0.ipng.ch'
|
||||
- '*.s3.chrma0.ipng.ch'
|
||||
```
|
||||
|
||||
I run the `certbot` playbook and it does two things:
|
||||
1. On the machines from group `nginx` and `minio`, it will ensure there exists a user `lego` with
|
||||
an SSH key and write permissions to `/etc/lego/`; this is where the automation will write (and
|
||||
update) the certificate keys.
|
||||
1. On the `lego` machine, it'll create two files. One is the certificate requestor, and the other
|
||||
is a certificate distribution script that will copy the cert to the right machine(s) when it
|
||||
renews.
|
||||
|
||||
On the `lego` machine, I'll run the cert request for the first time:
|
||||
|
||||
```
|
||||
lego@lego:~$ bin/certbot:s3.ipng.ch
|
||||
lego@lego:~$ RENEWED_LINEAGE=/home/lego/acme-dns/live/s3.ipng.ch bin/certbot-distribute
|
||||
```
|
||||
|
||||
The first script asks me to add the _acme-challenge DNS entries, which I'll do, for example on the
|
||||
`s3.chbtl0.ipng.ch` instance (and similar for the `ddln0` and `chrma0` ones:
|
||||
|
||||
```
|
||||
$ORIGIN chbtl0.ipng.ch.
|
||||
_acme-challenge.s3 CNAME 51f16fd0-8eb6-455c-b5cd-96fad12ef8fd.auth.ipng.ch.
|
||||
_acme-challenge.cons0-s3 CNAME 450477b8-74c9-4b9e-bbeb-de49c3f95379.auth.ipng.ch.
|
||||
s3 CNAME nginx0.ipng.ch.
|
||||
*.s3 CNAME nginx0.ipng.ch.
|
||||
cons0-s3 CNAME nginx0.ipng.ch.
|
||||
```
|
||||
|
||||
I push and reload the `ipng.ch` zonefile with these changes after which the certificate gets
|
||||
requested and a cronjob added to check for renewals. The second script will copy the newly created
|
||||
cert to all three `minio` machines, and all four `nginx` machines. From now on, every 90 days, a new
|
||||
cert will be automatically generated and distributed. Slick!
|
||||
|
||||
#### NGINX Configs
|
||||
|
||||
With the LE wildcard certs in hand, I can create an NGINX frontend for these minio deployments.
|
||||
|
||||
First, a simple redirector service that punts people on port 80 to port 443:
|
||||
|
||||
```
|
||||
server {
|
||||
listen [::]:80;
|
||||
listen 0.0.0.0:80;
|
||||
|
||||
server_name cons0-s3.chbtl0.ipng.ch s3.chbtl0.ipng.ch *.s3.chbtl0.ipng.ch;
|
||||
access_log /var/log/nginx/s3.chbtl0.ipng.ch-access.log;
|
||||
include /etc/nginx/conf.d/ipng-headers.inc;
|
||||
|
||||
location / {
|
||||
return 301 https://$server_name$request_uri;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Next, the Minio API service itself which runs on port 9000, with a configuration snippet inspired by
|
||||
the MinIO [[docs](https://min.io/docs/minio/linux/integrations/setup-nginx-proxy-with-minio.html)]:
|
||||
|
||||
```
|
||||
server {
|
||||
listen [::]:443 ssl http2;
|
||||
listen 0.0.0.0:443 ssl http2;
|
||||
ssl_certificate /etc/certs/s3.ipng.ch/fullchain.pem;
|
||||
ssl_certificate_key /etc/certs/s3.ipng.ch/privkey.pem;
|
||||
include /etc/nginx/conf.d/options-ssl-nginx.inc;
|
||||
ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc;
|
||||
|
||||
server_name s3.chbtl0.ipng.ch *.s3.chbtl0.ipng.ch;
|
||||
access_log /var/log/nginx/s3.chbtl0.ipng.ch-access.log upstream;
|
||||
include /etc/nginx/conf.d/ipng-headers.inc;
|
||||
|
||||
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
|
||||
|
||||
ignore_invalid_headers off;
|
||||
client_max_body_size 0;
|
||||
# Disable buffering
|
||||
proxy_buffering off;
|
||||
proxy_request_buffering off;
|
||||
|
||||
location / {
|
||||
proxy_set_header Host $http_host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
|
||||
proxy_connect_timeout 300;
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Connection "";
|
||||
chunked_transfer_encoding off;
|
||||
|
||||
proxy_pass http://minio0.chbtl0.net.ipng.ch:9000;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Finally, the Minio Console service which runs on port 9090:
|
||||
|
||||
```
|
||||
include /etc/nginx/conf.d/geo-ipng-trusted.inc;
|
||||
|
||||
server {
|
||||
listen [::]:443 ssl http2;
|
||||
listen 0.0.0.0:443 ssl http2;
|
||||
ssl_certificate /etc/certs/s3.ipng.ch/fullchain.pem;
|
||||
ssl_certificate_key /etc/certs/s3.ipng.ch/privkey.pem;
|
||||
include /etc/nginx/conf.d/options-ssl-nginx.inc;
|
||||
ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc;
|
||||
|
||||
server_name cons0-s3.chbtl0.ipng.ch;
|
||||
access_log /var/log/nginx/cons0-s3.chbtl0.ipng.ch-access.log upstream;
|
||||
include /etc/nginx/conf.d/ipng-headers.inc;
|
||||
|
||||
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
|
||||
|
||||
ignore_invalid_headers off;
|
||||
client_max_body_size 0;
|
||||
# Disable buffering
|
||||
proxy_buffering off;
|
||||
proxy_request_buffering off;
|
||||
|
||||
location / {
|
||||
if ($geo_ipng_trusted = 0) { rewrite ^ https://ipng.ch/ break; }
|
||||
proxy_set_header Host $http_host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
proxy_set_header X-NginX-Proxy true;
|
||||
|
||||
real_ip_header X-Real-IP;
|
||||
proxy_connect_timeout 300;
|
||||
chunked_transfer_encoding off;
|
||||
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection "upgrade";
|
||||
|
||||
proxy_pass http://minio0.chbtl0.net.ipng.ch:9090;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This last one has an NGINX trick. It will only allow users in if they are in the map called
|
||||
`geo_ipng_trusted`, which contains a set of IPv4 and IPv6 prefixes. Visitors who are not in this map
|
||||
will receive an HTTP redirect back to the [[IPng.ch](https://ipng.ch/)] homepage instead.
|
||||
|
||||
I run the Ansible Playbook which contains the NGINX changes to all frontends, but of course nothing
|
||||
runs yet, because I haven't yet started MinIO backends.
|
||||
|
||||
### MinIO Backends
|
||||
|
||||
The first thing I need to do is get those disks mounted. MinIO likes using XFS, so I'll install that
|
||||
and prepare the disks as follows:
|
||||
|
||||
```
|
||||
pim@minio0-chbtl0:~$ sudo apt install xfsprogs
|
||||
pim@minio0-chbtl0:~$ sudo modprobe xfs
|
||||
pim@minio0-chbtl0:~$ echo xfs | sudo tee -a /etc/modules
|
||||
pim@minio0-chbtl0:~$ sudo update-initramfs -k all -u
|
||||
pim@minio0-chbtl0:~$ for i in a b c d e f g h i j k l; do sudo mkfs.xfs /dev/sd$i; done
|
||||
pim@minio0-chbtl0:~$ blkid | awk 'BEGIN {i=1} /TYPE="xfs"/ {
|
||||
printf "%s /minio/disk%d xfs defaults 0 2\n",$2,i; i++;
|
||||
}' | sudo tee -a /etc/fstab
|
||||
pim@minio0-chbtl0:~$ for i in `seq 1 12`; do sudo mkdir -p /minio/disk$i; done
|
||||
pim@minio0-chbtl0:~$ sudo mount -t xfs -a
|
||||
pim@minio0-chbtl0:~$ sudo chown -R minio-user: /minio/
|
||||
```
|
||||
|
||||
From the top: I'll install `xfsprogs` which contains the things I need to manipulate XFS filesystems
|
||||
in Debian. Then I'll install the `xfs` kernel module, and make sure it gets inserted upon subsequent
|
||||
startup by adding it to `/etc/modules` and regenerating the initrd for the installed kernels.
|
||||
|
||||
Next, I'll format all twelve 16TB disks (which are `/dev/sda` - `/dev/sdl` on these machines), and
|
||||
add their resulting blockdevice id's to `/etc/fstab` so they get persistently mounted on reboot.
|
||||
|
||||
Finally, I'll create their mountpoints, mount all XFS filesystems, and chown them to the user that
|
||||
MinIO is running as. End result:
|
||||
|
||||
```
|
||||
pim@minio0-chbtl0:~$ df -T
|
||||
Filesystem Type 1K-blocks Used Available Use% Mounted on
|
||||
udev devtmpfs 32950856 0 32950856 0% /dev
|
||||
tmpfs tmpfs 6595340 1508 6593832 1% /run
|
||||
/dev/md0 ext4 114695308 5423976 103398948 5% /
|
||||
tmpfs tmpfs 32976680 0 32976680 0% /dev/shm
|
||||
tmpfs tmpfs 5120 4 5116 1% /run/lock
|
||||
/dev/sda xfs 15623792640 121505936 15502286704 1% /minio/disk1
|
||||
/dev/sde xfs 15623792640 121505968 15502286672 1% /minio/disk12
|
||||
/dev/sdi xfs 15623792640 121505968 15502286672 1% /minio/disk11
|
||||
/dev/sdl xfs 15623792640 121505904 15502286736 1% /minio/disk10
|
||||
/dev/sdd xfs 15623792640 121505936 15502286704 1% /minio/disk4
|
||||
/dev/sdb xfs 15623792640 121505968 15502286672 1% /minio/disk3
|
||||
/dev/sdk xfs 15623792640 121505936 15502286704 1% /minio/disk5
|
||||
/dev/sdc xfs 15623792640 121505936 15502286704 1% /minio/disk9
|
||||
/dev/sdf xfs 15623792640 121506000 15502286640 1% /minio/disk2
|
||||
/dev/sdj xfs 15623792640 121505968 15502286672 1% /minio/disk7
|
||||
/dev/sdg xfs 15623792640 121506000 15502286640 1% /minio/disk8
|
||||
/dev/sdh xfs 15623792640 121505968 15502286672 1% /minio/disk6
|
||||
tmpfs tmpfs 6595336 0 6595336 0% /run/user/0
|
||||
```
|
||||
|
||||
MinIO likes to be configured using environment variables - and this is likely because it's a popular
|
||||
thing to run in a containerized environment like Kubernetes. The maintainers ship it also as a
|
||||
Debian package, which will read its environment from `/etc/default/minio`, and I'll prepare that
|
||||
file as follows:
|
||||
|
||||
```
|
||||
pim@minio0-chbtl0:~$ cat << EOF | sudo tee /etc/default/minio
|
||||
MINIO_DOMAIN="s3.chbtl0.ipng.ch,minio0.chbtl0.net.ipng.ch"
|
||||
MINIO_ROOT_USER="XXX"
|
||||
MINIO_ROOT_PASSWORD="YYY"
|
||||
MINIO_VOLUMES="/minio/disk{1...12}"
|
||||
MINIO_OPTS="--console-address :9001"
|
||||
EOF
|
||||
pim@minio0-chbtl0:~$ sudo systemctl enable --now minio
|
||||
pim@minio0-chbtl0:~$ sudo journalctl -u minio
|
||||
May 31 10:44:11 minio0-chbtl0 minio[690420]: MinIO Object Storage Server
|
||||
May 31 10:44:11 minio0-chbtl0 minio[690420]: Copyright: 2015-2025 MinIO, Inc.
|
||||
May 31 10:44:11 minio0-chbtl0 minio[690420]: License: GNU AGPLv3 - https://www.gnu.org/licenses/agpl-3.0.html
|
||||
May 31 10:44:11 minio0-chbtl0 minio[690420]: Version: RELEASE.2025-05-24T17-08-30Z (go1.24.3 linux/amd64)
|
||||
May 31 10:44:11 minio0-chbtl0 minio[690420]: API: http://198.19.4.11:9000 http://127.0.0.1:9000
|
||||
May 31 10:44:11 minio0-chbtl0 minio[690420]: WebUI: https://cons0-s3.chbtl0.ipng.ch/
|
||||
May 31 10:44:11 minio0-chbtl0 minio[690420]: Docs: https://docs.min.io
|
||||
|
||||
pim@minio0-chbtl0:~$ sudo ipmitool sensor | grep Watts
|
||||
Pwr Consumption | 154.000 | Watts
|
||||
```
|
||||
|
||||
Incidentally - I am pretty pleased with this 192TB disk tank, sporting 24 cores, 64GB memory and
|
||||
2x10G network, casually hanging out at 154 Watts of power all up. Slick!
|
||||
|
||||
{{< image float="right" src="/assets/minio/minio-ec.svg" alt="MinIO Erasure Coding" width="22em" >}}
|
||||
|
||||
MinIO implements _erasure coding_ as a core component in providing availability and resiliency
|
||||
during drive or node-level failure events. MinIO partitions each object into data and parity shards
|
||||
and distributes those shards across a single so-called _erasure set_. Under the hood, it uses
|
||||
[[Reed-Solomon](https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction)] erasure coding
|
||||
implementation and partitions the object for distribution. From the MinIO website, I'll borrow a
|
||||
diagram to show how it looks like on a single node like mine to the right.
|
||||
|
||||
Anyway, MinIO detects 12 disks and installs an erasure set with 8 data disks and 4 parity disks,
|
||||
which it calls `EC:4` encoding, also known in the industry as `RS8.4`.
|
||||
Just like that, the thing shoots to life. Awesome!
|
||||
|
||||
### MinIO Client
|
||||
|
||||
On Summer, I'll install the MinIO Client called `mc`. This is easy because the maintainers ship a
|
||||
Linux binary which I can just download. On OpenBSD, they don't do that. Not a problem though, on
|
||||
Squanchy, Pencilvester and Glootie, I will just `go install` the client. Using the `mc` commandline,
|
||||
I can all any of the S3 APIs on my new MinIO instance:
|
||||
|
||||
```
|
||||
pim@summer:~$ set +o history
|
||||
pim@summer:~$ mc alias set chbtl0 https://s3.chbtl0.ipng.ch/ <rootuser> <rootpass>
|
||||
pim@summer:~$ set -o history
|
||||
pim@summer:~$ mc admin info chbtl0/
|
||||
● s3.chbtl0.ipng.ch
|
||||
Uptime: 22 hours
|
||||
Version: 2025-05-24T17:08:30Z
|
||||
Network: 1/1 OK
|
||||
Drives: 12/12 OK
|
||||
Pool: 1
|
||||
|
||||
┌──────┬───────────────────────┬─────────────────────┬──────────────┐
|
||||
│ Pool │ Drives Usage │ Erasure stripe size │ Erasure sets │
|
||||
│ 1st │ 0.8% (total: 116 TiB) │ 12 │ 1 │
|
||||
└──────┴───────────────────────┴─────────────────────┴──────────────┘
|
||||
|
||||
95 GiB Used, 5 Buckets, 5,859 Objects, 318 Versions, 1 Delete Marker
|
||||
12 drives online, 0 drives offline, EC:4
|
||||
|
||||
```
|
||||
|
||||
Cool beans. I think I should get rid of this root account though, I've installed those credentials
|
||||
into the `/etc/default/minio` environment file, but I don't want to keep them out in the open. So
|
||||
I'll make an account for myself and assign me reasonable privileges, called `consoleAdmin` in the
|
||||
default install:
|
||||
|
||||
```
|
||||
pim@summer:~$ set +o history
|
||||
pim@summer:~$ mc admin user add chbtl0/ <someuser> <somepass>
|
||||
pim@summer:~$ mc admin policy info chbtl0 consoleAdmin
|
||||
pim@summer:~$ mc admin policy attach chbtl0 consoleAdmin --user=<someuser>
|
||||
pim@summer:~$ mc alias set chbtl0 https://s3.chbtl0.ipng.ch/ <someuser> <somepass>
|
||||
pim@summer:~$ set -o history
|
||||
```
|
||||
|
||||
OK, I feel less gross now that I'm not operating as root on the MinIO deployment. Using my new
|
||||
user-powers, let me set some metadata on my new minio server:
|
||||
|
||||
```
|
||||
pim@summer:~$ mc admin config set chbtl0/ site name=chbtl0 region=switzerland
|
||||
Successfully applied new settings.
|
||||
Please restart your server 'mc admin service restart chbtl0/'.
|
||||
pim@summer:~$ mc admin service restart chbtl0/
|
||||
Service status: ▰▰▱ [DONE]
|
||||
Summary:
|
||||
┌───────────────┬─────────────────────────────┐
|
||||
│ Servers: │ 1 online, 0 offline, 0 hung │
|
||||
│ Restart Time: │ 61.322886ms │
|
||||
└───────────────┴─────────────────────────────┘
|
||||
pim@summer:~$ mc admin config get chbtl0/ site
|
||||
site name=chbtl0 region=switzerland
|
||||
```
|
||||
|
||||
By the way, what's really cool about these open standards is that both the Amazon `aws` client works
|
||||
with MinIO, but `mc` also works with AWS!
|
||||
### MinIO Console
|
||||
|
||||
Although I'm pretty good with APIs and command line tools, there's some benefit also in using a
|
||||
Graphical User Interface. MinIO ships with one, but there was a bit of a kerfuffle in the MinIO
|
||||
community. Unfortunately, these are pretty common -- Redis (an open source key/value storage system)
|
||||
changed their offering abruptly. Terraform (an open source infrastructure-as-code tool) changed
|
||||
their licensing at some point. Ansible (an open source machine management tool) changed their
|
||||
offering also. MinIO developers decided to strip their console of ~all features recently. The gnarly
|
||||
bits are discussed on
|
||||
[[reddit](https://www.reddit.com/r/selfhosted/comments/1kva3pw/avoid_minio_developers_introduce_trojan_horse/)].
|
||||
but suffice to say: the same thing that happened in literally 100% of the other cases, also happened
|
||||
here. Somebody decided to simply fork the code from before it was changed.
|
||||
|
||||
Enter OpenMaxIO. A cringe worthy name, but it gets the job done. Reading up on the
|
||||
[[GitHub](https://github.com/OpenMaxIO/openmaxio-object-browser/issues/5)], reviving the fully
|
||||
working console is pretty straight forward -- that is, once somebody spent a few days figuring it
|
||||
out. Thank you `icesvz` for this excellent pointer. With this, I can create a systemd service for
|
||||
the console and start it:
|
||||
|
||||
```
|
||||
pim@minio0-chbtl0:~$ cat << EOF | sudo tee -a /etc/default/minio
|
||||
## NOTE(pim): For openmaxio console service
|
||||
CONSOLE_MINIO_SERVER="http://localhost:9000"
|
||||
MINIO_BROWSER_REDIRECT_URL="https://cons0-s3.chbtl0.ipng.ch/"
|
||||
EOF
|
||||
pim@minio0-chbtl0:~$ cat << EOF | sudo tee /lib/systemd/system/minio-console.service
|
||||
[Unit]
|
||||
Description=OpenMaxIO Console Service
|
||||
Wants=network-online.target
|
||||
After=network-online.target
|
||||
AssertFileIsExecutable=/usr/local/bin/minio-console
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
|
||||
WorkingDirectory=/usr/local
|
||||
|
||||
User=minio-user
|
||||
Group=minio-user
|
||||
ProtectProc=invisible
|
||||
|
||||
EnvironmentFile=-/etc/default/minio
|
||||
ExecStart=/usr/local/bin/minio-console server
|
||||
Restart=always
|
||||
LimitNOFILE=1048576
|
||||
MemoryAccounting=no
|
||||
TasksMax=infinity
|
||||
TimeoutSec=infinity
|
||||
OOMScoreAdjust=-1000
|
||||
SendSIGKILL=no
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
EOF
|
||||
pim@minio0-chbtl0:~$ sudo systemctl enable --now minio-console
|
||||
pim@minio0-chbtl0:~$ sudo systemctl restart minio
|
||||
```
|
||||
|
||||
The first snippet is an update to the MinIO configuration that instructs it to redirect users who
|
||||
are not trying to use the API to the console endpoint on `cons0-s3.chbtl0.ipng.ch`, and then the
|
||||
console-server needs to know where to find the API, which from its vantage point is running on
|
||||
`localhost:9000`. Hello, beautiful fully featured console:
|
||||
|
||||
{{< image src="/assets/minio/console-1.png" alt="MinIO Console" >}}
|
||||
|
||||
### MinIO Prometheus
|
||||
|
||||
MinIO ships with a prometheus metrics endpoint, and I notice on its console that it has a nice
|
||||
metrics tab, which is fully greyed out. This is most likely because, well, I don't have a Prometheus
|
||||
install here yet. I decide to keep the storage nodes self-contained and start a Prometheus server on
|
||||
the local machine. I can always plumb that to IPng's Grafana instance later.
|
||||
|
||||
For now, I'll install Prometheus as follows:
|
||||
|
||||
```
|
||||
pim@minio0-chbtl0:~$ cat << EOF | sudo tee -a /etc/default/minio
|
||||
## NOTE(pim): Metrics for minio-console
|
||||
MINIO_PROMETHEUS_AUTH_TYPE="public"
|
||||
CONSOLE_PROMETHEUS_URL="http://localhost:19090/"
|
||||
CONSOLE_PROMETHEUS_JOB_ID="minio-job"
|
||||
EOF
|
||||
|
||||
pim@minio0-chbtl0:~$ sudo apt install prometheus
|
||||
pim@minio0-chbtl0:~$ cat << EOF | sudo tee /etc/default/prometheus
|
||||
ARGS="--web.listen-address='[::]:19090' --storage.tsdb.retention.size=16GB"
|
||||
EOF
|
||||
pim@minio0-chbtl0:~$ cat << EOF | sudo tee /etc/prometheus/prometheus.yml
|
||||
global:
|
||||
scrape_interval: 60s
|
||||
|
||||
scrape_configs:
|
||||
- job_name: minio-job
|
||||
metrics_path: /minio/v2/metrics/cluster
|
||||
static_configs:
|
||||
- targets: ['localhost:9000']
|
||||
labels:
|
||||
cluster: minio0-chbtl0
|
||||
|
||||
- job_name: minio-job-node
|
||||
metrics_path: /minio/v2/metrics/node
|
||||
static_configs:
|
||||
- targets: ['localhost:9000']
|
||||
labels:
|
||||
cluster: minio0-chbtl0
|
||||
|
||||
- job_name: minio-job-bucket
|
||||
metrics_path: /minio/v2/metrics/bucket
|
||||
static_configs:
|
||||
- targets: ['localhost:9000']
|
||||
labels:
|
||||
cluster: minio0-chbtl0
|
||||
|
||||
- job_name: minio-job-resource
|
||||
metrics_path: /minio/v2/metrics/resource
|
||||
static_configs:
|
||||
- targets: ['localhost:9000']
|
||||
labels:
|
||||
cluster: minio0-chbtl0
|
||||
|
||||
- job_name: node
|
||||
static_configs:
|
||||
- targets: ['localhost:9100']
|
||||
labels:
|
||||
cluster: minio0-chbtl0
|
||||
pim@minio0-chbtl0:~$ sudo systemctl restart minio prometheus
|
||||
```
|
||||
|
||||
In the first snippet, I'll tell MinIO where it should find its Prometheus instance. Since the MinIO
|
||||
console service is running on port 9090, and this is also the default port for Prometheus, I will
|
||||
run Promtheus on port 19090 instead. From reading the MinIO docs, I can see that normally MinIO will
|
||||
want prometheus to authenticate to it before it'll allow the endpoints to be scraped. I'll turn that
|
||||
off by making these public. On the IPng Frontends, I can always remove access to /minio/v2 and
|
||||
simply use the IPng Site Local access for local Prometheus scrapers instead.
|
||||
|
||||
After telling Prometheus its runtime arguments (in `/etc/default/prometheus`) and its scraping
|
||||
endpoints (in `/etc/prometheus/prometheus.yml`), I can restart minio and prometheus. A few minutes
|
||||
later, I can see the _Metrics_ tab in the console come to life.
|
||||
|
||||
But now that I have this prometheus running on the MinIO node, I can also add it to IPng's Grafana
|
||||
configuration, by adding a new data source on `minio0.chbtl0.net.ipng.ch:19090` and pointing the
|
||||
default Grafana [[Dashboard](https://grafana.com/grafana/dashboards/13502-minio-dashboard/)] at it:
|
||||
|
||||
{{< image src="/assets/minio/console-2.png" alt="Grafana Dashboard" >}}
|
||||
|
||||
A two-for-one: I will both be able to see metrics directly in the console, but also I will be able
|
||||
to hook up these per-node prometheus instances into IPng's alertmanager also, and I've read some
|
||||
[[docs](https://min.io/docs/minio/linux/operations/monitoring/collect-minio-metrics-using-prometheus.html)]
|
||||
on the concepts. I'm really liking the experience so far!
|
||||
|
||||
### MinIO Nagios
|
||||
|
||||
Prometheus is fancy and all, but at IPng Networks, I've been doing monitoring for a while now. As a
|
||||
dinosaur, I still have an active [[Nagios](https://www.nagios.org/)] install, which autogenerates
|
||||
all of its configuration using the Ansible repository I have. So for the new Ansible group called
|
||||
`minio`, I will autogenerate the following snippet:
|
||||
|
||||
```
|
||||
define command {
|
||||
command_name ipng_check_minio
|
||||
command_line $USER1$/check_http -E -H $HOSTALIAS$ -I $ARG1$ -p $ARG2$ -u $ARG3$ -r '$ARG4$'
|
||||
}
|
||||
|
||||
define service {
|
||||
hostgroup_name ipng:minio:ipv6
|
||||
service_description minio6:api
|
||||
check_command ipng_check_minio!$_HOSTADDRESS6$!9000!/minio/health/cluster!
|
||||
use ipng-service-fast
|
||||
notification_interval 0 ; set > 0 if you want to be renotified
|
||||
}
|
||||
|
||||
define service {
|
||||
hostgroup_name ipng:minio:ipv6
|
||||
service_description minio6:prom
|
||||
check_command ipng_check_minio!$_HOSTADDRESS6$!19090!/classic/targets!minio-job
|
||||
use ipng-service-fast
|
||||
notification_interval 0 ; set > 0 if you want to be renotified
|
||||
}
|
||||
|
||||
define service {
|
||||
hostgroup_name ipng:minio:ipv6
|
||||
service_description minio6:console
|
||||
check_command ipng_check_minio!$_HOSTADDRESS6$!9090!/!MinIO Console
|
||||
use ipng-service-fast
|
||||
notification_interval 0 ; set > 0 if you want to be renotified
|
||||
}
|
||||
```
|
||||
|
||||
I've shown the snippet for IPv6 but I also have three services defined for legacy IP in the
|
||||
hostgroup `ipng:minio:ipv4`. The check command here uses `-I` which has the IPv4 or IPv6 address to
|
||||
talk to, `-p` for the port to consule, `-u` for the URI to hit and an option `-r` for a regular
|
||||
expression to expect in the output. For the Nagios afficianados out there: my Ansible `groups`
|
||||
correspond one to one with autogenerated Nagios `hostgroups`. This allows me to add arbitrary checks
|
||||
by group-type, like above in the `ipng:minio` group for IPv4 and IPv6.
|
||||
|
||||
In the MinIO [[docs](https://min.io/docs/minio/linux/operations/monitoring/healthcheck-probe.html)]
|
||||
I read up on the Healthcheck API. I choose to monitor the _Cluster Write Quorum_ on my minio
|
||||
deployments. For Prometheus, I decide to hit the `targets` endpoint and expect the `minio-job` to be
|
||||
among them. Finally, for the MinIO Console, I expect to see a login screen with the words `MinIO
|
||||
Console` in the returned page. I guessed right, because Nagios is all green:
|
||||
|
||||
{{< image src="/assets/minio/nagios.png" alt="Nagios Dashboard" >}}
|
||||
|
||||
## My First Bucket
|
||||
|
||||
The IPng website is a statically generated Hugo site, and when-ever I submit a change to my Git
|
||||
repo, a CI/CD runner (called [[Drone](https://www.drone.io/)]), picks up the change. It re-builds
|
||||
the static website, and copies it to four redundant NGINX servers.
|
||||
|
||||
But IPng's website has amassed quite a bit of extra files (like VM images and VPP packages that I
|
||||
publish), which are copied separately using a simple push script I have in my home directory. This
|
||||
avoids all those big media files from cluttering the Git repository. I decide to move this stuff
|
||||
into S3:
|
||||
|
||||
```
|
||||
pim@summer:~/src/ipng-web-assets$ echo 'Gruezi World.' > ipng.ch/media/README.md
|
||||
pim@summer:~/src/ipng-web-assets$ mc mb chbtl0/ipng-web-assets
|
||||
pim@summer:~/src/ipng-web-assets$ mc mirror . chbtl0/ipng-web-assets/
|
||||
...ch/media/README.md: 6.50 GiB / 6.50 GiB ┃▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓┃ 236.38 MiB/s 28s
|
||||
pim@summer:~/src/ipng-web-assets$ mc anonymous set download chbtl0/ipng-web-assets/
|
||||
```
|
||||
|
||||
OK, two things that immediately jump out at me. This stuff is **fast**: Summer is connected with a
|
||||
2.5GbE network card, and she's running hard, copying the 6.5GB of data that are in these web assets
|
||||
essentially at line rate. It doesn't really surprise me because Summer is running off of Gen4 NVME,
|
||||
while MinIO has 12 spinning disks which each can write about 160MB/s or so sustained
|
||||
[[ref](https://www.seagate.com/www-content/datasheets/pdfs/exos-x16-DS2011-1-1904US-en_US.pdf)],
|
||||
with 24 CPUs to tend to the NIC (2x10G) and disks (2x SSD, 12x LFF). Should be plenty!
|
||||
|
||||
The second is that MinIO allows for buckets to be publicly shared in three ways: 1) read-only by
|
||||
setting `download`; 2) write-only by setting `upload`, and 3) read-write by setting `public`.
|
||||
I set `download` here, which means I should be able to fetch an asset now publicly:
|
||||
|
||||
```
|
||||
pim@summer:~$ curl https://s3.chbtl0.ipng.ch/ipng-web-assets/ipng.ch/media/README.md
|
||||
Gruezi World.
|
||||
pim@summer:~$ curl https://ipng-web-assets.s3.chbtl0.ipng.ch/ipng.ch/media/README.md
|
||||
Gruezi World.
|
||||
```
|
||||
|
||||
The first `curl` here shows the path-based access, while the second one shows an equivalent
|
||||
virtual-host based access. Both retrieve the file I just pushed via the public Internet. Whoot!
|
||||
|
||||
# What's Next
|
||||
|
||||
I'm going to be moving [[Restic](https://restic.net/)] backups from IPng's ZFS storage pool to this
|
||||
S3 service over the next few days. I'll also migrate PeerTube and possibly Mastodon from NVME based
|
||||
storage to replicated S3 buckets as well. Finally, the IPng website media that I mentioned above,
|
||||
should make for a nice followup article. Stay tuned!
|
475
content/articles/2025-06-01-minio-2.md
Normal file
475
content/articles/2025-06-01-minio-2.md
Normal file
@@ -0,0 +1,475 @@
|
||||
---
|
||||
date: "2025-06-01T10:07:23Z"
|
||||
title: 'Case Study: Minio S3 - Part 2'
|
||||
---
|
||||
|
||||
{{< image float="right" src="/assets/minio/minio-logo.png" alt="MinIO Logo" width="6em" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading
|
||||
scalability, data availability, security, and performance. Millions of customers of all sizes and
|
||||
industries store, manage, analyze, and protect any amount of data for virtually any use case, such
|
||||
as data lakes, cloud-native applications, and mobile apps. With cost-effective storage classes and
|
||||
easy-to-use management features, you can optimize costs, organize and analyze data, and configure
|
||||
fine-tuned access controls to meet specific business and compliance requirements.
|
||||
|
||||
Amazon's S3 became the _de facto_ standard object storage system, and there exist several fully open
|
||||
source implementations of the protocol. One of them is MinIO: designed to allow enterprises to
|
||||
consolidate all of their data on a single, private cloud namespace. Architected using the same
|
||||
principles as the hyperscalers, AIStor delivers performance at scale at a fraction of the cost
|
||||
compared to the public cloud.
|
||||
|
||||
IPng Networks is an Internet Service Provider, but I also dabble in self-hosting things, for
|
||||
example [[PeerTube](https://video.ipng.ch/)], [[Mastodon](https://ublog.tech/)],
|
||||
[[Immich](https://photos.ipng.ch/)], [[Pixelfed](https://pix.ublog.tech/)] and of course
|
||||
[[Hugo](https://ipng.ch/)]. These services all have one thing in common: they tend to use lots of
|
||||
storage when they grow. At IPng Networks, all hypervisors ship with enterprise SAS flash drives,
|
||||
mostly 1.92TB and 3.84TB. Scaling up each of these services, and backing them up safely, can be
|
||||
quite the headache.
|
||||
|
||||
In a [[previous article]({{< ref 2025-05-28-minio-1 >}})], I talked through the install of a
|
||||
redundant set of three Minio machines. In this article, I'll start putting them to good use.
|
||||
|
||||
## Use Case: Restic
|
||||
|
||||
{{< image float="right" src="/assets/minio/restic-logo.png" alt="Restic Logo" width="12em" >}}
|
||||
|
||||
[[Restic](https://restic.org/)] is a modern backup program that can back up your files from multiple
|
||||
host OS, to many different storage types, easily, effectively, securely, verifiably and freely. With
|
||||
a sales pitch like that, what's not to love? Actually, I am a long-time
|
||||
[[BorgBackup](https://www.borgbackup.org/)] user, and I think I'll keep that running. However, for
|
||||
resilience, and because I've heard only good things about Restic, I'll make a second backup of the
|
||||
routers, hypervisors, and virtual machines using Restic.
|
||||
|
||||
Restic can use S3 buckets out of the box (incidentally, so can BorgBackup). To configure it, I use
|
||||
a mixture of environment variables and flags. But first, let me create a bucket for the backups.
|
||||
|
||||
```
|
||||
pim@glootie:~$ mc mb chbtl0/ipng-restic
|
||||
pim@glootie:~$ mc admin user add chbtl0/ <key> <secret>
|
||||
pim@glootie:~$ cat << EOF | tee ipng-restic-access.json
|
||||
{
|
||||
"PolicyName": "ipng-restic-access",
|
||||
"Policy": {
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [ "s3:DeleteObject", "s3:GetObject", "s3:ListBucket", "s3:PutObject" ],
|
||||
"Resource": [ "arn:aws:s3:::ipng-restic", "arn:aws:s3:::ipng-restic/*" ]
|
||||
}
|
||||
]
|
||||
},
|
||||
}
|
||||
EOF
|
||||
pim@glootie:~$ mc admin policy create chbtl0/ ipng-restic-access.json
|
||||
pim@glootie:~$ mc admin policy attach chbtl0/ ipng-restic-access --user <key>
|
||||
```
|
||||
|
||||
First, I'll create a bucket called `ipng-restic`. Then, I'll create a _user_ with a given secret
|
||||
_key_. To protect the innocent, and my backups, I'll not disclose them. Next, I'll create an
|
||||
IAM policy that allows for Get/List/Put/Delete to be performed on the bucket and its contents, and
|
||||
finally I'll attach this policy to the user I just created.
|
||||
|
||||
To run a Restic backup, I'll first have to create a so-called _repository_. The repository has a
|
||||
location and a password, which Restic uses to encrypt the data. Because I'm using S3, I'll also need
|
||||
to specify the key and secret:
|
||||
|
||||
```
|
||||
root@glootie:~# RESTIC_PASSWORD="changeme"
|
||||
root@glootie:~# RESTIC_REPOSITORY="s3:https://s3.chbtl0.ipng.ch/ipng-restic/$(hostname)/"
|
||||
root@glootie:~# AWS_ACCESS_KEY_ID="<key>"
|
||||
root@glootie:~# AWS_SECRET_ACCESS_KEY:="<secret>"
|
||||
root@glootie:~# export RESTIC_PASSWORD RESTIC_REPOSITORY AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY
|
||||
root@glootie:~# restic init
|
||||
created restic repository 807cf25e85 at s3:https://s3.chbtl0.ipng.ch/ipng-restic/glootie.ipng.ch/
|
||||
```
|
||||
|
||||
Restic prints out some repository finterprint of the latest 'snapshot' it just created. Taking a
|
||||
look on the MinIO install:
|
||||
|
||||
```
|
||||
pim@glootie:~$ mc stat chbtl0/ipng-restic/glootie.ipng.ch/
|
||||
Name : config
|
||||
Date : 2025-06-01 12:01:43 UTC
|
||||
Size : 155 B
|
||||
ETag : 661a43f72c43080649712e45da14da3a
|
||||
Type : file
|
||||
Metadata :
|
||||
Content-Type: application/octet-stream
|
||||
|
||||
Name : keys/
|
||||
Date : 2025-06-01 12:03:33 UTC
|
||||
Type : folder
|
||||
```
|
||||
|
||||
Cool. Now I'm ready to make my first full backup:
|
||||
|
||||
```
|
||||
root@glootie:~# ARGS="--exclude /proc --exclude /sys --exclude /dev --exclude /run"
|
||||
root@glootie:~# ARGS="$ARGS --exclude-if-present .nobackup"
|
||||
root@glootie:~# restic backup $ARGS /
|
||||
...
|
||||
processed 1141426 files, 131.111 GiB in 15:12
|
||||
snapshot 34476c74 saved
|
||||
```
|
||||
|
||||
Once the backup completes, the Restic authors advise me to also do a check of the repository, and to
|
||||
prune it so that it keeps a finite amount of daily, weekly and monthly backups. My further journey
|
||||
for Restic looks a bit like this:
|
||||
|
||||
```
|
||||
root@glootie:~# restic check
|
||||
using temporary cache in /tmp/restic-check-cache-2712250731
|
||||
create exclusive lock for repository
|
||||
load indexes
|
||||
check all packs
|
||||
check snapshots, trees and blobs
|
||||
[0:04] 100.00% 1 / 1 snapshots
|
||||
|
||||
no errors were found
|
||||
|
||||
root@glootie:~# restic forget --prune --keep-daily 8 --keep-weekly 5 --keep-monthly 6
|
||||
repository 34476c74 opened (version 2, compression level auto)
|
||||
Applying Policy: keep 8 daily, 5 weekly, 6 monthly snapshots
|
||||
keep 1 snapshots:
|
||||
ID Time Host Tags Reasons Paths
|
||||
---------------------------------------------------------------------------------
|
||||
34476c74 2025-06-01 12:18:54 glootie.ipng.ch daily snapshot /
|
||||
weekly snapshot
|
||||
monthly snapshot
|
||||
----------------------------------------------------------------------------------
|
||||
1 snapshots
|
||||
```
|
||||
|
||||
Right on! I proceed to update the Ansible configs at IPng to roll this out against the entire fleet
|
||||
of 152 hosts at IPng Networks. I do this in a little tool called `bitcron`, which I wrote for a
|
||||
previous company I worked at: [[BIT](https://bit.nl)] in the Netherlands. Bitcron allows me to
|
||||
create relatively elegant cronjobs that can raise warnings, errors and fatal issues. If no issues
|
||||
are found, an e-mail can be sent to a bitbucket address, but if warnings or errors are found, a
|
||||
different _monitored_ address will be used. Bitcron is kind of cool, and I wrote it in 2001. Maybe
|
||||
I'll write about it, for old time's sake. I wonder if the folks at BIT still use it?
|
||||
|
||||
## Use Case: NGINX
|
||||
|
||||
{{< image float="right" src="/assets/minio/nginx-logo.png" alt="NGINX Logo" width="11em" >}}
|
||||
|
||||
OK, with the first use case out of the way, I turn my attention to a second - in my opinion more
|
||||
interesting - use case. In the [[previous article]({{< ref 2025-05-28-minio-1 >}})], I created a
|
||||
public bucket called `ipng-web-assets` in which I stored 6.50GB of website data belonging to the
|
||||
IPng website, and some material I posted when I was on my
|
||||
[[Sabbatical](https://sabbatical.ipng.nl/)] last year.
|
||||
|
||||
### MinIO: Bucket Replication
|
||||
|
||||
First things first: redundancy. These web assets are currently pushed to all four nginx machines,
|
||||
and statically served. If I were to replace them with a single S3 bucket, I would create a single
|
||||
point of failure, and that's _no bueno_!
|
||||
|
||||
Off I go, creating a replicated bucket using two MinIO instances (`chbtl0` and `ddln0`):
|
||||
|
||||
```
|
||||
pim@glootie:~$ mc mb ddln0/ipng-web-assets
|
||||
pim@glootie:~$ mc anonymous set download ddln0/ipng-web-assets
|
||||
pim@glootie:~$ mc admin user add ddln0/ <replkey> <replsecret>
|
||||
pim@glootie:~$ cat << EOF | tee ipng-web-assets-access.json
|
||||
{
|
||||
"PolicyName": "ipng-web-assets-access",
|
||||
"Policy": {
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [ "s3:DeleteObject", "s3:GetObject", "s3:ListBucket", "s3:PutObject" ],
|
||||
"Resource": [ "arn:aws:s3:::ipng-web-assets", "arn:aws:s3:::ipng-web-assets/*" ]
|
||||
}
|
||||
]
|
||||
},
|
||||
}
|
||||
EOF
|
||||
pim@glootie:~$ mc admin policy create ddln0/ ipng-web-assets-access.json
|
||||
pim@glootie:~$ mc admin policy attach ddln0/ ipng-web-assets-access --user <replkey>
|
||||
pim@glootie:~$ mc replicate add chbtl0/ipng-web-assets \
|
||||
--remote-bucket https://<key>:<secret>@s3.ddln0.ipng.ch/ipng-web-assets
|
||||
```
|
||||
|
||||
What happens next is pure magic. I've told `chbtl0` that I want it to replicate all existing and
|
||||
future changes to that bucket to its neighbor `ddln0`. Only minutes later, I check the replication
|
||||
status, just to see that it's _already done_:
|
||||
|
||||
```
|
||||
pim@glootie:~$ mc replicate status chbtl0/ipng-web-assets
|
||||
Replication status since 1 hour
|
||||
s3.ddln0.ipng.ch
|
||||
Replicated: 142 objects (6.5 GiB)
|
||||
Queued: ● 0 objects, 0 B (avg: 4 objects, 915 MiB ; max: 0 objects, 0 B)
|
||||
Workers: 0 (avg: 0; max: 0)
|
||||
Transfer Rate: 15 kB/s (avg: 88 MB/s; max: 719 MB/s
|
||||
Latency: 3ms (avg: 3ms; max: 7ms)
|
||||
Link: ● online (total downtime: 0 milliseconds)
|
||||
Errors: 0 in last 1 minute; 0 in last 1hr; 0 since uptime
|
||||
Configured Max Bandwidth (Bps): 644 GB/s Current Bandwidth (Bps): 975 B/s
|
||||
pim@summer:~/src/ipng-web-assets$ mc ls ddln0/ipng-web-assets/
|
||||
[2025-06-01 12:42:22 CEST] 0B ipng.ch/
|
||||
[2025-06-01 12:42:22 CEST] 0B sabbatical.ipng.nl/
|
||||
```
|
||||
|
||||
MinIO has pumped the data from bucket `ipng-web-assets` to the other machine at an average of 88MB/s
|
||||
with a peak throughput of 719MB/s (probably for the larger VM images). And indeed, looking at the
|
||||
remote machine, it is fully caught up after the push, within only a minute or so with a completely
|
||||
fresh copy. Nice!
|
||||
|
||||
### MinIO: Missing directory index
|
||||
|
||||
I take a look at what I just built, on the following URL:
|
||||
* [https://ipng-web-assets.s3.ddln0.ipng.ch/sabbatical.ipng.nl/media/vdo/IMG_0406_0.mp4](https://ipng-web-assets.s3.ddln0.ipng.ch/sabbatical.ipng.nl/media/vdo/IMG_0406_0.mp4)
|
||||
|
||||
That checks out, and I can see the mess that was my room when I first went on sabbatical. By the
|
||||
way, I totally cleaned it up, see
|
||||
[[here](https://sabbatical.ipng.nl/blog/2024/08/01/thursday-basement-done/)] for proof. I can't,
|
||||
however, see the directory listing:
|
||||
|
||||
```
|
||||
pim@glootie:~$ curl https://ipng-web-assets.s3.ddln0.ipng.ch/sabbatical.ipng.nl/media/vdo/
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<Error>
|
||||
<Code>NoSuchKey</Code>
|
||||
<Message>The specified key does not exist.</Message>
|
||||
<Key>sabbatical.ipng.nl/media/vdo/</Key>
|
||||
<BucketName>ipng-web-assets</BucketName>
|
||||
<Resource>/sabbatical.ipng.nl/media/vdo/</Resource>
|
||||
<RequestId>1844EC0CFEBF3C5F</RequestId>
|
||||
<HostId>dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8</HostId>
|
||||
</Error>
|
||||
```
|
||||
|
||||
That's unfortunate, because some of the IPng articles link to a directory full of files, which I'd
|
||||
like to be shown so that my readers can navigate through the directories. Surely I'm not the first
|
||||
to encounter this? And sure enough, I'm not
|
||||
[[ref](https://github.com/glowinthedark/index-html-generator)] by user `glowinthedark` who wrote a
|
||||
little python script that generates `index.html` files for their Caddy file server. I'll take me
|
||||
some of that Python, thank you!
|
||||
|
||||
With the following little script, my setup is complete:
|
||||
|
||||
```
|
||||
pim@glootie:~/src/ipng-web-assets$ cat push.sh
|
||||
#!/usr/bin/env bash
|
||||
|
||||
echo "Generating index.html files ..."
|
||||
for D in */media; do
|
||||
echo "* Directory $D"
|
||||
./genindex.py -r $D
|
||||
done
|
||||
echo "Done (genindex)"
|
||||
echo ""
|
||||
|
||||
echo "Mirroring directoro to S3 Bucket"
|
||||
mc mirror --remove --overwrite . chbtl0/ipng-web-assets/
|
||||
echo "Done (mc mirror)"
|
||||
echo ""
|
||||
pim@glootie:~/src/ipng-web-assets$ ./push.sh
|
||||
```
|
||||
|
||||
Only a few seconds after I run `./push.sh`, the replication is complete and I have two identical
|
||||
copies of my media:
|
||||
|
||||
1. [https://ipng-web-assets.s3.chbtl0.ipng.ch/ipng.ch/media/](https://ipng-web-assets.s3.chbtl0.ipng.ch/ipng.ch/media/index.html)
|
||||
1. [https://ipng-web-assets.s3.ddln0.ipng.ch/ipng.ch/media/](https://ipng-web-assets.s3.ddln0.ipng.ch/ipng.ch/media/index.html)
|
||||
|
||||
|
||||
### NGINX: Proxy to Minio
|
||||
|
||||
Before moving to S3 storage, my NGINX frontends all kept a copy of the IPng media on local NVME
|
||||
disk. That's great for reliability, as each NGINX instance is completely hermetic and standalone.
|
||||
However, it's not great for scaling: the current NGINX instances only have 16GB of local storage,
|
||||
and I'd rather not have my static web asset data outgrow that filesystem. From before, I already had
|
||||
an NGINX config that served the Hugo static data from `/var/www/ipng.ch/ and the `/media'
|
||||
subdirectory from a different directory in `/var/www/ipng-web-assets/ipng.ch/media`.
|
||||
|
||||
Moving to redundant S3 storage backenda is straight forward:
|
||||
|
||||
```
|
||||
upstream minio_ipng {
|
||||
least_conn;
|
||||
server minio0.chbtl0.net.ipng.ch:9000;
|
||||
server minio0.ddln0.net.ipng.ch:9000;
|
||||
}
|
||||
|
||||
server {
|
||||
...
|
||||
location / {
|
||||
root /var/www/ipng.ch/;
|
||||
}
|
||||
|
||||
location /media {
|
||||
proxy_set_header Host $http_host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
|
||||
proxy_connect_timeout 300;
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Connection "";
|
||||
chunked_transfer_encoding off;
|
||||
|
||||
rewrite (.*)/$ $1/index.html;
|
||||
|
||||
proxy_pass http://minio_ipng/ipng-web-assets/ipng.ch/media;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
I want to make note of a few things:
|
||||
1. The `upstream` definition here uses IPng Site Local entrypoints, considering the NGINX servers
|
||||
all have direct MTU=9000 access to the MinIO instances. I'll put both in there, in a
|
||||
round-robin configuration favoring the replica with _least connections_.
|
||||
1. Deeplinking to directory names without the trailing `/index.html` would serve a 404 from the
|
||||
backend, so I'll intercept these and rewrite directory to always include the `/index.html'.
|
||||
1. The used upstream endpoint is _path-based_, that is to say has the bucketname and website name
|
||||
included. This whole location used to be simply `root /var/www/ipng-web-assets/ipng.ch/media/`
|
||||
so the mental change is quite small.
|
||||
|
||||
### NGINX: Caching
|
||||
|
||||
|
||||
After deploying the S3 upstream on all IPng websites, I can delete the old
|
||||
`/var/www/ipng-web-assets/` directory and reclaim about 7GB of diskspace. This gives me an idea ...
|
||||
|
||||
{{< image width="8em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
|
||||
|
||||
On the one hand it's great that I will pull these assets from Minio and all, but at the same time,
|
||||
it's a tad inefficient to retrieve them from, say, Zurich to Amsterdam just to serve them onto the
|
||||
internet again. If at any time something on the IPng website goes viral, it'd be nice to be able to
|
||||
serve them directly from the edge, right?
|
||||
|
||||
A webcache. What could _possibly_ go wrong :)
|
||||
|
||||
NGINX is really really good at caching content. It has a powerful engine to store, scan, revalidate
|
||||
and match any content and upstream headers. It's also very well documented, so I take a look at the
|
||||
proxy module's documentation [[here](https://nginx.org/en/docs/http/ngx_http_proxy_module.html)] and
|
||||
in particular a useful [[blog](https://blog.nginx.org/blog/nginx-caching-guide)] on their website.
|
||||
|
||||
The first thing I need to do is create what is called a _key zone_, which is a region of memory in
|
||||
which URL keys are stored with some metadata. Having a copy of the keys in memory enables NGINX to
|
||||
quickly determine if a request is a HIT or a MISS without having to go to disk, greatly speeding up
|
||||
the check.
|
||||
|
||||
In `/etc/nginx/conf.d/ipng-cache.conf` I add the following NGINX cache:
|
||||
|
||||
```
|
||||
proxy_cache_path /var/www/nginx-cache levels=1:2 keys_zone=ipng_cache:10m max_size=8g
|
||||
inactive=24h use_temp_path=off;
|
||||
```
|
||||
|
||||
With this statement, I'll create a 2-level subdirectory, and allocate 10MB of space, which should
|
||||
hold on the order of 100K entries. The maximum size I'll allow the cache to grow to is 8GB, and I'll
|
||||
mark any object inactive if it's not been referenced for 24 hours. I learn that inactive is
|
||||
different to expired content. If a cache element has expired, but NGINX can't reach the upstream
|
||||
for a new copy, it can be configured to serve a inactive (stale) copy from the cache. That's dope,
|
||||
as it serves as an extra layer of defence in case the network or all available S3 replicas take the
|
||||
day off. I'll ask NGINX to avoid writing objects first to a tmp directory and them moving them into
|
||||
the `/var/www/nginx-cache` directory. These are recommendations I grab from the manual.
|
||||
|
||||
Within the `location` block I configured above, I'm now ready to enable this cache. I'll do that by
|
||||
adding two include files, which I'll reference in all sites that I want to have make use of this
|
||||
cache:
|
||||
|
||||
First, to enable the cache, I write the following snippet:
|
||||
```
|
||||
pim@nginx0-nlams1:~$ cat /etc/nginx/conf.d/ipng-cache.inc
|
||||
proxy_cache ipng_cache;
|
||||
proxy_ignore_headers Cache-Control;
|
||||
proxy_cache_valid any 1h;
|
||||
proxy_cache_revalidate on;
|
||||
proxy_cache_use_stale error timeout updating http_500 http_502 http_503 http_504;
|
||||
proxy_cache_background_update on;
|
||||
```
|
||||
|
||||
Then, I find it useful to emit a few debugging HTTP headers, and at the same time I see that Minio
|
||||
emits a bunch of HTTP headers that may not be safe for me to propagate, so I pen two more snippets:
|
||||
|
||||
```
|
||||
pim@nginx0-nlams1:~$ cat /etc/nginx/conf.d/ipng-strip-minio-headers.inc
|
||||
proxy_hide_header x-minio-deployment-id;
|
||||
proxy_hide_header x-amz-request-id;
|
||||
proxy_hide_header x-amz-id-2;
|
||||
proxy_hide_header x-amz-replication-status;
|
||||
proxy_hide_header x-amz-version-id;
|
||||
|
||||
pim@nginx0-nlams1:~$ cat /etc/nginx/conf.d/ipng-add-upstream-headers.inc
|
||||
add_header X-IPng-Frontend $hostname always;
|
||||
add_header X-IPng-Upstream $upstream_addr always;
|
||||
add_header X-IPng-Upstream-Status $upstream_status always;
|
||||
add_header X-IPng-Cache-Status $upstream_cache_status;
|
||||
```
|
||||
|
||||
With that, I am ready to enable caching of the IPng `/media` location:
|
||||
|
||||
```
|
||||
location /media {
|
||||
...
|
||||
include /etc/nginx/conf.d/ipng-strip-minio-headers.inc;
|
||||
include /etc/nginx/conf.d/ipng-add-upstream-headers.inc;
|
||||
include /etc/nginx/conf.d/ipng-cache.inc;
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
## Results
|
||||
|
||||
I run the Ansible playbook for the NGINX cluster and take a look at the replica at Coloclue in
|
||||
Amsterdam, called `nginx0.nlams1.ipng.ch`. Notably, it'll have to retrieve the file from a MinIO
|
||||
replica in Zurich (12ms away), so it's expected to take a little while.
|
||||
|
||||
The first attempt:
|
||||
|
||||
```
|
||||
pim@nginx0-nlams1:~$ curl -v -o /dev/null --connect-to ipng.ch:443:localhost:443 \
|
||||
https://ipng.ch/media/vpp-proto/vpp-proto-bookworm.qcow2.lrz
|
||||
...
|
||||
< last-modified: Sun, 01 Jun 2025 12:37:52 GMT
|
||||
< x-ipng-frontend: nginx0-nlams1
|
||||
< x-ipng-cache-status: MISS
|
||||
< x-ipng-upstream: [2001:678:d78:503::b]:9000
|
||||
< x-ipng-upstream-status: 200
|
||||
|
||||
100 711M 100 711M 0 0 26.2M 0 0:00:27 0:00:27 --:--:-- 26.6M
|
||||
```
|
||||
|
||||
|
||||
OK, that's respectable, I've read the file at 26MB/s. Of course I just turned on the cache, so the
|
||||
NGINX fetches the file from Zurich while handing it over to my `curl` here. It notifies me by means
|
||||
of a HTTP header that the cache was a `MISS`, and then which upstream server it contacted to
|
||||
retrieve the object.
|
||||
|
||||
But look at what happens the _second_ time I run the same command:
|
||||
|
||||
```
|
||||
pim@nginx0-nlams1:~$ curl -v -o /dev/null --connect-to ipng.ch:443:localhost:443 \
|
||||
https://ipng.ch/media/vpp-proto/vpp-proto-bookworm.qcow2.lrz
|
||||
< last-modified: Sun, 01 Jun 2025 12:37:52 GMT
|
||||
< x-ipng-frontend: nginx0-nlams1
|
||||
< x-ipng-cache-status: HIT
|
||||
|
||||
100 711M 100 711M 0 0 436M 0 0:00:01 0:00:01 --:--:-- 437M
|
||||
```
|
||||
|
||||
|
||||
Holy moly! First I see the object has the same _Last-Modified_ header, but I now also see that the
|
||||
_Cache-Status_ was a `HIT`, and there is no mention of any upstream server. I do however see the
|
||||
file come in at a whopping 437MB/s which is 16x faster than over the network!! Nice work, NGINX!
|
||||
|
||||
{{< image float="right" src="/assets/minio/rack-2.png" alt="Rack-o-Minio" width="12em" >}}
|
||||
|
||||
# What's Next
|
||||
|
||||
I'm going to deploy the third MinIO replica in Rümlang once the disks arrive. I'll release the
|
||||
~4TB of disk used currently in Restic backups for the fleet, and put that ZFS capacity to other use.
|
||||
Now, creating services like PeerTube, Mastodon, Pixelfed, Loops, NextCloud and what-have-you, will
|
||||
become much easier for me. And with the per-bucket replication between MinIO deployments, I also
|
||||
think this is a great way to auto-backup important data. First off, it'll be RS8.4 on the MinIO node
|
||||
itself, and secondly, user data will be copied automatically to a neighboring facility.
|
||||
|
||||
I've convinced myself that S3 storage is a great service to operate, and that MinIO is awesome.
|
375
content/articles/2025-07-12-vpp-evpn-1.md
Normal file
375
content/articles/2025-07-12-vpp-evpn-1.md
Normal file
@@ -0,0 +1,375 @@
|
||||
---
|
||||
date: "2025-07-12T08:07:23Z"
|
||||
title: 'VPP and eVPN/VxLAN - Part 1'
|
||||
---
|
||||
|
||||
{{< image width="6em" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
You know what would be really cool? If VPP could be an eVPN/VxLAN speaker! Sometimes I feel like I'm
|
||||
the very last on the planet to learn about something cool. My latest "A-Ha!"-moment was when I was
|
||||
configuring the eVPN fabric for [[Frys-IX](https://frys-ix.net/)], and I wrote up an article about
|
||||
it [[here]({{< ref 2025-04-09-frysix-evpn >}})] back in April.
|
||||
|
||||
I can build the equivalent of Virtual Private Wires (VPWS), also called L2VPN or Virtual Leased
|
||||
Lines, and these are straight forward because they typically only have two endpoints. A "regular"
|
||||
VxLAN tunnel which is L2 cross connected with another interface already does that just fine. Take a
|
||||
look at an article on [[L2 Gymnastics]({{< ref 2022-01-12-vpp-l2 >}})] for that. But the real kicker
|
||||
is that I can also create multi-site L2 domains like Virtual Private LAN Services (VPLS) or also
|
||||
called Virtual Private Ethernet, L2VPN or Ethernet LAN Service (E-LAN). And *that* is a whole other
|
||||
level of awesome.
|
||||
|
||||
## Recap: VPP today
|
||||
|
||||
### VPP: VxLAN
|
||||
|
||||
The current VPP VxLAN tunnel plugin does point to point tunnels, that is they are configured with a
|
||||
source address, destination address, destination port and VNI. As I mentioned, a point to point
|
||||
ethernet transport is configured very easily:
|
||||
|
||||
```
|
||||
vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.254 vni 8298 instance 0
|
||||
vpp0# set int l2 xconnect vxlan_tunnel0 HundredGigabitEthernet10/0/0
|
||||
vpp0# set int l2 xconnect HundredGigabitEthernet10/0/0 vxlan_tunnel0
|
||||
vpp0# set int state vxlan_tunnel0 up
|
||||
vpp0# set int state HundredGigabitEthernet10/0/0 up
|
||||
|
||||
vpp1# create vxlan tunnel src 192.0.2.254 dst 192.0.2.1 vni 8298 instance 0
|
||||
vpp1# set int l2 xconnect vxlan_tunnel0 HundredGigabitEthernet10/0/1
|
||||
vpp1# set int l2 xconnect HundredGigabitEthernet10/0/1 vxlan_tunnel0
|
||||
vpp1# set int state vxlan_tunnel0 up
|
||||
vpp1# set int state HundredGigabitEthernet10/0/1 up
|
||||
```
|
||||
|
||||
And with that, `vpp0:Hu10/0/0` is cross connected with `vpp1:Hu10/0/1` and ethernet flows between
|
||||
the two.
|
||||
|
||||
### VPP: Bridge Domains
|
||||
|
||||
Now consider a VPLS with five different routers. While it's possible to create a bridge-domain and add
|
||||
some local ports and four other VxLAN tunnels:
|
||||
|
||||
```
|
||||
vpp0# create bridge-domain 8298
|
||||
vpp0# set int l2 bridge HundredGigabitEthernet10/0/1 8298
|
||||
vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.2 vni 8298 instance 0
|
||||
vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.3 vni 8298 instance 1
|
||||
vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.4 vni 8298 instance 2
|
||||
vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.5 vni 8298 instance 3
|
||||
vpp0# set int l2 bridge vxlan_tunnel0 8298
|
||||
vpp0# set int l2 bridge vxlan_tunnel1 8298
|
||||
vpp0# set int l2 bridge vxlan_tunnel2 8298
|
||||
vpp0# set int l2 bridge vxlan_tunnel3 8298
|
||||
```
|
||||
|
||||
To make this work, I will have to replicate this configuration to all other `vpp1`-`vpp4` routers.
|
||||
While it does work, it's really not very practical. When other VPP instances get added to a VPLS,
|
||||
every other router will have to have a new VxLAN tunnel created and added to its local bridge
|
||||
domain. Consider 1000s of VPLS instances on 100s of routers, it would yield ~100'000 VxLAN tunnels
|
||||
on every router, yikes!
|
||||
|
||||
Such a configuration reminds me in a way of iBGP in a large network: the naive approach is to have a
|
||||
full mesh of all routers speaking to all other routers, but that quickly becomes a maintenance
|
||||
headache. The canonical solution for this is to create iBGP _Route Reflectors_ to which every router
|
||||
connects, and their job is to redistribute routing information between the fleet of routers. This
|
||||
turns the iBGP problem from an O(N^2) to an O(N) problem: all 1'000 routers connect to, say, three
|
||||
regional route reflectors for a total of 3'000 BGP connections, which is much better than ~1'000'000
|
||||
BGP connections in the naive approach.
|
||||
|
||||
## Recap: eVPN Moving parts
|
||||
|
||||
The reason why I got so enthusiastic when I was playing with Arista and Nokia's eVPN stuff, is
|
||||
because it requires very little dataplane configuration, and a relatively intuitive controlplane
|
||||
configuration:
|
||||
|
||||
1. **Dataplane**: For each L2 broadcast domain (be it a L2XC or a Bridge Domain), really all I
|
||||
need is a single VxLAN interface with a given VNI, which should be able to send encapsulated
|
||||
ethernet frames to one more more other speakers in the same domain.
|
||||
1. **Controlplane**: I will need to learn MAC addresses locally, and inform some BGP eVPN
|
||||
implementation of who-lives-where. Other VxLAN speakers learn of the MAC addresses I own, and
|
||||
will send me encapsulated ethernet for those addresses
|
||||
1. **Dataplane**: For unknown layer2 destinations, like _Broadcast_, _Unknown Unicast_, and
|
||||
_Multicast_ (BUM) traffic, I will want to keep track of which other VxLAN speakers these
|
||||
packets should be flooded. I make note that this is not that different to flooding the packets
|
||||
to local interfaces, except here it'd be flooding them to remote VxLAN endpoints.
|
||||
1. **ControlPlane**: Flooding L2 traffic across wide area networks is typically considered icky,
|
||||
so a few tricks might be optionally deployed. Since the controlplane already knows which MAC
|
||||
lives where, it may as well also make note of any local IPv6 ARP and IPv6 neighbor discovery
|
||||
replies and teach its peers which IPv4/IPv6 addresses live where: a distributed neighbor table.
|
||||
|
||||
{{< image width="6em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
|
||||
|
||||
For the controlplane parts, [[FRRouting](https://frrouting.org/)] has a working implementation for
|
||||
L2 (MAC-VRF) and L3 (IP-VRF). My favorite, [[Bird](https://birg.nic.cz/)], is slowly catching up, and
|
||||
has a few of these controlplane parts already working (mostly MAC-VRF). Commercial vendors like Arista,
|
||||
Nokia, Juniper, Cisco are ready to go. If we want VPP to inter-operate, we may need to make a few
|
||||
changes.
|
||||
|
||||
## VPP: Changes needed
|
||||
|
||||
### Dynamic VxLAN
|
||||
|
||||
I propose two changes to the VxLAN plugin, or perhaps, a new plugin that changes the behavior so that
|
||||
we don't have to break any performance or functional promises to existing users. This new VxLAN
|
||||
interface behavior changes in the following ways:
|
||||
|
||||
1. Each VxLAN interface has a local L2FIB attached to it, the keys are MAC address and the
|
||||
values are remote VTEPs. In its simplest form, the values would be just IPv4 or IPv6 addresses,
|
||||
because I can re-use the VNI and port information from the tunnel definition itself.
|
||||
|
||||
1. Each VxLAN interface has a local flood-list attached to it. This list contains remote VTEPs
|
||||
that I am supposed to send 'flood' packets to. Similar to the Bridge Domain, when packets are marked
|
||||
for flooding, I will need to prepare and replicate them, sending them to each VTEP.
|
||||
|
||||
|
||||
A set of APIs will be needed to manipulate these:
|
||||
* ***Interface***: I will need to have an interface create, delete and list call, which will
|
||||
be able to maintain the interfaces, their metadata like source address, source/destination port,
|
||||
VNI and such.
|
||||
* ***L2FIB***: I will need to add, replace, delete, and list which MAC addresses go where,
|
||||
With such a table, each time a packet is handled for a given Dynamic VxLAN interface, the
|
||||
dst_addr can be written into the packet.
|
||||
* ***Flooding***: For those packets that are not unicast (BUM), I will need to be able to add,
|
||||
remove and list which VTEPs should receive this packet.
|
||||
|
||||
It would be pretty dope if the configuration looked something like this:
|
||||
```
|
||||
vpp# create evpn-vxlan src <v46address> dst-port <port> vni <vni> instance <id>
|
||||
vpp# evpn-vxlan l2fib <iface> mac <mac> dst <v46address> [del]
|
||||
vpp# evpn-vxlan flood <iface> dst <v46address> [del]
|
||||
```
|
||||
|
||||
The VxLAN underlay transport can be either IPv4 or IPv6. Of course manipulating L2FIB or Flood
|
||||
destinations must match the address family of an interface of type evpn-vxlan. A practical example
|
||||
might be:
|
||||
|
||||
```
|
||||
vpp# create evpn-vxlan src 2001:db8::1 dst-port 4789 vni 8298 instance 6
|
||||
vpp# evpn-vxlan l2fib evpn-vxlan0 mac 00:01:02:82:98:02 dst 2001:db8::2
|
||||
vpp# evpn-vxlan l2fib evpn-vxlan0 mac 00:01:02:82:98:03 dst 2001:db8::3
|
||||
vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::2
|
||||
vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::3
|
||||
vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::4
|
||||
```
|
||||
|
||||
By the way, while this _could_ be a new plugin, it could also just be added to the existing VxLAN
|
||||
plugin. One way in which I might do this when creating a normal vxlan tunnel is to allow for its
|
||||
destination address to be either 0.0.0.0 for IPv4 or :: for IPv6. That would signal 'dynamic'
|
||||
tunneling, upon which the L2FIB and Flood lists are used. It would slow down each VxLAN packet by
|
||||
the time it takes to call `ip46_address_is_zero()` which is only a handfull of clocks.
|
||||
|
||||
### Bridge Domain
|
||||
|
||||
{{< image width="6em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
It's important to understand that L2 learning is **required** for eVPN to function. Each router
|
||||
needs to be able to tell the iBGP eVPN session which MAC addresses should be forwarded to it. This
|
||||
rules out the simple case of L2XC because there, no learning is performed. The corollary is that a
|
||||
bridge-domain is required for any form of eVPN.
|
||||
|
||||
The L2 code in VPP already does most of what I'd need. It maintains an L2FIB in `vnet/l2/l2_fib.c`,
|
||||
which is keyed by bridge-id and MAC address, and its values are a 64 bit structure that points
|
||||
essentially to a `sw_if_index` output interface. The L2FIB of the eVPN needs a bit more information
|
||||
though, notably a `ip46address` struct to know which VTEP to send to. It's tempting to add this
|
||||
extra data to the bridge domain code. I would recommend against it, because other implementations,
|
||||
for example MPLS, GENEVE or Carrier Pigeon IP may need more than just the destination address. Even
|
||||
the VxLAN implementation I'm thinking about might want to be able to override other things like the
|
||||
destination port for a given VTEP, or even the VNI. Putting all of this stuff in the bridge-domain
|
||||
code will just clutter it, for all users, not just those users who might want eVPN.
|
||||
|
||||
Similarly, one might argue it is tempting to re-use/extend the behavior in `vnet/l2/l2_flood.c`,
|
||||
because if it's already replicating BUM traffic, why not replicate it many times over the flood list
|
||||
for any member interface that happens to be a dynamic VxLAN interface? This would be a bad idea
|
||||
because of a few reasons. Firstly, it is not guaranteed that the VxLAN plugin is loaded, and in
|
||||
doing this, I would leak internal details of VxLAN into the bridge-domain code. Secondly, the
|
||||
`l2_flood.c` code would potentially get messy if other types were added (like the MPLS and GENEVE
|
||||
above).
|
||||
|
||||
A reasonable request is to mark such BUM frames once in the existing L2 code and when handing the
|
||||
replicated packet into the VxLAN node, to see the `is_bum` marker and once again replicate -- in the
|
||||
vxlan plugin -- these packets to the VTEPs in our local flood-list. Although a bit more work, this
|
||||
approach only requires a tiny amount of work in the `l2_flood.c` code (the marking), and will keep
|
||||
all the logic tucked away where it is relevant, derisking the VPP vnet codebase.
|
||||
|
||||
Fundamentally, I think the cleanest design is to keep the dynamic VxLAN interface fully
|
||||
self-contained and it would therefor maintain its own L2FIB and Flooding logic. The only thing I
|
||||
would add to the L2 codebase is some form of BUM marker to allow for efficient flooding.
|
||||
|
||||
### Control Plane
|
||||
|
||||
There's a few things the control plane has to do. Some external agent, like FRR or Bird, will be
|
||||
receiving a few types of eVPN messages. The ones I'm interested in are:
|
||||
|
||||
* ***Type 2***: MAC/IP Advertisement Route
|
||||
- On the way in, these should be fed to the VxLAN L2FIB belonging to the bridge-domain.
|
||||
- On the way out, learned addresses should be advertised to peers.
|
||||
- Regarding IPv4/IPv6 addresses, that is the ARP / ND tables: we can talk about those later.
|
||||
* ***Type 3***: Inclusive Multicast Ethernet Tag Route
|
||||
- On the way in, these will populate the VxLAN Flood list belonging to the bridge-domain
|
||||
- On the way out, each bridge-domain should advertise itself as IMET to peers.
|
||||
* ***Type 5***: IP Prefix Route
|
||||
- Similar to IP information in Type 2, we can talk about those later once L3VPN/eVPN is
|
||||
needed.
|
||||
|
||||
The 'on the way in' stuff can be easily done with my proposed APIs in the Dynamic VxLAN (or a new
|
||||
eVPN VxLAN) plugin. Adding, removing, listing L2FIB and Flood lists is easy as far as VPP is
|
||||
concerned. It's just that the controlplane implementation needs to somehow _feed_ the API, so an
|
||||
external program may be needed, or alterntively the Linux Control Plane netlink plugin might be used
|
||||
to consume this information.
|
||||
|
||||
The 'on the way out' stuff is a bit trickier. I will need to listen to creation of new broadcast
|
||||
domains and associate them with the right IMET announcements, and for each MAC address learned, pick
|
||||
them up and advertise them into eVPN. Later, if ever ARP and ND proxying becomes important, I'll
|
||||
have to revisit the bridge-domain feature to do IPv4 ARP and IPv6 Neighbor Discovery, and replace it
|
||||
with some code that populates the IPv4/IPv6 parts of the Type2 messages on the way out, and
|
||||
similarly on the way in, populates an L3 neighbor cache for the bridge domain, so ARP and ND replies
|
||||
can be synthesized based on what we've learned in eVPN.
|
||||
|
||||
# Demonstration
|
||||
|
||||
### VPP: Current VxLAN
|
||||
|
||||
I'll build a small demo environment on Summer to show how the interaction of VxLAN and Bridge
|
||||
Domain works today:
|
||||
|
||||
```
|
||||
vpp# create tap host-if-name dummy0 host-mtu-size 9216 host-ip4-addr 192.0.2.1/24
|
||||
vpp# set int state tap0 up
|
||||
vpp# set int ip address tap0 192.0.2.1/24
|
||||
vpp# set ip neighbor tap0 192.0.2.254 01:02:03:82:98:fe static
|
||||
vpp# set ip neighbor tap0 192.0.2.2 01:02:03:82:98:02 static
|
||||
vpp# set ip neighbor tap0 192.0.2.3 01:02:03:82:98:03 static
|
||||
|
||||
vpp# create vxlan tunnel src 192.0.2.1 dst 192.0.2.254 vni 8298
|
||||
vpp# set int state vxlan_tunnel0 up
|
||||
|
||||
vpp# create tap host-if-name vpptap0 host-mtu-size 9216 hw-addr 02:fe:64:dc:1b:82
|
||||
vpp# set int state tap1 up
|
||||
|
||||
vpp# create bridge-domain 8298
|
||||
vpp# set int l2 bridge tap1 8298
|
||||
vpp# set int l2 bridge vxlan_tunnel0 8298
|
||||
```
|
||||
|
||||
I've created a tap device called `dummy0` and gave it an IPv4 address. Normally, I would use some
|
||||
DPDK or RDMA interface like `TenGigabutEthernet10/0/0`. Then I'll populate some static ARP entries.
|
||||
Again, normally this would just be 'use normal routing'. However, for the purposes of this
|
||||
demonstration, it helps to use a TAP device, as any packets I make VPP send to those 192.0.2.254 and
|
||||
so on, can be captured with `tcpdump` in Linux in addition to `trace add` in VPP.
|
||||
|
||||
Then, I create a VxLAN tunnel with a default destination of 192.0.2.254 and the given VNI.
|
||||
Next, I create a TAP interface called `vpptap0` with the given MAC address.
|
||||
Finally, I bind these two interfaces together in a bridge-domain.
|
||||
|
||||
I proceed to write a small ScaPY program:
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
|
||||
from scapy.all import Ether, IP, UDP, Raw, sendp
|
||||
|
||||
pkt = Ether(dst="01:02:03:04:05:02", src="02:fe:64:dc:1b:82", type=0x0800)
|
||||
/ IP(src="192.168.1.1", dst="192.168.1.2")
|
||||
/ UDP(sport=8298, dport=7) / Raw(load=b"ping")
|
||||
print(pkt)
|
||||
sendp(pkt, iface="vpptap0")
|
||||
|
||||
pkt = Ether(dst="01:02:03:04:05:03", src="02:fe:64:dc:1b:82", type=0x0800)
|
||||
/ IP(src="192.168.1.1", dst="192.168.1.3")
|
||||
/ UDP(sport=8298, dport=7) / Raw(load=b"ping")
|
||||
print(pkt)
|
||||
sendp(pkt, iface="vpptap0")
|
||||
```
|
||||
|
||||
What will happen is, the ScaPY program will emit these frames into device `vpptap0` which is in
|
||||
bridge-domain 8298. The bridge will learn our src MAC `02:fe:64:dc:1b:82`, and look up the dst MAC
|
||||
`01:02:03:04:05:02`, and because there hasn't been traffic yet, it'll flood to all member ports, one
|
||||
of which is the VxLAN tunnel. VxLAN will then encapsulate the packets to the other side of the
|
||||
tunnel.
|
||||
|
||||
```
|
||||
pim@summer:~$ sudo ./vxlan-test.py
|
||||
Ether / IP / UDP 192.168.1.1:8298 > 192.168.1.2:echo / Raw
|
||||
Ether / IP / UDP 192.168.1.1:8298 > 192.168.1.3:echo / Raw
|
||||
|
||||
pim@summer:~$ sudo tcpdump -evni dummy0
|
||||
10:50:35.310620 02:fe:72:52:38:53 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
|
||||
(tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82)
|
||||
192.0.2.1.6345 > 192.0.2.254.4789: VXLAN, flags [I] (0x08), vni 8298
|
||||
02:fe:64:dc:1b:82 > 01:02:03:04:05:02, ethertype IPv4 (0x0800), length 46:
|
||||
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
|
||||
192.168.1.1.8298 > 192.168.1.2.7: UDP, length 4
|
||||
10:50:35.362552 02:fe:72:52:38:53 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
|
||||
(tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82)
|
||||
192.0.2.1.23916 > 192.0.2.254.4789: VXLAN, flags [I] (0x08), vni 8298
|
||||
02:fe:64:dc:1b:82 > 01:02:03:04:05:03, ethertype IPv4 (0x0800), length 46:
|
||||
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
|
||||
192.168.1.1.8298 > 192.168.1.3.7: UDP, length 4
|
||||
```
|
||||
|
||||
I want to point out that nothing, so far, is special. All of this works with upstream VPP just fine.
|
||||
I can see two VxLAN encapsulated packets, both destined to `192.0.2.254:4789`. Cool.
|
||||
|
||||
### Dynamic VPP VxLAN
|
||||
|
||||
I wrote a prototype for a Dynamic VxLAN tunnel in [[43433](https://gerrit.fd.io/r/c/vpp/+/43433)].
|
||||
The good news is, this works. The bad news is, I think I'll want to discuss my proposal (this
|
||||
article) with the community before going further down a potential rabbit hole.
|
||||
|
||||
With my gerrit patched in, I can do the following:
|
||||
|
||||
```
|
||||
vpp# vxlan l2fib vxlan_tunnel0 mac 01:02:03:04:05:02 dst 192.0.2.2
|
||||
Added VXLAN dynamic destination for 01:02:03:04:05:02 on vxlan_tunnel0 dst 192.0.2.2
|
||||
vpp# vxlan l2fib vxlan_tunnel0 mac 01:02:03:04:05:03 dst 192.0.2.3
|
||||
Added VXLAN dynamic destination for 01:02:03:04:05:03 on vxlan_tunnel0 dst 192.0.2.3
|
||||
|
||||
vpp# show vxlan l2fib
|
||||
VXLAN Dynamic L2FIB entries:
|
||||
MAC Interface Destination Port VNI
|
||||
01:02:03:04:05:02 vxlan_tunnel0 192.0.2.2 4789 8298
|
||||
01:02:03:04:05:03 vxlan_tunnel0 192.0.2.3 4789 8298
|
||||
Dynamic L2FIB entries: 2
|
||||
```
|
||||
|
||||
I've instructed the VxLAN tunnel to change the tunnel destination based on the destination MAC.
|
||||
|
||||
|
||||
I run the script and tcpdump again:
|
||||
|
||||
```
|
||||
pim@summer:~$ sudo tcpdump -evni dummy0
|
||||
11:16:53.834619 02:fe:fe:ae:0d:a3 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
|
||||
(tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82, bad cksum 3945 (->3997)!)
|
||||
192.0.2.1.6345 > 192.0.2.2.4789: VXLAN, flags [I] (0x08), vni 8298
|
||||
02:fe:64:dc:1b:82 > 01:02:03:04:05:02, ethertype IPv4 (0x0800), length 46:
|
||||
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
|
||||
192.168.1.1.8298 > 192.168.1.2.7: UDP, length 4
|
||||
11:16:53.882554 02:fe:fe:ae:0d:a3 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
|
||||
(tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82, bad cksum 3944 (->3996)!)
|
||||
192.0.2.1.23916 > 192.0.2.3.4789: VXLAN, flags [I] (0x08), vni 8298
|
||||
02:fe:64:dc:1b:82 > 01:02:03:04:05:03, ethertype IPv4 (0x0800), length 46:
|
||||
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
|
||||
192.168.1.1.8298 > 192.168.1.3.7: UDP, length 4
|
||||
```
|
||||
|
||||
Two important notes: Firstly, this works! For the MAC address ending in `:02`, send the packet to
|
||||
`192.0.2.2` instead of the default of `192.0.2.254`. Same for the `:03` MAC which now goes to
|
||||
`192.0.2.3`. Nice! But secondly, the IPv4 header of the VxLAN packets was changed, so there needs to
|
||||
be a call to `ip4_header_checksum()` inserted somewhere. That's an easy fix.
|
||||
|
||||
# What's next
|
||||
|
||||
I want to discuss a few things, perhaps at an upcoming VPP Community meeting. Notably:
|
||||
1. Is the VPP Developer community supportive of adding eVPN support? Does anybody want to help
|
||||
write it with me?
|
||||
1. Is changing the existing VxLAN plugin appropriate, or should I make a new plugin which adds
|
||||
dynamic endpoints, L2FIB and Flood lists for BUM traffic?
|
||||
1. Is it acceptable for me to add a BUM marker in `l2_flood.c` so that I can reuse all the logic
|
||||
from bridge-domain flooding as I extend to also do VTEP flooding?
|
||||
1. (perhaps later) VxLAN is the canonical underlay, but is there an appetite to extend also to,
|
||||
say, GENEVE or MPLS?
|
||||
1. (perhaps later) What's a good way to tie in a controlplane like FRRouting or Bird2 into the
|
||||
dataplane (perhaps using a sidecar controller, or perhaps using Linux CP Netlink messages)?
|
||||
|
701
content/articles/2025-07-26-ctlog-1.md
Normal file
701
content/articles/2025-07-26-ctlog-1.md
Normal file
@@ -0,0 +1,701 @@
|
||||
---
|
||||
date: "2025-07-26T22:07:23Z"
|
||||
title: 'Certificate Transparency - Part 1 - TesseraCT'
|
||||
aliases:
|
||||
- /s/articles/2025/07/26/certificate-transparency-part-1/
|
||||
---
|
||||
|
||||
{{< image width="10em" float="right" src="/assets/ctlog/ctlog-logo-ipng.png" alt="ctlog logo" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
There once was a Dutch company called [[DigiNotar](https://en.wikipedia.org/wiki/DigiNotar)], as the
|
||||
name suggests it was a form of _digital notary_, and they were in the business of issuing security
|
||||
certificates. Unfortunately, in June of 2011, their IT infrastructure was compromised and
|
||||
subsequently it issued hundreds of fraudulent SSL certificates, some of which were used for
|
||||
man-in-the-middle attacks on Iranian Gmail users. Not cool.
|
||||
|
||||
Google launched a project called **Certificate Transparency**, because it was becoming more common
|
||||
that the root of trust given to _Certification Authorities_ could no longer be unilaterally trusted.
|
||||
These attacks showed that the lack of transparency in the way CAs operated was a significant risk to
|
||||
the Web Public Key Infrastructure. It led to the creation of this ambitious
|
||||
[[project](https://certificate.transparency.dev/)] to improve security online by bringing
|
||||
accountability to the system that protects our online services with _SSL_ (Secure Socket Layer)
|
||||
and _TLS_ (Transport Layer Security).
|
||||
|
||||
In 2013, [[RFC 6962](https://datatracker.ietf.org/doc/html/rfc6962)] was published by the IETF. It
|
||||
describes an experimental protocol for publicly logging the existence of Transport Layer Security
|
||||
(TLS) certificates as they are issued or observed, in a manner that allows anyone to audit
|
||||
certificate authority (CA) activity and notice the issuance of suspect certificates as well as to
|
||||
audit the certificate logs themselves. The intent is that eventually clients would refuse to honor
|
||||
certificates that do not appear in a log, effectively forcing CAs to add all issued certificates to
|
||||
the logs.
|
||||
|
||||
This series explores and documents how IPng Networks will be running two Static CT _Logs_ with two
|
||||
different implementations. One will be [[Sunlight](https://sunlight.dev/)], and the other will be
|
||||
[[TesseraCT](https://github.com/transparency-dev/tesseract)].
|
||||
|
||||
## Static Certificate Transparency
|
||||
|
||||
In this context, _Logs_ are network services that implement the protocol operations for submissions
|
||||
and queries that are defined in a specification that builds on the previous RFC. A few years ago,
|
||||
my buddy Antonis asked me if I would be willing to run a log, but operationally they were very
|
||||
complex and expensive to run. However, over the years, the concept of _Static Logs_ put running one
|
||||
in reach. This [[Static CT API](https://github.com/C2SP/C2SP/blob/main/static-ct-api.md)] defines a
|
||||
read-path HTTP static asset hierarchy (for monitoring) to be implemented alongside the write-path
|
||||
RFC 6962 endpoints (for submission).
|
||||
|
||||
Aside from the different read endpoints, a log that implements the Static API is a regular CT log
|
||||
that can work alongside RFC 6962 logs and that fulfills the same purpose. In particular, it requires
|
||||
no modification to submitters and TLS clients.
|
||||
|
||||
If you only read one document about Static CT, read Filippo Valsorda's excellent
|
||||
[[paper](https://filippo.io/a-different-CT-log)]. It describes a radically cheaper and easier to
|
||||
operate [[Certificate Transparency](https://certificate.transparency.dev/)] log that is backed by a
|
||||
consistent object storage, and can scale to 30x the current issuance rate for 2-10% of the costs
|
||||
with no merge delay.
|
||||
|
||||
## Scalable, Cheap, Reliable: choose two
|
||||
|
||||
{{< image width="18em" float="right" src="/assets/ctlog/MPLS Backbone - CTLog.svg" alt="ctlog at ipng" >}}
|
||||
|
||||
In the diagram, I've drawn an overview of IPng's network. In {{< boldcolor color="red" >}}red{{<
|
||||
/boldcolor >}} a european backbone network is provided by a [[BGP Free Core
|
||||
network]({{< ref 2022-12-09-oem-switch-2 >}})]. It operates a private IPv4, IPv6, and MPLS network, called
|
||||
_IPng Site Local_, which is not connected to the internet. On top of that, IPng offers L2 and L3
|
||||
services, for example using [[VPP]({{< ref 2021-02-27-network >}})].
|
||||
|
||||
In {{< boldcolor color="lightgreen" >}}green{{< /boldcolor >}} I built a cluster of replicated
|
||||
NGINX frontends. They connect into _IPng Site Local_ and can reach all hypervisors, VMs, and storage
|
||||
systems. They also connect to the Internet with a single IPv4 and IPv6 address. One might say that
|
||||
SSL is _added and removed here :-)_ [[ref](/assets/ctlog/nsa_slide.jpg)].
|
||||
|
||||
Then in {{< boldcolor color="orange" >}}orange{{< /boldcolor >}} I built a set of [[MinIO]({{< ref
|
||||
2025-05-28-minio-1 >}})] S3 storage pools. Amongst others, I serve the static content from the IPng
|
||||
website from these pools, providing fancy redundancy and caching. I wrote about its design in [[this
|
||||
article]({{< ref 2025-06-01-minio-2 >}})].
|
||||
|
||||
Finally, I turn my attention to the {{< boldcolor color="blue" >}}blue{{< /boldcolor >}} which is
|
||||
two hypervisors, one run by [[IPng](https://ipng.ch/)] and the other by [[Massar](https://massars.net/)]. Each
|
||||
of them will be running one of the _Log_ implementations. IPng provides two large ZFS storage tanks
|
||||
for offsite backup, in case a hypervisor decides to check out, and daily backups to an S3 bucket
|
||||
using Restic.
|
||||
|
||||
Having explained all of this, I am well aware that end to end reliability will be coming from the
|
||||
fact that there are many independent _Log_ operators, and folks wanting to validate certificates can
|
||||
simply monitor many. If there is a gap in coverage, say due to any given _Log_'s downtime, this will
|
||||
not necessarily be problematic. It does mean that I may have to suppress the SRE in me...
|
||||
|
||||
## MinIO
|
||||
|
||||
My first instinct is to leverage the distributed storage IPng has, but as I'll show in the rest of
|
||||
this article, maybe a simpler, more elegant design could be superior, precisely because individual
|
||||
log reliability is not _as important_ as having many available log _instances_ to choose from.
|
||||
|
||||
From operators in the field I understand that the world-wide generation of certificates is roughly
|
||||
17M/day, which amounts of some 200-250qps of writes. Antonis explains that certs with a validity
|
||||
if 180 days or less will need two CT log entries, while certs with a validity more than 180d will
|
||||
need three CT log entries. So the write rate is roughly 2.2x that, as an upper bound.
|
||||
|
||||
My first thought is to see how fast my open source S3 machines can go, really. I'm curious also as
|
||||
to the difference between SSD and spinning disks.
|
||||
|
||||
I boot two Dell R630s in the Lab. These machines have two Xeon E5-2640 v4 CPUs for a total of 20
|
||||
cores and 40 threads, and 512GB of DDR4 memory. They also sport a SAS controller. In one machine I
|
||||
place 6pcs 1.2TB SAS3 disks (HPE part number EG1200JEHMC), and in the second machine I place 6pcs
|
||||
of 1.92TB enterprise storage (Samsung part number P1633N19).
|
||||
|
||||
I spin up a 6-device MinIO cluster on both and take them out for a spin using [[S3
|
||||
Benchmark](https://github.com/wasabi-tech/s3-benchmark.git)] from Wasabi Tech.
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/s3-benchmark$ for dev in disk ssd; do \
|
||||
for t in 1 8 32; do \
|
||||
for z in 4M 1M 8k 4k; do \
|
||||
./s3-benchmark -a $KEY -s $SECRET -u http://minio-$dev:9000 -t $t -z $z \
|
||||
| tee -a minio-results.txt; \
|
||||
done; \
|
||||
done; \
|
||||
done
|
||||
```
|
||||
|
||||
The loadtest above does a bunch of runs with varying parameters. First it tries to read and write
|
||||
object sizes of 4MB, 1MB, 8kB and 4kB respectively. Then it tries to do this with either 1 thread, 8
|
||||
threads or 32 threads. Finally it tests both the disk-based variant as well as the SSD based one.
|
||||
The loadtest runs from a third machine, so that the Dell R630 disk tanks can stay completely
|
||||
dedicated to their task of running MinIO.
|
||||
|
||||
{{< image width="100%" src="/assets/ctlog/minio_8kb_performance.png" alt="MinIO 8kb disk vs SSD" >}}
|
||||
|
||||
The left-hand side graph feels pretty natural to me. With one thread, uploading 8kB objects will
|
||||
quickly hit the IOPS rate of the disks, each of which have to participate in the write due to EC:3
|
||||
encoding when using six disks, and it tops out at ~56 PUT/s. The single thread hitting SSDs will not
|
||||
hit that limit, and has ~371 PUT/s which I found a bit underwhelming. But, when performing the
|
||||
loadtest with either 8 or 32 write threads, the hard disks become only marginally faster (topping
|
||||
out at 240 PUT/s), while the SSDs really start to shine, with 3850 PUT/s. Pretty good performance.
|
||||
|
||||
On the read-side, I am pleasantly surprised that there's not really that much of a difference
|
||||
between disks and SSDs. This is likely because the host filesystem cache is playing a large role, so
|
||||
the 1-thread performance is equivalent (765 GET/s for disks, 677 GET/s for SSDs), and the 32-thread
|
||||
performance is also equivalent (at 7624 GET/s for disks with 7261 GET/s for SSDs). I do wonder why
|
||||
the hard disks consistently outperform the SSDs with all the other variables (OS, MinIO version,
|
||||
hardware) the same.
|
||||
|
||||
## Sidequest: SeaweedFS
|
||||
|
||||
Something that has long caught my attention is the way in which
|
||||
[[SeaweedFS](https://github.com/seaweedfs/seaweedfs)] approaches blob storage. Many operators have
|
||||
great success with many small file writes in SeaweedFS compared to MinIO and even AWS S3 storage.
|
||||
This is because writes with WeedFS are not broken into erasure-sets, which would require every disk
|
||||
to write a small part or checksum of the data, but rather files are replicated within the cluster in
|
||||
their entirety on different disks, racks or datacenters. I won't bore you with the details of
|
||||
SeaweedFS but I'll tack on a docker [[compose file](/assets/ctlog/seaweedfs.docker-compose.yml)]
|
||||
that I used at the end of this article, if you're curious.
|
||||
|
||||
{{< image width="100%" src="/assets/ctlog/size_comparison_8t.png" alt="MinIO vs SeaWeedFS" >}}
|
||||
|
||||
In the write-path, SeaweedFS dominates in all cases, due to its different way of achieving durable
|
||||
storage (per-file replication in SeaweedFS versus all-disk erasure-sets in MinIO):
|
||||
* 4k: 3,384 ops/sec vs MinIO's 111 ops/sec (30x faster!)
|
||||
* 8k: 3,332 ops/sec vs MinIO's 111 ops/sec (30x faster!)
|
||||
* 1M: 383 ops/sec vs MinIO's 44 ops/sec (9x faster)
|
||||
* 4M: 104 ops/sec vs MinIO's 32 ops/sec (4x faster)
|
||||
|
||||
For the read-path, in GET operations MinIO is better at small objects, and really dominates the
|
||||
large objects:
|
||||
* 4k: 7,411 ops/sec vs SeaweedFS 5,014 ops/sec
|
||||
* 8k: 7,666 ops/sec vs SeaweedFS 5,165 ops/sec
|
||||
* 1M: 5,466 ops/sec vs SeaweedFS 2,212 ops/sec
|
||||
* 4M: 3,084 ops/sec vs SeaweedFS 646 ops/sec
|
||||
|
||||
This makes me draw an interesting conclusion: seeing as CT Logs are read/write heavy (every couple
|
||||
of seconds, the Merkle tree is recomputed which is reasonably disk-intensive), SeaweedFS might be a
|
||||
slight better choice. IPng Networks has three MinIO deployments, but no SeaweedFS deployments. Yet.
|
||||
|
||||
# Tessera
|
||||
|
||||
[[Tessera](https://github.com/transparency-dev/tessera.git)] is a Go library for building tile-based
|
||||
transparency logs (tlogs) [[ref](https://github.com/C2SP/C2SP/blob/main/tlog-tiles.md)]. It is the
|
||||
logical successor to the approach that Google took when building and operating _Logs_ using its
|
||||
predecessor called [[Trillian](https://github.com/google/trillian)]. The implementation and its APIs
|
||||
bake-in current best-practices based on the lessons learned over the past decade of building and
|
||||
operating transparency logs in production environments and at scale.
|
||||
|
||||
Tessera was introduced at the Transparency.Dev summit in October 2024. I first watch Al and Martin
|
||||
[[introduce](https://www.youtube.com/watch?v=9j_8FbQ9qSc)] it at last year's summit. At a high
|
||||
level, it wraps what used to be a whole kubernetes cluster full of components, into a single library
|
||||
that can be used with Cloud based services, either like AWS S3 and RDS database, or like GCP's GCS
|
||||
storage and Spanner database. However, Google also made is easy to use a regular POSIX filesystem
|
||||
implementation.
|
||||
|
||||
## TesseraCT
|
||||
|
||||
{{< image width="10em" float="right" src="/assets/ctlog/tesseract-logo.png" alt="tesseract logo" >}}
|
||||
|
||||
While Tessera is a library, a CT log implementation comes from its sibling GitHub repository called
|
||||
[[TesseraCT](https://github.com/transparency-dev/tesseract)]. Because it leverages Tessera under the
|
||||
hood, TesseraCT can run on GCP, AWS, POSIX-compliant, or on S3-compatible systems alongside a MySQL
|
||||
database. In order to provide ecosystem agility and to control the growth of CT Log sizes, new CT
|
||||
Logs must be temporally sharded, defining a certificate expiry range denoted in the form of two
|
||||
dates: `[rangeBegin, rangeEnd)`. The certificate expiry range allows a Log to reject otherwise valid
|
||||
logging submissions for certificates that expire before or after this defined range, thus
|
||||
partitioning the set of publicly-trusted certificates that each Log will accept. I will be expected
|
||||
to keep logs for an extended period of time, say 3-5 years.
|
||||
|
||||
It's time for me to figure out what this TesseraCT thing can do .. are you ready? Let's go!
|
||||
|
||||
### TesseraCT: S3 and SQL
|
||||
|
||||
TesseraCT comes with a few so-called _personalities_. Those are an implementation of the underlying
|
||||
storage infrastructure in an opinionated way. The first personality I look at is the `aws` one in
|
||||
`cmd/tesseract/aws`. I notice that this personality does make hard assumptions about the use of AWS
|
||||
which is unfortunate as the documentation says '.. or self-hosted S3 and MySQL database'. However,
|
||||
the `aws` personality assumes the AWS SecretManager in order to fetch its signing key. Before I
|
||||
can be successful, I need to detangle that.
|
||||
|
||||
#### TesseraCT: AWS and Local Signer
|
||||
|
||||
First, I change `cmd/tesseract/aws/main.go` to add two new flags:
|
||||
|
||||
* ***-signer_public_key_file***: a path to the public key for checkpoints and SCT signer
|
||||
* ***-signer_private_key_file***: a path to the private key for checkpoints and SCT signer
|
||||
|
||||
I then change the program to assume if these flags are both set, the user will want a
|
||||
_NewLocalSigner_ instead of a _NewSecretsManagerSigner_. Now all I have to do is implement the
|
||||
signer interface in a package `local_signer.go`. There, function _NewLocalSigner()_ will read the
|
||||
public and private PEM from file, decode them, and create an _ECDSAWithSHA256Signer_ with them, a
|
||||
simple example to show what I mean:
|
||||
|
||||
```
|
||||
// NewLocalSigner creates a new signer that uses the ECDSA P-256 key pair from
|
||||
// local disk files for signing digests.
|
||||
func NewLocalSigner(publicKeyFile, privateKeyFile string) (*ECDSAWithSHA256Signer, error) {
|
||||
// Read public key
|
||||
publicKeyPEM, err := os.ReadFile(publicKeyFile)
|
||||
publicPemBlock, rest := pem.Decode(publicKeyPEM)
|
||||
|
||||
var publicKey crypto.PublicKey
|
||||
publicKey, err = x509.ParsePKIXPublicKey(publicPemBlock.Bytes)
|
||||
ecdsaPublicKey, ok := publicKey.(*ecdsa.PublicKey)
|
||||
|
||||
// Read private key
|
||||
privateKeyPEM, err := os.ReadFile(privateKeyFile)
|
||||
privatePemBlock, rest := pem.Decode(privateKeyPEM)
|
||||
|
||||
var ecdsaPrivateKey *ecdsa.PrivateKey
|
||||
ecdsaPrivateKey, err = x509.ParseECPrivateKey(privatePemBlock.Bytes)
|
||||
|
||||
// Verify the correctness of the signer key pair
|
||||
if !ecdsaPrivateKey.PublicKey.Equal(ecdsaPublicKey) {
|
||||
return nil, errors.New("signer key pair doesn't match")
|
||||
}
|
||||
|
||||
return &ECDSAWithSHA256Signer{
|
||||
publicKey: ecdsaPublicKey,
|
||||
privateKey: ecdsaPrivateKey,
|
||||
}, nil
|
||||
}
|
||||
```
|
||||
|
||||
In the snippet above I omitted all of the error handling, but the local signer logic itself is
|
||||
hopefully clear. And with that, I am liberated from Amazon's Cloud offering and can run this thing
|
||||
all by myself!
|
||||
|
||||
#### TesseraCT: Running with S3, MySQL, and Local Signer
|
||||
|
||||
First, I need to create a suitable ECDSA key:
|
||||
```
|
||||
pim@ctlog-test:~$ openssl ecparam -name prime256v1 -genkey -noout -out /tmp/private_key.pem
|
||||
pim@ctlog-test:~$ openssl ec -in /tmp/private_key.pem -pubout -out /tmp/public_key.pem
|
||||
```
|
||||
|
||||
Then, I'll install the MySQL server and create the databases:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~$ sudo apt install default-mysql-server
|
||||
pim@ctlog-test:~$ sudo mysql -u root
|
||||
|
||||
CREATE USER 'tesseract'@'localhost' IDENTIFIED BY '<db_passwd>';
|
||||
CREATE DATABASE tesseract;
|
||||
CREATE DATABASE tesseract_antispam;
|
||||
GRANT ALL PRIVILEGES ON tesseract.* TO 'tesseract'@'localhost';
|
||||
GRANT ALL PRIVILEGES ON tesseract_antispam.* TO 'tesseract'@'localhost';
|
||||
```
|
||||
|
||||
Finally, I use the SSD MinIO lab-machine that I just loadtested to create an S3 bucket.
|
||||
|
||||
```
|
||||
pim@ctlog-test:~$ mc mb minio-ssd/tesseract-test
|
||||
pim@ctlog-test:~$ cat << EOF > /tmp/minio-access.json
|
||||
{ "Version": "2012-10-17", "Statement": [ {
|
||||
"Effect": "Allow",
|
||||
"Action": [ "s3:ListBucket", "s3:PutObject", "s3:GetObject", "s3:DeleteObject" ],
|
||||
"Resource": [ "arn:aws:s3:::tesseract-test/*", "arn:aws:s3:::tesseract-test" ]
|
||||
} ]
|
||||
}
|
||||
EOF
|
||||
pim@ctlog-test:~$ mc admin user add minio-ssd <user> <secret>
|
||||
pim@ctlog-test:~$ mc admin policy create minio-ssd tesseract-test-access /tmp/minio-access.json
|
||||
pim@ctlog-test:~$ mc admin policy attach minio-ssd tesseract-test-access --user <user>
|
||||
pim@ctlog-test:~$ mc anonymous set public minio-ssd/tesseract-test
|
||||
```
|
||||
|
||||
{{< image width="6em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
|
||||
|
||||
After some fiddling, I understand that the AWS software development kit makes some assumptions that
|
||||
you'll be using .. _quelle surprise_ .. AWS services. But you can also use local S3 services by
|
||||
setting a few key environment variables. I had heard of the S3 access and secret key environment
|
||||
variables before, but I now need to also use a different S3 endpoint. That little detour into the
|
||||
codebase only took me .. several hours.
|
||||
|
||||
Armed with that knowledge, I can build and finally start my TesseraCT instance:
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract/cmd/tesseract/aws$ go build -o ~/aws .
|
||||
pim@ctlog-test:~$ export AWS_DEFAULT_REGION="us-east-1"
|
||||
pim@ctlog-test:~$ export AWS_ACCESS_KEY_ID="<user>"
|
||||
pim@ctlog-test:~$ export AWS_SECRET_ACCESS_KEY="<secret>"
|
||||
pim@ctlog-test:~$ export AWS_ENDPOINT_URL_S3="http://minio-ssd.lab.ipng.ch:9000/"
|
||||
pim@ctlog-test:~$ ./aws --http_endpoint='[::]:6962' \
|
||||
--origin=ctlog-test.lab.ipng.ch/test-ecdsa \
|
||||
--bucket=tesseract-test \
|
||||
--db_host=ctlog-test.lab.ipng.ch \
|
||||
--db_user=tesseract \
|
||||
--db_password=<db_passwd> \
|
||||
--db_name=tesseract \
|
||||
--antispam_db_name=tesseract_antispam \
|
||||
--signer_public_key_file=/tmp/public_key.pem \
|
||||
--signer_private_key_file=/tmp/private_key.pem \
|
||||
--roots_pem_file=internal/hammer/testdata/test_root_ca_cert.pem
|
||||
|
||||
I0727 15:13:04.666056 337461 main.go:128] **** CT HTTP Server Starting ****
|
||||
```
|
||||
|
||||
Hah! I think most of the command line flags and environment variables should make sense, but I was
|
||||
struggling for a while with the `--roots_pem_file` and the `--origin` flags, so I phoned a friend
|
||||
(Al Cutter, Googler extraordinaire and an expert in Tessera/CT). He explained to me that the Log is
|
||||
actually an open endpoint to which anybody might POST data. However, to avoid folks abusing the log
|
||||
infrastructure, each POST is expected to come from one of the certificate authorities listed in the
|
||||
`--roots_pem_file`. OK, that makes sense.
|
||||
|
||||
Then, the `--origin` flag designates how my log calls itself. In the resulting `checkpoint` file it
|
||||
will enumerate a hash of the latest merged and published Merkle tree. In case a server serves
|
||||
multiple logs, it uses the `--origin` flag to make the destinction which checksum belongs to which.
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract$ curl http://tesseract-test.minio-ssd.lab.ipng.ch:9000/checkpoint
|
||||
ctlog-test.lab.ipng.ch/test-ecdsa
|
||||
0
|
||||
JGPitKWWI0aGuCfC2k1n/p9xdWAYPm5RZPNDXkCEVUU=
|
||||
|
||||
— ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMCONUBAMARjBEAiA/nc9dig6U//vPg7SoTHjt9bxP5K+x3w4MYKpIRn4ULQIgUY5zijRK8qyuJGvZaItDEmP1gohCt+wI+sESBnhkuqo=
|
||||
```
|
||||
|
||||
When creating the bucket above, I used `mc anonymous set public`, which made the S3 bucket
|
||||
world-readable. I can now execute the whole read-path simply by hitting the S3 service. Check.
|
||||
|
||||
#### TesseraCT: Loadtesting S3/MySQL
|
||||
|
||||
{{< image width="12em" float="right" src="/assets/ctlog/stop-hammer-time.jpg" alt="Stop, hammer time" >}}
|
||||
|
||||
The write path is a server on `[::]:6962`. I should be able to write a log to it, but how? Here's
|
||||
where I am grateful to find a tool in the TesseraCT GitHub repository called `hammer`. This hammer
|
||||
sets up read and write traffic to a Static CT API log to test correctness and performance under
|
||||
load. The traffic is sent according to the [[Static CT API](https://c2sp.org/static-ct-api)] spec.
|
||||
Slick!
|
||||
|
||||
The tool start a text-based UI (my favorite! also when using Cisco T-Rex loadtester) in the terminal
|
||||
that shows the current status, logs, and supports increasing/decreasing read and write traffic. This
|
||||
TUI allows for a level of interactivity when probing a new configuration of a log in order to find
|
||||
any cliffs where performance degrades. For real load-testing applications, especially headless runs
|
||||
as part of a CI pipeline, it is recommended to run the tool with `-show_ui=false` in order to disable
|
||||
the UI.
|
||||
|
||||
I'm a bit lost in the somewhat terse
|
||||
[[README.md](https://github.com/transparency-dev/tesseract/tree/main/internal/hammer)], but my buddy
|
||||
Al comes to my rescue and explains the flags to me. First of all, the loadtester wants to hit the
|
||||
same `--origin` that I configured the write-path to accept. In my case this is
|
||||
`ctlog-test.lab.ipng.ch/test-ecdsa`. Then, it needs the public key for that _Log_, which I can find
|
||||
in `/tmp/public_key.pem`. The text there is the _DER_ (Distinguished Encoding Rules), stored as a
|
||||
base64 encoded string. What follows next was the most difficult for me to understand, as I was
|
||||
thinking the hammer would read some log from the internet somewhere and replay it locally. Al
|
||||
explains that actually, the `hammer` tool synthetically creates all of these entries itself, and it
|
||||
regularly reads the `checkpoint` from the `--log_url` place, while it writes its certificates to
|
||||
`--write_log_url`. The last few flags just inform the `hammer` how many read and write ops/sec it
|
||||
should generate, and with that explanation my brain plays _tadaa.wav_ and I am ready to go.
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract$ go run ./internal/hammer \
|
||||
--origin=ctlog-test.lab.ipng.ch/test-ecdsa \
|
||||
--log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEucHtDWe9GYNicPnuGWbEX8rJg/VnDcXs8z40KdoNidBKy6/ZXw2u+NW1XAUnGpXcZozxufsgOMhijsWb25r7jw== \
|
||||
--log_url=http://tesseract-test.minio-ssd.lab.ipng.ch:9000/ \
|
||||
--write_log_url=http://localhost:6962/ctlog-test.lab.ipng.ch/test-ecdsa/ \
|
||||
--max_read_ops=0 \
|
||||
--num_writers=5000 \
|
||||
--max_write_ops=100
|
||||
```
|
||||
|
||||
{{< image width="30em" float="right" src="/assets/ctlog/ctlog-loadtest1.png" alt="S3/MySQL Loadtest 100qps" >}}
|
||||
|
||||
Cool! It seems that the loadtest is happily chugging along at 100qps. The log is consuming them in
|
||||
the HTTP write-path by accepting POST requests to
|
||||
`/ctlog-test.lab.ipng.ch/test-ecdsa/ct/v1/add-chain`, where hammer is offering them at a rate of
|
||||
100qps, with a configured probability of duplicates set at 10%. What that means is that every now
|
||||
and again, it'll repeat a previous request. The purpose of this is to stress test the so-called
|
||||
`antispam` implementation. When `hammer` sends its requests, it signs them with a certificate that
|
||||
was issued by the CA described in `internal/hammer/testdata/test_root_ca_cert.pem`, which is why
|
||||
TesseraCT accepts them.
|
||||
|
||||
I raise the write load by using the '>' key a few times. I notice things are great at 500qps, which
|
||||
is nice because that's double what we are to expect. But I start seeing a bit more noise at 600qps.
|
||||
When I raise the write-rate to 1000qps, all hell breaks loose on the logs of the server (and similar
|
||||
logs in the `hammer` loadtester:
|
||||
|
||||
```
|
||||
W0727 15:54:33.419881 348475 handlers.go:168] ctlog-test.lab.ipng.ch/test-ecdsa: AddChain handler error: couldn't store the leaf: failed to fetch entry bundle at index 0: failed to fetch resource: getObject: failed to create reader for object "tile/data/000" in bucket "tesseract-test": operation error S3: GetObject, context deadline exceeded
|
||||
W0727 15:55:02.727962 348475 aws.go:345] GarbageCollect failed: failed to delete one or more objects: failed to delete objects: operation error S3: DeleteObjects, https response error StatusCode: 400, RequestID: 1856202CA3C4B83F, HostID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8, api error MalformedXML: The XML you provided was not well-formed or did not validate against our published schema.
|
||||
E0727 15:55:10.448973 348475 append_lifecycle.go:293] followerStats: follower "AWS antispam" EntriesProcessed(): failed to read follow coordination info: Error 1040: Too many connections
|
||||
```
|
||||
|
||||
I see on the MinIO instance that it's doing about 150/s of GETs and 15/s of PUTs, which is totally
|
||||
reasonable:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract$ mc admin trace --stats ssd
|
||||
Duration: 6m9s ▰▱▱
|
||||
RX Rate:↑ 34 MiB/m
|
||||
TX Rate:↓ 2.3 GiB/m
|
||||
RPM : 10588.1
|
||||
-------------
|
||||
Call Count RPM Avg Time Min Time Max Time Avg TTFB Max TTFB Avg Size Rate /min
|
||||
s3.GetObject 60558 (92.9%) 9837.2 4.3ms 708µs 48.1ms 3.9ms 47.8ms ↑144B ↓246K ↑1.4M ↓2.3G
|
||||
s3.PutObject 2199 (3.4%) 357.2 5.3ms 2.4ms 32.7ms 5.3ms 32.7ms ↑92K ↑32M
|
||||
s3.DeleteMultipleObjects 1212 (1.9%) 196.9 877µs 290µs 41.1ms 850µs 41.1ms ↑230B ↓369B ↑44K ↓71K
|
||||
s3.ListObjectsV2 1212 (1.9%) 196.9 18.4ms 999µs 52.8ms 18.3ms 52.7ms ↑131B ↓261B ↑25K ↓50K
|
||||
```
|
||||
|
||||
Another nice way to see what makes it through is this oneliner, which reads the `checkpoint` every
|
||||
second, and once it changes, shows the delta in seconds and how many certs were written:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract$ T=0; O=0; while :; do \
|
||||
N=$(curl -sS http://tesseract-test.minio-ssd.lab.ipng.ch:9000/checkpoint | grep -E '^[0-9]+$'); \
|
||||
if [ "$N" -eq "$O" ]; then \
|
||||
echo -n .; \
|
||||
else \
|
||||
echo " $T seconds $((N-O)) certs"; O=$N; T=0; echo -n $N\ ;
|
||||
fi; \
|
||||
T=$((T+1)); sleep 1; done
|
||||
1012905 .... 5 seconds 2081 certs
|
||||
1014986 .... 5 seconds 2126 certs
|
||||
1017112 .... 5 seconds 1913 certs
|
||||
1019025 .... 5 seconds 2588 certs
|
||||
1021613 .... 5 seconds 2591 certs
|
||||
1024204 .... 5 seconds 2197 certs
|
||||
```
|
||||
|
||||
So I can see that the checkpoint is refreshed every 5 seconds and between 1913 and 2591 certs are
|
||||
written each time. And indeed, at 400/s there are no errors or warnings at all. At this write rate,
|
||||
TesseraCT is using about 2.9 CPUs/s, with MariaDB using 0.3 CPUs/s, but the hammer is using 6.0
|
||||
CPUs/s. Overall, the machine is perfectly happily serving for a few hours under this load test.
|
||||
|
||||
***Conclusion: a write-rate of 400/s should be safe with S3+MySQL***
|
||||
|
||||
### TesseraCT: POSIX
|
||||
|
||||
I have been playing with this idea of having a reliable read-path by having the S3 cluster be
|
||||
redundant, or by replicating the S3 bucket. But Al asks: why not use our experimental POSIX?
|
||||
We discuss two very important benefits, but also two drawbacks:
|
||||
|
||||
* On the plus side:
|
||||
1. There is no need for S3 storage, read/writing to a local ZFS raidz2 pool instead.
|
||||
1. There is no need for MySQL, as the POSIX implementation can use a local badger instance
|
||||
also on the local filesystem.
|
||||
* On the drawbacks:
|
||||
1. There is a SPOF in the read-path, as the single VM must handle both. The write-path always
|
||||
has a SPOF on the TesseraCT VM.
|
||||
1. Local storage is more expensive than S3 storage, and can be used only for the purposes of
|
||||
one application (and at best, shared with other VMs on the same hypervisor).
|
||||
|
||||
Come to think of it, this is maybe not such a bad tradeoff. I do kind of like having a single-VM
|
||||
with a single-binary and no other moving parts. It greatly simplifies the architecture, and for the
|
||||
read-path I can (and will) still use multiple upstream NGINX machines in IPng's network.
|
||||
|
||||
I consider myself nerd-sniped, and take a look at the POSIX variant. I have a few SAS3
|
||||
solid state storage (NetAPP part number X447_S1633800AMD), which I plug into the `ctlog-test`
|
||||
machine.
|
||||
|
||||
```
|
||||
pim@ctlog-test:~$ sudo zpool create -o ashift=12 -o autotrim=on -o ssd-vol0 mirror \
|
||||
/dev/disk/by-id/wwn-0x5002538a0???????
|
||||
pim@ctlog-test:~$ sudo zfs create ssd-vol0/tesseract-test
|
||||
pim@ctlog-test:~$ sudo chown pim:pim /ssd-vol0/tesseract-test
|
||||
pim@ctlog-test:~/src/tesseract$ go run ./cmd/experimental/posix --http_endpoint='[::]:6962' \
|
||||
--origin=ctlog-test.lab.ipng.ch/test-ecdsa \
|
||||
--private_key=/tmp/private_key.pem \
|
||||
--storage_dir=/ssd-vol0/tesseract-test \
|
||||
--roots_pem_file=internal/hammer/testdata/test_root_ca_cert.pem
|
||||
badger 2025/07/27 16:29:15 INFO: All 0 tables opened in 0s
|
||||
badger 2025/07/27 16:29:15 INFO: Discard stats nextEmptySlot: 0
|
||||
badger 2025/07/27 16:29:15 INFO: Set nextTxnTs to 0
|
||||
I0727 16:29:15.032845 363156 files.go:502] Initializing directory for POSIX log at "/ssd-vol0/tesseract-test" (this should only happen ONCE per log!)
|
||||
I0727 16:29:15.034101 363156 main.go:97] **** CT HTTP Server Starting ****
|
||||
|
||||
pim@ctlog-test:~/src/tesseract$ cat /ssd-vol0/tesseract-test/checkpoint
|
||||
ctlog-test.lab.ipng.ch/test-ecdsa
|
||||
0
|
||||
47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=
|
||||
|
||||
— ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMSgC8BAMARzBFAiBjT5zdkniKlryqlUlx/gLHOtVK26zuWwrc4BlyTVzCWgIhAJ0GIrlrP7YGzRaHjzdB5tnS5rpP3LeOsPbpLateaiFc
|
||||
```
|
||||
|
||||
Alright, I can see the log started and created an empty checkpoint file. Nice!
|
||||
|
||||
Before I can loadtest it, I will need to get the read-path to become visible. The `hammer` can read
|
||||
a checkpoint from local `file:///` prefixes, but I'll have to serve them over the network eventually
|
||||
anyway, so I create the following NGINX config for it:
|
||||
|
||||
```
|
||||
server {
|
||||
listen 80 default_server backlog=4096;
|
||||
listen [::]:80 default_server backlog=4096;
|
||||
root /ssd-vol0/tesseract-test/;
|
||||
index index.html index.htm index.nginx-debian.html;
|
||||
|
||||
server_name _;
|
||||
|
||||
access_log /var/log/nginx/access.log combined buffer=512k flush=5s;
|
||||
|
||||
location / {
|
||||
try_files $uri $uri/ =404;
|
||||
tcp_nopush on;
|
||||
sendfile on;
|
||||
tcp_nodelay on;
|
||||
keepalive_timeout 65;
|
||||
keepalive_requests 1000;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Just a couple of small thoughts on this configuration. I'm using buffered access logs, to avoid
|
||||
excessive disk writes in the read-path. Then, I'm using kernel `sendfile()` which will instruct the
|
||||
kernel to serve the static objects directly, so that NGINX can move on. Further, I'll allow for a
|
||||
long keepalive in HTTP 1.1, so that future requests can use the same TCP connection, and I'll set
|
||||
the flag `tcp_nodelay` and `tcp_nopush` to just blast the data out without waiting.
|
||||
|
||||
Without much ado:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract$ curl -sS ctlog-test.lab.ipng.ch/checkpoint
|
||||
ctlog-test.lab.ipng.ch/test-ecdsa
|
||||
0
|
||||
47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=
|
||||
|
||||
— ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMTfksBAMASDBGAiEAqADLH0P/SRVloF6G1ezlWG3Exf+sTzPIY5u6VjAKLqACIQCkJO2N0dZQuDHvkbnzL8Hd91oyU41bVqfD3vs5EwUouA==
|
||||
```
|
||||
|
||||
#### TesseraCT: Loadtesting POSIX
|
||||
|
||||
The loadtesting is roughly the same. I start the `hammer` with the same 500qps of write rate, which
|
||||
was roughly where the S3+MySQL variant topped. My checkpoint tracker shows the following:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract$ T=0; O=0; while :; do \
|
||||
N=$(curl -sS http://localhost/checkpoint | grep -E '^[0-9]+$'); \
|
||||
if [ "$N" -eq "$O" ]; then \
|
||||
echo -n .; \
|
||||
else \
|
||||
echo " $T seconds $((N-O)) certs"; O=$N; T=0; echo -n $N\ ;
|
||||
fi; \
|
||||
T=$((T+1)); sleep 1; done
|
||||
59250 ......... 10 seconds 5244 certs
|
||||
64494 ......... 10 seconds 5000 certs
|
||||
69494 ......... 10 seconds 5000 certs
|
||||
74494 ......... 10 seconds 5000 certs
|
||||
79494 ......... 10 seconds 5256 certs
|
||||
79494 ......... 10 seconds 5256 certs
|
||||
84750 ......... 10 seconds 5244 certs
|
||||
89994 ......... 10 seconds 5256 certs
|
||||
95250 ......... 10 seconds 5000 certs
|
||||
100250 ......... 10 seconds 5000 certs
|
||||
105250 ......... 10 seconds 5000 certs
|
||||
```
|
||||
|
||||
I learn two things. First, the checkpoint interval in this `posix` variant is 10 seconds, compared
|
||||
to the 5 seconds of the `aws` variant I tested before. I dive into the code, because there doesn't
|
||||
seem to be a `--checkpoint_interval` flag. In the `tessera` library, I find
|
||||
`DefaultCheckpointInterval` which is set to 10 seconds. I change it to be 2 seconds instead, and
|
||||
restart the `posix` binary:
|
||||
|
||||
```
|
||||
238250 . 2 seconds 1000 certs
|
||||
239250 . 2 seconds 1000 certs
|
||||
240250 . 2 seconds 1000 certs
|
||||
241250 . 2 seconds 1000 certs
|
||||
242250 . 2 seconds 1000 certs
|
||||
243250 . 2 seconds 1000 certs
|
||||
244250 . 2 seconds 1000 certs
|
||||
```
|
||||
|
||||
{{< image width="30em" float="right" src="/assets/ctlog/ctlog-loadtest2.png" alt="Posix Loadtest 5000qps" >}}
|
||||
|
||||
Very nice! Maybe I can write a few more certs? I restart the `hammer` with 5000/s, which somewhat to my
|
||||
surprise, ends up serving!
|
||||
|
||||
```
|
||||
642608 . 2 seconds 6155 certs
|
||||
648763 . 2 seconds 10256 certs
|
||||
659019 . 2 seconds 9237 certs
|
||||
668256 . 2 seconds 8800 certs
|
||||
677056 . 2 seconds 8729 certs
|
||||
685785 . 2 seconds 8237 certs
|
||||
694022 . 2 seconds 7487 certs
|
||||
701509 . 2 seconds 8572 certs
|
||||
710081 . 2 seconds 7413 certs
|
||||
```
|
||||
|
||||
The throughput is highly variable though, seemingly between 3700/sec and 5100/sec, and I quickly
|
||||
find out that the `hammer` is completely saturating the CPU on the machine, leaving very little room
|
||||
for the `posix` TesseraCT to serve. I'm going to need more machines!
|
||||
|
||||
So I start a `hammer` loadtester on the two now-idle MinIO servers, and run them at about 6000qps
|
||||
**each**, for a total of 12000 certs/sec. And my little `posix` binary is keeping up like a champ:
|
||||
|
||||
```
|
||||
2987169 . 2 seconds 23040 certs
|
||||
3010209 . 2 seconds 23040 certs
|
||||
3033249 . 2 seconds 21760 certs
|
||||
3055009 . 2 seconds 21504 certs
|
||||
3076513 . 2 seconds 23808 certs
|
||||
3100321 . 2 seconds 22528 certs
|
||||
```
|
||||
|
||||
One thing is reasonably clear, the `posix` TesseraCT is CPU bound, not disk bound. The CPU is now
|
||||
running at about 18.5 CPUs/s (with 20 cores), which is pretty much all this Dell has to offer. The
|
||||
NetAPP enterprise solid state drives are not impressed:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract$ zpool iostat -v ssd-vol0 10 100
|
||||
capacity operations bandwidth
|
||||
pool alloc free read write read write
|
||||
-------------------------- ----- ----- ----- ----- ----- -----
|
||||
ssd-vol0 11.4G 733G 0 3.13K 0 117M
|
||||
mirror-0 11.4G 733G 0 3.13K 0 117M
|
||||
wwn-0x5002538a05302930 - - 0 1.04K 0 39.1M
|
||||
wwn-0x5002538a053069f0 - - 0 1.06K 0 39.1M
|
||||
wwn-0x5002538a06313ed0 - - 0 1.02K 0 39.1M
|
||||
-------------------------- ----- ----- ----- ----- ----- -----
|
||||
|
||||
pim@ctlog-test:~/src/tesseract$ zpool iostat -l ssd-vol0 10
|
||||
capacity operations bandwidth total_wait disk_wait syncq_wait asyncq_wait scrub trim
|
||||
pool alloc free read write read write read write read write read write read write wait wait
|
||||
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
|
||||
ssd-vol0 14.0G 730G 0 1.48K 0 35.4M - 2ms - 535us - 1us - 3ms - 50ms
|
||||
ssd-vol0 14.0G 730G 0 1.12K 0 23.0M - 1ms - 733us - 2us - 1ms - 44ms
|
||||
ssd-vol0 14.1G 730G 0 1.42K 0 45.3M - 508us - 122us - 914ns - 2ms - 41ms
|
||||
ssd-vol0 14.2G 730G 0 678 0 21.0M - 863us - 144us - 2us - 2ms - -
|
||||
```
|
||||
|
||||
## Results
|
||||
|
||||
OK, that kind of seals the deal for me. The write path needs about 250 certs/sec and I'm hammering
|
||||
now with 12'000 certs/sec, with room to spare. But what about the read path? The cool thing about
|
||||
the static log is that reads are all entirely done by NGINX. The only file that isn't cacheable is
|
||||
the `checkpoint` file which gets updated every two seconds (or ten seconds in the default `tessera`
|
||||
settings).
|
||||
|
||||
So I start yet another `hammer` whose job it is to read back from the static filesystem:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract$ curl localhost/nginx_status; sleep 60; curl localhost/nginx_status
|
||||
Active connections: 10556
|
||||
server accepts handled requests
|
||||
25302 25302 1492918
|
||||
Reading: 0 Writing: 1 Waiting: 10555
|
||||
Active connections: 7791
|
||||
server accepts handled requests
|
||||
25764 25764 1727631
|
||||
Reading: 0 Writing: 1 Waiting: 7790
|
||||
```
|
||||
|
||||
And I can see that it's keeping up quite nicely. In one minute, it handled (1727631-1492918) or
|
||||
234713 requests, which is a cool 3911 requests/sec. All these read/write hammers are kind of
|
||||
saturating the `ctlog-test` machine though:
|
||||
|
||||
{{< image width="100%" src="/assets/ctlog/ctlog-loadtest3.png" alt="Posix Loadtest 8000qps write, 4000qps read" >}}
|
||||
|
||||
But after a little bit of fiddling, I can assert my conclusion:
|
||||
|
||||
***Conclusion: a write-rate of 8'000/s alongside a read-rate of 4'000/s should be safe with POSIX***
|
||||
|
||||
## What's Next
|
||||
|
||||
I am going to offer such a machine in production together with Antonis Chariton, and Jeroen Massar.
|
||||
I plan to do a few additional things:
|
||||
|
||||
* Test Sunlight as well on the same hardware. It would be nice to see a comparison between write
|
||||
rates of the two implementations.
|
||||
* Work with Al Cutter and the Transparency Dev team to close a few small gaps (like the
|
||||
`local_signer.go` and some Prometheus monitoring of the `posix` binary.
|
||||
* Install and launch both under `*.ct.ipng.ch`, which in itself deserves its own report, showing
|
||||
how I intend to do log cycling and care/feeding, as well as report on the real production
|
||||
experience running these CT Logs.
|
666
content/articles/2025-08-10-ctlog-2.md
Normal file
666
content/articles/2025-08-10-ctlog-2.md
Normal file
@@ -0,0 +1,666 @@
|
||||
---
|
||||
date: "2025-08-10T12:07:23Z"
|
||||
title: 'Certificate Transparency - Part 2 - Sunlight'
|
||||
---
|
||||
|
||||
{{< image width="10em" float="right" src="/assets/ctlog/ctlog-logo-ipng.png" alt="ctlog logo" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
There once was a Dutch company called [[DigiNotar](https://en.wikipedia.org/wiki/DigiNotar)], as the
|
||||
name suggests it was a form of _digital notary_, and they were in the business of issuing security
|
||||
certificates. Unfortunately, in June of 2011, their IT infrastructure was compromised and
|
||||
subsequently it issued hundreds of fraudulent SSL certificates, some of which were used for
|
||||
man-in-the-middle attacks on Iranian Gmail users. Not cool.
|
||||
|
||||
Google launched a project called **Certificate Transparency**, because it was becoming more common
|
||||
that the root of trust given to _Certification Authorities_ could no longer be unilaterally trusted.
|
||||
These attacks showed that the lack of transparency in the way CAs operated was a significant risk to
|
||||
the Web Public Key Infrastructure. It led to the creation of this ambitious
|
||||
[[project](https://certificate.transparency.dev/)] to improve security online by bringing
|
||||
accountability to the system that protects our online services with _SSL_ (Secure Socket Layer)
|
||||
and _TLS_ (Transport Layer Security).
|
||||
|
||||
In 2013, [[RFC 6962](https://datatracker.ietf.org/doc/html/rfc6962)] was published by the IETF. It
|
||||
describes an experimental protocol for publicly logging the existence of Transport Layer Security
|
||||
(TLS) certificates as they are issued or observed, in a manner that allows anyone to audit
|
||||
certificate authority (CA) activity and notice the issuance of suspect certificates as well as to
|
||||
audit the certificate logs themselves. The intent is that eventually clients would refuse to honor
|
||||
certificates that do not appear in a log, effectively forcing CAs to add all issued certificates to
|
||||
the logs.
|
||||
|
||||
In a [[previous article]({{< ref 2025-07-26-ctlog-1 >}})], I took a deep dive into an upcoming
|
||||
open source implementation of Static CT Logs made by Google. There is however a very competent
|
||||
alternative called [[Sunlight](https://sunlight.dev/)], which deserves some attention to get to know
|
||||
its look and feel, as well as its performance characteristics.
|
||||
|
||||
## Sunlight
|
||||
|
||||
I start by reading up on the project website, and learn:
|
||||
|
||||
> _Sunlight is a [[Certificate Transparency](https://certificate.transparency.dev/)] log implementation
|
||||
> and monitoring API designed for scalability, ease of operation, and reduced cost. What started as
|
||||
> the Sunlight API is now the [[Static CT API](https://c2sp.org/static-ct-api)] and is allowed by the
|
||||
> CT log policies of the major browsers._
|
||||
>
|
||||
> _Sunlight was designed by Filippo Valsorda for the needs of the WebPKI community, through the
|
||||
> feedback of many of its members, and in particular of the Sigsum, Google TrustFabric, and ISRG
|
||||
> teams. It is partially based on the Go Checksum Database. Sunlight's development was sponsored by
|
||||
> Let's Encrypt._
|
||||
|
||||
I have a chat with Filippo and think I'm addressing an Elephant by asking him which of the two
|
||||
implementations, TesseraCT or Sunlight, he thinks would be a good fit. One thing he says really sticks
|
||||
with me: "The community needs _any_ static log operator, so if Google thinks TesseraCT is ready, by
|
||||
all means use that. The diversity will do us good!".
|
||||
|
||||
To find out if one or the other is 'ready' is partly on the software, but importantly also on the
|
||||
operator. So I carefully take Sunlight out of its cardboard box, and put it onto the same Dell R630
|
||||
that I used in my previous tests: two Xeon E5-2640 v4 CPUs for a total of 20 cores and 40 threads,
|
||||
and 512GB of DDR4 memory. They also sport a SAS controller. In one machine I place 6 pcs 1.2TB SAS3
|
||||
drives (HPE part number EG1200JEHMC), and in the second machine I place 6pcs of 1.92TB enterprise
|
||||
storage (Samsung part number P1633N19).
|
||||
|
||||
### Sunlight: setup
|
||||
|
||||
I download the source from GitHub, which, one of these days, will have an IPv6 address. Building the
|
||||
tools is easy enough, there are three main tools:
|
||||
1. ***sunlight***: Which serves the write-path. Certification authorities add their certs here.
|
||||
1. ***sunlight-keygen***: A helper tool to create the so-called `seed` file (key material) for a
|
||||
log.
|
||||
1. ***skylight***: Which serves the read-path. `/checkpoint` and things like `/tile` and `/issuer`
|
||||
are served here in a spec-compliant way.
|
||||
|
||||
The YAML configuration file is straightforward, and can define and handle multiple logs in one
|
||||
instance, which sets it apart from TesseraCT which can only handle one log per instance. There's a
|
||||
`submissionprefix` which `sunlight` will use to accept writes, and a `monitoringprefix` which
|
||||
`skylight` will use for reads.
|
||||
|
||||
I stumble across a small issue - I haven't created multiple DNS hostnames for the test machine. So I
|
||||
decide to use a different port for one versus the other. The write path will use TLS on port 1443
|
||||
while Sunlight will point to a normal HTTP port 1080. And considering I don't have a certificate for
|
||||
`*.lab.ipng.ch`, I will use a self-signed one instead:
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ openssl genrsa -out ca.key 2048
|
||||
pim@ctlog-test:/etc/sunlight$ openssl req -new -x509 -days 365 -key ca.key \
|
||||
-subj "/C=CH/ST=ZH/L=Bruttisellen/O=IPng Networks GmbH/CN=IPng Root CA" -out ca.crt
|
||||
pim@ctlog-test:/etc/sunlight$ openssl req -newkey rsa:2048 -nodes -keyout sunlight-key.pem \
|
||||
-subj "/C=CH/ST=ZH/L=Bruttisellen/O=IPng Networks GmbH/CN=*.lab.ipng.ch" -out sunlight.csr
|
||||
pim@ctlog-test:/etc/sunlight# openssl x509 -req -extfile \
|
||||
<(printf "subjectAltName=DNS:ctlog-test.lab.ipng.ch,DNS:ctlog-test.lab.ipng.ch") -days 365 \
|
||||
-in sunlight.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out sunlight.pem
|
||||
ln -s sunlight.pem skylight.pem
|
||||
ln -s sunlight-key.pem skylight-key.pem
|
||||
```
|
||||
|
||||
This little snippet yields `sunlight.pem` (the certificate) and `sunlight-key.pem` (the private
|
||||
key), and symlinks them to `skylight.pem` and `skylight-key.pem` for simplicity. With these in hand,
|
||||
I can start the rest of the show. First I will prepare the NVME storage with a few datasets in
|
||||
which Sunlight will store its data:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~$ sudo zfs create ssd-vol0/sunlight-test
|
||||
pim@ctlog-test:~$ sudo zfs create ssd-vol0/sunlight-test/shared
|
||||
pim@ctlog-test:~$ sudo zfs create ssd-vol0/sunlight-test/logs
|
||||
pim@ctlog-test:~$ sudo zfs create ssd-vol0/sunlight-test/logs/sunlight-test
|
||||
pim@ctlog-test:~$ sudo chown -R pim:pim /ssd-vol0/sunlight-test
|
||||
```
|
||||
|
||||
Then I'll create the Sunlight configuration:
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ sunlight-keygen -f sunlight-test.seed.bin
|
||||
Log ID: IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E=
|
||||
ECDSA public key:
|
||||
-----BEGIN PUBLIC KEY-----
|
||||
MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHR
|
||||
wRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ==
|
||||
-----END PUBLIC KEY-----
|
||||
Ed25519 public key:
|
||||
-----BEGIN PUBLIC KEY-----
|
||||
0pHg7KptAxmb4o67m9xNM1Ku3YH4bjjXbyIgXn2R2bk=
|
||||
-----END PUBLIC KEY-----
|
||||
```
|
||||
|
||||
The first block creates key material for the log, and I get a fun surprise: the Log ID starts
|
||||
precisely with the string IPng... what are the odds that that would happen!? I should tell Antonis
|
||||
about this, it's dope!
|
||||
|
||||
As a safety precaution, Sunlight requires the operator to make the `checkpoints.db` by hand, which
|
||||
I'll also do:
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ sqlite3 /ssd-vol0/sunlight-test/shared/checkpoints.db \
|
||||
"CREATE TABLE checkpoints (logID BLOB PRIMARY KEY, body TEXT)"
|
||||
```
|
||||
|
||||
And with that, I'm ready to create my first log!
|
||||
|
||||
### Sunlight: Setting up S3
|
||||
|
||||
When learning about [[Tessera]({{< ref 2025-07-26-ctlog-1 >}})], I already kind of drew the
|
||||
conclusion that, for our case at IPng at least, running the fully cloud-native version with S3
|
||||
storage and MySQL database, gave both poorer performance, but also more operational complexity. But
|
||||
I find it interesting to compare behavior and performance, so I'll start by creating a Sunlight log
|
||||
using backing MinIO SSD storage.
|
||||
|
||||
I'll first create the bucket and a user account to access it:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~$ export AWS_ACCESS_KEY_ID="<some user>"
|
||||
pim@ctlog-test:~$ export AWS_SECRET_ACCESS_KEY="<some password>"
|
||||
pim@ctlog-test:~$ export S3_BUCKET=sunlight-test
|
||||
|
||||
pim@ctlog-test:~$ mc mb ssd/${S3_BUCKET}
|
||||
pim@ctlog-test:~$ cat << EOF > /tmp/minio-access.json
|
||||
{ "Version": "2012-10-17", "Statement": [ {
|
||||
"Effect": "Allow",
|
||||
"Action": [ "s3:ListBucket", "s3:PutObject", "s3:GetObject", "s3:DeleteObject" ],
|
||||
"Resource": [ "arn:aws:s3:::${S3_BUCKET}/*", "arn:aws:s3:::${S3_BUCKET}" ]
|
||||
} ]
|
||||
}
|
||||
EOF
|
||||
pim@ctlog-test:~$ mc admin user add ssd ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY}
|
||||
pim@ctlog-test:~$ mc admin policy create ssd ${S3_BUCKET}-access /tmp/minio-access.json
|
||||
pim@ctlog-test:~$ mc admin policy attach ssd ${S3_BUCKET}-access --user ${AWS_ACCESS_KEY_ID}
|
||||
pim@ctlog-test:~$ mc anonymous set public ssd/${S3_BUCKET}
|
||||
```
|
||||
|
||||
After setting up the S3 environment, all I must do is wire it up to the Sunlight configuration
|
||||
file:
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ cat << EOF > sunlight-s3.yaml
|
||||
listen:
|
||||
- "[::]:1443"
|
||||
checkpoints: /ssd-vol0/sunlight-test/shared/checkpoints.db
|
||||
logs:
|
||||
- shortname: sunlight-test
|
||||
inception: 2025-08-10
|
||||
submissionprefix: https://ctlog-test.lab.ipng.ch:1443/
|
||||
monitoringprefix: http://sunlight-test.minio-ssd.lab.ipng.ch:9000/
|
||||
secret: /etc/sunlight/sunlight-test.seed.bin
|
||||
cache: /ssd-vol0/sunlight-test/logs/sunlight-test/cache.db
|
||||
s3region: eu-schweiz-1
|
||||
s3bucket: sunlight-test
|
||||
s3endpoint: http://minio-ssd.lab.ipng.ch:9000/
|
||||
roots: /etc/sunlight/roots.pem
|
||||
period: 200
|
||||
poolsize: 15000
|
||||
notafterstart: 2024-01-01T00:00:00Z
|
||||
notafterlimit: 2025-01-01T00:00:00Z
|
||||
EOF
|
||||
```
|
||||
|
||||
The one thing of note here is the use of `roots:` file which contains the Root CA for the TesseraCT
|
||||
loadtester which I'll be using. In production, Sunlight can grab the approved roots from the
|
||||
so-called _Common CA Database_ or CCADB. But you can also specify either all roots using the `roots`
|
||||
field, or additional roots on top of the `ccadbroots` field, using the `extraroots` field. That's a
|
||||
handy trick! You can find more info on the [[CCADB](https://www.ccadb.org/)] homepage.
|
||||
|
||||
I can then start Sunlight just like this:
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ sunlight -testcert -c /etc/sunlight/sunlight-s3.yaml {"time":"2025-08-10T13:49:36.091384532+02:00","level":"INFO","source":{"function":"main.main.func1","file":"/home/pim/src/sunlight/cmd/sunlight/sunlig
|
||||
ht.go","line":341},"msg":"debug server listening","addr":{"IP":"127.0.0.1","Port":37477,"Zone":""}}
|
||||
time=2025-08-10T13:49:36.091+02:00 level=INFO msg="debug server listening" addr=127.0.0.1:37477 {"time":"2025-08-10T13:49:36.100471647+02:00","level":"INFO","source":{"function":"main.main","file":"/home/pim/src/sunlight/cmd/sunlight/sunlight.go"
|
||||
,"line":542},"msg":"today is the Inception date, creating log","log":"sunlight-test"} time=2025-08-10T13:49:36.100+02:00 level=INFO msg="today is the Inception date, creating log" log=sunlight-test
|
||||
{"time":"2025-08-10T13:49:36.119529208+02:00","level":"INFO","source":{"function":"filippo.io/sunlight/internal/ctlog.CreateLog","file":"/home/pim/src
|
||||
/sunlight/internal/ctlog/ctlog.go","line":159},"msg":"created log","log":"sunlight-test","timestamp":1754826576111,"logID":"IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E="}
|
||||
time=2025-08-10T13:49:36.119+02:00 level=INFO msg="created log" log=sunlight-test timestamp=1754826576111 logID="IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E="
|
||||
{"time":"2025-08-10T13:49:36.127702166+02:00","level":"WARN","source":{"function":"filippo.io/sunlight/internal/ctlog.LoadLog","file":"/home/pim/src/s
|
||||
unlight/internal/ctlog/ctlog.go","line":296},"msg":"failed to parse previously trusted roots","log":"sunlight-test","roots":""} time=2025-08-10T13:49:36.127+02:00 level=WARN msg="failed to parse previously trusted roots" log=sunlight-test roots=""
|
||||
{"time":"2025-08-10T13:49:36.127766452+02:00","level":"INFO","source":{"function":"filippo.io/sunlight/internal/ctlog.LoadLog","file":"/home/pim/src/sunlight/internal/ctlog/ctlog.go","line":301},"msg":"loaded log","log":"sunlight-test","logID":"IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E=","size":0,
|
||||
"timestamp":1754826576111}
|
||||
time=2025-08-10T13:49:36.127+02:00 level=INFO msg="loaded log" log=sunlight-test logID="IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E=" size=0 timestamp=1754826576111
|
||||
{"time":"2025-08-10T13:49:36.540297532+02:00","level":"INFO","source":{"function":"filippo.io/sunlight/internal/ctlog.(*Log).sequencePool","file":"/home/pim/src/sunlight/internal/ctlog/ctlog.go","line":972},"msg":"sequenced pool","log":"sunlight-test","old_tree_size":0,"entries":0,"start":"2025-08-1
|
||||
0T13:49:36.534500633+02:00","tree_size":0,"tiles":0,"timestamp":1754826576534,"elapsed":5788099}
|
||||
time=2025-08-10T13:49:36.540+02:00 level=INFO msg="sequenced pool" log=sunlight-test old_tree_size=0 entries=0 start=2025-08-10T13:49:36.534+02:00 tree_size=0 tiles=0 timestamp=1754826576534 elapsed=5.788099ms
|
||||
...
|
||||
```
|
||||
|
||||
Although that looks pretty good, I see that something is not quite right. When Sunlight comes up, it shares
|
||||
with me a few links, in the `get-roots` and `json` fields on the homepage, but neither of them work:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~$ curl -k https://ctlog-test.lab.ipng.ch:1443/ct/v1/get-roots
|
||||
404 page not found
|
||||
pim@ctlog-test:~$ curl -k https://ctlog-test.lab.ipng.ch:1443/log.v3.json
|
||||
404 page not found
|
||||
```
|
||||
|
||||
I'm starting to think that using a non-standard listen port won't work, or more precisely, adding
|
||||
a port in the `monitoringprefix` won't work. I notice that the logname is called
|
||||
`ctlog-test.lab.ipng.ch:1443` which I don't think is supposed to have a portname in it. So instead,
|
||||
I make Sunlight `listen` on port 443 and omit the port in the `submissionprefix`, and give it and
|
||||
its companion Skylight the needed privileges to bind the privileged port like so:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~$ sudo setcap 'cap_net_bind_service=+ep' /usr/local/bin/sunlight
|
||||
pim@ctlog-test:~$ sudo setcap 'cap_net_bind_service=+ep' /usr/local/bin/skylight
|
||||
pim@ctlog-test:~$ sunlight -testcert -c /etc/sunlight/sunlight-s3.yaml
|
||||
```
|
||||
|
||||
{{< image width="60%" src="/assets/ctlog/sunlight-test-s3.png" alt="Sunlight testlog / S3" >}}
|
||||
|
||||
And with that, Sunlight reports for duty and the links work. Hoi!
|
||||
|
||||
#### Sunlight: Loadtesting S3
|
||||
|
||||
I have some good experience loadtesting from the [[TesseraCT article]({{< ref 2025-07-26-ctlog-1
|
||||
>}})]. One important difference is that Sunlight wants to use SSL for the submission and monitoring
|
||||
paths, and I've created a snakeoil self-signed cert. CT Hammer does not accept that out of the box,
|
||||
so I need to make a tiny change to the Hammer:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract$ git diff
|
||||
diff --git a/internal/hammer/hammer.go b/internal/hammer/hammer.go
|
||||
index 3828fbd..1dfd895 100644
|
||||
--- a/internal/hammer/hammer.go
|
||||
+++ b/internal/hammer/hammer.go
|
||||
@@ -104,6 +104,9 @@ func main() {
|
||||
MaxIdleConns: *numWriters + *numReadersFull + *numReadersRandom,
|
||||
MaxIdleConnsPerHost: *numWriters + *numReadersFull + *numReadersRandom,
|
||||
DisableKeepAlives: false,
|
||||
+ TLSClientConfig: &tls.Config{
|
||||
+ InsecureSkipVerify: true,
|
||||
+ },
|
||||
},
|
||||
Timeout: *httpTimeout,
|
||||
}
|
||||
```
|
||||
|
||||
With that small bit of insecurity out of the way, Sunlight makes it otherwise pretty easy for me to
|
||||
construct the CT Hammer commandline:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract$ go run ./internal/hammer --origin=ctlog-test.lab.ipng.ch \
|
||||
--log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHRwRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ== \
|
||||
--log_url=http://sunlight-test.minio-ssd.lab.ipng.ch:9000/ --write_log_url=https://ctlog-test.lab.ipng.ch/ \
|
||||
--max_read_ops=0 --num_writers=5000 --max_write_ops=100
|
||||
|
||||
pim@ctlog-test:/etc/sunlight$ T=0; O=0; while :; do \
|
||||
N=$(curl -sS http://sunlight-test.minio-ssd.lab.ipng.ch:9000/checkpoint | grep -E '^[0-9]+$'); \
|
||||
if [ "$N" -eq "$O" ]; then \
|
||||
echo -n .; \
|
||||
else \
|
||||
echo " $T seconds $((N-O)) certs"; O=$N; T=0; echo -n $N\ ;
|
||||
fi; \
|
||||
T=$((T+1)); sleep 1; done
|
||||
24915 1 seconds 96 certs
|
||||
25011 1 seconds 92 certs
|
||||
25103 1 seconds 93 certs
|
||||
25196 1 seconds 87 certs
|
||||
```
|
||||
|
||||
On the first commandline I'll start the loadtest at 100 writes/sec with the standard duplication
|
||||
probability of 10%, which allows me to test Sunlights ability to avoid writing duplicates. This
|
||||
means I should see on average a growth of the tree at about 90/s. Check. I raise the write-load to
|
||||
500/s:
|
||||
|
||||
```
|
||||
39421 1 seconds 443 certs
|
||||
39864 1 seconds 442 certs
|
||||
40306 1 seconds 441 certs
|
||||
40747 1 seconds 447 certs
|
||||
41194 1 seconds 448 certs
|
||||
```
|
||||
|
||||
.. and to 1'000/s:
|
||||
```
|
||||
57941 1 seconds 945 certs
|
||||
58886 1 seconds 970 certs
|
||||
59856 1 seconds 948 certs
|
||||
60804 1 seconds 965 certs
|
||||
61769 1 seconds 955 certs
|
||||
```
|
||||
|
||||
After a few minutes I see a few errors from CT Hammer:
|
||||
```
|
||||
W0810 14:55:29.660710 1398779 analysis.go:134] (1 x) failed to create request: failed to write leaf: Post "https://ctlog-test.lab.ipng.ch/ct/v1/add-chain": EOF
|
||||
W0810 14:55:30.496603 1398779 analysis.go:124] (1 x) failed to create request: write leaf was not OK. Status code: 500. Body: "failed to read body: read tcp 127.0.1.1:443->127.0.0.1:44908: i/o timeout\n"
|
||||
```
|
||||
|
||||
I raise the Hammer load to 5'000/sec (which means 4'500/s unique certs and 500 duplicates), and find
|
||||
the max committed writes/sec to max out at around 4'200/s:
|
||||
```
|
||||
879637 1 seconds 4213 certs
|
||||
883850 1 seconds 4207 certs
|
||||
888057 1 seconds 4211 certs
|
||||
892268 1 seconds 4249 certs
|
||||
896517 1 seconds 4216 certs
|
||||
```
|
||||
|
||||
The error rate is a steady stream of errors like the one before:
|
||||
```
|
||||
W0810 14:59:48.499274 1398779 analysis.go:124] (1 x) failed to create request: failed to write leaf: Post "https://ctlog-test.lab.ipng.ch/ct/v1/add-chain": EOF
|
||||
W0810 14:59:49.034194 1398779 analysis.go:124] (1 x) failed to create request: failed to write leaf: Post "https://ctlog-test.lab.ipng.ch/ct/v1/add-chain": EOF
|
||||
W0810 15:00:05.496459 1398779 analysis.go:124] (1 x) failed to create request: failed to write leaf: Post "https://ctlog-test.lab.ipng.ch/ct/v1/add-chain": EOF
|
||||
W0810 15:00:07.187181 1398779 analysis.go:124] (1 x) failed to create request: failed to write leaf: Post "https://ctlog-test.lab.ipng.ch/ct/v1/add-chain": EOF
|
||||
```
|
||||
|
||||
At this load of 4'200/s, MinIO is not very impressed. Remember in the [[other article]({{< ref
|
||||
2025-07-26-ctlog-1 >}})] I loadtested it to about 7'500 ops/sec and the statistics below are about
|
||||
50 ops/sec (2'800/min). I conclude that MinIO is, in fact, bored of this whole activity:
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ mc admin trace --stats ssd
|
||||
Duration: 18m58s ▱▱▱
|
||||
RX Rate:↑ 115 MiB/m
|
||||
TX Rate:↓ 2.4 MiB/m
|
||||
RPM : 2821.3
|
||||
-------------
|
||||
Call Count RPM Avg Time Min Time Max Time Avg TTFB Max TTFB Avg Size Rate /min Errors
|
||||
s3.PutObject 37602 (70.3%) 1982.2 6.2ms 785µs 86.7ms 6.1ms 86.6ms ↑59K ↓0B ↑115M ↓1.4K 0
|
||||
s3.GetObject 15918 (29.7%) 839.1 996µs 670µs 51.3ms 912µs 51.2ms ↑46B ↓3.0K ↑38K ↓2.4M 0
|
||||
```
|
||||
|
||||
Sunlight still keeps its certificate cache on local disk. At a rate of 4'200/s, the ZFS pool has a
|
||||
write rate of about 105MB/s with about 877 ZFS writes per second.
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ zpool iostat -v ssd-vol0 10
|
||||
capacity operations bandwidth
|
||||
pool alloc free read write read write
|
||||
-------------------------- ----- ----- ----- ----- ----- -----
|
||||
ssd-vol0 59.1G 685G 0 2.55K 0 312M
|
||||
mirror-0 59.1G 685G 0 2.55K 0 312M
|
||||
wwn-0x5002538a05302930 - - 0 877 0 104M
|
||||
wwn-0x5002538a053069f0 - - 0 871 0 104M
|
||||
wwn-0x5002538a06313ed0 - - 0 866 0 104M
|
||||
-------------------------- ----- ----- ----- ----- ----- -----
|
||||
|
||||
pim@ctlog-test:/etc/sunlight$ zpool iostat -l ssd-vol0 10
|
||||
capacity operations bandwidth total_wait disk_wait syncq_wait asyncq_wait scrub trim
|
||||
pool alloc free read write read write read write read write read write read write wait wait
|
||||
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
|
||||
ssd-vol0 59.0G 685G 0 3.19K 0 388M - 8ms - 628us - 990us - 10ms - 88ms
|
||||
ssd-vol0 59.2G 685G 0 2.49K 0 296M - 5ms - 557us - 163us - 8ms - -
|
||||
ssd-vol0 59.6G 684G 0 2.04K 0 253M - 2ms - 704us - 296us - 4ms - -
|
||||
ssd-vol0 58.8G 685G 0 2.72K 0 328M - 6ms - 783us - 701us - 9ms - 68ms
|
||||
|
||||
```
|
||||
|
||||
A few interesting observations:
|
||||
* Sunlight still uses a local sqlite3 database for the certificate tracking, which is more
|
||||
efficient than MariaDB/MySQL, let alone AWS RDS, so it has one less runtime dependency.
|
||||
* The write rate to ZFS is significantly higher with Sunlight than TesseraCT (about 8:1). This is
|
||||
likely explained because the sqlite3 database lives on ZFS here, while TesseraCT uses MariaDB
|
||||
running on a different filesystem.
|
||||
* The MinIO usage is a lot lighter. As I reduce the load to 1'000/s, as was the case in the TesseraCT
|
||||
test, I can see the ratio of Get:Put was 93:4 in TesseraCT, while it's 70:30 here. TesseraCT as
|
||||
also consuming more IOPS, running at about 10.5k requests/minute, while Sunlight is
|
||||
significantly calmer at 2.8k requests/minute (almost 4x less!)
|
||||
* The burst capacity of Sunlight is a fair bit higher than TesseraCT, likely due to its more
|
||||
efficient use of S3 backends.
|
||||
|
||||
***Conclusion***: Sunlight S3+MinIO can handle 1'000/s reliably, and can spike to 4'200/s with only
|
||||
few errors.
|
||||
|
||||
#### Sunlight: Loadtesting POSIX
|
||||
|
||||
When I took a closer look at TesseraCT a few weeks ago, it struck me that while making a
|
||||
cloud-native setup, with S3 storage would allow for a cool way to enable storage scaling and
|
||||
read-path redundancy, by creating synchronously replicated buckets, it does come at a significant
|
||||
operational overhead and complexity. My main concern is the amount of different moving parts, and
|
||||
Sunlight really has one very appealing property: it can run entirely on one machine without the need
|
||||
for any other moving parts - even the SQL database is linked in. That's pretty slick.
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ cat << EOF > sunlight.yaml
|
||||
listen:
|
||||
- "[::]:443"
|
||||
checkpoints: /ssd-vol0/sunlight-test/shared/checkpoints.db
|
||||
logs:
|
||||
- shortname: sunlight-test
|
||||
inception: 2025-08-10
|
||||
submissionprefix: https://ctlog-test.lab.ipng.ch/
|
||||
monitoringprefix: https://ctlog-test.lab.ipng.ch:1443/
|
||||
secret: /etc/sunlight/sunlight-test.seed.bin
|
||||
cache: /ssd-vol0/sunlight-test/logs/sunlight-test/cache.db
|
||||
localdirectory: /ssd-vol0/sunlight-test/logs/sunlight-test/data
|
||||
roots: /etc/sunlight/roots.pem
|
||||
period: 200
|
||||
poolsize: 15000
|
||||
notafterstart: 2024-01-01T00:00:00Z
|
||||
notafterlimit: 2025-01-01T00:00:00Z
|
||||
EOF
|
||||
pim@ctlog-test:/etc/sunlight$ sunlight -testcert -c sunlight.yaml
|
||||
pim@ctlog-test:/etc/sunlight$ skylight -testcert -c skylight.yaml
|
||||
```
|
||||
|
||||
First I'll start a hello-world loadtest at 100/s and take a look at the number of leaves in the
|
||||
checkpoint after a few minutes, I would expect about three minutes worth at 100/s with a duplicate
|
||||
probability of 10% to yield about 16'200 unique certificates in total.
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E '^[0-9]+$'; sleep 60; done
|
||||
10086
|
||||
15518
|
||||
20920
|
||||
26339
|
||||
```
|
||||
|
||||
And would you look at that? `(26339-10086)` is right on the dot! One thing that I find particularly
|
||||
cool about Sunlight is its baked in Prometheus metrics. This allows me some pretty solid insight on
|
||||
its performance. Take a look for example at the write path latency tail (99th ptile):
|
||||
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ curl -ksS https://ctlog-test.lab.ipng.ch/metrics | egrep 'seconds.*quantile=\"0.99\"'
|
||||
sunlight_addchain_wait_seconds{log="sunlight-test",quantile="0.99"} 0.207285993
|
||||
sunlight_cache_get_duration_seconds{log="sunlight-test",quantile="0.99"} 0.001409719
|
||||
sunlight_cache_put_duration_seconds{log="sunlight-test",quantile="0.99"} 0.002227985
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="discard",quantile="0.99"} 0.000224969
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="fetch",quantile="0.99"} 8.3003e-05
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="upload",quantile="0.99"} 0.042118751
|
||||
sunlight_http_request_duration_seconds{endpoint="add-chain",log="sunlight-test",quantile="0.99"} 0.2259605
|
||||
sunlight_sequencing_duration_seconds{log="sunlight-test",quantile="0.99"} 0.108987393
|
||||
sunlight_sqlite_update_duration_seconds{quantile="0.99"} 0.014922489
|
||||
```
|
||||
|
||||
I'm seeing here that at a load of 100/s (with 90/s of unique certificates), the 99th percentile
|
||||
add-chain latency is 207ms, which makes sense because the `period` configuration field is set to
|
||||
200ms. The filesystem operations (discard, fetch, upload) are _de minimis_ and the sequencing
|
||||
duration is at 109ms. Excellent!
|
||||
|
||||
But can this thing go really fast? I do remember that the CT Hammer uses more CPU than TesseraCT,
|
||||
and I've seen it above also when running my 5'000/s loadtest that's about all the hammer can take on
|
||||
a single Dell R630. So, as I did with the TesseraCT test, I'll use the MinIO SSD and MinIO Disk
|
||||
machines to generate the load.
|
||||
|
||||
I boot them, so that I can hammer, or shall I say jackhammer away:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract$ go run ./internal/hammer --origin=ctlog-test.lab.ipng.ch \
|
||||
--log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHRwRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ== \
|
||||
--log_url=https://ctlog-test.lab.ipng.ch:1443/ --write_log_url=https://ctlog-test.lab.ipng.ch/ \
|
||||
--max_read_ops=0 --num_writers=5000 --max_write_ops=5000
|
||||
|
||||
pim@minio-ssd:~/src/tesseract$ go run ./internal/hammer --origin=ctlog-test.lab.ipng.ch \
|
||||
--log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHRwRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ== \
|
||||
--log_url=https://ctlog-test.lab.ipng.ch:1443/ --write_log_url=https://ctlog-test.lab.ipng.ch/ \
|
||||
--max_read_ops=0 --num_writers=5000 --max_write_ops=5000 --serial_offset=1000000
|
||||
|
||||
pim@minio-disk:~/src/tesseract$ go run ./internal/hammer --origin=ctlog-test.lab.ipng.ch \
|
||||
--log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHRwRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ== \
|
||||
--log_url=https://ctlog-test.lab.ipng.ch:1443/ --write_log_url=https://ctlog-test.lab.ipng.ch/ \
|
||||
--max_read_ops=0 --num_writers=5000 --max_write_ops=5000 --serial_offset=2000000
|
||||
```
|
||||
|
||||
This will generate 15'000/s of load, which I note does bring Sunlight to its knees, although it does
|
||||
remain stable (yaay!) with a somewhat more bursty checkpoint interval:
|
||||
|
||||
```
|
||||
5504780 1 seconds 4039 certs
|
||||
5508819 1 seconds 10000 certs
|
||||
5518819 . 2 seconds 7976 certs
|
||||
5526795 1 seconds 2022 certs
|
||||
5528817 1 seconds 9782 certs
|
||||
5538599 1 seconds 217 certs
|
||||
5538816 1 seconds 3114 certs
|
||||
5541930 1 seconds 6818 certs
|
||||
```
|
||||
|
||||
So what I do instead is a somewhat simpler measurement of certificates per minute:
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E '^[0-9]+$'; sleep 60; done
|
||||
6008831
|
||||
6296255
|
||||
6576712
|
||||
```
|
||||
|
||||
This rate boils down to `(6576712-6008831)/120` or 4'700/s of written certs, which at a duplication
|
||||
ratio of 10% means approximately 5'200/s of total accepted certs. This rate, Sunlight is consuming
|
||||
about 10.3 CPUs/s, while Skylight is at 0.1 CPUs/s and the CT Hammer is at 11.1 CPUs/s; Given the 40
|
||||
threads on this machine, I am not saturating the CPU, but I'm curious as this rate is significantly
|
||||
lower than TesseraCT. I briefly turn off the hammer on `ctlog-test` to allow Sunlight to monopolize
|
||||
the entire machine. The CPU use does reduce to about 9.3 CPUs/s suggesting that indeed, the bottleneck
|
||||
is not strictly CPU:
|
||||
|
||||
{{< image width="90%" src="/assets/ctlog/btop-sunlight.png" alt="Sunlight btop" >}}
|
||||
|
||||
When using only two CT Hammers (on `minio-ssd.lab.ipng.ch` and `minio-disk.lab.ipng.ch`), the CPU
|
||||
use on the `ctlog-test.lab.ipng.ch` machine definitely goes down (CT Hammer is kind of a CPU hog....),
|
||||
but the resulting throughput doesn't change that much:
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E '^[0-9]+$'; sleep 60; done
|
||||
7985648
|
||||
8302421
|
||||
8528122
|
||||
8772758
|
||||
```
|
||||
|
||||
What I find particularly interesting is that the total rate stays approximately 4'400/s
|
||||
(`(8772758-7985648)/180`), while the checkpoint latency varies considerably. One really cool thing I
|
||||
learned earlier is that Sunlight comes with baked in Prometheus metrics, which I can take a look at
|
||||
while keeping it under this load of ~10'000/sec:
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ curl -ksS https://ctlog-test.lab.ipng.ch/metrics | egrep 'seconds.*quantile=\"0.99\"'
|
||||
sunlight_addchain_wait_seconds{log="sunlight-test",quantile="0.99"} 1.889983538
|
||||
sunlight_cache_get_duration_seconds{log="sunlight-test",quantile="0.99"} 0.000148819
|
||||
sunlight_cache_put_duration_seconds{log="sunlight-test",quantile="0.99"} 0.837981208
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="discard",quantile="0.99"} 0.000433179
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="fetch",quantile="0.99"} NaN
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="upload",quantile="0.99"} 0.067494558
|
||||
sunlight_http_request_duration_seconds{endpoint="add-chain",log="sunlight-test",quantile="0.99"} 1.86894666
|
||||
sunlight_sequencing_duration_seconds{log="sunlight-test",quantile="0.99"} 1.111400223
|
||||
sunlight_sqlite_update_duration_seconds{quantile="0.99"} 0.016859223
|
||||
```
|
||||
|
||||
Comparing the throughput at 4'400/s with that first test of 100/s, I expect and can confirm a
|
||||
significant increase in all of these metrics. The 99th percentile addchain is now 1889ms (up from
|
||||
207ms) and the sequencing duration is now 1111ms (up from 109ms).
|
||||
|
||||
#### Sunlight: Effect of period
|
||||
|
||||
I fiddle a little bit with Sunlight's configuration file, notably the `period` and `poolsize`.
|
||||
First I set `period:2000` and `poolsize:15000`, which yields pretty much the same throughput:
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E '^[0-9]+$'; sleep 60; done
|
||||
701850
|
||||
1001424
|
||||
1295508
|
||||
1575789
|
||||
```
|
||||
|
||||
With a generated load of 10'000/sec with a 10% duplication rate, I am offering roughly 9'000/sec of
|
||||
unique certificates, and I'm seeing `(1575789 - 701850)/180` or about 4'855/sec come through. Just
|
||||
for reference, at this rate and with `period:2000`, the latency tail looks like this:
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ curl -ksS https://ctlog-test.lab.ipng.ch/metrics | egrep 'seconds.*quantile=\"0.99\"'
|
||||
sunlight_addchain_wait_seconds{log="sunlight-test",quantile="0.99"} 3.203510079
|
||||
sunlight_cache_get_duration_seconds{log="sunlight-test",quantile="0.99"} 0.000108613
|
||||
sunlight_cache_put_duration_seconds{log="sunlight-test",quantile="0.99"} 0.950453973
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="discard",quantile="0.99"} 0.00046192
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="fetch",quantile="0.99"} NaN
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="upload",quantile="0.99"} 0.049007693
|
||||
sunlight_http_request_duration_seconds{endpoint="add-chain",log="sunlight-test",quantile="0.99"} 3.570709413
|
||||
sunlight_sequencing_duration_seconds{log="sunlight-test",quantile="0.99"} 1.5968609040000001
|
||||
sunlight_sqlite_update_duration_seconds{quantile="0.99"} 0.010847308
|
||||
```
|
||||
|
||||
Then I also set a `period:100` and `poolsize:15000`, which does improve a bit:
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E '^[0-9]+$'; sleep 60; done
|
||||
560654
|
||||
950524
|
||||
1324645
|
||||
1720362
|
||||
```
|
||||
|
||||
With the same generated load of 10'000/sec with a 10% duplication rate, I am still offering roughly
|
||||
9'000/sec of unique certificates, and I'm seeing `(1720362 - 560654)/180` or about 6'440/sec come
|
||||
through, which is a fair bit better, at the expense of more disk activity. At this rate and with
|
||||
`period:100`, the latency tail looks like this:
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ curl -ksS https://ctlog-test.lab.ipng.ch/metrics | egrep 'seconds.*quantile=\"0.99\"'
|
||||
sunlight_addchain_wait_seconds{log="sunlight-test",quantile="0.99"} 1.616046445
|
||||
sunlight_cache_get_duration_seconds{log="sunlight-test",quantile="0.99"} 7.5123e-05
|
||||
sunlight_cache_put_duration_seconds{log="sunlight-test",quantile="0.99"} 0.534935803
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="discard",quantile="0.99"} 0.000377273
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="fetch",quantile="0.99"} 4.8893e-05
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="upload",quantile="0.99"} 0.054685991
|
||||
sunlight_http_request_duration_seconds{endpoint="add-chain",log="sunlight-test",quantile="0.99"} 1.946445877
|
||||
sunlight_sequencing_duration_seconds{log="sunlight-test",quantile="0.99"} 0.980602185
|
||||
sunlight_sqlite_update_duration_seconds{quantile="0.99"} 0.018385831
|
||||
```
|
||||
|
||||
***Conclusion***: Sunlight on POSIX can reliably handle 4'400/s (with a duplicate rate of 10%) on
|
||||
this setup.
|
||||
|
||||
## Wrapup - Observations
|
||||
|
||||
From an operators point of view, TesseraCT and Sunlight handle quite differently. Both are easily up
|
||||
to the task of serving the current write-load (which is about 250/s).
|
||||
|
||||
* ***S3***: When using the S3 backend, TesseraCT became quite unhappy above 800/s while Sunlight
|
||||
went all the way up to 4'200/s and sent significantly less requests to MinIO (about 4x less),
|
||||
while showing good telemetry on the use of S3 backends. In this mode, TesseraCT uses MySQL (in
|
||||
my case, MariaDB) which was not on the ZFS pool, but on the boot-disk.
|
||||
|
||||
* ***POSIX***: When using normal filesystem, Sunlight seems to peak at 4'800/s while TesseraCT
|
||||
went all the way to 12'000/s. When doing so, Disk IO was quite similar between the two
|
||||
solutions, taking into account that TesseraCT runs BadgerDB, while Sunlight uses sqlite3,
|
||||
both are using their respective ZFS pool.
|
||||
|
||||
***Notable***: Sunlight POSIX and S3 performance is roughly identical (both handle about
|
||||
5'000/sec), while TesseraCT POSIX performance (12'000/s) is significantly better than its S3
|
||||
(800/s). Some other observations:
|
||||
|
||||
* Sunlight has a very opinionated configuration, and can run multiple logs with one configuration
|
||||
file and one binary. Its configuration was a bit constraining though, as I could not manage to
|
||||
use `monitoringprefix` or `submissionprefix` with `http://` prefix - a likely security
|
||||
precaution - but also using ports in those prefixes (other than the standard 443) rendered
|
||||
Sunlight and Skylight unusable for me.
|
||||
|
||||
* Skylight only serves from local directory, it does not have support for S3. For operators using S3,
|
||||
an alternative could be to use NGINX in the serving path, similar to TesseraCT. Skylight does have
|
||||
a few things to teach me though, notably on proper compression, content type and other headers.
|
||||
|
||||
* TesseraCT does not have a configuration file, and will run exactly one log per binary
|
||||
instance. It uses flags to construct the environment, and is much more forgiving for creative
|
||||
`origin` (log name), and submission- and monitoring URLs. It's happy to use regular 'http://'
|
||||
for both, which comes in handy in those architectures where the system is serving behind a
|
||||
reversed proxy.
|
||||
|
||||
* The TesseraCT Hammer tool then again does not like using self-signed certificates, and needs
|
||||
to be told to skip certificate validation in the case of Sunlight loadtests while it is
|
||||
running with the `-testcert` commandline.
|
||||
|
||||
I consider all of these small and mostly cosmetic issues, because in production there will be proper
|
||||
TLS certificates issued and normal https:// serving ports with unique monitoring and submission
|
||||
hostnames.
|
||||
|
||||
## What's Next
|
||||
|
||||
Together with Antonis Chariton and Jeroen Massar, IPng Networks will be offering both TesseraCT and
|
||||
Sunlight logs on the public internet. One final step is to productionize both logs, and file the
|
||||
paperwork for them in the community. Although at this point our Sunlight log is already running,
|
||||
I'll wait a few weeks to gather any additional intel, before wrapping up in a final article.
|
||||
|
515
content/articles/2025-08-24-ctlog-3.md
Normal file
515
content/articles/2025-08-24-ctlog-3.md
Normal file
@@ -0,0 +1,515 @@
|
||||
---
|
||||
date: "2025-08-24T12:07:23Z"
|
||||
title: 'Certificate Transparency - Part 3 - Operations'
|
||||
---
|
||||
|
||||
{{< image width="10em" float="right" src="/assets/ctlog/ctlog-logo-ipng.png" alt="ctlog logo" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
There once was a Dutch company called [[DigiNotar](https://en.wikipedia.org/wiki/DigiNotar)], as the
|
||||
name suggests it was a form of _digital notary_, and they were in the business of issuing security
|
||||
certificates. Unfortunately, in June of 2011, their IT infrastructure was compromised and
|
||||
subsequently it issued hundreds of fraudulent SSL certificates, some of which were used for
|
||||
man-in-the-middle attacks on Iranian Gmail users. Not cool.
|
||||
|
||||
Google launched a project called **Certificate Transparency**, because it was becoming more common
|
||||
that the root of trust given to _Certification Authorities_ could no longer be unilaterally trusted.
|
||||
These attacks showed that the lack of transparency in the way CAs operated was a significant risk to
|
||||
the Web Public Key Infrastructure. It led to the creation of this ambitious
|
||||
[[project](https://certificate.transparency.dev/)] to improve security online by bringing
|
||||
accountability to the system that protects our online services with _SSL_ (Secure Socket Layer)
|
||||
and _TLS_ (Transport Layer Security).
|
||||
|
||||
In 2013, [[RFC 6962](https://datatracker.ietf.org/doc/html/rfc6962)] was published by the IETF. It
|
||||
describes an experimental protocol for publicly logging the existence of Transport Layer Security
|
||||
(TLS) certificates as they are issued or observed, in a manner that allows anyone to audit
|
||||
certificate authority (CA) activity and notice the issuance of suspect certificates as well as to
|
||||
audit the certificate logs themselves. The intent is that eventually clients would refuse to honor
|
||||
certificates that do not appear in a log, effectively forcing CAs to add all issued certificates to
|
||||
the logs.
|
||||
|
||||
In the first two articles of this series, I explored [[Sunlight]({{< ref 2025-07-26-ctlog-1 >}})]
|
||||
and [[TesseraCT]({{< ref 2025-08-10-ctlog-2 >}})], two open source implementations of the Static CT
|
||||
protocol. In this final article, I'll share the details on how I created the environment and
|
||||
production instances for four logs that IPng will be providing: Rennet and Lipase are two
|
||||
ingredients to make cheese and will serve as our staging/testing logs. Gouda and Halloumi are two
|
||||
delicious cheeses that pay homage to our heritage, Jeroen and I being Dutch and Antonis being
|
||||
Greek.
|
||||
|
||||
## Hardware
|
||||
|
||||
At IPng Networks, all hypervisors are from the same brand: Dell's Poweredge line. In this project,
|
||||
Jeroen is also contributing a server, and it so happens that he also has a Dell Poweredge. We're
|
||||
both running Debian on our hypervisor, so we install a fresh VM with Debian 13.0, codenamed
|
||||
_Trixie_, and give the machine 16GB of memory, 8 vCPU and a 16GB boot disk. Boot disks are placed on
|
||||
the hypervisor's ZFS pool, and a blockdevice snapshot is taken every 6hrs. This allows the boot disk
|
||||
to be rolled back to a last known good point in case an upgrade goes south. If you haven't seen it
|
||||
yet, take a look at [[zrepl](https://zrepl.github.io/)], a one-stop, integrated solution for ZFS
|
||||
replication. This tool is incredibly powerful, and can do snapshot management, sourcing / sinking
|
||||
to remote hosts, of course using incremental snapshots as they are native to ZFS.
|
||||
|
||||
Once the machine is up, we pass four enterprise-class storage drives, in our case 3.84TB Kioxia
|
||||
NVMe, model _KXD51RUE3T84_ which are PCIe 3.1 x4 lanes, and NVMe 1.2.1 specification with a good
|
||||
durability and reasonable (albeit not stellar) read throughput of ~2700MB/s, write throughput of
|
||||
~800MB/s with 240 kIOPS random read and 21 kIOPS random write. My attention is also drawn to a
|
||||
specific specification point: these drives allow for 1.0 DWPD, which stands for _Drive Writes Per
|
||||
Day_, in other words they are not going to run themselves off a cliff after a few petabytes of
|
||||
writes, and I am reminded that a CT Log wants to write to disk a lot during normal operation.
|
||||
|
||||
The point of these logs is to **keep them safe**, and the most important aspects of the compute
|
||||
environment are the use of ECC memory to detect single bit errors, and dependable storage. Toshiba
|
||||
makes a great product.
|
||||
|
||||
```
|
||||
ctlog1:~$ sudo zpool create -f -o ashift=12 -o autotrim=on -O atime=off -O xattr=sa \
|
||||
ssd-vol0 raidz2 /dev/disk/by-id/nvme-KXD51RUE3T84_TOSHIBA_*M
|
||||
ctlog1:~$ sudo zfs create -o encryption=on -o keyformat=passphrase ssd-vol0/enc
|
||||
ctlog1:~$ sudo zfs create ssd-vol0/logs
|
||||
ctlog1:~$ for log in lipase; do \
|
||||
for shard in 2025h2 2026h1 2026h2 2027h1 2027h2; do \
|
||||
sudo zfs create ssd-vol0/logs/${log}${shard} \
|
||||
done \
|
||||
done
|
||||
```
|
||||
|
||||
The hypervisor will use PCI passthrough for the NVMe drives, and we'll handle ZFS directly on the
|
||||
VM. The first command creates a ZFS raidz2 pool using 4kB blocks, turns of _atime_ (which avoids one
|
||||
metadata write for each read!), and turns on SSD trimming in ZFS, a very useful feature.
|
||||
|
||||
Then I'll create an encrypted volume for the configuration and key material. This way, if the
|
||||
machine is ever physically transported, the keys will be safe in transit. Finally, I'll create the
|
||||
temporal log shards starting at 2025h2, all the way through to 2027h2 for our testing log called
|
||||
_Lipase_ and our production log called _Halloumi_ on Jeroen's machine. On my own machine, it'll be
|
||||
_Rennet_ for the testing log and _Gouda_ for the production log.
|
||||
|
||||
## Sunlight
|
||||
|
||||
{{< image width="10em" float="right" src="/assets/ctlog/sunlight-logo.png" alt="Sunlight logo" >}}
|
||||
|
||||
I set up Sunlight first. as its authors have extensive operational notes both in terms of the
|
||||
[[config](https://config.sunlight.geomys.org/)] of Geomys' _Tuscolo_ log, as well as on the
|
||||
[[Sunlight](https://sunlight.dev)] homepage. I really appreciate that Filippo added some
|
||||
[[Gists](https://gist.github.com/FiloSottile/989338e6ba8e03f2c699590ce83f537b)] and
|
||||
[[Doc](https://docs.google.com/document/d/1ID8dX5VuvvrgJrM0Re-jt6Wjhx1eZp-trbpSIYtOhRE/edit?tab=t.0#heading=h.y3yghdo4mdij)]
|
||||
with pretty much all I need to know to run one too. Our Rennet and Gouda logs use very similar
|
||||
approach for their configuration, with one notable exception: the VMs do not have a public IP
|
||||
address, and are tucked away in a private network called IPng Site Local. I'll get back to that
|
||||
later.
|
||||
|
||||
```
|
||||
ctlog@ctlog0:/ssd-vol0/enc/sunlight$ cat << EOF | tee sunlight-staging.yaml
|
||||
listen:
|
||||
- "[::]:16420"
|
||||
checkpoints: /ssd-vol0/shared/checkpoints.db
|
||||
logs:
|
||||
- shortname: rennet2025h2
|
||||
inception: 2025-07-28
|
||||
period: 200
|
||||
poolsize: 750
|
||||
submissionprefix: https://rennet2025h2.log.ct.ipng.ch
|
||||
monitoringprefix: https://rennet2025h2.mon.ct.ipng.ch
|
||||
ccadbroots: testing
|
||||
extraroots: /ssd-vol0/enc/sunlight/extra-roots-staging.pem
|
||||
secret: /ssd-vol0/enc/sunlight/keys/rennet2025h2.seed.bin
|
||||
cache: /ssd-vol0/logs/rennet2025h2/cache.db
|
||||
localdirectory: /ssd-vol0/logs/rennet2025h2/data
|
||||
notafterstart: 2025-07-01T00:00:00Z
|
||||
notafterlimit: 2026-01-01T00:00:00Z
|
||||
...
|
||||
EOF
|
||||
ctlog@ctlog0:/ssd-vol0/enc/sunlight$ cat << EOF | tee skylight-staging.yaml
|
||||
listen:
|
||||
- "[::]:16421"
|
||||
homeredirect: https://ipng.ch/s/ct/
|
||||
logs:
|
||||
- shortname: rennet2025h2
|
||||
monitoringprefix: https://rennet2025h2.mon.ct.ipng.ch
|
||||
localdirectory: /ssd-vol0/logs/rennet2025h2/data
|
||||
staging: true
|
||||
...
|
||||
```
|
||||
|
||||
In the first configuration file, I'll tell _Sunlight_ (the write path component) to listen on port
|
||||
`:16420` and I'll tell _Skylight_ (the read path component) to listen on port `:16421`. I've disabled
|
||||
the automatic certificate renewals, and will handle SSL upstream. A few notes on this:
|
||||
|
||||
1. Most importantly, I will be using a common frontend pool with a wildcard certificate for
|
||||
`*.ct.ipng.ch`. I wrote about [[DNS-01]({{< ref 2023-03-24-lego-dns01 >}})] before, it's a very
|
||||
convenient way for IPng to do certificate pool management. I will be sharing certificate for all log
|
||||
types under this certificate.
|
||||
1. ACME/HTTP-01 could be made to work with a bit of effort; plumbing through the `/.well-known/`
|
||||
URIs on the frontend and pointing them to these instances. But then the cert would have to be copied
|
||||
from Sunlight back to the frontends.
|
||||
|
||||
I've noticed that when the log doesn't exist yet, I can start Sunlight and it'll create the bits and
|
||||
pieces on the local filesystem and start writing checkpoints. But if the log already exists, I am
|
||||
required to have the _monitoringprefix_ active, otherwise Sunlight won't start up. It's a small
|
||||
thing, as I will have the read path operational in a few simple steps. Anyway, all five logshards
|
||||
for Rennet, and a few days later, for Gouda, are operational this way.
|
||||
|
||||
Skylight provides all the things I need to serve the data back, which is a huge help. The [[Static
|
||||
Log Spec](https://github.com/C2SP/C2SP/blob/main/static-ct-api.md)] is very clear on things like
|
||||
compression, content-type, cache-control and other headers. Skylight makes this a breeze, as it reads
|
||||
a configuration file very similar to the Sunlight write-path one, and takes care of it all for me.
|
||||
|
||||
## TesseraCT
|
||||
|
||||
{{< image width="10em" float="right" src="/assets/ctlog/tesseract-logo.png" alt="TesseraCT logo" >}}
|
||||
|
||||
Good news came to our community on August 14th, when Google's TrustFabric team announced their Alpha
|
||||
milestone of [[TesseraCT](https://blog.transparency.dev/introducing-tesseract)]. This release
|
||||
also moved the POSIX variant from experimental alongside the already further along GCP and AWS
|
||||
personalities. After playing around with it with Al and the team, I think I've learned enough to get
|
||||
us going in a public `tesseract-posix` instance.
|
||||
|
||||
One thing I liked about Sunlight is its compact YAML file that described the pertinent bits of the
|
||||
system, and that I can serve any number of logs with the same process. On the other hand, TesseraCT
|
||||
can serve only one log per process. Both have pro's and con's, notably if any poisonous submission
|
||||
would be offered, Sunlight might take down all logs, while TesseraCT would only take down the log
|
||||
receiving the offensive submission. On the other hand, maintaining separate processes is cumbersome,
|
||||
and all log instances need to be meticulously configured.
|
||||
|
||||
|
||||
### TesseraCT genconf
|
||||
|
||||
I decide to automate this by vibing a little tool called `tesseract-genconf`, which I've published on
|
||||
[[Gitea](https://git.ipng.ch/certificate-transparency/cheese)]. What it does is take a YAML file
|
||||
describing the logs, and outputs the bits and pieces needed to operate multiple separate processes
|
||||
that together form the sharded static log. I've attempted to stay mostly compatible with the
|
||||
Sunlight YAML configuration, and came up with a variant like this one:
|
||||
|
||||
```
|
||||
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ cat << EOF | tee tesseract-staging.yaml
|
||||
listen:
|
||||
- "[::]:8080"
|
||||
roots: /ssd-vol0/enc/tesseract/roots.pem
|
||||
logs:
|
||||
- shortname: lipase2025h2
|
||||
listen: "[::]:16900"
|
||||
submissionprefix: https://lipase2025h2.log.ct.ipng.ch
|
||||
monitoringprefix: https://lipase2025h2.mon.ct.ipng.ch
|
||||
extraroots: /ssd-vol0/enc/tesseract/extra-roots-staging.pem
|
||||
secret: /ssd-vol0/enc/tesseract/keys/lipase2025h2.pem
|
||||
localdirectory: /ssd-vol0/logs/lipase2025h2/data
|
||||
notafterstart: 2025-07-01T00:00:00Z
|
||||
notafterlimit: 2026-01-01T00:00:00Z
|
||||
...
|
||||
EOF
|
||||
```
|
||||
|
||||
With this snippet, I have all the information I need. Here's the steps I take to construct the log
|
||||
itself:
|
||||
|
||||
***1. Generate keys***
|
||||
|
||||
The keys are `prime256v1` and the format that TesseraCT accepts did change since I wrote up my first
|
||||
[[deep dive]({{< ref 2025-07-26-ctlog-1 >}})] a few weeks ago. Now, the tool accepts a `PEM` format
|
||||
private key, from which the _Log ID_ and _Public Key_ can be derived. So off I go:
|
||||
|
||||
```
|
||||
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf -c tesseract-staging.yaml gen-key
|
||||
Creating /ssd-vol0/enc/tesseract/keys/lipase2025h2.pem
|
||||
Creating /ssd-vol0/enc/tesseract/keys/lipase2026h1.pem
|
||||
Creating /ssd-vol0/enc/tesseract/keys/lipase2026h2.pem
|
||||
Creating /ssd-vol0/enc/tesseract/keys/lipase2027h1.pem
|
||||
Creating /ssd-vol0/enc/tesseract/keys/lipase2027h2.pem
|
||||
```
|
||||
|
||||
Of course, if a file already exists at that location, it'll just print a warning like:
|
||||
```
|
||||
Key already exists: /ssd-vol0/enc/tesseract/keys/lipase2025h2.pem (skipped)
|
||||
```
|
||||
|
||||
***2. Generate JSON/HTML***
|
||||
|
||||
I will be operating the read-path with NGINX. Log operators have started speaking about their log
|
||||
metadata in terms of a small JSON file called `log.v3.json`, and Skylight does a good job of
|
||||
exposing that one, alongside all the other pertinent metadata. So I'll generate these files for each
|
||||
of the logs:
|
||||
|
||||
```
|
||||
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf -c tesseract-staging.yaml gen-html
|
||||
Creating /ssd-vol0/logs/lipase2025h2/data/index.html
|
||||
Creating /ssd-vol0/logs/lipase2025h2/data/log.v3.json
|
||||
Creating /ssd-vol0/logs/lipase2026h1/data/index.html
|
||||
Creating /ssd-vol0/logs/lipase2026h1/data/log.v3.json
|
||||
Creating /ssd-vol0/logs/lipase2026h2/data/index.html
|
||||
Creating /ssd-vol0/logs/lipase2026h2/data/log.v3.json
|
||||
Creating /ssd-vol0/logs/lipase2027h1/data/index.html
|
||||
Creating /ssd-vol0/logs/lipase2027h1/data/log.v3.json
|
||||
Creating /ssd-vol0/logs/lipase2027h2/data/index.html
|
||||
Creating /ssd-vol0/logs/lipase2027h2/data/log.v3.json
|
||||
```
|
||||
|
||||
{{< image width="60%" src="/assets/ctlog/lipase.png" alt="TesseraCT Lipase Log" >}}
|
||||
|
||||
It's nice to see a familiar look-and-feel for these logs appear in those `index.html` (which all
|
||||
cross-link to each other within the logs specificied in `tesseract-staging.yaml`, which is dope.
|
||||
|
||||
***3. Generate Roots***
|
||||
|
||||
Antonis had seen this before (thanks for the explanation!) but TesseraCT does not natively implement
|
||||
fetching of the [[CCADB](https://www.ccadb.org/)] roots. But, he points out, you can just get them
|
||||
from any other running log instance, so I'll implement a `gen-roots` command:
|
||||
|
||||
```
|
||||
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf gen-roots \
|
||||
--source https://tuscolo2027h1.sunlight.geomys.org --output production-roots.pem
|
||||
Fetching roots from: https://tuscolo2027h1.sunlight.geomys.org/ct/v1/get-roots
|
||||
2025/08/25 08:24:58 Warning: Failed to parse certificate,carefully skipping: x509: negative serial number
|
||||
Creating production-roots.pem
|
||||
Successfully wrote 248 certificates to tusc.pem (out of 249 total)
|
||||
|
||||
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf gen-roots \
|
||||
--source https://navigli2027h1.sunlight.geomys.org --output testing-roots.pem
|
||||
Fetching roots from: https://navigli2027h1.sunlight.geomys.org/ct/v1/get-roots
|
||||
Creating testing-roots.pem
|
||||
Successfully wrote 82 certificates to tusc.pem (out of 82 total)
|
||||
```
|
||||
|
||||
I can do this regularly, say daily, in a cronjob and if the files were to change, restart the
|
||||
TesseraCT processes. It's not ideal (because the restart might be briefly disruptive), but it's a
|
||||
reasonable option for the time being.
|
||||
|
||||
***4. Generate TesseraCT cmdline***
|
||||
|
||||
I will be running TesseraCT as a _templated unit_ in systemd. These are system unit files that have
|
||||
an argument, they will have an @ in their name, like so:
|
||||
|
||||
```
|
||||
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ cat << EOF | sudo tee /lib/systemd/system/tesseract@.service
|
||||
[Unit]
|
||||
Description=Tesseract CT Log service for %i
|
||||
ConditionFileExists=/ssd-vol0/logs/%i/data/.env
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
# The %i here refers to the instance name, e.g., "lipase2025h2"
|
||||
# This path should point to where your instance-specific .env files are located
|
||||
EnvironmentFile=/ssd-vol0/logs/%i/data/.env
|
||||
ExecStart=/home/ctlog/bin/tesseract-posix $TESSERACT_ARGS
|
||||
User=ctlog
|
||||
Group=ctlog
|
||||
Restart=on-failure
|
||||
RestartSec=5
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
EOF
|
||||
```
|
||||
|
||||
I can now implement a `gen-env` command for my tool:
|
||||
|
||||
```
|
||||
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf -c tesseract-staging.yaml gen-env
|
||||
Creating /ssd-vol0/logs/lipase2025h2/data/roots.pem
|
||||
Creating /ssd-vol0/logs/lipase2025h2/data/.env
|
||||
Creating /ssd-vol0/logs/lipase2026h1/data/roots.pem
|
||||
Creating /ssd-vol0/logs/lipase2026h1/data/.env
|
||||
Creating /ssd-vol0/logs/lipase2026h2/data/roots.pem
|
||||
Creating /ssd-vol0/logs/lipase2026h2/data/.env
|
||||
Creating /ssd-vol0/logs/lipase2027h1/data/roots.pem
|
||||
Creating /ssd-vol0/logs/lipase2027h1/data/.env
|
||||
Creating /ssd-vol0/logs/lipase2027h2/data/roots.pem
|
||||
Creating /ssd-vol0/logs/lipase2027h2/data/.env
|
||||
```
|
||||
|
||||
Looking at one of those .env files, I can show the exact commandline I'll be feeding to the
|
||||
`tesseract-posix` binary:
|
||||
|
||||
```
|
||||
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ cat /ssd-vol0/logs/lipase2025h2/data/.env
|
||||
TESSERACT_ARGS="--private_key=/ssd-vol0/enc/tesseract/keys/lipase2025h2.pem
|
||||
--origin=lipase2025h2.log.ct.ipng.ch --storage_dir=/ssd-vol0/logs/lipase2025h2/data
|
||||
--roots_pem_file=/ssd-vol0/logs/lipase2025h2/data/roots.pem --http_endpoint=[::]:16900
|
||||
--not_after_start=2025-07-01T00:00:00Z --not_after_limit=2026-01-01T00:00:00Z"
|
||||
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
|
||||
```
|
||||
|
||||
{{< image width="7em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
A quick operational note on OpenTelemetry (also often referred to as Otel): Al and the TrustFabric
|
||||
team added open telemetry to the TesseraCT personalities, as it was mostly already implemented in
|
||||
the underlying Tessera library. By default, it'll try to send its telemetry to localhost using
|
||||
`https`, which makes sense in those cases where the collector is on a different machine. In my case,
|
||||
I'll keep `otelcol` (the collector) on the same machine. Its job is to consume the Otel telemetry
|
||||
stream, and turn those back into Prometheus `/metrics` endpoint on port `:9464`.
|
||||
|
||||
The `gen-env` command also assembles the per-instance `roots.pem` file. For staging logs, it'll take
|
||||
the file pointed to by the `roots:` key, and append any per-log `extraroots:` files. For me, these
|
||||
extraroots are empty and the main roots file points at either the testing roots that came from
|
||||
_Rennet_ (our Sunlight staging log), or the production roots that came from _Gouda_. A job well done!
|
||||
|
||||
***5. Generate NGINX***
|
||||
|
||||
When I first ran my tests, I noticed that the log check tool called `ct-fsck` threw errors on my
|
||||
read path. Filippo explained that the HTTP headers matter in the Static CT specification. Tiles,
|
||||
Issuers, and Checkpoint must all have specific caching and content type headers set. This is what
|
||||
makes Skylight such a gem - I get to read it (and the spec!) to see what I'm supposed to be serving.
|
||||
|
||||
And thus, `gen-nginx` command is born, and listens on port `:8080` for requests:
|
||||
|
||||
```
|
||||
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf -c tesseract-staging.yaml gen-nginx
|
||||
Creating nginx config: /ssd-vol0/logs/lipase2025h2/data/lipase2025h2.mon.ct.ipng.ch.conf
|
||||
Creating nginx config: /ssd-vol0/logs/lipase2026h1/data/lipase2026h1.mon.ct.ipng.ch.conf
|
||||
Creating nginx config: /ssd-vol0/logs/lipase2026h2/data/lipase2026h2.mon.ct.ipng.ch.conf
|
||||
Creating nginx config: /ssd-vol0/logs/lipase2027h1/data/lipase2027h1.mon.ct.ipng.ch.conf
|
||||
Creating nginx config: /ssd-vol0/logs/lipase2027h2/data/lipase2027h2.mon.ct.ipng.ch.conf
|
||||
```
|
||||
|
||||
All that's left for me to do is symlink these from `/etc/nginx/sites-enabled/` and the read-path is
|
||||
off to the races. With these commands in the `tesseract-genconf` tool, I am hoping that future
|
||||
travelers have an easy time setting up their static log. Please let me know if you'd like to use, or
|
||||
contribute, to the tool. You can find me in the Transparency Dev Slack, in #ct and also #cheese.
|
||||
|
||||
|
||||
## IPng Frontends
|
||||
|
||||
{{< image width="18em" float="right" src="/assets/ctlog/MPLS Backbone - CTLog.svg" alt="ctlog at ipng" >}}
|
||||
|
||||
IPng Networks has a private internal network called [[IPng Site Local]({{< ref 2023-03-11-mpls-core
|
||||
>}})], which is not routed on the internet. Our [[Frontends]({{< ref 2023-03-17-ipng-frontends >}})]
|
||||
are the only things that have public IPv4 and IPv6 addresses. It allows for things like anycasted
|
||||
webservers and loadbalancing with
|
||||
[[Maglev](https://research.google/pubs/maglev-a-fast-and-reliable-software-network-load-balancer/)].
|
||||
|
||||
The IPng Site Local network kind of looks like the picture to the right. The hypervisors running the
|
||||
Sunlight and TesseraCT logs are at NTT Zurich1 in Rümlang, Switzerland. The IPng frontends are
|
||||
in green, and the sweet thing is, some of them run in IPng's own ISP network (AS8298), while others
|
||||
run in partner networks (like IP-Max AS25091, and Coloclue AS8283). This means that I will benefit
|
||||
from some pretty solid connectivity redundancy.
|
||||
|
||||
The frontends are provisioned with Ansible. There are two aspects to them - firstly, a _certbot_
|
||||
instance maintains the Let's Encrypt wildcard certificates for `*.ct.ipng.ch`. There's a machine
|
||||
tucked away somewhere called `lego.net.ipng.ch` -- again, not exposed on the internet -- and its job
|
||||
is to renew certificates and copy them to the machines that need them. Next, a cluster of NGINX
|
||||
servers uses these certificates to expose IPng and customer services to the Internet.
|
||||
|
||||
I can tie it all together with a snippet like so, for which I apologize in advance - it's quite a
|
||||
wall of text:
|
||||
|
||||
```
|
||||
map $http_user_agent $no_cache_ctlog_lipase {
|
||||
"~*TesseraCT fsck" 1;
|
||||
default 0;
|
||||
}
|
||||
|
||||
server {
|
||||
listen [::]:443 ssl http2;
|
||||
listen 0.0.0.0:443 ssl http2;
|
||||
ssl_certificate /etc/certs/ct.ipng.ch/fullchain.pem;
|
||||
ssl_certificate_key /etc/certs/ct.ipng.ch/privkey.pem;
|
||||
include /etc/nginx/conf.d/options-ssl-nginx.inc;
|
||||
ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc;
|
||||
|
||||
server_name lipase2025h2.log.ct.ipng.ch;
|
||||
access_log /nginx/logs/lipase2025h2.log.ct.ipng.ch-access.log upstream buffer=512k flush=5s;
|
||||
include /etc/nginx/conf.d/ipng-headers.inc;
|
||||
|
||||
location = / {
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Host lipase2025h2.mon.ct.ipng.ch;
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection "upgrade";
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
proxy_pass http://ctlog1.net.ipng.ch:8080/index.html;
|
||||
}
|
||||
|
||||
location = /metrics {
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection "upgrade";
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
proxy_pass http://ctlog1.net.ipng.ch:9464;
|
||||
}
|
||||
|
||||
location / {
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection "upgrade";
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
proxy_pass http://ctlog1.net.ipng.ch:16900;
|
||||
}
|
||||
}
|
||||
|
||||
server {
|
||||
listen [::]:443 ssl http2;
|
||||
listen 0.0.0.0:443 ssl http2;
|
||||
ssl_certificate /etc/certs/ct.ipng.ch/fullchain.pem;
|
||||
ssl_certificate_key /etc/certs/ct.ipng.ch/privkey.pem;
|
||||
include /etc/nginx/conf.d/options-ssl-nginx.inc;
|
||||
ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc;
|
||||
|
||||
server_name lipase2025h2.mon.ct.ipng.ch;
|
||||
access_log /nginx/logs/lipase2025h2.mon.ct.ipng.ch-access.log upstream buffer=512k flush=5s;
|
||||
include /etc/nginx/conf.d/ipng-headers.inc;
|
||||
|
||||
location = /checkpoint {
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection "upgrade";
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
|
||||
proxy_pass http://ctlog1.net.ipng.ch:8080;
|
||||
}
|
||||
|
||||
location / {
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection "upgrade";
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
|
||||
include /etc/nginx/conf.d/ipng-upstream-headers.inc;
|
||||
proxy_cache ipng_cache;
|
||||
proxy_cache_key "$scheme://$host$request_uri";
|
||||
proxy_cache_valid 200 24h;
|
||||
proxy_cache_revalidate off;
|
||||
proxy_cache_bypass $no_cache_ctlog_lipase;
|
||||
proxy_no_cache $no_cache_ctlog_lipase;
|
||||
|
||||
proxy_pass http://ctlog1.net.ipng.ch:8080;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Taking _Lipase_ shard 2025h2 as an example, The submission path (on `*.log.ct.ipng.ch`) will show
|
||||
the same `index.html` as the monitoring path (on `*.mon.ct.ipng.ch`), to provide some consistency
|
||||
with Sunlight logs. Otherwise, the `/metrics` endpoint is forwarded to the `otelcol` running on port
|
||||
`:9464`, and the rest (the `/ct/v1/` and so on) are sent to the first port `:16900` of the
|
||||
TesseraCT.
|
||||
|
||||
Then the read-path makes a special-case of the `/checkpoint` endpoint, which it does not cache. That
|
||||
request (as all others) are forwarded to port `:8080` which is where NGINX is running. Other
|
||||
requests (notably `/tile` and `/issuer`) are cacheable, so I'll cache these on the upstream NGINX
|
||||
servers, both for resilience as well as for performance. Having four of these NGINX upstream will
|
||||
allow the Static CT logs (regardless of being Sunlight or TesseraCT) to serve very high read-rates.
|
||||
|
||||
## What's Next
|
||||
|
||||
I need to spend a little bit of time thinking about rate limits, specifically write-ratelimits. I
|
||||
think I'll use a request limiter in upstream NGINX, to allow for each IP or /24 or /48 subnet to
|
||||
only send a fixed number of requests/sec. I'll probably keep that part private though, as it's a
|
||||
good rule of thumb to never offer information to attackers.
|
||||
|
||||
Together with Antonis Chariton and Jeroen Massar, IPng Networks will be offering both TesseraCT and
|
||||
Sunlight logs on the public internet. One final step is to productionize both logs, and file the
|
||||
paperwork for them in the community. At this point our Sunlight log has been running for a month or
|
||||
so, and we've filed the paperwork for it to be included at Apple and Google.
|
||||
|
||||
I'm going to have folks poke at _Lipase_ as well, after which I'll try to run a few `ct-fsck` to
|
||||
make sure the logs are sane, before offering them into the inclusion program as well. Wish us luck!
|
73
content/ctlog.md
Normal file
73
content/ctlog.md
Normal file
@@ -0,0 +1,73 @@
|
||||
---
|
||||
title: 'Certificate Transparency'
|
||||
date: 2025-07-30
|
||||
url: /s/ct
|
||||
---
|
||||
|
||||
{{< image width="10em" float="right" src="/assets/ctlog/ctlog-logo-ipng.png" alt="ctlog logo" >}}
|
||||
|
||||
Certificate Transparency logs are "append-only" and publicly-auditable ledgers of certificates being
|
||||
created, updated, and expired. This is the homepage for IPng Networks' Certificate Transparency
|
||||
project.
|
||||
|
||||
Certificate Transparency [[CT](https://certificate.transparency.dev)] is a system for logging and
|
||||
monitoring certificate issuance. It greatly enhances everyone’s ability to monitor and study
|
||||
certificate issuance, and these capabilities have led to numerous improvements to the CA ecosystem
|
||||
and Web security. As a result, it is rapidly becoming critical Internet infrastructure. Originally
|
||||
developed by Google, the concept is now being adopted by many _Certification Authories_ who log
|
||||
their certificates, and professional _Monitoring_ companies who observe the certificates and
|
||||
report anomalies.
|
||||
|
||||
IPng Networks runs our logs under the domain `ct.ipng.ch`, split into a `*.log.ct.ipng.ch` for the
|
||||
write-path, and `*.mon.ct.ipng.ch` for the read-path.
|
||||
|
||||
We are submitting our log for inclusion in the approved log lists for Google Chrome and Apple
|
||||
Safari. Following 90 days of successful monitoring, we anticipate our log will be added to these
|
||||
trusted lists and that change will propagate to people’s browsers with subsequent browser version
|
||||
releases.
|
||||
|
||||
We operate two popular implementations of Static Certificate Transparency software.
|
||||
|
||||
## Sunlight
|
||||
|
||||
{{< image width="10em" float="right" src="/assets/ctlog/sunlight-logo.png" alt="sunlight logo" >}}
|
||||
|
||||
[[Sunlight](https://sunlight.dev)] was designed by Filippo Valsorda for the needs of the WebPKI
|
||||
community, through the feedback of many of its members, and in particular of the Sigsum, Google
|
||||
TrustFabric, and ISRG teams. It is partially based on the Go Checksum Database. Sunlight's
|
||||
development was sponsored by Let's Encrypt.
|
||||
|
||||
Our Sunlight logs:
|
||||
* A staging log called [[Rennet](https://rennet2025h2.log.ct.ipng.ch/)], incepted 2025-07-28,
|
||||
starting from temporal shard `rennet2025h2`.
|
||||
* A production log called [[Gouda](https://gouda2025h2.log.ct.ipng.ch/)], incepted 2025-07-30,
|
||||
starting from temporal shard `gouda2025h2`.
|
||||
|
||||
## TesseraCT
|
||||
|
||||
{{< image width="10em" float="right" src="/assets/ctlog/tesseract-logo.png" alt="tesseract logo" >}}
|
||||
|
||||
[[TesseraCT](https://github.com/transparency-dev/tesseract)] is a Certificate Transparency (CT) log
|
||||
implementation by the TrustFabric team at Google. It was built to allow log operators to run
|
||||
production static-ct-api CT logs starting with temporal shards covering 2026 onwards, as the
|
||||
successor to Trillian's CTFE.
|
||||
|
||||
Our TesseraCT logs:
|
||||
* A staging log called [[Lipase](https://lipase2025h2.log.ct.ipng.ch/)], incepted 2025-08-22,
|
||||
starting from temporal shared `lipase2025h2`.
|
||||
* A production log called [[Halloumi](https://halloumi2025h2.log.ct.ipng.ch/)], incepted 2025-08-24,
|
||||
starting from temporal shared `halloumi2025h2`.
|
||||
* Log `halloumi2026h2` incorporated incorrect data into its Merkle Tree at entry 4357956 and
|
||||
4552365, due to a [[TesseraCT bug](https://github.com/transparency-dev/tesseract/issues/553)]
|
||||
and was retired on 2025-09-08, to be replaced by temporal shard `halloumi2026h2a`.
|
||||
|
||||
## Operational Details
|
||||
|
||||
You can read more details about our infrastructure on:
|
||||
* **[[TesseraCT]({{< ref 2025-07-26-ctlog-1 >}})]** - published on 2025-07-26.
|
||||
* **[[Sunlight]({{< ref 2025-08-10-ctlog-2 >}})]** - published on 2025-08-10.
|
||||
* **[[Operations]({{< ref 2025-08-24-ctlog-3 >}})]** - published on 2025-08-24.
|
||||
|
||||
The operators of this infrastructure are **Antonis Chariton**, **Jeroen Massar** and **Pim van Pelt**. \
|
||||
You can reach us via e-mail at [[<ct-ops@ipng.ch>](mailto:ct-ops@ipng.ch)].
|
||||
|
@@ -34,3 +34,5 @@ taxonomies:
|
||||
|
||||
permalinks:
|
||||
articles: "/s/articles/:year/:month/:day/:slug"
|
||||
|
||||
ignoreLogs: [ "warning-goldmark-raw-html" ]
|
||||
|
1
static/assets/containerlab/containerlab.svg
Normal file
1
static/assets/containerlab/containerlab.svg
Normal file
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 21 KiB |
BIN
static/assets/containerlab/learn-vpp.png
(Stored with Git LFS)
Normal file
BIN
static/assets/containerlab/learn-vpp.png
(Stored with Git LFS)
Normal file
Binary file not shown.
1270
static/assets/containerlab/vpp-containerlab.cast
Normal file
1270
static/assets/containerlab/vpp-containerlab.cast
Normal file
File diff suppressed because it is too large
Load Diff
1
static/assets/ctlog/MPLS Backbone - CTLog.svg
Normal file
1
static/assets/ctlog/MPLS Backbone - CTLog.svg
Normal file
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 147 KiB |
BIN
static/assets/ctlog/btop-sunlight.png
(Stored with Git LFS)
Normal file
BIN
static/assets/ctlog/btop-sunlight.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/ctlog/ctlog-loadtest1.png
(Stored with Git LFS)
Normal file
BIN
static/assets/ctlog/ctlog-loadtest1.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/ctlog/ctlog-loadtest2.png
(Stored with Git LFS)
Normal file
BIN
static/assets/ctlog/ctlog-loadtest2.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/ctlog/ctlog-loadtest3.png
(Stored with Git LFS)
Normal file
BIN
static/assets/ctlog/ctlog-loadtest3.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/ctlog/ctlog-logo-ipng.png
(Stored with Git LFS)
Normal file
BIN
static/assets/ctlog/ctlog-logo-ipng.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/ctlog/lipase.png
(Stored with Git LFS)
Normal file
BIN
static/assets/ctlog/lipase.png
(Stored with Git LFS)
Normal file
Binary file not shown.
164
static/assets/ctlog/minio-results.txt
Normal file
164
static/assets/ctlog/minio-results.txt
Normal file
@@ -0,0 +1,164 @@
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=4M
|
||||
Loop 1: PUT time 60.0 secs, objects = 813, speed = 54.2MB/sec, 13.5 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 23168, speed = 1.5GB/sec, 386.1 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 2.2 secs, 371.2 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=1M
|
||||
2025/07/20 16:07:25 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FACEBAC4D052, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 1221, speed = 20.3MB/sec, 20.3 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 31000, speed = 516.7MB/sec, 516.7 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 3.2 secs, 376.5 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=8k
|
||||
2025/07/20 16:09:29 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FAEB70060604, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 3353, speed = 447KB/sec, 55.9 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 45913, speed = 6MB/sec, 765.2 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 9.3 secs, 361.6 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=4k
|
||||
2025/07/20 16:11:38 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FB098B162788, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 3404, speed = 226.9KB/sec, 56.7 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 45230, speed = 2.9MB/sec, 753.8 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 9.4 secs, 362.6 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4M
|
||||
2025/07/20 16:13:47 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FB27AE890E75, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.1 secs, objects = 1898, speed = 126.4MB/sec, 31.6 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 185034, speed = 12GB/sec, 3083.9 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 0.4 secs, 4267.8 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=1M
|
||||
2025/07/20 16:15:48 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FB43C0386015, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.2 secs, objects = 2627, speed = 43.7MB/sec, 43.7 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 327959, speed = 5.3GB/sec, 5465.9 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 0.6 secs, 4045.6 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=8k
|
||||
2025/07/20 16:17:49 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FB5FE2012590, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 6663, speed = 887.7KB/sec, 111.0 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 459962, speed = 59.9MB/sec, 7666.0 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 1.7 secs, 3890.9 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4k
|
||||
2025/07/20 16:19:50 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FB7C3CF0FFCA, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.1 secs, objects = 6673, speed = 444.4KB/sec, 111.1 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 444637, speed = 28.9MB/sec, 7410.5 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 1.5 secs, 4411.8 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4M
|
||||
2025/07/20 16:21:52 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FB988DB60881, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.2 secs, objects = 3093, speed = 205.5MB/sec, 51.4 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 168750, speed = 11GB/sec, 2811.4 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 0.3 secs, 9112.2 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=1M
|
||||
2025/07/20 16:23:53 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FBB4A1E534DE, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.2 secs, objects = 4652, speed = 77.2MB/sec, 77.2 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 351187, speed = 5.7GB/sec, 5852.8 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 0.6 secs, 8141.6 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=8k
|
||||
2025/07/20 16:25:54 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FBD0C4764C64, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.1 secs, objects = 14497, speed = 1.9MB/sec, 241.4 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 457437, speed = 59.6MB/sec, 7623.7 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 1.7 secs, 8353.6 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4k
|
||||
2025/07/20 16:27:55 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FBED210B0792, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.1 secs, objects = 14459, speed = 962.6KB/sec, 240.7 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 466680, speed = 30.4MB/sec, 7777.7 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 1.7 secs, 8605.3 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=4M
|
||||
Loop 1: PUT time 60.0 secs, objects = 1866, speed = 124.4MB/sec, 31.1 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 16400, speed = 1.1GB/sec, 273.3 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 5.1 secs, 369.3 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=1M
|
||||
2025/07/20 16:32:02 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FC25AE815718, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 5459, speed = 91MB/sec, 91.0 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 25090, speed = 418.2MB/sec, 418.2 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 14.8 secs, 369.8 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=8k
|
||||
2025/07/20 16:34:17 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FC4514A78873, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 22278, speed = 2.9MB/sec, 371.3 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 40626, speed = 5.3MB/sec, 677.1 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 61.6 secs, 361.8 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=4k
|
||||
2025/07/20 16:37:19 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FC6F629ACFAC, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 23394, speed = 1.5MB/sec, 389.9 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 39249, speed = 2.6MB/sec, 654.1 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 64.5 secs, 363.0 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4M
|
||||
2025/07/20 16:40:23 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FC9A5D101971, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 10564, speed = 704.1MB/sec, 176.0 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 20682, speed = 1.3GB/sec, 344.6 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 2.5 secs, 4178.8 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=1M
|
||||
2025/07/20 16:42:26 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FCB6EB0A45D9, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 26550, speed = 442.4MB/sec, 442.4 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 124810, speed = 2GB/sec, 2080.1 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 6.6 secs, 4049.2 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=8k
|
||||
2025/07/20 16:44:32 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FCD4684A110E, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 129363, speed = 16.8MB/sec, 2155.9 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 423956, speed = 55.2MB/sec, 7065.8 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 32.4 secs, 3992.0 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4k
|
||||
2025/07/20 16:47:05 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FCF7EA4857CF, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 123067, speed = 8MB/sec, 2051.0 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 357694, speed = 23.3MB/sec, 5961.4 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 30.9 secs, 3986.0 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4M
|
||||
2025/07/20 16:49:36 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FD1B12EFDEBC, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.1 secs, objects = 13131, speed = 873.3MB/sec, 218.3 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.1 secs, objects = 18630, speed = 1.2GB/sec, 310.2 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 1.7 secs, 7787.5 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=1M
|
||||
2025/07/20 16:51:38 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FD3779E97644, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.1 secs, objects = 40226, speed = 669.8MB/sec, 669.8 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 85692, speed = 1.4GB/sec, 1427.8 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 4.7 secs, 8610.2 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=8k
|
||||
2025/07/20 16:53:42 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FD5489FB2F1F, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 230985, speed = 30.1MB/sec, 3849.3 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 435703, speed = 56.7MB/sec, 7261.1 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 25.8 secs, 8945.8 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4k
|
||||
2025/07/20 16:56:08 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FD7683B9BB96, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 228647, speed = 14.9MB/sec, 3810.4 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 452412, speed = 29.5MB/sec, 7539.9 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 27.2 secs, 8418.0 deletes/sec. Slowdowns = 0
|
BIN
static/assets/ctlog/minio_8kb_performance.png
(Stored with Git LFS)
Normal file
BIN
static/assets/ctlog/minio_8kb_performance.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/ctlog/nsa_slide.jpg
(Stored with Git LFS)
Normal file
BIN
static/assets/ctlog/nsa_slide.jpg
(Stored with Git LFS)
Normal file
Binary file not shown.
80
static/assets/ctlog/seaweedfs-results.txt
Normal file
80
static/assets/ctlog/seaweedfs-results.txt
Normal file
@@ -0,0 +1,80 @@
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=1M
|
||||
Loop 1: PUT time 60.0 secs, objects = 1994, speed = 33.2MB/sec, 33.2 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 29243, speed = 487.4MB/sec, 487.4 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 2.8 secs, 701.4 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=8k
|
||||
Loop 1: PUT time 60.0 secs, objects = 13634, speed = 1.8MB/sec, 227.2 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 32284, speed = 4.2MB/sec, 538.1 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 18.7 secs, 727.8 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=1M
|
||||
Loop 1: PUT time 62.0 secs, objects = 23733, speed = 382.8MB/sec, 382.8 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 132708, speed = 2.2GB/sec, 2211.7 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 3.7 secs, 6490.1 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=8k
|
||||
Loop 1: PUT time 60.0 secs, objects = 199925, speed = 26MB/sec, 3331.9 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 309937, speed = 40.4MB/sec, 5165.3 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 31.2 secs, 6406.0 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=1M
|
||||
Loop 1: PUT time 60.0 secs, objects = 1975, speed = 32.9MB/sec, 32.9 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 29898, speed = 498.3MB/sec, 498.3 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 2.7 secs, 726.6 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=8k
|
||||
Loop 1: PUT time 60.0 secs, objects = 13662, speed = 1.8MB/sec, 227.7 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 31865, speed = 4.1MB/sec, 531.1 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 18.8 secs, 726.9 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=1M
|
||||
Loop 1: PUT time 60.0 secs, objects = 26622, speed = 443.6MB/sec, 443.6 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 117688, speed = 1.9GB/sec, 1961.3 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 4.1 secs, 6499.5 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=8k
|
||||
Loop 1: PUT time 60.0 secs, objects = 198238, speed = 25.8MB/sec, 3303.9 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 312868, speed = 40.7MB/sec, 5214.3 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 30.8 secs, 6432.7 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4M
|
||||
Loop 1: PUT time 60.1 secs, objects = 6220, speed = 414.2MB/sec, 103.6 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 38773, speed = 2.5GB/sec, 646.1 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 0.9 secs, 6693.3 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4k
|
||||
Loop 1: PUT time 60.0 secs, objects = 203033, speed = 13.2MB/sec, 3383.8 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 300824, speed = 19.6MB/sec, 5013.6 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 31.1 secs, 6528.6 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4M
|
||||
Loop 1: PUT time 60.3 secs, objects = 13181, speed = 874.2MB/sec, 218.6 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.1 secs, objects = 18575, speed = 1.2GB/sec, 309.3 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 0.8 secs, 17547.2 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4k
|
||||
Loop 1: PUT time 60.0 secs, objects = 495006, speed = 32.2MB/sec, 8249.5 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 465947, speed = 30.3MB/sec, 7765.4 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 41.4 secs, 11961.3 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4M
|
||||
Loop 1: PUT time 60.1 secs, objects = 7073, speed = 471MB/sec, 117.8 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 31248, speed = 2GB/sec, 520.7 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 1.1 secs, 6576.1 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4k
|
||||
Loop 1: PUT time 60.0 secs, objects = 214387, speed = 14MB/sec, 3573.0 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 297586, speed = 19.4MB/sec, 4959.7 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 32.9 secs, 6519.8 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4M
|
||||
Loop 1: PUT time 60.1 secs, objects = 14365, speed = 956MB/sec, 239.0 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.1 secs, objects = 18113, speed = 1.2GB/sec, 301.6 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 0.8 secs, 18655.8 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4k
|
||||
Loop 1: PUT time 60.0 secs, objects = 489736, speed = 31.9MB/sec, 8161.8 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 460296, speed = 30MB/sec, 7671.2 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 41.0 secs, 11957.6 deletes/sec. Slowdowns = 0
|
116
static/assets/ctlog/seaweedfs.docker-compose.yml
Normal file
116
static/assets/ctlog/seaweedfs.docker-compose.yml
Normal file
@@ -0,0 +1,116 @@
|
||||
# Test Setup for SeaweedFS with 6 disks, a Filer an an S3 API
|
||||
#
|
||||
# Use with the following .env file
|
||||
# root@minio-ssd:~# cat /opt/seaweedfs/.env
|
||||
# AWS_ACCESS_KEY_ID="hottentotten"
|
||||
# AWS_SECRET_ACCESS_KEY="tentententoonstelling"
|
||||
|
||||
services:
|
||||
# Master
|
||||
master0:
|
||||
image: chrislusf/seaweedfs
|
||||
ports:
|
||||
- 9333:9333
|
||||
- 19333:19333
|
||||
command: "-v=1 master -volumeSizeLimitMB 100 -resumeState=false -ip=master0 -ip.bind=0.0.0.0 -port=9333 -mdir=/var/lib/seaweedfs/master"
|
||||
volumes:
|
||||
- ./data/master0:/var/lib/seaweedfs/master
|
||||
restart: unless-stopped
|
||||
|
||||
# Volume Server 1
|
||||
volume1:
|
||||
image: chrislusf/seaweedfs
|
||||
command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8081 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume1'
|
||||
volumes:
|
||||
- /data/disk1:/var/lib/seaweedfs/volume1
|
||||
depends_on:
|
||||
- master0
|
||||
restart: unless-stopped
|
||||
|
||||
# Volume Server 2
|
||||
volume2:
|
||||
image: chrislusf/seaweedfs
|
||||
command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8082 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume2'
|
||||
volumes:
|
||||
- /data/disk2:/var/lib/seaweedfs/volume2
|
||||
depends_on:
|
||||
- master0
|
||||
restart: unless-stopped
|
||||
|
||||
# Volume Server 3
|
||||
volume3:
|
||||
image: chrislusf/seaweedfs
|
||||
command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8083 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume3'
|
||||
volumes:
|
||||
- /data/disk3:/var/lib/seaweedfs/volume3
|
||||
depends_on:
|
||||
- master0
|
||||
restart: unless-stopped
|
||||
|
||||
# Volume Server 4
|
||||
volume4:
|
||||
image: chrislusf/seaweedfs
|
||||
command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8084 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume4'
|
||||
volumes:
|
||||
- /data/disk4:/var/lib/seaweedfs/volume4
|
||||
depends_on:
|
||||
- master0
|
||||
restart: unless-stopped
|
||||
|
||||
# Volume Server 5
|
||||
volume5:
|
||||
image: chrislusf/seaweedfs
|
||||
command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8085 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume5'
|
||||
volumes:
|
||||
- /data/disk5:/var/lib/seaweedfs/volume5
|
||||
depends_on:
|
||||
- master0
|
||||
restart: unless-stopped
|
||||
|
||||
# Volume Server 6
|
||||
volume6:
|
||||
image: chrislusf/seaweedfs
|
||||
command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8086 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume6'
|
||||
volumes:
|
||||
- /data/disk6:/var/lib/seaweedfs/volume6
|
||||
depends_on:
|
||||
- master0
|
||||
restart: unless-stopped
|
||||
|
||||
# Filer
|
||||
filer:
|
||||
image: chrislusf/seaweedfs
|
||||
ports:
|
||||
- 8888:8888
|
||||
- 18888:18888
|
||||
command: 'filer -defaultReplicaPlacement=002 -iam -master="master0:9333"'
|
||||
volumes:
|
||||
- ./data/filer:/data
|
||||
depends_on:
|
||||
- master0
|
||||
- volume1
|
||||
- volume2
|
||||
- volume3
|
||||
- volume4
|
||||
- volume5
|
||||
- volume6
|
||||
restart: unless-stopped
|
||||
|
||||
# S3 API
|
||||
s3:
|
||||
image: chrislusf/seaweedfs
|
||||
ports:
|
||||
- 8333:8333
|
||||
command: 's3 -filer="filer:8888" -ip.bind=0.0.0.0'
|
||||
env_file:
|
||||
- .env
|
||||
depends_on:
|
||||
- master0
|
||||
- volume1
|
||||
- volume2
|
||||
- volume3
|
||||
- volume4
|
||||
- volume5
|
||||
- volume6
|
||||
- filer
|
||||
restart: unless-stopped
|
BIN
static/assets/ctlog/size_comparison_8t.png
(Stored with Git LFS)
Normal file
BIN
static/assets/ctlog/size_comparison_8t.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/ctlog/stop-hammer-time.jpg
(Stored with Git LFS)
Normal file
BIN
static/assets/ctlog/stop-hammer-time.jpg
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/ctlog/sunlight-logo.png
(Stored with Git LFS)
Normal file
BIN
static/assets/ctlog/sunlight-logo.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/ctlog/sunlight-test-s3.png
(Stored with Git LFS)
Normal file
BIN
static/assets/ctlog/sunlight-test-s3.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/ctlog/tesseract-logo.png
(Stored with Git LFS)
Normal file
BIN
static/assets/ctlog/tesseract-logo.png
(Stored with Git LFS)
Normal file
Binary file not shown.
1
static/assets/frys-ix/FrysIX_ Topology (concept).svg
Normal file
1
static/assets/frys-ix/FrysIX_ Topology (concept).svg
Normal file
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 90 KiB |
BIN
static/assets/frys-ix/IXR-7220-D3.jpg
(Stored with Git LFS)
Normal file
BIN
static/assets/frys-ix/IXR-7220-D3.jpg
(Stored with Git LFS)
Normal file
Binary file not shown.
1
static/assets/frys-ix/Nokia Arista VXLAN.svg
Normal file
1
static/assets/frys-ix/Nokia Arista VXLAN.svg
Normal file
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 166 KiB |
169
static/assets/frys-ix/arista-leaf.conf
Normal file
169
static/assets/frys-ix/arista-leaf.conf
Normal file
@@ -0,0 +1,169 @@
|
||||
no aaa root
|
||||
!
|
||||
hardware counter feature vtep decap
|
||||
hardware counter feature vtep encap
|
||||
!
|
||||
service routing protocols model multi-agent
|
||||
!
|
||||
hostname arista-leaf
|
||||
!
|
||||
router l2-vpn
|
||||
arp learning bridged
|
||||
!
|
||||
spanning-tree mode mstp
|
||||
!
|
||||
system l1
|
||||
unsupported speed action error
|
||||
unsupported error-correction action error
|
||||
!
|
||||
vlan 2604
|
||||
name v-peeringlan
|
||||
!
|
||||
interface Ethernet1/1
|
||||
!
|
||||
interface Ethernet2/1
|
||||
!
|
||||
interface Ethernet3/1
|
||||
!
|
||||
interface Ethernet4/1
|
||||
!
|
||||
interface Ethernet5/1
|
||||
!
|
||||
interface Ethernet6/1
|
||||
!
|
||||
interface Ethernet7/1
|
||||
!
|
||||
interface Ethernet8/1
|
||||
!
|
||||
interface Ethernet9/1
|
||||
shutdown
|
||||
speed forced 10000full
|
||||
!
|
||||
interface Ethernet9/2
|
||||
shutdown
|
||||
!
|
||||
interface Ethernet9/3
|
||||
speed forced 10000full
|
||||
switchport access vlan 2604
|
||||
!
|
||||
interface Ethernet9/4
|
||||
shutdown
|
||||
!
|
||||
interface Ethernet10/1
|
||||
!
|
||||
interface Ethernet10/2
|
||||
shutdown
|
||||
!
|
||||
interface Ethernet10/4
|
||||
shutdown
|
||||
!
|
||||
interface Ethernet11/1
|
||||
!
|
||||
interface Ethernet12/1
|
||||
!
|
||||
interface Ethernet13/1
|
||||
!
|
||||
interface Ethernet14/1
|
||||
!
|
||||
interface Ethernet15/1
|
||||
!
|
||||
interface Ethernet16/1
|
||||
!
|
||||
interface Ethernet17/1
|
||||
!
|
||||
interface Ethernet18/1
|
||||
!
|
||||
interface Ethernet19/1
|
||||
!
|
||||
interface Ethernet20/1
|
||||
!
|
||||
interface Ethernet21/1
|
||||
!
|
||||
interface Ethernet22/1
|
||||
!
|
||||
interface Ethernet23/1
|
||||
!
|
||||
interface Ethernet24/1
|
||||
!
|
||||
interface Ethernet25/1
|
||||
!
|
||||
interface Ethernet26/1
|
||||
!
|
||||
interface Ethernet27/1
|
||||
!
|
||||
interface Ethernet28/1
|
||||
!
|
||||
interface Ethernet29/1
|
||||
no switchport
|
||||
!
|
||||
interface Ethernet30/1
|
||||
load-interval 1
|
||||
mtu 9190
|
||||
no switchport
|
||||
ip address 198.19.17.10/31
|
||||
ip ospf cost 10
|
||||
ip ospf network point-to-point
|
||||
ip ospf area 0.0.0.0
|
||||
!
|
||||
interface Ethernet31/1
|
||||
load-interval 1
|
||||
mtu 9190
|
||||
no switchport
|
||||
ip address 198.19.17.3/31
|
||||
ip ospf cost 1000
|
||||
ip ospf network point-to-point
|
||||
ip ospf area 0.0.0.0
|
||||
!
|
||||
interface Ethernet32/1
|
||||
load-interval 1
|
||||
mtu 9190
|
||||
no switchport
|
||||
ip address 198.19.17.5/31
|
||||
ip ospf cost 1000
|
||||
ip ospf network point-to-point
|
||||
ip ospf area 0.0.0.0
|
||||
!
|
||||
interface Loopback0
|
||||
ip address 198.19.16.2/32
|
||||
ip ospf area 0.0.0.0
|
||||
!
|
||||
interface Loopback1
|
||||
ip address 198.19.18.2/32
|
||||
!
|
||||
interface Management1
|
||||
ip address dhcp
|
||||
!
|
||||
interface Vxlan1
|
||||
vxlan source-interface Loopback1
|
||||
vxlan udp-port 4789
|
||||
vxlan vlan 2604 vni 2604
|
||||
!
|
||||
ip routing
|
||||
!
|
||||
ip route 0.0.0.0/0 Management1 10.75.8.1
|
||||
!
|
||||
router bgp 65500
|
||||
neighbor evpn peer group
|
||||
neighbor evpn remote-as 65500
|
||||
neighbor evpn update-source Loopback0
|
||||
neighbor evpn ebgp-multihop 3
|
||||
neighbor evpn send-community extended
|
||||
neighbor evpn maximum-routes 12000 warning-only
|
||||
neighbor 198.19.16.0 peer group evpn
|
||||
neighbor 198.19.16.1 peer group evpn
|
||||
!
|
||||
vlan 2604
|
||||
rd 65500:2604
|
||||
route-target both 65500:2604
|
||||
redistribute learned
|
||||
!
|
||||
address-family evpn
|
||||
neighbor evpn activate
|
||||
!
|
||||
router ospf 65500
|
||||
router-id 198.19.16.2
|
||||
redistribute connected
|
||||
network 198.19.0.0/16 area 0.0.0.0
|
||||
max-lsa 12000
|
||||
!
|
||||
end
|
90
static/assets/frys-ix/equinix.conf
Normal file
90
static/assets/frys-ix/equinix.conf
Normal file
@@ -0,0 +1,90 @@
|
||||
set / interface ethernet-1/1 admin-state disable
|
||||
set / interface ethernet-1/9 admin-state enable
|
||||
set / interface ethernet-1/9 breakout-mode num-breakout-ports 4
|
||||
set / interface ethernet-1/9 breakout-mode breakout-port-speed 10G
|
||||
set / interface ethernet-1/9/3 admin-state enable
|
||||
set / interface ethernet-1/9/3 vlan-tagging true
|
||||
set / interface ethernet-1/9/3 subinterface 0 type bridged
|
||||
set / interface ethernet-1/9/3 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/9/3 subinterface 0 vlan encap untagged
|
||||
set / interface ethernet-1/29 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 type routed
|
||||
set / interface ethernet-1/29 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/29 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 ipv4 address 198.19.17.0/31
|
||||
set / interface ethernet-1/29 subinterface 0 ipv6 admin-state enable
|
||||
set / interface lo0 admin-state enable
|
||||
set / interface lo0 subinterface 0 admin-state enable
|
||||
set / interface lo0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface lo0 subinterface 0 ipv4 address 198.19.16.0/32
|
||||
set / interface mgmt0 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv4 dhcp-client
|
||||
set / interface mgmt0 subinterface 0 ipv6 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv6 dhcp-client
|
||||
set / interface system0 admin-state enable
|
||||
set / interface system0 subinterface 0 admin-state enable
|
||||
set / interface system0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface system0 subinterface 0 ipv4 address 198.19.18.0/32
|
||||
set / network-instance default type default
|
||||
set / network-instance default admin-state enable
|
||||
set / network-instance default description "fabric: dc2 role: spine"
|
||||
set / network-instance default router-id 198.19.16.0
|
||||
set / network-instance default ip-forwarding receive-ipv4-check false
|
||||
set / network-instance default interface ethernet-1/29.0
|
||||
set / network-instance default interface lo0.0
|
||||
set / network-instance default interface system0.0
|
||||
set / network-instance default protocols bgp admin-state enable
|
||||
set / network-instance default protocols bgp autonomous-system 65500
|
||||
set / network-instance default protocols bgp router-id 198.19.16.0
|
||||
set / network-instance default protocols bgp dynamic-neighbors accept match 198.19.16.0/24 peer-group overlay
|
||||
set / network-instance default protocols bgp afi-safi evpn admin-state enable
|
||||
set / network-instance default protocols bgp preference ibgp 170
|
||||
set / network-instance default protocols bgp route-advertisement rapid-withdrawal true
|
||||
set / network-instance default protocols bgp route-advertisement wait-for-fib-install false
|
||||
set / network-instance default protocols bgp group overlay peer-as 65500
|
||||
set / network-instance default protocols bgp group overlay afi-safi evpn admin-state enable
|
||||
set / network-instance default protocols bgp group overlay afi-safi ipv4-unicast admin-state disable
|
||||
set / network-instance default protocols bgp group overlay local-as as-number 65500
|
||||
set / network-instance default protocols bgp group overlay route-reflector client true
|
||||
set / network-instance default protocols bgp group overlay transport local-address 198.19.16.0
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.1 admin-state enable
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.1 peer-group overlay
|
||||
set / network-instance default protocols ospf instance default admin-state enable
|
||||
set / network-instance default protocols ospf instance default version ospf-v2
|
||||
set / network-instance default protocols ospf instance default router-id 198.19.16.0
|
||||
set / network-instance default protocols ospf instance default export-policy ospf
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/29.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface lo0.0
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface system0.0
|
||||
set / network-instance mgmt type ip-vrf
|
||||
set / network-instance mgmt admin-state enable
|
||||
set / network-instance mgmt description "Management network instance"
|
||||
set / network-instance mgmt interface mgmt0.0
|
||||
set / network-instance mgmt protocols linux import-routes true
|
||||
set / network-instance mgmt protocols linux export-routes true
|
||||
set / network-instance mgmt protocols linux export-neighbors true
|
||||
set / network-instance peeringlan type mac-vrf
|
||||
set / network-instance peeringlan admin-state enable
|
||||
set / network-instance peeringlan interface ethernet-1/9/3.0
|
||||
set / network-instance peeringlan vxlan-interface vxlan1.2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 admin-state enable
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 evi 2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise true
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604
|
||||
set / network-instance peeringlan bridge-table proxy-arp admin-state enable
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning admin-state enable
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning age-time 600
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning send-refresh 180
|
||||
set / routing-policy policy ospf statement 100 match protocol host
|
||||
set / routing-policy policy ospf statement 100 action policy-result accept
|
||||
set / routing-policy policy ospf statement 200 match protocol ospfv2
|
||||
set / routing-policy policy ospf statement 200 action policy-result accept
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address
|
BIN
static/assets/frys-ix/frysix-logo-small.png
(Stored with Git LFS)
Normal file
BIN
static/assets/frys-ix/frysix-logo-small.png
(Stored with Git LFS)
Normal file
Binary file not shown.
132
static/assets/frys-ix/nikhef.conf
Normal file
132
static/assets/frys-ix/nikhef.conf
Normal file
@@ -0,0 +1,132 @@
|
||||
set / interface ethernet-1/1 admin-state enable
|
||||
set / interface ethernet-1/1 ethernet forward-error-correction fec-option rs-528
|
||||
set / interface ethernet-1/1 subinterface 0 type routed
|
||||
set / interface ethernet-1/1 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/1 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/1 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/1 subinterface 0 ipv4 address 198.19.17.2/31
|
||||
set / interface ethernet-1/1 subinterface 0 ipv6 admin-state enable
|
||||
set / interface ethernet-1/2 admin-state enable
|
||||
set / interface ethernet-1/2 ethernet forward-error-correction fec-option rs-528
|
||||
set / interface ethernet-1/2 subinterface 0 type routed
|
||||
set / interface ethernet-1/2 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/2 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/2 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/2 subinterface 0 ipv4 address 198.19.17.4/31
|
||||
set / interface ethernet-1/2 subinterface 0 ipv6 admin-state enable
|
||||
set / interface ethernet-1/3 admin-state enable
|
||||
set / interface ethernet-1/3 ethernet forward-error-correction fec-option rs-528
|
||||
set / interface ethernet-1/3 subinterface 0 type routed
|
||||
set / interface ethernet-1/3 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/3 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/3 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/3 subinterface 0 ipv4 address 198.19.17.6/31
|
||||
set / interface ethernet-1/3 subinterface 0 ipv6 admin-state enable
|
||||
set / interface ethernet-1/4 admin-state enable
|
||||
set / interface ethernet-1/4 ethernet forward-error-correction fec-option rs-528
|
||||
set / interface ethernet-1/4 subinterface 0 type routed
|
||||
set / interface ethernet-1/4 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/4 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/4 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/4 subinterface 0 ipv4 address 198.19.17.8/31
|
||||
set / interface ethernet-1/4 subinterface 0 ipv6 admin-state enable
|
||||
set / interface ethernet-1/9 admin-state enable
|
||||
set / interface ethernet-1/9 breakout-mode num-breakout-ports 4
|
||||
set / interface ethernet-1/9 breakout-mode breakout-port-speed 10G
|
||||
set / interface ethernet-1/9/1 admin-state disable
|
||||
set / interface ethernet-1/9/2 admin-state disable
|
||||
set / interface ethernet-1/9/3 admin-state enable
|
||||
set / interface ethernet-1/9/3 vlan-tagging true
|
||||
set / interface ethernet-1/9/3 subinterface 0 type bridged
|
||||
set / interface ethernet-1/9/3 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/9/3 subinterface 0 vlan encap untagged
|
||||
set / interface ethernet-1/9/4 admin-state disable
|
||||
set / interface ethernet-1/29 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 type routed
|
||||
set / interface ethernet-1/29 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/29 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 ipv4 address 198.19.17.1/31
|
||||
set / interface ethernet-1/29 subinterface 0 ipv6 admin-state enable
|
||||
set / interface lo0 admin-state enable
|
||||
set / interface lo0 subinterface 0 admin-state enable
|
||||
set / interface lo0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface lo0 subinterface 0 ipv4 address 198.19.16.1/32
|
||||
set / interface mgmt0 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv4 dhcp-client
|
||||
set / interface mgmt0 subinterface 0 ipv6 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv6 dhcp-client
|
||||
set / interface system0 admin-state enable
|
||||
set / interface system0 subinterface 0 admin-state enable
|
||||
set / interface system0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface system0 subinterface 0 ipv4 address 198.19.18.1/32
|
||||
set / network-instance default type default
|
||||
set / network-instance default admin-state enable
|
||||
set / network-instance default description "fabric: dc1 role: spine"
|
||||
set / network-instance default router-id 198.19.16.1
|
||||
set / network-instance default ip-forwarding receive-ipv4-check false
|
||||
set / network-instance default interface ethernet-1/1.0
|
||||
set / network-instance default interface ethernet-1/2.0
|
||||
set / network-instance default interface ethernet-1/29.0
|
||||
set / network-instance default interface ethernet-1/3.0
|
||||
set / network-instance default interface ethernet-1/4.0
|
||||
set / network-instance default interface lo0.0
|
||||
set / network-instance default interface system0.0
|
||||
set / network-instance default protocols bgp admin-state enable
|
||||
set / network-instance default protocols bgp autonomous-system 65500
|
||||
set / network-instance default protocols bgp router-id 198.19.16.1
|
||||
set / network-instance default protocols bgp dynamic-neighbors accept match 198.19.16.0/24 peer-group overlay
|
||||
set / network-instance default protocols bgp afi-safi evpn admin-state enable
|
||||
set / network-instance default protocols bgp preference ibgp 170
|
||||
set / network-instance default protocols bgp route-advertisement rapid-withdrawal true
|
||||
set / network-instance default protocols bgp route-advertisement wait-for-fib-install false
|
||||
set / network-instance default protocols bgp group overlay peer-as 65500
|
||||
set / network-instance default protocols bgp group overlay afi-safi evpn admin-state enable
|
||||
set / network-instance default protocols bgp group overlay afi-safi ipv4-unicast admin-state disable
|
||||
set / network-instance default protocols bgp group overlay local-as as-number 65500
|
||||
set / network-instance default protocols bgp group overlay route-reflector client true
|
||||
set / network-instance default protocols bgp group overlay transport local-address 198.19.16.1
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.0 admin-state enable
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.0 peer-group overlay
|
||||
set / network-instance default protocols ospf instance default admin-state enable
|
||||
set / network-instance default protocols ospf instance default version ospf-v2
|
||||
set / network-instance default protocols ospf instance default router-id 198.19.16.1
|
||||
set / network-instance default protocols ospf instance default export-policy ospf
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/1.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/2.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/3.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/4.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/29.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface lo0.0 passive true
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface system0.0
|
||||
set / network-instance mgmt type ip-vrf
|
||||
set / network-instance mgmt admin-state enable
|
||||
set / network-instance mgmt description "Management network instance"
|
||||
set / network-instance mgmt interface mgmt0.0
|
||||
set / network-instance mgmt protocols linux import-routes true
|
||||
set / network-instance mgmt protocols linux export-routes true
|
||||
set / network-instance mgmt protocols linux export-neighbors true
|
||||
set / network-instance peeringlan type mac-vrf
|
||||
set / network-instance peeringlan admin-state enable
|
||||
set / network-instance peeringlan interface ethernet-1/9/3.0
|
||||
set / network-instance peeringlan vxlan-interface vxlan1.2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 admin-state enable
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 evi 2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise true
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604
|
||||
set / network-instance peeringlan bridge-table proxy-arp admin-state enable
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning admin-state enable
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning age-time 600
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning send-refresh 180
|
||||
set / routing-policy policy ospf statement 100 match protocol host
|
||||
set / routing-policy policy ospf statement 100 action policy-result accept
|
||||
set / routing-policy policy ospf statement 200 match protocol ospfv2
|
||||
set / routing-policy policy ospf statement 200 action policy-result accept
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address
|
BIN
static/assets/frys-ix/nokia-7220-d2.png
(Stored with Git LFS)
Normal file
BIN
static/assets/frys-ix/nokia-7220-d2.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/frys-ix/nokia-7220-d4.png
(Stored with Git LFS)
Normal file
BIN
static/assets/frys-ix/nokia-7220-d4.png
(Stored with Git LFS)
Normal file
Binary file not shown.
105
static/assets/frys-ix/nokia-leaf.conf
Normal file
105
static/assets/frys-ix/nokia-leaf.conf
Normal file
@@ -0,0 +1,105 @@
|
||||
set / interface ethernet-1/9 admin-state enable
|
||||
set / interface ethernet-1/9 vlan-tagging true
|
||||
set / interface ethernet-1/9 ethernet port-speed 10G
|
||||
set / interface ethernet-1/9 subinterface 0 type bridged
|
||||
set / interface ethernet-1/9 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/9 subinterface 0 vlan encap untagged
|
||||
set / interface ethernet-1/53 admin-state enable
|
||||
set / interface ethernet-1/53 ethernet forward-error-correction fec-option rs-528
|
||||
set / interface ethernet-1/53 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/53 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/53 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/53 subinterface 0 ipv4 address 198.19.17.11/31
|
||||
set / interface ethernet-1/53 subinterface 0 ipv6 admin-state enable
|
||||
set / interface ethernet-1/55 admin-state enable
|
||||
set / interface ethernet-1/55 ethernet forward-error-correction fec-option rs-528
|
||||
set / interface ethernet-1/55 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/55 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/55 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/55 subinterface 0 ipv4 address 198.19.17.7/31
|
||||
set / interface ethernet-1/55 subinterface 0 ipv6 admin-state enable
|
||||
set / interface ethernet-1/56 admin-state enable
|
||||
set / interface ethernet-1/56 ethernet forward-error-correction fec-option rs-528
|
||||
set / interface ethernet-1/56 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/56 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/56 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/56 subinterface 0 ipv4 address 198.19.17.9/31
|
||||
set / interface ethernet-1/56 subinterface 0 ipv6 admin-state enable
|
||||
set / interface lo0 admin-state enable
|
||||
set / interface lo0 subinterface 0 admin-state enable
|
||||
set / interface lo0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface lo0 subinterface 0 ipv4 address 198.19.16.3/32
|
||||
set / interface mgmt0 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv4 dhcp-client
|
||||
set / interface mgmt0 subinterface 0 ipv6 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv6 dhcp-client
|
||||
set / interface system0 admin-state enable
|
||||
set / interface system0 subinterface 0 admin-state enable
|
||||
set / interface system0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface system0 subinterface 0 ipv4 address 198.19.18.3/32
|
||||
set / network-instance default type default
|
||||
set / network-instance default admin-state enable
|
||||
set / network-instance default description "fabric: dc1 role: leaf"
|
||||
set / network-instance default router-id 198.19.16.3
|
||||
set / network-instance default ip-forwarding receive-ipv4-check false
|
||||
set / network-instance default interface ethernet-1/53.0
|
||||
set / network-instance default interface ethernet-1/55.0
|
||||
set / network-instance default interface ethernet-1/56.0
|
||||
set / network-instance default interface lo0.0
|
||||
set / network-instance default interface system0.0
|
||||
set / network-instance default protocols bgp admin-state enable
|
||||
set / network-instance default protocols bgp autonomous-system 65500
|
||||
set / network-instance default protocols bgp router-id 198.19.16.3
|
||||
set / network-instance default protocols bgp afi-safi evpn admin-state enable
|
||||
set / network-instance default protocols bgp preference ibgp 170
|
||||
set / network-instance default protocols bgp route-advertisement rapid-withdrawal true
|
||||
set / network-instance default protocols bgp route-advertisement wait-for-fib-install false
|
||||
set / network-instance default protocols bgp group overlay peer-as 65500
|
||||
set / network-instance default protocols bgp group overlay afi-safi evpn admin-state enable
|
||||
set / network-instance default protocols bgp group overlay afi-safi ipv4-unicast admin-state disable
|
||||
set / network-instance default protocols bgp group overlay local-as as-number 65500
|
||||
set / network-instance default protocols bgp group overlay transport local-address 198.19.16.3
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.0 admin-state enable
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.0 peer-group overlay
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.1 admin-state enable
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.1 peer-group overlay
|
||||
set / network-instance default protocols ospf instance default admin-state enable
|
||||
set / network-instance default protocols ospf instance default version ospf-v2
|
||||
set / network-instance default protocols ospf instance default router-id 198.19.16.3
|
||||
set / network-instance default protocols ospf instance default export-policy ospf
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/53.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/55.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/56.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface lo0.0 passive true
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface system0.0
|
||||
set / network-instance mgmt type ip-vrf
|
||||
set / network-instance mgmt admin-state enable
|
||||
set / network-instance mgmt description "Management network instance"
|
||||
set / network-instance mgmt interface mgmt0.0
|
||||
set / network-instance mgmt protocols linux import-routes true
|
||||
set / network-instance mgmt protocols linux export-routes true
|
||||
set / network-instance mgmt protocols linux export-neighbors true
|
||||
set / network-instance peeringlan type mac-vrf
|
||||
set / network-instance peeringlan admin-state enable
|
||||
set / network-instance peeringlan interface ethernet-1/9.0
|
||||
set / network-instance peeringlan vxlan-interface vxlan1.2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 admin-state enable
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 evi 2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise true
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604
|
||||
set / network-instance peeringlan bridge-table proxy-arp admin-state enable
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning admin-state enable
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning age-time 600
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning send-refresh 180
|
||||
set / routing-policy policy ospf statement 100 match protocol host
|
||||
set / routing-policy policy ospf statement 100 action policy-result accept
|
||||
set / routing-policy policy ospf statement 200 match protocol ospfv2
|
||||
set / routing-policy policy ospf statement 200 action policy-result accept
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address
|
BIN
static/assets/minio/console-1.png
(Stored with Git LFS)
Normal file
BIN
static/assets/minio/console-1.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/minio/console-2.png
(Stored with Git LFS)
Normal file
BIN
static/assets/minio/console-2.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/minio/disks.png
(Stored with Git LFS)
Normal file
BIN
static/assets/minio/disks.png
(Stored with Git LFS)
Normal file
Binary file not shown.
1633
static/assets/minio/minio-ec.svg
Normal file
1633
static/assets/minio/minio-ec.svg
Normal file
File diff suppressed because it is too large
Load Diff
After Width: | Height: | Size: 90 KiB |
BIN
static/assets/minio/minio-logo.png
(Stored with Git LFS)
Normal file
BIN
static/assets/minio/minio-logo.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/minio/nagios.png
(Stored with Git LFS)
Normal file
BIN
static/assets/minio/nagios.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/minio/nginx-logo.png
(Stored with Git LFS)
Normal file
BIN
static/assets/minio/nginx-logo.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/minio/rack-2.png
(Stored with Git LFS)
Normal file
BIN
static/assets/minio/rack-2.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/minio/rack.png
(Stored with Git LFS)
Normal file
BIN
static/assets/minio/rack.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/minio/restic-logo.png
(Stored with Git LFS)
Normal file
BIN
static/assets/minio/restic-logo.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/sflow/sflow-all.pcap
Normal file
BIN
static/assets/sflow/sflow-all.pcap
Normal file
Binary file not shown.
BIN
static/assets/sflow/sflow-host.pcap
Normal file
BIN
static/assets/sflow/sflow-host.pcap
Normal file
Binary file not shown.
BIN
static/assets/sflow/sflow-interface.pcap
Normal file
BIN
static/assets/sflow/sflow-interface.pcap
Normal file
Binary file not shown.
BIN
static/assets/sflow/sflow-lab-trex.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/sflow-lab-trex.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/sflow/sflow-lab.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/sflow-lab.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/sflow/sflow-overview.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/sflow-overview.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/sflow/sflow-vpp-overview.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/sflow-vpp-overview.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
static/assets/sflow/sflow-wireshark.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/sflow-wireshark.png
(Stored with Git LFS)
Normal file
Binary file not shown.
@@ -3,9 +3,9 @@
|
||||
# OpenBSD bastion
|
||||
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAXMfDOJtI3JztcPJ1DZMXzILZzMilMvodvMIfqqa1qr pim+openbsd@ipng.ch
|
||||
|
||||
# Macbook M2 Air (Secretive)
|
||||
ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBOqcEzDb0ZmHl3s++rnxjOcoAeZKy5EkVU6WdChXLj8SuthjCinOTSMXy7k0PnxWejSST1KHxJ3nBbvpboGMwH8= pim+m2air@ipng.ch
|
||||
|
||||
# Mac Studio (Secretive)
|
||||
ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBMtJZgTDWxBEbQ2vPYtOw4L0s4VRKUUjpu6aFPVx3CpqrjLpyJIxzBWTfb/VnEp95VfgM8IUAYYM8w7xoLd7QZc= pim+studio@ipng.ch
|
||||
ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBMtJZgTDWxBEbQ2vPYtOw4L0s4VRKUUjpu6aFPVx3CpqrjLpyJIxzBWTfb/VnEp95VfgM8IUAYYM8w7xoLd7QZc= pim+jessica+secretive@ipng.ch
|
||||
|
||||
# Macbook Air M4 (Secretive)
|
||||
ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBASymGKXfKkfsYbo7UDrIBxl1F6X7LmVPQ3XOFOKp8tLI6zLyCYs5zgRNs/qksHOgKUK+fE/TzJ4XJsuSbYNMB0= pim+tammy+secretive@ipng.ch
|
||||
|
||||
|
@@ -17,6 +17,7 @@ $text-very-light: #767676;
|
||||
$medium-light-text: #4f4a5f;
|
||||
$code-background: #f3f3f3;
|
||||
$codeblock-background: #f6f8fa;
|
||||
$codeblock-text: #99a;
|
||||
$code-text: #f8f8f2;
|
||||
$ipng-orange: #f46524;
|
||||
$ipng-darkorange: #8c1919;
|
||||
@@ -142,7 +143,7 @@ pre {
|
||||
|
||||
code {
|
||||
background-color: transparent;
|
||||
color: #444;
|
||||
color: $codeblock-text;
|
||||
}
|
||||
}
|
||||
|
||||
|
1
themes/hugo-theme-ipng/layouts/shortcodes/boldcolor.html
Normal file
1
themes/hugo-theme-ipng/layouts/shortcodes/boldcolor.html
Normal file
@@ -0,0 +1 @@
|
||||
<span style="color: {{ .Get "color" }}; font-weight: bold;">{{ .Inner }}</span>
|
Reference in New Issue
Block a user