Compare commits
125 Commits
7ea57ba11e
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
1b95e25331 | ||
|
|
512cfd75dc | ||
|
|
8683d570a1 | ||
|
|
a1a98ad3c6 | ||
|
|
26ae98d977 | ||
|
|
619a1dfdf2 | ||
|
|
a9e978effb | ||
|
|
825335cef9 | ||
|
|
a97115593c | ||
|
|
3dd0d8a656 | ||
|
|
f137326339 | ||
|
|
51098ed43c | ||
|
|
6b337e1167 | ||
|
|
bbf36f5a4e | ||
|
|
b324d71b3f | ||
|
|
2681861e4b | ||
|
|
4f0188abeb | ||
|
|
f4ed332b18 | ||
|
|
d9066aa241 | ||
|
|
c68799703b | ||
|
|
c32d1779f8 | ||
|
|
eda80e7e66 | ||
|
|
d13da5608d | ||
|
|
d47261a3b7 | ||
|
|
383a598fc7 | ||
|
|
8afa2ff944 | ||
|
|
fe1207ee78 | ||
|
|
6a59b7d7e6 | ||
|
|
bc2a9bb352 | ||
|
|
5d02b6466c | ||
|
|
b6b419471d | ||
|
|
85b41ba4e0 | ||
|
|
ebbb0f8e24 | ||
|
|
218ee84d5f | ||
|
|
c476fa56fb | ||
|
|
a76abc331f | ||
|
|
44deb34685 | ||
|
|
ca46bcf6d5 | ||
|
|
5042f822ef | ||
|
|
fdb77838b8 | ||
|
|
6d3f4ac206 | ||
|
|
baa3e78045 | ||
|
|
0972cf4aa1 | ||
|
|
4f81d377a0 | ||
|
|
153048eda4 | ||
|
|
4aa5745d06 | ||
|
|
7d3f617966 | ||
|
|
8918821413 | ||
|
|
9783c7d39c | ||
|
|
af68c1ec3b | ||
|
|
0baadb5089 | ||
|
|
3b7e576d20 | ||
|
|
d0a7cdbe38 | ||
|
|
ed087f3fc6 | ||
|
|
51e6c0e1c2 | ||
|
|
8a991bee47 | ||
|
|
d9e2f407e7 | ||
|
|
01820776af | ||
|
|
d5d4f7ff55 | ||
|
|
2a61bdc028 | ||
|
|
c2b8eef4f4 | ||
|
|
533cca0108 | ||
|
|
4ac8c47127 | ||
|
|
bcbb119b20 | ||
|
|
ce6e6cde22 | ||
|
|
610835925b | ||
|
|
16ac42bad9 | ||
|
|
26397d69c6 | ||
|
|
388293baef | ||
|
|
b2129702ae | ||
|
|
ba068c1c52 | ||
|
|
3c69130cea | ||
|
|
255d3905d7 | ||
|
|
4cd42b9824 | ||
|
|
f12247d278 | ||
|
|
36b422ce08 | ||
|
|
2e1bb69772 | ||
|
|
ceb16714b6 | ||
|
|
72b99b20c6 | ||
|
|
4b5bd40fce | ||
|
|
1379c77181 | ||
|
|
08d55e6ac0 | ||
|
|
3feb217aa8 | ||
|
|
2f63fc0ebb | ||
|
|
4113615096 | ||
|
|
52cba49c90 | ||
|
|
b5c0819bfa | ||
|
|
ea05b39ddf | ||
|
|
27ab370dc4 | ||
|
|
1e5e965572 | ||
|
|
d8c36e5077 | ||
|
|
8b23bba61d | ||
|
|
5dc5a17f40 | ||
|
|
52d3606b1b | ||
|
|
d017f1c2cf | ||
|
|
e867f75a34 | ||
|
|
7da66c5f35 | ||
|
|
f201aeb596 | ||
|
|
ee4534c23a | ||
|
|
6ef9a21206 | ||
|
|
a4884a28d9 | ||
|
|
5b0f1acbf6 | ||
|
|
9727d065b8 | ||
|
|
ef83fd569d | ||
|
|
bf9a070ea5 | ||
|
|
090cf21170 | ||
|
|
f23a5ace77 | ||
|
|
3db7156652 | ||
|
|
9b47359318 | ||
|
|
ecb0062105 | ||
|
|
44a854dc8e | ||
|
|
7fc65b87df | ||
|
|
413498e4c1 | ||
|
|
b576a15a30 | ||
|
|
7f73540fd7 | ||
|
|
0b5ed8683c | ||
|
|
20022b77dd | ||
|
|
0542c1e2d9 | ||
|
|
4aa6f0bf10 | ||
|
|
4210f97c9d | ||
|
|
34981afe2e | ||
|
|
de61265f82 | ||
|
|
b09f7437b2 | ||
|
|
005add2b74 | ||
|
|
4122f50cb1 |
34
.drone.yml
Normal file
34
.drone.yml
Normal file
@@ -0,0 +1,34 @@
|
||||
kind: pipeline
|
||||
name: default
|
||||
|
||||
steps:
|
||||
- name: git-lfs
|
||||
image: alpine/git
|
||||
commands:
|
||||
- git lfs install
|
||||
- git lfs pull
|
||||
- name: build
|
||||
image: git.ipng.ch/ipng/drone-hugo:release-0.148.2
|
||||
settings:
|
||||
hugo_version: 0.148.2
|
||||
extended: true
|
||||
- name: rsync
|
||||
image: drillster/drone-rsync
|
||||
settings:
|
||||
user: drone
|
||||
key:
|
||||
from_secret: drone_sshkey
|
||||
hosts:
|
||||
- nginx0.chrma0.net.ipng.ch
|
||||
- nginx0.chplo0.net.ipng.ch
|
||||
- nginx0.nlams1.net.ipng.ch
|
||||
- nginx0.nlams2.net.ipng.ch
|
||||
port: 22
|
||||
args: '-6u --delete-after'
|
||||
source: public/
|
||||
target: /nginx/sites/ipng.ch/
|
||||
recursive: true
|
||||
secrets: [ drone_sshkey ]
|
||||
|
||||
image_pull_secrets:
|
||||
- git_ipng_ch_docker
|
||||
1
.gitignore
vendored
1
.gitignore
vendored
@@ -1,3 +1,4 @@
|
||||
.hugo*
|
||||
public/
|
||||
resources/_gen/
|
||||
.DS_Store
|
||||
|
||||
@@ -8,7 +8,7 @@ Historical context - todo, but notes for now
|
||||
|
||||
1. started with stack.nl (when it was still stack.urc.tue.nl), 6bone and watching NASA multicast video in 1997.
|
||||
2. founded ipng.nl project, first IPv6 in NL that was usable outside of NREN.
|
||||
3. attacted attention of the first few IPv6 partitipants in Amsterdam, organized the AIAD - AMS-IX IPv6 Awareness Day
|
||||
3. attracted attention of the first few IPv6 participants in Amsterdam, organized the AIAD - AMS-IX IPv6 Awareness Day
|
||||
4. launched IPv6 at AMS-IX, first IXP prefix allocated 2001:768:1::/48
|
||||
> My Brilliant Idea Of The Day -- encode AS number in leetspeak: `::AS01:2859:1`, because who would've thought we would ever run out of 16 bit AS numbers :)
|
||||
5. IPng rearchitected to SixXS, and became a very large scale deployment of IPv6 tunnelbroker; our main central provisioning system moved around a few times between ISPs (Intouch, Concepts ICT, BIT, IP Man)
|
||||
|
||||
@@ -185,7 +185,7 @@ function is_coloclue_beacon()
|
||||
}
|
||||
```
|
||||
|
||||
Then, I ran the configuration again with one IPv4 beacon set on dcg-1, and still all the bird configs on both IPv4 and IPv6 for all routers parsed correctly, and the generated function on the dcg-1 IPv4 filters file was popupated:
|
||||
Then, I ran the configuration again with one IPv4 beacon set on dcg-1, and still all the bird configs on both IPv4 and IPv6 for all routers parsed correctly, and the generated function on the dcg-1 IPv4 filters file was populated:
|
||||
```
|
||||
function is_coloclue_beacon()
|
||||
{
|
||||
|
||||
@@ -89,7 +89,7 @@ lcp lcp-sync off
|
||||
```
|
||||
|
||||
The prep work for the rest of the interface syncer starts with this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
|
||||
for the rest of this blog post, the behavior will be in the 'on' position.
|
||||
|
||||
### Change interface: state
|
||||
@@ -120,7 +120,7 @@ the state it was. I did notice that you can't bring up a sub-interface if its pa
|
||||
is down, which I found counterintuitive, but that's neither here nor there.
|
||||
|
||||
All of this is to say that we have to be careful when copying state forward, because as
|
||||
this [[commit](https://github.com/pimvanpelt/lcpng/commit/7c15c84f6c4739860a85c599779c199cb9efef03)]
|
||||
this [[commit](https://git.ipng.ch/ipng/lcpng/commit/7c15c84f6c4739860a85c599779c199cb9efef03)]
|
||||
shows, issuing `set int state ... up` on an interface, won't touch its sub-interfaces in VPP, but
|
||||
the subsequent netlink message to bring the _LIP_ for that interface up, **will** update the
|
||||
children, thus desynchronising Linux and VPP: Linux will have interface **and all its
|
||||
@@ -128,7 +128,7 @@ sub-interfaces** up unconditionally; VPP will have the interface up and its sub-
|
||||
whatever state they were before.
|
||||
|
||||
To address this, a second
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/a3dc56c01461bdffcac8193ead654ae79225220f)] was
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/a3dc56c01461bdffcac8193ead654ae79225220f)] was
|
||||
needed. I'm not too sure I want to keep this behavior, but for now, it results in an intuitive
|
||||
end-state, which is that all interfaces states are exactly the same between Linux and VPP.
|
||||
|
||||
@@ -157,7 +157,7 @@ DBGvpp# set int state TenGigabitEthernet3/0/0 up
|
||||
### Change interface: MTU
|
||||
|
||||
Finally, a straight forward
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/39bfa1615fd1cafe5df6d8fc9d34528e8d3906e2)], or
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/39bfa1615fd1cafe5df6d8fc9d34528e8d3906e2)], or
|
||||
so I thought. When the MTU changes in VPP (with `set interface mtu packet N <int>`), there is
|
||||
callback that can be registered which copies this into the _LIP_. I did notice a specific corner
|
||||
case: In VPP, a sub-interface can have a larger MTU than its parent. In Linux, this cannot happen,
|
||||
@@ -179,7 +179,7 @@ higher than that, perhaps logging an error explaining why. This means two things
|
||||
1. Any change in VPP of a parent MTU should ensure all children are clamped to at most that.
|
||||
|
||||
I addressed the issue in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/79a395b3c9f0dae9a23e6fbf10c5f284b1facb85)].
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/79a395b3c9f0dae9a23e6fbf10c5f284b1facb85)].
|
||||
|
||||
### Change interface: IP Addresses
|
||||
|
||||
@@ -199,7 +199,7 @@ VPP into the companion Linux devices:
|
||||
_LIP_ with `lcp_itf_set_interface_addr()`.
|
||||
|
||||
This means with this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/f7e1bb951d648a63dfa27d04ded0b6261b9e39fe)], at
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/f7e1bb951d648a63dfa27d04ded0b6261b9e39fe)], at
|
||||
any time a new _LIP_ is created, the IPv4 and IPv6 address on the VPP interface are fully copied
|
||||
over by the third change, while at runtime, new addresses can be set/removed as well by the first
|
||||
and second change.
|
||||
|
||||
@@ -100,7 +100,7 @@ linux-cp {
|
||||
|
||||
Based on this config, I set the startup default in `lcp_set_lcp_auto_subint()`, but I realize that
|
||||
an administrator may want to turn it on/off at runtime, too, so I add a CLI getter/setter that
|
||||
interacts with the flag in this [[commit](https://github.com/pimvanpelt/lcpng/commit/d23aab2d95aabcf24efb9f7aecaf15b513633ab7)]:
|
||||
interacts with the flag in this [[commit](https://git.ipng.ch/ipng/lcpng/commit/d23aab2d95aabcf24efb9f7aecaf15b513633ab7)]:
|
||||
|
||||
```
|
||||
DBGvpp# show lcp
|
||||
@@ -116,11 +116,11 @@ lcp lcp-sync off
|
||||
```
|
||||
|
||||
The prep work for the rest of the interface syncer starts with this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
|
||||
for the rest of this blog post, the behavior will be in the 'on' position.
|
||||
|
||||
The code for the configuration toggle is in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
|
||||
|
||||
### Auto create/delete sub-interfaces
|
||||
|
||||
@@ -145,7 +145,7 @@ I noticed that interface deletion had a bug (one that I fell victim to as well:
|
||||
remove the netlink device in the correct network namespace), which I fixed.
|
||||
|
||||
The code for the auto create/delete and the bugfix is in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
|
||||
|
||||
### Further Work
|
||||
|
||||
|
||||
@@ -154,7 +154,7 @@ For now, `lcp_nl_dispatch()` just throws the message away after logging it with
|
||||
a function that will come in very useful as I start to explore all the different Netlink message types.
|
||||
|
||||
The code that forms the basis of our Netlink Listener lives in [[this
|
||||
commit](https://github.com/pimvanpelt/lcpng/commit/c4e3043ea143d703915239b2390c55f7b6a9b0b1)] and
|
||||
commit](https://git.ipng.ch/ipng/lcpng/commit/c4e3043ea143d703915239b2390c55f7b6a9b0b1)] and
|
||||
specifically, here I want to call out I was not the primary author, I worked off of Matt and Neale's
|
||||
awesome work in this pending [Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122).
|
||||
|
||||
@@ -182,7 +182,7 @@ Linux interface VPP is not aware of. But, if I can find the _LIP_, I can convert
|
||||
add or remove the ip4/ip6 neighbor adjacency.
|
||||
|
||||
The code for this first Netlink message handler lives in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/30bab1d3f9ab06670fbef2c7c6a658e7b77f7738)]. An
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/30bab1d3f9ab06670fbef2c7c6a658e7b77f7738)]. An
|
||||
ironic insight is that after writing the code, I don't think any of it will be necessary, because
|
||||
the interface plugin will already copy ARP and IPv6 ND packets back and forth and itself update its
|
||||
neighbor adjacency tables; but I'm leaving the code in for now.
|
||||
@@ -197,7 +197,7 @@ it or remove it, and if there are no link-local addresses left, disable IPv6 on
|
||||
There's also a few multicast routes to add (notably 224.0.0.0/24 and ff00::/8, all-local-subnet).
|
||||
|
||||
The code for IP address handling is in this
|
||||
[[commit]](https://github.com/pimvanpelt/lcpng/commit/87742b4f541d389e745f0297d134e34f17b5b485), but
|
||||
[[commit]](https://git.ipng.ch/ipng/lcpng/commit/87742b4f541d389e745f0297d134e34f17b5b485), but
|
||||
when I took it out for a spin, I noticed something curious, looking at the log lines that are
|
||||
generated for the following sequence:
|
||||
|
||||
@@ -236,7 +236,7 @@ interface and directly connected route addition/deletion is slightly different i
|
||||
So, I decide to take a little shortcut -- if an addition returns "already there", or a deletion returns
|
||||
"no such entry", I'll just consider it a successful addition and deletion respectively, saving my eyes
|
||||
from being screamed at by this red error message. I changed that in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/d63fbd8a9a612d038aa385e79a57198785d409ca)],
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/d63fbd8a9a612d038aa385e79a57198785d409ca)],
|
||||
turning this situation in a friendly green notice instead.
|
||||
|
||||
### Netlink: Link (existing)
|
||||
@@ -267,7 +267,7 @@ To avoid this loop, I temporarily turn off `lcp-sync` just before handling a bat
|
||||
turn it back to its original state when I'm done with that.
|
||||
|
||||
The code for all/del of existing links is in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)].
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)].
|
||||
|
||||
### Netlink: Link (new)
|
||||
|
||||
@@ -276,7 +276,7 @@ doesn't have a _LIP_ for, but specifically describes a VLAN interface? Well, th
|
||||
is trying to create a new sub-interface. And supporting that operation would be super cool, so let's go!
|
||||
|
||||
Using the earlier placeholder hint in `lcp_nl_link_add()` (see the previous
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)]),
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)]),
|
||||
I know that I've gotten a NEWLINK request but the Linux ifindex doesn't have a _LIP_. This could be
|
||||
because the interface is entirely foreign to VPP, for example somebody created a dummy interface or
|
||||
a VLAN sub-interface on one:
|
||||
@@ -331,7 +331,7 @@ a boring `<phy>.<subid>` name.
|
||||
|
||||
Alright, without further ado, the code for the main innovation here, the implementation of
|
||||
`lcp_nl_link_add_vlan()`, is in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/45f408865688eb7ea0cdbf23aa6f8a973be49d1a)].
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/45f408865688eb7ea0cdbf23aa6f8a973be49d1a)].
|
||||
|
||||
## Results
|
||||
|
||||
|
||||
@@ -118,7 +118,7 @@ or Virtual Routing/Forwarding domains). So first, I need to add these:
|
||||
|
||||
All of this code was heavily inspired by the pending [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122)]
|
||||
but a few finishing touches were added, and wrapped up in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/7a76498277edc43beaa680e91e3a0c1787319106)].
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/7a76498277edc43beaa680e91e3a0c1787319106)].
|
||||
|
||||
### Deletion
|
||||
|
||||
@@ -459,7 +459,7 @@ it as 'unreachable' rather than deleting it. These are *additions* which have a
|
||||
but with an interface index of 1 (which, in Netlink, is 'lo'). This makes VPP intermittently crash, so I
|
||||
currently commented this out, while I gain better understanding. Result: blackhole/unreachable/prohibit
|
||||
specials can not be set using the plugin. Beware!
|
||||
(disabled in this [[commit](https://github.com/pimvanpelt/lcpng/commit/7c864ed099821f62c5be8cbe9ed3f4dd34000a42)]).
|
||||
(disabled in this [[commit](https://git.ipng.ch/ipng/lcpng/commit/7c864ed099821f62c5be8cbe9ed3f4dd34000a42)]).
|
||||
|
||||
## Credits
|
||||
|
||||
|
||||
@@ -88,7 +88,7 @@ stat['/if/rx-miss'][:, 1].sum() - returns the sum of packet counters for
|
||||
```
|
||||
|
||||
Alright, so let's grab that file and refactor it into a small library for me to use, I do
|
||||
this in [[this commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
|
||||
this in [[this commit](https://git.ipng.ch/ipng/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
|
||||
|
||||
### VPP's API
|
||||
|
||||
@@ -159,7 +159,7 @@ idx=19 name=tap4 mac=02:fe:17:06:fc:af mtu=9000 flags=3
|
||||
|
||||
So I added a little abstration with some error handling and one main function
|
||||
to return interfaces as a Python dictionary of those `sw_interface_details`
|
||||
tuples in [[this commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
|
||||
tuples in [[this commit](https://git.ipng.ch/ipng/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
|
||||
|
||||
### AgentX
|
||||
|
||||
@@ -207,9 +207,9 @@ once asked with `GetPDU` or `GetNextPDU` requests, by issuing a corresponding `R
|
||||
to the SNMP server -- it takes care of all the rest!
|
||||
|
||||
The resulting code is in [[this
|
||||
commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/8c9c1e2b4aa1d40a981f17581f92bba133dd2c29)]
|
||||
commit](https://git.ipng.ch/ipng/vpp-snmp-agent/commit/8c9c1e2b4aa1d40a981f17581f92bba133dd2c29)]
|
||||
but you can also check out the whole thing on
|
||||
[[Github](https://github.com/pimvanpelt/vpp-snmp-agent)].
|
||||
[[Github](https://git.ipng.ch/ipng/vpp-snmp-agent)].
|
||||
|
||||
### Building
|
||||
|
||||
|
||||
@@ -480,7 +480,7 @@ is to say, those packets which were destined to any IP address configured on the
|
||||
plane. Any traffic going _through_ VPP will never be seen by Linux! So, I'll have to be
|
||||
clever and count this traffic by polling VPP instead. This was the topic of my previous
|
||||
[VPP Part 6]({{< ref "2021-09-10-vpp-6" >}}) about the SNMP Agent. All of that code
|
||||
was released to [Github](https://github.com/pimvanpelt/vpp-snmp-agent), notably there's
|
||||
was released to [Github](https://git.ipng.ch/ipng/vpp-snmp-agent), notably there's
|
||||
a hint there for an `snmpd-dataplane.service` and a `vpp-snmp-agent.service`, including
|
||||
the compiled binary that reads from VPP and feeds this to SNMP.
|
||||
|
||||
|
||||
@@ -30,9 +30,9 @@ virtual machine running in Qemu/KVM into a working setup with both [Free Range R
|
||||
and [Bird](https://bird.network.cz/) installed side by side.
|
||||
|
||||
**NOTE**: If you're just interested in the resulting image, here's the most pertinent information:
|
||||
> * ***vpp-proto.qcow2.lrz [[Download](https://ipng.ch/media/vpp-proto/vpp-proto-bookworm-20231015.qcow2.lrz)]***
|
||||
> * ***SHA256*** `bff03a80ccd1c0094d867d1eb1b669720a1838330c0a5a526439ecb1a2457309`
|
||||
> * ***Debian Bookworm (12.4)*** and ***VPP 24.02-rc0~46-ga16463610e***
|
||||
> * ***vpp-proto.qcow2.lrz*** [[Download](https://ipng.ch/media/vpp-proto/vpp-proto-bookworm-20250607.qcow2.lrz)]
|
||||
> * ***SHA256*** `a5fdf157c03f2d202dcccdf6ed97db49c8aa5fdb6b9ca83a1da958a8a24780ab
|
||||
> * ***Debian Bookworm (12.11)*** and ***VPP 25.10-rc0~49-g90d92196***
|
||||
> * ***CPU*** Make sure the (virtualized) CPU supports AVX
|
||||
> * ***RAM*** The image needs at least 4GB of RAM, and the hypervisor should support hugepages and AVX
|
||||
> * ***Username***: `ipng` with ***password***: `ipng loves vpp` and is sudo-enabled
|
||||
@@ -62,7 +62,7 @@ plugins:
|
||||
or route, or the system receiving ARP or IPv6 neighbor request/reply from neighbors), and applying
|
||||
these events to the VPP dataplane.
|
||||
|
||||
I've published the code on [Github](https://github.com/pimvanpelt/lcpng/) and I am targeting a release
|
||||
I've published the code on [Github](https://git.ipng.ch/ipng/lcpng/) and I am targeting a release
|
||||
in upstream VPP, hoping to make the upcoming 22.02 release in February 2022. I have a lot of ground to
|
||||
cover, but I will note that the plugin has been running in production in [AS8298]({{< ref "2021-02-27-network" >}})
|
||||
since Sep'21 and no crashes related to LinuxCP have been observed.
|
||||
@@ -195,7 +195,7 @@ So grab a cup of tea, while we let Rhino stretch its legs, ehh, CPUs ...
|
||||
pim@rhino:~$ mkdir -p ~/src
|
||||
pim@rhino:~$ cd ~/src
|
||||
pim@rhino:~/src$ sudo apt install libmnl-dev
|
||||
pim@rhino:~/src$ git clone https://github.com/pimvanpelt/lcpng.git
|
||||
pim@rhino:~/src$ git clone https://git.ipng.ch/ipng/lcpng.git
|
||||
pim@rhino:~/src$ git clone https://gerrit.fd.io/r/vpp
|
||||
pim@rhino:~/src$ ln -s ~/src/lcpng ~/src/vpp/src/plugins/lcpng
|
||||
pim@rhino:~/src$ cd ~/src/vpp
|
||||
|
||||
@@ -33,7 +33,7 @@ In this first post, let's take a look at tablestakes: writing a YAML specificati
|
||||
configuration elements of VPP, and then ensures that the YAML file is both syntactically as well as
|
||||
semantically correct.
|
||||
|
||||
**Note**: Code is on [my Github](https://github.com/pimvanpelt/vppcfg), but it's not quite ready for
|
||||
**Note**: Code is on [my Github](https://git.ipng.ch/ipng/vppcfg), but it's not quite ready for
|
||||
prime-time yet. Take a look, and engage with us on GitHub (pull requests preferred over issues themselves)
|
||||
or reach out by [contacting us](/s/contact/).
|
||||
|
||||
@@ -348,7 +348,7 @@ to mess up my (or your!) VPP router by feeding it garbage, so the lions' share o
|
||||
has been to assert the YAML file is both syntactically and semantically valid.
|
||||
|
||||
|
||||
In the mean time, you can take a look at my code on [GitHub](https://github.com/pimvanpelt/vppcfg), but to
|
||||
In the mean time, you can take a look at my code on [GitHub](https://git.ipng.ch/ipng/vppcfg), but to
|
||||
whet your appetite, here's a hefty configuration that demonstrates all implemented types:
|
||||
|
||||
```
|
||||
|
||||
@@ -32,7 +32,7 @@ the configuration to the dataplane. Welcome to `vppcfg`!
|
||||
In this second post of the series, I want to talk a little bit about how planning a path from a running
|
||||
configuration to a desired new configuration might look like.
|
||||
|
||||
**Note**: Code is on [my Github](https://github.com/pimvanpelt/vppcfg), but it's not quite ready for
|
||||
**Note**: Code is on [my Github](https://git.ipng.ch/ipng/vppcfg), but it's not quite ready for
|
||||
prime-time yet. Take a look, and engage with us on GitHub (pull requests preferred over issues themselves)
|
||||
or reach out by [contacting us](/s/contact/).
|
||||
|
||||
|
||||
@@ -275,7 +275,6 @@ that will point at an `unbound` running on `lab.ipng.ch` itself.
|
||||
I can now create any file I'd like which may use variable substition and other jinja2 style templating. Take
|
||||
for example these two files:
|
||||
|
||||
{% raw %}
|
||||
```
|
||||
pim@lab:~/src/lab$ cat overlays/bird/common/etc/netplan/01-netcfg.yaml.j2
|
||||
network:
|
||||
@@ -292,13 +291,12 @@ network:
|
||||
|
||||
pim@lab:~/src/lab$ cat overlays/bird/common/etc/netns/dataplane/resolv.conf.j2
|
||||
domain lab.ipng.ch
|
||||
search{% for domain in lab.nameserver.search %} {{domain}}{%endfor %}
|
||||
search{% for domain in lab.nameserver.search %} {{ domain }}{% endfor %}
|
||||
|
||||
{% for resolver in lab.nameserver.addresses %}
|
||||
nameserver {{resolver}}
|
||||
{%endfor%}
|
||||
nameserver {{ resolver }}
|
||||
{% endfor %}
|
||||
```
|
||||
{% endraw %}
|
||||
|
||||
The first file is a [[NetPlan.io](https://netplan.io/)] configuration that substitutes the correct management
|
||||
IPv4 and IPv6 addresses and gateways. The second one enumerates a set of search domains and nameservers, so that
|
||||
|
||||
@@ -578,7 +578,7 @@ the inner payload carries the `vlan 30` tag, neat! The `VNI` there is `0xca986`
|
||||
VLAN10 traffic (showing that multiple VLANs can be transported across the same tunnel, distinguished
|
||||
by VNI).
|
||||
|
||||
{{< image width="90px" float="left" src="/assets/oem-switch/warning.png" alt="Warning" >}}
|
||||
{{< image width="90px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
At this point I make an important observation. VxLAN and GENEVE both have this really cool feature
|
||||
that they can hash their _inner_ payload (ie. the IPv4/IPv6 address and ports if available) and use
|
||||
|
||||
@@ -171,12 +171,12 @@ GigabitEthernet1/0/0 1 up GigabitEthernet1/0/0
|
||||
|
||||
After this exploratory exercise, I have learned enough about the hardware to be able to take the
|
||||
Fitlet2 out for a spin. To configure the VPP instance, I turn to
|
||||
[[vppcfg](https://github.com/pimvanpelt/vppcfg)], which can take a YAML configuration file
|
||||
[[vppcfg](https://git.ipng.ch/ipng/vppcfg)], which can take a YAML configuration file
|
||||
describing the desired VPP configuration, and apply it safely to the running dataplane using the VPP
|
||||
API. I've written a few more posts on how it does that, notably on its [[syntax]({{< ref "2022-03-27-vppcfg-1" >}})]
|
||||
and its [[planner]({{< ref "2022-04-02-vppcfg-2" >}})]. A complete
|
||||
configuration guide on vppcfg can be found
|
||||
[[here](https://github.com/pimvanpelt/vppcfg/blob/main/docs/config-guide.md)].
|
||||
[[here](https://git.ipng.ch/ipng/vppcfg/blob/main/docs/config-guide.md)].
|
||||
|
||||
```
|
||||
pim@fitlet:~$ sudo dpkg -i {lib,}vpp*23.06*deb
|
||||
|
||||
@@ -185,7 +185,7 @@ forgetful chipmunk-sized brain!), so here, I'll only recap what's already writte
|
||||
|
||||
**1. BUILD:** For the first step, the build is straight forward, and yields a VPP instance based on
|
||||
`vpp-ext-deps_23.06-1` at version `23.06-rc0~71-g182d2b466`, which contains my
|
||||
[[LCPng](https://github.com/pimvanpelt/lcpng.git)] plugin. I then copy the packages to the router.
|
||||
[[LCPng](https://git.ipng.ch/ipng/lcpng.git)] plugin. I then copy the packages to the router.
|
||||
The router has an E-2286G CPU @ 4.00GHz with 6 cores and 6 hyperthreads. There's a really handy tool
|
||||
called `likwid-topology` that can show how the L1, L2 and L3 cache lines up with respect to CPU
|
||||
cores. Here I learn that CPU (0+6) and (1+7) share L1 and L2 cache -- so I can conclude that 0-5 are
|
||||
@@ -351,7 +351,7 @@ in `vppcfg`:
|
||||
* When I create the initial `--novpp` config, there's a bug in `vppcfg` where I incorrectly
|
||||
reference a dataplane object which I haven't initialized (because with `--novpp` the tool
|
||||
will not contact the dataplane at all. That one was easy to fix, which I did in [[this
|
||||
commit](https://github.com/pimvanpelt/vppcfg/commit/0a0413927a0be6ed3a292a8c336deab8b86f5eee)]).
|
||||
commit](https://git.ipng.ch/ipng/vppcfg/commit/0a0413927a0be6ed3a292a8c336deab8b86f5eee)]).
|
||||
|
||||
After that small detour, I can now proceed to configure the dataplane by offering the resulting
|
||||
VPP commands, like so:
|
||||
@@ -573,7 +573,7 @@ see is that which is destined to the controlplane (eg, to one of the IPv4 or IPv
|
||||
multicast/broadcast groups that they are participating in), so things like tcpdump or SNMP won't
|
||||
really work.
|
||||
|
||||
However, due to my [[vpp-snmp-agent](https://github.com/pimvanpelt/vpp-snmp-agent.git)], which is
|
||||
However, due to my [[vpp-snmp-agent](https://git.ipng.ch/ipng/vpp-snmp-agent.git)], which is
|
||||
feeding as an AgentX behind an snmpd that in turn is running in the `dataplane` namespace, SNMP scrapes
|
||||
work as they did before, albeit with a few different interface names.
|
||||
|
||||
|
||||
@@ -14,7 +14,7 @@ performance and versatility. For those of us who have used Cisco IOS/XR devices,
|
||||
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
|
||||
are shared between the two.
|
||||
|
||||
I've been working on the Linux Control Plane [[ref](https://github.com/pimvanpelt/lcpng)], which you
|
||||
I've been working on the Linux Control Plane [[ref](https://git.ipng.ch/ipng/lcpng)], which you
|
||||
can read all about in my series on VPP back in 2021:
|
||||
|
||||
[{: style="width:300px; float: right; margin-left: 1em;"}](https://video.ipng.ch/w/erc9sAofrSZ22qjPwmv6H4)
|
||||
@@ -70,7 +70,7 @@ answered by a Response PDU.
|
||||
|
||||
Using parts of a Python Agentx library written by GitHub user hosthvo
|
||||
[[ref](https://github.com/hosthvo/pyagentx)], I tried my hands at writing one of these AgentX's.
|
||||
The resulting source code is on [[GitHub](https://github.com/pimvanpelt/vpp-snmp-agent)]. That's the
|
||||
The resulting source code is on [[GitHub](https://git.ipng.ch/ipng/vpp-snmp-agent)]. That's the
|
||||
one that's running in production ever since I started running VPP routers at IPng Networks AS8298.
|
||||
After the _AgentX_ exposes the dataplane interfaces and their statistics into _SNMP_, an open source
|
||||
monitoring tool such as LibreNMS [[ref](https://librenms.org/)] can discover the routers and draw
|
||||
@@ -126,7 +126,7 @@ for any interface created in the dataplane.
|
||||
|
||||
I wish I were good at Go, but I never really took to the language. I'm pretty good at Python, but
|
||||
sorting through the stats segment isn't super quick as I've already noticed in the Python3 based
|
||||
[[VPP SNMP Agent](https://github.com/pimvanpelt/vpp-snmp-agent)]. I'm probably the world's least
|
||||
[[VPP SNMP Agent](https://git.ipng.ch/ipng/vpp-snmp-agent)]. I'm probably the world's least
|
||||
terrible C programmer, so maybe I can take a look at the VPP Stats Client and make sense of it. Luckily,
|
||||
there's an example already in `src/vpp/app/vpp_get_stats.c` and it reveals the following pattern:
|
||||
|
||||
|
||||
@@ -19,7 +19,7 @@ same time keep an IPng Site Local network with IPv4 and IPv6 that is separate fr
|
||||
based on hardware/silicon based forwarding at line rate and high availability. You can read all
|
||||
about my Centec MPLS shenanigans in [[this article]({{< ref "2023-03-11-mpls-core" >}})].
|
||||
|
||||
Ever since the release of the Linux Control Plane [[ref](https://github.com/pimvanpelt/lcpng)]
|
||||
Ever since the release of the Linux Control Plane [[ref](https://git.ipng.ch/ipng/lcpng)]
|
||||
plugin in VPP, folks have asked "What about MPLS?" -- I have never really felt the need to go this
|
||||
rabbit hole, because I figured that in this day and age, higher level IP protocols that do tunneling
|
||||
are just as performant, and a little bit less of an 'art' to get right. For example, the Centec
|
||||
|
||||
@@ -459,6 +459,6 @@ and VPP, and the overall implementation before attempting to use in production.
|
||||
we got at least some of this right, but testing and runtime experience will tell.
|
||||
|
||||
I will be silently porting the change into my own copy of the Linux Controlplane called lcpng on
|
||||
[[GitHub](https://github.com/pimvanpelt/lcpng.git)]. If you'd like to test this - reach out to the VPP
|
||||
[[GitHub](https://git.ipng.ch/ipng/lcpng.git)]. If you'd like to test this - reach out to the VPP
|
||||
Developer [[mailinglist](mailto:vpp-dev@lists.fd.io)] any time!
|
||||
|
||||
|
||||
@@ -187,7 +187,7 @@ MPLS-VRF:0, fib_index:0 locks:[interface:4, CLI:1, lcp-rt:1, ]
|
||||
[@1]: mpls via fe80::5054:ff:fe02:1001 GigabitEthernet10/0/0: mtu:9000 next:2 flags:[] 5254000210015254000310008847
|
||||
```
|
||||
|
||||
{{< image width="80px" float="left" src="/assets/vpp-mpls/lightbulb.svg" alt="Lightbulb" >}}
|
||||
{{< image width="80px" float="left" src="/assets/shared/lightbulb.svg" alt="Lightbulb" >}}
|
||||
|
||||
Haha, I love it when the brain-ligutbulb goes to the _on_ position. What's happening is that when we
|
||||
turned on the MPLS feature on the VPP `tap` that is connected to `e0`, and VPP saw an MPLS packet,
|
||||
@@ -385,5 +385,5 @@ and VPP, and the overall implementation before attempting to use in production.
|
||||
we got at least some of this right, but testing and runtime experience will tell.
|
||||
|
||||
I will be silently porting the change into my own copy of the Linux Controlplane called lcpng on
|
||||
[[GitHub](https://github.com/pimvanpelt/lcpng.git)]. If you'd like to test this - reach out to the VPP
|
||||
[[GitHub](https://git.ipng.ch/ipng/lcpng.git)]. If you'd like to test this - reach out to the VPP
|
||||
Developer [[mailinglist](mailto:vpp-dev@lists.fd.io)] any time!
|
||||
|
||||
@@ -304,7 +304,7 @@ Gateway, just to show a few of the more advanced features of VPP. For me, this t
|
||||
line of thinking: classifiers. This extract/match/act pattern can be used in policers, ACLs and
|
||||
arbitrary traffic redirection through VPP's directed graph (eg. selecting a next node for
|
||||
processing). I'm going to deep-dive into this classifier behavior in an upcoming article, and see
|
||||
how I might add this to [[vppcfg](https://github.com/pimvanpelt/vppcfg.git)], because I think it
|
||||
how I might add this to [[vppcfg](https://git.ipng.ch/ipng/vppcfg.git)], because I think it
|
||||
would be super powerful to abstract away the rather complex underlying API into something a little
|
||||
bit more ... user friendly. Stay tuned! :)
|
||||
|
||||
|
||||
@@ -543,7 +543,7 @@ Whoa, what just happened here? The switch took the port defined by `pci/0000:03:
|
||||
it is _splittable_ and has four lanes, and split it into four NEW ports called `swp1s0`-`swp1s3`,
|
||||
and the resulting ports are 25G, 10G or 1G.
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/oem-switch/warning.png" alt="Warning" >}}
|
||||
{{< image width="100px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
However, I make an important observation. When splitting `swp1` in 4, the switch also removed port
|
||||
`swp2`, and remember at the beginning of this article I mentioned that the MAC addresses seemed to
|
||||
|
||||
@@ -243,7 +243,7 @@ any prefixes, for example this session in Düsseldorf:
|
||||
};
|
||||
```
|
||||
|
||||
{{< image width="80px" float="left" src="/assets/debian-vpp/warning.png" alt="Warning" >}}
|
||||
{{< image width="80px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
This is where it's a good idea to grab some tea. Quite a few internet providers have
|
||||
incredibly slow convergence, so just by stopping the announcment of `AS8298:AS-IPNG` prefixes at
|
||||
|
||||
@@ -548,7 +548,7 @@ for table in api_reply:
|
||||
print(str)
|
||||
```
|
||||
|
||||
{{< image width="50px" float="left" src="/assets/vpp-papi/warning.png" alt="Warning" >}}
|
||||
{{< image width="50px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
Funny detail - it took me almost two years to discover `VppEnum`, which contains all of these
|
||||
symbols. If you end up reading this after a Bing, Yahoo or DuckDuckGo search, feel free to buy
|
||||
|
||||
@@ -47,7 +47,7 @@ we'll use for performance testing, notably to compare the FreeBSD kernel routing
|
||||
like `netmap`, and of course VPP itself. I do intend to do some side-by-side comparisons between
|
||||
Debian and FreeBSD when they run VPP.
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/freebsd-vpp/brain.png" alt="brain" >}}
|
||||
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="brain" >}}
|
||||
|
||||
If you know me a little bit, you'll know that I typically forget how I did a thing, so I'm using
|
||||
this article for others as well as myself in case I want to reproduce this whole thing 5 years down
|
||||
|
||||
@@ -163,7 +163,7 @@ interfaces a bit. They need to be:
|
||||
075.810547 main [301] Ready to go, ixl0 0x0/4 <-> ixl1 0x0/4.
|
||||
```
|
||||
|
||||
{{< image width="80px" float="left" src="/assets/freebsd-vpp/warning.png" alt="Warning" >}}
|
||||
{{< image width="80px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
I start my first loadtest, which pretty immediately fails. It's an interesting behavior pattern which
|
||||
I've not seen before. After staring at the problem, and reading the code of `bridge.c`, which is a
|
||||
|
||||
@@ -63,7 +63,7 @@ Let me discuss these two purposes in more detail:
|
||||
|
||||
### 1. IPv4 ARP, née IPv6 NDP
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/vpp-babel/brain.png" alt="Brain" >}}
|
||||
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Brain" >}}
|
||||
|
||||
One really neat trick is simply replacing ARP resolution by something that can resolve the
|
||||
link-layer MAC address in a different way. As it turns out, IPv6 has an equivalent that's
|
||||
@@ -359,7 +359,7 @@ does not have an IPv4 address. Except -- I'm bending the rules a little bit by d
|
||||
There's an internal function `ip4_sw_interface_enable_disable()` which is called to enable IPv4
|
||||
processing on an interface once the first IPv4 address is added. So my first fix is to force this to
|
||||
be enabled for any interface that is exposed via Linux Control Plane, notably in `lcp_itf_pair_create()`
|
||||
[[here](https://github.com/pimvanpelt/lcpng/blob/main/lcpng_interface.c#L777)].
|
||||
[[here](https://git.ipng.ch/ipng/lcpng/blob/main/lcpng_interface.c#L777)].
|
||||
|
||||
This approach is partially effective:
|
||||
|
||||
@@ -500,7 +500,7 @@ which is unnumbered. Because I don't know for sure if everybody would find this
|
||||
I make sure to guard the behavior behind a backwards compatible configuration option.
|
||||
|
||||
If you're curious, please take a look at the change in my [[GitHub
|
||||
repo](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
|
||||
repo](https://git.ipng.ch/ipng/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
|
||||
which I:
|
||||
1. add a new configuration option, `lcp-sync-unnumbered`, which defaults to `on`. That would be
|
||||
what the plugin would do in the normal case: copy forward these borrowed IP addresses to Linux.
|
||||
|
||||
@@ -147,7 +147,7 @@ With all of that, I am ready to demonstrate two working solutions now. I first c
|
||||
Ondrej's [[commit](https://gitlab.nic.cz/labs/bird/-/commit/280daed57d061eb1ebc89013637c683fe23465e8)].
|
||||
Then, I compile VPP with my pending [[gerrit](https://gerrit.fd.io/r/c/vpp/+/40482)]. Finally,
|
||||
to demonstrate how `update_loopback_addr()` might work, I compile `lcpng` with my previous
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)],
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)],
|
||||
which allows me to inhibit copying forward addresses from VPP to Linux, when using _unnumbered_
|
||||
interfaces.
|
||||
|
||||
@@ -242,7 +242,7 @@ even if the interface link stays up. It's described in detail in
|
||||
[[RFC5880](https://www.rfc-editor.org/rfc/rfc5880.txt)], and I use it at IPng Networks all over the
|
||||
place.
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/vpp-babel/brain.png" alt="Brain" >}}
|
||||
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Brain" >}}
|
||||
|
||||
Then I'll configure two OSPF protocols, one for IPv4 called `ospf4` and another for IPv6 called
|
||||
`ospf6`. It's easy to overlook, but while usually the IPv4 protocol is OSPFv2 and the IPv6 protocol
|
||||
|
||||
@@ -1,8 +1,9 @@
|
||||
---
|
||||
date: "2024-04-27T10:52:11Z"
|
||||
title: FreeIX - Remote
|
||||
title: "FreeIX Remote - Part 1"
|
||||
aliases:
|
||||
- /s/articles/2024/04/27/freeix-1.html
|
||||
- /s/articles/2024/04/27/freeix-remote/
|
||||
---
|
||||
|
||||
# Introduction
|
||||
@@ -91,7 +92,7 @@ their traffic to these remote internet exchanges.
|
||||
There are two types of BGP neighbor adjacency:
|
||||
|
||||
1. ***Members***: these are {ip-address,AS}-tuples which FreeIX has explicitly configured. Learned prefixes are added
|
||||
to as-set AS50869:AS-MEMBERS. Members receive _all_ prefixes from FreeIX, each annotated with BGP **informational**
|
||||
to as-set AS50869:AS-MEMBERS. Members receive _some or all_ prefixes from FreeIX, each annotated with BGP **informational**
|
||||
communities, and members can drive certain behavior with BGP **action** communities.
|
||||
|
||||
1. ***Peers***: these are all other entities with whom FreeIX has an adjacency at public internet exchanges or private
|
||||
@@ -195,12 +196,12 @@ network interconnects:
|
||||
* `(50869,3020,1)`: Inhibit Action (30XX), Country (3020), Switzerland (1)
|
||||
* `(50869,3030,1308)`: Inhibit Action (30XX), IXP (3030), PeeringDB IXP for LS-IX (1308)
|
||||
|
||||
Further actions can be placed on a per-remote-neighbor basis:
|
||||
Four actions can be placed on a per-remote-asn basis:
|
||||
|
||||
* `(50869,3040,13030)`: Inhibit Action (30XX), AS (3040), Init7 (AS13030)
|
||||
* `(50869,3041,6939)`: Prepend Action (30XX), Prepend Once (3041), Hurricane Electric (AS6939)
|
||||
* `(50869,3042,12859)`: Prepend Action (30XX), Prepend Twice (3042), BIT BV (AS12859)
|
||||
* `(50869,3043,8283)`: Prepend Action (30XX), Prepend Three Times (3043), Coloclue (AS8283)
|
||||
* `(50869,3100,6939)`: Prepend Once Action (3100), Hurricane Electric (AS6939)
|
||||
* `(50869,3200,12859)`: Prepend Twice Action (3200), BIT BV (AS12859)
|
||||
* `(50869,3300,8283)`: Prepend Thice Action (3300), Coloclue (AS8283)
|
||||
|
||||
Peers cannot set these actions, as all action communities will be stripped on ingress. Members can set these action
|
||||
communities on their sessions with FreeIX routers, however in some cases they may also be set by FreeIX operators when
|
||||
|
||||
@@ -58,7 +58,8 @@ argument of resistance? Nerd-snipe accepted!
|
||||
|
||||
Let me first introduce the mail^W main characters of my story:
|
||||
|
||||
| {: style="width:100px; margin: 1em;"} | {: style="width:100px; margin: 1em;"} | {: style="width:100px; margin: 1em;"} | {: style="width:100px; margin: 1em;"} | {: style="width:100px; margin: 1em;"} | {: style="width:100px; margin: 1em;"} |
|
||||
| {{< image src="/assets/smtp/postfix_logo.png" width="8em" >}} | {{< image src="/assets/smtp/dovecot_logo.png" width="8em" >}} | {{< image src="/assets/smtp/nginx_logo.png" width="8em" >}} | {{< image src="/assets/smtp/rspamd_logo.png" width="8em" >}} | {{< image src="/assets/smtp/unbound_logo.png" width="8em" >}} | {{< image src="/assets/smtp/roundcube_logo.png" width="8em" >}} |
|
||||
| ---- | ---- | ---- | ---- | ---- | ---- |
|
||||
|
||||
* ***Postfix***: is Wietse Venema's mail server that started life at IBM research as an
|
||||
alternative to the widely-used Sendmail program. After eight years at Google, Wietse continues
|
||||
@@ -444,7 +445,7 @@ pim@squanchy:~$ sudo cat /etc/mail/secrets
|
||||
ipng bastion:<haha-made-you-look>
|
||||
```
|
||||
|
||||
{{< image width="120px" float="left" src="/assets/smtp/lightbulb.svg" alt="Lightbulb" >}}
|
||||
{{< image width="120px" float="left" src="/assets/shared/lightbulb.svg" alt="Lightbulb" >}}
|
||||
|
||||
What happens here is, every time this server `squanchy` wants to send an e-mail, it will use an SMTP
|
||||
session with TLS, on port 587, of the machine called `smtp-out.ipng.ch`, and it'll authenticate
|
||||
|
||||
@@ -101,6 +101,7 @@ IPv6 network and access the internet via a shared IPv6 address.
|
||||
I will assign a pool of four public IPv4 addresses and eight IPv6 addresses to each border gateway:
|
||||
|
||||
| **Machine** | **IPv4 pool** | **IPv6 pool** |
|
||||
| ----------- | ------------- | ------------- |
|
||||
| border0.chbtl0.net.ipng.ch | <span style='color:green;'>194.126.235.0/30</span> | <span style='color:blue;'>2001:678:d78::3:0:0/125</span> |
|
||||
| border0.chrma0.net.ipng.ch | <span style='color:green;'>194.126.235.4/30</span> | <span style='color:blue;'>2001:678:d78::3:1:0/125</span> |
|
||||
| border0.chplo0.net.ipng.ch | <span style='color:green;'>194.126.235.8/30</span> | <span style='color:blue;'>2001:678:d78::3:2:0/125</span> |
|
||||
@@ -305,7 +306,7 @@ switches, I will announce:
|
||||
towards DNS64-rewritten destinations, for example 2001:678:d78:564::8c52:7903 as DNS64 representation
|
||||
of github.com, which is reachable only at legacy address 140.82.121.3.
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/nat64/brain.png" alt="Brain" >}}
|
||||
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Brain" >}}
|
||||
|
||||
I have to be careful with the announcements into OSPF. The cost of E1 routes is the cost of the
|
||||
external metric **in addition to** the internal cost within OSPF to reach that network. The cost
|
||||
|
||||
@@ -250,10 +250,10 @@ remove the IPv4 and IPv6 addresses from the <span style='color:red;font-weight:b
|
||||
routers in Brüttisellen. They are directly connected, and if anything goes wrong, I can walk
|
||||
over and rescue them. Sounds like a safe way to start!
|
||||
|
||||
I quickly add the ability for [[vppcfg](https://github.com/pimvanpelt/vppcfg)] to configure
|
||||
I quickly add the ability for [[vppcfg](https://git.ipng.ch/ipng/vppcfg)] to configure
|
||||
_unnumbered_ interfaces. In VPP, these are interfaces that don't have an IPv4 or IPv6 address of
|
||||
their own, but they borrow one from another interface. If you're curious, you can take a look at the
|
||||
[[User Guide](https://github.com/pimvanpelt/vppcfg/blob/main/docs/config-guide.md#interfaces)] on
|
||||
[[User Guide](https://git.ipng.ch/ipng/vppcfg/blob/main/docs/config-guide.md#interfaces)] on
|
||||
GitHub.
|
||||
|
||||
Looking at their `vppcfg` files, the change is actually very easy, taking as an example the
|
||||
@@ -280,7 +280,7 @@ By commenting out the `addresses` field, and replacing it with `unnumbered: loop
|
||||
vppcfg to make Te6/0/0, which in Linux is called `xe1-0`, borrow its addresses from the loopback
|
||||
interface `loop0`.
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/freebsd-vpp/brain.png" alt="brain" >}}
|
||||
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="brain" >}}
|
||||
|
||||
Planning and applying this is straight forward, but there's one detail I should
|
||||
mention. In my [[previous article]({{< ref "2024-04-06-vpp-ospf" >}})] I asked myself a question:
|
||||
@@ -291,7 +291,7 @@ interface.
|
||||
|
||||
In the article, you'll see that discussed as _Solution 2_, and it includes a bit of rationale why I
|
||||
find this better. I implemented it in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
|
||||
case you're curious, and the commandline keyword is `lcp lcp-sync-unnumbered off` (the default is
|
||||
_on_).
|
||||
|
||||
|
||||
@@ -292,7 +292,7 @@ transmitting, or performing both receiving *and* transmitting.
|
||||
|
||||
### Intel X520 (10GbE)
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/oem-switch/warning.png" alt="Warning" >}}
|
||||
{{< image width="100px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
This network card is based on the classic Intel _Niantic_ chipset, also known as the 82599ES chip,
|
||||
first released in 2009. It's super reliable, but there is one downside. It's a PCIe v2.0 device
|
||||
@@ -462,7 +462,7 @@ ip4-rewrite active 14845221 35913927 0 8.9
|
||||
unix-epoll-input polling 22551 0 0 1.37e3 0.00
|
||||
```
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/vpp-babel/brain.png" alt="Brain" >}}
|
||||
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Brain" >}}
|
||||
|
||||
I kind of wonder why that is. Is the Mellanox Connect-X3 such a poor performer? Or does it not like
|
||||
small packets? I've read online that Mellanox cards do some form of message compression on the PCI
|
||||
|
||||
@@ -407,7 +407,7 @@ loadtest:
|
||||
|
||||
{{< image src="/assets/gowin-n305/cx5-cpu-rdma1q.png" alt="Cx5 CPU with 1Q" >}}
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/vpp-babel/brain.png" alt="Brain" >}}
|
||||
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Brain" >}}
|
||||
|
||||
Here I can clearly see that the one CPU thread (in yellow for unidirectional) and the two CPU
|
||||
therads (one for each of the bidirectional flows) jump up to 100% and stay there. This means that
|
||||
|
||||
452
content/articles/2024-08-12-jekyll-hugo.md
Normal file
452
content/articles/2024-08-12-jekyll-hugo.md
Normal file
@@ -0,0 +1,452 @@
|
||||
---
|
||||
date: "2024-08-12T09:01:23Z"
|
||||
title: 'Case Study: From Jekyll to Hugo'
|
||||
---
|
||||
|
||||
# Introduction
|
||||
|
||||
{{< image width="16em" float="right" src="/assets/jekyll-hugo/before.png" alt="ipng.nl before" >}}
|
||||
|
||||
In the _before-days_, I had a very modest personal website running on [[ipng.nl](https://ipng.nl)]
|
||||
and [[ipng.ch](https://ipng.ch/)]. Over the years I've had quite a few different designs, and
|
||||
although one of them was hosted (on Google Sites) for a brief moment, they were mostly very much web
|
||||
1.0, "The 90s called, they wanted their website back!" style.
|
||||
|
||||
The site didn't have much other than a little blurb on a few open source projects of mine, and a
|
||||
gallery hosted on PicasaWeb [which Google subsequently turned down], and a mostly empty Blogger
|
||||
page. Would you imagine that I hand-typed the XHTML and CSS for this website, where the menu at the
|
||||
top (thinks like `Home` - `Resume` - `History` - `Articles`) would just have a HTML page which
|
||||
meticulously linked to the other HTML pages. It was the way of the world, in the 1990s.
|
||||
|
||||
## Jekyll
|
||||
|
||||
{{< image width="9em" float="right" src="/assets/jekyll-hugo/jekyll-logo.png" alt="Jekyll" >}}
|
||||
|
||||
My buddy Michal suggested in May of 2021 that, if I was going to write all of the HTML skeleton by
|
||||
hand, I may as well switch to a static website generator. He's fluent in Ruby, and suggested I take
|
||||
a look at [[Jekyll](https://jekyllrb.com/)], a static site generator. It takes text written in
|
||||
your favorite markup language and uses layouts to create a static website. You can tweak the site’s
|
||||
look and feel, URLs, the data displayed on the page, and more.
|
||||
|
||||
I immediately fell in love! As an experiment, I moved [[IPng.ch](https://ipng.ch)] to a new
|
||||
webserver, and kept my personal website on [[IPng.nl](https://ipng.nl)]. I had always wanted to
|
||||
write a little bit more about technology, and since I was working on an interesting project [[Linux
|
||||
Control Plane]({{< ref 2021-08-12-vpp-1 >}})] in VPP, I thought it'd be nice to write a little bit
|
||||
about it, but certainly not while hand-crafting all of the HTML exoskeleton. I just wanted to write
|
||||
Markdown, and this is precisely the _raison d'être_ of Jekyll!
|
||||
|
||||
Since April 2021, I wrote in total 67 articles with Jekyll. Some of them proved to become quite
|
||||
popular, and (_humblebrag_) my website is widely considered one of the best resources for Vector
|
||||
Packet Processing, with my [[VPP]({{< ref 2021-09-21-vpp-7 >}})] series, [[MPLS]({{< ref
|
||||
2023-05-07-vpp-mpls-1 >}})] series and a few others like the [[Mastodon]({{< ref
|
||||
2022-11-20-mastodon-1 >}})] series being amongst some of the top visited articles, with ~7.5-8K
|
||||
monthly unique visitors.
|
||||
|
||||
## The catalyst
|
||||
|
||||
There were two distinct events that lead up to this. Firstly, I started a side project called [[Free
|
||||
IX](https://free-ix.ch/)], which I also created in Jekyll. When I did that, I branched the
|
||||
[[IPng.ch](https://ipng.ch)] site, but the build faild with Ruby errors. My buddy Antonios fixed
|
||||
those, and we were underway. Secondly, later on I attempted to upgrade the IPng website to the same
|
||||
fixes that Antonios had provided for Free-IX, and all hell broke loose (luckily, only in staging
|
||||
environment). I spent several hours pulling my hear out re-assembling the dependencies, downgrading
|
||||
Jekyll, pulling new `gems`, downgrading `ruby`. Finally, I got it to work again, only to see after
|
||||
my first production build, that the build immediately failed because the Docker container that does
|
||||
the build no longer liked what I had put in the `Gemfile` and `_config.yml`. It was something to do
|
||||
with `sass-embedded` gem, and I spent waaaay too long fixing this incredibly frustrating breakage.
|
||||
|
||||
## Hugo
|
||||
|
||||
{{< image width="9em" float="right" src="/assets/jekyll-hugo/hugo-logo-wide.svg" alt="Hugo" >}}
|
||||
|
||||
When I made my roadtrip from Zurich to the North Cape with my buddy Paul, we took extensive notes on
|
||||
our daily travels, and put them on a [[2022roadtripnose](https://2022roadtripnose.weirdnet.nl/)]
|
||||
website. At the time, I was looking for a photo caroussel for Jekyll, and while I found a few, none
|
||||
of them really worked in the way I wanted them to. I stumbled across [[Hugo](https://gohugo.io)],
|
||||
which says on its website that it is one of the most popular open-source static site generators.
|
||||
With its amazing speed and flexibility, Hugo makes building websites fun again. So I dabbled a bit
|
||||
and liked what I saw. I used the [[notrack](https://github.com/gevhaz/hugo-theme-notrack)] theme from
|
||||
GitHub user `@gevhaz`, as they had made a really nice gallery widget (called a `shortcode` in Hugo).
|
||||
|
||||
The main reason for me to move to Hugo is that it is a **standalone Go** program, with no runtime or
|
||||
build time dependencies. The Hugo [[GitHub](https://github.com/gohugoio/hugo)] delivers ready to go
|
||||
build artifacts, tests amd releases regularly, and has a vibrant user community.
|
||||
|
||||
### Migrating
|
||||
|
||||
I have only a few strong requirements if I am to move my website:
|
||||
|
||||
1. The site's URL namespace MUST be *identical* (not just similar) to Jekyll. I do not want to
|
||||
lose my precious ranking on popular search engines.
|
||||
1. MUST be built in a CI/CD tool like Drone or Jenkins, and autodeploy
|
||||
1. Code MUST be _hermetic_, not pulling in external dependencies, neither in the build system (eg.
|
||||
Hugo itself) nor the website (eg. dependencies, themes, etc).
|
||||
1. Theme MUST support images, videos and SHOULD support asciinema.
|
||||
1. Theme SHOULD try to look very similar to the current Jekyll `minima` theme.
|
||||
|
||||
|
||||
#### Attempt 1: Auto import ❌
|
||||
|
||||
With that in mind, I notice that Hugo has a site _importer_, that can import a site from Jekyll! I
|
||||
run it, but it produces completely broken code, and Hugo doesn't even want to compile the site. This
|
||||
turns out to be a _theme_ issue, so I take Hugo's advice and install the recommended theme. The site
|
||||
comes up, but is pretty screwed up. I now realize that the `hugo import jekyll` imports the markdown
|
||||
as-is, and only rewrites the _frontmatter_ (the little blurb of YAML metadata at the top of each
|
||||
file). Two notable problems:
|
||||
|
||||
**1. images** - I make liberal use of Markdown images, which in Jekyll can be decorated with CSS
|
||||
styling, like so:
|
||||
```
|
||||
{: style="width:200px; float: right; margin: 1em;"}
|
||||
```
|
||||
|
||||
**2. post_url** - Another widely used feature is cross-linking my own articles, using Jekyll
|
||||
template expansion, like so:
|
||||
```
|
||||
.. Remember in my [[VPP Babel]({% post_url 2024-03-06-vpp-babel-1 %})] ..
|
||||
```
|
||||
|
||||
I do some grepping, and have 246 such Jekyll template expansions, and 272 images OK, that's a dud.
|
||||
|
||||
#### Attempt 2: Skeleton ✅
|
||||
|
||||
I decide to do this one step at a time. First, I create a completely new website `hugo new site
|
||||
ipng.ch`, download the `notrack` theme, and add only the front page `index.md` from the
|
||||
original IPng site. OK, that renders.
|
||||
|
||||
Now comes a fun part: going over the `notrack` theme's SCSS to adjust it to look and feel similar to
|
||||
the Jekyll `minima` theme. I change a bunch of stuff in the skeleton of the website:
|
||||
|
||||
First, I take a look at the site media breakpoints, to feel correct for desktop screen, tablet
|
||||
screen and iPhone/Android screens. Then, I inspect the font family, size and H1/H2/H3...
|
||||
magnifications, also scaling them with media size. Finally I notice the footer, which in `notrack`
|
||||
spans the whole width of the browser. I change it to be as wide as the header and main page.
|
||||
|
||||
I go one by one on the site's main pages and, just as on the Jekyll site, I make them into menu
|
||||
items at the top of the page. The [[Services]({{< ref services >}})] page serves as my proof of
|
||||
concept, as it has both the `image` and the `post_url` pattern in Jekyll. It references six articles
|
||||
and has two images which float on the right side of the canvas. If I can figure out how to rewrite
|
||||
these to fit the Hugo variants of the same pattern, I should be home free.
|
||||
|
||||
### Hugo: image
|
||||
|
||||
The idiomatic way in `notrack` is an `image` shortcode. I hope you know where to find the curly
|
||||
braces on your keyboard - because geez, Hugo templating sure does like them!
|
||||
|
||||
```
|
||||
<figure class="image-shortcode{{ with .Get "class" }} {{ . }}{{ end }}
|
||||
{{- with .Get "wide" }}{{- if eq . "true" }} wide{{ end -}}{{ end -}}
|
||||
{{- with .Get "frame" }}{{- if eq . "true" }} frame{{ end -}}{{ end -}}
|
||||
{{- with .Get "float" }} {{ . }}{{ end -}}"
|
||||
style="
|
||||
{{- with .Get "width" }}width: {{ . }};{{ end -}}
|
||||
{{- with .Get "height" }}height: {{ . }};{{ end -}}">
|
||||
{{- if .Get "link" -}}
|
||||
<a href="{{ .Get "link" }}"{{ with .Get "target" }} target="{{ . }}"{{ end -}}
|
||||
{{- with .Get "rel" }} rel="{{ . }}"{{ end }}>
|
||||
{{- end }}
|
||||
<img src="{{ .Get "src" | relURL }}"
|
||||
{{- if or (.Get "alt") (.Get "caption") }}
|
||||
alt="{{ with .Get "alt" }}{{ replace . "'" "'" }}{{ else -}}
|
||||
{{- .Get "caption" | markdownify| plainify }}{{ end }}"
|
||||
{{- end -}}
|
||||
/> <!-- Closing img tag -->
|
||||
{{- if .Get "link" }}</a>{{ end -}}
|
||||
{{- if or (or (.Get "title") (.Get "caption")) (.Get "attr") -}}
|
||||
<figcaption>
|
||||
{{ with (.Get "title") -}}
|
||||
<h4>{{ . }}</h4>
|
||||
{{- end -}}
|
||||
{{- if or (.Get "caption") (.Get "attr") -}}<p>
|
||||
{{- .Get "caption" | markdownify -}}
|
||||
{{- with .Get "attrlink" }}
|
||||
<a href="{{ . }}">
|
||||
{{- end -}}
|
||||
{{- .Get "attr" | markdownify -}}
|
||||
{{- if .Get "attrlink" }}</a>{{ end }}</p>
|
||||
{{- end }}
|
||||
</figcaption>
|
||||
{{- end }}
|
||||
</figure>
|
||||
```
|
||||
|
||||
From the top - Hugo creates a figure with a certain set of classes, the default `image-shortcode`
|
||||
but also classes for `frame`, `wide` and `float` to further decorate the image. Then it applies
|
||||
direct styling for `width` and `height`, optionally inserts a link (something I had missed out on in
|
||||
Jekyll), then inlines the `<img>` tag with an `alt` or (markdown based!) `caption`. It then reuses
|
||||
the `caption` or `title` or `attr` variables to assemble a `<figcaption>` block. I absolutely love it!
|
||||
|
||||
I've rather consistently placed my images by themselves, on a single line, and they all have at
|
||||
least one style (be it `width`, or `float`), so it's really straight forward to rewrite this with a
|
||||
little bit of Python:
|
||||
|
||||
```
|
||||
def convert_image(line):
|
||||
p = re.compile(r'^!\[(.+)\]\((.+)\){:\s*(.*)}')
|
||||
m = p.match(line)
|
||||
if not m:
|
||||
return False
|
||||
|
||||
alt=m.group(1)
|
||||
src=m.group(2)
|
||||
style=m.group(3)
|
||||
|
||||
image_line = "{{</* image "
|
||||
if sm := re.search(r'width:\s*(\d+px)', style):
|
||||
image_line += f'width="{sm.group(1)}" '
|
||||
if sm := re.search(r'float:\s*(\w+)', style):
|
||||
image_line += f'float="{sm.group(1)}" '
|
||||
image_line += f'src="{src}" alt="{alt}" */>}}}}'
|
||||
|
||||
print(image_line)
|
||||
return True
|
||||
|
||||
with open(sys.argv[1], "r", encoding="utf-8") as file_handle:
|
||||
for line in file_handle.readlines():
|
||||
if not convert_image(line):
|
||||
print(line.rstrip())
|
||||
```
|
||||
|
||||
### Hugo: ref
|
||||
|
||||
In Hugo, the idiomatic way to reference another document in the corpus is with the builtin `ref`
|
||||
shortcode, requiring a single argument: the path to a content document, with or without a file
|
||||
extension, with or without an anchor. Paths without a leading / are first resolved relative to the
|
||||
current page, then to the remainder of the site. This is super cool, because I can essentially
|
||||
reference any file by just its name!
|
||||
|
||||
```
|
||||
for fn in $(find content/ -name \*.md); do
|
||||
sed -i -r 's/{%[ ]?post_url (.*)[ ]?%}/{{</* ref \1 */>}}/' $fn
|
||||
done
|
||||
```
|
||||
|
||||
And with that, the converted markdown from Jekyll renders perfectly in Hugo. Of course, other sites
|
||||
may use other templating commands, but for [[IPng.ch](https://ipng.ch)], these were the only two
|
||||
special cases.
|
||||
|
||||
### Hugo: URL redirects
|
||||
|
||||
It is a hard requirement for me to keep the same URLs that I had from Jekyll. Luckily, this is a
|
||||
trivial matter for Hugo, as it supports URL aliases in the _frontmatter_. Jekyll will add a file
|
||||
extension to the article _slugs_, while Hugo uses only the directly and serves an `index.html` from
|
||||
it. Also, the default for Hugo is to put content in a different directory.
|
||||
|
||||
The first change I make is to the main `hugo.toml` config file:
|
||||
|
||||
```
|
||||
[permalinks]
|
||||
articles = "/s/articles/:year/:month/:day/:slug"
|
||||
```
|
||||
|
||||
That solves the main directory problem, as back then, I chose `s/articles/` in Jekyll. Then, adding
|
||||
the URL redirect is a simple matter of looking up which filename Jekyll ultimately used, and adding
|
||||
a little frontmatter at the top of each article, for example my [[VPP #1]({{< ref
|
||||
2024-08-12-jekyll-hugo >}})] article would get this addition:
|
||||
|
||||
```
|
||||
---
|
||||
date: "2021-08-12T11:17:54Z"
|
||||
title: VPP Linux CP - Part1
|
||||
aliases:
|
||||
- /s/articles/2021/08/12/vpp-1.html
|
||||
---
|
||||
```
|
||||
|
||||
Hugo by default renders it in `/s/articles/2021/08/12/vpp-linux-cp-part1/index.html` but the
|
||||
addition of the `alias` makes it also generate a drop-in placeholder HTML page that offers a
|
||||
permanent redirect (cleverly setting `noindex` for web crawlers and offering the `canonical` link
|
||||
for the new place, aka a permanent redirect:
|
||||
|
||||
```
|
||||
$ curl https://ipng.ch/s/articles/2021/08/12/vpp-1.html
|
||||
<!DOCTYPE html>
|
||||
<html lang="en-us">
|
||||
<head>
|
||||
<title>https://ipng.ch/s/articles/2021/08/12/vpp-linux-cp-part1/</title>
|
||||
<link rel="canonical" href="https://ipng.ch/s/articles/2021/08/12/vpp-linux-cp-part1/">
|
||||
<meta name="robots" content="noindex">
|
||||
<meta charset="utf-8">
|
||||
<meta http-equiv="refresh" content="0; url=https://ipng.ch/s/articles/2021/08/12/vpp-linux-cp-part1/">
|
||||
</head>
|
||||
</html>
|
||||
```
|
||||
|
||||
### Hugo: Asciinema
|
||||
|
||||
One thing that I always wanted to add is the ability to inline [[Asciinema](https://asciinema.org)]
|
||||
screen recordings. First, I take a look at what is needed to serve Asciinema: One Javascript file,
|
||||
and one CSS file, followed by a named `<div>` which invokes the Javascript. Armed with that
|
||||
knowledge, I dive into the `shortcode` language a little bit:
|
||||
|
||||
```
|
||||
$ cat themes/hugo-theme-ipng/layouts/shortcodes/asciinema.html
|
||||
<div id='{{ .Get "src" | replaceRE "[[:^alnum:]]" "" }}'></div>
|
||||
<script>
|
||||
AsciinemaPlayer.create("{{ .Get "src" }}",
|
||||
document.getElementById('{{ .Get "src" | replaceRE "[[:^alnum:]]" "" }}'));
|
||||
</script>
|
||||
```
|
||||
|
||||
This file creates the `id` of the `<div>` by means of stripping all non-alphanumeric characters from
|
||||
the `src` argument of the _shortcode_. So if I were to create an `{{</* asciinema
|
||||
src='/casts/my.cast' */>}}`, the resulting DIV will be uniquely called `castsmycast`. This way, I
|
||||
can add multiple screencasts in the same document, which is dope.
|
||||
|
||||
But, as I now know, I need to load some CSS and JS so that the `AsciinemaPlayer` class becomes
|
||||
available. For this, I use a realtively new feature in Hugo, which allows for `params` to be set in
|
||||
the frontmatter, for example in the [[VPP OSPF #2]({{< ref 2024-06-22-vpp-ospf-2 >}})] article:
|
||||
|
||||
```
|
||||
---
|
||||
date: "2024-06-22T09:17:54Z"
|
||||
title: VPP with loopback-only OSPFv3 - Part 2
|
||||
aliases:
|
||||
- /s/articles/2024/06/22/vpp-ospf-2.html
|
||||
params:
|
||||
asciinema: true
|
||||
---
|
||||
```
|
||||
|
||||
The presence of that `params.asciinema` can be used in any page, including the HTML skeleton of the
|
||||
theme, like so:
|
||||
|
||||
```
|
||||
$ cat themes/hugo-theme-ipng/layouts/partials/head.html
|
||||
<head>
|
||||
...
|
||||
{{ if eq .Params.asciinema true -}}
|
||||
<link rel="stylesheet" type="text/css" href="{{ "css/asciinema-player.css" | relURL }}" />
|
||||
<script src="{{ "js/asciinema-player.min.js" | relURL }}"></script>
|
||||
{{- end }}
|
||||
</head>
|
||||
```
|
||||
|
||||
Now all that's left for me to do is drop the two Asciinema player files in their respective theme
|
||||
directories, and for each article that wants to use an Asciinema, set the `param` and it'll ship the
|
||||
CSS and Javascript to the browser. I think I'm going to have a good relationship with Hugo :)
|
||||
|
||||
### Gitea: Large File Support
|
||||
|
||||
One mistake I made with the old Jekyll based website, is that I checked in all of the images and
|
||||
binary files directly into Git. This bloats the repository and is otherwise completely unnecessary.
|
||||
For this new repository, I enable [[Git LFS](https://git-lfs.com/)], which is available for OpenBSD
|
||||
(packages), Debian (apt) and MacOS (homebrew). Turning this on is very simple:
|
||||
|
||||
```
|
||||
$ brew install git-lfs
|
||||
$ cd ipng.ch
|
||||
$ git lfs install
|
||||
$ for i in gz png gif jpg jpeg tgz zip; do \\
|
||||
git track "*.$i" \\
|
||||
git lfs import --everything --include "*.$i" \\
|
||||
done
|
||||
$ git push --force --all
|
||||
```
|
||||
|
||||
The `force` push rewrites the history of the repo to reference the binary blobs in LFS instead of
|
||||
directly in the repo. As a result, the size of the repository greatly shrinks, and handling it
|
||||
becomes easier once it grows. A really nice feature!
|
||||
|
||||
### Gitea: CI/CD with Drone
|
||||
|
||||
At IPng, I run a [[Gitea](https://gitea.io)] server, which is one of the coolest pieces of open
|
||||
source that I use on a daily basis. There's a very clean integration of a continuous integration
|
||||
tool called [[Drone](https://drone.io/)] and these two tools are literally made for each other.
|
||||
Drone can be enabled for any Git repo in Gitea, and given the presence of a `.drone.yml` file,
|
||||
execute a set of steps upon repository events, called _triggers_. It can then run a sequence of
|
||||
steps, hermetically in a Docker container called a _drone-runner_, which first checks out the
|
||||
repository at the latest commit, and then does whatever I'd like with it. I'd like to build and
|
||||
distribute a Hugo website, please!
|
||||
|
||||
As it turns out, there is a [[Drone Hugo](https://plugins.drone.io/plugins/hugo)] plugin available,
|
||||
but it seems to be very outdated. Luckily, this being open source and all, I can download the source
|
||||
on [[GitHub](https://github.com/drone-plugins/drone-hugo)], and in the `Dockerfile`, bump the Alpine
|
||||
version, the Go version and build the latest Hugo release, which is 0.130.1 at the moment. I really
|
||||
do need this version, because the `params` feature was introduced in 0.123 and the upstream package
|
||||
is still for 0.77 -- which is about four years old. Ouch!
|
||||
|
||||
I build a docker image and upload it to my private repo at IPng which is hosted as well on Gitea, by
|
||||
the way. As I said, it really is a great piece of kit! In case anybody else would like to give it a
|
||||
whirl, ping me on Mastodon or e-mail and I'll upload one to public Docker Hub as well.
|
||||
|
||||
### Putting it all together
|
||||
|
||||
With Drone activated for this repo, and the Drone Hugo plugin built with a new version, I can submit
|
||||
the following file to the root directory of the `ipng.ch` repository:
|
||||
|
||||
|
||||
```
|
||||
$ cat .drone.yml
|
||||
kind: pipeline
|
||||
name: default
|
||||
|
||||
steps:
|
||||
- name: git-lfs
|
||||
image: alpine/git
|
||||
commands:
|
||||
- git lfs install
|
||||
- git lfs pull
|
||||
- name: build
|
||||
image: git.ipng.ch/ipng/drone-hugo:release-0.130.0
|
||||
settings:
|
||||
hugo_version: 0.130.0
|
||||
extended: true
|
||||
- name: rsync
|
||||
image: drillster/drone-rsync
|
||||
settings:
|
||||
user: drone
|
||||
key:
|
||||
from_secret: drone_sshkey
|
||||
hosts:
|
||||
- nginx0.chrma0.net.ipng.ch
|
||||
- nginx0.chplo0.net.ipng.ch
|
||||
- nginx0.nlams1.net.ipng.ch
|
||||
- nginx0.nlams2.net.ipng.ch
|
||||
port: 22
|
||||
args: '-6u --delete-after'
|
||||
source: public/
|
||||
target: /var/www/ipng.ch/
|
||||
recursive: true
|
||||
secrets: [ drone_sshkey ]
|
||||
|
||||
image_pull_secrets:
|
||||
- git_ipng_ch_docker
|
||||
```
|
||||
|
||||
The file is relatively self-explanatory. Before my first step runs, Drone already checks out the
|
||||
repo in the current working directory of the docker container. I then install package `alpine/git`
|
||||
and run the `git lfs install` and `git lfs pull` commands to resolve the LFS symlinks into actual
|
||||
files by pulling those objects that are referenced (and, notably, not all historical versions of any
|
||||
binary file ever added to the repo).
|
||||
|
||||
Then, I run a step called `build` which invokes the Hugo Drone package that I created before.
|
||||
|
||||
Finally, I run a step called `rsync` which uses package `drillster/drone-rsync` to rsync-over-ssh
|
||||
the files to the four NGINX servers running at IPng: two in Amsterdam, one in Geneva and one in
|
||||
Zurich.
|
||||
|
||||
One really cool feature is the use of so called _Drone Secrets_ which are references to locked
|
||||
secrets such as the SSH key, and, notably, the Docker Repository credentials, because Gitea at IPng
|
||||
does not run a public docker repo. Using secrets is nifty, because it allows to safely check in the
|
||||
`.drone.yml` configuration file without leaking any specifics.
|
||||
|
||||
### NGINX and SSL
|
||||
|
||||
Now that the website is automatically built and rsync'd to the webservers upon every `git merge`,
|
||||
all that's left for me to do is arm the webservers with SSL certificates. I actually wrote a whole
|
||||
story about specifically that, as for `*.ipng.ch` and `*.ipng.nl` and a bunch of others,
|
||||
periodically there is a background task that retrieves multiple wildcard certificates with Let's
|
||||
Encrypt, and distributes them to any server that needs them (like the NGINX cluster, or the Postfix
|
||||
cluster). I wrote about the [[Frontends]({{< ref 2023-03-17-ipng-frontends >}})], the spiffy
|
||||
[[DNS-01]({{< ref 2023-03-24-lego-dns01.md >}})] certificate subsystem, and the internal network
|
||||
called [[IPng Site Local]({{< ref 2023-03-11-mpls-core >}})] each in their own articles, so I won't
|
||||
repeat that information here.
|
||||
|
||||
## The Results
|
||||
|
||||
The results are really cool, as I'll demonstrate in this video. I can just submit and merge this
|
||||
change, and it'll automatically kick off a build and push. Take a look at this video which was
|
||||
performed in real time as I pushed this very article live:
|
||||
|
||||
{{< video src="https://ipng.ch/media/vdo/hugo-drone.mp4" >}}
|
||||
238
content/articles/2024-09-03-asr9001.md
Normal file
238
content/articles/2024-09-03-asr9001.md
Normal file
@@ -0,0 +1,238 @@
|
||||
---
|
||||
date: "2024-09-03T13:07:54Z"
|
||||
title: Loadtest notes, ASR9001
|
||||
draft: true
|
||||
---
|
||||
|
||||
### L2 point-to-point (L2XC) config
|
||||
|
||||
```
|
||||
interface TenGigE0/0/0/0
|
||||
mtu 9216
|
||||
load-interval 30
|
||||
l2transport
|
||||
!
|
||||
!
|
||||
interface TenGigE0/0/0/1
|
||||
mtu 9216
|
||||
load-interval 30
|
||||
l2transport
|
||||
!
|
||||
!
|
||||
interface TenGigE0/0/0/2
|
||||
mtu 9216
|
||||
load-interval 30
|
||||
l2transport
|
||||
!
|
||||
!
|
||||
interface TenGigE0/0/0/3
|
||||
mtu 9216
|
||||
load-interval 30
|
||||
l2transport
|
||||
!
|
||||
!
|
||||
|
||||
|
||||
...
|
||||
l2vpn
|
||||
load-balancing flow src-dst-ip
|
||||
logging
|
||||
bridge-domain
|
||||
pseudowire
|
||||
!
|
||||
xconnect group LoadTest
|
||||
p2p pair0
|
||||
interface TenGigE0/0/2/0
|
||||
interface TenGigE0/0/2/1
|
||||
!
|
||||
p2p pair1
|
||||
interface TenGigE0/0/2/2
|
||||
interface TenGigE0/0/2/3
|
||||
!
|
||||
...
|
||||
```
|
||||
|
||||
|
||||
### L2 Bridge-Domain
|
||||
|
||||
```
|
||||
l2vpn
|
||||
bridge group LoadTestp
|
||||
bridge-domain bd0
|
||||
interface TenGigE0/0/0/0
|
||||
!
|
||||
interface TenGigE0/0/0/1
|
||||
!
|
||||
!
|
||||
bridge-domain bd1
|
||||
interface TenGigE0/0/0/2
|
||||
!
|
||||
interface TenGigE0/0/0/3
|
||||
!
|
||||
!
|
||||
...
|
||||
```
|
||||
RP/0/RSP0/CPU0:micro-fridge#show l2vpn forwarding bridge-domain mac-address location 0/0/CPU0
|
||||
Sat Aug 31 12:09:08.957 UTC
|
||||
Mac Address Type Learned from/Filtered on LC learned Resync Age Mapped to
|
||||
--------------------------------------------------------------------------------
|
||||
9c69.b461.fcf2 dynamic Te0/0/0/0 0/0/CPU0 0d 0h 0m 14s N/A
|
||||
9c69.b461.fcf3 dynamic Te0/0/0/1 0/0/CPU0 0d 0h 0m 2s N/A
|
||||
001b.2155.1f11 dynamic Te0/0/0/2 0/0/CPU0 0d 0h 0m 0s N/A
|
||||
001b.2155.1f10 dynamic Te0/0/0/3 0/0/CPU0 0d 0h 0m 15s N/A
|
||||
001b.21bc.47a4 dynamic Te0/0/1/0 0/0/CPU0 0d 0h 0m 6s N/A
|
||||
001b.21bc.47a5 dynamic Te0/0/1/1 0/0/CPU0 0d 0h 0m 21s N/A
|
||||
9c69.b461.ff41 dynamic Te0/0/1/2 0/0/CPU0 0d 0h 0m 16s N/A
|
||||
9c69.b461.ff40 dynamic Te0/0/1/3 0/0/CPU0 0d 0h 0m 10s N/A
|
||||
001b.2155.1d1d dynamic Te0/0/2/0 0/0/CPU0 0d 0h 0m 9s N/A
|
||||
001b.2155.1d1c dynamic Te0/0/2/1 0/0/CPU0 0d 0h 0m 16s N/A
|
||||
001b.2155.1e08 dynamic Te0/0/2/2 0/0/CPU0 0d 0h 0m 4s N/A
|
||||
001b.2155.1e09 dynamic Te0/0/2/3 0/0/CPU0 0d 0h 0m 11s N/A
|
||||
```
|
||||
|
||||
Interesting finding, after a bridge-domain overload occurs, forwarding pretty much stops
|
||||
```
|
||||
Te0/0/0/0:
|
||||
30 second input rate 6931755000 bits/sec, 14441158 packets/sec
|
||||
30 second output rate 0 bits/sec, 0 packets/sec
|
||||
Te0/0/0/1:
|
||||
30 second input rate 0 bits/sec, 0 packets/sec
|
||||
30 second output rate 19492000 bits/sec, 40609 packets/sec
|
||||
|
||||
Te0/0/0/2:
|
||||
30 second input rate 0 bits/sec, 0 packets/sec
|
||||
30 second output rate 19720000 bits/sec, 41084 packets/sec
|
||||
Te0/0/0/3:
|
||||
30 second input rate 6931728000 bits/sec, 14441100 packets/sec
|
||||
30 second output rate 0 bits/sec, 0 packets/sec
|
||||
|
||||
... and so on
|
||||
|
||||
30 second input rate 6931558000 bits/sec, 14440748 packets/sec
|
||||
30 second output rate 0 bits/sec, 0 packets/sec
|
||||
30 second input rate 0 bits/sec, 0 packets/sec
|
||||
30 second output rate 12627000 bits/sec, 26307 packets/sec
|
||||
30 second input rate 0 bits/sec, 0 packets/sec
|
||||
30 second output rate 12710000 bits/sec, 26479 packets/sec
|
||||
30 second input rate 6931542000 bits/sec, 14440712 packets/sec
|
||||
30 second output rate 0 bits/sec, 0 packets/sec
|
||||
30 second input rate 0 bits/sec, 0 packets/sec
|
||||
30 second output rate 19196000 bits/sec, 39992 packets/sec
|
||||
30 second input rate 6931651000 bits/sec, 14440938 packets/sec
|
||||
30 second output rate 0 bits/sec, 0 packets/sec
|
||||
30 second input rate 6931658000 bits/sec, 14440958 packets/sec
|
||||
30 second output rate 0 bits/sec, 0 packets/sec
|
||||
30 second input rate 0 bits/sec, 0 packets/sec
|
||||
30 second output rate 13167000 bits/sec, 27431 packets/sec
|
||||
```
|
||||
|
||||
MPLS enabled test:
|
||||
```
|
||||
arp vrf default 100.64.0.2 001b.2155.1e08 ARPA
|
||||
arp vrf default 100.64.1.2 001b.2155.1e09 ARPA
|
||||
arp vrf default 100.64.2.2 001b.2155.1d1c ARPA
|
||||
arp vrf default 100.64.3.2 001b.2155.1d1d ARPA
|
||||
arp vrf default 100.64.4.2 001b.21bc.47a4 ARPA
|
||||
arp vrf default 100.64.5.2 001b.21bc.47a5 ARPA
|
||||
arp vrf default 100.64.6.2 9c69.b461.fcf2 ARPA
|
||||
arp vrf default 100.64.7.2 9c69.b461.fcf3 ARPA
|
||||
arp vrf default 100.64.8.2 001b.2155.1f10 ARPA
|
||||
arp vrf default 100.64.9.2 001b.2155.1f11 ARPA
|
||||
arp vrf default 100.64.10.2 9c69.b461.ff40 ARPA
|
||||
arp vrf default 100.64.11.2 9c69.b461.ff41 ARPA
|
||||
|
||||
router static
|
||||
address-family ipv4 unicast
|
||||
0.0.0.0/0 198.19.5.1
|
||||
16.0.0.0/24 100.64.0.2
|
||||
16.0.1.0/24 100.64.2.2
|
||||
16.0.2.0/24 100.64.4.2
|
||||
16.0.3.0/24 100.64.6.2
|
||||
16.0.4.0/24 100.64.8.2
|
||||
16.0.5.0/24 100.64.10.2
|
||||
48.0.0.0/24 100.64.1.2
|
||||
48.0.1.0/24 100.64.3.2
|
||||
48.0.2.0/24 100.64.5.2
|
||||
48.0.3.0/24 100.64.7.2
|
||||
48.0.4.0/24 100.64.9.2
|
||||
48.0.5.0/24 100.64.11.2
|
||||
!
|
||||
!
|
||||
|
||||
mpls static
|
||||
interface TenGigE0/0/0/0
|
||||
interface TenGigE0/0/0/1
|
||||
interface TenGigE0/0/0/2
|
||||
interface TenGigE0/0/0/3
|
||||
interface TenGigE0/0/1/0
|
||||
interface TenGigE0/0/1/1
|
||||
interface TenGigE0/0/1/2
|
||||
interface TenGigE0/0/1/3
|
||||
interface TenGigE0/0/2/0
|
||||
interface TenGigE0/0/2/1
|
||||
interface TenGigE0/0/2/2
|
||||
interface TenGigE0/0/2/3
|
||||
address-family ipv4 unicast
|
||||
local-label 16 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/2/3 100.64.1.2 out-label 17
|
||||
!
|
||||
!
|
||||
local-label 17 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/2/2 100.64.0.2 out-label 16
|
||||
!
|
||||
!
|
||||
local-label 18 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/2/0 100.64.3.2 out-label 19
|
||||
!
|
||||
!
|
||||
local-label 19 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/2/1 100.64.2.2 out-label 18
|
||||
!
|
||||
!
|
||||
local-label 20 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/1/1 100.64.5.2 out-label 21
|
||||
!
|
||||
!
|
||||
local-label 21 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/1/0 100.64.4.2 out-label 20
|
||||
!
|
||||
!
|
||||
local-label 22 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/0/1 100.64.7.2 out-label 23
|
||||
!
|
||||
!
|
||||
local-label 23 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/0/0 100.64.6.2 out-label 22
|
||||
!
|
||||
!
|
||||
local-label 24 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/0/2 100.64.9.2 out-label 25
|
||||
!
|
||||
!
|
||||
local-label 25 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/0/3 100.64.8.2 out-label 24
|
||||
!
|
||||
!
|
||||
local-label 26 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/1/2 100.64.11.2 out-label 27
|
||||
!
|
||||
!
|
||||
local-label 27 allocate
|
||||
forward
|
||||
path 1 nexthop TenGigE0/0/1/3 100.64.10.2 out-label 26
|
||||
!
|
||||
!
|
||||
!
|
||||
!
|
||||
```
|
||||
725
content/articles/2024-09-08-sflow-1.md
Normal file
725
content/articles/2024-09-08-sflow-1.md
Normal file
@@ -0,0 +1,725 @@
|
||||
---
|
||||
date: "2024-09-08T12:51:23Z"
|
||||
title: 'VPP with sFlow - Part 1'
|
||||
---
|
||||
|
||||
# Introduction
|
||||
|
||||
{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width='12em' >}}
|
||||
|
||||
In January of 2023, an uncomfortably long time ago at this point, an acquaintance of mine called
|
||||
Ciprian reached out to me after seeing my [[DENOG
|
||||
#14](https://video.ipng.ch/w/erc9sAofrSZ22qjPwmv6H4)] presentation. He was interested to learn about
|
||||
IPFIX and was asking if sFlow would be an option. At the time, there was a plugin in VPP called
|
||||
[[flowprobe](https://s3-docs.fd.io/vpp/24.10/cli-reference/clis/clicmd_src_plugins_flowprobe.html)]
|
||||
which is able to emit IPFIX records. Unfotunately I never really got it to work well in my tests,
|
||||
as either the records were corrupted, sub-interfaces didn't work, or the plugin would just crash the
|
||||
dataplane entirely. In the meantime, the folks at [[Netgate](https://netgate.com/)] submitted quite
|
||||
a few fixes to flowprobe, but it remains an expensive operation computationally. Wouldn't copying
|
||||
one in a thousand or ten thousand packet headers with flow _sampling_ not be just as good?
|
||||
|
||||
In the months that followed, I discussed the feature with the incredible folks at
|
||||
[[inMon](https://inmon.com/)], the original designers and maintainers of the sFlow protocol and
|
||||
toolkit. Neil from inMon wrote a prototype and put it on [[GitHub](https://github.com/sflow/vpp)]
|
||||
but for lack of time I didn't manage to get it to work, which was largely my fault by the way.
|
||||
|
||||
However, I have a bit of time on my hands in September and October, and just a few weeks ago,
|
||||
my buddy Pavel from [[FastNetMon](https://fastnetmon.com/)] pinged that very dormant thread about
|
||||
sFlow being a potentially useful tool for anti DDoS protection using VPP. And I very much agree!
|
||||
|
||||
## sFlow: Protocol
|
||||
|
||||
Maintenance of the protocol is performed by the [[sFlow.org](https://sflow.org/)] consortium, the
|
||||
authoritative source of the sFlow protocol specifications. The current version of sFlow is v5.
|
||||
|
||||
sFlow, short for _sampled Flow_, works at the ethernet layer of the stack, where it inspects one in
|
||||
N datagrams (typically 1:1000 or 1:10000) going through the physical network interfaces of a device.
|
||||
On the device, an **sFlow Agent** does the sampling. For each sample the Agent takes, the first M
|
||||
bytes (typically 128) are copied into an sFlow Datagram. Sampling metadata is added, such as
|
||||
the ingress (or egress) interface and sampling process parameters. The Agent can then optionally add
|
||||
forwarding information (such as router source- and destination prefix, MPLS LSP information, BGP
|
||||
communties, and what-not). Finally the Agent will periodically read the octet and packet counters of
|
||||
physical network interface(s). Ultimately, the Agent will send the samples and additional
|
||||
information over the network as a UDP datagram, to an **sFlow Collector** for further processing.
|
||||
|
||||
sFlow has been specifically designed to take advantages of the statistical properties of packet
|
||||
sampling and can be modeled using statistical sampling theory. This means that the sFlow traffic
|
||||
monitoring system will always produce statistically quantifiable measurements. You can read more
|
||||
about it in Peter Phaal and Sonia Panchen's
|
||||
[[paper](https://sflow.org/packetSamplingBasics/index.htm)], I certainly did and my head spun a
|
||||
little bit at the math :)
|
||||
|
||||
### sFlow: Netlink PSAMPLE
|
||||
|
||||
sFlow is meant to be a very _lightweight_ operation for the sampling equipment. It can typically be
|
||||
done in hardware, but there also exist several software implementations. One very clever thing, I
|
||||
think, is decoupling the sampler from the rest of the Agent. The Linux kernel has a packet sampling
|
||||
API called [[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)], which
|
||||
allows _producers_ to send samples to a certain _group_, and then allows _consumers_ to subscribe to
|
||||
samples of a certrain _group_. The PSAMPLE API uses
|
||||
[[NetLink](https://docs.kernel.org/userspace-api/netlink/intro.html)] under the covers. The cool
|
||||
thing, for me anyway, is that I have a little bit of experience with Netlink due to my work on VPP's
|
||||
[[Linux Control Plane]({{< ref 2021-08-25-vpp-4 >}})] plugin.
|
||||
|
||||
The idea here is that some **sFlow Agent**, notably a VPP plugin, will be taking periodic samples
|
||||
from the physical network interfaces, and producing Netlink messages. Then, some other program,
|
||||
notably outside of VPP, can consume these messages and further handle them, creating UDP packets
|
||||
with sFlow samples and counters and other information, and sending them to an **sFlow Collector**
|
||||
somewhere else on the network.
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Warning" >}}
|
||||
|
||||
There's a handy utility called [[psampletest](https://github.com/sflow/psampletest)] which can
|
||||
subscribe to these PSAMPLE netlink groups and retrieve the samples. The first time I used all of
|
||||
this stuff, I wasn't aware of this utility and I kept on getting errors. It turns out, there's a
|
||||
kernel module that needs to be loaded: `modprobe psample` and `psampletest` helpfully does that for
|
||||
you [[ref](https://github.com/sflow/psampletest/blob/main/psampletest.c#L799)], so just make sure
|
||||
the module is loaded and added to `/etc/modules` before you spend as many hours as I did pulling out
|
||||
hair.
|
||||
|
||||
## VPP: sFlow Plugin
|
||||
|
||||
For the purposes of my initial testing, I'll simply take a look at Neil's prototype on
|
||||
[[GitHub](https://github.com/sflow/vpp)] and see what I learn in terms of functionality and
|
||||
performance.
|
||||
|
||||
### sFlow Plugin: Anatomy
|
||||
|
||||
The design is purposefully minimal, to do all of the heavy lifting outside of the VPP dataplane. The
|
||||
plugin will create a new VPP _graph node_ called `sflow`, which the operator can insert after
|
||||
`device-input`, in other words, if enabled, the plugin will get a copy of all packets that are read
|
||||
from an input provider, such as `dpdk-input` or `rdma-input`. The plugin's job is to process the
|
||||
packet, and if it's not selected for sampling, just move it onwards to the next node, typically
|
||||
`ethernet-input`. Almost all of the interesting action is in `node.c`
|
||||
|
||||
The kicker is, that one in N packets will be selected to sample, after which:
|
||||
1. the ethernet header (`*en`) is extracted from the packet
|
||||
1. the input interface (`hw_if_index`) is extracted from the VPP buffer. Remember, sFlow works
|
||||
with physical network interfaces!
|
||||
1. if there are too many samples from this worker thread being worked on, it is discarded and an
|
||||
error counter is incremented. This protects the main thread from being slammed with samples if
|
||||
there are simply too many being fished out of the dataplane.
|
||||
1. Otherwise:
|
||||
* a new `sflow_sample_t` is created, with all the sampling process metadata filled in
|
||||
* the first 128 bytes of the packet are copied into the sample
|
||||
* an RPC is dispatched to the main thread, which will send the sample to the PSAMPLE channel
|
||||
|
||||
Both a debug CLI command and API call are added:
|
||||
|
||||
```
|
||||
sflow enable-disable <interface-name> [<sampling_N>]|[disable]
|
||||
```
|
||||
|
||||
Some observations:
|
||||
|
||||
First off, the sampling_N in Neil's demo is a global rather than per-interface setting. It would
|
||||
make sense to make this be per-interface, as routers typically have a mixture of 1G/10G and faster
|
||||
100G network cards available. It was a surprise when I set one interface to 1:1000 and the other to
|
||||
1:10000 and then saw the first interface change its sampling rate also. It's a small thing, and
|
||||
will not be an issue to change.
|
||||
|
||||
{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
Secondly, sending the RPC to main uses `vl_api_rpc_call_main_thread()`, which
|
||||
requires a _spinlock_ in `src/vlibmemory/memclnt_api.c:649`. I'm somewhat worried that when many
|
||||
samples are sent from many threads, there will be lock contention and performance will suffer.
|
||||
|
||||
### sFlow Plugin: Functional
|
||||
|
||||
I boot up the [[IPng Lab]({{< ref 2022-10-14-lab-1 >}})] and install a bunch of sFlow tools on it,
|
||||
make sure the `psample` kernel module is loaded. In this first test I'll take a look at
|
||||
tablestakes. I compile VPP with the sFlow plugin, and enable that plugin in `startup.conf` on each
|
||||
of the four VPP routers. For reference, the Lab looks like this:
|
||||
|
||||
{{< image src="/assets/vpp-mpls/LAB v2.svg" alt="Lab Setup" >}}
|
||||
|
||||
What I'll do is start an `iperf3` server on `vpp0-3` and then hit it from `vpp0-0`, to generate
|
||||
a few TCP traffic streams back and forth, which will be traversing `vpp0-2` and `vpp0-1`, like so:
|
||||
|
||||
```
|
||||
pim@vpp0-3:~ $ iperf3 -s -D
|
||||
pim@vpp0-0:~ $ iperf3 -c vpp0-3.lab.ipng.ch -t 86400 -P 10 -b 10M
|
||||
```
|
||||
|
||||
### Configuring VPP for sFlow
|
||||
|
||||
While this `iperf3` is running, I'll log on to `vpp0-2` to take a closer look. The first thing I do,
|
||||
is turn on packet sampling on `vpp0-2`'s interface that points at `vpp0-3`, which is `Gi10/0/1`, and
|
||||
the interface that points at `vpp0-0`, which is `Gi10/0/0`. That's easy enough, and I will use a
|
||||
sampling rate of 1:1000 as these interfaces are GigabitEthernet:
|
||||
|
||||
```
|
||||
root@vpp0-2:~# vppctl sflow enable-disable GigabitEthernet10/0/0 1000
|
||||
root@vpp0-2:~# vppctl sflow enable-disable GigabitEthernet10/0/1 1000
|
||||
root@vpp0-2:~# vppctl show run | egrep '(Name|sflow)'
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
sflow active 5656 24168 0 9.01e2 4.27
|
||||
```
|
||||
|
||||
Nice! VPP inserted the `sflow` node between `dpdk-input` and `ethernet-input` where it can do its
|
||||
business. But is it sending data? To answer this question, I can first take a look at the
|
||||
`psampletest` tool:
|
||||
|
||||
```
|
||||
root@vpp0-2:~# psampletest
|
||||
pstest: modprobe psample returned 0
|
||||
pstest: netlink socket number = 1637
|
||||
pstest: getFamily
|
||||
pstest: generic netlink CMD = 1
|
||||
pstest: generic family name: psample
|
||||
pstest: generic family id: 32
|
||||
pstest: psample attr type: 4 (nested=0) len: 8
|
||||
pstest: psample attr type: 5 (nested=0) len: 8
|
||||
pstest: psample attr type: 6 (nested=0) len: 24
|
||||
pstest: psample multicast group id: 9
|
||||
pstest: psample multicast group: config
|
||||
pstest: psample multicast group id: 10
|
||||
pstest: psample multicast group: packets
|
||||
pstest: psample found group packets=10
|
||||
pstest: joinGroup 10
|
||||
pstest: received Netlink ACK
|
||||
pstest: joinGroup 10
|
||||
pstest: set headers...
|
||||
pstest: serialize...
|
||||
pstest: print before sending...
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=7 out=9 n=1000 seq=1 pktlen=1514 hdrlen=31 pkt=0x558c08ba4958 q=3 depth=33333333 delay=123456
|
||||
pstest: send...
|
||||
pstest: send_psample getuid=0 geteuid=0
|
||||
pstest: sendmsg returned 140
|
||||
pstest: free...
|
||||
pstest: start read loop...
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=1 out=0 n=1000 seq=600320 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=1 out=0 n=1000 seq=600321 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=1 out=0 n=1000 seq=600322 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=2 out=0 n=1000 seq=600423 pktlen=66 hdrlen=70 pkt=0x7ffdb0d5a1e8 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=1 out=0 n=1000 seq=600324 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
|
||||
```
|
||||
|
||||
I am amazed! The `psampletest` output shows a few packets, considering I'm asking `iperf3` to push
|
||||
100Mbit using 9000 byte jumboframes (which would be something like 1400 packets/second), I can
|
||||
expect two or three samples per second. I immediately notice a few things:
|
||||
|
||||
***1. Network Namespace***: The Netlink sampling channel belongs to a network _namespace_. The VPP
|
||||
process is running in the _default_ netns, so its PSAMPLE netlink messages will be in that namespace.
|
||||
Thus, the `psampletest` and other tools must also run in that namespace. I mention this because in
|
||||
Linux CP, often times the controlplane interfaces are created in a dedicated `dataplane` network
|
||||
namespace.
|
||||
|
||||
***2. pktlen and hdrlen***: The pktlen is wrong, and this is a bug. In VPP, packets are put into
|
||||
buffers of size 2048, and if there is a jumboframe, that means multiple buffers are concatenated for
|
||||
the same packet. The packet length here ought to be 9000 in one direction. Looking at the `in=2`
|
||||
packet with length 66, that looks like a legitimate ACK packet on the way back. But why is the
|
||||
hdrlen set to 70 there? I'm going to want to ask Neil about that.
|
||||
|
||||
***3. ingress and egress***: The `in=1` and one packet with `in=2` represent the input `hw_if_index`
|
||||
which is the ifIndex that VPP assigns to its devices. And looking at `show interfaces`, indeed
|
||||
number 1 corresponds with `GigabitEthernet10/0/0` and 2 is `GigabitEthernet10/0/1`, which checks
|
||||
out:
|
||||
```
|
||||
root@vpp0-2:~# vppctl show int
|
||||
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
|
||||
GigabitEthernet10/0/0 1 up 9000/0/0/0 rx packets 469552764
|
||||
rx bytes 4218754400233
|
||||
tx packets 133717230
|
||||
tx bytes 8887341013
|
||||
drops 6050
|
||||
ip4 469321635
|
||||
ip6 225164
|
||||
GigabitEthernet10/0/1 2 up 9000/0/0/0 rx packets 133527636
|
||||
rx bytes 8816920909
|
||||
tx packets 469353481
|
||||
tx bytes 4218736200819
|
||||
drops 6060
|
||||
ip4 133489925
|
||||
ip6 29139
|
||||
|
||||
```
|
||||
|
||||
***4. ifIndexes are orthogonal***: These `in=1` or `in=2` ifIndex numbers are constructs of the VPP
|
||||
dataplane. Notably, VPP's numbering of interface index is strictly _orthogonal_ to Linux, and it's
|
||||
not guaranteed that there even _exists_ an interface in Linux for the PHY upon which the sampling is
|
||||
happening. Said differently, `in=1` here is meant to reference VPP's `GigabitEthernet10/0/0`
|
||||
interface, but in Linux, `ifIndex=1` is a completely different interface (`lo`) in the default
|
||||
network namespace. Similarly `in=2` for VPP's `Gi10/0/1` interface corresponds to interface `enp1s0`
|
||||
in Linux:
|
||||
|
||||
```
|
||||
root@vpp0-2:~# ip link
|
||||
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
|
||||
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
|
||||
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
|
||||
link/ether 52:54:00:f0:01:20 brd ff:ff:ff:ff:ff:ff
|
||||
```
|
||||
|
||||
***5. Counters***: sFlow periodically polls the interface counters for all interfaces. It will
|
||||
normally use `/proc/net/` entries for that, but there are two problems with this:
|
||||
|
||||
1. There may not exist a Linux representation of the interface, for example if it's only doing L2
|
||||
bridging or cross connects in the VPP dataplane, and it does not have a Linux Control Plane
|
||||
interface, or `linux-cp` is not used at all.
|
||||
|
||||
1. Even if it does exist and it's the "correct" ifIndex in Linux, for example if the _Linux
|
||||
Interface Pair_'s tuntap `host_vif_index` index is used, even then the statistics counters in the
|
||||
Linux representation will only count packets and octets of _punted_ packets, that is to say, the
|
||||
stuff that LinuxCP has decided need to go to the Linux kernel through the TUN/TAP device. Important
|
||||
to note that east-west traffic that goes _through_ the dataplane, is never punted to Linux, and as
|
||||
such, the counters will be undershooting: only counting traffic _to_ the router, not _through_ the
|
||||
router.
|
||||
|
||||
### VPP sFlow: Performance
|
||||
|
||||
Now that I've shown that Neil's proof of concept works, I will take a better look at the performance
|
||||
of the plugin. I've made a mental note that the plugin sends RPCs from worker threads to the main
|
||||
thread to marshall the PSAMPLE messages out. I'd like to see how expensive that is, in general. So,
|
||||
I pull boot two Dell R730 machines in IPng's Lab and put them to work. The first machine will run
|
||||
Cisco's T-Rex loadtester with 8x 10Gbps ports (4x dual Intel 58299), while the second (identical)
|
||||
machine will run VPP also ith 8x 10Gbps ports (2x Intel i710-DA4).
|
||||
|
||||
I will test a bunch of things in parallel. First off, I'll test L2 (xconnect) and L3 (IPv4 routing),
|
||||
and secondly I'll test that with and without sFlow turned on. This gives me 8 ports to configure,
|
||||
and I'll start with the L2 configuration, as follows:
|
||||
|
||||
```
|
||||
vpp# set int state TenGigabitEthernet3/0/2 up
|
||||
vpp# set int state TenGigabitEthernet3/0/3 up
|
||||
vpp# set int state TenGigabitEthernet130/0/2 up
|
||||
vpp# set int state TenGigabitEthernet130/0/3 up
|
||||
vpp# set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3
|
||||
vpp# set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2
|
||||
vpp# set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3
|
||||
vpp# set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2
|
||||
```
|
||||
|
||||
Then, the L3 configuration looks like this:
|
||||
|
||||
```
|
||||
vpp# lcp create TenGigabitEthernet3/0/0 host-if xe0-0
|
||||
vpp# lcp create TenGigabitEthernet3/0/1 host-if xe0-1
|
||||
vpp# lcp create TenGigabitEthernet130/0/0 host-if xe1-0
|
||||
vpp# lcp create TenGigabitEthernet130/0/1 host-if xe1-1
|
||||
vpp# set int state TenGigabitEthernet3/0/0 up
|
||||
vpp# set int state TenGigabitEthernet3/0/1 up
|
||||
vpp# set int state TenGigabitEthernet130/0/0 up
|
||||
vpp# set int state TenGigabitEthernet130/0/1 up
|
||||
vpp# set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31
|
||||
vpp# set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31
|
||||
vpp# set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31
|
||||
vpp# set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31
|
||||
vpp# ip route add 16.0.0.0/24 via 100.64.0.0
|
||||
vpp# ip route add 48.0.0.0/24 via 100.64.1.0
|
||||
vpp# ip route add 16.0.2.0/24 via 100.64.4.0
|
||||
vpp# ip route add 48.0.2.0/24 via 100.64.5.0
|
||||
vpp# ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static
|
||||
vpp# ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static
|
||||
vpp# ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static
|
||||
vpp# ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static
|
||||
```
|
||||
|
||||
And finally, the Cisco T-Rex configuration looks like this:
|
||||
|
||||
```
|
||||
- version: 2
|
||||
interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ]
|
||||
port_info:
|
||||
- src_mac: 00:1b:21:06:00:00
|
||||
dest_mac: 9c:69:b4:61:a1:dc
|
||||
- src_mac: 00:1b:21:06:00:01
|
||||
dest_mac: 9c:69:b4:61:a1:dd
|
||||
|
||||
- src_mac: 00:1b:21:83:00:00
|
||||
dest_mac: 00:1b:21:83:00:01
|
||||
- src_mac: 00:1b:21:83:00:01
|
||||
dest_mac: 00:1b:21:83:00:00
|
||||
|
||||
- src_mac: 00:1b:21:87:00:00
|
||||
dest_mac: 9c:69:b4:61:75:d0
|
||||
- src_mac: 00:1b:21:87:00:01
|
||||
dest_mac: 9c:69:b4:61:75:d1
|
||||
|
||||
- src_mac: 9c:69:b4:85:00:00
|
||||
dest_mac: 9c:69:b4:85:00:01
|
||||
- src_mac: 9c:69:b4:85:00:01
|
||||
dest_mac: 9c:69:b4:85:00:00
|
||||
```
|
||||
|
||||
A little note on the use of `ip neighbor` in VPP and specific `dest_mac` in T-Rex. In L2 mode,
|
||||
because the VPP interfaces will be in promiscuous mode and simply pass through any ethernet frame
|
||||
received on interface `Te3/0/2` and copy it out on `Te3/0/3` and vice-versa, there is no need to
|
||||
tinker with MAC addresses. But in L3 mode, the NIC will only accept ethernet frames addressed to its
|
||||
MAC address, so you can see that for the first port in T-Rex, I am setting `dest_mac:
|
||||
9c:69:b4:61:a1:dc` which is the MAC address of `Te3/0/0` on VPP. And then on the way out, if VPP
|
||||
wants to send traffic back to T-Rex, I'll give it a static ARP entry with `ip neighbor .. static`.
|
||||
|
||||
With that said, I can start a baseline loadtest like so:
|
||||
{{< image width="100%" src="/assets/sflow/trex-baseline.png" alt="Cisco T-Rex: baseline" >}}
|
||||
|
||||
T-Rex is sending 10Gbps out on all eight interfaces (four of which are L3 routing, and four of which
|
||||
are L2 xconnecting), using packet size of 1514 bytes. This amounts of roughlu 813Kpps per port, or a
|
||||
cool 6.51Mpps in total. And I can see, in this base line configuration, the VPP router is happy to
|
||||
do the work.
|
||||
|
||||
I then enable sFlow on the second set of four ports, using a 1:1000 sampling rate:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 1000
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/1 1000
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 1000
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/3 1000
|
||||
```
|
||||
|
||||
This should yield about 3'250 or so samples per second, and things look pretty great:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ vppctl show err
|
||||
Count Node Reason Severity
|
||||
5034508 sflow sflow packets processed error
|
||||
4908 sflow sflow packets sampled error
|
||||
5034508 sflow sflow packets processed error
|
||||
5111 sflow sflow packets sampled error
|
||||
5034516 l2-output L2 output packets error
|
||||
5034516 l2-input L2 input packets error
|
||||
5034404 sflow sflow packets processed error
|
||||
4948 sflow sflow packets sampled error
|
||||
5034404 l2-output L2 output packets error
|
||||
5034404 l2-input L2 input packets error
|
||||
5034404 sflow sflow packets processed error
|
||||
4928 sflow sflow packets sampled error
|
||||
5034404 l2-output L2 output packets error
|
||||
5034404 l2-input L2 input packets error
|
||||
5034516 l2-output L2 output packets error
|
||||
5034516 l2-input L2 input packets error
|
||||
```
|
||||
|
||||
I can see that the `sflow packets sampled` is roughly 0.1% of the `sflow packets processed` which
|
||||
checks out. I can also see in `psampletest` a flurry of activity, so I'm happy:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ sudo psampletest
|
||||
...
|
||||
pstest: grp=1 in=9 out=0 n=1000 seq=63388 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=8 out=0 n=1000 seq=63389 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=11 out=0 n=1000 seq=63390 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=10 out=0 n=1000 seq=63391 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=11 out=0 n=1000 seq=63392 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
|
||||
```
|
||||
|
||||
I confirm that all four `in` interfaced (8, 9, 10 and 11) are sending samples, and those indexes
|
||||
correctly correspond to the VPP dataplane's `sw_if_index` for `TenGig130/0/0 - 3`. Sweet! On this
|
||||
machine, each TenGig network interface has its own dedicated VPP worker thread. Considering I
|
||||
turned on sFlow sampling on four interfaces, I should see the cost I'm paying for the feature:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ vppctl show run | grep -e '(Name|sflow)'
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
sflow active 3908218 14350684 0 9.05e1 3.67
|
||||
sflow active 3913266 14350680 0 1.11e2 3.67
|
||||
sflow active 3910828 14350687 0 1.08e2 3.67
|
||||
sflow active 3909274 14350692 0 5.66e1 3.67
|
||||
```
|
||||
|
||||
Alright, so for the 999 packets that went through and the one packet that got sampled, on average
|
||||
VPP is spending between 90 and 111 CPU cycles per packet, and the loadtest looks squeaky clean on
|
||||
T-Rex.
|
||||
|
||||
### VPP sFlow: Cost of passthru
|
||||
|
||||
I decide to take a look at two edge cases. What if there are no samples being taken at all, and the
|
||||
`sflow` node is merely passing through all packets to `ethernet-input`? To simulate this, I will set
|
||||
up a bizarrely high sampling rate, say one in ten million. I'll also make the T-Rex loadtester use
|
||||
only four ports, in other words, a unidirectional loadtest, and I'll make it go much faster by
|
||||
sending smaller packets, say 128 bytes:
|
||||
|
||||
```
|
||||
tui>start -f stl/ipng.py -p 0 2 4 6 -m 99% -t size=128
|
||||
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 1000 disable
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/1 1000 disable
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 1000 disable
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/3 1000 disable
|
||||
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 10000000
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 10000000
|
||||
```
|
||||
|
||||
The loadtester is now sending 33.5Mpps or thereabouts (4x 8.37Mpps), and I can confirm that the
|
||||
`sFlow` plugin is not sampling many packets:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ vppctl show err
|
||||
Count Node Reason Severity
|
||||
59777084 sflow sflow packets processed error
|
||||
6 sflow sflow packets sampled error
|
||||
59777152 l2-output L2 output packets error
|
||||
59777152 l2-input L2 input packets error
|
||||
59777104 sflow sflow packets processed error
|
||||
6 sflow sflow packets sampled error
|
||||
59777104 l2-output L2 output packets error
|
||||
59777104 l2-input L2 input packets error
|
||||
|
||||
pim@hvn6-lab:~$ vppctl show run | grep -e '(Name|sflow)'
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
sflow active 8186642 369674664 0 1.35e1 45.16
|
||||
sflow active 25173660 369674696 0 1.97e1 14.68
|
||||
```
|
||||
Two observations:
|
||||
|
||||
1. One of these is busier than the other. Without looking further, I can already predict that the
|
||||
top one (doing 45.16 vectors/call) is the L3 thread. Reasoning: the L3 code path through the
|
||||
dataplane is a lot more expensive than 'merely' L2 XConnect. As such, the packets will spend more
|
||||
time, and therefore the iterations of the `dpdk-input` loop will be further apart in time. And
|
||||
because of that, it'll end up consuming more packets on each subsequent iteration, in order to catch
|
||||
up. The L2 path on the other hand, is quicker and therefore will have less packets waiting on
|
||||
subsequent iterations of `dpdk-input`.
|
||||
|
||||
2. The `sflow` plugin spends between 13.5 and 19.7 CPU cycles shoveling the packets into
|
||||
`ethernet-input` without doing anything to them. That's pretty low! And the L3 path is a little bit
|
||||
more efficient per packet, which is very likely because it gets to amort its L1/L2 CPU instruction
|
||||
cache over 45 packets each time it runs, while the L2 path can only amort its instruction cache over
|
||||
15 or so packets each time it runs.
|
||||
|
||||
I let the loadtest run overnight,and the proof is in the pudding: sFlow enabled but not sampling
|
||||
works just fine:
|
||||
|
||||
{{< image width="100%" src="/assets/sflow/trex-passthru.png" alt="Cisco T-Rex: passthru" >}}
|
||||
|
||||
### VPP sFlow: Cost of sampling
|
||||
|
||||
The other interesting case is to figure out how much CPU it takes to execute the code path
|
||||
with the actual sampling. This one turns out a bit trickier to measure. While leaving the previous
|
||||
loadtest running at 33.5Mpps, I disable sFlow and then re-enable it at an abnormally _high_ ratio of
|
||||
1:10 packets:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 disable
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 disable
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 10
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 10
|
||||
```
|
||||
|
||||
The T-Rex view immediately reveals that VPP is not doing very well, as the throughput went from
|
||||
33.5Mpps all the way down to 7.5Mpps. Ouch! Looking at the dataplane:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ vppctl show err | grep sflow
|
||||
340502528 sflow sflow packets processed error
|
||||
12254462 sflow sflow packets dropped error
|
||||
22611461 sflow sflow packets sampled error
|
||||
422527140 sflow sflow packets processed error
|
||||
8533855 sflow sflow packets dropped error
|
||||
34235952 sflow sflow packets sampled error
|
||||
```
|
||||
|
||||
Ha, this new safeguard popped up: remember all the way at the beginning, I explained how there's a
|
||||
safety net in the `sflow` plugin that will pre-emptively drop the sample if the RPC channel towards
|
||||
the main thread is seeing too many outstanding RPCs? That's happening right now, under the moniker
|
||||
`sflow packets dropped`, and it's roughly *half* of the samples.
|
||||
|
||||
My first attempt is to back off the loadtester to roughly 1.5Mpps per port (so 6Mpps in total, under the
|
||||
current limit of 7.5Mpps), but I'm disappointed: the VPP instance is now returning 665Kpps per port
|
||||
only, which is horrible, and it's still dropping samples.
|
||||
|
||||
My second attempt is to turn off all ports but last pair (the L2XC port), which returns 930Kpps from
|
||||
the offered 1.5Mpps. VPP is clearly not having a good time here.
|
||||
|
||||
Finally, as a validation, I turn off all ports but the first pair (the L3 port, without sFlow), and
|
||||
ramp up the traffic to 8Mpps. Success (unsurprising to me). I also ramp up the second pair (the L2XC
|
||||
port, without sFlow), VPP forwards all 16Mpps and is happy again.
|
||||
|
||||
Once I turn on the third pair (the L3 port, _with_ sFlow), even at 1Mpps, the whole situation
|
||||
regresses again: First two ports go down from 8Mpps to 5.2Mpps each; the third (offending) port
|
||||
delivers 740Kpps out of 1Mpps. Clearly, there's some work to do under high load situations!
|
||||
|
||||
#### Reasoning about the bottle neck
|
||||
|
||||
But how expensive is sending samples, really? To try to get at least some pseudo-scientific answer I
|
||||
turn off all ports again, and ramp up the one port pair with (L3 + sFlow at 1:10 ratio) to full line
|
||||
rate: that is 64 byte packets at 14.88Mpps:
|
||||
|
||||
```
|
||||
tui>stop
|
||||
tui>start -f stl/ipng.py -m 100% -p 4 -t size=64
|
||||
```
|
||||
|
||||
VPP is now on the struggle bus and is returning 3.16Mpps or 21% of that. But, I think it'll give me
|
||||
some reasonable data to try to feel out where the bottleneck is.
|
||||
|
||||
```
|
||||
Thread 2 vpp_wk_1 (lcore 3)
|
||||
Time 6.3, 10 sec internal node vector rate 256.00 loops/sec 27310.73
|
||||
vector rates in 3.1607e6, out 3.1607e6, drop 0.0000e0, punt 0.0000e0
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
TenGigabitEthernet130/0/1-outp active 77906 19943936 0 5.79e0 256.00
|
||||
TenGigabitEthernet130/0/1-tx active 77906 19943936 0 6.88e1 256.00
|
||||
dpdk-input polling 77906 19943936 0 4.41e1 256.00
|
||||
ethernet-input active 77906 19943936 0 2.21e1 256.00
|
||||
ip4-input active 77906 19943936 0 2.05e1 256.00
|
||||
ip4-load-balance active 77906 19943936 0 1.07e1 256.00
|
||||
ip4-lookup active 77906 19943936 0 1.98e1 256.00
|
||||
ip4-rewrite active 77906 19943936 0 1.97e1 256.00
|
||||
sflow active 77906 19943936 0 6.14e1 256.00
|
||||
|
||||
pim@hvn6-lab:pim# vppctl show err | grep sflow
|
||||
551357440 sflow sflow packets processed error
|
||||
19829380 sflow sflow packets dropped error
|
||||
36613544 sflow sflow packets sampled error
|
||||
```
|
||||
|
||||
OK, the `sflow` plugin saw 551M packets, selected 36.6M of them for sampling, but ultimately only
|
||||
sent RPCs to the main thread for 16.8M samples after having dropped 19.8M of them. There are three
|
||||
code paths, each one extending the other:
|
||||
|
||||
1. Super cheap: pass through. I already learned that it takes about X=13.5 CPU cycles to pass
|
||||
through a packet.
|
||||
1. Very cheap: select sample and construct the RPC, but toss it, costing Y CPU cycles.
|
||||
1. Expensive: select sample, and send the RPC. Z CPU cycles in worker, and another amount in main.
|
||||
|
||||
Now I don't know what Y is, but seeing as the selection only copies some data from the VPP buffer
|
||||
into a new `sflow_sample_t`, and it uses `clip_memcpy_fast()` for the sample header, I'm going to
|
||||
assume it's not _drastically_ more expensive than the super cheap case, so for simplicity I'll
|
||||
guesstimate that it takes Y=20 CPU cyces.
|
||||
|
||||
With that guess out of the way, I can see what the `sflow` plugin is consuming for the third case:
|
||||
|
||||
```
|
||||
AvgClocks = (Total * X + Sampled * Y + RPCSent * Z) / Total
|
||||
|
||||
61.4 = ( 551357440 * 13.5 + 36613544 * 20 + (36613544-19829380) * Z ) / 551357440
|
||||
61.4 = ( 7443325440 + 732270880 + 16784164 * Z ) / 551357440
|
||||
33853346816 = 7443325440 + 732270880 + 16784164 * Z
|
||||
25677750496 = 16784164 * Z
|
||||
Z = 1529
|
||||
```
|
||||
|
||||
Good to know! I find spending O(1500) cycles to send the sample pretty reasonable. However, for a
|
||||
dataplane that is trying to do 10Mpps per core, and a core running 2.2GHz, there are really only 220
|
||||
CPU cycles to spend end-to-end. Spending an order of magnitude more than that once every ten packets
|
||||
feels dangerous to me.
|
||||
|
||||
Here's where I start my conjecture. If I count the CPU cycles spent in the table above, I will see
|
||||
273 CPU cycles spent on average per packet. The CPU in the VPP router is an `E5-2696 v4 @ 2.20GHz`,
|
||||
which means it should be able to do `2.2e10/273 = 8.06Mpps` per thread, more than double that what I
|
||||
observe (3.16Mpps)! But, for all the `vector rates in` (3.1607e6), it also managed to emit the
|
||||
packets back out (same number: 3.1607e6).
|
||||
|
||||
So why isn't VPP getting more packets from DPDK? I poke around a bit and find an important clue:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ vppctl show hard TenGigabitEthernet130/0/0 | grep rx\ missed; \
|
||||
sleep 10; vppctl show hard TenGigabitEthernet130/0/0 | grep rx\ missed
|
||||
rx missed 4065539464
|
||||
rx missed 4182788310
|
||||
```
|
||||
|
||||
In those ten seconds, VPP missed (4182788310-4065539464)/10 = 11.72Mpps. I already measured that it
|
||||
forwarded 3.16Mpps and you know what? 11.7 + 3.16 is precisely 14.88Mpps. All packets are accounted
|
||||
for! It's just, DPDK never managed to read them from the hardware: `sad-trombone.wav`
|
||||
|
||||
|
||||
As a validation, I turned off sFlow while keeping that one port at 14.88Mpps. Now, 10.8Mpps were
|
||||
delivered:
|
||||
|
||||
```
|
||||
Thread 2 vpp_wk_1 (lcore 3)
|
||||
Time 14.7, 10 sec internal node vector rate 256.00 loops/sec 40622.64
|
||||
vector rates in 1.0794e7, out 1.0794e7, drop 0.0000e0, punt 0.0000e0
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
TenGigabitEthernet130/0/1-outp active 620012 158723072 0 5.66e0 256.00
|
||||
TenGigabitEthernet130/0/1-tx active 620012 158723072 0 7.01e1 256.00
|
||||
dpdk-input polling 620012 158723072 0 4.39e1 256.00
|
||||
ethernet-input active 620012 158723072 0 1.56e1 256.00
|
||||
ip4-input-no-checksum active 620012 158723072 0 1.43e1 256.00
|
||||
ip4-load-balance active 620012 158723072 0 1.11e1 256.00
|
||||
ip4-lookup active 620012 158723072 0 2.00e1 256.00
|
||||
ip4-rewrite active 620012 158723072 0 2.02e1 256.00
|
||||
```
|
||||
|
||||
Total Clocks: 201 per packet; 2.2GHz/201 = 10.9Mpps, and I am observing 10.8Mpps. As [[North of the
|
||||
Border](https://www.youtube.com/c/NorthoftheBorder)] would say: "That's not just good, it's good
|
||||
_enough_!"
|
||||
|
||||
For completeness, I turned on all eight ports again, at line rate (8x14.88 = 119Mpps 🥰), and saw that
|
||||
about 29Mpps of that made it through. Interestingly, what was 3.16Mpps in the single-port line rate
|
||||
loadtest, went up slighty to 3.44Mpps now. What puzzles me even more, is that the non-sFlow worker
|
||||
threads are also impacted. I spent some time thinking about this and poking around, but I did not
|
||||
find a good explanation why port pair 0 (L3, no sFlow) and 1 (L2XC, no sFlow) would be impacted.
|
||||
Here's a screenshot of VPP on the struggle bus:
|
||||
|
||||
{{< image width="100%" src="/assets/sflow/trex-overload.png" alt="Cisco T-Rex: overload at line rate" >}}
|
||||
|
||||
**Hypothesis**: Due to the _spinlock_ in `vl_api_rpc_call_main_thread()`, the worker CPU is pegged
|
||||
for a longer time, during which the `dpdk-input` PMD can't run, so it misses out on these sweet
|
||||
sweet packets that the network card had dutifully received for it, resulting in the `rx-miss`
|
||||
situation. While VPP's performance measurement shows 273 CPU cycles per packet and 3.16Mpps, this
|
||||
accounts only for 862M cycles, while the thread has 2200M cycles, leaving a whopping 60% of CPU
|
||||
cycles unused in the dataplane. I still don't understand why _other_ worker threads are impacted,
|
||||
though.
|
||||
|
||||
## What's Next
|
||||
|
||||
I'll continue to work with the folks in the sFlow and VPP communities and iterate on the plugin and
|
||||
other **sFlow Agent** machinery. In an upcoming article, I hope to share more details on how to tie
|
||||
the VPP plugin in to the `hsflowd` host sflow daemon in a way that the interface indexes, counters
|
||||
and packet lengths are all correct. Of course, the main improvement that we can make is to allow for
|
||||
the system to work better under load, which will take some thinking.
|
||||
|
||||
I should do a few more tests with a debug binary and profiling turned on. I quickly ran a `perf`
|
||||
over the VPP (release / optimized) binary running on the bench, but it merely said 80% of time was
|
||||
spent in `libvlib` rather than `libvnet` in the baseline (sFlow turned off).
|
||||
|
||||
```
|
||||
root@hvn6-lab:/home/pim# perf record -p 1752441 sleep 10
|
||||
root@hvn6-lab:/home/pim# perf report --stdio --sort=dso
|
||||
# Overhead Shared Object (sFlow) Overhead Shared Object (baseline)
|
||||
# ........ ...................... ........ ........................
|
||||
#
|
||||
79.02% libvlib.so.24.10 54.27% libvlib.so.24.10
|
||||
12.82% libvnet.so.24.10 33.91% libvnet.so.24.10
|
||||
3.77% dpdk_plugin.so 10.87% dpdk_plugin.so
|
||||
3.21% [kernel.kallsyms] 0.81% [kernel.kallsyms]
|
||||
0.29% sflow_plugin.so 0.09% ld-linux-x86-64.so.2
|
||||
0.28% libvppinfra.so.24.10 0.03% libc.so.6
|
||||
0.21% libc.so.6 0.01% libvppinfra.so.24.10
|
||||
0.17% libvlibapi.so.24.10 0.00% libvlibmemory.so.24.10
|
||||
0.15% libvlibmemory.so.24.10
|
||||
0.07% ld-linux-x86-64.so.2
|
||||
0.00% vpp
|
||||
0.00% [vdso]
|
||||
0.00% libsvm.so.24.10
|
||||
```
|
||||
|
||||
Unfortunately, I'm not much of a profiler expert, being merely a network engineer :) so I may have
|
||||
to ask for help. Of course, if you're reading this, you may also _offer_ help! There's lots of
|
||||
interesting work to do on this `sflow` plugin, with matching ifIndex for consumers like `hsflowd`,
|
||||
reading interface counters from the dataplane (or from the Prometheus Exporter), and most
|
||||
importantly, ensuring it works well, or fails gracefully, under stringent load.
|
||||
|
||||
From the _cray-cray_ ideas department, what if we:
|
||||
1. In worker thread, produced the sample but instead of sending an RPC to main and taking the
|
||||
lock, append it to a producer sample queue and move on. This way, no locks are needed, and each
|
||||
worker thread will have its own producer queue.
|
||||
|
||||
1. Create a separate worker (or even pool of workers), running on possibly a different CPU (or in
|
||||
main), that runs a loop iterating on all sflow sample queues consuming the samples and sending them
|
||||
in batches to the PSAMPLE Netlink group, possibly dropping samples if there are too many coming in.
|
||||
|
||||
I'm reminded that this pattern exists already -- async crypto workers create a `crypto-dispatch`
|
||||
node that acts as poller for inbound crypto, and it hands off the result back into the worker
|
||||
thread: lockless at the expense of some complexity!
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
The plugin I am testing here is a prototype written by Neil McKee of inMon. I also wanted to say
|
||||
thanks to Pavel Odintsov of FastNetMon and Ciprian Balaceanu for showing an interest in this plugin,
|
||||
and Peter Phaal for facilitating a get-together last year.
|
||||
|
||||
Who's up for making this thing a reality?!
|
||||
547
content/articles/2024-10-06-sflow-2.md
Normal file
547
content/articles/2024-10-06-sflow-2.md
Normal file
@@ -0,0 +1,547 @@
|
||||
---
|
||||
date: "2024-10-06T07:51:23Z"
|
||||
title: 'VPP with sFlow - Part 2'
|
||||
---
|
||||
|
||||
# Introduction
|
||||
|
||||
{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width='12em' >}}
|
||||
|
||||
Last month, I picked up a project together with Neil McKee of [[inMon](https://inmon.com/)], the
|
||||
care takers of [[sFlow](https://sflow.org)]: an industry standard technology for monitoring high speed switched
|
||||
networks. `sFlow` gives complete visibility into the use of networks enabling performance optimization,
|
||||
accounting/billing for usage, and defense against security threats.
|
||||
|
||||
The open source software dataplane [[VPP](https://fd.io)] is a perfect match for sampling, as it
|
||||
forwards packets at very high rates using underlying libraries like [[DPDK](https://dpdk.org/)] and
|
||||
[[RDMA](https://en.wikipedia.org/wiki/Remote_direct_memory_access)]. A clever design choice in the so
|
||||
called _Host sFlow Daemon_ [[host-sflow](https://github.com/sflow/host-sflow)], which allows for a small
|
||||
portion of code to _grab_ the samples, for example in a merchant silicon ASIC or FPGA, but also in the
|
||||
VPP software dataplane, and then _transmit_ these samples using a Linux kernel feature called
|
||||
[[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)]. This greatly
|
||||
reduces the complexity of code to be implemented in the forwarding path, while at the same time
|
||||
bringing consistency to the `sFlow` delivery pipeline by (re)using the `hsflowd` business logic for
|
||||
the more complex state keeping, packet marshalling and transmission from the _Agent_ to a central
|
||||
_Collector_.
|
||||
|
||||
Last month, Neil and I discussed the proof of concept [[ref](https://github.com/sflow/vpp-sflow/)]
|
||||
and I described this in a [[first article]({{< ref 2024-09-08-sflow-1.md >}})]. Then, we iterated on
|
||||
the VPP plugin, playing with a few different approaches to strike a balance between performance, code
|
||||
complexity, and agent features. This article describes our journey.
|
||||
|
||||
## VPP: an sFlow plugin
|
||||
|
||||
There are three things Neil and I specifically take a look at:
|
||||
|
||||
1. If `sFlow` is not enabled on a given interface, there should not be a regression on other
|
||||
interfaces.
|
||||
1. If `sFlow` _is_ enabled, but a packet is not sampled, the overhead should be as small as
|
||||
possible, targetting single digit CPU cycles per packet in overhead.
|
||||
1. If `sFlow` actually selects a packet for sampling, it should be moved out of the dataplane as
|
||||
quickly as possible, targetting double digit CPU cycles per sample.
|
||||
|
||||
For all these validation and loadtests, I use a bare metal VPP machine which is receiving load from
|
||||
a T-Rex loadtester on eight TenGig ports. I have configured VPP and T-Rex as follows.
|
||||
|
||||
**1. RX Queue Placement**
|
||||
|
||||
It's important that the network card that is receiving the traffic, gets serviced by a worker thread
|
||||
on the same NUMA domain. Since my machine has two processors (and thus, two NUMA nodes), I will
|
||||
align the NIC with the correct processor, like so:
|
||||
|
||||
```
|
||||
set interface rx-placement TenGigabitEthernet3/0/0 queue 0 worker 0
|
||||
set interface rx-placement TenGigabitEthernet3/0/1 queue 0 worker 2
|
||||
set interface rx-placement TenGigabitEthernet3/0/2 queue 0 worker 4
|
||||
set interface rx-placement TenGigabitEthernet3/0/3 queue 0 worker 6
|
||||
|
||||
set interface rx-placement TenGigabitEthernet130/0/0 queue 0 worker 1
|
||||
set interface rx-placement TenGigabitEthernet130/0/1 queue 0 worker 3
|
||||
set interface rx-placement TenGigabitEthernet130/0/2 queue 0 worker 5
|
||||
set interface rx-placement TenGigabitEthernet130/0/3 queue 0 worker 7
|
||||
```
|
||||
|
||||
**2. L3 IPv4/MPLS interfaces**
|
||||
|
||||
I will take two pairs of interfaces, one on NUMA0, and the other on NUMA1, so that I can make a
|
||||
comparison with L3 IPv4 or MPLS running _without_ `sFlow` (these are TenGig3/0/*, which I will call
|
||||
the _baseline_ pairs) and two which are running _with_ `sFlow` (these are TenGig130/0/*, which I'll
|
||||
call the _experiment_ pairs).
|
||||
|
||||
```
|
||||
comment { L3: IPv4 interfaces }
|
||||
set int state TenGigabitEthernet3/0/0 up
|
||||
set int state TenGigabitEthernet3/0/1 up
|
||||
set int state TenGigabitEthernet130/0/0 up
|
||||
set int state TenGigabitEthernet130/0/1 up
|
||||
set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31
|
||||
set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31
|
||||
set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31
|
||||
set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31
|
||||
ip route add 16.0.0.0/24 via 100.64.0.0
|
||||
ip route add 48.0.0.0/24 via 100.64.1.0
|
||||
ip route add 16.0.2.0/24 via 100.64.4.0
|
||||
ip route add 48.0.2.0/24 via 100.64.5.0
|
||||
ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static
|
||||
ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static
|
||||
ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static
|
||||
ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static
|
||||
```
|
||||
|
||||
Here, the only specific trick worth mentioning is the use of `ip neighbor` to pre-populate the L2
|
||||
adjacency for the T-Rex loadtester. This way, VPP knows which MAC address to send traffic to, in
|
||||
case a packet has to be forwarded to 100.64.0.0 or 100.64.5.0. It avoids VPP from having to use ARP
|
||||
resolution.
|
||||
|
||||
The configuration for an MPLS label switching router _LSR_ or also called _P-Router_ is added:
|
||||
|
||||
```
|
||||
comment { MPLS interfaces }
|
||||
mpls table add 0
|
||||
set interface mpls TenGigabitEthernet3/0/0 enable
|
||||
set interface mpls TenGigabitEthernet3/0/1 enable
|
||||
set interface mpls TenGigabitEthernet130/0/0 enable
|
||||
set interface mpls TenGigabitEthernet130/0/1 enable
|
||||
mpls local-label add 16 eos via 100.64.1.0 TenGigabitEthernet3/0/1 out-labels 17
|
||||
mpls local-label add 17 eos via 100.64.0.0 TenGigabitEthernet3/0/0 out-labels 16
|
||||
mpls local-label add 20 eos via 100.64.5.0 TenGigabitEthernet130/0/1 out-labels 21
|
||||
mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels 20
|
||||
```
|
||||
|
||||
**3. L2 CrossConnect interfaces**
|
||||
|
||||
Here, I will also use NUMA0 as my baseline (`sFlow` disabled) pair, and an equivalent pair of TenGig
|
||||
interfaces on NUMA1 as my experiment (`sFlow` enabled) pair. This way, I can both make a comparison
|
||||
on the performance impact of enabling `sFlow`, but I can also assert if any regression occurs in the
|
||||
_baseline_ pair if I enable a feature in the _experiment_ pair, which should really never happen.
|
||||
|
||||
```
|
||||
comment { L2 xconnected interfaces }
|
||||
set int state TenGigabitEthernet3/0/2 up
|
||||
set int state TenGigabitEthernet3/0/3 up
|
||||
set int state TenGigabitEthernet130/0/2 up
|
||||
set int state TenGigabitEthernet130/0/3 up
|
||||
set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3
|
||||
set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2
|
||||
set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3
|
||||
set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2
|
||||
```
|
||||
|
||||
**4. T-Rex Configuration**
|
||||
|
||||
The Cisco T-Rex loadtester is running on another machine in the same rack. Physically, it has eight
|
||||
ports which are connected to a LAB switch, a cool Mellanox SN2700 running Debian [[ref]({{< ref
|
||||
2023-11-11-mellanox-sn2700.md >}})]. From there, eight ports go to my VPP machine. The LAB switch
|
||||
just has VLANs with two ports in each: VLAN 100 takes T-Rex port0 and connects it to TenGig3/0/0,
|
||||
VLAN 101 takes port1 and connects it to TenGig3/0/1, and so on. In total, sixteen ports and eight
|
||||
VLANs are used.
|
||||
|
||||
The configuration for T-Rex then becomes:
|
||||
|
||||
```
|
||||
- version: 2
|
||||
interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ]
|
||||
port_info:
|
||||
- src_mac: 00:1b:21:06:00:00
|
||||
dest_mac: 9c:69:b4:61:a1:dc
|
||||
- src_mac: 00:1b:21:06:00:01
|
||||
dest_mac: 9c:69:b4:61:a1:dd
|
||||
|
||||
- src_mac: 00:1b:21:83:00:00
|
||||
dest_mac: 00:1b:21:83:00:01
|
||||
- src_mac: 00:1b:21:83:00:01
|
||||
dest_mac: 00:1b:21:83:00:00
|
||||
|
||||
- src_mac: 00:1b:21:87:00:00
|
||||
dest_mac: 9c:69:b4:61:75:d0
|
||||
- src_mac: 00:1b:21:87:00:01
|
||||
dest_mac: 9c:69:b4:61:75:d1
|
||||
|
||||
- src_mac: 9c:69:b4:85:00:00
|
||||
dest_mac: 9c:69:b4:85:00:01
|
||||
- src_mac: 9c:69:b4:85:00:01
|
||||
dest_mac: 9c:69:b4:85:00:00
|
||||
```
|
||||
|
||||
Do you see how the first pair sends from `src_mac` 00:1b:21:06:00:00? That's the T-Rex side, and it
|
||||
encodes the PCI device `06:00.0` in the MAC address. It sends traffic to `dest_mac`
|
||||
9c:69:b4:61:a1:dc, which is the MAC address of VPP's TenGig3/0/0 interface. Looking back at the `ip
|
||||
neighbor` VPP config above, it becomes much easier to see who is sending traffic to whom.
|
||||
|
||||
For L2XC, the MAC addresses don't matter. VPP will set the NIC in _promiscuous_ mode which means
|
||||
it'll accept any ethernet frame, not only those sent to the NIC's own MAC address. Therefore, in
|
||||
L2XC modes (the second and fourth pair), I just use the MAC addresses from T-Rex. I find debugging
|
||||
connections and looking up FDB entries on the Mellanox switch much, much easier this way.
|
||||
|
||||
With all config in place, but with `sFlow` disabled, I run a quick bidirectional loadtest using 256b
|
||||
packets at line rate, which shows 79.83Gbps and 36.15Mpps. All ports are forwarding, with MPLS,
|
||||
IPv4, and L2XC. Neat!
|
||||
|
||||
{{< image src="/assets/sflow/trex-acceptance.png" alt="T-Rex Acceptance Loadtest" >}}
|
||||
|
||||
The name of the game is now to do a loadtest that shows the packet throughput and CPU cycles spent
|
||||
for each of the plugin iterations, comparing their performance on ports with and without `sFlow`
|
||||
enabled. For each iteration, I will use exactly the same VPP configuration, I will generate
|
||||
unidirectional 4x14.88Mpps of traffic with T-Rex, and I will report on VPP's performance in
|
||||
_baseline_ and a somewhat unfavorable 1:100 sampling rate.
|
||||
|
||||
Ready? Here I go!
|
||||
|
||||
### v1: Workers send RPC to main
|
||||
|
||||
***TL/DR***: _13 cycles/packet on passthrough, 4.68Mpps L2, 3.26Mpps L3, with severe regression in
|
||||
baseline_
|
||||
|
||||
The first iteration goes all the way back to a proof of concept from last year. It's described in
|
||||
detail in my [[first post]({{< ref 2024-09-08-sflow-1.md >}})]. The performance results are not
|
||||
stellar:
|
||||
* ☢ When slamming a single sFlow enabled interface, _all interfaces_ regress. When sending 8Mpps
|
||||
of IPv4 traffic through an _baseline_ interface, that is an interface _without_ sFlow enabled, only
|
||||
5.2Mpps get through. This is considered a mortal sin in VPP-land.
|
||||
* ✅ Passing through packets without sampling them, costs about 13 CPU cycles, not bad.
|
||||
* ❌ Sampling a packet, specifically at higher rates (say, 1:100 or worse, 1:10) completely
|
||||
destroys throughput. When sending 4x14.88MMpps of traffic, only one third makes it through.
|
||||
|
||||
Here's the bloodbath as seen from T-Rex:
|
||||
|
||||
{{< image src="/assets/sflow/trex-v1.png" alt="T-Rex Loadtest for v1" >}}
|
||||
|
||||
**Debrief**: When we talked through these issues, we sort of drew the conclusion that it would be much
|
||||
faster if, when a worker thread produces a sample, instead of sending an RPC to main and taking the
|
||||
spinlock, that the worker appends the sample to a producer queue and moves on. This way, no locks
|
||||
are needed, and each worker thread will have its own producer queue.
|
||||
|
||||
Then, we can create a separate thread (or even pool of threads), scheduling on possibly a different
|
||||
CPU (or in main), that runs a loop iterating on all sflow sample queues, consuming the samples and
|
||||
sending them in batches to the PSAMPLE Netlink group, possibly dropping samples if there are too
|
||||
many coming in.
|
||||
|
||||
### v2: Workers send PSAMPLE directly
|
||||
|
||||
**TL/DR**: _7.21Mpps IPv4 L3, 9.45Mpps L2XC, 87 cycles/packet, no impact on disabled interfaces_
|
||||
|
||||
But before we do that, we have one curiosity itch to scratch - what if we sent the sample directly
|
||||
from the worker? With such a model, if it works, we will need no RPCs or sample queue at all. Of
|
||||
course, in this model any sample will have to be rewritten into a PSAMPLE packet and written via the
|
||||
netlink socket. It would be less complex, but not as efficient as it could be. One thing is prety
|
||||
certain, though: it should be much faster than sending an RPC to the main thread.
|
||||
|
||||
After short refactor, Neil commits [[d278273](https://github.com/sflow/vpp-sflow/commit/d278273)],
|
||||
which adds compiler macros `SFLOW_SEND_FROM_WORKER` (v2) and `SFLOW_SEND_VIA_MAIN` (v1). When
|
||||
workers send directly, they will invoke `sflow_send_sample_from_worker()` instead of sending an RPC
|
||||
with `vl_api_rpc_call_main_thread()` in the previous version.
|
||||
|
||||
The code currently uses `clib_warning()` to print stats from the dataplane, which is pretty
|
||||
expensive. We should be using the VPP logging framework, but for the time being, I add a few CPU
|
||||
counters so we can more accurately count the cummulative time spent for each part of the calls, see
|
||||
[[6ca61d2](https://github.com/sflow/vpp-sflow/commit/6ca61d2)]. I can now see these with `vppctl show
|
||||
err` instead.
|
||||
|
||||
When loadtesting this, the deadly sin of impacting performance of interfaces that did not have
|
||||
`sFlow` enabled is gone. The throughput is not great, though. Instead of showing screenshots of
|
||||
T-Rex, I can also take a look at the throughput as measured by VPP itself. In its `show runtime`
|
||||
statistics, each worker thread shows both CPU cycles spent, as well as how many packets/sec it
|
||||
received and how many it transmitted:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ export C="v2-100"; vppctl clear run; vppctl clear err; sleep 30; \
|
||||
vppctl show run > $C-runtime.txt; vppctl show err > $C-err.txt
|
||||
pim@hvn6-lab:~$ grep 'vector rates' v2-100-runtime.txt | grep -v 'in 0'
|
||||
vector rates in 1.0909e7, out 1.0909e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 7.2078e6, out 7.2078e6, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 1.4866e7, out 1.4866e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 9.4476e6, out 9.4476e6, drop 0.0000e0, punt 0.0000e0
|
||||
pim@hvn6-lab:~$ grep 'sflow' v2-100-runtime.txt
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
sflow active 844916 216298496 0 8.69e1 256.00
|
||||
sflow active 1107466 283511296 0 8.26e1 256.00
|
||||
pim@hvn6-lab:~$ grep -i sflow v2-100-err.txt
|
||||
217929472 sflow sflow packets processed error
|
||||
1614519 sflow sflow packets sampled error
|
||||
2606893106 sflow CPU cycles in sent samples error
|
||||
280697344 sflow sflow packets processed error
|
||||
2078203 sflow sflow packets sampled error
|
||||
1844674406 sflow CPU cycles in sent samples error
|
||||
```
|
||||
|
||||
At a glance, I can see in the first `grep`, the in and out vector (==packet) rates for each worker
|
||||
thread that is doing meaningful work (ie. has more than 0pps of input). Remember that I pinned the
|
||||
RX queues to worker threads, and this now pays dividend: worker thread 0 is servicing TenGig3/0/0
|
||||
(as _even_ worker thread numbers are on NUMA domain 0), worker thread 1 is servicing TenGig130/0/0.
|
||||
What's cool about this, is it gives me an easy way to compare baseline L3 (10.9Mpps) with experiment
|
||||
L3 (7.21Mpps). Equally, L2XC comes in at 14.88Mpps in baseline and 9.45Mpps in experiment.
|
||||
|
||||
Looking at the output of `vppctl show error`, I can learn another interesting detail. See how there
|
||||
are 1614519 sampled packets out of 217929472 processed packets (ie. a roughly 1:100 rate)? I added a
|
||||
CPU clock cycle counter that counts cummulative clocks spent once samples are taken. I can see that
|
||||
VPP spent 2606893106 CPU cycles sending these samples. That's **1615 CPU cycles** per sent sample,
|
||||
which is pretty terrible.
|
||||
|
||||
**Debrief**: We both understand that assembling and `send()`ing the netlink messages from within the
|
||||
dataplane is a pretty bad idea. But it's great to see that removing the use of RPCs immediately
|
||||
improves performance on non-enabled interfaces, and we learned what the cost is of sending those
|
||||
samples. An easy step forward from here is to create a producer/consumer queue, where the workers
|
||||
can just copy the packet into a queue or ring buffer, and have an external `pthread` consume from
|
||||
the queue/ring in another thread that won't block the dataplane.
|
||||
|
||||
### v3: SVM FIFO from workers, dedicated PSAMPLE pthread
|
||||
|
||||
**TL/DR:** _9.34Mpps L3, 13.51Mpps L2XC, 16.3 cycles/packet, but with corruption on the FIFO queue messages_
|
||||
|
||||
Neil checks in after committing [[7a78e05](https://github.com/sflow/vpp-sflow/commit/7a78e05)]
|
||||
that he has introduced a macro `SFLOW_SEND_FIFO` which tries this new approach. There's a pretty
|
||||
elaborate FIFO queue implementation in `svm/fifo_segment.h`. Neil uses this to create a segment
|
||||
called `fifo-sflow-worker`, to which the worker can write its samples in the dataplane node. A new
|
||||
thread called `spt_process_samples` can then call `svm_fifo_dequeue()` from all workers' queues and
|
||||
pump those into Netlink.
|
||||
|
||||
The overhead of copying the samples onto a VPP native `svm_fifo` seems to be two orders of magnitude
|
||||
lower than writing directly to Netlink, even though the `svm_fifo` library code has many bells and
|
||||
whistles that we don't need. But, perhaps due to these bells and whistles, we may be holding it
|
||||
wrong, as invariably after a short while the Netlink writes return _Message too long_ errors.
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ grep 'vector rates' v3fifo-sflow-100-runtime.txt | grep -v 'in 0'
|
||||
vector rates in 1.0783e7, out 1.0783e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 9.3499e6, out 9.3499e6, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 1.4728e7, out 1.4728e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 1.3516e7, out 1.3516e7, drop 0.0000e0, punt 0.0000e0
|
||||
pim@hvn6-lab:~$ grep -i sflow v3fifo-sflow-100-runtime.txt
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
sflow active 1096132 280609792 0 1.63e1 256.00
|
||||
sflow active 1584577 405651712 0 1.46e1 256.00
|
||||
pim@hvn6-lab:~$ grep -i sflow v3fifo-sflow-100-err.txt
|
||||
280635904 sflow sflow packets processed error
|
||||
2079194 sflow sflow packets sampled error
|
||||
733447310 sflow CPU cycles in sent samples error
|
||||
405689856 sflow sflow packets processed error
|
||||
3004118 sflow sflow packets sampled error
|
||||
1844674407 sflow CPU cycles in sent samples error
|
||||
```
|
||||
|
||||
Two things of note here. Firstly, the average clocks spent in the `sFlow` node have gone down from
|
||||
86 CPU cycles/packet to 16.3 CPU cycles. But even more importantly, the amount of time spent after
|
||||
the sample is taken is hugely reduced, from 1600+ cycles in v2 to a much more favorable 352 cycles
|
||||
in this version. Also, any risk of Netlink writes failing has been eliminated, because that's now
|
||||
offloaded to a different thread entirely.
|
||||
|
||||
**Debrief**: It's not great that we created a new linux `pthread` for the consumer of the samples.
|
||||
VPP has an elaborate thread management system, and collaborative multitasking in its threading
|
||||
model, which adds introspection like clock counters, names, `show runtime`, `show threads` and so
|
||||
on. I can't help but wonder: wouldn't we just be able to move the `spt_process_samples()` thread
|
||||
into a VPP process node instead?
|
||||
|
||||
### v3bis: SVM FIFO, PSAMPLE process in Main
|
||||
|
||||
**TL/DR:** _9.68Mpps L3, 14.10Mpps L2XC, 14.2 cycles/packet, still with corrupted FIFO queue messages_
|
||||
|
||||
Neil agrees that there's no good reason to keep this out of main, and conjures up
|
||||
[[df2dab8d](https://github.com/vpp/sflow-vpp/df2dab8d)] which rewrites the thread to an
|
||||
`sflow_process_samples()` function, using `VLIB_REGISTER_NODE` to add it to VPP in an idiomatic way.
|
||||
As a really nice benefit, we can now count how many CPU cycles are spent, in _main_, each time this
|
||||
_process_ wakes up and does some work. It's a widely used pattern in VPP.
|
||||
|
||||
Because of the FIFO queue message corruption, Netlink message are failing to send at an alarming
|
||||
rate, which is causing lots of `clib_warning()` messages to be spewed on console. I replace those
|
||||
with a counter of Failed Netlink messages instead, and commit refactor
|
||||
[[6ba4715](https://github.com/sflow/vpp-sflow/6ba4715d050f76cfc582055958d50bf4cc8a0ad1)].
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ grep 'vector rates' v3bis-100-runtime.txt | grep -v 'in 0'
|
||||
vector rates in 1.0976e7, out 1.0976e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 9.6743e6, out 9.6743e6, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 1.4866e7, out 1.4866e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 1.4052e7, out 1.4052e7, drop 0.0000e0, punt 0.0000e0
|
||||
pim@hvn6-lab:~$ grep sflow v3bis-100-runtime.txt
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
sflow-process-samples any wait 0 0 28052 4.66e4 0.00
|
||||
sflow active 1134102 290330112 0 1.42e1 256.00
|
||||
sflow active 1647240 421693440 0 1.32e1 256.00
|
||||
pim@hvn6-lab:~$ grep sflow v3bis-100-err.txt
|
||||
77945 sflow sflow PSAMPLE sent error
|
||||
863 sflow sflow PSAMPLE send failed error
|
||||
290376960 sflow sflow packets processed error
|
||||
2151184 sflow sflow packets sampled error
|
||||
421761024 sflow sflow packets processed error
|
||||
3119625 sflow sflow packets sampled error
|
||||
```
|
||||
|
||||
With this iteration, I make a few observations. Firstly, the `sflow-process-samples` node shows up
|
||||
and informs me that, when handling the samples from the worker FIFO queues, the _process_ is using
|
||||
4660 CPU cycles. Secondly, the replacement of `clib_warnign()` with the `sflow PSAMPLE send failed`
|
||||
counter reduced time from 16.3 to 14.2 cycles on average in the dataplane. Nice.
|
||||
|
||||
**Debrief**: A sad conclusion: of the 5.2M samples taken, only 77k make it through to Netlink. All
|
||||
these send failures and corrupt packets are really messing things up. So while the provided FIFO
|
||||
implementation in `svm/fifo_segment.h` is idiomatic, it is also much more complex than we thought,
|
||||
and we're fearing that it may not be safe to read from another thread.
|
||||
|
||||
### v4: Custom lockless FIFO, PSAMPLE process in Main
|
||||
|
||||
**TL/DR:** _9.56Mpps L3, 13.69Mpps L2XC, 15.6 cycles/packet, corruption fixed!_
|
||||
|
||||
After reading around a bit in DPDK's
|
||||
[[kni_fifo](https://doc.dpdk.org/api-18.11/rte__kni__fifo_8h_source.html)], Neil produces a gem of a
|
||||
commit in
|
||||
[[42bbb64](https://github.com/sflow/vpp-sflow/commit/42bbb643b1f11e8498428d3f7d20cde4de8ee201)],
|
||||
where he introduces a tiny multiple-writer, single-consumer FIFO with two simple functions:
|
||||
`sflow_fifo_enqueue()` to be called in the workers, and `sflow_fifo_dequeue()` to be called in the
|
||||
main thread's `sflow-process-samples` process. He then makes this thread-safe by doing what I
|
||||
consider black magic, in commit
|
||||
[[dd8af17](https://github.com/sflow/vpp-sflow/commit/dd8af1722d579adc9d08656cd7ec8cf8b9ac11d6)],
|
||||
which makes use of `clib_atomic_load_acq_n()` and `clib_atomic_store_rel_n()` macros from VPP's
|
||||
`vppinfra/atomics.h`.
|
||||
|
||||
What I really like about this change is that it introduces a FIFO implementation in about twenty
|
||||
lines of code, which means the sampling code path in the dataplane becomes really easy to follow,
|
||||
and will be even faster than it was before. I take it out for a loadtest:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ grep 'vector rates' v4-100-runtime.txt | grep -v 'in 0'
|
||||
vector rates in 1.0958e7, out 1.0958e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 9.5633e6, out 9.5633e6, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 1.3697e7, out 1.3697e7, drop 0.0000e0, punt 0.0000e0
|
||||
pim@hvn6-lab:~$ grep sflow v4-100-runtime.txt
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
sflow-process-samples any wait 0 0 17767 1.52e6 0.00
|
||||
sflow active 1121156 287015936 0 1.56e1 256.00
|
||||
sflow active 1605772 411077632 0 1.53e1 256.00
|
||||
pim@hvn6-lab:~$ grep sflow v4-100-err.txt
|
||||
3553600 sflow sflow PSAMPLE sent error
|
||||
287101184 sflow sflow packets processed error
|
||||
2127024 sflow sflow packets sampled error
|
||||
350224 sflow sflow packets dropped error
|
||||
411199744 sflow sflow packets processed error
|
||||
3043693 sflow sflow packets sampled error
|
||||
1266893 sflow sflow packets dropped error
|
||||
```
|
||||
|
||||
|
||||
This is starting to be a very nice implementation! With this iteration of the plugin, all the
|
||||
corruption is gone, there is a slight regression (because we're now actually _sending_ the
|
||||
messages). With the v3bis variant, only a tiny fraction of the samples made it through to netlink.
|
||||
With this v4 variant, I can see 2127024 + 3043693 packets sampled, but due to a carefully chosen
|
||||
FIFO depth of 4, the workers will drop samples so as not to overload the main process that is trying
|
||||
to write them out. At this unnatural rate of 1:100, I can see that of the 2127024 samples taken,
|
||||
350224 are prematurely dropped (because the FIFO queue is full). This is a perfect defense in depth!
|
||||
|
||||
Doing the math, both workers can enqueue 1776800 samples in 30 seconds, which is 59k/s per
|
||||
interface. I can also see that the second interface, which is doing L2XC and hits a much larger
|
||||
packets/sec throughput, is dropping more samples because it receives an equal amount of time from main
|
||||
reading samples from its queue. In other words: in an overload scenario, one interface cannot crowd
|
||||
out another. Slick.
|
||||
|
||||
Finally, completing my math, each worker has enqueued 1776800 samples to their FIFOs, and I see that
|
||||
main has dequeued exactly 2x1776800 = 3553600 samples, all successfully written to Netlink, so
|
||||
the `sflow PSAMPLE send failed` counter remains zero.
|
||||
|
||||
{{< image float="right" width="20em" src="/assets/sflow/hsflowd-demo.png" >}}
|
||||
|
||||
**Debrief**: In the mean time, Neil has been working on the `host-sflow` daemon changes to pick up
|
||||
these netlink messages. There's also a bit of work to do with retrieving the packet and byte
|
||||
counters of the VPP interfaces, so he is creating a module in `host-sflow` that can consume some
|
||||
messages from VPP. He will call this `mod_vpp`, and he mails a screenshot of his work in progress.
|
||||
I'll discuss the end-to-end changes with `hsflowd` in a followup article, and focus my efforts here
|
||||
on documenting the VPP parts only. But, as a teaser, here's a screenshot of a validated
|
||||
`sflow-tool` output of a VPP instance using our `sFlow` plugin and his pending `host-sflow` changes
|
||||
to integrate the rest of the business logic outside of the VPP dataplane, where it's arguably
|
||||
expensive to make mistakes.
|
||||
|
||||
Neil admits to an itch that he has been meaning to scratch all this time. In VPP's
|
||||
`plugins/sflow/node.c`, we insert the node between `device-input` and `ethernet-input`. Here, really
|
||||
most of the time the plugin is just shoveling the ethernet packets through to `ethernet-input`. To
|
||||
make use of some CPU instruction cache affinity, the loop that does this shovelling can do it one
|
||||
packet at a time, two packets at a time, or even four packets at a time. Although the code is super
|
||||
repetitive and somewhat ugly, it does actually speed up processing in terms of CPU cycles spent per
|
||||
packet, if you shovel four of them at a time.
|
||||
|
||||
### v5: Quad Bucket Brigade in worker
|
||||
|
||||
**TL/DR:** _9.68Mpps L3, 14.0Mpps L2XC, 11 CPU cycles/packet, 1.28e5 CPU cycles in main_
|
||||
|
||||
Neil calls this the _Quad Bucket Brigade_, and one last finishing touch is to move from his default
|
||||
2-packet to a 4-packet shoveling. In commit
|
||||
[[285d8a0](https://github.com/sflow/vpp-sflow/commit/285d8a097b74bb38eeb14a922a1e8c1115da2ef2)], he
|
||||
extends a common pattern in VPP dataplane nodes, each time the node iterates, it'll pre-fetch now up
|
||||
to eight packets (`p0-p7`) if the vector is long enough, and handle them four at a time (`b0-b3`).
|
||||
He also adds a few compiler hints with branch prediction: almost no packets will have a trace
|
||||
enabled, so he can use `PREDICT_FALSE()` macros to allow the compiler to further optimize the code.
|
||||
|
||||
I find reading the dataplane code, that it is incredibly ugly. But it's the price to pay for ultra
|
||||
fast throughput. But how do we see the effect? My low-tech proposal is to enable sampling at a very
|
||||
high rate, say 1:10'000'000, so that the code path that grabs and enqueues the sample into the FIFO
|
||||
is almost never called. Then, what's left for the `sFlow` dataplane node, really is to shovel the
|
||||
packets from `device-input` into `ethernet-input`.
|
||||
|
||||
To measure the relative improvement, I do one test with, and one without commit
|
||||
[[285d8a09](https://github.com/sflow/vpp-sflow/commit/285d8a09)].
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ grep 'vector rates' v5-10M-runtime.txt | grep -v 'in 0'
|
||||
vector rates in 1.0981e7, out 1.0981e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 9.8806e6, out 9.8806e6, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 1.4328e7, out 1.4328e7, drop 0.0000e0, punt 0.0000e0
|
||||
pim@hvn6-lab:~$ grep sflow v5-10M-runtime.txt
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
sflow-process-samples any wait 0 0 28467 9.36e3 0.00
|
||||
sflow active 1158325 296531200 0 1.09e1 256.00
|
||||
sflow active 1679742 430013952 0 1.11e1 256.00
|
||||
|
||||
pim@hvn6-lab:~$ grep 'vector rates' v5-noquadbrigade-10M-runtime.txt | grep -v in\ 0
|
||||
vector rates in 1.0959e7, out 1.0959e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 9.7046e6, out 9.7046e6, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 1.4008e7, out 1.4008e7, drop 0.0000e0, punt 0.0000e0
|
||||
pim@hvn6-lab:~$ grep sflow v5-noquadbrigade-10M-runtime.txt
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
sflow-process-samples any wait 0 0 28462 9.57e3 0.00
|
||||
sflow active 1137571 291218176 0 1.26e1 256.00
|
||||
sflow active 1641991 420349696 0 1.20e1 256.00
|
||||
```
|
||||
|
||||
Would you look at that, this optimization actually works as advertised! There is a meaningful
|
||||
_progression_ from v5-noquadbrigade (9.70Mpps L3, 14.00Mpps L2XC) to v5 (9.88Mpps L3, 14.32Mpps
|
||||
L2XC). So at the expense of adding 63 lines of code, there is a 2.8% increase in throughput.
|
||||
**Quad-Bucket-Brigade, yaay!**
|
||||
|
||||
I'll leave you with a beautiful screenshot of the current code at HEAD, as it is sampling 1:100
|
||||
packets (!) on four interfaces, while forwarding 8x10G of 256 byte packets at line rate. You'll
|
||||
recall at the beginning of this article I did an acceptance loadtest with sFlow disabled, but this
|
||||
is the exact same result **with sFlow** enabled:
|
||||
|
||||
{{< image src="/assets/sflow/trex-sflow-acceptance.png" alt="T-Rex sFlow Acceptance Loadtest" >}}
|
||||
|
||||
This picture says it all: 79.98 Gbps in, 79.98 Gbps out; 36.22Mpps in, 36.22Mpps out. Also: 176k
|
||||
samples/sec taken from the dataplane, with correct rate limiting due to a per-worker FIFO depth
|
||||
limit, yielding 25k samples/sec sent to Netlink.
|
||||
|
||||
## What's Next
|
||||
|
||||
Checking in on the three main things we wanted to ensure with the plugin:
|
||||
|
||||
1. ✅ If `sFlow` _is not_ enabled on a given interface, there is no regression on other interfaces.
|
||||
1. ✅ If `sFlow` _is_ enabled, copying packets costs 11 CPU cycles on average
|
||||
1. ✅ If `sFlow` takes a sample, it takes only marginally more CPU time to enqueue.
|
||||
* No sampling gets 9.88Mpps of IPv4 and 14.3Mpps of L2XC throughput,
|
||||
* 1:1000 sampling reduces to 9.77Mpps of L3 and 14.05Mpps of L2XC throughput,
|
||||
* and an overly harsh 1:100 reduces to 9.69Mpps and 13.97Mpps only.
|
||||
|
||||
The hard part is finished, but we're not entirely done yet. What's left is to implement a set of
|
||||
packet and byte counters, and send this information along with possible Linux CP data (such as the
|
||||
TAP interface ID in the Linux side), and to add the module for VPP in `hsflowd`. I'll write about
|
||||
that part in a followup article.
|
||||
|
||||
Neil has introduced vpp-dev@ to this plugin, and so far there were no objections. But he has pointed
|
||||
folks to a github out of tree repo, and I may add a Gerrit instead so it becomes part of the
|
||||
ecosystem. Our work so far is captured in Gerrit [[41680](https://gerrit.fd.io/r/c/vpp/+/41680)],
|
||||
which ends up being just over 2600 lines all-up. I do think we need to refactor a bit, add some
|
||||
VPP-specific tidbits like `FEATURE.yaml` and `*.rst` documentation, but this should be in reasonable
|
||||
shape.
|
||||
|
||||
### Acknowledgements
|
||||
|
||||
I'd like to thank Neil McKee from inMon for his dedication to getting things right, including the
|
||||
finer details such as logging, error handling, API specifications, and documentation. He has been a
|
||||
true pleasure to work with and learn from.
|
||||
778
content/articles/2024-10-21-freeix-2.md
Normal file
778
content/articles/2024-10-21-freeix-2.md
Normal file
@@ -0,0 +1,778 @@
|
||||
---
|
||||
date: "2024-10-21T10:52:11Z"
|
||||
title: "FreeIX Remote - Part 2"
|
||||
---
|
||||
|
||||
{{< image width="18em" float="right" src="/assets/freeix/freeix-artist-rendering.png" alt="FreeIX, Artists Rendering" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
A few months ago, I wrote about [[an idea]({{< ref 2024-04-27-freeix-1.md >}})] to help boost the
|
||||
value of small Internet Exchange Points (_IXPs_). When such an exchange doesn't have many members,
|
||||
then the operational costs of connecting to it (cross connects, router ports, finding peers, etc)
|
||||
are not very favorable.
|
||||
|
||||
Clearly, the benefit of using an Internet Exchange is to reduce the portion of an ISP’s (and CDN’s)
|
||||
traffic that must be delivered via their upstream transit providers, thereby reducing the average
|
||||
per-bit delivery cost and as well reducing the end to end latency as seen by their users or
|
||||
customers. Furthermore, the increased number of paths available through the IXP improves routing
|
||||
efficiency and fault-tolerance, and at the same time it avoids traffic going the scenic route to a
|
||||
large hub like Frankfurt, London, Amsterdam, Paris or Rome, if it could very well remain local.
|
||||
|
||||
## Refresher: FreeIX Remote
|
||||
|
||||
{{< image width="20em" float="right" src="/assets/freeix/Free IX Remote.svg" alt="FreeIX Remote" >}}
|
||||
|
||||
Let's take for example the [[Free IX in Greece](https://free-ix.gr/)] that was announced at GRNOG16
|
||||
in Athens on April 19th, 2024. This exchange initially targets Athens and Thessaloniki, with 2x100G
|
||||
between the two cities. Members can connect to either site for the cost of only a cross connect.
|
||||
The 1G/10G/25G ports will be _Gratis_, so please make sure to apply if you're in this region! I
|
||||
myself have connected one very special router to Free IX Greece, which will be offering an outreach
|
||||
infrastructure by connecting to _other_ Internet Exchange Points in Amsterdam, and allowing all FreeIX
|
||||
Greece members to benefit from that in the following way:
|
||||
|
||||
1. FreeIX Remote uses AS50869 to peer with any network operator (or routeserver) available at public
|
||||
Internet Exchange Points or using private interconnects. For these peers, it looks like a completely
|
||||
normal service provider in this regard. It will connect to internet exchange points, and learn a bunch of
|
||||
routes and announce other routes.
|
||||
|
||||
1. FreeIX Remote _members_ can join the program, after which they are granted certain propagation
|
||||
permissions by FreeIX Remote at the point where they have a BGP session with AS50869. The prefixes
|
||||
learned on these _member_ sessions are marked as such, and will be allowed to propagate. Members
|
||||
will receive some or all learned prefixes from AS50869.
|
||||
|
||||
1. FreeIX _members_ can set fine grained BGP communities to determine which of their prefixes are
|
||||
propagated to and from which locations, by router, country or Internet Exchange Point.
|
||||
|
||||
Members at smaller internet exchange points greatly benefit from this type of outreach, by receiving large
|
||||
portions of the public internet directly at their preferred peering location. The _Free IX Remote_
|
||||
routers will carry member traffic to and from these remote Internet Exchange Points. My [[previous
|
||||
article]({{< ref 2024-04-27-freeix-1.md >}})] went into a good amount of detail on the principles of
|
||||
operation, but back then I made a promise to come back to the actual _implementation_ of such a
|
||||
complex routing topology. As a starting point, I work with the structure I shared in [[IPng's
|
||||
Routing Policy]({{< ref 2021-11-14-routing-policy.md >}})]. If you haven't read that yet, I think
|
||||
it may make sense to take a look as many of the structural elements and concepts will be similar.
|
||||
|
||||
## Implementation
|
||||
|
||||
The routing policy calls for three classes of (large) BGP communities: informational, permission and
|
||||
inhibit. It also defines a few classic BGP communties, but I'll skip over those as they are not
|
||||
very interesting. Firstly, I will use the _informational_ communities to tag which prefixes were
|
||||
learned by which _router_, in which _country_ and at which internet exchange point, which I will call a
|
||||
_group_.
|
||||
|
||||
Then, I will use the same structure to grant members _permissions_, that is to say, when AS50869
|
||||
learns their prefixes, they will get tagged with specific action communities that enable propagation
|
||||
to other places. I will call this 'Member-to-IXP'. Sometimes, I'd like to be able to _inhibit_
|
||||
propagation of 'Member-to-IXP', so there will be a third set of communities that perform this
|
||||
function. Finally, matching on the informational communities in a clever way will enable a symmetric
|
||||
'IXP-to-Member' propagation.
|
||||
|
||||
To help structure this implementation, it helps if I think about it in
|
||||
the following way:
|
||||
|
||||
Let's say, AS50869 is connected to IXP1, IXP2, IXP3 and IXP4. AS50869 has a _member_ called M1 at
|
||||
IXP1, and that member is 'permitted' to reach IXP2 and IXP3, but it is 'inhibited' from reaching
|
||||
IXP4. My _FreeIX Remote_ implementation now has to satisfy three main requirements:
|
||||
|
||||
1. **Ingress**: learn prefixes (from peers and members alike) at internet exchange points or
|
||||
private network interconnects, and 'tag' them with the correct informational communities.
|
||||
1. **Egress: Member-to-IXP**: Announce M1's prefixes to IXP2 and IXP3, but not to IXP4.
|
||||
1. **Egress: IXP-to-Member**: Announce IXP2's and IXP3's prefixes to M1, but not IXP4's.
|
||||
|
||||
### Defining Countries and Routers
|
||||
|
||||
I'll start by giving each country which has at least one router a unique _country_id_ in a YAML
|
||||
file, leaving the value 0 to mean 'all' countries:
|
||||
|
||||
```
|
||||
$ cat config/common/countries.yaml
|
||||
country:
|
||||
all: 0
|
||||
CH: 1
|
||||
NL: 2
|
||||
GR: 3
|
||||
IT: 4
|
||||
```
|
||||
|
||||
Each router has its own configuration file, and at the top, I'll define some metadata which
|
||||
includes things like the country in which it operates, and its own unique _router_id_, like so:
|
||||
|
||||
```
|
||||
$ cat config/chrma0.net.free-ix.net.yaml
|
||||
device:
|
||||
id: 1
|
||||
hostname: chrma0.free-ix.net
|
||||
shortname: chrma0
|
||||
country: CH
|
||||
loopbacks:
|
||||
ipv4: 194.126.235.16
|
||||
ipv6: "2a0b:dd80:3101::"
|
||||
location: "Hofwiesenstrasse, Ruemlang, Zurich, Switzerland"
|
||||
...
|
||||
```
|
||||
|
||||
### Defining communities
|
||||
|
||||
Next, I define the BGP communities in `class` and `subclass` types, in the following YAML structure:
|
||||
|
||||
```
|
||||
ebgp:
|
||||
community:
|
||||
legacy:
|
||||
noannounce: 0
|
||||
blackhole: 666
|
||||
inhibit: 3000
|
||||
prepend1: 3100
|
||||
prepend2: 3200
|
||||
prepend3: 3300
|
||||
large:
|
||||
class:
|
||||
informational: 1000
|
||||
permission: 2000
|
||||
inhibit: 3000
|
||||
prepend1: 3100
|
||||
prepend2: 3200
|
||||
prepend3: 3300
|
||||
subclass:
|
||||
all: 0
|
||||
router: 10
|
||||
country: 20
|
||||
group: 30
|
||||
asn: 40
|
||||
```
|
||||
|
||||
### Defining Members
|
||||
|
||||
In order to keep this system manageable, I have to rely on automation. I intend to leverage the
|
||||
BGP community _subclasses_ in a simple ACL system consisting of the following YAML, taking my buddy
|
||||
Antonios' network as an example:
|
||||
|
||||
```
|
||||
$ cat config/common/members.yaml
|
||||
member:
|
||||
210312:
|
||||
description: DaKnObNET
|
||||
prefix_filter: AS-SET-DNET
|
||||
permission: [ router:chrma0 ]
|
||||
inhibit: [ group:chix ]
|
||||
...
|
||||
```
|
||||
|
||||
The syntax of the `permission` and `inhibit` fields are identical. They are lists of key:value pairs
|
||||
where they key must be one of the _subclasses_ (eg. 'router', 'country', 'group', 'asn'), and the
|
||||
value appropriate for that type. In this example, AS50869 is being asked to grant permissions for
|
||||
Antonios' prefixes to any peer connected to `router:chrma0`, but inhibit propagation to/from the
|
||||
exchange point called `group:chix`. I could extend this list, for example by adding a permission to
|
||||
`country:NL` or an inhibit to `router:grskg0` and so on.
|
||||
|
||||
I decide that sensible defaults are to give permissions to all, and keep inhibit empty. In other
|
||||
words: be very liberal in propagation, to maximize the value that FreeIX Remote can provide its
|
||||
members.
|
||||
|
||||
### Ingress: Learning Prefixes
|
||||
|
||||
With what I've defined so far, I can start to set informational BGP communtiies:
|
||||
* The prefixes learned on subclass **router** for `chrma0` will have value of device.id=1:
|
||||
`(50869,1010,1)`
|
||||
* The prefixes learned on subclass **country** for `chrma0` will learn from device.country=CH and
|
||||
be able to look up in `countries['CH']` that this means value 1: `(50869,1020,1)`
|
||||
* When learning prefixes from a given internet exchange, Kees already knows its PeeringDB
|
||||
_ixp_id_, which is a unique value for each exchange point. Thus, subclass **group** for `chrma0` at
|
||||
[[CommunityIX](https://www.peeringdb.com/ix/2013)] is ixp_id=2013: `(50869,1030,2013)`
|
||||
|
||||
#### Ingress: Learning from members
|
||||
|
||||
I need to make sure that members send only the prefixes that I expect from them. To do this, I'll
|
||||
make use of a common tool called [[bgpq4](https://github.com/bgp/bgpq4)] which cobbles together the
|
||||
prefixes belonging to an AS-SET by referencing one or more IRR databases.
|
||||
|
||||
In Python, I'll prepare the Jinja context by generating the prefix filter lists like so:
|
||||
|
||||
```
|
||||
if session["type"] == "member":
|
||||
session = {**session, **data["member"][asn]}
|
||||
|
||||
pf = ebgp_merge_value(data["ebgp"], group, session, "prefix_filter", None)
|
||||
if pf:
|
||||
ctx["prefix_filter"] = {}
|
||||
pfn = pf
|
||||
pfn = pfn.replace("-", "_")
|
||||
pfn = pfn.replace(":", "_")
|
||||
|
||||
for af in [4, 6]:
|
||||
filter_name = "%s_%s_IPV%d" % (groupname.upper(), pfn, af)
|
||||
filter_contents = fetch_bgpq(filter_name, pf, af, allow_morespecifics=True)
|
||||
if "[" in filter_contents:
|
||||
ctx["prefix_filter"][filter_name] = { "str": filter_contents, "af": af }
|
||||
ctx["prefix_filter_ipv%d" % af] = True
|
||||
else:
|
||||
log.warning(f"Filter {filter_name} is empty!")
|
||||
ctx["prefix_filter_ipv%d" % af] = False
|
||||
```
|
||||
|
||||
First, if a given BGP session is of type _member_, I'll merge the `member[asn]` dictionary
|
||||
into the `ebgp.group.session[asn]`. I've left out error handling for brevity, but in case the member
|
||||
YAML file doesn't have an entry for the given ASN, it'll just revert back to being of type _peer_.
|
||||
|
||||
I'll use a helper function `ebgp_merge_value()` to walk the YAML hiearchy from the member-data
|
||||
enriched _session_ to the _group_ and finally to the _ebgp_ scope, looking for the existence of a
|
||||
key called _prefix_filter_ and defaulting to None in case none was found. With the value of
|
||||
_prefix_filter_ in hand (in this case `AS-SET-DNET`), I shell out to `bgpq4` for IPv4 and IPv6
|
||||
respectively. Sometimes, there are no IPv6 prefixes (why must you be like this?!) and sometimes
|
||||
there are no IPv4 prefixes (welcome to the Internet, kid!)
|
||||
|
||||
All of this context, including the session and group information, are then fed as context to a
|
||||
Jinja renderer, where I can use them in an _import_ filter like so:
|
||||
|
||||
```
|
||||
{% for plname, pl in (prefix_filter | default({})).items() %}
|
||||
{{pl.str}}
|
||||
{% endfor %}
|
||||
|
||||
filter ebgp_{{group_name}}_{{their_asn}}_import {
|
||||
{% if not prefix_filter_ipv4 | default(True) %}
|
||||
# WARNING: No IPv4 prefix filter found
|
||||
if (net.type = NET_IP4) then reject;
|
||||
{% endif %}
|
||||
{% if not prefix_filter_ipv6 | default(True) %}
|
||||
# WARNING: No IPv6 prefix filter found
|
||||
if (net.type = NET_IP6) then reject;
|
||||
{% endif %}
|
||||
{% for plname, pl in (prefix_filter | default({})).items() %}
|
||||
{% if pl.af == 4 %}
|
||||
if (net.type = NET_IP4 && ! (net ~ {{plname}})) then reject;
|
||||
{% elif pl.af == 6 %}
|
||||
if (net.type = NET_IP6 && ! (net ~ {{plname}})) then reject;
|
||||
{% endif %}
|
||||
{% endfor %}
|
||||
{% if session_type is defined %}
|
||||
if ! ebgp_import_{{session_type}}({{their_asn}}) then reject;
|
||||
{% endif %}
|
||||
|
||||
# Add FreeIX Remote: Informational
|
||||
bgp_large_community.add(({{my_asn}},{{community.large.class.informational+community.large.subclass.router}},{{device.id}})); ## informational.router = {{ device.hostname }}
|
||||
bgp_large_community.add(({{my_asn}},{{community.large.class.informational+community.large.subclass.country}},{{country[device.country]}})); ## informational.country = {{ device.country }}
|
||||
{% if group.peeringdb_ix.id %}
|
||||
bgp_large_community.add(({{my_asn}},{{community.large.class.informational+community.large.subclass.group}},{{group.peeringdb_ix.id}})); ## informational.group = {{ group_name }}
|
||||
{% endif %}
|
||||
|
||||
## NOTE(pim): More comes here, see Member-to-IXP below
|
||||
|
||||
accept;
|
||||
}
|
||||
```
|
||||
|
||||
Let me explain what's going on here, as Jinja templating language that my generator uses is a bit
|
||||
... chatty. The first block will print the dictionary of zero or more `prefix_filter` entries. If
|
||||
the `prefix_filter` context variable doesn't exist, assume it's the empty dictionary and thus,
|
||||
print no prefix lists.
|
||||
|
||||
Then, I create a Bird2 filter and these must each have a globally unique name. I satisfy this
|
||||
requirement by giving it a name with the tuple of {group, their_asn}. The first thing this filter
|
||||
does, is inspect `prefix_filter_ipv4` and `prefix_filter_ipv6`, and if they are explicitly set to
|
||||
False (for example, if a member doesn't have any IRR prefixes associated with their AS-SET), then
|
||||
I'll reject any prefixes from them. Then, I'll match the prefixes with the `prefix_filter`, if
|
||||
provided, and reject any prefixes that aren't in the list I'm expecting on this session. Assuming
|
||||
we're still good to go, I'll hand this prefix off to a function called `ebgp_import_peer()` for
|
||||
peers and `ebgp_import_member()` for members, both of which ensure BGP communities are scrubbed.
|
||||
|
||||
```
|
||||
function ebgp_import_peer(int remote_as) -> bool
|
||||
{
|
||||
# Scrub BGP Communities (RFC 7454 Section 11)
|
||||
bgp_community.delete([(50869, *)]);
|
||||
bgp_large_community.delete([(50869, *, *)]);
|
||||
|
||||
# Scrub BLACKHOLE community
|
||||
bgp_community.delete((65535, 666));
|
||||
|
||||
return ebgp_import(remote_as);
|
||||
}
|
||||
|
||||
function ebgp_import_member(int remote_as) -> bool
|
||||
{
|
||||
# We scrub only our own (informational, permissions) BGP Communities for members
|
||||
bgp_large_community.delete([(50869,1000..2999,*)]);
|
||||
|
||||
return ebgp_import(remote_as);
|
||||
}
|
||||
```
|
||||
|
||||
After scrubbing the communities (peers are not allowed to set _any_ communities, and members are not
|
||||
allowed to set their own informational or permissions communities, but they are allowed to inhibit
|
||||
themselves or prepend, if they wish), one last check is performed by calling the underlying
|
||||
`ebgp_import()`:
|
||||
|
||||
```
|
||||
function ebgp_import(int remote_as) -> bool
|
||||
{
|
||||
if aspath_bogon() then return false;
|
||||
if (net.type = NET_IP4 && ipv4_bogon()) then return false;
|
||||
if (net.type = NET_IP6 && ipv6_bogon()) then return false;
|
||||
|
||||
if (net.type = NET_IP4 && ipv4_rpki_invalid()) then return false;
|
||||
if (net.type = NET_IP6 && ipv6_rpki_invalid()) then return false;
|
||||
|
||||
# Graceful Shutdown (https://www.rfc-editor.org/rfc/rfc8326.html)
|
||||
if (65535, 0) ~ bgp_community then bgp_local_pref = 0;
|
||||
|
||||
return true;
|
||||
}
|
||||
```
|
||||
|
||||
Here, belt-and-suspenders checks are performed, notably bogon AS Paths, IPv4/IPv6 prefixes and RPKI
|
||||
invalids are filtered out. If the prefix has well-known community for [[BGP Graceful
|
||||
Shutdown](https://www.rfc-editor.org/rfc/rfc8326.html)], honor it and set the local preference to
|
||||
zero (making sure to prefer any other available path).
|
||||
|
||||
OK, after all these checks are done, I am finally ready to accept the prefix from this peer or
|
||||
member. It's time to add the informational communities based on the _router_id_, the router's
|
||||
_country_id_ and (if this is a session at a public internet exchange point documented in PeeringDB),
|
||||
the group's _ixp_id_.
|
||||
|
||||
#### Ingress Example: member
|
||||
|
||||
Here's what the rendered template looks like for Antonios' member session at CHIX:
|
||||
|
||||
```
|
||||
# bgpq4 -Ab4 -R 32 -l 'define CHIX_AS_SET_DNET_IPV4' AS-SET-DNET
|
||||
define CHIX_AS_SET_DNET_IPV4 = [
|
||||
44.31.27.0/24{24,32}, 44.154.130.0/24{24,32}, 44.154.132.0/24{24,32},
|
||||
147.189.216.0/21{21,32}, 193.5.16.0/22{22,32}, 212.46.55.0/24{24,32}
|
||||
];
|
||||
|
||||
# bgpq4 -Ab6 -R 128 -l 'define CHIX_AS_SET_DNET_IPV6' AS-SET-DNET
|
||||
define CHIX_AS_SET_DNET_IPV6 = [
|
||||
2001:678:f5c::/48{48,128}, 2a05:dfc1:9174::/48{48,128}, 2a06:9f81:2500::/40{40,128},
|
||||
2a06:9f81:2600::/40{40,128}, 2a0a:6044:7100::/40{40,128}, 2a0c:2f04:100::/40{40,128},
|
||||
2a0d:3dc0::/29{29,128}, 2a12:bc0::/29{29,128}
|
||||
];
|
||||
|
||||
filter ebgp_chix_210312_import {
|
||||
if (net.type = NET_IP4 && ! (net ~ CHIX_AS_SET_DNET_IPV4)) then reject;
|
||||
if (net.type = NET_IP6 && ! (net ~ CHIX_AS_SET_DNET_IPV6)) then reject;
|
||||
if ! ebgp_import_member(210312) then reject;
|
||||
|
||||
# Add FreeIX Remote: Informational
|
||||
bgp_large_community.add((50869,1010,1)); ## informational.router = chrma0.free-ix.net
|
||||
bgp_large_community.add((50869,1020,1)); ## informational.country = CH
|
||||
bgp_large_community.add((50869,1030,2365)); ## informational.group = chix
|
||||
|
||||
## NOTE(pim): More comes here, see Member-to-IXP below
|
||||
|
||||
accept;
|
||||
}
|
||||
```
|
||||
|
||||
#### Ingress Example: peer
|
||||
|
||||
For completeness, here's a regular peer Cloudflare at CHIX, and I hope you agree that the Jinja
|
||||
template renders down to something waaaay more readable now:
|
||||
|
||||
```
|
||||
filter ebgp_chix_13335_import {
|
||||
if ! ebgp_import_peer(13335) then reject;
|
||||
|
||||
# Add FreeIX Remote: Informational
|
||||
bgp_large_community.add((50869,1010,1)); ## informational.router = chrma0.free-ix.net
|
||||
bgp_large_community.add((50869,1020,1)); ## informational.country = CH
|
||||
bgp_large_community.add((50869,1030,2365)); ## informational.group = chix
|
||||
|
||||
accept;
|
||||
}
|
||||
```
|
||||
|
||||
Most sessions will actually look like this one: just learning prefixes, scrubbing inbound
|
||||
communities that are nobody's business to be setting but mine, tossing weird prefixes like bogons
|
||||
and then setting typically the three informational communities. I now know exactly which prefixes
|
||||
are picked up at group CHIX, which ones in country Switzerland, and which ones on router `chrma0`.
|
||||
|
||||
### Egress: Propagating Prefixes
|
||||
|
||||
And with that, I've completed the 'learning' part. Let me move to the 'propagating' part. A design
|
||||
goal of FreeIX Remote is to have _symmetric_ propagation. In my example above, member M1 should have
|
||||
its prefixes announced at IXP2 and IXP3, and all prefixes learned at IXP2 and IXP3 should be
|
||||
announced to member M1.
|
||||
|
||||
First, let me create a helper function in the generator. It's job is to take the symbolic
|
||||
`member.*.permissions` and `member.*.inhibit` lists and resolve them into a structure of numeric
|
||||
values suitable for BGP community list adding and matching. It's a bit of a beast, but I've
|
||||
simplified it a bit. Notably, I've removed all the error and exception handling for brevity:
|
||||
|
||||
```
|
||||
def parse_member_communities(data, asn, type):
|
||||
myasn = data["ebgp"]["asn"]
|
||||
cls = data["ebgp"]["community"]["large"]["class"]
|
||||
sub = data["ebgp"]["community"]["large"]["subclass"]
|
||||
|
||||
bgp_cl = []
|
||||
member = data["member"][asn]
|
||||
|
||||
for perm in perms:
|
||||
if perm == "all":
|
||||
el = { "class": int(cls[type]), "subclass": int(sub["all"]),
|
||||
"value": 0, "description": f"{type}.all" }
|
||||
return [el]
|
||||
k, v = perm.split(":")
|
||||
if k == "country":
|
||||
country_id = data["country"][v]
|
||||
el = { "class": int(cls[type]), "subclass": int(sub["country"]),
|
||||
"value": int(country_id), "description": f"{type}.{k} = {v}" }
|
||||
bgp_cl.append(el)
|
||||
elif k == "asn":
|
||||
el = { "class": int(cls[type]), "subclass": int(sub["asn"]),
|
||||
"value": int(v), "description": f"{type}.{k} = {v}" }
|
||||
bgp_cl.append(el)
|
||||
elif k == "router":
|
||||
device_id = data["_devices"][v]["id"]
|
||||
el = { "class": int(cls[type]), "subclass": int(sub["router"]),
|
||||
"value": int(device_id), "description": f"{type}.{k} = {v}" }
|
||||
bgp_cl.append(el)
|
||||
elif k == "group":
|
||||
group = data["ebgp"]["groups"][v]
|
||||
if isinstance(group["peeringdb_ix"], dict):
|
||||
ix_id = group["peeringdb_ix"]["id"]
|
||||
else:
|
||||
ix_id = group["peeringdb_ix"]
|
||||
el = { "class": int(cls[type]), "subclass": int(sub["group"]),
|
||||
"value": int(ix_id), "description": f"{type}.{k} = {v}" }
|
||||
bgp_cl.append(el)
|
||||
else:
|
||||
log.warning (f"No implementation for {type} subclass '{k}' for member AS{asn}, skipping")
|
||||
|
||||
return bgp_cl
|
||||
|
||||
```
|
||||
|
||||
The essence of this function is to take a human readable list of symbols, like 'router:chrma0' and
|
||||
look up what subclass is called 'router' and what router_id is 'chrma0'. It does this for keywords
|
||||
'router', 'country', 'group' and 'asn' and for a special keyword called 'all' as well.
|
||||
|
||||
Running this a function on Antonios' member data above would reveal the following:
|
||||
```
|
||||
Member 210312 has permissions:
|
||||
[{'class': 2000, 'subclass': 10, 'value': 1, 'description': 'permission.router = chrma0'}]
|
||||
Member 210312 has inhibits:
|
||||
[{'class': 3000, 'subclass': 30, 'value': 2365, 'description': 'inhibit.group = chix'}]
|
||||
```
|
||||
|
||||
The neat thing about this is, that this data will come in handy for _both_ types of propagation, and
|
||||
the `parse_member_communities()` helper function returns pretty readable data, which will help in
|
||||
debugging and further understanding the ultimately generated configuration.
|
||||
|
||||
#### Egress: Member-to-IXP
|
||||
|
||||
OK, when I learned Antonios' prefixes, I have instructed the system to propagate them to all
|
||||
sessions on router `chrma0`, except sessions on group `chix`. This means that in the direction of
|
||||
_from AS50869 to others_, I can do the following:
|
||||
|
||||
**1. Tag permissions and inhibits on ingress**
|
||||
|
||||
I add a tiny bit of logic using this data structure I just created above. In the import filter,
|
||||
remember I added `NOTE(pim): More comes here`? After setting the informational communities, I also
|
||||
add these:
|
||||
|
||||
```
|
||||
{% if session_type == "member" %}
|
||||
{% if permissions %}
|
||||
|
||||
# Add FreeIX Remote: Permission
|
||||
{% for el in permissions %}
|
||||
bgp_large_community.add(({{my_asn}},{{el.class+el.subclass}},{{el.value}})); ## {{ el.description
|
||||
}}
|
||||
{% endfor %}
|
||||
{% endif %}
|
||||
{% if inhibits %}
|
||||
|
||||
# Add FreeIX Remote: Inhibit
|
||||
{% for el in inhibits %}
|
||||
bgp_large_community.add(({{my_asn}},{{el.class+el.subclass}},{{el.value}})); ## {{ el.description
|
||||
}}
|
||||
{% endfor %}
|
||||
{% endif %}
|
||||
{% endif %}
|
||||
```
|
||||
|
||||
Seeing as this block only gets rendered if the session type is _member_, let me show you how
|
||||
Antonios' import filter looks like in its full glory:
|
||||
|
||||
```
|
||||
filter ebgp_chix_210312_import {
|
||||
if (net.type = NET_IP4 && ! (net ~ CHIX_AS_SET_DNET_IPV4)) then reject;
|
||||
if (net.type = NET_IP6 && ! (net ~ CHIX_AS_SET_DNET_IPV6)) then reject;
|
||||
if ! ebgp_import_member(210312) then reject;
|
||||
|
||||
# Add FreeIX Remote: Informational
|
||||
bgp_large_community.add((50869,1010,1)); ## informational.router = chrma0.free-ix.net
|
||||
bgp_large_community.add((50869,1020,1)); ## informational.country = CH
|
||||
bgp_large_community.add((50869,1030,2365)); ## informational.group = chix
|
||||
|
||||
# Add FreeIX Remote: Permission
|
||||
bgp_large_community.add((50869,2010,1)); ## permission.router = chrma0
|
||||
|
||||
# Add FreeIX Remote: Inhibit
|
||||
bgp_large_community.add((50869,3030,2365)); ## inhibit.group = chix
|
||||
|
||||
accept;
|
||||
}
|
||||
```
|
||||
|
||||
Remember, the `ebgp_import_member()` helper will strip any informational (the 1000s) and permissions
|
||||
(the 2000s), but it would allow Antonios to set inhibits and prepends (the 3000s) so these BGP
|
||||
communities will still be allowed in. In other words, Antonios can't give himself propagation rights
|
||||
(sorry, buddy!) but if he would like to make AS50869 stop sending his prefixes to, say, CommunityIX,
|
||||
he could simply add the BGP community `(50869,3030,2013)` on his announcements, and that will get
|
||||
honored. If he'd like AS50869 to prepend itself twice before announcing to peer AS8298, he could set
|
||||
`(50869,3200,8298)` and that will also get picked up.
|
||||
|
||||
**2. Match permissions and inhibits on egress**
|
||||
|
||||
Now that all of Antonios' prefixes are tagged with permissions and inhibits, I can reveal how I
|
||||
implemented the export filters for AS50869:
|
||||
|
||||
```
|
||||
function member_prefix(int group) -> bool
|
||||
{
|
||||
bool permitted = false;
|
||||
|
||||
if (({{ebgp.asn}}, {{ebgp.community.large.class.permission+ebgp.community.large.subclass.all}}, 0) ~ bgp_large_community ||
|
||||
({{ebgp.asn}}, {{ebgp.community.large.class.permission+ebgp.community.large.subclass.router}}, {{ device.id }}) ~ bgp_large_community ||
|
||||
({{ebgp.asn}}, {{ebgp.community.large.class.permission+ebgp.community.large.subclass.country}}, {{ country[device.country] }}) ~ bgp_large_community ||
|
||||
({{ebgp.asn}}, {{ebgp.community.large.class.permission+ebgp.community.large.subclass.group}}, group) ~ bgp_large_community) then {
|
||||
permitted = true;
|
||||
}
|
||||
if (({{ebgp.asn}}, {{ebgp.community.large.class.inhibit+ebgp.community.large.subclass.all}}, 0) ~ bgp_large_community ||
|
||||
({{ebgp.asn}}, {{ebgp.community.large.class.inhibit+ebgp.community.large.subclass.router}}, {{ device.id }}) ~ bgp_large_community ||
|
||||
({{ebgp.asn}}, {{ebgp.community.large.class.inhibit+ebgp.community.large.subclass.country}}, {{ country[device.country] }}) ~ bgp_large_community ||
|
||||
({{ebgp.asn}}, {{ebgp.community.large.class.inhibit+ebgp.community.large.subclass.group}}, group) ~ bgp_large_community) then {
|
||||
permitted = false;
|
||||
}
|
||||
return (permitted);
|
||||
}
|
||||
|
||||
function valid_prefix(int group) -> bool
|
||||
{
|
||||
return (source_prefix() || member_prefix(group));
|
||||
}
|
||||
|
||||
function ebgp_export_peer(int remote_as; int group) -> bool
|
||||
{
|
||||
if (source != RTS_BGP && source != RTS_STATIC) then return false;
|
||||
if !valid_prefix(group) then return false;
|
||||
|
||||
bgp_community.delete([(50869, *)]);
|
||||
bgp_large_community.delete([(50869, *, *)]);
|
||||
|
||||
return ebgp_export(remote_as);
|
||||
}
|
||||
```
|
||||
|
||||
From the bottom, the function `ebgp_export_peer()` is invoked on each peering session, and it gets
|
||||
the argument of the remote AS (for example 13335 for CloudFlare), and the group (for example 2365
|
||||
for CHIX). The function ensures that it's either a _static_ route or a _BGP_ route. Then it makes
|
||||
sure it's a `valid_prefix()` for the group.
|
||||
|
||||
The `valid_prefix()` function first checks if it's one of our own (as in: AS50869's own) prefixes,
|
||||
which it does by calling `source_prefix()`, which i've ommitted here as it would be a distraction.
|
||||
All it does is check if the prefix is in a static prefix list generated with `bgpq4` for AS50869
|
||||
itself. The more interesting observation is that to be eligible, the prefix needs to be either
|
||||
`source_prefix()` **or** `member_prefix(group)`.
|
||||
|
||||
The propagation decision for 'Member-to-IXP' actually happens in that `member_prefix()` function. It
|
||||
starts off by assuming the prefix is not permitted. Then it scans all relevant _permissions_
|
||||
communities which may be present in the RIB for this prefix:
|
||||
- is the `all` permissions community `(50869,2000,0)` set?
|
||||
- what about the `router` permission `(50869,2010,R)` for my _router_id_?
|
||||
- perhaps the `country` permission `(50869,2020,C)` for my _country_id_?
|
||||
- or maybe the `group` permission `(50869,2030,G)` for the _ixp_id_ that this session lives on?
|
||||
|
||||
If any of these conditions are true, then this prefix _might_ pe permitted, so I set the variable to
|
||||
True. Next, I check and see if any of the _inhibit_ communities are set, either by me (in
|
||||
`members.yaml`) or by the member on the live BGP session. If any one of them matches, then I flip
|
||||
the variable to False again. Once the verdict is known, I can return True or False here, which
|
||||
makes its way all the way up the call stack and ultimately announces the member prefix on the BGP
|
||||
session, or not. Slick!
|
||||
|
||||
#### Egress: IXP-to-Member
|
||||
|
||||
At this point, members' prefixes get announced at the correct internet exchange points, but I need to
|
||||
satisfy one more requirement: the prefixes picked up at those IXPs, should _also_ be announced to
|
||||
members. For this, the helper dictionary with permissions and inhibits can be used in a clever way.
|
||||
What if I held them against the informational communities? For example, I have _permitted_
|
||||
Antonios to be annouced at any IXP connected to router `chrma0`, then all prefixes I learned at
|
||||
`chrma0` are fair game, right? But, I configured an _inhibit_ for Antonios' prefixes at CHIX. No
|
||||
problem, I have an informational community for all prefixes I learned from the CHIX group!
|
||||
|
||||
I come to the realization that IXP-to-Member simply adds to the Member-to-IXP logic. Everything that
|
||||
I would announce to a peer, I will also announce to a member. Off I go, adding one last helper
|
||||
function to the BGP session Jinja template:
|
||||
|
||||
```
|
||||
{% if session_type == "member" %}
|
||||
function ebgp_export_{{group_name}}_{{their_asn}}(int remote_as; int group) -> bool
|
||||
{
|
||||
bool permitted = false;
|
||||
|
||||
if (source != RTS_BGP && source != RTS_STATIC) then return false;
|
||||
if valid_prefix(group) then return ebgp_export(remote_as);
|
||||
|
||||
{% for el in permissions | default([]) %}
|
||||
if (bgp_large_community ~ [({{ my_asn }},{{ 1000+el.subclass}},{% if el.value == 0%}*{% else %}{{el.value}}{% endif %})]) then permitted=true; ## {{el.description}}
|
||||
{% endfor %}
|
||||
{% for el in inhibits | default([]) %}
|
||||
if (bgp_large_community ~ [({{ my_asn }},{{ 1000+el.subclass}},{% if el.value == 0%}*{% else %}{{el.value}}{% endif %})]) then permitted=false; ## {{el.description}}
|
||||
{% endfor %}
|
||||
|
||||
if (permitted) then return ebgp_export(remote_as);
|
||||
return false;
|
||||
}
|
||||
{% endif %}
|
||||
```
|
||||
|
||||
Note that in essence, this new function still calls `valid_prefix()`, which in turn calls
|
||||
`source_prefix()` **or** `member_prefix(group)`, so it announces the same prefixes that are also
|
||||
announced to sessions of type 'peer'. But then, I'll also inspect the _informational_ communities,
|
||||
where the value of `0` is replaced with a wildcard, because 'permit or inhibit all' would mean
|
||||
'match any of these BGP communities'. This template renders as follows for Antonios at CHIX:
|
||||
|
||||
```
|
||||
function ebgp_export_chix_210312(int remote_as; int group) -> bool
|
||||
{
|
||||
bool export = false;
|
||||
|
||||
if (source != RTS_BGP && source != RTS_STATIC) then return false;
|
||||
if valid_prefix(group) then return ebgp_export(remote_as);
|
||||
|
||||
if (bgp_large_community ~ [(50869,1010,1)]) then export=true; ## permission.router = chrma0
|
||||
if (bgp_large_community ~ [(50869,1030,2365)]) then export=false; ## inhibit.group = chix
|
||||
|
||||
if (export) then return ebgp_export(remote_as);
|
||||
return false;
|
||||
}
|
||||
```
|
||||
|
||||
## Results
|
||||
|
||||
With this, the propagation logic is complete. Announcements are _symmetric_, that is to say the function
|
||||
`ebgp_export_chix_210312()` sees to it that Antonios gets the prefixes learned at router `chrma0`
|
||||
but not those learned at group `CHIX`. Similarly, the `ebgp_export_peer()` ensures that Antonios'
|
||||
prefixes are propagated to any session at router `chrma0` except those sessions at group `CHIX`.
|
||||
|
||||
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
|
||||
|
||||
I have installed VPP with [[OSPFv3]({{< ref 2024-06-22-vpp-ospf-2.md >}})] unnumbered interfaces,
|
||||
so each router has exactly one IPv4 and IPv6 loopback address. The router in Rümlang has been
|
||||
operational for a while, the one in Amsterdam (nlams0.free-ix.net) and Thessaloniki
|
||||
(grskg0.free-ix.net) have been deployed and are connecting to IXPs now, and the one in Milan
|
||||
(itmil0.free-ix.net) has been installed but is pending physical deployment at Caldara.
|
||||
|
||||
I deployed a test setup with a few permissions and inhibits on the Rümlang router, with many thanks
|
||||
to Jurrian, Sam and Antonios for allowing me to guinnaepig-ize their member sessions. With the
|
||||
following test configuration:
|
||||
|
||||
```
|
||||
member:
|
||||
35202:
|
||||
description: OnTheGo (Sam Aschwanden)
|
||||
prefix_filter: AS-OTG
|
||||
permission: [ router:chrma0 ]
|
||||
inhibit: [ group:comix ]
|
||||
210312:
|
||||
description: DaKnObNET
|
||||
prefix_filter: AS-SET-DNET
|
||||
permission: [ router:chrma0 ]
|
||||
inhibit: [ group:chix ]
|
||||
212635:
|
||||
description: Jurrian van Iersel
|
||||
prefix_filter: AS212635:AS-212635
|
||||
permission: [ router:chrma0 ]
|
||||
inhibit: [ group:chix, group:fogixp ]
|
||||
```
|
||||
|
||||
I can see the following prefix learn/announce counts towards _members_:
|
||||
|
||||
```
|
||||
pim@chrma0:~$ for i in $(birdc show protocol | grep member | cut -f1 -d' '); do echo -n $i\ ; birdc
|
||||
show protocol all $i | grep Routes; done
|
||||
chix_member_35202_ipv4_1 2 imported, 0 filtered, 159984 exported, 0 preferred
|
||||
chix_member_35202_ipv6_1 2 imported, 0 filtered, 61730 exported, 0 preferred
|
||||
chix_member_210312_ipv4_1 3 imported, 0 filtered, 3518 exported, 3 preferred
|
||||
chix_member_210312_ipv6_1 2 imported, 0 filtered, 1251 exported, 2 preferred
|
||||
comix_member_35202_ipv4_1 2 imported, 0 filtered, 159981 exported, 2 preferred
|
||||
comix_member_35202_ipv4_2 2 imported, 0 filtered, 159981 exported, 1 preferred
|
||||
comix_member_35202_ipv6_1 2 imported, 0 filtered, 61727 exported, 2 preferred
|
||||
comix_member_35202_ipv6_2 2 imported, 0 filtered, 61727 exported, 1 preferred
|
||||
fogixp_member_212635_ipv4_1 1 imported, 0 filtered, 442 exported, 1 preferred
|
||||
fogixp_member_212635_ipv6_1 14 imported, 0 filtered, 181 exported, 14 preferred
|
||||
freeix_ch_member_210312_ipv4_1 3 imported, 0 filtered, 3521 exported, 0 preferred
|
||||
freeix_ch_member_210312_ipv6_1 2 imported, 0 filtered, 1253 exported, 0 preferred
|
||||
```
|
||||
|
||||
Let me make a few observations:
|
||||
* Hurricane Electric AS6939 is present at CHIX, and they tend to announce a very large number of
|
||||
prefixes. So every member who is permitted (and not inhibited) at CHIX will see all of those: Sam's
|
||||
AS35202 is inhibited on CommunityIX but not on CHIX, and he's permitted on both. That explains why
|
||||
he is seeing the routes on both sessions.
|
||||
* I've inhibited Jurrian's AS212635 to/from both CHIX and FogIXP, which means he will be seeing
|
||||
CommunityIX (~245 IPv4, 85 IPv6 prefixes), and FreeIX CH (~173 IPv4 and ~60 IPv6). We also send him
|
||||
the member prefixes, which is about 35 or so additional prefixes. This explains why Jurrian is
|
||||
receiving from us ~440 IPv4 and ~180 IPv6.
|
||||
* Antonios' AS210312, the exemplar in this article, is receiving all-but-CHIX. FogIXP yields 3077
|
||||
or so IPv4 and 1056 IPv6 prefixes, while I've already added up FreeIX, CommunityIX, and our members
|
||||
(this is what we're sending Jurrian!), at 330 resp 180, so Antonios should be getting about 3500 IPv4
|
||||
prefixes and 1250 IPv6 prefixes.
|
||||
|
||||
In the other direction, I would expect to be announcing to _peers_ only prefixes belonging to either
|
||||
AS50869 itself, or those of our members:
|
||||
|
||||
```
|
||||
pim@chrma0:~$ for i in $(birdc show protocol | grep peer.*_1 | cut -f1 -d' '); do echo -n $i\ ; birdc
|
||||
show protocol all $i | grep Routes || echo; done
|
||||
chix_peer_212100_ipv4_1 57618 imported, 0 filtered, 24 exported, 778 preferred
|
||||
chix_peer_212100_ipv6_1 21979 imported, 1 filtered, 37 exported, 7186 preferred
|
||||
chix_peer_13335_ipv4_1 4767 imported, 9 filtered, 24 exported, 4765 preferred
|
||||
chix_peer_13335_ipv6_1 371 imported, 1 filtered, 37 exported, 369 preferred
|
||||
chix_peer_6939_ipv4_1 151787 imported, 27 filtered, 24 exported, 133943 preferred
|
||||
chix_peer_6939_ipv6_1 61191 imported, 6 filtered, 37 exported, 16223 preferred
|
||||
comix_peer_44596_ipv4_1 594 imported, 0 filtered, 25 exported, 10 preferred
|
||||
comix_peer_44596_ipv6_1 1147 imported, 0 filtered, 50 exported, 0 preferred
|
||||
comix_peer_8298_ipv4_1 23 imported, 0 filtered, 25 exported, 0 preferred
|
||||
comix_peer_8298_ipv6_1 34 imported, 0 filtered, 50 exported, 0 preferred
|
||||
fogixp_peer_47498_ipv4_1 3286 imported, 1 filtered, 27 exported, 3077 preferred
|
||||
fogixp_peer_47498_ipv6_1 1838 imported, 0 filtered, 39 exported, 1056 preferred
|
||||
freeix_ch_peer_51530_ipv4_1 355 imported, 0 filtered, 28 exported, 0 preferred
|
||||
freeix_ch_peer_51530_ipv6_1 143 imported, 0 filtered, 53 exported, 0 preferred
|
||||
```
|
||||
|
||||
Some observations:
|
||||
|
||||
* Nobody is inhibited at FreeIX Switzerland. It stands to reason therefore, that it has the most
|
||||
exported prefixes: 28 for IPv4 and 53 for IPv6.
|
||||
* Two members are inhibited at CHIX, which makes it have the lowest amount of exported prefixes:
|
||||
24 for IPv4 and 27 for IPv6.
|
||||
* All members at each exchange (group) will have the same amount of prefixes. I can confirm that
|
||||
at CHIX, all thre peers have the same amount of announced prefixes. Similarly, at CommunityIX, all
|
||||
peers have the same amount.
|
||||
* If Antonios, Sam or Jurrian would add an outgoing announcement to AS50869 with an additional inhibit
|
||||
BGP community (eg `(50869,3020,1)` to inhibit country Switzerland), they could tweak these numbers.
|
||||
|
||||
## What's next
|
||||
|
||||
This all adds up. I'd like to test the waters with my friendly neighborhood canaries a little bit,
|
||||
to make sure that announcements are expected, and traffic flows where appropriate. In the mean time,
|
||||
I'll chase the deployment of LSIX, FrysIX, SpeedIX and possibly a few others in Amsterdam. And of
|
||||
course FreeIX Greece in Thessaloniki. I'll try to get the Milano VPP router deployed (it's already
|
||||
installed and configured, but currently powered off) and connected to PCIX, MIX and a few others.
|
||||
|
||||
## How can you help?
|
||||
|
||||
If you're willing to participate with a VPP router and connect it to either multiple local internet
|
||||
exchanges (like I've demonstrated in Zurich), or better yet, to one or more of the other existing
|
||||
routers, I would welcome your contribution. [[Contact]({{< ref contact.md >}})] me for details.
|
||||
|
||||
A bit further down the pike, a connection from Amsterdam to Zurich, from Zurich to Milan and from
|
||||
Milan to Thessaloniki is on the horizon. If you are willing and able to donate some bandwidth (point
|
||||
to point VPWS, VLL, L2VPN) and your transport network is capable of at least 2026 bytes of _inner_
|
||||
payload, please also [[reach out]({{< ref contact.md >}})] as I'm sure many small network operators
|
||||
would be thrilled.
|
||||
857
content/articles/2025-02-08-sflow-3.md
Normal file
857
content/articles/2025-02-08-sflow-3.md
Normal file
@@ -0,0 +1,857 @@
|
||||
---
|
||||
date: "2025-02-08T07:51:23Z"
|
||||
title: 'VPP with sFlow - Part 3'
|
||||
---
|
||||
|
||||
# Introduction
|
||||
|
||||
{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width="12em" >}}
|
||||
|
||||
In the second half of last year, I picked up a project together with Neil McKee of
|
||||
[[inMon](https://inmon.com/)], the care takers of [[sFlow](https://sflow.org)]: an industry standard
|
||||
technology for monitoring high speed networks. `sFlow` gives complete visibility into the
|
||||
use of networks enabling performance optimization, accounting/billing for usage, and defense against
|
||||
security threats.
|
||||
|
||||
The open source software dataplane [[VPP](https://fd.io)] is a perfect match for sampling, as it
|
||||
forwards packets at very high rates using underlying libraries like [[DPDK](https://dpdk.org/)] and
|
||||
[[RDMA](https://en.wikipedia.org/wiki/Remote_direct_memory_access)]. A clever design choice in the
|
||||
so called _Host sFlow Daemon_ [[host-sflow](https://github.com/sflow/host-sflow)], which allows for
|
||||
a small portion of code to _grab_ the samples, for example in a merchant silicon ASIC or FPGA, but
|
||||
also in the VPP software dataplane. The agent then _transmits_ these samples using a Linux kernel
|
||||
feature called [[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)].
|
||||
This greatly reduces the complexity of code to be implemented in the forwarding path, while at the
|
||||
same time bringing consistency to the `sFlow` delivery pipeline by (re)using the `hsflowd` business
|
||||
logic for the more complex state keeping, packet marshalling and transmission from the _Agent_ to a
|
||||
central _Collector_.
|
||||
|
||||
In this third article, I wanted to spend some time discussing how samples make their way out of the
|
||||
VPP dataplane, and into higher level tools.
|
||||
|
||||
## Recap: sFlow
|
||||
|
||||
{{< image float="left" src="/assets/sflow/sflow-overview.png" alt="sFlow Overview" width="14em" >}}
|
||||
|
||||
sFlow describes a method for Monitoring Traffic in Switched/Routed Networks, originally described in
|
||||
[[RFC3176](https://datatracker.ietf.org/doc/html/rfc3176)]. The current specification is version 5
|
||||
and is homed on the sFlow.org website [[ref](https://sflow.org/sflow_version_5.txt)]. Typically, a
|
||||
Switching ASIC in the dataplane (seen at the bottom of the diagram to the left) is asked to copy
|
||||
1-in-N packets to local sFlow Agent.
|
||||
|
||||
**Sampling**: The agent will copy the first N bytes (typically 128) of the packet into a sample. As
|
||||
the ASIC knows which interface the packet was received on, the `inIfIndex` will be added. After a
|
||||
routing decision is made, the nexthop and its L2 address and interface become known. The ASIC might
|
||||
annotate the sample with this `outIfIndex` and `DstMAC` metadata as well.
|
||||
|
||||
**Drop Monitoring**: There's one rather clever insight that sFlow gives: what if the packet _was
|
||||
not_ routed or switched, but rather discarded? For this, sFlow is able to describe the reason for
|
||||
the drop. For example, the ASIC receive queue could have been overfull, or it did not find a
|
||||
destination to forward the packet to (no FIB entry), perhaps it was instructed by an ACL to drop the
|
||||
packet or maybe even tried to transmit the packet but the physical datalink layer had to abandon the
|
||||
transmission for whatever reason (link down, TX queue full, link saturation, and so on). It's hard
|
||||
to overstate how important it is to have this so-called _drop monitoring_, as operators often spend
|
||||
hours and hours figuring out _why_ packets are lost their network or datacenter switching fabric.
|
||||
|
||||
**Metadata**: The agent may have other metadata as well, such as which prefix was the source and
|
||||
destination of the packet, what additional RIB information is available (AS path, BGP communities,
|
||||
and so on). This may be added to the sample record as well.
|
||||
|
||||
**Counters**: Since sFlow is sampling 1:N packets, the system can estimate total traffic in a
|
||||
reasonably accurate way. Peter and Sonia wrote a succint
|
||||
[[paper](https://sflow.org/packetSamplingBasics/)] about the math, so I won't get into that here.
|
||||
Mostly because I am but a software engineer, not a statistician... :) However, I will say this: if a
|
||||
fraction of the traffic is sampled but the _Agent_ knows how many bytes and packets were forwarded
|
||||
in total, it can provide an overview with a quantifiable accuracy. This is why the _Agent_ will
|
||||
periodically get the interface counters from the ASIC.
|
||||
|
||||
**Collector**: One or more samples can be concatenated into UDP messages that go from the _sFlow
|
||||
Agent_ to a central _sFlow Collector_. The heavy lifting in analysis is done upstream from the
|
||||
switch or router, which is great for performance. Many thousands or even tens of thousands of
|
||||
agents can forward their samples and interface counters to a single central collector, which in turn
|
||||
can be used to draw up a near real time picture of the state of traffic through even the largest of
|
||||
ISP networks or datacenter switch fabrics.
|
||||
|
||||
In sFlow parlance [[VPP](https://fd.io/)] and its companion
|
||||
[[hsflowd](https://github.com/sflow/host-sflow)] together form an _Agent_ (it sends the UDP packets
|
||||
over the network), and for example the commandline tool `sflowtool` could be a _Collector_ (it
|
||||
receives the UDP packets).
|
||||
|
||||
## Recap: sFlow in VPP
|
||||
|
||||
First, I have some pretty good news to report - our work on this plugin was
|
||||
[[merged](https://gerrit.fd.io/r/c/vpp/+/41680)] and will be included in the VPP 25.02 release in a
|
||||
few weeks! Last weekend, I gave a lightning talk at
|
||||
[[FOSDEM](https://fosdem.org/2025/schedule/event/fosdem-2025-4196-vpp-monitoring-100gbps-with-sflow/)]
|
||||
in Brussels, Belgium, and caught up with a lot of community members and network- and software
|
||||
engineers. I had a great time.
|
||||
|
||||
In trying to keep the amount of code as small as possible, and therefore the probability of bugs that
|
||||
might impact VPP's dataplane stability low, the architecture of the end to end solution consists of
|
||||
three distinct parts, each with their own risk and performance profile:
|
||||
|
||||
{{< image float="left" src="/assets/sflow/sflow-vpp-overview.png" alt="sFlow VPP Overview" width="18em" >}}
|
||||
|
||||
**1. sFlow worker node**: Its job is to do what the ASIC does in the hardware case. As VPP moves
|
||||
packets from `device-input` to the `ethernet-input` nodes in its forwarding graph, the sFlow plugin
|
||||
will inspect 1-in-N, taking a sample for further processing. Here, we don't try to be clever, simply
|
||||
copy the `inIfIndex` and the first bytes of the ethernet frame, and append them to a
|
||||
[[FIFO](https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics))] queue. If too many samples
|
||||
arrive, samples are dropped at the tail, and a counter incremented. This way, I can tell when the
|
||||
dataplane is congested. Bounded FIFOs also provide fairness: it allows for each VPP worker thread to
|
||||
get their fair share of samples into the Agent's hands.
|
||||
|
||||
**2. sFlow main process**: There's a function running on the _main thread_, which shifts further
|
||||
processing time _away_ from the dataplane. This _sflow-process_ does two things. Firstly, it
|
||||
consumes samples from the per-worker FIFO queues (both forwarded packets in green, and dropped ones
|
||||
in red). Secondly, it keeps track of time and every few seconds (20 by default, but this is
|
||||
configurable), it'll grab all interface counters from those interfaces for which I have sFlow
|
||||
turned on. VPP produces _Netlink_ messages and sends them to the kernel.
|
||||
|
||||
**3. Host sFlow daemon**: The third component is external to VPP: `hsflowd` subscribes to the _Netlink_
|
||||
messages. It goes without saying that `hsflowd` is a battle-hardened implementation running on
|
||||
hundreds of different silicon and software defined networking stacks. The PSAMPLE stuff is easy,
|
||||
this module already exists. But Neil implemented a _mod_vpp_ which can grab interface names and their
|
||||
`ifIndex`, and counter statistics. VPP emits this data as _Netlink_ `USERSOCK` messages alongside
|
||||
the PSAMPLEs.
|
||||
|
||||
|
||||
By the way, I've written about _Netlink_ before when discussing the [[Linux Control Plane]({{< ref
|
||||
2021-08-25-vpp-4 >}})] plugin. It's a mechanism for programs running in userspace to share
|
||||
information with the kernel. In the Linux kernel, packets can be sampled as well, and sent from
|
||||
kernel to userspace using a _PSAMPLE_ Netlink channel. However, the pattern is that of a message
|
||||
producer/subscriber relationship and nothing precludes one userspace process (`vpp`) to be the
|
||||
producer while another userspace process (`hsflowd`) acts as the consumer!
|
||||
|
||||
Assuming the sFlow plugin in VPP produces samples and counters properly, `hsflowd` will do the rest,
|
||||
giving correctness and upstream interoperability pretty much for free. That's slick!
|
||||
|
||||
### VPP: sFlow Configuration
|
||||
|
||||
The solution that I offer is based on two moving parts. First, the VPP plugin configuration, which
|
||||
turns on sampling at a given rate on physical devices, also known as _hardware-interfaces_. Second,
|
||||
the open source component [[host-sflow](https://github.com/sflow/host-sflow/releases)] can be
|
||||
configured as of release v2.11-5 [[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)].
|
||||
|
||||
I will show how to configure VPP in three ways:
|
||||
|
||||
***1. VPP Configuration via CLI***
|
||||
|
||||
```
|
||||
pim@vpp0-0:~$ vppctl
|
||||
vpp0-0# sflow sampling-rate 100
|
||||
vpp0-0# sflow polling-interval 10
|
||||
vpp0-0# sflow header-bytes 128
|
||||
vpp0-0# sflow enable GigabitEthernet10/0/0
|
||||
vpp0-0# sflow enable GigabitEthernet10/0/0 disable
|
||||
vpp0-0# sflow enable GigabitEthernet10/0/2
|
||||
vpp0-0# sflow enable GigabitEthernet10/0/3
|
||||
```
|
||||
|
||||
The first three commands set the global defaults - in my case I'm going to be sampling at 1:100
|
||||
which is an unusually high rate. A production setup may take 1-in-_linkspeed-in-megabits_ so for a
|
||||
1Gbps device 1:1'000 is appropriate. For 100GE, something between 1:10'000 and 1:100'000 is more
|
||||
appropriate, depending on link load. The second command sets the interface stats polling interval.
|
||||
The default is to gather these statistics every 20 seconds, but I set it to 10s here.
|
||||
|
||||
Next, I tell the plugin how many bytes of the sampled ethernet frame should be taken. Common
|
||||
values are 64 and 128 but it doesn't have to be a power of two. I want enough data to see the
|
||||
headers, like MPLS label(s), Dot1Q tag(s), IP header and TCP/UDP/ICMP header, but the contents of
|
||||
the payload are rarely interesting for
|
||||
statistics purposes.
|
||||
|
||||
Finally, I can turn on the sFlow plugin on an interface with the `sflow enable-disable` CLI. In VPP,
|
||||
an idiomatic way to turn on and off things is to have an enabler/disabler. It feels a bit clunky
|
||||
maybe to write `sflow enable $iface disable` but it makes more logical sends if you parse that as
|
||||
"enable-disable" with the default being the "enable" operation, and the alternate being the
|
||||
"disable" operation.
|
||||
|
||||
***2. VPP Configuration via API***
|
||||
|
||||
I implemented a few API methods for the most common operations. Here's a snippet that obtains the
|
||||
same config as what I typed on the CLI above, but using these Python API calls:
|
||||
|
||||
```python
|
||||
from vpp_papi import VPPApiClient, VPPApiJSONFiles
|
||||
import sys
|
||||
|
||||
vpp_api_dir = VPPApiJSONFiles.find_api_dir([])
|
||||
vpp_api_files = VPPApiJSONFiles.find_api_files(api_dir=vpp_api_dir)
|
||||
vpp = VPPApiClient(apifiles=vpp_api_files, server_address="/run/vpp/api.sock")
|
||||
vpp.connect("sflow-api-client")
|
||||
print(vpp.api.show_version().version)
|
||||
# Output: 25.06-rc0~14-g9b1c16039
|
||||
|
||||
vpp.api.sflow_sampling_rate_set(sampling_N=100)
|
||||
print(vpp.api.sflow_sampling_rate_get())
|
||||
# Output: sflow_sampling_rate_get_reply(_0=655, context=3, sampling_N=100)
|
||||
|
||||
vpp.api.sflow_polling_interval_set(polling_S=10)
|
||||
print(vpp.api.sflow_polling_interval_get())
|
||||
# Output: sflow_polling_interval_get_reply(_0=661, context=5, polling_S=10)
|
||||
|
||||
vpp.api.sflow_header_bytes_set(header_B=128)
|
||||
print(vpp.api.sflow_header_bytes_get())
|
||||
# Output: sflow_header_bytes_get_reply(_0=665, context=7, header_B=128)
|
||||
|
||||
vpp.api.sflow_enable_disable(hw_if_index=1, enable_disable=True)
|
||||
vpp.api.sflow_enable_disable(hw_if_index=2, enable_disable=True)
|
||||
print(vpp.api.sflow_interface_dump())
|
||||
# Output: [ sflow_interface_details(_0=667, context=8, hw_if_index=1),
|
||||
# sflow_interface_details(_0=667, context=8, hw_if_index=2) ]
|
||||
|
||||
print(vpp.api.sflow_interface_dump(hw_if_index=2))
|
||||
# Output: [ sflow_interface_details(_0=667, context=9, hw_if_index=2) ]
|
||||
|
||||
print(vpp.api.sflow_interface_dump(hw_if_index=1234)) ## Invalid hw_if_index
|
||||
# Output: []
|
||||
|
||||
vpp.api.sflow_enable_disable(hw_if_index=1, enable_disable=False)
|
||||
print(vpp.api.sflow_interface_dump())
|
||||
# Output: [ sflow_interface_details(_0=667, context=10, hw_if_index=2) ]
|
||||
```
|
||||
|
||||
This short program toys around a bit with the sFlow API. I first set the sampling to 1:100 and get
|
||||
the current value. Then I set the polling interval to 10s and retrieve the current value again.
|
||||
Finally, I set the header bytes to 128, and retrieve the value again.
|
||||
|
||||
Enabling and disabling sFlow on interfaces shows the idiom I mentioned before - the API being an
|
||||
`*_enable_disable()` call of sorts, and typically taking a boolean argument if the operator wants to
|
||||
enable (the default), or disable sFlow on the interface. Getting the list of enabled interfaces can
|
||||
be done with the `sflow_interface_dump()` call, which returns a list of `sflow_interface_details`
|
||||
messages.
|
||||
|
||||
I demonstrated VPP's Python API and how it works in a fair amount of detail in a [[previous
|
||||
article]({{< ref 2024-01-27-vpp-papi >}})], in case this type of stuff interests you.
|
||||
|
||||
***3. VPPCfg YAML Configuration***
|
||||
|
||||
Writing on the CLI and calling the API is good and all, but many users of VPP have noticed that it
|
||||
does not have any form of configuration persistence and that's deliberate. VPP's goal is to be a
|
||||
programmable dataplane, and explicitly has left the programming and configuration as an exercise for
|
||||
integrators. I have written a Python project that takes a YAML file as input and uses it to
|
||||
configure (and reconfigure, on the fly) the dataplane automatically, called
|
||||
[[VPPcfg](https://git.ipng.ch/ipng/vppcfg.git)]. Previously, I wrote some implementation thoughts
|
||||
on its [[datamodel]({{< ref 2022-03-27-vppcfg-1 >}})] and its [[operations]({{< ref 2022-04-02-vppcfg-2
|
||||
>}})] so I won't repeat that here. Instead, I will just show the configuration:
|
||||
|
||||
```
|
||||
pim@vpp0-0:~$ cat << EOF > vppcfg.yaml
|
||||
interfaces:
|
||||
GigabitEthernet10/0/0:
|
||||
sflow: true
|
||||
GigabitEthernet10/0/1:
|
||||
sflow: true
|
||||
GigabitEthernet10/0/2:
|
||||
sflow: true
|
||||
GigabitEthernet10/0/3:
|
||||
sflow: true
|
||||
|
||||
sflow:
|
||||
sampling-rate: 100
|
||||
polling-interval: 10
|
||||
header-bytes: 128
|
||||
EOF
|
||||
pim@vpp0-0:~$ vppcfg plan -c vppcfg.yaml -o /etc/vpp/config/vppcfg.vpp
|
||||
[INFO ] root.main: Loading configfile vppcfg.yaml
|
||||
[INFO ] vppcfg.config.valid_config: Configuration validated successfully
|
||||
[INFO ] root.main: Configuration is valid
|
||||
[INFO ] vppcfg.reconciler.write: Wrote 13 lines to /etc/vpp/config/vppcfg.vpp
|
||||
[INFO ] root.main: Planning succeeded
|
||||
pim@vpp0-0:~$ vppctl exec /etc/vpp/config/vppcfg.vpp
|
||||
```
|
||||
|
||||
The nifty thing about `vppcfg` is that if I were to change, say, the sampling-rate (setting it to
|
||||
1000) and disable sFlow from an interface, say Gi10/0/0, I can re-run the `vppcfg plan` and `vppcfg
|
||||
apply` stages and the VPP dataplane will be reprogrammed to reflect the newly declared configuration.
|
||||
|
||||
### hsflowd: Configuration
|
||||
|
||||
When sFlow is enabled, VPP will start to emit _Netlink_ messages of type PSAMPLE with packet samples
|
||||
and of type USERSOCK with the custom messages containing interface names and counters. These latter
|
||||
custom messages have to be decoded, which is done by the _mod_vpp_ module in `hsflowd`, starting
|
||||
from release v2.11-5 [[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)].
|
||||
|
||||
Here's a minimalist configuration:
|
||||
|
||||
```
|
||||
pim@vpp0-0:~$ cat /etc/hsflowd.conf
|
||||
sflow {
|
||||
collector { ip=127.0.0.1 udpport=16343 }
|
||||
collector { ip=192.0.2.1 namespace=dataplane }
|
||||
psample { group=1 }
|
||||
vpp { osIndex=off }
|
||||
}
|
||||
```
|
||||
|
||||
{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
There are two important details that can be confusing at first: \
|
||||
**1.** kernel network namespaces \
|
||||
**2.** interface index namespaces
|
||||
|
||||
#### hsflowd: Network namespace
|
||||
|
||||
Network namespaces virtualize Linux's network stack. Upon creation, a network namespace contains only
|
||||
a loopback interface, and subsequently interfaces can be moved between namespaces. Each network
|
||||
namespace will have its own set of IP addresses, its own routing table, socket listing, connection
|
||||
tracking table, firewall, and other network-related resources. When started by systemd, `hsflowd`
|
||||
and VPP will normally both run in the _default_ network namespace.
|
||||
|
||||
Given this, I can conclude that when the sFlow plugin opens a Netlink channel, it will
|
||||
naturally do this in the network namespace that its VPP process is running in (the _default_
|
||||
namespace, normally). It is therefore important that the recipient of these Netlink messages,
|
||||
notably `hsflowd` runs in ths ***same*** namespace as VPP. It's totally fine to run them together in
|
||||
a different namespace (eg. a container in Kubernetes or Docker), as long as they can see each other.
|
||||
|
||||
It might pose a problem if the network connectivity lives in a different namespace than the default
|
||||
one. One common example (that I heavily rely on at IPng!) is to create Linux Control Plane interface
|
||||
pairs, _LIPs_, in a dataplane namespace. The main reason for doing this is to allow something like
|
||||
FRR or Bird to completely govern the routing table in the kernel and keep it in-sync with the FIB in
|
||||
VPP. In such a _dataplane_ network namespace, typically every interface is owned by VPP.
|
||||
|
||||
Luckily, `hsflowd` can attach to one (default) namespace to get the PSAMPLEs, but create a socket in
|
||||
a _different_ (dataplane) namespace to send packets to a collector. This explains the second
|
||||
_collector_ entry in the config-file above. Here, `hsflowd` will send UDP packets to 192.0.2.1:6343
|
||||
from within the (VPP) dataplane namespace, and to 127.0.0.1:16343 in the default namespace.
|
||||
|
||||
#### hsflowd: osIndex
|
||||
|
||||
I hope the previous section made some sense, because this one will be a tad more esoteric. When
|
||||
creating a network namespace, each interface will get its own uint32 interface index that identifies
|
||||
it, and such an ID is typically called an `ifIndex`. It's important to note that the same number can
|
||||
(and will!) occur multiple times, once for each namespace. Let me give you an example:
|
||||
|
||||
```
|
||||
pim@summer:~$ ip link
|
||||
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN ...
|
||||
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
|
||||
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ipng-sl state UP ...
|
||||
link/ether 00:22:19:6a:46:2e brd ff:ff:ff:ff:ff:ff
|
||||
altname enp1s0f0
|
||||
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 900 qdisc mq master ipng-sl state DOWN ...
|
||||
link/ether 00:22:19:6a:46:30 brd ff:ff:ff:ff:ff:ff
|
||||
altname enp1s0f1
|
||||
|
||||
pim@summer:~$ ip netns exec dataplane ip link
|
||||
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN ...
|
||||
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
|
||||
2: loop0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP ...
|
||||
link/ether de:ad:00:00:00:00 brd ff:ff:ff:ff:ff:ff
|
||||
3: xe1-0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP ...
|
||||
link/ether 00:1b:21:bd:c7:18 brd ff:ff:ff:ff:ff:ff
|
||||
```
|
||||
|
||||
I want to draw your attention to the number at the beginning of the line. In the _default_
|
||||
namespace, `ifIndex=3` corresponds to `ifName=eno2` (which has no link, it's marked `DOWN`). But in
|
||||
the _dataplane_ namespace, that index corresponds to a completely different interface called
|
||||
`ifName=xe1-0` (which is link `UP`).
|
||||
|
||||
Now, let me show you the interfaces in VPP:
|
||||
|
||||
```
|
||||
pim@summer:~$ vppctl show int | grep Gigabit | egrep 'Name|loop0|tap0|Gigabit'
|
||||
Name Idx State MTU (L3/IP4/IP6/MPLS)
|
||||
GigabitEthernet4/0/0 1 up 9000/0/0/0
|
||||
GigabitEthernet4/0/1 2 down 9000/0/0/0
|
||||
GigabitEthernet4/0/2 3 down 9000/0/0/0
|
||||
GigabitEthernet4/0/3 4 down 9000/0/0/0
|
||||
TenGigabitEthernet5/0/0 5 up 9216/0/0/0
|
||||
TenGigabitEthernet5/0/1 6 up 9216/0/0/0
|
||||
loop0 7 up 9216/0/0/0
|
||||
tap0 19 up 9216/0/0/0
|
||||
```
|
||||
|
||||
Here, I want you to look at the second column `Idx`, which shows what VPP calls the _sw_if_index_
|
||||
(the software interface index, as opposed to hardware index). Here, `ifIndex=3` corresponds to
|
||||
`ifName=GigabitEthernet4/0/2`, which is neither `eno2` nor `xe1-0`. Oh my, yet _another_ namespace!
|
||||
|
||||
It turns out that there are three (relevant) types of namespaces at play here:
|
||||
1. ***Linux network*** namespace; here using `dataplane` and `default` each with their own unique
|
||||
(and overlapping) numbering.
|
||||
1. ***VPP hardware*** interface namespace, also called PHYs (for physical interfaces). When VPP
|
||||
first attaches to or creates network interfaces like the ones from DPDK or RDMA, these will
|
||||
create an _hw_if_index_ in a list.
|
||||
1. ***VPP software*** interface namespace. All interfaces (including hardware ones!) will
|
||||
receive a _sw_if_index_ in VPP. A good example is sub-interfaces: if I create a sub-int on
|
||||
GigabitEthernet4/0/2, it will NOT get a hardware index, but it _will_ get the next available
|
||||
software index (in this example, `sw_if_index=7`).
|
||||
|
||||
In Linux CP, I can see a mapping from one to the other, just look at this:
|
||||
|
||||
```
|
||||
pim@summer:~$ vppctl show lcp
|
||||
lcp default netns dataplane
|
||||
lcp lcp-auto-subint off
|
||||
lcp lcp-sync on
|
||||
lcp lcp-sync-unnumbered on
|
||||
itf-pair: [0] loop0 tap0 loop0 2 type tap netns dataplane
|
||||
itf-pair: [1] TenGigabitEthernet5/0/0 tap1 xe1-0 3 type tap netns dataplane
|
||||
itf-pair: [2] TenGigabitEthernet5/0/1 tap2 xe1-1 4 type tap netns dataplane
|
||||
itf-pair: [3] TenGigabitEthernet5/0/0.20 tap1.20 xe1-0.20 5 type tap netns dataplane
|
||||
```
|
||||
|
||||
Those `itf-pair` describe our _LIPs_, and they have the coordinates to three things. 1) The VPP
|
||||
software interface (VPP `ifName=loop0` with `sw_if_index=7`), which 2) Linux CP will mirror into the
|
||||
Linux kernel using a TAP device (VPP `ifName=tap0` with `sw_if_index=19`). That TAP has one leg in
|
||||
VPP (`tap0`), and another in 3) Linux (with `ifName=loop` and `ifIndex=2` in namespace `dataplane`).
|
||||
|
||||
> So the tuple that fully describes a _LIP_ is `{7, 19,'dataplane', 2}`
|
||||
|
||||
Climbing back out of that rabbit hole, I am now finally ready to explain the feature. When sFlow in
|
||||
VPP takes its sample, it will be doing this on a PHY, that is a given interface with a specific
|
||||
_hw_if_index_. When it polls the counters, it'll do it for that specific _hw_if_index_. It now has a
|
||||
choice: should it share with the world the representation of *its* namespace, or should it try to be
|
||||
smarter? If LinuxCP is enabled, this interface will likely have a representation in Linux. So the
|
||||
plugin will first resolve the _sw_if_index_ belonging to that PHY, and using that, try to look up a
|
||||
_LIP_ with it. If it finds one, it'll know both the namespace in which it lives as well as the
|
||||
osIndex in that namespace. If it doesn't find a _LIP_, it will at least have the _sw_if_index_ at
|
||||
hand, so it'll annotate the USERSOCK counter messages with this information instead.
|
||||
|
||||
Now, `hsflowd` has a choice to make: does it share the Linux representation and hide VPP as an
|
||||
implementation detail? Or does it share the VPP dataplane _sw_if_index_? There are use cases
|
||||
relevant to both, so the decision was to let the operator decide, by setting `osIndex` either `on`
|
||||
(use Linux ifIndex) or `off` (use VPP _sw_if_index_).
|
||||
|
||||
### hsflowd: Host Counters
|
||||
|
||||
Now that I understand the configuration parts of VPP and `hsflowd`, I decide to configure everything
|
||||
but without enabling sFlow on on any interfaces yet in VPP. Once I start the daemon, I can see that
|
||||
it sends an UDP packet every 30 seconds to the configured _collector_:
|
||||
|
||||
```
|
||||
pim@vpp0-0:~$ sudo tcpdump -s 9000 -i lo -n
|
||||
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
|
||||
listening on lo, link-type EN10MB (Ethernet), snapshot length 9000 bytes
|
||||
15:34:19.695042 IP 127.0.0.1.48753 > 127.0.0.1.6343: sFlowv5,
|
||||
IPv4 agent 198.19.5.16, agent-id 100000, length 716
|
||||
```
|
||||
|
||||
The `tcpdump` I have on my Debian bookworm machines doesn't know how to decode the contents of these
|
||||
sFlow packets. Actually, neither does Wireshark. I've attached a file of these mysterious packets
|
||||
[[sflow-host.pcap](/assets/sflow/sflow-host.pcap)] in case you want to take a look.
|
||||
Neil however gives me a tip. A full message decoder and otherwise handy Swiss army knife lives in
|
||||
[[sflowtool](https://github.com/sflow/sflowtool)].
|
||||
|
||||
I can offer this pcap file to `sflowtool`, or let it just listen on the UDP port directly, and
|
||||
it'll tell me what it finds:
|
||||
|
||||
```
|
||||
pim@vpp0-0:~$ sflowtool -p 6343
|
||||
startDatagram =================================
|
||||
datagramSourceIP 127.0.0.1
|
||||
datagramSize 716
|
||||
unixSecondsUTC 1739112018
|
||||
localtime 2025-02-09T15:40:18+0100
|
||||
datagramVersion 5
|
||||
agentSubId 100000
|
||||
agent 198.19.5.16
|
||||
packetSequenceNo 57
|
||||
sysUpTime 987398
|
||||
samplesInPacket 1
|
||||
startSample ----------------------
|
||||
sampleType_tag 0:4
|
||||
sampleType COUNTERSSAMPLE
|
||||
sampleSequenceNo 33
|
||||
sourceId 2:1
|
||||
counterBlock_tag 0:2001
|
||||
adaptor_0_ifIndex 2
|
||||
adaptor_0_MACs 1
|
||||
adaptor_0_MAC_0 525400f00100
|
||||
counterBlock_tag 0:2010
|
||||
udpInDatagrams 123904
|
||||
udpNoPorts 23132459
|
||||
udpInErrors 0
|
||||
udpOutDatagrams 46480629
|
||||
udpRcvbufErrors 0
|
||||
udpSndbufErrors 0
|
||||
udpInCsumErrors 0
|
||||
counterBlock_tag 0:2009
|
||||
tcpRtoAlgorithm 1
|
||||
tcpRtoMin 200
|
||||
tcpRtoMax 120000
|
||||
tcpMaxConn 4294967295
|
||||
tcpActiveOpens 0
|
||||
tcpPassiveOpens 30
|
||||
tcpAttemptFails 0
|
||||
tcpEstabResets 0
|
||||
tcpCurrEstab 1
|
||||
tcpInSegs 89120
|
||||
tcpOutSegs 86961
|
||||
tcpRetransSegs 59
|
||||
tcpInErrs 0
|
||||
tcpOutRsts 4
|
||||
tcpInCsumErrors 0
|
||||
counterBlock_tag 0:2008
|
||||
icmpInMsgs 23129314
|
||||
icmpInErrors 32
|
||||
icmpInDestUnreachs 0
|
||||
icmpInTimeExcds 23129282
|
||||
icmpInParamProbs 0
|
||||
icmpInSrcQuenchs 0
|
||||
icmpInRedirects 0
|
||||
icmpInEchos 0
|
||||
icmpInEchoReps 32
|
||||
icmpInTimestamps 0
|
||||
icmpInAddrMasks 0
|
||||
icmpInAddrMaskReps 0
|
||||
icmpOutMsgs 0
|
||||
icmpOutErrors 0
|
||||
icmpOutDestUnreachs 23132467
|
||||
icmpOutTimeExcds 0
|
||||
icmpOutParamProbs 23132467
|
||||
icmpOutSrcQuenchs 0
|
||||
icmpOutRedirects 0
|
||||
icmpOutEchos 0
|
||||
icmpOutEchoReps 0
|
||||
icmpOutTimestamps 0
|
||||
icmpOutTimestampReps 0
|
||||
icmpOutAddrMasks 0
|
||||
icmpOutAddrMaskReps 0
|
||||
counterBlock_tag 0:2007
|
||||
ipForwarding 2
|
||||
ipDefaultTTL 64
|
||||
ipInReceives 46590552
|
||||
ipInHdrErrors 0
|
||||
ipInAddrErrors 0
|
||||
ipForwDatagrams 0
|
||||
ipInUnknownProtos 0
|
||||
ipInDiscards 0
|
||||
ipInDelivers 46402357
|
||||
ipOutRequests 69613096
|
||||
ipOutDiscards 0
|
||||
ipOutNoRoutes 80
|
||||
ipReasmTimeout 0
|
||||
ipReasmReqds 0
|
||||
ipReasmOKs 0
|
||||
ipReasmFails 0
|
||||
ipFragOKs 0
|
||||
ipFragFails 0
|
||||
ipFragCreates 0
|
||||
counterBlock_tag 0:2005
|
||||
disk_total 6253608960
|
||||
disk_free 2719039488
|
||||
disk_partition_max_used 56.52
|
||||
disk_reads 11512
|
||||
disk_bytes_read 626214912
|
||||
disk_read_time 48469
|
||||
disk_writes 1058955
|
||||
disk_bytes_written 8924332032
|
||||
disk_write_time 7954804
|
||||
counterBlock_tag 0:2004
|
||||
mem_total 8326963200
|
||||
mem_free 5063872512
|
||||
mem_shared 0
|
||||
mem_buffers 86425600
|
||||
mem_cached 827752448
|
||||
swap_total 0
|
||||
swap_free 0
|
||||
page_in 306365
|
||||
page_out 4357584
|
||||
swap_in 0
|
||||
swap_out 0
|
||||
counterBlock_tag 0:2003
|
||||
cpu_load_one 0.030
|
||||
cpu_load_five 0.050
|
||||
cpu_load_fifteen 0.040
|
||||
cpu_proc_run 1
|
||||
cpu_proc_total 138
|
||||
cpu_num 2
|
||||
cpu_speed 1699
|
||||
cpu_uptime 1699306
|
||||
cpu_user 64269210
|
||||
cpu_nice 1810
|
||||
cpu_system 34690140
|
||||
cpu_idle 3234293560
|
||||
cpu_wio 3568580
|
||||
cpuintr 0
|
||||
cpu_sintr 5687680
|
||||
cpuinterrupts 1596621688
|
||||
cpu_contexts 3246142972
|
||||
cpu_steal 329520
|
||||
cpu_guest 0
|
||||
cpu_guest_nice 0
|
||||
counterBlock_tag 0:2006
|
||||
nio_bytes_in 250283
|
||||
nio_pkts_in 2931
|
||||
nio_errs_in 0
|
||||
nio_drops_in 0
|
||||
nio_bytes_out 370244
|
||||
nio_pkts_out 1640
|
||||
nio_errs_out 0
|
||||
nio_drops_out 0
|
||||
counterBlock_tag 0:2000
|
||||
hostname vpp0-0
|
||||
UUID ec933791-d6af-7a93-3b8d-aab1a46d6faa
|
||||
machine_type 3
|
||||
os_name 2
|
||||
os_release 6.1.0-26-amd64
|
||||
endSample ----------------------
|
||||
endDatagram =================================
|
||||
```
|
||||
|
||||
If you thought: "What an obnoxiously long paste!", then my slightly RSI-induced mouse-hand might
|
||||
agree with you. But it is really cool to see that every 30 seconds, the _collector_ will receive
|
||||
this form of heartbeat from the _agent_. There's a lot of vitalsigns in this packet, including some
|
||||
non-obvious but interesting stats like CPU load, memory, disk use and disk IO, and kernel version
|
||||
information. It's super dope!
|
||||
|
||||
### hsflowd: Interface Counters
|
||||
|
||||
Next, I'll enable sFlow in VPP on all four interfaces (Gi10/0/0-Gi10/0/3), set the sampling rate to
|
||||
something very high (1 in 100M), and the interface polling-interval to every 10 seconds. And indeed,
|
||||
every ten seconds or so I get a few packets, which I captured in
|
||||
[[sflow-interface.pcap](/assets/sflow/sflow-interface.pcap)]. Most of the packets contain only one
|
||||
counter record, while some contain more than one (in the PCAP, packet #9 has two). If I update the
|
||||
polling-interval to every second, I can see that most of the packets have all four counters.
|
||||
|
||||
Those interface counters, as decoded by `sflowtool`, look like this:
|
||||
|
||||
```
|
||||
pim@vpp0-0:~$ sflowtool -r sflow-interface.pcap | \
|
||||
awk '/startSample/ { on=1 } { if (on) { print $0 } } /endSample/ { on=0 }'
|
||||
startSample ----------------------
|
||||
sampleType_tag 0:4
|
||||
sampleType COUNTERSSAMPLE
|
||||
sampleSequenceNo 745
|
||||
sourceId 0:3
|
||||
counterBlock_tag 0:1005
|
||||
ifName GigabitEthernet10/0/2
|
||||
counterBlock_tag 0:1
|
||||
ifIndex 3
|
||||
networkType 6
|
||||
ifSpeed 0
|
||||
ifDirection 1
|
||||
ifStatus 3
|
||||
ifInOctets 858282015
|
||||
ifInUcastPkts 780540
|
||||
ifInMulticastPkts 0
|
||||
ifInBroadcastPkts 0
|
||||
ifInDiscards 0
|
||||
ifInErrors 0
|
||||
ifInUnknownProtos 0
|
||||
ifOutOctets 1246716016
|
||||
ifOutUcastPkts 975772
|
||||
ifOutMulticastPkts 0
|
||||
ifOutBroadcastPkts 0
|
||||
ifOutDiscards 127
|
||||
ifOutErrors 28
|
||||
ifPromiscuousMode 0
|
||||
endSample ----------------------
|
||||
```
|
||||
|
||||
What I find particularly cool about it, is that sFlow provides an automatic mapping between the
|
||||
`ifName=GigabitEthernet10/0/2` (tag 0:1005), together with an object (tag 0:1), which contains the
|
||||
`ifIndex=3`, and lots of packet and octet counters both in the ingress and egress direction. This is
|
||||
super useful for upstream _collectors_, as they can now find the hostname, agent name and address,
|
||||
and the correlation between interface names and their indexes. Noice!
|
||||
|
||||
#### hsflowd: Packet Samples
|
||||
|
||||
Now it's time to ratchet up the packet sampling, so I move it from 1:100M to 1:1000, while keeping
|
||||
the interface polling-interval at 10 seconds and I ask VPP to sample 64 bytes of each packet that it
|
||||
inspects. On either side of my pet VPP instance, I start an `iperf3` run to generate some traffic. I
|
||||
now see a healthy stream of sflow packets coming in on port 6343. They still contain every 30
|
||||
seconds or so a host counter, and every 10 seconds a set of interface counters come by, but mostly
|
||||
these UDP packets are showing me samples. I've captured a few minutes of these in
|
||||
[[sflow-all.pcap](/assets/sflow/sflow-all.pcap)].
|
||||
Although Wireshark doesn't know how to interpret the sFlow counter messages, it _does_ know how to
|
||||
interpret the sFlow sample messagess, and it reveals one of them like this:
|
||||
|
||||
{{< image width="100%" src="/assets/sflow/sflow-wireshark.png" alt="sFlow Wireshark" >}}
|
||||
|
||||
Let me take a look at the picture from top to bottom. First, the outer header (from 127.0.0.1:48753
|
||||
to 127.0.0.1:6343) is the sFlow agent sending to the collector. The agent identifies itself as
|
||||
having IPv4 address 198.19.5.16 with ID 100000 and an uptime of 1h52m. Then, it says it's going to
|
||||
send 9 samples, the first of which says it's from ifIndex=2 and at a sampling rate of 1:1000. It
|
||||
then shows that sample, saying that the frame length is 1518 bytes, and the first 64 bytes of those
|
||||
are sampled. Finally, the first sampled packet starts at the blue line. It shows the SrcMAC and
|
||||
DstMAC, and that it was a TCP packet from 192.168.10.17:51028 to 192.168.10.33:5201 - my running
|
||||
`iperf3`, booyah!
|
||||
|
||||
### VPP: sFlow Performance
|
||||
|
||||
{{< image float="right" src="/assets/sflow/sflow-lab.png" alt="sFlow Lab" width="20em" >}}
|
||||
|
||||
One question I get a lot about this plugin is: what is the performance impact when using
|
||||
sFlow? I spent a considerable amount of time tinkering with this, and together with Neil bringing
|
||||
the plugin to what we both agree is the most efficient use of CPU. We could have gone a bit further,
|
||||
but that would require somewhat intrusive changes to VPP's internals and as _North of the Border_
|
||||
(and the Simpsons!) would say: what we have isn't just good, it's good enough!
|
||||
|
||||
I've built a small testbed based on two Dell R730 machines. On the left, I have a Debian machine
|
||||
running Cisco T-Rex using four quad-tengig network cards, the classic Intel i710-DA4. On the right,
|
||||
I have my VPP machine called _Hippo_ (because it's always hungry for packets), with the same
|
||||
hardware. I'll build two halves. On the top NIC (Te3/0/0-3 in VPP), I will install IPv4 and MPLS
|
||||
forwarding on the purple circuit, and a simple Layer2 cross connect on the cyan circuit. On all four
|
||||
interfaces, I will enable sFlow. Then, I will mirror this configuration on the bottom NIC
|
||||
(Te130/0/0-3) in the red and green circuits, for which I will leave sFlow turned off.
|
||||
|
||||
To help you reproduce my results, and under the assumption that this is your jam, here's the
|
||||
configuration for all of the kit:
|
||||
|
||||
***0. Cisco T-Rex***
|
||||
```
|
||||
pim@trex:~ $ cat /srv/trex/8x10.yaml
|
||||
- version: 2
|
||||
interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ]
|
||||
port_info:
|
||||
- src_mac: 00:1b:21:06:00:00
|
||||
dest_mac: 9c:69:b4:61:a1:dc # Connected to Hippo Te3/0/0, purple
|
||||
- src_mac: 00:1b:21:06:00:01
|
||||
dest_mac: 9c:69:b4:61:a1:dd # Connected to Hippo Te3/0/1, purple
|
||||
- src_mac: 00:1b:21:83:00:00
|
||||
dest_mac: 00:1b:21:83:00:01 # L2XC via Hippo Te3/0/2, cyan
|
||||
- src_mac: 00:1b:21:83:00:01
|
||||
dest_mac: 00:1b:21:83:00:00 # L2XC via Hippo Te3/0/3, cyan
|
||||
|
||||
- src_mac: 00:1b:21:87:00:00
|
||||
dest_mac: 9c:69:b4:61:75:d0 # Connected to Hippo Te130/0/0, red
|
||||
- src_mac: 00:1b:21:87:00:01
|
||||
dest_mac: 9c:69:b4:61:75:d1 # Connected to Hippo Te130/0/1, red
|
||||
- src_mac: 9c:69:b4:85:00:00
|
||||
dest_mac: 9c:69:b4:85:00:01 # L2XC via Hippo Te130/0/2, green
|
||||
- src_mac: 9c:69:b4:85:00:01
|
||||
dest_mac: 9c:69:b4:85:00:00 # L2XC via Hippo Te130/0/3, green
|
||||
pim@trex:~ $ sudo t-rex-64 -i -c 4 --cfg /srv/trex/8x10.yaml
|
||||
```
|
||||
|
||||
When constructing the T-Rex configuration, I specifically set the destination MAC address for L3
|
||||
circuits (the purple and red ones) using Hippo's interface MAC address, which I can find with
|
||||
`vppctl show hardware-interfaces`. This way, T-Rex does not have to ARP for the VPP endpoint. On
|
||||
L2XC circuits (the cyan and green ones), VPP does not concern itself with the MAC addressing at
|
||||
all. It puts its interface in _promiscuous_ mode, and simply writes out any ethernet frame received,
|
||||
directly to the egress interface.
|
||||
|
||||
***1. IPv4***
|
||||
```
|
||||
hippo# set int state TenGigabitEthernet3/0/0 up
|
||||
hippo# set int state TenGigabitEthernet3/0/1 up
|
||||
hippo# set int state TenGigabitEthernet130/0/0 up
|
||||
hippo# set int state TenGigabitEthernet130/0/1 up
|
||||
hippo# set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31
|
||||
hippo# set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31
|
||||
hippo# set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31
|
||||
hippo# set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31
|
||||
hippo# ip route add 16.0.0.0/24 via 100.64.0.0
|
||||
hippo# ip route add 48.0.0.0/24 via 100.64.1.0
|
||||
hippo# ip route add 16.0.2.0/24 via 100.64.4.0
|
||||
hippo# ip route add 48.0.2.0/24 via 100.64.5.0
|
||||
hippo# ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static
|
||||
hippo# ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static
|
||||
hippo# ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static
|
||||
hippo# ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static
|
||||
```
|
||||
|
||||
By the way, one note to this last piece, I'm setting static IPv4 neighbors so that Cisco T-Rex
|
||||
as well as VPP do not have to use ARP to resolve each other. You'll see above that the T-Rex
|
||||
configuration also uses MAC addresses exclusively. Setting the `ip neighbor` like this allows VPP
|
||||
to know where to send return traffic.
|
||||
|
||||
***2. MPLS***
|
||||
```
|
||||
hippo# mpls table add 0
|
||||
hippo# set interface mpls TenGigabitEthernet3/0/0 enable
|
||||
hippo# set interface mpls TenGigabitEthernet3/0/1 enable
|
||||
hippo# set interface mpls TenGigabitEthernet130/0/0 enable
|
||||
hippo# set interface mpls TenGigabitEthernet130/0/1 enable
|
||||
hippo# mpls local-label add 16 eos via 100.64.1.0 TenGigabitEthernet3/0/1 out-labels 17
|
||||
hippo# mpls local-label add 17 eos via 100.64.0.0 TenGigabitEthernet3/0/0 out-labels 16
|
||||
hippo# mpls local-label add 20 eos via 100.64.5.0 TenGigabitEthernet130/0/1 out-labels 21
|
||||
hippo# mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels 20
|
||||
```
|
||||
|
||||
Here, the MPLS configuration implements a simple P-router, where incoming MPLS packets with label 16
|
||||
will be sent back to T-Rex on Te3/0/1 to the specified IPv4 nexthop (for which I already know the
|
||||
MAC address), and with label 16 removed and new label 17 imposed, in other words a SWAP operation.
|
||||
|
||||
***3. L2XC***
|
||||
```
|
||||
hippo# set int state TenGigabitEthernet3/0/2 up
|
||||
hippo# set int state TenGigabitEthernet3/0/3 up
|
||||
hippo# set int state TenGigabitEthernet130/0/2 up
|
||||
hippo# set int state TenGigabitEthernet130/0/3 up
|
||||
hippo# set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3
|
||||
hippo# set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2
|
||||
hippo# set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3
|
||||
hippo# set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2
|
||||
```
|
||||
|
||||
I've added a layer2 cross connect as well because it's computationally very cheap for VPP to receive
|
||||
an L2 (ethernet) datagram, and immediately transmit it on another interface. There's no FIB lookup
|
||||
and not even an L2 nexthop lookup involved, VPP is just shoveling ethernet packets in-and-out as
|
||||
fast as it can!
|
||||
|
||||
Here's how a loadtest looks like when sending 80Gbps at 192b packets on all eight interfaces:
|
||||
|
||||
{{< image src="/assets/sflow/sflow-lab-trex.png" alt="sFlow T-Rex" width="100%" >}}
|
||||
|
||||
The leftmost ports p0 <-> p1 are sending IPv4+MPLS, while ports p0 <-> p2 are sending ethernet back
|
||||
and forth. All four of them have sFlow enabled, at a sampling rate of 1:10'000, the default. These
|
||||
four ports are my experiment, to show the CPU use of sFlow. Then, ports p3 <-> p4 and p5 <-> p6
|
||||
respectively have sFlow turned off but with the same configuration. They are my control, showing
|
||||
the CPU use without sFlow.
|
||||
|
||||
**First conclusion**: This stuff works a treat. There is absolutely no impact of throughput at
|
||||
80Gbps with 47.6Mpps either _with_, or _without_ sFlow turned on. That's wonderful news, as it shows
|
||||
that the dataplane has more CPU available than is needed for any combination of functionality.
|
||||
|
||||
But what _is_ the limit? For this, I'll take a deeper look at the runtime statistics by varying the
|
||||
CPU time spent and maximum throughput achievable on a single VPP worker, thus using a single CPU
|
||||
thread on this Hippo machine that has 44 cores and 44 hyperthreads. I switch the loadtester to emit
|
||||
64 byte ethernet packets, the smallest I'm allowed to send.
|
||||
|
||||
| Loadtest | no sFlow | 1:1'000'000 | 1:10'000 | 1:1'000 | 1:100 |
|
||||
|-------------|-----------|-----------|-----------|-----------|-----------|
|
||||
| L2XC | 14.88Mpps | 14.32Mpps | 14.31Mpps | 14.27Mpps | 14.15Mpps |
|
||||
| IPv4 | 10.89Mpps | 9.88Mpps | 9.88Mpps | 9.84Mpps | 9.73Mpps |
|
||||
| MPLS | 10.11Mpps | 9.52Mpps | 9.52Mpps | 9.51Mpps | 9.45Mpps |
|
||||
| ***sFlow Packets*** / 10sec | N/A | 337.42M total | 337.39M total | 336.48M total | 333.64M total |
|
||||
| .. Sampled | | 328 | 33.8k | 336k | 3.34M |
|
||||
| .. Sent | | 328 | 33.8k | 336k | 1.53M |
|
||||
| .. Dropped | | 0 | 0 | 0 | 1.81M |
|
||||
|
||||
Here I can make a few important observations.
|
||||
|
||||
**Baseline**: One worker (thus, one CPU thread) can sustain 14.88Mpps of L2XC when sFlow is turned off, which
|
||||
implies that it has a little bit of CPU left over to do other work, if needed. With IPv4, I can see
|
||||
that the throughput is actually CPU limited: 10.89Mpps can be handled by one worker (thus, one CPU thread). I
|
||||
know that MPLS is a little bit more expensive computationally than IPv4, and that checks out. The
|
||||
total capacity is 10.11Mpps for one worker, when sFlow is turned off.
|
||||
|
||||
**Overhead**: When I turn on sFlow on the interface, VPP will insert the _sflow-node_ into the
|
||||
forwarding graph between `device-input` and `ethernet-input`. It means that the sFlow node will see
|
||||
_every single_ packet, and it will have to move all of these into the next node, which costs about
|
||||
9.5 CPU cycles per packet. The regression on L2XC is 3.8% but I have to note that VPP was not CPU
|
||||
bound on the L2XC so it used some CPU cycles which were still available, before regressing
|
||||
throughput. There is an immediate regression of 9.3% on IPv4 and 5.9% on MPLS, only to shuffle the
|
||||
packets through the graph.
|
||||
|
||||
**Sampling Cost**: But when then doing higher rates of sampling, the further regression is not _that_
|
||||
terrible. Between 1:1'000'000 and 1:10'000, there's barely a noticeable difference. Even in the
|
||||
worst case of 1:100, the regression is from 14.32Mpps to 14.15Mpps for L2XC, only 1.2%. The
|
||||
regression for L2XC, IPv4 and MPLS are all very modest, at 1.2% (L2XC), 1.6% (IPv4) and 0.8% (MPLS).
|
||||
Of course, by using multiple hardware receive queues and multiple RX workers per interface, the cost
|
||||
can be kept well in hand.
|
||||
|
||||
**Overload Protection**: At 1:1'000 and an effective rate of 33.65Mpps across all ports, I correctly
|
||||
observe 336k samples taken, and sent to PSAMPLE. At 1:100 however, there are 3.34M samples, but
|
||||
they are not fitting through the FIFO, so the plugin is dropping samples to protect downstream
|
||||
`sflow-main` thread and `hsflowd`. I can see that here, 1.81M samples have been dropped, while 1.53M
|
||||
samples made it through. By the way, this means VPP is happily sending a whopping 153K samples/sec
|
||||
to the collector!
|
||||
|
||||
## What's Next
|
||||
|
||||
Now that I've seen the UDP packets from our agent to a collector on the wire, and also how
|
||||
incredibly efficient the sFlow sampling implementation turned out, I'm super motivated to
|
||||
continue the journey with higher level collector receivers like ntopng, sflow-rt or Akvorado. In an
|
||||
upcoming article, I'll describe how I rolled out Akvorado at IPng, and what types of changes would
|
||||
make the user experience even better (or simpler to understand, at least).
|
||||
|
||||
### Acknowledgements
|
||||
|
||||
I'd like to thank Neil McKee from inMon for his dedication to getting things right, including the
|
||||
finer details such as logging, error handling, API specifications, and documentation. He has been a
|
||||
true pleasure to work with and learn from. Also, thank you to the VPP developer community, notably
|
||||
Benoit, Florin, Damjan, Dave and Matt, for helping with the review and getting this thing merged in
|
||||
time for the 25.02 release.
|
||||
793
content/articles/2025-04-09-frysix-evpn.md
Normal file
793
content/articles/2025-04-09-frysix-evpn.md
Normal file
@@ -0,0 +1,793 @@
|
||||
---
|
||||
date: "2025-04-09T07:51:23Z"
|
||||
title: 'FrysIX eVPN: think different'
|
||||
---
|
||||
|
||||
{{< image float="right" src="/assets/frys-ix/frysix-logo-small.png" alt="FrysIX Logo" width="12em" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
Somewhere in the far north of the Netherlands, the country where I was born, a town called Jubbega
|
||||
is the home of the Frysian Internet Exchange called [[Frys-IX](https://frys-ix.net/)]. Back in 2021,
|
||||
a buddy of mine, Arend, said that he was planning on renting a rack at the NIKHEF facility, one of
|
||||
the most densely populated facilities in western Europe. He was looking for a few launching
|
||||
customers and I was definitely in the market for a presence in Amsterdam. I even wrote about it on
|
||||
my [[bucketlist]({{< ref 2021-07-26-bucketlist.md >}})]. Arend and his IT company
|
||||
[[ERITAP](https://www.eritap.com/)], took delivery of that rack in May of 2021, and this is when the
|
||||
internet exchange with _Frysian roots_ was born.
|
||||
|
||||
In the years from 2021 until now, Arend and I have been operating the exchange with reasonable
|
||||
success. It grew from a handful of folks in that first rack, to now some 250 participating ISPs
|
||||
with about ten switches in six datacenters across the Amsterdam metro area. It's shifting a cool
|
||||
800Gbit of traffic or so. It's dope, and very rewarding to be able to contribute to this community!
|
||||
|
||||
## Frys-IX is growing
|
||||
|
||||
We have several members with a 2x100G LAG and even though all inter-datacenter links are either dark
|
||||
fiber or WDM, we're starting to feel the growing pains as we set our sights to the next step growth.
|
||||
You see, when FrysIX did 13.37Gbit of traffic, Arend organized a barbecue. When it did 133.7Gbit of
|
||||
traffic, Arend organized an even bigger barbecue. Obviously, the next step is 1337Gbit and joining
|
||||
the infamous [[One TeraBit Club](https://github.com/tking/OneTeraBitClub)]. Thomas: we're on our
|
||||
way!
|
||||
|
||||
It became clear that we will not be able to keep a dependable peering platform if FrysIX remains a
|
||||
single L2 broadcast domain, and it also became clear that concatenating multiple 100G ports would be
|
||||
operationally expensive (think of all the dark fiber or WDM waves!), and brittle (think of LACP and
|
||||
balancing traffic over those ports). We need to modernize in order to stay ahead of the growth
|
||||
curve.
|
||||
|
||||
## Hello Nokia
|
||||
|
||||
{{< image float="right" src="/assets/frys-ix/nokia-7220-d4.png" alt="Nokia 7220-D4" width="20em" >}}
|
||||
|
||||
The Nokia 7220 Interconnect Router (7220 IXR) for data center fabric provides fixed-configuration,
|
||||
high-capacity platforms that let you bring unmatched scale, flexibility and operational simplicity
|
||||
to your data center networks and peering network environments. These devices are built around the
|
||||
Broadcom _Trident_ chipset, in the case of the "D4" platform, this is a Trident4 with 28x100G and
|
||||
8x400G ports. Whoot!
|
||||
|
||||
{{< image float="right" src="/assets/frys-ix/IXR-7220-D3.jpg" alt="Nokia 7220-D3" width="20em" >}}
|
||||
|
||||
What I find particularly awesome of the Trident series is their speed (total bandwidth of
|
||||
12.8Tbps _per router_), low power use (without optics, the IXR-7220-D4 consumes about 150W) and
|
||||
a plethora of advanced capabilities like L2/L3 filtering, IPv4, IPv6 and MPLS routing, and modern
|
||||
approaches to scale-out networking such as VXLAN based EVPN. At the FrysIX barbecue in September of
|
||||
2024, FrysIX was gifted a rather powerful IXR-7220-D3 router, shown in the picture to the right.
|
||||
That's a 32x100G router.
|
||||
|
||||
ERITAP has bought two (new in box) IXR-7220-D4 (8x400G,28x100G) routers, and has also acquired two
|
||||
IXR-7220-D2 (48x25G,8x100G) routers. So in total, FrysIX is now the proud owner of five of these
|
||||
beautiful Nokia devices. If you haven't yet, you should definitely read about these versatile
|
||||
routers on the [[Nokia](https://onestore.nokia.com/asset/207599)] website, and some details of the
|
||||
_merchant silicon_ switch chips in use on the
|
||||
[[Broadcom](https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs/bcm56880-series)]
|
||||
website.
|
||||
|
||||
### eVPN: A small rant
|
||||
|
||||
{{< image float="right" src="/assets/frys-ix/FrysIX_ Topology (concept).svg" alt="Topology Concept" width="50%" >}}
|
||||
|
||||
First, I need to get something off my chest. Consider a topology for an internet exchange platform,
|
||||
taking into account the available equipment, rackspace, power, and cross connects. Somehow, almost
|
||||
every design or reference architecture I can find on the Internet, assumes folks want to build a
|
||||
[[Clos network](https://en.wikipedia.org/wiki/Clos_network)], which has a topology existing of leaf
|
||||
and spine switches. The _spine_ switches have a different set of features than the _leaf_ ones,
|
||||
notably they don't have to do provider edge functionality like VXLAN encap and decapsulation.
|
||||
Almost all of these designs are showing how one might build a leaf-spine network for hyperscale.
|
||||
|
||||
**Critique 1**: my 'spine' (IXR-7220-D4 routers) must also be provider edge. Practically speaking,
|
||||
in the picture above I have these beautiful Nokia IXR-7220-D4 routers, using two 400G ports to
|
||||
connect between the facilities, and six 100G ports to connect the smaller breakout switches. That
|
||||
would leave a _massive_ amount of capacity unused: 22x 100G and 6x400G ports, to be exact.
|
||||
|
||||
**Critique 2**: all 'leaf' (either IXR-7220-D2 routers or Arista switches) can't realistically
|
||||
connect to both 'spines'. Our devices are spread out over two (and in practice, more like six)
|
||||
datacenters, and it's prohibitively expensive to get 100G waves or dark fiber to create a full mesh.
|
||||
It's much more economical to create a star-topology that minimizes cross-datacenter fiber spans.
|
||||
|
||||
**Critique 3**: Most of these 'spine-leaf' reference architectures assume that the interior gateway
|
||||
protocol is eBGP in what they call the _underlay_, and on top of that, some secondary eBGP that's
|
||||
called the _overlay_. Frankly, such a design makes my head spin a little bit. These designs assume
|
||||
hundreds of switches, in which case making use of one AS number per switch could make sense, as iBGP
|
||||
needs either a 'full mesh', or external route reflectors.
|
||||
|
||||
**Critique 4**: These reference designs also make an assumption that all fiber is local and while
|
||||
optics and links can fail, it will be relatively rare to _drain_ a link. However, in
|
||||
cross-datacenter networks, draining links for maintenance is very common, for example if the dark
|
||||
fiber provider needs to perform repairs on a span that was damaged. With these eBGP-over-eBGP
|
||||
connections, traffic engineering is more difficult than simply raising the OSPF (or IS-IS) cost of a
|
||||
link, to reroute traffic.
|
||||
|
||||
Setting aside eVPN for a second, if I were to build an IP transport network, like I did when I built
|
||||
[[IPng Site Local]({{< ref 2023-03-11-mpls-core.md >}})], I would use a much more intuitive
|
||||
and simple (I would even dare say elegant) design:
|
||||
|
||||
1. Take a classic IGP like [[OSPF](https://en.wikipedia.org/wiki/Open_Shortest_Path_First)], or
|
||||
perhaps [[IS-IS](https://en.wikipedia.org/wiki/IS-IS)]. There is no benefit, to me at least, to use
|
||||
BGP as an IGP.
|
||||
1. I would give each of the links between the switches an IPv4 /31 and enable link-local, and give
|
||||
each switch a loopback address with a /32 IPv4 and a /128 IPv6.
|
||||
1. If I had multiple links between two given switches, I would probably just use ECMP if my devices
|
||||
supported it, and fall back to a LACP signaled bundle-ethernet otherwise.
|
||||
1. If I were to need to use BGP (and for eVPN, this need exists), taking the ISP mindset (as opposed
|
||||
to the datacenter fabric mindset), I would simply install iBGP against two or three route
|
||||
reflectors, and exchange routing information within the same single AS number.
|
||||
|
||||
### eVPN: A demo topology
|
||||
|
||||
{{< image float="right" src="/assets/frys-ix/Nokia Arista VXLAN.svg" alt="Demo topology" width="50%" >}}
|
||||
|
||||
So, that's exactly how I'm going to approach the FrysIX eVPN design: OSPF for the underlay and iBGP
|
||||
for the overlay! I have a feeling that some folks will despise me for being contrarian, but you can
|
||||
leave your comments below, and don't forget to like-and-subscribe :-)
|
||||
|
||||
Arend builds this topology for me in Jubbega - also known as FrysIX HQ. He takes the two
|
||||
400G-capable routers and connects them. Then he takes an Arista DCS-7060CX switch, which is eVPN
|
||||
capable, with 32x100G ports, based on the Broadcom Tomahawk chipset, and a smaller Nokia
|
||||
IXR-7220-D2 with 48x25G and 8x100G ports, based on the Trident3 chipset. He wires all of this up
|
||||
to look like the picture on the right.
|
||||
|
||||
#### Underlay: Nokia's SR Linux
|
||||
|
||||
We boot up the equipment, verify that all the optics and links are up, and connect the management
|
||||
ports to an OOB network that I can remotely log in to. This is the first time that either of us work
|
||||
on Nokia, but I find it reasonably intuitive once I get a few tips and tricks from Niek.
|
||||
|
||||
```
|
||||
[pim@nikhef ~]$ sr_cli
|
||||
--{ running }--[ ]--
|
||||
A:pim@nikhef# enter candidate
|
||||
--{ candidate shared default }--[ ]--
|
||||
A:pim@nikhef# set / interface lo0 admin-state enable
|
||||
A:pim@nikhef# set / interface lo0 subinterface 0 admin-state enable
|
||||
A:pim@nikhef# set / interface lo0 subinterface 0 ipv4 admin-state enable
|
||||
A:pim@nikhef# set / interface lo0 subinterface 0 ipv4 address 198.19.16.1/32
|
||||
A:pim@nikhef# commit stay
|
||||
```
|
||||
|
||||
There, my first config snippet! This creates a _loopback_ interface, and similar to JunOS, a
|
||||
_subinterface_ (which Juniper calls a _unit_) which enables IPv4 and gives it an /32 address. In SR
|
||||
Linux, any interface has to be associated with a _network-instance_, think of those as routing
|
||||
domains or VRFs. There's a conveniently named _default_ network-instance, which I'll add this and
|
||||
the point-to-point interface between the two 400G routers to:
|
||||
|
||||
```
|
||||
A:pim@nikhef# info flat interface ethernet-1/29
|
||||
set / interface ethernet-1/29 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/29 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 ipv4 address 198.19.17.1/31
|
||||
set / interface ethernet-1/29 subinterface 0 ipv6 admin-state enable
|
||||
|
||||
A:pim@nikhef# set / network-instance default type default
|
||||
A:pim@nikhef# set / network-instance default admin-state enable
|
||||
A:pim@nikhef# set / network-instance default interface ethernet-1/29.0
|
||||
A:pim@nikhef# set / network-instance default interface lo0.0
|
||||
A:pim@nikhef# commit stay
|
||||
```
|
||||
|
||||
Cool. Assuming I now also do this on the other IXR-7220-D4 router, called _equinix_ (which gets the
|
||||
loopback address 198.19.16.0/32 and the point-to-point on the 400G interface of 198.19.17.0/31), I
|
||||
should be able to do my first jumboframe ping:
|
||||
|
||||
```
|
||||
A:pim@equinix# ping network-instance default 198.19.17.1 -s 9162 -M do
|
||||
Using network instance default
|
||||
PING 198.19.17.1 (198.19.17.1) 9162(9190) bytes of data.
|
||||
9170 bytes from 198.19.17.1: icmp_seq=1 ttl=64 time=0.466 ms
|
||||
9170 bytes from 198.19.17.1: icmp_seq=2 ttl=64 time=0.477 ms
|
||||
9170 bytes from 198.19.17.1: icmp_seq=3 ttl=64 time=0.547 ms
|
||||
```
|
||||
|
||||
#### Underlay: SR Linux OSPF
|
||||
|
||||
OK, let's get these two Nokia routers to speak OSPF, so that they can reach each other's loopback.
|
||||
It's really easy:
|
||||
|
||||
```
|
||||
A:pim@nikhef# / network-instance default protocols ospf instance default
|
||||
--{ candidate shared default }--[ network-instance default protocols ospf instance default ]--
|
||||
A:pim@nikhef# set admin-state enable
|
||||
A:pim@nikhef# set version ospf-v2
|
||||
A:pim@nikhef# set router-id 198.19.16.1
|
||||
A:pim@nikhef# set area 0.0.0.0 interface ethernet-1/29.0 interface-type point-to-point
|
||||
A:pim@nikhef# set area 0.0.0.0 interface lo0.0 passive true
|
||||
A:pim@nikhef# commit stay
|
||||
```
|
||||
|
||||
Similar to in JunOS, I can descend into a configuration scope: the first line goes into the
|
||||
_network-instance_ called `default` and then the _protocols_ called `ospf`, and then the _instance_
|
||||
called `default`. Subsequent `set` commands operate at this scope. Once I commit this configuration
|
||||
(on the _nikhef_ router and also the _equinix_ router, with its own unique router-id), OSPF quickly
|
||||
shoots in action:
|
||||
|
||||
```
|
||||
A:pim@nikhef# show network-instance default protocols ospf neighbor
|
||||
=========================================================================================
|
||||
Net-Inst default OSPFv2 Instance default Neighbors
|
||||
=========================================================================================
|
||||
+---------------------------------------------------------------------------------------+
|
||||
| Interface-Name Rtr Id State Pri RetxQ Time Before Dead |
|
||||
+=======================================================================================+
|
||||
| ethernet-1/29.0 198.19.16.0 full 1 0 36 |
|
||||
+---------------------------------------------------------------------------------------+
|
||||
-----------------------------------------------------------------------------------------
|
||||
No. of Neighbors: 1
|
||||
=========================================================================================
|
||||
|
||||
A:pim@nikhef# show network-instance default route-table all | more
|
||||
IPv4 unicast route table of network instance default
|
||||
+------------------+-----+------------+--------------+--------+----------+--------+------+-------------+-----------------+
|
||||
| Prefix | ID | Route Type | Route Owner | Active | Origin | Metric | Pref | Next-hop | Next-hop |
|
||||
| | | | | | Network | | | (Type) | Interface |
|
||||
| | | | | | Instance | | | | |
|
||||
+==================+=====+============+==============+========+==========+========+======+=============+=================+
|
||||
| 198.19.16.0/32 | 0 | ospfv2 | ospf_mgr | True | default | 1 | 10 | 198.19.17.0 | ethernet-1/29.0 |
|
||||
| | | | | | | | | (direct) | |
|
||||
| 198.19.16.1/32 | 7 | host | net_inst_mgr | True | default | 0 | 0 | None | None |
|
||||
| 198.19.17.0/31 | 6 | local | net_inst_mgr | True | default | 0 | 0 | 198.19.17.1 | ethernet-1/29.0 |
|
||||
| | | | | | | | | (direct) | |
|
||||
| 198.19.17.1/32 | 6 | host | net_inst_mgr | True | default | 0 | 0 | None | None |
|
||||
+==================+=====+============+==============+========+==========+========+======+=============+=================+
|
||||
|
||||
A:pim@nikhef# ping network-instance default 198.19.16.0
|
||||
Using network instance default
|
||||
PING 198.19.16.0 (198.19.16.0) 56(84) bytes of data.
|
||||
64 bytes from 198.19.16.0: icmp_seq=1 ttl=64 time=0.484 ms
|
||||
64 bytes from 198.19.16.0: icmp_seq=2 ttl=64 time=0.663 ms
|
||||
```
|
||||
|
||||
Delicious! OSPF has learned the loopback, and it is now reachable. As with most things, going from 0
|
||||
to 1 (in this case: understanding how SR Linux works at all) is the most difficult part. Then going
|
||||
from 1 to 2 is critical (in this case: making two routers interact with OSPF), but from there on,
|
||||
going from 2 to N is easy (in my case: enabling several other point-to-point /31 transit networks on
|
||||
the _nikhef_ router, using `ethernet-1/1.0` through `ethernet-1/4.0` with the correct MTU and
|
||||
turning on OSPF for these), makes the whole network shoot to life. Slick!
|
||||
|
||||
#### Underlay: Arista
|
||||
|
||||
I'll point out that one of the devices in this topology is an Arista. We have several of these ready
|
||||
for deployment at FrysIX. They are a lot more affordable and easy to find on the second hand /
|
||||
refurbished market. These switches come with 32x100G ports, and are really good at packet slinging
|
||||
because they're based on the Broadcom _Tomahawk_ chipset. They pack a few less features than the
|
||||
_Trident_ chipset that powers the Nokia, but they happen to have all the features we need to run our
|
||||
internet exchange . So I turn my attention to the Arista in the topology. I am much more
|
||||
comfortable configuring the whole thing here, as it's not my first time touching these devices:
|
||||
|
||||
```
|
||||
arista-leaf#show run int loop0
|
||||
interface Loopback0
|
||||
ip address 198.19.16.2/32
|
||||
ip ospf area 0.0.0.0
|
||||
arista-leaf#show run int Ethernet32/1
|
||||
interface Ethernet32/1
|
||||
description Core: Connected to nikhef:ethernet-1/2
|
||||
load-interval 1
|
||||
mtu 9190
|
||||
no switchport
|
||||
ip address 198.19.17.5/31
|
||||
ip ospf cost 1000
|
||||
ip ospf network point-to-point
|
||||
ip ospf area 0.0.0.0
|
||||
arista-leaf#show run section router ospf
|
||||
router ospf 65500
|
||||
router-id 198.19.16.2
|
||||
redistribute connected
|
||||
network 198.19.0.0/16 area 0.0.0.0
|
||||
max-lsa 12000
|
||||
```
|
||||
|
||||
I complete the configuration for the other two interfaces on this Arista, port Eth31/1 connects also
|
||||
to the _nikhef_ IXR-7220-D4 and I give it a high cost of 1000, while Eth30/1 connects only 1x100G to
|
||||
the _nokia-leaf_ IXR-7220-D2 with a cost of 10.
|
||||
It's nice to see that OSPF in action - there are two equal path (but high cost) OSPF paths via
|
||||
router-id 198.19.16.1 (nikhef), and there's one lower cost path via router-id 198.19.16.3
|
||||
(_nokia-leaf_). The traceroute nicely shows the scenic route (arista-leaf -> nokia-leaf -> nokia ->
|
||||
equinix). Dope!
|
||||
|
||||
```
|
||||
arista-leaf#show ip ospf nei
|
||||
Neighbor ID Instance VRF Pri State Dead Time Address Interface
|
||||
198.19.16.1 65500 default 1 FULL 00:00:36 198.19.17.4 Ethernet32/1
|
||||
198.19.16.3 65500 default 1 FULL 00:00:31 198.19.17.11 Ethernet30/1
|
||||
198.19.16.1 65500 default 1 FULL 00:00:35 198.19.17.2 Ethernet31/1
|
||||
|
||||
arista-leaf#traceroute 198.19.16.0
|
||||
traceroute to 198.19.16.0 (198.19.16.0), 30 hops max, 60 byte packets
|
||||
1 198.19.17.11 (198.19.17.11) 0.220 ms 0.150 ms 0.206 ms
|
||||
2 198.19.17.6 (198.19.17.6) 0.169 ms 0.107 ms 0.099 ms
|
||||
3 198.19.16.0 (198.19.16.0) 0.434 ms 0.346 ms 0.303 ms
|
||||
```
|
||||
|
||||
So far, so good! The _underlay_ is up, every router can reach every other router on its loopback,
|
||||
and all OSPF adjacencies are formed. I'll leave the 2x100G between _nikhef_ and _arista-leaf_ at
|
||||
high cost for now.
|
||||
|
||||
#### Overlay EVPN: SR Linux
|
||||
|
||||
The big-picture idea here is to use iBGP with the same private AS number, and because there are two
|
||||
main facilities (NIKHEF and Equinix), make each of those bigger IXR-7220-D4 routers act as
|
||||
route-reflectors for others. It means that they will have an iBGP session amongst themselves
|
||||
(198.191.16.0 <-> 198.19.16.1) and otherwise accept iBGP sessions from any IP address in the
|
||||
198.19.16.0/24 subnet. This way, I don't have to configure any more than strictly necessary on the
|
||||
core routers. Any new router can just plug in, form an OSPF adjacency, and connect to both core
|
||||
routers. I proceed to configure BGP on the Nokia's like this:
|
||||
|
||||
```
|
||||
A:pim@nikhef# / network-instance default protocols bgp
|
||||
A:pim@nikhef# set admin-state enable
|
||||
A:pim@nikhef# set autonomous-system 65500
|
||||
A:pim@nikhef# set router-id 198.19.16.1
|
||||
A:pim@nikhef# set dynamic-neighbors accept match 198.19.16.0/24 peer-group overlay
|
||||
A:pim@nikhef# set afi-safi evpn admin-state enable
|
||||
A:pim@nikhef# set preference ibgp 170
|
||||
A:pim@nikhef# set route-advertisement rapid-withdrawal true
|
||||
A:pim@nikhef# set route-advertisement wait-for-fib-install false
|
||||
A:pim@nikhef# set group overlay peer-as 65500
|
||||
A:pim@nikhef# set group overlay afi-safi evpn admin-state enable
|
||||
A:pim@nikhef# set group overlay afi-safi ipv4-unicast admin-state disable
|
||||
A:pim@nikhef# set group overlay afi-safi ipv6-unicast admin-state disable
|
||||
A:pim@nikhef# set group overlay local-as as-number 65500
|
||||
A:pim@nikhef# set group overlay route-reflector client true
|
||||
A:pim@nikhef# set group overlay transport local-address 198.19.16.1
|
||||
A:pim@nikhef# set neighbor 198.19.16.0 admin-state enable
|
||||
A:pim@nikhef# set neighbor 198.19.16.0 peer-group overlay
|
||||
A:pim@nikhef# commit stay
|
||||
```
|
||||
|
||||
I can see that iBGP sessions establish between all the devices:
|
||||
|
||||
```
|
||||
A:pim@nikhef# show network-instance default protocols bgp neighbor
|
||||
---------------------------------------------------------------------------------------------------------------------------
|
||||
BGP neighbor summary for network-instance "default"
|
||||
Flags: S static, D dynamic, L discovered by LLDP, B BFD enabled, - disabled, * slow
|
||||
---------------------------------------------------------------------------------------------------------------------------
|
||||
---------------------------------------------------------------------------------------------------------------------------
|
||||
+-------------+-------------+----------+-------+----------+-------------+---------------+------------+--------------------+
|
||||
| Net-Inst | Peer | Group | Flags | Peer-AS | State | Uptime | AFI/SAFI | [Rx/Active/Tx] |
|
||||
+=============+=============+==========+=======+==========+=============+===============+============+====================+
|
||||
| default | 198.19.16.0 | overlay | S | 65500 | established | 0d:0h:2m:32s | evpn | [0/0/0] |
|
||||
| default | 198.19.16.2 | overlay | D | 65500 | established | 0d:0h:2m:27s | evpn | [0/0/0] |
|
||||
| default | 198.19.16.3 | overlay | D | 65500 | established | 0d:0h:2m:41s | evpn | [0/0/0] |
|
||||
+-------------+-------------+----------+-------+----------+-------------+---------------+------------+--------------------+
|
||||
---------------------------------------------------------------------------------------------------------------------------
|
||||
Summary:
|
||||
1 configured neighbors, 1 configured sessions are established, 0 disabled peers
|
||||
2 dynamic peers
|
||||
```
|
||||
|
||||
A few things to note here - there one _configured_ neighbor (this is the other IXR-7220-D4 router),
|
||||
and two _dynamic_ peers, these are the Arista and the smaller IXR-7220-D2 router. The only address
|
||||
family that they are exchanging information for is the _evpn_ family, and no prefixes have been
|
||||
learned or sent yet, shown by the `[0/0/0]` designation in the last column.
|
||||
|
||||
#### Overlay EVPN: Arista
|
||||
|
||||
The Arista is also remarkably straight forward to configure. Here, I'll simply enable the iBGP
|
||||
session as follows:
|
||||
|
||||
```
|
||||
arista-leaf#show run section bgp
|
||||
router bgp 65500
|
||||
neighbor evpn peer group
|
||||
neighbor evpn remote-as 65500
|
||||
neighbor evpn update-source Loopback0
|
||||
neighbor evpn ebgp-multihop 3
|
||||
neighbor evpn send-community extended
|
||||
neighbor evpn maximum-routes 12000 warning-only
|
||||
neighbor 198.19.16.0 peer group evpn
|
||||
neighbor 198.19.16.1 peer group evpn
|
||||
!
|
||||
address-family evpn
|
||||
neighbor evpn activate
|
||||
|
||||
arista-leaf#show bgp summary
|
||||
BGP summary information for VRF default
|
||||
Router identifier 198.19.16.2, local AS number 65500
|
||||
Neighbor AS Session State AFI/SAFI AFI/SAFI State NLRI Rcd NLRI Acc
|
||||
----------- ----------- ------------- ----------------------- -------------- ---------- ----------
|
||||
198.19.16.0 65500 Established IPv4 Unicast Advertised 0 0
|
||||
198.19.16.0 65500 Established L2VPN EVPN Negotiated 0 0
|
||||
198.19.16.1 65500 Established IPv4 Unicast Advertised 0 0
|
||||
198.19.16.1 65500 Established L2VPN EVPN Negotiated 0 0
|
||||
```
|
||||
|
||||
On this leaf node, I'll have a redundant iBGP session with the two core nodes. Since those core
|
||||
nodes are peering amongst themselves, and are configured as route-reflectors, this is all I need. No
|
||||
matter how many additional Arista (or Nokia) devices I add to the network, all they'll have to do is
|
||||
enable OSPF (so they can reach 198.19.16.0 and .1) and turn on iBGP sessions with both core routers.
|
||||
Voila!
|
||||
|
||||
#### VXLAN EVPN: SR Linux
|
||||
|
||||
Nokia documentation informs me that SR Linux uses a special interface called _system0_ to source its
|
||||
VXLAN traffic from, and to add this interface to the _default_ network-instance. So it's a matter of
|
||||
defining that interface and associate a VXLAN interface with it, like so:
|
||||
|
||||
```
|
||||
A:pim@nikhef# set / interface system0 admin-state enable
|
||||
A:pim@nikhef# set / interface system0 subinterface 0 admin-state enable
|
||||
A:pim@nikhef# set / interface system0 subinterface 0 ipv4 admin-state enable
|
||||
A:pim@nikhef# set / interface system0 subinterface 0 ipv4 address 198.19.18.1/32
|
||||
A:pim@nikhef# set / network-instance default interface system0.0
|
||||
A:pim@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged
|
||||
A:pim@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604
|
||||
A:pim@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address
|
||||
A:pim@nikhef# commit stay
|
||||
```
|
||||
|
||||
This creates the plumbing for a VXLAN sub-interface called `vxlan1.2604` which will accept/send
|
||||
traffic using VNI 2604 (this happens to be the VLAN id we use at FrysIX for our production Peering
|
||||
LAN), and it'll use the `system0.0` address to source that traffic from.
|
||||
|
||||
The second part is to create what SR Linux calls a MAC-VRF and put some interface(s) in it:
|
||||
|
||||
```
|
||||
A:pim@nikhef# set / interface ethernet-1/9 admin-state enable
|
||||
A:pim@nikhef# set / interface ethernet-1/9 breakout-mode num-breakout-ports 4
|
||||
A:pim@nikhef# set / interface ethernet-1/9 breakout-mode breakout-port-speed 10G
|
||||
A:pim@nikhef# set / interface ethernet-1/9/3 admin-state enable
|
||||
A:pim@nikhef# set / interface ethernet-1/9/3 vlan-tagging true
|
||||
A:pim@nikhef# set / interface ethernet-1/9/3 subinterface 0 type bridged
|
||||
A:pim@nikhef# set / interface ethernet-1/9/3 subinterface 0 admin-state enable
|
||||
A:pim@nikhef# set / interface ethernet-1/9/3 subinterface 0 vlan encap untagged
|
||||
|
||||
A:pim@nikhef# / network-instance peeringlan
|
||||
A:pim@nikhef# set type mac-vrf
|
||||
A:pim@nikhef# set admin-state enable
|
||||
A:pim@nikhef# set interface ethernet-1/9/3.0
|
||||
A:pim@nikhef# set vxlan-interface vxlan1.2604
|
||||
A:pim@nikhef# set protocols bgp-evpn bgp-instance 1 admin-state enable
|
||||
A:pim@nikhef# set protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604
|
||||
A:pim@nikhef# set protocols bgp-evpn bgp-instance 1 evi 2604
|
||||
A:pim@nikhef# set protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604
|
||||
A:pim@nikhef# set protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604
|
||||
A:pim@nikhef# set protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604
|
||||
A:pim@nikhef# commit stay
|
||||
```
|
||||
|
||||
In the first block here, Arend took what is a 100G port called `ethernet-1/9` and split it into 4x25G
|
||||
ports. Arend forced the port speed to 10G because he has taken a 40G-4x10G DAC, and it happens that
|
||||
the third lane is plugged into the Debian machine. So on `ethernet-1/9/3` I'll create a
|
||||
sub-interface, make it type _bridged_ (which I've also done on `vxlan1.2604`!) and allow any
|
||||
untagged traffic to enter it.
|
||||
|
||||
{{< image width="5em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
|
||||
|
||||
If you, like me, are used to either VPP or IOS/XR, this type of sub-interface stuff should feel very
|
||||
natural to you. I've written about the sub-interfaces logic on Cisco's IOS/XR and VPP approach in a
|
||||
previous [[article]({{< ref 2022-02-14-vpp-vlan-gym.md >}})] which my buddy Fred lovingly calls
|
||||
_VLAN Gymnastics_ because the ports are just so damn flexible. Worth a read!
|
||||
|
||||
The second block creates a new _network-instance_ which I'll name `peeringlan`, and it associates
|
||||
the newly created untagged sub-interface `ethernet-1/9/3.0` with the VXLAN interface, and starts a
|
||||
protocol for eVPN instructing traffic in and out of this network-instance to use EVI 2604 on the
|
||||
VXLAN sub-interface, and signalling of all MAC addresses learned to use the specified
|
||||
route-distinguisher and import/export route-targets. For simplicity I've just used the same for
|
||||
each: 65500:2604.
|
||||
|
||||
I continue to add an interface to the `peeringlan` _network-instance_ on the other two Nokia
|
||||
routers: `ethernet-1/9/3.0` on the _equinix_ router and `ethernet-1/9.0` on the _nokia-leaf_ router.
|
||||
Each of these goes to a 10Gbps port on a Debian machine.
|
||||
|
||||
#### VXLAN EVPN: Arista
|
||||
|
||||
At this point I'm feeling pretty bullish about the whole project. Arista does not make it very
|
||||
difficult on me to configure it for L2 EVPN (which is called MAC-VRF here also):
|
||||
|
||||
```
|
||||
arista-leaf#conf t
|
||||
vlan 2604
|
||||
name v-peeringlan
|
||||
interface Ethernet9/3
|
||||
speed forced 10000full
|
||||
switchport access vlan 2604
|
||||
|
||||
interface Loopback1
|
||||
ip address 198.19.18.2/32
|
||||
interface Vxlan1
|
||||
vxlan source-interface Loopback1
|
||||
vxlan udp-port 4789
|
||||
vxlan vlan 2604 vni 2604
|
||||
```
|
||||
|
||||
After creating VLAN 2604 and making port Eth9/3 an access port in that VLAN, I'll add a VTEP endpoint
|
||||
called `Loopback1`, and a VXLAN interface that uses that to source its traffic. Here, I'll associate
|
||||
local VLAN 2604 with the `Vxlan1` and its VNI 2604, to match up with how I configured the Nokias
|
||||
previously.
|
||||
|
||||
Finally, it's a matter of tying these together by announcing the MAC addresses into the EVPN iBGP
|
||||
sessions:
|
||||
```
|
||||
arista-leaf#conf t
|
||||
router bgp 65500
|
||||
vlan 2604
|
||||
rd 65500:2604
|
||||
route-target both 65500:2604
|
||||
redistribute learned
|
||||
!
|
||||
```
|
||||
|
||||
### Results
|
||||
|
||||
To validate the configurations, I learn a cool trick from my buddy Andy on the SR Linux discord
|
||||
server. In EOS, I can ask it to check for any obvious mistakes in two places:
|
||||
|
||||
```
|
||||
arista-leaf#show vxlan config-sanity detail
|
||||
Category Result Detail
|
||||
---------------------------------- -------- --------------------------------------------------
|
||||
Local VTEP Configuration Check OK
|
||||
Loopback IP Address OK
|
||||
VLAN-VNI Map OK
|
||||
Flood List OK
|
||||
Routing OK
|
||||
VNI VRF ACL OK
|
||||
Decap VRF-VNI Map OK
|
||||
VRF-VNI Dynamic VLAN OK
|
||||
Remote VTEP Configuration Check OK
|
||||
Remote VTEP OK
|
||||
Platform Dependent Check OK
|
||||
VXLAN Bridging OK
|
||||
VXLAN Routing OK VXLAN Routing not enabled
|
||||
CVX Configuration Check OK
|
||||
CVX Server OK Not in controller client mode
|
||||
MLAG Configuration Check OK Run 'show mlag config-sanity' to verify MLAG config
|
||||
Peer VTEP IP OK MLAG peer is not connected
|
||||
MLAG VTEP IP OK
|
||||
Peer VLAN-VNI OK
|
||||
Virtual VTEP IP OK
|
||||
MLAG Inactive State OK
|
||||
|
||||
arista-leaf#show bgp evpn sanity detail
|
||||
Category Check Status Detail
|
||||
-------- -------------------- ------ ------
|
||||
General Send community OK
|
||||
General Multi-agent mode OK
|
||||
General Neighbor established OK
|
||||
L2 MAC-VRF route-target OK
|
||||
import and export
|
||||
L2 MAC-VRF OK
|
||||
route-distinguisher
|
||||
L2 MAC-VRF redistribute OK
|
||||
L2 MAC-VRF overlapping OK
|
||||
VLAN
|
||||
L2 Suppressed MAC OK
|
||||
VXLAN VLAN to VNI map for OK
|
||||
MAC-VRF
|
||||
VXLAN VRF to VNI map for OK
|
||||
IP-VRF
|
||||
```
|
||||
|
||||
#### Results: Arista view
|
||||
|
||||
Inspecting the MAC addresses learned from all four of the client ports on the Debian machine is
|
||||
easy:
|
||||
|
||||
```
|
||||
arista-leaf#show bgp evpn summary
|
||||
BGP summary information for VRF default
|
||||
Router identifier 198.19.16.2, local AS number 65500
|
||||
Neighbor Status Codes: m - Under maintenance
|
||||
Neighbor V AS MsgRcvd MsgSent InQ OutQ Up/Down State PfxRcd PfxAcc
|
||||
198.19.16.0 4 65500 3311 3867 0 0 18:06:28 Estab 7 7
|
||||
198.19.16.1 4 65500 3308 3873 0 0 18:06:28 Estab 7 7
|
||||
|
||||
arista-leaf#show bgp evpn vni 2604 next-hop 198.19.18.3
|
||||
BGP routing table information for VRF default
|
||||
Router identifier 198.19.16.2, local AS number 65500
|
||||
Route status codes: * - valid, > - active, S - Stale, E - ECMP head, e - ECMP
|
||||
c - Contributing to ECMP, % - Pending BGP convergence
|
||||
Origin codes: i - IGP, e - EGP, ? - incomplete
|
||||
AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Link Local Nexthop
|
||||
|
||||
Network Next Hop Metric LocPref Weight Path
|
||||
* >Ec RD: 65500:2604 mac-ip e43a.6e5f.0c59
|
||||
198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.1
|
||||
* ec RD: 65500:2604 mac-ip e43a.6e5f.0c59
|
||||
198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.0
|
||||
* >Ec RD: 65500:2604 imet 198.19.18.3
|
||||
198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.1
|
||||
* ec RD: 65500:2604 imet 198.19.18.3
|
||||
198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.0
|
||||
```
|
||||
There's a lot to unpack here! The Arista is seeing that from the _route-distinguisher_ I configured
|
||||
on all the sessions, it is learning one MAC address on neighbor 198.19.18.3 (this is the VTEP for
|
||||
the _nokia-leaf_ router) from both iBGP sessions. The MAC address is learned from originator
|
||||
198.19.16.3 (the loopback of the _nokia-leaf_ router), from two cluster members, the active one on
|
||||
iBGP speaker 198.19.16.1 (_nikhef_) and a backup member on 198.19.16.0 (_equinix_).
|
||||
|
||||
I can also see that there's a bunch of `imet` route entries, and Andy explained these to me. They are
|
||||
a signal from a VTEP participant that they are interested in seeing multicast traffic (like neighbor
|
||||
discovery or ARP requests) flooded to them. Every router participating in this L2VPN will raise such
|
||||
an `imet` route, which I'll see in duplicates as well (one from each iBGP session). This checks out.
|
||||
|
||||
#### Results: SR Linux view
|
||||
|
||||
The Nokia IXR-7220-D4 router called _equinix_ has also learned a bunch of EVPN routing entries,
|
||||
which I can inspect as follows:
|
||||
|
||||
```
|
||||
A:pim@equinix# show network-instance default protocols bgp routes evpn route-type summary
|
||||
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
Show report for the BGP route table of network-instance "default"
|
||||
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
Status codes: u=used, *=valid, >=best, x=stale, b=backup
|
||||
Origin codes: i=IGP, e=EGP, ?=incomplete
|
||||
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
BGP Router ID: 198.19.16.0 AS: 65500 Local AS: 65500
|
||||
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
Type 2 MAC-IP Advertisement Routes
|
||||
+--------+---------------+--------+-------------------+------------+-------------+------+-------------+--------+--------------------------------+------------------+
|
||||
| Status | Route- | Tag-ID | MAC-address | IP-address | neighbor | Path-| Next-Hop | Label | ESI | MAC Mobility |
|
||||
| | distinguisher | | | | | id | | | | |
|
||||
+========+===============+========+===================+============+=============+======+============-+========+================================+==================+
|
||||
| u*> | 65500:2604 | 0 | E4:3A:6E:5F:0C:57 | 0.0.0.0 | 198.19.16.1 | 0 | 198.19.18.1 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
|
||||
| * | 65500:2604 | 0 | E4:3A:6E:5F:0C:58 | 0.0.0.0 | 198.19.16.1 | 0 | 198.19.18.2 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
|
||||
| u*> | 65500:2604 | 0 | E4:3A:6E:5F:0C:58 | 0.0.0.0 | 198.19.16.2 | 0 | 198.19.18.2 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
|
||||
| * | 65500:2604 | 0 | E4:3A:6E:5F:0C:59 | 0.0.0.0 | 198.19.16.1 | 0 | 198.19.18.3 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
|
||||
| u*> | 65500:2604 | 0 | E4:3A:6E:5F:0C:59 | 0.0.0.0 | 198.19.16.3 | 0 | 198.19.18.3 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
|
||||
+--------+---------------+--------+-------------------+------------+-------------+------+-------------+--------+--------------------------------+------------------+
|
||||
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
Type 3 Inclusive Multicast Ethernet Tag Routes
|
||||
+--------+-----------------------------+--------+---------------------+-----------------+--------+-----------------------+
|
||||
| Status | Route-distinguisher | Tag-ID | Originator-IP | neighbor | Path- | Next-Hop |
|
||||
| | | | | | id | |
|
||||
+========+=============================+========+=====================+=================+========+=======================+
|
||||
| u*> | 65500:2604 | 0 | 198.19.18.1 | 198.19.16.1 | 0 | 198.19.18.1 |
|
||||
| * | 65500:2604 | 0 | 198.19.18.2 | 198.19.16.1 | 0 | 198.19.18.2 |
|
||||
| u*> | 65500:2604 | 0 | 198.19.18.2 | 198.19.16.2 | 0 | 198.19.18.2 |
|
||||
| * | 65500:2604 | 0 | 198.19.18.3 | 198.19.16.1 | 0 | 198.19.18.3 |
|
||||
| u*> | 65500:2604 | 0 | 198.19.18.3 | 198.19.16.3 | 0 | 198.19.18.3 |
|
||||
+--------+-----------------------------+--------+---------------------+-----------------+--------+-----------------------+
|
||||
--------------------------------------------------------------------------------------------------------------------------
|
||||
0 Ethernet Auto-Discovery routes 0 used, 0 valid
|
||||
5 MAC-IP Advertisement routes 3 used, 5 valid
|
||||
5 Inclusive Multicast Ethernet Tag routes 3 used, 5 valid
|
||||
0 Ethernet Segment routes 0 used, 0 valid
|
||||
0 IP Prefix routes 0 used, 0 valid
|
||||
0 Selective Multicast Ethernet Tag routes 0 used, 0 valid
|
||||
0 Selective Multicast Membership Report Sync routes 0 used, 0 valid
|
||||
0 Selective Multicast Leave Sync routes 0 used, 0 valid
|
||||
--------------------------------------------------------------------------------------------------------------------------
|
||||
```
|
||||
|
||||
I have to say, SR Linux output is incredibly verbose! But, I can see all the relevant bits and bobs
|
||||
here. Each MAC-IP entry is accounted for, I can see several nexthops pointing at the nikhef switch,
|
||||
one pointing at the nokia-leaf router and one pointing at the Arista switch. I also see the `imet`
|
||||
entries. One thing to note -- the SR Linux implementation leaves the type-2 routes empty with a
|
||||
0.0.0.0 IPv4 address, while the Arista (in my opinion, more correctly) leaves them as NULL
|
||||
(unspecified). But, everything looks great!
|
||||
|
||||
#### Results: Debian view
|
||||
|
||||
There's one more thing to show, and that's kind of the 'proof is in the pudding' moment. As I said,
|
||||
Arend hooked up a Debian machine with an Intel X710-DA4 network card, which sports 4x10G SFP+
|
||||
connections. This network card is a regular in my AS8298 network, as it has excellent DPDK support
|
||||
and can easily pump 40Mpps with VPP. IPng 🥰 Intel X710!
|
||||
|
||||
```
|
||||
root@debian:~ # ip netns add nikhef
|
||||
root@debian:~ # ip link set enp1s0f0 netns nikhef
|
||||
root@debian:~ # ip netns exec nikhef ip link set enp1s0f0 up mtu 9000
|
||||
root@debian:~ # ip netns exec nikhef ip addr add 192.0.2.10/24 dev enp1s0f0
|
||||
root@debian:~ # ip netns exec nikhef ip addr add 2001:db8::10/64 dev enp1s0f0
|
||||
|
||||
root@debian:~ # ip netns add arista-leaf
|
||||
root@debian:~ # ip link set enp1s0f1 netns arista-leaf
|
||||
root@debian:~ # ip netns exec arista-leaf ip link set enp1s0f1 up mtu 9000
|
||||
root@debian:~ # ip netns exec arista-leaf ip addr add 192.0.2.11/24 dev enp1s0f1
|
||||
root@debian:~ # ip netns exec arista-leaf ip addr add 2001:db8::11/64 dev enp1s0f1
|
||||
|
||||
root@debian:~ # ip netns add nokia-leaf
|
||||
root@debian:~ # ip link set enp1s0f2 netns nokia-leaf
|
||||
root@debian:~ # ip netns exec nokia-leaf ip link set enp1s0f2 up mtu 9000
|
||||
root@debian:~ # ip netns exec nokia-leaf ip addr add 192.0.2.12/24 dev enp1s0f2
|
||||
root@debian:~ # ip netns exec nokia-leaf ip addr add 2001:db8::12/64 dev enp1s0f2
|
||||
|
||||
root@debian:~ # ip netns add equinix
|
||||
root@debian:~ # ip link set enp1s0f3 netns equinix
|
||||
root@debian:~ # ip netns exec equinix ip link set enp1s0f3 up mtu 9000
|
||||
root@debian:~ # ip netns exec equinix ip addr add 192.0.2.13/24 dev enp1s0f3
|
||||
root@debian:~ # ip netns exec equinix ip addr add 2001:db8::13/64 dev enp1s0f3
|
||||
|
||||
root@debian:~# ip netns exec nikhef fping -g 192.0.2.8/29
|
||||
192.0.2.10 is alive
|
||||
192.0.2.11 is alive
|
||||
192.0.2.12 is alive
|
||||
192.0.2.13 is alive
|
||||
|
||||
root@debian:~# ip netns exec arista-leaf fping 2001:db8::10 2001:db8::11 2001:db8::12 2001:db8::13
|
||||
2001:db8::10 is alive
|
||||
2001:db8::11 is alive
|
||||
2001:db8::12 is alive
|
||||
2001:db8::13 is alive
|
||||
|
||||
root@debian:~# ip netns exec equinix ip nei
|
||||
192.0.2.10 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE
|
||||
192.0.2.11 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE
|
||||
192.0.2.12 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE
|
||||
fe80::e63a:6eff:fe5f:c57 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE
|
||||
fe80::e63a:6eff:fe5f:c58 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE
|
||||
fe80::e63a:6eff:fe5f:c59 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE
|
||||
2001:db8::10 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE
|
||||
2001:db8::11 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE
|
||||
2001:db8::12 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE
|
||||
```
|
||||
|
||||
The Debian machine puts each network card into its own network namespace, and gives them both an IPv4
|
||||
and an IPv6 address. I can then enter the `nikhef` network namespace, which has its NIC connected to
|
||||
the IXR-7220-D4 router called _nikhef_, and ping all four endpoints. Similarly, I can enter the
|
||||
`arista-leaf` namespace and ping6 all four endpoints. Finally, I take a look at the IPv6 and IPv4
|
||||
neighbor table on the network card that is connected to the _equinix_ router. All three MAC addresses are
|
||||
seen. This proves end to end connectivity across the EVPN VXLAN, and full interoperability. Booyah!
|
||||
|
||||
Performance? We got that! I'm not worried as these Nokia routers are rated for 12.8Tbps of VXLAN....
|
||||
```
|
||||
root@debian:~# ip netns exec equinix iperf3 -c 192.0.2.12
|
||||
Connecting to host 192.0.2.12, port 5201
|
||||
[ 5] local 192.0.2.10 port 34598 connected to 192.0.2.12 port 5201
|
||||
[ ID] Interval Transfer Bitrate Retr Cwnd
|
||||
[ 5] 0.00-1.00 sec 1.15 GBytes 9.91 Gbits/sec 19 1.52 MBytes
|
||||
[ 5] 1.00-2.00 sec 1.15 GBytes 9.90 Gbits/sec 3 1.54 MBytes
|
||||
[ 5] 2.00-3.00 sec 1.15 GBytes 9.90 Gbits/sec 1 1.54 MBytes
|
||||
[ 5] 3.00-4.00 sec 1.15 GBytes 9.90 Gbits/sec 1 1.54 MBytes
|
||||
[ 5] 4.00-5.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
|
||||
[ 5] 5.00-6.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
|
||||
[ 5] 6.00-7.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
|
||||
[ 5] 7.00-8.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
|
||||
[ 5] 8.00-9.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
|
||||
[ 5] 9.00-10.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
|
||||
- - - - - - - - - - - - - - - - - - - - - - - - -
|
||||
[ ID] Interval Transfer Bitrate Retr
|
||||
[ 5] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec 24 sender
|
||||
[ 5] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec receiver
|
||||
|
||||
iperf Done.
|
||||
```
|
||||
|
||||
## What's Next
|
||||
|
||||
There's a few improvements I can make before deploying this architecture to the internet exchange.
|
||||
Notably:
|
||||
* the functional equivalent of _port security_, that is to say only allowing one or two MAC
|
||||
addresses per member port. FrysIX has a strict one-port-one-member-one-MAC rule, and having port
|
||||
security will greatly improve our resilience.
|
||||
* SR Linux has the ability to suppress ARP, _even on L2 MAC-VRF_! It's relatively well known for
|
||||
IRB based setups, but adding this to transparent bridge-domains is possible in Nokia
|
||||
[[ref](https://documentation.nokia.com/srlinux/22-6/SR_Linux_Book_Files/EVPN-VXLAN_Guide/services-evpn-vxlan-l2.html#configuring_evpn_learning_for_proxy_arp)],
|
||||
using the syntax of `protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise
|
||||
true`. This will glean the IP addresses based on intercepted ARP requests, and reduce the need for
|
||||
BUM flooding.
|
||||
* Andy informs me that Arista also has this feature. By setting `router l2-vpn` and `arp learning bridged`,
|
||||
the suppression of ARP requests/replies also works in the same way. This greatly reduces cross-router
|
||||
BUM flooding. If DE-CIX can do it, so can FrysIX :)
|
||||
* some automation - although configuring the MAC-VRF across Arista and SR Linux is definitely not
|
||||
as difficult as I thought, having some automation in place will avoid errors and mistakes. It
|
||||
would suck if the IXP collapsed because I botched a link drain or PNI configuration!
|
||||
|
||||
### Acknowledgements
|
||||
|
||||
I am relatively new to EVPN configurations, and wanted to give a shoutout to Andy Whitaker who
|
||||
jumped in very quickly when I asked a question on the SR Linux Discord. He was gracious with his
|
||||
time and spent a few hours on a video call with me, explaining EVPN in great detail both for Arista
|
||||
as well as SR Linux, and in particular wanted to give a big "Thank you!" for helping me understand
|
||||
symmetric and asymmetric IRB in the context of multivendor EVPN. Andy is about to start a new job at
|
||||
Nokia, and I wish him all the best. To my friends at Nokia: you caught a good one, Andy is pure
|
||||
gold!
|
||||
|
||||
I also want to thank Niek for helping me take my first baby steps onto this platform and patiently
|
||||
answering my nerdly questions about the platform, the switch chip, and the configuration philosophy.
|
||||
Learning a new NOS is always a fun task, and it was made super fun because Niek spent an hour with
|
||||
Arend and I on a video call, giving a bunch of operational tips and tricks along the way.
|
||||
|
||||
Finally, Arend and ERITAP are an absolute joy to work with. We took turns hacking on the lab, which
|
||||
Arend made available for me while I am traveling to Mississippi this week. Thanks for the kWh and
|
||||
OOB access, and for brainstorming the config with me!
|
||||
|
||||
### Reference configurations
|
||||
|
||||
Here's the configs for all machines in this demonstration:
|
||||
[[nikhef](/assets/frys-ix/nikhef.conf)] | [[equinix](/assets/frys-ix/equinix.conf)] | [[nokia-leaf](/assets/frys-ix/nokia-leaf.conf)] | [[arista-leaf](/assets/frys-ix/arista-leaf.conf)]
|
||||
464
content/articles/2025-05-03-containerlab-1.md
Normal file
464
content/articles/2025-05-03-containerlab-1.md
Normal file
@@ -0,0 +1,464 @@
|
||||
---
|
||||
date: "2025-05-03T15:07:23Z"
|
||||
title: 'VPP in Containerlab - Part 1'
|
||||
---
|
||||
|
||||
{{< image float="right" src="/assets/containerlab/containerlab.svg" alt="Containerlab Logo" width="12em" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
From time to time the subject of containerized VPP instances comes up. At IPng, I run the routers in
|
||||
AS8298 on bare metal (Supermicro and Dell hardware), as it allows me to maximize performance.
|
||||
However, VPP is quite friendly in virtualization. Notably, it runs really well on virtual machines
|
||||
like Qemu/KVM or VMWare. I can pass through PCI devices directly to the host, and use CPU pinning to
|
||||
allow the guest virtual machine access to the underlying physical hardware. In such a mode, VPP
|
||||
performance almost the same as on bare metal. But did you know that VPP can also run in Docker?
|
||||
|
||||
The other day I joined the [[ZANOG'25](https://nog.net.za/event1/zanog25/)] in Durban, South Africa.
|
||||
One of the presenters was Nardus le Roux of Nokia, and he showed off a project called
|
||||
[[Containerlab](https://containerlab.dev/)], which provides a CLI for orchestrating and managing
|
||||
container-based networking labs. It starts the containers, builds a virtual wiring between them to
|
||||
create lab topologies of users choice and manages labs lifecycle.
|
||||
|
||||
Quite regularly I am asked 'when will you add VPP to Containerlab?', but at ZANOG I made a promise
|
||||
to actually add them. Here I go, on a journey to integrate VPP into Containerlab!
|
||||
|
||||
## Containerized VPP
|
||||
|
||||
The folks at [[Tigera](https://www.tigera.io/project-calico/)] maintain a project called _Calico_,
|
||||
which accelerates Kubernetes CNI (Container Network Interface) by using [[FD.io](https://fd.io)]
|
||||
VPP. Since the origins of Kubernetes are to run containers in a Docker environment, it stands to
|
||||
reason that it should be possible to run a containerized VPP. I start by reading up on how they
|
||||
create their Docker image, and I learn a lot.
|
||||
|
||||
### Docker Build
|
||||
|
||||
Considering IPng runs bare metal Debian (currently Bookworm) machines, my Docker image will be based
|
||||
on `debian:bookworm` as well. The build starts off quite modest:
|
||||
|
||||
```
|
||||
pim@summer:~$ mkdir -p src/vpp-containerlab
|
||||
pim@summer:~/src/vpp-containerlab$ cat < EOF > Dockerfile.bookworm
|
||||
FROM debian:bookworm
|
||||
ARG DEBIAN_FRONTEND=noninteractive
|
||||
ARG VPP_INSTALL_SKIP_SYSCTL=true
|
||||
ARG REPO=release
|
||||
RUN apt-get update && apt-get -y install curl procps && apt-get clean
|
||||
|
||||
# Install VPP
|
||||
RUN curl -s https://packagecloud.io/install/repositories/fdio/${REPO}/script.deb.sh | bash
|
||||
RUN apt-get update && apt-get -y install vpp vpp-plugin-core && apt-get clean
|
||||
|
||||
CMD ["/usr/bin/vpp","-c","/etc/vpp/startup.conf"]
|
||||
EOF
|
||||
pim@summer:~/src/vpp-containerlab$ docker build -f Dockerfile.bookworm . -t pimvanpelt/vpp-containerlab
|
||||
```
|
||||
|
||||
One gotcha - when I install the upstream VPP debian packages, they generate a `sysctl` file which it
|
||||
tries to execute. However, I can't set sysctl's in the container, so the build fails. I take a look
|
||||
at the VPP source code and find `src/pkg/debian/vpp.postinst` which helpfully contains a means to
|
||||
override setting the sysctl's, using an environment variable called `VPP_INSTALL_SKIP_SYSCTL`.
|
||||
|
||||
### Running VPP in Docker
|
||||
|
||||
With the Docker image built, I need to tweak the VPP startup configuration a little bit, to allow it
|
||||
to run well in a Docker environment. There are a few things I make note of:
|
||||
1. We may not have huge pages on the host machine, so I'll set all the page sizes to the
|
||||
linux-default 4kB rather than 2MB or 1GB hugepages. This creates a performance regression, but
|
||||
in the case of Containerlab, we're not here to build high performance stuff, but rather users
|
||||
will be doing functional testing.
|
||||
1. DPDK requires either UIO of VFIO kernel drivers, so that it can bind its so-called _poll mode
|
||||
driver_ to the network cards. It also requires huge pages. Since my first version will be
|
||||
using only virtual ethernet interfaces, I'll disable DPDK and VFIO alltogether.
|
||||
1. VPP can run any number of CPU worker threads. In its simplest form, I can also run it with only
|
||||
one thread. Of course, this will not be a high performance setup, but since I'm already not
|
||||
using hugepages, I'll use only 1 thread.
|
||||
|
||||
The VPP `startup.conf` configuration file I came up with:
|
||||
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ cat < EOF > clab-startup.conf
|
||||
unix {
|
||||
interactive
|
||||
log /var/log/vpp/vpp.log
|
||||
full-coredump
|
||||
cli-listen /run/vpp/cli.sock
|
||||
cli-prompt vpp-clab#
|
||||
cli-no-pager
|
||||
poll-sleep-usec 100
|
||||
}
|
||||
|
||||
api-trace {
|
||||
on
|
||||
}
|
||||
|
||||
memory {
|
||||
main-heap-size 512M
|
||||
main-heap-page-size 4k
|
||||
}
|
||||
buffers {
|
||||
buffers-per-numa 16000
|
||||
default data-size 2048
|
||||
page-size 4k
|
||||
}
|
||||
|
||||
statseg {
|
||||
size 64M
|
||||
page-size 4k
|
||||
per-node-counters on
|
||||
}
|
||||
|
||||
plugins {
|
||||
plugin default { enable }
|
||||
plugin dpdk_plugin.so { disable }
|
||||
}
|
||||
EOF
|
||||
```
|
||||
|
||||
Just a couple of notes for those who are running VPP in production. Each of the `*-page-size` config
|
||||
settings take the normal Linux pagesize of 4kB, which effectively avoids VPP from using anhy
|
||||
hugepages. Then, I'll specifically disable the DPDK plugin, although I didn't install it in the
|
||||
Dockerfile build, as it lives in its own dedicated Debian package called `vpp-plugin-dpdk`. Finally,
|
||||
I'll make VPP use less CPU by telling it to sleep for 100 microseconds between each poll iteration.
|
||||
In production environments, VPP will use 100% of the CPUs it's assigned, but in this lab, it will
|
||||
not be quite as hungry. By the way, even in this sleepy mode, it'll still easily handle a gigabit
|
||||
of traffic!
|
||||
|
||||
Now, VPP wants to run as root and it needs a few host features, notably tuntap devices and vhost,
|
||||
and a few capabilites, notably NET_ADMIN and SYS_PTRACE. I take a look at the
|
||||
[[manpage](https://man7.org/linux/man-pages/man7/capabilities.7.html)]:
|
||||
* ***CAP_SYS_NICE***: allows to set real-time scheduling, CPU affinity, I/O scheduling class, and
|
||||
to migrate and move memory pages.
|
||||
* ***CAP_NET_ADMIN***: allows to perform various network-relates operations like interface
|
||||
configs, routing tables, nested network namespaces, multicast, set promiscuous mode, and so on.
|
||||
* ***CAP_SYS_PTRACE***: allows to trace arbitrary processes using `ptrace(2)`, and a few related
|
||||
kernel system calls.
|
||||
|
||||
Being a networking dataplane implementation, VPP wants to be able to tinker with network devices.
|
||||
This is not typically allowed in Docker containers, although the Docker developers did make some
|
||||
consessions for those containers that need just that little bit more access. They described it in
|
||||
their
|
||||
[[docs](https://docs.docker.com/engine/containers/run/#runtime-privilege-and-linux-capabilities)] as
|
||||
follows:
|
||||
|
||||
| The --privileged flag gives all capabilities to the container. When the operator executes docker
|
||||
| run --privileged, Docker enables access to all devices on the host, and reconfigures AppArmor or
|
||||
| SELinux to allow the container nearly all the same access to the host as processes running outside
|
||||
| containers on the host. Use this flag with caution. For more information about the --privileged
|
||||
| flag, see the docker run reference.
|
||||
|
||||
{{< image width="4em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
In this moment, I feel I should point out that running a Docker container with `--privileged` flag
|
||||
set does give it _a lot_ of privileges. A container with `--privileged` is not a securely sandboxed
|
||||
process. Containers in this mode can get a root shell on the host and take control over the system.
|
||||
|
||||
With that little fineprint warning out of the way, I am going to Yolo like a boss:
|
||||
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ docker run --name clab-pim \
|
||||
--cap-add=NET_ADMIN --cap-add=SYS_NICE --cap-add=SYS_PTRACE \
|
||||
--device=/dev/net/tun:/dev/net/tun --device=/dev/vhost-net:/dev/vhost-net \
|
||||
--privileged -v $(pwd)/clab-startup.conf:/etc/vpp/startup.conf:ro \
|
||||
docker.io/pimvanpelt/vpp-containerlab
|
||||
clab-pim
|
||||
```
|
||||
|
||||
### Configuring VPP in Docker
|
||||
|
||||
And with that, the Docker container is running! I post a screenshot on
|
||||
[[Mastodon](https://ublog.tech/@IPngNetworks/114392852468494211)] and my buddy John responds with a
|
||||
polite but firm insistence that I explain myself. Here you go, buddy :)
|
||||
|
||||
In another terminal, I can play around with this VPP instance a little bit:
|
||||
```
|
||||
pim@summer:~$ docker exec -it clab-pim bash
|
||||
root@d57c3716eee9:/# ip -br l
|
||||
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
|
||||
eth0@if530566 UP 02:42:ac:11:00:02 <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
|
||||
root@d57c3716eee9:/# ps auxw
|
||||
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
|
||||
root 1 2.2 0.2 17498852 160300 ? Rs 15:11 0:00 /usr/bin/vpp -c /etc/vpp/startup.conf
|
||||
root 10 0.0 0.0 4192 3388 pts/0 Ss 15:11 0:00 bash
|
||||
root 18 0.0 0.0 8104 4056 pts/0 R+ 15:12 0:00 ps auxw
|
||||
|
||||
root@d57c3716eee9:/# vppctl
|
||||
_______ _ _ _____ ___
|
||||
__/ __/ _ \ (_)__ | | / / _ \/ _ \
|
||||
_/ _// // / / / _ \ | |/ / ___/ ___/
|
||||
/_/ /____(_)_/\___/ |___/_/ /_/
|
||||
|
||||
vpp-clab# show version
|
||||
vpp v25.02-release built by root on d5cd2c304b7f at 2025-02-26T13:58:32
|
||||
vpp-clab# show interfaces
|
||||
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
|
||||
local0 0 down 0/0/0/0
|
||||
```
|
||||
|
||||
Slick! I can see that the container has an `eth0` device, which Docker has connected to the main
|
||||
bridged network. For now, there's only one process running, pid 1 proudly shows VPP (as in Docker,
|
||||
the `CMD` field will simply replace `init`. Later on, I can imagine running a few more daemons like
|
||||
SSH and so on, but for now, I'm happy.
|
||||
|
||||
Looking at VPP itself, it has no network interfaces yet, except for the default `local0` interface.
|
||||
|
||||
### Adding Interfaces in Docker
|
||||
|
||||
But if I don't have DPDK, how will I add interfaces? Enter `veth(4)`. From the
|
||||
[[manpage](https://man7.org/linux/man-pages/man4/veth.4.html)], I learn that veth devices are
|
||||
virtual Ethernet devices. They can act as tunnels between network namespaces to create a bridge to
|
||||
a physical network device in another namespace, but can also be used as standalone network devices.
|
||||
veth devices are always created in interconnected pairs.
|
||||
|
||||
Of course, Docker users will recognize this. It's like bread and butter for containers to
|
||||
communicate with one another - and with the host they're running on. I can simply create a Docker
|
||||
network and attach one half of it to a running container, like so:
|
||||
|
||||
```
|
||||
pim@summer:~$ docker network create --driver=bridge clab-network \
|
||||
--subnet 192.0.2.0/24 --ipv6 --subnet 2001:db8::/64
|
||||
5711b95c6c32ac0ed185a54f39e5af4b499677171ff3d00f99497034e09320d2
|
||||
pim@summer:~$ docker network connect clab-network clab-pim --ip '' --ip6 ''
|
||||
```
|
||||
|
||||
The first command here creates a new network called `clab-network` in Docker. As a result, a new
|
||||
bridge called `br-5711b95c6c32` shows up on the host. The bridge name is chosen from the UUID of the
|
||||
Docker object. Seeing as I added an IPv4 and IPv6 subnet to the bridge, it gets configured with the
|
||||
first address in both:
|
||||
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ brctl show br-5711b95c6c32
|
||||
bridge name bridge id STP enabled interfaces
|
||||
br-5711b95c6c32 8000.0242099728c6 no veth021e363
|
||||
|
||||
|
||||
pim@summer:~/src/vpp-containerlab$ ip -br a show dev br-5711b95c6c32
|
||||
br-5711b95c6c32 UP 192.0.2.1/24 2001:db8::1/64 fe80::42:9ff:fe97:28c6/64 fe80::1/64
|
||||
```
|
||||
|
||||
The second command creates a `veth` pair, and puts one half of it in the bridge, and this interface
|
||||
is called `veth021e363` above. The other half of it pops up as `eth1` in the Docker container:
|
||||
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ docker exec -it clab-pim bash
|
||||
root@d57c3716eee9:/# ip -br l
|
||||
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
|
||||
eth0@if530566 UP 02:42:ac:11:00:02 <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
eth1@if530577 UP 02:42:c0:00:02:02 <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
```
|
||||
|
||||
One of the many awesome features of VPP is its ability to attach to these `veth` devices by means of
|
||||
its `af-packet` driver, by reusing the same MAC address (in this case `02:42:c0:00:02:02`). I first
|
||||
take a look at the linux [[manpage](https://man7.org/linux/man-pages/man7/packet.7.html)] for it,
|
||||
and then read up on the VPP
|
||||
[[documentation](https://fd.io/docs/vpp/v2101/gettingstarted/progressivevpp/interface)] on the
|
||||
topic.
|
||||
|
||||
|
||||
However, my attention is drawn to Docker assigning an IPv4 and IPv6 address to the container:
|
||||
```
|
||||
root@d57c3716eee9:/# ip -br a
|
||||
lo UNKNOWN 127.0.0.1/8 ::1/128
|
||||
eth0@if530566 UP 172.17.0.2/16
|
||||
eth1@if530577 UP 192.0.2.2/24 2001:db8::2/64 fe80::42:c0ff:fe00:202/64
|
||||
root@d57c3716eee9:/# ip addr del 192.0.2.2/24 dev eth1
|
||||
root@d57c3716eee9:/# ip addr del 2001:db8::2/64 dev eth1
|
||||
```
|
||||
|
||||
I decide to remove them from here, as in the end, `eth1` will be owned by VPP so _it_ should be
|
||||
setting the IPv4 and IPv6 addresses. For the life of me, I don't see how I can avoid Docker from
|
||||
assinging IPv4 and IPv6 addresses to this container ... and the
|
||||
[[docs](https://docs.docker.com/engine/network/)] seem to be off as well, as they suggest I can pass
|
||||
a flagg `--ipv4=False` but that flag doesn't exist, at least not on my Bookworm Docker variant. I
|
||||
make a mental note to discuss this with the folks in the Containerlab community.
|
||||
|
||||
|
||||
Anyway, armed with this knowledge I can bind the container-side veth pair called `eth1` to VPP, like
|
||||
so:
|
||||
|
||||
```
|
||||
root@d57c3716eee9:/# vppctl
|
||||
_______ _ _ _____ ___
|
||||
__/ __/ _ \ (_)__ | | / / _ \/ _ \
|
||||
_/ _// // / / / _ \ | |/ / ___/ ___/
|
||||
/_/ /____(_)_/\___/ |___/_/ /_/
|
||||
|
||||
vpp-clab# create host-interface name eth1 hw-addr 02:42:c0:00:02:02
|
||||
vpp-clab# set interface name host-eth1 eth1
|
||||
vpp-clab# set interface mtu 1500 eth1
|
||||
vpp-clab# set interface ip address eth1 192.0.2.2/24
|
||||
vpp-clab# set interface ip address eth1 2001:db8::2/64
|
||||
vpp-clab# set interface state eth1 up
|
||||
vpp-clab# show int addr
|
||||
eth1 (up):
|
||||
L3 192.0.2.2/24
|
||||
L3 2001:db8::2/64
|
||||
local0 (dn):
|
||||
```
|
||||
|
||||
## Results
|
||||
|
||||
After all this work, I've successfully created a Docker image based on Debian Bookworm and VPP 25.02
|
||||
(the current stable release version), started a container with it, added a network bridge in Docker,
|
||||
which binds the host `summer` to the container. Proof, as they say, is in the ping-pudding:
|
||||
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ ping -c5 2001:db8::2
|
||||
PING 2001:db8::2(2001:db8::2) 56 data bytes
|
||||
64 bytes from 2001:db8::2: icmp_seq=1 ttl=64 time=0.113 ms
|
||||
64 bytes from 2001:db8::2: icmp_seq=2 ttl=64 time=0.056 ms
|
||||
64 bytes from 2001:db8::2: icmp_seq=3 ttl=64 time=0.202 ms
|
||||
64 bytes from 2001:db8::2: icmp_seq=4 ttl=64 time=0.102 ms
|
||||
64 bytes from 2001:db8::2: icmp_seq=5 ttl=64 time=0.100 ms
|
||||
|
||||
--- 2001:db8::2 ping statistics ---
|
||||
5 packets transmitted, 5 received, 0% packet loss, time 4098ms
|
||||
rtt min/avg/max/mdev = 0.056/0.114/0.202/0.047 ms
|
||||
pim@summer:~/src/vpp-containerlab$ ping -c5 192.0.2.2
|
||||
PING 192.0.2.2 (192.0.2.2) 56(84) bytes of data.
|
||||
64 bytes from 192.0.2.2: icmp_seq=1 ttl=64 time=0.043 ms
|
||||
64 bytes from 192.0.2.2: icmp_seq=2 ttl=64 time=0.032 ms
|
||||
64 bytes from 192.0.2.2: icmp_seq=3 ttl=64 time=0.019 ms
|
||||
64 bytes from 192.0.2.2: icmp_seq=4 ttl=64 time=0.041 ms
|
||||
64 bytes from 192.0.2.2: icmp_seq=5 ttl=64 time=0.027 ms
|
||||
|
||||
--- 192.0.2.2 ping statistics ---
|
||||
5 packets transmitted, 5 received, 0% packet loss, time 4063ms
|
||||
rtt min/avg/max/mdev = 0.019/0.032/0.043/0.008 ms
|
||||
```
|
||||
|
||||
And in case that simple ping-test wasn't enough to get you excited, here's a packet trace from VPP
|
||||
itself, while I'm performing this ping:
|
||||
|
||||
```
|
||||
vpp-clab# trace add af-packet-input 100
|
||||
vpp-clab# wait 3
|
||||
vpp-clab# show trace
|
||||
------------------- Start of thread 0 vpp_main -------------------
|
||||
Packet 1
|
||||
|
||||
00:07:03:979275: af-packet-input
|
||||
af_packet: hw_if_index 1 rx-queue 0 next-index 4
|
||||
block 47:
|
||||
address 0x7fbf23b7d000 version 2 seq_num 48 pkt_num 0
|
||||
tpacket3_hdr:
|
||||
status 0x20000001 len 98 snaplen 98 mac 92 net 106
|
||||
sec 0x68164381 nsec 0x258e7659 vlan 0 vlan_tpid 0
|
||||
vnet-hdr:
|
||||
flags 0x00 gso_type 0x00 hdr_len 0
|
||||
gso_size 0 csum_start 0 csum_offset 0
|
||||
00:07:03:979293: ethernet-input
|
||||
IP4: 02:42:09:97:28:c6 -> 02:42:c0:00:02:02
|
||||
00:07:03:979306: ip4-input
|
||||
ICMP: 192.0.2.1 -> 192.0.2.2
|
||||
tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x5813, flags DONT_FRAGMENT
|
||||
ICMP echo_request checksum 0xc16 id 21197
|
||||
00:07:03:979315: ip4-lookup
|
||||
fib 0 dpo-idx 9 flow hash: 0x00000000
|
||||
ICMP: 192.0.2.1 -> 192.0.2.2
|
||||
tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x5813, flags DONT_FRAGMENT
|
||||
ICMP echo_request checksum 0xc16 id 21197
|
||||
00:07:03:979322: ip4-receive
|
||||
fib:0 adj:9 flow:0x00000000
|
||||
ICMP: 192.0.2.1 -> 192.0.2.2
|
||||
tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x5813, flags DONT_FRAGMENT
|
||||
ICMP echo_request checksum 0xc16 id 21197
|
||||
00:07:03:979323: ip4-icmp-input
|
||||
ICMP: 192.0.2.1 -> 192.0.2.2
|
||||
tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x5813, flags DONT_FRAGMENT
|
||||
ICMP echo_request checksum 0xc16 id 21197
|
||||
00:07:03:979323: ip4-icmp-echo-request
|
||||
ICMP: 192.0.2.1 -> 192.0.2.2
|
||||
tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x5813, flags DONT_FRAGMENT
|
||||
ICMP echo_request checksum 0xc16 id 21197
|
||||
00:07:03:979326: ip4-load-balance
|
||||
fib 0 dpo-idx 5 flow hash: 0x00000000
|
||||
ICMP: 192.0.2.2 -> 192.0.2.1
|
||||
tos 0x00, ttl 64, length 84, checksum 0x88e1 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x2dc4, flags DONT_FRAGMENT
|
||||
ICMP echo_reply checksum 0x1416 id 21197
|
||||
00:07:03:979325: ip4-rewrite
|
||||
tx_sw_if_index 1 dpo-idx 5 : ipv4 via 192.0.2.1 eth1: mtu:1500 next:3 flags:[] 0242099728c60242c00002020800 flow hash: 0x00000000
|
||||
00000000: 0242099728c60242c00002020800450000542dc44000400188e1c0000202c000
|
||||
00000020: 02010000141652cd00018143166800000000399d0900000000001011
|
||||
00:07:03:979326: eth1-output
|
||||
eth1 flags 0x02180005
|
||||
IP4: 02:42:c0:00:02:02 -> 02:42:09:97:28:c6
|
||||
ICMP: 192.0.2.2 -> 192.0.2.1
|
||||
tos 0x00, ttl 64, length 84, checksum 0x88e1 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x2dc4, flags DONT_FRAGMENT
|
||||
ICMP echo_reply checksum 0x1416 id 21197
|
||||
00:07:03:979327: eth1-tx
|
||||
af_packet: hw_if_index 1 tx-queue 0
|
||||
tpacket3_hdr:
|
||||
status 0x1 len 108 snaplen 108 mac 0 net 0
|
||||
sec 0x0 nsec 0x0 vlan 0 vlan_tpid 0
|
||||
vnet-hdr:
|
||||
flags 0x00 gso_type 0x00 hdr_len 0
|
||||
gso_size 0 csum_start 0 csum_offset 0
|
||||
buffer 0xf97c4:
|
||||
current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x0
|
||||
local l2-hdr-offset 0 l3-hdr-offset 14
|
||||
IP4: 02:42:c0:00:02:02 -> 02:42:09:97:28:c6
|
||||
ICMP: 192.0.2.2 -> 192.0.2.1
|
||||
tos 0x00, ttl 64, length 84, checksum 0x88e1 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x2dc4, flags DONT_FRAGMENT
|
||||
ICMP echo_reply checksum 0x1416 id 21197
|
||||
```
|
||||
|
||||
Well, that's a mouthfull, isn't it! Here, I get to show you VPP in action. After receiving the
|
||||
packet on its `af-packet-input` node from 192.0.2.1 (Summer, who is pinging us) to 192.0.2.2 (the
|
||||
VPP container), the packet traverses the dataplane graph. It goes through `ethernet-input`, then
|
||||
`ip4-input`, which sees it's destined to an IPv4 address configured, so the packet is handed to
|
||||
`ip4-receive`. That one sees that the IP protocol is ICMP, so it hands the packet to
|
||||
`ip4-icmp-input` which notices that the packet is an ICMP echo request, so off to
|
||||
`ip4-icmp-echo-request` our little packet goes. The ICMP plugin in VPP now answers by
|
||||
`ip4-rewrite`'ing the packet, sending the return to 192.0.2.1 at MAC address `02:42:09:97:28:c6`
|
||||
(this is Summer, the host doing the pinging!), after which the newly created ICMP echo-reply is
|
||||
handed to `eth1-output` which marshalls it back into the kernel's AF_PACKET interface using
|
||||
`eth1-tx`.
|
||||
|
||||
Boom. I could not be more pleased.
|
||||
|
||||
## What's Next
|
||||
|
||||
This was a nice exercise for me! I'm going this direction becaue the
|
||||
[[Containerlab](https://containerlab.dev)] framework will start containers with given NOS images,
|
||||
not too dissimilar from the one I just made, and then attaches `veth` pairs between the containers.
|
||||
I started dabbling with a [[pull-request](https://github.com/srl-labs/containerlab/pull/2571)], but
|
||||
I got stuck with a part of the Containerlab code that pre-deploys config files into the containers.
|
||||
You see, I will need to generate two files:
|
||||
|
||||
1. A `startup.conf` file that is specific to the containerlab Docker container. I'd like them to
|
||||
each set their own hostname so that the CLI has a unique prompt. I can do this by setting `unix
|
||||
{ cli-prompt {{ .ShortName }}# }` in the template renderer.
|
||||
1. Containerlab will know all of the veth pairs that are planned to be created into each VPP
|
||||
container. I'll need it to then write a little snippet of config that does the `create
|
||||
host-interface` spiel, to attach these `veth` pairs to the VPP dataplane.
|
||||
|
||||
I reached out to Roman from Nokia, who is one of the authors and current maintainer of Containerlab.
|
||||
Roman was keen to help out, and seeing as he knows the COntainerlab stuff well, and I know the VPP
|
||||
stuff well, this is a reasonable partnership! Soon, he and I plan to have a bare-bones setup that
|
||||
will connect a few VPP containers together with an SR Linux node in a lab. Stand by!
|
||||
|
||||
Once we have that, there's still quite some work for me to do. Notably:
|
||||
* Configuration persistence. `clab` allows you to save the running config. For that, I'll need to
|
||||
introduce [[vppcfg](https://git.ipng.ch/ipng/vppcfg)] and a means to invoke it when
|
||||
the lab operator wants to save their config, and then reconfigure VPP when the container
|
||||
restarts.
|
||||
* I'll need to have a few files from `clab` shared with the host, notably the `startup.conf` and
|
||||
`vppcfg.yaml`, as well as some manual pre- and post-flight configuration for the more esoteric
|
||||
stuff. Building the plumbing for this is a TODO for now.
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
I wanted to give a shout-out to Nardus le Roux who inspired me to contribute this Containerlab VPP
|
||||
node type, and to Roman Dodin for his help getting the Containerlab parts squared away when I got a
|
||||
little bit stuck.
|
||||
|
||||
First order of business: get it to ping at all ... it'll go faster from there on out :)
|
||||
373
content/articles/2025-05-04-containerlab-2.md
Normal file
373
content/articles/2025-05-04-containerlab-2.md
Normal file
@@ -0,0 +1,373 @@
|
||||
---
|
||||
date: "2025-05-04T15:07:23Z"
|
||||
title: 'VPP in Containerlab - Part 2'
|
||||
params:
|
||||
asciinema: true
|
||||
---
|
||||
|
||||
{{< image float="right" src="/assets/containerlab/containerlab.svg" alt="Containerlab Logo" width="12em" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
From time to time the subject of containerized VPP instances comes up. At IPng, I run the routers in
|
||||
AS8298 on bare metal (Supermicro and Dell hardware), as it allows me to maximize performance.
|
||||
However, VPP is quite friendly in virtualization. Notably, it runs really well on virtual machines
|
||||
like Qemu/KVM or VMWare. I can pass through PCI devices directly to the host, and use CPU pinning to
|
||||
allow the guest virtual machine access to the underlying physical hardware. In such a mode, VPP
|
||||
performance almost the same as on bare metal. But did you know that VPP can also run in Docker?
|
||||
|
||||
The other day I joined the [[ZANOG'25](https://nog.net.za/event1/zanog25/)] in Durban, South Africa.
|
||||
One of the presenters was Nardus le Roux of Nokia, and he showed off a project called
|
||||
[[Containerlab](https://containerlab.dev/)], which provides a CLI for orchestrating and managing
|
||||
container-based networking labs. It starts the containers, builds virtual wiring between them to
|
||||
create lab topologies of users' choice and manages the lab lifecycle.
|
||||
|
||||
Quite regularly I am asked 'when will you add VPP to Containerlab?', but at ZANOG I made a promise
|
||||
to actually add it. In my previous [[article]({{< ref 2025-05-03-containerlab-1.md >}})], I took
|
||||
a good look at VPP as a dockerized container. In this article, I'll explore how to make such a
|
||||
container run in Containerlab!
|
||||
|
||||
## Completing the Docker container
|
||||
|
||||
Just having VPP running by itself in a container is not super useful (although it _is_ cool!). I
|
||||
decide first to add a few bits and bobs that will come in handy in the `Dockerfile`:
|
||||
|
||||
```
|
||||
FROM debian:bookworm
|
||||
ARG DEBIAN_FRONTEND=noninteractive
|
||||
ARG VPP_INSTALL_SKIP_SYSCTL=true
|
||||
ARG REPO=release
|
||||
EXPOSE 22/tcp
|
||||
RUN apt-get update && apt-get -y install curl procps tcpdump iproute2 iptables \
|
||||
iputils-ping net-tools git python3 python3-pip vim-tiny openssh-server bird2 \
|
||||
mtr-tiny traceroute && apt-get clean
|
||||
|
||||
# Install VPP
|
||||
RUN mkdir -p /var/log/vpp /root/.ssh/
|
||||
RUN curl -s https://packagecloud.io/install/repositories/fdio/${REPO}/script.deb.sh | bash
|
||||
RUN apt-get update && apt-get -y install vpp vpp-plugin-core && apt-get clean
|
||||
|
||||
# Build vppcfg
|
||||
RUN pip install --break-system-packages build netaddr yamale argparse pyyaml ipaddress
|
||||
RUN git clone https://git.ipng.ch/ipng/vppcfg.git && cd vppcfg && python3 -m build && \
|
||||
pip install --break-system-packages dist/vppcfg-*-py3-none-any.whl
|
||||
|
||||
# Config files
|
||||
COPY files/etc/vpp/* /etc/vpp/
|
||||
COPY files/etc/bird/* /etc/bird/
|
||||
COPY files/init-container.sh /sbin/
|
||||
RUN chmod 755 /sbin/init-container.sh
|
||||
CMD ["/sbin/init-container.sh"]
|
||||
```
|
||||
|
||||
A few notable additions:
|
||||
* ***vppcfg*** is a handy utility I wrote and discussed in a previous [[article]({{< ref
|
||||
2022-04-02-vppcfg-2 >}})]. Its purpose is to take YAML file that describes the configuration of
|
||||
the dataplane (like which interfaces, sub-interfaces, MTU, IP addresses and so on), and then
|
||||
apply this safely to a running dataplane. You can check it out in my
|
||||
[[vppcfg](https://git.ipng.ch/ipng/vppcfg)] git repository.
|
||||
* ***openssh-server*** will come in handy to log in to the container, in addition to the already
|
||||
available `docker exec`.
|
||||
* ***bird2*** which will be my controlplane of choice. At a future date, I might also add FRR,
|
||||
which may be a good alterantive for some. VPP works well with both. You can check out Bird on
|
||||
the nic.cz [[website](https://bird.network.cz/?get_doc&f=bird.html&v=20)].
|
||||
|
||||
I'll add a couple of default config files for Bird and VPP, and replace the CMD with a generic
|
||||
`/sbin/init-container.sh` in which I can do any late binding stuff before launching VPP.
|
||||
|
||||
### Initializing the Container
|
||||
|
||||
#### VPP Containerlab: NetNS
|
||||
|
||||
VPP's Linux Control Plane plugin wants to run in its own network namespace. So the first order of
|
||||
business of `/sbin/init-container.sh` is to create it:
|
||||
|
||||
```
|
||||
NETNS=${NETNS:="dataplane"}
|
||||
|
||||
echo "Creating dataplane namespace"
|
||||
/usr/bin/mkdir -p /etc/netns/$NETNS
|
||||
/usr/bin/touch /etc/netns/$NETNS/resolv.conf
|
||||
/usr/sbin/ip netns add $NETNS
|
||||
```
|
||||
|
||||
#### VPP Containerlab: SSH
|
||||
|
||||
Then, I'll set the root password (which is `vpp` by the way), and start aan SSH daemon which allows
|
||||
for password-less logins:
|
||||
|
||||
```
|
||||
echo "Starting SSH, with credentials root:vpp"
|
||||
sed -i -e 's,^#PermitRootLogin prohibit-password,PermitRootLogin yes,' /etc/ssh/sshd_config
|
||||
sed -i -e 's,^root:.*,root:$y$j9T$kG8pyZEVmwLXEtXekQCRK.$9iJxq/bEx5buni1hrC8VmvkDHRy7ZMsw9wYvwrzexID:20211::::::,' /etc/shadow
|
||||
/etc/init.d/ssh start
|
||||
```
|
||||
|
||||
#### VPP Containerlab: Bird2
|
||||
|
||||
I can already predict that Bird2 won't be the only option for a controlplane, even though I'm a huge
|
||||
fan of it. Therefore, I'll make it configurable to leave the door open for other controlplane
|
||||
implementations in the future:
|
||||
|
||||
```
|
||||
BIRD_ENABLED=${BIRD_ENABLED:="true"}
|
||||
|
||||
if [ "$BIRD_ENABLED" == "true" ]; then
|
||||
echo "Starting Bird in $NETNS"
|
||||
mkdir -p /run/bird /var/log/bird
|
||||
chown bird:bird /var/log/bird
|
||||
ROUTERID=$(ip -br a show eth0 | awk '{ print $3 }' | cut -f1 -d/)
|
||||
sed -i -e "s,.*router id .*,router id $ROUTERID; # Set by container-init.sh," /etc/bird/bird.conf
|
||||
/usr/bin/nsenter --net=/var/run/netns/$NETNS /usr/sbin/bird -u bird -g bird
|
||||
fi
|
||||
```
|
||||
|
||||
I am reminded that Bird won't start if it cannot determine its _router id_. When I start it in the
|
||||
`dataplane` namespace, it will immediately exit, because there will be no IP addresses configured
|
||||
yet. But luckily, it logs its complaint and it's easily addressed. I decide to take the management
|
||||
IPv4 address from `eth0` and write that into the `bird.conf` file, which otherwise does some basic
|
||||
initialization that I described in a previous [[article]({{< ref 2021-09-02-vpp-5 >}})], so I'll
|
||||
skip that here. However, I do include an empty file called `/etc/bird/bird-local.conf` for users to
|
||||
further configure Bird2.
|
||||
|
||||
#### VPP Containerlab: Binding veth pairs
|
||||
|
||||
When Containerlab starts the VPP container, it'll offer it a set of `veth` ports that connect this
|
||||
container to other nodes in the lab. This is done by the `links` list in the topology file
|
||||
[[ref](https://containerlab.dev/manual/network/)]. It's my goal to take all of the interfaces
|
||||
that are of type `veth`, and generate a little snippet to grab them and bind them into VPP while
|
||||
setting their MTU to 9216 to allow for jumbo frames:
|
||||
|
||||
```
|
||||
CLAB_VPP_FILE=${CLAB_VPP_FILE:=/etc/vpp/clab.vpp}
|
||||
|
||||
echo "Generating $CLAB_VPP_FILE"
|
||||
: > $CLAB_VPP_FILE
|
||||
MTU=9216
|
||||
for IFNAME in $(ip -br link show type veth | cut -f1 -d@ | grep -v '^eth0$' | sort); do
|
||||
MAC=$(ip -br link show dev $IFNAME | awk '{ print $3 }')
|
||||
echo " * $IFNAME hw-addr $MAC mtu $MTU"
|
||||
ip link set $IFNAME up mtu $MTU
|
||||
cat << EOF >> $CLAB_VPP_FILE
|
||||
create host-interface name $IFNAME hw-addr $MAC
|
||||
set interface name host-$IFNAME $IFNAME
|
||||
set interface mtu $MTU $IFNAME
|
||||
set interface state $IFNAME up
|
||||
|
||||
EOF
|
||||
done
|
||||
```
|
||||
|
||||
{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
One thing I realized is that VPP will assign a random MAC address on its copy of the `veth` port,
|
||||
which is not great. I'll explicitly configure it with the same MAC address as the `veth` interface
|
||||
itself, otherwise I'd have to put the interface into promiscuous mode.
|
||||
|
||||
#### VPP Containerlab: VPPcfg
|
||||
|
||||
I'm almost ready, but I have one more detail. The user will be able to offer a
|
||||
[[vppcfg](https://git.ipng.ch/ipng/vppcfg)] YAML file to configure the interfaces and so on. If such
|
||||
a file exists, I'll apply it to the dataplane upon startup:
|
||||
|
||||
```
|
||||
VPPCFG_VPP_FILE=${VPPCFG_VPP_FILE:=/etc/vpp/vppcfg.vpp}
|
||||
|
||||
echo "Generating $VPPCFG_VPP_FILE"
|
||||
: > $VPPCFG_VPP_FILE
|
||||
if [ -r /etc/vpp/vppcfg.yaml ]; then
|
||||
vppcfg plan --novpp -c /etc/vpp/vppcfg.yaml -o $VPPCFG_VPP_FILE
|
||||
fi
|
||||
```
|
||||
|
||||
Once the VPP process starts, it'll execute `/etc/vpp/bootstrap.vpp`, which in turn executes these
|
||||
newly generated `/etc/vpp/clab.vpp` to grab the `veth` interfaces, and then `/etc/vpp/vppcfg.vpp` to
|
||||
further configure the dataplane. Easy peasy!
|
||||
|
||||
### Adding VPP to Containerlab
|
||||
|
||||
Roman points out a previous integration for the 6WIND VSR in
|
||||
[[PR#2540](https://github.com/srl-labs/containerlab/pull/2540)]. This serves as a useful guide to
|
||||
get me started. I fork the repo, create a branch so that Roman can also add a few commits, and
|
||||
together we start hacking in [[PR#2571](https://github.com/srl-labs/containerlab/pull/2571)].
|
||||
|
||||
First, I add the documentation skeleton in `docs/manual/kinds/fdio_vpp.md`, which links in from a
|
||||
few other places, and will be where the end-user facing documentation will live. That's about half
|
||||
the contributed LOC, right there!
|
||||
|
||||
Next, I'll create a Go module in `nodes/fdio_vpp/fdio_vpp.go` which doesn't do much other than
|
||||
creating the `struct`, and its required `Register` and `Init` functions. The `Init` function ensures
|
||||
the right capabilities are set in Docker, and the right devices are bound for the container.
|
||||
|
||||
I notice that Containerlab rewrites the Dockerfile `CMD` string and prepends an `if-wait.sh` script
|
||||
to it. This is because when Containerlab starts the container, it'll still be busy adding these
|
||||
`link` interfaces to it, and if a container starts too quickly, it may not see all the interfaces.
|
||||
So, containerlab informs the container using an environment variable called `CLAB_INTFS`, so this
|
||||
script simply sleeps for a while until that exact amount of interfaces are present. Ok, cool beans.
|
||||
|
||||
Roman helps me a bit with Go templating. You see, I think it'll be slick to have the CLI prompt for
|
||||
the VPP containers to reflect their hostname, because normally, VPP will assign `vpp# `. I add the
|
||||
template in `nodes/fdio_vpp/vpp_startup_config.go.tpl` and it only has one variable expansion: `unix
|
||||
{ cli-prompt {{ .ShortName }}# }`. But I totally think it's worth it, because when running many VPP
|
||||
containers in the lab, it could otherwise get confusing.
|
||||
|
||||
Roman also shows me a trick in the function `PostDeploy()`, which will write the user's SSH pubkeys
|
||||
to `/root/.ssh/authorized_keys`. This allows users to log in without having to use password
|
||||
authentication.
|
||||
|
||||
Collectively, we decide to punt on the `SaveConfig` function until we're a bit further along. I have
|
||||
an idea how this would work, basically along the lines of calling `vppcfg dump` and bind-mounting
|
||||
that file into the lab directory somewhere. This way, upon restarting, the YAML file can be re-read
|
||||
and the dataplane initialized. But it'll be for another day.
|
||||
|
||||
After the main module is finished, all I have to do is add it to `clab/register.go` and that's just
|
||||
about it. In about 170 lines of code, 50 lines of Go template, and 170 lines of Markdown, this
|
||||
contribution is about ready to ship!
|
||||
|
||||
### Containerlab: Demo
|
||||
|
||||
After I finish writing the documentation, I decide to include a demo with a quickstart to help folks
|
||||
along. A simple lab showing two VPP instances and two Alpine Linux clients can be found on
|
||||
[[git.ipng.ch/ipng/vpp-containerlab](https://git.ipng.ch/ipng/vpp-containerlab)]. Simply check out the
|
||||
repo and start the lab, like so:
|
||||
|
||||
```
|
||||
$ git clone https://git.ipng.ch/ipng/vpp-containerlab.git
|
||||
$ cd vpp-containerlab
|
||||
$ containerlab deploy --topo vpp.clab.yml
|
||||
```
|
||||
|
||||
#### Containerlab: configs
|
||||
|
||||
The file `vpp.clab.yml` contains an example topology existing of two VPP instances connected each to
|
||||
one Alpine linux container, in the following topology:
|
||||
|
||||
{{< image src="/assets/containerlab/learn-vpp.png" alt="Containerlab Topo" width="100%" >}}
|
||||
|
||||
Two relevant files for each VPP router are included in this
|
||||
[[repository](https://git.ipng.ch/ipng/vpp-containerlab)]:
|
||||
1. `config/vpp*/vppcfg.yaml` configures the dataplane interfaces, including a loopback address.
|
||||
1. `config/vpp*/bird-local.conf` configures the controlplane to enable BFD and OSPF.
|
||||
|
||||
To illustrate these files, let me take a closer look at node `vpp1`. It's VPP dataplane
|
||||
configuration looks like this:
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ cat config/vpp1/vppcfg.yaml
|
||||
interfaces:
|
||||
eth1:
|
||||
description: 'To client1'
|
||||
mtu: 1500
|
||||
lcp: eth1
|
||||
addresses: [ 10.82.98.65/28, 2001:db8:8298:101::1/64 ]
|
||||
eth2:
|
||||
description: 'To vpp2'
|
||||
mtu: 9216
|
||||
lcp: eth2
|
||||
addresses: [ 10.82.98.16/31, 2001:db8:8298:1::1/64 ]
|
||||
loopbacks:
|
||||
loop0:
|
||||
description: 'vpp1'
|
||||
lcp: loop0
|
||||
addresses: [ 10.82.98.0/32, 2001:db8:8298::/128 ]
|
||||
```
|
||||
|
||||
Then, I enable BFD, OSPF and OSPFv3 on `eth2` and `loop0` on both of the VPP routers:
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ cat config/vpp1/bird-local.conf
|
||||
protocol bfd bfd1 {
|
||||
interface "eth2" { interval 100 ms; multiplier 30; };
|
||||
}
|
||||
|
||||
protocol ospf v2 ospf4 {
|
||||
ipv4 { import all; export all; };
|
||||
area 0 {
|
||||
interface "loop0" { stub yes; };
|
||||
interface "eth2" { type pointopoint; cost 10; bfd on; };
|
||||
};
|
||||
}
|
||||
|
||||
protocol ospf v3 ospf6 {
|
||||
ipv6 { import all; export all; };
|
||||
area 0 {
|
||||
interface "loop0" { stub yes; };
|
||||
interface "eth2" { type pointopoint; cost 10; bfd on; };
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
#### Containerlab: playtime!
|
||||
|
||||
Once the lab comes up, I can SSH to the VPP containers (`vpp1` and `vpp2`) which have my SSH pubkeys
|
||||
installed thanks to Roman's work. Barring that, I could still log in as user `root` using
|
||||
password `vpp`. VPP runs its own network namespace called `dataplane`, which is very similar to SR
|
||||
Linux default `network-instance`. I can join that namespace to take a closer look:
|
||||
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ ssh root@vpp1
|
||||
root@vpp1:~# nsenter --net=/var/run/netns/dataplane
|
||||
root@vpp1:~# ip -br a
|
||||
lo DOWN
|
||||
loop0 UP 10.82.98.0/32 2001:db8:8298::/128 fe80::dcad:ff:fe00:0/64
|
||||
eth1 UNKNOWN 10.82.98.65/28 2001:db8:8298:101::1/64 fe80::a8c1:abff:fe77:acb9/64
|
||||
eth2 UNKNOWN 10.82.98.16/31 2001:db8:8298:1::1/64 fe80::a8c1:abff:fef0:7125/64
|
||||
|
||||
root@vpp1:~# ping 10.82.98.1
|
||||
PING 10.82.98.1 (10.82.98.1) 56(84) bytes of data.
|
||||
64 bytes from 10.82.98.1: icmp_seq=1 ttl=64 time=9.53 ms
|
||||
64 bytes from 10.82.98.1: icmp_seq=2 ttl=64 time=15.9 ms
|
||||
^C
|
||||
--- 10.82.98.1 ping statistics ---
|
||||
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
|
||||
rtt min/avg/max/mdev = 9.530/12.735/15.941/3.205 ms
|
||||
```
|
||||
|
||||
From `vpp1`, I can tell that Bird2's OSPF adjacency has formed, because I can ping the `loop0`
|
||||
address of `vpp2` router on 10.82.98.1. Nice! The two client nodes are running a minimalistic Alpine
|
||||
Linux container, which doesn't ship with SSH by default. But of course I can still enter the
|
||||
containers using `docker exec`, like so:
|
||||
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ docker exec -it client1 sh
|
||||
/ # ip addr show dev eth1
|
||||
531235: eth1@if531234: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 9500 qdisc noqueue state UP
|
||||
link/ether 00:c1:ab:00:00:01 brd ff:ff:ff:ff:ff:ff
|
||||
inet 10.82.98.66/28 scope global eth1
|
||||
valid_lft forever preferred_lft forever
|
||||
inet6 2001:db8:8298:101::2/64 scope global
|
||||
valid_lft forever preferred_lft forever
|
||||
inet6 fe80::2c1:abff:fe00:1/64 scope link
|
||||
valid_lft forever preferred_lft forever
|
||||
/ # traceroute 10.82.98.82
|
||||
traceroute to 10.82.98.82 (10.82.98.82), 30 hops max, 46 byte packets
|
||||
1 10.82.98.65 (10.82.98.65) 5.906 ms 7.086 ms 7.868 ms
|
||||
2 10.82.98.17 (10.82.98.17) 24.007 ms 23.349 ms 15.933 ms
|
||||
3 10.82.98.82 (10.82.98.82) 39.978 ms 31.127 ms 31.854 ms
|
||||
|
||||
/ # traceroute 2001:db8:8298:102::2
|
||||
traceroute to 2001:db8:8298:102::2 (2001:db8:8298:102::2), 30 hops max, 72 byte packets
|
||||
1 2001:db8:8298:101::1 (2001:db8:8298:101::1) 0.701 ms 7.144 ms 7.900 ms
|
||||
2 2001:db8:8298:1::2 (2001:db8:8298:1::2) 23.909 ms 22.943 ms 23.893 ms
|
||||
3 2001:db8:8298:102::2 (2001:db8:8298:102::2) 31.964 ms 30.814 ms 32.000 ms
|
||||
```
|
||||
|
||||
From the vantage point of `client1`, the first hop represents the `vpp1` node, which forwards to
|
||||
`vpp2`, which finally forwards to `client2`, which shows that both VPP routers are passing traffic.
|
||||
Dope!
|
||||
|
||||
## Results
|
||||
|
||||
After all of this deep-diving, all that's left is for me to demonstrate the Containerlab by means of
|
||||
this little screencast [[asciinema](/assets/containerlab/vpp-containerlab.cast)]. I hope you enjoy
|
||||
it as much as I enjoyed creating it:
|
||||
|
||||
{{< asciinema src="/assets/containerlab/vpp-containerlab.cast" >}}
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
I wanted to give a shout-out Roman Dodin for his help getting the Containerlab parts squared away
|
||||
when I got a little bit stuck. He took the time to explain the internals and idiom of Containerlab
|
||||
project, which really saved me a tonne of time. He also pair-programmed the
|
||||
[[PR#2471](https://github.com/srl-labs/containerlab/pull/2571)] with me over the span of two
|
||||
evenings.
|
||||
|
||||
Collaborative open source rocks!
|
||||
713
content/articles/2025-05-28-minio-1.md
Normal file
713
content/articles/2025-05-28-minio-1.md
Normal file
@@ -0,0 +1,713 @@
|
||||
---
|
||||
date: "2025-05-28T22:07:23Z"
|
||||
title: 'Case Study: Minio S3 - Part 1'
|
||||
---
|
||||
|
||||
{{< image float="right" src="/assets/minio/minio-logo.png" alt="MinIO Logo" width="6em" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading
|
||||
scalability, data availability, security, and performance. Millions of customers of all sizes and
|
||||
industries store, manage, analyze, and protect any amount of data for virtually any use case, such
|
||||
as data lakes, cloud-native applications, and mobile apps. With cost-effective storage classes and
|
||||
easy-to-use management features, you can optimize costs, organize and analyze data, and configure
|
||||
fine-tuned access controls to meet specific business and compliance requirements.
|
||||
|
||||
Amazon's S3 became the _de facto_ standard object storage system, and there exist several fully open
|
||||
source implementations of the protocol. One of them is MinIO: designed to allow enterprises to
|
||||
consolidate all of their data on a single, private cloud namespace. Architected using the same
|
||||
principles as the hyperscalers, AIStor delivers performance at scale at a fraction of the cost
|
||||
compared to the public cloud.
|
||||
|
||||
IPng Networks is an Internet Service Provider, but I also dabble in self-hosting things, for
|
||||
example [[PeerTube](https://video.ipng.ch/)], [[Mastodon](https://ublog.tech/)],
|
||||
[[Immich](https://photos.ipng.ch/)], [[Pixelfed](https://pix.ublog.tech/)] and of course
|
||||
[[Hugo](https://ipng.ch/)]. These services all have one thing in common: they tend to use lots of
|
||||
storage when they grow. At IPng Networks, all hypervisors ship with enterprise SAS flash drives,
|
||||
mostly 1.92TB and 3.84TB. Scaling up each of these services, and backing them up safely, can be
|
||||
quite the headache.
|
||||
|
||||
This article is for the storage-buffs. I'll set up a set of distributed MinIO nodes from scatch.
|
||||
|
||||
## Physical
|
||||
|
||||
{{< image float="right" src="/assets/minio/disks.png" alt="MinIO Disks" width="16em" >}}
|
||||
|
||||
I'll start with the basics. I still have a few Dell R720 servers laying around, they are getting a
|
||||
bit older but still have 24 cores and 64GB of memory. First I need to get me some disks. I order
|
||||
36pcs of 16TB SATA enterprise disk, a mixture of Seagate EXOS and Toshiba MG series disks. I've once
|
||||
learned (the hard way), that buying a big stack of disks from one production run is a risk - so I'll
|
||||
mix and match the drives.
|
||||
|
||||
Three trays of caddies and a melted credit card later, I have 576TB of SATA disks safely in hand.
|
||||
Each machine will carry 192TB of raw storage. The nice thing about this chassis is that Dell can
|
||||
ship them with 12x 3.5" SAS slots in the front, and 2x 2.5" SAS slots in the rear of the chassis.
|
||||
|
||||
So I'll install Debian Bookworm on one small 480G SSD in software RAID1.
|
||||
|
||||
### Cloning an install
|
||||
|
||||
I have three identical machines so in total I'll want six of these SSDs. I temporarily screw the
|
||||
other five in 3.5" drive caddies and plug them into the first installed Dell, which I've called
|
||||
`minio-proto`:
|
||||
|
||||
|
||||
```
|
||||
pim@minio-proto:~$ for i in b c d e f; do
|
||||
sudo dd if=/dev/sda of=/dev/sd${i} bs=512 count=1;
|
||||
sudo mdadm --manage /dev/md0 --add /dev/md${i}1
|
||||
done
|
||||
pim@minio-proto:~$ sudo mdadm --manage /dev/md0 --grow 6
|
||||
pim@minio-proto:~$ watch cat /proc/mdstat
|
||||
pim@minio-proto:~$ for i in a b c d e f; do
|
||||
sudo grub-install /dev/sd$i
|
||||
done
|
||||
```
|
||||
|
||||
{{< image float="right" src="/assets/minio/rack.png" alt="MinIO Rack" width="16em" >}}
|
||||
|
||||
The first command takes my installed disk, `/dev/sda`, and copies the first sector over to the other
|
||||
five. This will give them the same partition table. Next, I'll add the first partition of each disk
|
||||
to the raidset. Then, I'll expand the raidset to have six members, after which the kernel starts a
|
||||
recovery process that syncs the newly added paritions to `/dev/md0` (by copying from `/dev/sda` to
|
||||
all other disks at once). Finally, I'll watch this exciting movie and grab a cup of tea.
|
||||
|
||||
|
||||
Once the disks are fully copied, I'll shut down the machine and distribute the disks to their
|
||||
respective Dell R720, two each. Once they boot they will all be identical. I'll need to make sure
|
||||
their hostnames, and machine/host-id are unique, otherwise things like bridges will have overlapping
|
||||
MAC addresses - ask me how I know:
|
||||
|
||||
```
|
||||
pim@minio-proto:~$ sudo mdadm --manage /dev/md0 --grow -n 2
|
||||
pim@minio-proto:~$ sudo rm /etc/ssh/ssh_host*
|
||||
pim@minio-proto:~$ sudo hostname minio0-chbtl0
|
||||
pim@minio-proto:~$ sudo dpkg-reconfigure openssh-server
|
||||
pim@minio-proto:~$ sudo dd if=/dev/random of=/etc/hostid bs=4 count=1
|
||||
pim@minio-proto:~$ sudo /usr/bin/dbus-uuidgen > /etc/machine-id
|
||||
pim@minio-proto:~$ sudo reboot
|
||||
```
|
||||
|
||||
After which I have three beautiful and unique machines:
|
||||
* `minio0.chbtl0.net.ipng.ch`: which will go into my server rack at the IPng office.
|
||||
* `minio0.ddln0.net.ipng.ch`: which will go to [[Daedalean]({{< ref
|
||||
2022-02-24-colo >}})], doing AI since before it was all about vibe coding.
|
||||
* `minio0.chrma0.net.ipng.ch`: which will go to [[IP-Max](https://ip-max.net/)], one of the best
|
||||
ISPs on the planet. 🥰
|
||||
|
||||
|
||||
## Deploying Minio
|
||||
|
||||
The user guide that MinIO provides
|
||||
[[ref](https://min.io/docs/minio/linux/operations/installation.html)] is super good, arguably one of
|
||||
the best documented open source projects I've ever seen. it shows me that I can do three types of
|
||||
install. A 'Standalone' with one disk, a 'Standalone Multi-Drive', and a 'Distributed' deployment.
|
||||
I decide to make three independent standalone multi-drive installs. This way, I have less shared
|
||||
fate, and will be immune to network partitions (as these are going to be in three different
|
||||
physical locations). I've also read about per-bucket _replication_, which will be an excellent way
|
||||
to get geographical distribution and active/active instances to work together.
|
||||
|
||||
I feel good about the single-machine multi-drive decision. I follow the install guide
|
||||
[[ref](https://min.io/docs/minio/linux/operations/install-deploy-manage/deploy-minio-single-node-multi-drive.html#minio-snmd)]
|
||||
for this deployment type.
|
||||
|
||||
### IPng Frontends
|
||||
|
||||
At IPng I use a private IPv4/IPv6/MPLS network that is not connected to the internet. I call this
|
||||
network [[IPng Site Local]({{< ref 2023-03-11-mpls-core.md >}})]. But how will users reach my Minio
|
||||
install? I have four redundantly and geographically deployed frontends, two in the Netherlands and
|
||||
two in Switzerland. I've described the frontend setup in a [[previous article]({{< ref
|
||||
2023-03-17-ipng-frontends >}})] and the certificate management in [[this article]({{< ref
|
||||
2023-03-24-lego-dns01 >}})].
|
||||
|
||||
I've decided to run the service on these three regionalized endpoints:
|
||||
1. `s3.chbtl0.ipng.ch` which will back into `minio0.chbtl0.net.ipng.ch`
|
||||
1. `s3.ddln0.ipng.ch` which will back into `minio0.ddln0.net.ipng.ch`
|
||||
1. `s3.chrma0.ipng.ch` which will back into `minio0.chrma0.net.ipng.ch`
|
||||
|
||||
The first thing I take note of is that S3 buckets can be either addressed _by path_, in other words
|
||||
something like `s3.chbtl0.ipng.ch/my-bucket/README.md`, but they can also be addressed by virtual
|
||||
host, like so: `my-bucket.s3.chbtl0.ipng.ch/README.md`. A subtle difference, but from the docs I
|
||||
understand that Minio needs to have control of the whole space under its main domain.
|
||||
|
||||
There's a small implication to this requirement -- the Web Console that ships with MinIO (eh, well,
|
||||
maybe that's going to change, more on that later), will want to have its own domain-name, so I
|
||||
choose something simple: `cons0-s3.chbtl0.ipng.ch` and so on. This way, somebody might still be able
|
||||
to have a bucket name called `cons0` :)
|
||||
|
||||
#### Let's Encrypt Certificates
|
||||
|
||||
Alright, so I will be neading nine domains into this new certificate which I'll simply call
|
||||
`s3.ipng.ch`. I configure it in Ansible:
|
||||
|
||||
```
|
||||
certbot:
|
||||
certs:
|
||||
...
|
||||
s3.ipng.ch:
|
||||
groups: [ 'nginx', 'minio' ]
|
||||
altnames:
|
||||
- 's3.chbtl0.ipng.ch'
|
||||
- 'cons0-s3.chbtl0.ipng.ch'
|
||||
- '*.s3.chbtl0.ipng.ch'
|
||||
- 's3.ddln0.ipng.ch'
|
||||
- 'cons0-s3.ddln0.ipng.ch'
|
||||
- '*.s3.ddln0.ipng.ch'
|
||||
- 's3.chrma0.ipng.ch'
|
||||
- 'cons0-s3.chrma0.ipng.ch'
|
||||
- '*.s3.chrma0.ipng.ch'
|
||||
```
|
||||
|
||||
I run the `certbot` playbook and it does two things:
|
||||
1. On the machines from group `nginx` and `minio`, it will ensure there exists a user `lego` with
|
||||
an SSH key and write permissions to `/etc/lego/`; this is where the automation will write (and
|
||||
update) the certificate keys.
|
||||
1. On the `lego` machine, it'll create two files. One is the certificate requestor, and the other
|
||||
is a certificate distribution script that will copy the cert to the right machine(s) when it
|
||||
renews.
|
||||
|
||||
On the `lego` machine, I'll run the cert request for the first time:
|
||||
|
||||
```
|
||||
lego@lego:~$ bin/certbot:s3.ipng.ch
|
||||
lego@lego:~$ RENEWED_LINEAGE=/home/lego/acme-dns/live/s3.ipng.ch bin/certbot-distribute
|
||||
```
|
||||
|
||||
The first script asks me to add the _acme-challenge DNS entries, which I'll do, for example on the
|
||||
`s3.chbtl0.ipng.ch` instance (and similar for the `ddln0` and `chrma0` ones:
|
||||
|
||||
```
|
||||
$ORIGIN chbtl0.ipng.ch.
|
||||
_acme-challenge.s3 CNAME 51f16fd0-8eb6-455c-b5cd-96fad12ef8fd.auth.ipng.ch.
|
||||
_acme-challenge.cons0-s3 CNAME 450477b8-74c9-4b9e-bbeb-de49c3f95379.auth.ipng.ch.
|
||||
s3 CNAME nginx0.ipng.ch.
|
||||
*.s3 CNAME nginx0.ipng.ch.
|
||||
cons0-s3 CNAME nginx0.ipng.ch.
|
||||
```
|
||||
|
||||
I push and reload the `ipng.ch` zonefile with these changes after which the certificate gets
|
||||
requested and a cronjob added to check for renewals. The second script will copy the newly created
|
||||
cert to all three `minio` machines, and all four `nginx` machines. From now on, every 90 days, a new
|
||||
cert will be automatically generated and distributed. Slick!
|
||||
|
||||
#### NGINX Configs
|
||||
|
||||
With the LE wildcard certs in hand, I can create an NGINX frontend for these minio deployments.
|
||||
|
||||
First, a simple redirector service that punts people on port 80 to port 443:
|
||||
|
||||
```
|
||||
server {
|
||||
listen [::]:80;
|
||||
listen 0.0.0.0:80;
|
||||
|
||||
server_name cons0-s3.chbtl0.ipng.ch s3.chbtl0.ipng.ch *.s3.chbtl0.ipng.ch;
|
||||
access_log /var/log/nginx/s3.chbtl0.ipng.ch-access.log;
|
||||
include /etc/nginx/conf.d/ipng-headers.inc;
|
||||
|
||||
location / {
|
||||
return 301 https://$server_name$request_uri;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Next, the Minio API service itself which runs on port 9000, with a configuration snippet inspired by
|
||||
the MinIO [[docs](https://min.io/docs/minio/linux/integrations/setup-nginx-proxy-with-minio.html)]:
|
||||
|
||||
```
|
||||
server {
|
||||
listen [::]:443 ssl http2;
|
||||
listen 0.0.0.0:443 ssl http2;
|
||||
ssl_certificate /etc/certs/s3.ipng.ch/fullchain.pem;
|
||||
ssl_certificate_key /etc/certs/s3.ipng.ch/privkey.pem;
|
||||
include /etc/nginx/conf.d/options-ssl-nginx.inc;
|
||||
ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc;
|
||||
|
||||
server_name s3.chbtl0.ipng.ch *.s3.chbtl0.ipng.ch;
|
||||
access_log /var/log/nginx/s3.chbtl0.ipng.ch-access.log upstream;
|
||||
include /etc/nginx/conf.d/ipng-headers.inc;
|
||||
|
||||
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
|
||||
|
||||
ignore_invalid_headers off;
|
||||
client_max_body_size 0;
|
||||
# Disable buffering
|
||||
proxy_buffering off;
|
||||
proxy_request_buffering off;
|
||||
|
||||
location / {
|
||||
proxy_set_header Host $http_host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
|
||||
proxy_connect_timeout 300;
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Connection "";
|
||||
chunked_transfer_encoding off;
|
||||
|
||||
proxy_pass http://minio0.chbtl0.net.ipng.ch:9000;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Finally, the Minio Console service which runs on port 9090:
|
||||
|
||||
```
|
||||
include /etc/nginx/conf.d/geo-ipng-trusted.inc;
|
||||
|
||||
server {
|
||||
listen [::]:443 ssl http2;
|
||||
listen 0.0.0.0:443 ssl http2;
|
||||
ssl_certificate /etc/certs/s3.ipng.ch/fullchain.pem;
|
||||
ssl_certificate_key /etc/certs/s3.ipng.ch/privkey.pem;
|
||||
include /etc/nginx/conf.d/options-ssl-nginx.inc;
|
||||
ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc;
|
||||
|
||||
server_name cons0-s3.chbtl0.ipng.ch;
|
||||
access_log /var/log/nginx/cons0-s3.chbtl0.ipng.ch-access.log upstream;
|
||||
include /etc/nginx/conf.d/ipng-headers.inc;
|
||||
|
||||
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
|
||||
|
||||
ignore_invalid_headers off;
|
||||
client_max_body_size 0;
|
||||
# Disable buffering
|
||||
proxy_buffering off;
|
||||
proxy_request_buffering off;
|
||||
|
||||
location / {
|
||||
if ($geo_ipng_trusted = 0) { rewrite ^ https://ipng.ch/ break; }
|
||||
proxy_set_header Host $http_host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
proxy_set_header X-NginX-Proxy true;
|
||||
|
||||
real_ip_header X-Real-IP;
|
||||
proxy_connect_timeout 300;
|
||||
chunked_transfer_encoding off;
|
||||
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection "upgrade";
|
||||
|
||||
proxy_pass http://minio0.chbtl0.net.ipng.ch:9090;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This last one has an NGINX trick. It will only allow users in if they are in the map called
|
||||
`geo_ipng_trusted`, which contains a set of IPv4 and IPv6 prefixes. Visitors who are not in this map
|
||||
will receive an HTTP redirect back to the [[IPng.ch](https://ipng.ch/)] homepage instead.
|
||||
|
||||
I run the Ansible Playbook which contains the NGINX changes to all frontends, but of course nothing
|
||||
runs yet, because I haven't yet started MinIO backends.
|
||||
|
||||
### MinIO Backends
|
||||
|
||||
The first thing I need to do is get those disks mounted. MinIO likes using XFS, so I'll install that
|
||||
and prepare the disks as follows:
|
||||
|
||||
```
|
||||
pim@minio0-chbtl0:~$ sudo apt install xfsprogs
|
||||
pim@minio0-chbtl0:~$ sudo modprobe xfs
|
||||
pim@minio0-chbtl0:~$ echo xfs | sudo tee -a /etc/modules
|
||||
pim@minio0-chbtl0:~$ sudo update-initramfs -k all -u
|
||||
pim@minio0-chbtl0:~$ for i in a b c d e f g h i j k l; do sudo mkfs.xfs /dev/sd$i; done
|
||||
pim@minio0-chbtl0:~$ blkid | awk 'BEGIN {i=1} /TYPE="xfs"/ {
|
||||
printf "%s /minio/disk%d xfs defaults 0 2\n",$2,i; i++;
|
||||
}' | sudo tee -a /etc/fstab
|
||||
pim@minio0-chbtl0:~$ for i in `seq 1 12`; do sudo mkdir -p /minio/disk$i; done
|
||||
pim@minio0-chbtl0:~$ sudo mount -t xfs -a
|
||||
pim@minio0-chbtl0:~$ sudo chown -R minio-user: /minio/
|
||||
```
|
||||
|
||||
From the top: I'll install `xfsprogs` which contains the things I need to manipulate XFS filesystems
|
||||
in Debian. Then I'll install the `xfs` kernel module, and make sure it gets inserted upon subsequent
|
||||
startup by adding it to `/etc/modules` and regenerating the initrd for the installed kernels.
|
||||
|
||||
Next, I'll format all twelve 16TB disks (which are `/dev/sda` - `/dev/sdl` on these machines), and
|
||||
add their resulting blockdevice id's to `/etc/fstab` so they get persistently mounted on reboot.
|
||||
|
||||
Finally, I'll create their mountpoints, mount all XFS filesystems, and chown them to the user that
|
||||
MinIO is running as. End result:
|
||||
|
||||
```
|
||||
pim@minio0-chbtl0:~$ df -T
|
||||
Filesystem Type 1K-blocks Used Available Use% Mounted on
|
||||
udev devtmpfs 32950856 0 32950856 0% /dev
|
||||
tmpfs tmpfs 6595340 1508 6593832 1% /run
|
||||
/dev/md0 ext4 114695308 5423976 103398948 5% /
|
||||
tmpfs tmpfs 32976680 0 32976680 0% /dev/shm
|
||||
tmpfs tmpfs 5120 4 5116 1% /run/lock
|
||||
/dev/sda xfs 15623792640 121505936 15502286704 1% /minio/disk1
|
||||
/dev/sde xfs 15623792640 121505968 15502286672 1% /minio/disk12
|
||||
/dev/sdi xfs 15623792640 121505968 15502286672 1% /minio/disk11
|
||||
/dev/sdl xfs 15623792640 121505904 15502286736 1% /minio/disk10
|
||||
/dev/sdd xfs 15623792640 121505936 15502286704 1% /minio/disk4
|
||||
/dev/sdb xfs 15623792640 121505968 15502286672 1% /minio/disk3
|
||||
/dev/sdk xfs 15623792640 121505936 15502286704 1% /minio/disk5
|
||||
/dev/sdc xfs 15623792640 121505936 15502286704 1% /minio/disk9
|
||||
/dev/sdf xfs 15623792640 121506000 15502286640 1% /minio/disk2
|
||||
/dev/sdj xfs 15623792640 121505968 15502286672 1% /minio/disk7
|
||||
/dev/sdg xfs 15623792640 121506000 15502286640 1% /minio/disk8
|
||||
/dev/sdh xfs 15623792640 121505968 15502286672 1% /minio/disk6
|
||||
tmpfs tmpfs 6595336 0 6595336 0% /run/user/0
|
||||
```
|
||||
|
||||
MinIO likes to be configured using environment variables - and this is likely because it's a popular
|
||||
thing to run in a containerized environment like Kubernetes. The maintainers ship it also as a
|
||||
Debian package, which will read its environment from `/etc/default/minio`, and I'll prepare that
|
||||
file as follows:
|
||||
|
||||
```
|
||||
pim@minio0-chbtl0:~$ cat << EOF | sudo tee /etc/default/minio
|
||||
MINIO_DOMAIN="s3.chbtl0.ipng.ch,minio0.chbtl0.net.ipng.ch"
|
||||
MINIO_ROOT_USER="XXX"
|
||||
MINIO_ROOT_PASSWORD="YYY"
|
||||
MINIO_VOLUMES="/minio/disk{1...12}"
|
||||
MINIO_OPTS="--console-address :9001"
|
||||
EOF
|
||||
pim@minio0-chbtl0:~$ sudo systemctl enable --now minio
|
||||
pim@minio0-chbtl0:~$ sudo journalctl -u minio
|
||||
May 31 10:44:11 minio0-chbtl0 minio[690420]: MinIO Object Storage Server
|
||||
May 31 10:44:11 minio0-chbtl0 minio[690420]: Copyright: 2015-2025 MinIO, Inc.
|
||||
May 31 10:44:11 minio0-chbtl0 minio[690420]: License: GNU AGPLv3 - https://www.gnu.org/licenses/agpl-3.0.html
|
||||
May 31 10:44:11 minio0-chbtl0 minio[690420]: Version: RELEASE.2025-05-24T17-08-30Z (go1.24.3 linux/amd64)
|
||||
May 31 10:44:11 minio0-chbtl0 minio[690420]: API: http://198.19.4.11:9000 http://127.0.0.1:9000
|
||||
May 31 10:44:11 minio0-chbtl0 minio[690420]: WebUI: https://cons0-s3.chbtl0.ipng.ch/
|
||||
May 31 10:44:11 minio0-chbtl0 minio[690420]: Docs: https://docs.min.io
|
||||
|
||||
pim@minio0-chbtl0:~$ sudo ipmitool sensor | grep Watts
|
||||
Pwr Consumption | 154.000 | Watts
|
||||
```
|
||||
|
||||
Incidentally - I am pretty pleased with this 192TB disk tank, sporting 24 cores, 64GB memory and
|
||||
2x10G network, casually hanging out at 154 Watts of power all up. Slick!
|
||||
|
||||
{{< image float="right" src="/assets/minio/minio-ec.svg" alt="MinIO Erasure Coding" width="22em" >}}
|
||||
|
||||
MinIO implements _erasure coding_ as a core component in providing availability and resiliency
|
||||
during drive or node-level failure events. MinIO partitions each object into data and parity shards
|
||||
and distributes those shards across a single so-called _erasure set_. Under the hood, it uses
|
||||
[[Reed-Solomon](https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction)] erasure coding
|
||||
implementation and partitions the object for distribution. From the MinIO website, I'll borrow a
|
||||
diagram to show how it looks like on a single node like mine to the right.
|
||||
|
||||
Anyway, MinIO detects 12 disks and installs an erasure set with 8 data disks and 4 parity disks,
|
||||
which it calls `EC:4` encoding, also known in the industry as `RS8.4`.
|
||||
Just like that, the thing shoots to life. Awesome!
|
||||
|
||||
### MinIO Client
|
||||
|
||||
On Summer, I'll install the MinIO Client called `mc`. This is easy because the maintainers ship a
|
||||
Linux binary which I can just download. On OpenBSD, they don't do that. Not a problem though, on
|
||||
Squanchy, Pencilvester and Glootie, I will just `go install` the client. Using the `mc` commandline,
|
||||
I can all any of the S3 APIs on my new MinIO instance:
|
||||
|
||||
```
|
||||
pim@summer:~$ set +o history
|
||||
pim@summer:~$ mc alias set chbtl0 https://s3.chbtl0.ipng.ch/ <rootuser> <rootpass>
|
||||
pim@summer:~$ set -o history
|
||||
pim@summer:~$ mc admin info chbtl0/
|
||||
● s3.chbtl0.ipng.ch
|
||||
Uptime: 22 hours
|
||||
Version: 2025-05-24T17:08:30Z
|
||||
Network: 1/1 OK
|
||||
Drives: 12/12 OK
|
||||
Pool: 1
|
||||
|
||||
┌──────┬───────────────────────┬─────────────────────┬──────────────┐
|
||||
│ Pool │ Drives Usage │ Erasure stripe size │ Erasure sets │
|
||||
│ 1st │ 0.8% (total: 116 TiB) │ 12 │ 1 │
|
||||
└──────┴───────────────────────┴─────────────────────┴──────────────┘
|
||||
|
||||
95 GiB Used, 5 Buckets, 5,859 Objects, 318 Versions, 1 Delete Marker
|
||||
12 drives online, 0 drives offline, EC:4
|
||||
|
||||
```
|
||||
|
||||
Cool beans. I think I should get rid of this root account though, I've installed those credentials
|
||||
into the `/etc/default/minio` environment file, but I don't want to keep them out in the open. So
|
||||
I'll make an account for myself and assign me reasonable privileges, called `consoleAdmin` in the
|
||||
default install:
|
||||
|
||||
```
|
||||
pim@summer:~$ set +o history
|
||||
pim@summer:~$ mc admin user add chbtl0/ <someuser> <somepass>
|
||||
pim@summer:~$ mc admin policy info chbtl0 consoleAdmin
|
||||
pim@summer:~$ mc admin policy attach chbtl0 consoleAdmin --user=<someuser>
|
||||
pim@summer:~$ mc alias set chbtl0 https://s3.chbtl0.ipng.ch/ <someuser> <somepass>
|
||||
pim@summer:~$ set -o history
|
||||
```
|
||||
|
||||
OK, I feel less gross now that I'm not operating as root on the MinIO deployment. Using my new
|
||||
user-powers, let me set some metadata on my new minio server:
|
||||
|
||||
```
|
||||
pim@summer:~$ mc admin config set chbtl0/ site name=chbtl0 region=switzerland
|
||||
Successfully applied new settings.
|
||||
Please restart your server 'mc admin service restart chbtl0/'.
|
||||
pim@summer:~$ mc admin service restart chbtl0/
|
||||
Service status: ▰▰▱ [DONE]
|
||||
Summary:
|
||||
┌───────────────┬─────────────────────────────┐
|
||||
│ Servers: │ 1 online, 0 offline, 0 hung │
|
||||
│ Restart Time: │ 61.322886ms │
|
||||
└───────────────┴─────────────────────────────┘
|
||||
pim@summer:~$ mc admin config get chbtl0/ site
|
||||
site name=chbtl0 region=switzerland
|
||||
```
|
||||
|
||||
By the way, what's really cool about these open standards is that both the Amazon `aws` client works
|
||||
with MinIO, but `mc` also works with AWS!
|
||||
### MinIO Console
|
||||
|
||||
Although I'm pretty good with APIs and command line tools, there's some benefit also in using a
|
||||
Graphical User Interface. MinIO ships with one, but there was a bit of a kerfuffle in the MinIO
|
||||
community. Unfortunately, these are pretty common -- Redis (an open source key/value storage system)
|
||||
changed their offering abruptly. Terraform (an open source infrastructure-as-code tool) changed
|
||||
their licensing at some point. Ansible (an open source machine management tool) changed their
|
||||
offering also. MinIO developers decided to strip their console of ~all features recently. The gnarly
|
||||
bits are discussed on
|
||||
[[reddit](https://www.reddit.com/r/selfhosted/comments/1kva3pw/avoid_minio_developers_introduce_trojan_horse/)].
|
||||
but suffice to say: the same thing that happened in literally 100% of the other cases, also happened
|
||||
here. Somebody decided to simply fork the code from before it was changed.
|
||||
|
||||
Enter OpenMaxIO. A cringe worthy name, but it gets the job done. Reading up on the
|
||||
[[GitHub](https://github.com/OpenMaxIO/openmaxio-object-browser/issues/5)], reviving the fully
|
||||
working console is pretty straight forward -- that is, once somebody spent a few days figuring it
|
||||
out. Thank you `icesvz` for this excellent pointer. With this, I can create a systemd service for
|
||||
the console and start it:
|
||||
|
||||
```
|
||||
pim@minio0-chbtl0:~$ cat << EOF | sudo tee -a /etc/default/minio
|
||||
## NOTE(pim): For openmaxio console service
|
||||
CONSOLE_MINIO_SERVER="http://localhost:9000"
|
||||
MINIO_BROWSER_REDIRECT_URL="https://cons0-s3.chbtl0.ipng.ch/"
|
||||
EOF
|
||||
pim@minio0-chbtl0:~$ cat << EOF | sudo tee /lib/systemd/system/minio-console.service
|
||||
[Unit]
|
||||
Description=OpenMaxIO Console Service
|
||||
Wants=network-online.target
|
||||
After=network-online.target
|
||||
AssertFileIsExecutable=/usr/local/bin/minio-console
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
|
||||
WorkingDirectory=/usr/local
|
||||
|
||||
User=minio-user
|
||||
Group=minio-user
|
||||
ProtectProc=invisible
|
||||
|
||||
EnvironmentFile=-/etc/default/minio
|
||||
ExecStart=/usr/local/bin/minio-console server
|
||||
Restart=always
|
||||
LimitNOFILE=1048576
|
||||
MemoryAccounting=no
|
||||
TasksMax=infinity
|
||||
TimeoutSec=infinity
|
||||
OOMScoreAdjust=-1000
|
||||
SendSIGKILL=no
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
EOF
|
||||
pim@minio0-chbtl0:~$ sudo systemctl enable --now minio-console
|
||||
pim@minio0-chbtl0:~$ sudo systemctl restart minio
|
||||
```
|
||||
|
||||
The first snippet is an update to the MinIO configuration that instructs it to redirect users who
|
||||
are not trying to use the API to the console endpoint on `cons0-s3.chbtl0.ipng.ch`, and then the
|
||||
console-server needs to know where to find the API, which from its vantage point is running on
|
||||
`localhost:9000`. Hello, beautiful fully featured console:
|
||||
|
||||
{{< image src="/assets/minio/console-1.png" alt="MinIO Console" >}}
|
||||
|
||||
### MinIO Prometheus
|
||||
|
||||
MinIO ships with a prometheus metrics endpoint, and I notice on its console that it has a nice
|
||||
metrics tab, which is fully greyed out. This is most likely because, well, I don't have a Prometheus
|
||||
install here yet. I decide to keep the storage nodes self-contained and start a Prometheus server on
|
||||
the local machine. I can always plumb that to IPng's Grafana instance later.
|
||||
|
||||
For now, I'll install Prometheus as follows:
|
||||
|
||||
```
|
||||
pim@minio0-chbtl0:~$ cat << EOF | sudo tee -a /etc/default/minio
|
||||
## NOTE(pim): Metrics for minio-console
|
||||
MINIO_PROMETHEUS_AUTH_TYPE="public"
|
||||
CONSOLE_PROMETHEUS_URL="http://localhost:19090/"
|
||||
CONSOLE_PROMETHEUS_JOB_ID="minio-job"
|
||||
EOF
|
||||
|
||||
pim@minio0-chbtl0:~$ sudo apt install prometheus
|
||||
pim@minio0-chbtl0:~$ cat << EOF | sudo tee /etc/default/prometheus
|
||||
ARGS="--web.listen-address='[::]:19090' --storage.tsdb.retention.size=16GB"
|
||||
EOF
|
||||
pim@minio0-chbtl0:~$ cat << EOF | sudo tee /etc/prometheus/prometheus.yml
|
||||
global:
|
||||
scrape_interval: 60s
|
||||
|
||||
scrape_configs:
|
||||
- job_name: minio-job
|
||||
metrics_path: /minio/v2/metrics/cluster
|
||||
static_configs:
|
||||
- targets: ['localhost:9000']
|
||||
labels:
|
||||
cluster: minio0-chbtl0
|
||||
|
||||
- job_name: minio-job-node
|
||||
metrics_path: /minio/v2/metrics/node
|
||||
static_configs:
|
||||
- targets: ['localhost:9000']
|
||||
labels:
|
||||
cluster: minio0-chbtl0
|
||||
|
||||
- job_name: minio-job-bucket
|
||||
metrics_path: /minio/v2/metrics/bucket
|
||||
static_configs:
|
||||
- targets: ['localhost:9000']
|
||||
labels:
|
||||
cluster: minio0-chbtl0
|
||||
|
||||
- job_name: minio-job-resource
|
||||
metrics_path: /minio/v2/metrics/resource
|
||||
static_configs:
|
||||
- targets: ['localhost:9000']
|
||||
labels:
|
||||
cluster: minio0-chbtl0
|
||||
|
||||
- job_name: node
|
||||
static_configs:
|
||||
- targets: ['localhost:9100']
|
||||
labels:
|
||||
cluster: minio0-chbtl0
|
||||
pim@minio0-chbtl0:~$ sudo systemctl restart minio prometheus
|
||||
```
|
||||
|
||||
In the first snippet, I'll tell MinIO where it should find its Prometheus instance. Since the MinIO
|
||||
console service is running on port 9090, and this is also the default port for Prometheus, I will
|
||||
run Promtheus on port 19090 instead. From reading the MinIO docs, I can see that normally MinIO will
|
||||
want prometheus to authenticate to it before it'll allow the endpoints to be scraped. I'll turn that
|
||||
off by making these public. On the IPng Frontends, I can always remove access to /minio/v2 and
|
||||
simply use the IPng Site Local access for local Prometheus scrapers instead.
|
||||
|
||||
After telling Prometheus its runtime arguments (in `/etc/default/prometheus`) and its scraping
|
||||
endpoints (in `/etc/prometheus/prometheus.yml`), I can restart minio and prometheus. A few minutes
|
||||
later, I can see the _Metrics_ tab in the console come to life.
|
||||
|
||||
But now that I have this prometheus running on the MinIO node, I can also add it to IPng's Grafana
|
||||
configuration, by adding a new data source on `minio0.chbtl0.net.ipng.ch:19090` and pointing the
|
||||
default Grafana [[Dashboard](https://grafana.com/grafana/dashboards/13502-minio-dashboard/)] at it:
|
||||
|
||||
{{< image src="/assets/minio/console-2.png" alt="Grafana Dashboard" >}}
|
||||
|
||||
A two-for-one: I will both be able to see metrics directly in the console, but also I will be able
|
||||
to hook up these per-node prometheus instances into IPng's alertmanager also, and I've read some
|
||||
[[docs](https://min.io/docs/minio/linux/operations/monitoring/collect-minio-metrics-using-prometheus.html)]
|
||||
on the concepts. I'm really liking the experience so far!
|
||||
|
||||
### MinIO Nagios
|
||||
|
||||
Prometheus is fancy and all, but at IPng Networks, I've been doing monitoring for a while now. As a
|
||||
dinosaur, I still have an active [[Nagios](https://www.nagios.org/)] install, which autogenerates
|
||||
all of its configuration using the Ansible repository I have. So for the new Ansible group called
|
||||
`minio`, I will autogenerate the following snippet:
|
||||
|
||||
```
|
||||
define command {
|
||||
command_name ipng_check_minio
|
||||
command_line $USER1$/check_http -E -H $HOSTALIAS$ -I $ARG1$ -p $ARG2$ -u $ARG3$ -r '$ARG4$'
|
||||
}
|
||||
|
||||
define service {
|
||||
hostgroup_name ipng:minio:ipv6
|
||||
service_description minio6:api
|
||||
check_command ipng_check_minio!$_HOSTADDRESS6$!9000!/minio/health/cluster!
|
||||
use ipng-service-fast
|
||||
notification_interval 0 ; set > 0 if you want to be renotified
|
||||
}
|
||||
|
||||
define service {
|
||||
hostgroup_name ipng:minio:ipv6
|
||||
service_description minio6:prom
|
||||
check_command ipng_check_minio!$_HOSTADDRESS6$!19090!/classic/targets!minio-job
|
||||
use ipng-service-fast
|
||||
notification_interval 0 ; set > 0 if you want to be renotified
|
||||
}
|
||||
|
||||
define service {
|
||||
hostgroup_name ipng:minio:ipv6
|
||||
service_description minio6:console
|
||||
check_command ipng_check_minio!$_HOSTADDRESS6$!9090!/!MinIO Console
|
||||
use ipng-service-fast
|
||||
notification_interval 0 ; set > 0 if you want to be renotified
|
||||
}
|
||||
```
|
||||
|
||||
I've shown the snippet for IPv6 but I also have three services defined for legacy IP in the
|
||||
hostgroup `ipng:minio:ipv4`. The check command here uses `-I` which has the IPv4 or IPv6 address to
|
||||
talk to, `-p` for the port to consule, `-u` for the URI to hit and an option `-r` for a regular
|
||||
expression to expect in the output. For the Nagios afficianados out there: my Ansible `groups`
|
||||
correspond one to one with autogenerated Nagios `hostgroups`. This allows me to add arbitrary checks
|
||||
by group-type, like above in the `ipng:minio` group for IPv4 and IPv6.
|
||||
|
||||
In the MinIO [[docs](https://min.io/docs/minio/linux/operations/monitoring/healthcheck-probe.html)]
|
||||
I read up on the Healthcheck API. I choose to monitor the _Cluster Write Quorum_ on my minio
|
||||
deployments. For Prometheus, I decide to hit the `targets` endpoint and expect the `minio-job` to be
|
||||
among them. Finally, for the MinIO Console, I expect to see a login screen with the words `MinIO
|
||||
Console` in the returned page. I guessed right, because Nagios is all green:
|
||||
|
||||
{{< image src="/assets/minio/nagios.png" alt="Nagios Dashboard" >}}
|
||||
|
||||
## My First Bucket
|
||||
|
||||
The IPng website is a statically generated Hugo site, and when-ever I submit a change to my Git
|
||||
repo, a CI/CD runner (called [[Drone](https://www.drone.io/)]), picks up the change. It re-builds
|
||||
the static website, and copies it to four redundant NGINX servers.
|
||||
|
||||
But IPng's website has amassed quite a bit of extra files (like VM images and VPP packages that I
|
||||
publish), which are copied separately using a simple push script I have in my home directory. This
|
||||
avoids all those big media files from cluttering the Git repository. I decide to move this stuff
|
||||
into S3:
|
||||
|
||||
```
|
||||
pim@summer:~/src/ipng-web-assets$ echo 'Gruezi World.' > ipng.ch/media/README.md
|
||||
pim@summer:~/src/ipng-web-assets$ mc mb chbtl0/ipng-web-assets
|
||||
pim@summer:~/src/ipng-web-assets$ mc mirror . chbtl0/ipng-web-assets/
|
||||
...ch/media/README.md: 6.50 GiB / 6.50 GiB ┃▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓┃ 236.38 MiB/s 28s
|
||||
pim@summer:~/src/ipng-web-assets$ mc anonymous set download chbtl0/ipng-web-assets/
|
||||
```
|
||||
|
||||
OK, two things that immediately jump out at me. This stuff is **fast**: Summer is connected with a
|
||||
2.5GbE network card, and she's running hard, copying the 6.5GB of data that are in these web assets
|
||||
essentially at line rate. It doesn't really surprise me because Summer is running off of Gen4 NVME,
|
||||
while MinIO has 12 spinning disks which each can write about 160MB/s or so sustained
|
||||
[[ref](https://www.seagate.com/www-content/datasheets/pdfs/exos-x16-DS2011-1-1904US-en_US.pdf)],
|
||||
with 24 CPUs to tend to the NIC (2x10G) and disks (2x SSD, 12x LFF). Should be plenty!
|
||||
|
||||
The second is that MinIO allows for buckets to be publicly shared in three ways: 1) read-only by
|
||||
setting `download`; 2) write-only by setting `upload`, and 3) read-write by setting `public`.
|
||||
I set `download` here, which means I should be able to fetch an asset now publicly:
|
||||
|
||||
```
|
||||
pim@summer:~$ curl https://s3.chbtl0.ipng.ch/ipng-web-assets/ipng.ch/media/README.md
|
||||
Gruezi World.
|
||||
pim@summer:~$ curl https://ipng-web-assets.s3.chbtl0.ipng.ch/ipng.ch/media/README.md
|
||||
Gruezi World.
|
||||
```
|
||||
|
||||
The first `curl` here shows the path-based access, while the second one shows an equivalent
|
||||
virtual-host based access. Both retrieve the file I just pushed via the public Internet. Whoot!
|
||||
|
||||
# What's Next
|
||||
|
||||
I'm going to be moving [[Restic](https://restic.net/)] backups from IPng's ZFS storage pool to this
|
||||
S3 service over the next few days. I'll also migrate PeerTube and possibly Mastodon from NVME based
|
||||
storage to replicated S3 buckets as well. Finally, the IPng website media that I mentioned above,
|
||||
should make for a nice followup article. Stay tuned!
|
||||
475
content/articles/2025-06-01-minio-2.md
Normal file
475
content/articles/2025-06-01-minio-2.md
Normal file
@@ -0,0 +1,475 @@
|
||||
---
|
||||
date: "2025-06-01T10:07:23Z"
|
||||
title: 'Case Study: Minio S3 - Part 2'
|
||||
---
|
||||
|
||||
{{< image float="right" src="/assets/minio/minio-logo.png" alt="MinIO Logo" width="6em" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading
|
||||
scalability, data availability, security, and performance. Millions of customers of all sizes and
|
||||
industries store, manage, analyze, and protect any amount of data for virtually any use case, such
|
||||
as data lakes, cloud-native applications, and mobile apps. With cost-effective storage classes and
|
||||
easy-to-use management features, you can optimize costs, organize and analyze data, and configure
|
||||
fine-tuned access controls to meet specific business and compliance requirements.
|
||||
|
||||
Amazon's S3 became the _de facto_ standard object storage system, and there exist several fully open
|
||||
source implementations of the protocol. One of them is MinIO: designed to allow enterprises to
|
||||
consolidate all of their data on a single, private cloud namespace. Architected using the same
|
||||
principles as the hyperscalers, AIStor delivers performance at scale at a fraction of the cost
|
||||
compared to the public cloud.
|
||||
|
||||
IPng Networks is an Internet Service Provider, but I also dabble in self-hosting things, for
|
||||
example [[PeerTube](https://video.ipng.ch/)], [[Mastodon](https://ublog.tech/)],
|
||||
[[Immich](https://photos.ipng.ch/)], [[Pixelfed](https://pix.ublog.tech/)] and of course
|
||||
[[Hugo](https://ipng.ch/)]. These services all have one thing in common: they tend to use lots of
|
||||
storage when they grow. At IPng Networks, all hypervisors ship with enterprise SAS flash drives,
|
||||
mostly 1.92TB and 3.84TB. Scaling up each of these services, and backing them up safely, can be
|
||||
quite the headache.
|
||||
|
||||
In a [[previous article]({{< ref 2025-05-28-minio-1 >}})], I talked through the install of a
|
||||
redundant set of three Minio machines. In this article, I'll start putting them to good use.
|
||||
|
||||
## Use Case: Restic
|
||||
|
||||
{{< image float="right" src="/assets/minio/restic-logo.png" alt="Restic Logo" width="12em" >}}
|
||||
|
||||
[[Restic](https://restic.org/)] is a modern backup program that can back up your files from multiple
|
||||
host OS, to many different storage types, easily, effectively, securely, verifiably and freely. With
|
||||
a sales pitch like that, what's not to love? Actually, I am a long-time
|
||||
[[BorgBackup](https://www.borgbackup.org/)] user, and I think I'll keep that running. However, for
|
||||
resilience, and because I've heard only good things about Restic, I'll make a second backup of the
|
||||
routers, hypervisors, and virtual machines using Restic.
|
||||
|
||||
Restic can use S3 buckets out of the box (incidentally, so can BorgBackup). To configure it, I use
|
||||
a mixture of environment variables and flags. But first, let me create a bucket for the backups.
|
||||
|
||||
```
|
||||
pim@glootie:~$ mc mb chbtl0/ipng-restic
|
||||
pim@glootie:~$ mc admin user add chbtl0/ <key> <secret>
|
||||
pim@glootie:~$ cat << EOF | tee ipng-restic-access.json
|
||||
{
|
||||
"PolicyName": "ipng-restic-access",
|
||||
"Policy": {
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [ "s3:DeleteObject", "s3:GetObject", "s3:ListBucket", "s3:PutObject" ],
|
||||
"Resource": [ "arn:aws:s3:::ipng-restic", "arn:aws:s3:::ipng-restic/*" ]
|
||||
}
|
||||
]
|
||||
},
|
||||
}
|
||||
EOF
|
||||
pim@glootie:~$ mc admin policy create chbtl0/ ipng-restic-access.json
|
||||
pim@glootie:~$ mc admin policy attach chbtl0/ ipng-restic-access --user <key>
|
||||
```
|
||||
|
||||
First, I'll create a bucket called `ipng-restic`. Then, I'll create a _user_ with a given secret
|
||||
_key_. To protect the innocent, and my backups, I'll not disclose them. Next, I'll create an
|
||||
IAM policy that allows for Get/List/Put/Delete to be performed on the bucket and its contents, and
|
||||
finally I'll attach this policy to the user I just created.
|
||||
|
||||
To run a Restic backup, I'll first have to create a so-called _repository_. The repository has a
|
||||
location and a password, which Restic uses to encrypt the data. Because I'm using S3, I'll also need
|
||||
to specify the key and secret:
|
||||
|
||||
```
|
||||
root@glootie:~# RESTIC_PASSWORD="changeme"
|
||||
root@glootie:~# RESTIC_REPOSITORY="s3:https://s3.chbtl0.ipng.ch/ipng-restic/$(hostname)/"
|
||||
root@glootie:~# AWS_ACCESS_KEY_ID="<key>"
|
||||
root@glootie:~# AWS_SECRET_ACCESS_KEY:="<secret>"
|
||||
root@glootie:~# export RESTIC_PASSWORD RESTIC_REPOSITORY AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY
|
||||
root@glootie:~# restic init
|
||||
created restic repository 807cf25e85 at s3:https://s3.chbtl0.ipng.ch/ipng-restic/glootie.ipng.ch/
|
||||
```
|
||||
|
||||
Restic prints out some repository finterprint of the latest 'snapshot' it just created. Taking a
|
||||
look on the MinIO install:
|
||||
|
||||
```
|
||||
pim@glootie:~$ mc stat chbtl0/ipng-restic/glootie.ipng.ch/
|
||||
Name : config
|
||||
Date : 2025-06-01 12:01:43 UTC
|
||||
Size : 155 B
|
||||
ETag : 661a43f72c43080649712e45da14da3a
|
||||
Type : file
|
||||
Metadata :
|
||||
Content-Type: application/octet-stream
|
||||
|
||||
Name : keys/
|
||||
Date : 2025-06-01 12:03:33 UTC
|
||||
Type : folder
|
||||
```
|
||||
|
||||
Cool. Now I'm ready to make my first full backup:
|
||||
|
||||
```
|
||||
root@glootie:~# ARGS="--exclude /proc --exclude /sys --exclude /dev --exclude /run"
|
||||
root@glootie:~# ARGS="$ARGS --exclude-if-present .nobackup"
|
||||
root@glootie:~# restic backup $ARGS /
|
||||
...
|
||||
processed 1141426 files, 131.111 GiB in 15:12
|
||||
snapshot 34476c74 saved
|
||||
```
|
||||
|
||||
Once the backup completes, the Restic authors advise me to also do a check of the repository, and to
|
||||
prune it so that it keeps a finite amount of daily, weekly and monthly backups. My further journey
|
||||
for Restic looks a bit like this:
|
||||
|
||||
```
|
||||
root@glootie:~# restic check
|
||||
using temporary cache in /tmp/restic-check-cache-2712250731
|
||||
create exclusive lock for repository
|
||||
load indexes
|
||||
check all packs
|
||||
check snapshots, trees and blobs
|
||||
[0:04] 100.00% 1 / 1 snapshots
|
||||
|
||||
no errors were found
|
||||
|
||||
root@glootie:~# restic forget --prune --keep-daily 8 --keep-weekly 5 --keep-monthly 6
|
||||
repository 34476c74 opened (version 2, compression level auto)
|
||||
Applying Policy: keep 8 daily, 5 weekly, 6 monthly snapshots
|
||||
keep 1 snapshots:
|
||||
ID Time Host Tags Reasons Paths
|
||||
---------------------------------------------------------------------------------
|
||||
34476c74 2025-06-01 12:18:54 glootie.ipng.ch daily snapshot /
|
||||
weekly snapshot
|
||||
monthly snapshot
|
||||
----------------------------------------------------------------------------------
|
||||
1 snapshots
|
||||
```
|
||||
|
||||
Right on! I proceed to update the Ansible configs at IPng to roll this out against the entire fleet
|
||||
of 152 hosts at IPng Networks. I do this in a little tool called `bitcron`, which I wrote for a
|
||||
previous company I worked at: [[BIT](https://bit.nl)] in the Netherlands. Bitcron allows me to
|
||||
create relatively elegant cronjobs that can raise warnings, errors and fatal issues. If no issues
|
||||
are found, an e-mail can be sent to a bitbucket address, but if warnings or errors are found, a
|
||||
different _monitored_ address will be used. Bitcron is kind of cool, and I wrote it in 2001. Maybe
|
||||
I'll write about it, for old time's sake. I wonder if the folks at BIT still use it?
|
||||
|
||||
## Use Case: NGINX
|
||||
|
||||
{{< image float="right" src="/assets/minio/nginx-logo.png" alt="NGINX Logo" width="11em" >}}
|
||||
|
||||
OK, with the first use case out of the way, I turn my attention to a second - in my opinion more
|
||||
interesting - use case. In the [[previous article]({{< ref 2025-05-28-minio-1 >}})], I created a
|
||||
public bucket called `ipng-web-assets` in which I stored 6.50GB of website data belonging to the
|
||||
IPng website, and some material I posted when I was on my
|
||||
[[Sabbatical](https://sabbatical.ipng.nl/)] last year.
|
||||
|
||||
### MinIO: Bucket Replication
|
||||
|
||||
First things first: redundancy. These web assets are currently pushed to all four nginx machines,
|
||||
and statically served. If I were to replace them with a single S3 bucket, I would create a single
|
||||
point of failure, and that's _no bueno_!
|
||||
|
||||
Off I go, creating a replicated bucket using two MinIO instances (`chbtl0` and `ddln0`):
|
||||
|
||||
```
|
||||
pim@glootie:~$ mc mb ddln0/ipng-web-assets
|
||||
pim@glootie:~$ mc anonymous set download ddln0/ipng-web-assets
|
||||
pim@glootie:~$ mc admin user add ddln0/ <replkey> <replsecret>
|
||||
pim@glootie:~$ cat << EOF | tee ipng-web-assets-access.json
|
||||
{
|
||||
"PolicyName": "ipng-web-assets-access",
|
||||
"Policy": {
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [ "s3:DeleteObject", "s3:GetObject", "s3:ListBucket", "s3:PutObject" ],
|
||||
"Resource": [ "arn:aws:s3:::ipng-web-assets", "arn:aws:s3:::ipng-web-assets/*" ]
|
||||
}
|
||||
]
|
||||
},
|
||||
}
|
||||
EOF
|
||||
pim@glootie:~$ mc admin policy create ddln0/ ipng-web-assets-access.json
|
||||
pim@glootie:~$ mc admin policy attach ddln0/ ipng-web-assets-access --user <replkey>
|
||||
pim@glootie:~$ mc replicate add chbtl0/ipng-web-assets \
|
||||
--remote-bucket https://<key>:<secret>@s3.ddln0.ipng.ch/ipng-web-assets
|
||||
```
|
||||
|
||||
What happens next is pure magic. I've told `chbtl0` that I want it to replicate all existing and
|
||||
future changes to that bucket to its neighbor `ddln0`. Only minutes later, I check the replication
|
||||
status, just to see that it's _already done_:
|
||||
|
||||
```
|
||||
pim@glootie:~$ mc replicate status chbtl0/ipng-web-assets
|
||||
Replication status since 1 hour
|
||||
s3.ddln0.ipng.ch
|
||||
Replicated: 142 objects (6.5 GiB)
|
||||
Queued: ● 0 objects, 0 B (avg: 4 objects, 915 MiB ; max: 0 objects, 0 B)
|
||||
Workers: 0 (avg: 0; max: 0)
|
||||
Transfer Rate: 15 kB/s (avg: 88 MB/s; max: 719 MB/s
|
||||
Latency: 3ms (avg: 3ms; max: 7ms)
|
||||
Link: ● online (total downtime: 0 milliseconds)
|
||||
Errors: 0 in last 1 minute; 0 in last 1hr; 0 since uptime
|
||||
Configured Max Bandwidth (Bps): 644 GB/s Current Bandwidth (Bps): 975 B/s
|
||||
pim@summer:~/src/ipng-web-assets$ mc ls ddln0/ipng-web-assets/
|
||||
[2025-06-01 12:42:22 CEST] 0B ipng.ch/
|
||||
[2025-06-01 12:42:22 CEST] 0B sabbatical.ipng.nl/
|
||||
```
|
||||
|
||||
MinIO has pumped the data from bucket `ipng-web-assets` to the other machine at an average of 88MB/s
|
||||
with a peak throughput of 719MB/s (probably for the larger VM images). And indeed, looking at the
|
||||
remote machine, it is fully caught up after the push, within only a minute or so with a completely
|
||||
fresh copy. Nice!
|
||||
|
||||
### MinIO: Missing directory index
|
||||
|
||||
I take a look at what I just built, on the following URL:
|
||||
* [https://ipng-web-assets.s3.ddln0.ipng.ch/sabbatical.ipng.nl/media/vdo/IMG_0406_0.mp4](https://ipng-web-assets.s3.ddln0.ipng.ch/sabbatical.ipng.nl/media/vdo/IMG_0406_0.mp4)
|
||||
|
||||
That checks out, and I can see the mess that was my room when I first went on sabbatical. By the
|
||||
way, I totally cleaned it up, see
|
||||
[[here](https://sabbatical.ipng.nl/blog/2024/08/01/thursday-basement-done/)] for proof. I can't,
|
||||
however, see the directory listing:
|
||||
|
||||
```
|
||||
pim@glootie:~$ curl https://ipng-web-assets.s3.ddln0.ipng.ch/sabbatical.ipng.nl/media/vdo/
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<Error>
|
||||
<Code>NoSuchKey</Code>
|
||||
<Message>The specified key does not exist.</Message>
|
||||
<Key>sabbatical.ipng.nl/media/vdo/</Key>
|
||||
<BucketName>ipng-web-assets</BucketName>
|
||||
<Resource>/sabbatical.ipng.nl/media/vdo/</Resource>
|
||||
<RequestId>1844EC0CFEBF3C5F</RequestId>
|
||||
<HostId>dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8</HostId>
|
||||
</Error>
|
||||
```
|
||||
|
||||
That's unfortunate, because some of the IPng articles link to a directory full of files, which I'd
|
||||
like to be shown so that my readers can navigate through the directories. Surely I'm not the first
|
||||
to encounter this? And sure enough, I'm not
|
||||
[[ref](https://github.com/glowinthedark/index-html-generator)] by user `glowinthedark` who wrote a
|
||||
little python script that generates `index.html` files for their Caddy file server. I'll take me
|
||||
some of that Python, thank you!
|
||||
|
||||
With the following little script, my setup is complete:
|
||||
|
||||
```
|
||||
pim@glootie:~/src/ipng-web-assets$ cat push.sh
|
||||
#!/usr/bin/env bash
|
||||
|
||||
echo "Generating index.html files ..."
|
||||
for D in */media; do
|
||||
echo "* Directory $D"
|
||||
./genindex.py -r $D
|
||||
done
|
||||
echo "Done (genindex)"
|
||||
echo ""
|
||||
|
||||
echo "Mirroring directoro to S3 Bucket"
|
||||
mc mirror --remove --overwrite . chbtl0/ipng-web-assets/
|
||||
echo "Done (mc mirror)"
|
||||
echo ""
|
||||
pim@glootie:~/src/ipng-web-assets$ ./push.sh
|
||||
```
|
||||
|
||||
Only a few seconds after I run `./push.sh`, the replication is complete and I have two identical
|
||||
copies of my media:
|
||||
|
||||
1. [https://ipng-web-assets.s3.chbtl0.ipng.ch/ipng.ch/media/](https://ipng-web-assets.s3.chbtl0.ipng.ch/ipng.ch/media/index.html)
|
||||
1. [https://ipng-web-assets.s3.ddln0.ipng.ch/ipng.ch/media/](https://ipng-web-assets.s3.ddln0.ipng.ch/ipng.ch/media/index.html)
|
||||
|
||||
|
||||
### NGINX: Proxy to Minio
|
||||
|
||||
Before moving to S3 storage, my NGINX frontends all kept a copy of the IPng media on local NVME
|
||||
disk. That's great for reliability, as each NGINX instance is completely hermetic and standalone.
|
||||
However, it's not great for scaling: the current NGINX instances only have 16GB of local storage,
|
||||
and I'd rather not have my static web asset data outgrow that filesystem. From before, I already had
|
||||
an NGINX config that served the Hugo static data from `/var/www/ipng.ch/ and the `/media'
|
||||
subdirectory from a different directory in `/var/www/ipng-web-assets/ipng.ch/media`.
|
||||
|
||||
Moving to redundant S3 storage backenda is straight forward:
|
||||
|
||||
```
|
||||
upstream minio_ipng {
|
||||
least_conn;
|
||||
server minio0.chbtl0.net.ipng.ch:9000;
|
||||
server minio0.ddln0.net.ipng.ch:9000;
|
||||
}
|
||||
|
||||
server {
|
||||
...
|
||||
location / {
|
||||
root /var/www/ipng.ch/;
|
||||
}
|
||||
|
||||
location /media {
|
||||
proxy_set_header Host $http_host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
|
||||
proxy_connect_timeout 300;
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Connection "";
|
||||
chunked_transfer_encoding off;
|
||||
|
||||
rewrite (.*)/$ $1/index.html;
|
||||
|
||||
proxy_pass http://minio_ipng/ipng-web-assets/ipng.ch/media;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
I want to make note of a few things:
|
||||
1. The `upstream` definition here uses IPng Site Local entrypoints, considering the NGINX servers
|
||||
all have direct MTU=9000 access to the MinIO instances. I'll put both in there, in a
|
||||
round-robin configuration favoring the replica with _least connections_.
|
||||
1. Deeplinking to directory names without the trailing `/index.html` would serve a 404 from the
|
||||
backend, so I'll intercept these and rewrite directory to always include the `/index.html'.
|
||||
1. The used upstream endpoint is _path-based_, that is to say has the bucketname and website name
|
||||
included. This whole location used to be simply `root /var/www/ipng-web-assets/ipng.ch/media/`
|
||||
so the mental change is quite small.
|
||||
|
||||
### NGINX: Caching
|
||||
|
||||
|
||||
After deploying the S3 upstream on all IPng websites, I can delete the old
|
||||
`/var/www/ipng-web-assets/` directory and reclaim about 7GB of diskspace. This gives me an idea ...
|
||||
|
||||
{{< image width="8em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
|
||||
|
||||
On the one hand it's great that I will pull these assets from Minio and all, but at the same time,
|
||||
it's a tad inefficient to retrieve them from, say, Zurich to Amsterdam just to serve them onto the
|
||||
internet again. If at any time something on the IPng website goes viral, it'd be nice to be able to
|
||||
serve them directly from the edge, right?
|
||||
|
||||
A webcache. What could _possibly_ go wrong :)
|
||||
|
||||
NGINX is really really good at caching content. It has a powerful engine to store, scan, revalidate
|
||||
and match any content and upstream headers. It's also very well documented, so I take a look at the
|
||||
proxy module's documentation [[here](https://nginx.org/en/docs/http/ngx_http_proxy_module.html)] and
|
||||
in particular a useful [[blog](https://blog.nginx.org/blog/nginx-caching-guide)] on their website.
|
||||
|
||||
The first thing I need to do is create what is called a _key zone_, which is a region of memory in
|
||||
which URL keys are stored with some metadata. Having a copy of the keys in memory enables NGINX to
|
||||
quickly determine if a request is a HIT or a MISS without having to go to disk, greatly speeding up
|
||||
the check.
|
||||
|
||||
In `/etc/nginx/conf.d/ipng-cache.conf` I add the following NGINX cache:
|
||||
|
||||
```
|
||||
proxy_cache_path /var/www/nginx-cache levels=1:2 keys_zone=ipng_cache:10m max_size=8g
|
||||
inactive=24h use_temp_path=off;
|
||||
```
|
||||
|
||||
With this statement, I'll create a 2-level subdirectory, and allocate 10MB of space, which should
|
||||
hold on the order of 100K entries. The maximum size I'll allow the cache to grow to is 8GB, and I'll
|
||||
mark any object inactive if it's not been referenced for 24 hours. I learn that inactive is
|
||||
different to expired content. If a cache element has expired, but NGINX can't reach the upstream
|
||||
for a new copy, it can be configured to serve a inactive (stale) copy from the cache. That's dope,
|
||||
as it serves as an extra layer of defence in case the network or all available S3 replicas take the
|
||||
day off. I'll ask NGINX to avoid writing objects first to a tmp directory and them moving them into
|
||||
the `/var/www/nginx-cache` directory. These are recommendations I grab from the manual.
|
||||
|
||||
Within the `location` block I configured above, I'm now ready to enable this cache. I'll do that by
|
||||
adding two include files, which I'll reference in all sites that I want to have make use of this
|
||||
cache:
|
||||
|
||||
First, to enable the cache, I write the following snippet:
|
||||
```
|
||||
pim@nginx0-nlams1:~$ cat /etc/nginx/conf.d/ipng-cache.inc
|
||||
proxy_cache ipng_cache;
|
||||
proxy_ignore_headers Cache-Control;
|
||||
proxy_cache_valid any 1h;
|
||||
proxy_cache_revalidate on;
|
||||
proxy_cache_use_stale error timeout updating http_500 http_502 http_503 http_504;
|
||||
proxy_cache_background_update on;
|
||||
```
|
||||
|
||||
Then, I find it useful to emit a few debugging HTTP headers, and at the same time I see that Minio
|
||||
emits a bunch of HTTP headers that may not be safe for me to propagate, so I pen two more snippets:
|
||||
|
||||
```
|
||||
pim@nginx0-nlams1:~$ cat /etc/nginx/conf.d/ipng-strip-minio-headers.inc
|
||||
proxy_hide_header x-minio-deployment-id;
|
||||
proxy_hide_header x-amz-request-id;
|
||||
proxy_hide_header x-amz-id-2;
|
||||
proxy_hide_header x-amz-replication-status;
|
||||
proxy_hide_header x-amz-version-id;
|
||||
|
||||
pim@nginx0-nlams1:~$ cat /etc/nginx/conf.d/ipng-add-upstream-headers.inc
|
||||
add_header X-IPng-Frontend $hostname always;
|
||||
add_header X-IPng-Upstream $upstream_addr always;
|
||||
add_header X-IPng-Upstream-Status $upstream_status always;
|
||||
add_header X-IPng-Cache-Status $upstream_cache_status;
|
||||
```
|
||||
|
||||
With that, I am ready to enable caching of the IPng `/media` location:
|
||||
|
||||
```
|
||||
location /media {
|
||||
...
|
||||
include /etc/nginx/conf.d/ipng-strip-minio-headers.inc;
|
||||
include /etc/nginx/conf.d/ipng-add-upstream-headers.inc;
|
||||
include /etc/nginx/conf.d/ipng-cache.inc;
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
## Results
|
||||
|
||||
I run the Ansible playbook for the NGINX cluster and take a look at the replica at Coloclue in
|
||||
Amsterdam, called `nginx0.nlams1.ipng.ch`. Notably, it'll have to retrieve the file from a MinIO
|
||||
replica in Zurich (12ms away), so it's expected to take a little while.
|
||||
|
||||
The first attempt:
|
||||
|
||||
```
|
||||
pim@nginx0-nlams1:~$ curl -v -o /dev/null --connect-to ipng.ch:443:localhost:443 \
|
||||
https://ipng.ch/media/vpp-proto/vpp-proto-bookworm.qcow2.lrz
|
||||
...
|
||||
< last-modified: Sun, 01 Jun 2025 12:37:52 GMT
|
||||
< x-ipng-frontend: nginx0-nlams1
|
||||
< x-ipng-cache-status: MISS
|
||||
< x-ipng-upstream: [2001:678:d78:503::b]:9000
|
||||
< x-ipng-upstream-status: 200
|
||||
|
||||
100 711M 100 711M 0 0 26.2M 0 0:00:27 0:00:27 --:--:-- 26.6M
|
||||
```
|
||||
|
||||
|
||||
OK, that's respectable, I've read the file at 26MB/s. Of course I just turned on the cache, so the
|
||||
NGINX fetches the file from Zurich while handing it over to my `curl` here. It notifies me by means
|
||||
of a HTTP header that the cache was a `MISS`, and then which upstream server it contacted to
|
||||
retrieve the object.
|
||||
|
||||
But look at what happens the _second_ time I run the same command:
|
||||
|
||||
```
|
||||
pim@nginx0-nlams1:~$ curl -v -o /dev/null --connect-to ipng.ch:443:localhost:443 \
|
||||
https://ipng.ch/media/vpp-proto/vpp-proto-bookworm.qcow2.lrz
|
||||
< last-modified: Sun, 01 Jun 2025 12:37:52 GMT
|
||||
< x-ipng-frontend: nginx0-nlams1
|
||||
< x-ipng-cache-status: HIT
|
||||
|
||||
100 711M 100 711M 0 0 436M 0 0:00:01 0:00:01 --:--:-- 437M
|
||||
```
|
||||
|
||||
|
||||
Holy moly! First I see the object has the same _Last-Modified_ header, but I now also see that the
|
||||
_Cache-Status_ was a `HIT`, and there is no mention of any upstream server. I do however see the
|
||||
file come in at a whopping 437MB/s which is 16x faster than over the network!! Nice work, NGINX!
|
||||
|
||||
{{< image float="right" src="/assets/minio/rack-2.png" alt="Rack-o-Minio" width="12em" >}}
|
||||
|
||||
# What's Next
|
||||
|
||||
I'm going to deploy the third MinIO replica in Rümlang once the disks arrive. I'll release the
|
||||
~4TB of disk used currently in Restic backups for the fleet, and put that ZFS capacity to other use.
|
||||
Now, creating services like PeerTube, Mastodon, Pixelfed, Loops, NextCloud and what-have-you, will
|
||||
become much easier for me. And with the per-bucket replication between MinIO deployments, I also
|
||||
think this is a great way to auto-backup important data. First off, it'll be RS8.4 on the MinIO node
|
||||
itself, and secondly, user data will be copied automatically to a neighboring facility.
|
||||
|
||||
I've convinced myself that S3 storage is a great service to operate, and that MinIO is awesome.
|
||||
375
content/articles/2025-07-12-vpp-evpn-1.md
Normal file
375
content/articles/2025-07-12-vpp-evpn-1.md
Normal file
@@ -0,0 +1,375 @@
|
||||
---
|
||||
date: "2025-07-12T08:07:23Z"
|
||||
title: 'VPP and eVPN/VxLAN - Part 1'
|
||||
---
|
||||
|
||||
{{< image width="6em" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
You know what would be really cool? If VPP could be an eVPN/VxLAN speaker! Sometimes I feel like I'm
|
||||
the very last on the planet to learn about something cool. My latest "A-Ha!"-moment was when I was
|
||||
configuring the eVPN fabric for [[Frys-IX](https://frys-ix.net/)], and I wrote up an article about
|
||||
it [[here]({{< ref 2025-04-09-frysix-evpn >}})] back in April.
|
||||
|
||||
I can build the equivalent of Virtual Private Wires (VPWS), also called L2VPN or Virtual Leased
|
||||
Lines, and these are straight forward because they typically only have two endpoints. A "regular"
|
||||
VxLAN tunnel which is L2 cross connected with another interface already does that just fine. Take a
|
||||
look at an article on [[L2 Gymnastics]({{< ref 2022-01-12-vpp-l2 >}})] for that. But the real kicker
|
||||
is that I can also create multi-site L2 domains like Virtual Private LAN Services (VPLS) or also
|
||||
called Virtual Private Ethernet, L2VPN or Ethernet LAN Service (E-LAN). And *that* is a whole other
|
||||
level of awesome.
|
||||
|
||||
## Recap: VPP today
|
||||
|
||||
### VPP: VxLAN
|
||||
|
||||
The current VPP VxLAN tunnel plugin does point to point tunnels, that is they are configured with a
|
||||
source address, destination address, destination port and VNI. As I mentioned, a point to point
|
||||
ethernet transport is configured very easily:
|
||||
|
||||
```
|
||||
vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.254 vni 8298 instance 0
|
||||
vpp0# set int l2 xconnect vxlan_tunnel0 HundredGigabitEthernet10/0/0
|
||||
vpp0# set int l2 xconnect HundredGigabitEthernet10/0/0 vxlan_tunnel0
|
||||
vpp0# set int state vxlan_tunnel0 up
|
||||
vpp0# set int state HundredGigabitEthernet10/0/0 up
|
||||
|
||||
vpp1# create vxlan tunnel src 192.0.2.254 dst 192.0.2.1 vni 8298 instance 0
|
||||
vpp1# set int l2 xconnect vxlan_tunnel0 HundredGigabitEthernet10/0/1
|
||||
vpp1# set int l2 xconnect HundredGigabitEthernet10/0/1 vxlan_tunnel0
|
||||
vpp1# set int state vxlan_tunnel0 up
|
||||
vpp1# set int state HundredGigabitEthernet10/0/1 up
|
||||
```
|
||||
|
||||
And with that, `vpp0:Hu10/0/0` is cross connected with `vpp1:Hu10/0/1` and ethernet flows between
|
||||
the two.
|
||||
|
||||
### VPP: Bridge Domains
|
||||
|
||||
Now consider a VPLS with five different routers. While it's possible to create a bridge-domain and add
|
||||
some local ports and four other VxLAN tunnels:
|
||||
|
||||
```
|
||||
vpp0# create bridge-domain 8298
|
||||
vpp0# set int l2 bridge HundredGigabitEthernet10/0/1 8298
|
||||
vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.2 vni 8298 instance 0
|
||||
vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.3 vni 8298 instance 1
|
||||
vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.4 vni 8298 instance 2
|
||||
vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.5 vni 8298 instance 3
|
||||
vpp0# set int l2 bridge vxlan_tunnel0 8298
|
||||
vpp0# set int l2 bridge vxlan_tunnel1 8298
|
||||
vpp0# set int l2 bridge vxlan_tunnel2 8298
|
||||
vpp0# set int l2 bridge vxlan_tunnel3 8298
|
||||
```
|
||||
|
||||
To make this work, I will have to replicate this configuration to all other `vpp1`-`vpp4` routers.
|
||||
While it does work, it's really not very practical. When other VPP instances get added to a VPLS,
|
||||
every other router will have to have a new VxLAN tunnel created and added to its local bridge
|
||||
domain. Consider 1000s of VPLS instances on 100s of routers, it would yield ~100'000 VxLAN tunnels
|
||||
on every router, yikes!
|
||||
|
||||
Such a configuration reminds me in a way of iBGP in a large network: the naive approach is to have a
|
||||
full mesh of all routers speaking to all other routers, but that quickly becomes a maintenance
|
||||
headache. The canonical solution for this is to create iBGP _Route Reflectors_ to which every router
|
||||
connects, and their job is to redistribute routing information between the fleet of routers. This
|
||||
turns the iBGP problem from an O(N^2) to an O(N) problem: all 1'000 routers connect to, say, three
|
||||
regional route reflectors for a total of 3'000 BGP connections, which is much better than ~1'000'000
|
||||
BGP connections in the naive approach.
|
||||
|
||||
## Recap: eVPN Moving parts
|
||||
|
||||
The reason why I got so enthusiastic when I was playing with Arista and Nokia's eVPN stuff, is
|
||||
because it requires very little dataplane configuration, and a relatively intuitive controlplane
|
||||
configuration:
|
||||
|
||||
1. **Dataplane**: For each L2 broadcast domain (be it a L2XC or a Bridge Domain), really all I
|
||||
need is a single VxLAN interface with a given VNI, which should be able to send encapsulated
|
||||
ethernet frames to one more more other speakers in the same domain.
|
||||
1. **Controlplane**: I will need to learn MAC addresses locally, and inform some BGP eVPN
|
||||
implementation of who-lives-where. Other VxLAN speakers learn of the MAC addresses I own, and
|
||||
will send me encapsulated ethernet for those addresses
|
||||
1. **Dataplane**: For unknown layer2 destinations, like _Broadcast_, _Unknown Unicast_, and
|
||||
_Multicast_ (BUM) traffic, I will want to keep track of which other VxLAN speakers these
|
||||
packets should be flooded. I make note that this is not that different to flooding the packets
|
||||
to local interfaces, except here it'd be flooding them to remote VxLAN endpoints.
|
||||
1. **ControlPlane**: Flooding L2 traffic across wide area networks is typically considered icky,
|
||||
so a few tricks might be optionally deployed. Since the controlplane already knows which MAC
|
||||
lives where, it may as well also make note of any local IPv6 ARP and IPv6 neighbor discovery
|
||||
replies and teach its peers which IPv4/IPv6 addresses live where: a distributed neighbor table.
|
||||
|
||||
{{< image width="6em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
|
||||
|
||||
For the controlplane parts, [[FRRouting](https://frrouting.org/)] has a working implementation for
|
||||
L2 (MAC-VRF) and L3 (IP-VRF). My favorite, [[Bird](https://birg.nic.cz/)], is slowly catching up, and
|
||||
has a few of these controlplane parts already working (mostly MAC-VRF). Commercial vendors like Arista,
|
||||
Nokia, Juniper, Cisco are ready to go. If we want VPP to inter-operate, we may need to make a few
|
||||
changes.
|
||||
|
||||
## VPP: Changes needed
|
||||
|
||||
### Dynamic VxLAN
|
||||
|
||||
I propose two changes to the VxLAN plugin, or perhaps, a new plugin that changes the behavior so that
|
||||
we don't have to break any performance or functional promises to existing users. This new VxLAN
|
||||
interface behavior changes in the following ways:
|
||||
|
||||
1. Each VxLAN interface has a local L2FIB attached to it, the keys are MAC address and the
|
||||
values are remote VTEPs. In its simplest form, the values would be just IPv4 or IPv6 addresses,
|
||||
because I can re-use the VNI and port information from the tunnel definition itself.
|
||||
|
||||
1. Each VxLAN interface has a local flood-list attached to it. This list contains remote VTEPs
|
||||
that I am supposed to send 'flood' packets to. Similar to the Bridge Domain, when packets are marked
|
||||
for flooding, I will need to prepare and replicate them, sending them to each VTEP.
|
||||
|
||||
|
||||
A set of APIs will be needed to manipulate these:
|
||||
* ***Interface***: I will need to have an interface create, delete and list call, which will
|
||||
be able to maintain the interfaces, their metadata like source address, source/destination port,
|
||||
VNI and such.
|
||||
* ***L2FIB***: I will need to add, replace, delete, and list which MAC addresses go where,
|
||||
With such a table, each time a packet is handled for a given Dynamic VxLAN interface, the
|
||||
dst_addr can be written into the packet.
|
||||
* ***Flooding***: For those packets that are not unicast (BUM), I will need to be able to add,
|
||||
remove and list which VTEPs should receive this packet.
|
||||
|
||||
It would be pretty dope if the configuration looked something like this:
|
||||
```
|
||||
vpp# create evpn-vxlan src <v46address> dst-port <port> vni <vni> instance <id>
|
||||
vpp# evpn-vxlan l2fib <iface> mac <mac> dst <v46address> [del]
|
||||
vpp# evpn-vxlan flood <iface> dst <v46address> [del]
|
||||
```
|
||||
|
||||
The VxLAN underlay transport can be either IPv4 or IPv6. Of course manipulating L2FIB or Flood
|
||||
destinations must match the address family of an interface of type evpn-vxlan. A practical example
|
||||
might be:
|
||||
|
||||
```
|
||||
vpp# create evpn-vxlan src 2001:db8::1 dst-port 4789 vni 8298 instance 6
|
||||
vpp# evpn-vxlan l2fib evpn-vxlan0 mac 00:01:02:82:98:02 dst 2001:db8::2
|
||||
vpp# evpn-vxlan l2fib evpn-vxlan0 mac 00:01:02:82:98:03 dst 2001:db8::3
|
||||
vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::2
|
||||
vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::3
|
||||
vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::4
|
||||
```
|
||||
|
||||
By the way, while this _could_ be a new plugin, it could also just be added to the existing VxLAN
|
||||
plugin. One way in which I might do this when creating a normal vxlan tunnel is to allow for its
|
||||
destination address to be either 0.0.0.0 for IPv4 or :: for IPv6. That would signal 'dynamic'
|
||||
tunneling, upon which the L2FIB and Flood lists are used. It would slow down each VxLAN packet by
|
||||
the time it takes to call `ip46_address_is_zero()` which is only a handfull of clocks.
|
||||
|
||||
### Bridge Domain
|
||||
|
||||
{{< image width="6em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
It's important to understand that L2 learning is **required** for eVPN to function. Each router
|
||||
needs to be able to tell the iBGP eVPN session which MAC addresses should be forwarded to it. This
|
||||
rules out the simple case of L2XC because there, no learning is performed. The corollary is that a
|
||||
bridge-domain is required for any form of eVPN.
|
||||
|
||||
The L2 code in VPP already does most of what I'd need. It maintains an L2FIB in `vnet/l2/l2_fib.c`,
|
||||
which is keyed by bridge-id and MAC address, and its values are a 64 bit structure that points
|
||||
essentially to a `sw_if_index` output interface. The L2FIB of the eVPN needs a bit more information
|
||||
though, notably a `ip46address` struct to know which VTEP to send to. It's tempting to add this
|
||||
extra data to the bridge domain code. I would recommend against it, because other implementations,
|
||||
for example MPLS, GENEVE or Carrier Pigeon IP may need more than just the destination address. Even
|
||||
the VxLAN implementation I'm thinking about might want to be able to override other things like the
|
||||
destination port for a given VTEP, or even the VNI. Putting all of this stuff in the bridge-domain
|
||||
code will just clutter it, for all users, not just those users who might want eVPN.
|
||||
|
||||
Similarly, one might argue it is tempting to re-use/extend the behavior in `vnet/l2/l2_flood.c`,
|
||||
because if it's already replicating BUM traffic, why not replicate it many times over the flood list
|
||||
for any member interface that happens to be a dynamic VxLAN interface? This would be a bad idea
|
||||
because of a few reasons. Firstly, it is not guaranteed that the VxLAN plugin is loaded, and in
|
||||
doing this, I would leak internal details of VxLAN into the bridge-domain code. Secondly, the
|
||||
`l2_flood.c` code would potentially get messy if other types were added (like the MPLS and GENEVE
|
||||
above).
|
||||
|
||||
A reasonable request is to mark such BUM frames once in the existing L2 code and when handing the
|
||||
replicated packet into the VxLAN node, to see the `is_bum` marker and once again replicate -- in the
|
||||
vxlan plugin -- these packets to the VTEPs in our local flood-list. Although a bit more work, this
|
||||
approach only requires a tiny amount of work in the `l2_flood.c` code (the marking), and will keep
|
||||
all the logic tucked away where it is relevant, derisking the VPP vnet codebase.
|
||||
|
||||
Fundamentally, I think the cleanest design is to keep the dynamic VxLAN interface fully
|
||||
self-contained and it would therefor maintain its own L2FIB and Flooding logic. The only thing I
|
||||
would add to the L2 codebase is some form of BUM marker to allow for efficient flooding.
|
||||
|
||||
### Control Plane
|
||||
|
||||
There's a few things the control plane has to do. Some external agent, like FRR or Bird, will be
|
||||
receiving a few types of eVPN messages. The ones I'm interested in are:
|
||||
|
||||
* ***Type 2***: MAC/IP Advertisement Route
|
||||
- On the way in, these should be fed to the VxLAN L2FIB belonging to the bridge-domain.
|
||||
- On the way out, learned addresses should be advertised to peers.
|
||||
- Regarding IPv4/IPv6 addresses, that is the ARP / ND tables: we can talk about those later.
|
||||
* ***Type 3***: Inclusive Multicast Ethernet Tag Route
|
||||
- On the way in, these will populate the VxLAN Flood list belonging to the bridge-domain
|
||||
- On the way out, each bridge-domain should advertise itself as IMET to peers.
|
||||
* ***Type 5***: IP Prefix Route
|
||||
- Similar to IP information in Type 2, we can talk about those later once L3VPN/eVPN is
|
||||
needed.
|
||||
|
||||
The 'on the way in' stuff can be easily done with my proposed APIs in the Dynamic VxLAN (or a new
|
||||
eVPN VxLAN) plugin. Adding, removing, listing L2FIB and Flood lists is easy as far as VPP is
|
||||
concerned. It's just that the controlplane implementation needs to somehow _feed_ the API, so an
|
||||
external program may be needed, or alterntively the Linux Control Plane netlink plugin might be used
|
||||
to consume this information.
|
||||
|
||||
The 'on the way out' stuff is a bit trickier. I will need to listen to creation of new broadcast
|
||||
domains and associate them with the right IMET announcements, and for each MAC address learned, pick
|
||||
them up and advertise them into eVPN. Later, if ever ARP and ND proxying becomes important, I'll
|
||||
have to revisit the bridge-domain feature to do IPv4 ARP and IPv6 Neighbor Discovery, and replace it
|
||||
with some code that populates the IPv4/IPv6 parts of the Type2 messages on the way out, and
|
||||
similarly on the way in, populates an L3 neighbor cache for the bridge domain, so ARP and ND replies
|
||||
can be synthesized based on what we've learned in eVPN.
|
||||
|
||||
# Demonstration
|
||||
|
||||
### VPP: Current VxLAN
|
||||
|
||||
I'll build a small demo environment on Summer to show how the interaction of VxLAN and Bridge
|
||||
Domain works today:
|
||||
|
||||
```
|
||||
vpp# create tap host-if-name dummy0 host-mtu-size 9216 host-ip4-addr 192.0.2.1/24
|
||||
vpp# set int state tap0 up
|
||||
vpp# set int ip address tap0 192.0.2.1/24
|
||||
vpp# set ip neighbor tap0 192.0.2.254 01:02:03:82:98:fe static
|
||||
vpp# set ip neighbor tap0 192.0.2.2 01:02:03:82:98:02 static
|
||||
vpp# set ip neighbor tap0 192.0.2.3 01:02:03:82:98:03 static
|
||||
|
||||
vpp# create vxlan tunnel src 192.0.2.1 dst 192.0.2.254 vni 8298
|
||||
vpp# set int state vxlan_tunnel0 up
|
||||
|
||||
vpp# create tap host-if-name vpptap0 host-mtu-size 9216 hw-addr 02:fe:64:dc:1b:82
|
||||
vpp# set int state tap1 up
|
||||
|
||||
vpp# create bridge-domain 8298
|
||||
vpp# set int l2 bridge tap1 8298
|
||||
vpp# set int l2 bridge vxlan_tunnel0 8298
|
||||
```
|
||||
|
||||
I've created a tap device called `dummy0` and gave it an IPv4 address. Normally, I would use some
|
||||
DPDK or RDMA interface like `TenGigabutEthernet10/0/0`. Then I'll populate some static ARP entries.
|
||||
Again, normally this would just be 'use normal routing'. However, for the purposes of this
|
||||
demonstration, it helps to use a TAP device, as any packets I make VPP send to those 192.0.2.254 and
|
||||
so on, can be captured with `tcpdump` in Linux in addition to `trace add` in VPP.
|
||||
|
||||
Then, I create a VxLAN tunnel with a default destination of 192.0.2.254 and the given VNI.
|
||||
Next, I create a TAP interface called `vpptap0` with the given MAC address.
|
||||
Finally, I bind these two interfaces together in a bridge-domain.
|
||||
|
||||
I proceed to write a small ScaPY program:
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
|
||||
from scapy.all import Ether, IP, UDP, Raw, sendp
|
||||
|
||||
pkt = Ether(dst="01:02:03:04:05:02", src="02:fe:64:dc:1b:82", type=0x0800)
|
||||
/ IP(src="192.168.1.1", dst="192.168.1.2")
|
||||
/ UDP(sport=8298, dport=7) / Raw(load=b"ping")
|
||||
print(pkt)
|
||||
sendp(pkt, iface="vpptap0")
|
||||
|
||||
pkt = Ether(dst="01:02:03:04:05:03", src="02:fe:64:dc:1b:82", type=0x0800)
|
||||
/ IP(src="192.168.1.1", dst="192.168.1.3")
|
||||
/ UDP(sport=8298, dport=7) / Raw(load=b"ping")
|
||||
print(pkt)
|
||||
sendp(pkt, iface="vpptap0")
|
||||
```
|
||||
|
||||
What will happen is, the ScaPY program will emit these frames into device `vpptap0` which is in
|
||||
bridge-domain 8298. The bridge will learn our src MAC `02:fe:64:dc:1b:82`, and look up the dst MAC
|
||||
`01:02:03:04:05:02`, and because there hasn't been traffic yet, it'll flood to all member ports, one
|
||||
of which is the VxLAN tunnel. VxLAN will then encapsulate the packets to the other side of the
|
||||
tunnel.
|
||||
|
||||
```
|
||||
pim@summer:~$ sudo ./vxlan-test.py
|
||||
Ether / IP / UDP 192.168.1.1:8298 > 192.168.1.2:echo / Raw
|
||||
Ether / IP / UDP 192.168.1.1:8298 > 192.168.1.3:echo / Raw
|
||||
|
||||
pim@summer:~$ sudo tcpdump -evni dummy0
|
||||
10:50:35.310620 02:fe:72:52:38:53 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
|
||||
(tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82)
|
||||
192.0.2.1.6345 > 192.0.2.254.4789: VXLAN, flags [I] (0x08), vni 8298
|
||||
02:fe:64:dc:1b:82 > 01:02:03:04:05:02, ethertype IPv4 (0x0800), length 46:
|
||||
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
|
||||
192.168.1.1.8298 > 192.168.1.2.7: UDP, length 4
|
||||
10:50:35.362552 02:fe:72:52:38:53 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
|
||||
(tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82)
|
||||
192.0.2.1.23916 > 192.0.2.254.4789: VXLAN, flags [I] (0x08), vni 8298
|
||||
02:fe:64:dc:1b:82 > 01:02:03:04:05:03, ethertype IPv4 (0x0800), length 46:
|
||||
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
|
||||
192.168.1.1.8298 > 192.168.1.3.7: UDP, length 4
|
||||
```
|
||||
|
||||
I want to point out that nothing, so far, is special. All of this works with upstream VPP just fine.
|
||||
I can see two VxLAN encapsulated packets, both destined to `192.0.2.254:4789`. Cool.
|
||||
|
||||
### Dynamic VPP VxLAN
|
||||
|
||||
I wrote a prototype for a Dynamic VxLAN tunnel in [[43433](https://gerrit.fd.io/r/c/vpp/+/43433)].
|
||||
The good news is, this works. The bad news is, I think I'll want to discuss my proposal (this
|
||||
article) with the community before going further down a potential rabbit hole.
|
||||
|
||||
With my gerrit patched in, I can do the following:
|
||||
|
||||
```
|
||||
vpp# vxlan l2fib vxlan_tunnel0 mac 01:02:03:04:05:02 dst 192.0.2.2
|
||||
Added VXLAN dynamic destination for 01:02:03:04:05:02 on vxlan_tunnel0 dst 192.0.2.2
|
||||
vpp# vxlan l2fib vxlan_tunnel0 mac 01:02:03:04:05:03 dst 192.0.2.3
|
||||
Added VXLAN dynamic destination for 01:02:03:04:05:03 on vxlan_tunnel0 dst 192.0.2.3
|
||||
|
||||
vpp# show vxlan l2fib
|
||||
VXLAN Dynamic L2FIB entries:
|
||||
MAC Interface Destination Port VNI
|
||||
01:02:03:04:05:02 vxlan_tunnel0 192.0.2.2 4789 8298
|
||||
01:02:03:04:05:03 vxlan_tunnel0 192.0.2.3 4789 8298
|
||||
Dynamic L2FIB entries: 2
|
||||
```
|
||||
|
||||
I've instructed the VxLAN tunnel to change the tunnel destination based on the destination MAC.
|
||||
|
||||
|
||||
I run the script and tcpdump again:
|
||||
|
||||
```
|
||||
pim@summer:~$ sudo tcpdump -evni dummy0
|
||||
11:16:53.834619 02:fe:fe:ae:0d:a3 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
|
||||
(tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82, bad cksum 3945 (->3997)!)
|
||||
192.0.2.1.6345 > 192.0.2.2.4789: VXLAN, flags [I] (0x08), vni 8298
|
||||
02:fe:64:dc:1b:82 > 01:02:03:04:05:02, ethertype IPv4 (0x0800), length 46:
|
||||
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
|
||||
192.168.1.1.8298 > 192.168.1.2.7: UDP, length 4
|
||||
11:16:53.882554 02:fe:fe:ae:0d:a3 > 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96:
|
||||
(tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82, bad cksum 3944 (->3996)!)
|
||||
192.0.2.1.23916 > 192.0.2.3.4789: VXLAN, flags [I] (0x08), vni 8298
|
||||
02:fe:64:dc:1b:82 > 01:02:03:04:05:03, ethertype IPv4 (0x0800), length 46:
|
||||
(tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32)
|
||||
192.168.1.1.8298 > 192.168.1.3.7: UDP, length 4
|
||||
```
|
||||
|
||||
Two important notes: Firstly, this works! For the MAC address ending in `:02`, send the packet to
|
||||
`192.0.2.2` instead of the default of `192.0.2.254`. Same for the `:03` MAC which now goes to
|
||||
`192.0.2.3`. Nice! But secondly, the IPv4 header of the VxLAN packets was changed, so there needs to
|
||||
be a call to `ip4_header_checksum()` inserted somewhere. That's an easy fix.
|
||||
|
||||
# What's next
|
||||
|
||||
I want to discuss a few things, perhaps at an upcoming VPP Community meeting. Notably:
|
||||
1. Is the VPP Developer community supportive of adding eVPN support? Does anybody want to help
|
||||
write it with me?
|
||||
1. Is changing the existing VxLAN plugin appropriate, or should I make a new plugin which adds
|
||||
dynamic endpoints, L2FIB and Flood lists for BUM traffic?
|
||||
1. Is it acceptable for me to add a BUM marker in `l2_flood.c` so that I can reuse all the logic
|
||||
from bridge-domain flooding as I extend to also do VTEP flooding?
|
||||
1. (perhaps later) VxLAN is the canonical underlay, but is there an appetite to extend also to,
|
||||
say, GENEVE or MPLS?
|
||||
1. (perhaps later) What's a good way to tie in a controlplane like FRRouting or Bird2 into the
|
||||
dataplane (perhaps using a sidecar controller, or perhaps using Linux CP Netlink messages)?
|
||||
|
||||
701
content/articles/2025-07-26-ctlog-1.md
Normal file
701
content/articles/2025-07-26-ctlog-1.md
Normal file
@@ -0,0 +1,701 @@
|
||||
---
|
||||
date: "2025-07-26T22:07:23Z"
|
||||
title: 'Certificate Transparency - Part 1 - TesseraCT'
|
||||
aliases:
|
||||
- /s/articles/2025/07/26/certificate-transparency-part-1/
|
||||
---
|
||||
|
||||
{{< image width="10em" float="right" src="/assets/ctlog/ctlog-logo-ipng.png" alt="ctlog logo" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
There once was a Dutch company called [[DigiNotar](https://en.wikipedia.org/wiki/DigiNotar)], as the
|
||||
name suggests it was a form of _digital notary_, and they were in the business of issuing security
|
||||
certificates. Unfortunately, in June of 2011, their IT infrastructure was compromised and
|
||||
subsequently it issued hundreds of fraudulent SSL certificates, some of which were used for
|
||||
man-in-the-middle attacks on Iranian Gmail users. Not cool.
|
||||
|
||||
Google launched a project called **Certificate Transparency**, because it was becoming more common
|
||||
that the root of trust given to _Certification Authorities_ could no longer be unilaterally trusted.
|
||||
These attacks showed that the lack of transparency in the way CAs operated was a significant risk to
|
||||
the Web Public Key Infrastructure. It led to the creation of this ambitious
|
||||
[[project](https://certificate.transparency.dev/)] to improve security online by bringing
|
||||
accountability to the system that protects our online services with _SSL_ (Secure Socket Layer)
|
||||
and _TLS_ (Transport Layer Security).
|
||||
|
||||
In 2013, [[RFC 6962](https://datatracker.ietf.org/doc/html/rfc6962)] was published by the IETF. It
|
||||
describes an experimental protocol for publicly logging the existence of Transport Layer Security
|
||||
(TLS) certificates as they are issued or observed, in a manner that allows anyone to audit
|
||||
certificate authority (CA) activity and notice the issuance of suspect certificates as well as to
|
||||
audit the certificate logs themselves. The intent is that eventually clients would refuse to honor
|
||||
certificates that do not appear in a log, effectively forcing CAs to add all issued certificates to
|
||||
the logs.
|
||||
|
||||
This series explores and documents how IPng Networks will be running two Static CT _Logs_ with two
|
||||
different implementations. One will be [[Sunlight](https://sunlight.dev/)], and the other will be
|
||||
[[TesseraCT](https://github.com/transparency-dev/tesseract)].
|
||||
|
||||
## Static Certificate Transparency
|
||||
|
||||
In this context, _Logs_ are network services that implement the protocol operations for submissions
|
||||
and queries that are defined in a specification that builds on the previous RFC. A few years ago,
|
||||
my buddy Antonis asked me if I would be willing to run a log, but operationally they were very
|
||||
complex and expensive to run. However, over the years, the concept of _Static Logs_ put running one
|
||||
in reach. This [[Static CT API](https://github.com/C2SP/C2SP/blob/main/static-ct-api.md)] defines a
|
||||
read-path HTTP static asset hierarchy (for monitoring) to be implemented alongside the write-path
|
||||
RFC 6962 endpoints (for submission).
|
||||
|
||||
Aside from the different read endpoints, a log that implements the Static API is a regular CT log
|
||||
that can work alongside RFC 6962 logs and that fulfills the same purpose. In particular, it requires
|
||||
no modification to submitters and TLS clients.
|
||||
|
||||
If you only read one document about Static CT, read Filippo Valsorda's excellent
|
||||
[[paper](https://filippo.io/a-different-CT-log)]. It describes a radically cheaper and easier to
|
||||
operate [[Certificate Transparency](https://certificate.transparency.dev/)] log that is backed by a
|
||||
consistent object storage, and can scale to 30x the current issuance rate for 2-10% of the costs
|
||||
with no merge delay.
|
||||
|
||||
## Scalable, Cheap, Reliable: choose two
|
||||
|
||||
{{< image width="18em" float="right" src="/assets/ctlog/MPLS Backbone - CTLog.svg" alt="ctlog at ipng" >}}
|
||||
|
||||
In the diagram, I've drawn an overview of IPng's network. In {{< boldcolor color="red" >}}red{{<
|
||||
/boldcolor >}} a european backbone network is provided by a [[BGP Free Core
|
||||
network]({{< ref 2022-12-09-oem-switch-2 >}})]. It operates a private IPv4, IPv6, and MPLS network, called
|
||||
_IPng Site Local_, which is not connected to the internet. On top of that, IPng offers L2 and L3
|
||||
services, for example using [[VPP]({{< ref 2021-02-27-network >}})].
|
||||
|
||||
In {{< boldcolor color="lightgreen" >}}green{{< /boldcolor >}} I built a cluster of replicated
|
||||
NGINX frontends. They connect into _IPng Site Local_ and can reach all hypervisors, VMs, and storage
|
||||
systems. They also connect to the Internet with a single IPv4 and IPv6 address. One might say that
|
||||
SSL is _added and removed here :-)_ [[ref](/assets/ctlog/nsa_slide.jpg)].
|
||||
|
||||
Then in {{< boldcolor color="orange" >}}orange{{< /boldcolor >}} I built a set of [[MinIO]({{< ref
|
||||
2025-05-28-minio-1 >}})] S3 storage pools. Amongst others, I serve the static content from the IPng
|
||||
website from these pools, providing fancy redundancy and caching. I wrote about its design in [[this
|
||||
article]({{< ref 2025-06-01-minio-2 >}})].
|
||||
|
||||
Finally, I turn my attention to the {{< boldcolor color="blue" >}}blue{{< /boldcolor >}} which is
|
||||
two hypervisors, one run by [[IPng](https://ipng.ch/)] and the other by [[Massar](https://massars.net/)]. Each
|
||||
of them will be running one of the _Log_ implementations. IPng provides two large ZFS storage tanks
|
||||
for offsite backup, in case a hypervisor decides to check out, and daily backups to an S3 bucket
|
||||
using Restic.
|
||||
|
||||
Having explained all of this, I am well aware that end to end reliability will be coming from the
|
||||
fact that there are many independent _Log_ operators, and folks wanting to validate certificates can
|
||||
simply monitor many. If there is a gap in coverage, say due to any given _Log_'s downtime, this will
|
||||
not necessarily be problematic. It does mean that I may have to suppress the SRE in me...
|
||||
|
||||
## MinIO
|
||||
|
||||
My first instinct is to leverage the distributed storage IPng has, but as I'll show in the rest of
|
||||
this article, maybe a simpler, more elegant design could be superior, precisely because individual
|
||||
log reliability is not _as important_ as having many available log _instances_ to choose from.
|
||||
|
||||
From operators in the field I understand that the world-wide generation of certificates is roughly
|
||||
17M/day, which amounts of some 200-250qps of writes. Antonis explains that certs with a validity
|
||||
if 180 days or less will need two CT log entries, while certs with a validity more than 180d will
|
||||
need three CT log entries. So the write rate is roughly 2.2x that, as an upper bound.
|
||||
|
||||
My first thought is to see how fast my open source S3 machines can go, really. I'm curious also as
|
||||
to the difference between SSD and spinning disks.
|
||||
|
||||
I boot two Dell R630s in the Lab. These machines have two Xeon E5-2640 v4 CPUs for a total of 20
|
||||
cores and 40 threads, and 512GB of DDR4 memory. They also sport a SAS controller. In one machine I
|
||||
place 6pcs 1.2TB SAS3 disks (HPE part number EG1200JEHMC), and in the second machine I place 6pcs
|
||||
of 1.92TB enterprise storage (Samsung part number P1633N19).
|
||||
|
||||
I spin up a 6-device MinIO cluster on both and take them out for a spin using [[S3
|
||||
Benchmark](https://github.com/wasabi-tech/s3-benchmark.git)] from Wasabi Tech.
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/s3-benchmark$ for dev in disk ssd; do \
|
||||
for t in 1 8 32; do \
|
||||
for z in 4M 1M 8k 4k; do \
|
||||
./s3-benchmark -a $KEY -s $SECRET -u http://minio-$dev:9000 -t $t -z $z \
|
||||
| tee -a minio-results.txt; \
|
||||
done; \
|
||||
done; \
|
||||
done
|
||||
```
|
||||
|
||||
The loadtest above does a bunch of runs with varying parameters. First it tries to read and write
|
||||
object sizes of 4MB, 1MB, 8kB and 4kB respectively. Then it tries to do this with either 1 thread, 8
|
||||
threads or 32 threads. Finally it tests both the disk-based variant as well as the SSD based one.
|
||||
The loadtest runs from a third machine, so that the Dell R630 disk tanks can stay completely
|
||||
dedicated to their task of running MinIO.
|
||||
|
||||
{{< image width="100%" src="/assets/ctlog/minio_8kb_performance.png" alt="MinIO 8kb disk vs SSD" >}}
|
||||
|
||||
The left-hand side graph feels pretty natural to me. With one thread, uploading 8kB objects will
|
||||
quickly hit the IOPS rate of the disks, each of which have to participate in the write due to EC:3
|
||||
encoding when using six disks, and it tops out at ~56 PUT/s. The single thread hitting SSDs will not
|
||||
hit that limit, and has ~371 PUT/s which I found a bit underwhelming. But, when performing the
|
||||
loadtest with either 8 or 32 write threads, the hard disks become only marginally faster (topping
|
||||
out at 240 PUT/s), while the SSDs really start to shine, with 3850 PUT/s. Pretty good performance.
|
||||
|
||||
On the read-side, I am pleasantly surprised that there's not really that much of a difference
|
||||
between disks and SSDs. This is likely because the host filesystem cache is playing a large role, so
|
||||
the 1-thread performance is equivalent (765 GET/s for disks, 677 GET/s for SSDs), and the 32-thread
|
||||
performance is also equivalent (at 7624 GET/s for disks with 7261 GET/s for SSDs). I do wonder why
|
||||
the hard disks consistently outperform the SSDs with all the other variables (OS, MinIO version,
|
||||
hardware) the same.
|
||||
|
||||
## Sidequest: SeaweedFS
|
||||
|
||||
Something that has long caught my attention is the way in which
|
||||
[[SeaweedFS](https://github.com/seaweedfs/seaweedfs)] approaches blob storage. Many operators have
|
||||
great success with many small file writes in SeaweedFS compared to MinIO and even AWS S3 storage.
|
||||
This is because writes with WeedFS are not broken into erasure-sets, which would require every disk
|
||||
to write a small part or checksum of the data, but rather files are replicated within the cluster in
|
||||
their entirety on different disks, racks or datacenters. I won't bore you with the details of
|
||||
SeaweedFS but I'll tack on a docker [[compose file](/assets/ctlog/seaweedfs.docker-compose.yml)]
|
||||
that I used at the end of this article, if you're curious.
|
||||
|
||||
{{< image width="100%" src="/assets/ctlog/size_comparison_8t.png" alt="MinIO vs SeaWeedFS" >}}
|
||||
|
||||
In the write-path, SeaweedFS dominates in all cases, due to its different way of achieving durable
|
||||
storage (per-file replication in SeaweedFS versus all-disk erasure-sets in MinIO):
|
||||
* 4k: 3,384 ops/sec vs MinIO's 111 ops/sec (30x faster!)
|
||||
* 8k: 3,332 ops/sec vs MinIO's 111 ops/sec (30x faster!)
|
||||
* 1M: 383 ops/sec vs MinIO's 44 ops/sec (9x faster)
|
||||
* 4M: 104 ops/sec vs MinIO's 32 ops/sec (4x faster)
|
||||
|
||||
For the read-path, in GET operations MinIO is better at small objects, and really dominates the
|
||||
large objects:
|
||||
* 4k: 7,411 ops/sec vs SeaweedFS 5,014 ops/sec
|
||||
* 8k: 7,666 ops/sec vs SeaweedFS 5,165 ops/sec
|
||||
* 1M: 5,466 ops/sec vs SeaweedFS 2,212 ops/sec
|
||||
* 4M: 3,084 ops/sec vs SeaweedFS 646 ops/sec
|
||||
|
||||
This makes me draw an interesting conclusion: seeing as CT Logs are read/write heavy (every couple
|
||||
of seconds, the Merkle tree is recomputed which is reasonably disk-intensive), SeaweedFS might be a
|
||||
slight better choice. IPng Networks has three MinIO deployments, but no SeaweedFS deployments. Yet.
|
||||
|
||||
# Tessera
|
||||
|
||||
[[Tessera](https://github.com/transparency-dev/tessera.git)] is a Go library for building tile-based
|
||||
transparency logs (tlogs) [[ref](https://github.com/C2SP/C2SP/blob/main/tlog-tiles.md)]. It is the
|
||||
logical successor to the approach that Google took when building and operating _Logs_ using its
|
||||
predecessor called [[Trillian](https://github.com/google/trillian)]. The implementation and its APIs
|
||||
bake-in current best-practices based on the lessons learned over the past decade of building and
|
||||
operating transparency logs in production environments and at scale.
|
||||
|
||||
Tessera was introduced at the Transparency.Dev summit in October 2024. I first watch Al and Martin
|
||||
[[introduce](https://www.youtube.com/watch?v=9j_8FbQ9qSc)] it at last year's summit. At a high
|
||||
level, it wraps what used to be a whole kubernetes cluster full of components, into a single library
|
||||
that can be used with Cloud based services, either like AWS S3 and RDS database, or like GCP's GCS
|
||||
storage and Spanner database. However, Google also made is easy to use a regular POSIX filesystem
|
||||
implementation.
|
||||
|
||||
## TesseraCT
|
||||
|
||||
{{< image width="10em" float="right" src="/assets/ctlog/tesseract-logo.png" alt="tesseract logo" >}}
|
||||
|
||||
While Tessera is a library, a CT log implementation comes from its sibling GitHub repository called
|
||||
[[TesseraCT](https://github.com/transparency-dev/tesseract)]. Because it leverages Tessera under the
|
||||
hood, TesseraCT can run on GCP, AWS, POSIX-compliant, or on S3-compatible systems alongside a MySQL
|
||||
database. In order to provide ecosystem agility and to control the growth of CT Log sizes, new CT
|
||||
Logs must be temporally sharded, defining a certificate expiry range denoted in the form of two
|
||||
dates: `[rangeBegin, rangeEnd)`. The certificate expiry range allows a Log to reject otherwise valid
|
||||
logging submissions for certificates that expire before or after this defined range, thus
|
||||
partitioning the set of publicly-trusted certificates that each Log will accept. I will be expected
|
||||
to keep logs for an extended period of time, say 3-5 years.
|
||||
|
||||
It's time for me to figure out what this TesseraCT thing can do .. are you ready? Let's go!
|
||||
|
||||
### TesseraCT: S3 and SQL
|
||||
|
||||
TesseraCT comes with a few so-called _personalities_. Those are an implementation of the underlying
|
||||
storage infrastructure in an opinionated way. The first personality I look at is the `aws` one in
|
||||
`cmd/tesseract/aws`. I notice that this personality does make hard assumptions about the use of AWS
|
||||
which is unfortunate as the documentation says '.. or self-hosted S3 and MySQL database'. However,
|
||||
the `aws` personality assumes the AWS SecretManager in order to fetch its signing key. Before I
|
||||
can be successful, I need to detangle that.
|
||||
|
||||
#### TesseraCT: AWS and Local Signer
|
||||
|
||||
First, I change `cmd/tesseract/aws/main.go` to add two new flags:
|
||||
|
||||
* ***-signer_public_key_file***: a path to the public key for checkpoints and SCT signer
|
||||
* ***-signer_private_key_file***: a path to the private key for checkpoints and SCT signer
|
||||
|
||||
I then change the program to assume if these flags are both set, the user will want a
|
||||
_NewLocalSigner_ instead of a _NewSecretsManagerSigner_. Now all I have to do is implement the
|
||||
signer interface in a package `local_signer.go`. There, function _NewLocalSigner()_ will read the
|
||||
public and private PEM from file, decode them, and create an _ECDSAWithSHA256Signer_ with them, a
|
||||
simple example to show what I mean:
|
||||
|
||||
```
|
||||
// NewLocalSigner creates a new signer that uses the ECDSA P-256 key pair from
|
||||
// local disk files for signing digests.
|
||||
func NewLocalSigner(publicKeyFile, privateKeyFile string) (*ECDSAWithSHA256Signer, error) {
|
||||
// Read public key
|
||||
publicKeyPEM, err := os.ReadFile(publicKeyFile)
|
||||
publicPemBlock, rest := pem.Decode(publicKeyPEM)
|
||||
|
||||
var publicKey crypto.PublicKey
|
||||
publicKey, err = x509.ParsePKIXPublicKey(publicPemBlock.Bytes)
|
||||
ecdsaPublicKey, ok := publicKey.(*ecdsa.PublicKey)
|
||||
|
||||
// Read private key
|
||||
privateKeyPEM, err := os.ReadFile(privateKeyFile)
|
||||
privatePemBlock, rest := pem.Decode(privateKeyPEM)
|
||||
|
||||
var ecdsaPrivateKey *ecdsa.PrivateKey
|
||||
ecdsaPrivateKey, err = x509.ParseECPrivateKey(privatePemBlock.Bytes)
|
||||
|
||||
// Verify the correctness of the signer key pair
|
||||
if !ecdsaPrivateKey.PublicKey.Equal(ecdsaPublicKey) {
|
||||
return nil, errors.New("signer key pair doesn't match")
|
||||
}
|
||||
|
||||
return &ECDSAWithSHA256Signer{
|
||||
publicKey: ecdsaPublicKey,
|
||||
privateKey: ecdsaPrivateKey,
|
||||
}, nil
|
||||
}
|
||||
```
|
||||
|
||||
In the snippet above I omitted all of the error handling, but the local signer logic itself is
|
||||
hopefully clear. And with that, I am liberated from Amazon's Cloud offering and can run this thing
|
||||
all by myself!
|
||||
|
||||
#### TesseraCT: Running with S3, MySQL, and Local Signer
|
||||
|
||||
First, I need to create a suitable ECDSA key:
|
||||
```
|
||||
pim@ctlog-test:~$ openssl ecparam -name prime256v1 -genkey -noout -out /tmp/private_key.pem
|
||||
pim@ctlog-test:~$ openssl ec -in /tmp/private_key.pem -pubout -out /tmp/public_key.pem
|
||||
```
|
||||
|
||||
Then, I'll install the MySQL server and create the databases:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~$ sudo apt install default-mysql-server
|
||||
pim@ctlog-test:~$ sudo mysql -u root
|
||||
|
||||
CREATE USER 'tesseract'@'localhost' IDENTIFIED BY '<db_passwd>';
|
||||
CREATE DATABASE tesseract;
|
||||
CREATE DATABASE tesseract_antispam;
|
||||
GRANT ALL PRIVILEGES ON tesseract.* TO 'tesseract'@'localhost';
|
||||
GRANT ALL PRIVILEGES ON tesseract_antispam.* TO 'tesseract'@'localhost';
|
||||
```
|
||||
|
||||
Finally, I use the SSD MinIO lab-machine that I just loadtested to create an S3 bucket.
|
||||
|
||||
```
|
||||
pim@ctlog-test:~$ mc mb minio-ssd/tesseract-test
|
||||
pim@ctlog-test:~$ cat << EOF > /tmp/minio-access.json
|
||||
{ "Version": "2012-10-17", "Statement": [ {
|
||||
"Effect": "Allow",
|
||||
"Action": [ "s3:ListBucket", "s3:PutObject", "s3:GetObject", "s3:DeleteObject" ],
|
||||
"Resource": [ "arn:aws:s3:::tesseract-test/*", "arn:aws:s3:::tesseract-test" ]
|
||||
} ]
|
||||
}
|
||||
EOF
|
||||
pim@ctlog-test:~$ mc admin user add minio-ssd <user> <secret>
|
||||
pim@ctlog-test:~$ mc admin policy create minio-ssd tesseract-test-access /tmp/minio-access.json
|
||||
pim@ctlog-test:~$ mc admin policy attach minio-ssd tesseract-test-access --user <user>
|
||||
pim@ctlog-test:~$ mc anonymous set public minio-ssd/tesseract-test
|
||||
```
|
||||
|
||||
{{< image width="6em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
|
||||
|
||||
After some fiddling, I understand that the AWS software development kit makes some assumptions that
|
||||
you'll be using .. _quelle surprise_ .. AWS services. But you can also use local S3 services by
|
||||
setting a few key environment variables. I had heard of the S3 access and secret key environment
|
||||
variables before, but I now need to also use a different S3 endpoint. That little detour into the
|
||||
codebase only took me .. several hours.
|
||||
|
||||
Armed with that knowledge, I can build and finally start my TesseraCT instance:
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract/cmd/tesseract/aws$ go build -o ~/aws .
|
||||
pim@ctlog-test:~$ export AWS_DEFAULT_REGION="us-east-1"
|
||||
pim@ctlog-test:~$ export AWS_ACCESS_KEY_ID="<user>"
|
||||
pim@ctlog-test:~$ export AWS_SECRET_ACCESS_KEY="<secret>"
|
||||
pim@ctlog-test:~$ export AWS_ENDPOINT_URL_S3="http://minio-ssd.lab.ipng.ch:9000/"
|
||||
pim@ctlog-test:~$ ./aws --http_endpoint='[::]:6962' \
|
||||
--origin=ctlog-test.lab.ipng.ch/test-ecdsa \
|
||||
--bucket=tesseract-test \
|
||||
--db_host=ctlog-test.lab.ipng.ch \
|
||||
--db_user=tesseract \
|
||||
--db_password=<db_passwd> \
|
||||
--db_name=tesseract \
|
||||
--antispam_db_name=tesseract_antispam \
|
||||
--signer_public_key_file=/tmp/public_key.pem \
|
||||
--signer_private_key_file=/tmp/private_key.pem \
|
||||
--roots_pem_file=internal/hammer/testdata/test_root_ca_cert.pem
|
||||
|
||||
I0727 15:13:04.666056 337461 main.go:128] **** CT HTTP Server Starting ****
|
||||
```
|
||||
|
||||
Hah! I think most of the command line flags and environment variables should make sense, but I was
|
||||
struggling for a while with the `--roots_pem_file` and the `--origin` flags, so I phoned a friend
|
||||
(Al Cutter, Googler extraordinaire and an expert in Tessera/CT). He explained to me that the Log is
|
||||
actually an open endpoint to which anybody might POST data. However, to avoid folks abusing the log
|
||||
infrastructure, each POST is expected to come from one of the certificate authorities listed in the
|
||||
`--roots_pem_file`. OK, that makes sense.
|
||||
|
||||
Then, the `--origin` flag designates how my log calls itself. In the resulting `checkpoint` file it
|
||||
will enumerate a hash of the latest merged and published Merkle tree. In case a server serves
|
||||
multiple logs, it uses the `--origin` flag to make the destinction which checksum belongs to which.
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract$ curl http://tesseract-test.minio-ssd.lab.ipng.ch:9000/checkpoint
|
||||
ctlog-test.lab.ipng.ch/test-ecdsa
|
||||
0
|
||||
JGPitKWWI0aGuCfC2k1n/p9xdWAYPm5RZPNDXkCEVUU=
|
||||
|
||||
— ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMCONUBAMARjBEAiA/nc9dig6U//vPg7SoTHjt9bxP5K+x3w4MYKpIRn4ULQIgUY5zijRK8qyuJGvZaItDEmP1gohCt+wI+sESBnhkuqo=
|
||||
```
|
||||
|
||||
When creating the bucket above, I used `mc anonymous set public`, which made the S3 bucket
|
||||
world-readable. I can now execute the whole read-path simply by hitting the S3 service. Check.
|
||||
|
||||
#### TesseraCT: Loadtesting S3/MySQL
|
||||
|
||||
{{< image width="12em" float="right" src="/assets/ctlog/stop-hammer-time.jpg" alt="Stop, hammer time" >}}
|
||||
|
||||
The write path is a server on `[::]:6962`. I should be able to write a log to it, but how? Here's
|
||||
where I am grateful to find a tool in the TesseraCT GitHub repository called `hammer`. This hammer
|
||||
sets up read and write traffic to a Static CT API log to test correctness and performance under
|
||||
load. The traffic is sent according to the [[Static CT API](https://c2sp.org/static-ct-api)] spec.
|
||||
Slick!
|
||||
|
||||
The tool start a text-based UI (my favorite! also when using Cisco T-Rex loadtester) in the terminal
|
||||
that shows the current status, logs, and supports increasing/decreasing read and write traffic. This
|
||||
TUI allows for a level of interactivity when probing a new configuration of a log in order to find
|
||||
any cliffs where performance degrades. For real load-testing applications, especially headless runs
|
||||
as part of a CI pipeline, it is recommended to run the tool with `-show_ui=false` in order to disable
|
||||
the UI.
|
||||
|
||||
I'm a bit lost in the somewhat terse
|
||||
[[README.md](https://github.com/transparency-dev/tesseract/tree/main/internal/hammer)], but my buddy
|
||||
Al comes to my rescue and explains the flags to me. First of all, the loadtester wants to hit the
|
||||
same `--origin` that I configured the write-path to accept. In my case this is
|
||||
`ctlog-test.lab.ipng.ch/test-ecdsa`. Then, it needs the public key for that _Log_, which I can find
|
||||
in `/tmp/public_key.pem`. The text there is the _DER_ (Distinguished Encoding Rules), stored as a
|
||||
base64 encoded string. What follows next was the most difficult for me to understand, as I was
|
||||
thinking the hammer would read some log from the internet somewhere and replay it locally. Al
|
||||
explains that actually, the `hammer` tool synthetically creates all of these entries itself, and it
|
||||
regularly reads the `checkpoint` from the `--log_url` place, while it writes its certificates to
|
||||
`--write_log_url`. The last few flags just inform the `hammer` how many read and write ops/sec it
|
||||
should generate, and with that explanation my brain plays _tadaa.wav_ and I am ready to go.
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract$ go run ./internal/hammer \
|
||||
--origin=ctlog-test.lab.ipng.ch/test-ecdsa \
|
||||
--log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEucHtDWe9GYNicPnuGWbEX8rJg/VnDcXs8z40KdoNidBKy6/ZXw2u+NW1XAUnGpXcZozxufsgOMhijsWb25r7jw== \
|
||||
--log_url=http://tesseract-test.minio-ssd.lab.ipng.ch:9000/ \
|
||||
--write_log_url=http://localhost:6962/ctlog-test.lab.ipng.ch/test-ecdsa/ \
|
||||
--max_read_ops=0 \
|
||||
--num_writers=5000 \
|
||||
--max_write_ops=100
|
||||
```
|
||||
|
||||
{{< image width="30em" float="right" src="/assets/ctlog/ctlog-loadtest1.png" alt="S3/MySQL Loadtest 100qps" >}}
|
||||
|
||||
Cool! It seems that the loadtest is happily chugging along at 100qps. The log is consuming them in
|
||||
the HTTP write-path by accepting POST requests to
|
||||
`/ctlog-test.lab.ipng.ch/test-ecdsa/ct/v1/add-chain`, where hammer is offering them at a rate of
|
||||
100qps, with a configured probability of duplicates set at 10%. What that means is that every now
|
||||
and again, it'll repeat a previous request. The purpose of this is to stress test the so-called
|
||||
`antispam` implementation. When `hammer` sends its requests, it signs them with a certificate that
|
||||
was issued by the CA described in `internal/hammer/testdata/test_root_ca_cert.pem`, which is why
|
||||
TesseraCT accepts them.
|
||||
|
||||
I raise the write load by using the '>' key a few times. I notice things are great at 500qps, which
|
||||
is nice because that's double what we are to expect. But I start seeing a bit more noise at 600qps.
|
||||
When I raise the write-rate to 1000qps, all hell breaks loose on the logs of the server (and similar
|
||||
logs in the `hammer` loadtester:
|
||||
|
||||
```
|
||||
W0727 15:54:33.419881 348475 handlers.go:168] ctlog-test.lab.ipng.ch/test-ecdsa: AddChain handler error: couldn't store the leaf: failed to fetch entry bundle at index 0: failed to fetch resource: getObject: failed to create reader for object "tile/data/000" in bucket "tesseract-test": operation error S3: GetObject, context deadline exceeded
|
||||
W0727 15:55:02.727962 348475 aws.go:345] GarbageCollect failed: failed to delete one or more objects: failed to delete objects: operation error S3: DeleteObjects, https response error StatusCode: 400, RequestID: 1856202CA3C4B83F, HostID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8, api error MalformedXML: The XML you provided was not well-formed or did not validate against our published schema.
|
||||
E0727 15:55:10.448973 348475 append_lifecycle.go:293] followerStats: follower "AWS antispam" EntriesProcessed(): failed to read follow coordination info: Error 1040: Too many connections
|
||||
```
|
||||
|
||||
I see on the MinIO instance that it's doing about 150/s of GETs and 15/s of PUTs, which is totally
|
||||
reasonable:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract$ mc admin trace --stats ssd
|
||||
Duration: 6m9s ▰▱▱
|
||||
RX Rate:↑ 34 MiB/m
|
||||
TX Rate:↓ 2.3 GiB/m
|
||||
RPM : 10588.1
|
||||
-------------
|
||||
Call Count RPM Avg Time Min Time Max Time Avg TTFB Max TTFB Avg Size Rate /min
|
||||
s3.GetObject 60558 (92.9%) 9837.2 4.3ms 708µs 48.1ms 3.9ms 47.8ms ↑144B ↓246K ↑1.4M ↓2.3G
|
||||
s3.PutObject 2199 (3.4%) 357.2 5.3ms 2.4ms 32.7ms 5.3ms 32.7ms ↑92K ↑32M
|
||||
s3.DeleteMultipleObjects 1212 (1.9%) 196.9 877µs 290µs 41.1ms 850µs 41.1ms ↑230B ↓369B ↑44K ↓71K
|
||||
s3.ListObjectsV2 1212 (1.9%) 196.9 18.4ms 999µs 52.8ms 18.3ms 52.7ms ↑131B ↓261B ↑25K ↓50K
|
||||
```
|
||||
|
||||
Another nice way to see what makes it through is this oneliner, which reads the `checkpoint` every
|
||||
second, and once it changes, shows the delta in seconds and how many certs were written:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract$ T=0; O=0; while :; do \
|
||||
N=$(curl -sS http://tesseract-test.minio-ssd.lab.ipng.ch:9000/checkpoint | grep -E '^[0-9]+$'); \
|
||||
if [ "$N" -eq "$O" ]; then \
|
||||
echo -n .; \
|
||||
else \
|
||||
echo " $T seconds $((N-O)) certs"; O=$N; T=0; echo -n $N\ ;
|
||||
fi; \
|
||||
T=$((T+1)); sleep 1; done
|
||||
1012905 .... 5 seconds 2081 certs
|
||||
1014986 .... 5 seconds 2126 certs
|
||||
1017112 .... 5 seconds 1913 certs
|
||||
1019025 .... 5 seconds 2588 certs
|
||||
1021613 .... 5 seconds 2591 certs
|
||||
1024204 .... 5 seconds 2197 certs
|
||||
```
|
||||
|
||||
So I can see that the checkpoint is refreshed every 5 seconds and between 1913 and 2591 certs are
|
||||
written each time. And indeed, at 400/s there are no errors or warnings at all. At this write rate,
|
||||
TesseraCT is using about 2.9 CPUs/s, with MariaDB using 0.3 CPUs/s, but the hammer is using 6.0
|
||||
CPUs/s. Overall, the machine is perfectly happily serving for a few hours under this load test.
|
||||
|
||||
***Conclusion: a write-rate of 400/s should be safe with S3+MySQL***
|
||||
|
||||
### TesseraCT: POSIX
|
||||
|
||||
I have been playing with this idea of having a reliable read-path by having the S3 cluster be
|
||||
redundant, or by replicating the S3 bucket. But Al asks: why not use our experimental POSIX?
|
||||
We discuss two very important benefits, but also two drawbacks:
|
||||
|
||||
* On the plus side:
|
||||
1. There is no need for S3 storage, read/writing to a local ZFS raidz2 pool instead.
|
||||
1. There is no need for MySQL, as the POSIX implementation can use a local badger instance
|
||||
also on the local filesystem.
|
||||
* On the drawbacks:
|
||||
1. There is a SPOF in the read-path, as the single VM must handle both. The write-path always
|
||||
has a SPOF on the TesseraCT VM.
|
||||
1. Local storage is more expensive than S3 storage, and can be used only for the purposes of
|
||||
one application (and at best, shared with other VMs on the same hypervisor).
|
||||
|
||||
Come to think of it, this is maybe not such a bad tradeoff. I do kind of like having a single-VM
|
||||
with a single-binary and no other moving parts. It greatly simplifies the architecture, and for the
|
||||
read-path I can (and will) still use multiple upstream NGINX machines in IPng's network.
|
||||
|
||||
I consider myself nerd-sniped, and take a look at the POSIX variant. I have a few SAS3
|
||||
solid state storage (NetAPP part number X447_S1633800AMD), which I plug into the `ctlog-test`
|
||||
machine.
|
||||
|
||||
```
|
||||
pim@ctlog-test:~$ sudo zpool create -o ashift=12 -o autotrim=on -o ssd-vol0 mirror \
|
||||
/dev/disk/by-id/wwn-0x5002538a0???????
|
||||
pim@ctlog-test:~$ sudo zfs create ssd-vol0/tesseract-test
|
||||
pim@ctlog-test:~$ sudo chown pim:pim /ssd-vol0/tesseract-test
|
||||
pim@ctlog-test:~/src/tesseract$ go run ./cmd/experimental/posix --http_endpoint='[::]:6962' \
|
||||
--origin=ctlog-test.lab.ipng.ch/test-ecdsa \
|
||||
--private_key=/tmp/private_key.pem \
|
||||
--storage_dir=/ssd-vol0/tesseract-test \
|
||||
--roots_pem_file=internal/hammer/testdata/test_root_ca_cert.pem
|
||||
badger 2025/07/27 16:29:15 INFO: All 0 tables opened in 0s
|
||||
badger 2025/07/27 16:29:15 INFO: Discard stats nextEmptySlot: 0
|
||||
badger 2025/07/27 16:29:15 INFO: Set nextTxnTs to 0
|
||||
I0727 16:29:15.032845 363156 files.go:502] Initializing directory for POSIX log at "/ssd-vol0/tesseract-test" (this should only happen ONCE per log!)
|
||||
I0727 16:29:15.034101 363156 main.go:97] **** CT HTTP Server Starting ****
|
||||
|
||||
pim@ctlog-test:~/src/tesseract$ cat /ssd-vol0/tesseract-test/checkpoint
|
||||
ctlog-test.lab.ipng.ch/test-ecdsa
|
||||
0
|
||||
47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=
|
||||
|
||||
— ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMSgC8BAMARzBFAiBjT5zdkniKlryqlUlx/gLHOtVK26zuWwrc4BlyTVzCWgIhAJ0GIrlrP7YGzRaHjzdB5tnS5rpP3LeOsPbpLateaiFc
|
||||
```
|
||||
|
||||
Alright, I can see the log started and created an empty checkpoint file. Nice!
|
||||
|
||||
Before I can loadtest it, I will need to get the read-path to become visible. The `hammer` can read
|
||||
a checkpoint from local `file:///` prefixes, but I'll have to serve them over the network eventually
|
||||
anyway, so I create the following NGINX config for it:
|
||||
|
||||
```
|
||||
server {
|
||||
listen 80 default_server backlog=4096;
|
||||
listen [::]:80 default_server backlog=4096;
|
||||
root /ssd-vol0/tesseract-test/;
|
||||
index index.html index.htm index.nginx-debian.html;
|
||||
|
||||
server_name _;
|
||||
|
||||
access_log /var/log/nginx/access.log combined buffer=512k flush=5s;
|
||||
|
||||
location / {
|
||||
try_files $uri $uri/ =404;
|
||||
tcp_nopush on;
|
||||
sendfile on;
|
||||
tcp_nodelay on;
|
||||
keepalive_timeout 65;
|
||||
keepalive_requests 1000;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Just a couple of small thoughts on this configuration. I'm using buffered access logs, to avoid
|
||||
excessive disk writes in the read-path. Then, I'm using kernel `sendfile()` which will instruct the
|
||||
kernel to serve the static objects directly, so that NGINX can move on. Further, I'll allow for a
|
||||
long keepalive in HTTP 1.1, so that future requests can use the same TCP connection, and I'll set
|
||||
the flag `tcp_nodelay` and `tcp_nopush` to just blast the data out without waiting.
|
||||
|
||||
Without much ado:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract$ curl -sS ctlog-test.lab.ipng.ch/checkpoint
|
||||
ctlog-test.lab.ipng.ch/test-ecdsa
|
||||
0
|
||||
47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=
|
||||
|
||||
— ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMTfksBAMASDBGAiEAqADLH0P/SRVloF6G1ezlWG3Exf+sTzPIY5u6VjAKLqACIQCkJO2N0dZQuDHvkbnzL8Hd91oyU41bVqfD3vs5EwUouA==
|
||||
```
|
||||
|
||||
#### TesseraCT: Loadtesting POSIX
|
||||
|
||||
The loadtesting is roughly the same. I start the `hammer` with the same 500qps of write rate, which
|
||||
was roughly where the S3+MySQL variant topped. My checkpoint tracker shows the following:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract$ T=0; O=0; while :; do \
|
||||
N=$(curl -sS http://localhost/checkpoint | grep -E '^[0-9]+$'); \
|
||||
if [ "$N" -eq "$O" ]; then \
|
||||
echo -n .; \
|
||||
else \
|
||||
echo " $T seconds $((N-O)) certs"; O=$N; T=0; echo -n $N\ ;
|
||||
fi; \
|
||||
T=$((T+1)); sleep 1; done
|
||||
59250 ......... 10 seconds 5244 certs
|
||||
64494 ......... 10 seconds 5000 certs
|
||||
69494 ......... 10 seconds 5000 certs
|
||||
74494 ......... 10 seconds 5000 certs
|
||||
79494 ......... 10 seconds 5256 certs
|
||||
79494 ......... 10 seconds 5256 certs
|
||||
84750 ......... 10 seconds 5244 certs
|
||||
89994 ......... 10 seconds 5256 certs
|
||||
95250 ......... 10 seconds 5000 certs
|
||||
100250 ......... 10 seconds 5000 certs
|
||||
105250 ......... 10 seconds 5000 certs
|
||||
```
|
||||
|
||||
I learn two things. First, the checkpoint interval in this `posix` variant is 10 seconds, compared
|
||||
to the 5 seconds of the `aws` variant I tested before. I dive into the code, because there doesn't
|
||||
seem to be a `--checkpoint_interval` flag. In the `tessera` library, I find
|
||||
`DefaultCheckpointInterval` which is set to 10 seconds. I change it to be 2 seconds instead, and
|
||||
restart the `posix` binary:
|
||||
|
||||
```
|
||||
238250 . 2 seconds 1000 certs
|
||||
239250 . 2 seconds 1000 certs
|
||||
240250 . 2 seconds 1000 certs
|
||||
241250 . 2 seconds 1000 certs
|
||||
242250 . 2 seconds 1000 certs
|
||||
243250 . 2 seconds 1000 certs
|
||||
244250 . 2 seconds 1000 certs
|
||||
```
|
||||
|
||||
{{< image width="30em" float="right" src="/assets/ctlog/ctlog-loadtest2.png" alt="Posix Loadtest 5000qps" >}}
|
||||
|
||||
Very nice! Maybe I can write a few more certs? I restart the `hammer` with 5000/s, which somewhat to my
|
||||
surprise, ends up serving!
|
||||
|
||||
```
|
||||
642608 . 2 seconds 6155 certs
|
||||
648763 . 2 seconds 10256 certs
|
||||
659019 . 2 seconds 9237 certs
|
||||
668256 . 2 seconds 8800 certs
|
||||
677056 . 2 seconds 8729 certs
|
||||
685785 . 2 seconds 8237 certs
|
||||
694022 . 2 seconds 7487 certs
|
||||
701509 . 2 seconds 8572 certs
|
||||
710081 . 2 seconds 7413 certs
|
||||
```
|
||||
|
||||
The throughput is highly variable though, seemingly between 3700/sec and 5100/sec, and I quickly
|
||||
find out that the `hammer` is completely saturating the CPU on the machine, leaving very little room
|
||||
for the `posix` TesseraCT to serve. I'm going to need more machines!
|
||||
|
||||
So I start a `hammer` loadtester on the two now-idle MinIO servers, and run them at about 6000qps
|
||||
**each**, for a total of 12000 certs/sec. And my little `posix` binary is keeping up like a champ:
|
||||
|
||||
```
|
||||
2987169 . 2 seconds 23040 certs
|
||||
3010209 . 2 seconds 23040 certs
|
||||
3033249 . 2 seconds 21760 certs
|
||||
3055009 . 2 seconds 21504 certs
|
||||
3076513 . 2 seconds 23808 certs
|
||||
3100321 . 2 seconds 22528 certs
|
||||
```
|
||||
|
||||
One thing is reasonably clear, the `posix` TesseraCT is CPU bound, not disk bound. The CPU is now
|
||||
running at about 18.5 CPUs/s (with 20 cores), which is pretty much all this Dell has to offer. The
|
||||
NetAPP enterprise solid state drives are not impressed:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract$ zpool iostat -v ssd-vol0 10 100
|
||||
capacity operations bandwidth
|
||||
pool alloc free read write read write
|
||||
-------------------------- ----- ----- ----- ----- ----- -----
|
||||
ssd-vol0 11.4G 733G 0 3.13K 0 117M
|
||||
mirror-0 11.4G 733G 0 3.13K 0 117M
|
||||
wwn-0x5002538a05302930 - - 0 1.04K 0 39.1M
|
||||
wwn-0x5002538a053069f0 - - 0 1.06K 0 39.1M
|
||||
wwn-0x5002538a06313ed0 - - 0 1.02K 0 39.1M
|
||||
-------------------------- ----- ----- ----- ----- ----- -----
|
||||
|
||||
pim@ctlog-test:~/src/tesseract$ zpool iostat -l ssd-vol0 10
|
||||
capacity operations bandwidth total_wait disk_wait syncq_wait asyncq_wait scrub trim
|
||||
pool alloc free read write read write read write read write read write read write wait wait
|
||||
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
|
||||
ssd-vol0 14.0G 730G 0 1.48K 0 35.4M - 2ms - 535us - 1us - 3ms - 50ms
|
||||
ssd-vol0 14.0G 730G 0 1.12K 0 23.0M - 1ms - 733us - 2us - 1ms - 44ms
|
||||
ssd-vol0 14.1G 730G 0 1.42K 0 45.3M - 508us - 122us - 914ns - 2ms - 41ms
|
||||
ssd-vol0 14.2G 730G 0 678 0 21.0M - 863us - 144us - 2us - 2ms - -
|
||||
```
|
||||
|
||||
## Results
|
||||
|
||||
OK, that kind of seals the deal for me. The write path needs about 250 certs/sec and I'm hammering
|
||||
now with 12'000 certs/sec, with room to spare. But what about the read path? The cool thing about
|
||||
the static log is that reads are all entirely done by NGINX. The only file that isn't cacheable is
|
||||
the `checkpoint` file which gets updated every two seconds (or ten seconds in the default `tessera`
|
||||
settings).
|
||||
|
||||
So I start yet another `hammer` whose job it is to read back from the static filesystem:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract$ curl localhost/nginx_status; sleep 60; curl localhost/nginx_status
|
||||
Active connections: 10556
|
||||
server accepts handled requests
|
||||
25302 25302 1492918
|
||||
Reading: 0 Writing: 1 Waiting: 10555
|
||||
Active connections: 7791
|
||||
server accepts handled requests
|
||||
25764 25764 1727631
|
||||
Reading: 0 Writing: 1 Waiting: 7790
|
||||
```
|
||||
|
||||
And I can see that it's keeping up quite nicely. In one minute, it handled (1727631-1492918) or
|
||||
234713 requests, which is a cool 3911 requests/sec. All these read/write hammers are kind of
|
||||
saturating the `ctlog-test` machine though:
|
||||
|
||||
{{< image width="100%" src="/assets/ctlog/ctlog-loadtest3.png" alt="Posix Loadtest 8000qps write, 4000qps read" >}}
|
||||
|
||||
But after a little bit of fiddling, I can assert my conclusion:
|
||||
|
||||
***Conclusion: a write-rate of 8'000/s alongside a read-rate of 4'000/s should be safe with POSIX***
|
||||
|
||||
## What's Next
|
||||
|
||||
I am going to offer such a machine in production together with Antonis Chariton, and Jeroen Massar.
|
||||
I plan to do a few additional things:
|
||||
|
||||
* Test Sunlight as well on the same hardware. It would be nice to see a comparison between write
|
||||
rates of the two implementations.
|
||||
* Work with Al Cutter and the Transparency Dev team to close a few small gaps (like the
|
||||
`local_signer.go` and some Prometheus monitoring of the `posix` binary.
|
||||
* Install and launch both under `*.ct.ipng.ch`, which in itself deserves its own report, showing
|
||||
how I intend to do log cycling and care/feeding, as well as report on the real production
|
||||
experience running these CT Logs.
|
||||
666
content/articles/2025-08-10-ctlog-2.md
Normal file
666
content/articles/2025-08-10-ctlog-2.md
Normal file
@@ -0,0 +1,666 @@
|
||||
---
|
||||
date: "2025-08-10T12:07:23Z"
|
||||
title: 'Certificate Transparency - Part 2 - Sunlight'
|
||||
---
|
||||
|
||||
{{< image width="10em" float="right" src="/assets/ctlog/ctlog-logo-ipng.png" alt="ctlog logo" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
There once was a Dutch company called [[DigiNotar](https://en.wikipedia.org/wiki/DigiNotar)], as the
|
||||
name suggests it was a form of _digital notary_, and they were in the business of issuing security
|
||||
certificates. Unfortunately, in June of 2011, their IT infrastructure was compromised and
|
||||
subsequently it issued hundreds of fraudulent SSL certificates, some of which were used for
|
||||
man-in-the-middle attacks on Iranian Gmail users. Not cool.
|
||||
|
||||
Google launched a project called **Certificate Transparency**, because it was becoming more common
|
||||
that the root of trust given to _Certification Authorities_ could no longer be unilaterally trusted.
|
||||
These attacks showed that the lack of transparency in the way CAs operated was a significant risk to
|
||||
the Web Public Key Infrastructure. It led to the creation of this ambitious
|
||||
[[project](https://certificate.transparency.dev/)] to improve security online by bringing
|
||||
accountability to the system that protects our online services with _SSL_ (Secure Socket Layer)
|
||||
and _TLS_ (Transport Layer Security).
|
||||
|
||||
In 2013, [[RFC 6962](https://datatracker.ietf.org/doc/html/rfc6962)] was published by the IETF. It
|
||||
describes an experimental protocol for publicly logging the existence of Transport Layer Security
|
||||
(TLS) certificates as they are issued or observed, in a manner that allows anyone to audit
|
||||
certificate authority (CA) activity and notice the issuance of suspect certificates as well as to
|
||||
audit the certificate logs themselves. The intent is that eventually clients would refuse to honor
|
||||
certificates that do not appear in a log, effectively forcing CAs to add all issued certificates to
|
||||
the logs.
|
||||
|
||||
In a [[previous article]({{< ref 2025-07-26-ctlog-1 >}})], I took a deep dive into an upcoming
|
||||
open source implementation of Static CT Logs made by Google. There is however a very competent
|
||||
alternative called [[Sunlight](https://sunlight.dev/)], which deserves some attention to get to know
|
||||
its look and feel, as well as its performance characteristics.
|
||||
|
||||
## Sunlight
|
||||
|
||||
I start by reading up on the project website, and learn:
|
||||
|
||||
> _Sunlight is a [[Certificate Transparency](https://certificate.transparency.dev/)] log implementation
|
||||
> and monitoring API designed for scalability, ease of operation, and reduced cost. What started as
|
||||
> the Sunlight API is now the [[Static CT API](https://c2sp.org/static-ct-api)] and is allowed by the
|
||||
> CT log policies of the major browsers._
|
||||
>
|
||||
> _Sunlight was designed by Filippo Valsorda for the needs of the WebPKI community, through the
|
||||
> feedback of many of its members, and in particular of the Sigsum, Google TrustFabric, and ISRG
|
||||
> teams. It is partially based on the Go Checksum Database. Sunlight's development was sponsored by
|
||||
> Let's Encrypt._
|
||||
|
||||
I have a chat with Filippo and think I'm addressing an Elephant by asking him which of the two
|
||||
implementations, TesseraCT or Sunlight, he thinks would be a good fit. One thing he says really sticks
|
||||
with me: "The community needs _any_ static log operator, so if Google thinks TesseraCT is ready, by
|
||||
all means use that. The diversity will do us good!".
|
||||
|
||||
To find out if one or the other is 'ready' is partly on the software, but importantly also on the
|
||||
operator. So I carefully take Sunlight out of its cardboard box, and put it onto the same Dell R630
|
||||
that I used in my previous tests: two Xeon E5-2640 v4 CPUs for a total of 20 cores and 40 threads,
|
||||
and 512GB of DDR4 memory. They also sport a SAS controller. In one machine I place 6 pcs 1.2TB SAS3
|
||||
drives (HPE part number EG1200JEHMC), and in the second machine I place 6pcs of 1.92TB enterprise
|
||||
storage (Samsung part number P1633N19).
|
||||
|
||||
### Sunlight: setup
|
||||
|
||||
I download the source from GitHub, which, one of these days, will have an IPv6 address. Building the
|
||||
tools is easy enough, there are three main tools:
|
||||
1. ***sunlight***: Which serves the write-path. Certification authorities add their certs here.
|
||||
1. ***sunlight-keygen***: A helper tool to create the so-called `seed` file (key material) for a
|
||||
log.
|
||||
1. ***skylight***: Which serves the read-path. `/checkpoint` and things like `/tile` and `/issuer`
|
||||
are served here in a spec-compliant way.
|
||||
|
||||
The YAML configuration file is straightforward, and can define and handle multiple logs in one
|
||||
instance, which sets it apart from TesseraCT which can only handle one log per instance. There's a
|
||||
`submissionprefix` which `sunlight` will use to accept writes, and a `monitoringprefix` which
|
||||
`skylight` will use for reads.
|
||||
|
||||
I stumble across a small issue - I haven't created multiple DNS hostnames for the test machine. So I
|
||||
decide to use a different port for one versus the other. The write path will use TLS on port 1443
|
||||
while Sunlight will point to a normal HTTP port 1080. And considering I don't have a certificate for
|
||||
`*.lab.ipng.ch`, I will use a self-signed one instead:
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ openssl genrsa -out ca.key 2048
|
||||
pim@ctlog-test:/etc/sunlight$ openssl req -new -x509 -days 365 -key ca.key \
|
||||
-subj "/C=CH/ST=ZH/L=Bruttisellen/O=IPng Networks GmbH/CN=IPng Root CA" -out ca.crt
|
||||
pim@ctlog-test:/etc/sunlight$ openssl req -newkey rsa:2048 -nodes -keyout sunlight-key.pem \
|
||||
-subj "/C=CH/ST=ZH/L=Bruttisellen/O=IPng Networks GmbH/CN=*.lab.ipng.ch" -out sunlight.csr
|
||||
pim@ctlog-test:/etc/sunlight# openssl x509 -req -extfile \
|
||||
<(printf "subjectAltName=DNS:ctlog-test.lab.ipng.ch,DNS:ctlog-test.lab.ipng.ch") -days 365 \
|
||||
-in sunlight.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out sunlight.pem
|
||||
ln -s sunlight.pem skylight.pem
|
||||
ln -s sunlight-key.pem skylight-key.pem
|
||||
```
|
||||
|
||||
This little snippet yields `sunlight.pem` (the certificate) and `sunlight-key.pem` (the private
|
||||
key), and symlinks them to `skylight.pem` and `skylight-key.pem` for simplicity. With these in hand,
|
||||
I can start the rest of the show. First I will prepare the NVME storage with a few datasets in
|
||||
which Sunlight will store its data:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~$ sudo zfs create ssd-vol0/sunlight-test
|
||||
pim@ctlog-test:~$ sudo zfs create ssd-vol0/sunlight-test/shared
|
||||
pim@ctlog-test:~$ sudo zfs create ssd-vol0/sunlight-test/logs
|
||||
pim@ctlog-test:~$ sudo zfs create ssd-vol0/sunlight-test/logs/sunlight-test
|
||||
pim@ctlog-test:~$ sudo chown -R pim:pim /ssd-vol0/sunlight-test
|
||||
```
|
||||
|
||||
Then I'll create the Sunlight configuration:
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ sunlight-keygen -f sunlight-test.seed.bin
|
||||
Log ID: IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E=
|
||||
ECDSA public key:
|
||||
-----BEGIN PUBLIC KEY-----
|
||||
MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHR
|
||||
wRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ==
|
||||
-----END PUBLIC KEY-----
|
||||
Ed25519 public key:
|
||||
-----BEGIN PUBLIC KEY-----
|
||||
0pHg7KptAxmb4o67m9xNM1Ku3YH4bjjXbyIgXn2R2bk=
|
||||
-----END PUBLIC KEY-----
|
||||
```
|
||||
|
||||
The first block creates key material for the log, and I get a fun surprise: the Log ID starts
|
||||
precisely with the string IPng... what are the odds that that would happen!? I should tell Antonis
|
||||
about this, it's dope!
|
||||
|
||||
As a safety precaution, Sunlight requires the operator to make the `checkpoints.db` by hand, which
|
||||
I'll also do:
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ sqlite3 /ssd-vol0/sunlight-test/shared/checkpoints.db \
|
||||
"CREATE TABLE checkpoints (logID BLOB PRIMARY KEY, body TEXT)"
|
||||
```
|
||||
|
||||
And with that, I'm ready to create my first log!
|
||||
|
||||
### Sunlight: Setting up S3
|
||||
|
||||
When learning about [[Tessera]({{< ref 2025-07-26-ctlog-1 >}})], I already kind of drew the
|
||||
conclusion that, for our case at IPng at least, running the fully cloud-native version with S3
|
||||
storage and MySQL database, gave both poorer performance, but also more operational complexity. But
|
||||
I find it interesting to compare behavior and performance, so I'll start by creating a Sunlight log
|
||||
using backing MinIO SSD storage.
|
||||
|
||||
I'll first create the bucket and a user account to access it:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~$ export AWS_ACCESS_KEY_ID="<some user>"
|
||||
pim@ctlog-test:~$ export AWS_SECRET_ACCESS_KEY="<some password>"
|
||||
pim@ctlog-test:~$ export S3_BUCKET=sunlight-test
|
||||
|
||||
pim@ctlog-test:~$ mc mb ssd/${S3_BUCKET}
|
||||
pim@ctlog-test:~$ cat << EOF > /tmp/minio-access.json
|
||||
{ "Version": "2012-10-17", "Statement": [ {
|
||||
"Effect": "Allow",
|
||||
"Action": [ "s3:ListBucket", "s3:PutObject", "s3:GetObject", "s3:DeleteObject" ],
|
||||
"Resource": [ "arn:aws:s3:::${S3_BUCKET}/*", "arn:aws:s3:::${S3_BUCKET}" ]
|
||||
} ]
|
||||
}
|
||||
EOF
|
||||
pim@ctlog-test:~$ mc admin user add ssd ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY}
|
||||
pim@ctlog-test:~$ mc admin policy create ssd ${S3_BUCKET}-access /tmp/minio-access.json
|
||||
pim@ctlog-test:~$ mc admin policy attach ssd ${S3_BUCKET}-access --user ${AWS_ACCESS_KEY_ID}
|
||||
pim@ctlog-test:~$ mc anonymous set public ssd/${S3_BUCKET}
|
||||
```
|
||||
|
||||
After setting up the S3 environment, all I must do is wire it up to the Sunlight configuration
|
||||
file:
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ cat << EOF > sunlight-s3.yaml
|
||||
listen:
|
||||
- "[::]:1443"
|
||||
checkpoints: /ssd-vol0/sunlight-test/shared/checkpoints.db
|
||||
logs:
|
||||
- shortname: sunlight-test
|
||||
inception: 2025-08-10
|
||||
submissionprefix: https://ctlog-test.lab.ipng.ch:1443/
|
||||
monitoringprefix: http://sunlight-test.minio-ssd.lab.ipng.ch:9000/
|
||||
secret: /etc/sunlight/sunlight-test.seed.bin
|
||||
cache: /ssd-vol0/sunlight-test/logs/sunlight-test/cache.db
|
||||
s3region: eu-schweiz-1
|
||||
s3bucket: sunlight-test
|
||||
s3endpoint: http://minio-ssd.lab.ipng.ch:9000/
|
||||
roots: /etc/sunlight/roots.pem
|
||||
period: 200
|
||||
poolsize: 15000
|
||||
notafterstart: 2024-01-01T00:00:00Z
|
||||
notafterlimit: 2025-01-01T00:00:00Z
|
||||
EOF
|
||||
```
|
||||
|
||||
The one thing of note here is the use of `roots:` file which contains the Root CA for the TesseraCT
|
||||
loadtester which I'll be using. In production, Sunlight can grab the approved roots from the
|
||||
so-called _Common CA Database_ or CCADB. But you can also specify either all roots using the `roots`
|
||||
field, or additional roots on top of the `ccadbroots` field, using the `extraroots` field. That's a
|
||||
handy trick! You can find more info on the [[CCADB](https://www.ccadb.org/)] homepage.
|
||||
|
||||
I can then start Sunlight just like this:
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ sunlight -testcert -c /etc/sunlight/sunlight-s3.yaml {"time":"2025-08-10T13:49:36.091384532+02:00","level":"INFO","source":{"function":"main.main.func1","file":"/home/pim/src/sunlight/cmd/sunlight/sunlig
|
||||
ht.go","line":341},"msg":"debug server listening","addr":{"IP":"127.0.0.1","Port":37477,"Zone":""}}
|
||||
time=2025-08-10T13:49:36.091+02:00 level=INFO msg="debug server listening" addr=127.0.0.1:37477 {"time":"2025-08-10T13:49:36.100471647+02:00","level":"INFO","source":{"function":"main.main","file":"/home/pim/src/sunlight/cmd/sunlight/sunlight.go"
|
||||
,"line":542},"msg":"today is the Inception date, creating log","log":"sunlight-test"} time=2025-08-10T13:49:36.100+02:00 level=INFO msg="today is the Inception date, creating log" log=sunlight-test
|
||||
{"time":"2025-08-10T13:49:36.119529208+02:00","level":"INFO","source":{"function":"filippo.io/sunlight/internal/ctlog.CreateLog","file":"/home/pim/src
|
||||
/sunlight/internal/ctlog/ctlog.go","line":159},"msg":"created log","log":"sunlight-test","timestamp":1754826576111,"logID":"IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E="}
|
||||
time=2025-08-10T13:49:36.119+02:00 level=INFO msg="created log" log=sunlight-test timestamp=1754826576111 logID="IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E="
|
||||
{"time":"2025-08-10T13:49:36.127702166+02:00","level":"WARN","source":{"function":"filippo.io/sunlight/internal/ctlog.LoadLog","file":"/home/pim/src/s
|
||||
unlight/internal/ctlog/ctlog.go","line":296},"msg":"failed to parse previously trusted roots","log":"sunlight-test","roots":""} time=2025-08-10T13:49:36.127+02:00 level=WARN msg="failed to parse previously trusted roots" log=sunlight-test roots=""
|
||||
{"time":"2025-08-10T13:49:36.127766452+02:00","level":"INFO","source":{"function":"filippo.io/sunlight/internal/ctlog.LoadLog","file":"/home/pim/src/sunlight/internal/ctlog/ctlog.go","line":301},"msg":"loaded log","log":"sunlight-test","logID":"IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E=","size":0,
|
||||
"timestamp":1754826576111}
|
||||
time=2025-08-10T13:49:36.127+02:00 level=INFO msg="loaded log" log=sunlight-test logID="IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E=" size=0 timestamp=1754826576111
|
||||
{"time":"2025-08-10T13:49:36.540297532+02:00","level":"INFO","source":{"function":"filippo.io/sunlight/internal/ctlog.(*Log).sequencePool","file":"/home/pim/src/sunlight/internal/ctlog/ctlog.go","line":972},"msg":"sequenced pool","log":"sunlight-test","old_tree_size":0,"entries":0,"start":"2025-08-1
|
||||
0T13:49:36.534500633+02:00","tree_size":0,"tiles":0,"timestamp":1754826576534,"elapsed":5788099}
|
||||
time=2025-08-10T13:49:36.540+02:00 level=INFO msg="sequenced pool" log=sunlight-test old_tree_size=0 entries=0 start=2025-08-10T13:49:36.534+02:00 tree_size=0 tiles=0 timestamp=1754826576534 elapsed=5.788099ms
|
||||
...
|
||||
```
|
||||
|
||||
Although that looks pretty good, I see that something is not quite right. When Sunlight comes up, it shares
|
||||
with me a few links, in the `get-roots` and `json` fields on the homepage, but neither of them work:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~$ curl -k https://ctlog-test.lab.ipng.ch:1443/ct/v1/get-roots
|
||||
404 page not found
|
||||
pim@ctlog-test:~$ curl -k https://ctlog-test.lab.ipng.ch:1443/log.v3.json
|
||||
404 page not found
|
||||
```
|
||||
|
||||
I'm starting to think that using a non-standard listen port won't work, or more precisely, adding
|
||||
a port in the `monitoringprefix` won't work. I notice that the logname is called
|
||||
`ctlog-test.lab.ipng.ch:1443` which I don't think is supposed to have a portname in it. So instead,
|
||||
I make Sunlight `listen` on port 443 and omit the port in the `submissionprefix`, and give it and
|
||||
its companion Skylight the needed privileges to bind the privileged port like so:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~$ sudo setcap 'cap_net_bind_service=+ep' /usr/local/bin/sunlight
|
||||
pim@ctlog-test:~$ sudo setcap 'cap_net_bind_service=+ep' /usr/local/bin/skylight
|
||||
pim@ctlog-test:~$ sunlight -testcert -c /etc/sunlight/sunlight-s3.yaml
|
||||
```
|
||||
|
||||
{{< image width="60%" src="/assets/ctlog/sunlight-test-s3.png" alt="Sunlight testlog / S3" >}}
|
||||
|
||||
And with that, Sunlight reports for duty and the links work. Hoi!
|
||||
|
||||
#### Sunlight: Loadtesting S3
|
||||
|
||||
I have some good experience loadtesting from the [[TesseraCT article]({{< ref 2025-07-26-ctlog-1
|
||||
>}})]. One important difference is that Sunlight wants to use SSL for the submission and monitoring
|
||||
paths, and I've created a snakeoil self-signed cert. CT Hammer does not accept that out of the box,
|
||||
so I need to make a tiny change to the Hammer:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract$ git diff
|
||||
diff --git a/internal/hammer/hammer.go b/internal/hammer/hammer.go
|
||||
index 3828fbd..1dfd895 100644
|
||||
--- a/internal/hammer/hammer.go
|
||||
+++ b/internal/hammer/hammer.go
|
||||
@@ -104,6 +104,9 @@ func main() {
|
||||
MaxIdleConns: *numWriters + *numReadersFull + *numReadersRandom,
|
||||
MaxIdleConnsPerHost: *numWriters + *numReadersFull + *numReadersRandom,
|
||||
DisableKeepAlives: false,
|
||||
+ TLSClientConfig: &tls.Config{
|
||||
+ InsecureSkipVerify: true,
|
||||
+ },
|
||||
},
|
||||
Timeout: *httpTimeout,
|
||||
}
|
||||
```
|
||||
|
||||
With that small bit of insecurity out of the way, Sunlight makes it otherwise pretty easy for me to
|
||||
construct the CT Hammer commandline:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract$ go run ./internal/hammer --origin=ctlog-test.lab.ipng.ch \
|
||||
--log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHRwRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ== \
|
||||
--log_url=http://sunlight-test.minio-ssd.lab.ipng.ch:9000/ --write_log_url=https://ctlog-test.lab.ipng.ch/ \
|
||||
--max_read_ops=0 --num_writers=5000 --max_write_ops=100
|
||||
|
||||
pim@ctlog-test:/etc/sunlight$ T=0; O=0; while :; do \
|
||||
N=$(curl -sS http://sunlight-test.minio-ssd.lab.ipng.ch:9000/checkpoint | grep -E '^[0-9]+$'); \
|
||||
if [ "$N" -eq "$O" ]; then \
|
||||
echo -n .; \
|
||||
else \
|
||||
echo " $T seconds $((N-O)) certs"; O=$N; T=0; echo -n $N\ ;
|
||||
fi; \
|
||||
T=$((T+1)); sleep 1; done
|
||||
24915 1 seconds 96 certs
|
||||
25011 1 seconds 92 certs
|
||||
25103 1 seconds 93 certs
|
||||
25196 1 seconds 87 certs
|
||||
```
|
||||
|
||||
On the first commandline I'll start the loadtest at 100 writes/sec with the standard duplication
|
||||
probability of 10%, which allows me to test Sunlights ability to avoid writing duplicates. This
|
||||
means I should see on average a growth of the tree at about 90/s. Check. I raise the write-load to
|
||||
500/s:
|
||||
|
||||
```
|
||||
39421 1 seconds 443 certs
|
||||
39864 1 seconds 442 certs
|
||||
40306 1 seconds 441 certs
|
||||
40747 1 seconds 447 certs
|
||||
41194 1 seconds 448 certs
|
||||
```
|
||||
|
||||
.. and to 1'000/s:
|
||||
```
|
||||
57941 1 seconds 945 certs
|
||||
58886 1 seconds 970 certs
|
||||
59856 1 seconds 948 certs
|
||||
60804 1 seconds 965 certs
|
||||
61769 1 seconds 955 certs
|
||||
```
|
||||
|
||||
After a few minutes I see a few errors from CT Hammer:
|
||||
```
|
||||
W0810 14:55:29.660710 1398779 analysis.go:134] (1 x) failed to create request: failed to write leaf: Post "https://ctlog-test.lab.ipng.ch/ct/v1/add-chain": EOF
|
||||
W0810 14:55:30.496603 1398779 analysis.go:124] (1 x) failed to create request: write leaf was not OK. Status code: 500. Body: "failed to read body: read tcp 127.0.1.1:443->127.0.0.1:44908: i/o timeout\n"
|
||||
```
|
||||
|
||||
I raise the Hammer load to 5'000/sec (which means 4'500/s unique certs and 500 duplicates), and find
|
||||
the max committed writes/sec to max out at around 4'200/s:
|
||||
```
|
||||
879637 1 seconds 4213 certs
|
||||
883850 1 seconds 4207 certs
|
||||
888057 1 seconds 4211 certs
|
||||
892268 1 seconds 4249 certs
|
||||
896517 1 seconds 4216 certs
|
||||
```
|
||||
|
||||
The error rate is a steady stream of errors like the one before:
|
||||
```
|
||||
W0810 14:59:48.499274 1398779 analysis.go:124] (1 x) failed to create request: failed to write leaf: Post "https://ctlog-test.lab.ipng.ch/ct/v1/add-chain": EOF
|
||||
W0810 14:59:49.034194 1398779 analysis.go:124] (1 x) failed to create request: failed to write leaf: Post "https://ctlog-test.lab.ipng.ch/ct/v1/add-chain": EOF
|
||||
W0810 15:00:05.496459 1398779 analysis.go:124] (1 x) failed to create request: failed to write leaf: Post "https://ctlog-test.lab.ipng.ch/ct/v1/add-chain": EOF
|
||||
W0810 15:00:07.187181 1398779 analysis.go:124] (1 x) failed to create request: failed to write leaf: Post "https://ctlog-test.lab.ipng.ch/ct/v1/add-chain": EOF
|
||||
```
|
||||
|
||||
At this load of 4'200/s, MinIO is not very impressed. Remember in the [[other article]({{< ref
|
||||
2025-07-26-ctlog-1 >}})] I loadtested it to about 7'500 ops/sec and the statistics below are about
|
||||
50 ops/sec (2'800/min). I conclude that MinIO is, in fact, bored of this whole activity:
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ mc admin trace --stats ssd
|
||||
Duration: 18m58s ▱▱▱
|
||||
RX Rate:↑ 115 MiB/m
|
||||
TX Rate:↓ 2.4 MiB/m
|
||||
RPM : 2821.3
|
||||
-------------
|
||||
Call Count RPM Avg Time Min Time Max Time Avg TTFB Max TTFB Avg Size Rate /min Errors
|
||||
s3.PutObject 37602 (70.3%) 1982.2 6.2ms 785µs 86.7ms 6.1ms 86.6ms ↑59K ↓0B ↑115M ↓1.4K 0
|
||||
s3.GetObject 15918 (29.7%) 839.1 996µs 670µs 51.3ms 912µs 51.2ms ↑46B ↓3.0K ↑38K ↓2.4M 0
|
||||
```
|
||||
|
||||
Sunlight still keeps its certificate cache on local disk. At a rate of 4'200/s, the ZFS pool has a
|
||||
write rate of about 105MB/s with about 877 ZFS writes per second.
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ zpool iostat -v ssd-vol0 10
|
||||
capacity operations bandwidth
|
||||
pool alloc free read write read write
|
||||
-------------------------- ----- ----- ----- ----- ----- -----
|
||||
ssd-vol0 59.1G 685G 0 2.55K 0 312M
|
||||
mirror-0 59.1G 685G 0 2.55K 0 312M
|
||||
wwn-0x5002538a05302930 - - 0 877 0 104M
|
||||
wwn-0x5002538a053069f0 - - 0 871 0 104M
|
||||
wwn-0x5002538a06313ed0 - - 0 866 0 104M
|
||||
-------------------------- ----- ----- ----- ----- ----- -----
|
||||
|
||||
pim@ctlog-test:/etc/sunlight$ zpool iostat -l ssd-vol0 10
|
||||
capacity operations bandwidth total_wait disk_wait syncq_wait asyncq_wait scrub trim
|
||||
pool alloc free read write read write read write read write read write read write wait wait
|
||||
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
|
||||
ssd-vol0 59.0G 685G 0 3.19K 0 388M - 8ms - 628us - 990us - 10ms - 88ms
|
||||
ssd-vol0 59.2G 685G 0 2.49K 0 296M - 5ms - 557us - 163us - 8ms - -
|
||||
ssd-vol0 59.6G 684G 0 2.04K 0 253M - 2ms - 704us - 296us - 4ms - -
|
||||
ssd-vol0 58.8G 685G 0 2.72K 0 328M - 6ms - 783us - 701us - 9ms - 68ms
|
||||
|
||||
```
|
||||
|
||||
A few interesting observations:
|
||||
* Sunlight still uses a local sqlite3 database for the certificate tracking, which is more
|
||||
efficient than MariaDB/MySQL, let alone AWS RDS, so it has one less runtime dependency.
|
||||
* The write rate to ZFS is significantly higher with Sunlight than TesseraCT (about 8:1). This is
|
||||
likely explained because the sqlite3 database lives on ZFS here, while TesseraCT uses MariaDB
|
||||
running on a different filesystem.
|
||||
* The MinIO usage is a lot lighter. As I reduce the load to 1'000/s, as was the case in the TesseraCT
|
||||
test, I can see the ratio of Get:Put was 93:4 in TesseraCT, while it's 70:30 here. TesseraCT as
|
||||
also consuming more IOPS, running at about 10.5k requests/minute, while Sunlight is
|
||||
significantly calmer at 2.8k requests/minute (almost 4x less!)
|
||||
* The burst capacity of Sunlight is a fair bit higher than TesseraCT, likely due to its more
|
||||
efficient use of S3 backends.
|
||||
|
||||
***Conclusion***: Sunlight S3+MinIO can handle 1'000/s reliably, and can spike to 4'200/s with only
|
||||
few errors.
|
||||
|
||||
#### Sunlight: Loadtesting POSIX
|
||||
|
||||
When I took a closer look at TesseraCT a few weeks ago, it struck me that while making a
|
||||
cloud-native setup, with S3 storage would allow for a cool way to enable storage scaling and
|
||||
read-path redundancy, by creating synchronously replicated buckets, it does come at a significant
|
||||
operational overhead and complexity. My main concern is the amount of different moving parts, and
|
||||
Sunlight really has one very appealing property: it can run entirely on one machine without the need
|
||||
for any other moving parts - even the SQL database is linked in. That's pretty slick.
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ cat << EOF > sunlight.yaml
|
||||
listen:
|
||||
- "[::]:443"
|
||||
checkpoints: /ssd-vol0/sunlight-test/shared/checkpoints.db
|
||||
logs:
|
||||
- shortname: sunlight-test
|
||||
inception: 2025-08-10
|
||||
submissionprefix: https://ctlog-test.lab.ipng.ch/
|
||||
monitoringprefix: https://ctlog-test.lab.ipng.ch:1443/
|
||||
secret: /etc/sunlight/sunlight-test.seed.bin
|
||||
cache: /ssd-vol0/sunlight-test/logs/sunlight-test/cache.db
|
||||
localdirectory: /ssd-vol0/sunlight-test/logs/sunlight-test/data
|
||||
roots: /etc/sunlight/roots.pem
|
||||
period: 200
|
||||
poolsize: 15000
|
||||
notafterstart: 2024-01-01T00:00:00Z
|
||||
notafterlimit: 2025-01-01T00:00:00Z
|
||||
EOF
|
||||
pim@ctlog-test:/etc/sunlight$ sunlight -testcert -c sunlight.yaml
|
||||
pim@ctlog-test:/etc/sunlight$ skylight -testcert -c skylight.yaml
|
||||
```
|
||||
|
||||
First I'll start a hello-world loadtest at 100/s and take a look at the number of leaves in the
|
||||
checkpoint after a few minutes, I would expect about three minutes worth at 100/s with a duplicate
|
||||
probability of 10% to yield about 16'200 unique certificates in total.
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E '^[0-9]+$'; sleep 60; done
|
||||
10086
|
||||
15518
|
||||
20920
|
||||
26339
|
||||
```
|
||||
|
||||
And would you look at that? `(26339-10086)` is right on the dot! One thing that I find particularly
|
||||
cool about Sunlight is its baked in Prometheus metrics. This allows me some pretty solid insight on
|
||||
its performance. Take a look for example at the write path latency tail (99th ptile):
|
||||
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ curl -ksS https://ctlog-test.lab.ipng.ch/metrics | egrep 'seconds.*quantile=\"0.99\"'
|
||||
sunlight_addchain_wait_seconds{log="sunlight-test",quantile="0.99"} 0.207285993
|
||||
sunlight_cache_get_duration_seconds{log="sunlight-test",quantile="0.99"} 0.001409719
|
||||
sunlight_cache_put_duration_seconds{log="sunlight-test",quantile="0.99"} 0.002227985
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="discard",quantile="0.99"} 0.000224969
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="fetch",quantile="0.99"} 8.3003e-05
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="upload",quantile="0.99"} 0.042118751
|
||||
sunlight_http_request_duration_seconds{endpoint="add-chain",log="sunlight-test",quantile="0.99"} 0.2259605
|
||||
sunlight_sequencing_duration_seconds{log="sunlight-test",quantile="0.99"} 0.108987393
|
||||
sunlight_sqlite_update_duration_seconds{quantile="0.99"} 0.014922489
|
||||
```
|
||||
|
||||
I'm seeing here that at a load of 100/s (with 90/s of unique certificates), the 99th percentile
|
||||
add-chain latency is 207ms, which makes sense because the `period` configuration field is set to
|
||||
200ms. The filesystem operations (discard, fetch, upload) are _de minimis_ and the sequencing
|
||||
duration is at 109ms. Excellent!
|
||||
|
||||
But can this thing go really fast? I do remember that the CT Hammer uses more CPU than TesseraCT,
|
||||
and I've seen it above also when running my 5'000/s loadtest that's about all the hammer can take on
|
||||
a single Dell R630. So, as I did with the TesseraCT test, I'll use the MinIO SSD and MinIO Disk
|
||||
machines to generate the load.
|
||||
|
||||
I boot them, so that I can hammer, or shall I say jackhammer away:
|
||||
|
||||
```
|
||||
pim@ctlog-test:~/src/tesseract$ go run ./internal/hammer --origin=ctlog-test.lab.ipng.ch \
|
||||
--log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHRwRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ== \
|
||||
--log_url=https://ctlog-test.lab.ipng.ch:1443/ --write_log_url=https://ctlog-test.lab.ipng.ch/ \
|
||||
--max_read_ops=0 --num_writers=5000 --max_write_ops=5000
|
||||
|
||||
pim@minio-ssd:~/src/tesseract$ go run ./internal/hammer --origin=ctlog-test.lab.ipng.ch \
|
||||
--log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHRwRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ== \
|
||||
--log_url=https://ctlog-test.lab.ipng.ch:1443/ --write_log_url=https://ctlog-test.lab.ipng.ch/ \
|
||||
--max_read_ops=0 --num_writers=5000 --max_write_ops=5000 --serial_offset=1000000
|
||||
|
||||
pim@minio-disk:~/src/tesseract$ go run ./internal/hammer --origin=ctlog-test.lab.ipng.ch \
|
||||
--log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHRwRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ== \
|
||||
--log_url=https://ctlog-test.lab.ipng.ch:1443/ --write_log_url=https://ctlog-test.lab.ipng.ch/ \
|
||||
--max_read_ops=0 --num_writers=5000 --max_write_ops=5000 --serial_offset=2000000
|
||||
```
|
||||
|
||||
This will generate 15'000/s of load, which I note does bring Sunlight to its knees, although it does
|
||||
remain stable (yaay!) with a somewhat more bursty checkpoint interval:
|
||||
|
||||
```
|
||||
5504780 1 seconds 4039 certs
|
||||
5508819 1 seconds 10000 certs
|
||||
5518819 . 2 seconds 7976 certs
|
||||
5526795 1 seconds 2022 certs
|
||||
5528817 1 seconds 9782 certs
|
||||
5538599 1 seconds 217 certs
|
||||
5538816 1 seconds 3114 certs
|
||||
5541930 1 seconds 6818 certs
|
||||
```
|
||||
|
||||
So what I do instead is a somewhat simpler measurement of certificates per minute:
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E '^[0-9]+$'; sleep 60; done
|
||||
6008831
|
||||
6296255
|
||||
6576712
|
||||
```
|
||||
|
||||
This rate boils down to `(6576712-6008831)/120` or 4'700/s of written certs, which at a duplication
|
||||
ratio of 10% means approximately 5'200/s of total accepted certs. This rate, Sunlight is consuming
|
||||
about 10.3 CPUs/s, while Skylight is at 0.1 CPUs/s and the CT Hammer is at 11.1 CPUs/s; Given the 40
|
||||
threads on this machine, I am not saturating the CPU, but I'm curious as this rate is significantly
|
||||
lower than TesseraCT. I briefly turn off the hammer on `ctlog-test` to allow Sunlight to monopolize
|
||||
the entire machine. The CPU use does reduce to about 9.3 CPUs/s suggesting that indeed, the bottleneck
|
||||
is not strictly CPU:
|
||||
|
||||
{{< image width="90%" src="/assets/ctlog/btop-sunlight.png" alt="Sunlight btop" >}}
|
||||
|
||||
When using only two CT Hammers (on `minio-ssd.lab.ipng.ch` and `minio-disk.lab.ipng.ch`), the CPU
|
||||
use on the `ctlog-test.lab.ipng.ch` machine definitely goes down (CT Hammer is kind of a CPU hog....),
|
||||
but the resulting throughput doesn't change that much:
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E '^[0-9]+$'; sleep 60; done
|
||||
7985648
|
||||
8302421
|
||||
8528122
|
||||
8772758
|
||||
```
|
||||
|
||||
What I find particularly interesting is that the total rate stays approximately 4'400/s
|
||||
(`(8772758-7985648)/180`), while the checkpoint latency varies considerably. One really cool thing I
|
||||
learned earlier is that Sunlight comes with baked in Prometheus metrics, which I can take a look at
|
||||
while keeping it under this load of ~10'000/sec:
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ curl -ksS https://ctlog-test.lab.ipng.ch/metrics | egrep 'seconds.*quantile=\"0.99\"'
|
||||
sunlight_addchain_wait_seconds{log="sunlight-test",quantile="0.99"} 1.889983538
|
||||
sunlight_cache_get_duration_seconds{log="sunlight-test",quantile="0.99"} 0.000148819
|
||||
sunlight_cache_put_duration_seconds{log="sunlight-test",quantile="0.99"} 0.837981208
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="discard",quantile="0.99"} 0.000433179
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="fetch",quantile="0.99"} NaN
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="upload",quantile="0.99"} 0.067494558
|
||||
sunlight_http_request_duration_seconds{endpoint="add-chain",log="sunlight-test",quantile="0.99"} 1.86894666
|
||||
sunlight_sequencing_duration_seconds{log="sunlight-test",quantile="0.99"} 1.111400223
|
||||
sunlight_sqlite_update_duration_seconds{quantile="0.99"} 0.016859223
|
||||
```
|
||||
|
||||
Comparing the throughput at 4'400/s with that first test of 100/s, I expect and can confirm a
|
||||
significant increase in all of these metrics. The 99th percentile addchain is now 1889ms (up from
|
||||
207ms) and the sequencing duration is now 1111ms (up from 109ms).
|
||||
|
||||
#### Sunlight: Effect of period
|
||||
|
||||
I fiddle a little bit with Sunlight's configuration file, notably the `period` and `poolsize`.
|
||||
First I set `period:2000` and `poolsize:15000`, which yields pretty much the same throughput:
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E '^[0-9]+$'; sleep 60; done
|
||||
701850
|
||||
1001424
|
||||
1295508
|
||||
1575789
|
||||
```
|
||||
|
||||
With a generated load of 10'000/sec with a 10% duplication rate, I am offering roughly 9'000/sec of
|
||||
unique certificates, and I'm seeing `(1575789 - 701850)/180` or about 4'855/sec come through. Just
|
||||
for reference, at this rate and with `period:2000`, the latency tail looks like this:
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ curl -ksS https://ctlog-test.lab.ipng.ch/metrics | egrep 'seconds.*quantile=\"0.99\"'
|
||||
sunlight_addchain_wait_seconds{log="sunlight-test",quantile="0.99"} 3.203510079
|
||||
sunlight_cache_get_duration_seconds{log="sunlight-test",quantile="0.99"} 0.000108613
|
||||
sunlight_cache_put_duration_seconds{log="sunlight-test",quantile="0.99"} 0.950453973
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="discard",quantile="0.99"} 0.00046192
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="fetch",quantile="0.99"} NaN
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="upload",quantile="0.99"} 0.049007693
|
||||
sunlight_http_request_duration_seconds{endpoint="add-chain",log="sunlight-test",quantile="0.99"} 3.570709413
|
||||
sunlight_sequencing_duration_seconds{log="sunlight-test",quantile="0.99"} 1.5968609040000001
|
||||
sunlight_sqlite_update_duration_seconds{quantile="0.99"} 0.010847308
|
||||
```
|
||||
|
||||
Then I also set a `period:100` and `poolsize:15000`, which does improve a bit:
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E '^[0-9]+$'; sleep 60; done
|
||||
560654
|
||||
950524
|
||||
1324645
|
||||
1720362
|
||||
```
|
||||
|
||||
With the same generated load of 10'000/sec with a 10% duplication rate, I am still offering roughly
|
||||
9'000/sec of unique certificates, and I'm seeing `(1720362 - 560654)/180` or about 6'440/sec come
|
||||
through, which is a fair bit better, at the expense of more disk activity. At this rate and with
|
||||
`period:100`, the latency tail looks like this:
|
||||
|
||||
```
|
||||
pim@ctlog-test:/etc/sunlight$ curl -ksS https://ctlog-test.lab.ipng.ch/metrics | egrep 'seconds.*quantile=\"0.99\"'
|
||||
sunlight_addchain_wait_seconds{log="sunlight-test",quantile="0.99"} 1.616046445
|
||||
sunlight_cache_get_duration_seconds{log="sunlight-test",quantile="0.99"} 7.5123e-05
|
||||
sunlight_cache_put_duration_seconds{log="sunlight-test",quantile="0.99"} 0.534935803
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="discard",quantile="0.99"} 0.000377273
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="fetch",quantile="0.99"} 4.8893e-05
|
||||
sunlight_fs_op_duration_seconds{log="sunlight-test",method="upload",quantile="0.99"} 0.054685991
|
||||
sunlight_http_request_duration_seconds{endpoint="add-chain",log="sunlight-test",quantile="0.99"} 1.946445877
|
||||
sunlight_sequencing_duration_seconds{log="sunlight-test",quantile="0.99"} 0.980602185
|
||||
sunlight_sqlite_update_duration_seconds{quantile="0.99"} 0.018385831
|
||||
```
|
||||
|
||||
***Conclusion***: Sunlight on POSIX can reliably handle 4'400/s (with a duplicate rate of 10%) on
|
||||
this setup.
|
||||
|
||||
## Wrapup - Observations
|
||||
|
||||
From an operators point of view, TesseraCT and Sunlight handle quite differently. Both are easily up
|
||||
to the task of serving the current write-load (which is about 250/s).
|
||||
|
||||
* ***S3***: When using the S3 backend, TesseraCT became quite unhappy above 800/s while Sunlight
|
||||
went all the way up to 4'200/s and sent significantly less requests to MinIO (about 4x less),
|
||||
while showing good telemetry on the use of S3 backends. In this mode, TesseraCT uses MySQL (in
|
||||
my case, MariaDB) which was not on the ZFS pool, but on the boot-disk.
|
||||
|
||||
* ***POSIX***: When using normal filesystem, Sunlight seems to peak at 4'800/s while TesseraCT
|
||||
went all the way to 12'000/s. When doing so, Disk IO was quite similar between the two
|
||||
solutions, taking into account that TesseraCT runs BadgerDB, while Sunlight uses sqlite3,
|
||||
both are using their respective ZFS pool.
|
||||
|
||||
***Notable***: Sunlight POSIX and S3 performance is roughly identical (both handle about
|
||||
5'000/sec), while TesseraCT POSIX performance (12'000/s) is significantly better than its S3
|
||||
(800/s). Some other observations:
|
||||
|
||||
* Sunlight has a very opinionated configuration, and can run multiple logs with one configuration
|
||||
file and one binary. Its configuration was a bit constraining though, as I could not manage to
|
||||
use `monitoringprefix` or `submissionprefix` with `http://` prefix - a likely security
|
||||
precaution - but also using ports in those prefixes (other than the standard 443) rendered
|
||||
Sunlight and Skylight unusable for me.
|
||||
|
||||
* Skylight only serves from local directory, it does not have support for S3. For operators using S3,
|
||||
an alternative could be to use NGINX in the serving path, similar to TesseraCT. Skylight does have
|
||||
a few things to teach me though, notably on proper compression, content type and other headers.
|
||||
|
||||
* TesseraCT does not have a configuration file, and will run exactly one log per binary
|
||||
instance. It uses flags to construct the environment, and is much more forgiving for creative
|
||||
`origin` (log name), and submission- and monitoring URLs. It's happy to use regular 'http://'
|
||||
for both, which comes in handy in those architectures where the system is serving behind a
|
||||
reversed proxy.
|
||||
|
||||
* The TesseraCT Hammer tool then again does not like using self-signed certificates, and needs
|
||||
to be told to skip certificate validation in the case of Sunlight loadtests while it is
|
||||
running with the `-testcert` commandline.
|
||||
|
||||
I consider all of these small and mostly cosmetic issues, because in production there will be proper
|
||||
TLS certificates issued and normal https:// serving ports with unique monitoring and submission
|
||||
hostnames.
|
||||
|
||||
## What's Next
|
||||
|
||||
Together with Antonis Chariton and Jeroen Massar, IPng Networks will be offering both TesseraCT and
|
||||
Sunlight logs on the public internet. One final step is to productionize both logs, and file the
|
||||
paperwork for them in the community. Although at this point our Sunlight log is already running,
|
||||
I'll wait a few weeks to gather any additional intel, before wrapping up in a final article.
|
||||
|
||||
515
content/articles/2025-08-24-ctlog-3.md
Normal file
515
content/articles/2025-08-24-ctlog-3.md
Normal file
@@ -0,0 +1,515 @@
|
||||
---
|
||||
date: "2025-08-24T12:07:23Z"
|
||||
title: 'Certificate Transparency - Part 3 - Operations'
|
||||
---
|
||||
|
||||
{{< image width="10em" float="right" src="/assets/ctlog/ctlog-logo-ipng.png" alt="ctlog logo" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
There once was a Dutch company called [[DigiNotar](https://en.wikipedia.org/wiki/DigiNotar)], as the
|
||||
name suggests it was a form of _digital notary_, and they were in the business of issuing security
|
||||
certificates. Unfortunately, in June of 2011, their IT infrastructure was compromised and
|
||||
subsequently it issued hundreds of fraudulent SSL certificates, some of which were used for
|
||||
man-in-the-middle attacks on Iranian Gmail users. Not cool.
|
||||
|
||||
Google launched a project called **Certificate Transparency**, because it was becoming more common
|
||||
that the root of trust given to _Certification Authorities_ could no longer be unilaterally trusted.
|
||||
These attacks showed that the lack of transparency in the way CAs operated was a significant risk to
|
||||
the Web Public Key Infrastructure. It led to the creation of this ambitious
|
||||
[[project](https://certificate.transparency.dev/)] to improve security online by bringing
|
||||
accountability to the system that protects our online services with _SSL_ (Secure Socket Layer)
|
||||
and _TLS_ (Transport Layer Security).
|
||||
|
||||
In 2013, [[RFC 6962](https://datatracker.ietf.org/doc/html/rfc6962)] was published by the IETF. It
|
||||
describes an experimental protocol for publicly logging the existence of Transport Layer Security
|
||||
(TLS) certificates as they are issued or observed, in a manner that allows anyone to audit
|
||||
certificate authority (CA) activity and notice the issuance of suspect certificates as well as to
|
||||
audit the certificate logs themselves. The intent is that eventually clients would refuse to honor
|
||||
certificates that do not appear in a log, effectively forcing CAs to add all issued certificates to
|
||||
the logs.
|
||||
|
||||
In the first two articles of this series, I explored [[Sunlight]({{< ref 2025-07-26-ctlog-1 >}})]
|
||||
and [[TesseraCT]({{< ref 2025-08-10-ctlog-2 >}})], two open source implementations of the Static CT
|
||||
protocol. In this final article, I'll share the details on how I created the environment and
|
||||
production instances for four logs that IPng will be providing: Rennet and Lipase are two
|
||||
ingredients to make cheese and will serve as our staging/testing logs. Gouda and Halloumi are two
|
||||
delicious cheeses that pay homage to our heritage, Jeroen and I being Dutch and Antonis being
|
||||
Greek.
|
||||
|
||||
## Hardware
|
||||
|
||||
At IPng Networks, all hypervisors are from the same brand: Dell's Poweredge line. In this project,
|
||||
Jeroen is also contributing a server, and it so happens that he also has a Dell Poweredge. We're
|
||||
both running Debian on our hypervisor, so we install a fresh VM with Debian 13.0, codenamed
|
||||
_Trixie_, and give the machine 16GB of memory, 8 vCPU and a 16GB boot disk. Boot disks are placed on
|
||||
the hypervisor's ZFS pool, and a blockdevice snapshot is taken every 6hrs. This allows the boot disk
|
||||
to be rolled back to a last known good point in case an upgrade goes south. If you haven't seen it
|
||||
yet, take a look at [[zrepl](https://zrepl.github.io/)], a one-stop, integrated solution for ZFS
|
||||
replication. This tool is incredibly powerful, and can do snapshot management, sourcing / sinking
|
||||
to remote hosts, of course using incremental snapshots as they are native to ZFS.
|
||||
|
||||
Once the machine is up, we pass four enterprise-class storage drives, in our case 3.84TB Kioxia
|
||||
NVMe, model _KXD51RUE3T84_ which are PCIe 3.1 x4 lanes, and NVMe 1.2.1 specification with a good
|
||||
durability and reasonable (albeit not stellar) read throughput of ~2700MB/s, write throughput of
|
||||
~800MB/s with 240 kIOPS random read and 21 kIOPS random write. My attention is also drawn to a
|
||||
specific specification point: these drives allow for 1.0 DWPD, which stands for _Drive Writes Per
|
||||
Day_, in other words they are not going to run themselves off a cliff after a few petabytes of
|
||||
writes, and I am reminded that a CT Log wants to write to disk a lot during normal operation.
|
||||
|
||||
The point of these logs is to **keep them safe**, and the most important aspects of the compute
|
||||
environment are the use of ECC memory to detect single bit errors, and dependable storage. Toshiba
|
||||
makes a great product.
|
||||
|
||||
```
|
||||
ctlog1:~$ sudo zpool create -f -o ashift=12 -o autotrim=on -O atime=off -O xattr=sa \
|
||||
ssd-vol0 raidz2 /dev/disk/by-id/nvme-KXD51RUE3T84_TOSHIBA_*M
|
||||
ctlog1:~$ sudo zfs create -o encryption=on -o keyformat=passphrase ssd-vol0/enc
|
||||
ctlog1:~$ sudo zfs create ssd-vol0/logs
|
||||
ctlog1:~$ for log in lipase; do \
|
||||
for shard in 2025h2 2026h1 2026h2 2027h1 2027h2; do \
|
||||
sudo zfs create ssd-vol0/logs/${log}${shard} \
|
||||
done \
|
||||
done
|
||||
```
|
||||
|
||||
The hypervisor will use PCI passthrough for the NVMe drives, and we'll handle ZFS directly on the
|
||||
VM. The first command creates a ZFS raidz2 pool using 4kB blocks, turns of _atime_ (which avoids one
|
||||
metadata write for each read!), and turns on SSD trimming in ZFS, a very useful feature.
|
||||
|
||||
Then I'll create an encrypted volume for the configuration and key material. This way, if the
|
||||
machine is ever physically transported, the keys will be safe in transit. Finally, I'll create the
|
||||
temporal log shards starting at 2025h2, all the way through to 2027h2 for our testing log called
|
||||
_Lipase_ and our production log called _Halloumi_ on Jeroen's machine. On my own machine, it'll be
|
||||
_Rennet_ for the testing log and _Gouda_ for the production log.
|
||||
|
||||
## Sunlight
|
||||
|
||||
{{< image width="10em" float="right" src="/assets/ctlog/sunlight-logo.png" alt="Sunlight logo" >}}
|
||||
|
||||
I set up Sunlight first. as its authors have extensive operational notes both in terms of the
|
||||
[[config](https://config.sunlight.geomys.org/)] of Geomys' _Tuscolo_ log, as well as on the
|
||||
[[Sunlight](https://sunlight.dev)] homepage. I really appreciate that Filippo added some
|
||||
[[Gists](https://gist.github.com/FiloSottile/989338e6ba8e03f2c699590ce83f537b)] and
|
||||
[[Doc](https://docs.google.com/document/d/1ID8dX5VuvvrgJrM0Re-jt6Wjhx1eZp-trbpSIYtOhRE/edit?tab=t.0#heading=h.y3yghdo4mdij)]
|
||||
with pretty much all I need to know to run one too. Our Rennet and Gouda logs use very similar
|
||||
approach for their configuration, with one notable exception: the VMs do not have a public IP
|
||||
address, and are tucked away in a private network called IPng Site Local. I'll get back to that
|
||||
later.
|
||||
|
||||
```
|
||||
ctlog@ctlog0:/ssd-vol0/enc/sunlight$ cat << EOF | tee sunlight-staging.yaml
|
||||
listen:
|
||||
- "[::]:16420"
|
||||
checkpoints: /ssd-vol0/shared/checkpoints.db
|
||||
logs:
|
||||
- shortname: rennet2025h2
|
||||
inception: 2025-07-28
|
||||
period: 200
|
||||
poolsize: 750
|
||||
submissionprefix: https://rennet2025h2.log.ct.ipng.ch
|
||||
monitoringprefix: https://rennet2025h2.mon.ct.ipng.ch
|
||||
ccadbroots: testing
|
||||
extraroots: /ssd-vol0/enc/sunlight/extra-roots-staging.pem
|
||||
secret: /ssd-vol0/enc/sunlight/keys/rennet2025h2.seed.bin
|
||||
cache: /ssd-vol0/logs/rennet2025h2/cache.db
|
||||
localdirectory: /ssd-vol0/logs/rennet2025h2/data
|
||||
notafterstart: 2025-07-01T00:00:00Z
|
||||
notafterlimit: 2026-01-01T00:00:00Z
|
||||
...
|
||||
EOF
|
||||
ctlog@ctlog0:/ssd-vol0/enc/sunlight$ cat << EOF | tee skylight-staging.yaml
|
||||
listen:
|
||||
- "[::]:16421"
|
||||
homeredirect: https://ipng.ch/s/ct/
|
||||
logs:
|
||||
- shortname: rennet2025h2
|
||||
monitoringprefix: https://rennet2025h2.mon.ct.ipng.ch
|
||||
localdirectory: /ssd-vol0/logs/rennet2025h2/data
|
||||
staging: true
|
||||
...
|
||||
```
|
||||
|
||||
In the first configuration file, I'll tell _Sunlight_ (the write path component) to listen on port
|
||||
`:16420` and I'll tell _Skylight_ (the read path component) to listen on port `:16421`. I've disabled
|
||||
the automatic certificate renewals, and will handle SSL upstream. A few notes on this:
|
||||
|
||||
1. Most importantly, I will be using a common frontend pool with a wildcard certificate for
|
||||
`*.ct.ipng.ch`. I wrote about [[DNS-01]({{< ref 2023-03-24-lego-dns01 >}})] before, it's a very
|
||||
convenient way for IPng to do certificate pool management. I will be sharing certificate for all log
|
||||
types under this certificate.
|
||||
1. ACME/HTTP-01 could be made to work with a bit of effort; plumbing through the `/.well-known/`
|
||||
URIs on the frontend and pointing them to these instances. But then the cert would have to be copied
|
||||
from Sunlight back to the frontends.
|
||||
|
||||
I've noticed that when the log doesn't exist yet, I can start Sunlight and it'll create the bits and
|
||||
pieces on the local filesystem and start writing checkpoints. But if the log already exists, I am
|
||||
required to have the _monitoringprefix_ active, otherwise Sunlight won't start up. It's a small
|
||||
thing, as I will have the read path operational in a few simple steps. Anyway, all five logshards
|
||||
for Rennet, and a few days later, for Gouda, are operational this way.
|
||||
|
||||
Skylight provides all the things I need to serve the data back, which is a huge help. The [[Static
|
||||
Log Spec](https://github.com/C2SP/C2SP/blob/main/static-ct-api.md)] is very clear on things like
|
||||
compression, content-type, cache-control and other headers. Skylight makes this a breeze, as it reads
|
||||
a configuration file very similar to the Sunlight write-path one, and takes care of it all for me.
|
||||
|
||||
## TesseraCT
|
||||
|
||||
{{< image width="10em" float="right" src="/assets/ctlog/tesseract-logo.png" alt="TesseraCT logo" >}}
|
||||
|
||||
Good news came to our community on August 14th, when Google's TrustFabric team announced their Alpha
|
||||
milestone of [[TesseraCT](https://blog.transparency.dev/introducing-tesseract)]. This release
|
||||
also moved the POSIX variant from experimental alongside the already further along GCP and AWS
|
||||
personalities. After playing around with it with Al and the team, I think I've learned enough to get
|
||||
us going in a public `tesseract-posix` instance.
|
||||
|
||||
One thing I liked about Sunlight is its compact YAML file that described the pertinent bits of the
|
||||
system, and that I can serve any number of logs with the same process. On the other hand, TesseraCT
|
||||
can serve only one log per process. Both have pro's and con's, notably if any poisonous submission
|
||||
would be offered, Sunlight might take down all logs, while TesseraCT would only take down the log
|
||||
receiving the offensive submission. On the other hand, maintaining separate processes is cumbersome,
|
||||
and all log instances need to be meticulously configured.
|
||||
|
||||
|
||||
### TesseraCT genconf
|
||||
|
||||
I decide to automate this by vibing a little tool called `tesseract-genconf`, which I've published on
|
||||
[[Gitea](https://git.ipng.ch/certificate-transparency/cheese)]. What it does is take a YAML file
|
||||
describing the logs, and outputs the bits and pieces needed to operate multiple separate processes
|
||||
that together form the sharded static log. I've attempted to stay mostly compatible with the
|
||||
Sunlight YAML configuration, and came up with a variant like this one:
|
||||
|
||||
```
|
||||
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ cat << EOF | tee tesseract-staging.yaml
|
||||
listen:
|
||||
- "[::]:8080"
|
||||
roots: /ssd-vol0/enc/tesseract/roots.pem
|
||||
logs:
|
||||
- shortname: lipase2025h2
|
||||
listen: "[::]:16900"
|
||||
submissionprefix: https://lipase2025h2.log.ct.ipng.ch
|
||||
monitoringprefix: https://lipase2025h2.mon.ct.ipng.ch
|
||||
extraroots: /ssd-vol0/enc/tesseract/extra-roots-staging.pem
|
||||
secret: /ssd-vol0/enc/tesseract/keys/lipase2025h2.pem
|
||||
localdirectory: /ssd-vol0/logs/lipase2025h2/data
|
||||
notafterstart: 2025-07-01T00:00:00Z
|
||||
notafterlimit: 2026-01-01T00:00:00Z
|
||||
...
|
||||
EOF
|
||||
```
|
||||
|
||||
With this snippet, I have all the information I need. Here's the steps I take to construct the log
|
||||
itself:
|
||||
|
||||
***1. Generate keys***
|
||||
|
||||
The keys are `prime256v1` and the format that TesseraCT accepts did change since I wrote up my first
|
||||
[[deep dive]({{< ref 2025-07-26-ctlog-1 >}})] a few weeks ago. Now, the tool accepts a `PEM` format
|
||||
private key, from which the _Log ID_ and _Public Key_ can be derived. So off I go:
|
||||
|
||||
```
|
||||
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf -c tesseract-staging.yaml gen-key
|
||||
Creating /ssd-vol0/enc/tesseract/keys/lipase2025h2.pem
|
||||
Creating /ssd-vol0/enc/tesseract/keys/lipase2026h1.pem
|
||||
Creating /ssd-vol0/enc/tesseract/keys/lipase2026h2.pem
|
||||
Creating /ssd-vol0/enc/tesseract/keys/lipase2027h1.pem
|
||||
Creating /ssd-vol0/enc/tesseract/keys/lipase2027h2.pem
|
||||
```
|
||||
|
||||
Of course, if a file already exists at that location, it'll just print a warning like:
|
||||
```
|
||||
Key already exists: /ssd-vol0/enc/tesseract/keys/lipase2025h2.pem (skipped)
|
||||
```
|
||||
|
||||
***2. Generate JSON/HTML***
|
||||
|
||||
I will be operating the read-path with NGINX. Log operators have started speaking about their log
|
||||
metadata in terms of a small JSON file called `log.v3.json`, and Skylight does a good job of
|
||||
exposing that one, alongside all the other pertinent metadata. So I'll generate these files for each
|
||||
of the logs:
|
||||
|
||||
```
|
||||
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf -c tesseract-staging.yaml gen-html
|
||||
Creating /ssd-vol0/logs/lipase2025h2/data/index.html
|
||||
Creating /ssd-vol0/logs/lipase2025h2/data/log.v3.json
|
||||
Creating /ssd-vol0/logs/lipase2026h1/data/index.html
|
||||
Creating /ssd-vol0/logs/lipase2026h1/data/log.v3.json
|
||||
Creating /ssd-vol0/logs/lipase2026h2/data/index.html
|
||||
Creating /ssd-vol0/logs/lipase2026h2/data/log.v3.json
|
||||
Creating /ssd-vol0/logs/lipase2027h1/data/index.html
|
||||
Creating /ssd-vol0/logs/lipase2027h1/data/log.v3.json
|
||||
Creating /ssd-vol0/logs/lipase2027h2/data/index.html
|
||||
Creating /ssd-vol0/logs/lipase2027h2/data/log.v3.json
|
||||
```
|
||||
|
||||
{{< image width="60%" src="/assets/ctlog/lipase.png" alt="TesseraCT Lipase Log" >}}
|
||||
|
||||
It's nice to see a familiar look-and-feel for these logs appear in those `index.html` (which all
|
||||
cross-link to each other within the logs specificied in `tesseract-staging.yaml`, which is dope.
|
||||
|
||||
***3. Generate Roots***
|
||||
|
||||
Antonis had seen this before (thanks for the explanation!) but TesseraCT does not natively implement
|
||||
fetching of the [[CCADB](https://www.ccadb.org/)] roots. But, he points out, you can just get them
|
||||
from any other running log instance, so I'll implement a `gen-roots` command:
|
||||
|
||||
```
|
||||
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf gen-roots \
|
||||
--source https://tuscolo2027h1.sunlight.geomys.org --output production-roots.pem
|
||||
Fetching roots from: https://tuscolo2027h1.sunlight.geomys.org/ct/v1/get-roots
|
||||
2025/08/25 08:24:58 Warning: Failed to parse certificate,carefully skipping: x509: negative serial number
|
||||
Creating production-roots.pem
|
||||
Successfully wrote 248 certificates to tusc.pem (out of 249 total)
|
||||
|
||||
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf gen-roots \
|
||||
--source https://navigli2027h1.sunlight.geomys.org --output testing-roots.pem
|
||||
Fetching roots from: https://navigli2027h1.sunlight.geomys.org/ct/v1/get-roots
|
||||
Creating testing-roots.pem
|
||||
Successfully wrote 82 certificates to tusc.pem (out of 82 total)
|
||||
```
|
||||
|
||||
I can do this regularly, say daily, in a cronjob and if the files were to change, restart the
|
||||
TesseraCT processes. It's not ideal (because the restart might be briefly disruptive), but it's a
|
||||
reasonable option for the time being.
|
||||
|
||||
***4. Generate TesseraCT cmdline***
|
||||
|
||||
I will be running TesseraCT as a _templated unit_ in systemd. These are system unit files that have
|
||||
an argument, they will have an @ in their name, like so:
|
||||
|
||||
```
|
||||
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ cat << EOF | sudo tee /lib/systemd/system/tesseract@.service
|
||||
[Unit]
|
||||
Description=Tesseract CT Log service for %i
|
||||
ConditionFileExists=/ssd-vol0/logs/%i/data/.env
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
# The %i here refers to the instance name, e.g., "lipase2025h2"
|
||||
# This path should point to where your instance-specific .env files are located
|
||||
EnvironmentFile=/ssd-vol0/logs/%i/data/.env
|
||||
ExecStart=/home/ctlog/bin/tesseract-posix $TESSERACT_ARGS
|
||||
User=ctlog
|
||||
Group=ctlog
|
||||
Restart=on-failure
|
||||
RestartSec=5
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
EOF
|
||||
```
|
||||
|
||||
I can now implement a `gen-env` command for my tool:
|
||||
|
||||
```
|
||||
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf -c tesseract-staging.yaml gen-env
|
||||
Creating /ssd-vol0/logs/lipase2025h2/data/roots.pem
|
||||
Creating /ssd-vol0/logs/lipase2025h2/data/.env
|
||||
Creating /ssd-vol0/logs/lipase2026h1/data/roots.pem
|
||||
Creating /ssd-vol0/logs/lipase2026h1/data/.env
|
||||
Creating /ssd-vol0/logs/lipase2026h2/data/roots.pem
|
||||
Creating /ssd-vol0/logs/lipase2026h2/data/.env
|
||||
Creating /ssd-vol0/logs/lipase2027h1/data/roots.pem
|
||||
Creating /ssd-vol0/logs/lipase2027h1/data/.env
|
||||
Creating /ssd-vol0/logs/lipase2027h2/data/roots.pem
|
||||
Creating /ssd-vol0/logs/lipase2027h2/data/.env
|
||||
```
|
||||
|
||||
Looking at one of those .env files, I can show the exact commandline I'll be feeding to the
|
||||
`tesseract-posix` binary:
|
||||
|
||||
```
|
||||
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ cat /ssd-vol0/logs/lipase2025h2/data/.env
|
||||
TESSERACT_ARGS="--private_key=/ssd-vol0/enc/tesseract/keys/lipase2025h2.pem
|
||||
--origin=lipase2025h2.log.ct.ipng.ch --storage_dir=/ssd-vol0/logs/lipase2025h2/data
|
||||
--roots_pem_file=/ssd-vol0/logs/lipase2025h2/data/roots.pem --http_endpoint=[::]:16900
|
||||
--not_after_start=2025-07-01T00:00:00Z --not_after_limit=2026-01-01T00:00:00Z"
|
||||
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
|
||||
```
|
||||
|
||||
{{< image width="7em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
A quick operational note on OpenTelemetry (also often referred to as Otel): Al and the TrustFabric
|
||||
team added open telemetry to the TesseraCT personalities, as it was mostly already implemented in
|
||||
the underlying Tessera library. By default, it'll try to send its telemetry to localhost using
|
||||
`https`, which makes sense in those cases where the collector is on a different machine. In my case,
|
||||
I'll keep `otelcol` (the collector) on the same machine. Its job is to consume the Otel telemetry
|
||||
stream, and turn those back into Prometheus `/metrics` endpoint on port `:9464`.
|
||||
|
||||
The `gen-env` command also assembles the per-instance `roots.pem` file. For staging logs, it'll take
|
||||
the file pointed to by the `roots:` key, and append any per-log `extraroots:` files. For me, these
|
||||
extraroots are empty and the main roots file points at either the testing roots that came from
|
||||
_Rennet_ (our Sunlight staging log), or the production roots that came from _Gouda_. A job well done!
|
||||
|
||||
***5. Generate NGINX***
|
||||
|
||||
When I first ran my tests, I noticed that the log check tool called `ct-fsck` threw errors on my
|
||||
read path. Filippo explained that the HTTP headers matter in the Static CT specification. Tiles,
|
||||
Issuers, and Checkpoint must all have specific caching and content type headers set. This is what
|
||||
makes Skylight such a gem - I get to read it (and the spec!) to see what I'm supposed to be serving.
|
||||
|
||||
And thus, `gen-nginx` command is born, and listens on port `:8080` for requests:
|
||||
|
||||
```
|
||||
ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf -c tesseract-staging.yaml gen-nginx
|
||||
Creating nginx config: /ssd-vol0/logs/lipase2025h2/data/lipase2025h2.mon.ct.ipng.ch.conf
|
||||
Creating nginx config: /ssd-vol0/logs/lipase2026h1/data/lipase2026h1.mon.ct.ipng.ch.conf
|
||||
Creating nginx config: /ssd-vol0/logs/lipase2026h2/data/lipase2026h2.mon.ct.ipng.ch.conf
|
||||
Creating nginx config: /ssd-vol0/logs/lipase2027h1/data/lipase2027h1.mon.ct.ipng.ch.conf
|
||||
Creating nginx config: /ssd-vol0/logs/lipase2027h2/data/lipase2027h2.mon.ct.ipng.ch.conf
|
||||
```
|
||||
|
||||
All that's left for me to do is symlink these from `/etc/nginx/sites-enabled/` and the read-path is
|
||||
off to the races. With these commands in the `tesseract-genconf` tool, I am hoping that future
|
||||
travelers have an easy time setting up their static log. Please let me know if you'd like to use, or
|
||||
contribute, to the tool. You can find me in the Transparency Dev Slack, in #ct and also #cheese.
|
||||
|
||||
|
||||
## IPng Frontends
|
||||
|
||||
{{< image width="18em" float="right" src="/assets/ctlog/MPLS Backbone - CTLog.svg" alt="ctlog at ipng" >}}
|
||||
|
||||
IPng Networks has a private internal network called [[IPng Site Local]({{< ref 2023-03-11-mpls-core
|
||||
>}})], which is not routed on the internet. Our [[Frontends]({{< ref 2023-03-17-ipng-frontends >}})]
|
||||
are the only things that have public IPv4 and IPv6 addresses. It allows for things like anycasted
|
||||
webservers and loadbalancing with
|
||||
[[Maglev](https://research.google/pubs/maglev-a-fast-and-reliable-software-network-load-balancer/)].
|
||||
|
||||
The IPng Site Local network kind of looks like the picture to the right. The hypervisors running the
|
||||
Sunlight and TesseraCT logs are at NTT Zurich1 in Rümlang, Switzerland. The IPng frontends are
|
||||
in green, and the sweet thing is, some of them run in IPng's own ISP network (AS8298), while others
|
||||
run in partner networks (like IP-Max AS25091, and Coloclue AS8283). This means that I will benefit
|
||||
from some pretty solid connectivity redundancy.
|
||||
|
||||
The frontends are provisioned with Ansible. There are two aspects to them - firstly, a _certbot_
|
||||
instance maintains the Let's Encrypt wildcard certificates for `*.ct.ipng.ch`. There's a machine
|
||||
tucked away somewhere called `lego.net.ipng.ch` -- again, not exposed on the internet -- and its job
|
||||
is to renew certificates and copy them to the machines that need them. Next, a cluster of NGINX
|
||||
servers uses these certificates to expose IPng and customer services to the Internet.
|
||||
|
||||
I can tie it all together with a snippet like so, for which I apologize in advance - it's quite a
|
||||
wall of text:
|
||||
|
||||
```
|
||||
map $http_user_agent $no_cache_ctlog_lipase {
|
||||
"~*TesseraCT fsck" 1;
|
||||
default 0;
|
||||
}
|
||||
|
||||
server {
|
||||
listen [::]:443 ssl http2;
|
||||
listen 0.0.0.0:443 ssl http2;
|
||||
ssl_certificate /etc/certs/ct.ipng.ch/fullchain.pem;
|
||||
ssl_certificate_key /etc/certs/ct.ipng.ch/privkey.pem;
|
||||
include /etc/nginx/conf.d/options-ssl-nginx.inc;
|
||||
ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc;
|
||||
|
||||
server_name lipase2025h2.log.ct.ipng.ch;
|
||||
access_log /nginx/logs/lipase2025h2.log.ct.ipng.ch-access.log upstream buffer=512k flush=5s;
|
||||
include /etc/nginx/conf.d/ipng-headers.inc;
|
||||
|
||||
location = / {
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Host lipase2025h2.mon.ct.ipng.ch;
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection "upgrade";
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
proxy_pass http://ctlog1.net.ipng.ch:8080/index.html;
|
||||
}
|
||||
|
||||
location = /metrics {
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection "upgrade";
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
proxy_pass http://ctlog1.net.ipng.ch:9464;
|
||||
}
|
||||
|
||||
location / {
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection "upgrade";
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
proxy_pass http://ctlog1.net.ipng.ch:16900;
|
||||
}
|
||||
}
|
||||
|
||||
server {
|
||||
listen [::]:443 ssl http2;
|
||||
listen 0.0.0.0:443 ssl http2;
|
||||
ssl_certificate /etc/certs/ct.ipng.ch/fullchain.pem;
|
||||
ssl_certificate_key /etc/certs/ct.ipng.ch/privkey.pem;
|
||||
include /etc/nginx/conf.d/options-ssl-nginx.inc;
|
||||
ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc;
|
||||
|
||||
server_name lipase2025h2.mon.ct.ipng.ch;
|
||||
access_log /nginx/logs/lipase2025h2.mon.ct.ipng.ch-access.log upstream buffer=512k flush=5s;
|
||||
include /etc/nginx/conf.d/ipng-headers.inc;
|
||||
|
||||
location = /checkpoint {
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection "upgrade";
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
|
||||
proxy_pass http://ctlog1.net.ipng.ch:8080;
|
||||
}
|
||||
|
||||
location / {
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection "upgrade";
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
|
||||
include /etc/nginx/conf.d/ipng-upstream-headers.inc;
|
||||
proxy_cache ipng_cache;
|
||||
proxy_cache_key "$scheme://$host$request_uri";
|
||||
proxy_cache_valid 200 24h;
|
||||
proxy_cache_revalidate off;
|
||||
proxy_cache_bypass $no_cache_ctlog_lipase;
|
||||
proxy_no_cache $no_cache_ctlog_lipase;
|
||||
|
||||
proxy_pass http://ctlog1.net.ipng.ch:8080;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Taking _Lipase_ shard 2025h2 as an example, The submission path (on `*.log.ct.ipng.ch`) will show
|
||||
the same `index.html` as the monitoring path (on `*.mon.ct.ipng.ch`), to provide some consistency
|
||||
with Sunlight logs. Otherwise, the `/metrics` endpoint is forwarded to the `otelcol` running on port
|
||||
`:9464`, and the rest (the `/ct/v1/` and so on) are sent to the first port `:16900` of the
|
||||
TesseraCT.
|
||||
|
||||
Then the read-path makes a special-case of the `/checkpoint` endpoint, which it does not cache. That
|
||||
request (as all others) are forwarded to port `:8080` which is where NGINX is running. Other
|
||||
requests (notably `/tile` and `/issuer`) are cacheable, so I'll cache these on the upstream NGINX
|
||||
servers, both for resilience as well as for performance. Having four of these NGINX upstream will
|
||||
allow the Static CT logs (regardless of being Sunlight or TesseraCT) to serve very high read-rates.
|
||||
|
||||
## What's Next
|
||||
|
||||
I need to spend a little bit of time thinking about rate limits, specifically write-ratelimits. I
|
||||
think I'll use a request limiter in upstream NGINX, to allow for each IP or /24 or /48 subnet to
|
||||
only send a fixed number of requests/sec. I'll probably keep that part private though, as it's a
|
||||
good rule of thumb to never offer information to attackers.
|
||||
|
||||
Together with Antonis Chariton and Jeroen Massar, IPng Networks will be offering both TesseraCT and
|
||||
Sunlight logs on the public internet. One final step is to productionize both logs, and file the
|
||||
paperwork for them in the community. At this point our Sunlight log has been running for a month or
|
||||
so, and we've filed the paperwork for it to be included at Apple and Google.
|
||||
|
||||
I'm going to have folks poke at _Lipase_ as well, after which I'll try to run a few `ct-fsck` to
|
||||
make sure the logs are sane, before offering them into the inclusion program as well. Wish us luck!
|
||||
73
content/ctlog.md
Normal file
73
content/ctlog.md
Normal file
@@ -0,0 +1,73 @@
|
||||
---
|
||||
title: 'Certificate Transparency'
|
||||
date: 2025-07-30
|
||||
url: /s/ct
|
||||
---
|
||||
|
||||
{{< image width="10em" float="right" src="/assets/ctlog/ctlog-logo-ipng.png" alt="ctlog logo" >}}
|
||||
|
||||
Certificate Transparency logs are "append-only" and publicly-auditable ledgers of certificates being
|
||||
created, updated, and expired. This is the homepage for IPng Networks' Certificate Transparency
|
||||
project.
|
||||
|
||||
Certificate Transparency [[CT](https://certificate.transparency.dev)] is a system for logging and
|
||||
monitoring certificate issuance. It greatly enhances everyone’s ability to monitor and study
|
||||
certificate issuance, and these capabilities have led to numerous improvements to the CA ecosystem
|
||||
and Web security. As a result, it is rapidly becoming critical Internet infrastructure. Originally
|
||||
developed by Google, the concept is now being adopted by many _Certification Authories_ who log
|
||||
their certificates, and professional _Monitoring_ companies who observe the certificates and
|
||||
report anomalies.
|
||||
|
||||
IPng Networks runs our logs under the domain `ct.ipng.ch`, split into a `*.log.ct.ipng.ch` for the
|
||||
write-path, and `*.mon.ct.ipng.ch` for the read-path.
|
||||
|
||||
We are submitting our log for inclusion in the approved log lists for Google Chrome and Apple
|
||||
Safari. Following 90 days of successful monitoring, we anticipate our log will be added to these
|
||||
trusted lists and that change will propagate to people’s browsers with subsequent browser version
|
||||
releases.
|
||||
|
||||
We operate two popular implementations of Static Certificate Transparency software.
|
||||
|
||||
## Sunlight
|
||||
|
||||
{{< image width="10em" float="right" src="/assets/ctlog/sunlight-logo.png" alt="sunlight logo" >}}
|
||||
|
||||
[[Sunlight](https://sunlight.dev)] was designed by Filippo Valsorda for the needs of the WebPKI
|
||||
community, through the feedback of many of its members, and in particular of the Sigsum, Google
|
||||
TrustFabric, and ISRG teams. It is partially based on the Go Checksum Database. Sunlight's
|
||||
development was sponsored by Let's Encrypt.
|
||||
|
||||
Our Sunlight logs:
|
||||
* A staging log called [[Rennet](https://rennet2025h2.log.ct.ipng.ch/)], incepted 2025-07-28,
|
||||
starting from temporal shard `rennet2025h2`.
|
||||
* A production log called [[Gouda](https://gouda2025h2.log.ct.ipng.ch/)], incepted 2025-07-30,
|
||||
starting from temporal shard `gouda2025h2`.
|
||||
|
||||
## TesseraCT
|
||||
|
||||
{{< image width="10em" float="right" src="/assets/ctlog/tesseract-logo.png" alt="tesseract logo" >}}
|
||||
|
||||
[[TesseraCT](https://github.com/transparency-dev/tesseract)] is a Certificate Transparency (CT) log
|
||||
implementation by the TrustFabric team at Google. It was built to allow log operators to run
|
||||
production static-ct-api CT logs starting with temporal shards covering 2026 onwards, as the
|
||||
successor to Trillian's CTFE.
|
||||
|
||||
Our TesseraCT logs:
|
||||
* A staging log called [[Lipase](https://lipase2025h2.log.ct.ipng.ch/)], incepted 2025-08-22,
|
||||
starting from temporal shard `lipase2025h2`.
|
||||
* A production log called [[Halloumi](https://halloumi2025h2.log.ct.ipng.ch/)], incepted 2025-08-24,
|
||||
starting from temporal shard `halloumi2025h2`.
|
||||
* Shard `halloumi2026h2` incorporated incorrect data into its Merkle Tree at entry 4357956 and
|
||||
4552365, due to a [[TesseraCT bug](https://github.com/transparency-dev/tesseract/issues/553)]
|
||||
and was retired on 2025-09-08, to be replaced by temporal shard `halloumi2026h2a`.
|
||||
|
||||
## Operational Details
|
||||
|
||||
You can read more details about our infrastructure on:
|
||||
* **[[TesseraCT]({{< ref 2025-07-26-ctlog-1 >}})]** - published on 2025-07-26.
|
||||
* **[[Sunlight]({{< ref 2025-08-10-ctlog-2 >}})]** - published on 2025-08-10.
|
||||
* **[[Operations]({{< ref 2025-08-24-ctlog-3 >}})]** - published on 2025-08-24.
|
||||
|
||||
The operators of this infrastructure are **Antonis Chariton**, **Jeroen Massar** and **Pim van Pelt**. \
|
||||
You can reach us via e-mail at [[<ct-ops@ipng.ch>](mailto:ct-ops@ipng.ch)].
|
||||
|
||||
36
hugo.toml
36
hugo.toml
@@ -1,36 +0,0 @@
|
||||
baseURL = 'https://ipng.ch/'
|
||||
languageCode = 'en-us'
|
||||
title = "IPng Networks"
|
||||
theme = 'hugo-theme-ipng'
|
||||
|
||||
mainSections = ["articles"]
|
||||
# disqusShortname = "example"
|
||||
paginate = 4
|
||||
|
||||
[params]
|
||||
author = "IPng Networks GmbH"
|
||||
siteHeading = "IPng Networks"
|
||||
favicon = "favicon.ico" # Adds a small icon next to the page title in a tab
|
||||
showBlogLatest = false
|
||||
mainSections = ["articles"]
|
||||
showTaxonomyLinks = false
|
||||
nBlogLatest = 14 # number of blog post om the home page
|
||||
Paginate = 30
|
||||
blogLatestHeading = "Latest Dabblings"
|
||||
footer = "Copyright 2021- IPng Networks GmbH, all rights reserved"
|
||||
|
||||
[params.social]
|
||||
email = "info+www@ipng.ch"
|
||||
mastodon = "IPngNetworks"
|
||||
twitter = "IPngNetworks"
|
||||
linkedin = "pimvanpelt"
|
||||
instagram = "IPngNetworks"
|
||||
|
||||
[taxonomies]
|
||||
year = "year"
|
||||
month = "month"
|
||||
tags = "tags"
|
||||
categories = "categories"
|
||||
|
||||
[permalinks]
|
||||
articles = "/s/articles/:year/:month/:day/:slug"
|
||||
38
hugo.yaml
Normal file
38
hugo.yaml
Normal file
@@ -0,0 +1,38 @@
|
||||
baseURL: 'https://ipng.ch/'
|
||||
languageCode: 'en-us'
|
||||
title: "IPng Networks"
|
||||
theme: 'hugo-theme-ipng'
|
||||
|
||||
mainSections: ["articles"]
|
||||
|
||||
params:
|
||||
author: "IPng Networks GmbH"
|
||||
siteHeading: "IPng Networks"
|
||||
favicon: "favicon.ico"
|
||||
showBlogLatest: false
|
||||
mainSections: ["articles"]
|
||||
showTaxonomyLinks: false
|
||||
nBlogLatest: 14 # number of blog post om the home page
|
||||
Paginate: 30
|
||||
blogLatestHeading: "Latest Dabblings"
|
||||
footer: "Copyright 2021- IPng Networks GmbH, all rights reserved"
|
||||
|
||||
social:
|
||||
email: "info+www@ipng.ch"
|
||||
mastodon: "@IPngNetworks"
|
||||
twitter: "IPngNetworks"
|
||||
linkedin: "pimvanpelt"
|
||||
github: "pimvanpelt"
|
||||
instagram: "IPngNetworks"
|
||||
rss: true
|
||||
|
||||
taxonomies:
|
||||
year: "year"
|
||||
month: "month"
|
||||
tags: "tags"
|
||||
categories: "categories"
|
||||
|
||||
permalinks:
|
||||
articles: "/s/articles/:year/:month/:day/:slug"
|
||||
|
||||
ignoreLogs: [ "warning-goldmark-raw-html" ]
|
||||
5
static/.well-known/security.txt
Normal file
5
static/.well-known/security.txt
Normal file
@@ -0,0 +1,5 @@
|
||||
Canonical: https://ipng.ch/.well-known/security.txt
|
||||
Expires: 2026-01-01T00:00:00.000Z
|
||||
Contact: mailto:info@ipng.ch
|
||||
Contact: https://ipng.ch/s/contact/
|
||||
Preferred-Languages: en, nl, de
|
||||
55
static/app/go/index.html
Normal file
55
static/app/go/index.html
Normal file
@@ -0,0 +1,55 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en-us">
|
||||
<head>
|
||||
<title>Javascript Redirector for RFID / NFC / nTAG</title>
|
||||
<meta name="robots" content="noindex,nofollow">
|
||||
<meta charset="utf-8">
|
||||
<script type="text/JavaScript">
|
||||
|
||||
const ntag_list = [
|
||||
"/s/articles/2021/09/21/vpp-linux-cp-part7/",
|
||||
"/s/articles/2021/12/23/vpp-linux-cp-virtual-machine-playground/",
|
||||
"/s/articles/2022/01/12/case-study-virtual-leased-line-vll-in-vpp/",
|
||||
"/s/articles/2022/02/14/case-study-vlan-gymnastics-with-vpp/",
|
||||
"/s/articles/2022/03/27/vpp-configuration-part1/",
|
||||
"/s/articles/2022/10/14/vpp-lab-setup/",
|
||||
"/s/articles/2023/03/11/case-study-centec-mpls-core/",
|
||||
"/s/articles/2023/04/09/vpp-monitoring/",
|
||||
"/s/articles/2023/05/28/vpp-mpls-part-4/",
|
||||
"/s/articles/2023/11/11/debian-on-mellanox-sn2700-32x100g/",
|
||||
"/s/articles/2023/12/17/debian-on-ipngs-vpp-routers/",
|
||||
"/s/articles/2024/01/27/vpp-python-api/",
|
||||
"/s/articles/2024/02/10/vpp-on-freebsd-part-1/",
|
||||
"/s/articles/2024/03/06/vpp-with-babel-part-1/",
|
||||
"/s/articles/2024/04/06/vpp-with-loopback-only-ospfv3-part-1/",
|
||||
"/s/articles/2024/04/27/freeix-remote/"
|
||||
];
|
||||
|
||||
var redir_url = "https://ipng.ch/";
|
||||
var key = window.location.hash.slice(1);
|
||||
if (key.startsWith("ntag")) {
|
||||
let week = Math.round(new Date().getTime() / 1000 / (7*24*3400));
|
||||
let num = parseInt(key.slice(-2));
|
||||
let idx = (num + week) % ntag_list.length;
|
||||
console.log("(ntag " + num + " + week number " + week + ") % " + ntag_list.length + " = " + idx);
|
||||
redir_url = ntag_list[idx];
|
||||
}
|
||||
|
||||
console.log("Redirecting to " + redir_url + " - off you go!");
|
||||
window.location = redir_url;
|
||||
</script>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<pre>
|
||||
Usage: https://ipng.ch/app/go/#<key>
|
||||
Example: <a href="/app/go/#ntag00">#ntag00</a>
|
||||
|
||||
Also, this page requires javascript.
|
||||
|
||||
Love,
|
||||
IPng Networks.
|
||||
</pre>
|
||||
|
||||
</body>
|
||||
</html>
|
||||
1
static/assets/containerlab/containerlab.svg
Normal file
1
static/assets/containerlab/containerlab.svg
Normal file
File diff suppressed because one or more lines are too long
|
After Width: | Height: | Size: 21 KiB |
BIN
static/assets/containerlab/learn-vpp.png
LFS
Normal file
BIN
static/assets/containerlab/learn-vpp.png
LFS
Normal file
Binary file not shown.
1270
static/assets/containerlab/vpp-containerlab.cast
Normal file
1270
static/assets/containerlab/vpp-containerlab.cast
Normal file
File diff suppressed because it is too large
Load Diff
1
static/assets/ctlog/MPLS Backbone - CTLog.svg
Normal file
1
static/assets/ctlog/MPLS Backbone - CTLog.svg
Normal file
File diff suppressed because one or more lines are too long
|
After Width: | Height: | Size: 147 KiB |
BIN
static/assets/ctlog/btop-sunlight.png
LFS
Normal file
BIN
static/assets/ctlog/btop-sunlight.png
LFS
Normal file
Binary file not shown.
BIN
static/assets/ctlog/ctlog-loadtest1.png
LFS
Normal file
BIN
static/assets/ctlog/ctlog-loadtest1.png
LFS
Normal file
Binary file not shown.
BIN
static/assets/ctlog/ctlog-loadtest2.png
LFS
Normal file
BIN
static/assets/ctlog/ctlog-loadtest2.png
LFS
Normal file
Binary file not shown.
BIN
static/assets/ctlog/ctlog-loadtest3.png
LFS
Normal file
BIN
static/assets/ctlog/ctlog-loadtest3.png
LFS
Normal file
Binary file not shown.
BIN
static/assets/ctlog/ctlog-logo-ipng.png
LFS
Normal file
BIN
static/assets/ctlog/ctlog-logo-ipng.png
LFS
Normal file
Binary file not shown.
BIN
static/assets/ctlog/lipase.png
LFS
Normal file
BIN
static/assets/ctlog/lipase.png
LFS
Normal file
Binary file not shown.
164
static/assets/ctlog/minio-results.txt
Normal file
164
static/assets/ctlog/minio-results.txt
Normal file
@@ -0,0 +1,164 @@
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=4M
|
||||
Loop 1: PUT time 60.0 secs, objects = 813, speed = 54.2MB/sec, 13.5 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 23168, speed = 1.5GB/sec, 386.1 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 2.2 secs, 371.2 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=1M
|
||||
2025/07/20 16:07:25 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FACEBAC4D052, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 1221, speed = 20.3MB/sec, 20.3 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 31000, speed = 516.7MB/sec, 516.7 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 3.2 secs, 376.5 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=8k
|
||||
2025/07/20 16:09:29 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FAEB70060604, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 3353, speed = 447KB/sec, 55.9 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 45913, speed = 6MB/sec, 765.2 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 9.3 secs, 361.6 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=4k
|
||||
2025/07/20 16:11:38 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FB098B162788, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 3404, speed = 226.9KB/sec, 56.7 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 45230, speed = 2.9MB/sec, 753.8 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 9.4 secs, 362.6 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4M
|
||||
2025/07/20 16:13:47 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FB27AE890E75, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.1 secs, objects = 1898, speed = 126.4MB/sec, 31.6 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 185034, speed = 12GB/sec, 3083.9 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 0.4 secs, 4267.8 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=1M
|
||||
2025/07/20 16:15:48 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FB43C0386015, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.2 secs, objects = 2627, speed = 43.7MB/sec, 43.7 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 327959, speed = 5.3GB/sec, 5465.9 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 0.6 secs, 4045.6 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=8k
|
||||
2025/07/20 16:17:49 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FB5FE2012590, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 6663, speed = 887.7KB/sec, 111.0 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 459962, speed = 59.9MB/sec, 7666.0 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 1.7 secs, 3890.9 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4k
|
||||
2025/07/20 16:19:50 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FB7C3CF0FFCA, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.1 secs, objects = 6673, speed = 444.4KB/sec, 111.1 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 444637, speed = 28.9MB/sec, 7410.5 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 1.5 secs, 4411.8 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4M
|
||||
2025/07/20 16:21:52 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FB988DB60881, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.2 secs, objects = 3093, speed = 205.5MB/sec, 51.4 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 168750, speed = 11GB/sec, 2811.4 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 0.3 secs, 9112.2 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=1M
|
||||
2025/07/20 16:23:53 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FBB4A1E534DE, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.2 secs, objects = 4652, speed = 77.2MB/sec, 77.2 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 351187, speed = 5.7GB/sec, 5852.8 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 0.6 secs, 8141.6 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=8k
|
||||
2025/07/20 16:25:54 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FBD0C4764C64, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.1 secs, objects = 14497, speed = 1.9MB/sec, 241.4 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 457437, speed = 59.6MB/sec, 7623.7 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 1.7 secs, 8353.6 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4k
|
||||
2025/07/20 16:27:55 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FBED210B0792, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.1 secs, objects = 14459, speed = 962.6KB/sec, 240.7 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 466680, speed = 30.4MB/sec, 7777.7 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 1.7 secs, 8605.3 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=4M
|
||||
Loop 1: PUT time 60.0 secs, objects = 1866, speed = 124.4MB/sec, 31.1 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 16400, speed = 1.1GB/sec, 273.3 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 5.1 secs, 369.3 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=1M
|
||||
2025/07/20 16:32:02 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FC25AE815718, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 5459, speed = 91MB/sec, 91.0 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 25090, speed = 418.2MB/sec, 418.2 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 14.8 secs, 369.8 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=8k
|
||||
2025/07/20 16:34:17 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FC4514A78873, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 22278, speed = 2.9MB/sec, 371.3 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 40626, speed = 5.3MB/sec, 677.1 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 61.6 secs, 361.8 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=4k
|
||||
2025/07/20 16:37:19 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FC6F629ACFAC, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 23394, speed = 1.5MB/sec, 389.9 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 39249, speed = 2.6MB/sec, 654.1 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 64.5 secs, 363.0 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4M
|
||||
2025/07/20 16:40:23 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FC9A5D101971, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 10564, speed = 704.1MB/sec, 176.0 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 20682, speed = 1.3GB/sec, 344.6 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 2.5 secs, 4178.8 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=1M
|
||||
2025/07/20 16:42:26 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FCB6EB0A45D9, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 26550, speed = 442.4MB/sec, 442.4 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 124810, speed = 2GB/sec, 2080.1 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 6.6 secs, 4049.2 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=8k
|
||||
2025/07/20 16:44:32 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FCD4684A110E, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 129363, speed = 16.8MB/sec, 2155.9 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 423956, speed = 55.2MB/sec, 7065.8 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 32.4 secs, 3992.0 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4k
|
||||
2025/07/20 16:47:05 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FCF7EA4857CF, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 123067, speed = 8MB/sec, 2051.0 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 357694, speed = 23.3MB/sec, 5961.4 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 30.9 secs, 3986.0 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4M
|
||||
2025/07/20 16:49:36 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FD1B12EFDEBC, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.1 secs, objects = 13131, speed = 873.3MB/sec, 218.3 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.1 secs, objects = 18630, speed = 1.2GB/sec, 310.2 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 1.7 secs, 7787.5 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=1M
|
||||
2025/07/20 16:51:38 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FD3779E97644, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.1 secs, objects = 40226, speed = 669.8MB/sec, 669.8 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 85692, speed = 1.4GB/sec, 1427.8 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 4.7 secs, 8610.2 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=8k
|
||||
2025/07/20 16:53:42 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FD5489FB2F1F, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 230985, speed = 30.1MB/sec, 3849.3 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 435703, speed = 56.7MB/sec, 7261.1 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 25.8 secs, 8945.8 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:9000, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4k
|
||||
2025/07/20 16:56:08 WARNING: createBucket wasabi-benchmark-bucket error, ignoring BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
|
||||
status code: 409, request id: 1853FD7683B9BB96, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
|
||||
Loop 1: PUT time 60.0 secs, objects = 228647, speed = 14.9MB/sec, 3810.4 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 452412, speed = 29.5MB/sec, 7539.9 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 27.2 secs, 8418.0 deletes/sec. Slowdowns = 0
|
||||
BIN
static/assets/ctlog/minio_8kb_performance.png
LFS
Normal file
BIN
static/assets/ctlog/minio_8kb_performance.png
LFS
Normal file
Binary file not shown.
BIN
static/assets/ctlog/nsa_slide.jpg
LFS
Normal file
BIN
static/assets/ctlog/nsa_slide.jpg
LFS
Normal file
Binary file not shown.
80
static/assets/ctlog/seaweedfs-results.txt
Normal file
80
static/assets/ctlog/seaweedfs-results.txt
Normal file
@@ -0,0 +1,80 @@
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=1M
|
||||
Loop 1: PUT time 60.0 secs, objects = 1994, speed = 33.2MB/sec, 33.2 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 29243, speed = 487.4MB/sec, 487.4 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 2.8 secs, 701.4 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=8k
|
||||
Loop 1: PUT time 60.0 secs, objects = 13634, speed = 1.8MB/sec, 227.2 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 32284, speed = 4.2MB/sec, 538.1 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 18.7 secs, 727.8 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=1M
|
||||
Loop 1: PUT time 62.0 secs, objects = 23733, speed = 382.8MB/sec, 382.8 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 132708, speed = 2.2GB/sec, 2211.7 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 3.7 secs, 6490.1 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=8k
|
||||
Loop 1: PUT time 60.0 secs, objects = 199925, speed = 26MB/sec, 3331.9 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 309937, speed = 40.4MB/sec, 5165.3 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 31.2 secs, 6406.0 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=1M
|
||||
Loop 1: PUT time 60.0 secs, objects = 1975, speed = 32.9MB/sec, 32.9 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 29898, speed = 498.3MB/sec, 498.3 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 2.7 secs, 726.6 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=1, loops=1, size=8k
|
||||
Loop 1: PUT time 60.0 secs, objects = 13662, speed = 1.8MB/sec, 227.7 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 31865, speed = 4.1MB/sec, 531.1 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 18.8 secs, 726.9 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=1M
|
||||
Loop 1: PUT time 60.0 secs, objects = 26622, speed = 443.6MB/sec, 443.6 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 117688, speed = 1.9GB/sec, 1961.3 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 4.1 secs, 6499.5 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=8k
|
||||
Loop 1: PUT time 60.0 secs, objects = 198238, speed = 25.8MB/sec, 3303.9 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 312868, speed = 40.7MB/sec, 5214.3 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 30.8 secs, 6432.7 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4M
|
||||
Loop 1: PUT time 60.1 secs, objects = 6220, speed = 414.2MB/sec, 103.6 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 38773, speed = 2.5GB/sec, 646.1 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 0.9 secs, 6693.3 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4k
|
||||
Loop 1: PUT time 60.0 secs, objects = 203033, speed = 13.2MB/sec, 3383.8 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 300824, speed = 19.6MB/sec, 5013.6 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 31.1 secs, 6528.6 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4M
|
||||
Loop 1: PUT time 60.3 secs, objects = 13181, speed = 874.2MB/sec, 218.6 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.1 secs, objects = 18575, speed = 1.2GB/sec, 309.3 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 0.8 secs, 17547.2 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-disk:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4k
|
||||
Loop 1: PUT time 60.0 secs, objects = 495006, speed = 32.2MB/sec, 8249.5 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 465947, speed = 30.3MB/sec, 7765.4 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 41.4 secs, 11961.3 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4M
|
||||
Loop 1: PUT time 60.1 secs, objects = 7073, speed = 471MB/sec, 117.8 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 31248, speed = 2GB/sec, 520.7 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 1.1 secs, 6576.1 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=8, loops=1, size=4k
|
||||
Loop 1: PUT time 60.0 secs, objects = 214387, speed = 14MB/sec, 3573.0 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 297586, speed = 19.4MB/sec, 4959.7 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 32.9 secs, 6519.8 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4M
|
||||
Loop 1: PUT time 60.1 secs, objects = 14365, speed = 956MB/sec, 239.0 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.1 secs, objects = 18113, speed = 1.2GB/sec, 301.6 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 0.8 secs, 18655.8 deletes/sec. Slowdowns = 0
|
||||
Wasabi benchmark program v2.0
|
||||
Parameters: url=http://minio-ssd:8333, bucket=wasabi-benchmark-bucket, region=us-east-1, duration=60, threads=32, loops=1, size=4k
|
||||
Loop 1: PUT time 60.0 secs, objects = 489736, speed = 31.9MB/sec, 8161.8 operations/sec. Slowdowns = 0
|
||||
Loop 1: GET time 60.0 secs, objects = 460296, speed = 30MB/sec, 7671.2 operations/sec. Slowdowns = 0
|
||||
Loop 1: DELETE time 41.0 secs, 11957.6 deletes/sec. Slowdowns = 0
|
||||
116
static/assets/ctlog/seaweedfs.docker-compose.yml
Normal file
116
static/assets/ctlog/seaweedfs.docker-compose.yml
Normal file
@@ -0,0 +1,116 @@
|
||||
# Test Setup for SeaweedFS with 6 disks, a Filer an an S3 API
|
||||
#
|
||||
# Use with the following .env file
|
||||
# root@minio-ssd:~# cat /opt/seaweedfs/.env
|
||||
# AWS_ACCESS_KEY_ID="hottentotten"
|
||||
# AWS_SECRET_ACCESS_KEY="tentententoonstelling"
|
||||
|
||||
services:
|
||||
# Master
|
||||
master0:
|
||||
image: chrislusf/seaweedfs
|
||||
ports:
|
||||
- 9333:9333
|
||||
- 19333:19333
|
||||
command: "-v=1 master -volumeSizeLimitMB 100 -resumeState=false -ip=master0 -ip.bind=0.0.0.0 -port=9333 -mdir=/var/lib/seaweedfs/master"
|
||||
volumes:
|
||||
- ./data/master0:/var/lib/seaweedfs/master
|
||||
restart: unless-stopped
|
||||
|
||||
# Volume Server 1
|
||||
volume1:
|
||||
image: chrislusf/seaweedfs
|
||||
command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8081 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume1'
|
||||
volumes:
|
||||
- /data/disk1:/var/lib/seaweedfs/volume1
|
||||
depends_on:
|
||||
- master0
|
||||
restart: unless-stopped
|
||||
|
||||
# Volume Server 2
|
||||
volume2:
|
||||
image: chrislusf/seaweedfs
|
||||
command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8082 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume2'
|
||||
volumes:
|
||||
- /data/disk2:/var/lib/seaweedfs/volume2
|
||||
depends_on:
|
||||
- master0
|
||||
restart: unless-stopped
|
||||
|
||||
# Volume Server 3
|
||||
volume3:
|
||||
image: chrislusf/seaweedfs
|
||||
command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8083 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume3'
|
||||
volumes:
|
||||
- /data/disk3:/var/lib/seaweedfs/volume3
|
||||
depends_on:
|
||||
- master0
|
||||
restart: unless-stopped
|
||||
|
||||
# Volume Server 4
|
||||
volume4:
|
||||
image: chrislusf/seaweedfs
|
||||
command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8084 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume4'
|
||||
volumes:
|
||||
- /data/disk4:/var/lib/seaweedfs/volume4
|
||||
depends_on:
|
||||
- master0
|
||||
restart: unless-stopped
|
||||
|
||||
# Volume Server 5
|
||||
volume5:
|
||||
image: chrislusf/seaweedfs
|
||||
command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8085 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume5'
|
||||
volumes:
|
||||
- /data/disk5:/var/lib/seaweedfs/volume5
|
||||
depends_on:
|
||||
- master0
|
||||
restart: unless-stopped
|
||||
|
||||
# Volume Server 6
|
||||
volume6:
|
||||
image: chrislusf/seaweedfs
|
||||
command: 'volume -dataCenter=dc1 -rack=r1 -mserver="master0:9333" -port=8086 -preStopSeconds=1 -dir=/var/lib/seaweedfs/volume6'
|
||||
volumes:
|
||||
- /data/disk6:/var/lib/seaweedfs/volume6
|
||||
depends_on:
|
||||
- master0
|
||||
restart: unless-stopped
|
||||
|
||||
# Filer
|
||||
filer:
|
||||
image: chrislusf/seaweedfs
|
||||
ports:
|
||||
- 8888:8888
|
||||
- 18888:18888
|
||||
command: 'filer -defaultReplicaPlacement=002 -iam -master="master0:9333"'
|
||||
volumes:
|
||||
- ./data/filer:/data
|
||||
depends_on:
|
||||
- master0
|
||||
- volume1
|
||||
- volume2
|
||||
- volume3
|
||||
- volume4
|
||||
- volume5
|
||||
- volume6
|
||||
restart: unless-stopped
|
||||
|
||||
# S3 API
|
||||
s3:
|
||||
image: chrislusf/seaweedfs
|
||||
ports:
|
||||
- 8333:8333
|
||||
command: 's3 -filer="filer:8888" -ip.bind=0.0.0.0'
|
||||
env_file:
|
||||
- .env
|
||||
depends_on:
|
||||
- master0
|
||||
- volume1
|
||||
- volume2
|
||||
- volume3
|
||||
- volume4
|
||||
- volume5
|
||||
- volume6
|
||||
- filer
|
||||
restart: unless-stopped
|
||||
BIN
static/assets/ctlog/size_comparison_8t.png
LFS
Normal file
BIN
static/assets/ctlog/size_comparison_8t.png
LFS
Normal file
Binary file not shown.
BIN
static/assets/ctlog/stop-hammer-time.jpg
LFS
Normal file
BIN
static/assets/ctlog/stop-hammer-time.jpg
LFS
Normal file
Binary file not shown.
BIN
static/assets/ctlog/sunlight-logo.png
LFS
Normal file
BIN
static/assets/ctlog/sunlight-logo.png
LFS
Normal file
Binary file not shown.
BIN
static/assets/ctlog/sunlight-test-s3.png
LFS
Normal file
BIN
static/assets/ctlog/sunlight-test-s3.png
LFS
Normal file
Binary file not shown.
BIN
static/assets/ctlog/tesseract-logo.png
LFS
Normal file
BIN
static/assets/ctlog/tesseract-logo.png
LFS
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
static/assets/freeix/freeix-artist-rendering.png
LFS
Normal file
BIN
static/assets/freeix/freeix-artist-rendering.png
LFS
Normal file
Binary file not shown.
1
static/assets/frys-ix/FrysIX_ Topology (concept).svg
Normal file
1
static/assets/frys-ix/FrysIX_ Topology (concept).svg
Normal file
File diff suppressed because one or more lines are too long
|
After Width: | Height: | Size: 90 KiB |
BIN
static/assets/frys-ix/IXR-7220-D3.jpg
LFS
Normal file
BIN
static/assets/frys-ix/IXR-7220-D3.jpg
LFS
Normal file
Binary file not shown.
1
static/assets/frys-ix/Nokia Arista VXLAN.svg
Normal file
1
static/assets/frys-ix/Nokia Arista VXLAN.svg
Normal file
File diff suppressed because one or more lines are too long
|
After Width: | Height: | Size: 166 KiB |
169
static/assets/frys-ix/arista-leaf.conf
Normal file
169
static/assets/frys-ix/arista-leaf.conf
Normal file
@@ -0,0 +1,169 @@
|
||||
no aaa root
|
||||
!
|
||||
hardware counter feature vtep decap
|
||||
hardware counter feature vtep encap
|
||||
!
|
||||
service routing protocols model multi-agent
|
||||
!
|
||||
hostname arista-leaf
|
||||
!
|
||||
router l2-vpn
|
||||
arp learning bridged
|
||||
!
|
||||
spanning-tree mode mstp
|
||||
!
|
||||
system l1
|
||||
unsupported speed action error
|
||||
unsupported error-correction action error
|
||||
!
|
||||
vlan 2604
|
||||
name v-peeringlan
|
||||
!
|
||||
interface Ethernet1/1
|
||||
!
|
||||
interface Ethernet2/1
|
||||
!
|
||||
interface Ethernet3/1
|
||||
!
|
||||
interface Ethernet4/1
|
||||
!
|
||||
interface Ethernet5/1
|
||||
!
|
||||
interface Ethernet6/1
|
||||
!
|
||||
interface Ethernet7/1
|
||||
!
|
||||
interface Ethernet8/1
|
||||
!
|
||||
interface Ethernet9/1
|
||||
shutdown
|
||||
speed forced 10000full
|
||||
!
|
||||
interface Ethernet9/2
|
||||
shutdown
|
||||
!
|
||||
interface Ethernet9/3
|
||||
speed forced 10000full
|
||||
switchport access vlan 2604
|
||||
!
|
||||
interface Ethernet9/4
|
||||
shutdown
|
||||
!
|
||||
interface Ethernet10/1
|
||||
!
|
||||
interface Ethernet10/2
|
||||
shutdown
|
||||
!
|
||||
interface Ethernet10/4
|
||||
shutdown
|
||||
!
|
||||
interface Ethernet11/1
|
||||
!
|
||||
interface Ethernet12/1
|
||||
!
|
||||
interface Ethernet13/1
|
||||
!
|
||||
interface Ethernet14/1
|
||||
!
|
||||
interface Ethernet15/1
|
||||
!
|
||||
interface Ethernet16/1
|
||||
!
|
||||
interface Ethernet17/1
|
||||
!
|
||||
interface Ethernet18/1
|
||||
!
|
||||
interface Ethernet19/1
|
||||
!
|
||||
interface Ethernet20/1
|
||||
!
|
||||
interface Ethernet21/1
|
||||
!
|
||||
interface Ethernet22/1
|
||||
!
|
||||
interface Ethernet23/1
|
||||
!
|
||||
interface Ethernet24/1
|
||||
!
|
||||
interface Ethernet25/1
|
||||
!
|
||||
interface Ethernet26/1
|
||||
!
|
||||
interface Ethernet27/1
|
||||
!
|
||||
interface Ethernet28/1
|
||||
!
|
||||
interface Ethernet29/1
|
||||
no switchport
|
||||
!
|
||||
interface Ethernet30/1
|
||||
load-interval 1
|
||||
mtu 9190
|
||||
no switchport
|
||||
ip address 198.19.17.10/31
|
||||
ip ospf cost 10
|
||||
ip ospf network point-to-point
|
||||
ip ospf area 0.0.0.0
|
||||
!
|
||||
interface Ethernet31/1
|
||||
load-interval 1
|
||||
mtu 9190
|
||||
no switchport
|
||||
ip address 198.19.17.3/31
|
||||
ip ospf cost 1000
|
||||
ip ospf network point-to-point
|
||||
ip ospf area 0.0.0.0
|
||||
!
|
||||
interface Ethernet32/1
|
||||
load-interval 1
|
||||
mtu 9190
|
||||
no switchport
|
||||
ip address 198.19.17.5/31
|
||||
ip ospf cost 1000
|
||||
ip ospf network point-to-point
|
||||
ip ospf area 0.0.0.0
|
||||
!
|
||||
interface Loopback0
|
||||
ip address 198.19.16.2/32
|
||||
ip ospf area 0.0.0.0
|
||||
!
|
||||
interface Loopback1
|
||||
ip address 198.19.18.2/32
|
||||
!
|
||||
interface Management1
|
||||
ip address dhcp
|
||||
!
|
||||
interface Vxlan1
|
||||
vxlan source-interface Loopback1
|
||||
vxlan udp-port 4789
|
||||
vxlan vlan 2604 vni 2604
|
||||
!
|
||||
ip routing
|
||||
!
|
||||
ip route 0.0.0.0/0 Management1 10.75.8.1
|
||||
!
|
||||
router bgp 65500
|
||||
neighbor evpn peer group
|
||||
neighbor evpn remote-as 65500
|
||||
neighbor evpn update-source Loopback0
|
||||
neighbor evpn ebgp-multihop 3
|
||||
neighbor evpn send-community extended
|
||||
neighbor evpn maximum-routes 12000 warning-only
|
||||
neighbor 198.19.16.0 peer group evpn
|
||||
neighbor 198.19.16.1 peer group evpn
|
||||
!
|
||||
vlan 2604
|
||||
rd 65500:2604
|
||||
route-target both 65500:2604
|
||||
redistribute learned
|
||||
!
|
||||
address-family evpn
|
||||
neighbor evpn activate
|
||||
!
|
||||
router ospf 65500
|
||||
router-id 198.19.16.2
|
||||
redistribute connected
|
||||
network 198.19.0.0/16 area 0.0.0.0
|
||||
max-lsa 12000
|
||||
!
|
||||
end
|
||||
90
static/assets/frys-ix/equinix.conf
Normal file
90
static/assets/frys-ix/equinix.conf
Normal file
@@ -0,0 +1,90 @@
|
||||
set / interface ethernet-1/1 admin-state disable
|
||||
set / interface ethernet-1/9 admin-state enable
|
||||
set / interface ethernet-1/9 breakout-mode num-breakout-ports 4
|
||||
set / interface ethernet-1/9 breakout-mode breakout-port-speed 10G
|
||||
set / interface ethernet-1/9/3 admin-state enable
|
||||
set / interface ethernet-1/9/3 vlan-tagging true
|
||||
set / interface ethernet-1/9/3 subinterface 0 type bridged
|
||||
set / interface ethernet-1/9/3 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/9/3 subinterface 0 vlan encap untagged
|
||||
set / interface ethernet-1/29 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 type routed
|
||||
set / interface ethernet-1/29 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/29 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 ipv4 address 198.19.17.0/31
|
||||
set / interface ethernet-1/29 subinterface 0 ipv6 admin-state enable
|
||||
set / interface lo0 admin-state enable
|
||||
set / interface lo0 subinterface 0 admin-state enable
|
||||
set / interface lo0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface lo0 subinterface 0 ipv4 address 198.19.16.0/32
|
||||
set / interface mgmt0 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv4 dhcp-client
|
||||
set / interface mgmt0 subinterface 0 ipv6 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv6 dhcp-client
|
||||
set / interface system0 admin-state enable
|
||||
set / interface system0 subinterface 0 admin-state enable
|
||||
set / interface system0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface system0 subinterface 0 ipv4 address 198.19.18.0/32
|
||||
set / network-instance default type default
|
||||
set / network-instance default admin-state enable
|
||||
set / network-instance default description "fabric: dc2 role: spine"
|
||||
set / network-instance default router-id 198.19.16.0
|
||||
set / network-instance default ip-forwarding receive-ipv4-check false
|
||||
set / network-instance default interface ethernet-1/29.0
|
||||
set / network-instance default interface lo0.0
|
||||
set / network-instance default interface system0.0
|
||||
set / network-instance default protocols bgp admin-state enable
|
||||
set / network-instance default protocols bgp autonomous-system 65500
|
||||
set / network-instance default protocols bgp router-id 198.19.16.0
|
||||
set / network-instance default protocols bgp dynamic-neighbors accept match 198.19.16.0/24 peer-group overlay
|
||||
set / network-instance default protocols bgp afi-safi evpn admin-state enable
|
||||
set / network-instance default protocols bgp preference ibgp 170
|
||||
set / network-instance default protocols bgp route-advertisement rapid-withdrawal true
|
||||
set / network-instance default protocols bgp route-advertisement wait-for-fib-install false
|
||||
set / network-instance default protocols bgp group overlay peer-as 65500
|
||||
set / network-instance default protocols bgp group overlay afi-safi evpn admin-state enable
|
||||
set / network-instance default protocols bgp group overlay afi-safi ipv4-unicast admin-state disable
|
||||
set / network-instance default protocols bgp group overlay local-as as-number 65500
|
||||
set / network-instance default protocols bgp group overlay route-reflector client true
|
||||
set / network-instance default protocols bgp group overlay transport local-address 198.19.16.0
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.1 admin-state enable
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.1 peer-group overlay
|
||||
set / network-instance default protocols ospf instance default admin-state enable
|
||||
set / network-instance default protocols ospf instance default version ospf-v2
|
||||
set / network-instance default protocols ospf instance default router-id 198.19.16.0
|
||||
set / network-instance default protocols ospf instance default export-policy ospf
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/29.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface lo0.0
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface system0.0
|
||||
set / network-instance mgmt type ip-vrf
|
||||
set / network-instance mgmt admin-state enable
|
||||
set / network-instance mgmt description "Management network instance"
|
||||
set / network-instance mgmt interface mgmt0.0
|
||||
set / network-instance mgmt protocols linux import-routes true
|
||||
set / network-instance mgmt protocols linux export-routes true
|
||||
set / network-instance mgmt protocols linux export-neighbors true
|
||||
set / network-instance peeringlan type mac-vrf
|
||||
set / network-instance peeringlan admin-state enable
|
||||
set / network-instance peeringlan interface ethernet-1/9/3.0
|
||||
set / network-instance peeringlan vxlan-interface vxlan1.2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 admin-state enable
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 evi 2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise true
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604
|
||||
set / network-instance peeringlan bridge-table proxy-arp admin-state enable
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning admin-state enable
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning age-time 600
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning send-refresh 180
|
||||
set / routing-policy policy ospf statement 100 match protocol host
|
||||
set / routing-policy policy ospf statement 100 action policy-result accept
|
||||
set / routing-policy policy ospf statement 200 match protocol ospfv2
|
||||
set / routing-policy policy ospf statement 200 action policy-result accept
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address
|
||||
BIN
static/assets/frys-ix/frysix-logo-small.png
LFS
Normal file
BIN
static/assets/frys-ix/frysix-logo-small.png
LFS
Normal file
Binary file not shown.
132
static/assets/frys-ix/nikhef.conf
Normal file
132
static/assets/frys-ix/nikhef.conf
Normal file
@@ -0,0 +1,132 @@
|
||||
set / interface ethernet-1/1 admin-state enable
|
||||
set / interface ethernet-1/1 ethernet forward-error-correction fec-option rs-528
|
||||
set / interface ethernet-1/1 subinterface 0 type routed
|
||||
set / interface ethernet-1/1 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/1 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/1 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/1 subinterface 0 ipv4 address 198.19.17.2/31
|
||||
set / interface ethernet-1/1 subinterface 0 ipv6 admin-state enable
|
||||
set / interface ethernet-1/2 admin-state enable
|
||||
set / interface ethernet-1/2 ethernet forward-error-correction fec-option rs-528
|
||||
set / interface ethernet-1/2 subinterface 0 type routed
|
||||
set / interface ethernet-1/2 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/2 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/2 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/2 subinterface 0 ipv4 address 198.19.17.4/31
|
||||
set / interface ethernet-1/2 subinterface 0 ipv6 admin-state enable
|
||||
set / interface ethernet-1/3 admin-state enable
|
||||
set / interface ethernet-1/3 ethernet forward-error-correction fec-option rs-528
|
||||
set / interface ethernet-1/3 subinterface 0 type routed
|
||||
set / interface ethernet-1/3 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/3 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/3 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/3 subinterface 0 ipv4 address 198.19.17.6/31
|
||||
set / interface ethernet-1/3 subinterface 0 ipv6 admin-state enable
|
||||
set / interface ethernet-1/4 admin-state enable
|
||||
set / interface ethernet-1/4 ethernet forward-error-correction fec-option rs-528
|
||||
set / interface ethernet-1/4 subinterface 0 type routed
|
||||
set / interface ethernet-1/4 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/4 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/4 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/4 subinterface 0 ipv4 address 198.19.17.8/31
|
||||
set / interface ethernet-1/4 subinterface 0 ipv6 admin-state enable
|
||||
set / interface ethernet-1/9 admin-state enable
|
||||
set / interface ethernet-1/9 breakout-mode num-breakout-ports 4
|
||||
set / interface ethernet-1/9 breakout-mode breakout-port-speed 10G
|
||||
set / interface ethernet-1/9/1 admin-state disable
|
||||
set / interface ethernet-1/9/2 admin-state disable
|
||||
set / interface ethernet-1/9/3 admin-state enable
|
||||
set / interface ethernet-1/9/3 vlan-tagging true
|
||||
set / interface ethernet-1/9/3 subinterface 0 type bridged
|
||||
set / interface ethernet-1/9/3 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/9/3 subinterface 0 vlan encap untagged
|
||||
set / interface ethernet-1/9/4 admin-state disable
|
||||
set / interface ethernet-1/29 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 type routed
|
||||
set / interface ethernet-1/29 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/29 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 ipv4 address 198.19.17.1/31
|
||||
set / interface ethernet-1/29 subinterface 0 ipv6 admin-state enable
|
||||
set / interface lo0 admin-state enable
|
||||
set / interface lo0 subinterface 0 admin-state enable
|
||||
set / interface lo0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface lo0 subinterface 0 ipv4 address 198.19.16.1/32
|
||||
set / interface mgmt0 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv4 dhcp-client
|
||||
set / interface mgmt0 subinterface 0 ipv6 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv6 dhcp-client
|
||||
set / interface system0 admin-state enable
|
||||
set / interface system0 subinterface 0 admin-state enable
|
||||
set / interface system0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface system0 subinterface 0 ipv4 address 198.19.18.1/32
|
||||
set / network-instance default type default
|
||||
set / network-instance default admin-state enable
|
||||
set / network-instance default description "fabric: dc1 role: spine"
|
||||
set / network-instance default router-id 198.19.16.1
|
||||
set / network-instance default ip-forwarding receive-ipv4-check false
|
||||
set / network-instance default interface ethernet-1/1.0
|
||||
set / network-instance default interface ethernet-1/2.0
|
||||
set / network-instance default interface ethernet-1/29.0
|
||||
set / network-instance default interface ethernet-1/3.0
|
||||
set / network-instance default interface ethernet-1/4.0
|
||||
set / network-instance default interface lo0.0
|
||||
set / network-instance default interface system0.0
|
||||
set / network-instance default protocols bgp admin-state enable
|
||||
set / network-instance default protocols bgp autonomous-system 65500
|
||||
set / network-instance default protocols bgp router-id 198.19.16.1
|
||||
set / network-instance default protocols bgp dynamic-neighbors accept match 198.19.16.0/24 peer-group overlay
|
||||
set / network-instance default protocols bgp afi-safi evpn admin-state enable
|
||||
set / network-instance default protocols bgp preference ibgp 170
|
||||
set / network-instance default protocols bgp route-advertisement rapid-withdrawal true
|
||||
set / network-instance default protocols bgp route-advertisement wait-for-fib-install false
|
||||
set / network-instance default protocols bgp group overlay peer-as 65500
|
||||
set / network-instance default protocols bgp group overlay afi-safi evpn admin-state enable
|
||||
set / network-instance default protocols bgp group overlay afi-safi ipv4-unicast admin-state disable
|
||||
set / network-instance default protocols bgp group overlay local-as as-number 65500
|
||||
set / network-instance default protocols bgp group overlay route-reflector client true
|
||||
set / network-instance default protocols bgp group overlay transport local-address 198.19.16.1
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.0 admin-state enable
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.0 peer-group overlay
|
||||
set / network-instance default protocols ospf instance default admin-state enable
|
||||
set / network-instance default protocols ospf instance default version ospf-v2
|
||||
set / network-instance default protocols ospf instance default router-id 198.19.16.1
|
||||
set / network-instance default protocols ospf instance default export-policy ospf
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/1.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/2.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/3.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/4.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/29.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface lo0.0 passive true
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface system0.0
|
||||
set / network-instance mgmt type ip-vrf
|
||||
set / network-instance mgmt admin-state enable
|
||||
set / network-instance mgmt description "Management network instance"
|
||||
set / network-instance mgmt interface mgmt0.0
|
||||
set / network-instance mgmt protocols linux import-routes true
|
||||
set / network-instance mgmt protocols linux export-routes true
|
||||
set / network-instance mgmt protocols linux export-neighbors true
|
||||
set / network-instance peeringlan type mac-vrf
|
||||
set / network-instance peeringlan admin-state enable
|
||||
set / network-instance peeringlan interface ethernet-1/9/3.0
|
||||
set / network-instance peeringlan vxlan-interface vxlan1.2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 admin-state enable
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 evi 2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise true
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604
|
||||
set / network-instance peeringlan bridge-table proxy-arp admin-state enable
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning admin-state enable
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning age-time 600
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning send-refresh 180
|
||||
set / routing-policy policy ospf statement 100 match protocol host
|
||||
set / routing-policy policy ospf statement 100 action policy-result accept
|
||||
set / routing-policy policy ospf statement 200 match protocol ospfv2
|
||||
set / routing-policy policy ospf statement 200 action policy-result accept
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address
|
||||
BIN
static/assets/frys-ix/nokia-7220-d2.png
LFS
Normal file
BIN
static/assets/frys-ix/nokia-7220-d2.png
LFS
Normal file
Binary file not shown.
BIN
static/assets/frys-ix/nokia-7220-d4.png
LFS
Normal file
BIN
static/assets/frys-ix/nokia-7220-d4.png
LFS
Normal file
Binary file not shown.
105
static/assets/frys-ix/nokia-leaf.conf
Normal file
105
static/assets/frys-ix/nokia-leaf.conf
Normal file
@@ -0,0 +1,105 @@
|
||||
set / interface ethernet-1/9 admin-state enable
|
||||
set / interface ethernet-1/9 vlan-tagging true
|
||||
set / interface ethernet-1/9 ethernet port-speed 10G
|
||||
set / interface ethernet-1/9 subinterface 0 type bridged
|
||||
set / interface ethernet-1/9 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/9 subinterface 0 vlan encap untagged
|
||||
set / interface ethernet-1/53 admin-state enable
|
||||
set / interface ethernet-1/53 ethernet forward-error-correction fec-option rs-528
|
||||
set / interface ethernet-1/53 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/53 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/53 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/53 subinterface 0 ipv4 address 198.19.17.11/31
|
||||
set / interface ethernet-1/53 subinterface 0 ipv6 admin-state enable
|
||||
set / interface ethernet-1/55 admin-state enable
|
||||
set / interface ethernet-1/55 ethernet forward-error-correction fec-option rs-528
|
||||
set / interface ethernet-1/55 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/55 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/55 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/55 subinterface 0 ipv4 address 198.19.17.7/31
|
||||
set / interface ethernet-1/55 subinterface 0 ipv6 admin-state enable
|
||||
set / interface ethernet-1/56 admin-state enable
|
||||
set / interface ethernet-1/56 ethernet forward-error-correction fec-option rs-528
|
||||
set / interface ethernet-1/56 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/56 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/56 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/56 subinterface 0 ipv4 address 198.19.17.9/31
|
||||
set / interface ethernet-1/56 subinterface 0 ipv6 admin-state enable
|
||||
set / interface lo0 admin-state enable
|
||||
set / interface lo0 subinterface 0 admin-state enable
|
||||
set / interface lo0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface lo0 subinterface 0 ipv4 address 198.19.16.3/32
|
||||
set / interface mgmt0 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv4 dhcp-client
|
||||
set / interface mgmt0 subinterface 0 ipv6 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv6 dhcp-client
|
||||
set / interface system0 admin-state enable
|
||||
set / interface system0 subinterface 0 admin-state enable
|
||||
set / interface system0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface system0 subinterface 0 ipv4 address 198.19.18.3/32
|
||||
set / network-instance default type default
|
||||
set / network-instance default admin-state enable
|
||||
set / network-instance default description "fabric: dc1 role: leaf"
|
||||
set / network-instance default router-id 198.19.16.3
|
||||
set / network-instance default ip-forwarding receive-ipv4-check false
|
||||
set / network-instance default interface ethernet-1/53.0
|
||||
set / network-instance default interface ethernet-1/55.0
|
||||
set / network-instance default interface ethernet-1/56.0
|
||||
set / network-instance default interface lo0.0
|
||||
set / network-instance default interface system0.0
|
||||
set / network-instance default protocols bgp admin-state enable
|
||||
set / network-instance default protocols bgp autonomous-system 65500
|
||||
set / network-instance default protocols bgp router-id 198.19.16.3
|
||||
set / network-instance default protocols bgp afi-safi evpn admin-state enable
|
||||
set / network-instance default protocols bgp preference ibgp 170
|
||||
set / network-instance default protocols bgp route-advertisement rapid-withdrawal true
|
||||
set / network-instance default protocols bgp route-advertisement wait-for-fib-install false
|
||||
set / network-instance default protocols bgp group overlay peer-as 65500
|
||||
set / network-instance default protocols bgp group overlay afi-safi evpn admin-state enable
|
||||
set / network-instance default protocols bgp group overlay afi-safi ipv4-unicast admin-state disable
|
||||
set / network-instance default protocols bgp group overlay local-as as-number 65500
|
||||
set / network-instance default protocols bgp group overlay transport local-address 198.19.16.3
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.0 admin-state enable
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.0 peer-group overlay
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.1 admin-state enable
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.1 peer-group overlay
|
||||
set / network-instance default protocols ospf instance default admin-state enable
|
||||
set / network-instance default protocols ospf instance default version ospf-v2
|
||||
set / network-instance default protocols ospf instance default router-id 198.19.16.3
|
||||
set / network-instance default protocols ospf instance default export-policy ospf
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/53.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/55.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/56.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface lo0.0 passive true
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface system0.0
|
||||
set / network-instance mgmt type ip-vrf
|
||||
set / network-instance mgmt admin-state enable
|
||||
set / network-instance mgmt description "Management network instance"
|
||||
set / network-instance mgmt interface mgmt0.0
|
||||
set / network-instance mgmt protocols linux import-routes true
|
||||
set / network-instance mgmt protocols linux export-routes true
|
||||
set / network-instance mgmt protocols linux export-neighbors true
|
||||
set / network-instance peeringlan type mac-vrf
|
||||
set / network-instance peeringlan admin-state enable
|
||||
set / network-instance peeringlan interface ethernet-1/9.0
|
||||
set / network-instance peeringlan vxlan-interface vxlan1.2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 admin-state enable
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 evi 2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise true
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604
|
||||
set / network-instance peeringlan bridge-table proxy-arp admin-state enable
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning admin-state enable
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning age-time 600
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning send-refresh 180
|
||||
set / routing-policy policy ospf statement 100 match protocol host
|
||||
set / routing-policy policy ospf statement 100 action policy-result accept
|
||||
set / routing-policy policy ospf statement 200 match protocol ospfv2
|
||||
set / routing-policy policy ospf statement 200 action policy-result accept
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address
|
||||
BIN
static/assets/jekyll-hugo/before.png
LFS
Normal file
BIN
static/assets/jekyll-hugo/before.png
LFS
Normal file
Binary file not shown.
7
static/assets/jekyll-hugo/hugo-logo-wide.svg
Normal file
7
static/assets/jekyll-hugo/hugo-logo-wide.svg
Normal file
@@ -0,0 +1,7 @@
|
||||
<svg xmlns="http://www.w3.org/2000/svg" fill-rule="evenodd" stroke-width="27" aria-label="Logo" viewBox="0 0 1493 391">
|
||||
<path fill="#ebb951" stroke="#fcd804" d="M1345.211 24.704l112.262 64.305a43 43 0 0 1 21.627 37.312v142.237a40 40 0 0 1-20.702 35.037l-120.886 66.584a42 42 0 0 1-41.216-.389l-106.242-61.155a57 57 0 0 1-28.564-49.4V138.71a64 64 0 0 1 31.172-54.939l98.01-58.564a54 54 0 0 1 54.54-.503z"/>
|
||||
<path fill="#33ba91" stroke="#00a88a" d="M958.07 22.82l117.31 66.78a41 41 0 0 1 20.72 35.64v139.5a45 45 0 0 1-23.1 39.32L955.68 369.4a44 44 0 0 1-43.54-.41l-105.82-61.6a56 56 0 0 1-27.83-48.4V140.07a68 68 0 0 1 33.23-58.44l98.06-58.35a48 48 0 0 1 48.3-.46z"/>
|
||||
<path fill="#0594cb" stroke="#0083c0" d="M575.26 20.97l117.23 68.9a40 40 0 0 1 19.73 34.27l.73 138.67a48 48 0 0 1-24.64 42.2l-115.13 64.11a45 45 0 0 1-44.53-.42l-105.83-61.6a55 55 0 0 1-27.33-47.53V136.52a63 63 0 0 1 29.87-53.59l99.3-61.4a49 49 0 0 1 50.6-.56z"/>
|
||||
<path fill="#ff4088" stroke="#c9177e" d="M195.81 24.13l114.41 66.54a44 44 0 0 1 21.88 38.04v136.43a48 48 0 0 1-24.45 41.82L194.1 370.9a49 49 0 0 1-48.48-.23L41.05 310.48a53 53 0 0 1-26.56-45.93V135.08a55 55 0 0 1 26.1-46.8l102.8-63.46a51 51 0 0 1 52.42-.69z"/>
|
||||
<path fill="#fff" d="M1320.72 89.15c58.79 0 106.52 47.73 106.52 106.51 0 58.8-47.73 106.52-106.52 106.52-58.78 0-106.52-47.73-106.52-106.52 0-58.78 47.74-106.51 106.52-106.51zm0 39.57c36.95 0 66.94 30 66.94 66.94a66.97 66.97 0 0 1-66.94 66.94c-36.95 0-66.94-29.99-66.94-66.94a66.97 66.97 0 0 1 66.93-66.94h.01zm-283.8 65.31c0 47.18-8.94 60.93-26.81 80.58-17.87 19.65-41.57 27.57-71.1 27.57-27 0-48.75-9.58-67.61-26.23-20.88-18.45-36.08-47.04-36.08-78.95 0-31.37 11.72-58.48 32.49-78.67 18.22-17.67 45.34-29.18 73.3-29.18 33.77 0 68.83 15.98 90.44 47.53l-31.73 26.82c-13.45-25.03-32.94-33.46-60.82-34.26-30.83-.88-64.77 28.53-62.25 67.75 1.4 21.94 11.65 59.65 60.96 66.57 25.9 3.63 55.36-24.02 55.36-39.04H944.4v-37.5h92.5V194l.02.03zm-562.6-94.65h42.29v112.17c0 17.8.49 29.33 1.47 34.61 1.69 8.48 4.81 14.37 11.17 19.5 6.37 5.13 13.8 6.59 24.84 6.59 11.2 0 14.96-1.74 20.66-6.6 5.69-4.85 9.12-9.46 10.28-16.53 1.15-7.07 3.07-18.8 3.07-35.18V99.38h42.28v108.78c0 24.86-1.07 42.43-3.21 52.69-2.14 10.27-6.08 18.93-11.82 26-5.74 7.06-13.42 12.69-23.03 16.88-9.62 4.19-22.16 6.28-37.65 6.28-18.7 0-32.87-2.28-42.52-6.85-9.66-4.57-17.3-10.5-22.9-17.8-5.61-7.3-9.3-14.95-11.08-22.96-2.58-11.86-3.88-29.38-3.88-52.55V99.38h.03zM93.91 299.92V92.7h43.35v75.48h71.92V92.7h43.48v207.22h-43.48v-90.61h-71.92v90.61z"/>
|
||||
</svg>
|
||||
|
After Width: | Height: | Size: 2.5 KiB |
BIN
static/assets/jekyll-hugo/jekyll-logo.png
LFS
Normal file
BIN
static/assets/jekyll-hugo/jekyll-logo.png
LFS
Normal file
Binary file not shown.
83
static/assets/logo/logo-red.svg
Normal file
83
static/assets/logo/logo-red.svg
Normal file
File diff suppressed because one or more lines are too long
|
After Width: | Height: | Size: 16 KiB |
BIN
static/assets/logo/logo-white-1000px.png
LFS
Normal file
BIN
static/assets/logo/logo-white-1000px.png
LFS
Normal file
Binary file not shown.
BIN
static/assets/logo/logo-white-100px.png
LFS
Normal file
BIN
static/assets/logo/logo-white-100px.png
LFS
Normal file
Binary file not shown.
BIN
static/assets/logo/logo-white-2000px.png
LFS
Normal file
BIN
static/assets/logo/logo-white-2000px.png
LFS
Normal file
Binary file not shown.
BIN
static/assets/logo/logo-white-200px.png
LFS
Normal file
BIN
static/assets/logo/logo-white-200px.png
LFS
Normal file
Binary file not shown.
BIN
static/assets/logo/logo-white-400px.png
LFS
Normal file
BIN
static/assets/logo/logo-white-400px.png
LFS
Normal file
Binary file not shown.
BIN
static/assets/minio/console-1.png
LFS
Normal file
BIN
static/assets/minio/console-1.png
LFS
Normal file
Binary file not shown.
BIN
static/assets/minio/console-2.png
LFS
Normal file
BIN
static/assets/minio/console-2.png
LFS
Normal file
Binary file not shown.
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user