Compare commits
78 Commits
0542c1e2d9
...
main
Author | SHA1 | Date | |
---|---|---|---|
fdb77838b8 | |||
6d3f4ac206 | |||
baa3e78045 | |||
0972cf4aa1 | |||
4f81d377a0 | |||
153048eda4 | |||
4aa5745d06 | |||
7d3f617966 | |||
8918821413 | |||
9783c7d39c | |||
af68c1ec3b | |||
0baadb5089 | |||
3b7e576d20 | |||
d0a7cdbe38 | |||
ed087f3fc6 | |||
51e6c0e1c2 | |||
8a991bee47 | |||
d9e2f407e7 | |||
01820776af | |||
d5d4f7ff55 | |||
2a61bdc028 | |||
c2b8eef4f4 | |||
533cca0108 | |||
4ac8c47127 | |||
bcbb119b20 | |||
ce6e6cde22 | |||
610835925b | |||
16ac42bad9 | |||
26397d69c6 | |||
388293baef | |||
b2129702ae | |||
ba068c1c52 | |||
3c69130cea | |||
255d3905d7 | |||
4cd42b9824 | |||
f12247d278 | |||
36b422ce08 | |||
2e1bb69772 | |||
ceb16714b6 | |||
72b99b20c6 | |||
4b5bd40fce | |||
1379c77181 | |||
08d55e6ac0 | |||
3feb217aa8 | |||
2f63fc0ebb | |||
4113615096 | |||
52cba49c90 | |||
b5c0819bfa | |||
ea05b39ddf | |||
27ab370dc4 | |||
1e5e965572 | |||
d8c36e5077 | |||
8b23bba61d | |||
5dc5a17f40 | |||
52d3606b1b | |||
d017f1c2cf | |||
e867f75a34 | |||
7da66c5f35 | |||
f201aeb596 | |||
ee4534c23a | |||
6ef9a21206 | |||
a4884a28d9 | |||
5b0f1acbf6 | |||
9727d065b8 | |||
ef83fd569d | |||
bf9a070ea5 | |||
090cf21170 | |||
f23a5ace77 | |||
3db7156652 | |||
9b47359318 | |||
ecb0062105 | |||
44a854dc8e | |||
7fc65b87df | |||
413498e4c1 | |||
b576a15a30 | |||
7f73540fd7 | |||
0b5ed8683c | |||
20022b77dd |
11
.drone.yml
@ -5,13 +5,12 @@ steps:
|
||||
- name: git-lfs
|
||||
image: alpine/git
|
||||
commands:
|
||||
# - git submodule update --init --recursive --remote
|
||||
- git lfs install
|
||||
- git lfs pull
|
||||
- name: build
|
||||
image: pimvanpelt/drone-hugo:release-0.130.0
|
||||
image: git.ipng.ch/ipng/drone-hugo:release-0.145.1
|
||||
settings:
|
||||
hugo_version: 0.130.0
|
||||
hugo_version: 0.145.0
|
||||
extended: true
|
||||
- name: rsync
|
||||
image: drillster/drone-rsync
|
||||
@ -25,9 +24,11 @@ steps:
|
||||
- nginx0.nlams1.net.ipng.ch
|
||||
- nginx0.nlams2.net.ipng.ch
|
||||
port: 22
|
||||
args: '-6'
|
||||
args: '-6u --delete-after'
|
||||
source: public/
|
||||
target: /var/www/ipng.ch/
|
||||
delete: true
|
||||
recursive: true
|
||||
secrets: [ drone_sshkey ]
|
||||
|
||||
image_pull_secrets:
|
||||
- git_ipng_ch_docker
|
||||
|
1
.gitignore
vendored
@ -1,3 +1,4 @@
|
||||
.hugo*
|
||||
public/
|
||||
resources/_gen/
|
||||
.DS_Store
|
||||
|
36
config.toml
@ -1,36 +0,0 @@
|
||||
baseURL = 'https://ipng.ch/'
|
||||
languageCode = 'en-us'
|
||||
title = "IPng Networks"
|
||||
theme = 'hugo-theme-ipng'
|
||||
|
||||
mainSections = ["articles"]
|
||||
# disqusShortname = "example"
|
||||
paginate = 4
|
||||
|
||||
[params]
|
||||
author = "IPng Networks GmbH"
|
||||
siteHeading = "IPng Networks"
|
||||
favicon = "favicon.ico" # Adds a small icon next to the page title in a tab
|
||||
showBlogLatest = false
|
||||
mainSections = ["articles"]
|
||||
showTaxonomyLinks = false
|
||||
nBlogLatest = 14 # number of blog post om the home page
|
||||
Paginate = 30
|
||||
blogLatestHeading = "Latest Dabblings"
|
||||
footer = "Copyright 2021- IPng Networks GmbH, all rights reserved"
|
||||
|
||||
[params.social]
|
||||
email = "info+www@ipng.ch"
|
||||
mastodon = "IPngNetworks"
|
||||
twitter = "IPngNetworks"
|
||||
linkedin = "pimvanpelt"
|
||||
instagram = "IPngNetworks"
|
||||
|
||||
[taxonomies]
|
||||
year = "year"
|
||||
month = "month"
|
||||
tags = "tags"
|
||||
categories = "categories"
|
||||
|
||||
[permalinks]
|
||||
articles = "/s/articles/:year/:month/:day/:slug"
|
@ -89,7 +89,7 @@ lcp lcp-sync off
|
||||
```
|
||||
|
||||
The prep work for the rest of the interface syncer starts with this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
|
||||
for the rest of this blog post, the behavior will be in the 'on' position.
|
||||
|
||||
### Change interface: state
|
||||
@ -120,7 +120,7 @@ the state it was. I did notice that you can't bring up a sub-interface if its pa
|
||||
is down, which I found counterintuitive, but that's neither here nor there.
|
||||
|
||||
All of this is to say that we have to be careful when copying state forward, because as
|
||||
this [[commit](https://github.com/pimvanpelt/lcpng/commit/7c15c84f6c4739860a85c599779c199cb9efef03)]
|
||||
this [[commit](https://git.ipng.ch/ipng/lcpng/commit/7c15c84f6c4739860a85c599779c199cb9efef03)]
|
||||
shows, issuing `set int state ... up` on an interface, won't touch its sub-interfaces in VPP, but
|
||||
the subsequent netlink message to bring the _LIP_ for that interface up, **will** update the
|
||||
children, thus desynchronising Linux and VPP: Linux will have interface **and all its
|
||||
@ -128,7 +128,7 @@ sub-interfaces** up unconditionally; VPP will have the interface up and its sub-
|
||||
whatever state they were before.
|
||||
|
||||
To address this, a second
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/a3dc56c01461bdffcac8193ead654ae79225220f)] was
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/a3dc56c01461bdffcac8193ead654ae79225220f)] was
|
||||
needed. I'm not too sure I want to keep this behavior, but for now, it results in an intuitive
|
||||
end-state, which is that all interfaces states are exactly the same between Linux and VPP.
|
||||
|
||||
@ -157,7 +157,7 @@ DBGvpp# set int state TenGigabitEthernet3/0/0 up
|
||||
### Change interface: MTU
|
||||
|
||||
Finally, a straight forward
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/39bfa1615fd1cafe5df6d8fc9d34528e8d3906e2)], or
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/39bfa1615fd1cafe5df6d8fc9d34528e8d3906e2)], or
|
||||
so I thought. When the MTU changes in VPP (with `set interface mtu packet N <int>`), there is
|
||||
callback that can be registered which copies this into the _LIP_. I did notice a specific corner
|
||||
case: In VPP, a sub-interface can have a larger MTU than its parent. In Linux, this cannot happen,
|
||||
@ -179,7 +179,7 @@ higher than that, perhaps logging an error explaining why. This means two things
|
||||
1. Any change in VPP of a parent MTU should ensure all children are clamped to at most that.
|
||||
|
||||
I addressed the issue in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/79a395b3c9f0dae9a23e6fbf10c5f284b1facb85)].
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/79a395b3c9f0dae9a23e6fbf10c5f284b1facb85)].
|
||||
|
||||
### Change interface: IP Addresses
|
||||
|
||||
@ -199,7 +199,7 @@ VPP into the companion Linux devices:
|
||||
_LIP_ with `lcp_itf_set_interface_addr()`.
|
||||
|
||||
This means with this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/f7e1bb951d648a63dfa27d04ded0b6261b9e39fe)], at
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/f7e1bb951d648a63dfa27d04ded0b6261b9e39fe)], at
|
||||
any time a new _LIP_ is created, the IPv4 and IPv6 address on the VPP interface are fully copied
|
||||
over by the third change, while at runtime, new addresses can be set/removed as well by the first
|
||||
and second change.
|
||||
|
@ -100,7 +100,7 @@ linux-cp {
|
||||
|
||||
Based on this config, I set the startup default in `lcp_set_lcp_auto_subint()`, but I realize that
|
||||
an administrator may want to turn it on/off at runtime, too, so I add a CLI getter/setter that
|
||||
interacts with the flag in this [[commit](https://github.com/pimvanpelt/lcpng/commit/d23aab2d95aabcf24efb9f7aecaf15b513633ab7)]:
|
||||
interacts with the flag in this [[commit](https://git.ipng.ch/ipng/lcpng/commit/d23aab2d95aabcf24efb9f7aecaf15b513633ab7)]:
|
||||
|
||||
```
|
||||
DBGvpp# show lcp
|
||||
@ -116,11 +116,11 @@ lcp lcp-sync off
|
||||
```
|
||||
|
||||
The prep work for the rest of the interface syncer starts with this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
|
||||
for the rest of this blog post, the behavior will be in the 'on' position.
|
||||
|
||||
The code for the configuration toggle is in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
|
||||
|
||||
### Auto create/delete sub-interfaces
|
||||
|
||||
@ -145,7 +145,7 @@ I noticed that interface deletion had a bug (one that I fell victim to as well:
|
||||
remove the netlink device in the correct network namespace), which I fixed.
|
||||
|
||||
The code for the auto create/delete and the bugfix is in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
|
||||
|
||||
### Further Work
|
||||
|
||||
|
@ -154,7 +154,7 @@ For now, `lcp_nl_dispatch()` just throws the message away after logging it with
|
||||
a function that will come in very useful as I start to explore all the different Netlink message types.
|
||||
|
||||
The code that forms the basis of our Netlink Listener lives in [[this
|
||||
commit](https://github.com/pimvanpelt/lcpng/commit/c4e3043ea143d703915239b2390c55f7b6a9b0b1)] and
|
||||
commit](https://git.ipng.ch/ipng/lcpng/commit/c4e3043ea143d703915239b2390c55f7b6a9b0b1)] and
|
||||
specifically, here I want to call out I was not the primary author, I worked off of Matt and Neale's
|
||||
awesome work in this pending [Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122).
|
||||
|
||||
@ -182,7 +182,7 @@ Linux interface VPP is not aware of. But, if I can find the _LIP_, I can convert
|
||||
add or remove the ip4/ip6 neighbor adjacency.
|
||||
|
||||
The code for this first Netlink message handler lives in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/30bab1d3f9ab06670fbef2c7c6a658e7b77f7738)]. An
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/30bab1d3f9ab06670fbef2c7c6a658e7b77f7738)]. An
|
||||
ironic insight is that after writing the code, I don't think any of it will be necessary, because
|
||||
the interface plugin will already copy ARP and IPv6 ND packets back and forth and itself update its
|
||||
neighbor adjacency tables; but I'm leaving the code in for now.
|
||||
@ -197,7 +197,7 @@ it or remove it, and if there are no link-local addresses left, disable IPv6 on
|
||||
There's also a few multicast routes to add (notably 224.0.0.0/24 and ff00::/8, all-local-subnet).
|
||||
|
||||
The code for IP address handling is in this
|
||||
[[commit]](https://github.com/pimvanpelt/lcpng/commit/87742b4f541d389e745f0297d134e34f17b5b485), but
|
||||
[[commit]](https://git.ipng.ch/ipng/lcpng/commit/87742b4f541d389e745f0297d134e34f17b5b485), but
|
||||
when I took it out for a spin, I noticed something curious, looking at the log lines that are
|
||||
generated for the following sequence:
|
||||
|
||||
@ -236,7 +236,7 @@ interface and directly connected route addition/deletion is slightly different i
|
||||
So, I decide to take a little shortcut -- if an addition returns "already there", or a deletion returns
|
||||
"no such entry", I'll just consider it a successful addition and deletion respectively, saving my eyes
|
||||
from being screamed at by this red error message. I changed that in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/d63fbd8a9a612d038aa385e79a57198785d409ca)],
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/d63fbd8a9a612d038aa385e79a57198785d409ca)],
|
||||
turning this situation in a friendly green notice instead.
|
||||
|
||||
### Netlink: Link (existing)
|
||||
@ -267,7 +267,7 @@ To avoid this loop, I temporarily turn off `lcp-sync` just before handling a bat
|
||||
turn it back to its original state when I'm done with that.
|
||||
|
||||
The code for all/del of existing links is in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)].
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)].
|
||||
|
||||
### Netlink: Link (new)
|
||||
|
||||
@ -276,7 +276,7 @@ doesn't have a _LIP_ for, but specifically describes a VLAN interface? Well, th
|
||||
is trying to create a new sub-interface. And supporting that operation would be super cool, so let's go!
|
||||
|
||||
Using the earlier placeholder hint in `lcp_nl_link_add()` (see the previous
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)]),
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)]),
|
||||
I know that I've gotten a NEWLINK request but the Linux ifindex doesn't have a _LIP_. This could be
|
||||
because the interface is entirely foreign to VPP, for example somebody created a dummy interface or
|
||||
a VLAN sub-interface on one:
|
||||
@ -331,7 +331,7 @@ a boring `<phy>.<subid>` name.
|
||||
|
||||
Alright, without further ado, the code for the main innovation here, the implementation of
|
||||
`lcp_nl_link_add_vlan()`, is in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/45f408865688eb7ea0cdbf23aa6f8a973be49d1a)].
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/45f408865688eb7ea0cdbf23aa6f8a973be49d1a)].
|
||||
|
||||
## Results
|
||||
|
||||
|
@ -118,7 +118,7 @@ or Virtual Routing/Forwarding domains). So first, I need to add these:
|
||||
|
||||
All of this code was heavily inspired by the pending [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122)]
|
||||
but a few finishing touches were added, and wrapped up in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/7a76498277edc43beaa680e91e3a0c1787319106)].
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/7a76498277edc43beaa680e91e3a0c1787319106)].
|
||||
|
||||
### Deletion
|
||||
|
||||
@ -459,7 +459,7 @@ it as 'unreachable' rather than deleting it. These are *additions* which have a
|
||||
but with an interface index of 1 (which, in Netlink, is 'lo'). This makes VPP intermittently crash, so I
|
||||
currently commented this out, while I gain better understanding. Result: blackhole/unreachable/prohibit
|
||||
specials can not be set using the plugin. Beware!
|
||||
(disabled in this [[commit](https://github.com/pimvanpelt/lcpng/commit/7c864ed099821f62c5be8cbe9ed3f4dd34000a42)]).
|
||||
(disabled in this [[commit](https://git.ipng.ch/ipng/lcpng/commit/7c864ed099821f62c5be8cbe9ed3f4dd34000a42)]).
|
||||
|
||||
## Credits
|
||||
|
||||
|
@ -88,7 +88,7 @@ stat['/if/rx-miss'][:, 1].sum() - returns the sum of packet counters for
|
||||
```
|
||||
|
||||
Alright, so let's grab that file and refactor it into a small library for me to use, I do
|
||||
this in [[this commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
|
||||
this in [[this commit](https://git.ipng.ch/ipng/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
|
||||
|
||||
### VPP's API
|
||||
|
||||
@ -159,7 +159,7 @@ idx=19 name=tap4 mac=02:fe:17:06:fc:af mtu=9000 flags=3
|
||||
|
||||
So I added a little abstration with some error handling and one main function
|
||||
to return interfaces as a Python dictionary of those `sw_interface_details`
|
||||
tuples in [[this commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
|
||||
tuples in [[this commit](https://git.ipng.ch/ipng/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
|
||||
|
||||
### AgentX
|
||||
|
||||
@ -207,9 +207,9 @@ once asked with `GetPDU` or `GetNextPDU` requests, by issuing a corresponding `R
|
||||
to the SNMP server -- it takes care of all the rest!
|
||||
|
||||
The resulting code is in [[this
|
||||
commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/8c9c1e2b4aa1d40a981f17581f92bba133dd2c29)]
|
||||
commit](https://git.ipng.ch/ipng/vpp-snmp-agent/commit/8c9c1e2b4aa1d40a981f17581f92bba133dd2c29)]
|
||||
but you can also check out the whole thing on
|
||||
[[Github](https://github.com/pimvanpelt/vpp-snmp-agent)].
|
||||
[[Github](https://git.ipng.ch/ipng/vpp-snmp-agent)].
|
||||
|
||||
### Building
|
||||
|
||||
|
@ -480,7 +480,7 @@ is to say, those packets which were destined to any IP address configured on the
|
||||
plane. Any traffic going _through_ VPP will never be seen by Linux! So, I'll have to be
|
||||
clever and count this traffic by polling VPP instead. This was the topic of my previous
|
||||
[VPP Part 6]({{< ref "2021-09-10-vpp-6" >}}) about the SNMP Agent. All of that code
|
||||
was released to [Github](https://github.com/pimvanpelt/vpp-snmp-agent), notably there's
|
||||
was released to [Github](https://git.ipng.ch/ipng/vpp-snmp-agent), notably there's
|
||||
a hint there for an `snmpd-dataplane.service` and a `vpp-snmp-agent.service`, including
|
||||
the compiled binary that reads from VPP and feeds this to SNMP.
|
||||
|
||||
|
@ -62,7 +62,7 @@ plugins:
|
||||
or route, or the system receiving ARP or IPv6 neighbor request/reply from neighbors), and applying
|
||||
these events to the VPP dataplane.
|
||||
|
||||
I've published the code on [Github](https://github.com/pimvanpelt/lcpng/) and I am targeting a release
|
||||
I've published the code on [Github](https://git.ipng.ch/ipng/lcpng/) and I am targeting a release
|
||||
in upstream VPP, hoping to make the upcoming 22.02 release in February 2022. I have a lot of ground to
|
||||
cover, but I will note that the plugin has been running in production in [AS8298]({{< ref "2021-02-27-network" >}})
|
||||
since Sep'21 and no crashes related to LinuxCP have been observed.
|
||||
@ -195,7 +195,7 @@ So grab a cup of tea, while we let Rhino stretch its legs, ehh, CPUs ...
|
||||
pim@rhino:~$ mkdir -p ~/src
|
||||
pim@rhino:~$ cd ~/src
|
||||
pim@rhino:~/src$ sudo apt install libmnl-dev
|
||||
pim@rhino:~/src$ git clone https://github.com/pimvanpelt/lcpng.git
|
||||
pim@rhino:~/src$ git clone https://git.ipng.ch/ipng/lcpng.git
|
||||
pim@rhino:~/src$ git clone https://gerrit.fd.io/r/vpp
|
||||
pim@rhino:~/src$ ln -s ~/src/lcpng ~/src/vpp/src/plugins/lcpng
|
||||
pim@rhino:~/src$ cd ~/src/vpp
|
||||
|
@ -33,7 +33,7 @@ In this first post, let's take a look at tablestakes: writing a YAML specificati
|
||||
configuration elements of VPP, and then ensures that the YAML file is both syntactically as well as
|
||||
semantically correct.
|
||||
|
||||
**Note**: Code is on [my Github](https://github.com/pimvanpelt/vppcfg), but it's not quite ready for
|
||||
**Note**: Code is on [my Github](https://git.ipng.ch/ipng/vppcfg), but it's not quite ready for
|
||||
prime-time yet. Take a look, and engage with us on GitHub (pull requests preferred over issues themselves)
|
||||
or reach out by [contacting us](/s/contact/).
|
||||
|
||||
@ -348,7 +348,7 @@ to mess up my (or your!) VPP router by feeding it garbage, so the lions' share o
|
||||
has been to assert the YAML file is both syntactically and semantically valid.
|
||||
|
||||
|
||||
In the mean time, you can take a look at my code on [GitHub](https://github.com/pimvanpelt/vppcfg), but to
|
||||
In the mean time, you can take a look at my code on [GitHub](https://git.ipng.ch/ipng/vppcfg), but to
|
||||
whet your appetite, here's a hefty configuration that demonstrates all implemented types:
|
||||
|
||||
```
|
||||
|
@ -32,7 +32,7 @@ the configuration to the dataplane. Welcome to `vppcfg`!
|
||||
In this second post of the series, I want to talk a little bit about how planning a path from a running
|
||||
configuration to a desired new configuration might look like.
|
||||
|
||||
**Note**: Code is on [my Github](https://github.com/pimvanpelt/vppcfg), but it's not quite ready for
|
||||
**Note**: Code is on [my Github](https://git.ipng.ch/ipng/vppcfg), but it's not quite ready for
|
||||
prime-time yet. Take a look, and engage with us on GitHub (pull requests preferred over issues themselves)
|
||||
or reach out by [contacting us](/s/contact/).
|
||||
|
||||
|
@ -275,7 +275,6 @@ that will point at an `unbound` running on `lab.ipng.ch` itself.
|
||||
I can now create any file I'd like which may use variable substition and other jinja2 style templating. Take
|
||||
for example these two files:
|
||||
|
||||
{% raw %}
|
||||
```
|
||||
pim@lab:~/src/lab$ cat overlays/bird/common/etc/netplan/01-netcfg.yaml.j2
|
||||
network:
|
||||
@ -292,13 +291,12 @@ network:
|
||||
|
||||
pim@lab:~/src/lab$ cat overlays/bird/common/etc/netns/dataplane/resolv.conf.j2
|
||||
domain lab.ipng.ch
|
||||
search{% for domain in lab.nameserver.search %} {{domain}}{%endfor %}
|
||||
search{% for domain in lab.nameserver.search %} {{ domain }}{% endfor %}
|
||||
|
||||
{% for resolver in lab.nameserver.addresses %}
|
||||
nameserver {{resolver}}
|
||||
{%endfor%}
|
||||
nameserver {{ resolver }}
|
||||
{% endfor %}
|
||||
```
|
||||
{% endraw %}
|
||||
|
||||
The first file is a [[NetPlan.io](https://netplan.io/)] configuration that substitutes the correct management
|
||||
IPv4 and IPv6 addresses and gateways. The second one enumerates a set of search domains and nameservers, so that
|
||||
|
@ -578,7 +578,7 @@ the inner payload carries the `vlan 30` tag, neat! The `VNI` there is `0xca986`
|
||||
VLAN10 traffic (showing that multiple VLANs can be transported across the same tunnel, distinguished
|
||||
by VNI).
|
||||
|
||||
{{< image width="90px" float="left" src="/assets/oem-switch/warning.png" alt="Warning" >}}
|
||||
{{< image width="90px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
At this point I make an important observation. VxLAN and GENEVE both have this really cool feature
|
||||
that they can hash their _inner_ payload (ie. the IPv4/IPv6 address and ports if available) and use
|
||||
|
@ -171,12 +171,12 @@ GigabitEthernet1/0/0 1 up GigabitEthernet1/0/0
|
||||
|
||||
After this exploratory exercise, I have learned enough about the hardware to be able to take the
|
||||
Fitlet2 out for a spin. To configure the VPP instance, I turn to
|
||||
[[vppcfg](https://github.com/pimvanpelt/vppcfg)], which can take a YAML configuration file
|
||||
[[vppcfg](https://git.ipng.ch/ipng/vppcfg)], which can take a YAML configuration file
|
||||
describing the desired VPP configuration, and apply it safely to the running dataplane using the VPP
|
||||
API. I've written a few more posts on how it does that, notably on its [[syntax]({{< ref "2022-03-27-vppcfg-1" >}})]
|
||||
and its [[planner]({{< ref "2022-04-02-vppcfg-2" >}})]. A complete
|
||||
configuration guide on vppcfg can be found
|
||||
[[here](https://github.com/pimvanpelt/vppcfg/blob/main/docs/config-guide.md)].
|
||||
[[here](https://git.ipng.ch/ipng/vppcfg/blob/main/docs/config-guide.md)].
|
||||
|
||||
```
|
||||
pim@fitlet:~$ sudo dpkg -i {lib,}vpp*23.06*deb
|
||||
|
@ -185,7 +185,7 @@ forgetful chipmunk-sized brain!), so here, I'll only recap what's already writte
|
||||
|
||||
**1. BUILD:** For the first step, the build is straight forward, and yields a VPP instance based on
|
||||
`vpp-ext-deps_23.06-1` at version `23.06-rc0~71-g182d2b466`, which contains my
|
||||
[[LCPng](https://github.com/pimvanpelt/lcpng.git)] plugin. I then copy the packages to the router.
|
||||
[[LCPng](https://git.ipng.ch/ipng/lcpng.git)] plugin. I then copy the packages to the router.
|
||||
The router has an E-2286G CPU @ 4.00GHz with 6 cores and 6 hyperthreads. There's a really handy tool
|
||||
called `likwid-topology` that can show how the L1, L2 and L3 cache lines up with respect to CPU
|
||||
cores. Here I learn that CPU (0+6) and (1+7) share L1 and L2 cache -- so I can conclude that 0-5 are
|
||||
@ -351,7 +351,7 @@ in `vppcfg`:
|
||||
* When I create the initial `--novpp` config, there's a bug in `vppcfg` where I incorrectly
|
||||
reference a dataplane object which I haven't initialized (because with `--novpp` the tool
|
||||
will not contact the dataplane at all. That one was easy to fix, which I did in [[this
|
||||
commit](https://github.com/pimvanpelt/vppcfg/commit/0a0413927a0be6ed3a292a8c336deab8b86f5eee)]).
|
||||
commit](https://git.ipng.ch/ipng/vppcfg/commit/0a0413927a0be6ed3a292a8c336deab8b86f5eee)]).
|
||||
|
||||
After that small detour, I can now proceed to configure the dataplane by offering the resulting
|
||||
VPP commands, like so:
|
||||
@ -573,7 +573,7 @@ see is that which is destined to the controlplane (eg, to one of the IPv4 or IPv
|
||||
multicast/broadcast groups that they are participating in), so things like tcpdump or SNMP won't
|
||||
really work.
|
||||
|
||||
However, due to my [[vpp-snmp-agent](https://github.com/pimvanpelt/vpp-snmp-agent.git)], which is
|
||||
However, due to my [[vpp-snmp-agent](https://git.ipng.ch/ipng/vpp-snmp-agent.git)], which is
|
||||
feeding as an AgentX behind an snmpd that in turn is running in the `dataplane` namespace, SNMP scrapes
|
||||
work as they did before, albeit with a few different interface names.
|
||||
|
||||
|
@ -14,7 +14,7 @@ performance and versatility. For those of us who have used Cisco IOS/XR devices,
|
||||
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
|
||||
are shared between the two.
|
||||
|
||||
I've been working on the Linux Control Plane [[ref](https://github.com/pimvanpelt/lcpng)], which you
|
||||
I've been working on the Linux Control Plane [[ref](https://git.ipng.ch/ipng/lcpng)], which you
|
||||
can read all about in my series on VPP back in 2021:
|
||||
|
||||
[{: style="width:300px; float: right; margin-left: 1em;"}](https://video.ipng.ch/w/erc9sAofrSZ22qjPwmv6H4)
|
||||
@ -70,7 +70,7 @@ answered by a Response PDU.
|
||||
|
||||
Using parts of a Python Agentx library written by GitHub user hosthvo
|
||||
[[ref](https://github.com/hosthvo/pyagentx)], I tried my hands at writing one of these AgentX's.
|
||||
The resulting source code is on [[GitHub](https://github.com/pimvanpelt/vpp-snmp-agent)]. That's the
|
||||
The resulting source code is on [[GitHub](https://git.ipng.ch/ipng/vpp-snmp-agent)]. That's the
|
||||
one that's running in production ever since I started running VPP routers at IPng Networks AS8298.
|
||||
After the _AgentX_ exposes the dataplane interfaces and their statistics into _SNMP_, an open source
|
||||
monitoring tool such as LibreNMS [[ref](https://librenms.org/)] can discover the routers and draw
|
||||
@ -126,7 +126,7 @@ for any interface created in the dataplane.
|
||||
|
||||
I wish I were good at Go, but I never really took to the language. I'm pretty good at Python, but
|
||||
sorting through the stats segment isn't super quick as I've already noticed in the Python3 based
|
||||
[[VPP SNMP Agent](https://github.com/pimvanpelt/vpp-snmp-agent)]. I'm probably the world's least
|
||||
[[VPP SNMP Agent](https://git.ipng.ch/ipng/vpp-snmp-agent)]. I'm probably the world's least
|
||||
terrible C programmer, so maybe I can take a look at the VPP Stats Client and make sense of it. Luckily,
|
||||
there's an example already in `src/vpp/app/vpp_get_stats.c` and it reveals the following pattern:
|
||||
|
||||
|
@ -19,7 +19,7 @@ same time keep an IPng Site Local network with IPv4 and IPv6 that is separate fr
|
||||
based on hardware/silicon based forwarding at line rate and high availability. You can read all
|
||||
about my Centec MPLS shenanigans in [[this article]({{< ref "2023-03-11-mpls-core" >}})].
|
||||
|
||||
Ever since the release of the Linux Control Plane [[ref](https://github.com/pimvanpelt/lcpng)]
|
||||
Ever since the release of the Linux Control Plane [[ref](https://git.ipng.ch/ipng/lcpng)]
|
||||
plugin in VPP, folks have asked "What about MPLS?" -- I have never really felt the need to go this
|
||||
rabbit hole, because I figured that in this day and age, higher level IP protocols that do tunneling
|
||||
are just as performant, and a little bit less of an 'art' to get right. For example, the Centec
|
||||
|
@ -459,6 +459,6 @@ and VPP, and the overall implementation before attempting to use in production.
|
||||
we got at least some of this right, but testing and runtime experience will tell.
|
||||
|
||||
I will be silently porting the change into my own copy of the Linux Controlplane called lcpng on
|
||||
[[GitHub](https://github.com/pimvanpelt/lcpng.git)]. If you'd like to test this - reach out to the VPP
|
||||
[[GitHub](https://git.ipng.ch/ipng/lcpng.git)]. If you'd like to test this - reach out to the VPP
|
||||
Developer [[mailinglist](mailto:vpp-dev@lists.fd.io)] any time!
|
||||
|
||||
|
@ -187,7 +187,7 @@ MPLS-VRF:0, fib_index:0 locks:[interface:4, CLI:1, lcp-rt:1, ]
|
||||
[@1]: mpls via fe80::5054:ff:fe02:1001 GigabitEthernet10/0/0: mtu:9000 next:2 flags:[] 5254000210015254000310008847
|
||||
```
|
||||
|
||||
{{< image width="80px" float="left" src="/assets/vpp-mpls/lightbulb.svg" alt="Lightbulb" >}}
|
||||
{{< image width="80px" float="left" src="/assets/shared/lightbulb.svg" alt="Lightbulb" >}}
|
||||
|
||||
Haha, I love it when the brain-ligutbulb goes to the _on_ position. What's happening is that when we
|
||||
turned on the MPLS feature on the VPP `tap` that is connected to `e0`, and VPP saw an MPLS packet,
|
||||
@ -385,5 +385,5 @@ and VPP, and the overall implementation before attempting to use in production.
|
||||
we got at least some of this right, but testing and runtime experience will tell.
|
||||
|
||||
I will be silently porting the change into my own copy of the Linux Controlplane called lcpng on
|
||||
[[GitHub](https://github.com/pimvanpelt/lcpng.git)]. If you'd like to test this - reach out to the VPP
|
||||
[[GitHub](https://git.ipng.ch/ipng/lcpng.git)]. If you'd like to test this - reach out to the VPP
|
||||
Developer [[mailinglist](mailto:vpp-dev@lists.fd.io)] any time!
|
||||
|
@ -304,7 +304,7 @@ Gateway, just to show a few of the more advanced features of VPP. For me, this t
|
||||
line of thinking: classifiers. This extract/match/act pattern can be used in policers, ACLs and
|
||||
arbitrary traffic redirection through VPP's directed graph (eg. selecting a next node for
|
||||
processing). I'm going to deep-dive into this classifier behavior in an upcoming article, and see
|
||||
how I might add this to [[vppcfg](https://github.com/pimvanpelt/vppcfg.git)], because I think it
|
||||
how I might add this to [[vppcfg](https://git.ipng.ch/ipng/vppcfg.git)], because I think it
|
||||
would be super powerful to abstract away the rather complex underlying API into something a little
|
||||
bit more ... user friendly. Stay tuned! :)
|
||||
|
||||
|
@ -543,7 +543,7 @@ Whoa, what just happened here? The switch took the port defined by `pci/0000:03:
|
||||
it is _splittable_ and has four lanes, and split it into four NEW ports called `swp1s0`-`swp1s3`,
|
||||
and the resulting ports are 25G, 10G or 1G.
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/oem-switch/warning.png" alt="Warning" >}}
|
||||
{{< image width="100px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
However, I make an important observation. When splitting `swp1` in 4, the switch also removed port
|
||||
`swp2`, and remember at the beginning of this article I mentioned that the MAC addresses seemed to
|
||||
|
@ -243,7 +243,7 @@ any prefixes, for example this session in Düsseldorf:
|
||||
};
|
||||
```
|
||||
|
||||
{{< image width="80px" float="left" src="/assets/debian-vpp/warning.png" alt="Warning" >}}
|
||||
{{< image width="80px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
This is where it's a good idea to grab some tea. Quite a few internet providers have
|
||||
incredibly slow convergence, so just by stopping the announcment of `AS8298:AS-IPNG` prefixes at
|
||||
|
@ -548,7 +548,7 @@ for table in api_reply:
|
||||
print(str)
|
||||
```
|
||||
|
||||
{{< image width="50px" float="left" src="/assets/vpp-papi/warning.png" alt="Warning" >}}
|
||||
{{< image width="50px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
Funny detail - it took me almost two years to discover `VppEnum`, which contains all of these
|
||||
symbols. If you end up reading this after a Bing, Yahoo or DuckDuckGo search, feel free to buy
|
||||
|
@ -47,7 +47,7 @@ we'll use for performance testing, notably to compare the FreeBSD kernel routing
|
||||
like `netmap`, and of course VPP itself. I do intend to do some side-by-side comparisons between
|
||||
Debian and FreeBSD when they run VPP.
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/freebsd-vpp/brain.png" alt="brain" >}}
|
||||
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="brain" >}}
|
||||
|
||||
If you know me a little bit, you'll know that I typically forget how I did a thing, so I'm using
|
||||
this article for others as well as myself in case I want to reproduce this whole thing 5 years down
|
||||
|
@ -163,7 +163,7 @@ interfaces a bit. They need to be:
|
||||
075.810547 main [301] Ready to go, ixl0 0x0/4 <-> ixl1 0x0/4.
|
||||
```
|
||||
|
||||
{{< image width="80px" float="left" src="/assets/freebsd-vpp/warning.png" alt="Warning" >}}
|
||||
{{< image width="80px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
I start my first loadtest, which pretty immediately fails. It's an interesting behavior pattern which
|
||||
I've not seen before. After staring at the problem, and reading the code of `bridge.c`, which is a
|
||||
|
@ -63,7 +63,7 @@ Let me discuss these two purposes in more detail:
|
||||
|
||||
### 1. IPv4 ARP, née IPv6 NDP
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/vpp-babel/brain.png" alt="Brain" >}}
|
||||
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Brain" >}}
|
||||
|
||||
One really neat trick is simply replacing ARP resolution by something that can resolve the
|
||||
link-layer MAC address in a different way. As it turns out, IPv6 has an equivalent that's
|
||||
@ -359,7 +359,7 @@ does not have an IPv4 address. Except -- I'm bending the rules a little bit by d
|
||||
There's an internal function `ip4_sw_interface_enable_disable()` which is called to enable IPv4
|
||||
processing on an interface once the first IPv4 address is added. So my first fix is to force this to
|
||||
be enabled for any interface that is exposed via Linux Control Plane, notably in `lcp_itf_pair_create()`
|
||||
[[here](https://github.com/pimvanpelt/lcpng/blob/main/lcpng_interface.c#L777)].
|
||||
[[here](https://git.ipng.ch/ipng/lcpng/blob/main/lcpng_interface.c#L777)].
|
||||
|
||||
This approach is partially effective:
|
||||
|
||||
@ -500,7 +500,7 @@ which is unnumbered. Because I don't know for sure if everybody would find this
|
||||
I make sure to guard the behavior behind a backwards compatible configuration option.
|
||||
|
||||
If you're curious, please take a look at the change in my [[GitHub
|
||||
repo](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
|
||||
repo](https://git.ipng.ch/ipng/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
|
||||
which I:
|
||||
1. add a new configuration option, `lcp-sync-unnumbered`, which defaults to `on`. That would be
|
||||
what the plugin would do in the normal case: copy forward these borrowed IP addresses to Linux.
|
||||
|
@ -147,7 +147,7 @@ With all of that, I am ready to demonstrate two working solutions now. I first c
|
||||
Ondrej's [[commit](https://gitlab.nic.cz/labs/bird/-/commit/280daed57d061eb1ebc89013637c683fe23465e8)].
|
||||
Then, I compile VPP with my pending [[gerrit](https://gerrit.fd.io/r/c/vpp/+/40482)]. Finally,
|
||||
to demonstrate how `update_loopback_addr()` might work, I compile `lcpng` with my previous
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)],
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)],
|
||||
which allows me to inhibit copying forward addresses from VPP to Linux, when using _unnumbered_
|
||||
interfaces.
|
||||
|
||||
@ -242,7 +242,7 @@ even if the interface link stays up. It's described in detail in
|
||||
[[RFC5880](https://www.rfc-editor.org/rfc/rfc5880.txt)], and I use it at IPng Networks all over the
|
||||
place.
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/vpp-babel/brain.png" alt="Brain" >}}
|
||||
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Brain" >}}
|
||||
|
||||
Then I'll configure two OSPF protocols, one for IPv4 called `ospf4` and another for IPv6 called
|
||||
`ospf6`. It's easy to overlook, but while usually the IPv4 protocol is OSPFv2 and the IPv6 protocol
|
||||
|
@ -1,8 +1,9 @@
|
||||
---
|
||||
date: "2024-04-27T10:52:11Z"
|
||||
title: FreeIX - Remote
|
||||
title: "FreeIX Remote - Part 1"
|
||||
aliases:
|
||||
- /s/articles/2024/04/27/freeix-1.html
|
||||
- /s/articles/2024/04/27/freeix-remote/
|
||||
---
|
||||
|
||||
# Introduction
|
||||
@ -91,7 +92,7 @@ their traffic to these remote internet exchanges.
|
||||
There are two types of BGP neighbor adjacency:
|
||||
|
||||
1. ***Members***: these are {ip-address,AS}-tuples which FreeIX has explicitly configured. Learned prefixes are added
|
||||
to as-set AS50869:AS-MEMBERS. Members receive _all_ prefixes from FreeIX, each annotated with BGP **informational**
|
||||
to as-set AS50869:AS-MEMBERS. Members receive _some or all_ prefixes from FreeIX, each annotated with BGP **informational**
|
||||
communities, and members can drive certain behavior with BGP **action** communities.
|
||||
|
||||
1. ***Peers***: these are all other entities with whom FreeIX has an adjacency at public internet exchanges or private
|
||||
@ -195,12 +196,12 @@ network interconnects:
|
||||
* `(50869,3020,1)`: Inhibit Action (30XX), Country (3020), Switzerland (1)
|
||||
* `(50869,3030,1308)`: Inhibit Action (30XX), IXP (3030), PeeringDB IXP for LS-IX (1308)
|
||||
|
||||
Further actions can be placed on a per-remote-neighbor basis:
|
||||
Four actions can be placed on a per-remote-asn basis:
|
||||
|
||||
* `(50869,3040,13030)`: Inhibit Action (30XX), AS (3040), Init7 (AS13030)
|
||||
* `(50869,3041,6939)`: Prepend Action (30XX), Prepend Once (3041), Hurricane Electric (AS6939)
|
||||
* `(50869,3042,12859)`: Prepend Action (30XX), Prepend Twice (3042), BIT BV (AS12859)
|
||||
* `(50869,3043,8283)`: Prepend Action (30XX), Prepend Three Times (3043), Coloclue (AS8283)
|
||||
* `(50869,3100,6939)`: Prepend Once Action (3100), Hurricane Electric (AS6939)
|
||||
* `(50869,3200,12859)`: Prepend Twice Action (3200), BIT BV (AS12859)
|
||||
* `(50869,3300,8283)`: Prepend Thice Action (3300), Coloclue (AS8283)
|
||||
|
||||
Peers cannot set these actions, as all action communities will be stripped on ingress. Members can set these action
|
||||
communities on their sessions with FreeIX routers, however in some cases they may also be set by FreeIX operators when
|
||||
|
@ -58,7 +58,8 @@ argument of resistance? Nerd-snipe accepted!
|
||||
|
||||
Let me first introduce the mail^W main characters of my story:
|
||||
|
||||
| {: style="width:100px; margin: 1em;"} | {: style="width:100px; margin: 1em;"} | {: style="width:100px; margin: 1em;"} | {: style="width:100px; margin: 1em;"} | {: style="width:100px; margin: 1em;"} | {: style="width:100px; margin: 1em;"} |
|
||||
| {{< image src="/assets/smtp/postfix_logo.png" width="8em" >}} | {{< image src="/assets/smtp/dovecot_logo.png" width="8em" >}} | {{< image src="/assets/smtp/nginx_logo.png" width="8em" >}} | {{< image src="/assets/smtp/rspamd_logo.png" width="8em" >}} | {{< image src="/assets/smtp/unbound_logo.png" width="8em" >}} | {{< image src="/assets/smtp/roundcube_logo.png" width="8em" >}} |
|
||||
| ---- | ---- | ---- | ---- | ---- | ---- |
|
||||
|
||||
* ***Postfix***: is Wietse Venema's mail server that started life at IBM research as an
|
||||
alternative to the widely-used Sendmail program. After eight years at Google, Wietse continues
|
||||
@ -444,7 +445,7 @@ pim@squanchy:~$ sudo cat /etc/mail/secrets
|
||||
ipng bastion:<haha-made-you-look>
|
||||
```
|
||||
|
||||
{{< image width="120px" float="left" src="/assets/smtp/lightbulb.svg" alt="Lightbulb" >}}
|
||||
{{< image width="120px" float="left" src="/assets/shared/lightbulb.svg" alt="Lightbulb" >}}
|
||||
|
||||
What happens here is, every time this server `squanchy` wants to send an e-mail, it will use an SMTP
|
||||
session with TLS, on port 587, of the machine called `smtp-out.ipng.ch`, and it'll authenticate
|
||||
|
@ -101,6 +101,7 @@ IPv6 network and access the internet via a shared IPv6 address.
|
||||
I will assign a pool of four public IPv4 addresses and eight IPv6 addresses to each border gateway:
|
||||
|
||||
| **Machine** | **IPv4 pool** | **IPv6 pool** |
|
||||
| ----------- | ------------- | ------------- |
|
||||
| border0.chbtl0.net.ipng.ch | <span style='color:green;'>194.126.235.0/30</span> | <span style='color:blue;'>2001:678:d78::3:0:0/125</span> |
|
||||
| border0.chrma0.net.ipng.ch | <span style='color:green;'>194.126.235.4/30</span> | <span style='color:blue;'>2001:678:d78::3:1:0/125</span> |
|
||||
| border0.chplo0.net.ipng.ch | <span style='color:green;'>194.126.235.8/30</span> | <span style='color:blue;'>2001:678:d78::3:2:0/125</span> |
|
||||
@ -305,7 +306,7 @@ switches, I will announce:
|
||||
towards DNS64-rewritten destinations, for example 2001:678:d78:564::8c52:7903 as DNS64 representation
|
||||
of github.com, which is reachable only at legacy address 140.82.121.3.
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/nat64/brain.png" alt="Brain" >}}
|
||||
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Brain" >}}
|
||||
|
||||
I have to be careful with the announcements into OSPF. The cost of E1 routes is the cost of the
|
||||
external metric **in addition to** the internal cost within OSPF to reach that network. The cost
|
||||
|
@ -250,10 +250,10 @@ remove the IPv4 and IPv6 addresses from the <span style='color:red;font-weight:b
|
||||
routers in Brüttisellen. They are directly connected, and if anything goes wrong, I can walk
|
||||
over and rescue them. Sounds like a safe way to start!
|
||||
|
||||
I quickly add the ability for [[vppcfg](https://github.com/pimvanpelt/vppcfg)] to configure
|
||||
I quickly add the ability for [[vppcfg](https://git.ipng.ch/ipng/vppcfg)] to configure
|
||||
_unnumbered_ interfaces. In VPP, these are interfaces that don't have an IPv4 or IPv6 address of
|
||||
their own, but they borrow one from another interface. If you're curious, you can take a look at the
|
||||
[[User Guide](https://github.com/pimvanpelt/vppcfg/blob/main/docs/config-guide.md#interfaces)] on
|
||||
[[User Guide](https://git.ipng.ch/ipng/vppcfg/blob/main/docs/config-guide.md#interfaces)] on
|
||||
GitHub.
|
||||
|
||||
Looking at their `vppcfg` files, the change is actually very easy, taking as an example the
|
||||
@ -280,7 +280,7 @@ By commenting out the `addresses` field, and replacing it with `unnumbered: loop
|
||||
vppcfg to make Te6/0/0, which in Linux is called `xe1-0`, borrow its addresses from the loopback
|
||||
interface `loop0`.
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/freebsd-vpp/brain.png" alt="brain" >}}
|
||||
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="brain" >}}
|
||||
|
||||
Planning and applying this is straight forward, but there's one detail I should
|
||||
mention. In my [[previous article]({{< ref "2024-04-06-vpp-ospf" >}})] I asked myself a question:
|
||||
@ -291,7 +291,7 @@ interface.
|
||||
|
||||
In the article, you'll see that discussed as _Solution 2_, and it includes a bit of rationale why I
|
||||
find this better. I implemented it in this
|
||||
[[commit](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
|
||||
[[commit](https://git.ipng.ch/ipng/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
|
||||
case you're curious, and the commandline keyword is `lcp lcp-sync-unnumbered off` (the default is
|
||||
_on_).
|
||||
|
||||
|
@ -292,7 +292,7 @@ transmitting, or performing both receiving *and* transmitting.
|
||||
|
||||
### Intel X520 (10GbE)
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/oem-switch/warning.png" alt="Warning" >}}
|
||||
{{< image width="100px" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
This network card is based on the classic Intel _Niantic_ chipset, also known as the 82599ES chip,
|
||||
first released in 2009. It's super reliable, but there is one downside. It's a PCIe v2.0 device
|
||||
@ -462,7 +462,7 @@ ip4-rewrite active 14845221 35913927 0 8.9
|
||||
unix-epoll-input polling 22551 0 0 1.37e3 0.00
|
||||
```
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/vpp-babel/brain.png" alt="Brain" >}}
|
||||
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Brain" >}}
|
||||
|
||||
I kind of wonder why that is. Is the Mellanox Connect-X3 such a poor performer? Or does it not like
|
||||
small packets? I've read online that Mellanox cards do some form of message compression on the PCI
|
||||
|
@ -407,7 +407,7 @@ loadtest:
|
||||
|
||||
{{< image src="/assets/gowin-n305/cx5-cpu-rdma1q.png" alt="Cx5 CPU with 1Q" >}}
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/vpp-babel/brain.png" alt="Brain" >}}
|
||||
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Brain" >}}
|
||||
|
||||
Here I can clearly see that the one CPU thread (in yellow for unidirectional) and the two CPU
|
||||
therads (one for each of the bidirectional flows) jump up to 100% and stay there. This means that
|
||||
|
452
content/articles/2024-08-12-jekyll-hugo.md
Normal file
@ -0,0 +1,452 @@
|
||||
---
|
||||
date: "2024-08-12T09:01:23Z"
|
||||
title: 'Case Study: From Jekyll to Hugo'
|
||||
---
|
||||
|
||||
# Introduction
|
||||
|
||||
{{< image width="16em" float="right" src="/assets/jekyll-hugo/before.png" alt="ipng.nl before" >}}
|
||||
|
||||
In the _before-days_, I had a very modest personal website running on [[ipng.nl](https://ipng.nl)]
|
||||
and [[ipng.ch](https://ipng.ch/)]. Over the years I've had quite a few different designs, and
|
||||
although one of them was hosted (on Google Sites) for a brief moment, they were mostly very much web
|
||||
1.0, "The 90s called, they wanted their website back!" style.
|
||||
|
||||
The site didn't have much other than a little blurb on a few open source projects of mine, and a
|
||||
gallery hosted on PicasaWeb [which Google subsequently turned down], and a mostly empty Blogger
|
||||
page. Would you imagine that I hand-typed the XHTML and CSS for this website, where the menu at the
|
||||
top (thinks like `Home` - `Resume` - `History` - `Articles`) would just have a HTML page which
|
||||
meticulously linked to the other HTML pages. It was the way of the world, in the 1990s.
|
||||
|
||||
## Jekyll
|
||||
|
||||
{{< image width="9em" float="right" src="/assets/jekyll-hugo/jekyll-logo.png" alt="Jekyll" >}}
|
||||
|
||||
My buddy Michal suggested in May of 2021 that, if I was going to write all of the HTML skeleton by
|
||||
hand, I may as well switch to a static website generator. He's fluent in Ruby, and suggested I take
|
||||
a look at [[Jekyll](https://jekyllrb.com/)], a static site generator. It takes text written in
|
||||
your favorite markup language and uses layouts to create a static website. You can tweak the site’s
|
||||
look and feel, URLs, the data displayed on the page, and more.
|
||||
|
||||
I immediately fell in love! As an experiment, I moved [[IPng.ch](https://ipng.ch)] to a new
|
||||
webserver, and kept my personal website on [[IPng.nl](https://ipng.nl)]. I had always wanted to
|
||||
write a little bit more about technology, and since I was working on an interesting project [[Linux
|
||||
Control Plane]({{< ref 2021-08-12-vpp-1 >}})] in VPP, I thought it'd be nice to write a little bit
|
||||
about it, but certainly not while hand-crafting all of the HTML exoskeleton. I just wanted to write
|
||||
Markdown, and this is precisely the _raison d'être_ of Jekyll!
|
||||
|
||||
Since April 2021, I wrote in total 67 articles with Jekyll. Some of them proved to become quite
|
||||
popular, and (_humblebrag_) my website is widely considered one of the best resources for Vector
|
||||
Packet Processing, with my [[VPP]({{< ref 2021-09-21-vpp-7 >}})] series, [[MPLS]({{< ref
|
||||
2023-05-07-vpp-mpls-1 >}})] series and a few others like the [[Mastodon]({{< ref
|
||||
2022-11-20-mastodon-1 >}})] series being amongst some of the top visited articles, with ~7.5-8K
|
||||
monthly unique visitors.
|
||||
|
||||
## The catalyst
|
||||
|
||||
There were two distinct events that lead up to this. Firstly, I started a side project called [[Free
|
||||
IX](https://free-ix.ch/)], which I also created in Jekyll. When I did that, I branched the
|
||||
[[IPng.ch](https://ipng.ch)] site, but the build faild with Ruby errors. My buddy Antonios fixed
|
||||
those, and we were underway. Secondly, later on I attempted to upgrade the IPng website to the same
|
||||
fixes that Antonios had provided for Free-IX, and all hell broke loose (luckily, only in staging
|
||||
environment). I spent several hours pulling my hear out re-assembling the dependencies, downgrading
|
||||
Jekyll, pulling new `gems`, downgrading `ruby`. Finally, I got it to work again, only to see after
|
||||
my first production build, that the build immediately failed because the Docker container that does
|
||||
the build no longer liked what I had put in the `Gemfile` and `_config.yml`. It was something to do
|
||||
with `sass-embedded` gem, and I spent waaaay too long fixing this incredibly frustrating breakage.
|
||||
|
||||
## Hugo
|
||||
|
||||
{{< image width="9em" float="right" src="/assets/jekyll-hugo/hugo-logo-wide.svg" alt="Hugo" >}}
|
||||
|
||||
When I made my roadtrip from Zurich to the North Cape with my buddy Paul, we took extensive notes on
|
||||
our daily travels, and put them on a [[2022roadtripnose](https://2022roadtripnose.weirdnet.nl/)]
|
||||
website. At the time, I was looking for a photo caroussel for Jekyll, and while I found a few, none
|
||||
of them really worked in the way I wanted them to. I stumbled across [[Hugo](https://gohugo.io)],
|
||||
which says on its website that it is one of the most popular open-source static site generators.
|
||||
With its amazing speed and flexibility, Hugo makes building websites fun again. So I dabbled a bit
|
||||
and liked what I saw. I used the [[notrack](https://github.com/gevhaz/hugo-theme-notrack)] theme from
|
||||
GitHub user `@gevhaz`, as they had made a really nice gallery widget (called a `shortcode` in Hugo).
|
||||
|
||||
The main reason for me to move to Hugo is that it is a **standalone Go** program, with no runtime or
|
||||
build time dependencies. The Hugo [[GitHub](https://github.com/gohugoio/hugo)] delivers ready to go
|
||||
build artifacts, tests amd releases regularly, and has a vibrant user community.
|
||||
|
||||
### Migrating
|
||||
|
||||
I have only a few strong requirements if I am to move my website:
|
||||
|
||||
1. The site's URL namespace MUST be *identical* (not just similar) to Jekyll. I do not want to
|
||||
lose my precious ranking on popular search engines.
|
||||
1. MUST be built in a CI/CD tool like Drone or Jenkins, and autodeploy
|
||||
1. Code MUST be _hermetic_, not pulling in external dependencies, neither in the build system (eg.
|
||||
Hugo itself) nor the website (eg. dependencies, themes, etc).
|
||||
1. Theme MUST support images, videos and SHOULD support asciinema.
|
||||
1. Theme SHOULD try to look very similar to the current Jekyll `minima` theme.
|
||||
|
||||
|
||||
#### Attempt 1: Auto import ❌
|
||||
|
||||
With that in mind, I notice that Hugo has a site _importer_, that can import a site from Jekyll! I
|
||||
run it, but it produces completely broken code, and Hugo doesn't even want to compile the site. This
|
||||
turns out to be a _theme_ issue, so I take Hugo's advice and install the recommended theme. The site
|
||||
comes up, but is pretty screwed up. I now realize that the `hugo import jekyll` imports the markdown
|
||||
as-is, and only rewrites the _frontmatter_ (the little blurb of YAML metadata at the top of each
|
||||
file). Two notable problems:
|
||||
|
||||
**1. images** - I make liberal use of Markdown images, which in Jekyll can be decorated with CSS
|
||||
styling, like so:
|
||||
```
|
||||
{: style="width:200px; float: right; margin: 1em;"}
|
||||
```
|
||||
|
||||
**2. post_url** - Another widely used feature is cross-linking my own articles, using Jekyll
|
||||
template expansion, like so:
|
||||
```
|
||||
.. Remember in my [[VPP Babel]({% post_url 2024-03-06-vpp-babel-1 %})] ..
|
||||
```
|
||||
|
||||
I do some grepping, and have 246 such Jekyll template expansions, and 272 images OK, that's a dud.
|
||||
|
||||
#### Attempt 2: Skeleton ✅
|
||||
|
||||
I decide to do this one step at a time. First, I create a completely new website `hugo new site
|
||||
ipng.ch`, download the `notrack` theme, and add only the front page `index.md` from the
|
||||
original IPng site. OK, that renders.
|
||||
|
||||
Now comes a fun part: going over the `notrack` theme's SCSS to adjust it to look and feel similar to
|
||||
the Jekyll `minima` theme. I change a bunch of stuff in the skeleton of the website:
|
||||
|
||||
First, I take a look at the site media breakpoints, to feel correct for desktop screen, tablet
|
||||
screen and iPhone/Android screens. Then, I inspect the font family, size and H1/H2/H3...
|
||||
magnifications, also scaling them with media size. Finally I notice the footer, which in `notrack`
|
||||
spans the whole width of the browser. I change it to be as wide as the header and main page.
|
||||
|
||||
I go one by one on the site's main pages and, just as on the Jekyll site, I make them into menu
|
||||
items at the top of the page. The [[Services]({{< ref services >}})] page serves as my proof of
|
||||
concept, as it has both the `image` and the `post_url` pattern in Jekyll. It references six articles
|
||||
and has two images which float on the right side of the canvas. If I can figure out how to rewrite
|
||||
these to fit the Hugo variants of the same pattern, I should be home free.
|
||||
|
||||
### Hugo: image
|
||||
|
||||
The idiomatic way in `notrack` is an `image` shortcode. I hope you know where to find the curly
|
||||
braces on your keyboard - because geez, Hugo templating sure does like them!
|
||||
|
||||
```
|
||||
<figure class="image-shortcode{{ with .Get "class" }} {{ . }}{{ end }}
|
||||
{{- with .Get "wide" }}{{- if eq . "true" }} wide{{ end -}}{{ end -}}
|
||||
{{- with .Get "frame" }}{{- if eq . "true" }} frame{{ end -}}{{ end -}}
|
||||
{{- with .Get "float" }} {{ . }}{{ end -}}"
|
||||
style="
|
||||
{{- with .Get "width" }}width: {{ . }};{{ end -}}
|
||||
{{- with .Get "height" }}height: {{ . }};{{ end -}}">
|
||||
{{- if .Get "link" -}}
|
||||
<a href="{{ .Get "link" }}"{{ with .Get "target" }} target="{{ . }}"{{ end -}}
|
||||
{{- with .Get "rel" }} rel="{{ . }}"{{ end }}>
|
||||
{{- end }}
|
||||
<img src="{{ .Get "src" | relURL }}"
|
||||
{{- if or (.Get "alt") (.Get "caption") }}
|
||||
alt="{{ with .Get "alt" }}{{ replace . "'" "'" }}{{ else -}}
|
||||
{{- .Get "caption" | markdownify| plainify }}{{ end }}"
|
||||
{{- end -}}
|
||||
/> <!-- Closing img tag -->
|
||||
{{- if .Get "link" }}</a>{{ end -}}
|
||||
{{- if or (or (.Get "title") (.Get "caption")) (.Get "attr") -}}
|
||||
<figcaption>
|
||||
{{ with (.Get "title") -}}
|
||||
<h4>{{ . }}</h4>
|
||||
{{- end -}}
|
||||
{{- if or (.Get "caption") (.Get "attr") -}}<p>
|
||||
{{- .Get "caption" | markdownify -}}
|
||||
{{- with .Get "attrlink" }}
|
||||
<a href="{{ . }}">
|
||||
{{- end -}}
|
||||
{{- .Get "attr" | markdownify -}}
|
||||
{{- if .Get "attrlink" }}</a>{{ end }}</p>
|
||||
{{- end }}
|
||||
</figcaption>
|
||||
{{- end }}
|
||||
</figure>
|
||||
```
|
||||
|
||||
From the top - Hugo creates a figure with a certain set of classes, the default `image-shortcode`
|
||||
but also classes for `frame`, `wide` and `float` to further decorate the image. Then it applies
|
||||
direct styling for `width` and `height`, optionally inserts a link (something I had missed out on in
|
||||
Jekyll), then inlines the `<img>` tag with an `alt` or (markdown based!) `caption`. It then reuses
|
||||
the `caption` or `title` or `attr` variables to assemble a `<figcaption>` block. I absolutely love it!
|
||||
|
||||
I've rather consistently placed my images by themselves, on a single line, and they all have at
|
||||
least one style (be it `width`, or `float`), so it's really straight forward to rewrite this with a
|
||||
little bit of Python:
|
||||
|
||||
```
|
||||
def convert_image(line):
|
||||
p = re.compile(r'^!\[(.+)\]\((.+)\){:\s*(.*)}')
|
||||
m = p.match(line)
|
||||
if not m:
|
||||
return False
|
||||
|
||||
alt=m.group(1)
|
||||
src=m.group(2)
|
||||
style=m.group(3)
|
||||
|
||||
image_line = "{{</* image "
|
||||
if sm := re.search(r'width:\s*(\d+px)', style):
|
||||
image_line += f'width="{sm.group(1)}" '
|
||||
if sm := re.search(r'float:\s*(\w+)', style):
|
||||
image_line += f'float="{sm.group(1)}" '
|
||||
image_line += f'src="{src}" alt="{alt}" */>}}}}'
|
||||
|
||||
print(image_line)
|
||||
return True
|
||||
|
||||
with open(sys.argv[1], "r", encoding="utf-8") as file_handle:
|
||||
for line in file_handle.readlines():
|
||||
if not convert_image(line):
|
||||
print(line.rstrip())
|
||||
```
|
||||
|
||||
### Hugo: ref
|
||||
|
||||
In Hugo, the idiomatic way to reference another document in the corpus is with the builtin `ref`
|
||||
shortcode, requiring a single argument: the path to a content document, with or without a file
|
||||
extension, with or without an anchor. Paths without a leading / are first resolved relative to the
|
||||
current page, then to the remainder of the site. This is super cool, because I can essentially
|
||||
reference any file by just its name!
|
||||
|
||||
```
|
||||
for fn in $(find content/ -name \*.md); do
|
||||
sed -i -r 's/{%[ ]?post_url (.*)[ ]?%}/{{</* ref \1 */>}}/' $fn
|
||||
done
|
||||
```
|
||||
|
||||
And with that, the converted markdown from Jekyll renders perfectly in Hugo. Of course, other sites
|
||||
may use other templating commands, but for [[IPng.ch](https://ipng.ch)], these were the only two
|
||||
special cases.
|
||||
|
||||
### Hugo: URL redirects
|
||||
|
||||
It is a hard requirement for me to keep the same URLs that I had from Jekyll. Luckily, this is a
|
||||
trivial matter for Hugo, as it supports URL aliases in the _frontmatter_. Jekyll will add a file
|
||||
extension to the article _slugs_, while Hugo uses only the directly and serves an `index.html` from
|
||||
it. Also, the default for Hugo is to put content in a different directory.
|
||||
|
||||
The first change I make is to the main `hugo.toml` config file:
|
||||
|
||||
```
|
||||
[permalinks]
|
||||
articles = "/s/articles/:year/:month/:day/:slug"
|
||||
```
|
||||
|
||||
That solves the main directory problem, as back then, I chose `s/articles/` in Jekyll. Then, adding
|
||||
the URL redirect is a simple matter of looking up which filename Jekyll ultimately used, and adding
|
||||
a little frontmatter at the top of each article, for example my [[VPP #1]({{< ref
|
||||
2024-08-12-jekyll-hugo >}})] article would get this addition:
|
||||
|
||||
```
|
||||
---
|
||||
date: "2021-08-12T11:17:54Z"
|
||||
title: VPP Linux CP - Part1
|
||||
aliases:
|
||||
- /s/articles/2021/08/12/vpp-1.html
|
||||
---
|
||||
```
|
||||
|
||||
Hugo by default renders it in `/s/articles/2021/08/12/vpp-linux-cp-part1/index.html` but the
|
||||
addition of the `alias` makes it also generate a drop-in placeholder HTML page that offers a
|
||||
permanent redirect (cleverly setting `noindex` for web crawlers and offering the `canonical` link
|
||||
for the new place, aka a permanent redirect:
|
||||
|
||||
```
|
||||
$ curl https://ipng.ch/s/articles/2021/08/12/vpp-1.html
|
||||
<!DOCTYPE html>
|
||||
<html lang="en-us">
|
||||
<head>
|
||||
<title>https://ipng.ch/s/articles/2021/08/12/vpp-linux-cp-part1/</title>
|
||||
<link rel="canonical" href="https://ipng.ch/s/articles/2021/08/12/vpp-linux-cp-part1/">
|
||||
<meta name="robots" content="noindex">
|
||||
<meta charset="utf-8">
|
||||
<meta http-equiv="refresh" content="0; url=https://ipng.ch/s/articles/2021/08/12/vpp-linux-cp-part1/">
|
||||
</head>
|
||||
</html>
|
||||
```
|
||||
|
||||
### Hugo: Asciinema
|
||||
|
||||
One thing that I always wanted to add is the ability to inline [[Asciinema](https://asciinema.org)]
|
||||
screen recordings. First, I take a look at what is needed to serve Asciinema: One Javascript file,
|
||||
and one CSS file, followed by a named `<div>` which invokes the Javascript. Armed with that
|
||||
knowledge, I dive into the `shortcode` language a little bit:
|
||||
|
||||
```
|
||||
$ cat themes/hugo-theme-ipng/layouts/shortcodes/asciinema.html
|
||||
<div id='{{ .Get "src" | replaceRE "[[:^alnum:]]" "" }}'></div>
|
||||
<script>
|
||||
AsciinemaPlayer.create("{{ .Get "src" }}",
|
||||
document.getElementById('{{ .Get "src" | replaceRE "[[:^alnum:]]" "" }}'));
|
||||
</script>
|
||||
```
|
||||
|
||||
This file creates the `id` of the `<div>` by means of stripping all non-alphanumeric characters from
|
||||
the `src` argument of the _shortcode_. So if I were to create an `{{</* asciinema
|
||||
src='/casts/my.cast' */>}}`, the resulting DIV will be uniquely called `castsmycast`. This way, I
|
||||
can add multiple screencasts in the same document, which is dope.
|
||||
|
||||
But, as I now know, I need to load some CSS and JS so that the `AsciinemaPlayer` class becomes
|
||||
available. For this, I use a realtively new feature in Hugo, which allows for `params` to be set in
|
||||
the frontmatter, for example in the [[VPP OSPF #2]({{< ref 2024-06-22-vpp-ospf-2 >}})] article:
|
||||
|
||||
```
|
||||
---
|
||||
date: "2024-06-22T09:17:54Z"
|
||||
title: VPP with loopback-only OSPFv3 - Part 2
|
||||
aliases:
|
||||
- /s/articles/2024/06/22/vpp-ospf-2.html
|
||||
params:
|
||||
asciinema: true
|
||||
---
|
||||
```
|
||||
|
||||
The presence of that `params.asciinema` can be used in any page, including the HTML skeleton of the
|
||||
theme, like so:
|
||||
|
||||
```
|
||||
$ cat themes/hugo-theme-ipng/layouts/partials/head.html
|
||||
<head>
|
||||
...
|
||||
{{ if eq .Params.asciinema true -}}
|
||||
<link rel="stylesheet" type="text/css" href="{{ "css/asciinema-player.css" | relURL }}" />
|
||||
<script src="{{ "js/asciinema-player.min.js" | relURL }}"></script>
|
||||
{{- end }}
|
||||
</head>
|
||||
```
|
||||
|
||||
Now all that's left for me to do is drop the two Asciinema player files in their respective theme
|
||||
directories, and for each article that wants to use an Asciinema, set the `param` and it'll ship the
|
||||
CSS and Javascript to the browser. I think I'm going to have a good relationship with Hugo :)
|
||||
|
||||
### Gitea: Large File Support
|
||||
|
||||
One mistake I made with the old Jekyll based website, is that I checked in all of the images and
|
||||
binary files directly into Git. This bloats the repository and is otherwise completely unnecessary.
|
||||
For this new repository, I enable [[Git LFS](https://git-lfs.com/)], which is available for OpenBSD
|
||||
(packages), Debian (apt) and MacOS (homebrew). Turning this on is very simple:
|
||||
|
||||
```
|
||||
$ brew install git-lfs
|
||||
$ cd ipng.ch
|
||||
$ git lfs install
|
||||
$ for i in gz png gif jpg jpeg tgz zip; do \\
|
||||
git track "*.$i" \\
|
||||
git lfs import --everything --include "*.$i" \\
|
||||
done
|
||||
$ git push --force --all
|
||||
```
|
||||
|
||||
The `force` push rewrites the history of the repo to reference the binary blobs in LFS instead of
|
||||
directly in the repo. As a result, the size of the repository greatly shrinks, and handling it
|
||||
becomes easier once it grows. A really nice feature!
|
||||
|
||||
### Gitea: CI/CD with Drone
|
||||
|
||||
At IPng, I run a [[Gitea](https://gitea.io)] server, which is one of the coolest pieces of open
|
||||
source that I use on a daily basis. There's a very clean integration of a continuous integration
|
||||
tool called [[Drone](https://drone.io/)] and these two tools are literally made for each other.
|
||||
Drone can be enabled for any Git repo in Gitea, and given the presence of a `.drone.yml` file,
|
||||
execute a set of steps upon repository events, called _triggers_. It can then run a sequence of
|
||||
steps, hermetically in a Docker container called a _drone-runner_, which first checks out the
|
||||
repository at the latest commit, and then does whatever I'd like with it. I'd like to build and
|
||||
distribute a Hugo website, please!
|
||||
|
||||
As it turns out, there is a [[Drone Hugo](https://plugins.drone.io/plugins/hugo)] plugin available,
|
||||
but it seems to be very outdated. Luckily, this being open source and all, I can download the source
|
||||
on [[GitHub](https://github.com/drone-plugins/drone-hugo)], and in the `Dockerfile`, bump the Alpine
|
||||
version, the Go version and build the latest Hugo release, which is 0.130.1 at the moment. I really
|
||||
do need this version, because the `params` feature was introduced in 0.123 and the upstream package
|
||||
is still for 0.77 -- which is about four years old. Ouch!
|
||||
|
||||
I build a docker image and upload it to my private repo at IPng which is hosted as well on Gitea, by
|
||||
the way. As I said, it really is a great piece of kit! In case anybody else would like to give it a
|
||||
whirl, ping me on Mastodon or e-mail and I'll upload one to public Docker Hub as well.
|
||||
|
||||
### Putting it all together
|
||||
|
||||
With Drone activated for this repo, and the Drone Hugo plugin built with a new version, I can submit
|
||||
the following file to the root directory of the `ipng.ch` repository:
|
||||
|
||||
|
||||
```
|
||||
$ cat .drone.yml
|
||||
kind: pipeline
|
||||
name: default
|
||||
|
||||
steps:
|
||||
- name: git-lfs
|
||||
image: alpine/git
|
||||
commands:
|
||||
- git lfs install
|
||||
- git lfs pull
|
||||
- name: build
|
||||
image: git.ipng.ch/ipng/drone-hugo:release-0.130.0
|
||||
settings:
|
||||
hugo_version: 0.130.0
|
||||
extended: true
|
||||
- name: rsync
|
||||
image: drillster/drone-rsync
|
||||
settings:
|
||||
user: drone
|
||||
key:
|
||||
from_secret: drone_sshkey
|
||||
hosts:
|
||||
- nginx0.chrma0.net.ipng.ch
|
||||
- nginx0.chplo0.net.ipng.ch
|
||||
- nginx0.nlams1.net.ipng.ch
|
||||
- nginx0.nlams2.net.ipng.ch
|
||||
port: 22
|
||||
args: '-6u --delete-after'
|
||||
source: public/
|
||||
target: /var/www/ipng.ch/
|
||||
recursive: true
|
||||
secrets: [ drone_sshkey ]
|
||||
|
||||
image_pull_secrets:
|
||||
- git_ipng_ch_docker
|
||||
```
|
||||
|
||||
The file is relatively self-explanatory. Before my first step runs, Drone already checks out the
|
||||
repo in the current working directory of the docker container. I then install package `alpine/git`
|
||||
and run the `git lfs install` and `git lfs pull` commands to resolve the LFS symlinks into actual
|
||||
files by pulling those objects that are referenced (and, notably, not all historical versions of any
|
||||
binary file ever added to the repo).
|
||||
|
||||
Then, I run a step called `build` which invokes the Hugo Drone package that I created before.
|
||||
|
||||
Finally, I run a step called `rsync` which uses package `drillster/drone-rsync` to rsync-over-ssh
|
||||
the files to the four NGINX servers running at IPng: two in Amsterdam, one in Geneva and one in
|
||||
Zurich.
|
||||
|
||||
One really cool feature is the use of so called _Drone Secrets_ which are references to locked
|
||||
secrets such as the SSH key, and, notably, the Docker Repository credentials, because Gitea at IPng
|
||||
does not run a public docker repo. Using secrets is nifty, because it allows to safely check in the
|
||||
`.drone.yml` configuration file without leaking any specifics.
|
||||
|
||||
### NGINX and SSL
|
||||
|
||||
Now that the website is automatically built and rsync'd to the webservers upon every `git merge`,
|
||||
all that's left for me to do is arm the webservers with SSL certificates. I actually wrote a whole
|
||||
story about specifically that, as for `*.ipng.ch` and `*.ipng.nl` and a bunch of others,
|
||||
periodically there is a background task that retrieves multiple wildcard certificates with Let's
|
||||
Encrypt, and distributes them to any server that needs them (like the NGINX cluster, or the Postfix
|
||||
cluster). I wrote about the [[Frontends]({{< ref 2023-03-17-ipng-frontends >}})], the spiffy
|
||||
[[DNS-01]({{< ref 2023-03-24-lego-dns01.md >}})] certificate subsystem, and the internal network
|
||||
called [[IPng Site Local]({{< ref 2023-03-11-mpls-core >}})] each in their own articles, so I won't
|
||||
repeat that information here.
|
||||
|
||||
## The Results
|
||||
|
||||
The results are really cool, as I'll demonstrate in this video. I can just submit and merge this
|
||||
change, and it'll automatically kick off a build and push. Take a look at this video which was
|
||||
performed in real time as I pushed this very article live:
|
||||
|
||||
{{< video src="https://ipng.ch/media/vdo/hugo-drone.mp4" >}}
|
725
content/articles/2024-09-08-sflow-1.md
Normal file
@ -0,0 +1,725 @@
|
||||
---
|
||||
date: "2024-09-08T12:51:23Z"
|
||||
title: 'VPP with sFlow - Part 1'
|
||||
---
|
||||
|
||||
# Introduction
|
||||
|
||||
{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width='12em' >}}
|
||||
|
||||
In January of 2023, an uncomfortably long time ago at this point, an acquaintance of mine called
|
||||
Ciprian reached out to me after seeing my [[DENOG
|
||||
#14](https://video.ipng.ch/w/erc9sAofrSZ22qjPwmv6H4)] presentation. He was interested to learn about
|
||||
IPFIX and was asking if sFlow would be an option. At the time, there was a plugin in VPP called
|
||||
[[flowprobe](https://s3-docs.fd.io/vpp/24.10/cli-reference/clis/clicmd_src_plugins_flowprobe.html)]
|
||||
which is able to emit IPFIX records. Unfotunately I never really got it to work well in my tests,
|
||||
as either the records were corrupted, sub-interfaces didn't work, or the plugin would just crash the
|
||||
dataplane entirely. In the meantime, the folks at [[Netgate](https://netgate.com/)] submitted quite
|
||||
a few fixes to flowprobe, but it remains an expensive operation computationally. Wouldn't copying
|
||||
one in a thousand or ten thousand packet headers with flow _sampling_ not be just as good?
|
||||
|
||||
In the months that followed, I discussed the feature with the incredible folks at
|
||||
[[inMon](https://inmon.com/)], the original designers and maintainers of the sFlow protocol and
|
||||
toolkit. Neil from inMon wrote a prototype and put it on [[GitHub](https://github.com/sflow/vpp)]
|
||||
but for lack of time I didn't manage to get it to work, which was largely my fault by the way.
|
||||
|
||||
However, I have a bit of time on my hands in September and October, and just a few weeks ago,
|
||||
my buddy Pavel from [[FastNetMon](https://fastnetmon.com/)] pinged that very dormant thread about
|
||||
sFlow being a potentially useful tool for anti DDoS protection using VPP. And I very much agree!
|
||||
|
||||
## sFlow: Protocol
|
||||
|
||||
Maintenance of the protocol is performed by the [[sFlow.org](https://sflow.org/)] consortium, the
|
||||
authoritative source of the sFlow protocol specifications. The current version of sFlow is v5.
|
||||
|
||||
sFlow, short for _sampled Flow_, works at the ethernet layer of the stack, where it inspects one in
|
||||
N datagrams (typically 1:1000 or 1:10000) going through the physical network interfaces of a device.
|
||||
On the device, an **sFlow Agent** does the sampling. For each sample the Agent takes, the first M
|
||||
bytes (typically 128) are copied into an sFlow Datagram. Sampling metadata is added, such as
|
||||
the ingress (or egress) interface and sampling process parameters. The Agent can then optionally add
|
||||
forwarding information (such as router source- and destination prefix, MPLS LSP information, BGP
|
||||
communties, and what-not). Finally the Agent will periodically read the octet and packet counters of
|
||||
physical network interface(s). Ultimately, the Agent will send the samples and additional
|
||||
information over the network as a UDP datagram, to an **sFlow Collector** for further processing.
|
||||
|
||||
sFlow has been specifically designed to take advantages of the statistical properties of packet
|
||||
sampling and can be modeled using statistical sampling theory. This means that the sFlow traffic
|
||||
monitoring system will always produce statistically quantifiable measurements. You can read more
|
||||
about it in Peter Phaal and Sonia Panchen's
|
||||
[[paper](https://sflow.org/packetSamplingBasics/index.htm)], I certainly did and my head spun a
|
||||
little bit at the math :)
|
||||
|
||||
### sFlow: Netlink PSAMPLE
|
||||
|
||||
sFlow is meant to be a very _lightweight_ operation for the sampling equipment. It can typically be
|
||||
done in hardware, but there also exist several software implementations. One very clever thing, I
|
||||
think, is decoupling the sampler from the rest of the Agent. The Linux kernel has a packet sampling
|
||||
API called [[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)], which
|
||||
allows _producers_ to send samples to a certain _group_, and then allows _consumers_ to subscribe to
|
||||
samples of a certrain _group_. The PSAMPLE API uses
|
||||
[[NetLink](https://docs.kernel.org/userspace-api/netlink/intro.html)] under the covers. The cool
|
||||
thing, for me anyway, is that I have a little bit of experience with Netlink due to my work on VPP's
|
||||
[[Linux Control Plane]({{< ref 2021-08-25-vpp-4 >}})] plugin.
|
||||
|
||||
The idea here is that some **sFlow Agent**, notably a VPP plugin, will be taking periodic samples
|
||||
from the physical network interfaces, and producing Netlink messages. Then, some other program,
|
||||
notably outside of VPP, can consume these messages and further handle them, creating UDP packets
|
||||
with sFlow samples and counters and other information, and sending them to an **sFlow Collector**
|
||||
somewhere else on the network.
|
||||
|
||||
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Warning" >}}
|
||||
|
||||
There's a handy utility called [[psampletest](https://github.com/sflow/psampletest)] which can
|
||||
subscribe to these PSAMPLE netlink groups and retrieve the samples. The first time I used all of
|
||||
this stuff, I wasn't aware of this utility and I kept on getting errors. It turns out, there's a
|
||||
kernel module that needs to be loaded: `modprobe psample` and `psampletest` helpfully does that for
|
||||
you [[ref](https://github.com/sflow/psampletest/blob/main/psampletest.c#L799)], so just make sure
|
||||
the module is loaded and added to `/etc/modules` before you spend as many hours as I did pulling out
|
||||
hair.
|
||||
|
||||
## VPP: sFlow Plugin
|
||||
|
||||
For the purposes of my initial testing, I'll simply take a look at Neil's prototype on
|
||||
[[GitHub](https://github.com/sflow/vpp)] and see what I learn in terms of functionality and
|
||||
performance.
|
||||
|
||||
### sFlow Plugin: Anatomy
|
||||
|
||||
The design is purposefully minimal, to do all of the heavy lifting outside of the VPP dataplane. The
|
||||
plugin will create a new VPP _graph node_ called `sflow`, which the operator can insert after
|
||||
`device-input`, in other words, if enabled, the plugin will get a copy of all packets that are read
|
||||
from an input provider, such as `dpdk-input` or `rdma-input`. The plugin's job is to process the
|
||||
packet, and if it's not selected for sampling, just move it onwards to the next node, typically
|
||||
`ethernet-input`. Almost all of the interesting action is in `node.c`
|
||||
|
||||
The kicker is, that one in N packets will be selected to sample, after which:
|
||||
1. the ethernet header (`*en`) is extracted from the packet
|
||||
1. the input interface (`hw_if_index`) is extracted from the VPP buffer. Remember, sFlow works
|
||||
with physical network interfaces!
|
||||
1. if there are too many samples from this worker thread being worked on, it is discarded and an
|
||||
error counter is incremented. This protects the main thread from being slammed with samples if
|
||||
there are simply too many being fished out of the dataplane.
|
||||
1. Otherwise:
|
||||
* a new `sflow_sample_t` is created, with all the sampling process metadata filled in
|
||||
* the first 128 bytes of the packet are copied into the sample
|
||||
* an RPC is dispatched to the main thread, which will send the sample to the PSAMPLE channel
|
||||
|
||||
Both a debug CLI command and API call are added:
|
||||
|
||||
```
|
||||
sflow enable-disable <interface-name> [<sampling_N>]|[disable]
|
||||
```
|
||||
|
||||
Some observations:
|
||||
|
||||
First off, the sampling_N in Neil's demo is a global rather than per-interface setting. It would
|
||||
make sense to make this be per-interface, as routers typically have a mixture of 1G/10G and faster
|
||||
100G network cards available. It was a surprise when I set one interface to 1:1000 and the other to
|
||||
1:10000 and then saw the first interface change its sampling rate also. It's a small thing, and
|
||||
will not be an issue to change.
|
||||
|
||||
{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
Secondly, sending the RPC to main uses `vl_api_rpc_call_main_thread()`, which
|
||||
requires a _spinlock_ in `src/vlibmemory/memclnt_api.c:649`. I'm somewhat worried that when many
|
||||
samples are sent from many threads, there will be lock contention and performance will suffer.
|
||||
|
||||
### sFlow Plugin: Functional
|
||||
|
||||
I boot up the [[IPng Lab]({{< ref 2022-10-14-lab-1 >}})] and install a bunch of sFlow tools on it,
|
||||
make sure the `psample` kernel module is loaded. In this first test I'll take a look at
|
||||
tablestakes. I compile VPP with the sFlow plugin, and enable that plugin in `startup.conf` on each
|
||||
of the four VPP routers. For reference, the Lab looks like this:
|
||||
|
||||
{{< image src="/assets/vpp-mpls/LAB v2.svg" alt="Lab Setup" >}}
|
||||
|
||||
What I'll do is start an `iperf3` server on `vpp0-3` and then hit it from `vpp0-0`, to generate
|
||||
a few TCP traffic streams back and forth, which will be traversing `vpp0-2` and `vpp0-1`, like so:
|
||||
|
||||
```
|
||||
pim@vpp0-3:~ $ iperf3 -s -D
|
||||
pim@vpp0-0:~ $ iperf3 -c vpp0-3.lab.ipng.ch -t 86400 -P 10 -b 10M
|
||||
```
|
||||
|
||||
### Configuring VPP for sFlow
|
||||
|
||||
While this `iperf3` is running, I'll log on to `vpp0-2` to take a closer look. The first thing I do,
|
||||
is turn on packet sampling on `vpp0-2`'s interface that points at `vpp0-3`, which is `Gi10/0/1`, and
|
||||
the interface that points at `vpp0-0`, which is `Gi10/0/0`. That's easy enough, and I will use a
|
||||
sampling rate of 1:1000 as these interfaces are GigabitEthernet:
|
||||
|
||||
```
|
||||
root@vpp0-2:~# vppctl sflow enable-disable GigabitEthernet10/0/0 1000
|
||||
root@vpp0-2:~# vppctl sflow enable-disable GigabitEthernet10/0/1 1000
|
||||
root@vpp0-2:~# vppctl show run | egrep '(Name|sflow)'
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
sflow active 5656 24168 0 9.01e2 4.27
|
||||
```
|
||||
|
||||
Nice! VPP inserted the `sflow` node between `dpdk-input` and `ethernet-input` where it can do its
|
||||
business. But is it sending data? To answer this question, I can first take a look at the
|
||||
`psampletest` tool:
|
||||
|
||||
```
|
||||
root@vpp0-2:~# psampletest
|
||||
pstest: modprobe psample returned 0
|
||||
pstest: netlink socket number = 1637
|
||||
pstest: getFamily
|
||||
pstest: generic netlink CMD = 1
|
||||
pstest: generic family name: psample
|
||||
pstest: generic family id: 32
|
||||
pstest: psample attr type: 4 (nested=0) len: 8
|
||||
pstest: psample attr type: 5 (nested=0) len: 8
|
||||
pstest: psample attr type: 6 (nested=0) len: 24
|
||||
pstest: psample multicast group id: 9
|
||||
pstest: psample multicast group: config
|
||||
pstest: psample multicast group id: 10
|
||||
pstest: psample multicast group: packets
|
||||
pstest: psample found group packets=10
|
||||
pstest: joinGroup 10
|
||||
pstest: received Netlink ACK
|
||||
pstest: joinGroup 10
|
||||
pstest: set headers...
|
||||
pstest: serialize...
|
||||
pstest: print before sending...
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=7 out=9 n=1000 seq=1 pktlen=1514 hdrlen=31 pkt=0x558c08ba4958 q=3 depth=33333333 delay=123456
|
||||
pstest: send...
|
||||
pstest: send_psample getuid=0 geteuid=0
|
||||
pstest: sendmsg returned 140
|
||||
pstest: free...
|
||||
pstest: start read loop...
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=1 out=0 n=1000 seq=600320 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=1 out=0 n=1000 seq=600321 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=1 out=0 n=1000 seq=600322 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=2 out=0 n=1000 seq=600423 pktlen=66 hdrlen=70 pkt=0x7ffdb0d5a1e8 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=1 out=0 n=1000 seq=600324 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
|
||||
```
|
||||
|
||||
I am amazed! The `psampletest` output shows a few packets, considering I'm asking `iperf3` to push
|
||||
100Mbit using 9000 byte jumboframes (which would be something like 1400 packets/second), I can
|
||||
expect two or three samples per second. I immediately notice a few things:
|
||||
|
||||
***1. Network Namespace***: The Netlink sampling channel belongs to a network _namespace_. The VPP
|
||||
process is running in the _default_ netns, so its PSAMPLE netlink messages will be in that namespace.
|
||||
Thus, the `psampletest` and other tools must also run in that namespace. I mention this because in
|
||||
Linux CP, often times the controlplane interfaces are created in a dedicated `dataplane` network
|
||||
namespace.
|
||||
|
||||
***2. pktlen and hdrlen***: The pktlen is wrong, and this is a bug. In VPP, packets are put into
|
||||
buffers of size 2048, and if there is a jumboframe, that means multiple buffers are concatenated for
|
||||
the same packet. The packet length here ought to be 9000 in one direction. Looking at the `in=2`
|
||||
packet with length 66, that looks like a legitimate ACK packet on the way back. But why is the
|
||||
hdrlen set to 70 there? I'm going to want to ask Neil about that.
|
||||
|
||||
***3. ingress and egress***: The `in=1` and one packet with `in=2` represent the input `hw_if_index`
|
||||
which is the ifIndex that VPP assigns to its devices. And looking at `show interfaces`, indeed
|
||||
number 1 corresponds with `GigabitEthernet10/0/0` and 2 is `GigabitEthernet10/0/1`, which checks
|
||||
out:
|
||||
```
|
||||
root@vpp0-2:~# vppctl show int
|
||||
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
|
||||
GigabitEthernet10/0/0 1 up 9000/0/0/0 rx packets 469552764
|
||||
rx bytes 4218754400233
|
||||
tx packets 133717230
|
||||
tx bytes 8887341013
|
||||
drops 6050
|
||||
ip4 469321635
|
||||
ip6 225164
|
||||
GigabitEthernet10/0/1 2 up 9000/0/0/0 rx packets 133527636
|
||||
rx bytes 8816920909
|
||||
tx packets 469353481
|
||||
tx bytes 4218736200819
|
||||
drops 6060
|
||||
ip4 133489925
|
||||
ip6 29139
|
||||
|
||||
```
|
||||
|
||||
***4. ifIndexes are orthogonal***: These `in=1` or `in=2` ifIndex numbers are constructs of the VPP
|
||||
dataplane. Notably, VPP's numbering of interface index is strictly _orthogonal_ to Linux, and it's
|
||||
not guaranteed that there even _exists_ an interface in Linux for the PHY upon which the sampling is
|
||||
happening. Said differently, `in=1` here is meant to reference VPP's `GigabitEthernet10/0/0`
|
||||
interface, but in Linux, `ifIndex=1` is a completely different interface (`lo`) in the default
|
||||
network namespace. Similarly `in=2` for VPP's `Gi10/0/1` interface corresponds to interface `enp1s0`
|
||||
in Linux:
|
||||
|
||||
```
|
||||
root@vpp0-2:~# ip link
|
||||
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
|
||||
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
|
||||
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
|
||||
link/ether 52:54:00:f0:01:20 brd ff:ff:ff:ff:ff:ff
|
||||
```
|
||||
|
||||
***5. Counters***: sFlow periodically polls the interface counters for all interfaces. It will
|
||||
normally use `/proc/net/` entries for that, but there are two problems with this:
|
||||
|
||||
1. There may not exist a Linux representation of the interface, for example if it's only doing L2
|
||||
bridging or cross connects in the VPP dataplane, and it does not have a Linux Control Plane
|
||||
interface, or `linux-cp` is not used at all.
|
||||
|
||||
1. Even if it does exist and it's the "correct" ifIndex in Linux, for example if the _Linux
|
||||
Interface Pair_'s tuntap `host_vif_index` index is used, even then the statistics counters in the
|
||||
Linux representation will only count packets and octets of _punted_ packets, that is to say, the
|
||||
stuff that LinuxCP has decided need to go to the Linux kernel through the TUN/TAP device. Important
|
||||
to note that east-west traffic that goes _through_ the dataplane, is never punted to Linux, and as
|
||||
such, the counters will be undershooting: only counting traffic _to_ the router, not _through_ the
|
||||
router.
|
||||
|
||||
### VPP sFlow: Performance
|
||||
|
||||
Now that I've shown that Neil's proof of concept works, I will take a better look at the performance
|
||||
of the plugin. I've made a mental note that the plugin sends RPCs from worker threads to the main
|
||||
thread to marshall the PSAMPLE messages out. I'd like to see how expensive that is, in general. So,
|
||||
I pull boot two Dell R730 machines in IPng's Lab and put them to work. The first machine will run
|
||||
Cisco's T-Rex loadtester with 8x 10Gbps ports (4x dual Intel 58299), while the second (identical)
|
||||
machine will run VPP also ith 8x 10Gbps ports (2x Intel i710-DA4).
|
||||
|
||||
I will test a bunch of things in parallel. First off, I'll test L2 (xconnect) and L3 (IPv4 routing),
|
||||
and secondly I'll test that with and without sFlow turned on. This gives me 8 ports to configure,
|
||||
and I'll start with the L2 configuration, as follows:
|
||||
|
||||
```
|
||||
vpp# set int state TenGigabitEthernet3/0/2 up
|
||||
vpp# set int state TenGigabitEthernet3/0/3 up
|
||||
vpp# set int state TenGigabitEthernet130/0/2 up
|
||||
vpp# set int state TenGigabitEthernet130/0/3 up
|
||||
vpp# set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3
|
||||
vpp# set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2
|
||||
vpp# set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3
|
||||
vpp# set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2
|
||||
```
|
||||
|
||||
Then, the L3 configuration looks like this:
|
||||
|
||||
```
|
||||
vpp# lcp create TenGigabitEthernet3/0/0 host-if xe0-0
|
||||
vpp# lcp create TenGigabitEthernet3/0/1 host-if xe0-1
|
||||
vpp# lcp create TenGigabitEthernet130/0/0 host-if xe1-0
|
||||
vpp# lcp create TenGigabitEthernet130/0/1 host-if xe1-1
|
||||
vpp# set int state TenGigabitEthernet3/0/0 up
|
||||
vpp# set int state TenGigabitEthernet3/0/1 up
|
||||
vpp# set int state TenGigabitEthernet130/0/0 up
|
||||
vpp# set int state TenGigabitEthernet130/0/1 up
|
||||
vpp# set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31
|
||||
vpp# set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31
|
||||
vpp# set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31
|
||||
vpp# set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31
|
||||
vpp# ip route add 16.0.0.0/24 via 100.64.0.0
|
||||
vpp# ip route add 48.0.0.0/24 via 100.64.1.0
|
||||
vpp# ip route add 16.0.2.0/24 via 100.64.4.0
|
||||
vpp# ip route add 48.0.2.0/24 via 100.64.5.0
|
||||
vpp# ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static
|
||||
vpp# ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static
|
||||
vpp# ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static
|
||||
vpp# ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static
|
||||
```
|
||||
|
||||
And finally, the Cisco T-Rex configuration looks like this:
|
||||
|
||||
```
|
||||
- version: 2
|
||||
interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ]
|
||||
port_info:
|
||||
- src_mac: 00:1b:21:06:00:00
|
||||
dest_mac: 9c:69:b4:61:a1:dc
|
||||
- src_mac: 00:1b:21:06:00:01
|
||||
dest_mac: 9c:69:b4:61:a1:dd
|
||||
|
||||
- src_mac: 00:1b:21:83:00:00
|
||||
dest_mac: 00:1b:21:83:00:01
|
||||
- src_mac: 00:1b:21:83:00:01
|
||||
dest_mac: 00:1b:21:83:00:00
|
||||
|
||||
- src_mac: 00:1b:21:87:00:00
|
||||
dest_mac: 9c:69:b4:61:75:d0
|
||||
- src_mac: 00:1b:21:87:00:01
|
||||
dest_mac: 9c:69:b4:61:75:d1
|
||||
|
||||
- src_mac: 9c:69:b4:85:00:00
|
||||
dest_mac: 9c:69:b4:85:00:01
|
||||
- src_mac: 9c:69:b4:85:00:01
|
||||
dest_mac: 9c:69:b4:85:00:00
|
||||
```
|
||||
|
||||
A little note on the use of `ip neighbor` in VPP and specific `dest_mac` in T-Rex. In L2 mode,
|
||||
because the VPP interfaces will be in promiscuous mode and simply pass through any ethernet frame
|
||||
received on interface `Te3/0/2` and copy it out on `Te3/0/3` and vice-versa, there is no need to
|
||||
tinker with MAC addresses. But in L3 mode, the NIC will only accept ethernet frames addressed to its
|
||||
MAC address, so you can see that for the first port in T-Rex, I am setting `dest_mac:
|
||||
9c:69:b4:61:a1:dc` which is the MAC address of `Te3/0/0` on VPP. And then on the way out, if VPP
|
||||
wants to send traffic back to T-Rex, I'll give it a static ARP entry with `ip neighbor .. static`.
|
||||
|
||||
With that said, I can start a baseline loadtest like so:
|
||||
{{< image width="100%" src="/assets/sflow/trex-baseline.png" alt="Cisco T-Rex: baseline" >}}
|
||||
|
||||
T-Rex is sending 10Gbps out on all eight interfaces (four of which are L3 routing, and four of which
|
||||
are L2 xconnecting), using packet size of 1514 bytes. This amounts of roughlu 813Kpps per port, or a
|
||||
cool 6.51Mpps in total. And I can see, in this base line configuration, the VPP router is happy to
|
||||
do the work.
|
||||
|
||||
I then enable sFlow on the second set of four ports, using a 1:1000 sampling rate:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 1000
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/1 1000
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 1000
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/3 1000
|
||||
```
|
||||
|
||||
This should yield about 3'250 or so samples per second, and things look pretty great:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ vppctl show err
|
||||
Count Node Reason Severity
|
||||
5034508 sflow sflow packets processed error
|
||||
4908 sflow sflow packets sampled error
|
||||
5034508 sflow sflow packets processed error
|
||||
5111 sflow sflow packets sampled error
|
||||
5034516 l2-output L2 output packets error
|
||||
5034516 l2-input L2 input packets error
|
||||
5034404 sflow sflow packets processed error
|
||||
4948 sflow sflow packets sampled error
|
||||
5034404 l2-output L2 output packets error
|
||||
5034404 l2-input L2 input packets error
|
||||
5034404 sflow sflow packets processed error
|
||||
4928 sflow sflow packets sampled error
|
||||
5034404 l2-output L2 output packets error
|
||||
5034404 l2-input L2 input packets error
|
||||
5034516 l2-output L2 output packets error
|
||||
5034516 l2-input L2 input packets error
|
||||
```
|
||||
|
||||
I can see that the `sflow packets sampled` is roughly 0.1% of the `sflow packets processed` which
|
||||
checks out. I can also see in `psampletest` a flurry of activity, so I'm happy:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ sudo psampletest
|
||||
...
|
||||
pstest: grp=1 in=9 out=0 n=1000 seq=63388 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=8 out=0 n=1000 seq=63389 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=11 out=0 n=1000 seq=63390 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=10 out=0 n=1000 seq=63391 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
|
||||
pstest: psample netlink (type=32) CMD = 0
|
||||
pstest: grp=1 in=11 out=0 n=1000 seq=63392 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
|
||||
```
|
||||
|
||||
I confirm that all four `in` interfaced (8, 9, 10 and 11) are sending samples, and those indexes
|
||||
correctly correspond to the VPP dataplane's `sw_if_index` for `TenGig130/0/0 - 3`. Sweet! On this
|
||||
machine, each TenGig network interface has its own dedicated VPP worker thread. Considering I
|
||||
turned on sFlow sampling on four interfaces, I should see the cost I'm paying for the feature:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ vppctl show run | grep -e '(Name|sflow)'
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
sflow active 3908218 14350684 0 9.05e1 3.67
|
||||
sflow active 3913266 14350680 0 1.11e2 3.67
|
||||
sflow active 3910828 14350687 0 1.08e2 3.67
|
||||
sflow active 3909274 14350692 0 5.66e1 3.67
|
||||
```
|
||||
|
||||
Alright, so for the 999 packets that went through and the one packet that got sampled, on average
|
||||
VPP is spending between 90 and 111 CPU cycles per packet, and the loadtest looks squeaky clean on
|
||||
T-Rex.
|
||||
|
||||
### VPP sFlow: Cost of passthru
|
||||
|
||||
I decide to take a look at two edge cases. What if there are no samples being taken at all, and the
|
||||
`sflow` node is merely passing through all packets to `ethernet-input`? To simulate this, I will set
|
||||
up a bizarrely high sampling rate, say one in ten million. I'll also make the T-Rex loadtester use
|
||||
only four ports, in other words, a unidirectional loadtest, and I'll make it go much faster by
|
||||
sending smaller packets, say 128 bytes:
|
||||
|
||||
```
|
||||
tui>start -f stl/ipng.py -p 0 2 4 6 -m 99% -t size=128
|
||||
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 1000 disable
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/1 1000 disable
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 1000 disable
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/3 1000 disable
|
||||
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 10000000
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 10000000
|
||||
```
|
||||
|
||||
The loadtester is now sending 33.5Mpps or thereabouts (4x 8.37Mpps), and I can confirm that the
|
||||
`sFlow` plugin is not sampling many packets:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ vppctl show err
|
||||
Count Node Reason Severity
|
||||
59777084 sflow sflow packets processed error
|
||||
6 sflow sflow packets sampled error
|
||||
59777152 l2-output L2 output packets error
|
||||
59777152 l2-input L2 input packets error
|
||||
59777104 sflow sflow packets processed error
|
||||
6 sflow sflow packets sampled error
|
||||
59777104 l2-output L2 output packets error
|
||||
59777104 l2-input L2 input packets error
|
||||
|
||||
pim@hvn6-lab:~$ vppctl show run | grep -e '(Name|sflow)'
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
sflow active 8186642 369674664 0 1.35e1 45.16
|
||||
sflow active 25173660 369674696 0 1.97e1 14.68
|
||||
```
|
||||
Two observations:
|
||||
|
||||
1. One of these is busier than the other. Without looking further, I can already predict that the
|
||||
top one (doing 45.16 vectors/call) is the L3 thread. Reasoning: the L3 code path through the
|
||||
dataplane is a lot more expensive than 'merely' L2 XConnect. As such, the packets will spend more
|
||||
time, and therefore the iterations of the `dpdk-input` loop will be further apart in time. And
|
||||
because of that, it'll end up consuming more packets on each subsequent iteration, in order to catch
|
||||
up. The L2 path on the other hand, is quicker and therefore will have less packets waiting on
|
||||
subsequent iterations of `dpdk-input`.
|
||||
|
||||
2. The `sflow` plugin spends between 13.5 and 19.7 CPU cycles shoveling the packets into
|
||||
`ethernet-input` without doing anything to them. That's pretty low! And the L3 path is a little bit
|
||||
more efficient per packet, which is very likely because it gets to amort its L1/L2 CPU instruction
|
||||
cache over 45 packets each time it runs, while the L2 path can only amort its instruction cache over
|
||||
15 or so packets each time it runs.
|
||||
|
||||
I let the loadtest run overnight,and the proof is in the pudding: sFlow enabled but not sampling
|
||||
works just fine:
|
||||
|
||||
{{< image width="100%" src="/assets/sflow/trex-passthru.png" alt="Cisco T-Rex: passthru" >}}
|
||||
|
||||
### VPP sFlow: Cost of sampling
|
||||
|
||||
The other interesting case is to figure out how much CPU it takes to execute the code path
|
||||
with the actual sampling. This one turns out a bit trickier to measure. While leaving the previous
|
||||
loadtest running at 33.5Mpps, I disable sFlow and then re-enable it at an abnormally _high_ ratio of
|
||||
1:10 packets:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 disable
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 disable
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 10
|
||||
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 10
|
||||
```
|
||||
|
||||
The T-Rex view immediately reveals that VPP is not doing very well, as the throughput went from
|
||||
33.5Mpps all the way down to 7.5Mpps. Ouch! Looking at the dataplane:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ vppctl show err | grep sflow
|
||||
340502528 sflow sflow packets processed error
|
||||
12254462 sflow sflow packets dropped error
|
||||
22611461 sflow sflow packets sampled error
|
||||
422527140 sflow sflow packets processed error
|
||||
8533855 sflow sflow packets dropped error
|
||||
34235952 sflow sflow packets sampled error
|
||||
```
|
||||
|
||||
Ha, this new safeguard popped up: remember all the way at the beginning, I explained how there's a
|
||||
safety net in the `sflow` plugin that will pre-emptively drop the sample if the RPC channel towards
|
||||
the main thread is seeing too many outstanding RPCs? That's happening right now, under the moniker
|
||||
`sflow packets dropped`, and it's roughly *half* of the samples.
|
||||
|
||||
My first attempt is to back off the loadtester to roughly 1.5Mpps per port (so 6Mpps in total, under the
|
||||
current limit of 7.5Mpps), but I'm disappointed: the VPP instance is now returning 665Kpps per port
|
||||
only, which is horrible, and it's still dropping samples.
|
||||
|
||||
My second attempt is to turn off all ports but last pair (the L2XC port), which returns 930Kpps from
|
||||
the offered 1.5Mpps. VPP is clearly not having a good time here.
|
||||
|
||||
Finally, as a validation, I turn off all ports but the first pair (the L3 port, without sFlow), and
|
||||
ramp up the traffic to 8Mpps. Success (unsurprising to me). I also ramp up the second pair (the L2XC
|
||||
port, without sFlow), VPP forwards all 16Mpps and is happy again.
|
||||
|
||||
Once I turn on the third pair (the L3 port, _with_ sFlow), even at 1Mpps, the whole situation
|
||||
regresses again: First two ports go down from 8Mpps to 5.2Mpps each; the third (offending) port
|
||||
delivers 740Kpps out of 1Mpps. Clearly, there's some work to do under high load situations!
|
||||
|
||||
#### Reasoning about the bottle neck
|
||||
|
||||
But how expensive is sending samples, really? To try to get at least some pseudo-scientific answer I
|
||||
turn off all ports again, and ramp up the one port pair with (L3 + sFlow at 1:10 ratio) to full line
|
||||
rate: that is 64 byte packets at 14.88Mpps:
|
||||
|
||||
```
|
||||
tui>stop
|
||||
tui>start -f stl/ipng.py -m 100% -p 4 -t size=64
|
||||
```
|
||||
|
||||
VPP is now on the struggle bus and is returning 3.16Mpps or 21% of that. But, I think it'll give me
|
||||
some reasonable data to try to feel out where the bottleneck is.
|
||||
|
||||
```
|
||||
Thread 2 vpp_wk_1 (lcore 3)
|
||||
Time 6.3, 10 sec internal node vector rate 256.00 loops/sec 27310.73
|
||||
vector rates in 3.1607e6, out 3.1607e6, drop 0.0000e0, punt 0.0000e0
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
TenGigabitEthernet130/0/1-outp active 77906 19943936 0 5.79e0 256.00
|
||||
TenGigabitEthernet130/0/1-tx active 77906 19943936 0 6.88e1 256.00
|
||||
dpdk-input polling 77906 19943936 0 4.41e1 256.00
|
||||
ethernet-input active 77906 19943936 0 2.21e1 256.00
|
||||
ip4-input active 77906 19943936 0 2.05e1 256.00
|
||||
ip4-load-balance active 77906 19943936 0 1.07e1 256.00
|
||||
ip4-lookup active 77906 19943936 0 1.98e1 256.00
|
||||
ip4-rewrite active 77906 19943936 0 1.97e1 256.00
|
||||
sflow active 77906 19943936 0 6.14e1 256.00
|
||||
|
||||
pim@hvn6-lab:pim# vppctl show err | grep sflow
|
||||
551357440 sflow sflow packets processed error
|
||||
19829380 sflow sflow packets dropped error
|
||||
36613544 sflow sflow packets sampled error
|
||||
```
|
||||
|
||||
OK, the `sflow` plugin saw 551M packets, selected 36.6M of them for sampling, but ultimately only
|
||||
sent RPCs to the main thread for 16.8M samples after having dropped 19.8M of them. There are three
|
||||
code paths, each one extending the other:
|
||||
|
||||
1. Super cheap: pass through. I already learned that it takes about X=13.5 CPU cycles to pass
|
||||
through a packet.
|
||||
1. Very cheap: select sample and construct the RPC, but toss it, costing Y CPU cycles.
|
||||
1. Expensive: select sample, and send the RPC. Z CPU cycles in worker, and another amount in main.
|
||||
|
||||
Now I don't know what Y is, but seeing as the selection only copies some data from the VPP buffer
|
||||
into a new `sflow_sample_t`, and it uses `clip_memcpy_fast()` for the sample header, I'm going to
|
||||
assume it's not _drastically_ more expensive than the super cheap case, so for simplicity I'll
|
||||
guesstimate that it takes Y=20 CPU cyces.
|
||||
|
||||
With that guess out of the way, I can see what the `sflow` plugin is consuming for the third case:
|
||||
|
||||
```
|
||||
AvgClocks = (Total * X + Sampled * Y + RPCSent * Z) / Total
|
||||
|
||||
61.4 = ( 551357440 * 13.5 + 36613544 * 20 + (36613544-19829380) * Z ) / 551357440
|
||||
61.4 = ( 7443325440 + 732270880 + 16784164 * Z ) / 551357440
|
||||
33853346816 = 7443325440 + 732270880 + 16784164 * Z
|
||||
25677750496 = 16784164 * Z
|
||||
Z = 1529
|
||||
```
|
||||
|
||||
Good to know! I find spending O(1500) cycles to send the sample pretty reasonable. However, for a
|
||||
dataplane that is trying to do 10Mpps per core, and a core running 2.2GHz, there are really only 220
|
||||
CPU cycles to spend end-to-end. Spending an order of magnitude more than that once every ten packets
|
||||
feels dangerous to me.
|
||||
|
||||
Here's where I start my conjecture. If I count the CPU cycles spent in the table above, I will see
|
||||
273 CPU cycles spent on average per packet. The CPU in the VPP router is an `E5-2696 v4 @ 2.20GHz`,
|
||||
which means it should be able to do `2.2e10/273 = 8.06Mpps` per thread, more than double that what I
|
||||
observe (3.16Mpps)! But, for all the `vector rates in` (3.1607e6), it also managed to emit the
|
||||
packets back out (same number: 3.1607e6).
|
||||
|
||||
So why isn't VPP getting more packets from DPDK? I poke around a bit and find an important clue:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ vppctl show hard TenGigabitEthernet130/0/0 | grep rx\ missed; \
|
||||
sleep 10; vppctl show hard TenGigabitEthernet130/0/0 | grep rx\ missed
|
||||
rx missed 4065539464
|
||||
rx missed 4182788310
|
||||
```
|
||||
|
||||
In those ten seconds, VPP missed (4182788310-4065539464)/10 = 11.72Mpps. I already measured that it
|
||||
forwarded 3.16Mpps and you know what? 11.7 + 3.16 is precisely 14.88Mpps. All packets are accounted
|
||||
for! It's just, DPDK never managed to read them from the hardware: `sad-trombone.wav`
|
||||
|
||||
|
||||
As a validation, I turned off sFlow while keeping that one port at 14.88Mpps. Now, 10.8Mpps were
|
||||
delivered:
|
||||
|
||||
```
|
||||
Thread 2 vpp_wk_1 (lcore 3)
|
||||
Time 14.7, 10 sec internal node vector rate 256.00 loops/sec 40622.64
|
||||
vector rates in 1.0794e7, out 1.0794e7, drop 0.0000e0, punt 0.0000e0
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
TenGigabitEthernet130/0/1-outp active 620012 158723072 0 5.66e0 256.00
|
||||
TenGigabitEthernet130/0/1-tx active 620012 158723072 0 7.01e1 256.00
|
||||
dpdk-input polling 620012 158723072 0 4.39e1 256.00
|
||||
ethernet-input active 620012 158723072 0 1.56e1 256.00
|
||||
ip4-input-no-checksum active 620012 158723072 0 1.43e1 256.00
|
||||
ip4-load-balance active 620012 158723072 0 1.11e1 256.00
|
||||
ip4-lookup active 620012 158723072 0 2.00e1 256.00
|
||||
ip4-rewrite active 620012 158723072 0 2.02e1 256.00
|
||||
```
|
||||
|
||||
Total Clocks: 201 per packet; 2.2GHz/201 = 10.9Mpps, and I am observing 10.8Mpps. As [[North of the
|
||||
Border](https://www.youtube.com/c/NorthoftheBorder)] would say: "That's not just good, it's good
|
||||
_enough_!"
|
||||
|
||||
For completeness, I turned on all eight ports again, at line rate (8x14.88 = 119Mpps 🥰), and saw that
|
||||
about 29Mpps of that made it through. Interestingly, what was 3.16Mpps in the single-port line rate
|
||||
loadtest, went up slighty to 3.44Mpps now. What puzzles me even more, is that the non-sFlow worker
|
||||
threads are also impacted. I spent some time thinking about this and poking around, but I did not
|
||||
find a good explanation why port pair 0 (L3, no sFlow) and 1 (L2XC, no sFlow) would be impacted.
|
||||
Here's a screenshot of VPP on the struggle bus:
|
||||
|
||||
{{< image width="100%" src="/assets/sflow/trex-overload.png" alt="Cisco T-Rex: overload at line rate" >}}
|
||||
|
||||
**Hypothesis**: Due to the _spinlock_ in `vl_api_rpc_call_main_thread()`, the worker CPU is pegged
|
||||
for a longer time, during which the `dpdk-input` PMD can't run, so it misses out on these sweet
|
||||
sweet packets that the network card had dutifully received for it, resulting in the `rx-miss`
|
||||
situation. While VPP's performance measurement shows 273 CPU cycles per packet and 3.16Mpps, this
|
||||
accounts only for 862M cycles, while the thread has 2200M cycles, leaving a whopping 60% of CPU
|
||||
cycles unused in the dataplane. I still don't understand why _other_ worker threads are impacted,
|
||||
though.
|
||||
|
||||
## What's Next
|
||||
|
||||
I'll continue to work with the folks in the sFlow and VPP communities and iterate on the plugin and
|
||||
other **sFlow Agent** machinery. In an upcoming article, I hope to share more details on how to tie
|
||||
the VPP plugin in to the `hsflowd` host sflow daemon in a way that the interface indexes, counters
|
||||
and packet lengths are all correct. Of course, the main improvement that we can make is to allow for
|
||||
the system to work better under load, which will take some thinking.
|
||||
|
||||
I should do a few more tests with a debug binary and profiling turned on. I quickly ran a `perf`
|
||||
over the VPP (release / optimized) binary running on the bench, but it merely said 80% of time was
|
||||
spent in `libvlib` rather than `libvnet` in the baseline (sFlow turned off).
|
||||
|
||||
```
|
||||
root@hvn6-lab:/home/pim# perf record -p 1752441 sleep 10
|
||||
root@hvn6-lab:/home/pim# perf report --stdio --sort=dso
|
||||
# Overhead Shared Object (sFlow) Overhead Shared Object (baseline)
|
||||
# ........ ...................... ........ ........................
|
||||
#
|
||||
79.02% libvlib.so.24.10 54.27% libvlib.so.24.10
|
||||
12.82% libvnet.so.24.10 33.91% libvnet.so.24.10
|
||||
3.77% dpdk_plugin.so 10.87% dpdk_plugin.so
|
||||
3.21% [kernel.kallsyms] 0.81% [kernel.kallsyms]
|
||||
0.29% sflow_plugin.so 0.09% ld-linux-x86-64.so.2
|
||||
0.28% libvppinfra.so.24.10 0.03% libc.so.6
|
||||
0.21% libc.so.6 0.01% libvppinfra.so.24.10
|
||||
0.17% libvlibapi.so.24.10 0.00% libvlibmemory.so.24.10
|
||||
0.15% libvlibmemory.so.24.10
|
||||
0.07% ld-linux-x86-64.so.2
|
||||
0.00% vpp
|
||||
0.00% [vdso]
|
||||
0.00% libsvm.so.24.10
|
||||
```
|
||||
|
||||
Unfortunately, I'm not much of a profiler expert, being merely a network engineer :) so I may have
|
||||
to ask for help. Of course, if you're reading this, you may also _offer_ help! There's lots of
|
||||
interesting work to do on this `sflow` plugin, with matching ifIndex for consumers like `hsflowd`,
|
||||
reading interface counters from the dataplane (or from the Prometheus Exporter), and most
|
||||
importantly, ensuring it works well, or fails gracefully, under stringent load.
|
||||
|
||||
From the _cray-cray_ ideas department, what if we:
|
||||
1. In worker thread, produced the sample but instead of sending an RPC to main and taking the
|
||||
lock, append it to a producer sample queue and move on. This way, no locks are needed, and each
|
||||
worker thread will have its own producer queue.
|
||||
|
||||
1. Create a separate worker (or even pool of workers), running on possibly a different CPU (or in
|
||||
main), that runs a loop iterating on all sflow sample queues consuming the samples and sending them
|
||||
in batches to the PSAMPLE Netlink group, possibly dropping samples if there are too many coming in.
|
||||
|
||||
I'm reminded that this pattern exists already -- async crypto workers create a `crypto-dispatch`
|
||||
node that acts as poller for inbound crypto, and it hands off the result back into the worker
|
||||
thread: lockless at the expense of some complexity!
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
The plugin I am testing here is a prototype written by Neil McKee of inMon. I also wanted to say
|
||||
thanks to Pavel Odintsov of FastNetMon and Ciprian Balaceanu for showing an interest in this plugin,
|
||||
and Peter Phaal for facilitating a get-together last year.
|
||||
|
||||
Who's up for making this thing a reality?!
|
547
content/articles/2024-10-06-sflow-2.md
Normal file
@ -0,0 +1,547 @@
|
||||
---
|
||||
date: "2024-10-06T07:51:23Z"
|
||||
title: 'VPP with sFlow - Part 2'
|
||||
---
|
||||
|
||||
# Introduction
|
||||
|
||||
{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width='12em' >}}
|
||||
|
||||
Last month, I picked up a project together with Neil McKee of [[inMon](https://inmon.com/)], the
|
||||
care takers of [[sFlow](https://sflow.org)]: an industry standard technology for monitoring high speed switched
|
||||
networks. `sFlow` gives complete visibility into the use of networks enabling performance optimization,
|
||||
accounting/billing for usage, and defense against security threats.
|
||||
|
||||
The open source software dataplane [[VPP](https://fd.io)] is a perfect match for sampling, as it
|
||||
forwards packets at very high rates using underlying libraries like [[DPDK](https://dpdk.org/)] and
|
||||
[[RDMA](https://en.wikipedia.org/wiki/Remote_direct_memory_access)]. A clever design choice in the so
|
||||
called _Host sFlow Daemon_ [[host-sflow](https://github.com/sflow/host-sflow)], which allows for a small
|
||||
portion of code to _grab_ the samples, for example in a merchant silicon ASIC or FPGA, but also in the
|
||||
VPP software dataplane, and then _transmit_ these samples using a Linux kernel feature called
|
||||
[[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)]. This greatly
|
||||
reduces the complexity of code to be implemented in the forwarding path, while at the same time
|
||||
bringing consistency to the `sFlow` delivery pipeline by (re)using the `hsflowd` business logic for
|
||||
the more complex state keeping, packet marshalling and transmission from the _Agent_ to a central
|
||||
_Collector_.
|
||||
|
||||
Last month, Neil and I discussed the proof of concept [[ref](https://github.com/sflow/vpp-sflow/)]
|
||||
and I described this in a [[first article]({{< ref 2024-09-08-sflow-1.md >}})]. Then, we iterated on
|
||||
the VPP plugin, playing with a few different approaches to strike a balance between performance, code
|
||||
complexity, and agent features. This article describes our journey.
|
||||
|
||||
## VPP: an sFlow plugin
|
||||
|
||||
There are three things Neil and I specifically take a look at:
|
||||
|
||||
1. If `sFlow` is not enabled on a given interface, there should not be a regression on other
|
||||
interfaces.
|
||||
1. If `sFlow` _is_ enabled, but a packet is not sampled, the overhead should be as small as
|
||||
possible, targetting single digit CPU cycles per packet in overhead.
|
||||
1. If `sFlow` actually selects a packet for sampling, it should be moved out of the dataplane as
|
||||
quickly as possible, targetting double digit CPU cycles per sample.
|
||||
|
||||
For all these validation and loadtests, I use a bare metal VPP machine which is receiving load from
|
||||
a T-Rex loadtester on eight TenGig ports. I have configured VPP and T-Rex as follows.
|
||||
|
||||
**1. RX Queue Placement**
|
||||
|
||||
It's important that the network card that is receiving the traffic, gets serviced by a worker thread
|
||||
on the same NUMA domain. Since my machine has two processors (and thus, two NUMA nodes), I will
|
||||
align the NIC with the correct processor, like so:
|
||||
|
||||
```
|
||||
set interface rx-placement TenGigabitEthernet3/0/0 queue 0 worker 0
|
||||
set interface rx-placement TenGigabitEthernet3/0/1 queue 0 worker 2
|
||||
set interface rx-placement TenGigabitEthernet3/0/2 queue 0 worker 4
|
||||
set interface rx-placement TenGigabitEthernet3/0/3 queue 0 worker 6
|
||||
|
||||
set interface rx-placement TenGigabitEthernet130/0/0 queue 0 worker 1
|
||||
set interface rx-placement TenGigabitEthernet130/0/1 queue 0 worker 3
|
||||
set interface rx-placement TenGigabitEthernet130/0/2 queue 0 worker 5
|
||||
set interface rx-placement TenGigabitEthernet130/0/3 queue 0 worker 7
|
||||
```
|
||||
|
||||
**2. L3 IPv4/MPLS interfaces**
|
||||
|
||||
I will take two pairs of interfaces, one on NUMA0, and the other on NUMA1, so that I can make a
|
||||
comparison with L3 IPv4 or MPLS running _without_ `sFlow` (these are TenGig3/0/*, which I will call
|
||||
the _baseline_ pairs) and two which are running _with_ `sFlow` (these are TenGig130/0/*, which I'll
|
||||
call the _experiment_ pairs).
|
||||
|
||||
```
|
||||
comment { L3: IPv4 interfaces }
|
||||
set int state TenGigabitEthernet3/0/0 up
|
||||
set int state TenGigabitEthernet3/0/1 up
|
||||
set int state TenGigabitEthernet130/0/0 up
|
||||
set int state TenGigabitEthernet130/0/1 up
|
||||
set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31
|
||||
set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31
|
||||
set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31
|
||||
set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31
|
||||
ip route add 16.0.0.0/24 via 100.64.0.0
|
||||
ip route add 48.0.0.0/24 via 100.64.1.0
|
||||
ip route add 16.0.2.0/24 via 100.64.4.0
|
||||
ip route add 48.0.2.0/24 via 100.64.5.0
|
||||
ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static
|
||||
ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static
|
||||
ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static
|
||||
ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static
|
||||
```
|
||||
|
||||
Here, the only specific trick worth mentioning is the use of `ip neighbor` to pre-populate the L2
|
||||
adjacency for the T-Rex loadtester. This way, VPP knows which MAC address to send traffic to, in
|
||||
case a packet has to be forwarded to 100.64.0.0 or 100.64.5.0. It avoids VPP from having to use ARP
|
||||
resolution.
|
||||
|
||||
The configuration for an MPLS label switching router _LSR_ or also called _P-Router_ is added:
|
||||
|
||||
```
|
||||
comment { MPLS interfaces }
|
||||
mpls table add 0
|
||||
set interface mpls TenGigabitEthernet3/0/0 enable
|
||||
set interface mpls TenGigabitEthernet3/0/1 enable
|
||||
set interface mpls TenGigabitEthernet130/0/0 enable
|
||||
set interface mpls TenGigabitEthernet130/0/1 enable
|
||||
mpls local-label add 16 eos via 100.64.1.0 TenGigabitEthernet3/0/1 out-labels 17
|
||||
mpls local-label add 17 eos via 100.64.0.0 TenGigabitEthernet3/0/0 out-labels 16
|
||||
mpls local-label add 20 eos via 100.64.5.0 TenGigabitEthernet130/0/1 out-labels 21
|
||||
mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels 20
|
||||
```
|
||||
|
||||
**3. L2 CrossConnect interfaces**
|
||||
|
||||
Here, I will also use NUMA0 as my baseline (`sFlow` disabled) pair, and an equivalent pair of TenGig
|
||||
interfaces on NUMA1 as my experiment (`sFlow` enabled) pair. This way, I can both make a comparison
|
||||
on the performance impact of enabling `sFlow`, but I can also assert if any regression occurs in the
|
||||
_baseline_ pair if I enable a feature in the _experiment_ pair, which should really never happen.
|
||||
|
||||
```
|
||||
comment { L2 xconnected interfaces }
|
||||
set int state TenGigabitEthernet3/0/2 up
|
||||
set int state TenGigabitEthernet3/0/3 up
|
||||
set int state TenGigabitEthernet130/0/2 up
|
||||
set int state TenGigabitEthernet130/0/3 up
|
||||
set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3
|
||||
set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2
|
||||
set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3
|
||||
set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2
|
||||
```
|
||||
|
||||
**4. T-Rex Configuration**
|
||||
|
||||
The Cisco T-Rex loadtester is running on another machine in the same rack. Physically, it has eight
|
||||
ports which are connected to a LAB switch, a cool Mellanox SN2700 running Debian [[ref]({{< ref
|
||||
2023-11-11-mellanox-sn2700.md >}})]. From there, eight ports go to my VPP machine. The LAB switch
|
||||
just has VLANs with two ports in each: VLAN 100 takes T-Rex port0 and connects it to TenGig3/0/0,
|
||||
VLAN 101 takes port1 and connects it to TenGig3/0/1, and so on. In total, sixteen ports and eight
|
||||
VLANs are used.
|
||||
|
||||
The configuration for T-Rex then becomes:
|
||||
|
||||
```
|
||||
- version: 2
|
||||
interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ]
|
||||
port_info:
|
||||
- src_mac: 00:1b:21:06:00:00
|
||||
dest_mac: 9c:69:b4:61:a1:dc
|
||||
- src_mac: 00:1b:21:06:00:01
|
||||
dest_mac: 9c:69:b4:61:a1:dd
|
||||
|
||||
- src_mac: 00:1b:21:83:00:00
|
||||
dest_mac: 00:1b:21:83:00:01
|
||||
- src_mac: 00:1b:21:83:00:01
|
||||
dest_mac: 00:1b:21:83:00:00
|
||||
|
||||
- src_mac: 00:1b:21:87:00:00
|
||||
dest_mac: 9c:69:b4:61:75:d0
|
||||
- src_mac: 00:1b:21:87:00:01
|
||||
dest_mac: 9c:69:b4:61:75:d1
|
||||
|
||||
- src_mac: 9c:69:b4:85:00:00
|
||||
dest_mac: 9c:69:b4:85:00:01
|
||||
- src_mac: 9c:69:b4:85:00:01
|
||||
dest_mac: 9c:69:b4:85:00:00
|
||||
```
|
||||
|
||||
Do you see how the first pair sends from `src_mac` 00:1b:21:06:00:00? That's the T-Rex side, and it
|
||||
encodes the PCI device `06:00.0` in the MAC address. It sends traffic to `dest_mac`
|
||||
9c:69:b4:61:a1:dc, which is the MAC address of VPP's TenGig3/0/0 interface. Looking back at the `ip
|
||||
neighbor` VPP config above, it becomes much easier to see who is sending traffic to whom.
|
||||
|
||||
For L2XC, the MAC addresses don't matter. VPP will set the NIC in _promiscuous_ mode which means
|
||||
it'll accept any ethernet frame, not only those sent to the NIC's own MAC address. Therefore, in
|
||||
L2XC modes (the second and fourth pair), I just use the MAC addresses from T-Rex. I find debugging
|
||||
connections and looking up FDB entries on the Mellanox switch much, much easier this way.
|
||||
|
||||
With all config in place, but with `sFlow` disabled, I run a quick bidirectional loadtest using 256b
|
||||
packets at line rate, which shows 79.83Gbps and 36.15Mpps. All ports are forwarding, with MPLS,
|
||||
IPv4, and L2XC. Neat!
|
||||
|
||||
{{< image src="/assets/sflow/trex-acceptance.png" alt="T-Rex Acceptance Loadtest" >}}
|
||||
|
||||
The name of the game is now to do a loadtest that shows the packet throughput and CPU cycles spent
|
||||
for each of the plugin iterations, comparing their performance on ports with and without `sFlow`
|
||||
enabled. For each iteration, I will use exactly the same VPP configuration, I will generate
|
||||
unidirectional 4x14.88Mpps of traffic with T-Rex, and I will report on VPP's performance in
|
||||
_baseline_ and a somewhat unfavorable 1:100 sampling rate.
|
||||
|
||||
Ready? Here I go!
|
||||
|
||||
### v1: Workers send RPC to main
|
||||
|
||||
***TL/DR***: _13 cycles/packet on passthrough, 4.68Mpps L2, 3.26Mpps L3, with severe regression in
|
||||
baseline_
|
||||
|
||||
The first iteration goes all the way back to a proof of concept from last year. It's described in
|
||||
detail in my [[first post]({{< ref 2024-09-08-sflow-1.md >}})]. The performance results are not
|
||||
stellar:
|
||||
* ☢ When slamming a single sFlow enabled interface, _all interfaces_ regress. When sending 8Mpps
|
||||
of IPv4 traffic through an _baseline_ interface, that is an interface _without_ sFlow enabled, only
|
||||
5.2Mpps get through. This is considered a mortal sin in VPP-land.
|
||||
* ✅ Passing through packets without sampling them, costs about 13 CPU cycles, not bad.
|
||||
* ❌ Sampling a packet, specifically at higher rates (say, 1:100 or worse, 1:10) completely
|
||||
destroys throughput. When sending 4x14.88MMpps of traffic, only one third makes it through.
|
||||
|
||||
Here's the bloodbath as seen from T-Rex:
|
||||
|
||||
{{< image src="/assets/sflow/trex-v1.png" alt="T-Rex Loadtest for v1" >}}
|
||||
|
||||
**Debrief**: When we talked through these issues, we sort of drew the conclusion that it would be much
|
||||
faster if, when a worker thread produces a sample, instead of sending an RPC to main and taking the
|
||||
spinlock, that the worker appends the sample to a producer queue and moves on. This way, no locks
|
||||
are needed, and each worker thread will have its own producer queue.
|
||||
|
||||
Then, we can create a separate thread (or even pool of threads), scheduling on possibly a different
|
||||
CPU (or in main), that runs a loop iterating on all sflow sample queues, consuming the samples and
|
||||
sending them in batches to the PSAMPLE Netlink group, possibly dropping samples if there are too
|
||||
many coming in.
|
||||
|
||||
### v2: Workers send PSAMPLE directly
|
||||
|
||||
**TL/DR**: _7.21Mpps IPv4 L3, 9.45Mpps L2XC, 87 cycles/packet, no impact on disabled interfaces_
|
||||
|
||||
But before we do that, we have one curiosity itch to scratch - what if we sent the sample directly
|
||||
from the worker? With such a model, if it works, we will need no RPCs or sample queue at all. Of
|
||||
course, in this model any sample will have to be rewritten into a PSAMPLE packet and written via the
|
||||
netlink socket. It would be less complex, but not as efficient as it could be. One thing is prety
|
||||
certain, though: it should be much faster than sending an RPC to the main thread.
|
||||
|
||||
After short refactor, Neil commits [[d278273](https://github.com/sflow/vpp-sflow/commit/d278273)],
|
||||
which adds compiler macros `SFLOW_SEND_FROM_WORKER` (v2) and `SFLOW_SEND_VIA_MAIN` (v1). When
|
||||
workers send directly, they will invoke `sflow_send_sample_from_worker()` instead of sending an RPC
|
||||
with `vl_api_rpc_call_main_thread()` in the previous version.
|
||||
|
||||
The code currently uses `clib_warning()` to print stats from the dataplane, which is pretty
|
||||
expensive. We should be using the VPP logging framework, but for the time being, I add a few CPU
|
||||
counters so we can more accurately count the cummulative time spent for each part of the calls, see
|
||||
[[6ca61d2](https://github.com/sflow/vpp-sflow/commit/6ca61d2)]. I can now see these with `vppctl show
|
||||
err` instead.
|
||||
|
||||
When loadtesting this, the deadly sin of impacting performance of interfaces that did not have
|
||||
`sFlow` enabled is gone. The throughput is not great, though. Instead of showing screenshots of
|
||||
T-Rex, I can also take a look at the throughput as measured by VPP itself. In its `show runtime`
|
||||
statistics, each worker thread shows both CPU cycles spent, as well as how many packets/sec it
|
||||
received and how many it transmitted:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ export C="v2-100"; vppctl clear run; vppctl clear err; sleep 30; \
|
||||
vppctl show run > $C-runtime.txt; vppctl show err > $C-err.txt
|
||||
pim@hvn6-lab:~$ grep 'vector rates' v2-100-runtime.txt | grep -v 'in 0'
|
||||
vector rates in 1.0909e7, out 1.0909e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 7.2078e6, out 7.2078e6, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 1.4866e7, out 1.4866e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 9.4476e6, out 9.4476e6, drop 0.0000e0, punt 0.0000e0
|
||||
pim@hvn6-lab:~$ grep 'sflow' v2-100-runtime.txt
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
sflow active 844916 216298496 0 8.69e1 256.00
|
||||
sflow active 1107466 283511296 0 8.26e1 256.00
|
||||
pim@hvn6-lab:~$ grep -i sflow v2-100-err.txt
|
||||
217929472 sflow sflow packets processed error
|
||||
1614519 sflow sflow packets sampled error
|
||||
2606893106 sflow CPU cycles in sent samples error
|
||||
280697344 sflow sflow packets processed error
|
||||
2078203 sflow sflow packets sampled error
|
||||
1844674406 sflow CPU cycles in sent samples error
|
||||
```
|
||||
|
||||
At a glance, I can see in the first `grep`, the in and out vector (==packet) rates for each worker
|
||||
thread that is doing meaningful work (ie. has more than 0pps of input). Remember that I pinned the
|
||||
RX queues to worker threads, and this now pays dividend: worker thread 0 is servicing TenGig3/0/0
|
||||
(as _even_ worker thread numbers are on NUMA domain 0), worker thread 1 is servicing TenGig130/0/0.
|
||||
What's cool about this, is it gives me an easy way to compare baseline L3 (10.9Mpps) with experiment
|
||||
L3 (7.21Mpps). Equally, L2XC comes in at 14.88Mpps in baseline and 9.45Mpps in experiment.
|
||||
|
||||
Looking at the output of `vppctl show error`, I can learn another interesting detail. See how there
|
||||
are 1614519 sampled packets out of 217929472 processed packets (ie. a roughly 1:100 rate)? I added a
|
||||
CPU clock cycle counter that counts cummulative clocks spent once samples are taken. I can see that
|
||||
VPP spent 2606893106 CPU cycles sending these samples. That's **1615 CPU cycles** per sent sample,
|
||||
which is pretty terrible.
|
||||
|
||||
**Debrief**: We both understand that assembling and `send()`ing the netlink messages from within the
|
||||
dataplane is a pretty bad idea. But it's great to see that removing the use of RPCs immediately
|
||||
improves performance on non-enabled interfaces, and we learned what the cost is of sending those
|
||||
samples. An easy step forward from here is to create a producer/consumer queue, where the workers
|
||||
can just copy the packet into a queue or ring buffer, and have an external `pthread` consume from
|
||||
the queue/ring in another thread that won't block the dataplane.
|
||||
|
||||
### v3: SVM FIFO from workers, dedicated PSAMPLE pthread
|
||||
|
||||
**TL/DR:** _9.34Mpps L3, 13.51Mpps L2XC, 16.3 cycles/packet, but with corruption on the FIFO queue messages_
|
||||
|
||||
Neil checks in after committing [[7a78e05](https://github.com/sflow/vpp-sflow/commit/7a78e05)]
|
||||
that he has introduced a macro `SFLOW_SEND_FIFO` which tries this new approach. There's a pretty
|
||||
elaborate FIFO queue implementation in `svm/fifo_segment.h`. Neil uses this to create a segment
|
||||
called `fifo-sflow-worker`, to which the worker can write its samples in the dataplane node. A new
|
||||
thread called `spt_process_samples` can then call `svm_fifo_dequeue()` from all workers' queues and
|
||||
pump those into Netlink.
|
||||
|
||||
The overhead of copying the samples onto a VPP native `svm_fifo` seems to be two orders of magnitude
|
||||
lower than writing directly to Netlink, even though the `svm_fifo` library code has many bells and
|
||||
whistles that we don't need. But, perhaps due to these bells and whistles, we may be holding it
|
||||
wrong, as invariably after a short while the Netlink writes return _Message too long_ errors.
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ grep 'vector rates' v3fifo-sflow-100-runtime.txt | grep -v 'in 0'
|
||||
vector rates in 1.0783e7, out 1.0783e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 9.3499e6, out 9.3499e6, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 1.4728e7, out 1.4728e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 1.3516e7, out 1.3516e7, drop 0.0000e0, punt 0.0000e0
|
||||
pim@hvn6-lab:~$ grep -i sflow v3fifo-sflow-100-runtime.txt
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
sflow active 1096132 280609792 0 1.63e1 256.00
|
||||
sflow active 1584577 405651712 0 1.46e1 256.00
|
||||
pim@hvn6-lab:~$ grep -i sflow v3fifo-sflow-100-err.txt
|
||||
280635904 sflow sflow packets processed error
|
||||
2079194 sflow sflow packets sampled error
|
||||
733447310 sflow CPU cycles in sent samples error
|
||||
405689856 sflow sflow packets processed error
|
||||
3004118 sflow sflow packets sampled error
|
||||
1844674407 sflow CPU cycles in sent samples error
|
||||
```
|
||||
|
||||
Two things of note here. Firstly, the average clocks spent in the `sFlow` node have gone down from
|
||||
86 CPU cycles/packet to 16.3 CPU cycles. But even more importantly, the amount of time spent after
|
||||
the sample is taken is hugely reduced, from 1600+ cycles in v2 to a much more favorable 352 cycles
|
||||
in this version. Also, any risk of Netlink writes failing has been eliminated, because that's now
|
||||
offloaded to a different thread entirely.
|
||||
|
||||
**Debrief**: It's not great that we created a new linux `pthread` for the consumer of the samples.
|
||||
VPP has an elaborate thread management system, and collaborative multitasking in its threading
|
||||
model, which adds introspection like clock counters, names, `show runtime`, `show threads` and so
|
||||
on. I can't help but wonder: wouldn't we just be able to move the `spt_process_samples()` thread
|
||||
into a VPP process node instead?
|
||||
|
||||
### v3bis: SVM FIFO, PSAMPLE process in Main
|
||||
|
||||
**TL/DR:** _9.68Mpps L3, 14.10Mpps L2XC, 14.2 cycles/packet, still with corrupted FIFO queue messages_
|
||||
|
||||
Neil agrees that there's no good reason to keep this out of main, and conjures up
|
||||
[[df2dab8d](https://github.com/vpp/sflow-vpp/df2dab8d)] which rewrites the thread to an
|
||||
`sflow_process_samples()` function, using `VLIB_REGISTER_NODE` to add it to VPP in an idiomatic way.
|
||||
As a really nice benefit, we can now count how many CPU cycles are spent, in _main_, each time this
|
||||
_process_ wakes up and does some work. It's a widely used pattern in VPP.
|
||||
|
||||
Because of the FIFO queue message corruption, Netlink message are failing to send at an alarming
|
||||
rate, which is causing lots of `clib_warning()` messages to be spewed on console. I replace those
|
||||
with a counter of Failed Netlink messages instead, and commit refactor
|
||||
[[6ba4715](https://github.com/sflow/vpp-sflow/6ba4715d050f76cfc582055958d50bf4cc8a0ad1)].
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ grep 'vector rates' v3bis-100-runtime.txt | grep -v 'in 0'
|
||||
vector rates in 1.0976e7, out 1.0976e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 9.6743e6, out 9.6743e6, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 1.4866e7, out 1.4866e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 1.4052e7, out 1.4052e7, drop 0.0000e0, punt 0.0000e0
|
||||
pim@hvn6-lab:~$ grep sflow v3bis-100-runtime.txt
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
sflow-process-samples any wait 0 0 28052 4.66e4 0.00
|
||||
sflow active 1134102 290330112 0 1.42e1 256.00
|
||||
sflow active 1647240 421693440 0 1.32e1 256.00
|
||||
pim@hvn6-lab:~$ grep sflow v3bis-100-err.txt
|
||||
77945 sflow sflow PSAMPLE sent error
|
||||
863 sflow sflow PSAMPLE send failed error
|
||||
290376960 sflow sflow packets processed error
|
||||
2151184 sflow sflow packets sampled error
|
||||
421761024 sflow sflow packets processed error
|
||||
3119625 sflow sflow packets sampled error
|
||||
```
|
||||
|
||||
With this iteration, I make a few observations. Firstly, the `sflow-process-samples` node shows up
|
||||
and informs me that, when handling the samples from the worker FIFO queues, the _process_ is using
|
||||
4660 CPU cycles. Secondly, the replacement of `clib_warnign()` with the `sflow PSAMPLE send failed`
|
||||
counter reduced time from 16.3 to 14.2 cycles on average in the dataplane. Nice.
|
||||
|
||||
**Debrief**: A sad conclusion: of the 5.2M samples taken, only 77k make it through to Netlink. All
|
||||
these send failures and corrupt packets are really messing things up. So while the provided FIFO
|
||||
implementation in `svm/fifo_segment.h` is idiomatic, it is also much more complex than we thought,
|
||||
and we're fearing that it may not be safe to read from another thread.
|
||||
|
||||
### v4: Custom lockless FIFO, PSAMPLE process in Main
|
||||
|
||||
**TL/DR:** _9.56Mpps L3, 13.69Mpps L2XC, 15.6 cycles/packet, corruption fixed!_
|
||||
|
||||
After reading around a bit in DPDK's
|
||||
[[kni_fifo](https://doc.dpdk.org/api-18.11/rte__kni__fifo_8h_source.html)], Neil produces a gem of a
|
||||
commit in
|
||||
[[42bbb64](https://github.com/sflow/vpp-sflow/commit/42bbb643b1f11e8498428d3f7d20cde4de8ee201)],
|
||||
where he introduces a tiny multiple-writer, single-consumer FIFO with two simple functions:
|
||||
`sflow_fifo_enqueue()` to be called in the workers, and `sflow_fifo_dequeue()` to be called in the
|
||||
main thread's `sflow-process-samples` process. He then makes this thread-safe by doing what I
|
||||
consider black magic, in commit
|
||||
[[dd8af17](https://github.com/sflow/vpp-sflow/commit/dd8af1722d579adc9d08656cd7ec8cf8b9ac11d6)],
|
||||
which makes use of `clib_atomic_load_acq_n()` and `clib_atomic_store_rel_n()` macros from VPP's
|
||||
`vppinfra/atomics.h`.
|
||||
|
||||
What I really like about this change is that it introduces a FIFO implementation in about twenty
|
||||
lines of code, which means the sampling code path in the dataplane becomes really easy to follow,
|
||||
and will be even faster than it was before. I take it out for a loadtest:
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ grep 'vector rates' v4-100-runtime.txt | grep -v 'in 0'
|
||||
vector rates in 1.0958e7, out 1.0958e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 9.5633e6, out 9.5633e6, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 1.3697e7, out 1.3697e7, drop 0.0000e0, punt 0.0000e0
|
||||
pim@hvn6-lab:~$ grep sflow v4-100-runtime.txt
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
sflow-process-samples any wait 0 0 17767 1.52e6 0.00
|
||||
sflow active 1121156 287015936 0 1.56e1 256.00
|
||||
sflow active 1605772 411077632 0 1.53e1 256.00
|
||||
pim@hvn6-lab:~$ grep sflow v4-100-err.txt
|
||||
3553600 sflow sflow PSAMPLE sent error
|
||||
287101184 sflow sflow packets processed error
|
||||
2127024 sflow sflow packets sampled error
|
||||
350224 sflow sflow packets dropped error
|
||||
411199744 sflow sflow packets processed error
|
||||
3043693 sflow sflow packets sampled error
|
||||
1266893 sflow sflow packets dropped error
|
||||
```
|
||||
|
||||
|
||||
This is starting to be a very nice implementation! With this iteration of the plugin, all the
|
||||
corruption is gone, there is a slight regression (because we're now actually _sending_ the
|
||||
messages). With the v3bis variant, only a tiny fraction of the samples made it through to netlink.
|
||||
With this v4 variant, I can see 2127024 + 3043693 packets sampled, but due to a carefully chosen
|
||||
FIFO depth of 4, the workers will drop samples so as not to overload the main process that is trying
|
||||
to write them out. At this unnatural rate of 1:100, I can see that of the 2127024 samples taken,
|
||||
350224 are prematurely dropped (because the FIFO queue is full). This is a perfect defense in depth!
|
||||
|
||||
Doing the math, both workers can enqueue 1776800 samples in 30 seconds, which is 59k/s per
|
||||
interface. I can also see that the second interface, which is doing L2XC and hits a much larger
|
||||
packets/sec throughput, is dropping more samples because it receives an equal amount of time from main
|
||||
reading samples from its queue. In other words: in an overload scenario, one interface cannot crowd
|
||||
out another. Slick.
|
||||
|
||||
Finally, completing my math, each worker has enqueued 1776800 samples to their FIFOs, and I see that
|
||||
main has dequeued exactly 2x1776800 = 3553600 samples, all successfully written to Netlink, so
|
||||
the `sflow PSAMPLE send failed` counter remains zero.
|
||||
|
||||
{{< image float="right" width="20em" src="/assets/sflow/hsflowd-demo.png" >}}
|
||||
|
||||
**Debrief**: In the mean time, Neil has been working on the `host-sflow` daemon changes to pick up
|
||||
these netlink messages. There's also a bit of work to do with retrieving the packet and byte
|
||||
counters of the VPP interfaces, so he is creating a module in `host-sflow` that can consume some
|
||||
messages from VPP. He will call this `mod_vpp`, and he mails a screenshot of his work in progress.
|
||||
I'll discuss the end-to-end changes with `hsflowd` in a followup article, and focus my efforts here
|
||||
on documenting the VPP parts only. But, as a teaser, here's a screenshot of a validated
|
||||
`sflow-tool` output of a VPP instance using our `sFlow` plugin and his pending `host-sflow` changes
|
||||
to integrate the rest of the business logic outside of the VPP dataplane, where it's arguably
|
||||
expensive to make mistakes.
|
||||
|
||||
Neil admits to an itch that he has been meaning to scratch all this time. In VPP's
|
||||
`plugins/sflow/node.c`, we insert the node between `device-input` and `ethernet-input`. Here, really
|
||||
most of the time the plugin is just shoveling the ethernet packets through to `ethernet-input`. To
|
||||
make use of some CPU instruction cache affinity, the loop that does this shovelling can do it one
|
||||
packet at a time, two packets at a time, or even four packets at a time. Although the code is super
|
||||
repetitive and somewhat ugly, it does actually speed up processing in terms of CPU cycles spent per
|
||||
packet, if you shovel four of them at a time.
|
||||
|
||||
### v5: Quad Bucket Brigade in worker
|
||||
|
||||
**TL/DR:** _9.68Mpps L3, 14.0Mpps L2XC, 11 CPU cycles/packet, 1.28e5 CPU cycles in main_
|
||||
|
||||
Neil calls this the _Quad Bucket Brigade_, and one last finishing touch is to move from his default
|
||||
2-packet to a 4-packet shoveling. In commit
|
||||
[[285d8a0](https://github.com/sflow/vpp-sflow/commit/285d8a097b74bb38eeb14a922a1e8c1115da2ef2)], he
|
||||
extends a common pattern in VPP dataplane nodes, each time the node iterates, it'll pre-fetch now up
|
||||
to eight packets (`p0-p7`) if the vector is long enough, and handle them four at a time (`b0-b3`).
|
||||
He also adds a few compiler hints with branch prediction: almost no packets will have a trace
|
||||
enabled, so he can use `PREDICT_FALSE()` macros to allow the compiler to further optimize the code.
|
||||
|
||||
I find reading the dataplane code, that it is incredibly ugly. But it's the price to pay for ultra
|
||||
fast throughput. But how do we see the effect? My low-tech proposal is to enable sampling at a very
|
||||
high rate, say 1:10'000'000, so that the code path that grabs and enqueues the sample into the FIFO
|
||||
is almost never called. Then, what's left for the `sFlow` dataplane node, really is to shovel the
|
||||
packets from `device-input` into `ethernet-input`.
|
||||
|
||||
To measure the relative improvement, I do one test with, and one without commit
|
||||
[[285d8a09](https://github.com/sflow/vpp-sflow/commit/285d8a09)].
|
||||
|
||||
```
|
||||
pim@hvn6-lab:~$ grep 'vector rates' v5-10M-runtime.txt | grep -v 'in 0'
|
||||
vector rates in 1.0981e7, out 1.0981e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 9.8806e6, out 9.8806e6, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 1.4328e7, out 1.4328e7, drop 0.0000e0, punt 0.0000e0
|
||||
pim@hvn6-lab:~$ grep sflow v5-10M-runtime.txt
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
sflow-process-samples any wait 0 0 28467 9.36e3 0.00
|
||||
sflow active 1158325 296531200 0 1.09e1 256.00
|
||||
sflow active 1679742 430013952 0 1.11e1 256.00
|
||||
|
||||
pim@hvn6-lab:~$ grep 'vector rates' v5-noquadbrigade-10M-runtime.txt | grep -v in\ 0
|
||||
vector rates in 1.0959e7, out 1.0959e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 9.7046e6, out 9.7046e6, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
|
||||
vector rates in 1.4008e7, out 1.4008e7, drop 0.0000e0, punt 0.0000e0
|
||||
pim@hvn6-lab:~$ grep sflow v5-noquadbrigade-10M-runtime.txt
|
||||
Name State Calls Vectors Suspends Clocks Vectors/Call
|
||||
sflow-process-samples any wait 0 0 28462 9.57e3 0.00
|
||||
sflow active 1137571 291218176 0 1.26e1 256.00
|
||||
sflow active 1641991 420349696 0 1.20e1 256.00
|
||||
```
|
||||
|
||||
Would you look at that, this optimization actually works as advertised! There is a meaningful
|
||||
_progression_ from v5-noquadbrigade (9.70Mpps L3, 14.00Mpps L2XC) to v5 (9.88Mpps L3, 14.32Mpps
|
||||
L2XC). So at the expense of adding 63 lines of code, there is a 2.8% increase in throughput.
|
||||
**Quad-Bucket-Brigade, yaay!**
|
||||
|
||||
I'll leave you with a beautiful screenshot of the current code at HEAD, as it is sampling 1:100
|
||||
packets (!) on four interfaces, while forwarding 8x10G of 256 byte packets at line rate. You'll
|
||||
recall at the beginning of this article I did an acceptance loadtest with sFlow disabled, but this
|
||||
is the exact same result **with sFlow** enabled:
|
||||
|
||||
{{< image src="/assets/sflow/trex-sflow-acceptance.png" alt="T-Rex sFlow Acceptance Loadtest" >}}
|
||||
|
||||
This picture says it all: 79.98 Gbps in, 79.98 Gbps out; 36.22Mpps in, 36.22Mpps out. Also: 176k
|
||||
samples/sec taken from the dataplane, with correct rate limiting due to a per-worker FIFO depth
|
||||
limit, yielding 25k samples/sec sent to Netlink.
|
||||
|
||||
## What's Next
|
||||
|
||||
Checking in on the three main things we wanted to ensure with the plugin:
|
||||
|
||||
1. ✅ If `sFlow` _is not_ enabled on a given interface, there is no regression on other interfaces.
|
||||
1. ✅ If `sFlow` _is_ enabled, copying packets costs 11 CPU cycles on average
|
||||
1. ✅ If `sFlow` takes a sample, it takes only marginally more CPU time to enqueue.
|
||||
* No sampling gets 9.88Mpps of IPv4 and 14.3Mpps of L2XC throughput,
|
||||
* 1:1000 sampling reduces to 9.77Mpps of L3 and 14.05Mpps of L2XC throughput,
|
||||
* and an overly harsh 1:100 reduces to 9.69Mpps and 13.97Mpps only.
|
||||
|
||||
The hard part is finished, but we're not entirely done yet. What's left is to implement a set of
|
||||
packet and byte counters, and send this information along with possible Linux CP data (such as the
|
||||
TAP interface ID in the Linux side), and to add the module for VPP in `hsflowd`. I'll write about
|
||||
that part in a followup article.
|
||||
|
||||
Neil has introduced vpp-dev@ to this plugin, and so far there were no objections. But he has pointed
|
||||
folks to a github out of tree repo, and I may add a Gerrit instead so it becomes part of the
|
||||
ecosystem. Our work so far is captured in Gerrit [[41680](https://gerrit.fd.io/r/c/vpp/+/41680)],
|
||||
which ends up being just over 2600 lines all-up. I do think we need to refactor a bit, add some
|
||||
VPP-specific tidbits like `FEATURE.yaml` and `*.rst` documentation, but this should be in reasonable
|
||||
shape.
|
||||
|
||||
### Acknowledgements
|
||||
|
||||
I'd like to thank Neil McKee from inMon for his dedication to getting things right, including the
|
||||
finer details such as logging, error handling, API specifications, and documentation. He has been a
|
||||
true pleasure to work with and learn from.
|
778
content/articles/2024-10-21-freeix-2.md
Normal file
@ -0,0 +1,778 @@
|
||||
---
|
||||
date: "2024-10-21T10:52:11Z"
|
||||
title: "FreeIX Remote - Part 2"
|
||||
---
|
||||
|
||||
{{< image width="18em" float="right" src="/assets/freeix/freeix-artist-rendering.png" alt="FreeIX, Artists Rendering" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
A few months ago, I wrote about [[an idea]({{< ref 2024-04-27-freeix-1.md >}})] to help boost the
|
||||
value of small Internet Exchange Points (_IXPs_). When such an exchange doesn't have many members,
|
||||
then the operational costs of connecting to it (cross connects, router ports, finding peers, etc)
|
||||
are not very favorable.
|
||||
|
||||
Clearly, the benefit of using an Internet Exchange is to reduce the portion of an ISP’s (and CDN’s)
|
||||
traffic that must be delivered via their upstream transit providers, thereby reducing the average
|
||||
per-bit delivery cost and as well reducing the end to end latency as seen by their users or
|
||||
customers. Furthermore, the increased number of paths available through the IXP improves routing
|
||||
efficiency and fault-tolerance, and at the same time it avoids traffic going the scenic route to a
|
||||
large hub like Frankfurt, London, Amsterdam, Paris or Rome, if it could very well remain local.
|
||||
|
||||
## Refresher: FreeIX Remote
|
||||
|
||||
{{< image width="20em" float="right" src="/assets/freeix/Free IX Remote.svg" alt="FreeIX Remote" >}}
|
||||
|
||||
Let's take for example the [[Free IX in Greece](https://free-ix.gr/)] that was announced at GRNOG16
|
||||
in Athens on April 19th, 2024. This exchange initially targets Athens and Thessaloniki, with 2x100G
|
||||
between the two cities. Members can connect to either site for the cost of only a cross connect.
|
||||
The 1G/10G/25G ports will be _Gratis_, so please make sure to apply if you're in this region! I
|
||||
myself have connected one very special router to Free IX Greece, which will be offering an outreach
|
||||
infrastructure by connecting to _other_ Internet Exchange Points in Amsterdam, and allowing all FreeIX
|
||||
Greece members to benefit from that in the following way:
|
||||
|
||||
1. FreeIX Remote uses AS50869 to peer with any network operator (or routeserver) available at public
|
||||
Internet Exchange Points or using private interconnects. For these peers, it looks like a completely
|
||||
normal service provider in this regard. It will connect to internet exchange points, and learn a bunch of
|
||||
routes and announce other routes.
|
||||
|
||||
1. FreeIX Remote _members_ can join the program, after which they are granted certain propagation
|
||||
permissions by FreeIX Remote at the point where they have a BGP session with AS50869. The prefixes
|
||||
learned on these _member_ sessions are marked as such, and will be allowed to propagate. Members
|
||||
will receive some or all learned prefixes from AS50869.
|
||||
|
||||
1. FreeIX _members_ can set fine grained BGP communities to determine which of their prefixes are
|
||||
propagated to and from which locations, by router, country or Internet Exchange Point.
|
||||
|
||||
Members at smaller internet exchange points greatly benefit from this type of outreach, by receiving large
|
||||
portions of the public internet directly at their preferred peering location. The _Free IX Remote_
|
||||
routers will carry member traffic to and from these remote Internet Exchange Points. My [[previous
|
||||
article]({{< ref 2024-04-27-freeix-1.md >}})] went into a good amount of detail on the principles of
|
||||
operation, but back then I made a promise to come back to the actual _implementation_ of such a
|
||||
complex routing topology. As a starting point, I work with the structure I shared in [[IPng's
|
||||
Routing Policy]({{< ref 2021-11-14-routing-policy.md >}})]. If you haven't read that yet, I think
|
||||
it may make sense to take a look as many of the structural elements and concepts will be similar.
|
||||
|
||||
## Implementation
|
||||
|
||||
The routing policy calls for three classes of (large) BGP communities: informational, permission and
|
||||
inhibit. It also defines a few classic BGP communties, but I'll skip over those as they are not
|
||||
very interesting. Firstly, I will use the _informational_ communities to tag which prefixes were
|
||||
learned by which _router_, in which _country_ and at which internet exchange point, which I will call a
|
||||
_group_.
|
||||
|
||||
Then, I will use the same structure to grant members _permissions_, that is to say, when AS50869
|
||||
learns their prefixes, they will get tagged with specific action communities that enable propagation
|
||||
to other places. I will call this 'Member-to-IXP'. Sometimes, I'd like to be able to _inhibit_
|
||||
propagation of 'Member-to-IXP', so there will be a third set of communities that perform this
|
||||
function. Finally, matching on the informational communities in a clever way will enable a symmetric
|
||||
'IXP-to-Member' propagation.
|
||||
|
||||
To help structure this implementation, it helps if I think about it in
|
||||
the following way:
|
||||
|
||||
Let's say, AS50869 is connected to IXP1, IXP2, IXP3 and IXP4. AS50869 has a _member_ called M1 at
|
||||
IXP1, and that member is 'permitted' to reach IXP2 and IXP3, but it is 'inhibited' from reaching
|
||||
IXP4. My _FreeIX Remote_ implementation now has to satisfy three main requirements:
|
||||
|
||||
1. **Ingress**: learn prefixes (from peers and members alike) at internet exchange points or
|
||||
private network interconnects, and 'tag' them with the correct informational communities.
|
||||
1. **Egress: Member-to-IXP**: Announce M1's prefixes to IXP2 and IXP3, but not to IXP4.
|
||||
1. **Egress: IXP-to-Member**: Announce IXP2's and IXP3's prefixes to M1, but not IXP4's.
|
||||
|
||||
### Defining Countries and Routers
|
||||
|
||||
I'll start by giving each country which has at least one router a unique _country_id_ in a YAML
|
||||
file, leaving the value 0 to mean 'all' countries:
|
||||
|
||||
```
|
||||
$ cat config/common/countries.yaml
|
||||
country:
|
||||
all: 0
|
||||
CH: 1
|
||||
NL: 2
|
||||
GR: 3
|
||||
IT: 4
|
||||
```
|
||||
|
||||
Each router has its own configuration file, and at the top, I'll define some metadata which
|
||||
includes things like the country in which it operates, and its own unique _router_id_, like so:
|
||||
|
||||
```
|
||||
$ cat config/chrma0.net.free-ix.net.yaml
|
||||
device:
|
||||
id: 1
|
||||
hostname: chrma0.free-ix.net
|
||||
shortname: chrma0
|
||||
country: CH
|
||||
loopbacks:
|
||||
ipv4: 194.126.235.16
|
||||
ipv6: "2a0b:dd80:3101::"
|
||||
location: "Hofwiesenstrasse, Ruemlang, Zurich, Switzerland"
|
||||
...
|
||||
```
|
||||
|
||||
### Defining communities
|
||||
|
||||
Next, I define the BGP communities in `class` and `subclass` types, in the following YAML structure:
|
||||
|
||||
```
|
||||
ebgp:
|
||||
community:
|
||||
legacy:
|
||||
noannounce: 0
|
||||
blackhole: 666
|
||||
inhibit: 3000
|
||||
prepend1: 3100
|
||||
prepend2: 3200
|
||||
prepend3: 3300
|
||||
large:
|
||||
class:
|
||||
informational: 1000
|
||||
permission: 2000
|
||||
inhibit: 3000
|
||||
prepend1: 3100
|
||||
prepend2: 3200
|
||||
prepend3: 3300
|
||||
subclass:
|
||||
all: 0
|
||||
router: 10
|
||||
country: 20
|
||||
group: 30
|
||||
asn: 40
|
||||
```
|
||||
|
||||
### Defining Members
|
||||
|
||||
In order to keep this system manageable, I have to rely on automation. I intend to leverage the
|
||||
BGP community _subclasses_ in a simple ACL system consisting of the following YAML, taking my buddy
|
||||
Antonios' network as an example:
|
||||
|
||||
```
|
||||
$ cat config/common/members.yaml
|
||||
member:
|
||||
210312:
|
||||
description: DaKnObNET
|
||||
prefix_filter: AS-SET-DNET
|
||||
permission: [ router:chrma0 ]
|
||||
inhibit: [ group:chix ]
|
||||
...
|
||||
```
|
||||
|
||||
The syntax of the `permission` and `inhibit` fields are identical. They are lists of key:value pairs
|
||||
where they key must be one of the _subclasses_ (eg. 'router', 'country', 'group', 'asn'), and the
|
||||
value appropriate for that type. In this example, AS50869 is being asked to grant permissions for
|
||||
Antonios' prefixes to any peer connected to `router:chrma0`, but inhibit propagation to/from the
|
||||
exchange point called `group:chix`. I could extend this list, for example by adding a permission to
|
||||
`country:NL` or an inhibit to `router:grskg0` and so on.
|
||||
|
||||
I decide that sensible defaults are to give permissions to all, and keep inhibit empty. In other
|
||||
words: be very liberal in propagation, to maximize the value that FreeIX Remote can provide its
|
||||
members.
|
||||
|
||||
### Ingress: Learning Prefixes
|
||||
|
||||
With what I've defined so far, I can start to set informational BGP communtiies:
|
||||
* The prefixes learned on subclass **router** for `chrma0` will have value of device.id=1:
|
||||
`(50869,1010,1)`
|
||||
* The prefixes learned on subclass **country** for `chrma0` will learn from device.country=CH and
|
||||
be able to look up in `countries['CH']` that this means value 1: `(50869,1020,1)`
|
||||
* When learning prefixes from a given internet exchange, Kees already knows its PeeringDB
|
||||
_ixp_id_, which is a unique value for each exchange point. Thus, subclass **group** for `chrma0` at
|
||||
[[CommunityIX](https://www.peeringdb.com/ix/2013)] is ixp_id=2013: `(50869,1030,2013)`
|
||||
|
||||
#### Ingress: Learning from members
|
||||
|
||||
I need to make sure that members send only the prefixes that I expect from them. To do this, I'll
|
||||
make use of a common tool called [[bgpq4](https://github.com/bgp/bgpq4)] which cobbles together the
|
||||
prefixes belonging to an AS-SET by referencing one or more IRR databases.
|
||||
|
||||
In Python, I'll prepare the Jinja context by generating the prefix filter lists like so:
|
||||
|
||||
```
|
||||
if session["type"] == "member":
|
||||
session = {**session, **data["member"][asn]}
|
||||
|
||||
pf = ebgp_merge_value(data["ebgp"], group, session, "prefix_filter", None)
|
||||
if pf:
|
||||
ctx["prefix_filter"] = {}
|
||||
pfn = pf
|
||||
pfn = pfn.replace("-", "_")
|
||||
pfn = pfn.replace(":", "_")
|
||||
|
||||
for af in [4, 6]:
|
||||
filter_name = "%s_%s_IPV%d" % (groupname.upper(), pfn, af)
|
||||
filter_contents = fetch_bgpq(filter_name, pf, af, allow_morespecifics=True)
|
||||
if "[" in filter_contents:
|
||||
ctx["prefix_filter"][filter_name] = { "str": filter_contents, "af": af }
|
||||
ctx["prefix_filter_ipv%d" % af] = True
|
||||
else:
|
||||
log.warning(f"Filter {filter_name} is empty!")
|
||||
ctx["prefix_filter_ipv%d" % af] = False
|
||||
```
|
||||
|
||||
First, if a given BGP session is of type _member_, I'll merge the `member[asn]` dictionary
|
||||
into the `ebgp.group.session[asn]`. I've left out error handling for brevity, but in case the member
|
||||
YAML file doesn't have an entry for the given ASN, it'll just revert back to being of type _peer_.
|
||||
|
||||
I'll use a helper function `ebgp_merge_value()` to walk the YAML hiearchy from the member-data
|
||||
enriched _session_ to the _group_ and finally to the _ebgp_ scope, looking for the existence of a
|
||||
key called _prefix_filter_ and defaulting to None in case none was found. With the value of
|
||||
_prefix_filter_ in hand (in this case `AS-SET-DNET`), I shell out to `bgpq4` for IPv4 and IPv6
|
||||
respectively. Sometimes, there are no IPv6 prefixes (why must you be like this?!) and sometimes
|
||||
there are no IPv4 prefixes (welcome to the Internet, kid!)
|
||||
|
||||
All of this context, including the session and group information, are then fed as context to a
|
||||
Jinja renderer, where I can use them in an _import_ filter like so:
|
||||
|
||||
```
|
||||
{% for plname, pl in (prefix_filter | default({})).items() %}
|
||||
{{pl.str}}
|
||||
{% endfor %}
|
||||
|
||||
filter ebgp_{{group_name}}_{{their_asn}}_import {
|
||||
{% if not prefix_filter_ipv4 | default(True) %}
|
||||
# WARNING: No IPv4 prefix filter found
|
||||
if (net.type = NET_IP4) then reject;
|
||||
{% endif %}
|
||||
{% if not prefix_filter_ipv6 | default(True) %}
|
||||
# WARNING: No IPv6 prefix filter found
|
||||
if (net.type = NET_IP6) then reject;
|
||||
{% endif %}
|
||||
{% for plname, pl in (prefix_filter | default({})).items() %}
|
||||
{% if pl.af == 4 %}
|
||||
if (net.type = NET_IP4 && ! (net ~ {{plname}})) then reject;
|
||||
{% elif pl.af == 6 %}
|
||||
if (net.type = NET_IP6 && ! (net ~ {{plname}})) then reject;
|
||||
{% endif %}
|
||||
{% endfor %}
|
||||
{% if session_type is defined %}
|
||||
if ! ebgp_import_{{session_type}}({{their_asn}}) then reject;
|
||||
{% endif %}
|
||||
|
||||
# Add FreeIX Remote: Informational
|
||||
bgp_large_community.add(({{my_asn}},{{community.large.class.informational+community.large.subclass.router}},{{device.id}})); ## informational.router = {{ device.hostname }}
|
||||
bgp_large_community.add(({{my_asn}},{{community.large.class.informational+community.large.subclass.country}},{{country[device.country]}})); ## informational.country = {{ device.country }}
|
||||
{% if group.peeringdb_ix.id %}
|
||||
bgp_large_community.add(({{my_asn}},{{community.large.class.informational+community.large.subclass.group}},{{group.peeringdb_ix.id}})); ## informational.group = {{ group_name }}
|
||||
{% endif %}
|
||||
|
||||
## NOTE(pim): More comes here, see Member-to-IXP below
|
||||
|
||||
accept;
|
||||
}
|
||||
```
|
||||
|
||||
Let me explain what's going on here, as Jinja templating language that my generator uses is a bit
|
||||
... chatty. The first block will print the dictionary of zero or more `prefix_filter` entries. If
|
||||
the `prefix_filter` context variable doesn't exist, assume it's the empty dictionary and thus,
|
||||
print no prefix lists.
|
||||
|
||||
Then, I create a Bird2 filter and these must each have a globally unique name. I satisfy this
|
||||
requirement by giving it a name with the tuple of {group, their_asn}. The first thing this filter
|
||||
does, is inspect `prefix_filter_ipv4` and `prefix_filter_ipv6`, and if they are explicitly set to
|
||||
False (for example, if a member doesn't have any IRR prefixes associated with their AS-SET), then
|
||||
I'll reject any prefixes from them. Then, I'll match the prefixes with the `prefix_filter`, if
|
||||
provided, and reject any prefixes that aren't in the list I'm expecting on this session. Assuming
|
||||
we're still good to go, I'll hand this prefix off to a function called `ebgp_import_peer()` for
|
||||
peers and `ebgp_import_member()` for members, both of which ensure BGP communities are scrubbed.
|
||||
|
||||
```
|
||||
function ebgp_import_peer(int remote_as) -> bool
|
||||
{
|
||||
# Scrub BGP Communities (RFC 7454 Section 11)
|
||||
bgp_community.delete([(50869, *)]);
|
||||
bgp_large_community.delete([(50869, *, *)]);
|
||||
|
||||
# Scrub BLACKHOLE community
|
||||
bgp_community.delete((65535, 666));
|
||||
|
||||
return ebgp_import(remote_as);
|
||||
}
|
||||
|
||||
function ebgp_import_member(int remote_as) -> bool
|
||||
{
|
||||
# We scrub only our own (informational, permissions) BGP Communities for members
|
||||
bgp_large_community.delete([(50869,1000..2999,*)]);
|
||||
|
||||
return ebgp_import(remote_as);
|
||||
}
|
||||
```
|
||||
|
||||
After scrubbing the communities (peers are not allowed to set _any_ communities, and members are not
|
||||
allowed to set their own informational or permissions communities, but they are allowed to inhibit
|
||||
themselves or prepend, if they wish), one last check is performed by calling the underlying
|
||||
`ebgp_import()`:
|
||||
|
||||
```
|
||||
function ebgp_import(int remote_as) -> bool
|
||||
{
|
||||
if aspath_bogon() then return false;
|
||||
if (net.type = NET_IP4 && ipv4_bogon()) then return false;
|
||||
if (net.type = NET_IP6 && ipv6_bogon()) then return false;
|
||||
|
||||
if (net.type = NET_IP4 && ipv4_rpki_invalid()) then return false;
|
||||
if (net.type = NET_IP6 && ipv6_rpki_invalid()) then return false;
|
||||
|
||||
# Graceful Shutdown (https://www.rfc-editor.org/rfc/rfc8326.html)
|
||||
if (65535, 0) ~ bgp_community then bgp_local_pref = 0;
|
||||
|
||||
return true;
|
||||
}
|
||||
```
|
||||
|
||||
Here, belt-and-suspenders checks are performed, notably bogon AS Paths, IPv4/IPv6 prefixes and RPKI
|
||||
invalids are filtered out. If the prefix has well-known community for [[BGP Graceful
|
||||
Shutdown](https://www.rfc-editor.org/rfc/rfc8326.html)], honor it and set the local preference to
|
||||
zero (making sure to prefer any other available path).
|
||||
|
||||
OK, after all these checks are done, I am finally ready to accept the prefix from this peer or
|
||||
member. It's time to add the informational communities based on the _router_id_, the router's
|
||||
_country_id_ and (if this is a session at a public internet exchange point documented in PeeringDB),
|
||||
the group's _ixp_id_.
|
||||
|
||||
#### Ingress Example: member
|
||||
|
||||
Here's what the rendered template looks like for Antonios' member session at CHIX:
|
||||
|
||||
```
|
||||
# bgpq4 -Ab4 -R 32 -l 'define CHIX_AS_SET_DNET_IPV4' AS-SET-DNET
|
||||
define CHIX_AS_SET_DNET_IPV4 = [
|
||||
44.31.27.0/24{24,32}, 44.154.130.0/24{24,32}, 44.154.132.0/24{24,32},
|
||||
147.189.216.0/21{21,32}, 193.5.16.0/22{22,32}, 212.46.55.0/24{24,32}
|
||||
];
|
||||
|
||||
# bgpq4 -Ab6 -R 128 -l 'define CHIX_AS_SET_DNET_IPV6' AS-SET-DNET
|
||||
define CHIX_AS_SET_DNET_IPV6 = [
|
||||
2001:678:f5c::/48{48,128}, 2a05:dfc1:9174::/48{48,128}, 2a06:9f81:2500::/40{40,128},
|
||||
2a06:9f81:2600::/40{40,128}, 2a0a:6044:7100::/40{40,128}, 2a0c:2f04:100::/40{40,128},
|
||||
2a0d:3dc0::/29{29,128}, 2a12:bc0::/29{29,128}
|
||||
];
|
||||
|
||||
filter ebgp_chix_210312_import {
|
||||
if (net.type = NET_IP4 && ! (net ~ CHIX_AS_SET_DNET_IPV4)) then reject;
|
||||
if (net.type = NET_IP6 && ! (net ~ CHIX_AS_SET_DNET_IPV6)) then reject;
|
||||
if ! ebgp_import_member(210312) then reject;
|
||||
|
||||
# Add FreeIX Remote: Informational
|
||||
bgp_large_community.add((50869,1010,1)); ## informational.router = chrma0.free-ix.net
|
||||
bgp_large_community.add((50869,1020,1)); ## informational.country = CH
|
||||
bgp_large_community.add((50869,1030,2365)); ## informational.group = chix
|
||||
|
||||
## NOTE(pim): More comes here, see Member-to-IXP below
|
||||
|
||||
accept;
|
||||
}
|
||||
```
|
||||
|
||||
#### Ingress Example: peer
|
||||
|
||||
For completeness, here's a regular peer Cloudflare at CHIX, and I hope you agree that the Jinja
|
||||
template renders down to something waaaay more readable now:
|
||||
|
||||
```
|
||||
filter ebgp_chix_13335_import {
|
||||
if ! ebgp_import_peer(13335) then reject;
|
||||
|
||||
# Add FreeIX Remote: Informational
|
||||
bgp_large_community.add((50869,1010,1)); ## informational.router = chrma0.free-ix.net
|
||||
bgp_large_community.add((50869,1020,1)); ## informational.country = CH
|
||||
bgp_large_community.add((50869,1030,2365)); ## informational.group = chix
|
||||
|
||||
accept;
|
||||
}
|
||||
```
|
||||
|
||||
Most sessions will actually look like this one: just learning prefixes, scrubbing inbound
|
||||
communities that are nobody's business to be setting but mine, tossing weird prefixes like bogons
|
||||
and then setting typically the three informational communities. I now know exactly which prefixes
|
||||
are picked up at group CHIX, which ones in country Switzerland, and which ones on router `chrma0`.
|
||||
|
||||
### Egress: Propagating Prefixes
|
||||
|
||||
And with that, I've completed the 'learning' part. Let me move to the 'propagating' part. A design
|
||||
goal of FreeIX Remote is to have _symmetric_ propagation. In my example above, member M1 should have
|
||||
its prefixes announced at IXP2 and IXP3, and all prefixes learned at IXP2 and IXP3 should be
|
||||
announced to member M1.
|
||||
|
||||
First, let me create a helper function in the generator. It's job is to take the symbolic
|
||||
`member.*.permissions` and `member.*.inhibit` lists and resolve them into a structure of numeric
|
||||
values suitable for BGP community list adding and matching. It's a bit of a beast, but I've
|
||||
simplified it a bit. Notably, I've removed all the error and exception handling for brevity:
|
||||
|
||||
```
|
||||
def parse_member_communities(data, asn, type):
|
||||
myasn = data["ebgp"]["asn"]
|
||||
cls = data["ebgp"]["community"]["large"]["class"]
|
||||
sub = data["ebgp"]["community"]["large"]["subclass"]
|
||||
|
||||
bgp_cl = []
|
||||
member = data["member"][asn]
|
||||
|
||||
for perm in perms:
|
||||
if perm == "all":
|
||||
el = { "class": int(cls[type]), "subclass": int(sub["all"]),
|
||||
"value": 0, "description": f"{type}.all" }
|
||||
return [el]
|
||||
k, v = perm.split(":")
|
||||
if k == "country":
|
||||
country_id = data["country"][v]
|
||||
el = { "class": int(cls[type]), "subclass": int(sub["country"]),
|
||||
"value": int(country_id), "description": f"{type}.{k} = {v}" }
|
||||
bgp_cl.append(el)
|
||||
elif k == "asn":
|
||||
el = { "class": int(cls[type]), "subclass": int(sub["asn"]),
|
||||
"value": int(v), "description": f"{type}.{k} = {v}" }
|
||||
bgp_cl.append(el)
|
||||
elif k == "router":
|
||||
device_id = data["_devices"][v]["id"]
|
||||
el = { "class": int(cls[type]), "subclass": int(sub["router"]),
|
||||
"value": int(device_id), "description": f"{type}.{k} = {v}" }
|
||||
bgp_cl.append(el)
|
||||
elif k == "group":
|
||||
group = data["ebgp"]["groups"][v]
|
||||
if isinstance(group["peeringdb_ix"], dict):
|
||||
ix_id = group["peeringdb_ix"]["id"]
|
||||
else:
|
||||
ix_id = group["peeringdb_ix"]
|
||||
el = { "class": int(cls[type]), "subclass": int(sub["group"]),
|
||||
"value": int(ix_id), "description": f"{type}.{k} = {v}" }
|
||||
bgp_cl.append(el)
|
||||
else:
|
||||
log.warning (f"No implementation for {type} subclass '{k}' for member AS{asn}, skipping")
|
||||
|
||||
return bgp_cl
|
||||
|
||||
```
|
||||
|
||||
The essence of this function is to take a human readable list of symbols, like 'router:chrma0' and
|
||||
look up what subclass is called 'router' and what router_id is 'chrma0'. It does this for keywords
|
||||
'router', 'country', 'group' and 'asn' and for a special keyword called 'all' as well.
|
||||
|
||||
Running this a function on Antonios' member data above would reveal the following:
|
||||
```
|
||||
Member 210312 has permissions:
|
||||
[{'class': 2000, 'subclass': 10, 'value': 1, 'description': 'permission.router = chrma0'}]
|
||||
Member 210312 has inhibits:
|
||||
[{'class': 3000, 'subclass': 30, 'value': 2365, 'description': 'inhibit.group = chix'}]
|
||||
```
|
||||
|
||||
The neat thing about this is, that this data will come in handy for _both_ types of propagation, and
|
||||
the `parse_member_communities()` helper function returns pretty readable data, which will help in
|
||||
debugging and further understanding the ultimately generated configuration.
|
||||
|
||||
#### Egress: Member-to-IXP
|
||||
|
||||
OK, when I learned Antonios' prefixes, I have instructed the system to propagate them to all
|
||||
sessions on router `chrma0`, except sessions on group `chix`. This means that in the direction of
|
||||
_from AS50869 to others_, I can do the following:
|
||||
|
||||
**1. Tag permissions and inhibits on ingress**
|
||||
|
||||
I add a tiny bit of logic using this data structure I just created above. In the import filter,
|
||||
remember I added `NOTE(pim): More comes here`? After setting the informational communities, I also
|
||||
add these:
|
||||
|
||||
```
|
||||
{% if session_type == "member" %}
|
||||
{% if permissions %}
|
||||
|
||||
# Add FreeIX Remote: Permission
|
||||
{% for el in permissions %}
|
||||
bgp_large_community.add(({{my_asn}},{{el.class+el.subclass}},{{el.value}})); ## {{ el.description
|
||||
}}
|
||||
{% endfor %}
|
||||
{% endif %}
|
||||
{% if inhibits %}
|
||||
|
||||
# Add FreeIX Remote: Inhibit
|
||||
{% for el in inhibits %}
|
||||
bgp_large_community.add(({{my_asn}},{{el.class+el.subclass}},{{el.value}})); ## {{ el.description
|
||||
}}
|
||||
{% endfor %}
|
||||
{% endif %}
|
||||
{% endif %}
|
||||
```
|
||||
|
||||
Seeing as this block only gets rendered if the session type is _member_, let me show you how
|
||||
Antonios' import filter looks like in its full glory:
|
||||
|
||||
```
|
||||
filter ebgp_chix_210312_import {
|
||||
if (net.type = NET_IP4 && ! (net ~ CHIX_AS_SET_DNET_IPV4)) then reject;
|
||||
if (net.type = NET_IP6 && ! (net ~ CHIX_AS_SET_DNET_IPV6)) then reject;
|
||||
if ! ebgp_import_member(210312) then reject;
|
||||
|
||||
# Add FreeIX Remote: Informational
|
||||
bgp_large_community.add((50869,1010,1)); ## informational.router = chrma0.free-ix.net
|
||||
bgp_large_community.add((50869,1020,1)); ## informational.country = CH
|
||||
bgp_large_community.add((50869,1030,2365)); ## informational.group = chix
|
||||
|
||||
# Add FreeIX Remote: Permission
|
||||
bgp_large_community.add((50869,2010,1)); ## permission.router = chrma0
|
||||
|
||||
# Add FreeIX Remote: Inhibit
|
||||
bgp_large_community.add((50869,3030,2365)); ## inhibit.group = chix
|
||||
|
||||
accept;
|
||||
}
|
||||
```
|
||||
|
||||
Remember, the `ebgp_import_member()` helper will strip any informational (the 1000s) and permissions
|
||||
(the 2000s), but it would allow Antonios to set inhibits and prepends (the 3000s) so these BGP
|
||||
communities will still be allowed in. In other words, Antonios can't give himself propagation rights
|
||||
(sorry, buddy!) but if he would like to make AS50869 stop sending his prefixes to, say, CommunityIX,
|
||||
he could simply add the BGP community `(50869,3030,2013)` on his announcements, and that will get
|
||||
honored. If he'd like AS50869 to prepend itself twice before announcing to peer AS8298, he could set
|
||||
`(50869,3200,8298)` and that will also get picked up.
|
||||
|
||||
**2. Match permissions and inhibits on egress**
|
||||
|
||||
Now that all of Antonios' prefixes are tagged with permissions and inhibits, I can reveal how I
|
||||
implemented the export filters for AS50869:
|
||||
|
||||
```
|
||||
function member_prefix(int group) -> bool
|
||||
{
|
||||
bool permitted = false;
|
||||
|
||||
if (({{ebgp.asn}}, {{ebgp.community.large.class.permission+ebgp.community.large.subclass.all}}, 0) ~ bgp_large_community ||
|
||||
({{ebgp.asn}}, {{ebgp.community.large.class.permission+ebgp.community.large.subclass.router}}, {{ device.id }}) ~ bgp_large_community ||
|
||||
({{ebgp.asn}}, {{ebgp.community.large.class.permission+ebgp.community.large.subclass.country}}, {{ country[device.country] }}) ~ bgp_large_community ||
|
||||
({{ebgp.asn}}, {{ebgp.community.large.class.permission+ebgp.community.large.subclass.group}}, group) ~ bgp_large_community) then {
|
||||
permitted = true;
|
||||
}
|
||||
if (({{ebgp.asn}}, {{ebgp.community.large.class.inhibit+ebgp.community.large.subclass.all}}, 0) ~ bgp_large_community ||
|
||||
({{ebgp.asn}}, {{ebgp.community.large.class.inhibit+ebgp.community.large.subclass.router}}, {{ device.id }}) ~ bgp_large_community ||
|
||||
({{ebgp.asn}}, {{ebgp.community.large.class.inhibit+ebgp.community.large.subclass.country}}, {{ country[device.country] }}) ~ bgp_large_community ||
|
||||
({{ebgp.asn}}, {{ebgp.community.large.class.inhibit+ebgp.community.large.subclass.group}}, group) ~ bgp_large_community) then {
|
||||
permitted = false;
|
||||
}
|
||||
return (permitted);
|
||||
}
|
||||
|
||||
function valid_prefix(int group) -> bool
|
||||
{
|
||||
return (source_prefix() || member_prefix(group));
|
||||
}
|
||||
|
||||
function ebgp_export_peer(int remote_as; int group) -> bool
|
||||
{
|
||||
if (source != RTS_BGP && source != RTS_STATIC) then return false;
|
||||
if !valid_prefix(group) then return false;
|
||||
|
||||
bgp_community.delete([(50869, *)]);
|
||||
bgp_large_community.delete([(50869, *, *)]);
|
||||
|
||||
return ebgp_export(remote_as);
|
||||
}
|
||||
```
|
||||
|
||||
From the bottom, the function `ebgp_export_peer()` is invoked on each peering session, and it gets
|
||||
the argument of the remote AS (for example 13335 for CloudFlare), and the group (for example 2365
|
||||
for CHIX). The function ensures that it's either a _static_ route or a _BGP_ route. Then it makes
|
||||
sure it's a `valid_prefix()` for the group.
|
||||
|
||||
The `valid_prefix()` function first checks if it's one of our own (as in: AS50869's own) prefixes,
|
||||
which it does by calling `source_prefix()`, which i've ommitted here as it would be a distraction.
|
||||
All it does is check if the prefix is in a static prefix list generated with `bgpq4` for AS50869
|
||||
itself. The more interesting observation is that to be eligible, the prefix needs to be either
|
||||
`source_prefix()` **or** `member_prefix(group)`.
|
||||
|
||||
The propagation decision for 'Member-to-IXP' actually happens in that `member_prefix()` function. It
|
||||
starts off by assuming the prefix is not permitted. Then it scans all relevant _permissions_
|
||||
communities which may be present in the RIB for this prefix:
|
||||
- is the `all` permissions community `(50869,2000,0)` set?
|
||||
- what about the `router` permission `(50869,2010,R)` for my _router_id_?
|
||||
- perhaps the `country` permission `(50869,2020,C)` for my _country_id_?
|
||||
- or maybe the `group` permission `(50869,2030,G)` for the _ixp_id_ that this session lives on?
|
||||
|
||||
If any of these conditions are true, then this prefix _might_ pe permitted, so I set the variable to
|
||||
True. Next, I check and see if any of the _inhibit_ communities are set, either by me (in
|
||||
`members.yaml`) or by the member on the live BGP session. If any one of them matches, then I flip
|
||||
the variable to False again. Once the verdict is known, I can return True or False here, which
|
||||
makes its way all the way up the call stack and ultimately announces the member prefix on the BGP
|
||||
session, or not. Slick!
|
||||
|
||||
#### Egress: IXP-to-Member
|
||||
|
||||
At this point, members' prefixes get announced at the correct internet exchange points, but I need to
|
||||
satisfy one more requirement: the prefixes picked up at those IXPs, should _also_ be announced to
|
||||
members. For this, the helper dictionary with permissions and inhibits can be used in a clever way.
|
||||
What if I held them against the informational communities? For example, I have _permitted_
|
||||
Antonios to be annouced at any IXP connected to router `chrma0`, then all prefixes I learned at
|
||||
`chrma0` are fair game, right? But, I configured an _inhibit_ for Antonios' prefixes at CHIX. No
|
||||
problem, I have an informational community for all prefixes I learned from the CHIX group!
|
||||
|
||||
I come to the realization that IXP-to-Member simply adds to the Member-to-IXP logic. Everything that
|
||||
I would announce to a peer, I will also announce to a member. Off I go, adding one last helper
|
||||
function to the BGP session Jinja template:
|
||||
|
||||
```
|
||||
{% if session_type == "member" %}
|
||||
function ebgp_export_{{group_name}}_{{their_asn}}(int remote_as; int group) -> bool
|
||||
{
|
||||
bool permitted = false;
|
||||
|
||||
if (source != RTS_BGP && source != RTS_STATIC) then return false;
|
||||
if valid_prefix(group) then return ebgp_export(remote_as);
|
||||
|
||||
{% for el in permissions | default([]) %}
|
||||
if (bgp_large_community ~ [({{ my_asn }},{{ 1000+el.subclass}},{% if el.value == 0%}*{% else %}{{el.value}}{% endif %})]) then permitted=true; ## {{el.description}}
|
||||
{% endfor %}
|
||||
{% for el in inhibits | default([]) %}
|
||||
if (bgp_large_community ~ [({{ my_asn }},{{ 1000+el.subclass}},{% if el.value == 0%}*{% else %}{{el.value}}{% endif %})]) then permitted=false; ## {{el.description}}
|
||||
{% endfor %}
|
||||
|
||||
if (permitted) then return ebgp_export(remote_as);
|
||||
return false;
|
||||
}
|
||||
{% endif %}
|
||||
```
|
||||
|
||||
Note that in essence, this new function still calls `valid_prefix()`, which in turn calls
|
||||
`source_prefix()` **or** `member_prefix(group)`, so it announces the same prefixes that are also
|
||||
announced to sessions of type 'peer'. But then, I'll also inspect the _informational_ communities,
|
||||
where the value of `0` is replaced with a wildcard, because 'permit or inhibit all' would mean
|
||||
'match any of these BGP communities'. This template renders as follows for Antonios at CHIX:
|
||||
|
||||
```
|
||||
function ebgp_export_chix_210312(int remote_as; int group) -> bool
|
||||
{
|
||||
bool export = false;
|
||||
|
||||
if (source != RTS_BGP && source != RTS_STATIC) then return false;
|
||||
if valid_prefix(group) then return ebgp_export(remote_as);
|
||||
|
||||
if (bgp_large_community ~ [(50869,1010,1)]) then export=true; ## permission.router = chrma0
|
||||
if (bgp_large_community ~ [(50869,1030,2365)]) then export=false; ## inhibit.group = chix
|
||||
|
||||
if (export) then return ebgp_export(remote_as);
|
||||
return false;
|
||||
}
|
||||
```
|
||||
|
||||
## Results
|
||||
|
||||
With this, the propagation logic is complete. Announcements are _symmetric_, that is to say the function
|
||||
`ebgp_export_chix_210312()` sees to it that Antonios gets the prefixes learned at router `chrma0`
|
||||
but not those learned at group `CHIX`. Similarly, the `ebgp_export_peer()` ensures that Antonios'
|
||||
prefixes are propagated to any session at router `chrma0` except those sessions at group `CHIX`.
|
||||
|
||||
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
|
||||
|
||||
I have installed VPP with [[OSPFv3]({{< ref 2024-06-22-vpp-ospf-2.md >}})] unnumbered interfaces,
|
||||
so each router has exactly one IPv4 and IPv6 loopback address. The router in Rümlang has been
|
||||
operational for a while, the one in Amsterdam (nlams0.free-ix.net) and Thessaloniki
|
||||
(grskg0.free-ix.net) have been deployed and are connecting to IXPs now, and the one in Milan
|
||||
(itmil0.free-ix.net) has been installed but is pending physical deployment at Caldara.
|
||||
|
||||
I deployed a test setup with a few permissions and inhibits on the Rümlang router, with many thanks
|
||||
to Jurrian, Sam and Antonios for allowing me to guinnaepig-ize their member sessions. With the
|
||||
following test configuration:
|
||||
|
||||
```
|
||||
member:
|
||||
35202:
|
||||
description: OnTheGo (Sam Aschwanden)
|
||||
prefix_filter: AS-OTG
|
||||
permission: [ router:chrma0 ]
|
||||
inhibit: [ group:comix ]
|
||||
210312:
|
||||
description: DaKnObNET
|
||||
prefix_filter: AS-SET-DNET
|
||||
permission: [ router:chrma0 ]
|
||||
inhibit: [ group:chix ]
|
||||
212635:
|
||||
description: Jurrian van Iersel
|
||||
prefix_filter: AS212635:AS-212635
|
||||
permission: [ router:chrma0 ]
|
||||
inhibit: [ group:chix, group:fogixp ]
|
||||
```
|
||||
|
||||
I can see the following prefix learn/announce counts towards _members_:
|
||||
|
||||
```
|
||||
pim@chrma0:~$ for i in $(birdc show protocol | grep member | cut -f1 -d' '); do echo -n $i\ ; birdc
|
||||
show protocol all $i | grep Routes; done
|
||||
chix_member_35202_ipv4_1 2 imported, 0 filtered, 159984 exported, 0 preferred
|
||||
chix_member_35202_ipv6_1 2 imported, 0 filtered, 61730 exported, 0 preferred
|
||||
chix_member_210312_ipv4_1 3 imported, 0 filtered, 3518 exported, 3 preferred
|
||||
chix_member_210312_ipv6_1 2 imported, 0 filtered, 1251 exported, 2 preferred
|
||||
comix_member_35202_ipv4_1 2 imported, 0 filtered, 159981 exported, 2 preferred
|
||||
comix_member_35202_ipv4_2 2 imported, 0 filtered, 159981 exported, 1 preferred
|
||||
comix_member_35202_ipv6_1 2 imported, 0 filtered, 61727 exported, 2 preferred
|
||||
comix_member_35202_ipv6_2 2 imported, 0 filtered, 61727 exported, 1 preferred
|
||||
fogixp_member_212635_ipv4_1 1 imported, 0 filtered, 442 exported, 1 preferred
|
||||
fogixp_member_212635_ipv6_1 14 imported, 0 filtered, 181 exported, 14 preferred
|
||||
freeix_ch_member_210312_ipv4_1 3 imported, 0 filtered, 3521 exported, 0 preferred
|
||||
freeix_ch_member_210312_ipv6_1 2 imported, 0 filtered, 1253 exported, 0 preferred
|
||||
```
|
||||
|
||||
Let me make a few observations:
|
||||
* Hurricane Electric AS6939 is present at CHIX, and they tend to announce a very large number of
|
||||
prefixes. So every member who is permitted (and not inhibited) at CHIX will see all of those: Sam's
|
||||
AS35202 is inhibited on CommunityIX but not on CHIX, and he's permitted on both. That explains why
|
||||
he is seeing the routes on both sessions.
|
||||
* I've inhibited Jurrian's AS212635 to/from both CHIX and FogIXP, which means he will be seeing
|
||||
CommunityIX (~245 IPv4, 85 IPv6 prefixes), and FreeIX CH (~173 IPv4 and ~60 IPv6). We also send him
|
||||
the member prefixes, which is about 35 or so additional prefixes. This explains why Jurrian is
|
||||
receiving from us ~440 IPv4 and ~180 IPv6.
|
||||
* Antonios' AS210312, the exemplar in this article, is receiving all-but-CHIX. FogIXP yields 3077
|
||||
or so IPv4 and 1056 IPv6 prefixes, while I've already added up FreeIX, CommunityIX, and our members
|
||||
(this is what we're sending Jurrian!), at 330 resp 180, so Antonios should be getting about 3500 IPv4
|
||||
prefixes and 1250 IPv6 prefixes.
|
||||
|
||||
In the other direction, I would expect to be announcing to _peers_ only prefixes belonging to either
|
||||
AS50869 itself, or those of our members:
|
||||
|
||||
```
|
||||
pim@chrma0:~$ for i in $(birdc show protocol | grep peer.*_1 | cut -f1 -d' '); do echo -n $i\ ; birdc
|
||||
show protocol all $i | grep Routes || echo; done
|
||||
chix_peer_212100_ipv4_1 57618 imported, 0 filtered, 24 exported, 778 preferred
|
||||
chix_peer_212100_ipv6_1 21979 imported, 1 filtered, 37 exported, 7186 preferred
|
||||
chix_peer_13335_ipv4_1 4767 imported, 9 filtered, 24 exported, 4765 preferred
|
||||
chix_peer_13335_ipv6_1 371 imported, 1 filtered, 37 exported, 369 preferred
|
||||
chix_peer_6939_ipv4_1 151787 imported, 27 filtered, 24 exported, 133943 preferred
|
||||
chix_peer_6939_ipv6_1 61191 imported, 6 filtered, 37 exported, 16223 preferred
|
||||
comix_peer_44596_ipv4_1 594 imported, 0 filtered, 25 exported, 10 preferred
|
||||
comix_peer_44596_ipv6_1 1147 imported, 0 filtered, 50 exported, 0 preferred
|
||||
comix_peer_8298_ipv4_1 23 imported, 0 filtered, 25 exported, 0 preferred
|
||||
comix_peer_8298_ipv6_1 34 imported, 0 filtered, 50 exported, 0 preferred
|
||||
fogixp_peer_47498_ipv4_1 3286 imported, 1 filtered, 27 exported, 3077 preferred
|
||||
fogixp_peer_47498_ipv6_1 1838 imported, 0 filtered, 39 exported, 1056 preferred
|
||||
freeix_ch_peer_51530_ipv4_1 355 imported, 0 filtered, 28 exported, 0 preferred
|
||||
freeix_ch_peer_51530_ipv6_1 143 imported, 0 filtered, 53 exported, 0 preferred
|
||||
```
|
||||
|
||||
Some observations:
|
||||
|
||||
* Nobody is inhibited at FreeIX Switzerland. It stands to reason therefore, that it has the most
|
||||
exported prefixes: 28 for IPv4 and 53 for IPv6.
|
||||
* Two members are inhibited at CHIX, which makes it have the lowest amount of exported prefixes:
|
||||
24 for IPv4 and 27 for IPv6.
|
||||
* All members at each exchange (group) will have the same amount of prefixes. I can confirm that
|
||||
at CHIX, all thre peers have the same amount of announced prefixes. Similarly, at CommunityIX, all
|
||||
peers have the same amount.
|
||||
* If Antonios, Sam or Jurrian would add an outgoing announcement to AS50869 with an additional inhibit
|
||||
BGP community (eg `(50869,3020,1)` to inhibit country Switzerland), they could tweak these numbers.
|
||||
|
||||
## What's next
|
||||
|
||||
This all adds up. I'd like to test the waters with my friendly neighborhood canaries a little bit,
|
||||
to make sure that announcements are expected, and traffic flows where appropriate. In the mean time,
|
||||
I'll chase the deployment of LSIX, FrysIX, SpeedIX and possibly a few others in Amsterdam. And of
|
||||
course FreeIX Greece in Thessaloniki. I'll try to get the Milano VPP router deployed (it's already
|
||||
installed and configured, but currently powered off) and connected to PCIX, MIX and a few others.
|
||||
|
||||
## How can you help?
|
||||
|
||||
If you're willing to participate with a VPP router and connect it to either multiple local internet
|
||||
exchanges (like I've demonstrated in Zurich), or better yet, to one or more of the other existing
|
||||
routers, I would welcome your contribution. [[Contact]({{< ref contact.md >}})] me for details.
|
||||
|
||||
A bit further down the pike, a connection from Amsterdam to Zurich, from Zurich to Milan and from
|
||||
Milan to Thessaloniki is on the horizon. If you are willing and able to donate some bandwidth (point
|
||||
to point VPWS, VLL, L2VPN) and your transport network is capable of at least 2026 bytes of _inner_
|
||||
payload, please also [[reach out]({{< ref contact.md >}})] as I'm sure many small network operators
|
||||
would be thrilled.
|
857
content/articles/2025-02-08-sflow-3.md
Normal file
@ -0,0 +1,857 @@
|
||||
---
|
||||
date: "2025-02-08T07:51:23Z"
|
||||
title: 'VPP with sFlow - Part 3'
|
||||
---
|
||||
|
||||
# Introduction
|
||||
|
||||
{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width="12em" >}}
|
||||
|
||||
In the second half of last year, I picked up a project together with Neil McKee of
|
||||
[[inMon](https://inmon.com/)], the care takers of [[sFlow](https://sflow.org)]: an industry standard
|
||||
technology for monitoring high speed networks. `sFlow` gives complete visibility into the
|
||||
use of networks enabling performance optimization, accounting/billing for usage, and defense against
|
||||
security threats.
|
||||
|
||||
The open source software dataplane [[VPP](https://fd.io)] is a perfect match for sampling, as it
|
||||
forwards packets at very high rates using underlying libraries like [[DPDK](https://dpdk.org/)] and
|
||||
[[RDMA](https://en.wikipedia.org/wiki/Remote_direct_memory_access)]. A clever design choice in the
|
||||
so called _Host sFlow Daemon_ [[host-sflow](https://github.com/sflow/host-sflow)], which allows for
|
||||
a small portion of code to _grab_ the samples, for example in a merchant silicon ASIC or FPGA, but
|
||||
also in the VPP software dataplane. The agent then _transmits_ these samples using a Linux kernel
|
||||
feature called [[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)].
|
||||
This greatly reduces the complexity of code to be implemented in the forwarding path, while at the
|
||||
same time bringing consistency to the `sFlow` delivery pipeline by (re)using the `hsflowd` business
|
||||
logic for the more complex state keeping, packet marshalling and transmission from the _Agent_ to a
|
||||
central _Collector_.
|
||||
|
||||
In this third article, I wanted to spend some time discussing how samples make their way out of the
|
||||
VPP dataplane, and into higher level tools.
|
||||
|
||||
## Recap: sFlow
|
||||
|
||||
{{< image float="left" src="/assets/sflow/sflow-overview.png" alt="sFlow Overview" width="14em" >}}
|
||||
|
||||
sFlow describes a method for Monitoring Traffic in Switched/Routed Networks, originally described in
|
||||
[[RFC3176](https://datatracker.ietf.org/doc/html/rfc3176)]. The current specification is version 5
|
||||
and is homed on the sFlow.org website [[ref](https://sflow.org/sflow_version_5.txt)]. Typically, a
|
||||
Switching ASIC in the dataplane (seen at the bottom of the diagram to the left) is asked to copy
|
||||
1-in-N packets to local sFlow Agent.
|
||||
|
||||
**Sampling**: The agent will copy the first N bytes (typically 128) of the packet into a sample. As
|
||||
the ASIC knows which interface the packet was received on, the `inIfIndex` will be added. After a
|
||||
routing decision is made, the nexthop and its L2 address and interface become known. The ASIC might
|
||||
annotate the sample with this `outIfIndex` and `DstMAC` metadata as well.
|
||||
|
||||
**Drop Monitoring**: There's one rather clever insight that sFlow gives: what if the packet _was
|
||||
not_ routed or switched, but rather discarded? For this, sFlow is able to describe the reason for
|
||||
the drop. For example, the ASIC receive queue could have been overfull, or it did not find a
|
||||
destination to forward the packet to (no FIB entry), perhaps it was instructed by an ACL to drop the
|
||||
packet or maybe even tried to transmit the packet but the physical datalink layer had to abandon the
|
||||
transmission for whatever reason (link down, TX queue full, link saturation, and so on). It's hard
|
||||
to overstate how important it is to have this so-called _drop monitoring_, as operators often spend
|
||||
hours and hours figuring out _why_ packets are lost their network or datacenter switching fabric.
|
||||
|
||||
**Metadata**: The agent may have other metadata as well, such as which prefix was the source and
|
||||
destination of the packet, what additional RIB information is available (AS path, BGP communities,
|
||||
and so on). This may be added to the sample record as well.
|
||||
|
||||
**Counters**: Since sFlow is sampling 1:N packets, the system can estimate total traffic in a
|
||||
reasonably accurate way. Peter and Sonia wrote a succint
|
||||
[[paper](https://sflow.org/packetSamplingBasics/)] about the math, so I won't get into that here.
|
||||
Mostly because I am but a software engineer, not a statistician... :) However, I will say this: if a
|
||||
fraction of the traffic is sampled but the _Agent_ knows how many bytes and packets were forwarded
|
||||
in total, it can provide an overview with a quantifiable accuracy. This is why the _Agent_ will
|
||||
periodically get the interface counters from the ASIC.
|
||||
|
||||
**Collector**: One or more samples can be concatenated into UDP messages that go from the _sFlow
|
||||
Agent_ to a central _sFlow Collector_. The heavy lifting in analysis is done upstream from the
|
||||
switch or router, which is great for performance. Many thousands or even tens of thousands of
|
||||
agents can forward their samples and interface counters to a single central collector, which in turn
|
||||
can be used to draw up a near real time picture of the state of traffic through even the largest of
|
||||
ISP networks or datacenter switch fabrics.
|
||||
|
||||
In sFlow parlance [[VPP](https://fd.io/)] and its companion
|
||||
[[hsflowd](https://github.com/sflow/host-sflow)] together form an _Agent_ (it sends the UDP packets
|
||||
over the network), and for example the commandline tool `sflowtool` could be a _Collector_ (it
|
||||
receives the UDP packets).
|
||||
|
||||
## Recap: sFlow in VPP
|
||||
|
||||
First, I have some pretty good news to report - our work on this plugin was
|
||||
[[merged](https://gerrit.fd.io/r/c/vpp/+/41680)] and will be included in the VPP 25.02 release in a
|
||||
few weeks! Last weekend, I gave a lightning talk at
|
||||
[[FOSDEM](https://fosdem.org/2025/schedule/event/fosdem-2025-4196-vpp-monitoring-100gbps-with-sflow/)]
|
||||
in Brussels, Belgium, and caught up with a lot of community members and network- and software
|
||||
engineers. I had a great time.
|
||||
|
||||
In trying to keep the amount of code as small as possible, and therefore the probability of bugs that
|
||||
might impact VPP's dataplane stability low, the architecture of the end to end solution consists of
|
||||
three distinct parts, each with their own risk and performance profile:
|
||||
|
||||
{{< image float="left" src="/assets/sflow/sflow-vpp-overview.png" alt="sFlow VPP Overview" width="18em" >}}
|
||||
|
||||
**1. sFlow worker node**: Its job is to do what the ASIC does in the hardware case. As VPP moves
|
||||
packets from `device-input` to the `ethernet-input` nodes in its forwarding graph, the sFlow plugin
|
||||
will inspect 1-in-N, taking a sample for further processing. Here, we don't try to be clever, simply
|
||||
copy the `inIfIndex` and the first bytes of the ethernet frame, and append them to a
|
||||
[[FIFO](https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics))] queue. If too many samples
|
||||
arrive, samples are dropped at the tail, and a counter incremented. This way, I can tell when the
|
||||
dataplane is congested. Bounded FIFOs also provide fairness: it allows for each VPP worker thread to
|
||||
get their fair share of samples into the Agent's hands.
|
||||
|
||||
**2. sFlow main process**: There's a function running on the _main thread_, which shifts further
|
||||
processing time _away_ from the dataplane. This _sflow-process_ does two things. Firstly, it
|
||||
consumes samples from the per-worker FIFO queues (both forwarded packets in green, and dropped ones
|
||||
in red). Secondly, it keeps track of time and every few seconds (20 by default, but this is
|
||||
configurable), it'll grab all interface counters from those interfaces for which I have sFlow
|
||||
turned on. VPP produces _Netlink_ messages and sends them to the kernel.
|
||||
|
||||
**3. Host sFlow daemon**: The third component is external to VPP: `hsflowd` subscribes to the _Netlink_
|
||||
messages. It goes without saying that `hsflowd` is a battle-hardened implementation running on
|
||||
hundreds of different silicon and software defined networking stacks. The PSAMPLE stuff is easy,
|
||||
this module already exists. But Neil implemented a _mod_vpp_ which can grab interface names and their
|
||||
`ifIndex`, and counter statistics. VPP emits this data as _Netlink_ `USERSOCK` messages alongside
|
||||
the PSAMPLEs.
|
||||
|
||||
|
||||
By the way, I've written about _Netlink_ before when discussing the [[Linux Control Plane]({{< ref
|
||||
2021-08-25-vpp-4 >}})] plugin. It's a mechanism for programs running in userspace to share
|
||||
information with the kernel. In the Linux kernel, packets can be sampled as well, and sent from
|
||||
kernel to userspace using a _PSAMPLE_ Netlink channel. However, the pattern is that of a message
|
||||
producer/subscriber relationship and nothing precludes one userspace process (`vpp`) to be the
|
||||
producer while another userspace process (`hsflowd`) acts as the consumer!
|
||||
|
||||
Assuming the sFlow plugin in VPP produces samples and counters properly, `hsflowd` will do the rest,
|
||||
giving correctness and upstream interoperability pretty much for free. That's slick!
|
||||
|
||||
### VPP: sFlow Configuration
|
||||
|
||||
The solution that I offer is based on two moving parts. First, the VPP plugin configuration, which
|
||||
turns on sampling at a given rate on physical devices, also known as _hardware-interfaces_. Second,
|
||||
the open source component [[host-sflow](https://github.com/sflow/host-sflow/releases)] can be
|
||||
configured as of release v2.11-5 [[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)].
|
||||
|
||||
I will show how to configure VPP in three ways:
|
||||
|
||||
***1. VPP Configuration via CLI***
|
||||
|
||||
```
|
||||
pim@vpp0-0:~$ vppctl
|
||||
vpp0-0# sflow sampling-rate 100
|
||||
vpp0-0# sflow polling-interval 10
|
||||
vpp0-0# sflow header-bytes 128
|
||||
vpp0-0# sflow enable GigabitEthernet10/0/0
|
||||
vpp0-0# sflow enable GigabitEthernet10/0/0 disable
|
||||
vpp0-0# sflow enable GigabitEthernet10/0/2
|
||||
vpp0-0# sflow enable GigabitEthernet10/0/3
|
||||
```
|
||||
|
||||
The first three commands set the global defaults - in my case I'm going to be sampling at 1:100
|
||||
which is an unusually high rate. A production setup may take 1-in-_linkspeed-in-megabits_ so for a
|
||||
1Gbps device 1:1'000 is appropriate. For 100GE, something between 1:10'000 and 1:100'000 is more
|
||||
appropriate, depending on link load. The second command sets the interface stats polling interval.
|
||||
The default is to gather these statistics every 20 seconds, but I set it to 10s here.
|
||||
|
||||
Next, I tell the plugin how many bytes of the sampled ethernet frame should be taken. Common
|
||||
values are 64 and 128 but it doesn't have to be a power of two. I want enough data to see the
|
||||
headers, like MPLS label(s), Dot1Q tag(s), IP header and TCP/UDP/ICMP header, but the contents of
|
||||
the payload are rarely interesting for
|
||||
statistics purposes.
|
||||
|
||||
Finally, I can turn on the sFlow plugin on an interface with the `sflow enable-disable` CLI. In VPP,
|
||||
an idiomatic way to turn on and off things is to have an enabler/disabler. It feels a bit clunky
|
||||
maybe to write `sflow enable $iface disable` but it makes more logical sends if you parse that as
|
||||
"enable-disable" with the default being the "enable" operation, and the alternate being the
|
||||
"disable" operation.
|
||||
|
||||
***2. VPP Configuration via API***
|
||||
|
||||
I implemented a few API methods for the most common operations. Here's a snippet that obtains the
|
||||
same config as what I typed on the CLI above, but using these Python API calls:
|
||||
|
||||
```python
|
||||
from vpp_papi import VPPApiClient, VPPApiJSONFiles
|
||||
import sys
|
||||
|
||||
vpp_api_dir = VPPApiJSONFiles.find_api_dir([])
|
||||
vpp_api_files = VPPApiJSONFiles.find_api_files(api_dir=vpp_api_dir)
|
||||
vpp = VPPApiClient(apifiles=vpp_api_files, server_address="/run/vpp/api.sock")
|
||||
vpp.connect("sflow-api-client")
|
||||
print(vpp.api.show_version().version)
|
||||
# Output: 25.06-rc0~14-g9b1c16039
|
||||
|
||||
vpp.api.sflow_sampling_rate_set(sampling_N=100)
|
||||
print(vpp.api.sflow_sampling_rate_get())
|
||||
# Output: sflow_sampling_rate_get_reply(_0=655, context=3, sampling_N=100)
|
||||
|
||||
vpp.api.sflow_polling_interval_set(polling_S=10)
|
||||
print(vpp.api.sflow_polling_interval_get())
|
||||
# Output: sflow_polling_interval_get_reply(_0=661, context=5, polling_S=10)
|
||||
|
||||
vpp.api.sflow_header_bytes_set(header_B=128)
|
||||
print(vpp.api.sflow_header_bytes_get())
|
||||
# Output: sflow_header_bytes_get_reply(_0=665, context=7, header_B=128)
|
||||
|
||||
vpp.api.sflow_enable_disable(hw_if_index=1, enable_disable=True)
|
||||
vpp.api.sflow_enable_disable(hw_if_index=2, enable_disable=True)
|
||||
print(vpp.api.sflow_interface_dump())
|
||||
# Output: [ sflow_interface_details(_0=667, context=8, hw_if_index=1),
|
||||
# sflow_interface_details(_0=667, context=8, hw_if_index=2) ]
|
||||
|
||||
print(vpp.api.sflow_interface_dump(hw_if_index=2))
|
||||
# Output: [ sflow_interface_details(_0=667, context=9, hw_if_index=2) ]
|
||||
|
||||
print(vpp.api.sflow_interface_dump(hw_if_index=1234)) ## Invalid hw_if_index
|
||||
# Output: []
|
||||
|
||||
vpp.api.sflow_enable_disable(hw_if_index=1, enable_disable=False)
|
||||
print(vpp.api.sflow_interface_dump())
|
||||
# Output: [ sflow_interface_details(_0=667, context=10, hw_if_index=2) ]
|
||||
```
|
||||
|
||||
This short program toys around a bit with the sFlow API. I first set the sampling to 1:100 and get
|
||||
the current value. Then I set the polling interval to 10s and retrieve the current value again.
|
||||
Finally, I set the header bytes to 128, and retrieve the value again.
|
||||
|
||||
Enabling and disabling sFlow on interfaces shows the idiom I mentioned before - the API being an
|
||||
`*_enable_disable()` call of sorts, and typically taking a boolean argument if the operator wants to
|
||||
enable (the default), or disable sFlow on the interface. Getting the list of enabled interfaces can
|
||||
be done with the `sflow_interface_dump()` call, which returns a list of `sflow_interface_details`
|
||||
messages.
|
||||
|
||||
I demonstrated VPP's Python API and how it works in a fair amount of detail in a [[previous
|
||||
article]({{< ref 2024-01-27-vpp-papi >}})], in case this type of stuff interests you.
|
||||
|
||||
***3. VPPCfg YAML Configuration***
|
||||
|
||||
Writing on the CLI and calling the API is good and all, but many users of VPP have noticed that it
|
||||
does not have any form of configuration persistence and that's deliberate. VPP's goal is to be a
|
||||
programmable dataplane, and explicitly has left the programming and configuration as an exercise for
|
||||
integrators. I have written a Python project that takes a YAML file as input and uses it to
|
||||
configure (and reconfigure, on the fly) the dataplane automatically, called
|
||||
[[VPPcfg](https://git.ipng.ch/ipng/vppcfg.git)]. Previously, I wrote some implementation thoughts
|
||||
on its [[datamodel]({{< ref 2022-03-27-vppcfg-1 >}})] and its [[operations]({{< ref 2022-04-02-vppcfg-2
|
||||
>}})] so I won't repeat that here. Instead, I will just show the configuration:
|
||||
|
||||
```
|
||||
pim@vpp0-0:~$ cat << EOF > vppcfg.yaml
|
||||
interfaces:
|
||||
GigabitEthernet10/0/0:
|
||||
sflow: true
|
||||
GigabitEthernet10/0/1:
|
||||
sflow: true
|
||||
GigabitEthernet10/0/2:
|
||||
sflow: true
|
||||
GigabitEthernet10/0/3:
|
||||
sflow: true
|
||||
|
||||
sflow:
|
||||
sampling-rate: 100
|
||||
polling-interval: 10
|
||||
header-bytes: 128
|
||||
EOF
|
||||
pim@vpp0-0:~$ vppcfg plan -c vppcfg.yaml -o /etc/vpp/config/vppcfg.vpp
|
||||
[INFO ] root.main: Loading configfile vppcfg.yaml
|
||||
[INFO ] vppcfg.config.valid_config: Configuration validated successfully
|
||||
[INFO ] root.main: Configuration is valid
|
||||
[INFO ] vppcfg.reconciler.write: Wrote 13 lines to /etc/vpp/config/vppcfg.vpp
|
||||
[INFO ] root.main: Planning succeeded
|
||||
pim@vpp0-0:~$ vppctl exec /etc/vpp/config/vppcfg.vpp
|
||||
```
|
||||
|
||||
The nifty thing about `vppcfg` is that if I were to change, say, the sampling-rate (setting it to
|
||||
1000) and disable sFlow from an interface, say Gi10/0/0, I can re-run the `vppcfg plan` and `vppcfg
|
||||
apply` stages and the VPP dataplane will be reprogrammed to reflect the newly declared configuration.
|
||||
|
||||
### hsflowd: Configuration
|
||||
|
||||
When sFlow is enabled, VPP will start to emit _Netlink_ messages of type PSAMPLE with packet samples
|
||||
and of type USERSOCK with the custom messages containing interface names and counters. These latter
|
||||
custom messages have to be decoded, which is done by the _mod_vpp_ module in `hsflowd`, starting
|
||||
from release v2.11-5 [[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)].
|
||||
|
||||
Here's a minimalist configuration:
|
||||
|
||||
```
|
||||
pim@vpp0-0:~$ cat /etc/hsflowd.conf
|
||||
sflow {
|
||||
collector { ip=127.0.0.1 udpport=16343 }
|
||||
collector { ip=192.0.2.1 namespace=dataplane }
|
||||
psample { group=1 }
|
||||
vpp { osIndex=off }
|
||||
}
|
||||
```
|
||||
|
||||
{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
There are two important details that can be confusing at first: \
|
||||
**1.** kernel network namespaces \
|
||||
**2.** interface index namespaces
|
||||
|
||||
#### hsflowd: Network namespace
|
||||
|
||||
Network namespaces virtualize Linux's network stack. Upon creation, a network namespace contains only
|
||||
a loopback interface, and subsequently interfaces can be moved between namespaces. Each network
|
||||
namespace will have its own set of IP addresses, its own routing table, socket listing, connection
|
||||
tracking table, firewall, and other network-related resources. When started by systemd, `hsflowd`
|
||||
and VPP will normally both run in the _default_ network namespace.
|
||||
|
||||
Given this, I can conclude that when the sFlow plugin opens a Netlink channel, it will
|
||||
naturally do this in the network namespace that its VPP process is running in (the _default_
|
||||
namespace, normally). It is therefore important that the recipient of these Netlink messages,
|
||||
notably `hsflowd` runs in ths ***same*** namespace as VPP. It's totally fine to run them together in
|
||||
a different namespace (eg. a container in Kubernetes or Docker), as long as they can see each other.
|
||||
|
||||
It might pose a problem if the network connectivity lives in a different namespace than the default
|
||||
one. One common example (that I heavily rely on at IPng!) is to create Linux Control Plane interface
|
||||
pairs, _LIPs_, in a dataplane namespace. The main reason for doing this is to allow something like
|
||||
FRR or Bird to completely govern the routing table in the kernel and keep it in-sync with the FIB in
|
||||
VPP. In such a _dataplane_ network namespace, typically every interface is owned by VPP.
|
||||
|
||||
Luckily, `hsflowd` can attach to one (default) namespace to get the PSAMPLEs, but create a socket in
|
||||
a _different_ (dataplane) namespace to send packets to a collector. This explains the second
|
||||
_collector_ entry in the config-file above. Here, `hsflowd` will send UDP packets to 192.0.2.1:6343
|
||||
from within the (VPP) dataplane namespace, and to 127.0.0.1:16343 in the default namespace.
|
||||
|
||||
#### hsflowd: osIndex
|
||||
|
||||
I hope the previous section made some sense, because this one will be a tad more esoteric. When
|
||||
creating a network namespace, each interface will get its own uint32 interface index that identifies
|
||||
it, and such an ID is typically called an `ifIndex`. It's important to note that the same number can
|
||||
(and will!) occur multiple times, once for each namespace. Let me give you an example:
|
||||
|
||||
```
|
||||
pim@summer:~$ ip link
|
||||
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN ...
|
||||
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
|
||||
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ipng-sl state UP ...
|
||||
link/ether 00:22:19:6a:46:2e brd ff:ff:ff:ff:ff:ff
|
||||
altname enp1s0f0
|
||||
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 900 qdisc mq master ipng-sl state DOWN ...
|
||||
link/ether 00:22:19:6a:46:30 brd ff:ff:ff:ff:ff:ff
|
||||
altname enp1s0f1
|
||||
|
||||
pim@summer:~$ ip netns exec dataplane ip link
|
||||
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN ...
|
||||
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
|
||||
2: loop0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP ...
|
||||
link/ether de:ad:00:00:00:00 brd ff:ff:ff:ff:ff:ff
|
||||
3: xe1-0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP ...
|
||||
link/ether 00:1b:21:bd:c7:18 brd ff:ff:ff:ff:ff:ff
|
||||
```
|
||||
|
||||
I want to draw your attention to the number at the beginning of the line. In the _default_
|
||||
namespace, `ifIndex=3` corresponds to `ifName=eno2` (which has no link, it's marked `DOWN`). But in
|
||||
the _dataplane_ namespace, that index corresponds to a completely different interface called
|
||||
`ifName=xe1-0` (which is link `UP`).
|
||||
|
||||
Now, let me show you the interfaces in VPP:
|
||||
|
||||
```
|
||||
pim@summer:~$ vppctl show int | grep Gigabit | egrep 'Name|loop0|tap0|Gigabit'
|
||||
Name Idx State MTU (L3/IP4/IP6/MPLS)
|
||||
GigabitEthernet4/0/0 1 up 9000/0/0/0
|
||||
GigabitEthernet4/0/1 2 down 9000/0/0/0
|
||||
GigabitEthernet4/0/2 3 down 9000/0/0/0
|
||||
GigabitEthernet4/0/3 4 down 9000/0/0/0
|
||||
TenGigabitEthernet5/0/0 5 up 9216/0/0/0
|
||||
TenGigabitEthernet5/0/1 6 up 9216/0/0/0
|
||||
loop0 7 up 9216/0/0/0
|
||||
tap0 19 up 9216/0/0/0
|
||||
```
|
||||
|
||||
Here, I want you to look at the second column `Idx`, which shows what VPP calls the _sw_if_index_
|
||||
(the software interface index, as opposed to hardware index). Here, `ifIndex=3` corresponds to
|
||||
`ifName=GigabitEthernet4/0/2`, which is neither `eno2` nor `xe1-0`. Oh my, yet _another_ namespace!
|
||||
|
||||
It turns out that there are three (relevant) types of namespaces at play here:
|
||||
1. ***Linux network*** namespace; here using `dataplane` and `default` each with their own unique
|
||||
(and overlapping) numbering.
|
||||
1. ***VPP hardware*** interface namespace, also called PHYs (for physical interfaces). When VPP
|
||||
first attaches to or creates network interfaces like the ones from DPDK or RDMA, these will
|
||||
create an _hw_if_index_ in a list.
|
||||
1. ***VPP software*** interface namespace. All interfaces (including hardware ones!) will
|
||||
receive a _sw_if_index_ in VPP. A good example is sub-interfaces: if I create a sub-int on
|
||||
GigabitEthernet4/0/2, it will NOT get a hardware index, but it _will_ get the next available
|
||||
software index (in this example, `sw_if_index=7`).
|
||||
|
||||
In Linux CP, I can see a mapping from one to the other, just look at this:
|
||||
|
||||
```
|
||||
pim@summer:~$ vppctl show lcp
|
||||
lcp default netns dataplane
|
||||
lcp lcp-auto-subint off
|
||||
lcp lcp-sync on
|
||||
lcp lcp-sync-unnumbered on
|
||||
itf-pair: [0] loop0 tap0 loop0 2 type tap netns dataplane
|
||||
itf-pair: [1] TenGigabitEthernet5/0/0 tap1 xe1-0 3 type tap netns dataplane
|
||||
itf-pair: [2] TenGigabitEthernet5/0/1 tap2 xe1-1 4 type tap netns dataplane
|
||||
itf-pair: [3] TenGigabitEthernet5/0/0.20 tap1.20 xe1-0.20 5 type tap netns dataplane
|
||||
```
|
||||
|
||||
Those `itf-pair` describe our _LIPs_, and they have the coordinates to three things. 1) The VPP
|
||||
software interface (VPP `ifName=loop0` with `sw_if_index=7`), which 2) Linux CP will mirror into the
|
||||
Linux kernel using a TAP device (VPP `ifName=tap0` with `sw_if_index=19`). That TAP has one leg in
|
||||
VPP (`tap0`), and another in 3) Linux (with `ifName=loop` and `ifIndex=2` in namespace `dataplane`).
|
||||
|
||||
> So the tuple that fully describes a _LIP_ is `{7, 19,'dataplane', 2}`
|
||||
|
||||
Climbing back out of that rabbit hole, I am now finally ready to explain the feature. When sFlow in
|
||||
VPP takes its sample, it will be doing this on a PHY, that is a given interface with a specific
|
||||
_hw_if_index_. When it polls the counters, it'll do it for that specific _hw_if_index_. It now has a
|
||||
choice: should it share with the world the representation of *its* namespace, or should it try to be
|
||||
smarter? If LinuxCP is enabled, this interface will likely have a representation in Linux. So the
|
||||
plugin will first resolve the _sw_if_index_ belonging to that PHY, and using that, try to look up a
|
||||
_LIP_ with it. If it finds one, it'll know both the namespace in which it lives as well as the
|
||||
osIndex in that namespace. If it doesn't find a _LIP_, it will at least have the _sw_if_index_ at
|
||||
hand, so it'll annotate the USERSOCK counter messages with this information instead.
|
||||
|
||||
Now, `hsflowd` has a choice to make: does it share the Linux representation and hide VPP as an
|
||||
implementation detail? Or does it share the VPP dataplane _sw_if_index_? There are use cases
|
||||
relevant to both, so the decision was to let the operator decide, by setting `osIndex` either `on`
|
||||
(use Linux ifIndex) or `off` (use VPP _sw_if_index_).
|
||||
|
||||
### hsflowd: Host Counters
|
||||
|
||||
Now that I understand the configuration parts of VPP and `hsflowd`, I decide to configure everything
|
||||
but without enabling sFlow on on any interfaces yet in VPP. Once I start the daemon, I can see that
|
||||
it sends an UDP packet every 30 seconds to the configured _collector_:
|
||||
|
||||
```
|
||||
pim@vpp0-0:~$ sudo tcpdump -s 9000 -i lo -n
|
||||
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
|
||||
listening on lo, link-type EN10MB (Ethernet), snapshot length 9000 bytes
|
||||
15:34:19.695042 IP 127.0.0.1.48753 > 127.0.0.1.6343: sFlowv5,
|
||||
IPv4 agent 198.19.5.16, agent-id 100000, length 716
|
||||
```
|
||||
|
||||
The `tcpdump` I have on my Debian bookworm machines doesn't know how to decode the contents of these
|
||||
sFlow packets. Actually, neither does Wireshark. I've attached a file of these mysterious packets
|
||||
[[sflow-host.pcap](/assets/sflow/sflow-host.pcap)] in case you want to take a look.
|
||||
Neil however gives me a tip. A full message decoder and otherwise handy Swiss army knife lives in
|
||||
[[sflowtool](https://github.com/sflow/sflowtool)].
|
||||
|
||||
I can offer this pcap file to `sflowtool`, or let it just listen on the UDP port directly, and
|
||||
it'll tell me what it finds:
|
||||
|
||||
```
|
||||
pim@vpp0-0:~$ sflowtool -p 6343
|
||||
startDatagram =================================
|
||||
datagramSourceIP 127.0.0.1
|
||||
datagramSize 716
|
||||
unixSecondsUTC 1739112018
|
||||
localtime 2025-02-09T15:40:18+0100
|
||||
datagramVersion 5
|
||||
agentSubId 100000
|
||||
agent 198.19.5.16
|
||||
packetSequenceNo 57
|
||||
sysUpTime 987398
|
||||
samplesInPacket 1
|
||||
startSample ----------------------
|
||||
sampleType_tag 0:4
|
||||
sampleType COUNTERSSAMPLE
|
||||
sampleSequenceNo 33
|
||||
sourceId 2:1
|
||||
counterBlock_tag 0:2001
|
||||
adaptor_0_ifIndex 2
|
||||
adaptor_0_MACs 1
|
||||
adaptor_0_MAC_0 525400f00100
|
||||
counterBlock_tag 0:2010
|
||||
udpInDatagrams 123904
|
||||
udpNoPorts 23132459
|
||||
udpInErrors 0
|
||||
udpOutDatagrams 46480629
|
||||
udpRcvbufErrors 0
|
||||
udpSndbufErrors 0
|
||||
udpInCsumErrors 0
|
||||
counterBlock_tag 0:2009
|
||||
tcpRtoAlgorithm 1
|
||||
tcpRtoMin 200
|
||||
tcpRtoMax 120000
|
||||
tcpMaxConn 4294967295
|
||||
tcpActiveOpens 0
|
||||
tcpPassiveOpens 30
|
||||
tcpAttemptFails 0
|
||||
tcpEstabResets 0
|
||||
tcpCurrEstab 1
|
||||
tcpInSegs 89120
|
||||
tcpOutSegs 86961
|
||||
tcpRetransSegs 59
|
||||
tcpInErrs 0
|
||||
tcpOutRsts 4
|
||||
tcpInCsumErrors 0
|
||||
counterBlock_tag 0:2008
|
||||
icmpInMsgs 23129314
|
||||
icmpInErrors 32
|
||||
icmpInDestUnreachs 0
|
||||
icmpInTimeExcds 23129282
|
||||
icmpInParamProbs 0
|
||||
icmpInSrcQuenchs 0
|
||||
icmpInRedirects 0
|
||||
icmpInEchos 0
|
||||
icmpInEchoReps 32
|
||||
icmpInTimestamps 0
|
||||
icmpInAddrMasks 0
|
||||
icmpInAddrMaskReps 0
|
||||
icmpOutMsgs 0
|
||||
icmpOutErrors 0
|
||||
icmpOutDestUnreachs 23132467
|
||||
icmpOutTimeExcds 0
|
||||
icmpOutParamProbs 23132467
|
||||
icmpOutSrcQuenchs 0
|
||||
icmpOutRedirects 0
|
||||
icmpOutEchos 0
|
||||
icmpOutEchoReps 0
|
||||
icmpOutTimestamps 0
|
||||
icmpOutTimestampReps 0
|
||||
icmpOutAddrMasks 0
|
||||
icmpOutAddrMaskReps 0
|
||||
counterBlock_tag 0:2007
|
||||
ipForwarding 2
|
||||
ipDefaultTTL 64
|
||||
ipInReceives 46590552
|
||||
ipInHdrErrors 0
|
||||
ipInAddrErrors 0
|
||||
ipForwDatagrams 0
|
||||
ipInUnknownProtos 0
|
||||
ipInDiscards 0
|
||||
ipInDelivers 46402357
|
||||
ipOutRequests 69613096
|
||||
ipOutDiscards 0
|
||||
ipOutNoRoutes 80
|
||||
ipReasmTimeout 0
|
||||
ipReasmReqds 0
|
||||
ipReasmOKs 0
|
||||
ipReasmFails 0
|
||||
ipFragOKs 0
|
||||
ipFragFails 0
|
||||
ipFragCreates 0
|
||||
counterBlock_tag 0:2005
|
||||
disk_total 6253608960
|
||||
disk_free 2719039488
|
||||
disk_partition_max_used 56.52
|
||||
disk_reads 11512
|
||||
disk_bytes_read 626214912
|
||||
disk_read_time 48469
|
||||
disk_writes 1058955
|
||||
disk_bytes_written 8924332032
|
||||
disk_write_time 7954804
|
||||
counterBlock_tag 0:2004
|
||||
mem_total 8326963200
|
||||
mem_free 5063872512
|
||||
mem_shared 0
|
||||
mem_buffers 86425600
|
||||
mem_cached 827752448
|
||||
swap_total 0
|
||||
swap_free 0
|
||||
page_in 306365
|
||||
page_out 4357584
|
||||
swap_in 0
|
||||
swap_out 0
|
||||
counterBlock_tag 0:2003
|
||||
cpu_load_one 0.030
|
||||
cpu_load_five 0.050
|
||||
cpu_load_fifteen 0.040
|
||||
cpu_proc_run 1
|
||||
cpu_proc_total 138
|
||||
cpu_num 2
|
||||
cpu_speed 1699
|
||||
cpu_uptime 1699306
|
||||
cpu_user 64269210
|
||||
cpu_nice 1810
|
||||
cpu_system 34690140
|
||||
cpu_idle 3234293560
|
||||
cpu_wio 3568580
|
||||
cpuintr 0
|
||||
cpu_sintr 5687680
|
||||
cpuinterrupts 1596621688
|
||||
cpu_contexts 3246142972
|
||||
cpu_steal 329520
|
||||
cpu_guest 0
|
||||
cpu_guest_nice 0
|
||||
counterBlock_tag 0:2006
|
||||
nio_bytes_in 250283
|
||||
nio_pkts_in 2931
|
||||
nio_errs_in 0
|
||||
nio_drops_in 0
|
||||
nio_bytes_out 370244
|
||||
nio_pkts_out 1640
|
||||
nio_errs_out 0
|
||||
nio_drops_out 0
|
||||
counterBlock_tag 0:2000
|
||||
hostname vpp0-0
|
||||
UUID ec933791-d6af-7a93-3b8d-aab1a46d6faa
|
||||
machine_type 3
|
||||
os_name 2
|
||||
os_release 6.1.0-26-amd64
|
||||
endSample ----------------------
|
||||
endDatagram =================================
|
||||
```
|
||||
|
||||
If you thought: "What an obnoxiously long paste!", then my slightly RSI-induced mouse-hand might
|
||||
agree with you. But it is really cool to see that every 30 seconds, the _collector_ will receive
|
||||
this form of heartbeat from the _agent_. There's a lot of vitalsigns in this packet, including some
|
||||
non-obvious but interesting stats like CPU load, memory, disk use and disk IO, and kernel version
|
||||
information. It's super dope!
|
||||
|
||||
### hsflowd: Interface Counters
|
||||
|
||||
Next, I'll enable sFlow in VPP on all four interfaces (Gi10/0/0-Gi10/0/3), set the sampling rate to
|
||||
something very high (1 in 100M), and the interface polling-interval to every 10 seconds. And indeed,
|
||||
every ten seconds or so I get a few packets, which I captured in
|
||||
[[sflow-interface.pcap](/assets/sflow/sflow-interface.pcap)]. Most of the packets contain only one
|
||||
counter record, while some contain more than one (in the PCAP, packet #9 has two). If I update the
|
||||
polling-interval to every second, I can see that most of the packets have all four counters.
|
||||
|
||||
Those interface counters, as decoded by `sflowtool`, look like this:
|
||||
|
||||
```
|
||||
pim@vpp0-0:~$ sflowtool -r sflow-interface.pcap | \
|
||||
awk '/startSample/ { on=1 } { if (on) { print $0 } } /endSample/ { on=0 }'
|
||||
startSample ----------------------
|
||||
sampleType_tag 0:4
|
||||
sampleType COUNTERSSAMPLE
|
||||
sampleSequenceNo 745
|
||||
sourceId 0:3
|
||||
counterBlock_tag 0:1005
|
||||
ifName GigabitEthernet10/0/2
|
||||
counterBlock_tag 0:1
|
||||
ifIndex 3
|
||||
networkType 6
|
||||
ifSpeed 0
|
||||
ifDirection 1
|
||||
ifStatus 3
|
||||
ifInOctets 858282015
|
||||
ifInUcastPkts 780540
|
||||
ifInMulticastPkts 0
|
||||
ifInBroadcastPkts 0
|
||||
ifInDiscards 0
|
||||
ifInErrors 0
|
||||
ifInUnknownProtos 0
|
||||
ifOutOctets 1246716016
|
||||
ifOutUcastPkts 975772
|
||||
ifOutMulticastPkts 0
|
||||
ifOutBroadcastPkts 0
|
||||
ifOutDiscards 127
|
||||
ifOutErrors 28
|
||||
ifPromiscuousMode 0
|
||||
endSample ----------------------
|
||||
```
|
||||
|
||||
What I find particularly cool about it, is that sFlow provides an automatic mapping between the
|
||||
`ifName=GigabitEthernet10/0/2` (tag 0:1005), together with an object (tag 0:1), which contains the
|
||||
`ifIndex=3`, and lots of packet and octet counters both in the ingress and egress direction. This is
|
||||
super useful for upstream _collectors_, as they can now find the hostname, agent name and address,
|
||||
and the correlation between interface names and their indexes. Noice!
|
||||
|
||||
#### hsflowd: Packet Samples
|
||||
|
||||
Now it's time to ratchet up the packet sampling, so I move it from 1:100M to 1:1000, while keeping
|
||||
the interface polling-interval at 10 seconds and I ask VPP to sample 64 bytes of each packet that it
|
||||
inspects. On either side of my pet VPP instance, I start an `iperf3` run to generate some traffic. I
|
||||
now see a healthy stream of sflow packets coming in on port 6343. They still contain every 30
|
||||
seconds or so a host counter, and every 10 seconds a set of interface counters come by, but mostly
|
||||
these UDP packets are showing me samples. I've captured a few minutes of these in
|
||||
[[sflow-all.pcap](/assets/sflow/sflow-all.pcap)].
|
||||
Although Wireshark doesn't know how to interpret the sFlow counter messages, it _does_ know how to
|
||||
interpret the sFlow sample messagess, and it reveals one of them like this:
|
||||
|
||||
{{< image width="100%" src="/assets/sflow/sflow-wireshark.png" alt="sFlow Wireshark" >}}
|
||||
|
||||
Let me take a look at the picture from top to bottom. First, the outer header (from 127.0.0.1:48753
|
||||
to 127.0.0.1:6343) is the sFlow agent sending to the collector. The agent identifies itself as
|
||||
having IPv4 address 198.19.5.16 with ID 100000 and an uptime of 1h52m. Then, it says it's going to
|
||||
send 9 samples, the first of which says it's from ifIndex=2 and at a sampling rate of 1:1000. It
|
||||
then shows that sample, saying that the frame length is 1518 bytes, and the first 64 bytes of those
|
||||
are sampled. Finally, the first sampled packet starts at the blue line. It shows the SrcMAC and
|
||||
DstMAC, and that it was a TCP packet from 192.168.10.17:51028 to 192.168.10.33:5201 - my running
|
||||
`iperf3`, booyah!
|
||||
|
||||
### VPP: sFlow Performance
|
||||
|
||||
{{< image float="right" src="/assets/sflow/sflow-lab.png" alt="sFlow Lab" width="20em" >}}
|
||||
|
||||
One question I get a lot about this plugin is: what is the performance impact when using
|
||||
sFlow? I spent a considerable amount of time tinkering with this, and together with Neil bringing
|
||||
the plugin to what we both agree is the most efficient use of CPU. We could have gone a bit further,
|
||||
but that would require somewhat intrusive changes to VPP's internals and as _North of the Border_
|
||||
(and the Simpsons!) would say: what we have isn't just good, it's good enough!
|
||||
|
||||
I've built a small testbed based on two Dell R730 machines. On the left, I have a Debian machine
|
||||
running Cisco T-Rex using four quad-tengig network cards, the classic Intel i710-DA4. On the right,
|
||||
I have my VPP machine called _Hippo_ (because it's always hungry for packets), with the same
|
||||
hardware. I'll build two halves. On the top NIC (Te3/0/0-3 in VPP), I will install IPv4 and MPLS
|
||||
forwarding on the purple circuit, and a simple Layer2 cross connect on the cyan circuit. On all four
|
||||
interfaces, I will enable sFlow. Then, I will mirror this configuration on the bottom NIC
|
||||
(Te130/0/0-3) in the red and green circuits, for which I will leave sFlow turned off.
|
||||
|
||||
To help you reproduce my results, and under the assumption that this is your jam, here's the
|
||||
configuration for all of the kit:
|
||||
|
||||
***0. Cisco T-Rex***
|
||||
```
|
||||
pim@trex:~ $ cat /srv/trex/8x10.yaml
|
||||
- version: 2
|
||||
interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ]
|
||||
port_info:
|
||||
- src_mac: 00:1b:21:06:00:00
|
||||
dest_mac: 9c:69:b4:61:a1:dc # Connected to Hippo Te3/0/0, purple
|
||||
- src_mac: 00:1b:21:06:00:01
|
||||
dest_mac: 9c:69:b4:61:a1:dd # Connected to Hippo Te3/0/1, purple
|
||||
- src_mac: 00:1b:21:83:00:00
|
||||
dest_mac: 00:1b:21:83:00:01 # L2XC via Hippo Te3/0/2, cyan
|
||||
- src_mac: 00:1b:21:83:00:01
|
||||
dest_mac: 00:1b:21:83:00:00 # L2XC via Hippo Te3/0/3, cyan
|
||||
|
||||
- src_mac: 00:1b:21:87:00:00
|
||||
dest_mac: 9c:69:b4:61:75:d0 # Connected to Hippo Te130/0/0, red
|
||||
- src_mac: 00:1b:21:87:00:01
|
||||
dest_mac: 9c:69:b4:61:75:d1 # Connected to Hippo Te130/0/1, red
|
||||
- src_mac: 9c:69:b4:85:00:00
|
||||
dest_mac: 9c:69:b4:85:00:01 # L2XC via Hippo Te130/0/2, green
|
||||
- src_mac: 9c:69:b4:85:00:01
|
||||
dest_mac: 9c:69:b4:85:00:00 # L2XC via Hippo Te130/0/3, green
|
||||
pim@trex:~ $ sudo t-rex-64 -i -c 4 --cfg /srv/trex/8x10.yaml
|
||||
```
|
||||
|
||||
When constructing the T-Rex configuration, I specifically set the destination MAC address for L3
|
||||
circuits (the purple and red ones) using Hippo's interface MAC address, which I can find with
|
||||
`vppctl show hardware-interfaces`. This way, T-Rex does not have to ARP for the VPP endpoint. On
|
||||
L2XC circuits (the cyan and green ones), VPP does not concern itself with the MAC addressing at
|
||||
all. It puts its interface in _promiscuous_ mode, and simply writes out any ethernet frame received,
|
||||
directly to the egress interface.
|
||||
|
||||
***1. IPv4***
|
||||
```
|
||||
hippo# set int state TenGigabitEthernet3/0/0 up
|
||||
hippo# set int state TenGigabitEthernet3/0/1 up
|
||||
hippo# set int state TenGigabitEthernet130/0/0 up
|
||||
hippo# set int state TenGigabitEthernet130/0/1 up
|
||||
hippo# set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31
|
||||
hippo# set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31
|
||||
hippo# set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31
|
||||
hippo# set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31
|
||||
hippo# ip route add 16.0.0.0/24 via 100.64.0.0
|
||||
hippo# ip route add 48.0.0.0/24 via 100.64.1.0
|
||||
hippo# ip route add 16.0.2.0/24 via 100.64.4.0
|
||||
hippo# ip route add 48.0.2.0/24 via 100.64.5.0
|
||||
hippo# ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static
|
||||
hippo# ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static
|
||||
hippo# ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static
|
||||
hippo# ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static
|
||||
```
|
||||
|
||||
By the way, one note to this last piece, I'm setting static IPv4 neighbors so that Cisco T-Rex
|
||||
as well as VPP do not have to use ARP to resolve each other. You'll see above that the T-Rex
|
||||
configuration also uses MAC addresses exclusively. Setting the `ip neighbor` like this allows VPP
|
||||
to know where to send return traffic.
|
||||
|
||||
***2. MPLS***
|
||||
```
|
||||
hippo# mpls table add 0
|
||||
hippo# set interface mpls TenGigabitEthernet3/0/0 enable
|
||||
hippo# set interface mpls TenGigabitEthernet3/0/1 enable
|
||||
hippo# set interface mpls TenGigabitEthernet130/0/0 enable
|
||||
hippo# set interface mpls TenGigabitEthernet130/0/1 enable
|
||||
hippo# mpls local-label add 16 eos via 100.64.1.0 TenGigabitEthernet3/0/1 out-labels 17
|
||||
hippo# mpls local-label add 17 eos via 100.64.0.0 TenGigabitEthernet3/0/0 out-labels 16
|
||||
hippo# mpls local-label add 20 eos via 100.64.5.0 TenGigabitEthernet130/0/1 out-labels 21
|
||||
hippo# mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels 20
|
||||
```
|
||||
|
||||
Here, the MPLS configuration implements a simple P-router, where incoming MPLS packets with label 16
|
||||
will be sent back to T-Rex on Te3/0/1 to the specified IPv4 nexthop (for which I already know the
|
||||
MAC address), and with label 16 removed and new label 17 imposed, in other words a SWAP operation.
|
||||
|
||||
***3. L2XC***
|
||||
```
|
||||
hippo# set int state TenGigabitEthernet3/0/2 up
|
||||
hippo# set int state TenGigabitEthernet3/0/3 up
|
||||
hippo# set int state TenGigabitEthernet130/0/2 up
|
||||
hippo# set int state TenGigabitEthernet130/0/3 up
|
||||
hippo# set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3
|
||||
hippo# set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2
|
||||
hippo# set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3
|
||||
hippo# set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2
|
||||
```
|
||||
|
||||
I've added a layer2 cross connect as well because it's computationally very cheap for VPP to receive
|
||||
an L2 (ethernet) datagram, and immediately transmit it on another interface. There's no FIB lookup
|
||||
and not even an L2 nexthop lookup involved, VPP is just shoveling ethernet packets in-and-out as
|
||||
fast as it can!
|
||||
|
||||
Here's how a loadtest looks like when sending 80Gbps at 192b packets on all eight interfaces:
|
||||
|
||||
{{< image src="/assets/sflow/sflow-lab-trex.png" alt="sFlow T-Rex" width="100%" >}}
|
||||
|
||||
The leftmost ports p0 <-> p1 are sending IPv4+MPLS, while ports p0 <-> p2 are sending ethernet back
|
||||
and forth. All four of them have sFlow enabled, at a sampling rate of 1:10'000, the default. These
|
||||
four ports are my experiment, to show the CPU use of sFlow. Then, ports p3 <-> p4 and p5 <-> p6
|
||||
respectively have sFlow turned off but with the same configuration. They are my control, showing
|
||||
the CPU use without sFlow.
|
||||
|
||||
**First conclusion**: This stuff works a treat. There is absolutely no impact of throughput at
|
||||
80Gbps with 47.6Mpps either _with_, or _without_ sFlow turned on. That's wonderful news, as it shows
|
||||
that the dataplane has more CPU available than is needed for any combination of functionality.
|
||||
|
||||
But what _is_ the limit? For this, I'll take a deeper look at the runtime statistics by varying the
|
||||
CPU time spent and maximum throughput achievable on a single VPP worker, thus using a single CPU
|
||||
thread on this Hippo machine that has 44 cores and 44 hyperthreads. I switch the loadtester to emit
|
||||
64 byte ethernet packets, the smallest I'm allowed to send.
|
||||
|
||||
| Loadtest | no sFlow | 1:1'000'000 | 1:10'000 | 1:1'000 | 1:100 |
|
||||
|-------------|-----------|-----------|-----------|-----------|-----------|
|
||||
| L2XC | 14.88Mpps | 14.32Mpps | 14.31Mpps | 14.27Mpps | 14.15Mpps |
|
||||
| IPv4 | 10.89Mpps | 9.88Mpps | 9.88Mpps | 9.84Mpps | 9.73Mpps |
|
||||
| MPLS | 10.11Mpps | 9.52Mpps | 9.52Mpps | 9.51Mpps | 9.45Mpps |
|
||||
| ***sFlow Packets*** / 10sec | N/A | 337.42M total | 337.39M total | 336.48M total | 333.64M total |
|
||||
| .. Sampled | | 328 | 33.8k | 336k | 3.34M |
|
||||
| .. Sent | | 328 | 33.8k | 336k | 1.53M |
|
||||
| .. Dropped | | 0 | 0 | 0 | 1.81M |
|
||||
|
||||
Here I can make a few important observations.
|
||||
|
||||
**Baseline**: One worker (thus, one CPU thread) can sustain 14.88Mpps of L2XC when sFlow is turned off, which
|
||||
implies that it has a little bit of CPU left over to do other work, if needed. With IPv4, I can see
|
||||
that the throughput is actually CPU limited: 10.89Mpps can be handled by one worker (thus, one CPU thread). I
|
||||
know that MPLS is a little bit more expensive computationally than IPv4, and that checks out. The
|
||||
total capacity is 10.11Mpps for one worker, when sFlow is turned off.
|
||||
|
||||
**Overhead**: When I turn on sFlow on the interface, VPP will insert the _sflow-node_ into the
|
||||
forwarding graph between `device-input` and `ethernet-input`. It means that the sFlow node will see
|
||||
_every single_ packet, and it will have to move all of these into the next node, which costs about
|
||||
9.5 CPU cycles per packet. The regression on L2XC is 3.8% but I have to note that VPP was not CPU
|
||||
bound on the L2XC so it used some CPU cycles which were still available, before regressing
|
||||
throughput. There is an immediate regression of 9.3% on IPv4 and 5.9% on MPLS, only to shuffle the
|
||||
packets through the graph.
|
||||
|
||||
**Sampling Cost**: But when then doing higher rates of sampling, the further regression is not _that_
|
||||
terrible. Between 1:1'000'000 and 1:10'000, there's barely a noticeable difference. Even in the
|
||||
worst case of 1:100, the regression is from 14.32Mpps to 14.15Mpps for L2XC, only 1.2%. The
|
||||
regression for L2XC, IPv4 and MPLS are all very modest, at 1.2% (L2XC), 1.6% (IPv4) and 0.8% (MPLS).
|
||||
Of course, by using multiple hardware receive queues and multiple RX workers per interface, the cost
|
||||
can be kept well in hand.
|
||||
|
||||
**Overload Protection**: At 1:1'000 and an effective rate of 33.65Mpps across all ports, I correctly
|
||||
observe 336k samples taken, and sent to PSAMPLE. At 1:100 however, there are 3.34M samples, but
|
||||
they are not fitting through the FIFO, so the plugin is dropping samples to protect downstream
|
||||
`sflow-main` thread and `hsflowd`. I can see that here, 1.81M samples have been dropped, while 1.53M
|
||||
samples made it through. By the way, this means VPP is happily sending a whopping 153K samples/sec
|
||||
to the collector!
|
||||
|
||||
## What's Next
|
||||
|
||||
Now that I've seen the UDP packets from our agent to a collector on the wire, and also how
|
||||
incredibly efficient the sFlow sampling implementation turned out, I'm super motivated to
|
||||
continue the journey with higher level collector receivers like ntopng, sflow-rt or Akvorado. In an
|
||||
upcoming article, I'll describe how I rolled out Akvorado at IPng, and what types of changes would
|
||||
make the user experience even better (or simpler to understand, at least).
|
||||
|
||||
### Acknowledgements
|
||||
|
||||
I'd like to thank Neil McKee from inMon for his dedication to getting things right, including the
|
||||
finer details such as logging, error handling, API specifications, and documentation. He has been a
|
||||
true pleasure to work with and learn from. Also, thank you to the VPP developer community, notably
|
||||
Benoit, Florin, Damjan, Dave and Matt, for helping with the review and getting this thing merged in
|
||||
time for the 25.02 release.
|
793
content/articles/2025-04-09-frysix-evpn.md
Normal file
@ -0,0 +1,793 @@
|
||||
---
|
||||
date: "2025-04-09T07:51:23Z"
|
||||
title: 'FrysIX eVPN: think different'
|
||||
---
|
||||
|
||||
{{< image float="right" src="/assets/frys-ix/frysix-logo-small.png" alt="FrysIX Logo" width="12em" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
Somewhere in the far north of the Netherlands, the country where I was born, a town called Jubbega
|
||||
is the home of the Frysian Internet Exchange called [[Frys-IX](https://frys-ix.net/)]. Back in 2021,
|
||||
a buddy of mine, Arend, said that he was planning on renting a rack at the NIKHEF facility, one of
|
||||
the most densely populated facilities in western Europe. He was looking for a few launching
|
||||
customers and I was definitely in the market for a presence in Amsterdam. I even wrote about it on
|
||||
my [[bucketlist]({{< ref 2021-07-26-bucketlist.md >}})]. Arend and his IT company
|
||||
[[ERITAP](https://www.eritap.com/)], took delivery of that rack in May of 2021, and this is when the
|
||||
internet exchange with _Frysian roots_ was born.
|
||||
|
||||
In the years from 2021 until now, Arend and I have been operating the exchange with reasonable
|
||||
success. It grew from a handful of folks in that first rack, to now some 250 participating ISPs
|
||||
with about ten switches in six datacenters across the Amsterdam metro area. It's shifting a cool
|
||||
800Gbit of traffic or so. It's dope, and very rewarding to be able to contribute to this community!
|
||||
|
||||
## Frys-IX is growing
|
||||
|
||||
We have several members with a 2x100G LAG and even though all inter-datacenter links are either dark
|
||||
fiber or WDM, we're starting to feel the growing pains as we set our sights to the next step growth.
|
||||
You see, when FrysIX did 13.37Gbit of traffic, Arend organized a barbecue. When it did 133.7Gbit of
|
||||
traffic, Arend organized an even bigger barbecue. Obviously, the next step is 1337Gbit and joining
|
||||
the infamous [[One TeraBit Club](https://github.com/tking/OneTeraBitClub)]. Thomas: we're on our
|
||||
way!
|
||||
|
||||
It became clear that we will not be able to keep a dependable peering platform if FrysIX remains a
|
||||
single L2 broadcast domain, and it also became clear that concatenating multiple 100G ports would be
|
||||
operationally expensive (think of all the dark fiber or WDM waves!), and brittle (think of LACP and
|
||||
balancing traffic over those ports). We need to modernize in order to stay ahead of the growth
|
||||
curve.
|
||||
|
||||
## Hello Nokia
|
||||
|
||||
{{< image float="right" src="/assets/frys-ix/nokia-7220-d4.png" alt="Nokia 7220-D4" width="20em" >}}
|
||||
|
||||
The Nokia 7220 Interconnect Router (7220 IXR) for data center fabric provides fixed-configuration,
|
||||
high-capacity platforms that let you bring unmatched scale, flexibility and operational simplicity
|
||||
to your data center networks and peering network environments. These devices are built around the
|
||||
Broadcom _Trident_ chipset, in the case of the "D4" platform, this is a Trident4 with 28x100G and
|
||||
8x400G ports. Whoot!
|
||||
|
||||
{{< image float="right" src="/assets/frys-ix/IXR-7220-D3.jpg" alt="Nokia 7220-D3" width="20em" >}}
|
||||
|
||||
What I find particularly awesome of the Trident series is their speed (total bandwidth of
|
||||
12.8Tbps _per router_), low power use (without optics, the IXR-7220-D4 consumes about 150W) and
|
||||
a plethora of advanced capabilities like L2/L3 filtering, IPv4, IPv6 and MPLS routing, and modern
|
||||
approaches to scale-out networking such as VXLAN based EVPN. At the FrysIX barbecue in September of
|
||||
2024, FrysIX was gifted a rather powerful IXR-7220-D3 router, shown in the picture to the right.
|
||||
That's a 32x100G router.
|
||||
|
||||
ERITAP has bought two (new in box) IXR-7220-D4 (8x400G,28x100G) routers, and has also acquired two
|
||||
IXR-7220-D2 (48x25G,8x100G) routers. So in total, FrysIX is now the proud owner of five of these
|
||||
beautiful Nokia devices. If you haven't yet, you should definitely read about these versatile
|
||||
routers on the [[Nokia](https://onestore.nokia.com/asset/207599)] website, and some details of the
|
||||
_merchant silicon_ switch chips in use on the
|
||||
[[Broadcom](https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs/bcm56880-series)]
|
||||
website.
|
||||
|
||||
### eVPN: A small rant
|
||||
|
||||
{{< image float="right" src="/assets/frys-ix/FrysIX_ Topology (concept).svg" alt="Topology Concept" width="50%" >}}
|
||||
|
||||
First, I need to get something off my chest. Consider a topology for an internet exchange platform,
|
||||
taking into account the available equipment, rackspace, power, and cross connects. Somehow, almost
|
||||
every design or reference architecture I can find on the Internet, assumes folks want to build a
|
||||
[[Clos network](https://en.wikipedia.org/wiki/Clos_network)], which has a topology existing of leaf
|
||||
and spine switches. The _spine_ switches have a different set of features than the _leaf_ ones,
|
||||
notably they don't have to do provider edge functionality like VXLAN encap and decapsulation.
|
||||
Almost all of these designs are showing how one might build a leaf-spine network for hyperscale.
|
||||
|
||||
**Critique 1**: my 'spine' (IXR-7220-D4 routers) must also be provider edge. Practically speaking,
|
||||
in the picture above I have these beautiful Nokia IXR-7220-D4 routers, using two 400G ports to
|
||||
connect between the facilities, and six 100G ports to connect the smaller breakout switches. That
|
||||
would leave a _massive_ amount of capacity unused: 22x 100G and 6x400G ports, to be exact.
|
||||
|
||||
**Critique 2**: all 'leaf' (either IXR-7220-D2 routers or Arista switches) can't realistically
|
||||
connect to both 'spines'. Our devices are spread out over two (and in practice, more like six)
|
||||
datacenters, and it's prohibitively expensive to get 100G waves or dark fiber to create a full mesh.
|
||||
It's much more economical to create a star-topology that minimizes cross-datacenter fiber spans.
|
||||
|
||||
**Critique 3**: Most of these 'spine-leaf' reference architectures assume that the interior gateway
|
||||
protocol is eBGP in what they call the _underlay_, and on top of that, some secondary eBGP that's
|
||||
called the _overlay_. Frankly, such a design makes my head spin a little bit. These designs assume
|
||||
hundreds of switches, in which case making use of one AS number per switch could make sense, as iBGP
|
||||
needs either a 'full mesh', or external route reflectors.
|
||||
|
||||
**Critique 4**: These reference designs also make an assumption that all fiber is local and while
|
||||
optics and links can fail, it will be relatively rare to _drain_ a link. However, in
|
||||
cross-datacenter networks, draining links for maintenance is very common, for example if the dark
|
||||
fiber provider needs to perform repairs on a span that was damaged. With these eBGP-over-eBGP
|
||||
connections, traffic engineering is more difficult than simply raising the OSPF (or IS-IS) cost of a
|
||||
link, to reroute traffic.
|
||||
|
||||
Setting aside eVPN for a second, if I were to build an IP transport network, like I did when I built
|
||||
[[IPng Site Local]({{< ref 2023-03-11-mpls-core.md >}})], I would use a much more intuitive
|
||||
and simple (I would even dare say elegant) design:
|
||||
|
||||
1. Take a classic IGP like [[OSPF](https://en.wikipedia.org/wiki/Open_Shortest_Path_First)], or
|
||||
perhaps [[IS-IS](https://en.wikipedia.org/wiki/IS-IS)]. There is no benefit, to me at least, to use
|
||||
BGP as an IGP.
|
||||
1. I would give each of the links between the switches an IPv4 /31 and enable link-local, and give
|
||||
each switch a loopback address with a /32 IPv4 and a /128 IPv6.
|
||||
1. If I had multiple links between two given switches, I would probably just use ECMP if my devices
|
||||
supported it, and fall back to a LACP signaled bundle-ethernet otherwise.
|
||||
1. If I were to need to use BGP (and for eVPN, this need exists), taking the ISP mindset (as opposed
|
||||
to the datacenter fabric mindset), I would simply install iBGP against two or three route
|
||||
reflectors, and exchange routing information within the same single AS number.
|
||||
|
||||
### eVPN: A demo topology
|
||||
|
||||
{{< image float="right" src="/assets/frys-ix/Nokia Arista VXLAN.svg" alt="Demo topology" width="50%" >}}
|
||||
|
||||
So, that's exactly how I'm going to approach the FrysIX eVPN design: OSPF for the underlay and iBGP
|
||||
for the overlay! I have a feeling that some folks will despise me for being contrarian, but you can
|
||||
leave your comments below, and don't forget to like-and-subscribe :-)
|
||||
|
||||
Arend builds this topology for me in Jubbega - also known as FrysIX HQ. He takes the two
|
||||
400G-capable routers and connects them. Then he takes an Arista DCS-7060CX switch, which is eVPN
|
||||
capable, with 32x100G ports, based on the Broadcom Tomahawk chipset, and a smaller Nokia
|
||||
IXR-7220-D2 with 48x25G and 8x100G ports, based on the Trident3 chipset. He wires all of this up
|
||||
to look like the picture on the right.
|
||||
|
||||
#### Underlay: Nokia's SR Linux
|
||||
|
||||
We boot up the equipment, verify that all the optics and links are up, and connect the management
|
||||
ports to an OOB network that I can remotely log in to. This is the first time that either of us work
|
||||
on Nokia, but I find it reasonably intuitive once I get a few tips and tricks from Niek.
|
||||
|
||||
```
|
||||
[pim@nikhef ~]$ sr_cli
|
||||
--{ running }--[ ]--
|
||||
A:pim@nikhef# enter candidate
|
||||
--{ candidate shared default }--[ ]--
|
||||
A:pim@nikhef# set / interface lo0 admin-state enable
|
||||
A:pim@nikhef# set / interface lo0 subinterface 0 admin-state enable
|
||||
A:pim@nikhef# set / interface lo0 subinterface 0 ipv4 admin-state enable
|
||||
A:pim@nikhef# set / interface lo0 subinterface 0 ipv4 address 198.19.16.1/32
|
||||
A:pim@nikhef# commit stay
|
||||
```
|
||||
|
||||
There, my first config snippet! This creates a _loopback_ interface, and similar to JunOS, a
|
||||
_subinterface_ (which Juniper calls a _unit_) which enables IPv4 and gives it an /32 address. In SR
|
||||
Linux, any interface has to be associated with a _network-instance_, think of those as routing
|
||||
domains or VRFs. There's a conveniently named _default_ network-instance, which I'll add this and
|
||||
the point-to-point interface between the two 400G routers to:
|
||||
|
||||
```
|
||||
A:pim@nikhef# info flat interface ethernet-1/29
|
||||
set / interface ethernet-1/29 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/29 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 ipv4 address 198.19.17.1/31
|
||||
set / interface ethernet-1/29 subinterface 0 ipv6 admin-state enable
|
||||
|
||||
A:pim@nikhef# set / network-instance default type default
|
||||
A:pim@nikhef# set / network-instance default admin-state enable
|
||||
A:pim@nikhef# set / network-instance default interface ethernet-1/29.0
|
||||
A:pim@nikhef# set / network-instance default interface lo0.0
|
||||
A:pim@nikhef# commit stay
|
||||
```
|
||||
|
||||
Cool. Assuming I now also do this on the other IXR-7220-D4 router, called _equinix_ (which gets the
|
||||
loopback address 198.19.16.0/32 and the point-to-point on the 400G interface of 198.19.17.0/31), I
|
||||
should be able to do my first jumboframe ping:
|
||||
|
||||
```
|
||||
A:pim@equinix# ping network-instance default 198.19.17.1 -s 9162 -M do
|
||||
Using network instance default
|
||||
PING 198.19.17.1 (198.19.17.1) 9162(9190) bytes of data.
|
||||
9170 bytes from 198.19.17.1: icmp_seq=1 ttl=64 time=0.466 ms
|
||||
9170 bytes from 198.19.17.1: icmp_seq=2 ttl=64 time=0.477 ms
|
||||
9170 bytes from 198.19.17.1: icmp_seq=3 ttl=64 time=0.547 ms
|
||||
```
|
||||
|
||||
#### Underlay: SR Linux OSPF
|
||||
|
||||
OK, let's get these two Nokia routers to speak OSPF, so that they can reach each other's loopback.
|
||||
It's really easy:
|
||||
|
||||
```
|
||||
A:pim@nikhef# / network-instance default protocols ospf instance default
|
||||
--{ candidate shared default }--[ network-instance default protocols ospf instance default ]--
|
||||
A:pim@nikhef# set admin-state enable
|
||||
A:pim@nikhef# set version ospf-v2
|
||||
A:pim@nikhef# set router-id 198.19.16.1
|
||||
A:pim@nikhef# set area 0.0.0.0 interface ethernet-1/29.0 interface-type point-to-point
|
||||
A:pim@nikhef# set area 0.0.0.0 interface lo0.0 passive true
|
||||
A:pim@nikhef# commit stay
|
||||
```
|
||||
|
||||
Similar to in JunOS, I can descend into a configuration scope: the first line goes into the
|
||||
_network-instance_ called `default` and then the _protocols_ called `ospf`, and then the _instance_
|
||||
called `default`. Subsequent `set` commands operate at this scope. Once I commit this configuration
|
||||
(on the _nikhef_ router and also the _equinix_ router, with its own unique router-id), OSPF quickly
|
||||
shoots in action:
|
||||
|
||||
```
|
||||
A:pim@nikhef# show network-instance default protocols ospf neighbor
|
||||
=========================================================================================
|
||||
Net-Inst default OSPFv2 Instance default Neighbors
|
||||
=========================================================================================
|
||||
+---------------------------------------------------------------------------------------+
|
||||
| Interface-Name Rtr Id State Pri RetxQ Time Before Dead |
|
||||
+=======================================================================================+
|
||||
| ethernet-1/29.0 198.19.16.0 full 1 0 36 |
|
||||
+---------------------------------------------------------------------------------------+
|
||||
-----------------------------------------------------------------------------------------
|
||||
No. of Neighbors: 1
|
||||
=========================================================================================
|
||||
|
||||
A:pim@nikhef# show network-instance default route-table all | more
|
||||
IPv4 unicast route table of network instance default
|
||||
+------------------+-----+------------+--------------+--------+----------+--------+------+-------------+-----------------+
|
||||
| Prefix | ID | Route Type | Route Owner | Active | Origin | Metric | Pref | Next-hop | Next-hop |
|
||||
| | | | | | Network | | | (Type) | Interface |
|
||||
| | | | | | Instance | | | | |
|
||||
+==================+=====+============+==============+========+==========+========+======+=============+=================+
|
||||
| 198.19.16.0/32 | 0 | ospfv2 | ospf_mgr | True | default | 1 | 10 | 198.19.17.0 | ethernet-1/29.0 |
|
||||
| | | | | | | | | (direct) | |
|
||||
| 198.19.16.1/32 | 7 | host | net_inst_mgr | True | default | 0 | 0 | None | None |
|
||||
| 198.19.17.0/31 | 6 | local | net_inst_mgr | True | default | 0 | 0 | 198.19.17.1 | ethernet-1/29.0 |
|
||||
| | | | | | | | | (direct) | |
|
||||
| 198.19.17.1/32 | 6 | host | net_inst_mgr | True | default | 0 | 0 | None | None |
|
||||
+==================+=====+============+==============+========+==========+========+======+=============+=================+
|
||||
|
||||
A:pim@nikhef# ping network-instance default 198.19.16.0
|
||||
Using network instance default
|
||||
PING 198.19.16.0 (198.19.16.0) 56(84) bytes of data.
|
||||
64 bytes from 198.19.16.0: icmp_seq=1 ttl=64 time=0.484 ms
|
||||
64 bytes from 198.19.16.0: icmp_seq=2 ttl=64 time=0.663 ms
|
||||
```
|
||||
|
||||
Delicious! OSPF has learned the loopback, and it is now reachable. As with most things, going from 0
|
||||
to 1 (in this case: understanding how SR Linux works at all) is the most difficult part. Then going
|
||||
from 1 to 2 is critical (in this case: making two routers interact with OSPF), but from there on,
|
||||
going from 2 to N is easy (in my case: enabling several other point-to-point /31 transit networks on
|
||||
the _nikhef_ router, using `ethernet-1/1.0` through `ethernet-1/4.0` with the correct MTU and
|
||||
turning on OSPF for these), makes the whole network shoot to life. Slick!
|
||||
|
||||
#### Underlay: Arista
|
||||
|
||||
I'll point out that one of the devices in this topology is an Arista. We have several of these ready
|
||||
for deployment at FrysIX. They are a lot more affordable and easy to find on the second hand /
|
||||
refurbished market. These switches come with 32x100G ports, and are really good at packet slinging
|
||||
because they're based on the Broadcom _Tomahawk_ chipset. They pack a few less features than the
|
||||
_Trident_ chipset that powers the Nokia, but they happen to have all the features we need to run our
|
||||
internet exchange . So I turn my attention to the Arista in the topology. I am much more
|
||||
comfortable configuring the whole thing here, as it's not my first time touching these devices:
|
||||
|
||||
```
|
||||
arista-leaf#show run int loop0
|
||||
interface Loopback0
|
||||
ip address 198.19.16.2/32
|
||||
ip ospf area 0.0.0.0
|
||||
arista-leaf#show run int Ethernet32/1
|
||||
interface Ethernet32/1
|
||||
description Core: Connected to nikhef:ethernet-1/2
|
||||
load-interval 1
|
||||
mtu 9190
|
||||
no switchport
|
||||
ip address 198.19.17.5/31
|
||||
ip ospf cost 1000
|
||||
ip ospf network point-to-point
|
||||
ip ospf area 0.0.0.0
|
||||
arista-leaf#show run section router ospf
|
||||
router ospf 65500
|
||||
router-id 198.19.16.2
|
||||
redistribute connected
|
||||
network 198.19.0.0/16 area 0.0.0.0
|
||||
max-lsa 12000
|
||||
```
|
||||
|
||||
I complete the configuration for the other two interfaces on this Arista, port Eth31/1 connects also
|
||||
to the _nikhef_ IXR-7220-D4 and I give it a high cost of 1000, while Eth30/1 connects only 1x100G to
|
||||
the _nokia-leaf_ IXR-7220-D2 with a cost of 10.
|
||||
It's nice to see that OSPF in action - there are two equal path (but high cost) OSPF paths via
|
||||
router-id 198.19.16.1 (nikhef), and there's one lower cost path via router-id 198.19.16.3
|
||||
(_nokia-leaf_). The traceroute nicely shows the scenic route (arista-leaf -> nokia-leaf -> nokia ->
|
||||
equinix). Dope!
|
||||
|
||||
```
|
||||
arista-leaf#show ip ospf nei
|
||||
Neighbor ID Instance VRF Pri State Dead Time Address Interface
|
||||
198.19.16.1 65500 default 1 FULL 00:00:36 198.19.17.4 Ethernet32/1
|
||||
198.19.16.3 65500 default 1 FULL 00:00:31 198.19.17.11 Ethernet30/1
|
||||
198.19.16.1 65500 default 1 FULL 00:00:35 198.19.17.2 Ethernet31/1
|
||||
|
||||
arista-leaf#traceroute 198.19.16.0
|
||||
traceroute to 198.19.16.0 (198.19.16.0), 30 hops max, 60 byte packets
|
||||
1 198.19.17.11 (198.19.17.11) 0.220 ms 0.150 ms 0.206 ms
|
||||
2 198.19.17.6 (198.19.17.6) 0.169 ms 0.107 ms 0.099 ms
|
||||
3 198.19.16.0 (198.19.16.0) 0.434 ms 0.346 ms 0.303 ms
|
||||
```
|
||||
|
||||
So far, so good! The _underlay_ is up, every router can reach every other router on its loopback,
|
||||
and all OSPF adjacencies are formed. I'll leave the 2x100G between _nikhef_ and _arista-leaf_ at
|
||||
high cost for now.
|
||||
|
||||
#### Overlay EVPN: SR Linux
|
||||
|
||||
The big-picture idea here is to use iBGP with the same private AS number, and because there are two
|
||||
main facilities (NIKHEF and Equinix), make each of those bigger IXR-7220-D4 routers act as
|
||||
route-reflectors for others. It means that they will have an iBGP session amongst themselves
|
||||
(198.191.16.0 <-> 198.19.16.1) and otherwise accept iBGP sessions from any IP address in the
|
||||
198.19.16.0/24 subnet. This way, I don't have to configure any more than strictly necessary on the
|
||||
core routers. Any new router can just plug in, form an OSPF adjacency, and connect to both core
|
||||
routers. I proceed to configure BGP on the Nokia's like this:
|
||||
|
||||
```
|
||||
A:pim@nikhef# / network-instance default protocols bgp
|
||||
A:pim@nikhef# set admin-state enable
|
||||
A:pim@nikhef# set autonomous-system 65500
|
||||
A:pim@nikhef# set router-id 198.19.16.1
|
||||
A:pim@nikhef# set dynamic-neighbors accept match 198.19.16.0/24 peer-group overlay
|
||||
A:pim@nikhef# set afi-safi evpn admin-state enable
|
||||
A:pim@nikhef# set preference ibgp 170
|
||||
A:pim@nikhef# set route-advertisement rapid-withdrawal true
|
||||
A:pim@nikhef# set route-advertisement wait-for-fib-install false
|
||||
A:pim@nikhef# set group overlay peer-as 65500
|
||||
A:pim@nikhef# set group overlay afi-safi evpn admin-state enable
|
||||
A:pim@nikhef# set group overlay afi-safi ipv4-unicast admin-state disable
|
||||
A:pim@nikhef# set group overlay afi-safi ipv6-unicast admin-state disable
|
||||
A:pim@nikhef# set group overlay local-as as-number 65500
|
||||
A:pim@nikhef# set group overlay route-reflector client true
|
||||
A:pim@nikhef# set group overlay transport local-address 198.19.16.1
|
||||
A:pim@nikhef# set neighbor 198.19.16.0 admin-state enable
|
||||
A:pim@nikhef# set neighbor 198.19.16.0 peer-group overlay
|
||||
A:pim@nikhef# commit stay
|
||||
```
|
||||
|
||||
I can see that iBGP sessions establish between all the devices:
|
||||
|
||||
```
|
||||
A:pim@nikhef# show network-instance default protocols bgp neighbor
|
||||
---------------------------------------------------------------------------------------------------------------------------
|
||||
BGP neighbor summary for network-instance "default"
|
||||
Flags: S static, D dynamic, L discovered by LLDP, B BFD enabled, - disabled, * slow
|
||||
---------------------------------------------------------------------------------------------------------------------------
|
||||
---------------------------------------------------------------------------------------------------------------------------
|
||||
+-------------+-------------+----------+-------+----------+-------------+---------------+------------+--------------------+
|
||||
| Net-Inst | Peer | Group | Flags | Peer-AS | State | Uptime | AFI/SAFI | [Rx/Active/Tx] |
|
||||
+=============+=============+==========+=======+==========+=============+===============+============+====================+
|
||||
| default | 198.19.16.0 | overlay | S | 65500 | established | 0d:0h:2m:32s | evpn | [0/0/0] |
|
||||
| default | 198.19.16.2 | overlay | D | 65500 | established | 0d:0h:2m:27s | evpn | [0/0/0] |
|
||||
| default | 198.19.16.3 | overlay | D | 65500 | established | 0d:0h:2m:41s | evpn | [0/0/0] |
|
||||
+-------------+-------------+----------+-------+----------+-------------+---------------+------------+--------------------+
|
||||
---------------------------------------------------------------------------------------------------------------------------
|
||||
Summary:
|
||||
1 configured neighbors, 1 configured sessions are established, 0 disabled peers
|
||||
2 dynamic peers
|
||||
```
|
||||
|
||||
A few things to note here - there one _configured_ neighbor (this is the other IXR-7220-D4 router),
|
||||
and two _dynamic_ peers, these are the Arista and the smaller IXR-7220-D2 router. The only address
|
||||
family that they are exchanging information for is the _evpn_ family, and no prefixes have been
|
||||
learned or sent yet, shown by the `[0/0/0]` designation in the last column.
|
||||
|
||||
#### Overlay EVPN: Arista
|
||||
|
||||
The Arista is also remarkably straight forward to configure. Here, I'll simply enable the iBGP
|
||||
session as follows:
|
||||
|
||||
```
|
||||
arista-leaf#show run section bgp
|
||||
router bgp 65500
|
||||
neighbor evpn peer group
|
||||
neighbor evpn remote-as 65500
|
||||
neighbor evpn update-source Loopback0
|
||||
neighbor evpn ebgp-multihop 3
|
||||
neighbor evpn send-community extended
|
||||
neighbor evpn maximum-routes 12000 warning-only
|
||||
neighbor 198.19.16.0 peer group evpn
|
||||
neighbor 198.19.16.1 peer group evpn
|
||||
!
|
||||
address-family evpn
|
||||
neighbor evpn activate
|
||||
|
||||
arista-leaf#show bgp summary
|
||||
BGP summary information for VRF default
|
||||
Router identifier 198.19.16.2, local AS number 65500
|
||||
Neighbor AS Session State AFI/SAFI AFI/SAFI State NLRI Rcd NLRI Acc
|
||||
----------- ----------- ------------- ----------------------- -------------- ---------- ----------
|
||||
198.19.16.0 65500 Established IPv4 Unicast Advertised 0 0
|
||||
198.19.16.0 65500 Established L2VPN EVPN Negotiated 0 0
|
||||
198.19.16.1 65500 Established IPv4 Unicast Advertised 0 0
|
||||
198.19.16.1 65500 Established L2VPN EVPN Negotiated 0 0
|
||||
```
|
||||
|
||||
On this leaf node, I'll have a redundant iBGP session with the two core nodes. Since those core
|
||||
nodes are peering amongst themselves, and are configured as route-reflectors, this is all I need. No
|
||||
matter how many additional Arista (or Nokia) devices I add to the network, all they'll have to do is
|
||||
enable OSPF (so they can reach 198.19.16.0 and .1) and turn on iBGP sessions with both core routers.
|
||||
Voila!
|
||||
|
||||
#### VXLAN EVPN: SR Linux
|
||||
|
||||
Nokia documentation informs me that SR Linux uses a special interface called _system0_ to source its
|
||||
VXLAN traffic from, and to add this interface to the _default_ network-instance. So it's a matter of
|
||||
defining that interface and associate a VXLAN interface with it, like so:
|
||||
|
||||
```
|
||||
A:pim@nikhef# set / interface system0 admin-state enable
|
||||
A:pim@nikhef# set / interface system0 subinterface 0 admin-state enable
|
||||
A:pim@nikhef# set / interface system0 subinterface 0 ipv4 admin-state enable
|
||||
A:pim@nikhef# set / interface system0 subinterface 0 ipv4 address 198.19.18.1/32
|
||||
A:pim@nikhef# set / network-instance default interface system0.0
|
||||
A:pim@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged
|
||||
A:pim@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604
|
||||
A:pim@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address
|
||||
A:pim@nikhef# commit stay
|
||||
```
|
||||
|
||||
This creates the plumbing for a VXLAN sub-interface called `vxlan1.2604` which will accept/send
|
||||
traffic using VNI 2604 (this happens to be the VLAN id we use at FrysIX for our production Peering
|
||||
LAN), and it'll use the `system0.0` address to source that traffic from.
|
||||
|
||||
The second part is to create what SR Linux calls a MAC-VRF and put some interface(s) in it:
|
||||
|
||||
```
|
||||
A:pim@nikhef# set / interface ethernet-1/9 admin-state enable
|
||||
A:pim@nikhef# set / interface ethernet-1/9 breakout-mode num-breakout-ports 4
|
||||
A:pim@nikhef# set / interface ethernet-1/9 breakout-mode breakout-port-speed 10G
|
||||
A:pim@nikhef# set / interface ethernet-1/9/3 admin-state enable
|
||||
A:pim@nikhef# set / interface ethernet-1/9/3 vlan-tagging true
|
||||
A:pim@nikhef# set / interface ethernet-1/9/3 subinterface 0 type bridged
|
||||
A:pim@nikhef# set / interface ethernet-1/9/3 subinterface 0 admin-state enable
|
||||
A:pim@nikhef# set / interface ethernet-1/9/3 subinterface 0 vlan encap untagged
|
||||
|
||||
A:pim@nikhef# / network-instance peeringlan
|
||||
A:pim@nikhef# set type mac-vrf
|
||||
A:pim@nikhef# set admin-state enable
|
||||
A:pim@nikhef# set interface ethernet-1/9/3.0
|
||||
A:pim@nikhef# set vxlan-interface vxlan1.2604
|
||||
A:pim@nikhef# set protocols bgp-evpn bgp-instance 1 admin-state enable
|
||||
A:pim@nikhef# set protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604
|
||||
A:pim@nikhef# set protocols bgp-evpn bgp-instance 1 evi 2604
|
||||
A:pim@nikhef# set protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604
|
||||
A:pim@nikhef# set protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604
|
||||
A:pim@nikhef# set protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604
|
||||
A:pim@nikhef# commit stay
|
||||
```
|
||||
|
||||
In the first block here, Arend took what is a 100G port called `ethernet-1/9` and split it into 4x25G
|
||||
ports. Arend forced the port speed to 10G because he has taken a 40G-4x10G DAC, and it happens that
|
||||
the third lane is plugged into the Debian machine. So on `ethernet-1/9/3` I'll create a
|
||||
sub-interface, make it type _bridged_ (which I've also done on `vxlan1.2604`!) and allow any
|
||||
untagged traffic to enter it.
|
||||
|
||||
{{< image width="5em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
|
||||
|
||||
If you, like me, are used to either VPP or IOS/XR, this type of sub-interface stuff should feel very
|
||||
natural to you. I've written about the sub-interfaces logic on Cisco's IOS/XR and VPP approach in a
|
||||
previous [[article]({{< ref 2022-02-14-vpp-vlan-gym.md >}})] which my buddy Fred lovingly calls
|
||||
_VLAN Gymnastics_ because the ports are just so damn flexible. Worth a read!
|
||||
|
||||
The second block creates a new _network-instance_ which I'll name `peeringlan`, and it associates
|
||||
the newly created untagged sub-interface `ethernet-1/9/3.0` with the VXLAN interface, and starts a
|
||||
protocol for eVPN instructing traffic in and out of this network-instance to use EVI 2604 on the
|
||||
VXLAN sub-interface, and signalling of all MAC addresses learned to use the specified
|
||||
route-distinguisher and import/export route-targets. For simplicity I've just used the same for
|
||||
each: 65500:2604.
|
||||
|
||||
I continue to add an interface to the `peeringlan` _network-instance_ on the other two Nokia
|
||||
routers: `ethernet-1/9/3.0` on the _equinix_ router and `ethernet-1/9.0` on the _nokia-leaf_ router.
|
||||
Each of these goes to a 10Gbps port on a Debian machine.
|
||||
|
||||
#### VXLAN EVPN: Arista
|
||||
|
||||
At this point I'm feeling pretty bullish about the whole project. Arista does not make it very
|
||||
difficult on me to configure it for L2 EVPN (which is called MAC-VRF here also):
|
||||
|
||||
```
|
||||
arista-leaf#conf t
|
||||
vlan 2604
|
||||
name v-peeringlan
|
||||
interface Ethernet9/3
|
||||
speed forced 10000full
|
||||
switchport access vlan 2604
|
||||
|
||||
interface Loopback1
|
||||
ip address 198.19.18.2/32
|
||||
interface Vxlan1
|
||||
vxlan source-interface Loopback1
|
||||
vxlan udp-port 4789
|
||||
vxlan vlan 2604 vni 2604
|
||||
```
|
||||
|
||||
After creating VLAN 2604 and making port Eth9/3 an access port in that VLAN, I'll add a VTEP endpoint
|
||||
called `Loopback1`, and a VXLAN interface that uses that to source its traffic. Here, I'll associate
|
||||
local VLAN 2604 with the `Vxlan1` and its VNI 2604, to match up with how I configured the Nokias
|
||||
previously.
|
||||
|
||||
Finally, it's a matter of tying these together by announcing the MAC addresses into the EVPN iBGP
|
||||
sessions:
|
||||
```
|
||||
arista-leaf#conf t
|
||||
router bgp 65500
|
||||
vlan 2604
|
||||
rd 65500:2604
|
||||
route-target both 65500:2604
|
||||
redistribute learned
|
||||
!
|
||||
```
|
||||
|
||||
### Results
|
||||
|
||||
To validate the configurations, I learn a cool trick from my buddy Andy on the SR Linux discord
|
||||
server. In EOS, I can ask it to check for any obvious mistakes in two places:
|
||||
|
||||
```
|
||||
arista-leaf#show vxlan config-sanity detail
|
||||
Category Result Detail
|
||||
---------------------------------- -------- --------------------------------------------------
|
||||
Local VTEP Configuration Check OK
|
||||
Loopback IP Address OK
|
||||
VLAN-VNI Map OK
|
||||
Flood List OK
|
||||
Routing OK
|
||||
VNI VRF ACL OK
|
||||
Decap VRF-VNI Map OK
|
||||
VRF-VNI Dynamic VLAN OK
|
||||
Remote VTEP Configuration Check OK
|
||||
Remote VTEP OK
|
||||
Platform Dependent Check OK
|
||||
VXLAN Bridging OK
|
||||
VXLAN Routing OK VXLAN Routing not enabled
|
||||
CVX Configuration Check OK
|
||||
CVX Server OK Not in controller client mode
|
||||
MLAG Configuration Check OK Run 'show mlag config-sanity' to verify MLAG config
|
||||
Peer VTEP IP OK MLAG peer is not connected
|
||||
MLAG VTEP IP OK
|
||||
Peer VLAN-VNI OK
|
||||
Virtual VTEP IP OK
|
||||
MLAG Inactive State OK
|
||||
|
||||
arista-leaf#show bgp evpn sanity detail
|
||||
Category Check Status Detail
|
||||
-------- -------------------- ------ ------
|
||||
General Send community OK
|
||||
General Multi-agent mode OK
|
||||
General Neighbor established OK
|
||||
L2 MAC-VRF route-target OK
|
||||
import and export
|
||||
L2 MAC-VRF OK
|
||||
route-distinguisher
|
||||
L2 MAC-VRF redistribute OK
|
||||
L2 MAC-VRF overlapping OK
|
||||
VLAN
|
||||
L2 Suppressed MAC OK
|
||||
VXLAN VLAN to VNI map for OK
|
||||
MAC-VRF
|
||||
VXLAN VRF to VNI map for OK
|
||||
IP-VRF
|
||||
```
|
||||
|
||||
#### Results: Arista view
|
||||
|
||||
Inspecting the MAC addresses learned from all four of the client ports on the Debian machine is
|
||||
easy:
|
||||
|
||||
```
|
||||
arista-leaf#show bgp evpn summary
|
||||
BGP summary information for VRF default
|
||||
Router identifier 198.19.16.2, local AS number 65500
|
||||
Neighbor Status Codes: m - Under maintenance
|
||||
Neighbor V AS MsgRcvd MsgSent InQ OutQ Up/Down State PfxRcd PfxAcc
|
||||
198.19.16.0 4 65500 3311 3867 0 0 18:06:28 Estab 7 7
|
||||
198.19.16.1 4 65500 3308 3873 0 0 18:06:28 Estab 7 7
|
||||
|
||||
arista-leaf#show bgp evpn vni 2604 next-hop 198.19.18.3
|
||||
BGP routing table information for VRF default
|
||||
Router identifier 198.19.16.2, local AS number 65500
|
||||
Route status codes: * - valid, > - active, S - Stale, E - ECMP head, e - ECMP
|
||||
c - Contributing to ECMP, % - Pending BGP convergence
|
||||
Origin codes: i - IGP, e - EGP, ? - incomplete
|
||||
AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Link Local Nexthop
|
||||
|
||||
Network Next Hop Metric LocPref Weight Path
|
||||
* >Ec RD: 65500:2604 mac-ip e43a.6e5f.0c59
|
||||
198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.1
|
||||
* ec RD: 65500:2604 mac-ip e43a.6e5f.0c59
|
||||
198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.0
|
||||
* >Ec RD: 65500:2604 imet 198.19.18.3
|
||||
198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.1
|
||||
* ec RD: 65500:2604 imet 198.19.18.3
|
||||
198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.0
|
||||
```
|
||||
There's a lot to unpack here! The Arista is seeing that from the _route-distinguisher_ I configured
|
||||
on all the sessions, it is learning one MAC address on neighbor 198.19.18.3 (this is the VTEP for
|
||||
the _nokia-leaf_ router) from both iBGP sessions. The MAC address is learned from originator
|
||||
198.19.16.3 (the loopback of the _nokia-leaf_ router), from two cluster members, the active one on
|
||||
iBGP speaker 198.19.16.1 (_nikhef_) and a backup member on 198.19.16.0 (_equinix_).
|
||||
|
||||
I can also see that there's a bunch of `imet` route entries, and Andy explained these to me. They are
|
||||
a signal from a VTEP participant that they are interested in seeing multicast traffic (like neighbor
|
||||
discovery or ARP requests) flooded to them. Every router participating in this L2VPN will raise such
|
||||
an `imet` route, which I'll see in duplicates as well (one from each iBGP session). This checks out.
|
||||
|
||||
#### Results: SR Linux view
|
||||
|
||||
The Nokia IXR-7220-D4 router called _equinix_ has also learned a bunch of EVPN routing entries,
|
||||
which I can inspect as follows:
|
||||
|
||||
```
|
||||
A:pim@equinix# show network-instance default protocols bgp routes evpn route-type summary
|
||||
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
Show report for the BGP route table of network-instance "default"
|
||||
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
Status codes: u=used, *=valid, >=best, x=stale, b=backup
|
||||
Origin codes: i=IGP, e=EGP, ?=incomplete
|
||||
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
BGP Router ID: 198.19.16.0 AS: 65500 Local AS: 65500
|
||||
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
Type 2 MAC-IP Advertisement Routes
|
||||
+--------+---------------+--------+-------------------+------------+-------------+------+-------------+--------+--------------------------------+------------------+
|
||||
| Status | Route- | Tag-ID | MAC-address | IP-address | neighbor | Path-| Next-Hop | Label | ESI | MAC Mobility |
|
||||
| | distinguisher | | | | | id | | | | |
|
||||
+========+===============+========+===================+============+=============+======+============-+========+================================+==================+
|
||||
| u*> | 65500:2604 | 0 | E4:3A:6E:5F:0C:57 | 0.0.0.0 | 198.19.16.1 | 0 | 198.19.18.1 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
|
||||
| * | 65500:2604 | 0 | E4:3A:6E:5F:0C:58 | 0.0.0.0 | 198.19.16.1 | 0 | 198.19.18.2 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
|
||||
| u*> | 65500:2604 | 0 | E4:3A:6E:5F:0C:58 | 0.0.0.0 | 198.19.16.2 | 0 | 198.19.18.2 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
|
||||
| * | 65500:2604 | 0 | E4:3A:6E:5F:0C:59 | 0.0.0.0 | 198.19.16.1 | 0 | 198.19.18.3 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
|
||||
| u*> | 65500:2604 | 0 | E4:3A:6E:5F:0C:59 | 0.0.0.0 | 198.19.16.3 | 0 | 198.19.18.3 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
|
||||
+--------+---------------+--------+-------------------+------------+-------------+------+-------------+--------+--------------------------------+------------------+
|
||||
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
Type 3 Inclusive Multicast Ethernet Tag Routes
|
||||
+--------+-----------------------------+--------+---------------------+-----------------+--------+-----------------------+
|
||||
| Status | Route-distinguisher | Tag-ID | Originator-IP | neighbor | Path- | Next-Hop |
|
||||
| | | | | | id | |
|
||||
+========+=============================+========+=====================+=================+========+=======================+
|
||||
| u*> | 65500:2604 | 0 | 198.19.18.1 | 198.19.16.1 | 0 | 198.19.18.1 |
|
||||
| * | 65500:2604 | 0 | 198.19.18.2 | 198.19.16.1 | 0 | 198.19.18.2 |
|
||||
| u*> | 65500:2604 | 0 | 198.19.18.2 | 198.19.16.2 | 0 | 198.19.18.2 |
|
||||
| * | 65500:2604 | 0 | 198.19.18.3 | 198.19.16.1 | 0 | 198.19.18.3 |
|
||||
| u*> | 65500:2604 | 0 | 198.19.18.3 | 198.19.16.3 | 0 | 198.19.18.3 |
|
||||
+--------+-----------------------------+--------+---------------------+-----------------+--------+-----------------------+
|
||||
--------------------------------------------------------------------------------------------------------------------------
|
||||
0 Ethernet Auto-Discovery routes 0 used, 0 valid
|
||||
5 MAC-IP Advertisement routes 3 used, 5 valid
|
||||
5 Inclusive Multicast Ethernet Tag routes 3 used, 5 valid
|
||||
0 Ethernet Segment routes 0 used, 0 valid
|
||||
0 IP Prefix routes 0 used, 0 valid
|
||||
0 Selective Multicast Ethernet Tag routes 0 used, 0 valid
|
||||
0 Selective Multicast Membership Report Sync routes 0 used, 0 valid
|
||||
0 Selective Multicast Leave Sync routes 0 used, 0 valid
|
||||
--------------------------------------------------------------------------------------------------------------------------
|
||||
```
|
||||
|
||||
I have to say, SR Linux output is incredibly verbose! But, I can see all the relevant bits and bobs
|
||||
here. Each MAC-IP entry is accounted for, I can see several nexthops pointing at the nikhef switch,
|
||||
one pointing at the nokia-leaf router and one pointing at the Arista switch. I also see the `imet`
|
||||
entries. One thing to note -- the SR Linux implementation leaves the type-2 routes empty with a
|
||||
0.0.0.0 IPv4 address, while the Arista (in my opinion, more correctly) leaves them as NULL
|
||||
(unspecified). But, everything looks great!
|
||||
|
||||
#### Results: Debian view
|
||||
|
||||
There's one more thing to show, and that's kind of the 'proof is in the pudding' moment. As I said,
|
||||
Arend hooked up a Debian machine with an Intel X710-DA4 network card, which sports 4x10G SFP+
|
||||
connections. This network card is a regular in my AS8298 network, as it has excellent DPDK support
|
||||
and can easily pump 40Mpps with VPP. IPng 🥰 Intel X710!
|
||||
|
||||
```
|
||||
root@debian:~ # ip netns add nikhef
|
||||
root@debian:~ # ip link set enp1s0f0 netns nikhef
|
||||
root@debian:~ # ip netns exec nikhef ip link set enp1s0f0 up mtu 9000
|
||||
root@debian:~ # ip netns exec nikhef ip addr add 192.0.2.10/24 dev enp1s0f0
|
||||
root@debian:~ # ip netns exec nikhef ip addr add 2001:db8::10/64 dev enp1s0f0
|
||||
|
||||
root@debian:~ # ip netns add arista-leaf
|
||||
root@debian:~ # ip link set enp1s0f1 netns arista-leaf
|
||||
root@debian:~ # ip netns exec arista-leaf ip link set enp1s0f1 up mtu 9000
|
||||
root@debian:~ # ip netns exec arista-leaf ip addr add 192.0.2.11/24 dev enp1s0f1
|
||||
root@debian:~ # ip netns exec arista-leaf ip addr add 2001:db8::11/64 dev enp1s0f1
|
||||
|
||||
root@debian:~ # ip netns add nokia-leaf
|
||||
root@debian:~ # ip link set enp1s0f2 netns nokia-leaf
|
||||
root@debian:~ # ip netns exec nokia-leaf ip link set enp1s0f2 up mtu 9000
|
||||
root@debian:~ # ip netns exec nokia-leaf ip addr add 192.0.2.12/24 dev enp1s0f2
|
||||
root@debian:~ # ip netns exec nokia-leaf ip addr add 2001:db8::12/64 dev enp1s0f2
|
||||
|
||||
root@debian:~ # ip netns add equinix
|
||||
root@debian:~ # ip link set enp1s0f3 netns equinix
|
||||
root@debian:~ # ip netns exec equinix ip link set enp1s0f3 up mtu 9000
|
||||
root@debian:~ # ip netns exec equinix ip addr add 192.0.2.13/24 dev enp1s0f3
|
||||
root@debian:~ # ip netns exec equinix ip addr add 2001:db8::13/64 dev enp1s0f3
|
||||
|
||||
root@debian:~# ip netns exec nikhef fping -g 192.0.2.8/29
|
||||
192.0.2.10 is alive
|
||||
192.0.2.11 is alive
|
||||
192.0.2.12 is alive
|
||||
192.0.2.13 is alive
|
||||
|
||||
root@debian:~# ip netns exec arista-leaf fping 2001:db8::10 2001:db8::11 2001:db8::12 2001:db8::13
|
||||
2001:db8::10 is alive
|
||||
2001:db8::11 is alive
|
||||
2001:db8::12 is alive
|
||||
2001:db8::13 is alive
|
||||
|
||||
root@debian:~# ip netns exec equinix ip nei
|
||||
192.0.2.10 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE
|
||||
192.0.2.11 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE
|
||||
192.0.2.12 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE
|
||||
fe80::e63a:6eff:fe5f:c57 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE
|
||||
fe80::e63a:6eff:fe5f:c58 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE
|
||||
fe80::e63a:6eff:fe5f:c59 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE
|
||||
2001:db8::10 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE
|
||||
2001:db8::11 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE
|
||||
2001:db8::12 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE
|
||||
```
|
||||
|
||||
The Debian machine puts each network card into its own network namespace, and gives them both an IPv4
|
||||
and an IPv6 address. I can then enter the `nikhef` network namespace, which has its NIC connected to
|
||||
the IXR-7220-D4 router called _nikhef_, and ping all four endpoints. Similarly, I can enter the
|
||||
`arista-leaf` namespace and ping6 all four endpoints. Finally, I take a look at the IPv6 and IPv4
|
||||
neighbor table on the network card that is connected to the _equinix_ router. All three MAC addresses are
|
||||
seen. This proves end to end connectivity across the EVPN VXLAN, and full interoperability. Booyah!
|
||||
|
||||
Performance? We got that! I'm not worried as these Nokia routers are rated for 12.8Tbps of VXLAN....
|
||||
```
|
||||
root@debian:~# ip netns exec equinix iperf3 -c 192.0.2.12
|
||||
Connecting to host 192.0.2.12, port 5201
|
||||
[ 5] local 192.0.2.10 port 34598 connected to 192.0.2.12 port 5201
|
||||
[ ID] Interval Transfer Bitrate Retr Cwnd
|
||||
[ 5] 0.00-1.00 sec 1.15 GBytes 9.91 Gbits/sec 19 1.52 MBytes
|
||||
[ 5] 1.00-2.00 sec 1.15 GBytes 9.90 Gbits/sec 3 1.54 MBytes
|
||||
[ 5] 2.00-3.00 sec 1.15 GBytes 9.90 Gbits/sec 1 1.54 MBytes
|
||||
[ 5] 3.00-4.00 sec 1.15 GBytes 9.90 Gbits/sec 1 1.54 MBytes
|
||||
[ 5] 4.00-5.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
|
||||
[ 5] 5.00-6.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
|
||||
[ 5] 6.00-7.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
|
||||
[ 5] 7.00-8.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
|
||||
[ 5] 8.00-9.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
|
||||
[ 5] 9.00-10.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
|
||||
- - - - - - - - - - - - - - - - - - - - - - - - -
|
||||
[ ID] Interval Transfer Bitrate Retr
|
||||
[ 5] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec 24 sender
|
||||
[ 5] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec receiver
|
||||
|
||||
iperf Done.
|
||||
```
|
||||
|
||||
## What's Next
|
||||
|
||||
There's a few improvements I can make before deploying this architecture to the internet exchange.
|
||||
Notably:
|
||||
* the functional equivalent of _port security_, that is to say only allowing one or two MAC
|
||||
addresses per member port. FrysIX has a strict one-port-one-member-one-MAC rule, and having port
|
||||
security will greatly improve our resilience.
|
||||
* SR Linux has the ability to suppress ARP, _even on L2 MAC-VRF_! It's relatively well known for
|
||||
IRB based setups, but adding this to transparent bridge-domains is possible in Nokia
|
||||
[[ref](https://documentation.nokia.com/srlinux/22-6/SR_Linux_Book_Files/EVPN-VXLAN_Guide/services-evpn-vxlan-l2.html#configuring_evpn_learning_for_proxy_arp)],
|
||||
using the syntax of `protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise
|
||||
true`. This will glean the IP addresses based on intercepted ARP requests, and reduce the need for
|
||||
BUM flooding.
|
||||
* Andy informs me that Arista also has this feature. By setting `router l2-vpn` and `arp learning bridged`,
|
||||
the suppression of ARP requests/replies also works in the same way. This greatly reduces cross-router
|
||||
BUM flooding. If DE-CIX can do it, so can FrysIX :)
|
||||
* some automation - although configuring the MAC-VRF across Arista and SR Linux is definitely not
|
||||
as difficult as I thought, having some automation in place will avoid errors and mistakes. It
|
||||
would suck if the IXP collapsed because I botched a link drain or PNI configuration!
|
||||
|
||||
### Acknowledgements
|
||||
|
||||
I am relatively new to EVPN configurations, and wanted to give a shoutout to Andy Whitaker who
|
||||
jumped in very quickly when I asked a question on the SR Linux Discord. He was gracious with his
|
||||
time and spent a few hours on a video call with me, explaining EVPN in great detail both for Arista
|
||||
as well as SR Linux, and in particular wanted to give a big "Thank you!" for helping me understand
|
||||
symmetric and asymmetric IRB in the context of multivendor EVPN. Andy is about to start a new job at
|
||||
Nokia, and I wish him all the best. To my friends at Nokia: you caught a good one, Andy is pure
|
||||
gold!
|
||||
|
||||
I also want to thank Niek for helping me take my first baby steps onto this platform and patiently
|
||||
answering my nerdly questions about the platform, the switch chip, and the configuration philosophy.
|
||||
Learning a new NOS is always a fun task, and it was made super fun because Niek spent an hour with
|
||||
Arend and I on a video call, giving a bunch of operational tips and tricks along the way.
|
||||
|
||||
Finally, Arend and ERITAP are an absolute joy to work with. We took turns hacking on the lab, which
|
||||
Arend made available for me while I am traveling to Mississippi this week. Thanks for the kWh and
|
||||
OOB access, and for brainstorming the config with me!
|
||||
|
||||
### Reference configurations
|
||||
|
||||
Here's the configs for all machines in this demonstration:
|
||||
[[nikhef](/assets/frys-ix/nikhef.conf)] | [[equinix](/assets/frys-ix/equinix.conf)] | [[nokia-leaf](/assets/frys-ix/nokia-leaf.conf)] | [[arista-leaf](/assets/frys-ix/arista-leaf.conf)]
|
464
content/articles/2025-05-03-containerlab-1.md
Normal file
@ -0,0 +1,464 @@
|
||||
---
|
||||
date: "2025-05-03T15:07:23Z"
|
||||
title: 'VPP in Containerlab - Part 1'
|
||||
---
|
||||
|
||||
{{< image float="right" src="/assets/containerlab/containerlab.svg" alt="Containerlab Logo" width="12em" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
From time to time the subject of containerized VPP instances comes up. At IPng, I run the routers in
|
||||
AS8298 on bare metal (Supermicro and Dell hardware), as it allows me to maximize performance.
|
||||
However, VPP is quite friendly in virtualization. Notably, it runs really well on virtual machines
|
||||
like Qemu/KVM or VMWare. I can pass through PCI devices directly to the host, and use CPU pinning to
|
||||
allow the guest virtual machine access to the underlying physical hardware. In such a mode, VPP
|
||||
performance almost the same as on bare metal. But did you know that VPP can also run in Docker?
|
||||
|
||||
The other day I joined the [[ZANOG'25](https://nog.net.za/event1/zanog25/)] in Durban, South Africa.
|
||||
One of the presenters was Nardus le Roux of Nokia, and he showed off a project called
|
||||
[[Containerlab](https://containerlab.dev/)], which provides a CLI for orchestrating and managing
|
||||
container-based networking labs. It starts the containers, builds a virtual wiring between them to
|
||||
create lab topologies of users choice and manages labs lifecycle.
|
||||
|
||||
Quite regularly I am asked 'when will you add VPP to Containerlab?', but at ZANOG I made a promise
|
||||
to actually add them. Here I go, on a journey to integrate VPP into Containerlab!
|
||||
|
||||
## Containerized VPP
|
||||
|
||||
The folks at [[Tigera](https://www.tigera.io/project-calico/)] maintain a project called _Calico_,
|
||||
which accelerates Kubernetes CNI (Container Network Interface) by using [[FD.io](https://fd.io)]
|
||||
VPP. Since the origins of Kubernetes are to run containers in a Docker environment, it stands to
|
||||
reason that it should be possible to run a containerized VPP. I start by reading up on how they
|
||||
create their Docker image, and I learn a lot.
|
||||
|
||||
### Docker Build
|
||||
|
||||
Considering IPng runs bare metal Debian (currently Bookworm) machines, my Docker image will be based
|
||||
on `debian:bookworm` as well. The build starts off quite modest:
|
||||
|
||||
```
|
||||
pim@summer:~$ mkdir -p src/vpp-containerlab
|
||||
pim@summer:~/src/vpp-containerlab$ cat < EOF > Dockerfile.bookworm
|
||||
FROM debian:bookworm
|
||||
ARG DEBIAN_FRONTEND=noninteractive
|
||||
ARG VPP_INSTALL_SKIP_SYSCTL=true
|
||||
ARG REPO=release
|
||||
RUN apt-get update && apt-get -y install curl procps && apt-get clean
|
||||
|
||||
# Install VPP
|
||||
RUN curl -s https://packagecloud.io/install/repositories/fdio/${REPO}/script.deb.sh | bash
|
||||
RUN apt-get update && apt-get -y install vpp vpp-plugin-core && apt-get clean
|
||||
|
||||
CMD ["/usr/bin/vpp","-c","/etc/vpp/startup.conf"]
|
||||
EOF
|
||||
pim@summer:~/src/vpp-containerlab$ docker build -f Dockerfile.bookworm . -t pimvanpelt/vpp-containerlab
|
||||
```
|
||||
|
||||
One gotcha - when I install the upstream VPP debian packages, they generate a `sysctl` file which it
|
||||
tries to execute. However, I can't set sysctl's in the container, so the build fails. I take a look
|
||||
at the VPP source code and find `src/pkg/debian/vpp.postinst` which helpfully contains a means to
|
||||
override setting the sysctl's, using an environment variable called `VPP_INSTALL_SKIP_SYSCTL`.
|
||||
|
||||
### Running VPP in Docker
|
||||
|
||||
With the Docker image built, I need to tweak the VPP startup configuration a little bit, to allow it
|
||||
to run well in a Docker environment. There are a few things I make note of:
|
||||
1. We may not have huge pages on the host machine, so I'll set all the page sizes to the
|
||||
linux-default 4kB rather than 2MB or 1GB hugepages. This creates a performance regression, but
|
||||
in the case of Containerlab, we're not here to build high performance stuff, but rather users
|
||||
will be doing functional testing.
|
||||
1. DPDK requires either UIO of VFIO kernel drivers, so that it can bind its so-called _poll mode
|
||||
driver_ to the network cards. It also requires huge pages. Since my first version will be
|
||||
using only virtual ethernet interfaces, I'll disable DPDK and VFIO alltogether.
|
||||
1. VPP can run any number of CPU worker threads. In its simplest form, I can also run it with only
|
||||
one thread. Of course, this will not be a high performance setup, but since I'm already not
|
||||
using hugepages, I'll use only 1 thread.
|
||||
|
||||
The VPP `startup.conf` configuration file I came up with:
|
||||
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ cat < EOF > clab-startup.conf
|
||||
unix {
|
||||
interactive
|
||||
log /var/log/vpp/vpp.log
|
||||
full-coredump
|
||||
cli-listen /run/vpp/cli.sock
|
||||
cli-prompt vpp-clab#
|
||||
cli-no-pager
|
||||
poll-sleep-usec 100
|
||||
}
|
||||
|
||||
api-trace {
|
||||
on
|
||||
}
|
||||
|
||||
memory {
|
||||
main-heap-size 512M
|
||||
main-heap-page-size 4k
|
||||
}
|
||||
buffers {
|
||||
buffers-per-numa 16000
|
||||
default data-size 2048
|
||||
page-size 4k
|
||||
}
|
||||
|
||||
statseg {
|
||||
size 64M
|
||||
page-size 4k
|
||||
per-node-counters on
|
||||
}
|
||||
|
||||
plugins {
|
||||
plugin default { enable }
|
||||
plugin dpdk_plugin.so { disable }
|
||||
}
|
||||
EOF
|
||||
```
|
||||
|
||||
Just a couple of notes for those who are running VPP in production. Each of the `*-page-size` config
|
||||
settings take the normal Linux pagesize of 4kB, which effectively avoids VPP from using anhy
|
||||
hugepages. Then, I'll specifically disable the DPDK plugin, although I didn't install it in the
|
||||
Dockerfile build, as it lives in its own dedicated Debian package called `vpp-plugin-dpdk`. Finally,
|
||||
I'll make VPP use less CPU by telling it to sleep for 100 microseconds between each poll iteration.
|
||||
In production environments, VPP will use 100% of the CPUs it's assigned, but in this lab, it will
|
||||
not be quite as hungry. By the way, even in this sleepy mode, it'll still easily handle a gigabit
|
||||
of traffic!
|
||||
|
||||
Now, VPP wants to run as root and it needs a few host features, notably tuntap devices and vhost,
|
||||
and a few capabilites, notably NET_ADMIN and SYS_PTRACE. I take a look at the
|
||||
[[manpage](https://man7.org/linux/man-pages/man7/capabilities.7.html)]:
|
||||
* ***CAP_SYS_NICE***: allows to set real-time scheduling, CPU affinity, I/O scheduling class, and
|
||||
to migrate and move memory pages.
|
||||
* ***CAP_NET_ADMIN***: allows to perform various network-relates operations like interface
|
||||
configs, routing tables, nested network namespaces, multicast, set promiscuous mode, and so on.
|
||||
* ***CAP_SYS_PTRACE***: allows to trace arbitrary processes using `ptrace(2)`, and a few related
|
||||
kernel system calls.
|
||||
|
||||
Being a networking dataplane implementation, VPP wants to be able to tinker with network devices.
|
||||
This is not typically allowed in Docker containers, although the Docker developers did make some
|
||||
consessions for those containers that need just that little bit more access. They described it in
|
||||
their
|
||||
[[docs](https://docs.docker.com/engine/containers/run/#runtime-privilege-and-linux-capabilities)] as
|
||||
follows:
|
||||
|
||||
| The --privileged flag gives all capabilities to the container. When the operator executes docker
|
||||
| run --privileged, Docker enables access to all devices on the host, and reconfigures AppArmor or
|
||||
| SELinux to allow the container nearly all the same access to the host as processes running outside
|
||||
| containers on the host. Use this flag with caution. For more information about the --privileged
|
||||
| flag, see the docker run reference.
|
||||
|
||||
{{< image width="4em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
In this moment, I feel I should point out that running a Docker container with `--privileged` flag
|
||||
set does give it _a lot_ of privileges. A container with `--privileged` is not a securely sandboxed
|
||||
process. Containers in this mode can get a root shell on the host and take control over the system.
|
||||
|
||||
With that little fineprint warning out of the way, I am going to Yolo like a boss:
|
||||
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ docker run --name clab-pim \
|
||||
--cap-add=NET_ADMIN --cap-add=SYS_NICE --cap-add=SYS_PTRACE \
|
||||
--device=/dev/net/tun:/dev/net/tun --device=/dev/vhost-net:/dev/vhost-net \
|
||||
--privileged -v $(pwd)/clab-startup.conf:/etc/vpp/startup.conf:ro \
|
||||
docker.io/pimvanpelt/vpp-containerlab
|
||||
clab-pim
|
||||
```
|
||||
|
||||
### Configuring VPP in Docker
|
||||
|
||||
And with that, the Docker container is running! I post a screenshot on
|
||||
[[Mastodon](https://ublog.tech/@IPngNetworks/114392852468494211)] and my buddy John responds with a
|
||||
polite but firm insistence that I explain myself. Here you go, buddy :)
|
||||
|
||||
In another terminal, I can play around with this VPP instance a little bit:
|
||||
```
|
||||
pim@summer:~$ docker exec -it clab-pim bash
|
||||
root@d57c3716eee9:/# ip -br l
|
||||
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
|
||||
eth0@if530566 UP 02:42:ac:11:00:02 <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
|
||||
root@d57c3716eee9:/# ps auxw
|
||||
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
|
||||
root 1 2.2 0.2 17498852 160300 ? Rs 15:11 0:00 /usr/bin/vpp -c /etc/vpp/startup.conf
|
||||
root 10 0.0 0.0 4192 3388 pts/0 Ss 15:11 0:00 bash
|
||||
root 18 0.0 0.0 8104 4056 pts/0 R+ 15:12 0:00 ps auxw
|
||||
|
||||
root@d57c3716eee9:/# vppctl
|
||||
_______ _ _ _____ ___
|
||||
__/ __/ _ \ (_)__ | | / / _ \/ _ \
|
||||
_/ _// // / / / _ \ | |/ / ___/ ___/
|
||||
/_/ /____(_)_/\___/ |___/_/ /_/
|
||||
|
||||
vpp-clab# show version
|
||||
vpp v25.02-release built by root on d5cd2c304b7f at 2025-02-26T13:58:32
|
||||
vpp-clab# show interfaces
|
||||
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
|
||||
local0 0 down 0/0/0/0
|
||||
```
|
||||
|
||||
Slick! I can see that the container has an `eth0` device, which Docker has connected to the main
|
||||
bridged network. For now, there's only one process running, pid 1 proudly shows VPP (as in Docker,
|
||||
the `CMD` field will simply replace `init`. Later on, I can imagine running a few more daemons like
|
||||
SSH and so on, but for now, I'm happy.
|
||||
|
||||
Looking at VPP itself, it has no network interfaces yet, except for the default `local0` interface.
|
||||
|
||||
### Adding Interfaces in Docker
|
||||
|
||||
But if I don't have DPDK, how will I add interfaces? Enter `veth(4)`. From the
|
||||
[[manpage](https://man7.org/linux/man-pages/man4/veth.4.html)], I learn that veth devices are
|
||||
virtual Ethernet devices. They can act as tunnels between network namespaces to create a bridge to
|
||||
a physical network device in another namespace, but can also be used as standalone network devices.
|
||||
veth devices are always created in interconnected pairs.
|
||||
|
||||
Of course, Docker users will recognize this. It's like bread and butter for containers to
|
||||
communicate with one another - and with the host they're running on. I can simply create a Docker
|
||||
network and attach one half of it to a running container, like so:
|
||||
|
||||
```
|
||||
pim@summer:~$ docker network create --driver=bridge clab-network \
|
||||
--subnet 192.0.2.0/24 --ipv6 --subnet 2001:db8::/64
|
||||
5711b95c6c32ac0ed185a54f39e5af4b499677171ff3d00f99497034e09320d2
|
||||
pim@summer:~$ docker network connect clab-network clab-pim --ip '' --ip6 ''
|
||||
```
|
||||
|
||||
The first command here creates a new network called `clab-network` in Docker. As a result, a new
|
||||
bridge called `br-5711b95c6c32` shows up on the host. The bridge name is chosen from the UUID of the
|
||||
Docker object. Seeing as I added an IPv4 and IPv6 subnet to the bridge, it gets configured with the
|
||||
first address in both:
|
||||
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ brctl show br-5711b95c6c32
|
||||
bridge name bridge id STP enabled interfaces
|
||||
br-5711b95c6c32 8000.0242099728c6 no veth021e363
|
||||
|
||||
|
||||
pim@summer:~/src/vpp-containerlab$ ip -br a show dev br-5711b95c6c32
|
||||
br-5711b95c6c32 UP 192.0.2.1/24 2001:db8::1/64 fe80::42:9ff:fe97:28c6/64 fe80::1/64
|
||||
```
|
||||
|
||||
The second command creates a `veth` pair, and puts one half of it in the bridge, and this interface
|
||||
is called `veth021e363` above. The other half of it pops up as `eth1` in the Docker container:
|
||||
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ docker exec -it clab-pim bash
|
||||
root@d57c3716eee9:/# ip -br l
|
||||
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
|
||||
eth0@if530566 UP 02:42:ac:11:00:02 <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
eth1@if530577 UP 02:42:c0:00:02:02 <BROADCAST,MULTICAST,UP,LOWER_UP>
|
||||
```
|
||||
|
||||
One of the many awesome features of VPP is its ability to attach to these `veth` devices by means of
|
||||
its `af-packet` driver, by reusing the same MAC address (in this case `02:42:c0:00:02:02`). I first
|
||||
take a look at the linux [[manpage](https://man7.org/linux/man-pages/man7/packet.7.html)] for it,
|
||||
and then read up on the VPP
|
||||
[[documentation](https://fd.io/docs/vpp/v2101/gettingstarted/progressivevpp/interface)] on the
|
||||
topic.
|
||||
|
||||
|
||||
However, my attention is drawn to Docker assigning an IPv4 and IPv6 address to the container:
|
||||
```
|
||||
root@d57c3716eee9:/# ip -br a
|
||||
lo UNKNOWN 127.0.0.1/8 ::1/128
|
||||
eth0@if530566 UP 172.17.0.2/16
|
||||
eth1@if530577 UP 192.0.2.2/24 2001:db8::2/64 fe80::42:c0ff:fe00:202/64
|
||||
root@d57c3716eee9:/# ip addr del 192.0.2.2/24 dev eth1
|
||||
root@d57c3716eee9:/# ip addr del 2001:db8::2/64 dev eth1
|
||||
```
|
||||
|
||||
I decide to remove them from here, as in the end, `eth1` will be owned by VPP so _it_ should be
|
||||
setting the IPv4 and IPv6 addresses. For the life of me, I don't see how I can avoid Docker from
|
||||
assinging IPv4 and IPv6 addresses to this container ... and the
|
||||
[[docs](https://docs.docker.com/engine/network/)] seem to be off as well, as they suggest I can pass
|
||||
a flagg `--ipv4=False` but that flag doesn't exist, at least not on my Bookworm Docker variant. I
|
||||
make a mental note to discuss this with the folks in the Containerlab community.
|
||||
|
||||
|
||||
Anyway, armed with this knowledge I can bind the container-side veth pair called `eth1` to VPP, like
|
||||
so:
|
||||
|
||||
```
|
||||
root@d57c3716eee9:/# vppctl
|
||||
_______ _ _ _____ ___
|
||||
__/ __/ _ \ (_)__ | | / / _ \/ _ \
|
||||
_/ _// // / / / _ \ | |/ / ___/ ___/
|
||||
/_/ /____(_)_/\___/ |___/_/ /_/
|
||||
|
||||
vpp-clab# create host-interface name eth1 hw-addr 02:42:c0:00:02:02
|
||||
vpp-clab# set interface name host-eth1 eth1
|
||||
vpp-clab# set interface mtu 1500 eth1
|
||||
vpp-clab# set interface ip address eth1 192.0.2.2/24
|
||||
vpp-clab# set interface ip address eth1 2001:db8::2/64
|
||||
vpp-clab# set interface state eth1 up
|
||||
vpp-clab# show int addr
|
||||
eth1 (up):
|
||||
L3 192.0.2.2/24
|
||||
L3 2001:db8::2/64
|
||||
local0 (dn):
|
||||
```
|
||||
|
||||
## Results
|
||||
|
||||
After all this work, I've successfully created a Docker image based on Debian Bookworm and VPP 25.02
|
||||
(the current stable release version), started a container with it, added a network bridge in Docker,
|
||||
which binds the host `summer` to the container. Proof, as they say, is in the ping-pudding:
|
||||
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ ping -c5 2001:db8::2
|
||||
PING 2001:db8::2(2001:db8::2) 56 data bytes
|
||||
64 bytes from 2001:db8::2: icmp_seq=1 ttl=64 time=0.113 ms
|
||||
64 bytes from 2001:db8::2: icmp_seq=2 ttl=64 time=0.056 ms
|
||||
64 bytes from 2001:db8::2: icmp_seq=3 ttl=64 time=0.202 ms
|
||||
64 bytes from 2001:db8::2: icmp_seq=4 ttl=64 time=0.102 ms
|
||||
64 bytes from 2001:db8::2: icmp_seq=5 ttl=64 time=0.100 ms
|
||||
|
||||
--- 2001:db8::2 ping statistics ---
|
||||
5 packets transmitted, 5 received, 0% packet loss, time 4098ms
|
||||
rtt min/avg/max/mdev = 0.056/0.114/0.202/0.047 ms
|
||||
pim@summer:~/src/vpp-containerlab$ ping -c5 192.0.2.2
|
||||
PING 192.0.2.2 (192.0.2.2) 56(84) bytes of data.
|
||||
64 bytes from 192.0.2.2: icmp_seq=1 ttl=64 time=0.043 ms
|
||||
64 bytes from 192.0.2.2: icmp_seq=2 ttl=64 time=0.032 ms
|
||||
64 bytes from 192.0.2.2: icmp_seq=3 ttl=64 time=0.019 ms
|
||||
64 bytes from 192.0.2.2: icmp_seq=4 ttl=64 time=0.041 ms
|
||||
64 bytes from 192.0.2.2: icmp_seq=5 ttl=64 time=0.027 ms
|
||||
|
||||
--- 192.0.2.2 ping statistics ---
|
||||
5 packets transmitted, 5 received, 0% packet loss, time 4063ms
|
||||
rtt min/avg/max/mdev = 0.019/0.032/0.043/0.008 ms
|
||||
```
|
||||
|
||||
And in case that simple ping-test wasn't enough to get you excited, here's a packet trace from VPP
|
||||
itself, while I'm performing this ping:
|
||||
|
||||
```
|
||||
vpp-clab# trace add af-packet-input 100
|
||||
vpp-clab# wait 3
|
||||
vpp-clab# show trace
|
||||
------------------- Start of thread 0 vpp_main -------------------
|
||||
Packet 1
|
||||
|
||||
00:07:03:979275: af-packet-input
|
||||
af_packet: hw_if_index 1 rx-queue 0 next-index 4
|
||||
block 47:
|
||||
address 0x7fbf23b7d000 version 2 seq_num 48 pkt_num 0
|
||||
tpacket3_hdr:
|
||||
status 0x20000001 len 98 snaplen 98 mac 92 net 106
|
||||
sec 0x68164381 nsec 0x258e7659 vlan 0 vlan_tpid 0
|
||||
vnet-hdr:
|
||||
flags 0x00 gso_type 0x00 hdr_len 0
|
||||
gso_size 0 csum_start 0 csum_offset 0
|
||||
00:07:03:979293: ethernet-input
|
||||
IP4: 02:42:09:97:28:c6 -> 02:42:c0:00:02:02
|
||||
00:07:03:979306: ip4-input
|
||||
ICMP: 192.0.2.1 -> 192.0.2.2
|
||||
tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x5813, flags DONT_FRAGMENT
|
||||
ICMP echo_request checksum 0xc16 id 21197
|
||||
00:07:03:979315: ip4-lookup
|
||||
fib 0 dpo-idx 9 flow hash: 0x00000000
|
||||
ICMP: 192.0.2.1 -> 192.0.2.2
|
||||
tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x5813, flags DONT_FRAGMENT
|
||||
ICMP echo_request checksum 0xc16 id 21197
|
||||
00:07:03:979322: ip4-receive
|
||||
fib:0 adj:9 flow:0x00000000
|
||||
ICMP: 192.0.2.1 -> 192.0.2.2
|
||||
tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x5813, flags DONT_FRAGMENT
|
||||
ICMP echo_request checksum 0xc16 id 21197
|
||||
00:07:03:979323: ip4-icmp-input
|
||||
ICMP: 192.0.2.1 -> 192.0.2.2
|
||||
tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x5813, flags DONT_FRAGMENT
|
||||
ICMP echo_request checksum 0xc16 id 21197
|
||||
00:07:03:979323: ip4-icmp-echo-request
|
||||
ICMP: 192.0.2.1 -> 192.0.2.2
|
||||
tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x5813, flags DONT_FRAGMENT
|
||||
ICMP echo_request checksum 0xc16 id 21197
|
||||
00:07:03:979326: ip4-load-balance
|
||||
fib 0 dpo-idx 5 flow hash: 0x00000000
|
||||
ICMP: 192.0.2.2 -> 192.0.2.1
|
||||
tos 0x00, ttl 64, length 84, checksum 0x88e1 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x2dc4, flags DONT_FRAGMENT
|
||||
ICMP echo_reply checksum 0x1416 id 21197
|
||||
00:07:03:979325: ip4-rewrite
|
||||
tx_sw_if_index 1 dpo-idx 5 : ipv4 via 192.0.2.1 eth1: mtu:1500 next:3 flags:[] 0242099728c60242c00002020800 flow hash: 0x00000000
|
||||
00000000: 0242099728c60242c00002020800450000542dc44000400188e1c0000202c000
|
||||
00000020: 02010000141652cd00018143166800000000399d0900000000001011
|
||||
00:07:03:979326: eth1-output
|
||||
eth1 flags 0x02180005
|
||||
IP4: 02:42:c0:00:02:02 -> 02:42:09:97:28:c6
|
||||
ICMP: 192.0.2.2 -> 192.0.2.1
|
||||
tos 0x00, ttl 64, length 84, checksum 0x88e1 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x2dc4, flags DONT_FRAGMENT
|
||||
ICMP echo_reply checksum 0x1416 id 21197
|
||||
00:07:03:979327: eth1-tx
|
||||
af_packet: hw_if_index 1 tx-queue 0
|
||||
tpacket3_hdr:
|
||||
status 0x1 len 108 snaplen 108 mac 0 net 0
|
||||
sec 0x0 nsec 0x0 vlan 0 vlan_tpid 0
|
||||
vnet-hdr:
|
||||
flags 0x00 gso_type 0x00 hdr_len 0
|
||||
gso_size 0 csum_start 0 csum_offset 0
|
||||
buffer 0xf97c4:
|
||||
current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x0
|
||||
local l2-hdr-offset 0 l3-hdr-offset 14
|
||||
IP4: 02:42:c0:00:02:02 -> 02:42:09:97:28:c6
|
||||
ICMP: 192.0.2.2 -> 192.0.2.1
|
||||
tos 0x00, ttl 64, length 84, checksum 0x88e1 dscp CS0 ecn NON_ECN
|
||||
fragment id 0x2dc4, flags DONT_FRAGMENT
|
||||
ICMP echo_reply checksum 0x1416 id 21197
|
||||
```
|
||||
|
||||
Well, that's a mouthfull, isn't it! Here, I get to show you VPP in action. After receiving the
|
||||
packet on its `af-packet-input` node from 192.0.2.1 (Summer, who is pinging us) to 192.0.2.2 (the
|
||||
VPP container), the packet traverses the dataplane graph. It goes through `ethernet-input`, then
|
||||
`ip4-input`, which sees it's destined to an IPv4 address configured, so the packet is handed to
|
||||
`ip4-receive`. That one sees that the IP protocol is ICMP, so it hands the packet to
|
||||
`ip4-icmp-input` which notices that the packet is an ICMP echo request, so off to
|
||||
`ip4-icmp-echo-request` our little packet goes. The ICMP plugin in VPP now answers by
|
||||
`ip4-rewrite`'ing the packet, sending the return to 192.0.2.1 at MAC address `02:42:09:97:28:c6`
|
||||
(this is Summer, the host doing the pinging!), after which the newly created ICMP echo-reply is
|
||||
handed to `eth1-output` which marshalls it back into the kernel's AF_PACKET interface using
|
||||
`eth1-tx`.
|
||||
|
||||
Boom. I could not be more pleased.
|
||||
|
||||
## What's Next
|
||||
|
||||
This was a nice exercise for me! I'm going this direction becaue the
|
||||
[[Containerlab](https://containerlab.dev)] framework will start containers with given NOS images,
|
||||
not too dissimilar from the one I just made, and then attaches `veth` pairs between the containers.
|
||||
I started dabbling with a [[pull-request](https://github.com/srl-labs/containerlab/pull/2571)], but
|
||||
I got stuck with a part of the Containerlab code that pre-deploys config files into the containers.
|
||||
You see, I will need to generate two files:
|
||||
|
||||
1. A `startup.conf` file that is specific to the containerlab Docker container. I'd like them to
|
||||
each set their own hostname so that the CLI has a unique prompt. I can do this by setting `unix
|
||||
{ cli-prompt {{ .ShortName }}# }` in the template renderer.
|
||||
1. Containerlab will know all of the veth pairs that are planned to be created into each VPP
|
||||
container. I'll need it to then write a little snippet of config that does the `create
|
||||
host-interface` spiel, to attach these `veth` pairs to the VPP dataplane.
|
||||
|
||||
I reached out to Roman from Nokia, who is one of the authors and current maintainer of Containerlab.
|
||||
Roman was keen to help out, and seeing as he knows the COntainerlab stuff well, and I know the VPP
|
||||
stuff well, this is a reasonable partnership! Soon, he and I plan to have a bare-bones setup that
|
||||
will connect a few VPP containers together with an SR Linux node in a lab. Stand by!
|
||||
|
||||
Once we have that, there's still quite some work for me to do. Notably:
|
||||
* Configuration persistence. `clab` allows you to save the running config. For that, I'll need to
|
||||
introduce [[vppcfg](https://git.ipng.ch/ipng/vppcfg)] and a means to invoke it when
|
||||
the lab operator wants to save their config, and then reconfigure VPP when the container
|
||||
restarts.
|
||||
* I'll need to have a few files from `clab` shared with the host, notably the `startup.conf` and
|
||||
`vppcfg.yaml`, as well as some manual pre- and post-flight configuration for the more esoteric
|
||||
stuff. Building the plumbing for this is a TODO for now.
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
I wanted to give a shout-out to Nardus le Roux who inspired me to contribute this Containerlab VPP
|
||||
node type, and to Roman Dodin for his help getting the Containerlab parts squared away when I got a
|
||||
little bit stuck.
|
||||
|
||||
First order of business: get it to ping at all ... it'll go faster from there on out :)
|
373
content/articles/2025-05-04-containerlab-2.md
Normal file
@ -0,0 +1,373 @@
|
||||
---
|
||||
date: "2025-05-04T15:07:23Z"
|
||||
title: 'VPP in Containerlab - Part 2'
|
||||
params:
|
||||
asciinema: true
|
||||
---
|
||||
|
||||
{{< image float="right" src="/assets/containerlab/containerlab.svg" alt="Containerlab Logo" width="12em" >}}
|
||||
|
||||
# Introduction
|
||||
|
||||
From time to time the subject of containerized VPP instances comes up. At IPng, I run the routers in
|
||||
AS8298 on bare metal (Supermicro and Dell hardware), as it allows me to maximize performance.
|
||||
However, VPP is quite friendly in virtualization. Notably, it runs really well on virtual machines
|
||||
like Qemu/KVM or VMWare. I can pass through PCI devices directly to the host, and use CPU pinning to
|
||||
allow the guest virtual machine access to the underlying physical hardware. In such a mode, VPP
|
||||
performance almost the same as on bare metal. But did you know that VPP can also run in Docker?
|
||||
|
||||
The other day I joined the [[ZANOG'25](https://nog.net.za/event1/zanog25/)] in Durban, South Africa.
|
||||
One of the presenters was Nardus le Roux of Nokia, and he showed off a project called
|
||||
[[Containerlab](https://containerlab.dev/)], which provides a CLI for orchestrating and managing
|
||||
container-based networking labs. It starts the containers, builds virtual wiring between them to
|
||||
create lab topologies of users' choice and manages the lab lifecycle.
|
||||
|
||||
Quite regularly I am asked 'when will you add VPP to Containerlab?', but at ZANOG I made a promise
|
||||
to actually add it. In my previous [[article]({{< ref 2025-05-03-containerlab-1.md >}})], I took
|
||||
a good look at VPP as a dockerized container. In this article, I'll explore how to make such a
|
||||
container run in Containerlab!
|
||||
|
||||
## Completing the Docker container
|
||||
|
||||
Just having VPP running by itself in a container is not super useful (although it _is_ cool!). I
|
||||
decide first to add a few bits and bobs that will come in handy in the `Dockerfile`:
|
||||
|
||||
```
|
||||
FROM debian:bookworm
|
||||
ARG DEBIAN_FRONTEND=noninteractive
|
||||
ARG VPP_INSTALL_SKIP_SYSCTL=true
|
||||
ARG REPO=release
|
||||
EXPOSE 22/tcp
|
||||
RUN apt-get update && apt-get -y install curl procps tcpdump iproute2 iptables \
|
||||
iputils-ping net-tools git python3 python3-pip vim-tiny openssh-server bird2 \
|
||||
mtr-tiny traceroute && apt-get clean
|
||||
|
||||
# Install VPP
|
||||
RUN mkdir -p /var/log/vpp /root/.ssh/
|
||||
RUN curl -s https://packagecloud.io/install/repositories/fdio/${REPO}/script.deb.sh | bash
|
||||
RUN apt-get update && apt-get -y install vpp vpp-plugin-core && apt-get clean
|
||||
|
||||
# Build vppcfg
|
||||
RUN pip install --break-system-packages build netaddr yamale argparse pyyaml ipaddress
|
||||
RUN git clone https://git.ipng.ch/ipng/vppcfg.git && cd vppcfg && python3 -m build && \
|
||||
pip install --break-system-packages dist/vppcfg-*-py3-none-any.whl
|
||||
|
||||
# Config files
|
||||
COPY files/etc/vpp/* /etc/vpp/
|
||||
COPY files/etc/bird/* /etc/bird/
|
||||
COPY files/init-container.sh /sbin/
|
||||
RUN chmod 755 /sbin/init-container.sh
|
||||
CMD ["/sbin/init-container.sh"]
|
||||
```
|
||||
|
||||
A few notable additions:
|
||||
* ***vppcfg*** is a handy utility I wrote and discussed in a previous [[article]({{< ref
|
||||
2022-04-02-vppcfg-2 >}})]. Its purpose is to take YAML file that describes the configuration of
|
||||
the dataplane (like which interfaces, sub-interfaces, MTU, IP addresses and so on), and then
|
||||
apply this safely to a running dataplane. You can check it out in my
|
||||
[[vppcfg](https://git.ipng.ch/ipng/vppcfg)] git repository.
|
||||
* ***openssh-server*** will come in handy to log in to the container, in addition to the already
|
||||
available `docker exec`.
|
||||
* ***bird2*** which will be my controlplane of choice. At a future date, I might also add FRR,
|
||||
which may be a good alterantive for some. VPP works well with both. You can check out Bird on
|
||||
the nic.cz [[website](https://bird.network.cz/?get_doc&f=bird.html&v=20)].
|
||||
|
||||
I'll add a couple of default config files for Bird and VPP, and replace the CMD with a generic
|
||||
`/sbin/init-container.sh` in which I can do any late binding stuff before launching VPP.
|
||||
|
||||
### Initializing the Container
|
||||
|
||||
#### VPP Containerlab: NetNS
|
||||
|
||||
VPP's Linux Control Plane plugin wants to run in its own network namespace. So the first order of
|
||||
business of `/sbin/init-container.sh` is to create it:
|
||||
|
||||
```
|
||||
NETNS=${NETNS:="dataplane"}
|
||||
|
||||
echo "Creating dataplane namespace"
|
||||
/usr/bin/mkdir -p /etc/netns/$NETNS
|
||||
/usr/bin/touch /etc/netns/$NETNS/resolv.conf
|
||||
/usr/sbin/ip netns add $NETNS
|
||||
```
|
||||
|
||||
#### VPP Containerlab: SSH
|
||||
|
||||
Then, I'll set the root password (which is `vpp` by the way), and start aan SSH daemon which allows
|
||||
for password-less logins:
|
||||
|
||||
```
|
||||
echo "Starting SSH, with credentials root:vpp"
|
||||
sed -i -e 's,^#PermitRootLogin prohibit-password,PermitRootLogin yes,' /etc/ssh/sshd_config
|
||||
sed -i -e 's,^root:.*,root:$y$j9T$kG8pyZEVmwLXEtXekQCRK.$9iJxq/bEx5buni1hrC8VmvkDHRy7ZMsw9wYvwrzexID:20211::::::,' /etc/shadow
|
||||
/etc/init.d/ssh start
|
||||
```
|
||||
|
||||
#### VPP Containerlab: Bird2
|
||||
|
||||
I can already predict that Bird2 won't be the only option for a controlplane, even though I'm a huge
|
||||
fan of it. Therefore, I'll make it configurable to leave the door open for other controlplane
|
||||
implementations in the future:
|
||||
|
||||
```
|
||||
BIRD_ENABLED=${BIRD_ENABLED:="true"}
|
||||
|
||||
if [ "$BIRD_ENABLED" == "true" ]; then
|
||||
echo "Starting Bird in $NETNS"
|
||||
mkdir -p /run/bird /var/log/bird
|
||||
chown bird:bird /var/log/bird
|
||||
ROUTERID=$(ip -br a show eth0 | awk '{ print $3 }' | cut -f1 -d/)
|
||||
sed -i -e "s,.*router id .*,router id $ROUTERID; # Set by container-init.sh," /etc/bird/bird.conf
|
||||
/usr/bin/nsenter --net=/var/run/netns/$NETNS /usr/sbin/bird -u bird -g bird
|
||||
fi
|
||||
```
|
||||
|
||||
I am reminded that Bird won't start if it cannot determine its _router id_. When I start it in the
|
||||
`dataplane` namespace, it will immediately exit, because there will be no IP addresses configured
|
||||
yet. But luckily, it logs its complaint and it's easily addressed. I decide to take the management
|
||||
IPv4 address from `eth0` and write that into the `bird.conf` file, which otherwise does some basic
|
||||
initialization that I described in a previous [[article]({{< ref 2021-09-02-vpp-5 >}})], so I'll
|
||||
skip that here. However, I do include an empty file called `/etc/bird/bird-local.conf` for users to
|
||||
further configure Bird2.
|
||||
|
||||
#### VPP Containerlab: Binding veth pairs
|
||||
|
||||
When Containerlab starts the VPP container, it'll offer it a set of `veth` ports that connect this
|
||||
container to other nodes in the lab. This is done by the `links` list in the topology file
|
||||
[[ref](https://containerlab.dev/manual/network/)]. It's my goal to take all of the interfaces
|
||||
that are of type `veth`, and generate a little snippet to grab them and bind them into VPP while
|
||||
setting their MTU to 9216 to allow for jumbo frames:
|
||||
|
||||
```
|
||||
CLAB_VPP_FILE=${CLAB_VPP_FILE:=/etc/vpp/clab.vpp}
|
||||
|
||||
echo "Generating $CLAB_VPP_FILE"
|
||||
: > $CLAB_VPP_FILE
|
||||
MTU=9216
|
||||
for IFNAME in $(ip -br link show type veth | cut -f1 -d@ | grep -v '^eth0$' | sort); do
|
||||
MAC=$(ip -br link show dev $IFNAME | awk '{ print $3 }')
|
||||
echo " * $IFNAME hw-addr $MAC mtu $MTU"
|
||||
ip link set $IFNAME up mtu $MTU
|
||||
cat << EOF >> $CLAB_VPP_FILE
|
||||
create host-interface name $IFNAME hw-addr $MAC
|
||||
set interface name host-$IFNAME $IFNAME
|
||||
set interface mtu $MTU $IFNAME
|
||||
set interface state $IFNAME up
|
||||
|
||||
EOF
|
||||
done
|
||||
```
|
||||
|
||||
{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
|
||||
|
||||
One thing I realized is that VPP will assign a random MAC address on its copy of the `veth` port,
|
||||
which is not great. I'll explicitly configure it with the same MAC address as the `veth` interface
|
||||
itself, otherwise I'd have to put the interface into promiscuous mode.
|
||||
|
||||
#### VPP Containerlab: VPPcfg
|
||||
|
||||
I'm almost ready, but I have one more detail. The user will be able to offer a
|
||||
[[vppcfg](https://git.ipng.ch/ipng/vppcfg)] YAML file to configure the interfaces and so on. If such
|
||||
a file exists, I'll apply it to the dataplane upon startup:
|
||||
|
||||
```
|
||||
VPPCFG_VPP_FILE=${VPPCFG_VPP_FILE:=/etc/vpp/vppcfg.vpp}
|
||||
|
||||
echo "Generating $VPPCFG_VPP_FILE"
|
||||
: > $VPPCFG_VPP_FILE
|
||||
if [ -r /etc/vpp/vppcfg.yaml ]; then
|
||||
vppcfg plan --novpp -c /etc/vpp/vppcfg.yaml -o $VPPCFG_VPP_FILE
|
||||
fi
|
||||
```
|
||||
|
||||
Once the VPP process starts, it'll execute `/etc/vpp/bootstrap.vpp`, which in turn executes these
|
||||
newly generated `/etc/vpp/clab.vpp` to grab the `veth` interfaces, and then `/etc/vpp/vppcfg.vpp` to
|
||||
further configure the dataplane. Easy peasy!
|
||||
|
||||
### Adding VPP to Containerlab
|
||||
|
||||
Roman points out a previous integration for the 6WIND VSR in
|
||||
[[PR#2540](https://github.com/srl-labs/containerlab/pull/2540)]. This serves as a useful guide to
|
||||
get me started. I fork the repo, create a branch so that Roman can also add a few commits, and
|
||||
together we start hacking in [[PR#2571](https://github.com/srl-labs/containerlab/pull/2571)].
|
||||
|
||||
First, I add the documentation skeleton in `docs/manual/kinds/fdio_vpp.md`, which links in from a
|
||||
few other places, and will be where the end-user facing documentation will live. That's about half
|
||||
the contributed LOC, right there!
|
||||
|
||||
Next, I'll create a Go module in `nodes/fdio_vpp/fdio_vpp.go` which doesn't do much other than
|
||||
creating the `struct`, and its required `Register` and `Init` functions. The `Init` function ensures
|
||||
the right capabilities are set in Docker, and the right devices are bound for the container.
|
||||
|
||||
I notice that Containerlab rewrites the Dockerfile `CMD` string and prepends an `if-wait.sh` script
|
||||
to it. This is because when Containerlab starts the container, it'll still be busy adding these
|
||||
`link` interfaces to it, and if a container starts too quickly, it may not see all the interfaces.
|
||||
So, containerlab informs the container using an environment variable called `CLAB_INTFS`, so this
|
||||
script simply sleeps for a while until that exact amount of interfaces are present. Ok, cool beans.
|
||||
|
||||
Roman helps me a bit with Go templating. You see, I think it'll be slick to have the CLI prompt for
|
||||
the VPP containers to reflect their hostname, because normally, VPP will assign `vpp# `. I add the
|
||||
template in `nodes/fdio_vpp/vpp_startup_config.go.tpl` and it only has one variable expansion: `unix
|
||||
{ cli-prompt {{ .ShortName }}# }`. But I totally think it's worth it, because when running many VPP
|
||||
containers in the lab, it could otherwise get confusing.
|
||||
|
||||
Roman also shows me a trick in the function `PostDeploy()`, which will write the user's SSH pubkeys
|
||||
to `/root/.ssh/authorized_keys`. This allows users to log in without having to use password
|
||||
authentication.
|
||||
|
||||
Collectively, we decide to punt on the `SaveConfig` function until we're a bit further along. I have
|
||||
an idea how this would work, basically along the lines of calling `vppcfg dump` and bind-mounting
|
||||
that file into the lab directory somewhere. This way, upon restarting, the YAML file can be re-read
|
||||
and the dataplane initialized. But it'll be for another day.
|
||||
|
||||
After the main module is finished, all I have to do is add it to `clab/register.go` and that's just
|
||||
about it. In about 170 lines of code, 50 lines of Go template, and 170 lines of Markdown, this
|
||||
contribution is about ready to ship!
|
||||
|
||||
### Containerlab: Demo
|
||||
|
||||
After I finish writing the documentation, I decide to include a demo with a quickstart to help folks
|
||||
along. A simple lab showing two VPP instances and two Alpine Linux clients can be found on
|
||||
[[git.ipng.ch/ipng/vpp-containerlab](https://git.ipng.ch/ipng/vpp-containerlab)]. Simply check out the
|
||||
repo and start the lab, like so:
|
||||
|
||||
```
|
||||
$ git clone https://git.ipng.ch/ipng/vpp-containerlab.git
|
||||
$ cd vpp-containerlab
|
||||
$ containerlab deploy --topo vpp.clab.yml
|
||||
```
|
||||
|
||||
#### Containerlab: configs
|
||||
|
||||
The file `vpp.clab.yml` contains an example topology existing of two VPP instances connected each to
|
||||
one Alpine linux container, in the following topology:
|
||||
|
||||
{{< image src="/assets/containerlab/learn-vpp.png" alt="Containerlab Topo" width="100%" >}}
|
||||
|
||||
Two relevant files for each VPP router are included in this
|
||||
[[repository](https://git.ipng.ch/ipng/vpp-containerlab)]:
|
||||
1. `config/vpp*/vppcfg.yaml` configures the dataplane interfaces, including a loopback address.
|
||||
1. `config/vpp*/bird-local.conf` configures the controlplane to enable BFD and OSPF.
|
||||
|
||||
To illustrate these files, let me take a closer look at node `vpp1`. It's VPP dataplane
|
||||
configuration looks like this:
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ cat config/vpp1/vppcfg.yaml
|
||||
interfaces:
|
||||
eth1:
|
||||
description: 'To client1'
|
||||
mtu: 1500
|
||||
lcp: eth1
|
||||
addresses: [ 10.82.98.65/28, 2001:db8:8298:101::1/64 ]
|
||||
eth2:
|
||||
description: 'To vpp2'
|
||||
mtu: 9216
|
||||
lcp: eth2
|
||||
addresses: [ 10.82.98.16/31, 2001:db8:8298:1::1/64 ]
|
||||
loopbacks:
|
||||
loop0:
|
||||
description: 'vpp1'
|
||||
lcp: loop0
|
||||
addresses: [ 10.82.98.0/32, 2001:db8:8298::/128 ]
|
||||
```
|
||||
|
||||
Then, I enable BFD, OSPF and OSPFv3 on `eth2` and `loop0` on both of the VPP routers:
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ cat config/vpp1/bird-local.conf
|
||||
protocol bfd bfd1 {
|
||||
interface "eth2" { interval 100 ms; multiplier 30; };
|
||||
}
|
||||
|
||||
protocol ospf v2 ospf4 {
|
||||
ipv4 { import all; export all; };
|
||||
area 0 {
|
||||
interface "loop0" { stub yes; };
|
||||
interface "eth2" { type pointopoint; cost 10; bfd on; };
|
||||
};
|
||||
}
|
||||
|
||||
protocol ospf v3 ospf6 {
|
||||
ipv6 { import all; export all; };
|
||||
area 0 {
|
||||
interface "loop0" { stub yes; };
|
||||
interface "eth2" { type pointopoint; cost 10; bfd on; };
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
#### Containerlab: playtime!
|
||||
|
||||
Once the lab comes up, I can SSH to the VPP containers (`vpp1` and `vpp2`) which have my SSH pubkeys
|
||||
installed thanks to Roman's work. Barring that, I could still log in as user `root` using
|
||||
password `vpp`. VPP runs its own network namespace called `dataplane`, which is very similar to SR
|
||||
Linux default `network-instance`. I can join that namespace to take a closer look:
|
||||
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ ssh root@vpp1
|
||||
root@vpp1:~# nsenter --net=/var/run/netns/dataplane
|
||||
root@vpp1:~# ip -br a
|
||||
lo DOWN
|
||||
loop0 UP 10.82.98.0/32 2001:db8:8298::/128 fe80::dcad:ff:fe00:0/64
|
||||
eth1 UNKNOWN 10.82.98.65/28 2001:db8:8298:101::1/64 fe80::a8c1:abff:fe77:acb9/64
|
||||
eth2 UNKNOWN 10.82.98.16/31 2001:db8:8298:1::1/64 fe80::a8c1:abff:fef0:7125/64
|
||||
|
||||
root@vpp1:~# ping 10.82.98.1
|
||||
PING 10.82.98.1 (10.82.98.1) 56(84) bytes of data.
|
||||
64 bytes from 10.82.98.1: icmp_seq=1 ttl=64 time=9.53 ms
|
||||
64 bytes from 10.82.98.1: icmp_seq=2 ttl=64 time=15.9 ms
|
||||
^C
|
||||
--- 10.82.98.1 ping statistics ---
|
||||
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
|
||||
rtt min/avg/max/mdev = 9.530/12.735/15.941/3.205 ms
|
||||
```
|
||||
|
||||
From `vpp1`, I can tell that Bird2's OSPF adjacency has formed, because I can ping the `loop0`
|
||||
address of `vpp2` router on 10.82.98.1. Nice! The two client nodes are running a minimalistic Alpine
|
||||
Linux container, which doesn't ship with SSH by default. But of course I can still enter the
|
||||
containers using `docker exec`, like so:
|
||||
|
||||
```
|
||||
pim@summer:~/src/vpp-containerlab$ docker exec -it client1 sh
|
||||
/ # ip addr show dev eth1
|
||||
531235: eth1@if531234: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 9500 qdisc noqueue state UP
|
||||
link/ether 00:c1:ab:00:00:01 brd ff:ff:ff:ff:ff:ff
|
||||
inet 10.82.98.66/28 scope global eth1
|
||||
valid_lft forever preferred_lft forever
|
||||
inet6 2001:db8:8298:101::2/64 scope global
|
||||
valid_lft forever preferred_lft forever
|
||||
inet6 fe80::2c1:abff:fe00:1/64 scope link
|
||||
valid_lft forever preferred_lft forever
|
||||
/ # traceroute 10.82.98.82
|
||||
traceroute to 10.82.98.82 (10.82.98.82), 30 hops max, 46 byte packets
|
||||
1 10.82.98.65 (10.82.98.65) 5.906 ms 7.086 ms 7.868 ms
|
||||
2 10.82.98.17 (10.82.98.17) 24.007 ms 23.349 ms 15.933 ms
|
||||
3 10.82.98.82 (10.82.98.82) 39.978 ms 31.127 ms 31.854 ms
|
||||
|
||||
/ # traceroute 2001:db8:8298:102::2
|
||||
traceroute to 2001:db8:8298:102::2 (2001:db8:8298:102::2), 30 hops max, 72 byte packets
|
||||
1 2001:db8:8298:101::1 (2001:db8:8298:101::1) 0.701 ms 7.144 ms 7.900 ms
|
||||
2 2001:db8:8298:1::2 (2001:db8:8298:1::2) 23.909 ms 22.943 ms 23.893 ms
|
||||
3 2001:db8:8298:102::2 (2001:db8:8298:102::2) 31.964 ms 30.814 ms 32.000 ms
|
||||
```
|
||||
|
||||
From the vantage point of `client1`, the first hop represents the `vpp1` node, which forwards to
|
||||
`vpp2`, which finally forwards to `client2`, which shows that both VPP routers are passing traffic.
|
||||
Dope!
|
||||
|
||||
## Results
|
||||
|
||||
After all of this deep-diving, all that's left is for me to demonstrate the Containerlab by means of
|
||||
this little screencast [[asciinema](/assets/containerlab/vpp-containerlab.cast)]. I hope you enjoy
|
||||
it as much as I enjoyed creating it:
|
||||
|
||||
{{< asciinema src="/assets/containerlab/vpp-containerlab.cast" >}}
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
I wanted to give a shout-out Roman Dodin for his help getting the Containerlab parts squared away
|
||||
when I got a little bit stuck. He took the time to explain the internals and idiom of Containerlab
|
||||
project, which really saved me a tonne of time. He also pair-programmed the
|
||||
[[PR#2471](https://github.com/srl-labs/containerlab/pull/2571)] with me over the span of two
|
||||
evenings.
|
||||
|
||||
Collaborative open source rocks!
|
38
hugo.yaml
Normal file
@ -0,0 +1,38 @@
|
||||
baseURL: 'https://ipng.ch/'
|
||||
languageCode: 'en-us'
|
||||
title: "IPng Networks"
|
||||
theme: 'hugo-theme-ipng'
|
||||
|
||||
mainSections: ["articles"]
|
||||
|
||||
params:
|
||||
author: "IPng Networks GmbH"
|
||||
siteHeading: "IPng Networks"
|
||||
favicon: "favicon.ico"
|
||||
showBlogLatest: false
|
||||
mainSections: ["articles"]
|
||||
showTaxonomyLinks: false
|
||||
nBlogLatest: 14 # number of blog post om the home page
|
||||
Paginate: 30
|
||||
blogLatestHeading: "Latest Dabblings"
|
||||
footer: "Copyright 2021- IPng Networks GmbH, all rights reserved"
|
||||
|
||||
social:
|
||||
email: "info+www@ipng.ch"
|
||||
mastodon: "@IPngNetworks"
|
||||
twitter: "IPngNetworks"
|
||||
linkedin: "pimvanpelt"
|
||||
github: "pimvanpelt"
|
||||
instagram: "IPngNetworks"
|
||||
rss: true
|
||||
|
||||
taxonomies:
|
||||
year: "year"
|
||||
month: "month"
|
||||
tags: "tags"
|
||||
categories: "categories"
|
||||
|
||||
permalinks:
|
||||
articles: "/s/articles/:year/:month/:day/:slug"
|
||||
|
||||
ignoreLogs: [ "warning-goldmark-raw-html" ]
|
5
static/.well-known/security.txt
Normal file
@ -0,0 +1,5 @@
|
||||
Canonical: https://ipng.ch/.well-known/security.txt
|
||||
Expires: 2026-01-01T00:00:00.000Z
|
||||
Contact: mailto:info@ipng.ch
|
||||
Contact: https://ipng.ch/s/contact/
|
||||
Preferred-Languages: en, nl, de
|
55
static/app/go/index.html
Normal file
@ -0,0 +1,55 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en-us">
|
||||
<head>
|
||||
<title>Javascript Redirector for RFID / NFC / nTAG</title>
|
||||
<meta name="robots" content="noindex,nofollow">
|
||||
<meta charset="utf-8">
|
||||
<script type="text/JavaScript">
|
||||
|
||||
const ntag_list = [
|
||||
"/s/articles/2021/09/21/vpp-linux-cp-part7/",
|
||||
"/s/articles/2021/12/23/vpp-linux-cp-virtual-machine-playground/",
|
||||
"/s/articles/2022/01/12/case-study-virtual-leased-line-vll-in-vpp/",
|
||||
"/s/articles/2022/02/14/case-study-vlan-gymnastics-with-vpp/",
|
||||
"/s/articles/2022/03/27/vpp-configuration-part1/",
|
||||
"/s/articles/2022/10/14/vpp-lab-setup/",
|
||||
"/s/articles/2023/03/11/case-study-centec-mpls-core/",
|
||||
"/s/articles/2023/04/09/vpp-monitoring/",
|
||||
"/s/articles/2023/05/28/vpp-mpls-part-4/",
|
||||
"/s/articles/2023/11/11/debian-on-mellanox-sn2700-32x100g/",
|
||||
"/s/articles/2023/12/17/debian-on-ipngs-vpp-routers/",
|
||||
"/s/articles/2024/01/27/vpp-python-api/",
|
||||
"/s/articles/2024/02/10/vpp-on-freebsd-part-1/",
|
||||
"/s/articles/2024/03/06/vpp-with-babel-part-1/",
|
||||
"/s/articles/2024/04/06/vpp-with-loopback-only-ospfv3-part-1/",
|
||||
"/s/articles/2024/04/27/freeix-remote/"
|
||||
];
|
||||
|
||||
var redir_url = "https://ipng.ch/";
|
||||
var key = window.location.hash.slice(1);
|
||||
if (key.startsWith("ntag")) {
|
||||
let week = Math.round(new Date().getTime() / 1000 / (7*24*3400));
|
||||
let num = parseInt(key.slice(-2));
|
||||
let idx = (num + week) % ntag_list.length;
|
||||
console.log("(ntag " + num + " + week number " + week + ") % " + ntag_list.length + " = " + idx);
|
||||
redir_url = ntag_list[idx];
|
||||
}
|
||||
|
||||
console.log("Redirecting to " + redir_url + " - off you go!");
|
||||
window.location = redir_url;
|
||||
</script>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<pre>
|
||||
Usage: https://ipng.ch/app/go/#<key>
|
||||
Example: <a href="/app/go/#ntag00">#ntag00</a>
|
||||
|
||||
Also, this page requires javascript.
|
||||
|
||||
Love,
|
||||
IPng Networks.
|
||||
</pre>
|
||||
|
||||
</body>
|
||||
</html>
|
1
static/assets/containerlab/containerlab.svg
Normal file
After Width: | Height: | Size: 21 KiB |
BIN
static/assets/containerlab/learn-vpp.png
(Stored with Git LFS)
Normal file
1270
static/assets/containerlab/vpp-containerlab.cast
Normal file
BIN
static/assets/debian-vpp/warning.png
(Stored with Git LFS)
BIN
static/assets/freebsd-vpp/brain.png
(Stored with Git LFS)
BIN
static/assets/freebsd-vpp/warning.png
(Stored with Git LFS)
BIN
static/assets/freeix/freeix-artist-rendering.png
(Stored with Git LFS)
Normal file
1
static/assets/frys-ix/FrysIX_ Topology (concept).svg
Normal file
After Width: | Height: | Size: 90 KiB |
BIN
static/assets/frys-ix/IXR-7220-D3.jpg
(Stored with Git LFS)
Normal file
1
static/assets/frys-ix/Nokia Arista VXLAN.svg
Normal file
After Width: | Height: | Size: 166 KiB |
169
static/assets/frys-ix/arista-leaf.conf
Normal file
@ -0,0 +1,169 @@
|
||||
no aaa root
|
||||
!
|
||||
hardware counter feature vtep decap
|
||||
hardware counter feature vtep encap
|
||||
!
|
||||
service routing protocols model multi-agent
|
||||
!
|
||||
hostname arista-leaf
|
||||
!
|
||||
router l2-vpn
|
||||
arp learning bridged
|
||||
!
|
||||
spanning-tree mode mstp
|
||||
!
|
||||
system l1
|
||||
unsupported speed action error
|
||||
unsupported error-correction action error
|
||||
!
|
||||
vlan 2604
|
||||
name v-peeringlan
|
||||
!
|
||||
interface Ethernet1/1
|
||||
!
|
||||
interface Ethernet2/1
|
||||
!
|
||||
interface Ethernet3/1
|
||||
!
|
||||
interface Ethernet4/1
|
||||
!
|
||||
interface Ethernet5/1
|
||||
!
|
||||
interface Ethernet6/1
|
||||
!
|
||||
interface Ethernet7/1
|
||||
!
|
||||
interface Ethernet8/1
|
||||
!
|
||||
interface Ethernet9/1
|
||||
shutdown
|
||||
speed forced 10000full
|
||||
!
|
||||
interface Ethernet9/2
|
||||
shutdown
|
||||
!
|
||||
interface Ethernet9/3
|
||||
speed forced 10000full
|
||||
switchport access vlan 2604
|
||||
!
|
||||
interface Ethernet9/4
|
||||
shutdown
|
||||
!
|
||||
interface Ethernet10/1
|
||||
!
|
||||
interface Ethernet10/2
|
||||
shutdown
|
||||
!
|
||||
interface Ethernet10/4
|
||||
shutdown
|
||||
!
|
||||
interface Ethernet11/1
|
||||
!
|
||||
interface Ethernet12/1
|
||||
!
|
||||
interface Ethernet13/1
|
||||
!
|
||||
interface Ethernet14/1
|
||||
!
|
||||
interface Ethernet15/1
|
||||
!
|
||||
interface Ethernet16/1
|
||||
!
|
||||
interface Ethernet17/1
|
||||
!
|
||||
interface Ethernet18/1
|
||||
!
|
||||
interface Ethernet19/1
|
||||
!
|
||||
interface Ethernet20/1
|
||||
!
|
||||
interface Ethernet21/1
|
||||
!
|
||||
interface Ethernet22/1
|
||||
!
|
||||
interface Ethernet23/1
|
||||
!
|
||||
interface Ethernet24/1
|
||||
!
|
||||
interface Ethernet25/1
|
||||
!
|
||||
interface Ethernet26/1
|
||||
!
|
||||
interface Ethernet27/1
|
||||
!
|
||||
interface Ethernet28/1
|
||||
!
|
||||
interface Ethernet29/1
|
||||
no switchport
|
||||
!
|
||||
interface Ethernet30/1
|
||||
load-interval 1
|
||||
mtu 9190
|
||||
no switchport
|
||||
ip address 198.19.17.10/31
|
||||
ip ospf cost 10
|
||||
ip ospf network point-to-point
|
||||
ip ospf area 0.0.0.0
|
||||
!
|
||||
interface Ethernet31/1
|
||||
load-interval 1
|
||||
mtu 9190
|
||||
no switchport
|
||||
ip address 198.19.17.3/31
|
||||
ip ospf cost 1000
|
||||
ip ospf network point-to-point
|
||||
ip ospf area 0.0.0.0
|
||||
!
|
||||
interface Ethernet32/1
|
||||
load-interval 1
|
||||
mtu 9190
|
||||
no switchport
|
||||
ip address 198.19.17.5/31
|
||||
ip ospf cost 1000
|
||||
ip ospf network point-to-point
|
||||
ip ospf area 0.0.0.0
|
||||
!
|
||||
interface Loopback0
|
||||
ip address 198.19.16.2/32
|
||||
ip ospf area 0.0.0.0
|
||||
!
|
||||
interface Loopback1
|
||||
ip address 198.19.18.2/32
|
||||
!
|
||||
interface Management1
|
||||
ip address dhcp
|
||||
!
|
||||
interface Vxlan1
|
||||
vxlan source-interface Loopback1
|
||||
vxlan udp-port 4789
|
||||
vxlan vlan 2604 vni 2604
|
||||
!
|
||||
ip routing
|
||||
!
|
||||
ip route 0.0.0.0/0 Management1 10.75.8.1
|
||||
!
|
||||
router bgp 65500
|
||||
neighbor evpn peer group
|
||||
neighbor evpn remote-as 65500
|
||||
neighbor evpn update-source Loopback0
|
||||
neighbor evpn ebgp-multihop 3
|
||||
neighbor evpn send-community extended
|
||||
neighbor evpn maximum-routes 12000 warning-only
|
||||
neighbor 198.19.16.0 peer group evpn
|
||||
neighbor 198.19.16.1 peer group evpn
|
||||
!
|
||||
vlan 2604
|
||||
rd 65500:2604
|
||||
route-target both 65500:2604
|
||||
redistribute learned
|
||||
!
|
||||
address-family evpn
|
||||
neighbor evpn activate
|
||||
!
|
||||
router ospf 65500
|
||||
router-id 198.19.16.2
|
||||
redistribute connected
|
||||
network 198.19.0.0/16 area 0.0.0.0
|
||||
max-lsa 12000
|
||||
!
|
||||
end
|
90
static/assets/frys-ix/equinix.conf
Normal file
@ -0,0 +1,90 @@
|
||||
set / interface ethernet-1/1 admin-state disable
|
||||
set / interface ethernet-1/9 admin-state enable
|
||||
set / interface ethernet-1/9 breakout-mode num-breakout-ports 4
|
||||
set / interface ethernet-1/9 breakout-mode breakout-port-speed 10G
|
||||
set / interface ethernet-1/9/3 admin-state enable
|
||||
set / interface ethernet-1/9/3 vlan-tagging true
|
||||
set / interface ethernet-1/9/3 subinterface 0 type bridged
|
||||
set / interface ethernet-1/9/3 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/9/3 subinterface 0 vlan encap untagged
|
||||
set / interface ethernet-1/29 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 type routed
|
||||
set / interface ethernet-1/29 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/29 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 ipv4 address 198.19.17.0/31
|
||||
set / interface ethernet-1/29 subinterface 0 ipv6 admin-state enable
|
||||
set / interface lo0 admin-state enable
|
||||
set / interface lo0 subinterface 0 admin-state enable
|
||||
set / interface lo0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface lo0 subinterface 0 ipv4 address 198.19.16.0/32
|
||||
set / interface mgmt0 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv4 dhcp-client
|
||||
set / interface mgmt0 subinterface 0 ipv6 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv6 dhcp-client
|
||||
set / interface system0 admin-state enable
|
||||
set / interface system0 subinterface 0 admin-state enable
|
||||
set / interface system0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface system0 subinterface 0 ipv4 address 198.19.18.0/32
|
||||
set / network-instance default type default
|
||||
set / network-instance default admin-state enable
|
||||
set / network-instance default description "fabric: dc2 role: spine"
|
||||
set / network-instance default router-id 198.19.16.0
|
||||
set / network-instance default ip-forwarding receive-ipv4-check false
|
||||
set / network-instance default interface ethernet-1/29.0
|
||||
set / network-instance default interface lo0.0
|
||||
set / network-instance default interface system0.0
|
||||
set / network-instance default protocols bgp admin-state enable
|
||||
set / network-instance default protocols bgp autonomous-system 65500
|
||||
set / network-instance default protocols bgp router-id 198.19.16.0
|
||||
set / network-instance default protocols bgp dynamic-neighbors accept match 198.19.16.0/24 peer-group overlay
|
||||
set / network-instance default protocols bgp afi-safi evpn admin-state enable
|
||||
set / network-instance default protocols bgp preference ibgp 170
|
||||
set / network-instance default protocols bgp route-advertisement rapid-withdrawal true
|
||||
set / network-instance default protocols bgp route-advertisement wait-for-fib-install false
|
||||
set / network-instance default protocols bgp group overlay peer-as 65500
|
||||
set / network-instance default protocols bgp group overlay afi-safi evpn admin-state enable
|
||||
set / network-instance default protocols bgp group overlay afi-safi ipv4-unicast admin-state disable
|
||||
set / network-instance default protocols bgp group overlay local-as as-number 65500
|
||||
set / network-instance default protocols bgp group overlay route-reflector client true
|
||||
set / network-instance default protocols bgp group overlay transport local-address 198.19.16.0
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.1 admin-state enable
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.1 peer-group overlay
|
||||
set / network-instance default protocols ospf instance default admin-state enable
|
||||
set / network-instance default protocols ospf instance default version ospf-v2
|
||||
set / network-instance default protocols ospf instance default router-id 198.19.16.0
|
||||
set / network-instance default protocols ospf instance default export-policy ospf
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/29.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface lo0.0
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface system0.0
|
||||
set / network-instance mgmt type ip-vrf
|
||||
set / network-instance mgmt admin-state enable
|
||||
set / network-instance mgmt description "Management network instance"
|
||||
set / network-instance mgmt interface mgmt0.0
|
||||
set / network-instance mgmt protocols linux import-routes true
|
||||
set / network-instance mgmt protocols linux export-routes true
|
||||
set / network-instance mgmt protocols linux export-neighbors true
|
||||
set / network-instance peeringlan type mac-vrf
|
||||
set / network-instance peeringlan admin-state enable
|
||||
set / network-instance peeringlan interface ethernet-1/9/3.0
|
||||
set / network-instance peeringlan vxlan-interface vxlan1.2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 admin-state enable
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 evi 2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise true
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604
|
||||
set / network-instance peeringlan bridge-table proxy-arp admin-state enable
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning admin-state enable
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning age-time 600
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning send-refresh 180
|
||||
set / routing-policy policy ospf statement 100 match protocol host
|
||||
set / routing-policy policy ospf statement 100 action policy-result accept
|
||||
set / routing-policy policy ospf statement 200 match protocol ospfv2
|
||||
set / routing-policy policy ospf statement 200 action policy-result accept
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address
|
BIN
static/assets/frys-ix/frysix-logo-small.png
(Stored with Git LFS)
Normal file
132
static/assets/frys-ix/nikhef.conf
Normal file
@ -0,0 +1,132 @@
|
||||
set / interface ethernet-1/1 admin-state enable
|
||||
set / interface ethernet-1/1 ethernet forward-error-correction fec-option rs-528
|
||||
set / interface ethernet-1/1 subinterface 0 type routed
|
||||
set / interface ethernet-1/1 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/1 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/1 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/1 subinterface 0 ipv4 address 198.19.17.2/31
|
||||
set / interface ethernet-1/1 subinterface 0 ipv6 admin-state enable
|
||||
set / interface ethernet-1/2 admin-state enable
|
||||
set / interface ethernet-1/2 ethernet forward-error-correction fec-option rs-528
|
||||
set / interface ethernet-1/2 subinterface 0 type routed
|
||||
set / interface ethernet-1/2 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/2 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/2 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/2 subinterface 0 ipv4 address 198.19.17.4/31
|
||||
set / interface ethernet-1/2 subinterface 0 ipv6 admin-state enable
|
||||
set / interface ethernet-1/3 admin-state enable
|
||||
set / interface ethernet-1/3 ethernet forward-error-correction fec-option rs-528
|
||||
set / interface ethernet-1/3 subinterface 0 type routed
|
||||
set / interface ethernet-1/3 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/3 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/3 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/3 subinterface 0 ipv4 address 198.19.17.6/31
|
||||
set / interface ethernet-1/3 subinterface 0 ipv6 admin-state enable
|
||||
set / interface ethernet-1/4 admin-state enable
|
||||
set / interface ethernet-1/4 ethernet forward-error-correction fec-option rs-528
|
||||
set / interface ethernet-1/4 subinterface 0 type routed
|
||||
set / interface ethernet-1/4 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/4 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/4 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/4 subinterface 0 ipv4 address 198.19.17.8/31
|
||||
set / interface ethernet-1/4 subinterface 0 ipv6 admin-state enable
|
||||
set / interface ethernet-1/9 admin-state enable
|
||||
set / interface ethernet-1/9 breakout-mode num-breakout-ports 4
|
||||
set / interface ethernet-1/9 breakout-mode breakout-port-speed 10G
|
||||
set / interface ethernet-1/9/1 admin-state disable
|
||||
set / interface ethernet-1/9/2 admin-state disable
|
||||
set / interface ethernet-1/9/3 admin-state enable
|
||||
set / interface ethernet-1/9/3 vlan-tagging true
|
||||
set / interface ethernet-1/9/3 subinterface 0 type bridged
|
||||
set / interface ethernet-1/9/3 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/9/3 subinterface 0 vlan encap untagged
|
||||
set / interface ethernet-1/9/4 admin-state disable
|
||||
set / interface ethernet-1/29 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 type routed
|
||||
set / interface ethernet-1/29 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/29 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/29 subinterface 0 ipv4 address 198.19.17.1/31
|
||||
set / interface ethernet-1/29 subinterface 0 ipv6 admin-state enable
|
||||
set / interface lo0 admin-state enable
|
||||
set / interface lo0 subinterface 0 admin-state enable
|
||||
set / interface lo0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface lo0 subinterface 0 ipv4 address 198.19.16.1/32
|
||||
set / interface mgmt0 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv4 dhcp-client
|
||||
set / interface mgmt0 subinterface 0 ipv6 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv6 dhcp-client
|
||||
set / interface system0 admin-state enable
|
||||
set / interface system0 subinterface 0 admin-state enable
|
||||
set / interface system0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface system0 subinterface 0 ipv4 address 198.19.18.1/32
|
||||
set / network-instance default type default
|
||||
set / network-instance default admin-state enable
|
||||
set / network-instance default description "fabric: dc1 role: spine"
|
||||
set / network-instance default router-id 198.19.16.1
|
||||
set / network-instance default ip-forwarding receive-ipv4-check false
|
||||
set / network-instance default interface ethernet-1/1.0
|
||||
set / network-instance default interface ethernet-1/2.0
|
||||
set / network-instance default interface ethernet-1/29.0
|
||||
set / network-instance default interface ethernet-1/3.0
|
||||
set / network-instance default interface ethernet-1/4.0
|
||||
set / network-instance default interface lo0.0
|
||||
set / network-instance default interface system0.0
|
||||
set / network-instance default protocols bgp admin-state enable
|
||||
set / network-instance default protocols bgp autonomous-system 65500
|
||||
set / network-instance default protocols bgp router-id 198.19.16.1
|
||||
set / network-instance default protocols bgp dynamic-neighbors accept match 198.19.16.0/24 peer-group overlay
|
||||
set / network-instance default protocols bgp afi-safi evpn admin-state enable
|
||||
set / network-instance default protocols bgp preference ibgp 170
|
||||
set / network-instance default protocols bgp route-advertisement rapid-withdrawal true
|
||||
set / network-instance default protocols bgp route-advertisement wait-for-fib-install false
|
||||
set / network-instance default protocols bgp group overlay peer-as 65500
|
||||
set / network-instance default protocols bgp group overlay afi-safi evpn admin-state enable
|
||||
set / network-instance default protocols bgp group overlay afi-safi ipv4-unicast admin-state disable
|
||||
set / network-instance default protocols bgp group overlay local-as as-number 65500
|
||||
set / network-instance default protocols bgp group overlay route-reflector client true
|
||||
set / network-instance default protocols bgp group overlay transport local-address 198.19.16.1
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.0 admin-state enable
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.0 peer-group overlay
|
||||
set / network-instance default protocols ospf instance default admin-state enable
|
||||
set / network-instance default protocols ospf instance default version ospf-v2
|
||||
set / network-instance default protocols ospf instance default router-id 198.19.16.1
|
||||
set / network-instance default protocols ospf instance default export-policy ospf
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/1.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/2.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/3.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/4.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/29.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface lo0.0 passive true
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface system0.0
|
||||
set / network-instance mgmt type ip-vrf
|
||||
set / network-instance mgmt admin-state enable
|
||||
set / network-instance mgmt description "Management network instance"
|
||||
set / network-instance mgmt interface mgmt0.0
|
||||
set / network-instance mgmt protocols linux import-routes true
|
||||
set / network-instance mgmt protocols linux export-routes true
|
||||
set / network-instance mgmt protocols linux export-neighbors true
|
||||
set / network-instance peeringlan type mac-vrf
|
||||
set / network-instance peeringlan admin-state enable
|
||||
set / network-instance peeringlan interface ethernet-1/9/3.0
|
||||
set / network-instance peeringlan vxlan-interface vxlan1.2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 admin-state enable
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 evi 2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise true
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604
|
||||
set / network-instance peeringlan bridge-table proxy-arp admin-state enable
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning admin-state enable
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning age-time 600
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning send-refresh 180
|
||||
set / routing-policy policy ospf statement 100 match protocol host
|
||||
set / routing-policy policy ospf statement 100 action policy-result accept
|
||||
set / routing-policy policy ospf statement 200 match protocol ospfv2
|
||||
set / routing-policy policy ospf statement 200 action policy-result accept
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address
|
BIN
static/assets/frys-ix/nokia-7220-d2.png
(Stored with Git LFS)
Normal file
BIN
static/assets/frys-ix/nokia-7220-d4.png
(Stored with Git LFS)
Normal file
105
static/assets/frys-ix/nokia-leaf.conf
Normal file
@ -0,0 +1,105 @@
|
||||
set / interface ethernet-1/9 admin-state enable
|
||||
set / interface ethernet-1/9 vlan-tagging true
|
||||
set / interface ethernet-1/9 ethernet port-speed 10G
|
||||
set / interface ethernet-1/9 subinterface 0 type bridged
|
||||
set / interface ethernet-1/9 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/9 subinterface 0 vlan encap untagged
|
||||
set / interface ethernet-1/53 admin-state enable
|
||||
set / interface ethernet-1/53 ethernet forward-error-correction fec-option rs-528
|
||||
set / interface ethernet-1/53 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/53 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/53 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/53 subinterface 0 ipv4 address 198.19.17.11/31
|
||||
set / interface ethernet-1/53 subinterface 0 ipv6 admin-state enable
|
||||
set / interface ethernet-1/55 admin-state enable
|
||||
set / interface ethernet-1/55 ethernet forward-error-correction fec-option rs-528
|
||||
set / interface ethernet-1/55 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/55 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/55 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/55 subinterface 0 ipv4 address 198.19.17.7/31
|
||||
set / interface ethernet-1/55 subinterface 0 ipv6 admin-state enable
|
||||
set / interface ethernet-1/56 admin-state enable
|
||||
set / interface ethernet-1/56 ethernet forward-error-correction fec-option rs-528
|
||||
set / interface ethernet-1/56 subinterface 0 admin-state enable
|
||||
set / interface ethernet-1/56 subinterface 0 ip-mtu 9190
|
||||
set / interface ethernet-1/56 subinterface 0 ipv4 admin-state enable
|
||||
set / interface ethernet-1/56 subinterface 0 ipv4 address 198.19.17.9/31
|
||||
set / interface ethernet-1/56 subinterface 0 ipv6 admin-state enable
|
||||
set / interface lo0 admin-state enable
|
||||
set / interface lo0 subinterface 0 admin-state enable
|
||||
set / interface lo0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface lo0 subinterface 0 ipv4 address 198.19.16.3/32
|
||||
set / interface mgmt0 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv4 dhcp-client
|
||||
set / interface mgmt0 subinterface 0 ipv6 admin-state enable
|
||||
set / interface mgmt0 subinterface 0 ipv6 dhcp-client
|
||||
set / interface system0 admin-state enable
|
||||
set / interface system0 subinterface 0 admin-state enable
|
||||
set / interface system0 subinterface 0 ipv4 admin-state enable
|
||||
set / interface system0 subinterface 0 ipv4 address 198.19.18.3/32
|
||||
set / network-instance default type default
|
||||
set / network-instance default admin-state enable
|
||||
set / network-instance default description "fabric: dc1 role: leaf"
|
||||
set / network-instance default router-id 198.19.16.3
|
||||
set / network-instance default ip-forwarding receive-ipv4-check false
|
||||
set / network-instance default interface ethernet-1/53.0
|
||||
set / network-instance default interface ethernet-1/55.0
|
||||
set / network-instance default interface ethernet-1/56.0
|
||||
set / network-instance default interface lo0.0
|
||||
set / network-instance default interface system0.0
|
||||
set / network-instance default protocols bgp admin-state enable
|
||||
set / network-instance default protocols bgp autonomous-system 65500
|
||||
set / network-instance default protocols bgp router-id 198.19.16.3
|
||||
set / network-instance default protocols bgp afi-safi evpn admin-state enable
|
||||
set / network-instance default protocols bgp preference ibgp 170
|
||||
set / network-instance default protocols bgp route-advertisement rapid-withdrawal true
|
||||
set / network-instance default protocols bgp route-advertisement wait-for-fib-install false
|
||||
set / network-instance default protocols bgp group overlay peer-as 65500
|
||||
set / network-instance default protocols bgp group overlay afi-safi evpn admin-state enable
|
||||
set / network-instance default protocols bgp group overlay afi-safi ipv4-unicast admin-state disable
|
||||
set / network-instance default protocols bgp group overlay local-as as-number 65500
|
||||
set / network-instance default protocols bgp group overlay transport local-address 198.19.16.3
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.0 admin-state enable
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.0 peer-group overlay
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.1 admin-state enable
|
||||
set / network-instance default protocols bgp neighbor 198.19.16.1 peer-group overlay
|
||||
set / network-instance default protocols ospf instance default admin-state enable
|
||||
set / network-instance default protocols ospf instance default version ospf-v2
|
||||
set / network-instance default protocols ospf instance default router-id 198.19.16.3
|
||||
set / network-instance default protocols ospf instance default export-policy ospf
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/53.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/55.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/56.0 interface-type point-to-point
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface lo0.0 passive true
|
||||
set / network-instance default protocols ospf instance default area 0.0.0.0 interface system0.0
|
||||
set / network-instance mgmt type ip-vrf
|
||||
set / network-instance mgmt admin-state enable
|
||||
set / network-instance mgmt description "Management network instance"
|
||||
set / network-instance mgmt interface mgmt0.0
|
||||
set / network-instance mgmt protocols linux import-routes true
|
||||
set / network-instance mgmt protocols linux export-routes true
|
||||
set / network-instance mgmt protocols linux export-neighbors true
|
||||
set / network-instance peeringlan type mac-vrf
|
||||
set / network-instance peeringlan admin-state enable
|
||||
set / network-instance peeringlan interface ethernet-1/9.0
|
||||
set / network-instance peeringlan vxlan-interface vxlan1.2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 admin-state enable
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 evi 2604
|
||||
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise true
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604
|
||||
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604
|
||||
set / network-instance peeringlan bridge-table proxy-arp admin-state enable
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning admin-state enable
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning age-time 600
|
||||
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning send-refresh 180
|
||||
set / routing-policy policy ospf statement 100 match protocol host
|
||||
set / routing-policy policy ospf statement 100 action policy-result accept
|
||||
set / routing-policy policy ospf statement 200 match protocol ospfv2
|
||||
set / routing-policy policy ospf statement 200 action policy-result accept
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604
|
||||
set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address
|
BIN
static/assets/jekyll-hugo/before.png
(Stored with Git LFS)
Normal file
7
static/assets/jekyll-hugo/hugo-logo-wide.svg
Normal file
@ -0,0 +1,7 @@
|
||||
<svg xmlns="http://www.w3.org/2000/svg" fill-rule="evenodd" stroke-width="27" aria-label="Logo" viewBox="0 0 1493 391">
|
||||
<path fill="#ebb951" stroke="#fcd804" d="M1345.211 24.704l112.262 64.305a43 43 0 0 1 21.627 37.312v142.237a40 40 0 0 1-20.702 35.037l-120.886 66.584a42 42 0 0 1-41.216-.389l-106.242-61.155a57 57 0 0 1-28.564-49.4V138.71a64 64 0 0 1 31.172-54.939l98.01-58.564a54 54 0 0 1 54.54-.503z"/>
|
||||
<path fill="#33ba91" stroke="#00a88a" d="M958.07 22.82l117.31 66.78a41 41 0 0 1 20.72 35.64v139.5a45 45 0 0 1-23.1 39.32L955.68 369.4a44 44 0 0 1-43.54-.41l-105.82-61.6a56 56 0 0 1-27.83-48.4V140.07a68 68 0 0 1 33.23-58.44l98.06-58.35a48 48 0 0 1 48.3-.46z"/>
|
||||
<path fill="#0594cb" stroke="#0083c0" d="M575.26 20.97l117.23 68.9a40 40 0 0 1 19.73 34.27l.73 138.67a48 48 0 0 1-24.64 42.2l-115.13 64.11a45 45 0 0 1-44.53-.42l-105.83-61.6a55 55 0 0 1-27.33-47.53V136.52a63 63 0 0 1 29.87-53.59l99.3-61.4a49 49 0 0 1 50.6-.56z"/>
|
||||
<path fill="#ff4088" stroke="#c9177e" d="M195.81 24.13l114.41 66.54a44 44 0 0 1 21.88 38.04v136.43a48 48 0 0 1-24.45 41.82L194.1 370.9a49 49 0 0 1-48.48-.23L41.05 310.48a53 53 0 0 1-26.56-45.93V135.08a55 55 0 0 1 26.1-46.8l102.8-63.46a51 51 0 0 1 52.42-.69z"/>
|
||||
<path fill="#fff" d="M1320.72 89.15c58.79 0 106.52 47.73 106.52 106.51 0 58.8-47.73 106.52-106.52 106.52-58.78 0-106.52-47.73-106.52-106.52 0-58.78 47.74-106.51 106.52-106.51zm0 39.57c36.95 0 66.94 30 66.94 66.94a66.97 66.97 0 0 1-66.94 66.94c-36.95 0-66.94-29.99-66.94-66.94a66.97 66.97 0 0 1 66.93-66.94h.01zm-283.8 65.31c0 47.18-8.94 60.93-26.81 80.58-17.87 19.65-41.57 27.57-71.1 27.57-27 0-48.75-9.58-67.61-26.23-20.88-18.45-36.08-47.04-36.08-78.95 0-31.37 11.72-58.48 32.49-78.67 18.22-17.67 45.34-29.18 73.3-29.18 33.77 0 68.83 15.98 90.44 47.53l-31.73 26.82c-13.45-25.03-32.94-33.46-60.82-34.26-30.83-.88-64.77 28.53-62.25 67.75 1.4 21.94 11.65 59.65 60.96 66.57 25.9 3.63 55.36-24.02 55.36-39.04H944.4v-37.5h92.5V194l.02.03zm-562.6-94.65h42.29v112.17c0 17.8.49 29.33 1.47 34.61 1.69 8.48 4.81 14.37 11.17 19.5 6.37 5.13 13.8 6.59 24.84 6.59 11.2 0 14.96-1.74 20.66-6.6 5.69-4.85 9.12-9.46 10.28-16.53 1.15-7.07 3.07-18.8 3.07-35.18V99.38h42.28v108.78c0 24.86-1.07 42.43-3.21 52.69-2.14 10.27-6.08 18.93-11.82 26-5.74 7.06-13.42 12.69-23.03 16.88-9.62 4.19-22.16 6.28-37.65 6.28-18.7 0-32.87-2.28-42.52-6.85-9.66-4.57-17.3-10.5-22.9-17.8-5.61-7.3-9.3-14.95-11.08-22.96-2.58-11.86-3.88-29.38-3.88-52.55V99.38h.03zM93.91 299.92V92.7h43.35v75.48h71.92V92.7h43.48v207.22h-43.48v-90.61h-71.92v90.61z"/>
|
||||
</svg>
|
After Width: | Height: | Size: 2.5 KiB |
BIN
static/assets/jekyll-hugo/jekyll-logo.png
(Stored with Git LFS)
Normal file
83
static/assets/logo/logo-red.svg
Normal file
After Width: | Height: | Size: 16 KiB |
BIN
static/assets/logo/logo-white-1000px.png
(Stored with Git LFS)
Normal file
BIN
static/assets/logo/logo-white-100px.png
(Stored with Git LFS)
Normal file
BIN
static/assets/logo/logo-white-2000px.png
(Stored with Git LFS)
Normal file
BIN
static/assets/logo/logo-white-200px.png
(Stored with Git LFS)
Normal file
BIN
static/assets/logo/logo-white-400px.png
(Stored with Git LFS)
Normal file
BIN
static/assets/nat64/brain.png
(Stored with Git LFS)
BIN
static/assets/oem-switch/warning.png
(Stored with Git LFS)
BIN
static/assets/sflow/hsflowd-demo.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/sflow-all.pcap
Normal file
BIN
static/assets/sflow/sflow-host.pcap
Normal file
BIN
static/assets/sflow/sflow-interface.pcap
Normal file
BIN
static/assets/sflow/sflow-lab-trex.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/sflow-lab.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/sflow-overview.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/sflow-vpp-overview.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/sflow-wireshark.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/sflow.gif
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/trex-acceptance.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/trex-baseline.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/trex-overload.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/trex-passthru.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/trex-sflow-acceptance.png
(Stored with Git LFS)
Normal file
BIN
static/assets/sflow/trex-v1.png
(Stored with Git LFS)
Normal file
BIN
static/assets/shared/brain.png
(Stored with Git LFS)
Normal file
Before Width: | Height: | Size: 4.9 KiB After Width: | Height: | Size: 4.9 KiB |
BIN
static/assets/shared/warning.png
(Stored with Git LFS)
Normal file
BIN
static/assets/smtp/nginx_logo.png
(Stored with Git LFS)
BIN
static/assets/smtp/postfix_logo.png
(Stored with Git LFS)
BIN
static/assets/smtp/roundcube_logo.png
(Stored with Git LFS)
BIN
static/assets/smtp/unbound_logo.png
(Stored with Git LFS)
BIN
static/assets/vpp-babel/brain.png
(Stored with Git LFS)
BIN
static/assets/vpp-babel/warning.png
(Stored with Git LFS)
@ -1,81 +0,0 @@
|
||||
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||||
<svg
|
||||
xmlns:dc="http://purl.org/dc/elements/1.1/"
|
||||
xmlns:cc="http://creativecommons.org/ns#"
|
||||
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
|
||||
xmlns:svg="http://www.w3.org/2000/svg"
|
||||
xmlns="http://www.w3.org/2000/svg"
|
||||
xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
|
||||
xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
|
||||
version="1.1"
|
||||
x="0px"
|
||||
y="0px"
|
||||
viewBox="0 0 93.533997 110"
|
||||
xml:space="preserve"
|
||||
id="svg30"
|
||||
sodipodi:docname="noun_1263005_cc.svg"
|
||||
width="93.533997"
|
||||
height="110"
|
||||
inkscape:version="0.92.2 5c3e80d, 2017-08-06"><metadata
|
||||
id="metadata36"><rdf:RDF><cc:Work
|
||||
rdf:about=""><dc:format>image/svg+xml</dc:format><dc:type
|
||||
rdf:resource="http://purl.org/dc/dcmitype/StillImage" /><dc:title></dc:title></cc:Work></rdf:RDF></metadata><defs
|
||||
id="defs34" /><sodipodi:namedview
|
||||
pagecolor="#ffffff"
|
||||
bordercolor="#666666"
|
||||
borderopacity="1"
|
||||
objecttolerance="10"
|
||||
gridtolerance="10"
|
||||
guidetolerance="10"
|
||||
inkscape:pageopacity="0"
|
||||
inkscape:pageshadow="2"
|
||||
inkscape:window-width="640"
|
||||
inkscape:window-height="480"
|
||||
id="namedview32"
|
||||
showgrid="false"
|
||||
fit-margin-top="10"
|
||||
fit-margin-right="10"
|
||||
fit-margin-bottom="10"
|
||||
fit-margin-left="10"
|
||||
inkscape:zoom="1.888"
|
||||
inkscape:cx="46.767"
|
||||
inkscape:cy="42.5"
|
||||
inkscape:window-x="0"
|
||||
inkscape:window-y="0"
|
||||
inkscape:window-maximized="0"
|
||||
inkscape:current-layer="svg30" /><g
|
||||
id="g24"
|
||||
transform="translate(-3.233,5)"><path
|
||||
d="m 59.632,75.107 v -2.822 c 0,-4.96 2.015,-9.725 5.529,-13.073 4.396,-4.189 6.817,-9.839 6.817,-15.909 0,-12.119 -9.859,-21.978 -21.978,-21.978 -0.827,0 -1.667,0.046 -2.496,0.138 -10.5,1.161 -18.86,10.013 -19.447,20.591 -0.354,6.379 2.066,12.583 6.64,17.02 3.604,3.496 5.671,8.31 5.671,13.208 v 2.824 c 0,2.999 2.439,5.438 5.438,5.438 h 8.387 c 2.999,0.002 5.439,-2.438 5.439,-5.437 z m -4,0 c 0,0.793 -0.646,1.438 -1.438,1.438 h -8.387 c -0.793,0 -1.438,-0.646 -1.438,-1.438 v -2.824 c 0,-5.973 -2.51,-11.833 -6.886,-16.079 -3.741,-3.629 -5.721,-8.706 -5.431,-13.927 0.48,-8.65 7.312,-15.888 15.893,-16.837 0.684,-0.076 1.376,-0.114 2.057,-0.114 9.913,0 17.978,8.064 17.978,17.978 0,4.965 -1.98,9.586 -5.576,13.013 -4.302,4.099 -6.77,9.919 -6.77,15.969 v 2.821 z"
|
||||
id="path2"
|
||||
inkscape:connector-curvature="0" /><path
|
||||
d="M 56.509,82.521 H 43.491 c -0.829,0 -1.5,0.671 -1.5,1.5 0,0.829 0.671,1.5 1.5,1.5 H 56.51 c 0.829,0 1.5,-0.671 1.5,-1.5 0,-0.829 -0.672,-1.5 -1.501,-1.5 z"
|
||||
id="path4"
|
||||
inkscape:connector-curvature="0" /><path
|
||||
d="m 58.009,88.761 c 0,-0.829 -0.671,-1.5 -1.5,-1.5 H 43.491 c -0.829,0 -1.5,0.671 -1.5,1.5 0,0.829 0.671,1.5 1.5,1.5 H 56.51 c 0.828,0 1.499,-0.672 1.499,-1.5 z"
|
||||
id="path6"
|
||||
inkscape:connector-curvature="0" /><path
|
||||
d="m 14.733,43.267 h 8.643 c 0.829,0 1.5,-0.671 1.5,-1.5 0,-0.829 -0.671,-1.5 -1.5,-1.5 h -8.643 c -0.829,0 -1.5,0.671 -1.5,1.5 0,0.829 0.671,1.5 1.5,1.5 z"
|
||||
id="path8"
|
||||
inkscape:connector-curvature="0" /><path
|
||||
d="m 86.767,41.767 c 0,-0.829 -0.671,-1.5 -1.5,-1.5 h -8.643 c -0.829,0 -1.5,0.671 -1.5,1.5 0,0.829 0.671,1.5 1.5,1.5 h 8.643 c 0.829,0 1.5,-0.671 1.5,-1.5 z"
|
||||
id="path10"
|
||||
inkscape:connector-curvature="0" /><path
|
||||
d="m 48.5,6.5 v 8.643 c 0,0.829 0.671,1.5 1.5,1.5 0.829,0 1.5,-0.671 1.5,-1.5 V 6.5 C 51.5,5.671 50.829,5 50,5 49.171,5 48.5,5.671 48.5,6.5 Z"
|
||||
id="path12"
|
||||
inkscape:connector-curvature="0" /><path
|
||||
d="m 73.877,15.769 -6.111,6.111 c -0.586,0.585 -0.586,1.536 0,2.121 0.293,0.293 0.677,0.439 1.061,0.439 0.384,0 0.768,-0.146 1.061,-0.439 l 6.111,-6.111 c 0.586,-0.585 0.586,-1.536 0,-2.121 -0.587,-0.586 -1.536,-0.586 -2.122,0 z"
|
||||
id="path14"
|
||||
inkscape:connector-curvature="0" /><path
|
||||
d="m 32.234,59.533 c -0.586,-0.586 -1.535,-0.586 -2.121,0 l -6.111,6.111 c -0.586,0.585 -0.586,1.535 0,2.121 0.293,0.293 0.677,0.439 1.061,0.439 0.384,0 0.768,-0.146 1.061,-0.439 l 6.111,-6.111 c 0.585,-0.585 0.585,-1.535 -10e-4,-2.121 z"
|
||||
id="path16"
|
||||
inkscape:connector-curvature="0" /><path
|
||||
d="m 30.113,24.001 c 0.293,0.293 0.677,0.439 1.061,0.439 0.384,0 0.768,-0.146 1.061,-0.439 0.586,-0.585 0.586,-1.536 0,-2.121 l -6.111,-6.111 c -0.586,-0.586 -1.535,-0.586 -2.121,0 -0.586,0.585 -0.586,1.536 0,2.121 z"
|
||||
id="path18"
|
||||
inkscape:connector-curvature="0" /><path
|
||||
d="m 73.877,67.765 c 0.293,0.293 0.677,0.439 1.061,0.439 0.384,0 0.768,-0.146 1.061,-0.439 0.586,-0.586 0.586,-1.536 0,-2.121 l -6.111,-6.111 c -0.586,-0.586 -1.535,-0.586 -2.121,0 -0.586,0.586 -0.586,1.536 0,2.121 z"
|
||||
id="path20"
|
||||
inkscape:connector-curvature="0" /><path
|
||||
d="m 54.754,93.5 c 0,-0.829 -0.671,-1.5 -1.5,-1.5 h -6.509 c -0.829,0 -1.5,0.671 -1.5,1.5 0,0.829 0.671,1.5 1.5,1.5 h 6.509 c 0.829,0 1.5,-0.671 1.5,-1.5 z"
|
||||
id="path22"
|
||||
inkscape:connector-curvature="0" /></g></svg>
|
Before Width: | Height: | Size: 4.9 KiB |