Compare commits

...

50 Commits

Author SHA1 Message Date
fdb77838b8 Rewrite github.com to git.ipng.ch for popular repos
All checks were successful
continuous-integration/drone/push Build is passing
2025-05-04 21:54:16 +02:00
6d3f4ac206 Some readability changes 2025-05-04 21:50:07 +02:00
baa3e78045 Update MTU to 9216
All checks were successful
continuous-integration/drone/push Build is passing
2025-05-04 20:15:24 +02:00
0972cf4aa1 A few readability fixes
All checks were successful
continuous-integration/drone/push Build is passing
2025-05-04 17:30:04 +02:00
4f81d377a0 Article #2, Containerlab is up and running
All checks were successful
continuous-integration/drone/push Build is passing
2025-05-04 17:11:58 +02:00
153048eda4 Update git repo 2025-05-04 17:11:58 +02:00
4aa5745d06 Merge branch 'main' of git.ipng.ch:ipng/ipng.ch
All checks were successful
continuous-integration/drone/push Build is passing
2025-05-04 08:50:32 +02:00
7d3f617966 Add a note on MAC addresses and an af-packet trace to show end to end
All checks were successful
continuous-integration/drone/push Build is passing
2025-05-03 18:45:20 +02:00
8918821413 Add clab part 1
All checks were successful
continuous-integration/drone/push Build is passing
2025-05-03 18:15:39 +02:00
9783c7d39c Correct the TH version for 7060CX 2025-04-25 16:22:56 +02:00
af68c1ec3b One more typo fix
All checks were successful
continuous-integration/drone/push Build is passing
2025-04-24 15:49:11 +02:00
0baadb5089 Typo fixes, h/t Michael Bear
All checks were successful
continuous-integration/drone/push Build is passing
2025-04-24 15:46:56 +02:00
3b7e576d20 Typo and readability
All checks were successful
continuous-integration/drone/push Build is passing
2025-04-10 01:16:19 -05:00
d0a7cdbe38 Rename linuxadmin to pim
All checks were successful
continuous-integration/drone/push Build is passing
2025-04-10 00:04:36 -05:00
ed087f3fc6 Add configs
All checks were successful
continuous-integration/drone/push Build is passing
2025-04-09 23:17:57 -05:00
51e6c0e1c2 Add note on l2-vpn BUM suppression on Arista
All checks were successful
continuous-integration/drone/push Build is passing
2025-04-09 23:03:33 -05:00
8a991bee47 A few typo and readability fixes
All checks were successful
continuous-integration/drone/push Build is passing
2025-04-09 22:57:24 -05:00
d9e2f407e7 Add article on SR Linux + Arista EVPN
All checks were successful
continuous-integration/drone/push Build is passing
2025-04-09 22:25:51 -05:00
01820776af Use release-0.145.1 as a re-build of 0.145.0 with the correct Hugo branch
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
2025-03-29 20:44:59 -05:00
d5d4f7ff55 Bump Go/Hugo package
All checks were successful
continuous-integration/drone/push Build is passing
2025-03-29 20:27:12 -05:00
2a61bdc028 Update keys - add new Tammy laptop
All checks were successful
continuous-integration/drone/push Build is passing
2025-03-29 14:13:01 -05:00
c2b8eef4f4 Remove m2air key - laptop retired
All checks were successful
continuous-integration/drone/push Build is passing
2025-03-26 22:34:32 +00:00
533cca0108 Readability edits
All checks were successful
continuous-integration/drone/push Build is passing
2025-02-09 22:21:18 +01:00
4ac8c47127 Some readability fixes
All checks were successful
continuous-integration/drone/push Build is passing
2025-02-09 18:30:38 +01:00
bcbb119b20 Add article sflow-3
All checks were successful
continuous-integration/drone/push Build is passing
2025-02-09 17:51:05 +01:00
ce6e6cde22 Add ignoreLogs for raw HTML 2025-02-09 17:50:13 +01:00
610835925b Give rendered codeblock a bit more contrast
All checks were successful
continuous-integration/drone/push Build is passing
2024-12-20 13:19:30 +01:00
16ac42bad9 Create consistent title for both articles
All checks were successful
continuous-integration/drone/push Build is passing
2024-10-21 19:50:50 +02:00
26397d69c6 Readability pass, ready for publication
All checks were successful
continuous-integration/drone/push Build is passing
2024-10-21 18:58:27 +02:00
388293baef Add FreeIX #2 article 2024-10-21 18:10:02 +02:00
b2129702ae remove unnecessary raw/endraw tags from Jekyll. h/t Luiz Amaral
All checks were successful
continuous-integration/drone/push Build is passing
2024-10-13 15:55:47 +02:00
ba068c1c52 Some typo and readability fixes
All checks were successful
continuous-integration/drone/push Build is passing
2024-10-06 19:11:42 +02:00
3c69130cea Add sflow part 2
All checks were successful
continuous-integration/drone/push Build is passing
2024-10-06 18:07:22 +02:00
255d3905d7 Fix table
All checks were successful
continuous-integration/drone/push Build is passing
2024-09-26 17:00:09 +02:00
4cd42b9824 Make the ntags cycle every week
All checks were successful
continuous-integration/drone/push Build is passing
2024-09-26 14:37:52 +02:00
f12247d278 Remove unused 'paginate' variable
All checks were successful
continuous-integration/drone/push Build is passing
2024-09-23 22:11:59 +02:00
36b422ce08 Bump to Hugo 0.134.3
All checks were successful
continuous-integration/drone/push Build is passing
2024-09-23 22:09:29 +02:00
2e1bb69772 A few typo fixes - h/t jeroen@
All checks were successful
continuous-integration/drone/push Build is passing
2024-09-12 15:40:37 +02:00
ceb16714b6 Fix links, h/t ChrisPL
All checks were successful
continuous-integration/drone/push Build is passing
2024-09-11 08:19:45 +02:00
72b99b20c6 Add silly security.txt
All checks were successful
continuous-integration/drone/push Build is passing
2024-09-09 22:55:16 +02:00
4b5bd40fce Add an idea, and another set of typo fixes
All checks were successful
continuous-integration/drone/push Build is passing
2024-09-09 11:09:58 +02:00
1379c77181 A few typo fixes and clarifications
All checks were successful
continuous-integration/drone/push Build is passing
2024-09-09 11:00:22 +02:00
08d55e6ac0 sFlow, part 1
All checks were successful
continuous-integration/drone/push Build is passing
2024-09-09 10:26:15 +02:00
3feb217aa8 Add static resources: MTA-STS, Prefixes, SSH Keys
All checks were successful
continuous-integration/drone/push Build is passing
2024-09-02 11:17:19 +02:00
2f63fc0ebb Fix page-footer issue, by creating defaults for mq-mini or smaller
All checks were successful
continuous-integration/drone/push Build is passing
2024-09-02 10:49:45 +02:00
4113615096 Remove unnecessary semi-colon
All checks were successful
continuous-integration/drone/push Build is passing
2024-08-21 00:45:36 +02:00
52cba49c90 Add redirector javascript on /app/go/
All checks were successful
continuous-integration/drone/push Build is passing
2024-08-21 00:42:31 +02:00
b5c0819bfa Trailing slash on void elements has no effect and interacts badly with unquoted attribute values.
All checks were successful
continuous-integration/drone/push Build is passing
2024-08-13 02:20:33 +02:00
ea05b39ddf Remove spurious whitespace in head section
All checks were successful
continuous-integration/drone/push Build is passing
2024-08-13 02:15:02 +02:00
27ab370dc4 Move to hugo.yaml config format
All checks were successful
continuous-integration/drone/push Build is passing
2024-08-13 02:00:30 +02:00
71 changed files with 6576 additions and 116 deletions

View File

@ -8,9 +8,9 @@ steps:
- git lfs install
- git lfs pull
- name: build
image: git.ipng.ch/ipng/drone-hugo:release-0.130.0
image: git.ipng.ch/ipng/drone-hugo:release-0.145.1
settings:
hugo_version: 0.130.0
hugo_version: 0.145.0
extended: true
- name: rsync
image: drillster/drone-rsync

View File

@ -1,38 +0,0 @@
baseURL = 'https://ipng.ch/'
languageCode = 'en-us'
title = "IPng Networks"
theme = 'hugo-theme-ipng'
mainSections = ["articles"]
# disqusShortname = "example"
paginate = 4
[params]
author = "IPng Networks GmbH"
siteHeading = "IPng Networks"
favicon = "favicon.ico" # Adds a small icon next to the page title in a tab
showBlogLatest = false
mainSections = ["articles"]
showTaxonomyLinks = false
nBlogLatest = 14 # number of blog post om the home page
Paginate = 30
blogLatestHeading = "Latest Dabblings"
footer = "Copyright 2021- IPng Networks GmbH, all rights reserved"
[params.social]
email = "info+www@ipng.ch"
mastodon = "@IPngNetworks"
twitter = "IPngNetworks"
linkedin = "pimvanpelt"
github = "pimvanpelt"
instagram = "IPngNetworks"
rss = true
[taxonomies]
year = "year"
month = "month"
tags = "tags"
categories = "categories"
[permalinks]
articles = "/s/articles/:year/:month/:day/:slug"

View File

@ -89,7 +89,7 @@ lcp lcp-sync off
```
The prep work for the rest of the interface syncer starts with this
[[commit](https://github.com/pimvanpelt/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
[[commit](https://git.ipng.ch/ipng/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
for the rest of this blog post, the behavior will be in the 'on' position.
### Change interface: state
@ -120,7 +120,7 @@ the state it was. I did notice that you can't bring up a sub-interface if its pa
is down, which I found counterintuitive, but that's neither here nor there.
All of this is to say that we have to be careful when copying state forward, because as
this [[commit](https://github.com/pimvanpelt/lcpng/commit/7c15c84f6c4739860a85c599779c199cb9efef03)]
this [[commit](https://git.ipng.ch/ipng/lcpng/commit/7c15c84f6c4739860a85c599779c199cb9efef03)]
shows, issuing `set int state ... up` on an interface, won't touch its sub-interfaces in VPP, but
the subsequent netlink message to bring the _LIP_ for that interface up, **will** update the
children, thus desynchronising Linux and VPP: Linux will have interface **and all its
@ -128,7 +128,7 @@ sub-interfaces** up unconditionally; VPP will have the interface up and its sub-
whatever state they were before.
To address this, a second
[[commit](https://github.com/pimvanpelt/lcpng/commit/a3dc56c01461bdffcac8193ead654ae79225220f)] was
[[commit](https://git.ipng.ch/ipng/lcpng/commit/a3dc56c01461bdffcac8193ead654ae79225220f)] was
needed. I'm not too sure I want to keep this behavior, but for now, it results in an intuitive
end-state, which is that all interfaces states are exactly the same between Linux and VPP.
@ -157,7 +157,7 @@ DBGvpp# set int state TenGigabitEthernet3/0/0 up
### Change interface: MTU
Finally, a straight forward
[[commit](https://github.com/pimvanpelt/lcpng/commit/39bfa1615fd1cafe5df6d8fc9d34528e8d3906e2)], or
[[commit](https://git.ipng.ch/ipng/lcpng/commit/39bfa1615fd1cafe5df6d8fc9d34528e8d3906e2)], or
so I thought. When the MTU changes in VPP (with `set interface mtu packet N <int>`), there is
callback that can be registered which copies this into the _LIP_. I did notice a specific corner
case: In VPP, a sub-interface can have a larger MTU than its parent. In Linux, this cannot happen,
@ -179,7 +179,7 @@ higher than that, perhaps logging an error explaining why. This means two things
1. Any change in VPP of a parent MTU should ensure all children are clamped to at most that.
I addressed the issue in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/79a395b3c9f0dae9a23e6fbf10c5f284b1facb85)].
[[commit](https://git.ipng.ch/ipng/lcpng/commit/79a395b3c9f0dae9a23e6fbf10c5f284b1facb85)].
### Change interface: IP Addresses
@ -199,7 +199,7 @@ VPP into the companion Linux devices:
_LIP_ with `lcp_itf_set_interface_addr()`.
This means with this
[[commit](https://github.com/pimvanpelt/lcpng/commit/f7e1bb951d648a63dfa27d04ded0b6261b9e39fe)], at
[[commit](https://git.ipng.ch/ipng/lcpng/commit/f7e1bb951d648a63dfa27d04ded0b6261b9e39fe)], at
any time a new _LIP_ is created, the IPv4 and IPv6 address on the VPP interface are fully copied
over by the third change, while at runtime, new addresses can be set/removed as well by the first
and second change.

View File

@ -100,7 +100,7 @@ linux-cp {
Based on this config, I set the startup default in `lcp_set_lcp_auto_subint()`, but I realize that
an administrator may want to turn it on/off at runtime, too, so I add a CLI getter/setter that
interacts with the flag in this [[commit](https://github.com/pimvanpelt/lcpng/commit/d23aab2d95aabcf24efb9f7aecaf15b513633ab7)]:
interacts with the flag in this [[commit](https://git.ipng.ch/ipng/lcpng/commit/d23aab2d95aabcf24efb9f7aecaf15b513633ab7)]:
```
DBGvpp# show lcp
@ -116,11 +116,11 @@ lcp lcp-sync off
```
The prep work for the rest of the interface syncer starts with this
[[commit](https://github.com/pimvanpelt/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
[[commit](https://git.ipng.ch/ipng/lcpng/commit/2d00de080bd26d80ce69441b1043de37e0326e0a)], and
for the rest of this blog post, the behavior will be in the 'on' position.
The code for the configuration toggle is in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
[[commit](https://git.ipng.ch/ipng/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
### Auto create/delete sub-interfaces
@ -145,7 +145,7 @@ I noticed that interface deletion had a bug (one that I fell victim to as well:
remove the netlink device in the correct network namespace), which I fixed.
The code for the auto create/delete and the bugfix is in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
[[commit](https://git.ipng.ch/ipng/lcpng/commit/934446dcd97f51c82ddf133ad45b61b3aae14b2d)].
### Further Work

View File

@ -154,7 +154,7 @@ For now, `lcp_nl_dispatch()` just throws the message away after logging it with
a function that will come in very useful as I start to explore all the different Netlink message types.
The code that forms the basis of our Netlink Listener lives in [[this
commit](https://github.com/pimvanpelt/lcpng/commit/c4e3043ea143d703915239b2390c55f7b6a9b0b1)] and
commit](https://git.ipng.ch/ipng/lcpng/commit/c4e3043ea143d703915239b2390c55f7b6a9b0b1)] and
specifically, here I want to call out I was not the primary author, I worked off of Matt and Neale's
awesome work in this pending [Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122).
@ -182,7 +182,7 @@ Linux interface VPP is not aware of. But, if I can find the _LIP_, I can convert
add or remove the ip4/ip6 neighbor adjacency.
The code for this first Netlink message handler lives in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/30bab1d3f9ab06670fbef2c7c6a658e7b77f7738)]. An
[[commit](https://git.ipng.ch/ipng/lcpng/commit/30bab1d3f9ab06670fbef2c7c6a658e7b77f7738)]. An
ironic insight is that after writing the code, I don't think any of it will be necessary, because
the interface plugin will already copy ARP and IPv6 ND packets back and forth and itself update its
neighbor adjacency tables; but I'm leaving the code in for now.
@ -197,7 +197,7 @@ it or remove it, and if there are no link-local addresses left, disable IPv6 on
There's also a few multicast routes to add (notably 224.0.0.0/24 and ff00::/8, all-local-subnet).
The code for IP address handling is in this
[[commit]](https://github.com/pimvanpelt/lcpng/commit/87742b4f541d389e745f0297d134e34f17b5b485), but
[[commit]](https://git.ipng.ch/ipng/lcpng/commit/87742b4f541d389e745f0297d134e34f17b5b485), but
when I took it out for a spin, I noticed something curious, looking at the log lines that are
generated for the following sequence:
@ -236,7 +236,7 @@ interface and directly connected route addition/deletion is slightly different i
So, I decide to take a little shortcut -- if an addition returns "already there", or a deletion returns
"no such entry", I'll just consider it a successful addition and deletion respectively, saving my eyes
from being screamed at by this red error message. I changed that in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/d63fbd8a9a612d038aa385e79a57198785d409ca)],
[[commit](https://git.ipng.ch/ipng/lcpng/commit/d63fbd8a9a612d038aa385e79a57198785d409ca)],
turning this situation in a friendly green notice instead.
### Netlink: Link (existing)
@ -267,7 +267,7 @@ To avoid this loop, I temporarily turn off `lcp-sync` just before handling a bat
turn it back to its original state when I'm done with that.
The code for all/del of existing links is in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)].
[[commit](https://git.ipng.ch/ipng/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)].
### Netlink: Link (new)
@ -276,7 +276,7 @@ doesn't have a _LIP_ for, but specifically describes a VLAN interface? Well, th
is trying to create a new sub-interface. And supporting that operation would be super cool, so let's go!
Using the earlier placeholder hint in `lcp_nl_link_add()` (see the previous
[[commit](https://github.com/pimvanpelt/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)]),
[[commit](https://git.ipng.ch/ipng/lcpng/commit/e604dd34784e029b41a47baa3179296d15b0632e)]),
I know that I've gotten a NEWLINK request but the Linux ifindex doesn't have a _LIP_. This could be
because the interface is entirely foreign to VPP, for example somebody created a dummy interface or
a VLAN sub-interface on one:
@ -331,7 +331,7 @@ a boring `<phy>.<subid>` name.
Alright, without further ado, the code for the main innovation here, the implementation of
`lcp_nl_link_add_vlan()`, is in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/45f408865688eb7ea0cdbf23aa6f8a973be49d1a)].
[[commit](https://git.ipng.ch/ipng/lcpng/commit/45f408865688eb7ea0cdbf23aa6f8a973be49d1a)].
## Results

View File

@ -118,7 +118,7 @@ or Virtual Routing/Forwarding domains). So first, I need to add these:
All of this code was heavily inspired by the pending [[Gerrit](https://gerrit.fd.io/r/c/vpp/+/31122)]
but a few finishing touches were added, and wrapped up in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/7a76498277edc43beaa680e91e3a0c1787319106)].
[[commit](https://git.ipng.ch/ipng/lcpng/commit/7a76498277edc43beaa680e91e3a0c1787319106)].
### Deletion
@ -459,7 +459,7 @@ it as 'unreachable' rather than deleting it. These are *additions* which have a
but with an interface index of 1 (which, in Netlink, is 'lo'). This makes VPP intermittently crash, so I
currently commented this out, while I gain better understanding. Result: blackhole/unreachable/prohibit
specials can not be set using the plugin. Beware!
(disabled in this [[commit](https://github.com/pimvanpelt/lcpng/commit/7c864ed099821f62c5be8cbe9ed3f4dd34000a42)]).
(disabled in this [[commit](https://git.ipng.ch/ipng/lcpng/commit/7c864ed099821f62c5be8cbe9ed3f4dd34000a42)]).
## Credits

View File

@ -88,7 +88,7 @@ stat['/if/rx-miss'][:, 1].sum() - returns the sum of packet counters for
```
Alright, so let's grab that file and refactor it into a small library for me to use, I do
this in [[this commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
this in [[this commit](https://git.ipng.ch/ipng/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
### VPP's API
@ -159,7 +159,7 @@ idx=19 name=tap4 mac=02:fe:17:06:fc:af mtu=9000 flags=3
So I added a little abstration with some error handling and one main function
to return interfaces as a Python dictionary of those `sw_interface_details`
tuples in [[this commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
tuples in [[this commit](https://git.ipng.ch/ipng/vpp-snmp-agent/commit/51eee915bf0f6267911da596b41a4475feaf212e)].
### AgentX
@ -207,9 +207,9 @@ once asked with `GetPDU` or `GetNextPDU` requests, by issuing a corresponding `R
to the SNMP server -- it takes care of all the rest!
The resulting code is in [[this
commit](https://github.com/pimvanpelt/vpp-snmp-agent/commit/8c9c1e2b4aa1d40a981f17581f92bba133dd2c29)]
commit](https://git.ipng.ch/ipng/vpp-snmp-agent/commit/8c9c1e2b4aa1d40a981f17581f92bba133dd2c29)]
but you can also check out the whole thing on
[[Github](https://github.com/pimvanpelt/vpp-snmp-agent)].
[[Github](https://git.ipng.ch/ipng/vpp-snmp-agent)].
### Building

View File

@ -480,7 +480,7 @@ is to say, those packets which were destined to any IP address configured on the
plane. Any traffic going _through_ VPP will never be seen by Linux! So, I'll have to be
clever and count this traffic by polling VPP instead. This was the topic of my previous
[VPP Part 6]({{< ref "2021-09-10-vpp-6" >}}) about the SNMP Agent. All of that code
was released to [Github](https://github.com/pimvanpelt/vpp-snmp-agent), notably there's
was released to [Github](https://git.ipng.ch/ipng/vpp-snmp-agent), notably there's
a hint there for an `snmpd-dataplane.service` and a `vpp-snmp-agent.service`, including
the compiled binary that reads from VPP and feeds this to SNMP.

View File

@ -62,7 +62,7 @@ plugins:
or route, or the system receiving ARP or IPv6 neighbor request/reply from neighbors), and applying
these events to the VPP dataplane.
I've published the code on [Github](https://github.com/pimvanpelt/lcpng/) and I am targeting a release
I've published the code on [Github](https://git.ipng.ch/ipng/lcpng/) and I am targeting a release
in upstream VPP, hoping to make the upcoming 22.02 release in February 2022. I have a lot of ground to
cover, but I will note that the plugin has been running in production in [AS8298]({{< ref "2021-02-27-network" >}})
since Sep'21 and no crashes related to LinuxCP have been observed.
@ -195,7 +195,7 @@ So grab a cup of tea, while we let Rhino stretch its legs, ehh, CPUs ...
pim@rhino:~$ mkdir -p ~/src
pim@rhino:~$ cd ~/src
pim@rhino:~/src$ sudo apt install libmnl-dev
pim@rhino:~/src$ git clone https://github.com/pimvanpelt/lcpng.git
pim@rhino:~/src$ git clone https://git.ipng.ch/ipng/lcpng.git
pim@rhino:~/src$ git clone https://gerrit.fd.io/r/vpp
pim@rhino:~/src$ ln -s ~/src/lcpng ~/src/vpp/src/plugins/lcpng
pim@rhino:~/src$ cd ~/src/vpp

View File

@ -33,7 +33,7 @@ In this first post, let's take a look at tablestakes: writing a YAML specificati
configuration elements of VPP, and then ensures that the YAML file is both syntactically as well as
semantically correct.
**Note**: Code is on [my Github](https://github.com/pimvanpelt/vppcfg), but it's not quite ready for
**Note**: Code is on [my Github](https://git.ipng.ch/ipng/vppcfg), but it's not quite ready for
prime-time yet. Take a look, and engage with us on GitHub (pull requests preferred over issues themselves)
or reach out by [contacting us](/s/contact/).
@ -348,7 +348,7 @@ to mess up my (or your!) VPP router by feeding it garbage, so the lions' share o
has been to assert the YAML file is both syntactically and semantically valid.
In the mean time, you can take a look at my code on [GitHub](https://github.com/pimvanpelt/vppcfg), but to
In the mean time, you can take a look at my code on [GitHub](https://git.ipng.ch/ipng/vppcfg), but to
whet your appetite, here's a hefty configuration that demonstrates all implemented types:
```

View File

@ -32,7 +32,7 @@ the configuration to the dataplane. Welcome to `vppcfg`!
In this second post of the series, I want to talk a little bit about how planning a path from a running
configuration to a desired new configuration might look like.
**Note**: Code is on [my Github](https://github.com/pimvanpelt/vppcfg), but it's not quite ready for
**Note**: Code is on [my Github](https://git.ipng.ch/ipng/vppcfg), but it's not quite ready for
prime-time yet. Take a look, and engage with us on GitHub (pull requests preferred over issues themselves)
or reach out by [contacting us](/s/contact/).

View File

@ -275,7 +275,6 @@ that will point at an `unbound` running on `lab.ipng.ch` itself.
I can now create any file I'd like which may use variable substition and other jinja2 style templating. Take
for example these two files:
{% raw %}
```
pim@lab:~/src/lab$ cat overlays/bird/common/etc/netplan/01-netcfg.yaml.j2
network:
@ -292,13 +291,12 @@ network:
pim@lab:~/src/lab$ cat overlays/bird/common/etc/netns/dataplane/resolv.conf.j2
domain lab.ipng.ch
search{% for domain in lab.nameserver.search %} {{domain}}{%endfor %}
search{% for domain in lab.nameserver.search %} {{ domain }}{% endfor %}
{% for resolver in lab.nameserver.addresses %}
nameserver {{resolver}}
{%endfor%}
nameserver {{ resolver }}
{% endfor %}
```
{% endraw %}
The first file is a [[NetPlan.io](https://netplan.io/)] configuration that substitutes the correct management
IPv4 and IPv6 addresses and gateways. The second one enumerates a set of search domains and nameservers, so that

View File

@ -171,12 +171,12 @@ GigabitEthernet1/0/0 1 up GigabitEthernet1/0/0
After this exploratory exercise, I have learned enough about the hardware to be able to take the
Fitlet2 out for a spin. To configure the VPP instance, I turn to
[[vppcfg](https://github.com/pimvanpelt/vppcfg)], which can take a YAML configuration file
[[vppcfg](https://git.ipng.ch/ipng/vppcfg)], which can take a YAML configuration file
describing the desired VPP configuration, and apply it safely to the running dataplane using the VPP
API. I've written a few more posts on how it does that, notably on its [[syntax]({{< ref "2022-03-27-vppcfg-1" >}})]
and its [[planner]({{< ref "2022-04-02-vppcfg-2" >}})]. A complete
configuration guide on vppcfg can be found
[[here](https://github.com/pimvanpelt/vppcfg/blob/main/docs/config-guide.md)].
[[here](https://git.ipng.ch/ipng/vppcfg/blob/main/docs/config-guide.md)].
```
pim@fitlet:~$ sudo dpkg -i {lib,}vpp*23.06*deb

View File

@ -185,7 +185,7 @@ forgetful chipmunk-sized brain!), so here, I'll only recap what's already writte
**1. BUILD:** For the first step, the build is straight forward, and yields a VPP instance based on
`vpp-ext-deps_23.06-1` at version `23.06-rc0~71-g182d2b466`, which contains my
[[LCPng](https://github.com/pimvanpelt/lcpng.git)] plugin. I then copy the packages to the router.
[[LCPng](https://git.ipng.ch/ipng/lcpng.git)] plugin. I then copy the packages to the router.
The router has an E-2286G CPU @ 4.00GHz with 6 cores and 6 hyperthreads. There's a really handy tool
called `likwid-topology` that can show how the L1, L2 and L3 cache lines up with respect to CPU
cores. Here I learn that CPU (0+6) and (1+7) share L1 and L2 cache -- so I can conclude that 0-5 are
@ -351,7 +351,7 @@ in `vppcfg`:
* When I create the initial `--novpp` config, there's a bug in `vppcfg` where I incorrectly
reference a dataplane object which I haven't initialized (because with `--novpp` the tool
will not contact the dataplane at all. That one was easy to fix, which I did in [[this
commit](https://github.com/pimvanpelt/vppcfg/commit/0a0413927a0be6ed3a292a8c336deab8b86f5eee)]).
commit](https://git.ipng.ch/ipng/vppcfg/commit/0a0413927a0be6ed3a292a8c336deab8b86f5eee)]).
After that small detour, I can now proceed to configure the dataplane by offering the resulting
VPP commands, like so:
@ -573,7 +573,7 @@ see is that which is destined to the controlplane (eg, to one of the IPv4 or IPv
multicast/broadcast groups that they are participating in), so things like tcpdump or SNMP won't
really work.
However, due to my [[vpp-snmp-agent](https://github.com/pimvanpelt/vpp-snmp-agent.git)], which is
However, due to my [[vpp-snmp-agent](https://git.ipng.ch/ipng/vpp-snmp-agent.git)], which is
feeding as an AgentX behind an snmpd that in turn is running in the `dataplane` namespace, SNMP scrapes
work as they did before, albeit with a few different interface names.

View File

@ -14,7 +14,7 @@ performance and versatility. For those of us who have used Cisco IOS/XR devices,
_ASR_ (aggregation service router), VPP will look and feel quite familiar as many of the approaches
are shared between the two.
I've been working on the Linux Control Plane [[ref](https://github.com/pimvanpelt/lcpng)], which you
I've been working on the Linux Control Plane [[ref](https://git.ipng.ch/ipng/lcpng)], which you
can read all about in my series on VPP back in 2021:
[![DENOG14](/assets/vpp-stats/denog14-thumbnail.png){: style="width:300px; float: right; margin-left: 1em;"}](https://video.ipng.ch/w/erc9sAofrSZ22qjPwmv6H4)
@ -70,7 +70,7 @@ answered by a Response PDU.
Using parts of a Python Agentx library written by GitHub user hosthvo
[[ref](https://github.com/hosthvo/pyagentx)], I tried my hands at writing one of these AgentX's.
The resulting source code is on [[GitHub](https://github.com/pimvanpelt/vpp-snmp-agent)]. That's the
The resulting source code is on [[GitHub](https://git.ipng.ch/ipng/vpp-snmp-agent)]. That's the
one that's running in production ever since I started running VPP routers at IPng Networks AS8298.
After the _AgentX_ exposes the dataplane interfaces and their statistics into _SNMP_, an open source
monitoring tool such as LibreNMS [[ref](https://librenms.org/)] can discover the routers and draw
@ -126,7 +126,7 @@ for any interface created in the dataplane.
I wish I were good at Go, but I never really took to the language. I'm pretty good at Python, but
sorting through the stats segment isn't super quick as I've already noticed in the Python3 based
[[VPP SNMP Agent](https://github.com/pimvanpelt/vpp-snmp-agent)]. I'm probably the world's least
[[VPP SNMP Agent](https://git.ipng.ch/ipng/vpp-snmp-agent)]. I'm probably the world's least
terrible C programmer, so maybe I can take a look at the VPP Stats Client and make sense of it. Luckily,
there's an example already in `src/vpp/app/vpp_get_stats.c` and it reveals the following pattern:

View File

@ -19,7 +19,7 @@ same time keep an IPng Site Local network with IPv4 and IPv6 that is separate fr
based on hardware/silicon based forwarding at line rate and high availability. You can read all
about my Centec MPLS shenanigans in [[this article]({{< ref "2023-03-11-mpls-core" >}})].
Ever since the release of the Linux Control Plane [[ref](https://github.com/pimvanpelt/lcpng)]
Ever since the release of the Linux Control Plane [[ref](https://git.ipng.ch/ipng/lcpng)]
plugin in VPP, folks have asked "What about MPLS?" -- I have never really felt the need to go this
rabbit hole, because I figured that in this day and age, higher level IP protocols that do tunneling
are just as performant, and a little bit less of an 'art' to get right. For example, the Centec

View File

@ -459,6 +459,6 @@ and VPP, and the overall implementation before attempting to use in production.
we got at least some of this right, but testing and runtime experience will tell.
I will be silently porting the change into my own copy of the Linux Controlplane called lcpng on
[[GitHub](https://github.com/pimvanpelt/lcpng.git)]. If you'd like to test this - reach out to the VPP
[[GitHub](https://git.ipng.ch/ipng/lcpng.git)]. If you'd like to test this - reach out to the VPP
Developer [[mailinglist](mailto:vpp-dev@lists.fd.io)] any time!

View File

@ -385,5 +385,5 @@ and VPP, and the overall implementation before attempting to use in production.
we got at least some of this right, but testing and runtime experience will tell.
I will be silently porting the change into my own copy of the Linux Controlplane called lcpng on
[[GitHub](https://github.com/pimvanpelt/lcpng.git)]. If you'd like to test this - reach out to the VPP
[[GitHub](https://git.ipng.ch/ipng/lcpng.git)]. If you'd like to test this - reach out to the VPP
Developer [[mailinglist](mailto:vpp-dev@lists.fd.io)] any time!

View File

@ -304,7 +304,7 @@ Gateway, just to show a few of the more advanced features of VPP. For me, this t
line of thinking: classifiers. This extract/match/act pattern can be used in policers, ACLs and
arbitrary traffic redirection through VPP's directed graph (eg. selecting a next node for
processing). I'm going to deep-dive into this classifier behavior in an upcoming article, and see
how I might add this to [[vppcfg](https://github.com/pimvanpelt/vppcfg.git)], because I think it
how I might add this to [[vppcfg](https://git.ipng.ch/ipng/vppcfg.git)], because I think it
would be super powerful to abstract away the rather complex underlying API into something a little
bit more ... user friendly. Stay tuned! :)

View File

@ -359,7 +359,7 @@ does not have an IPv4 address. Except -- I'm bending the rules a little bit by d
There's an internal function `ip4_sw_interface_enable_disable()` which is called to enable IPv4
processing on an interface once the first IPv4 address is added. So my first fix is to force this to
be enabled for any interface that is exposed via Linux Control Plane, notably in `lcp_itf_pair_create()`
[[here](https://github.com/pimvanpelt/lcpng/blob/main/lcpng_interface.c#L777)].
[[here](https://git.ipng.ch/ipng/lcpng/blob/main/lcpng_interface.c#L777)].
This approach is partially effective:
@ -500,7 +500,7 @@ which is unnumbered. Because I don't know for sure if everybody would find this
I make sure to guard the behavior behind a backwards compatible configuration option.
If you're curious, please take a look at the change in my [[GitHub
repo](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
repo](https://git.ipng.ch/ipng/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
which I:
1. add a new configuration option, `lcp-sync-unnumbered`, which defaults to `on`. That would be
what the plugin would do in the normal case: copy forward these borrowed IP addresses to Linux.

View File

@ -147,7 +147,7 @@ With all of that, I am ready to demonstrate two working solutions now. I first c
Ondrej's [[commit](https://gitlab.nic.cz/labs/bird/-/commit/280daed57d061eb1ebc89013637c683fe23465e8)].
Then, I compile VPP with my pending [[gerrit](https://gerrit.fd.io/r/c/vpp/+/40482)]. Finally,
to demonstrate how `update_loopback_addr()` might work, I compile `lcpng` with my previous
[[commit](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)],
[[commit](https://git.ipng.ch/ipng/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)],
which allows me to inhibit copying forward addresses from VPP to Linux, when using _unnumbered_
interfaces.

View File

@ -1,8 +1,9 @@
---
date: "2024-04-27T10:52:11Z"
title: FreeIX - Remote
title: "FreeIX Remote - Part 1"
aliases:
- /s/articles/2024/04/27/freeix-1.html
- /s/articles/2024/04/27/freeix-remote/
---
# Introduction
@ -91,7 +92,7 @@ their traffic to these remote internet exchanges.
There are two types of BGP neighbor adjacency:
1. ***Members***: these are {ip-address,AS}-tuples which FreeIX has explicitly configured. Learned prefixes are added
to as-set AS50869:AS-MEMBERS. Members receive _all_ prefixes from FreeIX, each annotated with BGP **informational**
to as-set AS50869:AS-MEMBERS. Members receive _some or all_ prefixes from FreeIX, each annotated with BGP **informational**
communities, and members can drive certain behavior with BGP **action** communities.
1. ***Peers***: these are all other entities with whom FreeIX has an adjacency at public internet exchanges or private
@ -195,12 +196,12 @@ network interconnects:
* `(50869,3020,1)`: Inhibit Action (30XX), Country (3020), Switzerland (1)
* `(50869,3030,1308)`: Inhibit Action (30XX), IXP (3030), PeeringDB IXP for LS-IX (1308)
Further actions can be placed on a per-remote-neighbor basis:
Four actions can be placed on a per-remote-asn basis:
* `(50869,3040,13030)`: Inhibit Action (30XX), AS (3040), Init7 (AS13030)
* `(50869,3041,6939)`: Prepend Action (30XX), Prepend Once (3041), Hurricane Electric (AS6939)
* `(50869,3042,12859)`: Prepend Action (30XX), Prepend Twice (3042), BIT BV (AS12859)
* `(50869,3043,8283)`: Prepend Action (30XX), Prepend Three Times (3043), Coloclue (AS8283)
* `(50869,3100,6939)`: Prepend Once Action (3100), Hurricane Electric (AS6939)
* `(50869,3200,12859)`: Prepend Twice Action (3200), BIT BV (AS12859)
* `(50869,3300,8283)`: Prepend Thice Action (3300), Coloclue (AS8283)
Peers cannot set these actions, as all action communities will be stripped on ingress. Members can set these action
communities on their sessions with FreeIX routers, however in some cases they may also be set by FreeIX operators when

View File

@ -101,6 +101,7 @@ IPv6 network and access the internet via a shared IPv6 address.
I will assign a pool of four public IPv4 addresses and eight IPv6 addresses to each border gateway:
| **Machine** | **IPv4 pool** | **IPv6 pool** |
| ----------- | ------------- | ------------- |
| border0.chbtl0.net.ipng.ch | <span style='color:green;'>194.126.235.0/30</span> | <span style='color:blue;'>2001:678:d78::3:0:0/125</span> |
| border0.chrma0.net.ipng.ch | <span style='color:green;'>194.126.235.4/30</span> | <span style='color:blue;'>2001:678:d78::3:1:0/125</span> |
| border0.chplo0.net.ipng.ch | <span style='color:green;'>194.126.235.8/30</span> | <span style='color:blue;'>2001:678:d78::3:2:0/125</span> |

View File

@ -250,10 +250,10 @@ remove the IPv4 and IPv6 addresses from the <span style='color:red;font-weight:b
routers in Br&uuml;ttisellen. They are directly connected, and if anything goes wrong, I can walk
over and rescue them. Sounds like a safe way to start!
I quickly add the ability for [[vppcfg](https://github.com/pimvanpelt/vppcfg)] to configure
I quickly add the ability for [[vppcfg](https://git.ipng.ch/ipng/vppcfg)] to configure
_unnumbered_ interfaces. In VPP, these are interfaces that don't have an IPv4 or IPv6 address of
their own, but they borrow one from another interface. If you're curious, you can take a look at the
[[User Guide](https://github.com/pimvanpelt/vppcfg/blob/main/docs/config-guide.md#interfaces)] on
[[User Guide](https://git.ipng.ch/ipng/vppcfg/blob/main/docs/config-guide.md#interfaces)] on
GitHub.
Looking at their `vppcfg` files, the change is actually very easy, taking as an example the
@ -291,7 +291,7 @@ interface.
In the article, you'll see that discussed as _Solution 2_, and it includes a bit of rationale why I
find this better. I implemented it in this
[[commit](https://github.com/pimvanpelt/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
[[commit](https://git.ipng.ch/ipng/lcpng/commit/a960d64a87849d312b32d9432ffb722672c14878)], in
case you're curious, and the commandline keyword is `lcp lcp-sync-unnumbered off` (the default is
_on_).

View File

@ -0,0 +1,725 @@
---
date: "2024-09-08T12:51:23Z"
title: 'VPP with sFlow - Part 1'
---
# Introduction
{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width='12em' >}}
In January of 2023, an uncomfortably long time ago at this point, an acquaintance of mine called
Ciprian reached out to me after seeing my [[DENOG
#14](https://video.ipng.ch/w/erc9sAofrSZ22qjPwmv6H4)] presentation. He was interested to learn about
IPFIX and was asking if sFlow would be an option. At the time, there was a plugin in VPP called
[[flowprobe](https://s3-docs.fd.io/vpp/24.10/cli-reference/clis/clicmd_src_plugins_flowprobe.html)]
which is able to emit IPFIX records. Unfotunately I never really got it to work well in my tests,
as either the records were corrupted, sub-interfaces didn't work, or the plugin would just crash the
dataplane entirely. In the meantime, the folks at [[Netgate](https://netgate.com/)] submitted quite
a few fixes to flowprobe, but it remains an expensive operation computationally. Wouldn't copying
one in a thousand or ten thousand packet headers with flow _sampling_ not be just as good?
In the months that followed, I discussed the feature with the incredible folks at
[[inMon](https://inmon.com/)], the original designers and maintainers of the sFlow protocol and
toolkit. Neil from inMon wrote a prototype and put it on [[GitHub](https://github.com/sflow/vpp)]
but for lack of time I didn't manage to get it to work, which was largely my fault by the way.
However, I have a bit of time on my hands in September and October, and just a few weeks ago,
my buddy Pavel from [[FastNetMon](https://fastnetmon.com/)] pinged that very dormant thread about
sFlow being a potentially useful tool for anti DDoS protection using VPP. And I very much agree!
## sFlow: Protocol
Maintenance of the protocol is performed by the [[sFlow.org](https://sflow.org/)] consortium, the
authoritative source of the sFlow protocol specifications. The current version of sFlow is v5.
sFlow, short for _sampled Flow_, works at the ethernet layer of the stack, where it inspects one in
N datagrams (typically 1:1000 or 1:10000) going through the physical network interfaces of a device.
On the device, an **sFlow Agent** does the sampling. For each sample the Agent takes, the first M
bytes (typically 128) are copied into an sFlow Datagram. Sampling metadata is added, such as
the ingress (or egress) interface and sampling process parameters. The Agent can then optionally add
forwarding information (such as router source- and destination prefix, MPLS LSP information, BGP
communties, and what-not). Finally the Agent will periodically read the octet and packet counters of
physical network interface(s). Ultimately, the Agent will send the samples and additional
information over the network as a UDP datagram, to an **sFlow Collector** for further processing.
sFlow has been specifically designed to take advantages of the statistical properties of packet
sampling and can be modeled using statistical sampling theory. This means that the sFlow traffic
monitoring system will always produce statistically quantifiable measurements. You can read more
about it in Peter Phaal and Sonia Panchen's
[[paper](https://sflow.org/packetSamplingBasics/index.htm)], I certainly did and my head spun a
little bit at the math :)
### sFlow: Netlink PSAMPLE
sFlow is meant to be a very _lightweight_ operation for the sampling equipment. It can typically be
done in hardware, but there also exist several software implementations. One very clever thing, I
think, is decoupling the sampler from the rest of the Agent. The Linux kernel has a packet sampling
API called [[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)], which
allows _producers_ to send samples to a certain _group_, and then allows _consumers_ to subscribe to
samples of a certrain _group_. The PSAMPLE API uses
[[NetLink](https://docs.kernel.org/userspace-api/netlink/intro.html)] under the covers. The cool
thing, for me anyway, is that I have a little bit of experience with Netlink due to my work on VPP's
[[Linux Control Plane]({{< ref 2021-08-25-vpp-4 >}})] plugin.
The idea here is that some **sFlow Agent**, notably a VPP plugin, will be taking periodic samples
from the physical network interfaces, and producing Netlink messages. Then, some other program,
notably outside of VPP, can consume these messages and further handle them, creating UDP packets
with sFlow samples and counters and other information, and sending them to an **sFlow Collector**
somewhere else on the network.
{{< image width="100px" float="left" src="/assets/shared/brain.png" alt="Warning" >}}
There's a handy utility called [[psampletest](https://github.com/sflow/psampletest)] which can
subscribe to these PSAMPLE netlink groups and retrieve the samples. The first time I used all of
this stuff, I wasn't aware of this utility and I kept on getting errors. It turns out, there's a
kernel module that needs to be loaded: `modprobe psample` and `psampletest` helpfully does that for
you [[ref](https://github.com/sflow/psampletest/blob/main/psampletest.c#L799)], so just make sure
the module is loaded and added to `/etc/modules` before you spend as many hours as I did pulling out
hair.
## VPP: sFlow Plugin
For the purposes of my initial testing, I'll simply take a look at Neil's prototype on
[[GitHub](https://github.com/sflow/vpp)] and see what I learn in terms of functionality and
performance.
### sFlow Plugin: Anatomy
The design is purposefully minimal, to do all of the heavy lifting outside of the VPP dataplane. The
plugin will create a new VPP _graph node_ called `sflow`, which the operator can insert after
`device-input`, in other words, if enabled, the plugin will get a copy of all packets that are read
from an input provider, such as `dpdk-input` or `rdma-input`. The plugin's job is to process the
packet, and if it's not selected for sampling, just move it onwards to the next node, typically
`ethernet-input`. Almost all of the interesting action is in `node.c`
The kicker is, that one in N packets will be selected to sample, after which:
1. the ethernet header (`*en`) is extracted from the packet
1. the input interface (`hw_if_index`) is extracted from the VPP buffer. Remember, sFlow works
with physical network interfaces!
1. if there are too many samples from this worker thread being worked on, it is discarded and an
error counter is incremented. This protects the main thread from being slammed with samples if
there are simply too many being fished out of the dataplane.
1. Otherwise:
* a new `sflow_sample_t` is created, with all the sampling process metadata filled in
* the first 128 bytes of the packet are copied into the sample
* an RPC is dispatched to the main thread, which will send the sample to the PSAMPLE channel
Both a debug CLI command and API call are added:
```
sflow enable-disable <interface-name> [<sampling_N>]|[disable]
```
Some observations:
First off, the sampling_N in Neil's demo is a global rather than per-interface setting. It would
make sense to make this be per-interface, as routers typically have a mixture of 1G/10G and faster
100G network cards available. It was a surprise when I set one interface to 1:1000 and the other to
1:10000 and then saw the first interface change its sampling rate also. It's a small thing, and
will not be an issue to change.
{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
Secondly, sending the RPC to main uses `vl_api_rpc_call_main_thread()`, which
requires a _spinlock_ in `src/vlibmemory/memclnt_api.c:649`. I'm somewhat worried that when many
samples are sent from many threads, there will be lock contention and performance will suffer.
### sFlow Plugin: Functional
I boot up the [[IPng Lab]({{< ref 2022-10-14-lab-1 >}})] and install a bunch of sFlow tools on it,
make sure the `psample` kernel module is loaded. In this first test I'll take a look at
tablestakes. I compile VPP with the sFlow plugin, and enable that plugin in `startup.conf` on each
of the four VPP routers. For reference, the Lab looks like this:
{{< image src="/assets/vpp-mpls/LAB v2.svg" alt="Lab Setup" >}}
What I'll do is start an `iperf3` server on `vpp0-3` and then hit it from `vpp0-0`, to generate
a few TCP traffic streams back and forth, which will be traversing `vpp0-2` and `vpp0-1`, like so:
```
pim@vpp0-3:~ $ iperf3 -s -D
pim@vpp0-0:~ $ iperf3 -c vpp0-3.lab.ipng.ch -t 86400 -P 10 -b 10M
```
### Configuring VPP for sFlow
While this `iperf3` is running, I'll log on to `vpp0-2` to take a closer look. The first thing I do,
is turn on packet sampling on `vpp0-2`'s interface that points at `vpp0-3`, which is `Gi10/0/1`, and
the interface that points at `vpp0-0`, which is `Gi10/0/0`. That's easy enough, and I will use a
sampling rate of 1:1000 as these interfaces are GigabitEthernet:
```
root@vpp0-2:~# vppctl sflow enable-disable GigabitEthernet10/0/0 1000
root@vpp0-2:~# vppctl sflow enable-disable GigabitEthernet10/0/1 1000
root@vpp0-2:~# vppctl show run | egrep '(Name|sflow)'
Name State Calls Vectors Suspends Clocks Vectors/Call
sflow active 5656 24168 0 9.01e2 4.27
```
Nice! VPP inserted the `sflow` node between `dpdk-input` and `ethernet-input` where it can do its
business. But is it sending data? To answer this question, I can first take a look at the
`psampletest` tool:
```
root@vpp0-2:~# psampletest
pstest: modprobe psample returned 0
pstest: netlink socket number = 1637
pstest: getFamily
pstest: generic netlink CMD = 1
pstest: generic family name: psample
pstest: generic family id: 32
pstest: psample attr type: 4 (nested=0) len: 8
pstest: psample attr type: 5 (nested=0) len: 8
pstest: psample attr type: 6 (nested=0) len: 24
pstest: psample multicast group id: 9
pstest: psample multicast group: config
pstest: psample multicast group id: 10
pstest: psample multicast group: packets
pstest: psample found group packets=10
pstest: joinGroup 10
pstest: received Netlink ACK
pstest: joinGroup 10
pstest: set headers...
pstest: serialize...
pstest: print before sending...
pstest: psample netlink (type=32) CMD = 0
pstest: grp=1 in=7 out=9 n=1000 seq=1 pktlen=1514 hdrlen=31 pkt=0x558c08ba4958 q=3 depth=33333333 delay=123456
pstest: send...
pstest: send_psample getuid=0 geteuid=0
pstest: sendmsg returned 140
pstest: free...
pstest: start read loop...
pstest: psample netlink (type=32) CMD = 0
pstest: grp=1 in=1 out=0 n=1000 seq=600320 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
pstest: psample netlink (type=32) CMD = 0
pstest: grp=1 in=1 out=0 n=1000 seq=600321 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
pstest: psample netlink (type=32) CMD = 0
pstest: grp=1 in=1 out=0 n=1000 seq=600322 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
pstest: psample netlink (type=32) CMD = 0
pstest: grp=1 in=2 out=0 n=1000 seq=600423 pktlen=66 hdrlen=70 pkt=0x7ffdb0d5a1e8 q=0 depth=0 delay=0
pstest: psample netlink (type=32) CMD = 0
pstest: grp=1 in=1 out=0 n=1000 seq=600324 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0
```
I am amazed! The `psampletest` output shows a few packets, considering I'm asking `iperf3` to push
100Mbit using 9000 byte jumboframes (which would be something like 1400 packets/second), I can
expect two or three samples per second. I immediately notice a few things:
***1. Network Namespace***: The Netlink sampling channel belongs to a network _namespace_. The VPP
process is running in the _default_ netns, so its PSAMPLE netlink messages will be in that namespace.
Thus, the `psampletest` and other tools must also run in that namespace. I mention this because in
Linux CP, often times the controlplane interfaces are created in a dedicated `dataplane` network
namespace.
***2. pktlen and hdrlen***: The pktlen is wrong, and this is a bug. In VPP, packets are put into
buffers of size 2048, and if there is a jumboframe, that means multiple buffers are concatenated for
the same packet. The packet length here ought to be 9000 in one direction. Looking at the `in=2`
packet with length 66, that looks like a legitimate ACK packet on the way back. But why is the
hdrlen set to 70 there? I'm going to want to ask Neil about that.
***3. ingress and egress***: The `in=1` and one packet with `in=2` represent the input `hw_if_index`
which is the ifIndex that VPP assigns to its devices. And looking at `show interfaces`, indeed
number 1 corresponds with `GigabitEthernet10/0/0` and 2 is `GigabitEthernet10/0/1`, which checks
out:
```
root@vpp0-2:~# vppctl show int
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
GigabitEthernet10/0/0 1 up 9000/0/0/0 rx packets 469552764
rx bytes 4218754400233
tx packets 133717230
tx bytes 8887341013
drops 6050
ip4 469321635
ip6 225164
GigabitEthernet10/0/1 2 up 9000/0/0/0 rx packets 133527636
rx bytes 8816920909
tx packets 469353481
tx bytes 4218736200819
drops 6060
ip4 133489925
ip6 29139
```
***4. ifIndexes are orthogonal***: These `in=1` or `in=2` ifIndex numbers are constructs of the VPP
dataplane. Notably, VPP's numbering of interface index is strictly _orthogonal_ to Linux, and it's
not guaranteed that there even _exists_ an interface in Linux for the PHY upon which the sampling is
happening. Said differently, `in=1` here is meant to reference VPP's `GigabitEthernet10/0/0`
interface, but in Linux, `ifIndex=1` is a completely different interface (`lo`) in the default
network namespace. Similarly `in=2` for VPP's `Gi10/0/1` interface corresponds to interface `enp1s0`
in Linux:
```
root@vpp0-2:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:f0:01:20 brd ff:ff:ff:ff:ff:ff
```
***5. Counters***: sFlow periodically polls the interface counters for all interfaces. It will
normally use `/proc/net/` entries for that, but there are two problems with this:
1. There may not exist a Linux representation of the interface, for example if it's only doing L2
bridging or cross connects in the VPP dataplane, and it does not have a Linux Control Plane
interface, or `linux-cp` is not used at all.
1. Even if it does exist and it's the "correct" ifIndex in Linux, for example if the _Linux
Interface Pair_'s tuntap `host_vif_index` index is used, even then the statistics counters in the
Linux representation will only count packets and octets of _punted_ packets, that is to say, the
stuff that LinuxCP has decided need to go to the Linux kernel through the TUN/TAP device. Important
to note that east-west traffic that goes _through_ the dataplane, is never punted to Linux, and as
such, the counters will be undershooting: only counting traffic _to_ the router, not _through_ the
router.
### VPP sFlow: Performance
Now that I've shown that Neil's proof of concept works, I will take a better look at the performance
of the plugin. I've made a mental note that the plugin sends RPCs from worker threads to the main
thread to marshall the PSAMPLE messages out. I'd like to see how expensive that is, in general. So,
I pull boot two Dell R730 machines in IPng's Lab and put them to work. The first machine will run
Cisco's T-Rex loadtester with 8x 10Gbps ports (4x dual Intel 58299), while the second (identical)
machine will run VPP also ith 8x 10Gbps ports (2x Intel i710-DA4).
I will test a bunch of things in parallel. First off, I'll test L2 (xconnect) and L3 (IPv4 routing),
and secondly I'll test that with and without sFlow turned on. This gives me 8 ports to configure,
and I'll start with the L2 configuration, as follows:
```
vpp# set int state TenGigabitEthernet3/0/2 up
vpp# set int state TenGigabitEthernet3/0/3 up
vpp# set int state TenGigabitEthernet130/0/2 up
vpp# set int state TenGigabitEthernet130/0/3 up
vpp# set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3
vpp# set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2
vpp# set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3
vpp# set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2
```
Then, the L3 configuration looks like this:
```
vpp# lcp create TenGigabitEthernet3/0/0 host-if xe0-0
vpp# lcp create TenGigabitEthernet3/0/1 host-if xe0-1
vpp# lcp create TenGigabitEthernet130/0/0 host-if xe1-0
vpp# lcp create TenGigabitEthernet130/0/1 host-if xe1-1
vpp# set int state TenGigabitEthernet3/0/0 up
vpp# set int state TenGigabitEthernet3/0/1 up
vpp# set int state TenGigabitEthernet130/0/0 up
vpp# set int state TenGigabitEthernet130/0/1 up
vpp# set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31
vpp# set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31
vpp# set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31
vpp# set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31
vpp# ip route add 16.0.0.0/24 via 100.64.0.0
vpp# ip route add 48.0.0.0/24 via 100.64.1.0
vpp# ip route add 16.0.2.0/24 via 100.64.4.0
vpp# ip route add 48.0.2.0/24 via 100.64.5.0
vpp# ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static
vpp# ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static
vpp# ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static
vpp# ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static
```
And finally, the Cisco T-Rex configuration looks like this:
```
- version: 2
interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ]
port_info:
- src_mac: 00:1b:21:06:00:00
dest_mac: 9c:69:b4:61:a1:dc
- src_mac: 00:1b:21:06:00:01
dest_mac: 9c:69:b4:61:a1:dd
- src_mac: 00:1b:21:83:00:00
dest_mac: 00:1b:21:83:00:01
- src_mac: 00:1b:21:83:00:01
dest_mac: 00:1b:21:83:00:00
- src_mac: 00:1b:21:87:00:00
dest_mac: 9c:69:b4:61:75:d0
- src_mac: 00:1b:21:87:00:01
dest_mac: 9c:69:b4:61:75:d1
- src_mac: 9c:69:b4:85:00:00
dest_mac: 9c:69:b4:85:00:01
- src_mac: 9c:69:b4:85:00:01
dest_mac: 9c:69:b4:85:00:00
```
A little note on the use of `ip neighbor` in VPP and specific `dest_mac` in T-Rex. In L2 mode,
because the VPP interfaces will be in promiscuous mode and simply pass through any ethernet frame
received on interface `Te3/0/2` and copy it out on `Te3/0/3` and vice-versa, there is no need to
tinker with MAC addresses. But in L3 mode, the NIC will only accept ethernet frames addressed to its
MAC address, so you can see that for the first port in T-Rex, I am setting `dest_mac:
9c:69:b4:61:a1:dc` which is the MAC address of `Te3/0/0` on VPP. And then on the way out, if VPP
wants to send traffic back to T-Rex, I'll give it a static ARP entry with `ip neighbor .. static`.
With that said, I can start a baseline loadtest like so:
{{< image width="100%" src="/assets/sflow/trex-baseline.png" alt="Cisco T-Rex: baseline" >}}
T-Rex is sending 10Gbps out on all eight interfaces (four of which are L3 routing, and four of which
are L2 xconnecting), using packet size of 1514 bytes. This amounts of roughlu 813Kpps per port, or a
cool 6.51Mpps in total. And I can see, in this base line configuration, the VPP router is happy to
do the work.
I then enable sFlow on the second set of four ports, using a 1:1000 sampling rate:
```
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 1000
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/1 1000
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 1000
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/3 1000
```
This should yield about 3'250 or so samples per second, and things look pretty great:
```
pim@hvn6-lab:~$ vppctl show err
Count Node Reason Severity
5034508 sflow sflow packets processed error
4908 sflow sflow packets sampled error
5034508 sflow sflow packets processed error
5111 sflow sflow packets sampled error
5034516 l2-output L2 output packets error
5034516 l2-input L2 input packets error
5034404 sflow sflow packets processed error
4948 sflow sflow packets sampled error
5034404 l2-output L2 output packets error
5034404 l2-input L2 input packets error
5034404 sflow sflow packets processed error
4928 sflow sflow packets sampled error
5034404 l2-output L2 output packets error
5034404 l2-input L2 input packets error
5034516 l2-output L2 output packets error
5034516 l2-input L2 input packets error
```
I can see that the `sflow packets sampled` is roughly 0.1% of the `sflow packets processed` which
checks out. I can also see in `psampletest` a flurry of activity, so I'm happy:
```
pim@hvn6-lab:~$ sudo psampletest
...
pstest: grp=1 in=9 out=0 n=1000 seq=63388 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
pstest: psample netlink (type=32) CMD = 0
pstest: grp=1 in=8 out=0 n=1000 seq=63389 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
pstest: psample netlink (type=32) CMD = 0
pstest: grp=1 in=11 out=0 n=1000 seq=63390 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
pstest: psample netlink (type=32) CMD = 0
pstest: grp=1 in=10 out=0 n=1000 seq=63391 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
pstest: psample netlink (type=32) CMD = 0
pstest: grp=1 in=11 out=0 n=1000 seq=63392 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0
```
I confirm that all four `in` interfaced (8, 9, 10 and 11) are sending samples, and those indexes
correctly correspond to the VPP dataplane's `sw_if_index` for `TenGig130/0/0 - 3`. Sweet! On this
machine, each TenGig network interface has its own dedicated VPP worker thread. Considering I
turned on sFlow sampling on four interfaces, I should see the cost I'm paying for the feature:
```
pim@hvn6-lab:~$ vppctl show run | grep -e '(Name|sflow)'
Name State Calls Vectors Suspends Clocks Vectors/Call
sflow active 3908218 14350684 0 9.05e1 3.67
sflow active 3913266 14350680 0 1.11e2 3.67
sflow active 3910828 14350687 0 1.08e2 3.67
sflow active 3909274 14350692 0 5.66e1 3.67
```
Alright, so for the 999 packets that went through and the one packet that got sampled, on average
VPP is spending between 90 and 111 CPU cycles per packet, and the loadtest looks squeaky clean on
T-Rex.
### VPP sFlow: Cost of passthru
I decide to take a look at two edge cases. What if there are no samples being taken at all, and the
`sflow` node is merely passing through all packets to `ethernet-input`? To simulate this, I will set
up a bizarrely high sampling rate, say one in ten million. I'll also make the T-Rex loadtester use
only four ports, in other words, a unidirectional loadtest, and I'll make it go much faster by
sending smaller packets, say 128 bytes:
```
tui>start -f stl/ipng.py -p 0 2 4 6 -m 99% -t size=128
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 1000 disable
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/1 1000 disable
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 1000 disable
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/3 1000 disable
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 10000000
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 10000000
```
The loadtester is now sending 33.5Mpps or thereabouts (4x 8.37Mpps), and I can confirm that the
`sFlow` plugin is not sampling many packets:
```
pim@hvn6-lab:~$ vppctl show err
Count Node Reason Severity
59777084 sflow sflow packets processed error
6 sflow sflow packets sampled error
59777152 l2-output L2 output packets error
59777152 l2-input L2 input packets error
59777104 sflow sflow packets processed error
6 sflow sflow packets sampled error
59777104 l2-output L2 output packets error
59777104 l2-input L2 input packets error
pim@hvn6-lab:~$ vppctl show run | grep -e '(Name|sflow)'
Name State Calls Vectors Suspends Clocks Vectors/Call
sflow active 8186642 369674664 0 1.35e1 45.16
sflow active 25173660 369674696 0 1.97e1 14.68
```
Two observations:
1. One of these is busier than the other. Without looking further, I can already predict that the
top one (doing 45.16 vectors/call) is the L3 thread. Reasoning: the L3 code path through the
dataplane is a lot more expensive than 'merely' L2 XConnect. As such, the packets will spend more
time, and therefore the iterations of the `dpdk-input` loop will be further apart in time. And
because of that, it'll end up consuming more packets on each subsequent iteration, in order to catch
up. The L2 path on the other hand, is quicker and therefore will have less packets waiting on
subsequent iterations of `dpdk-input`.
2. The `sflow` plugin spends between 13.5 and 19.7 CPU cycles shoveling the packets into
`ethernet-input` without doing anything to them. That's pretty low! And the L3 path is a little bit
more efficient per packet, which is very likely because it gets to amort its L1/L2 CPU instruction
cache over 45 packets each time it runs, while the L2 path can only amort its instruction cache over
15 or so packets each time it runs.
I let the loadtest run overnight,and the proof is in the pudding: sFlow enabled but not sampling
works just fine:
{{< image width="100%" src="/assets/sflow/trex-passthru.png" alt="Cisco T-Rex: passthru" >}}
### VPP sFlow: Cost of sampling
The other interesting case is to figure out how much CPU it takes to execute the code path
with the actual sampling. This one turns out a bit trickier to measure. While leaving the previous
loadtest running at 33.5Mpps, I disable sFlow and then re-enable it at an abnormally _high_ ratio of
1:10 packets:
```
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 disable
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 disable
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 10
pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 10
```
The T-Rex view immediately reveals that VPP is not doing very well, as the throughput went from
33.5Mpps all the way down to 7.5Mpps. Ouch! Looking at the dataplane:
```
pim@hvn6-lab:~$ vppctl show err | grep sflow
340502528 sflow sflow packets processed error
12254462 sflow sflow packets dropped error
22611461 sflow sflow packets sampled error
422527140 sflow sflow packets processed error
8533855 sflow sflow packets dropped error
34235952 sflow sflow packets sampled error
```
Ha, this new safeguard popped up: remember all the way at the beginning, I explained how there's a
safety net in the `sflow` plugin that will pre-emptively drop the sample if the RPC channel towards
the main thread is seeing too many outstanding RPCs? That's happening right now, under the moniker
`sflow packets dropped`, and it's roughly *half* of the samples.
My first attempt is to back off the loadtester to roughly 1.5Mpps per port (so 6Mpps in total, under the
current limit of 7.5Mpps), but I'm disappointed: the VPP instance is now returning 665Kpps per port
only, which is horrible, and it's still dropping samples.
My second attempt is to turn off all ports but last pair (the L2XC port), which returns 930Kpps from
the offered 1.5Mpps. VPP is clearly not having a good time here.
Finally, as a validation, I turn off all ports but the first pair (the L3 port, without sFlow), and
ramp up the traffic to 8Mpps. Success (unsurprising to me). I also ramp up the second pair (the L2XC
port, without sFlow), VPP forwards all 16Mpps and is happy again.
Once I turn on the third pair (the L3 port, _with_ sFlow), even at 1Mpps, the whole situation
regresses again: First two ports go down from 8Mpps to 5.2Mpps each; the third (offending) port
delivers 740Kpps out of 1Mpps. Clearly, there's some work to do under high load situations!
#### Reasoning about the bottle neck
But how expensive is sending samples, really? To try to get at least some pseudo-scientific answer I
turn off all ports again, and ramp up the one port pair with (L3 + sFlow at 1:10 ratio) to full line
rate: that is 64 byte packets at 14.88Mpps:
```
tui>stop
tui>start -f stl/ipng.py -m 100% -p 4 -t size=64
```
VPP is now on the struggle bus and is returning 3.16Mpps or 21% of that. But, I think it'll give me
some reasonable data to try to feel out where the bottleneck is.
```
Thread 2 vpp_wk_1 (lcore 3)
Time 6.3, 10 sec internal node vector rate 256.00 loops/sec 27310.73
vector rates in 3.1607e6, out 3.1607e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
TenGigabitEthernet130/0/1-outp active 77906 19943936 0 5.79e0 256.00
TenGigabitEthernet130/0/1-tx active 77906 19943936 0 6.88e1 256.00
dpdk-input polling 77906 19943936 0 4.41e1 256.00
ethernet-input active 77906 19943936 0 2.21e1 256.00
ip4-input active 77906 19943936 0 2.05e1 256.00
ip4-load-balance active 77906 19943936 0 1.07e1 256.00
ip4-lookup active 77906 19943936 0 1.98e1 256.00
ip4-rewrite active 77906 19943936 0 1.97e1 256.00
sflow active 77906 19943936 0 6.14e1 256.00
pim@hvn6-lab:pim# vppctl show err | grep sflow
551357440 sflow sflow packets processed error
19829380 sflow sflow packets dropped error
36613544 sflow sflow packets sampled error
```
OK, the `sflow` plugin saw 551M packets, selected 36.6M of them for sampling, but ultimately only
sent RPCs to the main thread for 16.8M samples after having dropped 19.8M of them. There are three
code paths, each one extending the other:
1. Super cheap: pass through. I already learned that it takes about X=13.5 CPU cycles to pass
through a packet.
1. Very cheap: select sample and construct the RPC, but toss it, costing Y CPU cycles.
1. Expensive: select sample, and send the RPC. Z CPU cycles in worker, and another amount in main.
Now I don't know what Y is, but seeing as the selection only copies some data from the VPP buffer
into a new `sflow_sample_t`, and it uses `clip_memcpy_fast()` for the sample header, I'm going to
assume it's not _drastically_ more expensive than the super cheap case, so for simplicity I'll
guesstimate that it takes Y=20 CPU cyces.
With that guess out of the way, I can see what the `sflow` plugin is consuming for the third case:
```
AvgClocks = (Total * X + Sampled * Y + RPCSent * Z) / Total
61.4 = ( 551357440 * 13.5 + 36613544 * 20 + (36613544-19829380) * Z ) / 551357440
61.4 = ( 7443325440 + 732270880 + 16784164 * Z ) / 551357440
33853346816 = 7443325440 + 732270880 + 16784164 * Z
25677750496 = 16784164 * Z
Z = 1529
```
Good to know! I find spending O(1500) cycles to send the sample pretty reasonable. However, for a
dataplane that is trying to do 10Mpps per core, and a core running 2.2GHz, there are really only 220
CPU cycles to spend end-to-end. Spending an order of magnitude more than that once every ten packets
feels dangerous to me.
Here's where I start my conjecture. If I count the CPU cycles spent in the table above, I will see
273 CPU cycles spent on average per packet. The CPU in the VPP router is an `E5-2696 v4 @ 2.20GHz`,
which means it should be able to do `2.2e10/273 = 8.06Mpps` per thread, more than double that what I
observe (3.16Mpps)! But, for all the `vector rates in` (3.1607e6), it also managed to emit the
packets back out (same number: 3.1607e6).
So why isn't VPP getting more packets from DPDK? I poke around a bit and find an important clue:
```
pim@hvn6-lab:~$ vppctl show hard TenGigabitEthernet130/0/0 | grep rx\ missed; \
sleep 10; vppctl show hard TenGigabitEthernet130/0/0 | grep rx\ missed
rx missed 4065539464
rx missed 4182788310
```
In those ten seconds, VPP missed (4182788310-4065539464)/10 = 11.72Mpps. I already measured that it
forwarded 3.16Mpps and you know what? 11.7 + 3.16 is precisely 14.88Mpps. All packets are accounted
for! It's just, DPDK never managed to read them from the hardware: `sad-trombone.wav`
As a validation, I turned off sFlow while keeping that one port at 14.88Mpps. Now, 10.8Mpps were
delivered:
```
Thread 2 vpp_wk_1 (lcore 3)
Time 14.7, 10 sec internal node vector rate 256.00 loops/sec 40622.64
vector rates in 1.0794e7, out 1.0794e7, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
TenGigabitEthernet130/0/1-outp active 620012 158723072 0 5.66e0 256.00
TenGigabitEthernet130/0/1-tx active 620012 158723072 0 7.01e1 256.00
dpdk-input polling 620012 158723072 0 4.39e1 256.00
ethernet-input active 620012 158723072 0 1.56e1 256.00
ip4-input-no-checksum active 620012 158723072 0 1.43e1 256.00
ip4-load-balance active 620012 158723072 0 1.11e1 256.00
ip4-lookup active 620012 158723072 0 2.00e1 256.00
ip4-rewrite active 620012 158723072 0 2.02e1 256.00
```
Total Clocks: 201 per packet; 2.2GHz/201 = 10.9Mpps, and I am observing 10.8Mpps. As [[North of the
Border](https://www.youtube.com/c/NorthoftheBorder)] would say: "That's not just good, it's good
_enough_!"
For completeness, I turned on all eight ports again, at line rate (8x14.88 = 119Mpps 🥰), and saw that
about 29Mpps of that made it through. Interestingly, what was 3.16Mpps in the single-port line rate
loadtest, went up slighty to 3.44Mpps now. What puzzles me even more, is that the non-sFlow worker
threads are also impacted. I spent some time thinking about this and poking around, but I did not
find a good explanation why port pair 0 (L3, no sFlow) and 1 (L2XC, no sFlow) would be impacted.
Here's a screenshot of VPP on the struggle bus:
{{< image width="100%" src="/assets/sflow/trex-overload.png" alt="Cisco T-Rex: overload at line rate" >}}
**Hypothesis**: Due to the _spinlock_ in `vl_api_rpc_call_main_thread()`, the worker CPU is pegged
for a longer time, during which the `dpdk-input` PMD can't run, so it misses out on these sweet
sweet packets that the network card had dutifully received for it, resulting in the `rx-miss`
situation. While VPP's performance measurement shows 273 CPU cycles per packet and 3.16Mpps, this
accounts only for 862M cycles, while the thread has 2200M cycles, leaving a whopping 60% of CPU
cycles unused in the dataplane. I still don't understand why _other_ worker threads are impacted,
though.
## What's Next
I'll continue to work with the folks in the sFlow and VPP communities and iterate on the plugin and
other **sFlow Agent** machinery. In an upcoming article, I hope to share more details on how to tie
the VPP plugin in to the `hsflowd` host sflow daemon in a way that the interface indexes, counters
and packet lengths are all correct. Of course, the main improvement that we can make is to allow for
the system to work better under load, which will take some thinking.
I should do a few more tests with a debug binary and profiling turned on. I quickly ran a `perf`
over the VPP (release / optimized) binary running on the bench, but it merely said 80% of time was
spent in `libvlib` rather than `libvnet` in the baseline (sFlow turned off).
```
root@hvn6-lab:/home/pim# perf record -p 1752441 sleep 10
root@hvn6-lab:/home/pim# perf report --stdio --sort=dso
# Overhead Shared Object (sFlow) Overhead Shared Object (baseline)
# ........ ...................... ........ ........................
#
79.02% libvlib.so.24.10 54.27% libvlib.so.24.10
12.82% libvnet.so.24.10 33.91% libvnet.so.24.10
3.77% dpdk_plugin.so 10.87% dpdk_plugin.so
3.21% [kernel.kallsyms] 0.81% [kernel.kallsyms]
0.29% sflow_plugin.so 0.09% ld-linux-x86-64.so.2
0.28% libvppinfra.so.24.10 0.03% libc.so.6
0.21% libc.so.6 0.01% libvppinfra.so.24.10
0.17% libvlibapi.so.24.10 0.00% libvlibmemory.so.24.10
0.15% libvlibmemory.so.24.10
0.07% ld-linux-x86-64.so.2
0.00% vpp
0.00% [vdso]
0.00% libsvm.so.24.10
```
Unfortunately, I'm not much of a profiler expert, being merely a network engineer :) so I may have
to ask for help. Of course, if you're reading this, you may also _offer_ help! There's lots of
interesting work to do on this `sflow` plugin, with matching ifIndex for consumers like `hsflowd`,
reading interface counters from the dataplane (or from the Prometheus Exporter), and most
importantly, ensuring it works well, or fails gracefully, under stringent load.
From the _cray-cray_ ideas department, what if we:
1. In worker thread, produced the sample but instead of sending an RPC to main and taking the
lock, append it to a producer sample queue and move on. This way, no locks are needed, and each
worker thread will have its own producer queue.
1. Create a separate worker (or even pool of workers), running on possibly a different CPU (or in
main), that runs a loop iterating on all sflow sample queues consuming the samples and sending them
in batches to the PSAMPLE Netlink group, possibly dropping samples if there are too many coming in.
I'm reminded that this pattern exists already -- async crypto workers create a `crypto-dispatch`
node that acts as poller for inbound crypto, and it hands off the result back into the worker
thread: lockless at the expense of some complexity!
## Acknowledgements
The plugin I am testing here is a prototype written by Neil McKee of inMon. I also wanted to say
thanks to Pavel Odintsov of FastNetMon and Ciprian Balaceanu for showing an interest in this plugin,
and Peter Phaal for facilitating a get-together last year.
Who's up for making this thing a reality?!

View File

@ -0,0 +1,547 @@
---
date: "2024-10-06T07:51:23Z"
title: 'VPP with sFlow - Part 2'
---
# Introduction
{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width='12em' >}}
Last month, I picked up a project together with Neil McKee of [[inMon](https://inmon.com/)], the
care takers of [[sFlow](https://sflow.org)]: an industry standard technology for monitoring high speed switched
networks. `sFlow` gives complete visibility into the use of networks enabling performance optimization,
accounting/billing for usage, and defense against security threats.
The open source software dataplane [[VPP](https://fd.io)] is a perfect match for sampling, as it
forwards packets at very high rates using underlying libraries like [[DPDK](https://dpdk.org/)] and
[[RDMA](https://en.wikipedia.org/wiki/Remote_direct_memory_access)]. A clever design choice in the so
called _Host sFlow Daemon_ [[host-sflow](https://github.com/sflow/host-sflow)], which allows for a small
portion of code to _grab_ the samples, for example in a merchant silicon ASIC or FPGA, but also in the
VPP software dataplane, and then _transmit_ these samples using a Linux kernel feature called
[[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)]. This greatly
reduces the complexity of code to be implemented in the forwarding path, while at the same time
bringing consistency to the `sFlow` delivery pipeline by (re)using the `hsflowd` business logic for
the more complex state keeping, packet marshalling and transmission from the _Agent_ to a central
_Collector_.
Last month, Neil and I discussed the proof of concept [[ref](https://github.com/sflow/vpp-sflow/)]
and I described this in a [[first article]({{< ref 2024-09-08-sflow-1.md >}})]. Then, we iterated on
the VPP plugin, playing with a few different approaches to strike a balance between performance, code
complexity, and agent features. This article describes our journey.
## VPP: an sFlow plugin
There are three things Neil and I specifically take a look at:
1. If `sFlow` is not enabled on a given interface, there should not be a regression on other
interfaces.
1. If `sFlow` _is_ enabled, but a packet is not sampled, the overhead should be as small as
possible, targetting single digit CPU cycles per packet in overhead.
1. If `sFlow` actually selects a packet for sampling, it should be moved out of the dataplane as
quickly as possible, targetting double digit CPU cycles per sample.
For all these validation and loadtests, I use a bare metal VPP machine which is receiving load from
a T-Rex loadtester on eight TenGig ports. I have configured VPP and T-Rex as follows.
**1. RX Queue Placement**
It's important that the network card that is receiving the traffic, gets serviced by a worker thread
on the same NUMA domain. Since my machine has two processors (and thus, two NUMA nodes), I will
align the NIC with the correct processor, like so:
```
set interface rx-placement TenGigabitEthernet3/0/0 queue 0 worker 0
set interface rx-placement TenGigabitEthernet3/0/1 queue 0 worker 2
set interface rx-placement TenGigabitEthernet3/0/2 queue 0 worker 4
set interface rx-placement TenGigabitEthernet3/0/3 queue 0 worker 6
set interface rx-placement TenGigabitEthernet130/0/0 queue 0 worker 1
set interface rx-placement TenGigabitEthernet130/0/1 queue 0 worker 3
set interface rx-placement TenGigabitEthernet130/0/2 queue 0 worker 5
set interface rx-placement TenGigabitEthernet130/0/3 queue 0 worker 7
```
**2. L3 IPv4/MPLS interfaces**
I will take two pairs of interfaces, one on NUMA0, and the other on NUMA1, so that I can make a
comparison with L3 IPv4 or MPLS running _without_ `sFlow` (these are TenGig3/0/*, which I will call
the _baseline_ pairs) and two which are running _with_ `sFlow` (these are TenGig130/0/*, which I'll
call the _experiment_ pairs).
```
comment { L3: IPv4 interfaces }
set int state TenGigabitEthernet3/0/0 up
set int state TenGigabitEthernet3/0/1 up
set int state TenGigabitEthernet130/0/0 up
set int state TenGigabitEthernet130/0/1 up
set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31
set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31
set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31
set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31
ip route add 16.0.0.0/24 via 100.64.0.0
ip route add 48.0.0.0/24 via 100.64.1.0
ip route add 16.0.2.0/24 via 100.64.4.0
ip route add 48.0.2.0/24 via 100.64.5.0
ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static
ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static
ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static
ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static
```
Here, the only specific trick worth mentioning is the use of `ip neighbor` to pre-populate the L2
adjacency for the T-Rex loadtester. This way, VPP knows which MAC address to send traffic to, in
case a packet has to be forwarded to 100.64.0.0 or 100.64.5.0. It avoids VPP from having to use ARP
resolution.
The configuration for an MPLS label switching router _LSR_ or also called _P-Router_ is added:
```
comment { MPLS interfaces }
mpls table add 0
set interface mpls TenGigabitEthernet3/0/0 enable
set interface mpls TenGigabitEthernet3/0/1 enable
set interface mpls TenGigabitEthernet130/0/0 enable
set interface mpls TenGigabitEthernet130/0/1 enable
mpls local-label add 16 eos via 100.64.1.0 TenGigabitEthernet3/0/1 out-labels 17
mpls local-label add 17 eos via 100.64.0.0 TenGigabitEthernet3/0/0 out-labels 16
mpls local-label add 20 eos via 100.64.5.0 TenGigabitEthernet130/0/1 out-labels 21
mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels 20
```
**3. L2 CrossConnect interfaces**
Here, I will also use NUMA0 as my baseline (`sFlow` disabled) pair, and an equivalent pair of TenGig
interfaces on NUMA1 as my experiment (`sFlow` enabled) pair. This way, I can both make a comparison
on the performance impact of enabling `sFlow`, but I can also assert if any regression occurs in the
_baseline_ pair if I enable a feature in the _experiment_ pair, which should really never happen.
```
comment { L2 xconnected interfaces }
set int state TenGigabitEthernet3/0/2 up
set int state TenGigabitEthernet3/0/3 up
set int state TenGigabitEthernet130/0/2 up
set int state TenGigabitEthernet130/0/3 up
set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3
set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2
set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3
set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2
```
**4. T-Rex Configuration**
The Cisco T-Rex loadtester is running on another machine in the same rack. Physically, it has eight
ports which are connected to a LAB switch, a cool Mellanox SN2700 running Debian [[ref]({{< ref
2023-11-11-mellanox-sn2700.md >}})]. From there, eight ports go to my VPP machine. The LAB switch
just has VLANs with two ports in each: VLAN 100 takes T-Rex port0 and connects it to TenGig3/0/0,
VLAN 101 takes port1 and connects it to TenGig3/0/1, and so on. In total, sixteen ports and eight
VLANs are used.
The configuration for T-Rex then becomes:
```
- version: 2
interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ]
port_info:
- src_mac: 00:1b:21:06:00:00
dest_mac: 9c:69:b4:61:a1:dc
- src_mac: 00:1b:21:06:00:01
dest_mac: 9c:69:b4:61:a1:dd
- src_mac: 00:1b:21:83:00:00
dest_mac: 00:1b:21:83:00:01
- src_mac: 00:1b:21:83:00:01
dest_mac: 00:1b:21:83:00:00
- src_mac: 00:1b:21:87:00:00
dest_mac: 9c:69:b4:61:75:d0
- src_mac: 00:1b:21:87:00:01
dest_mac: 9c:69:b4:61:75:d1
- src_mac: 9c:69:b4:85:00:00
dest_mac: 9c:69:b4:85:00:01
- src_mac: 9c:69:b4:85:00:01
dest_mac: 9c:69:b4:85:00:00
```
Do you see how the first pair sends from `src_mac` 00:1b:21:06:00:00? That's the T-Rex side, and it
encodes the PCI device `06:00.0` in the MAC address. It sends traffic to `dest_mac`
9c:69:b4:61:a1:dc, which is the MAC address of VPP's TenGig3/0/0 interface. Looking back at the `ip
neighbor` VPP config above, it becomes much easier to see who is sending traffic to whom.
For L2XC, the MAC addresses don't matter. VPP will set the NIC in _promiscuous_ mode which means
it'll accept any ethernet frame, not only those sent to the NIC's own MAC address. Therefore, in
L2XC modes (the second and fourth pair), I just use the MAC addresses from T-Rex. I find debugging
connections and looking up FDB entries on the Mellanox switch much, much easier this way.
With all config in place, but with `sFlow` disabled, I run a quick bidirectional loadtest using 256b
packets at line rate, which shows 79.83Gbps and 36.15Mpps. All ports are forwarding, with MPLS,
IPv4, and L2XC. Neat!
{{< image src="/assets/sflow/trex-acceptance.png" alt="T-Rex Acceptance Loadtest" >}}
The name of the game is now to do a loadtest that shows the packet throughput and CPU cycles spent
for each of the plugin iterations, comparing their performance on ports with and without `sFlow`
enabled. For each iteration, I will use exactly the same VPP configuration, I will generate
unidirectional 4x14.88Mpps of traffic with T-Rex, and I will report on VPP's performance in
_baseline_ and a somewhat unfavorable 1:100 sampling rate.
Ready? Here I go!
### v1: Workers send RPC to main
***TL/DR***: _13 cycles/packet on passthrough, 4.68Mpps L2, 3.26Mpps L3, with severe regression in
baseline_
The first iteration goes all the way back to a proof of concept from last year. It's described in
detail in my [[first post]({{< ref 2024-09-08-sflow-1.md >}})]. The performance results are not
stellar:
* ☢ When slamming a single sFlow enabled interface, _all interfaces_ regress. When sending 8Mpps
of IPv4 traffic through an _baseline_ interface, that is an interface _without_ sFlow enabled, only
5.2Mpps get through. This is considered a mortal sin in VPP-land.
* ✅ Passing through packets without sampling them, costs about 13 CPU cycles, not bad.
* ❌ Sampling a packet, specifically at higher rates (say, 1:100 or worse, 1:10) completely
destroys throughput. When sending 4x14.88MMpps of traffic, only one third makes it through.
Here's the bloodbath as seen from T-Rex:
{{< image src="/assets/sflow/trex-v1.png" alt="T-Rex Loadtest for v1" >}}
**Debrief**: When we talked through these issues, we sort of drew the conclusion that it would be much
faster if, when a worker thread produces a sample, instead of sending an RPC to main and taking the
spinlock, that the worker appends the sample to a producer queue and moves on. This way, no locks
are needed, and each worker thread will have its own producer queue.
Then, we can create a separate thread (or even pool of threads), scheduling on possibly a different
CPU (or in main), that runs a loop iterating on all sflow sample queues, consuming the samples and
sending them in batches to the PSAMPLE Netlink group, possibly dropping samples if there are too
many coming in.
### v2: Workers send PSAMPLE directly
**TL/DR**: _7.21Mpps IPv4 L3, 9.45Mpps L2XC, 87 cycles/packet, no impact on disabled interfaces_
But before we do that, we have one curiosity itch to scratch - what if we sent the sample directly
from the worker? With such a model, if it works, we will need no RPCs or sample queue at all. Of
course, in this model any sample will have to be rewritten into a PSAMPLE packet and written via the
netlink socket. It would be less complex, but not as efficient as it could be. One thing is prety
certain, though: it should be much faster than sending an RPC to the main thread.
After short refactor, Neil commits [[d278273](https://github.com/sflow/vpp-sflow/commit/d278273)],
which adds compiler macros `SFLOW_SEND_FROM_WORKER` (v2) and `SFLOW_SEND_VIA_MAIN` (v1). When
workers send directly, they will invoke `sflow_send_sample_from_worker()` instead of sending an RPC
with `vl_api_rpc_call_main_thread()` in the previous version.
The code currently uses `clib_warning()` to print stats from the dataplane, which is pretty
expensive. We should be using the VPP logging framework, but for the time being, I add a few CPU
counters so we can more accurately count the cummulative time spent for each part of the calls, see
[[6ca61d2](https://github.com/sflow/vpp-sflow/commit/6ca61d2)]. I can now see these with `vppctl show
err` instead.
When loadtesting this, the deadly sin of impacting performance of interfaces that did not have
`sFlow` enabled is gone. The throughput is not great, though. Instead of showing screenshots of
T-Rex, I can also take a look at the throughput as measured by VPP itself. In its `show runtime`
statistics, each worker thread shows both CPU cycles spent, as well as how many packets/sec it
received and how many it transmitted:
```
pim@hvn6-lab:~$ export C="v2-100"; vppctl clear run; vppctl clear err; sleep 30; \
vppctl show run > $C-runtime.txt; vppctl show err > $C-err.txt
pim@hvn6-lab:~$ grep 'vector rates' v2-100-runtime.txt | grep -v 'in 0'
vector rates in 1.0909e7, out 1.0909e7, drop 0.0000e0, punt 0.0000e0
vector rates in 7.2078e6, out 7.2078e6, drop 0.0000e0, punt 0.0000e0
vector rates in 1.4866e7, out 1.4866e7, drop 0.0000e0, punt 0.0000e0
vector rates in 9.4476e6, out 9.4476e6, drop 0.0000e0, punt 0.0000e0
pim@hvn6-lab:~$ grep 'sflow' v2-100-runtime.txt
Name State Calls Vectors Suspends Clocks Vectors/Call
sflow active 844916 216298496 0 8.69e1 256.00
sflow active 1107466 283511296 0 8.26e1 256.00
pim@hvn6-lab:~$ grep -i sflow v2-100-err.txt
217929472 sflow sflow packets processed error
1614519 sflow sflow packets sampled error
2606893106 sflow CPU cycles in sent samples error
280697344 sflow sflow packets processed error
2078203 sflow sflow packets sampled error
1844674406 sflow CPU cycles in sent samples error
```
At a glance, I can see in the first `grep`, the in and out vector (==packet) rates for each worker
thread that is doing meaningful work (ie. has more than 0pps of input). Remember that I pinned the
RX queues to worker threads, and this now pays dividend: worker thread 0 is servicing TenGig3/0/0
(as _even_ worker thread numbers are on NUMA domain 0), worker thread 1 is servicing TenGig130/0/0.
What's cool about this, is it gives me an easy way to compare baseline L3 (10.9Mpps) with experiment
L3 (7.21Mpps). Equally, L2XC comes in at 14.88Mpps in baseline and 9.45Mpps in experiment.
Looking at the output of `vppctl show error`, I can learn another interesting detail. See how there
are 1614519 sampled packets out of 217929472 processed packets (ie. a roughly 1:100 rate)? I added a
CPU clock cycle counter that counts cummulative clocks spent once samples are taken. I can see that
VPP spent 2606893106 CPU cycles sending these samples. That's **1615 CPU cycles** per sent sample,
which is pretty terrible.
**Debrief**: We both understand that assembling and `send()`ing the netlink messages from within the
dataplane is a pretty bad idea. But it's great to see that removing the use of RPCs immediately
improves performance on non-enabled interfaces, and we learned what the cost is of sending those
samples. An easy step forward from here is to create a producer/consumer queue, where the workers
can just copy the packet into a queue or ring buffer, and have an external `pthread` consume from
the queue/ring in another thread that won't block the dataplane.
### v3: SVM FIFO from workers, dedicated PSAMPLE pthread
**TL/DR:** _9.34Mpps L3, 13.51Mpps L2XC, 16.3 cycles/packet, but with corruption on the FIFO queue messages_
Neil checks in after committing [[7a78e05](https://github.com/sflow/vpp-sflow/commit/7a78e05)]
that he has introduced a macro `SFLOW_SEND_FIFO` which tries this new approach. There's a pretty
elaborate FIFO queue implementation in `svm/fifo_segment.h`. Neil uses this to create a segment
called `fifo-sflow-worker`, to which the worker can write its samples in the dataplane node. A new
thread called `spt_process_samples` can then call `svm_fifo_dequeue()` from all workers' queues and
pump those into Netlink.
The overhead of copying the samples onto a VPP native `svm_fifo` seems to be two orders of magnitude
lower than writing directly to Netlink, even though the `svm_fifo` library code has many bells and
whistles that we don't need. But, perhaps due to these bells and whistles, we may be holding it
wrong, as invariably after a short while the Netlink writes return _Message too long_ errors.
```
pim@hvn6-lab:~$ grep 'vector rates' v3fifo-sflow-100-runtime.txt | grep -v 'in 0'
vector rates in 1.0783e7, out 1.0783e7, drop 0.0000e0, punt 0.0000e0
vector rates in 9.3499e6, out 9.3499e6, drop 0.0000e0, punt 0.0000e0
vector rates in 1.4728e7, out 1.4728e7, drop 0.0000e0, punt 0.0000e0
vector rates in 1.3516e7, out 1.3516e7, drop 0.0000e0, punt 0.0000e0
pim@hvn6-lab:~$ grep -i sflow v3fifo-sflow-100-runtime.txt
Name State Calls Vectors Suspends Clocks Vectors/Call
sflow active 1096132 280609792 0 1.63e1 256.00
sflow active 1584577 405651712 0 1.46e1 256.00
pim@hvn6-lab:~$ grep -i sflow v3fifo-sflow-100-err.txt
280635904 sflow sflow packets processed error
2079194 sflow sflow packets sampled error
733447310 sflow CPU cycles in sent samples error
405689856 sflow sflow packets processed error
3004118 sflow sflow packets sampled error
1844674407 sflow CPU cycles in sent samples error
```
Two things of note here. Firstly, the average clocks spent in the `sFlow` node have gone down from
86 CPU cycles/packet to 16.3 CPU cycles. But even more importantly, the amount of time spent after
the sample is taken is hugely reduced, from 1600+ cycles in v2 to a much more favorable 352 cycles
in this version. Also, any risk of Netlink writes failing has been eliminated, because that's now
offloaded to a different thread entirely.
**Debrief**: It's not great that we created a new linux `pthread` for the consumer of the samples.
VPP has an elaborate thread management system, and collaborative multitasking in its threading
model, which adds introspection like clock counters, names, `show runtime`, `show threads` and so
on. I can't help but wonder: wouldn't we just be able to move the `spt_process_samples()` thread
into a VPP process node instead?
### v3bis: SVM FIFO, PSAMPLE process in Main
**TL/DR:** _9.68Mpps L3, 14.10Mpps L2XC, 14.2 cycles/packet, still with corrupted FIFO queue messages_
Neil agrees that there's no good reason to keep this out of main, and conjures up
[[df2dab8d](https://github.com/vpp/sflow-vpp/df2dab8d)] which rewrites the thread to an
`sflow_process_samples()` function, using `VLIB_REGISTER_NODE` to add it to VPP in an idiomatic way.
As a really nice benefit, we can now count how many CPU cycles are spent, in _main_, each time this
_process_ wakes up and does some work. It's a widely used pattern in VPP.
Because of the FIFO queue message corruption, Netlink message are failing to send at an alarming
rate, which is causing lots of `clib_warning()` messages to be spewed on console. I replace those
with a counter of Failed Netlink messages instead, and commit refactor
[[6ba4715](https://github.com/sflow/vpp-sflow/6ba4715d050f76cfc582055958d50bf4cc8a0ad1)].
```
pim@hvn6-lab:~$ grep 'vector rates' v3bis-100-runtime.txt | grep -v 'in 0'
vector rates in 1.0976e7, out 1.0976e7, drop 0.0000e0, punt 0.0000e0
vector rates in 9.6743e6, out 9.6743e6, drop 0.0000e0, punt 0.0000e0
vector rates in 1.4866e7, out 1.4866e7, drop 0.0000e0, punt 0.0000e0
vector rates in 1.4052e7, out 1.4052e7, drop 0.0000e0, punt 0.0000e0
pim@hvn6-lab:~$ grep sflow v3bis-100-runtime.txt
Name State Calls Vectors Suspends Clocks Vectors/Call
sflow-process-samples any wait 0 0 28052 4.66e4 0.00
sflow active 1134102 290330112 0 1.42e1 256.00
sflow active 1647240 421693440 0 1.32e1 256.00
pim@hvn6-lab:~$ grep sflow v3bis-100-err.txt
77945 sflow sflow PSAMPLE sent error
863 sflow sflow PSAMPLE send failed error
290376960 sflow sflow packets processed error
2151184 sflow sflow packets sampled error
421761024 sflow sflow packets processed error
3119625 sflow sflow packets sampled error
```
With this iteration, I make a few observations. Firstly, the `sflow-process-samples` node shows up
and informs me that, when handling the samples from the worker FIFO queues, the _process_ is using
4660 CPU cycles. Secondly, the replacement of `clib_warnign()` with the `sflow PSAMPLE send failed`
counter reduced time from 16.3 to 14.2 cycles on average in the dataplane. Nice.
**Debrief**: A sad conclusion: of the 5.2M samples taken, only 77k make it through to Netlink. All
these send failures and corrupt packets are really messing things up. So while the provided FIFO
implementation in `svm/fifo_segment.h` is idiomatic, it is also much more complex than we thought,
and we're fearing that it may not be safe to read from another thread.
### v4: Custom lockless FIFO, PSAMPLE process in Main
**TL/DR:** _9.56Mpps L3, 13.69Mpps L2XC, 15.6 cycles/packet, corruption fixed!_
After reading around a bit in DPDK's
[[kni_fifo](https://doc.dpdk.org/api-18.11/rte__kni__fifo_8h_source.html)], Neil produces a gem of a
commit in
[[42bbb64](https://github.com/sflow/vpp-sflow/commit/42bbb643b1f11e8498428d3f7d20cde4de8ee201)],
where he introduces a tiny multiple-writer, single-consumer FIFO with two simple functions:
`sflow_fifo_enqueue()` to be called in the workers, and `sflow_fifo_dequeue()` to be called in the
main thread's `sflow-process-samples` process. He then makes this thread-safe by doing what I
consider black magic, in commit
[[dd8af17](https://github.com/sflow/vpp-sflow/commit/dd8af1722d579adc9d08656cd7ec8cf8b9ac11d6)],
which makes use of `clib_atomic_load_acq_n()` and `clib_atomic_store_rel_n()` macros from VPP's
`vppinfra/atomics.h`.
What I really like about this change is that it introduces a FIFO implementation in about twenty
lines of code, which means the sampling code path in the dataplane becomes really easy to follow,
and will be even faster than it was before. I take it out for a loadtest:
```
pim@hvn6-lab:~$ grep 'vector rates' v4-100-runtime.txt | grep -v 'in 0'
vector rates in 1.0958e7, out 1.0958e7, drop 0.0000e0, punt 0.0000e0
vector rates in 9.5633e6, out 9.5633e6, drop 0.0000e0, punt 0.0000e0
vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
vector rates in 1.3697e7, out 1.3697e7, drop 0.0000e0, punt 0.0000e0
pim@hvn6-lab:~$ grep sflow v4-100-runtime.txt
Name State Calls Vectors Suspends Clocks Vectors/Call
sflow-process-samples any wait 0 0 17767 1.52e6 0.00
sflow active 1121156 287015936 0 1.56e1 256.00
sflow active 1605772 411077632 0 1.53e1 256.00
pim@hvn6-lab:~$ grep sflow v4-100-err.txt
3553600 sflow sflow PSAMPLE sent error
287101184 sflow sflow packets processed error
2127024 sflow sflow packets sampled error
350224 sflow sflow packets dropped error
411199744 sflow sflow packets processed error
3043693 sflow sflow packets sampled error
1266893 sflow sflow packets dropped error
```
This is starting to be a very nice implementation! With this iteration of the plugin, all the
corruption is gone, there is a slight regression (because we're now actually _sending_ the
messages). With the v3bis variant, only a tiny fraction of the samples made it through to netlink.
With this v4 variant, I can see 2127024 + 3043693 packets sampled, but due to a carefully chosen
FIFO depth of 4, the workers will drop samples so as not to overload the main process that is trying
to write them out. At this unnatural rate of 1:100, I can see that of the 2127024 samples taken,
350224 are prematurely dropped (because the FIFO queue is full). This is a perfect defense in depth!
Doing the math, both workers can enqueue 1776800 samples in 30 seconds, which is 59k/s per
interface. I can also see that the second interface, which is doing L2XC and hits a much larger
packets/sec throughput, is dropping more samples because it receives an equal amount of time from main
reading samples from its queue. In other words: in an overload scenario, one interface cannot crowd
out another. Slick.
Finally, completing my math, each worker has enqueued 1776800 samples to their FIFOs, and I see that
main has dequeued exactly 2x1776800 = 3553600 samples, all successfully written to Netlink, so
the `sflow PSAMPLE send failed` counter remains zero.
{{< image float="right" width="20em" src="/assets/sflow/hsflowd-demo.png" >}}
**Debrief**: In the mean time, Neil has been working on the `host-sflow` daemon changes to pick up
these netlink messages. There's also a bit of work to do with retrieving the packet and byte
counters of the VPP interfaces, so he is creating a module in `host-sflow` that can consume some
messages from VPP. He will call this `mod_vpp`, and he mails a screenshot of his work in progress.
I'll discuss the end-to-end changes with `hsflowd` in a followup article, and focus my efforts here
on documenting the VPP parts only. But, as a teaser, here's a screenshot of a validated
`sflow-tool` output of a VPP instance using our `sFlow` plugin and his pending `host-sflow` changes
to integrate the rest of the business logic outside of the VPP dataplane, where it's arguably
expensive to make mistakes.
Neil admits to an itch that he has been meaning to scratch all this time. In VPP's
`plugins/sflow/node.c`, we insert the node between `device-input` and `ethernet-input`. Here, really
most of the time the plugin is just shoveling the ethernet packets through to `ethernet-input`. To
make use of some CPU instruction cache affinity, the loop that does this shovelling can do it one
packet at a time, two packets at a time, or even four packets at a time. Although the code is super
repetitive and somewhat ugly, it does actually speed up processing in terms of CPU cycles spent per
packet, if you shovel four of them at a time.
### v5: Quad Bucket Brigade in worker
**TL/DR:** _9.68Mpps L3, 14.0Mpps L2XC, 11 CPU cycles/packet, 1.28e5 CPU cycles in main_
Neil calls this the _Quad Bucket Brigade_, and one last finishing touch is to move from his default
2-packet to a 4-packet shoveling. In commit
[[285d8a0](https://github.com/sflow/vpp-sflow/commit/285d8a097b74bb38eeb14a922a1e8c1115da2ef2)], he
extends a common pattern in VPP dataplane nodes, each time the node iterates, it'll pre-fetch now up
to eight packets (`p0-p7`) if the vector is long enough, and handle them four at a time (`b0-b3`).
He also adds a few compiler hints with branch prediction: almost no packets will have a trace
enabled, so he can use `PREDICT_FALSE()` macros to allow the compiler to further optimize the code.
I find reading the dataplane code, that it is incredibly ugly. But it's the price to pay for ultra
fast throughput. But how do we see the effect? My low-tech proposal is to enable sampling at a very
high rate, say 1:10'000'000, so that the code path that grabs and enqueues the sample into the FIFO
is almost never called. Then, what's left for the `sFlow` dataplane node, really is to shovel the
packets from `device-input` into `ethernet-input`.
To measure the relative improvement, I do one test with, and one without commit
[[285d8a09](https://github.com/sflow/vpp-sflow/commit/285d8a09)].
```
pim@hvn6-lab:~$ grep 'vector rates' v5-10M-runtime.txt | grep -v 'in 0'
vector rates in 1.0981e7, out 1.0981e7, drop 0.0000e0, punt 0.0000e0
vector rates in 9.8806e6, out 9.8806e6, drop 0.0000e0, punt 0.0000e0
vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
vector rates in 1.4328e7, out 1.4328e7, drop 0.0000e0, punt 0.0000e0
pim@hvn6-lab:~$ grep sflow v5-10M-runtime.txt
Name State Calls Vectors Suspends Clocks Vectors/Call
sflow-process-samples any wait 0 0 28467 9.36e3 0.00
sflow active 1158325 296531200 0 1.09e1 256.00
sflow active 1679742 430013952 0 1.11e1 256.00
pim@hvn6-lab:~$ grep 'vector rates' v5-noquadbrigade-10M-runtime.txt | grep -v in\ 0
vector rates in 1.0959e7, out 1.0959e7, drop 0.0000e0, punt 0.0000e0
vector rates in 9.7046e6, out 9.7046e6, drop 0.0000e0, punt 0.0000e0
vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
vector rates in 1.4008e7, out 1.4008e7, drop 0.0000e0, punt 0.0000e0
pim@hvn6-lab:~$ grep sflow v5-noquadbrigade-10M-runtime.txt
Name State Calls Vectors Suspends Clocks Vectors/Call
sflow-process-samples any wait 0 0 28462 9.57e3 0.00
sflow active 1137571 291218176 0 1.26e1 256.00
sflow active 1641991 420349696 0 1.20e1 256.00
```
Would you look at that, this optimization actually works as advertised! There is a meaningful
_progression_ from v5-noquadbrigade (9.70Mpps L3, 14.00Mpps L2XC) to v5 (9.88Mpps L3, 14.32Mpps
L2XC). So at the expense of adding 63 lines of code, there is a 2.8% increase in throughput.
**Quad-Bucket-Brigade, yaay!**
I'll leave you with a beautiful screenshot of the current code at HEAD, as it is sampling 1:100
packets (!) on four interfaces, while forwarding 8x10G of 256 byte packets at line rate. You'll
recall at the beginning of this article I did an acceptance loadtest with sFlow disabled, but this
is the exact same result **with sFlow** enabled:
{{< image src="/assets/sflow/trex-sflow-acceptance.png" alt="T-Rex sFlow Acceptance Loadtest" >}}
This picture says it all: 79.98 Gbps in, 79.98 Gbps out; 36.22Mpps in, 36.22Mpps out. Also: 176k
samples/sec taken from the dataplane, with correct rate limiting due to a per-worker FIFO depth
limit, yielding 25k samples/sec sent to Netlink.
## What's Next
Checking in on the three main things we wanted to ensure with the plugin:
1. ✅ If `sFlow` _is not_ enabled on a given interface, there is no regression on other interfaces.
1. ✅ If `sFlow` _is_ enabled, copying packets costs 11 CPU cycles on average
1. ✅ If `sFlow` takes a sample, it takes only marginally more CPU time to enqueue.
* No sampling gets 9.88Mpps of IPv4 and 14.3Mpps of L2XC throughput,
* 1:1000 sampling reduces to 9.77Mpps of L3 and 14.05Mpps of L2XC throughput,
* and an overly harsh 1:100 reduces to 9.69Mpps and 13.97Mpps only.
The hard part is finished, but we're not entirely done yet. What's left is to implement a set of
packet and byte counters, and send this information along with possible Linux CP data (such as the
TAP interface ID in the Linux side), and to add the module for VPP in `hsflowd`. I'll write about
that part in a followup article.
Neil has introduced vpp-dev@ to this plugin, and so far there were no objections. But he has pointed
folks to a github out of tree repo, and I may add a Gerrit instead so it becomes part of the
ecosystem. Our work so far is captured in Gerrit [[41680](https://gerrit.fd.io/r/c/vpp/+/41680)],
which ends up being just over 2600 lines all-up. I do think we need to refactor a bit, add some
VPP-specific tidbits like `FEATURE.yaml` and `*.rst` documentation, but this should be in reasonable
shape.
### Acknowledgements
I'd like to thank Neil McKee from inMon for his dedication to getting things right, including the
finer details such as logging, error handling, API specifications, and documentation. He has been a
true pleasure to work with and learn from.

View File

@ -0,0 +1,778 @@
---
date: "2024-10-21T10:52:11Z"
title: "FreeIX Remote - Part 2"
---
{{< image width="18em" float="right" src="/assets/freeix/freeix-artist-rendering.png" alt="FreeIX, Artists Rendering" >}}
# Introduction
A few months ago, I wrote about [[an idea]({{< ref 2024-04-27-freeix-1.md >}})] to help boost the
value of small Internet Exchange Points (_IXPs_). When such an exchange doesn't have many members,
then the operational costs of connecting to it (cross connects, router ports, finding peers, etc)
are not very favorable.
Clearly, the benefit of using an Internet Exchange is to reduce the portion of an ISPs (and CDNs)
traffic that must be delivered via their upstream transit providers, thereby reducing the average
per-bit delivery cost and as well reducing the end to end latency as seen by their users or
customers. Furthermore, the increased number of paths available through the IXP improves routing
efficiency and fault-tolerance, and at the same time it avoids traffic going the scenic route to a
large hub like Frankfurt, London, Amsterdam, Paris or Rome, if it could very well remain local.
## Refresher: FreeIX Remote
{{< image width="20em" float="right" src="/assets/freeix/Free IX Remote.svg" alt="FreeIX Remote" >}}
Let's take for example the [[Free IX in Greece](https://free-ix.gr/)] that was announced at GRNOG16
in Athens on April 19th, 2024. This exchange initially targets Athens and Thessaloniki, with 2x100G
between the two cities. Members can connect to either site for the cost of only a cross connect.
The 1G/10G/25G ports will be _Gratis_, so please make sure to apply if you're in this region! I
myself have connected one very special router to Free IX Greece, which will be offering an outreach
infrastructure by connecting to _other_ Internet Exchange Points in Amsterdam, and allowing all FreeIX
Greece members to benefit from that in the following way:
1. FreeIX Remote uses AS50869 to peer with any network operator (or routeserver) available at public
Internet Exchange Points or using private interconnects. For these peers, it looks like a completely
normal service provider in this regard. It will connect to internet exchange points, and learn a bunch of
routes and announce other routes.
1. FreeIX Remote _members_ can join the program, after which they are granted certain propagation
permissions by FreeIX Remote at the point where they have a BGP session with AS50869. The prefixes
learned on these _member_ sessions are marked as such, and will be allowed to propagate. Members
will receive some or all learned prefixes from AS50869.
1. FreeIX _members_ can set fine grained BGP communities to determine which of their prefixes are
propagated to and from which locations, by router, country or Internet Exchange Point.
Members at smaller internet exchange points greatly benefit from this type of outreach, by receiving large
portions of the public internet directly at their preferred peering location. The _Free IX Remote_
routers will carry member traffic to and from these remote Internet Exchange Points. My [[previous
article]({{< ref 2024-04-27-freeix-1.md >}})] went into a good amount of detail on the principles of
operation, but back then I made a promise to come back to the actual _implementation_ of such a
complex routing topology. As a starting point, I work with the structure I shared in [[IPng's
Routing Policy]({{< ref 2021-11-14-routing-policy.md >}})]. If you haven't read that yet, I think
it may make sense to take a look as many of the structural elements and concepts will be similar.
## Implementation
The routing policy calls for three classes of (large) BGP communities: informational, permission and
inhibit. It also defines a few classic BGP communties, but I'll skip over those as they are not
very interesting. Firstly, I will use the _informational_ communities to tag which prefixes were
learned by which _router_, in which _country_ and at which internet exchange point, which I will call a
_group_.
Then, I will use the same structure to grant members _permissions_, that is to say, when AS50869
learns their prefixes, they will get tagged with specific action communities that enable propagation
to other places. I will call this 'Member-to-IXP'. Sometimes, I'd like to be able to _inhibit_
propagation of 'Member-to-IXP', so there will be a third set of communities that perform this
function. Finally, matching on the informational communities in a clever way will enable a symmetric
'IXP-to-Member' propagation.
To help structure this implementation, it helps if I think about it in
the following way:
Let's say, AS50869 is connected to IXP1, IXP2, IXP3 and IXP4. AS50869 has a _member_ called M1 at
IXP1, and that member is 'permitted' to reach IXP2 and IXP3, but it is 'inhibited' from reaching
IXP4. My _FreeIX Remote_ implementation now has to satisfy three main requirements:
1. **Ingress**: learn prefixes (from peers and members alike) at internet exchange points or
private network interconnects, and 'tag' them with the correct informational communities.
1. **Egress: Member-to-IXP**: Announce M1's prefixes to IXP2 and IXP3, but not to IXP4.
1. **Egress: IXP-to-Member**: Announce IXP2's and IXP3's prefixes to M1, but not IXP4's.
### Defining Countries and Routers
I'll start by giving each country which has at least one router a unique _country_id_ in a YAML
file, leaving the value 0 to mean 'all' countries:
```
$ cat config/common/countries.yaml
country:
all: 0
CH: 1
NL: 2
GR: 3
IT: 4
```
Each router has its own configuration file, and at the top, I'll define some metadata which
includes things like the country in which it operates, and its own unique _router_id_, like so:
```
$ cat config/chrma0.net.free-ix.net.yaml
device:
id: 1
hostname: chrma0.free-ix.net
shortname: chrma0
country: CH
loopbacks:
ipv4: 194.126.235.16
ipv6: "2a0b:dd80:3101::"
location: "Hofwiesenstrasse, Ruemlang, Zurich, Switzerland"
...
```
### Defining communities
Next, I define the BGP communities in `class` and `subclass` types, in the following YAML structure:
```
ebgp:
community:
legacy:
noannounce: 0
blackhole: 666
inhibit: 3000
prepend1: 3100
prepend2: 3200
prepend3: 3300
large:
class:
informational: 1000
permission: 2000
inhibit: 3000
prepend1: 3100
prepend2: 3200
prepend3: 3300
subclass:
all: 0
router: 10
country: 20
group: 30
asn: 40
```
### Defining Members
In order to keep this system manageable, I have to rely on automation. I intend to leverage the
BGP community _subclasses_ in a simple ACL system consisting of the following YAML, taking my buddy
Antonios' network as an example:
```
$ cat config/common/members.yaml
member:
210312:
description: DaKnObNET
prefix_filter: AS-SET-DNET
permission: [ router:chrma0 ]
inhibit: [ group:chix ]
...
```
The syntax of the `permission` and `inhibit` fields are identical. They are lists of key:value pairs
where they key must be one of the _subclasses_ (eg. 'router', 'country', 'group', 'asn'), and the
value appropriate for that type. In this example, AS50869 is being asked to grant permissions for
Antonios' prefixes to any peer connected to `router:chrma0`, but inhibit propagation to/from the
exchange point called `group:chix`. I could extend this list, for example by adding a permission to
`country:NL` or an inhibit to `router:grskg0` and so on.
I decide that sensible defaults are to give permissions to all, and keep inhibit empty. In other
words: be very liberal in propagation, to maximize the value that FreeIX Remote can provide its
members.
### Ingress: Learning Prefixes
With what I've defined so far, I can start to set informational BGP communtiies:
* The prefixes learned on subclass **router** for `chrma0` will have value of device.id=1:
`(50869,1010,1)`
* The prefixes learned on subclass **country** for `chrma0` will learn from device.country=CH and
be able to look up in `countries['CH']` that this means value 1: `(50869,1020,1)`
* When learning prefixes from a given internet exchange, Kees already knows its PeeringDB
_ixp_id_, which is a unique value for each exchange point. Thus, subclass **group** for `chrma0` at
[[CommunityIX](https://www.peeringdb.com/ix/2013)] is ixp_id=2013: `(50869,1030,2013)`
#### Ingress: Learning from members
I need to make sure that members send only the prefixes that I expect from them. To do this, I'll
make use of a common tool called [[bgpq4](https://github.com/bgp/bgpq4)] which cobbles together the
prefixes belonging to an AS-SET by referencing one or more IRR databases.
In Python, I'll prepare the Jinja context by generating the prefix filter lists like so:
```
if session["type"] == "member":
session = {**session, **data["member"][asn]}
pf = ebgp_merge_value(data["ebgp"], group, session, "prefix_filter", None)
if pf:
ctx["prefix_filter"] = {}
pfn = pf
pfn = pfn.replace("-", "_")
pfn = pfn.replace(":", "_")
for af in [4, 6]:
filter_name = "%s_%s_IPV%d" % (groupname.upper(), pfn, af)
filter_contents = fetch_bgpq(filter_name, pf, af, allow_morespecifics=True)
if "[" in filter_contents:
ctx["prefix_filter"][filter_name] = { "str": filter_contents, "af": af }
ctx["prefix_filter_ipv%d" % af] = True
else:
log.warning(f"Filter {filter_name} is empty!")
ctx["prefix_filter_ipv%d" % af] = False
```
First, if a given BGP session is of type _member_, I'll merge the `member[asn]` dictionary
into the `ebgp.group.session[asn]`. I've left out error handling for brevity, but in case the member
YAML file doesn't have an entry for the given ASN, it'll just revert back to being of type _peer_.
I'll use a helper function `ebgp_merge_value()` to walk the YAML hiearchy from the member-data
enriched _session_ to the _group_ and finally to the _ebgp_ scope, looking for the existence of a
key called _prefix_filter_ and defaulting to None in case none was found. With the value of
_prefix_filter_ in hand (in this case `AS-SET-DNET`), I shell out to `bgpq4` for IPv4 and IPv6
respectively. Sometimes, there are no IPv6 prefixes (why must you be like this?!) and sometimes
there are no IPv4 prefixes (welcome to the Internet, kid!)
All of this context, including the session and group information, are then fed as context to a
Jinja renderer, where I can use them in an _import_ filter like so:
```
{% for plname, pl in (prefix_filter | default({})).items() %}
{{pl.str}}
{% endfor %}
filter ebgp_{{group_name}}_{{their_asn}}_import {
{% if not prefix_filter_ipv4 | default(True) %}
# WARNING: No IPv4 prefix filter found
if (net.type = NET_IP4) then reject;
{% endif %}
{% if not prefix_filter_ipv6 | default(True) %}
# WARNING: No IPv6 prefix filter found
if (net.type = NET_IP6) then reject;
{% endif %}
{% for plname, pl in (prefix_filter | default({})).items() %}
{% if pl.af == 4 %}
if (net.type = NET_IP4 && ! (net ~ {{plname}})) then reject;
{% elif pl.af == 6 %}
if (net.type = NET_IP6 && ! (net ~ {{plname}})) then reject;
{% endif %}
{% endfor %}
{% if session_type is defined %}
if ! ebgp_import_{{session_type}}({{their_asn}}) then reject;
{% endif %}
# Add FreeIX Remote: Informational
bgp_large_community.add(({{my_asn}},{{community.large.class.informational+community.large.subclass.router}},{{device.id}})); ## informational.router = {{ device.hostname }}
bgp_large_community.add(({{my_asn}},{{community.large.class.informational+community.large.subclass.country}},{{country[device.country]}})); ## informational.country = {{ device.country }}
{% if group.peeringdb_ix.id %}
bgp_large_community.add(({{my_asn}},{{community.large.class.informational+community.large.subclass.group}},{{group.peeringdb_ix.id}})); ## informational.group = {{ group_name }}
{% endif %}
## NOTE(pim): More comes here, see Member-to-IXP below
accept;
}
```
Let me explain what's going on here, as Jinja templating language that my generator uses is a bit
... chatty. The first block will print the dictionary of zero or more `prefix_filter` entries. If
the `prefix_filter` context variable doesn't exist, assume it's the empty dictionary and thus,
print no prefix lists.
Then, I create a Bird2 filter and these must each have a globally unique name. I satisfy this
requirement by giving it a name with the tuple of {group, their_asn}. The first thing this filter
does, is inspect `prefix_filter_ipv4` and `prefix_filter_ipv6`, and if they are explicitly set to
False (for example, if a member doesn't have any IRR prefixes associated with their AS-SET), then
I'll reject any prefixes from them. Then, I'll match the prefixes with the `prefix_filter`, if
provided, and reject any prefixes that aren't in the list I'm expecting on this session. Assuming
we're still good to go, I'll hand this prefix off to a function called `ebgp_import_peer()` for
peers and `ebgp_import_member()` for members, both of which ensure BGP communities are scrubbed.
```
function ebgp_import_peer(int remote_as) -> bool
{
# Scrub BGP Communities (RFC 7454 Section 11)
bgp_community.delete([(50869, *)]);
bgp_large_community.delete([(50869, *, *)]);
# Scrub BLACKHOLE community
bgp_community.delete((65535, 666));
return ebgp_import(remote_as);
}
function ebgp_import_member(int remote_as) -> bool
{
# We scrub only our own (informational, permissions) BGP Communities for members
bgp_large_community.delete([(50869,1000..2999,*)]);
return ebgp_import(remote_as);
}
```
After scrubbing the communities (peers are not allowed to set _any_ communities, and members are not
allowed to set their own informational or permissions communities, but they are allowed to inhibit
themselves or prepend, if they wish), one last check is performed by calling the underlying
`ebgp_import()`:
```
function ebgp_import(int remote_as) -> bool
{
if aspath_bogon() then return false;
if (net.type = NET_IP4 && ipv4_bogon()) then return false;
if (net.type = NET_IP6 && ipv6_bogon()) then return false;
if (net.type = NET_IP4 && ipv4_rpki_invalid()) then return false;
if (net.type = NET_IP6 && ipv6_rpki_invalid()) then return false;
# Graceful Shutdown (https://www.rfc-editor.org/rfc/rfc8326.html)
if (65535, 0) ~ bgp_community then bgp_local_pref = 0;
return true;
}
```
Here, belt-and-suspenders checks are performed, notably bogon AS Paths, IPv4/IPv6 prefixes and RPKI
invalids are filtered out. If the prefix has well-known community for [[BGP Graceful
Shutdown](https://www.rfc-editor.org/rfc/rfc8326.html)], honor it and set the local preference to
zero (making sure to prefer any other available path).
OK, after all these checks are done, I am finally ready to accept the prefix from this peer or
member. It's time to add the informational communities based on the _router_id_, the router's
_country_id_ and (if this is a session at a public internet exchange point documented in PeeringDB),
the group's _ixp_id_.
#### Ingress Example: member
Here's what the rendered template looks like for Antonios' member session at CHIX:
```
# bgpq4 -Ab4 -R 32 -l 'define CHIX_AS_SET_DNET_IPV4' AS-SET-DNET
define CHIX_AS_SET_DNET_IPV4 = [
44.31.27.0/24{24,32}, 44.154.130.0/24{24,32}, 44.154.132.0/24{24,32},
147.189.216.0/21{21,32}, 193.5.16.0/22{22,32}, 212.46.55.0/24{24,32}
];
# bgpq4 -Ab6 -R 128 -l 'define CHIX_AS_SET_DNET_IPV6' AS-SET-DNET
define CHIX_AS_SET_DNET_IPV6 = [
2001:678:f5c::/48{48,128}, 2a05:dfc1:9174::/48{48,128}, 2a06:9f81:2500::/40{40,128},
2a06:9f81:2600::/40{40,128}, 2a0a:6044:7100::/40{40,128}, 2a0c:2f04:100::/40{40,128},
2a0d:3dc0::/29{29,128}, 2a12:bc0::/29{29,128}
];
filter ebgp_chix_210312_import {
if (net.type = NET_IP4 && ! (net ~ CHIX_AS_SET_DNET_IPV4)) then reject;
if (net.type = NET_IP6 && ! (net ~ CHIX_AS_SET_DNET_IPV6)) then reject;
if ! ebgp_import_member(210312) then reject;
# Add FreeIX Remote: Informational
bgp_large_community.add((50869,1010,1)); ## informational.router = chrma0.free-ix.net
bgp_large_community.add((50869,1020,1)); ## informational.country = CH
bgp_large_community.add((50869,1030,2365)); ## informational.group = chix
## NOTE(pim): More comes here, see Member-to-IXP below
accept;
}
```
#### Ingress Example: peer
For completeness, here's a regular peer Cloudflare at CHIX, and I hope you agree that the Jinja
template renders down to something waaaay more readable now:
```
filter ebgp_chix_13335_import {
if ! ebgp_import_peer(13335) then reject;
# Add FreeIX Remote: Informational
bgp_large_community.add((50869,1010,1)); ## informational.router = chrma0.free-ix.net
bgp_large_community.add((50869,1020,1)); ## informational.country = CH
bgp_large_community.add((50869,1030,2365)); ## informational.group = chix
accept;
}
```
Most sessions will actually look like this one: just learning prefixes, scrubbing inbound
communities that are nobody's business to be setting but mine, tossing weird prefixes like bogons
and then setting typically the three informational communities. I now know exactly which prefixes
are picked up at group CHIX, which ones in country Switzerland, and which ones on router `chrma0`.
### Egress: Propagating Prefixes
And with that, I've completed the 'learning' part. Let me move to the 'propagating' part. A design
goal of FreeIX Remote is to have _symmetric_ propagation. In my example above, member M1 should have
its prefixes announced at IXP2 and IXP3, and all prefixes learned at IXP2 and IXP3 should be
announced to member M1.
First, let me create a helper function in the generator. It's job is to take the symbolic
`member.*.permissions` and `member.*.inhibit` lists and resolve them into a structure of numeric
values suitable for BGP community list adding and matching. It's a bit of a beast, but I've
simplified it a bit. Notably, I've removed all the error and exception handling for brevity:
```
def parse_member_communities(data, asn, type):
myasn = data["ebgp"]["asn"]
cls = data["ebgp"]["community"]["large"]["class"]
sub = data["ebgp"]["community"]["large"]["subclass"]
bgp_cl = []
member = data["member"][asn]
for perm in perms:
if perm == "all":
el = { "class": int(cls[type]), "subclass": int(sub["all"]),
"value": 0, "description": f"{type}.all" }
return [el]
k, v = perm.split(":")
if k == "country":
country_id = data["country"][v]
el = { "class": int(cls[type]), "subclass": int(sub["country"]),
"value": int(country_id), "description": f"{type}.{k} = {v}" }
bgp_cl.append(el)
elif k == "asn":
el = { "class": int(cls[type]), "subclass": int(sub["asn"]),
"value": int(v), "description": f"{type}.{k} = {v}" }
bgp_cl.append(el)
elif k == "router":
device_id = data["_devices"][v]["id"]
el = { "class": int(cls[type]), "subclass": int(sub["router"]),
"value": int(device_id), "description": f"{type}.{k} = {v}" }
bgp_cl.append(el)
elif k == "group":
group = data["ebgp"]["groups"][v]
if isinstance(group["peeringdb_ix"], dict):
ix_id = group["peeringdb_ix"]["id"]
else:
ix_id = group["peeringdb_ix"]
el = { "class": int(cls[type]), "subclass": int(sub["group"]),
"value": int(ix_id), "description": f"{type}.{k} = {v}" }
bgp_cl.append(el)
else:
log.warning (f"No implementation for {type} subclass '{k}' for member AS{asn}, skipping")
return bgp_cl
```
The essence of this function is to take a human readable list of symbols, like 'router:chrma0' and
look up what subclass is called 'router' and what router_id is 'chrma0'. It does this for keywords
'router', 'country', 'group' and 'asn' and for a special keyword called 'all' as well.
Running this a function on Antonios' member data above would reveal the following:
```
Member 210312 has permissions:
[{'class': 2000, 'subclass': 10, 'value': 1, 'description': 'permission.router = chrma0'}]
Member 210312 has inhibits:
[{'class': 3000, 'subclass': 30, 'value': 2365, 'description': 'inhibit.group = chix'}]
```
The neat thing about this is, that this data will come in handy for _both_ types of propagation, and
the `parse_member_communities()` helper function returns pretty readable data, which will help in
debugging and further understanding the ultimately generated configuration.
#### Egress: Member-to-IXP
OK, when I learned Antonios' prefixes, I have instructed the system to propagate them to all
sessions on router `chrma0`, except sessions on group `chix`. This means that in the direction of
_from AS50869 to others_, I can do the following:
**1. Tag permissions and inhibits on ingress**
I add a tiny bit of logic using this data structure I just created above. In the import filter,
remember I added `NOTE(pim): More comes here`? After setting the informational communities, I also
add these:
```
{% if session_type == "member" %}
{% if permissions %}
# Add FreeIX Remote: Permission
{% for el in permissions %}
bgp_large_community.add(({{my_asn}},{{el.class+el.subclass}},{{el.value}})); ## {{ el.description
}}
{% endfor %}
{% endif %}
{% if inhibits %}
# Add FreeIX Remote: Inhibit
{% for el in inhibits %}
bgp_large_community.add(({{my_asn}},{{el.class+el.subclass}},{{el.value}})); ## {{ el.description
}}
{% endfor %}
{% endif %}
{% endif %}
```
Seeing as this block only gets rendered if the session type is _member_, let me show you how
Antonios' import filter looks like in its full glory:
```
filter ebgp_chix_210312_import {
if (net.type = NET_IP4 && ! (net ~ CHIX_AS_SET_DNET_IPV4)) then reject;
if (net.type = NET_IP6 && ! (net ~ CHIX_AS_SET_DNET_IPV6)) then reject;
if ! ebgp_import_member(210312) then reject;
# Add FreeIX Remote: Informational
bgp_large_community.add((50869,1010,1)); ## informational.router = chrma0.free-ix.net
bgp_large_community.add((50869,1020,1)); ## informational.country = CH
bgp_large_community.add((50869,1030,2365)); ## informational.group = chix
# Add FreeIX Remote: Permission
bgp_large_community.add((50869,2010,1)); ## permission.router = chrma0
# Add FreeIX Remote: Inhibit
bgp_large_community.add((50869,3030,2365)); ## inhibit.group = chix
accept;
}
```
Remember, the `ebgp_import_member()` helper will strip any informational (the 1000s) and permissions
(the 2000s), but it would allow Antonios to set inhibits and prepends (the 3000s) so these BGP
communities will still be allowed in. In other words, Antonios can't give himself propagation rights
(sorry, buddy!) but if he would like to make AS50869 stop sending his prefixes to, say, CommunityIX,
he could simply add the BGP community `(50869,3030,2013)` on his announcements, and that will get
honored. If he'd like AS50869 to prepend itself twice before announcing to peer AS8298, he could set
`(50869,3200,8298)` and that will also get picked up.
**2. Match permissions and inhibits on egress**
Now that all of Antonios' prefixes are tagged with permissions and inhibits, I can reveal how I
implemented the export filters for AS50869:
```
function member_prefix(int group) -> bool
{
bool permitted = false;
if (({{ebgp.asn}}, {{ebgp.community.large.class.permission+ebgp.community.large.subclass.all}}, 0) ~ bgp_large_community ||
({{ebgp.asn}}, {{ebgp.community.large.class.permission+ebgp.community.large.subclass.router}}, {{ device.id }}) ~ bgp_large_community ||
({{ebgp.asn}}, {{ebgp.community.large.class.permission+ebgp.community.large.subclass.country}}, {{ country[device.country] }}) ~ bgp_large_community ||
({{ebgp.asn}}, {{ebgp.community.large.class.permission+ebgp.community.large.subclass.group}}, group) ~ bgp_large_community) then {
permitted = true;
}
if (({{ebgp.asn}}, {{ebgp.community.large.class.inhibit+ebgp.community.large.subclass.all}}, 0) ~ bgp_large_community ||
({{ebgp.asn}}, {{ebgp.community.large.class.inhibit+ebgp.community.large.subclass.router}}, {{ device.id }}) ~ bgp_large_community ||
({{ebgp.asn}}, {{ebgp.community.large.class.inhibit+ebgp.community.large.subclass.country}}, {{ country[device.country] }}) ~ bgp_large_community ||
({{ebgp.asn}}, {{ebgp.community.large.class.inhibit+ebgp.community.large.subclass.group}}, group) ~ bgp_large_community) then {
permitted = false;
}
return (permitted);
}
function valid_prefix(int group) -> bool
{
return (source_prefix() || member_prefix(group));
}
function ebgp_export_peer(int remote_as; int group) -> bool
{
if (source != RTS_BGP && source != RTS_STATIC) then return false;
if !valid_prefix(group) then return false;
bgp_community.delete([(50869, *)]);
bgp_large_community.delete([(50869, *, *)]);
return ebgp_export(remote_as);
}
```
From the bottom, the function `ebgp_export_peer()` is invoked on each peering session, and it gets
the argument of the remote AS (for example 13335 for CloudFlare), and the group (for example 2365
for CHIX). The function ensures that it's either a _static_ route or a _BGP_ route. Then it makes
sure it's a `valid_prefix()` for the group.
The `valid_prefix()` function first checks if it's one of our own (as in: AS50869's own) prefixes,
which it does by calling `source_prefix()`, which i've ommitted here as it would be a distraction.
All it does is check if the prefix is in a static prefix list generated with `bgpq4` for AS50869
itself. The more interesting observation is that to be eligible, the prefix needs to be either
`source_prefix()` **or** `member_prefix(group)`.
The propagation decision for 'Member-to-IXP' actually happens in that `member_prefix()` function. It
starts off by assuming the prefix is not permitted. Then it scans all relevant _permissions_
communities which may be present in the RIB for this prefix:
- is the `all` permissions community `(50869,2000,0)` set?
- what about the `router` permission `(50869,2010,R)` for my _router_id_?
- perhaps the `country` permission `(50869,2020,C)` for my _country_id_?
- or maybe the `group` permission `(50869,2030,G)` for the _ixp_id_ that this session lives on?
If any of these conditions are true, then this prefix _might_ pe permitted, so I set the variable to
True. Next, I check and see if any of the _inhibit_ communities are set, either by me (in
`members.yaml`) or by the member on the live BGP session. If any one of them matches, then I flip
the variable to False again. Once the verdict is known, I can return True or False here, which
makes its way all the way up the call stack and ultimately announces the member prefix on the BGP
session, or not. Slick!
#### Egress: IXP-to-Member
At this point, members' prefixes get announced at the correct internet exchange points, but I need to
satisfy one more requirement: the prefixes picked up at those IXPs, should _also_ be announced to
members. For this, the helper dictionary with permissions and inhibits can be used in a clever way.
What if I held them against the informational communities? For example, I have _permitted_
Antonios to be annouced at any IXP connected to router `chrma0`, then all prefixes I learned at
`chrma0` are fair game, right? But, I configured an _inhibit_ for Antonios' prefixes at CHIX. No
problem, I have an informational community for all prefixes I learned from the CHIX group!
I come to the realization that IXP-to-Member simply adds to the Member-to-IXP logic. Everything that
I would announce to a peer, I will also announce to a member. Off I go, adding one last helper
function to the BGP session Jinja template:
```
{% if session_type == "member" %}
function ebgp_export_{{group_name}}_{{their_asn}}(int remote_as; int group) -> bool
{
bool permitted = false;
if (source != RTS_BGP && source != RTS_STATIC) then return false;
if valid_prefix(group) then return ebgp_export(remote_as);
{% for el in permissions | default([]) %}
if (bgp_large_community ~ [({{ my_asn }},{{ 1000+el.subclass}},{% if el.value == 0%}*{% else %}{{el.value}}{% endif %})]) then permitted=true; ## {{el.description}}
{% endfor %}
{% for el in inhibits | default([]) %}
if (bgp_large_community ~ [({{ my_asn }},{{ 1000+el.subclass}},{% if el.value == 0%}*{% else %}{{el.value}}{% endif %})]) then permitted=false; ## {{el.description}}
{% endfor %}
if (permitted) then return ebgp_export(remote_as);
return false;
}
{% endif %}
```
Note that in essence, this new function still calls `valid_prefix()`, which in turn calls
`source_prefix()` **or** `member_prefix(group)`, so it announces the same prefixes that are also
announced to sessions of type 'peer'. But then, I'll also inspect the _informational_ communities,
where the value of `0` is replaced with a wildcard, because 'permit or inhibit all' would mean
'match any of these BGP communities'. This template renders as follows for Antonios at CHIX:
```
function ebgp_export_chix_210312(int remote_as; int group) -> bool
{
bool export = false;
if (source != RTS_BGP && source != RTS_STATIC) then return false;
if valid_prefix(group) then return ebgp_export(remote_as);
if (bgp_large_community ~ [(50869,1010,1)]) then export=true; ## permission.router = chrma0
if (bgp_large_community ~ [(50869,1030,2365)]) then export=false; ## inhibit.group = chix
if (export) then return ebgp_export(remote_as);
return false;
}
```
## Results
With this, the propagation logic is complete. Announcements are _symmetric_, that is to say the function
`ebgp_export_chix_210312()` sees to it that Antonios gets the prefixes learned at router `chrma0`
but not those learned at group `CHIX`. Similarly, the `ebgp_export_peer()` ensures that Antonios'
prefixes are propagated to any session at router `chrma0` except those sessions at group `CHIX`.
{{< image width="200px" float="right" src="/assets/vpp/fdio-color.svg" alt="VPP" >}}
I have installed VPP with [[OSPFv3]({{< ref 2024-06-22-vpp-ospf-2.md >}})] unnumbered interfaces,
so each router has exactly one IPv4 and IPv6 loopback address. The router in R&uuml;mlang has been
operational for a while, the one in Amsterdam (nlams0.free-ix.net) and Thessaloniki
(grskg0.free-ix.net) have been deployed and are connecting to IXPs now, and the one in Milan
(itmil0.free-ix.net) has been installed but is pending physical deployment at Caldara.
I deployed a test setup with a few permissions and inhibits on the R&uuml;mlang router, with many thanks
to Jurrian, Sam and Antonios for allowing me to guinnaepig-ize their member sessions. With the
following test configuration:
```
member:
35202:
description: OnTheGo (Sam Aschwanden)
prefix_filter: AS-OTG
permission: [ router:chrma0 ]
inhibit: [ group:comix ]
210312:
description: DaKnObNET
prefix_filter: AS-SET-DNET
permission: [ router:chrma0 ]
inhibit: [ group:chix ]
212635:
description: Jurrian van Iersel
prefix_filter: AS212635:AS-212635
permission: [ router:chrma0 ]
inhibit: [ group:chix, group:fogixp ]
```
I can see the following prefix learn/announce counts towards _members_:
```
pim@chrma0:~$ for i in $(birdc show protocol | grep member | cut -f1 -d' '); do echo -n $i\ ; birdc
show protocol all $i | grep Routes; done
chix_member_35202_ipv4_1 2 imported, 0 filtered, 159984 exported, 0 preferred
chix_member_35202_ipv6_1 2 imported, 0 filtered, 61730 exported, 0 preferred
chix_member_210312_ipv4_1 3 imported, 0 filtered, 3518 exported, 3 preferred
chix_member_210312_ipv6_1 2 imported, 0 filtered, 1251 exported, 2 preferred
comix_member_35202_ipv4_1 2 imported, 0 filtered, 159981 exported, 2 preferred
comix_member_35202_ipv4_2 2 imported, 0 filtered, 159981 exported, 1 preferred
comix_member_35202_ipv6_1 2 imported, 0 filtered, 61727 exported, 2 preferred
comix_member_35202_ipv6_2 2 imported, 0 filtered, 61727 exported, 1 preferred
fogixp_member_212635_ipv4_1 1 imported, 0 filtered, 442 exported, 1 preferred
fogixp_member_212635_ipv6_1 14 imported, 0 filtered, 181 exported, 14 preferred
freeix_ch_member_210312_ipv4_1 3 imported, 0 filtered, 3521 exported, 0 preferred
freeix_ch_member_210312_ipv6_1 2 imported, 0 filtered, 1253 exported, 0 preferred
```
Let me make a few observations:
* Hurricane Electric AS6939 is present at CHIX, and they tend to announce a very large number of
prefixes. So every member who is permitted (and not inhibited) at CHIX will see all of those: Sam's
AS35202 is inhibited on CommunityIX but not on CHIX, and he's permitted on both. That explains why
he is seeing the routes on both sessions.
* I've inhibited Jurrian's AS212635 to/from both CHIX and FogIXP, which means he will be seeing
CommunityIX (~245 IPv4, 85 IPv6 prefixes), and FreeIX CH (~173 IPv4 and ~60 IPv6). We also send him
the member prefixes, which is about 35 or so additional prefixes. This explains why Jurrian is
receiving from us ~440 IPv4 and ~180 IPv6.
* Antonios' AS210312, the exemplar in this article, is receiving all-but-CHIX. FogIXP yields 3077
or so IPv4 and 1056 IPv6 prefixes, while I've already added up FreeIX, CommunityIX, and our members
(this is what we're sending Jurrian!), at 330 resp 180, so Antonios should be getting about 3500 IPv4
prefixes and 1250 IPv6 prefixes.
In the other direction, I would expect to be announcing to _peers_ only prefixes belonging to either
AS50869 itself, or those of our members:
```
pim@chrma0:~$ for i in $(birdc show protocol | grep peer.*_1 | cut -f1 -d' '); do echo -n $i\ ; birdc
show protocol all $i | grep Routes || echo; done
chix_peer_212100_ipv4_1 57618 imported, 0 filtered, 24 exported, 778 preferred
chix_peer_212100_ipv6_1 21979 imported, 1 filtered, 37 exported, 7186 preferred
chix_peer_13335_ipv4_1 4767 imported, 9 filtered, 24 exported, 4765 preferred
chix_peer_13335_ipv6_1 371 imported, 1 filtered, 37 exported, 369 preferred
chix_peer_6939_ipv4_1 151787 imported, 27 filtered, 24 exported, 133943 preferred
chix_peer_6939_ipv6_1 61191 imported, 6 filtered, 37 exported, 16223 preferred
comix_peer_44596_ipv4_1 594 imported, 0 filtered, 25 exported, 10 preferred
comix_peer_44596_ipv6_1 1147 imported, 0 filtered, 50 exported, 0 preferred
comix_peer_8298_ipv4_1 23 imported, 0 filtered, 25 exported, 0 preferred
comix_peer_8298_ipv6_1 34 imported, 0 filtered, 50 exported, 0 preferred
fogixp_peer_47498_ipv4_1 3286 imported, 1 filtered, 27 exported, 3077 preferred
fogixp_peer_47498_ipv6_1 1838 imported, 0 filtered, 39 exported, 1056 preferred
freeix_ch_peer_51530_ipv4_1 355 imported, 0 filtered, 28 exported, 0 preferred
freeix_ch_peer_51530_ipv6_1 143 imported, 0 filtered, 53 exported, 0 preferred
```
Some observations:
* Nobody is inhibited at FreeIX Switzerland. It stands to reason therefore, that it has the most
exported prefixes: 28 for IPv4 and 53 for IPv6.
* Two members are inhibited at CHIX, which makes it have the lowest amount of exported prefixes:
24 for IPv4 and 27 for IPv6.
* All members at each exchange (group) will have the same amount of prefixes. I can confirm that
at CHIX, all thre peers have the same amount of announced prefixes. Similarly, at CommunityIX, all
peers have the same amount.
* If Antonios, Sam or Jurrian would add an outgoing announcement to AS50869 with an additional inhibit
BGP community (eg `(50869,3020,1)` to inhibit country Switzerland), they could tweak these numbers.
## What's next
This all adds up. I'd like to test the waters with my friendly neighborhood canaries a little bit,
to make sure that announcements are expected, and traffic flows where appropriate. In the mean time,
I'll chase the deployment of LSIX, FrysIX, SpeedIX and possibly a few others in Amsterdam. And of
course FreeIX Greece in Thessaloniki. I'll try to get the Milano VPP router deployed (it's already
installed and configured, but currently powered off) and connected to PCIX, MIX and a few others.
## How can you help?
If you're willing to participate with a VPP router and connect it to either multiple local internet
exchanges (like I've demonstrated in Zurich), or better yet, to one or more of the other existing
routers, I would welcome your contribution. [[Contact]({{< ref contact.md >}})] me for details.
A bit further down the pike, a connection from Amsterdam to Zurich, from Zurich to Milan and from
Milan to Thessaloniki is on the horizon. If you are willing and able to donate some bandwidth (point
to point VPWS, VLL, L2VPN) and your transport network is capable of at least 2026 bytes of _inner_
payload, please also [[reach out]({{< ref contact.md >}})] as I'm sure many small network operators
would be thrilled.

View File

@ -0,0 +1,857 @@
---
date: "2025-02-08T07:51:23Z"
title: 'VPP with sFlow - Part 3'
---
# Introduction
{{< image float="right" src="/assets/sflow/sflow.gif" alt="sFlow Logo" width="12em" >}}
In the second half of last year, I picked up a project together with Neil McKee of
[[inMon](https://inmon.com/)], the care takers of [[sFlow](https://sflow.org)]: an industry standard
technology for monitoring high speed networks. `sFlow` gives complete visibility into the
use of networks enabling performance optimization, accounting/billing for usage, and defense against
security threats.
The open source software dataplane [[VPP](https://fd.io)] is a perfect match for sampling, as it
forwards packets at very high rates using underlying libraries like [[DPDK](https://dpdk.org/)] and
[[RDMA](https://en.wikipedia.org/wiki/Remote_direct_memory_access)]. A clever design choice in the
so called _Host sFlow Daemon_ [[host-sflow](https://github.com/sflow/host-sflow)], which allows for
a small portion of code to _grab_ the samples, for example in a merchant silicon ASIC or FPGA, but
also in the VPP software dataplane. The agent then _transmits_ these samples using a Linux kernel
feature called [[PSAMPLE](https://github.com/torvalds/linux/blob/master/net/psample/psample.c)].
This greatly reduces the complexity of code to be implemented in the forwarding path, while at the
same time bringing consistency to the `sFlow` delivery pipeline by (re)using the `hsflowd` business
logic for the more complex state keeping, packet marshalling and transmission from the _Agent_ to a
central _Collector_.
In this third article, I wanted to spend some time discussing how samples make their way out of the
VPP dataplane, and into higher level tools.
## Recap: sFlow
{{< image float="left" src="/assets/sflow/sflow-overview.png" alt="sFlow Overview" width="14em" >}}
sFlow describes a method for Monitoring Traffic in Switched/Routed Networks, originally described in
[[RFC3176](https://datatracker.ietf.org/doc/html/rfc3176)]. The current specification is version 5
and is homed on the sFlow.org website [[ref](https://sflow.org/sflow_version_5.txt)]. Typically, a
Switching ASIC in the dataplane (seen at the bottom of the diagram to the left) is asked to copy
1-in-N packets to local sFlow Agent.
**Sampling**: The agent will copy the first N bytes (typically 128) of the packet into a sample. As
the ASIC knows which interface the packet was received on, the `inIfIndex` will be added. After a
routing decision is made, the nexthop and its L2 address and interface become known. The ASIC might
annotate the sample with this `outIfIndex` and `DstMAC` metadata as well.
**Drop Monitoring**: There's one rather clever insight that sFlow gives: what if the packet _was
not_ routed or switched, but rather discarded? For this, sFlow is able to describe the reason for
the drop. For example, the ASIC receive queue could have been overfull, or it did not find a
destination to forward the packet to (no FIB entry), perhaps it was instructed by an ACL to drop the
packet or maybe even tried to transmit the packet but the physical datalink layer had to abandon the
transmission for whatever reason (link down, TX queue full, link saturation, and so on). It's hard
to overstate how important it is to have this so-called _drop monitoring_, as operators often spend
hours and hours figuring out _why_ packets are lost their network or datacenter switching fabric.
**Metadata**: The agent may have other metadata as well, such as which prefix was the source and
destination of the packet, what additional RIB information is available (AS path, BGP communities,
and so on). This may be added to the sample record as well.
**Counters**: Since sFlow is sampling 1:N packets, the system can estimate total traffic in a
reasonably accurate way. Peter and Sonia wrote a succint
[[paper](https://sflow.org/packetSamplingBasics/)] about the math, so I won't get into that here.
Mostly because I am but a software engineer, not a statistician... :) However, I will say this: if a
fraction of the traffic is sampled but the _Agent_ knows how many bytes and packets were forwarded
in total, it can provide an overview with a quantifiable accuracy. This is why the _Agent_ will
periodically get the interface counters from the ASIC.
**Collector**: One or more samples can be concatenated into UDP messages that go from the _sFlow
Agent_ to a central _sFlow Collector_. The heavy lifting in analysis is done upstream from the
switch or router, which is great for performance. Many thousands or even tens of thousands of
agents can forward their samples and interface counters to a single central collector, which in turn
can be used to draw up a near real time picture of the state of traffic through even the largest of
ISP networks or datacenter switch fabrics.
In sFlow parlance [[VPP](https://fd.io/)] and its companion
[[hsflowd](https://github.com/sflow/host-sflow)] together form an _Agent_ (it sends the UDP packets
over the network), and for example the commandline tool `sflowtool` could be a _Collector_ (it
receives the UDP packets).
## Recap: sFlow in VPP
First, I have some pretty good news to report - our work on this plugin was
[[merged](https://gerrit.fd.io/r/c/vpp/+/41680)] and will be included in the VPP 25.02 release in a
few weeks! Last weekend, I gave a lightning talk at
[[FOSDEM](https://fosdem.org/2025/schedule/event/fosdem-2025-4196-vpp-monitoring-100gbps-with-sflow/)]
in Brussels, Belgium, and caught up with a lot of community members and network- and software
engineers. I had a great time.
In trying to keep the amount of code as small as possible, and therefore the probability of bugs that
might impact VPP's dataplane stability low, the architecture of the end to end solution consists of
three distinct parts, each with their own risk and performance profile:
{{< image float="left" src="/assets/sflow/sflow-vpp-overview.png" alt="sFlow VPP Overview" width="18em" >}}
**1. sFlow worker node**: Its job is to do what the ASIC does in the hardware case. As VPP moves
packets from `device-input` to the `ethernet-input` nodes in its forwarding graph, the sFlow plugin
will inspect 1-in-N, taking a sample for further processing. Here, we don't try to be clever, simply
copy the `inIfIndex` and the first bytes of the ethernet frame, and append them to a
[[FIFO](https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics))] queue. If too many samples
arrive, samples are dropped at the tail, and a counter incremented. This way, I can tell when the
dataplane is congested. Bounded FIFOs also provide fairness: it allows for each VPP worker thread to
get their fair share of samples into the Agent's hands.
**2. sFlow main process**: There's a function running on the _main thread_, which shifts further
processing time _away_ from the dataplane. This _sflow-process_ does two things. Firstly, it
consumes samples from the per-worker FIFO queues (both forwarded packets in green, and dropped ones
in red). Secondly, it keeps track of time and every few seconds (20 by default, but this is
configurable), it'll grab all interface counters from those interfaces for which I have sFlow
turned on. VPP produces _Netlink_ messages and sends them to the kernel.
**3. Host sFlow daemon**: The third component is external to VPP: `hsflowd` subscribes to the _Netlink_
messages. It goes without saying that `hsflowd` is a battle-hardened implementation running on
hundreds of different silicon and software defined networking stacks. The PSAMPLE stuff is easy,
this module already exists. But Neil implemented a _mod_vpp_ which can grab interface names and their
`ifIndex`, and counter statistics. VPP emits this data as _Netlink_ `USERSOCK` messages alongside
the PSAMPLEs.
By the way, I've written about _Netlink_ before when discussing the [[Linux Control Plane]({{< ref
2021-08-25-vpp-4 >}})] plugin. It's a mechanism for programs running in userspace to share
information with the kernel. In the Linux kernel, packets can be sampled as well, and sent from
kernel to userspace using a _PSAMPLE_ Netlink channel. However, the pattern is that of a message
producer/subscriber relationship and nothing precludes one userspace process (`vpp`) to be the
producer while another userspace process (`hsflowd`) acts as the consumer!
Assuming the sFlow plugin in VPP produces samples and counters properly, `hsflowd` will do the rest,
giving correctness and upstream interoperability pretty much for free. That's slick!
### VPP: sFlow Configuration
The solution that I offer is based on two moving parts. First, the VPP plugin configuration, which
turns on sampling at a given rate on physical devices, also known as _hardware-interfaces_. Second,
the open source component [[host-sflow](https://github.com/sflow/host-sflow/releases)] can be
configured as of release v2.11-5 [[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)].
I will show how to configure VPP in three ways:
***1. VPP Configuration via CLI***
```
pim@vpp0-0:~$ vppctl
vpp0-0# sflow sampling-rate 100
vpp0-0# sflow polling-interval 10
vpp0-0# sflow header-bytes 128
vpp0-0# sflow enable GigabitEthernet10/0/0
vpp0-0# sflow enable GigabitEthernet10/0/0 disable
vpp0-0# sflow enable GigabitEthernet10/0/2
vpp0-0# sflow enable GigabitEthernet10/0/3
```
The first three commands set the global defaults - in my case I'm going to be sampling at 1:100
which is an unusually high rate. A production setup may take 1-in-_linkspeed-in-megabits_ so for a
1Gbps device 1:1'000 is appropriate. For 100GE, something between 1:10'000 and 1:100'000 is more
appropriate, depending on link load. The second command sets the interface stats polling interval.
The default is to gather these statistics every 20 seconds, but I set it to 10s here.
Next, I tell the plugin how many bytes of the sampled ethernet frame should be taken. Common
values are 64 and 128 but it doesn't have to be a power of two. I want enough data to see the
headers, like MPLS label(s), Dot1Q tag(s), IP header and TCP/UDP/ICMP header, but the contents of
the payload are rarely interesting for
statistics purposes.
Finally, I can turn on the sFlow plugin on an interface with the `sflow enable-disable` CLI. In VPP,
an idiomatic way to turn on and off things is to have an enabler/disabler. It feels a bit clunky
maybe to write `sflow enable $iface disable` but it makes more logical sends if you parse that as
"enable-disable" with the default being the "enable" operation, and the alternate being the
"disable" operation.
***2. VPP Configuration via API***
I implemented a few API methods for the most common operations. Here's a snippet that obtains the
same config as what I typed on the CLI above, but using these Python API calls:
```python
from vpp_papi import VPPApiClient, VPPApiJSONFiles
import sys
vpp_api_dir = VPPApiJSONFiles.find_api_dir([])
vpp_api_files = VPPApiJSONFiles.find_api_files(api_dir=vpp_api_dir)
vpp = VPPApiClient(apifiles=vpp_api_files, server_address="/run/vpp/api.sock")
vpp.connect("sflow-api-client")
print(vpp.api.show_version().version)
# Output: 25.06-rc0~14-g9b1c16039
vpp.api.sflow_sampling_rate_set(sampling_N=100)
print(vpp.api.sflow_sampling_rate_get())
# Output: sflow_sampling_rate_get_reply(_0=655, context=3, sampling_N=100)
vpp.api.sflow_polling_interval_set(polling_S=10)
print(vpp.api.sflow_polling_interval_get())
# Output: sflow_polling_interval_get_reply(_0=661, context=5, polling_S=10)
vpp.api.sflow_header_bytes_set(header_B=128)
print(vpp.api.sflow_header_bytes_get())
# Output: sflow_header_bytes_get_reply(_0=665, context=7, header_B=128)
vpp.api.sflow_enable_disable(hw_if_index=1, enable_disable=True)
vpp.api.sflow_enable_disable(hw_if_index=2, enable_disable=True)
print(vpp.api.sflow_interface_dump())
# Output: [ sflow_interface_details(_0=667, context=8, hw_if_index=1),
# sflow_interface_details(_0=667, context=8, hw_if_index=2) ]
print(vpp.api.sflow_interface_dump(hw_if_index=2))
# Output: [ sflow_interface_details(_0=667, context=9, hw_if_index=2) ]
print(vpp.api.sflow_interface_dump(hw_if_index=1234)) ## Invalid hw_if_index
# Output: []
vpp.api.sflow_enable_disable(hw_if_index=1, enable_disable=False)
print(vpp.api.sflow_interface_dump())
# Output: [ sflow_interface_details(_0=667, context=10, hw_if_index=2) ]
```
This short program toys around a bit with the sFlow API. I first set the sampling to 1:100 and get
the current value. Then I set the polling interval to 10s and retrieve the current value again.
Finally, I set the header bytes to 128, and retrieve the value again.
Enabling and disabling sFlow on interfaces shows the idiom I mentioned before - the API being an
`*_enable_disable()` call of sorts, and typically taking a boolean argument if the operator wants to
enable (the default), or disable sFlow on the interface. Getting the list of enabled interfaces can
be done with the `sflow_interface_dump()` call, which returns a list of `sflow_interface_details`
messages.
I demonstrated VPP's Python API and how it works in a fair amount of detail in a [[previous
article]({{< ref 2024-01-27-vpp-papi >}})], in case this type of stuff interests you.
***3. VPPCfg YAML Configuration***
Writing on the CLI and calling the API is good and all, but many users of VPP have noticed that it
does not have any form of configuration persistence and that's deliberate. VPP's goal is to be a
programmable dataplane, and explicitly has left the programming and configuration as an exercise for
integrators. I have written a Python project that takes a YAML file as input and uses it to
configure (and reconfigure, on the fly) the dataplane automatically, called
[[VPPcfg](https://git.ipng.ch/ipng/vppcfg.git)]. Previously, I wrote some implementation thoughts
on its [[datamodel]({{< ref 2022-03-27-vppcfg-1 >}})] and its [[operations]({{< ref 2022-04-02-vppcfg-2
>}})] so I won't repeat that here. Instead, I will just show the configuration:
```
pim@vpp0-0:~$ cat << EOF > vppcfg.yaml
interfaces:
GigabitEthernet10/0/0:
sflow: true
GigabitEthernet10/0/1:
sflow: true
GigabitEthernet10/0/2:
sflow: true
GigabitEthernet10/0/3:
sflow: true
sflow:
sampling-rate: 100
polling-interval: 10
header-bytes: 128
EOF
pim@vpp0-0:~$ vppcfg plan -c vppcfg.yaml -o /etc/vpp/config/vppcfg.vpp
[INFO ] root.main: Loading configfile vppcfg.yaml
[INFO ] vppcfg.config.valid_config: Configuration validated successfully
[INFO ] root.main: Configuration is valid
[INFO ] vppcfg.reconciler.write: Wrote 13 lines to /etc/vpp/config/vppcfg.vpp
[INFO ] root.main: Planning succeeded
pim@vpp0-0:~$ vppctl exec /etc/vpp/config/vppcfg.vpp
```
The nifty thing about `vppcfg` is that if I were to change, say, the sampling-rate (setting it to
1000) and disable sFlow from an interface, say Gi10/0/0, I can re-run the `vppcfg plan` and `vppcfg
apply` stages and the VPP dataplane will be reprogrammed to reflect the newly declared configuration.
### hsflowd: Configuration
When sFlow is enabled, VPP will start to emit _Netlink_ messages of type PSAMPLE with packet samples
and of type USERSOCK with the custom messages containing interface names and counters. These latter
custom messages have to be decoded, which is done by the _mod_vpp_ module in `hsflowd`, starting
from release v2.11-5 [[ref](https://github.com/sflow/host-sflow/tree/v2.1.11-5)].
Here's a minimalist configuration:
```
pim@vpp0-0:~$ cat /etc/hsflowd.conf
sflow {
collector { ip=127.0.0.1 udpport=16343 }
collector { ip=192.0.2.1 namespace=dataplane }
psample { group=1 }
vpp { osIndex=off }
}
```
{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
There are two important details that can be confusing at first: \
**1.** kernel network namespaces \
**2.** interface index namespaces
#### hsflowd: Network namespace
Network namespaces virtualize Linux's network stack. Upon creation, a network namespace contains only
a loopback interface, and subsequently interfaces can be moved between namespaces. Each network
namespace will have its own set of IP addresses, its own routing table, socket listing, connection
tracking table, firewall, and other network-related resources. When started by systemd, `hsflowd`
and VPP will normally both run in the _default_ network namespace.
Given this, I can conclude that when the sFlow plugin opens a Netlink channel, it will
naturally do this in the network namespace that its VPP process is running in (the _default_
namespace, normally). It is therefore important that the recipient of these Netlink messages,
notably `hsflowd` runs in ths ***same*** namespace as VPP. It's totally fine to run them together in
a different namespace (eg. a container in Kubernetes or Docker), as long as they can see each other.
It might pose a problem if the network connectivity lives in a different namespace than the default
one. One common example (that I heavily rely on at IPng!) is to create Linux Control Plane interface
pairs, _LIPs_, in a dataplane namespace. The main reason for doing this is to allow something like
FRR or Bird to completely govern the routing table in the kernel and keep it in-sync with the FIB in
VPP. In such a _dataplane_ network namespace, typically every interface is owned by VPP.
Luckily, `hsflowd` can attach to one (default) namespace to get the PSAMPLEs, but create a socket in
a _different_ (dataplane) namespace to send packets to a collector. This explains the second
_collector_ entry in the config-file above. Here, `hsflowd` will send UDP packets to 192.0.2.1:6343
from within the (VPP) dataplane namespace, and to 127.0.0.1:16343 in the default namespace.
#### hsflowd: osIndex
I hope the previous section made some sense, because this one will be a tad more esoteric. When
creating a network namespace, each interface will get its own uint32 interface index that identifies
it, and such an ID is typically called an `ifIndex`. It's important to note that the same number can
(and will!) occur multiple times, once for each namespace. Let me give you an example:
```
pim@summer:~$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN ...
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ipng-sl state UP ...
link/ether 00:22:19:6a:46:2e brd ff:ff:ff:ff:ff:ff
altname enp1s0f0
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 900 qdisc mq master ipng-sl state DOWN ...
link/ether 00:22:19:6a:46:30 brd ff:ff:ff:ff:ff:ff
altname enp1s0f1
pim@summer:~$ ip netns exec dataplane ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN ...
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: loop0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP ...
link/ether de:ad:00:00:00:00 brd ff:ff:ff:ff:ff:ff
3: xe1-0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP ...
link/ether 00:1b:21:bd:c7:18 brd ff:ff:ff:ff:ff:ff
```
I want to draw your attention to the number at the beginning of the line. In the _default_
namespace, `ifIndex=3` corresponds to `ifName=eno2` (which has no link, it's marked `DOWN`). But in
the _dataplane_ namespace, that index corresponds to a completely different interface called
`ifName=xe1-0` (which is link `UP`).
Now, let me show you the interfaces in VPP:
```
pim@summer:~$ vppctl show int | grep Gigabit | egrep 'Name|loop0|tap0|Gigabit'
Name Idx State MTU (L3/IP4/IP6/MPLS)
GigabitEthernet4/0/0 1 up 9000/0/0/0
GigabitEthernet4/0/1 2 down 9000/0/0/0
GigabitEthernet4/0/2 3 down 9000/0/0/0
GigabitEthernet4/0/3 4 down 9000/0/0/0
TenGigabitEthernet5/0/0 5 up 9216/0/0/0
TenGigabitEthernet5/0/1 6 up 9216/0/0/0
loop0 7 up 9216/0/0/0
tap0 19 up 9216/0/0/0
```
Here, I want you to look at the second column `Idx`, which shows what VPP calls the _sw_if_index_
(the software interface index, as opposed to hardware index). Here, `ifIndex=3` corresponds to
`ifName=GigabitEthernet4/0/2`, which is neither `eno2` nor `xe1-0`. Oh my, yet _another_ namespace!
It turns out that there are three (relevant) types of namespaces at play here:
1. ***Linux network*** namespace; here using `dataplane` and `default` each with their own unique
(and overlapping) numbering.
1. ***VPP hardware*** interface namespace, also called PHYs (for physical interfaces). When VPP
first attaches to or creates network interfaces like the ones from DPDK or RDMA, these will
create an _hw_if_index_ in a list.
1. ***VPP software*** interface namespace. All interfaces (including hardware ones!) will
receive a _sw_if_index_ in VPP. A good example is sub-interfaces: if I create a sub-int on
GigabitEthernet4/0/2, it will NOT get a hardware index, but it _will_ get the next available
software index (in this example, `sw_if_index=7`).
In Linux CP, I can see a mapping from one to the other, just look at this:
```
pim@summer:~$ vppctl show lcp
lcp default netns dataplane
lcp lcp-auto-subint off
lcp lcp-sync on
lcp lcp-sync-unnumbered on
itf-pair: [0] loop0 tap0 loop0 2 type tap netns dataplane
itf-pair: [1] TenGigabitEthernet5/0/0 tap1 xe1-0 3 type tap netns dataplane
itf-pair: [2] TenGigabitEthernet5/0/1 tap2 xe1-1 4 type tap netns dataplane
itf-pair: [3] TenGigabitEthernet5/0/0.20 tap1.20 xe1-0.20 5 type tap netns dataplane
```
Those `itf-pair` describe our _LIPs_, and they have the coordinates to three things. 1) The VPP
software interface (VPP `ifName=loop0` with `sw_if_index=7`), which 2) Linux CP will mirror into the
Linux kernel using a TAP device (VPP `ifName=tap0` with `sw_if_index=19`). That TAP has one leg in
VPP (`tap0`), and another in 3) Linux (with `ifName=loop` and `ifIndex=2` in namespace `dataplane`).
> So the tuple that fully describes a _LIP_ is `{7, 19,'dataplane', 2}`
Climbing back out of that rabbit hole, I am now finally ready to explain the feature. When sFlow in
VPP takes its sample, it will be doing this on a PHY, that is a given interface with a specific
_hw_if_index_. When it polls the counters, it'll do it for that specific _hw_if_index_. It now has a
choice: should it share with the world the representation of *its* namespace, or should it try to be
smarter? If LinuxCP is enabled, this interface will likely have a representation in Linux. So the
plugin will first resolve the _sw_if_index_ belonging to that PHY, and using that, try to look up a
_LIP_ with it. If it finds one, it'll know both the namespace in which it lives as well as the
osIndex in that namespace. If it doesn't find a _LIP_, it will at least have the _sw_if_index_ at
hand, so it'll annotate the USERSOCK counter messages with this information instead.
Now, `hsflowd` has a choice to make: does it share the Linux representation and hide VPP as an
implementation detail? Or does it share the VPP dataplane _sw_if_index_? There are use cases
relevant to both, so the decision was to let the operator decide, by setting `osIndex` either `on`
(use Linux ifIndex) or `off` (use VPP _sw_if_index_).
### hsflowd: Host Counters
Now that I understand the configuration parts of VPP and `hsflowd`, I decide to configure everything
but without enabling sFlow on on any interfaces yet in VPP. Once I start the daemon, I can see that
it sends an UDP packet every 30 seconds to the configured _collector_:
```
pim@vpp0-0:~$ sudo tcpdump -s 9000 -i lo -n
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on lo, link-type EN10MB (Ethernet), snapshot length 9000 bytes
15:34:19.695042 IP 127.0.0.1.48753 > 127.0.0.1.6343: sFlowv5,
IPv4 agent 198.19.5.16, agent-id 100000, length 716
```
The `tcpdump` I have on my Debian bookworm machines doesn't know how to decode the contents of these
sFlow packets. Actually, neither does Wireshark. I've attached a file of these mysterious packets
[[sflow-host.pcap](/assets/sflow/sflow-host.pcap)] in case you want to take a look.
Neil however gives me a tip. A full message decoder and otherwise handy Swiss army knife lives in
[[sflowtool](https://github.com/sflow/sflowtool)].
I can offer this pcap file to `sflowtool`, or let it just listen on the UDP port directly, and
it'll tell me what it finds:
```
pim@vpp0-0:~$ sflowtool -p 6343
startDatagram =================================
datagramSourceIP 127.0.0.1
datagramSize 716
unixSecondsUTC 1739112018
localtime 2025-02-09T15:40:18+0100
datagramVersion 5
agentSubId 100000
agent 198.19.5.16
packetSequenceNo 57
sysUpTime 987398
samplesInPacket 1
startSample ----------------------
sampleType_tag 0:4
sampleType COUNTERSSAMPLE
sampleSequenceNo 33
sourceId 2:1
counterBlock_tag 0:2001
adaptor_0_ifIndex 2
adaptor_0_MACs 1
adaptor_0_MAC_0 525400f00100
counterBlock_tag 0:2010
udpInDatagrams 123904
udpNoPorts 23132459
udpInErrors 0
udpOutDatagrams 46480629
udpRcvbufErrors 0
udpSndbufErrors 0
udpInCsumErrors 0
counterBlock_tag 0:2009
tcpRtoAlgorithm 1
tcpRtoMin 200
tcpRtoMax 120000
tcpMaxConn 4294967295
tcpActiveOpens 0
tcpPassiveOpens 30
tcpAttemptFails 0
tcpEstabResets 0
tcpCurrEstab 1
tcpInSegs 89120
tcpOutSegs 86961
tcpRetransSegs 59
tcpInErrs 0
tcpOutRsts 4
tcpInCsumErrors 0
counterBlock_tag 0:2008
icmpInMsgs 23129314
icmpInErrors 32
icmpInDestUnreachs 0
icmpInTimeExcds 23129282
icmpInParamProbs 0
icmpInSrcQuenchs 0
icmpInRedirects 0
icmpInEchos 0
icmpInEchoReps 32
icmpInTimestamps 0
icmpInAddrMasks 0
icmpInAddrMaskReps 0
icmpOutMsgs 0
icmpOutErrors 0
icmpOutDestUnreachs 23132467
icmpOutTimeExcds 0
icmpOutParamProbs 23132467
icmpOutSrcQuenchs 0
icmpOutRedirects 0
icmpOutEchos 0
icmpOutEchoReps 0
icmpOutTimestamps 0
icmpOutTimestampReps 0
icmpOutAddrMasks 0
icmpOutAddrMaskReps 0
counterBlock_tag 0:2007
ipForwarding 2
ipDefaultTTL 64
ipInReceives 46590552
ipInHdrErrors 0
ipInAddrErrors 0
ipForwDatagrams 0
ipInUnknownProtos 0
ipInDiscards 0
ipInDelivers 46402357
ipOutRequests 69613096
ipOutDiscards 0
ipOutNoRoutes 80
ipReasmTimeout 0
ipReasmReqds 0
ipReasmOKs 0
ipReasmFails 0
ipFragOKs 0
ipFragFails 0
ipFragCreates 0
counterBlock_tag 0:2005
disk_total 6253608960
disk_free 2719039488
disk_partition_max_used 56.52
disk_reads 11512
disk_bytes_read 626214912
disk_read_time 48469
disk_writes 1058955
disk_bytes_written 8924332032
disk_write_time 7954804
counterBlock_tag 0:2004
mem_total 8326963200
mem_free 5063872512
mem_shared 0
mem_buffers 86425600
mem_cached 827752448
swap_total 0
swap_free 0
page_in 306365
page_out 4357584
swap_in 0
swap_out 0
counterBlock_tag 0:2003
cpu_load_one 0.030
cpu_load_five 0.050
cpu_load_fifteen 0.040
cpu_proc_run 1
cpu_proc_total 138
cpu_num 2
cpu_speed 1699
cpu_uptime 1699306
cpu_user 64269210
cpu_nice 1810
cpu_system 34690140
cpu_idle 3234293560
cpu_wio 3568580
cpuintr 0
cpu_sintr 5687680
cpuinterrupts 1596621688
cpu_contexts 3246142972
cpu_steal 329520
cpu_guest 0
cpu_guest_nice 0
counterBlock_tag 0:2006
nio_bytes_in 250283
nio_pkts_in 2931
nio_errs_in 0
nio_drops_in 0
nio_bytes_out 370244
nio_pkts_out 1640
nio_errs_out 0
nio_drops_out 0
counterBlock_tag 0:2000
hostname vpp0-0
UUID ec933791-d6af-7a93-3b8d-aab1a46d6faa
machine_type 3
os_name 2
os_release 6.1.0-26-amd64
endSample ----------------------
endDatagram =================================
```
If you thought: "What an obnoxiously long paste!", then my slightly RSI-induced mouse-hand might
agree with you. But it is really cool to see that every 30 seconds, the _collector_ will receive
this form of heartbeat from the _agent_. There's a lot of vitalsigns in this packet, including some
non-obvious but interesting stats like CPU load, memory, disk use and disk IO, and kernel version
information. It's super dope!
### hsflowd: Interface Counters
Next, I'll enable sFlow in VPP on all four interfaces (Gi10/0/0-Gi10/0/3), set the sampling rate to
something very high (1 in 100M), and the interface polling-interval to every 10 seconds. And indeed,
every ten seconds or so I get a few packets, which I captured in
[[sflow-interface.pcap](/assets/sflow/sflow-interface.pcap)]. Most of the packets contain only one
counter record, while some contain more than one (in the PCAP, packet #9 has two). If I update the
polling-interval to every second, I can see that most of the packets have all four counters.
Those interface counters, as decoded by `sflowtool`, look like this:
```
pim@vpp0-0:~$ sflowtool -r sflow-interface.pcap | \
awk '/startSample/ { on=1 } { if (on) { print $0 } } /endSample/ { on=0 }'
startSample ----------------------
sampleType_tag 0:4
sampleType COUNTERSSAMPLE
sampleSequenceNo 745
sourceId 0:3
counterBlock_tag 0:1005
ifName GigabitEthernet10/0/2
counterBlock_tag 0:1
ifIndex 3
networkType 6
ifSpeed 0
ifDirection 1
ifStatus 3
ifInOctets 858282015
ifInUcastPkts 780540
ifInMulticastPkts 0
ifInBroadcastPkts 0
ifInDiscards 0
ifInErrors 0
ifInUnknownProtos 0
ifOutOctets 1246716016
ifOutUcastPkts 975772
ifOutMulticastPkts 0
ifOutBroadcastPkts 0
ifOutDiscards 127
ifOutErrors 28
ifPromiscuousMode 0
endSample ----------------------
```
What I find particularly cool about it, is that sFlow provides an automatic mapping between the
`ifName=GigabitEthernet10/0/2` (tag 0:1005), together with an object (tag 0:1), which contains the
`ifIndex=3`, and lots of packet and octet counters both in the ingress and egress direction. This is
super useful for upstream _collectors_, as they can now find the hostname, agent name and address,
and the correlation between interface names and their indexes. Noice!
#### hsflowd: Packet Samples
Now it's time to ratchet up the packet sampling, so I move it from 1:100M to 1:1000, while keeping
the interface polling-interval at 10 seconds and I ask VPP to sample 64 bytes of each packet that it
inspects. On either side of my pet VPP instance, I start an `iperf3` run to generate some traffic. I
now see a healthy stream of sflow packets coming in on port 6343. They still contain every 30
seconds or so a host counter, and every 10 seconds a set of interface counters come by, but mostly
these UDP packets are showing me samples. I've captured a few minutes of these in
[[sflow-all.pcap](/assets/sflow/sflow-all.pcap)].
Although Wireshark doesn't know how to interpret the sFlow counter messages, it _does_ know how to
interpret the sFlow sample messagess, and it reveals one of them like this:
{{< image width="100%" src="/assets/sflow/sflow-wireshark.png" alt="sFlow Wireshark" >}}
Let me take a look at the picture from top to bottom. First, the outer header (from 127.0.0.1:48753
to 127.0.0.1:6343) is the sFlow agent sending to the collector. The agent identifies itself as
having IPv4 address 198.19.5.16 with ID 100000 and an uptime of 1h52m. Then, it says it's going to
send 9 samples, the first of which says it's from ifIndex=2 and at a sampling rate of 1:1000. It
then shows that sample, saying that the frame length is 1518 bytes, and the first 64 bytes of those
are sampled. Finally, the first sampled packet starts at the blue line. It shows the SrcMAC and
DstMAC, and that it was a TCP packet from 192.168.10.17:51028 to 192.168.10.33:5201 - my running
`iperf3`, booyah!
### VPP: sFlow Performance
{{< image float="right" src="/assets/sflow/sflow-lab.png" alt="sFlow Lab" width="20em" >}}
One question I get a lot about this plugin is: what is the performance impact when using
sFlow? I spent a considerable amount of time tinkering with this, and together with Neil bringing
the plugin to what we both agree is the most efficient use of CPU. We could have gone a bit further,
but that would require somewhat intrusive changes to VPP's internals and as _North of the Border_
(and the Simpsons!) would say: what we have isn't just good, it's good enough!
I've built a small testbed based on two Dell R730 machines. On the left, I have a Debian machine
running Cisco T-Rex using four quad-tengig network cards, the classic Intel i710-DA4. On the right,
I have my VPP machine called _Hippo_ (because it's always hungry for packets), with the same
hardware. I'll build two halves. On the top NIC (Te3/0/0-3 in VPP), I will install IPv4 and MPLS
forwarding on the purple circuit, and a simple Layer2 cross connect on the cyan circuit. On all four
interfaces, I will enable sFlow. Then, I will mirror this configuration on the bottom NIC
(Te130/0/0-3) in the red and green circuits, for which I will leave sFlow turned off.
To help you reproduce my results, and under the assumption that this is your jam, here's the
configuration for all of the kit:
***0. Cisco T-Rex***
```
pim@trex:~ $ cat /srv/trex/8x10.yaml
- version: 2
interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ]
port_info:
- src_mac: 00:1b:21:06:00:00
dest_mac: 9c:69:b4:61:a1:dc # Connected to Hippo Te3/0/0, purple
- src_mac: 00:1b:21:06:00:01
dest_mac: 9c:69:b4:61:a1:dd # Connected to Hippo Te3/0/1, purple
- src_mac: 00:1b:21:83:00:00
dest_mac: 00:1b:21:83:00:01 # L2XC via Hippo Te3/0/2, cyan
- src_mac: 00:1b:21:83:00:01
dest_mac: 00:1b:21:83:00:00 # L2XC via Hippo Te3/0/3, cyan
- src_mac: 00:1b:21:87:00:00
dest_mac: 9c:69:b4:61:75:d0 # Connected to Hippo Te130/0/0, red
- src_mac: 00:1b:21:87:00:01
dest_mac: 9c:69:b4:61:75:d1 # Connected to Hippo Te130/0/1, red
- src_mac: 9c:69:b4:85:00:00
dest_mac: 9c:69:b4:85:00:01 # L2XC via Hippo Te130/0/2, green
- src_mac: 9c:69:b4:85:00:01
dest_mac: 9c:69:b4:85:00:00 # L2XC via Hippo Te130/0/3, green
pim@trex:~ $ sudo t-rex-64 -i -c 4 --cfg /srv/trex/8x10.yaml
```
When constructing the T-Rex configuration, I specifically set the destination MAC address for L3
circuits (the purple and red ones) using Hippo's interface MAC address, which I can find with
`vppctl show hardware-interfaces`. This way, T-Rex does not have to ARP for the VPP endpoint. On
L2XC circuits (the cyan and green ones), VPP does not concern itself with the MAC addressing at
all. It puts its interface in _promiscuous_ mode, and simply writes out any ethernet frame received,
directly to the egress interface.
***1. IPv4***
```
hippo# set int state TenGigabitEthernet3/0/0 up
hippo# set int state TenGigabitEthernet3/0/1 up
hippo# set int state TenGigabitEthernet130/0/0 up
hippo# set int state TenGigabitEthernet130/0/1 up
hippo# set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31
hippo# set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31
hippo# set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31
hippo# set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31
hippo# ip route add 16.0.0.0/24 via 100.64.0.0
hippo# ip route add 48.0.0.0/24 via 100.64.1.0
hippo# ip route add 16.0.2.0/24 via 100.64.4.0
hippo# ip route add 48.0.2.0/24 via 100.64.5.0
hippo# ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static
hippo# ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static
hippo# ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static
hippo# ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static
```
By the way, one note to this last piece, I'm setting static IPv4 neighbors so that Cisco T-Rex
as well as VPP do not have to use ARP to resolve each other. You'll see above that the T-Rex
configuration also uses MAC addresses exclusively. Setting the `ip neighbor` like this allows VPP
to know where to send return traffic.
***2. MPLS***
```
hippo# mpls table add 0
hippo# set interface mpls TenGigabitEthernet3/0/0 enable
hippo# set interface mpls TenGigabitEthernet3/0/1 enable
hippo# set interface mpls TenGigabitEthernet130/0/0 enable
hippo# set interface mpls TenGigabitEthernet130/0/1 enable
hippo# mpls local-label add 16 eos via 100.64.1.0 TenGigabitEthernet3/0/1 out-labels 17
hippo# mpls local-label add 17 eos via 100.64.0.0 TenGigabitEthernet3/0/0 out-labels 16
hippo# mpls local-label add 20 eos via 100.64.5.0 TenGigabitEthernet130/0/1 out-labels 21
hippo# mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels 20
```
Here, the MPLS configuration implements a simple P-router, where incoming MPLS packets with label 16
will be sent back to T-Rex on Te3/0/1 to the specified IPv4 nexthop (for which I already know the
MAC address), and with label 16 removed and new label 17 imposed, in other words a SWAP operation.
***3. L2XC***
```
hippo# set int state TenGigabitEthernet3/0/2 up
hippo# set int state TenGigabitEthernet3/0/3 up
hippo# set int state TenGigabitEthernet130/0/2 up
hippo# set int state TenGigabitEthernet130/0/3 up
hippo# set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3
hippo# set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2
hippo# set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3
hippo# set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2
```
I've added a layer2 cross connect as well because it's computationally very cheap for VPP to receive
an L2 (ethernet) datagram, and immediately transmit it on another interface. There's no FIB lookup
and not even an L2 nexthop lookup involved, VPP is just shoveling ethernet packets in-and-out as
fast as it can!
Here's how a loadtest looks like when sending 80Gbps at 192b packets on all eight interfaces:
{{< image src="/assets/sflow/sflow-lab-trex.png" alt="sFlow T-Rex" width="100%" >}}
The leftmost ports p0 <-> p1 are sending IPv4+MPLS, while ports p0 <-> p2 are sending ethernet back
and forth. All four of them have sFlow enabled, at a sampling rate of 1:10'000, the default. These
four ports are my experiment, to show the CPU use of sFlow. Then, ports p3 <-> p4 and p5 <-> p6
respectively have sFlow turned off but with the same configuration. They are my control, showing
the CPU use without sFlow.
**First conclusion**: This stuff works a treat. There is absolutely no impact of throughput at
80Gbps with 47.6Mpps either _with_, or _without_ sFlow turned on. That's wonderful news, as it shows
that the dataplane has more CPU available than is needed for any combination of functionality.
But what _is_ the limit? For this, I'll take a deeper look at the runtime statistics by varying the
CPU time spent and maximum throughput achievable on a single VPP worker, thus using a single CPU
thread on this Hippo machine that has 44 cores and 44 hyperthreads. I switch the loadtester to emit
64 byte ethernet packets, the smallest I'm allowed to send.
| Loadtest | no sFlow | 1:1'000'000 | 1:10'000 | 1:1'000 | 1:100 |
|-------------|-----------|-----------|-----------|-----------|-----------|
| L2XC | 14.88Mpps | 14.32Mpps | 14.31Mpps | 14.27Mpps | 14.15Mpps |
| IPv4 | 10.89Mpps | 9.88Mpps | 9.88Mpps | 9.84Mpps | 9.73Mpps |
| MPLS | 10.11Mpps | 9.52Mpps | 9.52Mpps | 9.51Mpps | 9.45Mpps |
| ***sFlow Packets*** / 10sec | N/A | 337.42M total | 337.39M total | 336.48M total | 333.64M total |
| .. Sampled | &nbsp; | 328 | 33.8k | 336k | 3.34M |
| .. Sent | &nbsp; | 328 | 33.8k | 336k | 1.53M |
| .. Dropped | &nbsp; | 0 | 0 | 0 | 1.81M |
Here I can make a few important observations.
**Baseline**: One worker (thus, one CPU thread) can sustain 14.88Mpps of L2XC when sFlow is turned off, which
implies that it has a little bit of CPU left over to do other work, if needed. With IPv4, I can see
that the throughput is actually CPU limited: 10.89Mpps can be handled by one worker (thus, one CPU thread). I
know that MPLS is a little bit more expensive computationally than IPv4, and that checks out. The
total capacity is 10.11Mpps for one worker, when sFlow is turned off.
**Overhead**: When I turn on sFlow on the interface, VPP will insert the _sflow-node_ into the
forwarding graph between `device-input` and `ethernet-input`. It means that the sFlow node will see
_every single_ packet, and it will have to move all of these into the next node, which costs about
9.5 CPU cycles per packet. The regression on L2XC is 3.8% but I have to note that VPP was not CPU
bound on the L2XC so it used some CPU cycles which were still available, before regressing
throughput. There is an immediate regression of 9.3% on IPv4 and 5.9% on MPLS, only to shuffle the
packets through the graph.
**Sampling Cost**: But when then doing higher rates of sampling, the further regression is not _that_
terrible. Between 1:1'000'000 and 1:10'000, there's barely a noticeable difference. Even in the
worst case of 1:100, the regression is from 14.32Mpps to 14.15Mpps for L2XC, only 1.2%. The
regression for L2XC, IPv4 and MPLS are all very modest, at 1.2% (L2XC), 1.6% (IPv4) and 0.8% (MPLS).
Of course, by using multiple hardware receive queues and multiple RX workers per interface, the cost
can be kept well in hand.
**Overload Protection**: At 1:1'000 and an effective rate of 33.65Mpps across all ports, I correctly
observe 336k samples taken, and sent to PSAMPLE. At 1:100 however, there are 3.34M samples, but
they are not fitting through the FIFO, so the plugin is dropping samples to protect downstream
`sflow-main` thread and `hsflowd`. I can see that here, 1.81M samples have been dropped, while 1.53M
samples made it through. By the way, this means VPP is happily sending a whopping 153K samples/sec
to the collector!
## What's Next
Now that I've seen the UDP packets from our agent to a collector on the wire, and also how
incredibly efficient the sFlow sampling implementation turned out, I'm super motivated to
continue the journey with higher level collector receivers like ntopng, sflow-rt or Akvorado. In an
upcoming article, I'll describe how I rolled out Akvorado at IPng, and what types of changes would
make the user experience even better (or simpler to understand, at least).
### Acknowledgements
I'd like to thank Neil McKee from inMon for his dedication to getting things right, including the
finer details such as logging, error handling, API specifications, and documentation. He has been a
true pleasure to work with and learn from. Also, thank you to the VPP developer community, notably
Benoit, Florin, Damjan, Dave and Matt, for helping with the review and getting this thing merged in
time for the 25.02 release.

View File

@ -0,0 +1,793 @@
---
date: "2025-04-09T07:51:23Z"
title: 'FrysIX eVPN: think different'
---
{{< image float="right" src="/assets/frys-ix/frysix-logo-small.png" alt="FrysIX Logo" width="12em" >}}
# Introduction
Somewhere in the far north of the Netherlands, the country where I was born, a town called Jubbega
is the home of the Frysian Internet Exchange called [[Frys-IX](https://frys-ix.net/)]. Back in 2021,
a buddy of mine, Arend, said that he was planning on renting a rack at the NIKHEF facility, one of
the most densely populated facilities in western Europe. He was looking for a few launching
customers and I was definitely in the market for a presence in Amsterdam. I even wrote about it on
my [[bucketlist]({{< ref 2021-07-26-bucketlist.md >}})]. Arend and his IT company
[[ERITAP](https://www.eritap.com/)], took delivery of that rack in May of 2021, and this is when the
internet exchange with _Frysian roots_ was born.
In the years from 2021 until now, Arend and I have been operating the exchange with reasonable
success. It grew from a handful of folks in that first rack, to now some 250 participating ISPs
with about ten switches in six datacenters across the Amsterdam metro area. It's shifting a cool
800Gbit of traffic or so. It's dope, and very rewarding to be able to contribute to this community!
## Frys-IX is growing
We have several members with a 2x100G LAG and even though all inter-datacenter links are either dark
fiber or WDM, we're starting to feel the growing pains as we set our sights to the next step growth.
You see, when FrysIX did 13.37Gbit of traffic, Arend organized a barbecue. When it did 133.7Gbit of
traffic, Arend organized an even bigger barbecue. Obviously, the next step is 1337Gbit and joining
the infamous [[One TeraBit Club](https://github.com/tking/OneTeraBitClub)]. Thomas: we're on our
way!
It became clear that we will not be able to keep a dependable peering platform if FrysIX remains a
single L2 broadcast domain, and it also became clear that concatenating multiple 100G ports would be
operationally expensive (think of all the dark fiber or WDM waves!), and brittle (think of LACP and
balancing traffic over those ports). We need to modernize in order to stay ahead of the growth
curve.
## Hello Nokia
{{< image float="right" src="/assets/frys-ix/nokia-7220-d4.png" alt="Nokia 7220-D4" width="20em" >}}
The Nokia 7220 Interconnect Router (7220 IXR) for data center fabric provides fixed-configuration,
high-capacity platforms that let you bring unmatched scale, flexibility and operational simplicity
to your data center networks and peering network environments. These devices are built around the
Broadcom _Trident_ chipset, in the case of the "D4" platform, this is a Trident4 with 28x100G and
8x400G ports. Whoot!
{{< image float="right" src="/assets/frys-ix/IXR-7220-D3.jpg" alt="Nokia 7220-D3" width="20em" >}}
What I find particularly awesome of the Trident series is their speed (total bandwidth of
12.8Tbps _per router_), low power use (without optics, the IXR-7220-D4 consumes about 150W) and
a plethora of advanced capabilities like L2/L3 filtering, IPv4, IPv6 and MPLS routing, and modern
approaches to scale-out networking such as VXLAN based EVPN. At the FrysIX barbecue in September of
2024, FrysIX was gifted a rather powerful IXR-7220-D3 router, shown in the picture to the right.
That's a 32x100G router.
ERITAP has bought two (new in box) IXR-7220-D4 (8x400G,28x100G) routers, and has also acquired two
IXR-7220-D2 (48x25G,8x100G) routers. So in total, FrysIX is now the proud owner of five of these
beautiful Nokia devices. If you haven't yet, you should definitely read about these versatile
routers on the [[Nokia](https://onestore.nokia.com/asset/207599)] website, and some details of the
_merchant silicon_ switch chips in use on the
[[Broadcom](https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs/bcm56880-series)]
website.
### eVPN: A small rant
{{< image float="right" src="/assets/frys-ix/FrysIX_ Topology (concept).svg" alt="Topology Concept" width="50%" >}}
First, I need to get something off my chest. Consider a topology for an internet exchange platform,
taking into account the available equipment, rackspace, power, and cross connects. Somehow, almost
every design or reference architecture I can find on the Internet, assumes folks want to build a
[[Clos network](https://en.wikipedia.org/wiki/Clos_network)], which has a topology existing of leaf
and spine switches. The _spine_ switches have a different set of features than the _leaf_ ones,
notably they don't have to do provider edge functionality like VXLAN encap and decapsulation.
Almost all of these designs are showing how one might build a leaf-spine network for hyperscale.
**Critique 1**: my 'spine' (IXR-7220-D4 routers) must also be provider edge. Practically speaking,
in the picture above I have these beautiful Nokia IXR-7220-D4 routers, using two 400G ports to
connect between the facilities, and six 100G ports to connect the smaller breakout switches. That
would leave a _massive_ amount of capacity unused: 22x 100G and 6x400G ports, to be exact.
**Critique 2**: all 'leaf' (either IXR-7220-D2 routers or Arista switches) can't realistically
connect to both 'spines'. Our devices are spread out over two (and in practice, more like six)
datacenters, and it's prohibitively expensive to get 100G waves or dark fiber to create a full mesh.
It's much more economical to create a star-topology that minimizes cross-datacenter fiber spans.
**Critique 3**: Most of these 'spine-leaf' reference architectures assume that the interior gateway
protocol is eBGP in what they call the _underlay_, and on top of that, some secondary eBGP that's
called the _overlay_. Frankly, such a design makes my head spin a little bit. These designs assume
hundreds of switches, in which case making use of one AS number per switch could make sense, as iBGP
needs either a 'full mesh', or external route reflectors.
**Critique 4**: These reference designs also make an assumption that all fiber is local and while
optics and links can fail, it will be relatively rare to _drain_ a link. However, in
cross-datacenter networks, draining links for maintenance is very common, for example if the dark
fiber provider needs to perform repairs on a span that was damaged. With these eBGP-over-eBGP
connections, traffic engineering is more difficult than simply raising the OSPF (or IS-IS) cost of a
link, to reroute traffic.
Setting aside eVPN for a second, if I were to build an IP transport network, like I did when I built
[[IPng Site Local]({{< ref 2023-03-11-mpls-core.md >}})], I would use a much more intuitive
and simple (I would even dare say elegant) design:
1. Take a classic IGP like [[OSPF](https://en.wikipedia.org/wiki/Open_Shortest_Path_First)], or
perhaps [[IS-IS](https://en.wikipedia.org/wiki/IS-IS)]. There is no benefit, to me at least, to use
BGP as an IGP.
1. I would give each of the links between the switches an IPv4 /31 and enable link-local, and give
each switch a loopback address with a /32 IPv4 and a /128 IPv6.
1. If I had multiple links between two given switches, I would probably just use ECMP if my devices
supported it, and fall back to a LACP signaled bundle-ethernet otherwise.
1. If I were to need to use BGP (and for eVPN, this need exists), taking the ISP mindset (as opposed
to the datacenter fabric mindset), I would simply install iBGP against two or three route
reflectors, and exchange routing information within the same single AS number.
### eVPN: A demo topology
{{< image float="right" src="/assets/frys-ix/Nokia Arista VXLAN.svg" alt="Demo topology" width="50%" >}}
So, that's exactly how I'm going to approach the FrysIX eVPN design: OSPF for the underlay and iBGP
for the overlay! I have a feeling that some folks will despise me for being contrarian, but you can
leave your comments below, and don't forget to like-and-subscribe :-)
Arend builds this topology for me in Jubbega - also known as FrysIX HQ. He takes the two
400G-capable routers and connects them. Then he takes an Arista DCS-7060CX switch, which is eVPN
capable, with 32x100G ports, based on the Broadcom Tomahawk chipset, and a smaller Nokia
IXR-7220-D2 with 48x25G and 8x100G ports, based on the Trident3 chipset. He wires all of this up
to look like the picture on the right.
#### Underlay: Nokia's SR Linux
We boot up the equipment, verify that all the optics and links are up, and connect the management
ports to an OOB network that I can remotely log in to. This is the first time that either of us work
on Nokia, but I find it reasonably intuitive once I get a few tips and tricks from Niek.
```
[pim@nikhef ~]$ sr_cli
--{ running }--[ ]--
A:pim@nikhef# enter candidate
--{ candidate shared default }--[ ]--
A:pim@nikhef# set / interface lo0 admin-state enable
A:pim@nikhef# set / interface lo0 subinterface 0 admin-state enable
A:pim@nikhef# set / interface lo0 subinterface 0 ipv4 admin-state enable
A:pim@nikhef# set / interface lo0 subinterface 0 ipv4 address 198.19.16.1/32
A:pim@nikhef# commit stay
```
There, my first config snippet! This creates a _loopback_ interface, and similar to JunOS, a
_subinterface_ (which Juniper calls a _unit_) which enables IPv4 and gives it an /32 address. In SR
Linux, any interface has to be associated with a _network-instance_, think of those as routing
domains or VRFs. There's a conveniently named _default_ network-instance, which I'll add this and
the point-to-point interface between the two 400G routers to:
```
A:pim@nikhef# info flat interface ethernet-1/29
set / interface ethernet-1/29 admin-state enable
set / interface ethernet-1/29 subinterface 0 admin-state enable
set / interface ethernet-1/29 subinterface 0 ip-mtu 9190
set / interface ethernet-1/29 subinterface 0 ipv4 admin-state enable
set / interface ethernet-1/29 subinterface 0 ipv4 address 198.19.17.1/31
set / interface ethernet-1/29 subinterface 0 ipv6 admin-state enable
A:pim@nikhef# set / network-instance default type default
A:pim@nikhef# set / network-instance default admin-state enable
A:pim@nikhef# set / network-instance default interface ethernet-1/29.0
A:pim@nikhef# set / network-instance default interface lo0.0
A:pim@nikhef# commit stay
```
Cool. Assuming I now also do this on the other IXR-7220-D4 router, called _equinix_ (which gets the
loopback address 198.19.16.0/32 and the point-to-point on the 400G interface of 198.19.17.0/31), I
should be able to do my first jumboframe ping:
```
A:pim@equinix# ping network-instance default 198.19.17.1 -s 9162 -M do
Using network instance default
PING 198.19.17.1 (198.19.17.1) 9162(9190) bytes of data.
9170 bytes from 198.19.17.1: icmp_seq=1 ttl=64 time=0.466 ms
9170 bytes from 198.19.17.1: icmp_seq=2 ttl=64 time=0.477 ms
9170 bytes from 198.19.17.1: icmp_seq=3 ttl=64 time=0.547 ms
```
#### Underlay: SR Linux OSPF
OK, let's get these two Nokia routers to speak OSPF, so that they can reach each other's loopback.
It's really easy:
```
A:pim@nikhef# / network-instance default protocols ospf instance default
--{ candidate shared default }--[ network-instance default protocols ospf instance default ]--
A:pim@nikhef# set admin-state enable
A:pim@nikhef# set version ospf-v2
A:pim@nikhef# set router-id 198.19.16.1
A:pim@nikhef# set area 0.0.0.0 interface ethernet-1/29.0 interface-type point-to-point
A:pim@nikhef# set area 0.0.0.0 interface lo0.0 passive true
A:pim@nikhef# commit stay
```
Similar to in JunOS, I can descend into a configuration scope: the first line goes into the
_network-instance_ called `default` and then the _protocols_ called `ospf`, and then the _instance_
called `default`. Subsequent `set` commands operate at this scope. Once I commit this configuration
(on the _nikhef_ router and also the _equinix_ router, with its own unique router-id), OSPF quickly
shoots in action:
```
A:pim@nikhef# show network-instance default protocols ospf neighbor
=========================================================================================
Net-Inst default OSPFv2 Instance default Neighbors
=========================================================================================
+---------------------------------------------------------------------------------------+
| Interface-Name Rtr Id State Pri RetxQ Time Before Dead |
+=======================================================================================+
| ethernet-1/29.0 198.19.16.0 full 1 0 36 |
+---------------------------------------------------------------------------------------+
-----------------------------------------------------------------------------------------
No. of Neighbors: 1
=========================================================================================
A:pim@nikhef# show network-instance default route-table all | more
IPv4 unicast route table of network instance default
+------------------+-----+------------+--------------+--------+----------+--------+------+-------------+-----------------+
| Prefix | ID | Route Type | Route Owner | Active | Origin | Metric | Pref | Next-hop | Next-hop |
| | | | | | Network | | | (Type) | Interface |
| | | | | | Instance | | | | |
+==================+=====+============+==============+========+==========+========+======+=============+=================+
| 198.19.16.0/32 | 0 | ospfv2 | ospf_mgr | True | default | 1 | 10 | 198.19.17.0 | ethernet-1/29.0 |
| | | | | | | | | (direct) | |
| 198.19.16.1/32 | 7 | host | net_inst_mgr | True | default | 0 | 0 | None | None |
| 198.19.17.0/31 | 6 | local | net_inst_mgr | True | default | 0 | 0 | 198.19.17.1 | ethernet-1/29.0 |
| | | | | | | | | (direct) | |
| 198.19.17.1/32 | 6 | host | net_inst_mgr | True | default | 0 | 0 | None | None |
+==================+=====+============+==============+========+==========+========+======+=============+=================+
A:pim@nikhef# ping network-instance default 198.19.16.0
Using network instance default
PING 198.19.16.0 (198.19.16.0) 56(84) bytes of data.
64 bytes from 198.19.16.0: icmp_seq=1 ttl=64 time=0.484 ms
64 bytes from 198.19.16.0: icmp_seq=2 ttl=64 time=0.663 ms
```
Delicious! OSPF has learned the loopback, and it is now reachable. As with most things, going from 0
to 1 (in this case: understanding how SR Linux works at all) is the most difficult part. Then going
from 1 to 2 is critical (in this case: making two routers interact with OSPF), but from there on,
going from 2 to N is easy (in my case: enabling several other point-to-point /31 transit networks on
the _nikhef_ router, using `ethernet-1/1.0` through `ethernet-1/4.0` with the correct MTU and
turning on OSPF for these), makes the whole network shoot to life. Slick!
#### Underlay: Arista
I'll point out that one of the devices in this topology is an Arista. We have several of these ready
for deployment at FrysIX. They are a lot more affordable and easy to find on the second hand /
refurbished market. These switches come with 32x100G ports, and are really good at packet slinging
because they're based on the Broadcom _Tomahawk_ chipset. They pack a few less features than the
_Trident_ chipset that powers the Nokia, but they happen to have all the features we need to run our
internet exchange . So I turn my attention to the Arista in the topology. I am much more
comfortable configuring the whole thing here, as it's not my first time touching these devices:
```
arista-leaf#show run int loop0
interface Loopback0
ip address 198.19.16.2/32
ip ospf area 0.0.0.0
arista-leaf#show run int Ethernet32/1
interface Ethernet32/1
description Core: Connected to nikhef:ethernet-1/2
load-interval 1
mtu 9190
no switchport
ip address 198.19.17.5/31
ip ospf cost 1000
ip ospf network point-to-point
ip ospf area 0.0.0.0
arista-leaf#show run section router ospf
router ospf 65500
router-id 198.19.16.2
redistribute connected
network 198.19.0.0/16 area 0.0.0.0
max-lsa 12000
```
I complete the configuration for the other two interfaces on this Arista, port Eth31/1 connects also
to the _nikhef_ IXR-7220-D4 and I give it a high cost of 1000, while Eth30/1 connects only 1x100G to
the _nokia-leaf_ IXR-7220-D2 with a cost of 10.
It's nice to see that OSPF in action - there are two equal path (but high cost) OSPF paths via
router-id 198.19.16.1 (nikhef), and there's one lower cost path via router-id 198.19.16.3
(_nokia-leaf_). The traceroute nicely shows the scenic route (arista-leaf -> nokia-leaf -> nokia ->
equinix). Dope!
```
arista-leaf#show ip ospf nei
Neighbor ID Instance VRF Pri State Dead Time Address Interface
198.19.16.1 65500 default 1 FULL 00:00:36 198.19.17.4 Ethernet32/1
198.19.16.3 65500 default 1 FULL 00:00:31 198.19.17.11 Ethernet30/1
198.19.16.1 65500 default 1 FULL 00:00:35 198.19.17.2 Ethernet31/1
arista-leaf#traceroute 198.19.16.0
traceroute to 198.19.16.0 (198.19.16.0), 30 hops max, 60 byte packets
1 198.19.17.11 (198.19.17.11) 0.220 ms 0.150 ms 0.206 ms
2 198.19.17.6 (198.19.17.6) 0.169 ms 0.107 ms 0.099 ms
3 198.19.16.0 (198.19.16.0) 0.434 ms 0.346 ms 0.303 ms
```
So far, so good! The _underlay_ is up, every router can reach every other router on its loopback,
and all OSPF adjacencies are formed. I'll leave the 2x100G between _nikhef_ and _arista-leaf_ at
high cost for now.
#### Overlay EVPN: SR Linux
The big-picture idea here is to use iBGP with the same private AS number, and because there are two
main facilities (NIKHEF and Equinix), make each of those bigger IXR-7220-D4 routers act as
route-reflectors for others. It means that they will have an iBGP session amongst themselves
(198.191.16.0 <-> 198.19.16.1) and otherwise accept iBGP sessions from any IP address in the
198.19.16.0/24 subnet. This way, I don't have to configure any more than strictly necessary on the
core routers. Any new router can just plug in, form an OSPF adjacency, and connect to both core
routers. I proceed to configure BGP on the Nokia's like this:
```
A:pim@nikhef# / network-instance default protocols bgp
A:pim@nikhef# set admin-state enable
A:pim@nikhef# set autonomous-system 65500
A:pim@nikhef# set router-id 198.19.16.1
A:pim@nikhef# set dynamic-neighbors accept match 198.19.16.0/24 peer-group overlay
A:pim@nikhef# set afi-safi evpn admin-state enable
A:pim@nikhef# set preference ibgp 170
A:pim@nikhef# set route-advertisement rapid-withdrawal true
A:pim@nikhef# set route-advertisement wait-for-fib-install false
A:pim@nikhef# set group overlay peer-as 65500
A:pim@nikhef# set group overlay afi-safi evpn admin-state enable
A:pim@nikhef# set group overlay afi-safi ipv4-unicast admin-state disable
A:pim@nikhef# set group overlay afi-safi ipv6-unicast admin-state disable
A:pim@nikhef# set group overlay local-as as-number 65500
A:pim@nikhef# set group overlay route-reflector client true
A:pim@nikhef# set group overlay transport local-address 198.19.16.1
A:pim@nikhef# set neighbor 198.19.16.0 admin-state enable
A:pim@nikhef# set neighbor 198.19.16.0 peer-group overlay
A:pim@nikhef# commit stay
```
I can see that iBGP sessions establish between all the devices:
```
A:pim@nikhef# show network-instance default protocols bgp neighbor
---------------------------------------------------------------------------------------------------------------------------
BGP neighbor summary for network-instance "default"
Flags: S static, D dynamic, L discovered by LLDP, B BFD enabled, - disabled, * slow
---------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------
+-------------+-------------+----------+-------+----------+-------------+---------------+------------+--------------------+
| Net-Inst | Peer | Group | Flags | Peer-AS | State | Uptime | AFI/SAFI | [Rx/Active/Tx] |
+=============+=============+==========+=======+==========+=============+===============+============+====================+
| default | 198.19.16.0 | overlay | S | 65500 | established | 0d:0h:2m:32s | evpn | [0/0/0] |
| default | 198.19.16.2 | overlay | D | 65500 | established | 0d:0h:2m:27s | evpn | [0/0/0] |
| default | 198.19.16.3 | overlay | D | 65500 | established | 0d:0h:2m:41s | evpn | [0/0/0] |
+-------------+-------------+----------+-------+----------+-------------+---------------+------------+--------------------+
---------------------------------------------------------------------------------------------------------------------------
Summary:
1 configured neighbors, 1 configured sessions are established, 0 disabled peers
2 dynamic peers
```
A few things to note here - there one _configured_ neighbor (this is the other IXR-7220-D4 router),
and two _dynamic_ peers, these are the Arista and the smaller IXR-7220-D2 router. The only address
family that they are exchanging information for is the _evpn_ family, and no prefixes have been
learned or sent yet, shown by the `[0/0/0]` designation in the last column.
#### Overlay EVPN: Arista
The Arista is also remarkably straight forward to configure. Here, I'll simply enable the iBGP
session as follows:
```
arista-leaf#show run section bgp
router bgp 65500
neighbor evpn peer group
neighbor evpn remote-as 65500
neighbor evpn update-source Loopback0
neighbor evpn ebgp-multihop 3
neighbor evpn send-community extended
neighbor evpn maximum-routes 12000 warning-only
neighbor 198.19.16.0 peer group evpn
neighbor 198.19.16.1 peer group evpn
!
address-family evpn
neighbor evpn activate
arista-leaf#show bgp summary
BGP summary information for VRF default
Router identifier 198.19.16.2, local AS number 65500
Neighbor AS Session State AFI/SAFI AFI/SAFI State NLRI Rcd NLRI Acc
----------- ----------- ------------- ----------------------- -------------- ---------- ----------
198.19.16.0 65500 Established IPv4 Unicast Advertised 0 0
198.19.16.0 65500 Established L2VPN EVPN Negotiated 0 0
198.19.16.1 65500 Established IPv4 Unicast Advertised 0 0
198.19.16.1 65500 Established L2VPN EVPN Negotiated 0 0
```
On this leaf node, I'll have a redundant iBGP session with the two core nodes. Since those core
nodes are peering amongst themselves, and are configured as route-reflectors, this is all I need. No
matter how many additional Arista (or Nokia) devices I add to the network, all they'll have to do is
enable OSPF (so they can reach 198.19.16.0 and .1) and turn on iBGP sessions with both core routers.
Voila!
#### VXLAN EVPN: SR Linux
Nokia documentation informs me that SR Linux uses a special interface called _system0_ to source its
VXLAN traffic from, and to add this interface to the _default_ network-instance. So it's a matter of
defining that interface and associate a VXLAN interface with it, like so:
```
A:pim@nikhef# set / interface system0 admin-state enable
A:pim@nikhef# set / interface system0 subinterface 0 admin-state enable
A:pim@nikhef# set / interface system0 subinterface 0 ipv4 admin-state enable
A:pim@nikhef# set / interface system0 subinterface 0 ipv4 address 198.19.18.1/32
A:pim@nikhef# set / network-instance default interface system0.0
A:pim@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged
A:pim@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604
A:pim@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address
A:pim@nikhef# commit stay
```
This creates the plumbing for a VXLAN sub-interface called `vxlan1.2604` which will accept/send
traffic using VNI 2604 (this happens to be the VLAN id we use at FrysIX for our production Peering
LAN), and it'll use the `system0.0` address to source that traffic from.
The second part is to create what SR Linux calls a MAC-VRF and put some interface(s) in it:
```
A:pim@nikhef# set / interface ethernet-1/9 admin-state enable
A:pim@nikhef# set / interface ethernet-1/9 breakout-mode num-breakout-ports 4
A:pim@nikhef# set / interface ethernet-1/9 breakout-mode breakout-port-speed 10G
A:pim@nikhef# set / interface ethernet-1/9/3 admin-state enable
A:pim@nikhef# set / interface ethernet-1/9/3 vlan-tagging true
A:pim@nikhef# set / interface ethernet-1/9/3 subinterface 0 type bridged
A:pim@nikhef# set / interface ethernet-1/9/3 subinterface 0 admin-state enable
A:pim@nikhef# set / interface ethernet-1/9/3 subinterface 0 vlan encap untagged
A:pim@nikhef# / network-instance peeringlan
A:pim@nikhef# set type mac-vrf
A:pim@nikhef# set admin-state enable
A:pim@nikhef# set interface ethernet-1/9/3.0
A:pim@nikhef# set vxlan-interface vxlan1.2604
A:pim@nikhef# set protocols bgp-evpn bgp-instance 1 admin-state enable
A:pim@nikhef# set protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604
A:pim@nikhef# set protocols bgp-evpn bgp-instance 1 evi 2604
A:pim@nikhef# set protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604
A:pim@nikhef# set protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604
A:pim@nikhef# set protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604
A:pim@nikhef# commit stay
```
In the first block here, Arend took what is a 100G port called `ethernet-1/9` and split it into 4x25G
ports. Arend forced the port speed to 10G because he has taken a 40G-4x10G DAC, and it happens that
the third lane is plugged into the Debian machine. So on `ethernet-1/9/3` I'll create a
sub-interface, make it type _bridged_ (which I've also done on `vxlan1.2604`!) and allow any
untagged traffic to enter it.
{{< image width="5em" float="left" src="/assets/shared/brain.png" alt="brain" >}}
If you, like me, are used to either VPP or IOS/XR, this type of sub-interface stuff should feel very
natural to you. I've written about the sub-interfaces logic on Cisco's IOS/XR and VPP approach in a
previous [[article]({{< ref 2022-02-14-vpp-vlan-gym.md >}})] which my buddy Fred lovingly calls
_VLAN Gymnastics_ because the ports are just so damn flexible. Worth a read!
The second block creates a new _network-instance_ which I'll name `peeringlan`, and it associates
the newly created untagged sub-interface `ethernet-1/9/3.0` with the VXLAN interface, and starts a
protocol for eVPN instructing traffic in and out of this network-instance to use EVI 2604 on the
VXLAN sub-interface, and signalling of all MAC addresses learned to use the specified
route-distinguisher and import/export route-targets. For simplicity I've just used the same for
each: 65500:2604.
I continue to add an interface to the `peeringlan` _network-instance_ on the other two Nokia
routers: `ethernet-1/9/3.0` on the _equinix_ router and `ethernet-1/9.0` on the _nokia-leaf_ router.
Each of these goes to a 10Gbps port on a Debian machine.
#### VXLAN EVPN: Arista
At this point I'm feeling pretty bullish about the whole project. Arista does not make it very
difficult on me to configure it for L2 EVPN (which is called MAC-VRF here also):
```
arista-leaf#conf t
vlan 2604
name v-peeringlan
interface Ethernet9/3
speed forced 10000full
switchport access vlan 2604
interface Loopback1
ip address 198.19.18.2/32
interface Vxlan1
vxlan source-interface Loopback1
vxlan udp-port 4789
vxlan vlan 2604 vni 2604
```
After creating VLAN 2604 and making port Eth9/3 an access port in that VLAN, I'll add a VTEP endpoint
called `Loopback1`, and a VXLAN interface that uses that to source its traffic. Here, I'll associate
local VLAN 2604 with the `Vxlan1` and its VNI 2604, to match up with how I configured the Nokias
previously.
Finally, it's a matter of tying these together by announcing the MAC addresses into the EVPN iBGP
sessions:
```
arista-leaf#conf t
router bgp 65500
vlan 2604
rd 65500:2604
route-target both 65500:2604
redistribute learned
!
```
### Results
To validate the configurations, I learn a cool trick from my buddy Andy on the SR Linux discord
server. In EOS, I can ask it to check for any obvious mistakes in two places:
```
arista-leaf#show vxlan config-sanity detail
Category Result Detail
---------------------------------- -------- --------------------------------------------------
Local VTEP Configuration Check OK
Loopback IP Address OK
VLAN-VNI Map OK
Flood List OK
Routing OK
VNI VRF ACL OK
Decap VRF-VNI Map OK
VRF-VNI Dynamic VLAN OK
Remote VTEP Configuration Check OK
Remote VTEP OK
Platform Dependent Check OK
VXLAN Bridging OK
VXLAN Routing OK VXLAN Routing not enabled
CVX Configuration Check OK
CVX Server OK Not in controller client mode
MLAG Configuration Check OK Run 'show mlag config-sanity' to verify MLAG config
Peer VTEP IP OK MLAG peer is not connected
MLAG VTEP IP OK
Peer VLAN-VNI OK
Virtual VTEP IP OK
MLAG Inactive State OK
arista-leaf#show bgp evpn sanity detail
Category Check Status Detail
-------- -------------------- ------ ------
General Send community OK
General Multi-agent mode OK
General Neighbor established OK
L2 MAC-VRF route-target OK
import and export
L2 MAC-VRF OK
route-distinguisher
L2 MAC-VRF redistribute OK
L2 MAC-VRF overlapping OK
VLAN
L2 Suppressed MAC OK
VXLAN VLAN to VNI map for OK
MAC-VRF
VXLAN VRF to VNI map for OK
IP-VRF
```
#### Results: Arista view
Inspecting the MAC addresses learned from all four of the client ports on the Debian machine is
easy:
```
arista-leaf#show bgp evpn summary
BGP summary information for VRF default
Router identifier 198.19.16.2, local AS number 65500
Neighbor Status Codes: m - Under maintenance
Neighbor V AS MsgRcvd MsgSent InQ OutQ Up/Down State PfxRcd PfxAcc
198.19.16.0 4 65500 3311 3867 0 0 18:06:28 Estab 7 7
198.19.16.1 4 65500 3308 3873 0 0 18:06:28 Estab 7 7
arista-leaf#show bgp evpn vni 2604 next-hop 198.19.18.3
BGP routing table information for VRF default
Router identifier 198.19.16.2, local AS number 65500
Route status codes: * - valid, > - active, S - Stale, E - ECMP head, e - ECMP
c - Contributing to ECMP, % - Pending BGP convergence
Origin codes: i - IGP, e - EGP, ? - incomplete
AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Link Local Nexthop
Network Next Hop Metric LocPref Weight Path
* >Ec RD: 65500:2604 mac-ip e43a.6e5f.0c59
198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.1
* ec RD: 65500:2604 mac-ip e43a.6e5f.0c59
198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.0
* >Ec RD: 65500:2604 imet 198.19.18.3
198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.1
* ec RD: 65500:2604 imet 198.19.18.3
198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.0
```
There's a lot to unpack here! The Arista is seeing that from the _route-distinguisher_ I configured
on all the sessions, it is learning one MAC address on neighbor 198.19.18.3 (this is the VTEP for
the _nokia-leaf_ router) from both iBGP sessions. The MAC address is learned from originator
198.19.16.3 (the loopback of the _nokia-leaf_ router), from two cluster members, the active one on
iBGP speaker 198.19.16.1 (_nikhef_) and a backup member on 198.19.16.0 (_equinix_).
I can also see that there's a bunch of `imet` route entries, and Andy explained these to me. They are
a signal from a VTEP participant that they are interested in seeing multicast traffic (like neighbor
discovery or ARP requests) flooded to them. Every router participating in this L2VPN will raise such
an `imet` route, which I'll see in duplicates as well (one from each iBGP session). This checks out.
#### Results: SR Linux view
The Nokia IXR-7220-D4 router called _equinix_ has also learned a bunch of EVPN routing entries,
which I can inspect as follows:
```
A:pim@equinix# show network-instance default protocols bgp routes evpn route-type summary
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Show report for the BGP route table of network-instance "default"
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Status codes: u=used, *=valid, >=best, x=stale, b=backup
Origin codes: i=IGP, e=EGP, ?=incomplete
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
BGP Router ID: 198.19.16.0 AS: 65500 Local AS: 65500
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Type 2 MAC-IP Advertisement Routes
+--------+---------------+--------+-------------------+------------+-------------+------+-------------+--------+--------------------------------+------------------+
| Status | Route- | Tag-ID | MAC-address | IP-address | neighbor | Path-| Next-Hop | Label | ESI | MAC Mobility |
| | distinguisher | | | | | id | | | | |
+========+===============+========+===================+============+=============+======+============-+========+================================+==================+
| u*> | 65500:2604 | 0 | E4:3A:6E:5F:0C:57 | 0.0.0.0 | 198.19.16.1 | 0 | 198.19.18.1 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
| * | 65500:2604 | 0 | E4:3A:6E:5F:0C:58 | 0.0.0.0 | 198.19.16.1 | 0 | 198.19.18.2 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
| u*> | 65500:2604 | 0 | E4:3A:6E:5F:0C:58 | 0.0.0.0 | 198.19.16.2 | 0 | 198.19.18.2 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
| * | 65500:2604 | 0 | E4:3A:6E:5F:0C:59 | 0.0.0.0 | 198.19.16.1 | 0 | 198.19.18.3 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
| u*> | 65500:2604 | 0 | E4:3A:6E:5F:0C:59 | 0.0.0.0 | 198.19.16.3 | 0 | 198.19.18.3 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - |
+--------+---------------+--------+-------------------+------------+-------------+------+-------------+--------+--------------------------------+------------------+
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Type 3 Inclusive Multicast Ethernet Tag Routes
+--------+-----------------------------+--------+---------------------+-----------------+--------+-----------------------+
| Status | Route-distinguisher | Tag-ID | Originator-IP | neighbor | Path- | Next-Hop |
| | | | | | id | |
+========+=============================+========+=====================+=================+========+=======================+
| u*> | 65500:2604 | 0 | 198.19.18.1 | 198.19.16.1 | 0 | 198.19.18.1 |
| * | 65500:2604 | 0 | 198.19.18.2 | 198.19.16.1 | 0 | 198.19.18.2 |
| u*> | 65500:2604 | 0 | 198.19.18.2 | 198.19.16.2 | 0 | 198.19.18.2 |
| * | 65500:2604 | 0 | 198.19.18.3 | 198.19.16.1 | 0 | 198.19.18.3 |
| u*> | 65500:2604 | 0 | 198.19.18.3 | 198.19.16.3 | 0 | 198.19.18.3 |
+--------+-----------------------------+--------+---------------------+-----------------+--------+-----------------------+
--------------------------------------------------------------------------------------------------------------------------
0 Ethernet Auto-Discovery routes 0 used, 0 valid
5 MAC-IP Advertisement routes 3 used, 5 valid
5 Inclusive Multicast Ethernet Tag routes 3 used, 5 valid
0 Ethernet Segment routes 0 used, 0 valid
0 IP Prefix routes 0 used, 0 valid
0 Selective Multicast Ethernet Tag routes 0 used, 0 valid
0 Selective Multicast Membership Report Sync routes 0 used, 0 valid
0 Selective Multicast Leave Sync routes 0 used, 0 valid
--------------------------------------------------------------------------------------------------------------------------
```
I have to say, SR Linux output is incredibly verbose! But, I can see all the relevant bits and bobs
here. Each MAC-IP entry is accounted for, I can see several nexthops pointing at the nikhef switch,
one pointing at the nokia-leaf router and one pointing at the Arista switch. I also see the `imet`
entries. One thing to note -- the SR Linux implementation leaves the type-2 routes empty with a
0.0.0.0 IPv4 address, while the Arista (in my opinion, more correctly) leaves them as NULL
(unspecified). But, everything looks great!
#### Results: Debian view
There's one more thing to show, and that's kind of the 'proof is in the pudding' moment. As I said,
Arend hooked up a Debian machine with an Intel X710-DA4 network card, which sports 4x10G SFP+
connections. This network card is a regular in my AS8298 network, as it has excellent DPDK support
and can easily pump 40Mpps with VPP. IPng 🥰 Intel X710!
```
root@debian:~ # ip netns add nikhef
root@debian:~ # ip link set enp1s0f0 netns nikhef
root@debian:~ # ip netns exec nikhef ip link set enp1s0f0 up mtu 9000
root@debian:~ # ip netns exec nikhef ip addr add 192.0.2.10/24 dev enp1s0f0
root@debian:~ # ip netns exec nikhef ip addr add 2001:db8::10/64 dev enp1s0f0
root@debian:~ # ip netns add arista-leaf
root@debian:~ # ip link set enp1s0f1 netns arista-leaf
root@debian:~ # ip netns exec arista-leaf ip link set enp1s0f1 up mtu 9000
root@debian:~ # ip netns exec arista-leaf ip addr add 192.0.2.11/24 dev enp1s0f1
root@debian:~ # ip netns exec arista-leaf ip addr add 2001:db8::11/64 dev enp1s0f1
root@debian:~ # ip netns add nokia-leaf
root@debian:~ # ip link set enp1s0f2 netns nokia-leaf
root@debian:~ # ip netns exec nokia-leaf ip link set enp1s0f2 up mtu 9000
root@debian:~ # ip netns exec nokia-leaf ip addr add 192.0.2.12/24 dev enp1s0f2
root@debian:~ # ip netns exec nokia-leaf ip addr add 2001:db8::12/64 dev enp1s0f2
root@debian:~ # ip netns add equinix
root@debian:~ # ip link set enp1s0f3 netns equinix
root@debian:~ # ip netns exec equinix ip link set enp1s0f3 up mtu 9000
root@debian:~ # ip netns exec equinix ip addr add 192.0.2.13/24 dev enp1s0f3
root@debian:~ # ip netns exec equinix ip addr add 2001:db8::13/64 dev enp1s0f3
root@debian:~# ip netns exec nikhef fping -g 192.0.2.8/29
192.0.2.10 is alive
192.0.2.11 is alive
192.0.2.12 is alive
192.0.2.13 is alive
root@debian:~# ip netns exec arista-leaf fping 2001:db8::10 2001:db8::11 2001:db8::12 2001:db8::13
2001:db8::10 is alive
2001:db8::11 is alive
2001:db8::12 is alive
2001:db8::13 is alive
root@debian:~# ip netns exec equinix ip nei
192.0.2.10 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE
192.0.2.11 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE
192.0.2.12 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE
fe80::e63a:6eff:fe5f:c57 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE
fe80::e63a:6eff:fe5f:c58 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE
fe80::e63a:6eff:fe5f:c59 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE
2001:db8::10 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE
2001:db8::11 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE
2001:db8::12 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE
```
The Debian machine puts each network card into its own network namespace, and gives them both an IPv4
and an IPv6 address. I can then enter the `nikhef` network namespace, which has its NIC connected to
the IXR-7220-D4 router called _nikhef_, and ping all four endpoints. Similarly, I can enter the
`arista-leaf` namespace and ping6 all four endpoints. Finally, I take a look at the IPv6 and IPv4
neighbor table on the network card that is connected to the _equinix_ router. All three MAC addresses are
seen. This proves end to end connectivity across the EVPN VXLAN, and full interoperability. Booyah!
Performance? We got that! I'm not worried as these Nokia routers are rated for 12.8Tbps of VXLAN....
```
root@debian:~# ip netns exec equinix iperf3 -c 192.0.2.12
Connecting to host 192.0.2.12, port 5201
[ 5] local 192.0.2.10 port 34598 connected to 192.0.2.12 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 1.15 GBytes 9.91 Gbits/sec 19 1.52 MBytes
[ 5] 1.00-2.00 sec 1.15 GBytes 9.90 Gbits/sec 3 1.54 MBytes
[ 5] 2.00-3.00 sec 1.15 GBytes 9.90 Gbits/sec 1 1.54 MBytes
[ 5] 3.00-4.00 sec 1.15 GBytes 9.90 Gbits/sec 1 1.54 MBytes
[ 5] 4.00-5.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
[ 5] 5.00-6.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
[ 5] 6.00-7.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
[ 5] 7.00-8.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
[ 5] 8.00-9.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
[ 5] 9.00-10.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec 24 sender
[ 5] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec receiver
iperf Done.
```
## What's Next
There's a few improvements I can make before deploying this architecture to the internet exchange.
Notably:
* the functional equivalent of _port security_, that is to say only allowing one or two MAC
addresses per member port. FrysIX has a strict one-port-one-member-one-MAC rule, and having port
security will greatly improve our resilience.
* SR Linux has the ability to suppress ARP, _even on L2 MAC-VRF_! It's relatively well known for
IRB based setups, but adding this to transparent bridge-domains is possible in Nokia
[[ref](https://documentation.nokia.com/srlinux/22-6/SR_Linux_Book_Files/EVPN-VXLAN_Guide/services-evpn-vxlan-l2.html#configuring_evpn_learning_for_proxy_arp)],
using the syntax of `protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise
true`. This will glean the IP addresses based on intercepted ARP requests, and reduce the need for
BUM flooding.
* Andy informs me that Arista also has this feature. By setting `router l2-vpn` and `arp learning bridged`,
the suppression of ARP requests/replies also works in the same way. This greatly reduces cross-router
BUM flooding. If DE-CIX can do it, so can FrysIX :)
* some automation - although configuring the MAC-VRF across Arista and SR Linux is definitely not
as difficult as I thought, having some automation in place will avoid errors and mistakes. It
would suck if the IXP collapsed because I botched a link drain or PNI configuration!
### Acknowledgements
I am relatively new to EVPN configurations, and wanted to give a shoutout to Andy Whitaker who
jumped in very quickly when I asked a question on the SR Linux Discord. He was gracious with his
time and spent a few hours on a video call with me, explaining EVPN in great detail both for Arista
as well as SR Linux, and in particular wanted to give a big "Thank you!" for helping me understand
symmetric and asymmetric IRB in the context of multivendor EVPN. Andy is about to start a new job at
Nokia, and I wish him all the best. To my friends at Nokia: you caught a good one, Andy is pure
gold!
I also want to thank Niek for helping me take my first baby steps onto this platform and patiently
answering my nerdly questions about the platform, the switch chip, and the configuration philosophy.
Learning a new NOS is always a fun task, and it was made super fun because Niek spent an hour with
Arend and I on a video call, giving a bunch of operational tips and tricks along the way.
Finally, Arend and ERITAP are an absolute joy to work with. We took turns hacking on the lab, which
Arend made available for me while I am traveling to Mississippi this week. Thanks for the kWh and
OOB access, and for brainstorming the config with me!
### Reference configurations
Here's the configs for all machines in this demonstration:
[[nikhef](/assets/frys-ix/nikhef.conf)] | [[equinix](/assets/frys-ix/equinix.conf)] | [[nokia-leaf](/assets/frys-ix/nokia-leaf.conf)] | [[arista-leaf](/assets/frys-ix/arista-leaf.conf)]

View File

@ -0,0 +1,464 @@
---
date: "2025-05-03T15:07:23Z"
title: 'VPP in Containerlab - Part 1'
---
{{< image float="right" src="/assets/containerlab/containerlab.svg" alt="Containerlab Logo" width="12em" >}}
# Introduction
From time to time the subject of containerized VPP instances comes up. At IPng, I run the routers in
AS8298 on bare metal (Supermicro and Dell hardware), as it allows me to maximize performance.
However, VPP is quite friendly in virtualization. Notably, it runs really well on virtual machines
like Qemu/KVM or VMWare. I can pass through PCI devices directly to the host, and use CPU pinning to
allow the guest virtual machine access to the underlying physical hardware. In such a mode, VPP
performance almost the same as on bare metal. But did you know that VPP can also run in Docker?
The other day I joined the [[ZANOG'25](https://nog.net.za/event1/zanog25/)] in Durban, South Africa.
One of the presenters was Nardus le Roux of Nokia, and he showed off a project called
[[Containerlab](https://containerlab.dev/)], which provides a CLI for orchestrating and managing
container-based networking labs. It starts the containers, builds a virtual wiring between them to
create lab topologies of users choice and manages labs lifecycle.
Quite regularly I am asked 'when will you add VPP to Containerlab?', but at ZANOG I made a promise
to actually add them. Here I go, on a journey to integrate VPP into Containerlab!
## Containerized VPP
The folks at [[Tigera](https://www.tigera.io/project-calico/)] maintain a project called _Calico_,
which accelerates Kubernetes CNI (Container Network Interface) by using [[FD.io](https://fd.io)]
VPP. Since the origins of Kubernetes are to run containers in a Docker environment, it stands to
reason that it should be possible to run a containerized VPP. I start by reading up on how they
create their Docker image, and I learn a lot.
### Docker Build
Considering IPng runs bare metal Debian (currently Bookworm) machines, my Docker image will be based
on `debian:bookworm` as well. The build starts off quite modest:
```
pim@summer:~$ mkdir -p src/vpp-containerlab
pim@summer:~/src/vpp-containerlab$ cat < EOF > Dockerfile.bookworm
FROM debian:bookworm
ARG DEBIAN_FRONTEND=noninteractive
ARG VPP_INSTALL_SKIP_SYSCTL=true
ARG REPO=release
RUN apt-get update && apt-get -y install curl procps && apt-get clean
# Install VPP
RUN curl -s https://packagecloud.io/install/repositories/fdio/${REPO}/script.deb.sh | bash
RUN apt-get update && apt-get -y install vpp vpp-plugin-core && apt-get clean
CMD ["/usr/bin/vpp","-c","/etc/vpp/startup.conf"]
EOF
pim@summer:~/src/vpp-containerlab$ docker build -f Dockerfile.bookworm . -t pimvanpelt/vpp-containerlab
```
One gotcha - when I install the upstream VPP debian packages, they generate a `sysctl` file which it
tries to execute. However, I can't set sysctl's in the container, so the build fails. I take a look
at the VPP source code and find `src/pkg/debian/vpp.postinst` which helpfully contains a means to
override setting the sysctl's, using an environment variable called `VPP_INSTALL_SKIP_SYSCTL`.
### Running VPP in Docker
With the Docker image built, I need to tweak the VPP startup configuration a little bit, to allow it
to run well in a Docker environment. There are a few things I make note of:
1. We may not have huge pages on the host machine, so I'll set all the page sizes to the
linux-default 4kB rather than 2MB or 1GB hugepages. This creates a performance regression, but
in the case of Containerlab, we're not here to build high performance stuff, but rather users
will be doing functional testing.
1. DPDK requires either UIO of VFIO kernel drivers, so that it can bind its so-called _poll mode
driver_ to the network cards. It also requires huge pages. Since my first version will be
using only virtual ethernet interfaces, I'll disable DPDK and VFIO alltogether.
1. VPP can run any number of CPU worker threads. In its simplest form, I can also run it with only
one thread. Of course, this will not be a high performance setup, but since I'm already not
using hugepages, I'll use only 1 thread.
The VPP `startup.conf` configuration file I came up with:
```
pim@summer:~/src/vpp-containerlab$ cat < EOF > clab-startup.conf
unix {
interactive
log /var/log/vpp/vpp.log
full-coredump
cli-listen /run/vpp/cli.sock
cli-prompt vpp-clab#
cli-no-pager
poll-sleep-usec 100
}
api-trace {
on
}
memory {
main-heap-size 512M
main-heap-page-size 4k
}
buffers {
buffers-per-numa 16000
default data-size 2048
page-size 4k
}
statseg {
size 64M
page-size 4k
per-node-counters on
}
plugins {
plugin default { enable }
plugin dpdk_plugin.so { disable }
}
EOF
```
Just a couple of notes for those who are running VPP in production. Each of the `*-page-size` config
settings take the normal Linux pagesize of 4kB, which effectively avoids VPP from using anhy
hugepages. Then, I'll specifically disable the DPDK plugin, although I didn't install it in the
Dockerfile build, as it lives in its own dedicated Debian package called `vpp-plugin-dpdk`. Finally,
I'll make VPP use less CPU by telling it to sleep for 100 microseconds between each poll iteration.
In production environments, VPP will use 100% of the CPUs it's assigned, but in this lab, it will
not be quite as hungry. By the way, even in this sleepy mode, it'll still easily handle a gigabit
of traffic!
Now, VPP wants to run as root and it needs a few host features, notably tuntap devices and vhost,
and a few capabilites, notably NET_ADMIN and SYS_PTRACE. I take a look at the
[[manpage](https://man7.org/linux/man-pages/man7/capabilities.7.html)]:
* ***CAP_SYS_NICE***: allows to set real-time scheduling, CPU affinity, I/O scheduling class, and
to migrate and move memory pages.
* ***CAP_NET_ADMIN***: allows to perform various network-relates operations like interface
configs, routing tables, nested network namespaces, multicast, set promiscuous mode, and so on.
* ***CAP_SYS_PTRACE***: allows to trace arbitrary processes using `ptrace(2)`, and a few related
kernel system calls.
Being a networking dataplane implementation, VPP wants to be able to tinker with network devices.
This is not typically allowed in Docker containers, although the Docker developers did make some
consessions for those containers that need just that little bit more access. They described it in
their
[[docs](https://docs.docker.com/engine/containers/run/#runtime-privilege-and-linux-capabilities)] as
follows:
| The --privileged flag gives all capabilities to the container. When the operator executes docker
| run --privileged, Docker enables access to all devices on the host, and reconfigures AppArmor or
| SELinux to allow the container nearly all the same access to the host as processes running outside
| containers on the host. Use this flag with caution. For more information about the --privileged
| flag, see the docker run reference.
{{< image width="4em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
In this moment, I feel I should point out that running a Docker container with `--privileged` flag
set does give it _a lot_ of privileges. A container with `--privileged` is not a securely sandboxed
process. Containers in this mode can get a root shell on the host and take control over the system.
With that little fineprint warning out of the way, I am going to Yolo like a boss:
```
pim@summer:~/src/vpp-containerlab$ docker run --name clab-pim \
--cap-add=NET_ADMIN --cap-add=SYS_NICE --cap-add=SYS_PTRACE \
--device=/dev/net/tun:/dev/net/tun --device=/dev/vhost-net:/dev/vhost-net \
--privileged -v $(pwd)/clab-startup.conf:/etc/vpp/startup.conf:ro \
docker.io/pimvanpelt/vpp-containerlab
clab-pim
```
### Configuring VPP in Docker
And with that, the Docker container is running! I post a screenshot on
[[Mastodon](https://ublog.tech/@IPngNetworks/114392852468494211)] and my buddy John responds with a
polite but firm insistence that I explain myself. Here you go, buddy :)
In another terminal, I can play around with this VPP instance a little bit:
```
pim@summer:~$ docker exec -it clab-pim bash
root@d57c3716eee9:/# ip -br l
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
eth0@if530566 UP 02:42:ac:11:00:02 <BROADCAST,MULTICAST,UP,LOWER_UP>
root@d57c3716eee9:/# ps auxw
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 2.2 0.2 17498852 160300 ? Rs 15:11 0:00 /usr/bin/vpp -c /etc/vpp/startup.conf
root 10 0.0 0.0 4192 3388 pts/0 Ss 15:11 0:00 bash
root 18 0.0 0.0 8104 4056 pts/0 R+ 15:12 0:00 ps auxw
root@d57c3716eee9:/# vppctl
_______ _ _ _____ ___
__/ __/ _ \ (_)__ | | / / _ \/ _ \
_/ _// // / / / _ \ | |/ / ___/ ___/
/_/ /____(_)_/\___/ |___/_/ /_/
vpp-clab# show version
vpp v25.02-release built by root on d5cd2c304b7f at 2025-02-26T13:58:32
vpp-clab# show interfaces
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
local0 0 down 0/0/0/0
```
Slick! I can see that the container has an `eth0` device, which Docker has connected to the main
bridged network. For now, there's only one process running, pid 1 proudly shows VPP (as in Docker,
the `CMD` field will simply replace `init`. Later on, I can imagine running a few more daemons like
SSH and so on, but for now, I'm happy.
Looking at VPP itself, it has no network interfaces yet, except for the default `local0` interface.
### Adding Interfaces in Docker
But if I don't have DPDK, how will I add interfaces? Enter `veth(4)`. From the
[[manpage](https://man7.org/linux/man-pages/man4/veth.4.html)], I learn that veth devices are
virtual Ethernet devices. They can act as tunnels between network namespaces to create a bridge to
a physical network device in another namespace, but can also be used as standalone network devices.
veth devices are always created in interconnected pairs.
Of course, Docker users will recognize this. It's like bread and butter for containers to
communicate with one another - and with the host they're running on. I can simply create a Docker
network and attach one half of it to a running container, like so:
```
pim@summer:~$ docker network create --driver=bridge clab-network \
--subnet 192.0.2.0/24 --ipv6 --subnet 2001:db8::/64
5711b95c6c32ac0ed185a54f39e5af4b499677171ff3d00f99497034e09320d2
pim@summer:~$ docker network connect clab-network clab-pim --ip '' --ip6 ''
```
The first command here creates a new network called `clab-network` in Docker. As a result, a new
bridge called `br-5711b95c6c32` shows up on the host. The bridge name is chosen from the UUID of the
Docker object. Seeing as I added an IPv4 and IPv6 subnet to the bridge, it gets configured with the
first address in both:
```
pim@summer:~/src/vpp-containerlab$ brctl show br-5711b95c6c32
bridge name bridge id STP enabled interfaces
br-5711b95c6c32 8000.0242099728c6 no veth021e363
pim@summer:~/src/vpp-containerlab$ ip -br a show dev br-5711b95c6c32
br-5711b95c6c32 UP 192.0.2.1/24 2001:db8::1/64 fe80::42:9ff:fe97:28c6/64 fe80::1/64
```
The second command creates a `veth` pair, and puts one half of it in the bridge, and this interface
is called `veth021e363` above. The other half of it pops up as `eth1` in the Docker container:
```
pim@summer:~/src/vpp-containerlab$ docker exec -it clab-pim bash
root@d57c3716eee9:/# ip -br l
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
eth0@if530566 UP 02:42:ac:11:00:02 <BROADCAST,MULTICAST,UP,LOWER_UP>
eth1@if530577 UP 02:42:c0:00:02:02 <BROADCAST,MULTICAST,UP,LOWER_UP>
```
One of the many awesome features of VPP is its ability to attach to these `veth` devices by means of
its `af-packet` driver, by reusing the same MAC address (in this case `02:42:c0:00:02:02`). I first
take a look at the linux [[manpage](https://man7.org/linux/man-pages/man7/packet.7.html)] for it,
and then read up on the VPP
[[documentation](https://fd.io/docs/vpp/v2101/gettingstarted/progressivevpp/interface)] on the
topic.
However, my attention is drawn to Docker assigning an IPv4 and IPv6 address to the container:
```
root@d57c3716eee9:/# ip -br a
lo UNKNOWN 127.0.0.1/8 ::1/128
eth0@if530566 UP 172.17.0.2/16
eth1@if530577 UP 192.0.2.2/24 2001:db8::2/64 fe80::42:c0ff:fe00:202/64
root@d57c3716eee9:/# ip addr del 192.0.2.2/24 dev eth1
root@d57c3716eee9:/# ip addr del 2001:db8::2/64 dev eth1
```
I decide to remove them from here, as in the end, `eth1` will be owned by VPP so _it_ should be
setting the IPv4 and IPv6 addresses. For the life of me, I don't see how I can avoid Docker from
assinging IPv4 and IPv6 addresses to this container ... and the
[[docs](https://docs.docker.com/engine/network/)] seem to be off as well, as they suggest I can pass
a flagg `--ipv4=False` but that flag doesn't exist, at least not on my Bookworm Docker variant. I
make a mental note to discuss this with the folks in the Containerlab community.
Anyway, armed with this knowledge I can bind the container-side veth pair called `eth1` to VPP, like
so:
```
root@d57c3716eee9:/# vppctl
_______ _ _ _____ ___
__/ __/ _ \ (_)__ | | / / _ \/ _ \
_/ _// // / / / _ \ | |/ / ___/ ___/
/_/ /____(_)_/\___/ |___/_/ /_/
vpp-clab# create host-interface name eth1 hw-addr 02:42:c0:00:02:02
vpp-clab# set interface name host-eth1 eth1
vpp-clab# set interface mtu 1500 eth1
vpp-clab# set interface ip address eth1 192.0.2.2/24
vpp-clab# set interface ip address eth1 2001:db8::2/64
vpp-clab# set interface state eth1 up
vpp-clab# show int addr
eth1 (up):
L3 192.0.2.2/24
L3 2001:db8::2/64
local0 (dn):
```
## Results
After all this work, I've successfully created a Docker image based on Debian Bookworm and VPP 25.02
(the current stable release version), started a container with it, added a network bridge in Docker,
which binds the host `summer` to the container. Proof, as they say, is in the ping-pudding:
```
pim@summer:~/src/vpp-containerlab$ ping -c5 2001:db8::2
PING 2001:db8::2(2001:db8::2) 56 data bytes
64 bytes from 2001:db8::2: icmp_seq=1 ttl=64 time=0.113 ms
64 bytes from 2001:db8::2: icmp_seq=2 ttl=64 time=0.056 ms
64 bytes from 2001:db8::2: icmp_seq=3 ttl=64 time=0.202 ms
64 bytes from 2001:db8::2: icmp_seq=4 ttl=64 time=0.102 ms
64 bytes from 2001:db8::2: icmp_seq=5 ttl=64 time=0.100 ms
--- 2001:db8::2 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4098ms
rtt min/avg/max/mdev = 0.056/0.114/0.202/0.047 ms
pim@summer:~/src/vpp-containerlab$ ping -c5 192.0.2.2
PING 192.0.2.2 (192.0.2.2) 56(84) bytes of data.
64 bytes from 192.0.2.2: icmp_seq=1 ttl=64 time=0.043 ms
64 bytes from 192.0.2.2: icmp_seq=2 ttl=64 time=0.032 ms
64 bytes from 192.0.2.2: icmp_seq=3 ttl=64 time=0.019 ms
64 bytes from 192.0.2.2: icmp_seq=4 ttl=64 time=0.041 ms
64 bytes from 192.0.2.2: icmp_seq=5 ttl=64 time=0.027 ms
--- 192.0.2.2 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4063ms
rtt min/avg/max/mdev = 0.019/0.032/0.043/0.008 ms
```
And in case that simple ping-test wasn't enough to get you excited, here's a packet trace from VPP
itself, while I'm performing this ping:
```
vpp-clab# trace add af-packet-input 100
vpp-clab# wait 3
vpp-clab# show trace
------------------- Start of thread 0 vpp_main -------------------
Packet 1
00:07:03:979275: af-packet-input
af_packet: hw_if_index 1 rx-queue 0 next-index 4
block 47:
address 0x7fbf23b7d000 version 2 seq_num 48 pkt_num 0
tpacket3_hdr:
status 0x20000001 len 98 snaplen 98 mac 92 net 106
sec 0x68164381 nsec 0x258e7659 vlan 0 vlan_tpid 0
vnet-hdr:
flags 0x00 gso_type 0x00 hdr_len 0
gso_size 0 csum_start 0 csum_offset 0
00:07:03:979293: ethernet-input
IP4: 02:42:09:97:28:c6 -> 02:42:c0:00:02:02
00:07:03:979306: ip4-input
ICMP: 192.0.2.1 -> 192.0.2.2
tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
fragment id 0x5813, flags DONT_FRAGMENT
ICMP echo_request checksum 0xc16 id 21197
00:07:03:979315: ip4-lookup
fib 0 dpo-idx 9 flow hash: 0x00000000
ICMP: 192.0.2.1 -> 192.0.2.2
tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
fragment id 0x5813, flags DONT_FRAGMENT
ICMP echo_request checksum 0xc16 id 21197
00:07:03:979322: ip4-receive
fib:0 adj:9 flow:0x00000000
ICMP: 192.0.2.1 -> 192.0.2.2
tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
fragment id 0x5813, flags DONT_FRAGMENT
ICMP echo_request checksum 0xc16 id 21197
00:07:03:979323: ip4-icmp-input
ICMP: 192.0.2.1 -> 192.0.2.2
tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
fragment id 0x5813, flags DONT_FRAGMENT
ICMP echo_request checksum 0xc16 id 21197
00:07:03:979323: ip4-icmp-echo-request
ICMP: 192.0.2.1 -> 192.0.2.2
tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN
fragment id 0x5813, flags DONT_FRAGMENT
ICMP echo_request checksum 0xc16 id 21197
00:07:03:979326: ip4-load-balance
fib 0 dpo-idx 5 flow hash: 0x00000000
ICMP: 192.0.2.2 -> 192.0.2.1
tos 0x00, ttl 64, length 84, checksum 0x88e1 dscp CS0 ecn NON_ECN
fragment id 0x2dc4, flags DONT_FRAGMENT
ICMP echo_reply checksum 0x1416 id 21197
00:07:03:979325: ip4-rewrite
tx_sw_if_index 1 dpo-idx 5 : ipv4 via 192.0.2.1 eth1: mtu:1500 next:3 flags:[] 0242099728c60242c00002020800 flow hash: 0x00000000
00000000: 0242099728c60242c00002020800450000542dc44000400188e1c0000202c000
00000020: 02010000141652cd00018143166800000000399d0900000000001011
00:07:03:979326: eth1-output
eth1 flags 0x02180005
IP4: 02:42:c0:00:02:02 -> 02:42:09:97:28:c6
ICMP: 192.0.2.2 -> 192.0.2.1
tos 0x00, ttl 64, length 84, checksum 0x88e1 dscp CS0 ecn NON_ECN
fragment id 0x2dc4, flags DONT_FRAGMENT
ICMP echo_reply checksum 0x1416 id 21197
00:07:03:979327: eth1-tx
af_packet: hw_if_index 1 tx-queue 0
tpacket3_hdr:
status 0x1 len 108 snaplen 108 mac 0 net 0
sec 0x0 nsec 0x0 vlan 0 vlan_tpid 0
vnet-hdr:
flags 0x00 gso_type 0x00 hdr_len 0
gso_size 0 csum_start 0 csum_offset 0
buffer 0xf97c4:
current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x0
local l2-hdr-offset 0 l3-hdr-offset 14
IP4: 02:42:c0:00:02:02 -> 02:42:09:97:28:c6
ICMP: 192.0.2.2 -> 192.0.2.1
tos 0x00, ttl 64, length 84, checksum 0x88e1 dscp CS0 ecn NON_ECN
fragment id 0x2dc4, flags DONT_FRAGMENT
ICMP echo_reply checksum 0x1416 id 21197
```
Well, that's a mouthfull, isn't it! Here, I get to show you VPP in action. After receiving the
packet on its `af-packet-input` node from 192.0.2.1 (Summer, who is pinging us) to 192.0.2.2 (the
VPP container), the packet traverses the dataplane graph. It goes through `ethernet-input`, then
`ip4-input`, which sees it's destined to an IPv4 address configured, so the packet is handed to
`ip4-receive`. That one sees that the IP protocol is ICMP, so it hands the packet to
`ip4-icmp-input` which notices that the packet is an ICMP echo request, so off to
`ip4-icmp-echo-request` our little packet goes. The ICMP plugin in VPP now answers by
`ip4-rewrite`'ing the packet, sending the return to 192.0.2.1 at MAC address `02:42:09:97:28:c6`
(this is Summer, the host doing the pinging!), after which the newly created ICMP echo-reply is
handed to `eth1-output` which marshalls it back into the kernel's AF_PACKET interface using
`eth1-tx`.
Boom. I could not be more pleased.
## What's Next
This was a nice exercise for me! I'm going this direction becaue the
[[Containerlab](https://containerlab.dev)] framework will start containers with given NOS images,
not too dissimilar from the one I just made, and then attaches `veth` pairs between the containers.
I started dabbling with a [[pull-request](https://github.com/srl-labs/containerlab/pull/2571)], but
I got stuck with a part of the Containerlab code that pre-deploys config files into the containers.
You see, I will need to generate two files:
1. A `startup.conf` file that is specific to the containerlab Docker container. I'd like them to
each set their own hostname so that the CLI has a unique prompt. I can do this by setting `unix
{ cli-prompt {{ .ShortName }}# }` in the template renderer.
1. Containerlab will know all of the veth pairs that are planned to be created into each VPP
container. I'll need it to then write a little snippet of config that does the `create
host-interface` spiel, to attach these `veth` pairs to the VPP dataplane.
I reached out to Roman from Nokia, who is one of the authors and current maintainer of Containerlab.
Roman was keen to help out, and seeing as he knows the COntainerlab stuff well, and I know the VPP
stuff well, this is a reasonable partnership! Soon, he and I plan to have a bare-bones setup that
will connect a few VPP containers together with an SR Linux node in a lab. Stand by!
Once we have that, there's still quite some work for me to do. Notably:
* Configuration persistence. `clab` allows you to save the running config. For that, I'll need to
introduce [[vppcfg](https://git.ipng.ch/ipng/vppcfg)] and a means to invoke it when
the lab operator wants to save their config, and then reconfigure VPP when the container
restarts.
* I'll need to have a few files from `clab` shared with the host, notably the `startup.conf` and
`vppcfg.yaml`, as well as some manual pre- and post-flight configuration for the more esoteric
stuff. Building the plumbing for this is a TODO for now.
## Acknowledgements
I wanted to give a shout-out to Nardus le Roux who inspired me to contribute this Containerlab VPP
node type, and to Roman Dodin for his help getting the Containerlab parts squared away when I got a
little bit stuck.
First order of business: get it to ping at all ... it'll go faster from there on out :)

View File

@ -0,0 +1,373 @@
---
date: "2025-05-04T15:07:23Z"
title: 'VPP in Containerlab - Part 2'
params:
asciinema: true
---
{{< image float="right" src="/assets/containerlab/containerlab.svg" alt="Containerlab Logo" width="12em" >}}
# Introduction
From time to time the subject of containerized VPP instances comes up. At IPng, I run the routers in
AS8298 on bare metal (Supermicro and Dell hardware), as it allows me to maximize performance.
However, VPP is quite friendly in virtualization. Notably, it runs really well on virtual machines
like Qemu/KVM or VMWare. I can pass through PCI devices directly to the host, and use CPU pinning to
allow the guest virtual machine access to the underlying physical hardware. In such a mode, VPP
performance almost the same as on bare metal. But did you know that VPP can also run in Docker?
The other day I joined the [[ZANOG'25](https://nog.net.za/event1/zanog25/)] in Durban, South Africa.
One of the presenters was Nardus le Roux of Nokia, and he showed off a project called
[[Containerlab](https://containerlab.dev/)], which provides a CLI for orchestrating and managing
container-based networking labs. It starts the containers, builds virtual wiring between them to
create lab topologies of users' choice and manages the lab lifecycle.
Quite regularly I am asked 'when will you add VPP to Containerlab?', but at ZANOG I made a promise
to actually add it. In my previous [[article]({{< ref 2025-05-03-containerlab-1.md >}})], I took
a good look at VPP as a dockerized container. In this article, I'll explore how to make such a
container run in Containerlab!
## Completing the Docker container
Just having VPP running by itself in a container is not super useful (although it _is_ cool!). I
decide first to add a few bits and bobs that will come in handy in the `Dockerfile`:
```
FROM debian:bookworm
ARG DEBIAN_FRONTEND=noninteractive
ARG VPP_INSTALL_SKIP_SYSCTL=true
ARG REPO=release
EXPOSE 22/tcp
RUN apt-get update && apt-get -y install curl procps tcpdump iproute2 iptables \
iputils-ping net-tools git python3 python3-pip vim-tiny openssh-server bird2 \
mtr-tiny traceroute && apt-get clean
# Install VPP
RUN mkdir -p /var/log/vpp /root/.ssh/
RUN curl -s https://packagecloud.io/install/repositories/fdio/${REPO}/script.deb.sh | bash
RUN apt-get update && apt-get -y install vpp vpp-plugin-core && apt-get clean
# Build vppcfg
RUN pip install --break-system-packages build netaddr yamale argparse pyyaml ipaddress
RUN git clone https://git.ipng.ch/ipng/vppcfg.git && cd vppcfg && python3 -m build && \
pip install --break-system-packages dist/vppcfg-*-py3-none-any.whl
# Config files
COPY files/etc/vpp/* /etc/vpp/
COPY files/etc/bird/* /etc/bird/
COPY files/init-container.sh /sbin/
RUN chmod 755 /sbin/init-container.sh
CMD ["/sbin/init-container.sh"]
```
A few notable additions:
* ***vppcfg*** is a handy utility I wrote and discussed in a previous [[article]({{< ref
2022-04-02-vppcfg-2 >}})]. Its purpose is to take YAML file that describes the configuration of
the dataplane (like which interfaces, sub-interfaces, MTU, IP addresses and so on), and then
apply this safely to a running dataplane. You can check it out in my
[[vppcfg](https://git.ipng.ch/ipng/vppcfg)] git repository.
* ***openssh-server*** will come in handy to log in to the container, in addition to the already
available `docker exec`.
* ***bird2*** which will be my controlplane of choice. At a future date, I might also add FRR,
which may be a good alterantive for some. VPP works well with both. You can check out Bird on
the nic.cz [[website](https://bird.network.cz/?get_doc&f=bird.html&v=20)].
I'll add a couple of default config files for Bird and VPP, and replace the CMD with a generic
`/sbin/init-container.sh` in which I can do any late binding stuff before launching VPP.
### Initializing the Container
#### VPP Containerlab: NetNS
VPP's Linux Control Plane plugin wants to run in its own network namespace. So the first order of
business of `/sbin/init-container.sh` is to create it:
```
NETNS=${NETNS:="dataplane"}
echo "Creating dataplane namespace"
/usr/bin/mkdir -p /etc/netns/$NETNS
/usr/bin/touch /etc/netns/$NETNS/resolv.conf
/usr/sbin/ip netns add $NETNS
```
#### VPP Containerlab: SSH
Then, I'll set the root password (which is `vpp` by the way), and start aan SSH daemon which allows
for password-less logins:
```
echo "Starting SSH, with credentials root:vpp"
sed -i -e 's,^#PermitRootLogin prohibit-password,PermitRootLogin yes,' /etc/ssh/sshd_config
sed -i -e 's,^root:.*,root:$y$j9T$kG8pyZEVmwLXEtXekQCRK.$9iJxq/bEx5buni1hrC8VmvkDHRy7ZMsw9wYvwrzexID:20211::::::,' /etc/shadow
/etc/init.d/ssh start
```
#### VPP Containerlab: Bird2
I can already predict that Bird2 won't be the only option for a controlplane, even though I'm a huge
fan of it. Therefore, I'll make it configurable to leave the door open for other controlplane
implementations in the future:
```
BIRD_ENABLED=${BIRD_ENABLED:="true"}
if [ "$BIRD_ENABLED" == "true" ]; then
echo "Starting Bird in $NETNS"
mkdir -p /run/bird /var/log/bird
chown bird:bird /var/log/bird
ROUTERID=$(ip -br a show eth0 | awk '{ print $3 }' | cut -f1 -d/)
sed -i -e "s,.*router id .*,router id $ROUTERID; # Set by container-init.sh," /etc/bird/bird.conf
/usr/bin/nsenter --net=/var/run/netns/$NETNS /usr/sbin/bird -u bird -g bird
fi
```
I am reminded that Bird won't start if it cannot determine its _router id_. When I start it in the
`dataplane` namespace, it will immediately exit, because there will be no IP addresses configured
yet. But luckily, it logs its complaint and it's easily addressed. I decide to take the management
IPv4 address from `eth0` and write that into the `bird.conf` file, which otherwise does some basic
initialization that I described in a previous [[article]({{< ref 2021-09-02-vpp-5 >}})], so I'll
skip that here. However, I do include an empty file called `/etc/bird/bird-local.conf` for users to
further configure Bird2.
#### VPP Containerlab: Binding veth pairs
When Containerlab starts the VPP container, it'll offer it a set of `veth` ports that connect this
container to other nodes in the lab. This is done by the `links` list in the topology file
[[ref](https://containerlab.dev/manual/network/)]. It's my goal to take all of the interfaces
that are of type `veth`, and generate a little snippet to grab them and bind them into VPP while
setting their MTU to 9216 to allow for jumbo frames:
```
CLAB_VPP_FILE=${CLAB_VPP_FILE:=/etc/vpp/clab.vpp}
echo "Generating $CLAB_VPP_FILE"
: > $CLAB_VPP_FILE
MTU=9216
for IFNAME in $(ip -br link show type veth | cut -f1 -d@ | grep -v '^eth0$' | sort); do
MAC=$(ip -br link show dev $IFNAME | awk '{ print $3 }')
echo " * $IFNAME hw-addr $MAC mtu $MTU"
ip link set $IFNAME up mtu $MTU
cat << EOF >> $CLAB_VPP_FILE
create host-interface name $IFNAME hw-addr $MAC
set interface name host-$IFNAME $IFNAME
set interface mtu $MTU $IFNAME
set interface state $IFNAME up
EOF
done
```
{{< image width="5em" float="left" src="/assets/shared/warning.png" alt="Warning" >}}
One thing I realized is that VPP will assign a random MAC address on its copy of the `veth` port,
which is not great. I'll explicitly configure it with the same MAC address as the `veth` interface
itself, otherwise I'd have to put the interface into promiscuous mode.
#### VPP Containerlab: VPPcfg
I'm almost ready, but I have one more detail. The user will be able to offer a
[[vppcfg](https://git.ipng.ch/ipng/vppcfg)] YAML file to configure the interfaces and so on. If such
a file exists, I'll apply it to the dataplane upon startup:
```
VPPCFG_VPP_FILE=${VPPCFG_VPP_FILE:=/etc/vpp/vppcfg.vpp}
echo "Generating $VPPCFG_VPP_FILE"
: > $VPPCFG_VPP_FILE
if [ -r /etc/vpp/vppcfg.yaml ]; then
vppcfg plan --novpp -c /etc/vpp/vppcfg.yaml -o $VPPCFG_VPP_FILE
fi
```
Once the VPP process starts, it'll execute `/etc/vpp/bootstrap.vpp`, which in turn executes these
newly generated `/etc/vpp/clab.vpp` to grab the `veth` interfaces, and then `/etc/vpp/vppcfg.vpp` to
further configure the dataplane. Easy peasy!
### Adding VPP to Containerlab
Roman points out a previous integration for the 6WIND VSR in
[[PR#2540](https://github.com/srl-labs/containerlab/pull/2540)]. This serves as a useful guide to
get me started. I fork the repo, create a branch so that Roman can also add a few commits, and
together we start hacking in [[PR#2571](https://github.com/srl-labs/containerlab/pull/2571)].
First, I add the documentation skeleton in `docs/manual/kinds/fdio_vpp.md`, which links in from a
few other places, and will be where the end-user facing documentation will live. That's about half
the contributed LOC, right there!
Next, I'll create a Go module in `nodes/fdio_vpp/fdio_vpp.go` which doesn't do much other than
creating the `struct`, and its required `Register` and `Init` functions. The `Init` function ensures
the right capabilities are set in Docker, and the right devices are bound for the container.
I notice that Containerlab rewrites the Dockerfile `CMD` string and prepends an `if-wait.sh` script
to it. This is because when Containerlab starts the container, it'll still be busy adding these
`link` interfaces to it, and if a container starts too quickly, it may not see all the interfaces.
So, containerlab informs the container using an environment variable called `CLAB_INTFS`, so this
script simply sleeps for a while until that exact amount of interfaces are present. Ok, cool beans.
Roman helps me a bit with Go templating. You see, I think it'll be slick to have the CLI prompt for
the VPP containers to reflect their hostname, because normally, VPP will assign `vpp# `. I add the
template in `nodes/fdio_vpp/vpp_startup_config.go.tpl` and it only has one variable expansion: `unix
{ cli-prompt {{ .ShortName }}# }`. But I totally think it's worth it, because when running many VPP
containers in the lab, it could otherwise get confusing.
Roman also shows me a trick in the function `PostDeploy()`, which will write the user's SSH pubkeys
to `/root/.ssh/authorized_keys`. This allows users to log in without having to use password
authentication.
Collectively, we decide to punt on the `SaveConfig` function until we're a bit further along. I have
an idea how this would work, basically along the lines of calling `vppcfg dump` and bind-mounting
that file into the lab directory somewhere. This way, upon restarting, the YAML file can be re-read
and the dataplane initialized. But it'll be for another day.
After the main module is finished, all I have to do is add it to `clab/register.go` and that's just
about it. In about 170 lines of code, 50 lines of Go template, and 170 lines of Markdown, this
contribution is about ready to ship!
### Containerlab: Demo
After I finish writing the documentation, I decide to include a demo with a quickstart to help folks
along. A simple lab showing two VPP instances and two Alpine Linux clients can be found on
[[git.ipng.ch/ipng/vpp-containerlab](https://git.ipng.ch/ipng/vpp-containerlab)]. Simply check out the
repo and start the lab, like so:
```
$ git clone https://git.ipng.ch/ipng/vpp-containerlab.git
$ cd vpp-containerlab
$ containerlab deploy --topo vpp.clab.yml
```
#### Containerlab: configs
The file `vpp.clab.yml` contains an example topology existing of two VPP instances connected each to
one Alpine linux container, in the following topology:
{{< image src="/assets/containerlab/learn-vpp.png" alt="Containerlab Topo" width="100%" >}}
Two relevant files for each VPP router are included in this
[[repository](https://git.ipng.ch/ipng/vpp-containerlab)]:
1. `config/vpp*/vppcfg.yaml` configures the dataplane interfaces, including a loopback address.
1. `config/vpp*/bird-local.conf` configures the controlplane to enable BFD and OSPF.
To illustrate these files, let me take a closer look at node `vpp1`. It's VPP dataplane
configuration looks like this:
```
pim@summer:~/src/vpp-containerlab$ cat config/vpp1/vppcfg.yaml
interfaces:
eth1:
description: 'To client1'
mtu: 1500
lcp: eth1
addresses: [ 10.82.98.65/28, 2001:db8:8298:101::1/64 ]
eth2:
description: 'To vpp2'
mtu: 9216
lcp: eth2
addresses: [ 10.82.98.16/31, 2001:db8:8298:1::1/64 ]
loopbacks:
loop0:
description: 'vpp1'
lcp: loop0
addresses: [ 10.82.98.0/32, 2001:db8:8298::/128 ]
```
Then, I enable BFD, OSPF and OSPFv3 on `eth2` and `loop0` on both of the VPP routers:
```
pim@summer:~/src/vpp-containerlab$ cat config/vpp1/bird-local.conf
protocol bfd bfd1 {
interface "eth2" { interval 100 ms; multiplier 30; };
}
protocol ospf v2 ospf4 {
ipv4 { import all; export all; };
area 0 {
interface "loop0" { stub yes; };
interface "eth2" { type pointopoint; cost 10; bfd on; };
};
}
protocol ospf v3 ospf6 {
ipv6 { import all; export all; };
area 0 {
interface "loop0" { stub yes; };
interface "eth2" { type pointopoint; cost 10; bfd on; };
};
}
```
#### Containerlab: playtime!
Once the lab comes up, I can SSH to the VPP containers (`vpp1` and `vpp2`) which have my SSH pubkeys
installed thanks to Roman's work. Barring that, I could still log in as user `root` using
password `vpp`. VPP runs its own network namespace called `dataplane`, which is very similar to SR
Linux default `network-instance`. I can join that namespace to take a closer look:
```
pim@summer:~/src/vpp-containerlab$ ssh root@vpp1
root@vpp1:~# nsenter --net=/var/run/netns/dataplane
root@vpp1:~# ip -br a
lo DOWN
loop0 UP 10.82.98.0/32 2001:db8:8298::/128 fe80::dcad:ff:fe00:0/64
eth1 UNKNOWN 10.82.98.65/28 2001:db8:8298:101::1/64 fe80::a8c1:abff:fe77:acb9/64
eth2 UNKNOWN 10.82.98.16/31 2001:db8:8298:1::1/64 fe80::a8c1:abff:fef0:7125/64
root@vpp1:~# ping 10.82.98.1
PING 10.82.98.1 (10.82.98.1) 56(84) bytes of data.
64 bytes from 10.82.98.1: icmp_seq=1 ttl=64 time=9.53 ms
64 bytes from 10.82.98.1: icmp_seq=2 ttl=64 time=15.9 ms
^C
--- 10.82.98.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 9.530/12.735/15.941/3.205 ms
```
From `vpp1`, I can tell that Bird2's OSPF adjacency has formed, because I can ping the `loop0`
address of `vpp2` router on 10.82.98.1. Nice! The two client nodes are running a minimalistic Alpine
Linux container, which doesn't ship with SSH by default. But of course I can still enter the
containers using `docker exec`, like so:
```
pim@summer:~/src/vpp-containerlab$ docker exec -it client1 sh
/ # ip addr show dev eth1
531235: eth1@if531234: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 9500 qdisc noqueue state UP
link/ether 00:c1:ab:00:00:01 brd ff:ff:ff:ff:ff:ff
inet 10.82.98.66/28 scope global eth1
valid_lft forever preferred_lft forever
inet6 2001:db8:8298:101::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::2c1:abff:fe00:1/64 scope link
valid_lft forever preferred_lft forever
/ # traceroute 10.82.98.82
traceroute to 10.82.98.82 (10.82.98.82), 30 hops max, 46 byte packets
1 10.82.98.65 (10.82.98.65) 5.906 ms 7.086 ms 7.868 ms
2 10.82.98.17 (10.82.98.17) 24.007 ms 23.349 ms 15.933 ms
3 10.82.98.82 (10.82.98.82) 39.978 ms 31.127 ms 31.854 ms
/ # traceroute 2001:db8:8298:102::2
traceroute to 2001:db8:8298:102::2 (2001:db8:8298:102::2), 30 hops max, 72 byte packets
1 2001:db8:8298:101::1 (2001:db8:8298:101::1) 0.701 ms 7.144 ms 7.900 ms
2 2001:db8:8298:1::2 (2001:db8:8298:1::2) 23.909 ms 22.943 ms 23.893 ms
3 2001:db8:8298:102::2 (2001:db8:8298:102::2) 31.964 ms 30.814 ms 32.000 ms
```
From the vantage point of `client1`, the first hop represents the `vpp1` node, which forwards to
`vpp2`, which finally forwards to `client2`, which shows that both VPP routers are passing traffic.
Dope!
## Results
After all of this deep-diving, all that's left is for me to demonstrate the Containerlab by means of
this little screencast [[asciinema](/assets/containerlab/vpp-containerlab.cast)]. I hope you enjoy
it as much as I enjoyed creating it:
{{< asciinema src="/assets/containerlab/vpp-containerlab.cast" >}}
## Acknowledgements
I wanted to give a shout-out Roman Dodin for his help getting the Containerlab parts squared away
when I got a little bit stuck. He took the time to explain the internals and idiom of Containerlab
project, which really saved me a tonne of time. He also pair-programmed the
[[PR#2471](https://github.com/srl-labs/containerlab/pull/2571)] with me over the span of two
evenings.
Collaborative open source rocks!

38
hugo.yaml Normal file
View File

@ -0,0 +1,38 @@
baseURL: 'https://ipng.ch/'
languageCode: 'en-us'
title: "IPng Networks"
theme: 'hugo-theme-ipng'
mainSections: ["articles"]
params:
author: "IPng Networks GmbH"
siteHeading: "IPng Networks"
favicon: "favicon.ico"
showBlogLatest: false
mainSections: ["articles"]
showTaxonomyLinks: false
nBlogLatest: 14 # number of blog post om the home page
Paginate: 30
blogLatestHeading: "Latest Dabblings"
footer: "Copyright 2021- IPng Networks GmbH, all rights reserved"
social:
email: "info+www@ipng.ch"
mastodon: "@IPngNetworks"
twitter: "IPngNetworks"
linkedin: "pimvanpelt"
github: "pimvanpelt"
instagram: "IPngNetworks"
rss: true
taxonomies:
year: "year"
month: "month"
tags: "tags"
categories: "categories"
permalinks:
articles: "/s/articles/:year/:month/:day/:slug"
ignoreLogs: [ "warning-goldmark-raw-html" ]

View File

@ -0,0 +1,5 @@
Canonical: https://ipng.ch/.well-known/security.txt
Expires: 2026-01-01T00:00:00.000Z
Contact: mailto:info@ipng.ch
Contact: https://ipng.ch/s/contact/
Preferred-Languages: en, nl, de

55
static/app/go/index.html Normal file
View File

@ -0,0 +1,55 @@
<!DOCTYPE html>
<html lang="en-us">
<head>
<title>Javascript Redirector for RFID / NFC / nTAG</title>
<meta name="robots" content="noindex,nofollow">
<meta charset="utf-8">
<script type="text/JavaScript">
const ntag_list = [
"/s/articles/2021/09/21/vpp-linux-cp-part7/",
"/s/articles/2021/12/23/vpp-linux-cp-virtual-machine-playground/",
"/s/articles/2022/01/12/case-study-virtual-leased-line-vll-in-vpp/",
"/s/articles/2022/02/14/case-study-vlan-gymnastics-with-vpp/",
"/s/articles/2022/03/27/vpp-configuration-part1/",
"/s/articles/2022/10/14/vpp-lab-setup/",
"/s/articles/2023/03/11/case-study-centec-mpls-core/",
"/s/articles/2023/04/09/vpp-monitoring/",
"/s/articles/2023/05/28/vpp-mpls-part-4/",
"/s/articles/2023/11/11/debian-on-mellanox-sn2700-32x100g/",
"/s/articles/2023/12/17/debian-on-ipngs-vpp-routers/",
"/s/articles/2024/01/27/vpp-python-api/",
"/s/articles/2024/02/10/vpp-on-freebsd-part-1/",
"/s/articles/2024/03/06/vpp-with-babel-part-1/",
"/s/articles/2024/04/06/vpp-with-loopback-only-ospfv3-part-1/",
"/s/articles/2024/04/27/freeix-remote/"
];
var redir_url = "https://ipng.ch/";
var key = window.location.hash.slice(1);
if (key.startsWith("ntag")) {
let week = Math.round(new Date().getTime() / 1000 / (7*24*3400));
let num = parseInt(key.slice(-2));
let idx = (num + week) % ntag_list.length;
console.log("(ntag " + num + " + week number " + week + ") % " + ntag_list.length + " = " + idx);
redir_url = ntag_list[idx];
}
console.log("Redirecting to " + redir_url + " - off you go!");
window.location = redir_url;
</script>
</head>
<body>
<pre>
Usage: https://ipng.ch/app/go/#&lt;key&gt;
Example: <a href="/app/go/#ntag00">#ntag00</a>
Also, this page requires javascript.
Love,
IPng Networks.
</pre>
</body>
</html>

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 21 KiB

BIN
static/assets/containerlab/learn-vpp.png (Stored with Git LFS) Normal file

Binary file not shown.

File diff suppressed because it is too large Load Diff

BIN
static/assets/freeix/freeix-artist-rendering.png (Stored with Git LFS) Normal file

Binary file not shown.

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 90 KiB

BIN
static/assets/frys-ix/IXR-7220-D3.jpg (Stored with Git LFS) Normal file

Binary file not shown.

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 166 KiB

View File

@ -0,0 +1,169 @@
no aaa root
!
hardware counter feature vtep decap
hardware counter feature vtep encap
!
service routing protocols model multi-agent
!
hostname arista-leaf
!
router l2-vpn
arp learning bridged
!
spanning-tree mode mstp
!
system l1
unsupported speed action error
unsupported error-correction action error
!
vlan 2604
name v-peeringlan
!
interface Ethernet1/1
!
interface Ethernet2/1
!
interface Ethernet3/1
!
interface Ethernet4/1
!
interface Ethernet5/1
!
interface Ethernet6/1
!
interface Ethernet7/1
!
interface Ethernet8/1
!
interface Ethernet9/1
shutdown
speed forced 10000full
!
interface Ethernet9/2
shutdown
!
interface Ethernet9/3
speed forced 10000full
switchport access vlan 2604
!
interface Ethernet9/4
shutdown
!
interface Ethernet10/1
!
interface Ethernet10/2
shutdown
!
interface Ethernet10/4
shutdown
!
interface Ethernet11/1
!
interface Ethernet12/1
!
interface Ethernet13/1
!
interface Ethernet14/1
!
interface Ethernet15/1
!
interface Ethernet16/1
!
interface Ethernet17/1
!
interface Ethernet18/1
!
interface Ethernet19/1
!
interface Ethernet20/1
!
interface Ethernet21/1
!
interface Ethernet22/1
!
interface Ethernet23/1
!
interface Ethernet24/1
!
interface Ethernet25/1
!
interface Ethernet26/1
!
interface Ethernet27/1
!
interface Ethernet28/1
!
interface Ethernet29/1
no switchport
!
interface Ethernet30/1
load-interval 1
mtu 9190
no switchport
ip address 198.19.17.10/31
ip ospf cost 10
ip ospf network point-to-point
ip ospf area 0.0.0.0
!
interface Ethernet31/1
load-interval 1
mtu 9190
no switchport
ip address 198.19.17.3/31
ip ospf cost 1000
ip ospf network point-to-point
ip ospf area 0.0.0.0
!
interface Ethernet32/1
load-interval 1
mtu 9190
no switchport
ip address 198.19.17.5/31
ip ospf cost 1000
ip ospf network point-to-point
ip ospf area 0.0.0.0
!
interface Loopback0
ip address 198.19.16.2/32
ip ospf area 0.0.0.0
!
interface Loopback1
ip address 198.19.18.2/32
!
interface Management1
ip address dhcp
!
interface Vxlan1
vxlan source-interface Loopback1
vxlan udp-port 4789
vxlan vlan 2604 vni 2604
!
ip routing
!
ip route 0.0.0.0/0 Management1 10.75.8.1
!
router bgp 65500
neighbor evpn peer group
neighbor evpn remote-as 65500
neighbor evpn update-source Loopback0
neighbor evpn ebgp-multihop 3
neighbor evpn send-community extended
neighbor evpn maximum-routes 12000 warning-only
neighbor 198.19.16.0 peer group evpn
neighbor 198.19.16.1 peer group evpn
!
vlan 2604
rd 65500:2604
route-target both 65500:2604
redistribute learned
!
address-family evpn
neighbor evpn activate
!
router ospf 65500
router-id 198.19.16.2
redistribute connected
network 198.19.0.0/16 area 0.0.0.0
max-lsa 12000
!
end

View File

@ -0,0 +1,90 @@
set / interface ethernet-1/1 admin-state disable
set / interface ethernet-1/9 admin-state enable
set / interface ethernet-1/9 breakout-mode num-breakout-ports 4
set / interface ethernet-1/9 breakout-mode breakout-port-speed 10G
set / interface ethernet-1/9/3 admin-state enable
set / interface ethernet-1/9/3 vlan-tagging true
set / interface ethernet-1/9/3 subinterface 0 type bridged
set / interface ethernet-1/9/3 subinterface 0 admin-state enable
set / interface ethernet-1/9/3 subinterface 0 vlan encap untagged
set / interface ethernet-1/29 admin-state enable
set / interface ethernet-1/29 subinterface 0 type routed
set / interface ethernet-1/29 subinterface 0 admin-state enable
set / interface ethernet-1/29 subinterface 0 ip-mtu 9190
set / interface ethernet-1/29 subinterface 0 ipv4 admin-state enable
set / interface ethernet-1/29 subinterface 0 ipv4 address 198.19.17.0/31
set / interface ethernet-1/29 subinterface 0 ipv6 admin-state enable
set / interface lo0 admin-state enable
set / interface lo0 subinterface 0 admin-state enable
set / interface lo0 subinterface 0 ipv4 admin-state enable
set / interface lo0 subinterface 0 ipv4 address 198.19.16.0/32
set / interface mgmt0 admin-state enable
set / interface mgmt0 subinterface 0 admin-state enable
set / interface mgmt0 subinterface 0 ipv4 admin-state enable
set / interface mgmt0 subinterface 0 ipv4 dhcp-client
set / interface mgmt0 subinterface 0 ipv6 admin-state enable
set / interface mgmt0 subinterface 0 ipv6 dhcp-client
set / interface system0 admin-state enable
set / interface system0 subinterface 0 admin-state enable
set / interface system0 subinterface 0 ipv4 admin-state enable
set / interface system0 subinterface 0 ipv4 address 198.19.18.0/32
set / network-instance default type default
set / network-instance default admin-state enable
set / network-instance default description "fabric: dc2 role: spine"
set / network-instance default router-id 198.19.16.0
set / network-instance default ip-forwarding receive-ipv4-check false
set / network-instance default interface ethernet-1/29.0
set / network-instance default interface lo0.0
set / network-instance default interface system0.0
set / network-instance default protocols bgp admin-state enable
set / network-instance default protocols bgp autonomous-system 65500
set / network-instance default protocols bgp router-id 198.19.16.0
set / network-instance default protocols bgp dynamic-neighbors accept match 198.19.16.0/24 peer-group overlay
set / network-instance default protocols bgp afi-safi evpn admin-state enable
set / network-instance default protocols bgp preference ibgp 170
set / network-instance default protocols bgp route-advertisement rapid-withdrawal true
set / network-instance default protocols bgp route-advertisement wait-for-fib-install false
set / network-instance default protocols bgp group overlay peer-as 65500
set / network-instance default protocols bgp group overlay afi-safi evpn admin-state enable
set / network-instance default protocols bgp group overlay afi-safi ipv4-unicast admin-state disable
set / network-instance default protocols bgp group overlay local-as as-number 65500
set / network-instance default protocols bgp group overlay route-reflector client true
set / network-instance default protocols bgp group overlay transport local-address 198.19.16.0
set / network-instance default protocols bgp neighbor 198.19.16.1 admin-state enable
set / network-instance default protocols bgp neighbor 198.19.16.1 peer-group overlay
set / network-instance default protocols ospf instance default admin-state enable
set / network-instance default protocols ospf instance default version ospf-v2
set / network-instance default protocols ospf instance default router-id 198.19.16.0
set / network-instance default protocols ospf instance default export-policy ospf
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/29.0 interface-type point-to-point
set / network-instance default protocols ospf instance default area 0.0.0.0 interface lo0.0
set / network-instance default protocols ospf instance default area 0.0.0.0 interface system0.0
set / network-instance mgmt type ip-vrf
set / network-instance mgmt admin-state enable
set / network-instance mgmt description "Management network instance"
set / network-instance mgmt interface mgmt0.0
set / network-instance mgmt protocols linux import-routes true
set / network-instance mgmt protocols linux export-routes true
set / network-instance mgmt protocols linux export-neighbors true
set / network-instance peeringlan type mac-vrf
set / network-instance peeringlan admin-state enable
set / network-instance peeringlan interface ethernet-1/9/3.0
set / network-instance peeringlan vxlan-interface vxlan1.2604
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 admin-state enable
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 evi 2604
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise true
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604
set / network-instance peeringlan bridge-table proxy-arp admin-state enable
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning admin-state enable
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning age-time 600
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning send-refresh 180
set / routing-policy policy ospf statement 100 match protocol host
set / routing-policy policy ospf statement 100 action policy-result accept
set / routing-policy policy ospf statement 200 match protocol ospfv2
set / routing-policy policy ospf statement 200 action policy-result accept
set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged
set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604
set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address

BIN
static/assets/frys-ix/frysix-logo-small.png (Stored with Git LFS) Normal file

Binary file not shown.

View File

@ -0,0 +1,132 @@
set / interface ethernet-1/1 admin-state enable
set / interface ethernet-1/1 ethernet forward-error-correction fec-option rs-528
set / interface ethernet-1/1 subinterface 0 type routed
set / interface ethernet-1/1 subinterface 0 admin-state enable
set / interface ethernet-1/1 subinterface 0 ip-mtu 9190
set / interface ethernet-1/1 subinterface 0 ipv4 admin-state enable
set / interface ethernet-1/1 subinterface 0 ipv4 address 198.19.17.2/31
set / interface ethernet-1/1 subinterface 0 ipv6 admin-state enable
set / interface ethernet-1/2 admin-state enable
set / interface ethernet-1/2 ethernet forward-error-correction fec-option rs-528
set / interface ethernet-1/2 subinterface 0 type routed
set / interface ethernet-1/2 subinterface 0 admin-state enable
set / interface ethernet-1/2 subinterface 0 ip-mtu 9190
set / interface ethernet-1/2 subinterface 0 ipv4 admin-state enable
set / interface ethernet-1/2 subinterface 0 ipv4 address 198.19.17.4/31
set / interface ethernet-1/2 subinterface 0 ipv6 admin-state enable
set / interface ethernet-1/3 admin-state enable
set / interface ethernet-1/3 ethernet forward-error-correction fec-option rs-528
set / interface ethernet-1/3 subinterface 0 type routed
set / interface ethernet-1/3 subinterface 0 admin-state enable
set / interface ethernet-1/3 subinterface 0 ip-mtu 9190
set / interface ethernet-1/3 subinterface 0 ipv4 admin-state enable
set / interface ethernet-1/3 subinterface 0 ipv4 address 198.19.17.6/31
set / interface ethernet-1/3 subinterface 0 ipv6 admin-state enable
set / interface ethernet-1/4 admin-state enable
set / interface ethernet-1/4 ethernet forward-error-correction fec-option rs-528
set / interface ethernet-1/4 subinterface 0 type routed
set / interface ethernet-1/4 subinterface 0 admin-state enable
set / interface ethernet-1/4 subinterface 0 ip-mtu 9190
set / interface ethernet-1/4 subinterface 0 ipv4 admin-state enable
set / interface ethernet-1/4 subinterface 0 ipv4 address 198.19.17.8/31
set / interface ethernet-1/4 subinterface 0 ipv6 admin-state enable
set / interface ethernet-1/9 admin-state enable
set / interface ethernet-1/9 breakout-mode num-breakout-ports 4
set / interface ethernet-1/9 breakout-mode breakout-port-speed 10G
set / interface ethernet-1/9/1 admin-state disable
set / interface ethernet-1/9/2 admin-state disable
set / interface ethernet-1/9/3 admin-state enable
set / interface ethernet-1/9/3 vlan-tagging true
set / interface ethernet-1/9/3 subinterface 0 type bridged
set / interface ethernet-1/9/3 subinterface 0 admin-state enable
set / interface ethernet-1/9/3 subinterface 0 vlan encap untagged
set / interface ethernet-1/9/4 admin-state disable
set / interface ethernet-1/29 admin-state enable
set / interface ethernet-1/29 subinterface 0 type routed
set / interface ethernet-1/29 subinterface 0 admin-state enable
set / interface ethernet-1/29 subinterface 0 ip-mtu 9190
set / interface ethernet-1/29 subinterface 0 ipv4 admin-state enable
set / interface ethernet-1/29 subinterface 0 ipv4 address 198.19.17.1/31
set / interface ethernet-1/29 subinterface 0 ipv6 admin-state enable
set / interface lo0 admin-state enable
set / interface lo0 subinterface 0 admin-state enable
set / interface lo0 subinterface 0 ipv4 admin-state enable
set / interface lo0 subinterface 0 ipv4 address 198.19.16.1/32
set / interface mgmt0 admin-state enable
set / interface mgmt0 subinterface 0 admin-state enable
set / interface mgmt0 subinterface 0 ipv4 admin-state enable
set / interface mgmt0 subinterface 0 ipv4 dhcp-client
set / interface mgmt0 subinterface 0 ipv6 admin-state enable
set / interface mgmt0 subinterface 0 ipv6 dhcp-client
set / interface system0 admin-state enable
set / interface system0 subinterface 0 admin-state enable
set / interface system0 subinterface 0 ipv4 admin-state enable
set / interface system0 subinterface 0 ipv4 address 198.19.18.1/32
set / network-instance default type default
set / network-instance default admin-state enable
set / network-instance default description "fabric: dc1 role: spine"
set / network-instance default router-id 198.19.16.1
set / network-instance default ip-forwarding receive-ipv4-check false
set / network-instance default interface ethernet-1/1.0
set / network-instance default interface ethernet-1/2.0
set / network-instance default interface ethernet-1/29.0
set / network-instance default interface ethernet-1/3.0
set / network-instance default interface ethernet-1/4.0
set / network-instance default interface lo0.0
set / network-instance default interface system0.0
set / network-instance default protocols bgp admin-state enable
set / network-instance default protocols bgp autonomous-system 65500
set / network-instance default protocols bgp router-id 198.19.16.1
set / network-instance default protocols bgp dynamic-neighbors accept match 198.19.16.0/24 peer-group overlay
set / network-instance default protocols bgp afi-safi evpn admin-state enable
set / network-instance default protocols bgp preference ibgp 170
set / network-instance default protocols bgp route-advertisement rapid-withdrawal true
set / network-instance default protocols bgp route-advertisement wait-for-fib-install false
set / network-instance default protocols bgp group overlay peer-as 65500
set / network-instance default protocols bgp group overlay afi-safi evpn admin-state enable
set / network-instance default protocols bgp group overlay afi-safi ipv4-unicast admin-state disable
set / network-instance default protocols bgp group overlay local-as as-number 65500
set / network-instance default protocols bgp group overlay route-reflector client true
set / network-instance default protocols bgp group overlay transport local-address 198.19.16.1
set / network-instance default protocols bgp neighbor 198.19.16.0 admin-state enable
set / network-instance default protocols bgp neighbor 198.19.16.0 peer-group overlay
set / network-instance default protocols ospf instance default admin-state enable
set / network-instance default protocols ospf instance default version ospf-v2
set / network-instance default protocols ospf instance default router-id 198.19.16.1
set / network-instance default protocols ospf instance default export-policy ospf
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/1.0 interface-type point-to-point
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/2.0 interface-type point-to-point
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/3.0 interface-type point-to-point
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/4.0 interface-type point-to-point
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/29.0 interface-type point-to-point
set / network-instance default protocols ospf instance default area 0.0.0.0 interface lo0.0 passive true
set / network-instance default protocols ospf instance default area 0.0.0.0 interface system0.0
set / network-instance mgmt type ip-vrf
set / network-instance mgmt admin-state enable
set / network-instance mgmt description "Management network instance"
set / network-instance mgmt interface mgmt0.0
set / network-instance mgmt protocols linux import-routes true
set / network-instance mgmt protocols linux export-routes true
set / network-instance mgmt protocols linux export-neighbors true
set / network-instance peeringlan type mac-vrf
set / network-instance peeringlan admin-state enable
set / network-instance peeringlan interface ethernet-1/9/3.0
set / network-instance peeringlan vxlan-interface vxlan1.2604
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 admin-state enable
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 evi 2604
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise true
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604
set / network-instance peeringlan bridge-table proxy-arp admin-state enable
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning admin-state enable
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning age-time 600
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning send-refresh 180
set / routing-policy policy ospf statement 100 match protocol host
set / routing-policy policy ospf statement 100 action policy-result accept
set / routing-policy policy ospf statement 200 match protocol ospfv2
set / routing-policy policy ospf statement 200 action policy-result accept
set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged
set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604
set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address

BIN
static/assets/frys-ix/nokia-7220-d2.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/frys-ix/nokia-7220-d4.png (Stored with Git LFS) Normal file

Binary file not shown.

View File

@ -0,0 +1,105 @@
set / interface ethernet-1/9 admin-state enable
set / interface ethernet-1/9 vlan-tagging true
set / interface ethernet-1/9 ethernet port-speed 10G
set / interface ethernet-1/9 subinterface 0 type bridged
set / interface ethernet-1/9 subinterface 0 admin-state enable
set / interface ethernet-1/9 subinterface 0 vlan encap untagged
set / interface ethernet-1/53 admin-state enable
set / interface ethernet-1/53 ethernet forward-error-correction fec-option rs-528
set / interface ethernet-1/53 subinterface 0 admin-state enable
set / interface ethernet-1/53 subinterface 0 ip-mtu 9190
set / interface ethernet-1/53 subinterface 0 ipv4 admin-state enable
set / interface ethernet-1/53 subinterface 0 ipv4 address 198.19.17.11/31
set / interface ethernet-1/53 subinterface 0 ipv6 admin-state enable
set / interface ethernet-1/55 admin-state enable
set / interface ethernet-1/55 ethernet forward-error-correction fec-option rs-528
set / interface ethernet-1/55 subinterface 0 admin-state enable
set / interface ethernet-1/55 subinterface 0 ip-mtu 9190
set / interface ethernet-1/55 subinterface 0 ipv4 admin-state enable
set / interface ethernet-1/55 subinterface 0 ipv4 address 198.19.17.7/31
set / interface ethernet-1/55 subinterface 0 ipv6 admin-state enable
set / interface ethernet-1/56 admin-state enable
set / interface ethernet-1/56 ethernet forward-error-correction fec-option rs-528
set / interface ethernet-1/56 subinterface 0 admin-state enable
set / interface ethernet-1/56 subinterface 0 ip-mtu 9190
set / interface ethernet-1/56 subinterface 0 ipv4 admin-state enable
set / interface ethernet-1/56 subinterface 0 ipv4 address 198.19.17.9/31
set / interface ethernet-1/56 subinterface 0 ipv6 admin-state enable
set / interface lo0 admin-state enable
set / interface lo0 subinterface 0 admin-state enable
set / interface lo0 subinterface 0 ipv4 admin-state enable
set / interface lo0 subinterface 0 ipv4 address 198.19.16.3/32
set / interface mgmt0 admin-state enable
set / interface mgmt0 subinterface 0 admin-state enable
set / interface mgmt0 subinterface 0 ipv4 admin-state enable
set / interface mgmt0 subinterface 0 ipv4 dhcp-client
set / interface mgmt0 subinterface 0 ipv6 admin-state enable
set / interface mgmt0 subinterface 0 ipv6 dhcp-client
set / interface system0 admin-state enable
set / interface system0 subinterface 0 admin-state enable
set / interface system0 subinterface 0 ipv4 admin-state enable
set / interface system0 subinterface 0 ipv4 address 198.19.18.3/32
set / network-instance default type default
set / network-instance default admin-state enable
set / network-instance default description "fabric: dc1 role: leaf"
set / network-instance default router-id 198.19.16.3
set / network-instance default ip-forwarding receive-ipv4-check false
set / network-instance default interface ethernet-1/53.0
set / network-instance default interface ethernet-1/55.0
set / network-instance default interface ethernet-1/56.0
set / network-instance default interface lo0.0
set / network-instance default interface system0.0
set / network-instance default protocols bgp admin-state enable
set / network-instance default protocols bgp autonomous-system 65500
set / network-instance default protocols bgp router-id 198.19.16.3
set / network-instance default protocols bgp afi-safi evpn admin-state enable
set / network-instance default protocols bgp preference ibgp 170
set / network-instance default protocols bgp route-advertisement rapid-withdrawal true
set / network-instance default protocols bgp route-advertisement wait-for-fib-install false
set / network-instance default protocols bgp group overlay peer-as 65500
set / network-instance default protocols bgp group overlay afi-safi evpn admin-state enable
set / network-instance default protocols bgp group overlay afi-safi ipv4-unicast admin-state disable
set / network-instance default protocols bgp group overlay local-as as-number 65500
set / network-instance default protocols bgp group overlay transport local-address 198.19.16.3
set / network-instance default protocols bgp neighbor 198.19.16.0 admin-state enable
set / network-instance default protocols bgp neighbor 198.19.16.0 peer-group overlay
set / network-instance default protocols bgp neighbor 198.19.16.1 admin-state enable
set / network-instance default protocols bgp neighbor 198.19.16.1 peer-group overlay
set / network-instance default protocols ospf instance default admin-state enable
set / network-instance default protocols ospf instance default version ospf-v2
set / network-instance default protocols ospf instance default router-id 198.19.16.3
set / network-instance default protocols ospf instance default export-policy ospf
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/53.0 interface-type point-to-point
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/55.0 interface-type point-to-point
set / network-instance default protocols ospf instance default area 0.0.0.0 interface ethernet-1/56.0 interface-type point-to-point
set / network-instance default protocols ospf instance default area 0.0.0.0 interface lo0.0 passive true
set / network-instance default protocols ospf instance default area 0.0.0.0 interface system0.0
set / network-instance mgmt type ip-vrf
set / network-instance mgmt admin-state enable
set / network-instance mgmt description "Management network instance"
set / network-instance mgmt interface mgmt0.0
set / network-instance mgmt protocols linux import-routes true
set / network-instance mgmt protocols linux export-routes true
set / network-instance mgmt protocols linux export-neighbors true
set / network-instance peeringlan type mac-vrf
set / network-instance peeringlan admin-state enable
set / network-instance peeringlan interface ethernet-1/9.0
set / network-instance peeringlan vxlan-interface vxlan1.2604
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 admin-state enable
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 evi 2604
set / network-instance peeringlan protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise true
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604
set / network-instance peeringlan protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604
set / network-instance peeringlan bridge-table proxy-arp admin-state enable
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning admin-state enable
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning age-time 600
set / network-instance peeringlan bridge-table proxy-arp dynamic-learning send-refresh 180
set / routing-policy policy ospf statement 100 match protocol host
set / routing-policy policy ospf statement 100 action policy-result accept
set / routing-policy policy ospf statement 200 match protocol ospfv2
set / routing-policy policy ospf statement 200 action policy-result accept
set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged
set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604
set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address

BIN
static/assets/sflow/hsflowd-demo.png (Stored with Git LFS) Normal file

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

BIN
static/assets/sflow/sflow-lab-trex.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/sflow/sflow-lab.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/sflow/sflow-overview.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/sflow/sflow-vpp-overview.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/sflow/sflow-wireshark.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/sflow/sflow.gif (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/sflow/trex-acceptance.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/sflow/trex-baseline.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/sflow/trex-overload.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/sflow/trex-passthru.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/sflow/trex-sflow-acceptance.png (Stored with Git LFS) Normal file

Binary file not shown.

BIN
static/assets/sflow/trex-v1.png (Stored with Git LFS) Normal file

Binary file not shown.

4
static/mta-sts.txt Normal file
View File

@ -0,0 +1,4 @@
version: STSv1
mode: enforce
mx: smtp.ipng.ch
max_age: 86400

21
static/prefixes.txt Normal file
View File

@ -0,0 +1,21 @@
# Source: https://ipng.ch/sshkeys.txt
#
194.1.163.64/27 # AS8298 IPng Networks
2001:678:d78:3::/64 # AS8298 IPng Networks
94.142.241.184/29 # AS8283 Coloclue
2a02:898:146::/64 # AS8283 Coloclue
46.20.246.112/28 # AS25091 IP-Max
2a02:2528:ff00::/64 # AS25091 IP-Max
193.109.122.0/26 # AS12859 BIT
2001:7b8:3:1e::/64 # AS12859 BIT
2001:678:d78:300::/56 # IPng Wireguard
46.20.243.179/32 # IPng Trusted border0.nlams3
2a02:2528:ff02::179/128 # IPng Trusted border0.nlams3
194.1.163.153/32 # IPng Trusted border0.chplo0
2001:678:d78:7::2/128 # IPng Trusted border0.chplo0
194.1.163.190/32 # IPng Trusted border0.chrma0
2001:678:d78:b::2/128 # IPng Trusted border0.chrma0
194.1.163.68/32 # IPng Trusted border0.chbtl0
2001:678:d78:3::4/128 # IPng Trusted border0.chbtl0

11
static/sshkeys.txt Normal file
View File

@ -0,0 +1,11 @@
# Source: https://ipng.ch/sshkeys.txt
#
# OpenBSD bastion
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAXMfDOJtI3JztcPJ1DZMXzILZzMilMvodvMIfqqa1qr pim+openbsd@ipng.ch
# Mac Studio (Secretive)
ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBMtJZgTDWxBEbQ2vPYtOw4L0s4VRKUUjpu6aFPVx3CpqrjLpyJIxzBWTfb/VnEp95VfgM8IUAYYM8w7xoLd7QZc= pim+jessica+secretive@ipng.ch
# Macbook Air M4 (Secretive)
ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBASymGKXfKkfsYbo7UDrIBxl1F6X7LmVPQ3XOFOKp8tLI6zLyCYs5zgRNs/qksHOgKUK+fE/TzJ4XJsuSbYNMB0= pim+tammy+secretive@ipng.ch

View File

@ -17,6 +17,7 @@ $text-very-light: #767676;
$medium-light-text: #4f4a5f;
$code-background: #f3f3f3;
$codeblock-background: #f6f8fa;
$codeblock-text: #99a;
$code-text: #f8f8f2;
$ipng-orange: #f46524;
$ipng-darkorange: #8c1919;
@ -63,7 +64,7 @@ body {
line-height: $base-line-height * 1.0;
font-family: BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol";
@media #{$mq-mini} { font-size: $base-font-size * 0.8; }
/* $mq-mini or smaller */ font-size: $base-font-size * 0.8;
@media #{$mq-small} { font-size: $base-font-size * 0.9; }
@media #{$mq-medium} { font-size: $base-font-size * 1.0; }
@media #{$mq-large} { font-size: $base-font-size * 1.0; }
@ -78,8 +79,8 @@ main {
padding: 1em 1em;
padding-bottom: 5em;
@media #{$mq-mini} { margin: 0 2%; }
@media #{$mq-small} { margin: 0 2%; }
/* $mq-mini or smaller */ margin: 0 2%;
@media #{$mq-small} { margin: 0 5%; }
@media #{$mq-medium} { margin: 0 17%; }
@media #{$mq-large} { margin: 0 21%; }
@media #{$mq-xlarge} { margin: 0 24%; }
@ -142,7 +143,7 @@ pre {
code {
background-color: transparent;
color: #444;
color: $codeblock-text;
}
}
@ -341,7 +342,7 @@ nav li:hover {
display: flex;
flex-flow: row wrap;
@media #{$mq-mini} { margin: 0; width: 100%; font-size: $base-font-size * 0.8; }
/* $mq-mini or smaller: */ margin: 0; width: 100%; font-size: $base-font-size * 0.8;
@media #{$mq-small} { margin: 0 5%; width: 90%; font-size: $base-font-size * 0.9; }
@media #{$mq-medium} { margin: 0 17%; width: 66%; font-size: $base-font-size * 1.0; }
@media #{$mq-large} { margin: 0 21%; width: 58%; font-size: $base-font-size * 1.0; }
@ -367,7 +368,7 @@ nav li:hover {
color: $text-light;
background-color: #f7f7f7;
@media #{$mq-mini} { margin: 0; width: 100%; font-size: $base-font-size * 0.8; }
/* $mq-mini or smaller: */ margin: 0; width: 100%; font-size: $base-font-size * 0.8;
@media #{$mq-small} { margin: 0 5%; width: 90%; font-size: $base-font-size * 0.9; }
@media #{$mq-medium} { margin: 0 17%; width: 66%; font-size: $base-font-size * 1.0; }
@media #{$mq-large} { margin: 0 21%; width: 58%; font-size: $base-font-size * 1.0; }

View File

@ -1,5 +1,5 @@
<!DOCTYPE html>
<html>
<html lang="en">
{{- partial "head.html" . -}}
<body>
{{- partial "header.html" . -}}

View File

@ -1,26 +1,26 @@
<head>
<title>{{ .Site.Title }} {{ with .Title }}- {{ . }} {{ end }}</title>
<link rel="stylesheet" type="text/css" href="{{ "css/fonts.css" | relURL }}" />
<link rel="stylesheet" type="text/css" href="{{ "css/fontawesome.css" | relURL }}" />
<link rel="stylesheet" type="text/css" href="{{ "css/fonts.css" | relURL }}">
<link rel="stylesheet" type="text/css" href="{{ "css/fontawesome.css" | relURL }}">
{{ $options := dict "transpiler" "libsass" "targetPath" "css/styles.css" -}}
{{ $style := resources.Get "styles.scss" | toCSS $options | minify | fingerprint -}}
<link rel="stylesheet" type="text/css" href="{{ $style.RelPermalink }}">
{{ with resources.Get "css/userstyles.css" }}
<link rel="stylesheet" type="text/css" href="{{ .Permalink }}">
{{ end -}}
<link rel="icon" href="/assets/logo/favicon/favicon.ico" type="image/x-icon" sizes="any" />
<link rel="apple-touch-icon" href="/assets/logo/favicon/apple-touch-icon.png" />
<link rel="manifest" href="/assets/logo/favicon/icon.manifest" />
<link rel="icon" href="/assets/logo/favicon/favicon.ico" type="image/x-icon" sizes="any">
<link rel="apple-touch-icon" href="/assets/logo/favicon/apple-touch-icon.png">
<link rel="manifest" href="/assets/logo/favicon/icon.manifest">
<meta charset="UTF-8">
<meta name="author" content="{{ .Site.Params.Author }}">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
{{ range .AlternativeOutputFormats -}}
{{ printf `<link rel="%s" type="%s" href="%s" title="%s" />` .Rel .MediaType.Type .Permalink $.Site.Title | safeHTML }}
{{ end -}}
{{- range .AlternativeOutputFormats }}
{{ printf `<link rel="%s" type="%s" href="%s" title="%s">` .Rel .MediaType.Type .Permalink $.Site.Title | safeHTML }}
{{- end }}
<script defer data-domain="ipng.ch" data-api="/api/event" src="/js/script.js"></script>
{{ if eq .Params.asciinema true -}}
<link rel="stylesheet" type="text/css" href="{{ "css/asciinema-player.css" | relURL }}" />
{{- if eq .Params.asciinema true }}
<link rel="stylesheet" type="text/css" href="{{ "css/asciinema-player.css" | relURL }}">
<script src="{{ "js/asciinema-player.min.js" | relURL }}"></script>
{{- end }}
</head>

View File

@ -10,7 +10,7 @@
<img src="{{ .Get "src" | relURL }}"
{{- if or (.Get "alt") (.Get "caption") }} alt="{{ with .Get "alt" }}{{ replace . "'" "&#39;" }}{{ else }}{{ .Get "caption" | markdownify| plainify }}{{ end }}"
{{- end -}}
/> <!-- Closing img tag -->
> <!-- Closing img tag -->
{{- if .Get "link" }}</a>{{ end -}}
{{- if or (or (.Get "title") (.Get "caption")) (.Get "attr") }}
<figcaption>