483 lines
19 KiB
Markdown
483 lines
19 KiB
Markdown
---
|
|
date: "2021-06-28T10:59:42Z"
|
|
title: Launch of AS112
|
|
aliases:
|
|
- /s/articles/2021/06/28/as112.html
|
|
---
|
|
|
|
I'm one of those people who is a fan of low-latency and high performance
|
|
distributed service architectures. After building out the IPng Network across
|
|
europe, I did notice a rather stark difference in presence of one particular
|
|
service: **AS112** anycast nameservers. In particular, I only have one Internet
|
|
Exchange in common with a direct presence of AS112, FCIX in California.
|
|
Big-up to the kind folks in Fremont who operate [www.as112.net](https://www.as112.net).
|
|
|
|
## The Problem
|
|
|
|
Looking around Switzerland, no internet exchanges actually have AS112 as a direct
|
|
member and as such you'll find the service tucked away behind several ISPs, with
|
|
AS paths such as `13030 29670 112`, `6939 112` and `34019 112`. A traceroute
|
|
from a popular swiss ISP, [Init7](https://init7.net/) will go to Germany, at a
|
|
roundtrip latency of 18.9ms. My own latency is 146ms as my queries are served
|
|
from FCIX:
|
|
|
|
```
|
|
pim@spongebob:~$ traceroute prisoner.iana.org
|
|
traceroute to prisoner.iana.org (192.175.48.1), 64 hops max, 40 byte packets
|
|
1 fiber7.xe8.chbtl0.ipng.ch (194.126.235.33) 2.658 ms 0.754 ms 0.523 ms
|
|
2 1790bre1.fiber7.init7.net (81.6.42.1) 1.132 ms 1.077 ms 3.621 ms
|
|
3 780eff1.fiber7.init7.net (109.202.193.44) 1.238 ms 1.162 ms 1.188 ms
|
|
4 r1win12.core.init7.net (77.109.181.155) 2.096 ms 2.1 ms 2.1 ms
|
|
5 r1zrh6.core.init7.net (82.197.168.222) 2.086 ms 3.904 ms 2.183 ms
|
|
6 r1glb1.core.init7.net (5.180.135.134) 2.043 ms 3.621 ms 2.088 ms
|
|
7 r2zrh2.core.init7.net (82.197.163.213) 2.353 ms 2.522 ms 2.289 ms
|
|
8 r2zrh2.core.init7.net (5.180.135.156) 2.08 ms 2.299 ms 2.202 ms
|
|
9 r1fra3.core.init7.net (5.180.135.173) 7.65 ms 7.582 ms 7.546 ms
|
|
10 r1fra2.core.init7.net (5.180.135.126) 7.928 ms 7.831 ms 7.997 ms
|
|
11 r1ber1.core.init7.net (77.109.129.8) 19.395 ms 19.287 ms 19.558 ms
|
|
12 octalus.in-berlin.a36.community-ix.de (185.1.74.3) 18.839 ms 18.717 ms 29.615 ms
|
|
13 prisoner.iana.org (192.175.48.1) 18.536 ms 18.613 ms 18.766 ms
|
|
|
|
pim@chumbucket:~$ traceroute blackhole-1.iana.org
|
|
traceroute to blackhole-1.iana.org (192.175.48.6), 30 hops max, 60 byte packets
|
|
1 chbtl1.ipng.ch (194.1.163.67) 0.247 ms 0.158 ms 0.107 ms
|
|
2 chgtg0.ipng.ch (194.1.163.19) 0.514 ms 0.474 ms 0.419 ms
|
|
3 usfmt0.ipng.ch (194.1.163.23) 146.451 ms 146.406 ms 146.364 ms
|
|
4 blackhole-1.iana.org (192.175.48.6) 146.323 ms 146.281 ms 146.239 ms
|
|
```
|
|
|
|
This path goes to FCIX because it's the only place where AS50869 picks up
|
|
AS112 directly, at an internet exchange, and therefore the localpref will
|
|
make this route preferred. But that's a long way to go for my DNS queries!
|
|
|
|
I think we can do better.
|
|
|
|
## Introduction
|
|
|
|
Taken from [RFC7534](https://datatracker.ietf.org/doc/html/rfc7534):
|
|
|
|
> Many sites connected to the Internet make use of IPv4 addresses that
|
|
> are not globally unique. Examples are the addresses designated in
|
|
> RFC 1918 for private use within individual sites.
|
|
>
|
|
> Devices in such environments may occasionally originate Domain Name
|
|
> System (DNS) queries (so-called "reverse lookups") corresponding to
|
|
> those private-use addresses. Since the addresses concerned have only
|
|
> local significance, it is good practice for site administrators to
|
|
> ensure that such queries are answered locally. However, it is not
|
|
> uncommon for such queries to follow the normal delegation path in the
|
|
> public DNS instead of being answered within the site.
|
|
>
|
|
> It is not possible for public DNS servers to give useful answers to
|
|
> such queries. In addition, due to the wide deployment of private-use
|
|
> addresses and the continuing growth of the Internet, the volume of
|
|
> such queries is large and growing. The AS112 project aims to provide
|
|
> a distributed sink for such queries in order to reduce the load on
|
|
> the corresponding authoritative servers. The AS112 project is named
|
|
> after the Autonomous System Number (ASN) that was assigned to it.
|
|
|
|
## Deployment
|
|
|
|
It's actually quite straight forward, the deployment consists of roughly
|
|
three steps:
|
|
|
|
1. Procure hardware to run the instances of the nameserver on.
|
|
1. Configure the nameserver to serve the zonefiles.
|
|
1. Announce the anycast service locally/regionally.
|
|
|
|
Let's discuss each in turn.
|
|
|
|
### Hardware
|
|
|
|
For the hardware, I've decided to use existing server platform at IP-Max
|
|
and IPng Networks. There are two types of hardware, both tried and tested,
|
|
one set is an HP ProLiant DL380 Gen9, and one is an older Dell PowerEdge R610.
|
|
|
|
Considering each vendor ships specific parts and each are different, many
|
|
appliance vendors choose to virtualize their environment such that the guest
|
|
operating system finds a very homogenous configuration. For my purposes,
|
|
the virtualization platform is Xen and the guest is a (para)virtualized
|
|
Debian.
|
|
|
|
I will be starting with three nodes, one in Geneva and one in Zurich, hosted
|
|
on hypervisors of [IP-Max](https://www.ip-max.net/), and one in Amsterdam,
|
|
hosted on a hypervisor of [IPng](https://ipng.ch/). I have a feeling a few
|
|
more places will follow.
|
|
|
|
#### Install the OS
|
|
|
|
Xen makes this repeatable and straight forward. Other systems, such as KVM,
|
|
have very similar installers, for example VMBuilder is popular. Both work
|
|
roughly the same way, and install a guest in a matter of minutes.
|
|
|
|
I'll install to an LVS volume group on all machines, backed by pairs of SSD
|
|
for throughput and redundancy. We'll give the guest 4GB of memory and 4
|
|
CPUs. I love how the machine boots using [PyGrub](https://wiki.debian.org/PyGrub),
|
|
fully on serial, and is fully booted and running in 20 seconds.
|
|
|
|
```
|
|
sudo xen-create-image --hostname as112-1.free-ix.net --ip 46.20.249.197 \
|
|
--vcpus 4 --pygrub --dist buster --lvm=vg1_hvn04_gva20
|
|
sudo xl create -c as112-1.free-ix.net.cfg
|
|
```
|
|
|
|
After logging in, the following additional software was installed. We'll be
|
|
using [Bird2](https://bird.network.cz/), which comes on Debian Buster's
|
|
backports. Otherwise, we're pretty vanilla:
|
|
|
|
```
|
|
$ cat << EOF | sudo tee -a /etc/apt/sources.list
|
|
#
|
|
# Backports
|
|
#
|
|
deb http://deb.debian.org/debian buster-backports main
|
|
EOF
|
|
|
|
$ sudo apt update
|
|
$ sudo apt install tcpdump sudo net-tools bridge-utils nsd bird2 \
|
|
netplan.io traceroute ufw curl bind9-dnsutils
|
|
$ sudo apt purge ifupdown
|
|
```
|
|
|
|
I removed the `/etc/network/interfaces` approach and configured Netplan,
|
|
a personal choice, which aligns the machines more closely with other servers
|
|
in the IPng fleet. The only trick is to ensure that the anycast IP addresses
|
|
are available for the nameserver to listen on, so at the top of Netplan's
|
|
configuration file, we add them like so:
|
|
|
|
```
|
|
network:
|
|
version: 2
|
|
renderer: networkd
|
|
ethernets:
|
|
lo:
|
|
addresses:
|
|
- 127.0.0.1/8
|
|
- ::1/128
|
|
- 192.175.48.1/32 # prisoner.iana.org (anycast)
|
|
- 2620:4f:8000::1/128 # prisoner.iana.org (anycast)
|
|
- 192.175.48.6/32 # blackhole-1.iana.org (anycast)
|
|
- 2620:4f:8000::6/128 # blackhole-1.iana.org (anycast)
|
|
- 192.175.48.42/32 # blackhole-2.iana.org (anycast)
|
|
- 2620:4f:8000::42/128 # blackhole-2.iana.org (anycast)
|
|
- 192.31.196.1/32 # blackhole.as112.arpa (anycast)
|
|
- 2001:4:112::1/128 # blackhole.as112.arpa (anycast)
|
|
```
|
|
|
|
### Nameserver
|
|
|
|
My nameserver of choice is [NSD](https://www.nlnetlabs.nl/projects/nsd/about/),
|
|
and its configuration is similar to BIND, which is described in RFC7534. In
|
|
fact, the zone files are identical, so all we should do is create a few listen
|
|
statements and load up the zones:
|
|
|
|
```
|
|
$ cat << EOF | sudo tee /etc/nsd/nsd.conf.d/listen.conf
|
|
server:
|
|
ip-address: 127.0.0.1
|
|
ip-address: ::1
|
|
ip-address: 46.20.249.197
|
|
ip-address: 2a02:2528:a04:202::197
|
|
|
|
ip-address: 192.175.48.1 # prisoner.iana.org (anycast)
|
|
ip-address: 2620:4f:8000::1 # prisoner.iana.org (anycast)
|
|
|
|
ip-address: 192.175.48.6 # blackhole-1.iana.org (anycast)
|
|
ip-address: 2620:4f:8000::6 # blackhole-1.iana.org (anycast)
|
|
|
|
ip-address: 192.175.48.42 # blackhole-2.iana.org (anycast)
|
|
ip-address: 2620:4f:8000::42 # blackhole-2.iana.org (anycast)
|
|
|
|
ip-address: 192.31.196.1 # blackhole.as112.arpa (anycast)
|
|
ip-address: 2001:4:112::1 # blackhole.as112.arpa (anycast)
|
|
|
|
server-count: 4
|
|
EOF
|
|
|
|
$ cat << EOF | sudo tee /etc/nsd/nsd.conf.d/as112.conf
|
|
zone:
|
|
name: "hostname.as112.net"
|
|
zonefile: "/etc/nsd/master/db.hostname.as112.net"
|
|
|
|
zone:
|
|
name: "hostname.as112.arpa"
|
|
zonefile: "/etc/nsd/master/db.hostname.as112.arpa"
|
|
|
|
zone:
|
|
name: "10.in-addr.arpa"
|
|
zonefile: "/etc/nsd/master/db.dd-empty"
|
|
|
|
# etcetera
|
|
EOF
|
|
```
|
|
|
|
While all of the zones are captured by `db.dd-empty` or `db.dr-empty`, which
|
|
can be found in the RFC text, I'll note the top two are special, as they are
|
|
specific to the instance. For example on our Geneva instance:
|
|
|
|
```
|
|
$ cat << EOF | sudo tee /etc/nsd/master/db.hostname.as112.arpa
|
|
$TTL 1W
|
|
@ SOA chplo01.paphosting.net. noc.ipng.ch. (
|
|
1 ; serial number
|
|
1W ; refresh
|
|
1M ; retry
|
|
1W ; expire
|
|
1W ) ; negative caching TTL
|
|
NS blackhole.as112.arpa.
|
|
TXT "AS112 hosted by IPng Networks" "Geneva, Switzerland"
|
|
TXT "See https://www.as112.net/ for more information."
|
|
TXT "See https://free-ix.net/ for local information."
|
|
TXT "Unique IP: 194.1.163.147"
|
|
TXT "Unique IP: [2001:678:d78:7::147]"
|
|
LOC 46 9 55.501 N 6 6 25.870 E 407.00m 10m 100m 10m
|
|
```
|
|
|
|
This is super helpful to users, who want to know which server, exactly,
|
|
is serving their request. Not all operators added the `Unique IP` details,
|
|
but I found it useful when launching the service, as several anycast nodes
|
|
quickly become confusing otherwise :-)
|
|
|
|
After this is all done, the nameserver can be started. I rebooted the guest
|
|
for good measure, and about 19 seconds later (a fact that continues to
|
|
amaze me), the server was up and serving queries, albeit only from localhost
|
|
because there is no way to reach the server on the network, yet.
|
|
|
|
To validate things work, we can do a few SOA or TXT queries, like this one:
|
|
```
|
|
pim@nlams01:~$ ping -c5 -q prisoner.iana.org
|
|
PING prisoner.iana.org(prisoner.iana.org (2620:4f:8000::1)) 56 data bytes
|
|
|
|
--- prisoner.iana.org ping statistics ---
|
|
5 packets transmitted, 5 received, 0% packet loss, time 34ms
|
|
rtt min/avg/max/mdev = 0.041/0.045/0.053/0.004 ms
|
|
|
|
pim@nlams01:~$ dig @prisoner.iana.org hostname.as112.net TXT +short +norec
|
|
"AS112 hosted by IPng Networks" "Amsterdam, The Netherlands"
|
|
"See http://www.as112.net/ for more information."
|
|
"Unique IP: 94.142.241.187"
|
|
"Unique IP: [2a02:898:146::2]"
|
|
```
|
|
|
|
### Network
|
|
|
|
Now comes the fun part! We're running these instances of the nameservers
|
|
in a few locations, and to ensure we don't route traffic to the incorrect
|
|
location, we'll announce them using BGP as per recommendation of RFC7534.
|
|
|
|
My choice of routing suite is [Bird2](https://bird.network.cz/), which comes
|
|
with a lot of extensiblility and a programmatic validation of routing policies.
|
|
|
|
We'll only be using `static` and `BGP` routing protocols for Bird, so the
|
|
configuration is relatively straight forward, first we create a routing table
|
|
export for IPv4 and IPv6, then we define some static _Nullroutes_, which ensure
|
|
that our prefixes are always present in the RIB (otherwise BGP will not export
|
|
them), then we create some filter functions (one for routeserver sessions,
|
|
one for peering sessions, and one for transit sessions), and finally we include
|
|
a few specific configuration files, one-per-environment where we'll be active.
|
|
|
|
```
|
|
$ cat << EOF | sudo tee /etc/bird/bird.conf
|
|
router id 46.20.249.197;
|
|
|
|
protocol kernel fib4 {
|
|
ipv4 { export all; };
|
|
scan time 60;
|
|
}
|
|
|
|
protocol kernel fib6 {
|
|
ipv6 { export all; };
|
|
scan time 60;
|
|
}
|
|
|
|
protocol static static_as112_ipv4 {
|
|
ipv4;
|
|
route 192.175.48.0/24 blackhole;
|
|
route 192.31.196.0/24 blackhole;
|
|
}
|
|
|
|
protocol static static_as112_ipv6 {
|
|
ipv6;
|
|
route 2620:4f:8000::/48 blackhole;
|
|
route 2001:4:112::/48 blackhole;
|
|
}
|
|
|
|
include "bgp-freeix.conf";
|
|
include "bgp-ipng.conf";
|
|
include "bgp-ipmax.conf";
|
|
EOF
|
|
```
|
|
|
|
The configuration file per environment, say `bgp-freeix.conf`, can (and will)
|
|
be autogenerated, but the pattern is of the following form:
|
|
|
|
```
|
|
$ cat << EOF | tee /etc/bird/bgp-freeix.conf
|
|
#
|
|
# Bird AS112 configuration for FreeIX
|
|
#
|
|
define my_ipv4 = 185.1.205.252;
|
|
define my_ipv6 = 2001:7f8:111:42::70:1;
|
|
|
|
protocol bgp freeix_as51530_1_ipv4 {
|
|
description "FreeIX - AS51530 - Routeserver #1";
|
|
local as 112;
|
|
source address my_ipv4;
|
|
neighbor 185.1.205.254 as 51530;
|
|
ipv4 {
|
|
import where fn_import_routeserver( 51530 );
|
|
export where proto = "static_as112_ipv4";
|
|
import limit 120000 action restart;
|
|
};
|
|
}
|
|
|
|
protocol bgp freeix_as51530_1_ipv6 {
|
|
description "FreeIX - AS51530 - Routeserver #1";
|
|
local as 112;
|
|
source address my_ipv6;
|
|
neighbor 2001:7f8:111:42::c94a:1 as 51530;
|
|
ipv6 {
|
|
import where fn_import_routeserver( 51530 );
|
|
export where proto = "static_as112_ipv6";
|
|
import limit 120000 action restart;
|
|
};
|
|
}
|
|
|
|
# etcetera
|
|
EOF
|
|
```
|
|
|
|
If you've seen IXPManager's approach to routeserver configuration generators,
|
|
you'll notice I borrowed the `fn_import()` function and its dependents from
|
|
there. This allows imports to be specific towards prefix-lists, as-paths and
|
|
ensure some _Belts and Braces_ checks are in place (no invalid or tier1 ASN
|
|
in the path, a valid nexthop, no tricks with AS path truncation, and so on).
|
|
|
|
After bringing up the service, the prefixes make their way into the
|
|
routeserver and get distributed to the FreeIX participants:
|
|
|
|
```
|
|
$ sudo systemctl start bird
|
|
$ sudo birdc show protocol
|
|
BIRD 2.0.7 ready.
|
|
Name Proto Table State Since Info
|
|
fib4 Kernel master4 up 2021-06-28 11:01:35
|
|
fib6 Kernel master6 up 2021-06-28 11:01:35
|
|
device1 Device --- up 2021-06-28 11:01:35
|
|
static_as112_ipv4 Static master4 up 2021-06-28 11:01:35
|
|
static_as112_ipv6 Static master6 up 2021-06-28 11:01:35
|
|
freeix_as51530_1_ipv4 BGP --- up 2021-06-28 11:01:17 Established
|
|
freeix_as51530_1_ipv6 BGP --- up 2021-06-28 11:01:19 Established
|
|
freeix_as51530_2_ipv4 BGP --- up 2021-06-28 11:01:32 Established
|
|
freeix_as51530_2_ipv6 BGP --- up 2021-06-28 11:01:37 Established
|
|
```
|
|
|
|
#### Internet Exchanges
|
|
|
|
Having one configuration file per group helps a lot with integration of
|
|
[IXPManager](https://www.ixpmanager.org/) where we might autogenerate the IXP
|
|
versions of these files and install them periodically. That way, when members
|
|
enable the `AS112` peering checkmark, the servers will automatically download
|
|
and set up those sessions without human involvement -- typically this is the
|
|
best way to avoid outages: never tinker with production config files by hand.
|
|
We'll test this out with [FreeIX](https://free-ix.net/), but hope as well to
|
|
offer our service to other internet exchanges, notably SwissIX and CIXP.
|
|
|
|
One of the huge benefits of operating within [IP-Max](https://ip-max.net/)
|
|
network is their ability to do L2VPN transport from any place on-net to any
|
|
other router. As such, connecting these virtual machines to other places,
|
|
like SwissIX, CIXP, CHIX-CH, Community-IX or other further away places,
|
|
is a piece of cake. All we must do is create an L2VPN and offer it to the
|
|
hypervisor (which usually is connected via a LACP _BundleEthernet_) on some
|
|
VLAN, after which we can bridge that into the guest OS by creating a new
|
|
virtio NIC. This is how, in the example above, our AS112 machines were
|
|
introduced to FreeIX. This scales very well, requiring only one guest reboot
|
|
per internet exchange, and greatly simplifies operations.
|
|
|
|
### Monitoring
|
|
|
|
Of course, one would not want to run a production service, certainly not
|
|
on the public internet, without a bit of introspection and monitoring.
|
|
|
|
There are four things we might want to ensure:
|
|
1. Is the machine up and healthy? For this we use NAGIOS.
|
|
1. Is NSD serving? For this we use NSD Exporter and Prometheus/Grafana.
|
|
1. Is NSD reachable? For this we use CloudProber.
|
|
1. If there is an issue, can we alert an operator? For this we use Telegram.
|
|
|
|
In a followup post, I'll demonstrate how these things come together into
|
|
a comprehensive anycast monitoring and alerting solution. As a fringe benefit
|
|
we can show contemporary graphs and dashboards. But seeing as the service
|
|
hasn't yet gotten a lot of mileage, it deserves its own followup post, some
|
|
time in August.
|
|
|
|
## The results
|
|
|
|
First things first - latency went waaaay down:
|
|
```
|
|
pim@chumbucket:~$ traceroute blackhole-1.iana.org
|
|
traceroute to blackhole-1.iana.org (192.175.48.6), 30 hops max, 60 byte packets
|
|
1 chbtl1.ipng.ch (194.1.163.67) 0.257 ms 0.199 ms 0.159 ms
|
|
2 chgtg0.ipng.ch (194.1.163.19) 0.468 ms 0.430 ms 0.430 ms
|
|
3 chrma0.ipng.ch (194.1.163.8) 0.648 ms 0.611 ms 0.597 ms
|
|
4 blackhole-1.iana.org (192.175.48.6) 1.272 ms 1.236 ms 1.201 ms
|
|
|
|
pim@chumbucket:~$ dig -6 @prisoner.iana.org hostname.as112.net txt +short +norec +tcp
|
|
"Free-IX hosted by IP-Max SA" "Zurich, Switzerland"
|
|
"See https://www.as112.net/ for more information."
|
|
"See https://free-ix.net/ for local information."
|
|
"Unique IP: 46.20.246.67"
|
|
"Unique IP: [2a02:2528:1703::67]"
|
|
```
|
|
|
|
and this demonstrates why it's super useful to have the `hostname.as112.net`
|
|
entry populated well. If I'm in Amsterdam, I'll be served by the local node there:
|
|
|
|
```
|
|
pim@gripe:~$ traceroute6 blackhole-2.iana.org
|
|
traceroute6 to blackhole-2.iana.org (2620:4f:8000::42), 64 hops max, 60 byte packets
|
|
1 nlams0.ipng.ch (2a02:898:146::1) 0.744 ms 0.879 ms 0.818 ms
|
|
2 blackhole-2.iana.org (2620:4f:8000::42) 1.104 ms 1.064 ms 1.035 ms
|
|
|
|
pim@gripe:~$ dig -4 @prisoner.iana.org hostname.as112.net txt +short +norec +tcp
|
|
"Hosted by IPng Networks" "Amsterdam, The Netherlands"
|
|
"See http://www.as112.net/ for more information."
|
|
"Unique IP: 94.142.241.187"
|
|
"Unique IP: [2a02:898:146::2]"
|
|
```
|
|
|
|
Of course, due to anycast, and me being in Zurich, I will be served primarily
|
|
by the Zurich node. If it were to go down for maintenance, or hardware failure,
|
|
BGP will immediately converge on alternate paths, there are currently three
|
|
to choose from:
|
|
|
|
```
|
|
pim@chrma0:~$ show protocols bgp ipv4 unicast 192.31.196.0/24
|
|
BGP routing table entry for 192.31.196.0/24
|
|
Paths: (10 available, best #2, table default)
|
|
Advertised to non peer-group peers:
|
|
185.1.205.251 194.1.163.1 [...]
|
|
112
|
|
194.1.163.32 (metric 137) from 194.1.163.32 (194.1.163.32)
|
|
Origin IGP, localpref 400, valid, internal
|
|
Community: 50869:3500 50869:4099 50869:5055
|
|
Last update: Mon Jun 28 11:13:14 2021
|
|
112
|
|
185.1.205.251 from 185.1.205.251 (46.20.246.67)
|
|
Origin IGP, localpref 400, valid, external, bestpath-from-AS 112, best (Local Pref)
|
|
Community: 50869:3500 50869:4099 50869:5000 50869:5020 50869:5060
|
|
Last update: Mon Jun 28 11:00:45 2021
|
|
112
|
|
185.1.205.251 from 185.1.205.253 (185.1.205.253)
|
|
Origin IGP, localpref 200, valid, external
|
|
Community: 50869:1061
|
|
Last update: Mon Jun 28 11:00:20 2021
|
|
|
|
(and more)
|
|
```
|
|
|
|
I am expecting a few more direct paths to come, as I harden this service, and
|
|
offer it to other swiss internet exchange points in the future. But mostly, my
|
|
mission of reducing the round trip time from 146ms to 1ms from my desktop at
|
|
home was successfully accomplished.
|