729 lines
32 KiB
Markdown
729 lines
32 KiB
Markdown
---
|
|
date: "2021-11-14T22:49:09Z"
|
|
title: Case Study - BGP Routing Policy
|
|
aliases:
|
|
- /s/articles/2021/11/14/routing-policy.html
|
|
---
|
|
|
|
# Introduction
|
|
|
|
BGP Routing policy is a very interesting topic. I get asked about it formally
|
|
and informally all the time. I have to admit, there are lots of ways to organize
|
|
an automous system. Vendors have unique features and templating / procedural
|
|
functions, but in the end, BGP routing policy all boils down to two+two things:
|
|
|
|
1. Not accepting the prefixes you don't want (inbound)
|
|
* For those prefixes accepted, ensure they have correct attributes.
|
|
1. Not announcing prefixes to folks who shouldn't see them (outbound)
|
|
* For those prefixes announced, ensure they have correct attributes.
|
|
|
|
At IPng Networks, I've cycled through a few iterations and landed on a specific
|
|
setup that works well for me. It provides sufficient information to enable our
|
|
downstream (customers) to make good decisions on what they should accept from
|
|
us, as well as enough expressivity for them to determine which prefixes we
|
|
should propagate for them, where, and how.
|
|
|
|
This article describes one approach to a relatively feature rich routing policy
|
|
which is in use at IPng Networks (AS8298). It uses the [Bird2](https://bird.network.cz/) configuration
|
|
language, although the concepts would be implementable in ~any modern routing
|
|
suite (ie. FRR, Cisco, Juniper, Arista, Extreme, et cetera).
|
|
|
|
Interested in one operator's opinion? Read on!
|
|
|
|
## 1. Concepts
|
|
|
|
There are three basic pieces of routing filtering, which I'll describe briefly.
|
|
|
|
### Prefix Lists
|
|
|
|
A prefix list (also sometimes referred to as an access-list in older software)
|
|
is a list of IPv4 of IPv6 prefixes, often with a prefixlen boundary, that
|
|
determines if a given prefix is "in" or "out".
|
|
|
|
An example could be: `2001:db8::/32{32,48}` which describes any prefix in the
|
|
supernet `2001:db8::/32` that has a prefix length of anywhere between /32 and
|
|
/48, inclusive.
|
|
|
|
### AS Paths
|
|
|
|
In BGP, each prefix learned comes with an AS path on how to reach it. If my router
|
|
learns a prefix from a peer with AS number `65520`, it'll see every prefix that peer
|
|
sends as a list of AS numbers starting with 65520. With AS Paths, the very first
|
|
one in the list is the one the router directly learned the prefix from, and the very
|
|
last one is the origin of the prefix. Often times the prefix is shown as a regular
|
|
expression, starting with `^` and ending with `$` and to help readability,
|
|
spaces are often written as `_`.
|
|
|
|
Examples: `^25091_1299_3301$` and `^58299_174_1299_3301$`
|
|
|
|
### BGP Communities
|
|
|
|
When learning (or originating) a prefix in BGP, zero or more so called `communities`
|
|
can be added to it along the way. The _Routing Information Base_ or _RIB_ carries
|
|
these communities and can share them between peering sessions. Communities can be
|
|
added, removed and modified. Some communities have special meaning (which is
|
|
agreed upon by everyone), and some have local meaning (agreed upon by only
|
|
one or a small set of operators).
|
|
|
|
There's three types of communities: _normal_ communities are a pair of 16-bit
|
|
integers; _extended_ communities are 8 bytes, split into one 16-bit integer
|
|
and an additional 48-bit value; and finally _large_ communities consist of
|
|
a triplet of 32-bit values.
|
|
|
|
Examples: `(8298, 1234)` (normal), or `(8298, 3, 212323)` (large)
|
|
|
|
# Routing Policy
|
|
|
|
Now that I've explained a little bit about the ingredients we have to work with,
|
|
let me share an observation that took me a few decades to make: BGP sessions are
|
|
really all the same. As such, every single one of the BGP sessions at IPng Networks
|
|
are generated with one template. What makes the difference between 'Transit', 'Customer'
|
|
and 'Peer' and 'Private Interconnect', really all boils down to what types of filtering
|
|
are applied on in- and outbound updates. I will demonstrate this by means of two main
|
|
functions in Bird: `ebgp_import()` discussed first in the section ***Inbound: Learning
|
|
Routes*** section, and `ebgp_export()` in the section ***Outbound: Announcing Routes***.
|
|
|
|
## 2. Inbound: Learning Routes
|
|
|
|
Let's consider this function:
|
|
|
|
```
|
|
function ebgp_import(int remote_as) {
|
|
if aspath_bogon() then return false;
|
|
if (net.type = NET_IP4 && ipv4_bogon()) then return false;
|
|
if (net.type = NET_IP6 && ipv6_bogon()) then return false;
|
|
|
|
if (net.type = NET_IP4 && ipv4_rpki_invalid()) then return false;
|
|
if (net.type = NET_IP6 && ipv6_rpki_invalid()) then return false;
|
|
|
|
# Demote certain AS nexthops to lower pref
|
|
if (bgp_path.first ~ AS_LOCALPREF50 && bgp_path.len > 1) then bgp_local_pref = 50;
|
|
if (bgp_path.first ~ AS_LOCALPREF30 && bgp_path.len > 1) then bgp_local_pref = 30;
|
|
if (bgp_path.first ~ AS_LOCALPREF10 && bgp_path.len > 1) then bgp_local_pref = 10;
|
|
|
|
# Graceful Shutdown (RFC8326)
|
|
if (65535, 0) ~ bgp_community then bgp_local_pref = 0;
|
|
|
|
# Scrub BLACKHOLE community
|
|
bgp_community.delete((65535, 666));
|
|
|
|
return true;
|
|
}
|
|
```
|
|
|
|
The function works by order of elimination -- for each prefix that is offered on the
|
|
session, it will either be rejected (by means of returning `false`), or modified (by means
|
|
of setting attributes like `bgp_local_pref`) and then accepted (by means of returning
|
|
`true`).
|
|
|
|
***AS-Path Bogon*** filtering is a way to remove prefixes that have an invalid AS
|
|
number in their path. The main example of this are private AS numbers (64496-131071)
|
|
and their 32 bit equivalents (4200000000-4294967295). In case you haven't come across
|
|
this yet, AS number 23456 is also magic, see [RFC4893](https://datatracker.ietf.org/doc/html/rfc4893)
|
|
for details:
|
|
```
|
|
function aspath_bogon() {
|
|
return bgp_path ~ [0, 23456, 64496..131071, 4200000000..4294967295];
|
|
}
|
|
```
|
|
|
|
***Prefix Bogon*** comes next, as certain prefixes that are not publicly routable (you
|
|
know, such as [RFC1918](https://datatracker.ietf.org/doc/html/rfc1918), but there are
|
|
many others). They look differently for IPv4 and IPv6:
|
|
```
|
|
function ipv4_bogon() {
|
|
return net ~ [
|
|
0.0.0.0/0, # Default
|
|
0.0.0.0/32-, # RFC 5735 Special Use IPv4 Addresses
|
|
0.0.0.0/0{0,7}, # RFC 1122 Requirements for Internet Hosts -- Communication Layers 3.2.1.3
|
|
10.0.0.0/8+, # RFC 1918 Address Allocation for Private Internets
|
|
100.64.0.0/10+, # RFC 6598 IANA-Reserved IPv4 Prefix for Shared Address Space
|
|
127.0.0.0/8+, # RFC 1122 Requirements for Internet Hosts -- Communication Layers 3.2.1.3
|
|
169.254.0.0/16+, # RFC 3927 Dynamic Configuration of IPv4 Link-Local Addresses
|
|
172.16.0.0/12+, # RFC 1918 Address Allocation for Private Internets
|
|
192.0.0.0/24+, # RFC 6890 Special-Purpose Address Registries
|
|
192.0.2.0/24+, # RFC 5737 IPv4 Address Blocks Reserved for Documentation
|
|
192.168.0.0/16+, # RFC 1918 Address Allocation for Private Internets
|
|
198.18.0.0/15+, # RFC 2544 Benchmarking Methodology for Network Interconnect Devices
|
|
198.51.100.0/24+, # RFC 5737 IPv4 Address Blocks Reserved for Documentation
|
|
203.0.113.0/24+, # RFC 5737 IPv4 Address Blocks Reserved for Documentation
|
|
224.0.0.0/4+, # RFC 1112 Host Extensions for IP Multicasting
|
|
240.0.0.0/4+ # RFC 6890 Special-Purpose Address Registries
|
|
];
|
|
}
|
|
|
|
function ipv6_bogon() {
|
|
return net ~ [
|
|
::/0, # Default
|
|
::/96, # IPv4-compatible IPv6 address - deprecated by RFC4291
|
|
::/128, # Unspecified address
|
|
::1/128, # Local host loopback address
|
|
::ffff:0.0.0.0/96+, # IPv4-mapped addresses
|
|
::224.0.0.0/100+, # Compatible address (IPv4 format)
|
|
::127.0.0.0/104+, # Compatible address (IPv4 format)
|
|
::0.0.0.0/104+, # Compatible address (IPv4 format)
|
|
::255.0.0.0/104+, # Compatible address (IPv4 format)
|
|
0000::/8+, # Pool used for unspecified, loopback and embedded IPv4 addresses
|
|
0100::/8+, # RFC 6666 - reserved for Discard-Only Address Block
|
|
0200::/7+, # OSI NSAP-mapped prefix set (RFC4548) - deprecated by RFC4048
|
|
0400::/6+, # RFC 4291 - Reserved by IETF
|
|
0800::/5+, # RFC 4291 - Reserved by IETF
|
|
1000::/4+, # RFC 4291 - Reserved by IETF
|
|
2001:10::/28+, # RFC 4843 - Deprecated (previously ORCHID)
|
|
2001:20::/28+, # RFC 7343 - ORCHIDv2
|
|
2001:db8::/32+, # Reserved by IANA for special purposes and documentation
|
|
2002:e000::/20+, # Invalid 6to4 packets (IPv4 multicast)
|
|
2002:7f00::/24+, # Invalid 6to4 packets (IPv4 loopback)
|
|
2002:0000::/24+, # Invalid 6to4 packets (IPv4 default)
|
|
2002:ff00::/24+, # Invalid 6to4 packets
|
|
2002:0a00::/24+, # Invalid 6to4 packets (IPv4 private 10.0.0.0/8 network)
|
|
2002:ac10::/28+, # Invalid 6to4 packets (IPv4 private 172.16.0.0/12 network)
|
|
2002:c0a8::/32+, # Invalid 6to4 packets (IPv4 private 192.168.0.0/16 network)
|
|
3ffe::/16+, # Former 6bone, now decommissioned
|
|
4000::/3+, # RFC 4291 - Reserved by IETF
|
|
5f00::/8+, # RFC 5156 - used for the 6bone but was returned
|
|
6000::/3+, # RFC 4291 - Reserved by IETF
|
|
8000::/3+, # RFC 4291 - Reserved by IETF
|
|
a000::/3+, # RFC 4291 - Reserved by IETF
|
|
c000::/3+, # RFC 4291 - Reserved by IETF
|
|
e000::/4+, # RFC 4291 - Reserved by IETF
|
|
f000::/5+, # RFC 4291 - Reserved by IETF
|
|
f800::/6+, # RFC 4291 - Reserved by IETF
|
|
fc00::/7+, # Unicast Unique Local Addresses (ULA) - RFC 4193
|
|
fe80::/10+, # Link-local Unicast
|
|
fec0::/10+, # Site-local Unicast - deprecated by RFC 3879 (replaced by ULA)
|
|
ff00::/8+ # Multicast
|
|
];
|
|
}
|
|
```
|
|
|
|
That's a long list!! But operators on the _DFZ_ should really never be accepting any
|
|
of these, and we should all collectively yell at those who propagate them.
|
|
|
|
***RPKI Filtering*** is a fantastic routing security feature, described in [RFC6810](https://datatracker.ietf.org/doc/html/rfc6810)
|
|
and relatively straight forward to implement. For each _originating_ AS number, we can
|
|
check in a table of known `<origin,prefix>` mapping, if it is the correct ISP to
|
|
originate the prefix. The lookup can either match (which makes the prefix RPKI valid),
|
|
the lookup can fail because the prefix is missing (which makes the prefix RPKI unknown),
|
|
and it can specifically mismatch (which makes the prefix RPKI invalid). Operators are
|
|
encouraged to flag and drop _invalid_ prefixes:
|
|
|
|
```
|
|
function ipv4_rpki_invalid() {
|
|
return roa_check(t_roa4, net, bgp_path.last) = ROA_INVALID;
|
|
}
|
|
|
|
function ipv6_rpki_invalid() {
|
|
return roa_check(t_roa6, net, bgp_path.last) = ROA_INVALID;
|
|
}
|
|
```
|
|
|
|
***NOTE***: In NLNOG my post sparked a bit of debate on the use of `bgp_path.last_nonaggregated`
|
|
versus simply `bgp_path.last`. Job Snijders did some spelunking and offered [this post](https://bird.network.cz/pipermail/bird-users/2019-September/013805.html) and a reference to [RFC6907](https://datatracker.ietf.org/doc/html/rfc6907) for details, and
|
|
Tijn confirmed that Coloclue (on which many of my approaches have been modeled) indeed uses
|
|
`bgp_path.last`. I've updated my configs, with many thanks for the discussion.
|
|
|
|
Alright, now that I've determined the as-path and prefix are kosher, and that it
|
|
is not known to be hijacked (ie. is either `ROA_VALID` or `ROA_UNKNOWN`), I'm ready
|
|
to set a few attributes, notably:
|
|
|
|
* ***AS_LOCALPREF*** If the peer I learned this prefix from is in the given list, set
|
|
the BGP local preference to either 50, 30 or 10 respectively (a lower localpref means
|
|
the prefix is less likely to be selected). Some internet providers send lots of
|
|
prefixes, but have poor network connectivity to the place I learned the routes from
|
|
(a few examples to this, 6939 is often oversubscribed in Amsterdam, and 39533 was
|
|
for a while connected via a tunnel (!) to Zurich, and several hobby/amateur IXPs are
|
|
on a VXLAN bridged domain rather than a physical switch).
|
|
|
|
* ***Graceful Shutdown*** described in [RFC8326](https://datatracker.ietf.org/doc/html/rfc8326),
|
|
shows a way to allow operators to pre-announce their downtime by setting a special
|
|
BGP community that informs their peers to deselect that path by setting the local
|
|
preference to the lowest possible value. This oneliner matching on `(65535,0)`
|
|
implements that behavior.
|
|
|
|
* ***Blackhole Community*** described in [RFC7999](https://datatracker.ietf.org/doc/html/rfc7999),
|
|
is another special BGP community of `(65535,666)` which signals the need to stop sending
|
|
traffic to the prefix at hand. I haven't yet implemented the blackhole routing (this has
|
|
to do with an intricacy of the VPP Linux-CP code that I wrote), so for now I'll just remove
|
|
the community.
|
|
|
|
Alright, based on this one template, I'm now ready to implement all three types of
|
|
BGP session: ***Peer***, ***Upstream***, and ***Downstream***.
|
|
|
|
### Peers
|
|
|
|
```
|
|
function ebgp_import_peer(int remote_as) {
|
|
# Scrub BGP Communities (RFC 7454 Section 11)
|
|
bgp_community.delete([(8298, *)]);
|
|
bgp_large_community.delete([(8298, *, *)]);
|
|
|
|
return ebgp_import(remote_as);
|
|
}
|
|
```
|
|
|
|
It's dangerous to accept communities for my own AS8298 from peers. This is because
|
|
several of them can actively change the behavior of route propagation (these types
|
|
of communities are commonly called _action_ communities). So with peering
|
|
relationships, I'll just toss them all.
|
|
|
|
Now, working my way up to the actual BGP peering session, taking for example a
|
|
peer that I'm connecting to at LSIX (the routeserver, in fact) in Amsterdam:
|
|
|
|
```
|
|
filter ebgp_lsix_49917_import {
|
|
if ! ebgp_import_peer(49917) then reject;
|
|
|
|
# Add IXP Communities
|
|
bgp_community.add((8298,1036));
|
|
bgp_large_community.add((8298,1,1036));
|
|
|
|
accept;
|
|
}
|
|
|
|
protocol bgp lsix_49917_ipv4_1 {
|
|
description "LSIX IX Route Servers (LSIX)";
|
|
local as 8298;
|
|
source address 185.1.32.74;
|
|
neighbor 185.1.32.254 as 49917;
|
|
default bgp_med 0;
|
|
default bgp_local_pref 200;
|
|
ipv4 {
|
|
import keep filtered;
|
|
import filter ebgp_lsix_49917_import;
|
|
export filter ebgp_lsix_49917_export;
|
|
receive limit 100000 action restart;
|
|
next hop self on;
|
|
};
|
|
};
|
|
```
|
|
|
|
Parsing this through: the ipv4 import filter is called `ebgp_lsix_49917_import` and its
|
|
job is to run the whole kittenkaboodle of filtering I described above, and then if the
|
|
`ebgp_import_peer()` function returns false, to simply drop the prefix. But if it is
|
|
accepted, I'll tag it with a few communities. As I'll show later, any other peer will receive
|
|
these communities if I decide to propagate the prefix to them. This is specifically
|
|
useful for downstream (customers), who can decide to accept/deny the prefix based on a
|
|
wellknown set of communities we tag.
|
|
|
|
***IXP Community***: If the prefix is learned at an IXP, I'll add a large community
|
|
`(8298,1,*)` and backwards compat normal community `(8298,10XX)`.
|
|
|
|
One last thing I'll note, and this is a matter of taste, is for most peering prefixes
|
|
picked up at internet exchanges (like LSIX), are typically much cheaper per megabit than
|
|
the transit routes, so I will set a default `bgp_local_pref` of 200 (higher localpref
|
|
is more likely to be selected as the active route).
|
|
|
|
### Upstream
|
|
|
|
An interesting observation: from Peers and from Upstreams I typically am happy to take
|
|
all the prefixes I can get (but see the epilog below for an important note on this). For a
|
|
Peer, this is mostly "their own prefixes" and for a Transit, this is mostly "all prefixes",
|
|
but there's things in the middle, say partial transit of "all prefixes learned at IXP A B
|
|
and C". Really, all inbound sessions are very similar:
|
|
|
|
```
|
|
function ebgp_import_upstream(int remote_as) {
|
|
# Scrub BGP Communities (RFC 7454 Section 11)
|
|
bgp_community.delete([(8298, *)]);
|
|
bgp_large_community.delete([(8298, *, *)]);
|
|
|
|
return ebgp_import(remote_as);
|
|
}
|
|
```
|
|
|
|
... is in fact identical to the `ebgp_import_peer()` function above, so I'll not discuss
|
|
it further. But for the sessions to upstream (==transit) providers, it can make sense
|
|
to use slightly different BGP community tags and a lower localpref:
|
|
|
|
```
|
|
filter ebgp_ipmax_25091_import {
|
|
if ! ebgp_import_upstream(25091) then reject;
|
|
|
|
# Add BGP Large Communities
|
|
bgp_large_community.add((8298,2,25091));
|
|
|
|
# Add BGP Communities
|
|
bgp_community.add((8298,2000));
|
|
|
|
accept;
|
|
}
|
|
|
|
protocol bgp ipmax_25091_ipv4_1 {
|
|
description "IP-Max Transit";
|
|
local as 8298;
|
|
source address 46.20.242.210;
|
|
neighbor 46.20.242.209 as 25091;
|
|
default bgp_med 0;
|
|
default bgp_local_pref 50;
|
|
ipv4 {
|
|
import keep filtered;
|
|
import filter ebgp_ipmax_25091_import;
|
|
export filter ebgp_ipmax_25091_export;
|
|
next hop self on;
|
|
};
|
|
};
|
|
```
|
|
|
|
Again, a very similar pattern; the only material difference is that the inbound prefixes
|
|
are tagged with an ***Upstream Community*** which is of the form `(8298,2,*)` and backwards
|
|
compatible `(8298,20XX)`. Downstream customers can use this, if they wish, to select or
|
|
reject routes (maybe they don't like routes coming from AS25091, although they should know
|
|
better because IP-Max rocks!).
|
|
|
|
The other slight change here is the `bgp_local_pref` is set to 50, which implies that it will
|
|
be used only if there are no alternatives in the _RIB_ with a higher localpref, or with a
|
|
similar localpref but shorter as-path, or many other scenarios which I won't get into here,
|
|
because BGP selection criteria 101 is a whole blogpost of its own.
|
|
|
|
## Downstream
|
|
|
|
That brings us to the third type of BGP sessions -- commonly referred to as customers except
|
|
that not everybody pays :) so I just call them _downstreams_:
|
|
|
|
```
|
|
function ebgp_import_downstream(int remote_as) {
|
|
# We do not scrub BGP Communities (RFC 7454 Section 11) for customers
|
|
return ebgp_import(remote_as);
|
|
}
|
|
```
|
|
|
|
Here, I have a special relationship with the `remote_as`, and I do not scrub the communities,
|
|
letting the downstream operator set whichever they like. As I'll demonstrate in the next
|
|
chapter, they can use these communities to drive certain types of behavior.
|
|
|
|
Here's how I use this `ebgp_import_downstream()` function in the full filter for a downstream:
|
|
|
|
```
|
|
# bgpq4 -Ab4 -R 24 -m 24 -l 'define AS201723_IPV4' AS201723
|
|
define AS201723_IPV4 = [
|
|
185.54.95.0/24
|
|
];
|
|
|
|
# bgpq4 -Ab6 -R 48 -m 48 -l 'define AS201723_IPV6' AS201723
|
|
define AS201723_IPV6 = [
|
|
2001:678:3d4::/48,
|
|
2001:67c:6bc::/48
|
|
];
|
|
|
|
filter ebgp_raymon_201723_import {
|
|
if (net.type = NET_IP4 && ! (net ~ AS201723_IPV4)) then reject;
|
|
if (net.type = NET_IP6 && ! (net ~ AS201723_IPV6)) then reject;
|
|
if ! ebgp_import_downstream(201723) then reject;
|
|
|
|
# Add BGP Large Communities
|
|
bgp_large_community.add((8298,3,201723));
|
|
|
|
# Add BGP Communities
|
|
bgp_community.add((8298,3500));
|
|
|
|
accept;
|
|
}
|
|
|
|
protocol bgp raymon_201723_ipv4_1 {
|
|
local as 8298;
|
|
source address 185.54.95.250;
|
|
neighbor 185.54.95.251 as 201723;
|
|
default bgp_med 0;
|
|
default bgp_local_pref 400;
|
|
ipv4 {
|
|
import keep filtered;
|
|
import filter ebgp_raymon_201723_import;
|
|
export filter ebgp_raymon_201723_export;
|
|
receive limit 94 action restart;
|
|
next hop self on;
|
|
};
|
|
};
|
|
```
|
|
|
|
OK, so this is a mouthful, but the one thing that I really need to do with customers is
|
|
ensure that I only accept prefixes from them that they're supposed to send me. I do this
|
|
with a `prefix-list` for IPv4 and IPv6, and in the importer, I simply reject any prefixes
|
|
that are not in the list. From then on, it looks very much like a peer, with identical
|
|
filtering and tagging, except now I'm using yet another ***Customer Community*** which
|
|
starts with `(8298,3,*)` and a vanilla `(8298,3500)` community. Anybody who wishes to,
|
|
can act on the presence of these communities to know that it's a downstream of IPng Networks
|
|
AS8298.
|
|
|
|
***A note on Peers and Downstreams***:
|
|
|
|
Some ISPs will not peer with their customers (as in: once you become a transit customer
|
|
they will terminate all BGP sessions at public internet exchanges), and I find that silly.
|
|
However, for me the situation becomes a little bit more complex if I were to have AS201723
|
|
both as a Downstream (as shown here) as well as a Peer (which in fact, I do, at multiple Amsterdam
|
|
based internet exchanges). Note how the `bgp_local_pref` is 400 on this session, and it
|
|
will always be lower on other types of sessions. The implication is that this prefix from the _RIB_
|
|
which carries `(8298,3,201723)` will be selected, and the ones I learn from LSIX will
|
|
carry `(8298,1,*)` and the ones I learn from A2B (a transit provider) will carry `(8298,2,51088)`
|
|
and both will not be selected due to those having a lower localpref. As I'll demonstrate below,
|
|
I can make smart use of these communities when announcing prefixes to my own peers and upstreams,
|
|
... read on :)
|
|
|
|
## 3. Outbound: Announcing Routes
|
|
|
|
Alright, the _RIB_ is now filled with lots of prefixes that have the right localpref and
|
|
communities, for example from having been learned at an IXP, from an Upstream, or from a
|
|
Downstream. Now let's consider the following generic exporter:
|
|
|
|
```
|
|
function ebgp_export(int remote_as) {
|
|
# Remove private ASNs
|
|
bgp_path.delete([64512..65535, 4200000000..4294967295]);
|
|
|
|
# Well known BGP Large Communities
|
|
if (8298, 0, remote_as) ~ bgp_large_community then return false;
|
|
if (8298, 0, 0) ~ bgp_large_community then return false;
|
|
|
|
# Well known BGP Communities
|
|
if (0, 8298) ~ bgp_community then return false;
|
|
if (remote_as < 65536 && (0, remote_as) ~ bgp_community) then return false;
|
|
|
|
# AS path prepending
|
|
if ((8298, 103, remote_as) ~ bgp_large_community ||
|
|
(8298, 103, 0) ~ bgp_large_community) then {
|
|
bgp_path.prepend( bgp_path.first );
|
|
bgp_path.prepend( bgp_path.first );
|
|
bgp_path.prepend( bgp_path.first );
|
|
} else if ((8298, 102, remote_as) ~ bgp_large_community ||
|
|
(8298, 102, 0) ~ bgp_large_community) then {
|
|
bgp_path.prepend( bgp_path.first );
|
|
bgp_path.prepend( bgp_path.first );
|
|
} else if ((8298, 101, remote_as) ~ bgp_large_community ||
|
|
(8298, 101, 0) ~ bgp_large_community) then {
|
|
bgp_path.prepend( bgp_path.first );
|
|
}
|
|
|
|
return true;
|
|
}
|
|
```
|
|
|
|
Oh, wow! There's some really cool stuff to unpack here. As a belt-and-braces type safety,
|
|
I will remove any private AS numbers from the as-path - this avoids my own announcements
|
|
from tripping any as-path bogon filtering. But then, there's a few well-known communities
|
|
that help determine if the announcement is made or not, and there are three-and-a-half
|
|
ways of doing this:
|
|
1. `(8298,0,remote_as)`
|
|
1. `(8298,0,0)`
|
|
1. `(0,8298)`
|
|
1. `(0,remote_as)` but only if the remote_as is 16 bits.
|
|
|
|
All four of these methods will tell the router to refuse announcing the prefix on this
|
|
session. Note that downstreams are allowed to set `(8298,*,*)` and `(8298,*)` communities
|
|
(and they're the only ones who are allowed to do so). So here is where some of the cool
|
|
magic starts to happen.
|
|
|
|
Then, to drive prepending of the prefix on this session, I'll again match certain
|
|
communities `(8298, 103, *)` will prepend the customer's AS number three times, using
|
|
`102` will prepend twice, and `101` will prepend once. If the third digit is `0`, then
|
|
any session with this filter will prepend. If the third digit is the AS number, then
|
|
only sessions to this AS number will be prepended.
|
|
|
|
Using these types of communities allow downstream (customers) incredibly fine grained
|
|
propagation actions, at the per-IPng-session level. Not many ISPs offer this functionality!
|
|
|
|
### Peers
|
|
|
|
Exporting to peers, I really need to make sure that I don't send too many prefixes. Most
|
|
of us have at some point gone through the embarassing motions of being told by a fellow
|
|
operator "hey you're sending a full table". It is paramount to good peering hygiene
|
|
that I do not leak. So I'll define a healthy set of _defense in depth_ principles here:
|
|
|
|
```
|
|
# bgpq4 -A4b -R 24 -m 24 -l 'define AS8298_IPV4' AS8298
|
|
define AS8298_IPV4 = [ 92.119.38.0/24, 194.1.163.0/24, 194.126.235.0/24 ];
|
|
|
|
# bgpq4 -A6bR 48 -m 48 -l 'define AS8298_IPV6' AS8298
|
|
define AS8298_IPV6 = [ 2001:678:d78::/48, 2a0b:dd80::/29{29,48} ];
|
|
|
|
# bgpq4 -A4b -R 24 -m 24 -l 'define AS_IPNG_IPV4' AS-IPNG
|
|
define AS_IPNG_IPV4 = [ ... ## Removed for brevity ];
|
|
|
|
# bgpq4 -A6bR 48 -m 48 -l 'define AS_IPNG_IPV6' AS-IPNG
|
|
define AS_IPNG_IPV6 = [ .. ## Removed for brevity ];
|
|
|
|
# bgpq4 -t4b -l 'define AS_IPNG' AS-IPNG
|
|
define AS_IPNG = [112, 8298, 50869, 57777, 60557, 201723, 212323, 212855];
|
|
|
|
function aspath_first_valid() {
|
|
return (bgp_path.len = 0 || bgp_path.first ~ AS_IPNG);
|
|
}
|
|
|
|
# A list of well-known tier1 transit providers
|
|
function aspath_contains_tier1() {
|
|
return bgp_path ~ [
|
|
174, # Cogent
|
|
209, # Qwest (HE carries this on IXPs IPv6 (Jul 12 2018))
|
|
701, # UUNET
|
|
702, # UUNET
|
|
1239, # Sprint
|
|
1299, # Telia
|
|
2914, # NTT Communications
|
|
3257, # GTT Backbone
|
|
3320, # Deutsche Telekom AG (DTAG)
|
|
3356, # Level3
|
|
3549, # Level3
|
|
3561, # Savvis / CenturyLink
|
|
4134, # Chinanet
|
|
5511, # Orange opentransit
|
|
6453, # Tata Communications
|
|
6762, # Seabone / Telecom Italia
|
|
7018 ]; # AT&T
|
|
}
|
|
|
|
# The list of our own uplink (transit) providers
|
|
# Note: This list is autogenerated by our automation.
|
|
function aspath_contains_upstream() {
|
|
return bgp_path ~ [ 8283,25091,34549,51088,58299 ];
|
|
}
|
|
|
|
function ipv4_prefix_valid() {
|
|
# Our (locally sourced) prefixes
|
|
if (net ~ AS8298_IPV4) then return true;
|
|
|
|
# Customer prefixes in AS-IPNG must be tagged with customer community
|
|
if (net ~ AS_IPNG_IPV4 &&
|
|
(bgp_large_community ~ [(8298, 3, *)] || bgp_community ~ [(8298, 3500)])
|
|
) then return true;
|
|
|
|
return false;
|
|
}
|
|
function ipv6_prefix_valid() {
|
|
# Our (locally sourced) prefixes
|
|
if (net ~ AS8298_IPV6) then return true;
|
|
|
|
# Customer prefixes in AS-IPNG must be tagged with customer community
|
|
if (net ~ AS_IPNG_IPV6 &&
|
|
(bgp_large_community ~ [(8298, 3, *)] || bgp_community ~ [(8298, 3500)])
|
|
) then return true;
|
|
|
|
return false;
|
|
}
|
|
function prefix_valid() {
|
|
# as-path based filtering
|
|
if !aspath_first_valid() then return false;
|
|
if aspath_contains_tier1() then return false;
|
|
if aspath_contains_upstream() then return false;
|
|
|
|
# prefix (and BGP community) based filtering
|
|
if (net.type = NET_IP4 && !ipv4_prefix_valid()) then return false;
|
|
if (net.type = NET_IP6 && !ipv6_prefix_valid()) then return false;
|
|
return true;
|
|
}
|
|
|
|
function ebgp_export_peer(int remote_as) {
|
|
if !prefix_valid() then return false;
|
|
return ebgp_export(remote_as);
|
|
}
|
|
```
|
|
|
|
Wow, alrighty then!! All I'm doing here is checking if the call to `prefix_valid()`
|
|
returns true. That function isn't very complex. It takes a look at three as-path based
|
|
filters and then a prefix-list based filter. Let's go over them in turn:
|
|
|
|
***aspath_first_valid()*** takes a look at the first hop in the as-path. I need to
|
|
make sure that I've received this prefix from an actual downstream, and those are
|
|
collected in a RIPE `as-set` called `AS-IPNG`. So if the first BGP hop in the path is
|
|
not one of these, I'll refuse to announce the prefix.
|
|
|
|
***aspath_contains_tier1()*** is a belt-and-braces style check. How on earth would
|
|
I provide transit for any prefix for which there's already a global _Tier1_ provider
|
|
in the path? I mean, in no universe would AS174 or AS1299 need me to reach any of
|
|
their customers, or indeed, any place in the world. So this filter helps me never
|
|
announce the prefix, if it has one of these ISPs in the path.
|
|
|
|
***aspath_contains_upstream()*** similarly, if I am receiving a full table from an
|
|
upstream provider, I should not be passing this prefix along - I would for similar
|
|
reasons never be a transit provider for A2B or IP-Max or Meerfarbig. Due to a bug
|
|
in my configuration, my buddy Erik kindly pointed out this issue to me, so hat-tip
|
|
to him for the intelligence.
|
|
|
|
***ipv[46]_prefix_valid()*** is the main thrust of prefix-based filtering. At this
|
|
point we've already established that the as-path is clean, but it could be that
|
|
the downstream is sending prefixes they should not (possibly leaking a full table)
|
|
so let's take a look at a good way to avoid this.
|
|
* First, we look at locally sourced routes from `AS8298`, that is the ones that I
|
|
myself originate at IPng Networks. These are always OK. The list is carefully
|
|
curated.
|
|
* Alternatively, the prefix needs to be from the as-set `AS-IPNG` (which contains
|
|
both my prefixes and all `route` and `route6` objects belonging to any AS number
|
|
that I consider a downstream),
|
|
* Finally, if the prefix is from `AS-IPNG`, I'll still add one additional check to
|
|
ensure that there is a so-called _customer community_ attached. Remember that I
|
|
discused this specifically up in the ***Inbound - Downstream*** section.
|
|
|
|
So before I were to announce anything on such a session, all _four_ of as-path,
|
|
inbound prefix-list, outbound prefix-list and bgp-community are checked. This
|
|
makes it incredibly unlikely that AS8298 ever leaks prefixes -- knock on wood!
|
|
|
|
### Upstream
|
|
|
|
Interestingly and if you think about it, unsurprisingly, an upstream configuration
|
|
is exactly identical to a peer:
|
|
|
|
```
|
|
function ebgp_export_upstream(int remote_as) {
|
|
if !prefix_valid() then return false;
|
|
return ebgp_export(remote_as);
|
|
}
|
|
```
|
|
|
|
Alright, nothing to see here, moving on ...
|
|
|
|
### Downstream
|
|
|
|
Now the difference between a Peer and an Upstream on the one hand, and a Downstream
|
|
on the other, is that the former two will only see a very limited set of prefixes,
|
|
heavily guarded by all of that filtering I described. But a downstream typically
|
|
has the luxury of getting to learn every prefix I've learned:
|
|
|
|
```
|
|
function ipv4_acceptable_size() {
|
|
if net.len < 8 then return false;
|
|
if net.len > 24 then return false;
|
|
return true;
|
|
}
|
|
function ipv6_acceptable_size() {
|
|
if net.len < 12 then return false;
|
|
if net.len > 48 then return false;
|
|
return true;
|
|
}
|
|
function ebgp_export_downstream(int remote_as) {
|
|
if (source != RTS_BGP && source != RTS_STATIC) then return false;
|
|
if (net.type = NET_IP4 && ! ipv4_acceptable_size()) then return false;
|
|
if (net.type = NET_IP6 && ! ipv6_acceptable_size()) then return false;
|
|
|
|
return ebgp_export(remote_as);
|
|
}
|
|
```
|
|
|
|
So here I'll assert that the prefix has to be either from the `RTS_BGP` source, or
|
|
from the `RTS_STATIC` source. This latter source is what Bird uses for locally
|
|
generated routes (ie. the ones in AS8298 itself). Locally generated routes are not
|
|
known from BGP, but known instead because they are blackholed / null-routed on the
|
|
router itself. And from these routes, I further deselect those prefixes that are
|
|
too short or too long, which are slightly different based on address family (IPv4
|
|
is anywhere between /8-/24 and for IPv6 is anywhere between /12-/48).
|
|
|
|
Now, I will note that I've seen many operators who inject OSPF or connected or
|
|
static routes into BGP, and all of those folks will have to maintain elaborate egress
|
|
"bogon" route filters, for example for those IXP prefixes that they picked up due to
|
|
them being directly connected. If those operators would simply not propagate directly
|
|
connected routes, their life would be so much simpler .. but I digress and it's time
|
|
for me to wrap up.
|
|
|
|
## Epilog
|
|
|
|
I hope this little dissertation proves useful for other Bird enthusiasts out there.
|
|
I myself had to fiddle a bit over the years with the idiosyncracies (and bugs) of
|
|
Bird and Bird2. I wanted to make a few comments:
|
|
|
|
1. Thanks to the crew at [Coloclue](https://coloclue.net/) for having a really phenomenal
|
|
routing setup, with a lot of thoughtful documentation, action communities, and strict
|
|
ingress and egress filtering. It's also fully automated and I've derived, although
|
|
completely rewritten, my own automation based off of [Kees](https://github.com/coloclue/kees).
|
|
1. I understand that the main destinction on inbound Peer and Upstream, is that for Peers
|
|
many folks will want to do strict filtering. I've considered this for a long time and
|
|
ultimately decided against it, because a combination of max prefix, tier1 as-path filtering
|
|
and RPKI filtering would take care of the most egregious mistakes and otherwise, I'm actually
|
|
happy to get more prefixes via IXPs rather than less.
|