633 lines
34 KiB
Markdown
633 lines
34 KiB
Markdown
---
|
|
date: "2023-03-11T11:56:54Z"
|
|
title: 'Case Study: Centec MPLS Core'
|
|
aliases:
|
|
- /s/articles/2023/03/11/mpls-core.html
|
|
---
|
|
|
|
After receiving an e-mail from [[Starry Networks](https://starry-networks.com)], I had a chat with their
|
|
founder and learned that the combination of switch silicon and software may be a good match for IPng Networks.
|
|
|
|
I got pretty enthusiastic when this new vendor claimed VxLAN, GENEVE, MPLS and GRE at 56 ports and
|
|
line rate, on a really affordable budget ($4'200,- for the 56 port; and $1'650,- for the 26 port
|
|
switch). This reseller is using a less known silicon vendor called
|
|
[[Centec](https://www.centec.com/silicon)], who have a lineup of ethernet chipsets. In this device,
|
|
the CTC8096 (GoldenGate) is used for cost effective high density 10GbE/40GbE applications paired
|
|
with 4x100GbE uplink capability. This is Centec's fourth generation, so CTC8096 inherits the feature
|
|
set from L2/L3 switching to advanced data center and metro Ethernet features with innovative
|
|
enhancement. The switch chip provides up to 96x10GbE ports, or 24x40GbE, or 80x10GbE + 4x100GbE
|
|
ports, inheriting from its predecessors a variety of features, including L2, L3, MPLS, VXLAN, MPLS
|
|
SR, and OAM/APS. Highlights features include Telemetry, Programmability, Security and traffic
|
|
management, and Network time synchronization.
|
|
|
|
{{< image width="450px" float="left" src="/assets/oem-switch/S5624X-front.png" alt="S5624X Front" >}}
|
|
|
|
{{< image width="450px" float="right" src="/assets/oem-switch/S5648X-front.png" alt="S5648X Front" >}}
|
|
|
|
<br /><br />
|
|
|
|
After discussing basic L2, L3 and Overlay functionality in my [[first post]({{< ref "2022-12-05-oem-switch-1" >}})], and explored the functionality and performance of MPLS and VPLS in my
|
|
[[second post]({{< ref "2022-12-09-oem-switch-2" >}})], I convinced myself and committed to a bunch
|
|
of these for IPng Networks. I'm now ready to roll out these switches and create a BGP-free core
|
|
network for IPng Networks. If this kind of thing tickles your fancy, by all means read on :)
|
|
|
|
## Overview
|
|
|
|
You may be wondering what folks mean when they talk about a [[BGP Free
|
|
Core](https://bgphelp.com/2017/02/12/bgp-free-core/)], and also you may ask yourself why would I
|
|
decide to retrofit this in our network. For most, operating this way gives very little room for
|
|
outages to occur in the L2 (Ethernet and MPLS) transport network, because it's relatively simple in
|
|
design and implementation. Some advantages worth mentioning:
|
|
|
|
* Transport devices do not need to be capable of supporting a large number of IPv4/IPv6 routes, either
|
|
in the RIB or FIB, allowing them to be much cheaper.
|
|
* As there is no eBGP, transport devices will not be impacted by BGP-related issues, such as high CPU
|
|
utilization during massive BGP re-convergence.
|
|
* Also, without eBGP, some of the attack vectors in ISPs (loopback DDoS or ARP storms on public
|
|
internet exchange, to take two common examples) can be eliminated. If a new BGP security
|
|
vulnerability were to be discovered, transport devices aren't impacted.
|
|
* Operator errors (the #1 reason for outages in our industry) associated with BGP configuration and
|
|
the use of large RIBs (eg. leaking into IGP, flapping transit sessions, etc) can be eradicated.
|
|
* New transport services such as MPLS point to point virtual leased lines, SR-MPLS, VPLS clouds, and
|
|
eVPN can all be introduced without modifying the routing core.
|
|
|
|
If deployed correctly, this type of transport-only network can be kept entirely isolated from the Internet,
|
|
making DDoS and hacking attacks against transport elements impossible, and it also opens up possibilities
|
|
for relatively safe sharing of infrastructure resources between ISPs (think of things like dark fibers
|
|
between locations, rackspace, power, cross connects).
|
|
|
|
For smaller clubs (like IPng Networks), being able to share a 100G wave with others, significantly reduces
|
|
price per Megabit! So if you're in Zurich, Switzerland, or Europe and find this an interesting avenue to
|
|
expand your reach in a co-op style environment, [[reach out](/s/contact)] to us, any time!
|
|
|
|
### Hybrid Design
|
|
|
|
I've decided to make this the direction of IPng's core network -- I know that the specs of the
|
|
Centec switches I've bought will allow for a modest but not huge amount of routes in the hardware
|
|
forwarding tables. I loadtested them in [[a previous article]({{< ref "2022-12-05-oem-switch-1" >}})] at line rate (well, at least 8x10G at 64b packets and around 110Mpps), so they were forwarding
|
|
both IPv4 and MPLS traffic effortlessly, and at 45 Watts I might add! However, they clearly cannot
|
|
operate in the DFZ for two main reasons:
|
|
|
|
1. The FIB is limited to 12K IPv4, 2K IPv6 entries, so they can't hold a full table
|
|
1. The CPU is a bit whimpy so it won't be happy doing large BGP reconvergence operations
|
|
|
|
IPng Networks has three (3) /24 IPv4 networks, which means we're not swimming in IPv4 addresses.
|
|
But, I'm possibly the world's okayest systems engineer, and I happen to know that most things don't
|
|
really need an IPv4 address anymore. There's all sorts of fancy loadbalancers like
|
|
[[NGINX](https://nginx.org)] and [[HAProxy](https://www.haproxy.org/)] which can take traffic (UDP,
|
|
TCP or higher level constructs like HTTP traffic), provide SSL offloading, and then talk to one or
|
|
more loadbalanced backends to retrieve the actual content.
|
|
|
|
#### IPv4 versus IPv6
|
|
|
|
{{< image float="right" src="/assets/mpls-core/go-for-it-90s.gif" alt="The 90s" >}}
|
|
|
|
Most modern operating systems can operate in IPv6-only mode, certainly the Debian and Ubuntu and
|
|
Apple machines that are common in my network are happily dual-stack and probably mono-stack as well.
|
|
Seeing that I've been running IPv6 since, eh, the 90s (my first allocation was on the 6bone in
|
|
1996, and I did run [[SixXS](https://sixxs.net/)] for longer than I can remember!).
|
|
|
|
You might be inclined to argue that I should be able to advance the core of my serverpark to
|
|
IPv6-only ... but unfortunately that's not only up to me, as it has been mentioned to me a number of times
|
|
that my [[Videos](https://video.ipng.ch/)] are not reachable, which of course they are, but only if
|
|
your computer speaks IPv6.
|
|
|
|
In addition to my stuff needing _legacy_ reachability, some external websites, including pretty big
|
|
ones (I'm looking at you, [[GitHub](https://github.com/)] and [[Cisco
|
|
T-Rex](https://trex-tgn.cisco.com/)]) are still IPv4 only, and some network gear still hasn't
|
|
really caught on to the IPv6 control- and management plane scene (for example, SNMP traps or
|
|
scraping, BFD, LDP, and a few others, even in a modern switch like the Centecs that I'm about to
|
|
deploy).
|
|
|
|
#### AS8298 BGP-Free Core
|
|
|
|
I have a few options -- I could be stubborn and do NAT64 for an IPv6-only internal network. But if I'm going
|
|
to be doing NAT anyway, I decide to make a compromise and deploy my new network using private IPv4
|
|
space alongside public IPv6 space, and to deploy a few strategically placed border gateways that can
|
|
do the translation and frontending for me.
|
|
|
|
There's quite a few private/reserved IPv4 ranges on the internet, which the current LIRs on the
|
|
RIPE [[Waiting List](https://www.ripe.net/manage-ips-and-asns/ipv4/ipv4-waiting-list)] are salivating
|
|
all over, gross. However, there's a few ones beyond canonical [[RFC1918](https://www.rfc-editor.org/rfc/rfc1918)]
|
|
that are quite frequently used in enterprise networking, for example by large Cloud providers. They build
|
|
what is called a _Virtual Private Cloud_ or
|
|
[[VPC](https://www.cloudflare.com/learning/cloud/what-is-a-virtual-private-cloud/)]. And if they can
|
|
do it, so can I!
|
|
|
|
#### Numberplan
|
|
|
|
Let me draw your attention to [[RFC5735](https://www.rfc-editor.org/rfc/rfc5735)], which describes
|
|
special use IPv4 addresses. One of these is **198.18.0.0/15**: this block has been allocated for use
|
|
in benchmark tests of network interconnect devices. What I found interesting, is that
|
|
[[RFC2544](https://www.rfc-editor.org/rfc/rfc2544)] explains that this range was assigned to minimize the
|
|
chance of conflict in case a testing device were to be accidentally connected to part of the Internet.
|
|
Packets with source addresses from this range are not meant to be forwarded across the Internet.
|
|
But, they can _totally_ be used to build a pan-european private network that is not directly connected
|
|
to the internet. I grab my free private Class-B, like so:
|
|
|
|
* For IPv4, I take the second /16 from that to use as my IPv4 block: **198.19.0.0/16**.
|
|
* For IPv6, I carve out a small part of IPng's own IPv6 PI block: **2001:678:d78:500::/56**
|
|
|
|
First order of business is to create a simple numberplan that's not totally broken:
|
|
|
|
Purpose | IPv4 Prefix | IPv6 Prefix
|
|
--------------|-------------------|------------------
|
|
Loopbacks | 198.19.0.0/24 (size /32) | 2001:678:d78:500::/64 (size /128)
|
|
P2P Networks | 198.19.2.0/23 (size /31) | 2001:678:d78:501::/64 (size /112)
|
|
Site Local Networks | 198.19.4.0/22 (size /27) | 2001:678:d78:502::/56 (size /64)
|
|
|
|
This simple start leaves most of the IPv4 space allocatable for the future, while giving me lots of
|
|
IPv4 and IPv6 addresses to retrofit this network in all sites where IPng is present, which is
|
|
[[quite a few](https://as8298.peeringdb.com/)]. All of **198.19.1.0/24** (reserved either for P2P
|
|
networks or for loopbacks, whichever I'll need first), **198.19.8.0/21**, **198.19.16.0/20**,
|
|
**198.19.32.0/19**, **198.19.64.0/18** and **198.19.128.0/17** will be ready for me to use in the
|
|
future, and they are all nicely tucked away under one **19.198.in-addr.arpa** reversed domain, which
|
|
I stub out on IPng's resolvers. Winner!
|
|
|
|
### Inserting MPLS Under AS8298
|
|
|
|
I am currently running [[VPP](https://fd.io)] based on my own deployment [[article]({{< ref "2021-09-21-vpp-7" >}})], and this has a bunch of routers connected back-to-back with one another using
|
|
either crossconnects (if there are multiple routers in the same location), or a CWDM/DWDM wave over
|
|
dark fiber (if they are in adjacent buildings and I have found a provider willing to share their
|
|
dark fiber with me), or a Carrier Ethernet virtual leased line (L2VPN, provided by folks like
|
|
[[Init7](https://init7.net)] in Switzerland, or [[IP-Max](https://ip-max.net)] throughout europe in
|
|
our [[backbone]({{< ref "2021-02-27-network" >}})]).
|
|
|
|
{{< image width="350px" float="right" src="/assets/mpls-core/before.svg" alt="Before" >}}
|
|
|
|
Most of these links are actually "just" point to point ethernet links, which I can use untagged (eg
|
|
`xe1-0`), or add any _dot1q_ sub-interfaces (eg `xe1-0.10`). In some cases, the ISP will deliver the
|
|
circuit to me with an additional _outer_ tag, in which case I can still use that interface (eg
|
|
`xe1-0.400`) and create _qinq_ sub-interfaces (eg `xe1-0.400.10`).
|
|
|
|
In January 2023, my Zurich metro deployment looks a bit like the top drawing to the right. Of
|
|
course, these routers connect to all sorts of other things, like internet exchange points
|
|
([[SwissIX](https://swissix.net/)], [[CHIX](https://ch-ix.ch/)],
|
|
[[CommunityIX](https://communityrack.org/)], and [[FreeIX](https://free-ix.net/)]), IP transit
|
|
upstreams (in Zurich mainly [[IP-Max](https://as25091.peeringdb.com/)] and
|
|
[[Openfactory](https://as58299.peeringdb.com/)]), and customer downstreams, colocation space,
|
|
private network interconnects with others, and so on.
|
|
|
|
I want to draw your attention to the four _main_ links between these routers:
|
|
|
|
1. Orange (bottom): chbtl0 and chbtl1 are at our offices in Brüttisellen; they're in two
|
|
separate racks, and have 24 fibers between them. Here, the two routers connect back to back with a
|
|
25G SFP28 optic at 1310nm.
|
|
1. Blue (top): Between chrma0 (at NTT datacenter in Rümlang) and chgtg0 (at Interxion datacenter
|
|
in Glattbrugg), IPng rents a CWDM wave from Openfactory, so the two routers here connect back to
|
|
back also, albeit over 4.2km of dark fiber between the two datacenters, with a 25G SFP28 optic at 1610nm.
|
|
1. Red (left): Between chbtl0 and chrma0, Init7 provides a 10G L2VPN over MPLS ethernet circuit,
|
|
starting in our offices with a BiDi 10G optic, and delivered at NTT on a BiDi 10G optic as well (we
|
|
did this, so that the cross connect between our racks might in the future be able to use the other
|
|
fiber). Init7 delivers both ports tagged VLAN 400.
|
|
1. Green (right): Between chbtl1 and chgtg0, Openfactory provides a 10G VLAN ethernet circuit,
|
|
starting in our offices with a BiDi 10G optic to the local telco, and then transported over dark
|
|
fiber by UPC to Interxion. Openfactory delivers both sides tagged VLAN 200-209 to us.
|
|
|
|
This is a super fun puzzle! I am running a live network, with customers, and I want to retrofit this
|
|
MPLS network _underneath_ my existing network, and after thinking about it for a while, I see how I
|
|
can do it.
|
|
|
|
{{< image width="350px" float="right" src="/assets/mpls-core/after.svg" alt="After" >}}
|
|
|
|
To avoid using the link, I raise OSPF cost for the link chbtl0-chrma0, the red link in the graph.
|
|
Traffic will now flow via chgtg0 and through chbtl1. After I've taken the link out of service, I
|
|
make a few important changes:
|
|
|
|
1. First, I move the interface on both VPP routers from it's _dot1q_ tagged `xe1-0.400`, to a double
|
|
tagged `xe1-0.400.10`. Init7 will pass this through for me, and after I make the change, I can ping
|
|
both sides again (with a subtle loss of 4 bytes because of the second tag).
|
|
1. Next, I unplug the Init7 link on both sides and plug them into a TenGig port on a Centec switch
|
|
that I deployed in both sites, and I take a second TenGig port and I plug that into the router. I
|
|
make both ports a _trunk_ mode switchport, and allow VLAN 400 tagged on it.
|
|
1. Finally, on the switch I create interface `vlan400` on both sides, and the two switches can see
|
|
each other directly connected now on the single-tagged interface, while the two routers can see each
|
|
other directly connected now on the double-tagged interface.
|
|
|
|
With the _red_ leg taken care of, I ask the kind folks from Openfactory if they would mind if I use
|
|
a second wavelength for the duration of my migration, which they kindly agree to. So, I plug a new
|
|
CWDM 25G optic on another channel (1270nm), and bring the network to Glattbrugg, where I deploy a
|
|
Centec switch.
|
|
|
|
With the _blue_/_purple_ leg taken care of, all I have to do is undrain the _red_ link (lower OSPF
|
|
cost) while draining the _green_ link (raising its OSPF cost). Traffic now flips back from chgtg0
|
|
through chrma0 and into chbtl0. I can rinse and repeat the green leg, moving the interfaces on the
|
|
routers to a double-tagged `xe1-0.200.10` on both sides, inserting and moving the _green_ link from
|
|
the routers into the switches, and connecting them in turn to the routers.
|
|
|
|
## Configuration
|
|
|
|
And just like that, I've inserted a triangle of Centec switches without disrupting any traffic,
|
|
would you look at that! They are however, still "just" switches, each with two ports sharing
|
|
the _red_ VLAN 400 and the _green_ VLAN 200, and doing ... decidedly nothing on the _purple_ leg,
|
|
as those ports aren't even switchports!
|
|
|
|
Next up: configuring these switches to become, you guessed it, routers!
|
|
|
|
### Interfaces
|
|
|
|
I will take the switch at NTT Rümlang as an example, but the switches really are all very
|
|
similar. First, I define the loopback addresses and transit networks to chbtl0 (_red_ link) and
|
|
chgtg0 (_purple_ link).
|
|
|
|
```
|
|
interface loopback0
|
|
description Core: msw0.chrma0.net.ipng.ch
|
|
ip address 198.19.0.2/32
|
|
ipv6 address 2001:678:d78:500::2/128
|
|
!
|
|
interface vlan400
|
|
description Core: msw0.chbtl0.net.ipng.ch (Init7)
|
|
mtu 9172
|
|
ip address 198.19.2.1/31
|
|
ipv6 address 2001:678:d78:501::2/112
|
|
!
|
|
interface eth-0-38
|
|
description Core: msw0.chgtg0.net.ipng.ch (UPC 1270nm)
|
|
mtu 9216
|
|
ip address 198.19.2.4/31
|
|
ipv6 address 2001:678:d78:501::2:1/112
|
|
```
|
|
|
|
I need to make sure that the MTU is correct on both sides (this will be important later when OSPF is
|
|
turned on), and I ensure that the underlay has sufficient MTU (in the case of Init7, as the _purple_
|
|
interface goes over dark fiber with no active equipment in between!) I issue a set of ping commands
|
|
ensuring that the dont-fragment bit is set and the size of the resulting IP packet is exactly that
|
|
which my MTU claims I should allow, and validate that indeed, we're good.
|
|
|
|
### OSPF, LDP, MPLS
|
|
|
|
For OSPF, I am certain that this network should never carry or propagate anything other than the
|
|
**198.19.0.0/16** and **2001:678:d78:500::/56** networks that I have assigned to it, even if it were
|
|
to be connected to other things (like an out-of-band connection, or even AS8298), so as belt-and-braces
|
|
style protection I take the following base-line configuration:
|
|
|
|
```
|
|
ip prefix-list pl-ospf seq 5 permit 198.19.0.0/16 le 32
|
|
ipv6 prefix-list pl-ospf seq 5 permit 2001:678:d78:500::/56 le 128
|
|
!
|
|
route-map ospf-export permit 10
|
|
match ipv6 address prefix-list pl-ospf
|
|
route-map ospf-export permit 20
|
|
match ip address prefix-list pl-ospf
|
|
route-map ospf-export deny 9999
|
|
!
|
|
router ospf
|
|
router-id 198.19.0.2
|
|
redistribute connected route-map ospf-export
|
|
redistribute static route-map ospf-export
|
|
network 198.19.0.0/16 area 0
|
|
!
|
|
router ipv6 ospf
|
|
router-id 198.19.0.2
|
|
redistribute connected route-map ospf-export
|
|
redistribute static route-map ospf-export
|
|
!
|
|
ip route 198.19.0.0/16 null0
|
|
ipv6 route 2001:678:d78:500::/56 null0
|
|
```
|
|
|
|
I also set a static discard by means of a _nullroute_, for the space beloning to the private
|
|
network. This way, packets will not loop around if there is not a more specific for them in OSPF.
|
|
The route-map ensures that I'll only be advertising _our space_, even if the switches eventually get
|
|
connected to other networks, for example some out-of-band access mechanism.
|
|
|
|
Next up, enabling LDP and MPLS, which is very straight forward. In my interfaces, I'll add the
|
|
***label-switching*** and ***enable-ldp*** keywords, as well as ensure that the OSPF and OSPFv3
|
|
speakers on these interfaces know that they are in __point-to-point__ mode. For the cost, I will
|
|
start off with the cost in tenths of milliseconds, in other words, if the latency between chbtl0 and
|
|
chrma0 is 0.8ms, I will set the cost to 8:
|
|
|
|
```
|
|
interface vlan400
|
|
description Core: msw0.chbtl0.net.ipng.ch (Init7)
|
|
mtu 9172
|
|
label-switching
|
|
ip address 198.19.2.1/31
|
|
ipv6 address 2001:678:d78:501::2/112
|
|
ip ospf network point-to-point
|
|
ip ospf cost 8
|
|
ip ospf bfd
|
|
ipv6 ospf network point-to-point
|
|
ipv6 ospf cost 8
|
|
ipv6 router ospf area 0
|
|
enable-ldp
|
|
!
|
|
router ldp
|
|
router-id 198.19.0.2
|
|
transport-address 198.19.0.2
|
|
!
|
|
```
|
|
|
|
The rest is really just rinse-and-repeat. I loop around all relevant interfaces, and see all of
|
|
OSPF, OSPFv3, and LDP adjacencies form:
|
|
|
|
```
|
|
msw0.chrma0# show ip ospf nei
|
|
OSPF process 0:
|
|
Neighbor ID Pri State Dead Time Address Interface
|
|
198.19.0.0 1 Full/ - 00:00:35 198.19.2.0 vlan400
|
|
198.19.0.3 1 Full/ - 00:00:39 198.19.2.5 eth-0-38
|
|
|
|
msw0.chrma0# show ipv6 ospf nei
|
|
OSPFv3 Process (0)
|
|
Neighbor ID Pri State Dead Time Interface Instance ID
|
|
198.19.0.0 1 Full/ - 00:00:37 vlan400 0
|
|
198.19.0.3 1 Full/ - 00:00:39 eth-0-38 0
|
|
|
|
msw0.chrma0# show ldp session
|
|
Peer IP Address IF Name My Role State KeepAlive
|
|
198.19.0.0 vlan400 Active OPERATIONAL 30
|
|
198.19.0.3 eth-0-38 Active OPERATIONAL 30
|
|
```
|
|
|
|
### Connectivity
|
|
|
|
{{< image float="right" src="/assets/mpls-core/backbone.svg" alt="Backbone" >}}
|
|
|
|
And after I'm done with this heavy lifting, I can now build MPLS services (like L2VPN and VPLS) on
|
|
these three switches. But as you may remember, IPng is in a few more sites than just
|
|
Brüttisellen, Rümlang and Glattbrugg. While a lot of work, retrofitting every
|
|
site in exactly the same way is not mentally challenging, so I'm not going to spend a lot of words
|
|
describing it. Wax on, wax off.
|
|
|
|
Once I'm done though, the (MPLS) network looks a little bit like this. What's really cool about it,
|
|
is that it's an fully capable IPv4 and IPv6 network running OSPF and OSPFv3, LDP and MPLS services,
|
|
albeit one that's not connected to the internet, yet. This means that I've successfully created both
|
|
a completely private network that spans all sites we have active equipment in, but also did not
|
|
stand in the way of our public facing (VPP) routers in AS8298. Customers haven't noticed a single
|
|
thing, except now they can benefit from any L2 services (using MPLS tunnels or VPLS clouds) from any
|
|
of our sites. Neat!
|
|
|
|
Our VPP routers are connected through the switches, (carrier) L2VPN and WDM waves just as they were
|
|
before, but carried transparently by the Centec switches. Performance wise, there is no regression,
|
|
because the switches do line rate L2/MPLS switching and L3 forwarding. This means that the VPP
|
|
routers, except for having a little detour in-and-out the switch for their long haul, have the same
|
|
throughput as they had before.
|
|
|
|
I will deploy three additional features, to make this new private network a fair bit more powerful:
|
|
|
|
**1. Site Local Connectivity**
|
|
|
|
Each switch gets what is called an IPng Site Local (or _ipng-sl_) interface. This is a /27 IPv4 and
|
|
a /64 IPv6 that is bound on a local VLAN on each switch on our private network. Remember: the links
|
|
_between_ sites are no longer switched, they are _routed_ and pass ethernet frames only using MPLS.
|
|
I can connect for example all of the fleet's hypervisors to this internal network. I have given our
|
|
three bastion _jumphosts_ (Squanchy, Glootie and Pencilvester) an address on this internal network
|
|
as well, just look at this beautiful result:
|
|
|
|
```
|
|
pim@hvn0-ddln0:~$ traceroute hvn0.nlams3.net.ipng.ch
|
|
traceroute to hvn0.nlams3.net.ipng.ch (198.19.4.98), 64 hops max, 40 byte packets
|
|
1 msw0.ddln0.net.ipng.ch (198.19.4.129) 1.488 ms 1.233 ms 1.102 ms
|
|
2 msw0.chrma0.net.ipng.ch (198.19.2.1) 2.138 ms 2.04 ms 1.949 ms
|
|
3 msw0.defra0.net.ipng.ch (198.19.2.13) 6.207 ms 6.288 ms 7.862 ms
|
|
4 msw0.nlams0.net.ipng.ch (198.19.2.14) 13.424 ms 13.459 ms 13.513 ms
|
|
5 hvn0.nlams3.net.ipng.ch (198.19.4.98) 12.221 ms 12.131 ms 12.161 ms
|
|
|
|
pim@hvn0-ddln0:~$ iperf3 -6 -c hvn0.nlams3.net.ipng.ch -P 10
|
|
Connecting to host hvn0.nlams3, port 5201
|
|
- - - - - - - - - - - - - - - - - - - - - - - - -
|
|
[ 5] 9.00-10.00 sec 60.0 MBytes 503 Mbits/sec 0 1.47 MBytes
|
|
[ 7] 9.00-10.00 sec 71.2 MBytes 598 Mbits/sec 0 1.73 MBytes
|
|
[ 9] 9.00-10.00 sec 61.2 MBytes 530 Mbits/sec 0 1.30 MBytes
|
|
[ 11] 9.00-10.00 sec 91.2 MBytes 765 Mbits/sec 0 2.16 MBytes
|
|
[ 13] 9.00-10.00 sec 88.8 MBytes 744 Mbits/sec 0 2.13 MBytes
|
|
[ 15] 9.00-10.00 sec 62.5 MBytes 524 Mbits/sec 0 1.57 MBytes
|
|
[ 17] 9.00-10.00 sec 60.0 MBytes 503 Mbits/sec 0 1.47 MBytes
|
|
[ 19] 9.00-10.00 sec 65.0 MBytes 561 Mbits/sec 0 1.39 MBytes
|
|
[ 21] 9.00-10.00 sec 61.2 MBytes 530 Mbits/sec 0 1.24 MBytes
|
|
[ 23] 9.00-10.00 sec 63.8 MBytes 535 Mbits/sec 0 1.58 MBytes
|
|
[SUM] 9.00-10.00 sec 685 MBytes 5.79 Gbits/sec 0
|
|
...
|
|
[SUM] 0.00-10.00 sec 7.38 GBytes 6.34 Gbits/sec 177 sender
|
|
[SUM] 0.00-10.02 sec 7.37 GBytes 6.32 Gbits/sec receiver
|
|
```
|
|
|
|
**2. Egress Connectivity**
|
|
|
|
Having a private network is great, as it allows me to run the entire internal environment with 9000
|
|
byte jumboframes, mix IPv4 and IPv6, segment off background tasks such as ZFS replication and
|
|
borgbackup between physical sites, and employ monitoring with Prometheus and LibreNMS and log in
|
|
safely with SSH or IPMI without ever needing to leave the safety of the walled garden that is
|
|
**198.19.0.0/16**.
|
|
|
|
Hypervisors will now typically get a management interface _only_ in this network, and for them to be
|
|
able to do things like run _apt upgrade_, some remote repositories will need to be reachable over
|
|
IPv4 as well. For this, I decide to add three internet gateways, which will have one leg into the
|
|
private network, and one leg out into the world. For IPv4 they'll provide NAT, and for IPv6 they'll
|
|
ensure only _trusted_ traffic can enter the private network.
|
|
|
|
These gateways will:
|
|
|
|
* Connect to the internal network with OSPF and OSPFv3:
|
|
* They will learn 198.19.0.0/16, 2001:687:d78:500::/56 and their more specifics from it
|
|
* They will inject a default route for 0.0.0.0/0 and ::/0 to it
|
|
* Connect to AS8298 with BGP:
|
|
* They will receive a default IPv4 and IPv6 route from AS8298
|
|
* They will announce the two aggregate prefixes to it with **no-export** community set
|
|
* Provide a WireGuard endpoint to allow remote management:
|
|
* Clients will be put in 192.168.6.0/24 and 2001:678:d78:300::/56
|
|
* These ranges will be announced both to AS8298 externally and to OSPF internally
|
|
|
|
This provides dynamic routing at its best. If the gateway, the physical connection to the internal
|
|
network, or the OSPF adjacency is down, AS8298 will not learn the routes into the internal network
|
|
at this node. If the gateway, the physical connection to the external network, or the BGP adjacency
|
|
is down, the Centec switch will not pick up the default routes, and no traffic will be sent through
|
|
it. By having three such nodes geographically separated (one in Brüttisellen, one in
|
|
Plan-les-Ouates and one in Amsterdam), I am very likely to have stable and resilient connectivity.
|
|
|
|
At the same time, these three machines serve as WireGuard endpoints to be able to remotely manage
|
|
the network. For this purpose, I've carved out **192.168.6.0/26** and **2001:678:d78:300::/56** and
|
|
will hand out IP addresses from those to clients. I'd like these two networks to have access to the
|
|
internal private network as well.
|
|
|
|
The Bird2 OSPF configuration for one of the nodes (in Brüttisellen) looks like this:
|
|
|
|
```
|
|
filter ospf_export {
|
|
if (net.type = NET_IP4 && net ~ [ 0.0.0.0/0, 192.168.6.0/26 ]) then accept;
|
|
if (net.type = NET_IP6 && net ~ [ ::/0, 2001:678:d78:300::/64 ]) then accept;
|
|
if (source = RTS_DEVICE) then accept;
|
|
reject;
|
|
}
|
|
|
|
filter ospf_import {
|
|
if (net.type = NET_IP4 && net ~ [ 198.19.0.0/16 ]) then accept;
|
|
if (net.type = NET_IP6 && net ~ [ 2001:678:d78:500::/56 ]) then accept;
|
|
reject;
|
|
}
|
|
|
|
protocol ospf v2 ospf4 {
|
|
debug { events };
|
|
ipv4 { export filter ospf_export; import filter ospf_import; };
|
|
area 0 {
|
|
interface "lo" { stub yes; };
|
|
interface "wg0" { stub yes; };
|
|
interface "ipng-sl" { type broadcast; cost 15; bfd on; };
|
|
};
|
|
}
|
|
|
|
protocol ospf v3 ospf6 {
|
|
debug { events };
|
|
ipv6 { export filter ospf_export; import filter ospf_import; };
|
|
area 0 {
|
|
interface "lo" { stub yes; };
|
|
interface "wg0" { stub yes; };
|
|
interface "ipng-sl" { type broadcast; cost 15; bfd off; };
|
|
};
|
|
}
|
|
```
|
|
|
|
The ***ospf_export*** filter is what we're telling the Centec switches. Here, precisely the default
|
|
route and the WireGuard space is announced, in addition to connected routes. The ***ospf_import***
|
|
is what we're willing to learn from the Centec switches, and here we will accept exactly the
|
|
aggregate **198.19.0.0/16** and **2001:678:d78:500::/56** prefixes belonging to the private internal
|
|
network.
|
|
|
|
The Bird2 BGP configuration for this gateway then looks like this:
|
|
|
|
```
|
|
filter bgp_export {
|
|
if (net.type = NET_IP4 && ! (net ~ [ 198.19.0.0/16, 192.168.6.0/26 ])) then reject;
|
|
if (net.type = NET_IP6 && ! (net ~ [ 2001:678:d78:500::/56, 2001:678:d78:300::/64 ]) then reject;
|
|
|
|
# Add BGP Wellknown community no-export (FFFF:FF01)
|
|
bgp_community.add((65535,65281));
|
|
accept;
|
|
}
|
|
|
|
template bgp T_GW4 {
|
|
local as 64512;
|
|
source address 194.1.163.72;
|
|
default bgp_med 0;
|
|
default bgp_local_pref 400;
|
|
ipv4 { import all; export filter bgp_export; next hop self on; };
|
|
}
|
|
|
|
template bgp T_GW6 {
|
|
local as 64512;
|
|
source address 2001:678:d78:3::72;
|
|
default bgp_med 0;
|
|
default bgp_local_pref 400;
|
|
ipv6 { import all; export filter bgp_export; next hop self on; };
|
|
}
|
|
|
|
protocol bgp chbtl0_ipv4_1 from T_GW4 { neighbor 194.1.163.66 as 8298; };
|
|
protocol bgp chbtl1_ipv4_1 from T_GW4 { neighbor 194.1.163.67 as 8298; };
|
|
protocol bgp chbtl0_ipv6_1 from T_GW6 { neighbor 2001:678:d78:3::2 as 8298; };
|
|
protocol bgp chbtl1_ipv6_1 from T_GW6 { neighbor 2001:678:d78:3::3 as 8298; };
|
|
```
|
|
|
|
The ***bgp_export*** filter is where we restrict our announcements to only exactly the prefixes
|
|
we've learned from the Centec, and WireGuard. We'll set the _no-export_ BGP community on it, which
|
|
will allow the prefixes to live in AS8298 but never be announced to any eBGP peers. If the any of
|
|
the machine, the BGP session, the WireGuard interface, or the default route, would be missing, they
|
|
would simply not be announced. In the other direction, if the Centec is not feeding the gateway its
|
|
prefixes via OSPF, the BGP session may be up, but it will not be propagating these prefixes, and the
|
|
gateway will not attract network traffic to it. There are two BGP uplinks to AS8298 here, which also
|
|
provides resilience in case one of them is down for maintenance or in fault condition. __N+k__ is a
|
|
great rule to live by, when it comes to network engineering.
|
|
|
|
The last two things I should provide on each gateway, is **(A)** a NAT translator from internal to
|
|
external, and **(B)** a firewall that ensures only authorized traffic gets passed to the Centec
|
|
network.
|
|
|
|
First, I'll provide an IPv4 NAT translation to the internet facing AS8298 (`ipng`), for traffic
|
|
that is coming from WireGuard or the private network, while allowing it to pass _between_ the two
|
|
networks without performing NAT. The first rule says to jump to __ACCEPT__ (skipping the NAT rules),
|
|
if the source is WireGuard. The second two rules say to provide NAT towards the internet for any
|
|
traffic coming from WireGuard or the private network. The fourth and last rule says to provide NAT
|
|
towards the _internal_ private network, so that anything trying to get into the network will be
|
|
coming from an address in **198.19.0.0/16** as well. Here they are:
|
|
|
|
```
|
|
iptables -t nat -A POSTROUTING -s 192.168.6.0/24 -o ipng-sl -j ACCEPT
|
|
iptables -t nat -A POSTROUTING -s 192.168.6.0/24 -o ipng -j MASQUERADE
|
|
iptables -t nat -A POSTROUTING -s 198.19.0.0/16 -o ipng -j MASQUERADE
|
|
iptables -t nat -A POSTROUTING -o ipng-sl -j MASQUERADE
|
|
```
|
|
|
|
**3. Ingress Connectivity**
|
|
|
|
For inbound traffic, the rules are similarly permissive for _trusted_ sources but otherwise prohibit
|
|
any passing traffic. Prefixes are allowed to be forwarded from WireGuard, and some (not
|
|
disclosed, cuz I'm not stoopid!) trusted prefixes for IPv4 and IPv6, but ultimately if not specified
|
|
the forwarding tables will end in a default policy of __DROP__, which means no traffic will be
|
|
passed into the WireGuard or Centec internal networks unless explicitly allowed here:
|
|
|
|
```
|
|
iptables -P FORWARD DROP
|
|
ip6tables -P FORWARD DROP
|
|
for SRC4 in 192.168.6.0/24 ...; do
|
|
iptables -I FORWARD -s $SRC4 -j ACCEPT
|
|
done
|
|
for SRC6 in 2001:678:d78:300::/56 ...; do
|
|
ip6tables -I FORWARD -s $SRC6 -j ACCEPT
|
|
done
|
|
```
|
|
|
|
With that, any machine in the Centec (and WireGuard) private internal network will have full access
|
|
amongst each other, and they will be NATed to the internet, through these three (N+2) gateways. If I
|
|
turn one of them off, things look like this:
|
|
|
|
```
|
|
pim@hvn0-ddln0:~$ traceroute 8.8.8.8
|
|
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
|
|
1 msw0.ddln0.net.ipng.ch (198.19.4.129) 0.733 ms 1.040 ms 1.340 ms
|
|
2 msw0.chrma0.net.ipng.ch (198.19.2.6) 1.249 ms 1.555 ms 1.799 ms
|
|
3 msw0.chbtl0.net.ipng.ch (198.19.2.0) 2.733 ms 2.840 ms 2.974 ms
|
|
4 hvn0.chbtl0.net.ipng.ch (198.19.4.2) 1.447 ms 1.423 ms 1.402 ms
|
|
5 chbtl0.ipng.ch (194.1.163.66) 1.672 ms 1.652 ms 1.632 ms
|
|
6 chrma0.ipng.ch (194.1.163.17) 2.414 ms 2.431 ms 2.322 ms
|
|
7 as15169.lup.swissix.ch (91.206.52.223) 2.353 ms 2.331 ms 2.311 ms
|
|
...
|
|
|
|
pim@hvn0-chbtl0:~$ sudo systemctl stop bird
|
|
|
|
pim@hvn0-ddln0:~$ traceroute 8.8.8.8
|
|
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
|
|
1 msw0.ddln0.net.ipng.ch (198.19.4.129) 0.770 ms 1.058 ms 1.311 ms
|
|
2 msw0.chrma0.net.ipng.ch (198.19.2.6) 1.251 ms 1.662 ms 2.036 ms
|
|
3 msw0.chplo0.net.ipng.ch (198.19.2.22) 5.828 ms 5.455 ms 6.064 ms
|
|
4 hvn1.chplo0.net.ipng.ch (198.19.4.163) 4.901 ms 4.879 ms 4.858 ms
|
|
5 chplo0.ipng.ch (194.1.163.145) 4.867 ms 4.958 ms 5.113 ms
|
|
6 chrma0.ipng.ch (194.1.163.50) 9.274 ms 9.306 ms 9.313 ms
|
|
7 as15169.lup.swissix.ch (91.206.52.223) 10.168 ms 10.127 ms 10.090 ms
|
|
...
|
|
```
|
|
|
|
{{< image width="200px" float="right" src="/assets/mpls-core/swedish_chef.jpg" alt="Chef's Kiss" >}}
|
|
|
|
**How cool is that :)** First I do a traceroute from the hypervisor pool in DDLN colocation site, which
|
|
finds its closest default at `msw0.chbtl0.net.ipng.ch` which exits via `hvn0.chbtl0` and into the public
|
|
internet. Then, I shut down bird on that hypervisor/gateway, which means it won't be advertising the
|
|
default into the private network, nor will it be picking up traffic to/from it. About one second
|
|
later, the next default route is found to be at `msw0.chplo0.net.ipng.ch` over its hypervisor in
|
|
Geneva (note, 4ms down the line), after which the egress is performed at `hvn1.chplo0` into the
|
|
public internet. Of course, it's then sent back to Zurich to still find its way to Google at
|
|
SwissIX, but the only penalty is a scenic route: looping from Brüttisellen to Geneva and back
|
|
adds pretty much 8ms of end to end latency.
|
|
|
|
Just look at that beautiful resillience at play. Chef's kiss.
|
|
|
|
## What's next
|
|
|
|
The ring hasn't been fully deployed yet. I am waiting on a backorder of switches from Starry
|
|
Networks, due to arrive early April. The delivery of those will allow me to deploy in Paris and Lille,
|
|
hopefully in a cool roadtrip with Fred :)
|
|
|
|
But, I got pretty far, so what's next for me is the following few fun things:
|
|
|
|
1. Start offering EoMPLS / L2VPN / VPLS services to IPng customers. Who wants some?!
|
|
1. Move replication traffic from the current public internet, towards the internal _private_ network.
|
|
This both can leverage 9000 byte jumboframes, but it can also use wirespeed forwarding from the Centec
|
|
network gear.
|
|
1. Move all unneeded IPv4 addresses into the _private_ network, such as maintenance and management
|
|
/ controlplane, route reflectors, backup servers, hypervisors, and so on.
|
|
1. Move frontends to be dual-homed as well: one leg towards AS8298 using Public IPv4 and IPv6
|
|
addresses, and then finding backend servers in the private network (think of it like an NGINX
|
|
frontend that terminmates the HTTP/HTTPS connection [_SSL is inserted and removed here :)_], and then
|
|
has one or more backend servers in the private network. This can be useful for Mastodon, Peertube,
|
|
and of course our own websites.
|