We spent a lot of time on trying to provide state-of-the-art networking in the data center. This document describes the requirements, ideas and implementation details of the network topology that hosts the metal-stack.
The document is separated into three main sections describing the constraints, theoretical ideas and implementation details.
Finding the requirements for this greenfield project was kicked off with a handful of design parameters that included:
- Investigation of the idea of a layer-3 based infrastructure to overcome the drawbacks of traditional layer-2 architectures.
- Application of a routing technology that involves a single stand-alone protocol BGP for operational simplicity.
- Utilization of the overlay virtual network technology EVPN to support cost-effective scaling, efficient network information exchange and a manageable amount of administration effort.
- Applying the routing topology on top of a completely new physical infrastructure that is designed as a CLOS network topology.
Evaluation of those parameters led to more specific requirements:
- Physical Wiring:
- The data center is made of a leaf-spine CLOS topology containing:
- leaf switches
- spine switches
- exit switches
- management server
- management switch
- tenant servers
- tenant firewalls.
- Bare metal servers are dual-attached to leaf switches. The bare metal servers either become tenant servers or firewalls for a group of tenant servers.
- All network switches are connected to a management switch. A management server provides access to this management network.
- The data center is made of a leaf-spine CLOS topology containing:
- Network Operation Characteristics:
- IPv4 based network.
- No IPv6 deployment.
- Utilization of external BGP.
- Numbered BGP only for peerings at exit switches with third parties (Internet Service Provider).
- Overall BGP unnumbered.
- 4-byte private ASN instead of default 2-byte ASN for BGP.
- Network operation relies on Cumulus Linux.
- Bleeding edge Routing-to-the-Host/EVPN-to-the-Host with ordinary Linux distributions.
- Layer-3 routing using BGP and VXLAN/EVPN.
- Every VTEP acts as a layer-3 gateway and does routing. Routing is done on both the ingress and the egress VTEP (aka distributed symmetric routing).
- Tenant isolation is realized with VRF.
- Internet Access is implemented with route leak on the firewall servers and during the PXE-Process with route leak on the exit switches.
- MTU 9216 is used for VXLAN-facing interfaces, otherwise MTU 9000 is used.
Furthermore, requirements such as operational simplicity and network stability that a small group of people can effectively support have been identified being a primary focus for building metal-stack.
The theoretical concept targets the aforementioned requirements. New technologies have been evaluated to apply the best solutions. The process was heavily inspired by the work of Dinesh G. Dutt regarding BGP (bgp-ebook) and EVPN (evpn-ebook).
External BGP together with network overlay concepts as EVPN can address the essential demands. These revolutionary concepts are part of the next evolutionary step in data center design. It overcomes common issues of traditional layer 2 architectures (e.g. VLAN limitations, network visibility for operations, firewall requirements) by introducing a layer 3 based network topology.
A CLOS topology is named after the pioneer Charles Clos (short: CLOS) who first formalized this approach. CLOS defines a multistage network topology that is used today to improve performance and resilience while enabling a cost effective scalability. A CLOS topology comprises network switches aggregated into spine and leaf layers. Each leaf switch (short: leaf) is connected to all spine switches (short: spine) but there is no direct leaf-to-leaf or spine-to-spine connection (See: picture 1).
Picture 1: Fragment of CLOS to show leaf-spine layer.
This data center network architecture, based on a leaf-spine architecture, is also know as "two-tier" CLOS topology.
Picture 2: Fragment to show a 3-stage, 2-layer CLOS topology.
Tenant servers are dual-attached to the leaf layer in order to have redundancy and load balancing capability (Picture 2). The set of leaves, spine switches and tenant servers define stages. From top down each server is reachable with 3 hops (spine -> leaf -> server). This is why that CLOS design is called a 3-stage CLOS. Consistent latency throughout the data center are an outcome of this design.
It is not only important to have a scalable and resilient infrastructure but also to support planning and operation teams. Visibility within the network is of significant meaning for them. Consequently layer-3 routing in favor of layer-2 bridging provides this kind of tooling.
For routing the Border Gateway Protocol (BGP), more specific: External BGP was selected. Extensive testing and operational experiences have shown that External BGP is well suited as a stand-alone routing protocol (see: RFC7938).
Not all tenant servers are connected to the same leaf. Instead they can be distributed among any of the leaves of the data center. To not let this detail restrict the intra-tenant communication it is required to interconnect those layer-2 domains. In the context of BGP there is a concept of overlay networking with VXLAN/ EVPN that was evaluated to satisfy the needs of the metal-stack.
In BGP traditionally each BGP peer-facing interface requires a separate IPv4 address. This consumes a lot of IP addresses. RFC 5549 defines the BGP unnumbered standard. It allows to use interface's IPv6 link local address (LLA) to set up a BGP session with a peer. With BGP unnumbered the IPv6 LLA of the remote is automatically discovered via Router Advertisement (RA) protocol. Important: This does not (!) mean that IPv6 must be deployed in the network. BGP uses RFC 5549 to encode IPv4 routes as reachable over IPv6 next-hop using the LLA. Having unnumbered interfaces does not mean no IPv4 address may be in place. It is a good practice to configure an IP address to the never failing and always present local loopback interface (lo). This lo address is reachable over BGP from other peers because the RFC 5549 standard provides an encoding scheme to allow a router to advertise IPv4 routes with an IPv6 next-hop. BGP unnumbered also has an advantage from security perspective. It removes IPv4 and global IPv6 addresses from router interfaces, thus reducing the attack vector.
To sum it up:
- BGP unnumbered uses IPv6 next-hops to announce IPv4 routes.
- There is no IPv6 deployment in the network required.
- IPv6 just has to be enabled on the BGP peers to provide LLA and RA.
In BGP, ASN is how BGP peers know each other.
Within the data center each BGP router is identified by a private autonomous system number (ASN). This ASN is used for internal communication. The default is to have 2-byte ASN. To avoid having to find workarounds in case the ASN address space is exhausted, a 4-byte ASN that supports up to 95 million ASNs (4200000000–4294967294) is used from the beginning.
ASN numbering in a CLOS topology should follow a model to avoid routing problems (path hunting) due to it's redundant nature. Within a CLOS topology the following ANS numbering model is suggested to solve path hunting problems:
- Leaves have unique ASN
- Spines share an ASN
- Exit switches share an ASN
As stated, BGP is a multi-protocol routing protocol. Since it is planned to use IPv4 and overlay networks using EVPN/VXLAN several address-families have to be activated for the BGP sessions to use:
- IPv4 unicast address-family
- L2 EVPN address-family
Ethernet VPN (EVPN) is an overlay virtual network that connects layer-2 segments over layer-3 infrastructure. EVPN is an answer to common problems of entire layer-2 data centers.
Challenges such as large failure domains, spanning tree complexities, difficult troubleshooting and scaling issues are addressed by EVPN:
- administration: less routers are involved in configuration (with VLAN every switch on routing-paths needs VLAN awareness). The configuration is less error prone due to the nature of EVPN and the good support in FRR.
- scaling: EVPN overcomes scaling issues with traditional VLANs (max. 4094 VLANs).
- cost-effectiveness: EVPN is an overlay virtual network. Not every switch on the routing path needs EVPN awareness. This enables the use of standard routers (in contrast to traditional VLAN); e.g.: spine switches act only as evpn information replicator and do not need to have knowledge of specific virtual networks.
- efficiency: EVPN information is exclusively exchanged via BGP (Multiprotocol BGP). Only a single eBGP session is needed to advertise layer-2 reachability. No other protocols beneath BGP are involved and flood traffic is reduced to a minimum (no "flood-and-learn", no BUM traffic).
Virtual routing permits multiple network paths without the need of multiple switches. Hence the servers are logically isolated by assigning their networks to dedicated virtual routers using virtual routing and forwarding (short: VRF).
EVPN (technology) is based on BGP as control plane protocol (underlay) and VXLAN as data plane protocol (overlay).
As EVPN is an overlay network, only the VXLAN Tunnel End Points (VTEPs) must be configured. In the case of two-tier CLOS networks leaf switches are tunnel endpoints.
In EVPN routing is assumed to occur in the context of a VRF. VRF enables true multitenancy. Therewith, VRF is the first step for EVPN configuration and there is a 1:1 relationship between tenant and VRF.
To enable layer-2 connectivity, we need a special interface to route between layer-2 networks. This interface is called Switched VLAN Interface (SVI). The SVI is realized with a VLAN. It is part of a VRF (layer-3).
The VTEP configuration requires the setup of a VXLAN interface. A VLAN aware bridge interconnects the VXLAN interface and the SVI.
Required Interfaces to establish the EVPN control plane:
- VRF: because routing happens in the context of this interface.
- SVI: because remote host routes for symmetric routing are installed over this interface.
- VLAN-aware bridge: because router MAC addresses of remote VTEPs are installed over this interface.
- VXLAN Interface / VXLAN Tunnel Endpoint: because the VRF to layer-3 VNI mapping has to be consistent across all VTEPs)
Picture 3: Required interfaces on the switch to wire up the vrf to swp 1 connectivity with a given vxlan
Integrated routing and bridging (IRB) is the most complex part of EVPN. You could choose between centralized or distributed routing, and between asymmetrical (routing on ingress) or symmetrical (routing on ingress and egress) routing. We expect a lot of traffic within the data center itself which implies the need to avoid zigzag routing. This is why we go with distributed routing model. Further it is recommended to use the symmetric model since it makes the cut in most cases and has advantages in scalability (see "EVPN in the Data Center", Dinesh G. Dutt).
In a layer-3 network it is important to associate each interface with a proper Maximum Transmission Unit (MTU) to avoid fragmentation of IP packets. Typical modern networks do not fragment IP packets and the introduction of VXLAN adds another additional header to the packets that must not exceed the MTU. If the MTU is exceeded, VXLAN might just fail without error as Cumulus consultants stated. This already represents a difficult-to-diagnose connectivity issue that has to be avoided.
It is common practice to set the MTU for VXLAN facing interfaces (e.g. inter-switch links) to a value of
9216 to compensate the additional VXLAN overhead and an MTU of
9000 as a default to other interfaces (e.g. server facing ports). The common MTU of
1500 is not sufficient for traffic inside a data center!
Routing is needed for communication between VXLAN tunnels or between a VXLAN tunnel and an external networks. VXLAN routing supports layer-3 multi-tenancy. All routing occurs in the context of a VRF. There is a 1:1 relation of a VRF to a tenant. Picture 3 illustrates this. Servers A and B belong to the same vrf VRF1. Server C is enslaved into VRF2. There is no communication possible between members of VRF1 and those of VRF2.
Picture 4: Illustration of two distinct routing tables of VRF1 (enslaved: servers A and B) and VRF2 (enslaved: server C)
To leaverage the potential and power of BGP, VRF, EVPN/VXLAN without a vendor lock-in the implementation relies on hardware that is supported by open network operating system: Cumulus Linux.
Implementation of the network operation requires the data center infrastructure to be in place. To implement a functional meaning for the parts of the CLOS network, all members must be wired accordingly.
Reference: See the CLOS overview picture in ./README.md.
|Tenant server (aka Machine)||Bare metal server that is associated to a tenant. Dual-connected to leafs.|
|Tenant firewall||Bare metal server that is associated to a tenant. Dual-connected to leafs.|
|Leaf||Network Switch that interconnects tenant servers and firewalls. Connected to spines.|
|Spine||Network switch that interconnects leafs and exit switches.|
|Exit||Network switch that connects to spines and interconnects to external networks.|
|Management Server||Jump-host to access all network switches within the CLOS topology for administrative purpose.|
|Management Switch||Connected to the management port of each of the network switches.|
Tenant servers are organized into a layer called projects. In case those tenant servers require access to or from external networks, a new tenant server to function as a firewall is created. Leaf and spine switches form the fundament of the CLOS network to facilitate redundancy, resilience and scalability. Exit switches establish connectivity to or from external networks. Management Switch and Management Server are mandatory parts that build a management network to access the network switches for administration.
To operate the CLOS topology, software defined configuration to enable BGP, VRF, EVPN and VXLAN must be set up.
Cumulus Linux as the network operating system will be installed on all network switches (leaves, spines, exit switches) within the CLOS topology. Cumulus Linux cannot be installed on bare metal servers that require BGP/EVPN but does not have a switching silicon.
Components without a switching silicon are:
- tenant servers
- tenant firewalls
- management server
There exist two paradigms to use BGP and/or VXLAN/EVPN on non switching bare metal servers: BGP-to-the-host and EVPN-to-the-host. Both describe a setup of Free Range Routing Framework (see frrouting.org) and its configuration. FRR seamlessly integrates with the native Linux IP networking stacks.
Starting with an explanation of the tenant server's BGP-to-the-Host helps to get an insight into the setup of the CLOS network from a bottom-up perspective.
Tenant servers are dual-connected to leaf switches. To communicate with other servers or reach out to external networks they must join a BGP session with each of the leaf switches. Thus, it is required to bring BGP to those hosts (aka BGP-to-the-Host). Each tenant server becomes a BGP router (aka BGP speaker).
BGP-to-the-Host is established by installing and configuring FRR. The required FRR configuration for tenant servers is limited to a basic setup to peer with BGP next-hops:
# /etc/network/interfaces auto lo iface lo inet static address 10.0.0.1/32 auto lan0 iface lan0 inet6 auto mtu 9000 auto lan1 iface lan1 inet6 auto mtu 9000
Listing 1: Network interfaces of a tenant server.
Listing 1 shows the local interfaces configuration. lan0 and lan1 connect to the leaves. As described, there is no IPv4 address assigned to them (BGP unnumbered). The local loopback has an IPv4 address assigned that is announced by BGP.
The required BGP configuration:
# /etc/frr/frr.conf frr version 7.0 frr defaults datacenter log syslog debugging service integrated-vtysh-config ! interface lan0 ipv6 nd ra-interval 6 no ipv6 nd suppress-ra ! interface lan1 ipv6 nd ra-interval 6 no ipv6 nd suppress-ra ! router bgp 4200000001 bgp router-id 10.0.0.1 bgp bestpath as-path multipath-relax neighbor TOR peer-group neighbor TOR remote-as external neighbor TOR timers 1 3 neighbor lan0 interface peer-group TOR neighbor lan1 interface peer-group TOR neighbor LOCAL peer-group neighbor LOCAL remote-as internal neighbor LOCAL timers 1 3 neighbor LOCAL route-map local-in in bgp listen range 10.244.0.0/16 peer-group LOCAL address-family ipv4 unicast redistribute connected neighbor TOR route-map only-self-out out exit-address-family ! bgp as-path access-list SELF permit ^$ ! route-map local-in permit 10 set weight 32768 ! route-map only-self-out permit 10 match as-path SELF ! route-map only-self-out deny 99 !
Listing 2: FRR configuration of a tenant server.
The frr configuration in Listing 2 starts with
frr defaults datacenter. This is a marker that enables compile-time provided settings that e.g. set specific values for BGP session timers. This is followed by a directive to state that instead of several configuration files for different purposes a single frr.conf file is used:
service integrated-vtysh-config. The two interface specific blocks starting with
interface ... enable the RA mechanism that is required for BGP unnumbered peer discovery. There is a global BGP instance configuration
router bgp 4200000001 that sets the private ASN. The BGP router configuration contains a setup that identifies the BGP speaker
bgp router-id 10.0.0.1. This router id should be unique. It is a good practice to assign the local loopback IPv4 as router-id. To apply the same configuration to several interfaces a peer group named
TOR is defined via
neighbor TOR peer-group.
remote-as external activates external BGP for this peer group. To have a fast convergence, limits of default timers are reduced by
timer 1 3 section. The two BGP-peer-facing interfaces are enslaved into the peer-group to inherit the peer-group's setup. Activation of IPv4 unicast protocol is completed with
address-family ipv4 unicast. To prevent a tenant server from announcing other paths than
lo interface a route-map
only-self-out is defined. This route map is activated within the ipv4 address family:
neighbor TOR route-map only-self-out out.
Application of the route map
only-self-out enables to announce only local ip(s). This is to avoid that a tenant server announces paths to other servers (prevents unwanted traffic). To achieve this:
the route-map named
only-self-outpermits only matches against an access list named
SELFpermits only empty path announcements
the path of the tenant server itself has no ASN. It is always empty (see line
*> 10.0.0.2/32 0.0.0.0 0 32768 ?):
root@machine:~# vtysh -c 'show bgp ipv4 unicast' BGP table version is 7, local router ID is 10.0.0.2, vrf id 0 Default local pref 100, local AS 4200000002 Status codes: s suppressed, d damped, h history, * valid, > best, = multipath, i internal, r RIB-failure, S Stale, R Removed Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self Origin codes: i - IGP, e - EGP, ? - incomplete Network Next Hop Metric LocPrf Weight Path *= 0.0.0.0/0 lan1 0 4200000012 4200000040 i *> lan0 0 4200000011 4200000040 i *= 10.0.0.1/32 lan1 0 4200000012 4200000001 ? *> lan0 0 4200000011 4200000001 ? *> 10.0.0.2/32 0.0.0.0 0 32768 ? *= 10.0.0.78/32 lan1 0 4200000012 4200000001 ? *> lan0 0 4200000011 4200000001 ? Displayed 4 routes and 7 total paths
That is why only the self ip (looback ip) is announced.
To allow for peering between FRR and other routing daemons on a tenant server a
listen range is specified to accept iBGP sessions on the network
10.244.0.0/16. Therewith it gets possible that pods / containers like metal-lb with IPs of this range may peer with FRR.
This is the only place where we use iBGP in our topology. For local peering this has the advantage, that we don't need an additional ASN that has to be handled / pruned in the AS-path of routes. Routes coming from other routing daemons look as if they are configured on the tenant server's lo interface from the viewpoint of the leaves. iBGP routes are differently handled than eBGP routes in BGPs best path algorithm. Generally BGP has the rule to prefer eBGP routes over iBGP routes (s. 'eBGP over iBGP' ). BGP adds automatically an weight based on the route type. To overcome this issue, we set the weight of iBGP routes to the same weight that eBGP routes have, namely 32768 (
set weight 32768). Without this configuration we will only get a single route to the IPs announced via iBGP. So this setting is essential for HA/failover!
Statistics of the established BGP session can be viewed locally from the tenant server via:
sudo vtysh -c 'show bgp ipv4 unicast'
To establish this BGP session a BGP setup is required on the leaves as well.
Every leaf switch is connected to every spine switch. Tenant servers can be distributed within the data center and thus be connected to different leaves. Routing for tenant servers is isolated in unique VRFs. These constraints imply several configuration requirements for the leaf and spine switches:
- leaves define tenant VRFs
- leaves terminate VXLAN tunnels (aka "VXLAN tunnel endpoint" = VTEP)
The leaf setup requires the definition of a tenant VRF that enslaves the tenant server facing interfaces:
# /etc/network/interfaces # [...] iface vrf3981 vrf-table auto iface swp1 mtu 9000 post-up sysctl -w net.ipv6.conf.swp1.disable_ipv6=0 vrf vrf3981 # [...]
Listing 3: Fragment that shows swp1 being member of vrf vrf3981.
There is a VRF defintion
iface vrf3981 to create a distinct routing table and a section
vrf vrf3981 that enslaves swp1 (connects the tenant server) into the VRF. Those host facing ports are also called
Unfortunately, due to a kernel bug, IPv6 is not reliably enabled, so it is enforced explicitly via
post-up sysctl -w net.ipv6.conf.swp1.disable_ipv6=0. If this
post-up trigger is missing the LLA of the interface might be absent.
Additional to the VRF definition the leaf must be configured to provide and connect a VXLAN interface to establish a VXLAN tunnel. This network virtualization begins at the leaves. Therefore, the leaves are also called Network Virtualization Edges (NVEs). The leaves encapsulate and decapsulate VXLAN packets.
# /etc/network/interfaces # [...] iface bridge bridge-ports vni3981 bridge-vids 1001 bridge-vlan-aware yes iface vlan1001 mtu 9000 vlan-id 1001 vlan-raw-device bridge vrf vrf3981 iface vni3981 mtu 9000 bridge-access 1001 bridge-arp-nd-suppress on bridge-learning off mstpctl-bpduguard yes mstpctl-portbpdufilter yes vxlan-id 3981 vxlan-local-tunnelip 10.0.0.11 # [...]
Listing 4: Fragment that shows VXLAN setup for vrf vrf3981.
All routing happens in the context of the tenant VRF. To send and receive packets of a VRF, several interface are in place.
A bridge is used to attach VXLAN interface
bridge-ports vni3981 and map its local VLAN to a VNI. Router MAC addresses of remote VTEPs are installed over this interface.
The Routed VLAN Interface or Switched Virtual Interface (SVI)
iface vlan1001 is configured corresponding to the per-tenant VXLAN interface. It is attached to the tenant VRF. Remote host routes are installed over this SVI. The
vlan-raw-device bridge is used to associate the SVI with the VLAN aware bridge. For a packet received from a locally attached host the SVI interface corresponding to the VLAN determines the VRF
The VXLAN interface
iface vni3981 defines a tunnel address that is used for the VXLAN tunnel header
vlxan-local-tunnelip 10.0.0.11. This VTEP IP address is typically the loopback device address of the switch. When EVPN is provisioned, data plane MAC learning for VXLAN interfaces must be disabled because the purpose of EVPN is to exchange MACs between VTEPs in the control plane:
bridge-learning off. EVPN is responsible for installing remote MACs.
bridge-arp-nd-suppress suppresses ARP flooding over VXLAN tunnels. Instead, a local proxy handles ARP requests received from locally attached hosts for remote hosts. ARP suppression is the implementation for IPv4; ND suppression is the implementation for IPv6. It is recommended by Cumulus to enable ARP suppression on all VXLAN interfaces. Bridge Protocol Data Unit (BPDU) are not transmitted over VXLAN interfaces. So as a good practice bpduguard and pbdufilter are enabled with
mstpctl-bpduguard yes and
mstpctl-portbpdufilter yes. These settings filter BPDU and guard the spanning tree topology from unauthorized switches affecting the forwarding path.
vxlan-id 3981 specifies the VXLAN Network Identifier (VNI). The type of VNI can either be layer-2 (L2) or layer-3 (L3). This is an implicit thing. A VNI is a L3 VNI (L3VNI) when a mapping exists that maps the VNI to a VRF (configured in
/etc/frr/frr.conf) otherwise it is a L2 VNI (L2VNI).
# /etc/frr/frr.conf # [...] vrf vrf3981 vni 3981 exit-vrf #[...] router bgp 4200000011 # [...] address-family ipv4 unicast redistribute connected route-map LOOPBACKS # [...] address-family l2vpn evpn neighbor FABRIC activate advertise-all-vni exit-address-family # [...] router bgp 4200000011 vrf vrf3981 # [...] address-family ipv4 unicast redistribute connected neighbor MACHINE maximum-prefix 100 exit-address-family ! address-family l2vpn evpn advertise ipv4 unicast exit-address-family # [...] route-map LOOPBACKS permit 10 match interface lo # [...]
Listing 5: Leaf FRR configuration.
Listing 5 shows the required FRR configuration of the BGP control plane. Only content not discussed so far is explained. The section
vrf vrf3981 contains the mapping from layer-3 VNI to VRF. This is required to be able to install EVPN IP prefix routes (type-5 routes) into the routing table. Further the file contains a global BGP instance
router bgp 4200000011 definition. A new setting
redistribute connected route-map LOOPBACKS is in place to filter the redistribution of routes that are not matching the local loopback interface. The route-map is defined with
route-map LOOPBACKS permit 10. With the configuration line
address-family l2vpn evpn, the EVPN address family is enabled between BGP neighbours.
advertise-all-vni makes the switch a VTEP configures it in such a way, that all locally configured VNIs should be advertised by the BGP control plane.
The second BGP instance configuration is specific to the tenant VRF
router bgp 4200000011 vrf vrf3981. This VRF BGP instance configures the l2vpn evpn address family with
advertise ipv4 unicast to announce IP prefixes in BGP's routing information base (RIB). This is required to apply learned routes to the routing tables of connected hosts. The Maximum-Prefix feature is useful to avoid that a router recieves more routes than the router memory can take. The maximum number of prefixes a tenant server is allowed to announce is limited to
neighbor MACHINE maximum-prefix 100.
On the spine switches the setup is quite simple.
/etc/network/interfaces contains the loopback interface definition to support BGP unnumbered and listings for connected switch ports to provide proper MTUs (Listing 6). I.e. swp1 is configured with an MTU of 9216 as it is a VXLAN-facing interface.
# /etc/network/interfaces # [...] iface swp1 mtu 9216
Listing 6: Fragment of spine interface configuration.
The spines are important to forward EVPN routes and transport VXLAN packets between the VTEPs. They are not configured as VTEPs. The FRR configuration only contains the already known global BGP instance configuration
router bgp 4200000020 plus the activation of the l2vpn evpn address family
address-family l2vpn evpn to enable EVPN type-5 route forwarding (Listing 7).
frr version 4.0+cl3u9 frr defaults datacenter hostname spine01 username cumulus nopassword ! # [...] interface swp1 ipv6 nd ra-interval 6 no ipv6 nd suppress-ra ! # [...] ! router bgp 4200000020 # [...] ! address-family l2vpn evpn neighbor FABRIC activate exit-address-family ! # [...]
Listing 7: Fragment of spine FRR configuration to show the activated L2VPN EVPN address-family.
In case a tenant server needs to reach out to external networks as the Internet, a tenant firewall is provisioned. The firewall is a bare metal server without a switching silicon. Thus, there is no installation of Cumulus Linux. Instead a standard Linux OS with a special configuration is used (FRR from branch
dev/frr-7.1, iproute2 and a more recent Linux Kernel >= 5.0) to provide the BGP/ EVPN functionality known as
EVPN-to-the-host. The firewall is configured as a VTEP and applies
dynamic route-leaking to install routes of an foreign VRF. The set of routes that are leaked are restricted with route-maps.
As Listing 8 shows, the firewall is configured with VXLAN interfaces as known from the leaf setup. Additionally, a VXLAN setup for VRF
vrfInternet is added to provide Internet access. vrfInternet contains a route to the Internet that will be leaked into the tenant VRF.
Traffic that originates from the tenant network
10.0.0.0/22 will be masqueraded before leaving the inferface
vlanInternet towards the internet.
# /etc/network/interfaces # [...] iface bridge # [...] iface vlan1001 # [...] iface vni3981 # [...] iface vrf3981 # [...] iface vlanInternet mtu 9000 vlan-id 4009 vlan-raw-device bridge vrf vrfInternet address 126.96.36.199 post-up iptables -t nat -A POSTROUTING -s 10.0.0.0/22 -o vlanInternet -j MASQUERADE pre-down iptables -t nat -D POSTROUTING -s 10.0.0.0/22 -o vlanInternet -j MASQUERADE iface vniInternet mtu 9000 bridge-access 4009 mstpctl-bpduguard yes mstpctl-portbpdufilter yes vxlan-id 104009 vxlan-local-tunnelip 10.0.0.40 iface vrfInternet mtu 9000 vrf-table auto
Listing 8: Interfaces configuration of firewall to show the VTEP interface configuration.
To install a default route into the routing table of tenant VRF vrf3981 a dynamic route leak is established for it (
import vrf vrfInternet). With the help of a route-map
import vrf route-map vrf3981-import-map only the default route will be leaked:
root@firewall01:~# vtysh -c 'show ip route vrf vrf3981' # [...] VRF vrf3981: S>* 0.0.0.0/0 [1/0] is directly connected, vrfInternet(vrf vrfInternet), 03:19:26 B>* 10.0.0.1/32 [20/0] via 10.0.0.12, vlan1001 onlink, 02:34:48 * via 10.0.0.11, vlan1001 onlink, 02:34:48 B>* 10.0.0.2/32 [20/0] via 10.0.0.12, vlan1001 onlink, 02:34:49 * via 10.0.0.11, vlan1001 onlink, 02:34:49
To receive responses from vrfInternet in vrf3981 a route is leaked into vrfInternet as well (
import vrf vrf3981) restricted with the route-map
vrfInternet-import-map that allows leaking of the tenant routes as well as internet prefixes used on worker nodes of the tenant. To limit the prefixes that are announced from the firewall within the global BGP instance a route-map
only-self-out is defined and applied within the ipv4 and l2vpn evpn address family. Together with the definition of an as path access list
bgp as-path access-list it avoids the announcement of prefixes to non VRF BGP peers.
# /etc/frr/frr.conf ! vrf vrf3981 vni 3981 ! vrf vrfInternet vni 104009 ! # [...] ! router bgp 4200000040 # [...] ! address-family ipv4 unicast # [...] neighbor FABRIC route-map only-self-out out exit-address-family ! ! router bgp 4200000040 vrf vrf3981 # [...] address-family ipv4 unicast redistribute connected import vrf vrfInternet import vrf route-map vrf3981-import-map # [...] addrress-family l2vpn evpn advertise ipv4 unicast # [...] router bgp 4200000040 vrf vrfInternet # [...] address-family ipv4 unicast redistribute connected import vrf vrf3981 import vrf route-map vrfInternet-import-map # [...] addrress-family l2vpn evpn advertise ipv4 unicast # [...] bgp as-path access-list SELF permit ^$ ! route-map only-self-out permit 10 match as-path SELF ! route-map only-self-out deny 99 ! route-map LOOPBACKS permit 10 match interface lo ! ip prefix-list vrf3981-import-prefixes seq 100 permit 0.0.0.0/0 ! route-map vrf3981-import-map permit 10 match ip address prefix-list vrf3981-import-prefixes ! route-map vrf3981-import-map deny 99 ! ip prefix-list vrfInternet-import-prefixes seq 100 permit 10.0.0.0/22 le 32 ip prefix-list vrfInternet-import-prefixes seq 101 permit 188.8.131.52/24 le 32 ip prefix-list vrfInternet-import-prefixes seq 102 permit 184.108.40.206/27 le 32 ! route-map vrfInternet-import-map permit 10 match ip address prefix-list vrfInternet-import-prefixes ! route-map vrfInternet-import-map deny 99 ! line vty !
Listing 9: FRR configuration of a tenant firewall to show route leak and prefix announcement filtering.
Traffic to external networks is routed via the firewalls to the exit switch. The exit switch, as an exception, connects to the Internet Service Provider using numbered BGP. Numbered BGP implies to assign IPv4 addresses to network interfaces (See Listing 10, swp1). Interface swp1 is enslaved into
vrf vrfInternet to include the port that is connected to the ISP within the VRF that is expected to contain a way into the Internet. The exit switch is configured to be a VTEP to terminate traffic coming from the firewall VRF
# /etc/network/interfaces # [...] iface swp1 mtu 9000 vrf vrfInternet address 220.127.116.11/30 # [...] iface vlan4000 mtu 9000 address 10.0.0.71/24 vlan-id 4000 vlan-raw-device bridge # [...] iface vlanInternet # [...] iface vniInternet # [...] iface vrfInternet # [...]
Listing 10: Fragment of interfaces configuration of exit switch.
The configuration of FRR is equivalent to the previously discussed ones. It contains a global BGP instance configuration that enables IPv4 unicast and l2vpn evpn address families. The vrfInternet BGP instance defines
neighbor 18.104.22.168 peer-group INTERNET to use "old style BGP" transit network.
# [..] vrf vrfInternet vni 104009 ! # [...] router bgp 4200000031 bgp router-id 10.0.0.31 neighbor FABRIC peer-group neighbor FABRIC remote-as external neighbor FABRIC timers 1 3 # [...] ! address-family ipv4 unicast neighbor FABRIC activate redistribute connected route-map LOOPBACKS exit-address-family ! address-family l2vpn evpn neighbor FABRIC activate advertise-all-vni exit-address-family ! router bgp 4200000031 vrf vrfInternet bgp router-id 10.0.0.31 bgp bestpath as-path multipath-relax neighbor INTERNET peer-group neighbor INTERNET remote-as external neighbor INTERNET timers 1 3 neighbor 22.214.171.124 peer-group INTERNET ! address-family ipv4 unicast neighbor INTERNET route-map PREPEND-PATH-TO-DISFAVOR-IN in neighbor INTERNET route-map PREPEND-PATH-TO-DISFAVOR-OUT out exit-address-family ! address-family l2vpn evpn advertise ipv4 unicast exit-address-family ! route-map LOOPBACKS permit 10 match interface lo ! route-map PREPEND-PATH-TO-DISFAVOR-IN permit 10 set as-path prepend last-as 2 ! route-map PREPEND-PATH-TO-DISFAVOR-OUT permit 10 set as-path prepend last-as 2 ! vrf mgmt ip route 10.0.0.0/24 10.0.0.71 nexthop-vrf default exit-vrf ! ip route 0.0.0.0/0 192.168.0.254 nexthop-vrf mgmt ! line vty !
Listing 11: Fragment of FRR configuration on exit switch to give an example for numbered BGP and route leak.
In addition to the standard BGP setup the exit switches have configured
static route leak to support internet access during PXE. There is one route leak from
default VRF into the
mgmt VRF defined with:
ip route 0.0.0.0/0 192.168.0.254 nexthop-vrf mgmt and another one from
mgmt VRF into the
ip route 10.0.0.0/24 10.0.0.71 nexthop-vrf default. The first one adds a default route into the
default VRF and the second one routes traffic destined to the PXE network back from
mgmt VRF into the
To reach out into external networks each of the exit nodes joins a BGP session with a distinct external router. There is a different latency to each of these routers. To favor routes of exit nodes connected with lower latency over exit nodes with higher latency two route maps
PREPEND-PATH-TO-DISFAVOR-OUT are added to high latency exit nodes. These route maps apply actions to prolong the path of the incoming and outgoing routes. Because of this path extension BGP will calculate a lower weight for these paths and favors paths via other exit nodes. It is important to know that within an address family only one route map (the last) will be applied. To apply more than one actions within a route-map the required entries can be applied to a single route-map.
With Cumulus Linux as the basic network operating system there are two possibilities on how to get configuration into the target devices:
- editing flat files and reloading required services
- utilizing Network Command Line Utility (NCLU)
The NCLU has two wrappers:
- CLI (
- Ansible Module
As Cumulus states, the NCLU tool is intended for people who do not understand Linux. Besides the fact, that we understand that editing flat files has advantages, we found the following reasons to stick with this approach:
- full control on what is going on and when it is going on
- no suffer from bugs that are present in NCLU
- decoupling from additional layers (NCLU)
- avoid hybrid solutions since NCLU is not capable of solving all use cases completely (e.g. DHCP configuration)
- in the past we made good experiences with rendering plain configuration files
Of course this decision comes with the challenge to render all required files correctly and reload the appropriate services accordingly.
Reference: Operationalizing Cumulus Linux
Before a bare metal server can act as tenant server or tenant firewall, it has to be provisioned. Within the Metal domain, this provisioning mode is called "PXE Mode" since it is based on Preboot eXecution Environment (PXE). PXE uses protocols like DHCP. This requires all bare metal servers that need provisioning to be located in a layer-2 domain where DHCP is available. This domain is a VLAN
vlan4000. A DHCP server for PXE Mode is installed on the exit switches to work in this specific VLAN.
# /etc/default/isc-dhcp-server INTERFACES="vlan4000"
Listing 13: DHCP server configration of exit switches.
As shown in listing 13, the PXE DHCP server is located on the exit switches and enforced to bind to interface
vlan4000. This represents a layer-2 separation that allows only DHCP clients in the same VLAN to request IP addresses. Only unprovisionned bare metal servers are configured to be member of this VLAN. Thus unwanted or accidential provisionning is impossible.
vlan4000 on the leaves (that face the bare metal servers) the exit and leaf switches are configured as VTEPs and share an interface configuration that contains the required interfaces (Listing 13). Since no EVPN routing is in place
vni104000 is configured as an L2 VNI (there is no mapping for this VNI in
# /etc/network/interfaces # [...] iface bridge bridge-ports vni104000 [...] bridge-vids 4000 [...] bridge-vlan-aware yes iface vlan4000 # [...] iface vni104000 # [...]
Listing 13: Interfaces configuration on exit and leaf switches to show DHCP/PXE related fragments.
On the leaf switches the bare metal server facing ports are configured as VLAN access ports to carry the traffic for only the PXE VLAN
vlan4000 (listing 14)to separate unprovisioned from other bare metal servers.
# /etc/network/interfaces # [...] auto swp1 iface swp1 mtu 9000 bridge-access 4000 # [...]
Listing 14: VLAN access setup for bare metal server facing ports on leaves.
Once a bare metal server is provisioned it is deconfigured from PXE VLAN
vlan4000 to avoid accidential or unwanted provisioning.
During provisioning bare metal servers get internet access via the management network of the exit switches. This is because the exit switches are announced as DHCP gateway to the DHCP clients.
To manage network switches beside the out-of-band system console access a further management access is required. For this purpose the concept of Management VRF is applied. The Management VRF is a subset of VRF. It provides a separation between out-of-band management network and the in-band data plane network by introducing another routing table mgmt. Cumulus Linux supports lan0 to be used as the management interface.
To enable and use the Management VRF all switches have to be connected via their lan0 interface to a management-switch. The management switch is connected to a management server. All access is established from within the managment server. Logins to the switch are set into the Management VRF context once the Managment VRF is enabled.