metal-stack · majst01 · Dec 9, 2024 · Oct 25, 2024 · Nov 29, 2024 · Dec 9, 2024
@@ -50,7 +50,7 @@ Furthermore, requirements such as *operational simplicity* and *network stabilit
 
 ## Concept
 
-The theoretical concept targets the aforementioned requirements. New technologies have been evaluated to apply the best solutions. The process was heavily inspired by the work of Dinesh G. Dutt regarding BGP ([bgp-ebook](https://www.nvidia.com/en-us/networking/border-gateway-protocol/)) and EVPN ([evpn-ebook](https://www.nvidia.com/en-us/networking/evpn-ebook/)).
+The theoretical concept targets the aforementioned requirements. New technologies have been evaluated to apply the best solutions. The process was heavily inspired by the work of Dinesh G. Dutt regarding BGP ([bgp-ebook](https://www.nvidia.com/en-us/networking/border-gateway-protocol/)), EVPN ([evpn-ebook](https://www.nvidia.com/en-us/networking/evpn-ebook/)) and the his 2019 work "[Cloud Native Data Center Networking](https://www.oreilly.com/library/view/cloud-native-data/9781492045595/)" (O'Reilly), which teaches some interesting basics.
 
 External BGP together with network overlay concepts as EVPN can address the essential demands. These revolutionary concepts are part of the next evolutionary step in data center design. It overcomes common issues of traditional layer 2 architectures (e.g. VLAN limitations, network visibility for operations, firewall requirements) by introducing a layer 3 based network topology.
 
@@ -80,26 +80,32 @@ Not all tenant servers are connected to the same leaf. Instead they can be distr
 
 #### BGP Unnumbered
 
-In BGP traditionally each BGP peer-facing interface requires a separate IPv4 address. This consumes a lot of IP addresses. RFC 5549 defines the BGP unnumbered standard. It allows to use interface's IPv6 link local address (LLA) to set up a BGP session with a peer. With BGP unnumbered the IPv6 LLA of the remote is automatically discovered via Router Advertisement (RA) protocol. Important: This does not (!) mean that IPv6 must be deployed in the network. BGP uses RFC 5549 to encode IPv4 routes as reachable over IPv6 next-hop using the LLA. Having unnumbered interfaces does not mean no IPv4 address may be in place. It is a good practice to configure an IP address to the never failing and always present local loopback interface (lo). This lo address is reachable over BGP from other peers because the RFC 5549 standard provides an encoding scheme to allow a router to advertise IPv4 routes with an IPv6 next-hop. BGP unnumbered also has an advantage from security perspective. It removes IPv4 and global IPv6 addresses from router interfaces, thus reducing the attack vector.
+In BGP traditionally each BGP peer-facing interface requires a separate IPv4 address. This consumes a lot of IP addresses. [RFC 5549](https://datatracker.ietf.org/doc/html/rfc5549) defines the BGP unnumbered standard. It allows to use interface's IPv6 link local address (LLA) to set up a BGP session with a peer. With BGP unnumbered the IPv6 LLA of the remote is automatically discovered via Router Advertisement (RA) protocol. Important: This does not (!) mean that IPv6 must be deployed in the network. BGP uses [RFC 5549](https://datatracker.ietf.org/doc/html/rfc5549) to encode IPv4 routes as reachable over IPv6 next-hop using the LLA. Having unnumbered interfaces does not mean no IPv4 address may be in place. It is a good practice to configure an IP address to the never failing and always present local loopback interface (lo). This lo address is reachable over BGP from other peers because the [RFC 5549](https://datatracker.ietf.org/doc/html/rfc5549) standard provides an encoding scheme to allow a router to advertise IPv4 routes with an IPv6 next-hop. BGP unnumbered also has an advantage from security perspective. It removes IPv4 and global IPv6 addresses from router interfaces, thus reducing the attack vector.
 
 To sum it up:
 
 - BGP unnumbered uses IPv6 next-hops to announce IPv4 routes.
 - There is no IPv6 deployment in the network required.
 - IPv6 just has to be enabled on the BGP peers to provide LLA and RA.
 
-In BGP, ASN is how BGP peers know each other.
+*In External BGP, ASN is how BGP peers know each other.*
 
 #### ASN Numbering
 
-Within the data center each BGP router is identified by a private autonomous system number (ASN). This ASN is used for internal communication. The default is to have 2-byte ASN. To avoid having to find workarounds in case the ASN address space is exhausted, a 4-byte ASN that supports up to 95 million ASNs (4200000000–4294967294) is used from the beginning.
+Within the data center each BGP router is identified by a private autonomous system number (ASN). This ASN is used for internal communication. The default is to have 2-byte ASN. To avoid having to find workarounds in case the ASN address space is exhausted, a 4-byte ASN (see [RFC 6793](https://datatracker.ietf.org/doc/html/rfc6793)) that supports up to 95 million private ASNs (4200000000–4294967294, see [RFC 6996](https://www.rfc-editor.org/rfc/rfc6996.html)) is used from the beginning.
 
-ASN numbering in a CLOS topology should follow a model to avoid routing problems (path hunting) due to it's redundant nature. Within a CLOS topology the following ASN numbering model is suggested to solve path hunting problems:
+ASN numbering in a CLOS topology should follow a model to avoid routing problems (path hunting) due to it's redundant nature. Within a a two-tier CLOS topology the following ASN numbering model is suggested to solve path hunting problems:
 
 - Leaves have unique ASN
 - Spines share an ASN
 - Exit switches share an ASN
 
+A illustrated example of the background of this architecture decision can be inspected in the chapter "BGP’s ASN Numbering Scheme" ("BGP’S PATH HUNTING PROBLEM") of the previously mentioned "Cloud Native Data Center Networking" book.
+
+To summarize that, one can say: Since all nodes receive or know the physical connection status of all other nodes in the network, the nodes potentially have routing information that they do not know whether they still have up to date, since it takes some time before they are fully distributed in the network.
+Routes to nodes may actually no longer exist (because not a single link to the node, but the node itself has failed) or the path may have changed. To determine how and whether a particular node can be reached, a path search must therefore be carried out at all its communication partners or BGP routers.
+Essentially, the sharing of ASNs reduces the transmission of incorrect or outdated path information (this reduces path transmissions and calculations and thus saves resources).
+
 #### Address-Families
 
 As stated, BGP is a multi-protocol routing protocol. Since it is planned to use IPv4 and overlay networks using EVPN/VXLAN several address-families have to be activated for the BGP sessions to use:
@@ -109,32 +115,34 @@ As stated, BGP is a multi-protocol routing protocol. Since it is planned to use
 
 ### EVPN
 
-Ethernet VPN (EVPN) is an overlay virtual network that connects layer-2 segments over layer-3 infrastructure. EVPN is an answer to common problems of entire layer-2 data centers.
+Ethernet VPN (EVPN, see [RFC 7432](https://www.rfc-editor.org/rfc/rfc7432.html)) is an overlay virtual network that connects layer-2 segments over layer-3 infrastructure. EVPN is an answer to common problems of entire layer-2 data centers.
 
-#### Why do we need EVPN
+#### The necessity of EVPN
 
 Challenges such as large failure domains, spanning tree complexities, difficult troubleshooting and scaling issues are addressed by EVPN:
 
 - **administration**: less routers are involved in configuration (with VLAN every switch on routing-paths needs VLAN awareness). The configuration is less error prone due to the nature of EVPN and the good support in FRR.
 - **scaling**: EVPN overcomes scaling issues with traditional VLANs (max. 4094 VLANs).
-- **cost-effectiveness**: EVPN is an overlay virtual network. Not every switch on the routing path needs EVPN awareness. This enables the use of standard routers (in contrast to traditional VLAN); e.g.: spine switches act only as evpn information replicator and do not need to have knowledge of specific virtual networks.
-- **efficiency**: EVPN information is exclusively exchanged via BGP (Multiprotocol BGP). Only a single eBGP session is needed to advertise layer-2 reachability. No other protocols beneath BGP are involved and flood traffic is reduced to a minimum (no "flood-and-learn", no BUM traffic).
+- **cost-effectiveness**: EVPN is an overlay virtual network. Not every switch on the routing path needs EVPN awareness. This enables the use of standard routers (in contrast to traditional VLAN); e.g.: spine switches act only as EVPN information replicator and do not need to have knowledge of specific virtual networks.
+- **efficiency**: EVPN information is exclusively exchanged via BGP (Multiprotocol BGP, see [RFC 4760](https://datatracker.ietf.org/doc/html/rfc4760)). Only a single eBGP session is needed to advertise layer-2 reachability. No other protocols beneath BGP are involved and flood traffic is reduced to a minimum (no "flood-and-learn", no BUM traffic).
 
-Virtual routing permits multiple network paths without the need of multiple switches. Hence the servers are logically isolated by assigning their networks to dedicated virtual routers using virtual routing and forwarding (short: **VRF**).
+Virtual routing permits multiple network paths without the need of multiple switches. Hence the servers are logically isolated by assigning their networks to dedicated virtual routers using virtual routing and forwarding (short, **VRF**, see [Linux Virtual Routing and Forwarding](https://docs.kernel.org/networking/vrf.html) and [SONiC VRF support](https://github.com/sonic-net/SONiC/blob/master/doc/vrf/sonic-vrf-hld.md)).
 
-#### How do we use EVPN
+#### The operation of EVPN
 
 EVPN (technology) is based on BGP as control plane protocol (underlay) and VXLAN as data plane protocol (overlay).
 
 As EVPN is an overlay network, only the VXLAN Tunnel End Points (VTEPs) must be configured. In the case of two-tier CLOS networks leaf switches are tunnel endpoints.
 
-In EVPN routing is assumed to occur in the context of a VRF. VRF enables true multitenancy. Therewith, VRF is the first step for EVPN configuration and there is a 1:1 relationship between tenant and VRF.
+As described earlier, a dedicated VRF is used for each new tenant. VRF enables true multi-tenancy/isolation for routing tables. This is why the same ip-addresses or -networks can be used for tenants with different meanings without collisions or conflicts.
+
+In EVPN routing is assumed to occur in the context of a VRF. VRF enables true multitenancy/isolatation for routing tables. Therewith, VRF is the first step for EVPN configuration and there is a 1:1 relationship between tenant and VRF.
 
 To enable layer-2 connectivity, we need a special interface to route between layer-2 networks. This interface is called Switched VLAN Interface (SVI). The SVI is realized with a VLAN. It is part of a VRF (layer-3).
 
 The VTEP configuration requires the setup of a VXLAN interface. A VLAN aware bridge interconnects the VXLAN interface and the SVI.
 
-Required Interfaces to establish the EVPN control plane:
+Required resources to establish the EVPN control plane:
 
 - VRF: because routing happens in the context of this interface.
 - SVI: because remote host routes for symmetric routing are installed over this interface.
@@ -169,7 +177,7 @@ Implementation of the network operation requires the data center infrastructure
 
 ### Physical Wiring
 
-Reference: See the CLOS overview picture in ./README.md.
+Reference: See the [CLOS overview picture](#CLOS)
 
 | Name                        | Wiring                                                                                        |
 | :-------------------------- | :-------------------------------------------------------------------------------------------- |
@@ -181,6 +189,7 @@ Reference: See the CLOS overview picture in ./README.md.
 | Management Server           | Jump-host to access all network switches within the CLOS topology for administrative purpose. |
 | Management Switch           | Connected to the management port of each of the network switches.                             |
 
+
 Tenant servers are organized into a layer called projects. In case those tenant servers require access to or from external networks, a new tenant server to function as a firewall is created. Leaf and spine switches form the fundament of the CLOS network to facilitate redundancy, resilience and scalability. Exit switches establish connectivity to or from external networks. Management Switch and Management Server are mandatory parts that build a management network to access the network switches for administration.
 
 To operate the CLOS topology, software defined configuration to enable BGP, VRF, EVPN and VXLAN must be set up.
@@ -308,7 +317,7 @@ Application of the route map `only-self-out` enables to announce only local ip(s
 
 To allow for peering between FRR and other routing daemons on a tenant server a `listen range` is specified to accept iBGP sessions on the network `10.244.0.0/16`. Therewith it gets possible that pods / containers like metal-lb with IPs of this range may peer with FRR.
 
-This is the only place where we use iBGP in our topology. For local peering this has the advantage, that we don't need an additional ASN that has to be handled / pruned in the AS-path of routes. Routes coming from other routing daemons look as if they are configured on the tenant server's lo interface from the viewpoint of the leaves. iBGP routes are differently handled than eBGP routes in BGPs best path algorithm. Generally BGP has the rule to prefer eBGP routes over iBGP routes (s. ['eBGP over iBGP'](https://medium.com/netdevops/how-bgp-best-path-selection-works-80e6e7b2da2b) ). BGP adds automatically an weight based on the route type. To overcome this issue, we set the weight of iBGP routes to the same weight that eBGP routes have, namely 32768 (`set weight 32768`). Without this configuration we will only get a single route to the IPs announced via iBGP. So this setting is essential for HA/failover!
+This is the only place where we use iBGP in our topology. For local peering this has the advantage, that we don't need an additional ASN that has to be handled / pruned in the AS-path of routes. Routes coming from other routing daemons look as if they are configured on the tenant server's lo interface from the viewpoint of the leaves. iBGP routes are differently handled than eBGP routes in BGPs best path algorithm. Generally BGP has the rule to prefer eBGP routes over iBGP routes (see ['eBGP over iBGP'](https://medium.com/netdevops/how-bgp-best-path-selection-works-80e6e7b2da2b) ). BGP adds automatically an weight based on the route type. To overcome this issue, we set the weight of iBGP routes to the same weight that eBGP routes have, namely 32768 (`set weight 32768`). Without this configuration we will only get a single route to the IPs announced via iBGP. So this setting is essential for HA/failover!
 
 Statistics of the established BGP session can be viewed locally from the tenant server via: `sudo vtysh -c 'show bgp ipv4 unicast'`
 
@@ -343,8 +352,6 @@ iface swp1
 
 There is a VRF definition `iface vrf3981` to create a distinct routing table and a section `vrf vrf3981` that enslaves swp1 (connects the tenant server) into the VRF. Those host facing ports are also called `edge ports`.
 
-Unfortunately, due to a kernel bug, IPv6 is not reliably enabled, so it is enforced explicitly via `post-up sysctl -w net.ipv6.conf.swp1.disable_ipv6=0`. If this `post-up` trigger is missing the LLA of the interface might be absent.
-
 Additional to the VRF definition the leaf must be configured to provide and connect a VXLAN interface to establish a VXLAN tunnel. This network virtualization begins at the leaves. Therefore, the leaves are also called Network Virtualization Edges (NVEs). The leaves encapsulate and decapsulate VXLAN packets.
 
 ```bash