Load Balancers
Note 1: You will see reference to "Mikrotik" in these configuration. This redesign started with a Mikrotik as the core router, but it turns out that the Edgerouter does a better job at ECMP routing right out of the box. See below for more information about ECMP. I decided not to go through the effort of changing the label on all the configurations. 😉
Note 2: The load balancer configuration in the example is using RFC-1918 unroutable private address space. In a real-world example, these would be actual publicly routed addresses. However, since this demo is living in my basement at the end of consumer cable modem service, we are sticking with RFC-1918 addresses.
Configuration
Metallb
Kubernetes is wonderful for orchestrating your workloads, but out of the box, exposing those workloads to the outside world isn't currently elegant. There are two ways to expose the workloads: Load balancers and ingress controllers. On this page, we will discuss one load balancer solution.
MetalLB is a Kubernetes load balancer solution. In this demo, it was installed into MicroK8s via the MicroK8s addons. I also decided to leave the Network addon with the default for MicroK8s, which is Calico. MetalLB runs in two modes, Layer 2 and Layer 3. For use in this cluster, we will be running it in Layer 3 mode. It uses Layer 3 to advertise routes to upstream BGP routers. In our case, we will be peering with the Ubiquiti EdgeRouter core router we discussed on another page.
Since the last iteration of this document, MetalLB has changed its configuration approach from config maps to a CustomResourceDefinition (CRD) approach. A Kubernetes CustomResourceDefinition (CRD) is a powerful feature that allows users to extend the Kubernetes API by defining their own custom resources. MetalLB in our example is configured using the following six CRDs:
In this demo, this resource is empty since we aren't using the Layer 2 mode anywhere in it.
kind: BGPPeer
metadata:
name: mikrotik
namespace: metallb-system
spec:
myASN: 64500
peerASN: 64500
peerAddress: 192.168.201.1
In this case, we are peering with a single upstream router. If we wanted to put more redundancy in place, we could add an additional router in parallel. This syntax is actually pretty compact. In our case, this defines three peering sessions (one from each node in the cluster). It is much easier than maintaining a separate configuration for each node and having to modify the configuration when nodes are added or removed. In this particular case, the peer is the address attached to the end of the point-to-point Ethernet links to the nodes. In a real production environment, it would likely point to an address on a loopback interface on the router. However, since this configuration only has one router, the overhead of configuring everything for the loopback address wasn't worth it.
This resource is empty due to the simplicity of this demo configuration. Here's a
page explaining the use of community tags in a more complex BGP configuration.
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: mikrotik
namespace: metallb-system
spec:
addresses:
- 192.168.101.1-192.168.101.254
In this resource we are defining an address space labeled mikrotik to have the address range of the 192.168.101.0/24 subnet.
This resource is empty in this demo. Bidirectional Forwarding Detection (BFD) is a network protocol designed to provide fast detection of failures in the forwarding path between two routers. I decided to pass on this configuration given the simplicity of the setup. I may come back and revisit this later.
apiVersion: metallb.io/v1beta1
kind: BGPAdvertisement
metadata:
name: mikrotik
namespace: metallb-system
This is the minimal configuration for the BGPAdvertisement resource. In this case, it uses all of the BGP configuration defaults. It also uses all of the address pools defined in MetalLB. See IPAddressPool above.
Ubiquiti EdgeRouter
The EdgeRouter(EdgeOS) uses a configuration format that traces back to
gated. Vyatta, VyOS, EdgeOS
and JunOS all have roots that trace back to gated. Hence the commonality in the formats.
interfaces {
switch switch0 {
vif 201 {
address 192.168.201.1/24
description "VLAN 201"
ip {
}
}
}
}
protocols {
bgp 64500 {
maximum-paths {
ibgp 8
}
neighbor 192.168.201.101 {
remote-as 64500
}
neighbor 192.168.201.102 {
remote-as 64500
}
neighbor 192.168.201.103 {
remote-as 64500
}
parameters {
router-id 192.168.201.1
}
}
This example only contains the parts of the configuration that specifically relate to the load balancer setup. This router has the last three ports configured in a switch chip, hence the "switch" tag. In addition, the Kubernetes nodes are VMs that are living in the 802.1q VLAN 201, which is defined by the "vif" stanza. The protocol stanza states that the local AS value is 64500. It also defines the neighbors to peer with (each of the Kubernetes hosting VMs). Since each of the remote peers is defined as being in AS 64500, this means that we're running iBGP in this situation. Finally, the "maximum-paths" stanza declares that the router will maintain up to eight parallel active paths to the neighbors if the cost is equal. In our case, there will only be three.
Kubernetes Manifests
Although we have four workloads defined in this demo cluster, we only have two that are exposed via load balancers. The manifests that create load balancers for these two workloads are below.
whoami-service
apiVersion: v1
kind: Service
metadata:
name: whoami-service
spec:
selector:
app: whoami
ports:
- protocol: TCP
port: 80
targetPort: 80
type: LoadBalancer
The service definition above creates a load balancer for a service named "whoami-service" that is mapped into the application "whoami". This is a simple service that returns all of the incoming information available for an HTTP(S) request. The service runs on port 80 in the container and is exposed on the load balancer on port 80.
gstreamer-service
apiVersion: v1
kind: Service
metadata:
name: nginx
spec:
selector:
app: nginx
ports:
- protocol: TCP
port: 8080
targetPort: 8080
type: LoadBalancer
This service definition creates a load balancer for a service called "nginx" that is mapped into an app named "nginx". The TCP service (web server) runs on port 8080 in the container and is also exposed on port 8080 in the load balancer.
Results
ubnt@EdgeRouter-PoE-5-Port:~$ show ip route summary
IP routing table name is Default-IP-Routing-Table(0)
IP routing table maximum-paths : 8
Total number of IPv4 routes : 14
Total number of IPv4 paths : 16
Route Source Networks
connected 4
static 1
ospf 7
bgp 2
Total 14
FIB 9
ECMP statistics:
---------------------------------
Total number of IPv4 ECMP routes : 1
Total number of IPv4 ECMP paths : 3
Number of routes with 3 ECMP paths: 1
The above output shows that the router is learning 2 BGP routes, which is what is expected given the 2 services defined. It also shows that it is seeing 1 ECMP route that defines 3 paths. This also checks out given the fact that one of the services is a Daemon Set and is running on all 3 nodes in the cluster.
ubnt@EdgeRouter-PoE-5-Port:~$ show ip bgp
BGP table version is 4, local router ID is 192.168.201.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, l - labeled
S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
*>i 192.168.101.1/32 192.168.201.101 0 0 0 i
* i 192.168.201.103 0 0 0 i
* i 192.168.201.102 0 0 0 i
*>i 192.168.101.3/32 192.168.201.103 0 0 0 i
Total number of prefixes 2
The above output shows more specifics about the running BGP states. All the routes learned are listed as internal. This is because these nodes are all in the same ASN and are therefore being learned via Internal Border Gateway Protocol. All the routes are also listed as "valid," meaning they are all available to be selected as a path to the destination network. The "Next Hop" column lists the paths for each of the networks being advertised. This shows 192.168.101.1 with 192.168.201.10[123] as the next hop. This correlates with the ECMP configuration and the fact that the service is available on all 3 nodes. It then shows 192.168.101.3 on one node. This is correct since the workload being advertised is associated with the nginx container of the gstreamer service and is configured for only one replica.
Proper routing hygiene would have the router aggregate the upstream advertisement into a /24, but currently, since I can count the number of /32 routes on one hand, I've left that configuration out. Probably something I will clean up on a second pass.
No comments:
Post a Comment