Build your own load balancer for your cluster

Load Balancers

Note 1: You will see reference to "Mikrotik" in these configuration.  This redesign started with a Mikrotik as the core router, but it turns out that the Edgerouter does a better job at ECMP routing right out of the box.  See below for more information about ECMP.  I decided not to go through the effort of changing the label on all the configurations. ðŸ˜‰

Note 2: The load balancer configuration in the example is using RFC-1918 unroutable private address space. In a real-world example, these would be actual publicly routed addresses. However, since this demo is living in my basement at the end of consumer cable modem service, we are sticking with RFC-1918 addresses.

Configuration

Metallb

Kubernetes is wonderful for orchestrating your workloads, but out of the box, exposing those workloads to the outside world isn't currently elegant. There are two ways to expose the workloads: Load balancers and ingress controllers. On this page, we will discuss one load balancer solution.

MetalLB is a Kubernetes load balancer solution. In this demo, it was installed into MicroK8s via the MicroK8s addons. I also decided to leave the Network addon with the default for MicroK8s, which is Calico.  MetalLB runs in two modes, Layer 2 and Layer 3. For use in this cluster, we will be running it in Layer 3 mode. It uses Layer 3 to advertise routes to upstream BGP routers. In our case, we will be peering with the Ubiquiti EdgeRouter core router we discussed on another page.

Since the last iteration of this document, MetalLB has changed its configuration approach from config maps to a CustomResourceDefinition (CRD) approach. A Kubernetes CustomResourceDefinition (CRD) is a powerful feature that allows users to extend the Kubernetes API by defining their own custom resources. MetalLB in our example is configured using the following six CRDs:

L2Advertisement 

In this demo, this resource is empty since we aren't using the Layer 2 mode anywhere in it.

BGPPeer 

kind: BGPPeer
metadata:
  name: mikrotik
  namespace: metallb-system
spec:
  myASN: 64500
  peerASN: 64500
  peerAddress: 192.168.201.1

In this case, we are peering with a single upstream router. If we wanted to put more redundancy in place, we could add an additional router in parallel. This syntax is actually pretty compact. In our case, this defines three peering sessions (one from each node in the cluster). It is much easier than maintaining a separate configuration for each node and having to modify the configuration when nodes are added or removed. In this particular case, the peer is the address attached to the end of the point-to-point Ethernet links to the nodes. In a real production environment, it would likely point to an address on a loopback interface on the router. However, since this configuration only has one router, the overhead of configuring everything for the loopback address wasn't worth it.


Community

This resource is empty due to the simplicity of this demo configuration.  Here's a page explaining the use of community tags in a more complex BGP configuration.

 
IPAddressPool

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: mikrotik
  namespace: metallb-system
spec:
  addresses:
  - 192.168.101.1-192.168.101.254

In this resource we are defining an address space labeled mikrotik to have the address range of the 192.168.101.0/24 subnet.

 
BFDProfile

This resource is empty in this demo. Bidirectional Forwarding Detection (BFD) is a network protocol designed to provide fast detection of failures in the forwarding path between two routers. I decided to pass on this configuration given the simplicity of the setup. I may come back and revisit this later.

 
BGPAdvertisement

apiVersion: metallb.io/v1beta1
kind: BGPAdvertisement
metadata:
  name: mikrotik
  namespace: metallb-system

This is the minimal configuration for the BGPAdvertisement resource. In this case, it uses all of the BGP configuration defaults. It also uses all of the address pools defined in MetalLB. See IPAddressPool above.

Ubiquiti EdgeRouter

The EdgeRouter(EdgeOS) uses a configuration format that traces back to gated.  Vyatta, VyOS, EdgeOS
and JunOS all have roots that trace back to gated.  Hence the commonality in the formats.

interfaces {
    switch switch0 {
        vif 201 {
            address 192.168.201.1/24
            description "VLAN 201"
            ip {
            }
        }
    }
}
protocols {
    bgp 64500 {
        maximum-paths {
            ibgp 8
        }
        neighbor 192.168.201.101 {
            remote-as 64500
        }
        neighbor 192.168.201.102 {
            remote-as 64500
        }
        neighbor 192.168.201.103 {
            remote-as 64500
        }
        parameters {
            router-id 192.168.201.1
        }
    }

This example only contains the parts of the configuration that specifically relate to the load balancer setup. This router has the last three ports configured in a switch chip, hence the "switch" tag. In addition, the Kubernetes nodes are VMs that are living in the 802.1q VLAN 201, which is defined by the "vif" stanza. The protocol stanza states that the local AS value is 64500. It also defines the neighbors to peer with (each of the Kubernetes hosting VMs). Since each of the remote peers is defined as being in AS 64500, this means that we're running iBGP in this situation. Finally, the "maximum-paths" stanza declares that the router will maintain up to eight parallel active paths to the neighbors if the cost is equal. In our case, there will only be three.

Kubernetes Manifests

Although we have four workloads defined in this demo cluster, we only have two that are exposed via load balancers. The manifests that create load balancers for these two workloads are below.

whoami-service

apiVersion: v1
kind: Service
metadata:
  name: whoami-service
spec:
  selector:
    app: whoami
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
  type: LoadBalancer

The service definition above creates a load balancer for a service named "whoami-service" that is mapped into the application "whoami". This is a simple service that returns all of the incoming information available for an HTTP(S) request. The service runs on port 80 in the container and is exposed on the load balancer on port 80.

gstreamer-service

apiVersion: v1
kind: Service
metadata:
  name: nginx
spec:
  selector:
    app: nginx
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 8080
  type: LoadBalancer


This service definition creates a load balancer for a service called "nginx" that is mapped into an app named "nginx". The TCP service (web server) runs on port 8080 in the container and is also exposed on port 8080 in the load balancer.

Results


ubnt@EdgeRouter-PoE-5-Port:~$ show ip route summary 
IP routing table name is Default-IP-Routing-Table(0)
IP routing table maximum-paths   : 8
Total number of IPv4 routes      : 14
Total number of IPv4 paths       : 16
Route Source    Networks
connected       4
static          1
ospf            7
bgp             2
Total           14
FIB             9

ECMP statistics:
---------------------------------
 Total number of IPv4 ECMP routes   : 1
 Total number of IPv4 ECMP paths    : 3
 Number of routes with  3 ECMP paths: 1

The above output shows that the router is learning 2 BGP routes, which is what is expected given the 2 services defined. It also shows that it is seeing 1 ECMP route that defines 3 paths. This also checks out given the fact that one of the services is a Daemon Set and is running on all 3 nodes in the cluster.


ubnt@EdgeRouter-PoE-5-Port:~$ show ip bgp
BGP table version is 4, local router ID is 192.168.201.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, l - labeled
              S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

    Network          Next Hop            Metric    LocPrf       Weight Path
*>i 192.168.101.1/32 192.168.201.101      0        0            0        i
* i                  192.168.201.103      0        0            0        i
* i                  192.168.201.102      0        0            0        i
*>i 192.168.101.3/32 192.168.201.103      0        0            0        i

Total number of prefixes 2

The above output shows more specifics about the running BGP states. All the routes learned are listed as internal. This is because these nodes are all in the same ASN and are therefore being learned via Internal Border Gateway Protocol. All the routes are also listed as "valid," meaning they are all available to be selected as a path to the destination network. The "Next Hop" column lists the paths for each of the networks being advertised. This shows 192.168.101.1 with 192.168.201.10[123] as the next hop. This correlates with the ECMP configuration and the fact that the service is available on all 3 nodes. It then shows 192.168.101.3 on one node. This is correct since the workload being advertised is associated with the nginx container of the gstreamer service and is configured for only one replica.

Proper routing hygiene would have the router aggregate the upstream advertisement into a /24, but currently, since I can count the number of /32 routes on one hand, I've left that configuration out. Probably something I will clean up on a second pass.




No comments: