MetalLB the Cattle Way

Some generic banner with a Kubernetes Logo

So you want to deploy Kubernetes in your lab environment. You have your cluster stood up, and now its time to deploy your ingress controllers. You get a warning that says “Waiting for Service IP” in your helm deployment. You can go in and modify the values to make the ingress a node port rather than a service port. This is less than ideal for a couple of reason. First and foremost, you are shifting your failure layer to an outside source that you have no control over. If you run keepalived you might be able to get past this hurdle, but you are still not ensiling the true methodology of devops in treating your nodes as cattle rather than pets.

How do we solve this issue? If you are with a cloud provider, you might just throw an Application Load Balancer to expose a service in EKS. Unfortunately this is the state of Kubernetes, its amazing for managing all these self-hosted applications you want to deploy, but unless you are on a cloud provider, you are a second class citizen.

Enter MetalLB, but with some caveats.

Most of the deployments I’ve seen on the internet have used the Layer 2 functionality, but this has some limitations, namely its not dynamic on host changes.

Enter MetalLB with BGP, this lets you have a truly dynamic setup, just like the big boy cloud providers. Unfortunately in my research to get it deployed, I saw lots of guides on getting it to work with Unifi Edge Routers, but nothing really with getting it to work with a Cisco Nexus or an Arista switch.

Deploying MetalLB:

In this section, I am making some assumptions: You have an existing k8s cluster, you have a basic understanding of routing, and you know how to pull logs and pod statuses.

Deploying MetalLB to your Kubernetes Cluster is relatively simple. In my case I’m using the kubectl method of install with a config map.

In this setup the speakers, which are doing the actual work, pull their config off a config map that is applied to the cluster.

This is the map I’m using

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: metallb-system
  name: config
data:
  config: |
    peers:
      - peer-address: 10.4.4.1
        peer-asn: 65101
        my-asn: 65101
    address-pools:
      - name: default
        protocol: bgp
        addresses:
          - 10.8.4.0/23

Breaking it down, there are 5 fields that matter, Peer Address, this is the Router ID detailed later, in this case its 10.4.4.1, then the ASNs, it is critical that we use ASNs that are delegated for private use, in this case 65101 will work fine, then the actual definition of the address pool, in my setup I have 10.4.0.0/14 allocated for my site, I allocated a range out of this to make life easier, 10.8.0.0/16 as the super-net and 10.8.4.0/23 as the pool for MetalLB to issue to services and Ingresses

Once you have tweaked the settings to your liking, apply it with your method of choice:

kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.11.0/manifests/namespace.yaml

kubectl apply -f metallb-configmap.yaml

kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.11.0/manifests/metallb.yaml

Breaking down what’s happening here, we are creating the metallb-system namespace, we then are applying the config map, and then deploying the MetalLB deployment

On to setting up the switch…

Switch Side:

Switch I'm Using: Cisco Nexus 9000 C9396Px

I’m going to document the Nexus setup here, as in my attempts to get it working, it was the biggest hurdle to getting this deployment to come together. Again I'm making some assumptions, you know the NX-OS command line enough to break down config snippets, you are using your Nexus as your core router, and that you have a basic understanding of routing as a whole.

If you haven’t already, enable the BGP feature using feature bgp this will enable bgp on the switch

Now create a Loopback interface in my case it will be loopback1 with an address of 10.4.4.1 this is the peer-address we deployed in the config map earlier

interface loopback1
  description router_ID
  ip address 10.4.4.1/32

Now time to configure the actual BGP router on the switch...

router bgp 65101
  router-id 10.4.4.1
  timers bgp 15 45
  log-neighbor-changes
  address-family ipv4 unicast
    redistribute direct route-map allow
    maximum-paths 2
  neighbor 10.6.64.0/24
    remote-as 65101
    update-source loopback1
    address-family ipv4 unicast

This is the config I found that gets MetalLB to play ball with the Nexus. Lets break down what's happening. We are declaring a BGP router with an ASN of 65101, with the router-id coming from the address on our loopback interface we made earlier. We then add some declarations to set timeout, logging, and then allowing the router to modify the route table on the switch. We then instantiate the neighbor config, the thing that makes this truly dynamic is we are allocating the block that all my Kubernetes worker nodes live in. This allows me to dynamically change the size of my cluster without the need to update the switch config. The other critical take away here is that out peer ASN is the same as the instantiated ASN, this configuration is called iBGP, meaning the announcements are all taking place in the same ASN. We tell the switch to listen to updates from our loopback interface.

If everything applied correctly, running sh ip bgp sum you should get an output similar to this:

# sh ip bgp sum
BGP summary information for VRF default, address family IPv4 Unicast
BGP router identifier 10.4.4.1, local AS number 65101
BGP table version is 55, IPv4 Unicast config peers 10, capable peers 5
18 network entries and 26 paths using 4952 bytes of memory
BGP attribute entries [2/312], BGP AS path entries [0/0]
BGP community entries [0/0], BGP clusterlist entries [0/0]

Neighbor        V    AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
10.6.64.11      4 65101    4201    4200       55    0    0 17:29:26 2         
10.6.64.12      4 65101    4207    4205       55    0    0 17:30:40 3         
10.6.64.13      4 65101    4205    4204       55    0    0 17:30:27 2         
10.6.64.14      4 65101    4208    4204       55    0    0 17:30:28 2         
10.6.64.15      4 65101    4205    4204       55    0    0 17:30:26 2         

If instead you are getting any hosts stuck in "IDLE" then you should go back and check for any typos in your configuration.

Conclusion

While BGP "load balancing" is not the same as what the big cloud providers are rolling out, it gets us close enough to enjoy the benefits of Kubernetes in a non-pet fashion. When I deploy MetalLB to Colo, I will update this to include configs for Vyos and Arista.

Adam Smith

Adam Smith