Posted on Mar 26, 2021

High availability Kubernetes cluster on bare metal - part 2

#kubernetes #distributedsystems #architecture

Last week we covered the theory of high availability in a bare-metal Kubernetes cluster, which means that this week is where the magic happens.

First of all, there are a few dependencies that you need to have installed to initialize a Kubernetes cluster. Since this is not a guide on how to set up Kubernetes, I will assume that you have already done this before, and if not you can use the same guide as I used when installing Kubernetes for the first time: guide.

Also, if you did not follow the guide and have already installed Kubernetes and Docker (or your favorite container runtime), you will also have installed a key Kubernetes toolbox kubeadm, which is what we will use to initialize the cluster. First, we need to deal with the problems of high availability, which we discussed last week.

The stable control plane IP

As mentioned, we will use a self-hosted solution where we set up a stable IP with HAProxy and Keepalived as pods inside the Kubernetes cluster. To achieve this, we will need to configure a few files for each master node:

A keepalived configuration.
A keepalived health check script.
A manifest file for the keepalived static pod.
A HAproxy configuration file.
A manifest file for the HAProxy static pod.

Keepalived:

! /etc/keepalived/keepalived.conf ! Configuration File for keepalived global_defs { router_id LVS_DEVEL } vrrp_script check_apiserver { script "/etc/keepalived/check_apiserver.sh" interval 3 weight -2 fall 10 rise 2 } vrrp_instance VI_1 { state ${STATE} interface ${INTERFACE} virtual_router_id ${ROUTER_ID} priority ${PRIORITY} authentication { auth_type PASS auth_pass ${AUTH_PASS} } virtual_ipaddress { ${APISERVER_VIP} } track_script { check_apiserver } }

We have some placeholders in bash that we need to fill out manually or through scripting:

STATE Will be MASTER for the node initializing the cluster because it will also be the first one to host the virtual IP address of the control plane.
INTERFACE Is the network interface of the network where the nodes will communicate. For Ethernet connections, this is often eth0, and can be found with the command ifconfig on most Linux operating systems.
ROUTER_ID Needs to be the same for all the hosts. Often set to 51.
PRIORITY A unique number that decides which node should host the virtual IP of the control plane in case the first MASTER node goes down. Often set to 100 for the node initializing the cluster, and then decreasing values for the rest.
AUTH_PASS should be the same for all nodes. Often set to 42.
APISERVER_VIP The virtual IP for the control plane. This will be created.

For the health check script we have the following:

#!/bin/sh errorExit() { echo "*** $*" 1>&2 exit 1 } curl --silent --max-time 2 --insecure https://localhost:${APISERVER_DEST_PORT}/ -o /dev/null || errorExit "Error GET https://localhost:${APISERVER_DEST_PORT}/" if ip addr | grep -q ${APISERVER_VIP}; then curl --silent --max-time 2 --insecure https://${APISERVER_VIP}:${APISERVER_DEST_PORT}/ -o /dev/null || errorExit "Error GET https://${APISERVER_VIP}:${APISERVER_DEST_PORT}/" fi

We see the APISERVER_VIP placeholder again, which is just the same as before. If some variables are repeated I will not repeat the explanation, which means that the only new variable is:

APISERVER_DEST_PORT, which is the front end port on the virtual IP for the API server. This can be any unused port e.g. 4200.

Last, the manifest file for Keepalived:

apiVersion: v1 kind: Pod metadata: creationTimestamp: null name: keepalived namespace: kube-system spec: containers: - image: osixia/keepalived:1.3.5-1 name: keepalived resources: {} securityContext: capabilities: add: - NET_ADMIN - NET_BROADCAST - NET_RAW volumeMounts: - mountPath: /usr/local/etc/keepalived/keepalived.conf name: config - mountPath: /etc/keepalived/check_apiserver.sh name: check hostNetwork: true volumes: - hostPath: path: /etc/keepalived/keepalived.conf name: config - hostPath: path: /etc/keepalived/check_apiserver.sh name: check status: {}

This creates a pod that uses the two configuration files.

HAProxy

We have one configuration file for the HAProxy:

# /etc/haproxy/haproxy.cfg #--------------------------------------------------------------------- # Global settings #--------------------------------------------------------------------- global log /dev/log local0 log /dev/log local1 notice daemon #--------------------------------------------------------------------- # common defaults that all the 'listen' and 'backend' sections will # use if not designated in their block #--------------------------------------------------------------------- defaults mode http log global option httplog option dontlognull option http-server-close option forwardfor except 127.0.0.0/8 option redispatch retries 1 timeout http-request 10s timeout queue 20s timeout connect 5s timeout client 20s timeout server 20s timeout http-keep-alive 10s timeout check 10s #--------------------------------------------------------------------- # apiserver frontend which proxys to the masters #--------------------------------------------------------------------- frontend apiserver bind *:${APISERVER_DEST_PORT} mode tcp option tcplog default_backend apiserver #--------------------------------------------------------------------- # round robin balancing for apiserver #--------------------------------------------------------------------- backend apiserver option httpchk GET /healthz http-check expect status 200 mode tcp option ssl-hello-chk balance roundrobin server ${HOST1_ID} ${HOST1_ADDRESS}:${APISERVER_SRC_PORT} check server ${HOST2_ID} ${HOST2_ADDRESS}:${APISERVER_SRC_PORT} check server ${HOST3_ID} ${HOST3_ADDRESS}:${APISERVER_SRC_PORT} check

Here, we plug in the control plane IPs. Assuming a 3 node cluster we input a symbolic HOST_ID, which is just a unique name, for each as well as the HOST_ADDRESS. The APISERVER_SRC_PORT is by default port 6443, where the apiserver listens for traffic.

The last file is the HAProxy manifest file:

apiVersion: v1 kind: Pod metadata: name: haproxy namespace: kube-system spec: containers: - image: haproxy:2.1.4 name: haproxy livenessProbe: failureThreshold: 8 httpGet: host: localhost path: /healthz port: ${APISERVER_DEST_PORT} scheme: HTTPS volumeMounts: - mountPath: /usr/local/etc/haproxy/haproxy.cfg name: haproxyconf readOnly: true hostNetwork: true volumes: - hostPath: path: /etc/haproxy/haproxy.cfg type: FileOrCreate name: haproxyconf status: {}

This is all we actually need to configure to get a cluster up and running. Some of these are constants that need to be the same for all three master nodes, and some need to vary between nodes. Some are just values you have to input and for some values, you have to make a decision.

Values sanity check

Let us just take a quick sanity check over the variables and what they are by default for each node.

Constants

ROUTER_ID=51
AUTH_PASS=42
APISERVER_SRC_PORT=6443

Variables to input

STATE
MASTER for the node that initializes the cluster, BACKUP for the two others.
PRIORITY
100 for the node that initializes the cluster, 99 and 98 for the two others.

Variables to retrieve

APISERVER_VIP
An IP within your network subnet. If your node has IP 192.168.1.140, this could be 192.168.1.50.
APISERVER_DEST_PORT
A port for your choosing. Must not conflict with other service ports.

INTERFACE
The network interface. Use ifconfig to find it.

HOSTX_ID
Any unique name for each of the 3 master nodes.

HOSTX_ADDRESS
The ip addresses of your machines. Can also be found with ifconfig on each machine.

Files

Now that the files are configured they should be but in the right destination so that kubeadm can find them when the cluster is initializing.

The absolute file paths are:

/etc/keepalived/check_apiserver.sh /etc/keepalived/keepalived.conf /etc/haproxy/haproxy.cfg /etc/kubernetes/manifests/keepalived.yaml /etc/kubernetes/manifests/haproxy.yaml

Putting manifest files into /etc/kubernetes/manifests/ is what does the magic here. Everything in this folder will be applied when the cluster initializes. Even the control plane pods that are generated by kubeadm will be put in here before the cluster initializes.

Initializing the cluster

When the files are in place, initializing the cluster is as simple as running the kubeadm init command with a few extra pieces of information.

kubeadm init --control-plane-endpoint APISERVER_VIP:APISERVER_DEST_PORT --upload-certs

Will do the trick. The extra arguments tell the cluster that the control plane should not be contacted on the actual nodes IP, but on the virtual IP address. When the other nodes join, this is what makes the cluster highly available. If the node that is currently hosting the virtual IP goes down, the virtual IP will just jump to another available master node.

Last, join the other two nodes to the cluster with the join command output by kubeadm init.

If this even peaked your interest a little bit, you are in for a treat. The whole manual process is being eliminated in an open-source project right here. It is still a work in progress, but feel free to drop in and join the discussion.