HA kubernetes cluster: Accidental kubeadm reset on 1 master node, connection refused when rejoining the cluster

Question

I have setup a kubernetes cluster with 2 master nodes (cp01 192.168.1.42, cp02 192.168.1.46) and 4 worker nodes, implemented with haproxy and keepalived running as static pods in the cluster, internal etcd cluster. For some silly reasons, I accidentally kubeadm reset -f on cp01. Now I am trying rejoin the cluster using kubeadm join command but I keep getting the dial tcp 192.168.1.49:8443: connect: connection refused, where 192.168.1.49 is the LoadBalancer IP. Please help! Below are the current configurations.

/etc/haproxy/haproxy.cfg on cp02

defaults timeout connect 10s timeout client 30s timeout server 30s frontend apiserver bind *.8443 mode tcp option tcplog default_backend apiserver backend apiserver option httpchk GET /healthz http-check expect status 200 mode tcp option ssl-hello-chk balance roundrobin default-server inter 10s downinter 5s rise 2 fall 2 slowstart 60s maxconn 250 maxqueue 256 weight 100 #server master01 192.168.1.42:6443 check ***the one i accidentally resetted server master02 192.168.1.46:6443 check

/etc/keepalived/keepalived.conf on cp02

global_defs { router_id LVS_DEVEL script_user root enable_script_security dynamic_interfaces } vrrp_script check_apiserver { script "/etc/keepalived/check_apiserver.sh" interval 3 weight -2 fall 10 rise 2 } vrrp_instance VI_l { state BACKUP interface ens192 virtual_router_id 51 priority 101 authentication { auth_type PASS auth_pass *** } virtual_ipaddress { 192.168.1.49/24 } track_script { check_apiserver } }

cluster kubeadm-config

apiVersion: v1 data: ClusterConfiguration: | apiServer: extraArgs: authorization-mode: Node,RBAC timeoutForControlPlane: 4m0s apiVersion: kubeadm.k8s.io/v1beta2 certificatesDir: /etc/kubernetes/pki clusterName: kubernetes controlPlaneEndpoint: 192.168.1.49:8443 controllerManager: {} dns: type: CoreDNS etcd: local: dataDir: /var/lib/etcd imageRepository: k8s.gcr.io kind: ClusterConfiguration kubernetesVersion: v1.19.2 networking: dnsDomain: cluster.local podSubnet: 10.244.0.0/16 serviceSubnet: 10.96.0.0/12 scheduler: {} ClusterStatus: | apiEndpoints: cp02: advertiseAddress: 192.168.1.46 bindPort: 6443 apiVersion: kubeadm.k8s.io/v1beta2 kind: ClusterStatus ...

kubectl cluster-info

Kubernetes master is running at https://192.168.1.49:8443 KubeDNS is running at https://192.168.1.49:8443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

More Info

cluster was initialised with --upload-certs on cp01.
I drained and deleted cp01 from the cluster.

kubeadm join --token ... --discovery-token-ca-cert-hash ... --control-plane --certificate-key ... command returned:

error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Get "https://192.168.1.49:8443/api/v1/namespaces/kube-system/configmaps/kubeadm-config?timeout=10s": dial tcp 192.168.1.49:8443: connect: connection refused

kubectl exec -n kube-system -it etcd-cp02 -- etcdctl --endpoints=https://192.168.1.46:2379 --key=/etc/kubernetes/pki/etcd/peer.key --cert=/etc/kubernetes/pki/etcd/peer.crt --cacert=/etc/kubernetes/pki/etcd/ca.crt member list returned:
```
..., started, cp02, https://192.168.1.46:2380, https://192.168.1.46:2379, false 
```

kubectl describe pod/etcd-cp02 -n kube-system:

... Container ID: docker://... Image: k8s.gcr.io/etcd:3.4.13-0 Image ID: docker://... Port: <none> Host Port: <none> Command: etcd --advertise-client-urls=https://192.168.1.46:2379 --cert-file=/etc/kubernetes/pki/etcd/server.crt --client-cert-auth=true --data-dir=/var/lib/etcd --initial-advertise-peer-urls=https://192.168.1.46:2380 --initial-cluster=cp01=https://192.168.1.42:2380,cp02=https://192.168.1.46:2380 --initial-cluster-state=existing --key-file=/etc/kubernetes/pki/etcd/server.key --listen-client-urls=https://127.0.0.1:2379,https://192.168.1.46:2379 --listen-metrics-urls=http://127.0.0.1:2381 --listen-peer-urls=https://192.168.1.46:2380 --name=cp02 --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt --peer-client-cert-auth=true --peer-key-file=/etc/kubernetes/pki/etcd/peer.key --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt --snapshot-count=10000 --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt ...

Tried copying the certs to cp01:/etc/kubernetes/pki before running kubeadm join 192.168.1.49:8443 --token ... --discovery-token-ca-cert-hash but returned same error.
```
# files copied over to cp01 ca.crt ca.key sa.key sa.pub front-proxy-ca.crt front-proxy-ca.key etcd/ca.crt etcd/ca.key 
```

Troubleshoot network

Able to ping 192.168.1.49 on cp01
nc -v 192.168.1.49 8443 on cp01 returned Ncat: Connection refused.
curl -k https://192.168.1.49:8443/api/v1... works on cp02 and worker nodes (returns code 403 which should be normal).
/etc/cni/net.d/ is removed on cp01
Manually cleared iptables rules on cp01 with 'KUBE' or 'cali'.
firewalld is disabled on both cp01 and cp02.
I tried joining with a new server cp03 192.168.1.48 and encountered the same dial tcp 192.168.1.49:8443: connect: connection refused error.

netstat -tlnp | grep 8443 on cp02 returned:

tcp 0 0.0.0.0:8443 0.0.0.0:* LISTEN 27316/haproxy

nc -v 192.168.1.46 6443 on cp01 and cp03 returns:
```
Ncat: Connected to 192.168.1.46:6443 
```

Any advice/guidance would be greatly appreciated as I am at a loss here. I'm thinking it might be due to the network rules on cp02 but I don't really know how to check this. Thank you!!

Cecil · Accepted Answer · 2021-08-26 09:23:36Z

Figured out what was the issue when I entered ip a. Realised that ens192 on cp01 still contains the secondary ip address 192.168.1.49.

Simply ip addr del 192.168.1.49/24 dev ens192 and kubeadm join... and cp01 is able to rejoin the cluster successfully. Can't believe I missed that...

Stack Exchange Network

HA kubernetes cluster: Accidental kubeadm reset on 1 master node, connection refused when rejoining the cluster

1 Answer 1

You must log in to answer this question.

Hot Network Questions

HA kubernetes cluster: Accidental kubeadm reset on 1 master node, connection refused when rejoining the cluster

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions