Saturday, April 24, 2021

Rough guide to upgrading k8s cluster w/ kubeadm

This is not the best way, just a way that works for me given the cluster topography I have (which was installed using kubeadm on ubuntu, and includes a non-HA etcd running in-cluster). On the control plane / master node: 1) Backup etcd (manually) You might need the info from the etcd pod (`kubectl -n kube-system describe po etcd-master`) to find the various certs/keys/etc, but really they're probably just at /etc/kubernetes/pki/etcd/
kubectl exec -n kube-system etcd-kmaster -- etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --key=/etc/kubernetes/pki/etcd/server.key --cert=/etc/kubernetes/pki/etcd/server.crt snapshot save /var/lib/etcd/snapshot.db
--ignore-daemonsets Backup important files locally (but really, these should also be backed-up on a different server)
mkdir $HOME/backup
sudo cp -r /etc/kubernetes/pki/etcd $HOME/backup/
sudo cp /var/lib/etcd/snapshot.db $HOME/backup/$(date +%Y-%m-%d--%H-%M)-snapshot.db
sudo cp /$HOME/kubeadm-init.yaml $HOME/backup
Figure out what we're going to upgrade to. Do NOT attempt to skip minor versions (i.e. go from 1.19 -> 1.20 -> 1.21, not 1.19 - 1.21)
sudo apt update
sudo apt-cache madison kubeadm
sudo kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.6", GitCommit:"8a62859e515889f07e3e3be6a1080413f17cf2c3", GitTreeState:"clean", BuildDate:"2021-04-15T03:26:21Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}

I'm going to go from 1.19.6-00 to 1.20.6-00 because that's what's currently available (and then from 1.20.6-00 to 1.21.0-00)

Remove the hold on kubeadm, update it, then freeze it again.

sudo apt-mark unhold kubeadm
sudo apt-get install -y kubeadm=1.20.6-00
sudo apt-mark hold kubeadm
Make sure it worked
sudo kubeadm version

kubeadm version: &version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.6", GitCommit:"8a62859e515889f07e3e3be6a1080413f17cf2c3", GitTreeState:"clean", BuildDate:"2021-04-15T03:26:21Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}
Cordon and drain the master node (I've got a pod using local storage, so that extra flag is necessary)
kubectl cordon kmaster
kubectl drain kmaster --ignore-daemonsets --delete-local-data
Check out the upgrade plan. I get two options, upgrade to latest in the v1.19 series (1.19.10) or upgrade to latest stable version (1.20.6)
sudo kubeadm upgrade plan
sudo kubeadm upgrade apply v1.20.6
Nothing else needed to be upgraded, so I saw
[upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.20.6". Enjoy!
[upgrade/kubelet] Now that your control plane is upgraded, please proceed with upgrading your kubelets if you haven't already done so.
It's still going to show 1.19.6, which is expected
kubectl get no
NAME        STATUS                     ROLES                  AGE    VERSION
kmaster     Ready,SchedulingDisabled   control-plane,master   128d   v1.19.6
kworker01   Ready                      none                   125d   v1.19.6
Now to upgrade kubelet and kubectl to the SAME version as kubeadm
sudo apt-mark unhold kubelet kubectl
sudo apt-get install -y  kubelet=1.20.6-00 kubectl=1.20.6-00
sudo apt-mark hold kubelet kubectl
sudo systemctl daemon-reload
sudo systemctl restart kubelet.service
Now we should see the master node running the updated version
kubectl get no
NAME        STATUS                     ROLES                  AGE    VERSION
kmaster     Ready,SchedulingDisabled   control-plane,master   128d   v1.20.6
kworker01   Ready                      none                   125d   v1.19.6
Uncordon it, and make sure it shows 'Ready' Now drain the worker(s) and then repeat roughly the same process on the worker nodes (and yes, the --force is necessary because I'm running something that isn't set up correctly or playing nicely - I'm looking at you operatorhub)
kubectl drain kworker01 --ignore-daemonsets --delete-local-data --force
On the worker node(s)
sudo apt-mark unhold kubeadm
sudo apt-get install -y kubeadm=1.20.6-00
sudo apt-mark hold kubeadm

sudo kubeadm upgrade node

sudo apt-mark unhold kubelet kubectl
sudo apt-get install -y  kubelet=1.20.6-00 kubectl=1.20.6-00
sudo apt-mark hold kubelet kubectl

sudo systemctl daemon-reload
sudo systemctl restart kubelet.service
Back on the master node, we should be able to get the nodes and see that the worker is upgraded. Since it is, we can uncordon it, and it should switch to 'Ready'
kubectl get no
NAME        STATUS                     ROLES                  AGE    VERSION
kmaster     Ready                      control-plane,master   128d   v1.20.6
kworker01   Ready                      none                   125d   v1.20.6
That's it! Rinse and repeat for 1.21 once the entire cluster is on 1.20

Thursday, April 1, 2021

Mysql connection error

This was a mildly interesting one. I run some applications on my laptop that talk to a k8s cluster in my office, including a mysql instance. The main application started failing with the common "The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server" error. The app had been running earlier today. Debugging it, the first step is always the logs.
  kubectl logs mysql-57f577f4b9-gvtlz
Lo and behold, a bunch of suspicious errors:
2021-03-10T02:49:19.349769Z 149591 [ERROR] Disk is full writing './mysql-bin.000015' (Errcode: 15781392 - No space left on device). Waiting for someone to free space...
2021-03-10T02:49:19.349823Z 149591 [ERROR] Retry in 60 secs. Message reprinted in 600 secs
2021-03-10T02:58:46.658696Z 151120 [ERROR] Disk is full writing './mysql-bin.~rec~' (Errcode: 15781392 - No space left on device). Waiting for someone to free space...
2021-03-10T02:58:46.658728Z 151120 [ERROR] Retry in 60 secs. Message reprinted in 600 secs
2021-03-10T02:59:19.352777Z 149591 [ERROR] Disk is full writing './mysql-bin.000015' (Errcode: 15781392 - No space left on device). Waiting for someone to free space...
2021-03-10T02:59:19.354093Z 149591 [ERROR] Retry in 60 secs. Message reprinted in 600 secs
2021-03-10T03:04:46.886946Z 151120 [ERROR] Error in Log_event::read_log_event(): 'read error', data_len: 61, event_type: 34
Looks like the bin logs have finally filled up the volume. Unfortunately, I created that pod with a rather small PVC, and since I'm using OpenEBS, it won't easily resize. What to do? Log into the instance and clean out the logs...
  kubectl exec -it mysql-57f577f4b9-gvtlz -- /bin/sh
  rm /var/lib/mysql/mysql-bin*
Problem solved! (well, temporarily, until they fill up again)