Maintenance considerations
This section provides prerequisites when you have to drain Kubernetes workers dedicated to ThingPark Enterprise.
- Thingpark Enterprise 7.1.3 is required to manage correctly disruptions
- It's advised to backup data before any maintenance operation
Generic requirements
Maintenance on one node
- To maintain compute capacities, It's encouraged to scale up the worker group with a spare worker in the same Availability Zone as the one in maintenance.
- The spare worker must follow sizing prerequisites.
Maintenance sequence example:
- Add a spare node to cluster in the appropriate Availability Zone
- Add taint and label to start to schedule ThingPark Enterprise workloads on the spare node
- Remove
thingpark.enterprise.actility.com/nodegroup-name=tpe
label and drain the node - Run maintenance operations on the node
- Re-taint and label the node when maintenance operations done
- Remove
thingpark.enterprise.actility.com/nodegroup-name=tpe
label, drain the spare worker and finally remove it
Large maintenances
Some maintenances require a rolling upgrade on the worker running ThingPark Enterprise. The best approach is to move all workloads on a temporary spare worker group.
In the big picture:
-
Create a new spare worker group with same characteristics of the main one (CPU, RAM, Availability Zones).
-
Add taint and label to dedicate theses new resources to ThingPark Enterprise. New task start to be scheduled on the spare worker group.
-
Move all ThingPark Enterprise workloads by removing
thingpark.enterprise.actility.com/nodegroup-name=tpe
label and draining nodes from the worker group to put in maintenance.cautionNodes have to be drained one by one to safely move workloads
-
Run maintenance operations
-
Re-taint and label up to date worker group
-
Remove label and drain one by one node from the spare worker group
AKS specifics
Kubernetes cluster version upgrade
Kubernetes cluster version upgrade follow the large maintenance guidelines.
Assumptions:
- The next procedure exposes the way to upgrade both Kubernetes control plane and worker from 1.21 to 1.24 version using the az CLI.
- ThingPark Enterprise is deployed on the default node pool
- The deployment is a L segment sizing
-
Start by upgrade the control plan to the last 1.22 patch version. The control plane 1.22 is compatible with worker 1.20 to 1.22.
az aks upgrade --resource-group <resourceGroupName> \
--name <aksClusterName> \
--kubernetes-version 1.22 \
--control-plane-only -
Create the spare worker group following the same sizing of the main one
az aks nodepool add --cluster-name <aksClusterName> \
--name spare \
--resource-group <resourceGroupName> \
--kubernetes-version 1.22 \
--node-count 3 \
--zones 1 2 3 \
--node-vm-size Standard_D4s_v4 \
--node-osdisk-type Managed \
--node-osdisk-size 128 \
--max-pods 50 -
Apply label (mandatory) and taint (optional for dedicated cluster) following the Installation procedure to all spare nodes
-
Start to move workloads by draining nodes of the default node pool. Each node should be drained one by one:
-
Drain the node and remove label
kubectl drain <nodeID> --delete-emptydir-data --ignore-daemonsets
-
Monitor both Deployments and StatefulSets are all backing fully healthy before draining the next node
kubectl get sts
kubectl get deploy
-
-
At this point the default node pool must be free of all ThingPark Enterprise workloads. It can be upgraded:
# Speed up upgrade by allowing to upgrade all nodes at same time
az aks nodepool update --cluster-name <aksClusterName> \
--name default \
--resource-group <resourceGroupName> \
--max-surge 100%
# Upgrade to the latest 1.22 patch
az aks nodepool upgrade --cluster-name <aksClusterName> \
--kubernetes-version 1.22 \
--name default \
--resource-group <resourceGroupName> -
Go on with upgrading both control plane and workers of the default node pool to the 1.23 and next to the 1.24 release
# 1.23 upgrade
az aks upgrade --resource-group <resourceGroupName> \
--name <aksClusterName> \
--kubernetes-version 1.23 \
--control-plane-only
az aks nodepool upgrade --cluster-name <aksClusterName> \
--kubernetes-version 1.23 \
--name default \
--resource-group <resourceGroupName>
# 1.24 upgrade
az aks upgrade --resource-group <resourceGroupName> \
--name <aksClusterName> \
--kubernetes-version 1.24 \
--control-plane-only
az aks nodepool upgrade --cluster-name <aksClusterName> \
--kubernetes-version 1.24 \
--name default \
--resource-group <resourceGroupName> -
Start to get back workloads on the default node pool by re-taint and label nodes
-
Drain spare nodes and remove labels. For each node:
-
Drain the node and remove label
kubectl drain <nodeID> --delete-emptydir-data --ignore-daemonsets
-
Monitor both Deployments and StatefulSets are all backing fully healthy before draining the next node
kubectl get sts
kubectl get deploy
-
-
Delete the spare node pool
az aks nodepool delete --cluster-name <aksClusterName> \
--name spare \
--resource-group <resourceGroupName> -
Check the cluster provisioning state
$ az aks show --resource-group <resourceGroupName> \
--name <aksClusterName> \
--output table
Name Location ResourceGroup KubernetesVersion CurrentKubernetesVersion ProvisioningState Fqdn
--------------- ---------- --------------- ------------------- -------------------------- ------------------- -------------------------------------------------
<aksClusterName> westeurope <resourceGroupName> 1.24.0 1.24.0 Succeeded <aksClusterName>-d1661175.hcp.westeurope.azmk8s.io