Maintenance considerations
This section provides prerequisites when you have to drain Kubernetes workers dedicated to ThingPark Enterprise.
- Thingpark Enterprise 7.1.3 is required to manage correctly disruptions
- It's advised to backup data before any maintenance operation
Generic requirements
Maintenance on one node
- To maintain compute capacities, It's encouraged to scale up the worker group with a spare worker in the same Availability Zone as the one in maintenance.
- The spare worker must follow sizing prerequisites.
Maintenance sequence example:
- Add a spare node to cluster in the appropriate Availability Zone
- Add taint and label to start to schedule ThingPark Enterprise workloads on the spare node
- Remove
thingpark.enterprise.actility.com/nodegroup-name=tpe
label and drain the node - Run maintenance operations on the node
- Re-taint and label the node when maintenance operations done
- Remove
thingpark.enterprise.actility.com/nodegroup-name=tpe
label, drain the spare worker and finally remove it
Large maintenances
Some maintenances require a rolling upgrade on the worker running ThingPark Enterprise. The best approach is to move all workloads on a temporary spare worker group.
In the big picture:
Create a new spare worker group with same characteristics of the main one (CPU, RAM, Availability Zones).
Add taint and label to dedicate theses new resources to ThingPark Enterprise. New task start to be scheduled on the spare worker group.
Move all ThingPark Enterprise workloads by removing
thingpark.enterprise.actility.com/nodegroup-name=tpe
label and draining nodes from the worker group to put in maintenance.caution- Nodes have to be drained one by one to safely move workloads
Run maintenance operations
Re-taint and label up to date worker group
Remove label and drain one by one node from the spare worker group
AKS specifics
Kubernetes cluster version upgrade
Kubernetes cluster version upgrade follow the large maintenance guidelines.
Assumptions:
- The next procedure exposes the way to upgrade both Kubernetes control plane and worker from 1.21 to 1.24 version using the az CLI.
- ThingPark Enterprise is deployed on the default node pool
- The deployment is a L segment sizing
Start by upgrade the control plan to the last 1.22 patch version. The control plane 1.22 is compatible with worker 1.20 to 1.22.
az aks upgrade --resource-group <resourceGroupName> \
--name <aksClusterName> \
--kubernetes-version 1.22 \
--control-plane-onlyCreate the spare worker group following the same sizing of the main one
az aks nodepool add --cluster-name <aksClusterName> \
--name spare \
--resource-group <resourceGroupName> \
--kubernetes-version 1.22 \
--node-count 3 \
--zones 1 2 3 \
--node-vm-size Standard_D4s_v4 \
--node-osdisk-type Managed \
--node-osdisk-size 128 \
--max-pods 50Apply label (mandatory) and taint (optional for dedicated cluster) following the Installation procedure to all spare nodes
Start to move workloads by draining nodes of the default node pool. Each node should be drained one by one:
Drain the node and remove label
kubectl drain <nodeID> --delete-emptydir-data --ignore-daemonsets
Monitor both Deployments and StatefulSets are all backing fully healthy before draining the next node
kubectl get sts
kubectl get deploy
At this point the default node pool must be free of all ThingPark Enterprise workloads. It can be upgraded:
# Speed up upgrade by allowing to upgrade all nodes at same time
az aks nodepool update --cluster-name <aksClusterName> \
--name default \
--resource-group <resourceGroupName> \
--max-surge 100%
# Upgrade to the latest 1.22 patch
az aks nodepool upgrade --cluster-name <aksClusterName> \
--kubernetes-version 1.22 \
--name default \
--resource-group <resourceGroupName>Go on with upgrading both control plane and workers of the default node pool to the 1.23 and next to the 1.24 release
# 1.23 upgrade
az aks upgrade --resource-group <resourceGroupName> \
--name <aksClusterName> \
--kubernetes-version 1.23 \
--control-plane-only
az aks nodepool upgrade --cluster-name <aksClusterName> \
--kubernetes-version 1.23 \
--name default \
--resource-group <resourceGroupName>
# 1.24 upgrade
az aks upgrade --resource-group <resourceGroupName> \
--name <aksClusterName> \
--kubernetes-version 1.24 \
--control-plane-only
az aks nodepool upgrade --cluster-name <aksClusterName> \
--kubernetes-version 1.24 \
--name default \
--resource-group <resourceGroupName>Start to get back workloads on the default node pool by re-taint and label nodes
Drain spare nodes and remove labels. For each node:
Drain the node and remove label
kubectl drain <nodeID> --delete-emptydir-data --ignore-daemonsets
Monitor both Deployments and StatefulSets are all backing fully healthy before draining the next node
kubectl get sts
kubectl get deploy
Delete the spare node pool
az aks nodepool delete --cluster-name <aksClusterName> \
--name spare \
--resource-group <resourceGroupName>Check the cluster provisioning state
$ az aks show --resource-group <resourceGroupName> \
--name <aksClusterName> \
--output table
Name Location ResourceGroup KubernetesVersion CurrentKubernetesVersion ProvisioningState Fqdn
--------------- ---------- --------------- ------------------- -------------------------- ------------------- -------------------------------------------------
<aksClusterName> westeurope <resourceGroupName> 1.24.0 1.24.0 Succeeded <aksClusterName>-d1661175.hcp.westeurope.azmk8s.io