Data stack
Galera cluster recovery
Major kubernetes cluster outage or accidental voluntary disruption can lead to loose galera quorum and to a full cluster failure.
Follow next step to recover by re bootstrapping the cluster.
-
Requirement: Set workstation environment:
export RELEASE=<release name>
export CONFIG_REPO_BASEURL=https://raw.githubusercontent.com/actility/thingpark-enterprise-kubernetes/v$RELEASE
eval $(curl $CONFIG_REPO_BASEURL/VERSIONS)
# Set the deployment namespace as an environment variable
export NAMESPACE=thingpark-enterprise
# Value in s,m,l,xl,xxl
export SEGMENT=l
# Value azure,amazon
export HOSTING=azure
-
Check the galera statefulset pod state, it will return failure conditions:
kubectl get po -n $NAMESPACE -l app.kubernetes.io/name=mariadb-galera -o jsonpath='{.items[].status.containerStatuses[].ready}'
$ kubectl get po -n $NAMESPACE -l app.kubernetes.io/name=mariadb-galera -o jsonpath='{.items[].status.containerStatuses[].state}'|jq
{
"waiting": {
"message": "back-off 5m0s restarting failed container=mariadb-galera pod=tpe-mariadb-galera-0_thingpark-enterprise(6faab544-25fd-4e77-a6b9-185e058462dd)",
"reason": "CrashLoopBackOff"
}
} -
Stop the cluster by destroying the statefulset and stop the sql proxy
kubectl -n $NAMESPACE delete statefulsets.apps mariadb-galera
kubectl -n $NAMESPACE scale deployment sql-proxy --replicas=0 -
Retrieve
grastate.dat
content of each node using eachdata-mariadb-galera-0
,data-mariadb-galera-1
,data-mariadb-galera-2
volume claim name , for instance:$ kubectl run --restart=Never -n $NAMESPACE -i --rm --tty volpod --overrides='
{
"apiVersion": "v1",
"kind": "Pod",
"metadata": {
"name": "volpod"
},
"spec": {
"containers": [{
"command": [
"cat",
"/mnt/data/grastate.dat"
],
"image": "bitnami/minideb",
"name": "mycontainer",
"volumeMounts": [{
"mountPath": "/mnt",
"name": "galeradata"
}]
}],
"restartPolicy": "Never",
"volumes": [{
"name": "galeradata",
"persistentVolumeClaim": {
"claimName": "data-mariadb-galera-0"
}
}]
}
}' --image="bitnami/minideb" -
As a result, you obtain each node state, for instance (This situation reflect an improper cluster stop (all safe_to_bootstrap equal 0)):
## Node 0
# GALERA saved state
version: 2.1
uuid: f23062b8-3ed3-11eb-9979-0e1cb0f4f878
seqno: 14
safe_to_bootstrap: 0
pod "volpod" deleted
## Node 1
# GALERA saved state
version: 2.1
uuid: f23062b8-3ed3-11eb-9979-0e1cb0f4f878
seqno: 14
safe_to_bootstrap: 0
pod "volpod" deleted
## Node 2
# GALERA saved state
version: 2.1
uuid: f23062b8-3ed3-11eb-9979-0e1cb0f4f878
seqno: 14
safe_to_bootstrap: 0 -
Bootstrap the cluster:
-
Option 1: One node have a
safe_to_bootstrap: 1
:# GALERA saved state
version: 2.1
uuid: f23062b8-3ed3-11eb-9979-0e1cb0f4f878
seqno: 14
safe_to_bootstrap: 1
pod "volpod" deletedThis node should be used to bootstrap the cluster, for instance with the node 1:
helm -n $NAMESPACE upgrade -i tpe-data actility/thingpark-data \
--version $THINGPARK_DATA_VERSION --reuse-values \
--set mariadb-galera.podManagementPolicy=Parallel \
--set mariadb-galera.galera.bootstrap.forceBootstrap=true \
--set mariadb-galera.galera.bootstrap.bootstrapFromNode=1 -
Option 2: All nodes have a
safe_to_bootstrap: 0
:Cluster should be bootstrapped with the node with the highest
seqno
:helm -n $NAMESPACE upgrade -i tpe-data actility/thingpark-data \
--version $THINGPARK_DATA_VERSION --reuse-values \
--set mariadb-galera.podManagementPolicy=Parallel \
--set mariadb-galera.galera.bootstrap.forceSafeToBootstrap=true \
--set mariadb-galera.galera.bootstrap.forceBootstrap=true \
--set mariadb-galera.galera.bootstrap.bootstrapFromNode=1
-
-
Wait for the end of recovery and reset helm release values in following way: by stopping the galera cluster in this way:
# Wait until all pods became READY
kubectl -n $NAMESPACE get statefulsets.apps mariadb-galera -w
# Scale down gracefully mariadb galera cluster (wait until the end of pod deletion at each steps)
kubectl -n $NAMESPACE scale statefulsets.apps mariadb-galera --replicas=2
kubectl -n $NAMESPACE scale statefulsets.apps mariadb-galera --replicas=1
# Delete stalefulset
kubectl -n $NAMESPACE delete statefulsets.apps mariadb-galera -
And finally upgrade the tpe-data release and restart sql-proxy router deployment:
helm upgrade -i tpe-data -n $NAMESPACE \
actility/thingpark-data --version $THINGPARK_DATA_VERSION \
-f $CONFIG_REPO_BASEURL/configs/$HOSTING/values-$SEGMENT-segment.yaml \
-f custom-values.yaml
kubectl scale -n $NAMESPACE deployment sql-proxy --replicas=2