Galera cluster recovery
TPE-HA: Mariadb/Galera is unable to bootstrap
If the MariaDB cluster stay down after a disruption (all tpe nodes down) and the SQL container on tpe_node1 or tpe_node2 always restart with this kind of log :
[support@tpe-node1 ~]$ docker logs -f $(docker ps -q --filter name=sql_node)
[...]
INFO: Reporting seqno: -1 to Zookeeper store.
[...]
ERROR: A unaivalable node have backuped a higher seqno, can't bootstrap.
SOLUTION:
Follow next step to recover by re bootstrapping the cluster.
Retrieve the seqno for sql_node1 and sql_node2 using :
[support@tpe-node1 ~]$ docker exec $(docker ps -q -f 'name=zk_node') java -Xmx256m org.apache.zookeeper.ZooKeeperMain get /galera/tpe/nodes/sql_node1/seqno
Connecting to localhost:2181
WATCHER::
WatchedEvent state:SyncConnected type:None path:null
1645
The seqno is 1645 for this sql_node1 node.
And :
[support@tpe-node1 ~]$ docker exec $(docker ps -q -f 'name=zk_node') java -Xmx256m org.apache.zookeeper.ZooKeeperMain get /galera/tpe/nodes/sql_node2/seqno
Connecting to localhost:2181
WATCHER::
WatchedEvent state:SyncConnected type:None path:null
1643
The seqno is 1643 for this sql_node2 node.
Cluster should be bootstrapped with the node with the highest seqno:
If this node is the sql_node1 :
docker exec $(docker ps -q -f 'name=zk_node') java -Xmx256m org.apache.zookeeper.ZooKeeperMain create /galera/tpe/forceboot ""
docker exec $(docker ps -q -f 'name=zk_node') java -Xmx256m org.apache.zookeeper.ZooKeeperMain create /galera/tpe/forceboot/node sql_node1
Else :
docker exec $(docker ps -q -f 'name=zk_node') java -Xmx256m org.apache.zookeeper.ZooKeeperMain create /galera/tpe/forceboot ""
docker exec $(docker ps -q -f 'name=zk_node') java -Xmx256m org.apache.zookeeper.ZooKeeperMain create /galera/tpe/forceboot/node sql_node2
The Galera cluster restarts.
Update procedure fails with "current TPE image is not present anymore" message
SYMPTOM: If the TPE update procedure fails with the following error:
NOTE : This error may append when the current TPE image is not present anymore on TPE host.
For more details, please consult the TPE documentation.
SOLUTION:
The solution is to do a redeploy (TPE Service -> TPE Cluster operations -> Redeploy cluster)