Skip to main content

Galera cluster recovery

TPE-HA: Mariadb/Galera is unable to bootstrap

If the MariaDB cluster stay down after a disruption (all tpe nodes down) and the SQL container on tpe_node1 or tpe_node2 always restart with this kind of log :

[support@tpe-node1 ~]$ docker logs -f $(docker ps -q --filter name=sql_node)
[...]
INFO: Reporting seqno: -1 to Zookeeper store.
[...]
ERROR: A unaivalable node have backuped a higher seqno, can't bootstrap.

SOLUTION:

Follow next step to recover by re bootstrapping the cluster.

Retrieve the seqno for sql_node1 and sql_node2 using :

[support@tpe-node1 ~]$ docker exec -it $(docker ps -q --filter "name=zk_node") zkCli.sh get /galera/tpe/nodes/sql_node1/seqno
Connecting to localhost:2181
2023-01-10 19:17:23,870 [myid:] - INFO [main:Environment@100] - Client environment:zookeeper.version=3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built on 03/06/2019 16:18 GMT
[...]
Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x10000016c1704e0, negotiated timeout = 30000


WATCHER::


WatchedEvent state:SyncConnected type:None path:null
1645
cZxid = 0x5700007423
ctime = <Date>
mZxid = 0x5900002efb
mtime = <Date>
pZxid = 0x5700007423
cversion = 0
dataVersion = 1850
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 4
numChildren = 0

The seqno is 1645 for this sql_node1 node.

And :

[support@tpe-node1 ~]$ docker exec -it $(docker ps -q --filter "name=zk_node") zkCli.sh get /galera/tpe/nodes/sql_node2/seqno
Connecting to localhost:2181
2023-01-10 19:17:23,870 [myid:] - INFO [main:Environment@100] - Client environment:zookeeper.version=3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built on 03/06/2019 16:18 GMT
[...]
Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x10000016c1704e0, negotiated timeout = 30000


WATCHER::


WatchedEvent state:SyncConnected type:None path:null
1643
cZxid = 0x5700007423
ctime = <Date>
mZxid = 0x5900002efb
mtime = <Date>
pZxid = 0x5700007423
cversion = 0
dataVersion = 1850
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 4
numChildren = 0

The seqno is 1643 for this sql_node2 node.

Cluster should be bootstrapped with the node with the highest seqno:

If this node is the sql_node1 :

docker exec -it $(docker ps -q --filter "name=zk_node") zkCli.sh create /galera/tpe/forceboot ""
docker exec -it $(docker ps -q --filter "name=zk_node") zkCli.sh create /galera/tpe/forceboot/node sql_node1

Else :

docker exec -it $(docker ps -q --filter "name=zk_node") zkCli.sh create /galera/tpe/forceboot ""
docker exec -it $(docker ps -q --filter "name=zk_node") zkCli.sh create /galera/tpe/forceboot/node sql_node2

The Galera cluster restarts.

Update procedure fails with "current TPE image is not present anymore" message

SYMPTOM: If the TPE update procedure fails with the following error:

NOTE : This error may append when the current TPE image is not present anymore on TPE host.
For more details, please consult the TPE documentation.

SOLUTION:

The solution is to do a redeploy (TPE Service -> TPE Cluster operations -> Redeploy cluster)