TPE installation and upgrade issues

Cockpit is unreachable

SYMPTOM: After customizing the Cockpit certificate during the first installation or by editing the infrastructure configuration, Cockpit is unreachable ("CONNECTION_REFUSED").

SOLUTION:

Check the status of the Cockpit systemd service.
```
systemctl status cockpit
```
If you see the error The certificate and the given key do not match, run the following commands to clean up the Cockpit certificate then redeploy the cluster using TPE Service Cockpit module with correct certificate and private key.
```
sudo rm /etc/cockpit/ws-certs.d/cockpit.cert
sudo rm /etc/cockpit/ws-certs.d/cockpit.key
systemctl restart cockpit
```

GUI is unreachable

SYMPTOM: After a first installation, browser is not able to load the page ("Loading... please wait") or displays a certificate error.

SOLUTION:

Check the system monitoring (For more information, see Listing Containers and Health Check).
If errors are raised when you try to access the TPE Portal GUI after TPE installation, try to reconfigure the hostname of the TPE system to "actility.local". Save & Apply the change and try to access the TPE portal GUI using https://enterprise.actility.local/tpe. If it is working, your certificate is incorrect.
If you stay on page "Loading... please wait" when you try to access the TPE Portal GUI, the TLS certificate for HTTP traffic is maybe expired. If the TLS certificate has expired, on TPE cockpit, an error message is displayed at the top of the "TPE Configuration page"

In this case, you must generate a new certificate and upload it.

GUI displays "Unexpected error" messages

SYMPTOM: After an HA upgrade, the GUI displays connection errors (popup message like "Unexpected error occurred", "Unknown error").

SOLUTION:

Disconnect and reconnect to the GUI (may require up to 10 disconnect/reconnect iterations to flush the SQL connection pool),
If the issue persists, restart both SQL proxy containers and retry to connect to the TPE GUI.

YUM update fails

SYMPTOM: The TPE update procedure fails with the following error:

There are unfinished transactions remaining. You might consider running yum-complete-transaction, or "yum-complete-transaction --cleanup-only" and "yum history redo last", first to finish them. If those don't work you'll have to try removing/installing packages by hand (maybe package-cleanup can help).

SOLUTION:

Run the following command through ssh:
```
sudo yum-complete-transaction
```
Then retry TPE update procedure.

Update procedure fails with "ansible_memtotal_mb is undefined" message

SYMPTOM: The TPE update procedure fails with the following error:

tpe_node1 failed | msg: The conditional check 'ansible_memtotal_mb < max_xs_host_memory_size_mb' failed. The error was: error while evaluating conditional (ansible_memtotal_mb < max_xs_host_memory_size_mb): 'ansible_memtotal_mb' is undefined^M

SOLUTION:

Retry TPE update procedure.

Update procedure fails with "Not enough disk space to perfom the upgrade" message

SYMPTOM: The TPE Update procedure fails with the following error:

NOTE: There is not enough disk space to perform the upgrade (at least 10GB required).

SOLUTION:

Clean up unnecessary container images, here is the procedure
Free up disk space

Service actility_post-upgrade remains in failed state after reboot

SYMPTOM: After TPE instance reboot, the service actility_post-upgrade remains in failed state in Cockpit. This is because post-upgrade service has not been restarted automatically after the reboot.

SOLUTION:

The solution consists of redeploying the actility_post-upgrade service manually. For that:

Connect to Cockpit
Go to the TPE Services module
Under "others" directory, for service "actility_post-upgrade", click on "Actions" button then on "Redeploy"

Container images provisioning failed

SYMPTOM: During an install, an upgrade or a cluster redeployment, the container images provisioning might fail and the procedure (install, upgrade or redeploy) stops.

tpe_node1 -> localhost failed | item: {u'key': u'mongo', u'value': {u'image': u'tpe/mongo', u'version': u'5.2.0-4'}} | msg: Error pushing image registry1.actility.local:5000/tpe/mongo: dial tcp X.X.X.X:5000: connect: no route to host

tpe_node2 -> localhost failed | item: {u'key': u'twa_dev', u'value': {u'image': u'tpe/twa-dev', u'version': u'7.12.6-1'}} | msg: Error pushing image registry2.actility.local:5000/tpe/twa-dev: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.

Several reasons can explain this issue:

network connection issue to the TPE repository
local container images registry instability

SOLUTION:

The solution is to restart the procedure (install, upgrade, redeploy).

Container images cleanup failed

SYMPTOM: During an upgrade, the container images cleanup might fail with the error:

An error occurs during container images cleaning.

Since this is not blocking, the upgrade will continue to process but uncleaned container images will take up disk space.

SOLUTION:

After the upgrade is complete, run the following script to clean up container images:

/usr/bin/tpe-cleanup-container-images

TPE installation and upgrade issues

Cockpit is unreachable​

GUI is unreachable​

GUI displays "Unexpected error" messages​

YUM update fails​

Update procedure fails with "ansible_memtotal_mb is undefined" message​

Update procedure fails with "Not enough disk space to perfom the upgrade" message​

Service actility_post-upgrade remains in failed state after reboot​

Container images provisioning failed​

Container images cleanup failed​

Cockpit is unreachable

GUI is unreachable

GUI displays "Unexpected error" messages

YUM update fails

Update procedure fails with "ansible_memtotal_mb is undefined" message

Update procedure fails with "Not enough disk space to perfom the upgrade" message

Service actility_post-upgrade remains in failed state after reboot

Container images provisioning failed

Container images cleanup failed