You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When starting a cluster, very rarely we see an issue where a node comes up but does not register. When we ssh into the node, we see that the docker containers never come up, and the /var/log/cloud-init-output.log reports the following error:
Error response from daemon: Get "https://742127912612.dkr.ecr.us-west-2.amazonaws.com/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2024-02-09 14:55:26,122 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
2024-02-09 14:55:26,123 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3.9/site-packages/cloudinit/config/cc_scripts_user.py'>) failed
Cloud-init v. 22.2.2 finished at Fri, 09 Feb 2024 14:55:26 +0000. Datasource DataSourceEc2. Up 23.76 seconds
The text was updated successfully, but these errors were encountered:
We need a way to check to see if the containers are running and healthy (at startup, and continually), and then reset the node if that is not the case.
Docker compose should be doing this, but maybe something is needed to watch docker compose, or maybe the settings aren't correct. Or maybe it needs to be baked into the AMI so that it is not dependent on the EC2 user data running to completion.
Docker compose isn't running which is why none of the containers are started.
Maybe we need to move the startup script logic into a bash script that has retries and things like that, and then the user data just calls this script.
When starting a cluster, very rarely we see an issue where a node comes up but does not register. When we ssh into the node, we see that the docker containers never come up, and the /var/log/cloud-init-output.log reports the following error:
The text was updated successfully, but these errors were encountered: