Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker containers fail to start on EC2 instances #374

Open
jpswinski opened this issue Feb 9, 2024 · 2 comments
Open

Docker containers fail to start on EC2 instances #374

jpswinski opened this issue Feb 9, 2024 · 2 comments

Comments

@jpswinski
Copy link
Member

When starting a cluster, very rarely we see an issue where a node comes up but does not register. When we ssh into the node, we see that the docker containers never come up, and the /var/log/cloud-init-output.log reports the following error:

Error response from daemon: Get "https://742127912612.dkr.ecr.us-west-2.amazonaws.com/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2024-02-09 14:55:26,122 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
2024-02-09 14:55:26,123 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3.9/site-packages/cloudinit/config/cc_scripts_user.py'>) failed
Cloud-init v. 22.2.2 finished at Fri, 09 Feb 2024 14:55:26 +0000. Datasource DataSourceEc2.  Up 23.76 seconds
@jpswinski
Copy link
Member Author

We need a way to check to see if the containers are running and healthy (at startup, and continually), and then reset the node if that is not the case.

Docker compose should be doing this, but maybe something is needed to watch docker compose, or maybe the settings aren't correct. Or maybe it needs to be baked into the AMI so that it is not dependent on the EC2 user data running to completion.

@jpswinski
Copy link
Member Author

jpswinski commented Feb 15, 2024

Docker compose isn't running which is why none of the containers are started.

Maybe we need to move the startup script logic into a bash script that has retries and things like that, and then the user data just calls this script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant