Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

driver: add a bootstrap timeout #2126

Closed
wants to merge 1 commit into from

Conversation

jsternberg
Copy link
Collaborator

Drivers will now fail with a timeout if they take over 20 seconds to
boot. This process repeats itself 3 times for a total of a minute.

Fixes #2120.

Drivers will now fail with a timeout if they take over 20 seconds to
boot. This process repeats itself 3 times for a total of a minute.

Signed-off-by: Jonathan A. Sternberg <[email protected]>
Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should set a general limit for boot. Eg. if you are on slow connection maybe pulling buildkit image takes a lot of time, or who knows what may affect the boot time for k8s driver. I think we should concentrate on the initial report of detecting unresponsive remote server or server that is not buildkit at all.

@jsternberg
Copy link
Collaborator Author

We can put the cancellation on the Wait() call instead and make it driver specific so this would only apply to the remote. It might also be better to include it as an option to the driver instead of just a default parameter. That way, we're also not changing the existing behavior.

The main problem with this that we run into is that GRPC doesn't really tell us if the GRPC service is unavailable in a permanent or temporary way. We'd have to parse the error message explicitly. GRPC's reconnection mechanism also seems to only be for reconnections and doesn't apply to the initial connection.

Another option that might help might be to include a status message for the bootstrap process. Right now, it just kind of stalls and doesn't show any message about the progress of the bootstrap. It might help if we use the progress writer to at least say how long the process is taking.

@tonistiigi
Copy link
Member

We can put the cancellation on the Wait() call instead and make it driver specific so this would only apply to the remote.

If you mean the client.Wait() in boot then that seems fine.

It might help if we use the progress writer to at least say how long the process is taking.

That would mean collecting such messages from the grpc level I think. I'm not sure how possible it is to capture that, and keep it in the form that it is useful for the user.

@jsternberg
Copy link
Collaborator Author

I've created a PR for an alternative to this one: #2130

If the service is already up, it doesn't change the output at all. If the service is invalid like in the original PR, it looks like this:

[+] Building 20.0s (1/1) FINISHED
 => ERROR [internal] waiting for connection                                                                                 20.0s
------
 > [internal] waiting for connection:
------
ERROR: context deadline exceeded

For ctrl+c:

[+] Building 0.9s (1/1) FINISHED
 => CANCELED [internal] waiting for connection                                                                               0.9s
ERROR: context canceled

@jsternberg
Copy link
Collaborator Author

Closing this in favor of #2130.

@jsternberg jsternberg closed this Nov 17, 2023
@jsternberg jsternberg deleted the bootstrap-timeout branch November 17, 2023 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feature request: buildx should timeout if node never becomes healthy
2 participants