Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: ctx on run image pull #5645

Merged
merged 1 commit into from
Nov 29, 2024
Merged

fix: ctx on run image pull #5645

merged 1 commit into from
Nov 29, 2024

Conversation

Benehiko
Copy link
Member

@Benehiko Benehiko commented Nov 25, 2024

Fixes #5639

After some digging I found that we don't actually explicitly return an error on the pullImage function here. Once the context is cancelled the response is closed since the API client uses http.NewRequestWithContext which then exits the jsonmessage.DisplayJSONMessagesToStream(responseBody, out, nil) without any errors.

The following API request to ContainerCreate then immediately fails since the context is closed, which then returns an error.

See this section of the code, it fails to find the image, then pulls the image, and continues to retry creating a container - on this second retry it only then returns with the context error.
https://github.com/docker/cli/blob/master/cli/command/container/create.go#L271-L289

Writing a test for this means that we are building into the test an assumption that we won't error on the pullImage function, but rather on the second call to ContainerCreate. I'll push what I have, but I honestly think that we will get tripped up by the pullImage function again since it's not context aware - actually more so the jsonmessage package it is using is actually the root cause found in Moby here. It's a bit of a hornets nest since it gets used in many places and by refactoring/improving this it could cause edge case bugs.

- What I did
Delay stripping the cancel from the function ctx so we can still cancel the createContainer function.

containerID, err := createContainer(ctx, dockerCli, containerCfg, &runOpts.createOptions)

- How I did it

- How to verify it
Unit test

- Description for the changelog

Fixed bug preventing image pulls from being cancelled during `docker run`.

- A picture of a cute animal (not mandatory but encouraged)

@codecov-commenter
Copy link

codecov-commenter commented Nov 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 59.50%. Comparing base (682cf57) to head (30a73ff).
Report is 4 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5645   +/-   ##
=======================================
  Coverage   59.49%   59.50%           
=======================================
  Files         346      346           
  Lines       29371    29369    -2     
=======================================
+ Hits        17474    17475    +1     
+ Misses      10922    10920    -2     
+ Partials      975      974    -1     

@Benehiko Benehiko requested a review from a team November 25, 2024 12:50
Copy link
Member

@laurazard laurazard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM, but these commits need a rebase.

There's no reason for b42b50a and 2da6244 to be separate, and their commit messages don't explain much of anything –

b42b50a:

fix: ctx on run image pull

and 2da6244:

chore: ensure we strip the main ctx cancel

Can you squash them and include a message that explains why this change was done?

@laurazard
Copy link
Member

laurazard commented Nov 25, 2024

For reference, check the Linux kernel's guidelines on commit messages. Specifically:

Describe your problem. Whether your patch is a one-line bug fix or 5000 lines of a new feature, there must be an underlying problem that motivated you to do this work. Convince the reviewer that there is a problem worth fixing and that it makes sense for them to read past the first paragraph.
Describe user-visible impact. Straight up crashes and lockups are pretty convincing, but not all bugs are that blatant.

Describing what the commit is actually changing is okay, but people can tell that from reading the changes. Describing the why is IMO a lot more useful.

@laurazard
Copy link
Member

laurazard commented Nov 25, 2024

Somewhat orthogonal, but I was thinking about this:

After some digging I found that we don't actually explicitly return an error on the pullImage function here. Once the context is cancelled the response is closed since the API client uses http.NewRequestWithContext which then exits the jsonmessage.DisplayJSONMessagesToStream(responseBody, out, nil) without any errors.

The following API request to ContainerCreate then immediately fails since the context is closed, which then returns an error.

The client could check if the context was cancelled after the jsonmessage.DisplayJSONMessagesStream returns (before the next request to ContainerCreate), and immediately return there if it is.

if err := pullImage(ctx, dockerCli, config.Image, options); err != nil {
	return err
}
select {
case <-ctx.Done():
	return ctx.Err()
default:
}
...

That way, we'd always do the right thing.

Furthermore, I'm not 100% convinced about introducing a ctx into jsonmessage.DisplayJSONMessagesStream to allow cancelling it – regardless of what happens, we always want to drain the response body, which means that even if a client were to cancel it, it would have to go ahead and read the rest of the stream anyway, so the benefits here are slimmer (but we might still want to do it to at least cancel printing the messages, I guess).

Another benefit of having it this way is that there's clearer separation of concerns – DisplayJSONMessagesStream is only concerned with processing the stream, and only returns an error if something went wrong while processing the stream, and the client (who knows it cancelled the connection) can handle that.

@vvoland vvoland added the kind/bugfix PR's that fix bugs label Nov 25, 2024
@vvoland vvoland added this to the 28.0.0 milestone Nov 25, 2024
@Benehiko
Copy link
Member Author

regardless of what happens, we always want to drain the response body, which means that even if a client were to cancel it, it would have to go ahead and read the rest of the stream anyway

AFAIK the response body would be closed when the context is cancelled since the API request is done with http.NewRequestWithContext. You can see this behavior in the CLI using this patch. Below I am running a postgres container, but I don't have the image pulled yet. Mid pull I cancel the operation with a ctrl+c. The output of the pull progress is immediately halted - even though the jsonmessage.DisplayJSONMessagesStream has no context awareness, it still knows that the response body has been closed, so it returns without an error. If it were true that it would still need to drain the response body, you would possibly see some output still happen after cancelling the process.

⋊> ~/G/cli-3 on fix-run-ctx ◦ ./build/docker run postgres:latest    10:10:41
Unable to find image 'postgres:latest' locally
latest: Pulling from library/postgres
2d429b9e73a6: Pulling fs layer
3234e936b543: Pulling fs layer
b5b68d2b7dfa: Pulling fs layer
2bc4b686b410: Waiting
0e741c16b01d: Waiting
febd2e801cbc: Waiting
abad6e2f102b: Waiting
fc0ed0630c16: Waiting
15ea73ccc174: Waiting
bc241f8dfdda: Waiting
e6b9724cd240: Waiting
87f78d636266: Waiting
4bd7a2dad750: Waiting
4c07baf06858: Waiting
^Cdocker: context canceled

Run 'docker run --help' for more information
⋊> ~/G/cli-3 on fix-run-ctx ◦

@laurazard
Copy link
Member

AFAIK the response body would be closed when the context is cancelled since the API request is done with http.NewRequestWithContext.
even though the jsonmessage.DisplayJSONMessagesStream has no context awareness, it still knows that the response body has been closed, so it returns without an error. If it were true that it would still need to drain the response body, you would possibly see some output still happen after cancelling the process.

This doesn't conflict with what I'm saying – the request is cancelled, so DisplayJSONMessagesStream gets an EOF while processing the stream.

The fact that you get no more output after cancelling also doesn't disprove this – as long as DisplayJSONMessagesStream is fast enough/caught up and already processed the last message the daemon sent, there would be no more messages after you cancel, cancelling could just be stopping the pull and stopping the engine from adding any additional messages, and `DisplayJSONMessagesStream reaches the end of the body since no more is sent.

However, you might be right that the response doesn't need to be drained regardless, since the request has been cancelled – I'd have to check. I'd be careful about not draining responses though – we've had lots of issues with fd's being left open due to things like this.

@laurazard
Copy link
Member

We could still explicitly check for context cancellation though as I suggested in #5645 (comment), no need to try the follow up request if the context has been cancelled.

@laurazard
Copy link
Member

For cross-referencing/reference:

(iiuc) This was introduced in #5247, released in v27.1 (#5250).

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at my last few comments, we should probably squash all the commits into a single one (or at least all test-changes to go into a single commit, but as they're testing the first commit, could be squashed with that one as well)

Comment on lines +157 to +159
ctx, cancelFun := context.WithCancel(context.WithoutCancel(ctx))
defer cancelFun()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this context is only used by the attachContainer (we use a separate one for the (status wait), would it make sense to move this to that location to make it be next to the comment describing why we remove the cancel;

// ctx should not be cancellable here, as this would kill the stream to the container
// and we want to keep the stream open until the process in the container exits or until
// the user forcefully terminates the CLI.
closeFn, err := attachContainer(ctx, dockerCli, containerID, &errCh, config, container.AttachOptions{

⚠️ if we do, do we want statusCtx to be inherited from that context (for tracing purposes), or is that not relevant?

👇 if we do want statusCtx to be a "child" of the context used for attach, it should probably be something like this to not shadow the original context;

var cancelFun context.CancelFunc
ctx, cancelFun = context.WithCancel(context.WithoutCancel(ctx))
defer cancelFun()

closeFn, err := attachContainer(ctx, dockerCli, containerID, &errCh, config, container.AttachOptions{
// ....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👇 if we do want statusCtx to be a "child" of the context used for attach, it should probably be something like this to not shadow the original context;

Yeah, I believe this is the reason why we have context.WithCancel(context.WithoutCancel(xx))

Your suggestion looks good!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggested change looks alright, although there are some reasons why I don't want to do anymore changes in this PR. The code has become quite difficult to follow (in-terms of which context is used where) since we've been patching out bugs introduced by the context changes. This makes me a bit afraid of doing too many code changes since I don't want to accidentally re-introduce an edge case bug.

By moving line 144

ctx, cancelFun := context.WithCancel(ctx)
defer cancelFun()

down to the attachContainer call on line 184

// ctx should not be cancellable here, as this would kill the stream to the container
// and we want to keep the stream open until the process in the container exits or until
// the user forcefully terminates the CLI.
closeFn, err := attachContainer(ctx, dockerCli, containerID, &errCh, config, container.AttachOptions{

we are making it harder to read the code since line 209 uses the cancelFun in some cases

if err := apiClient.ContainerStart(ctx, containerID, container.StartOptions{}); err != nil {
// If we have hijackedIOStreamer, we should notify
// hijackedIOStreamer we are going to exit and wait
// to avoid the terminal are not restored.
if attach {
cancelFun()
<-errCh
}

Line 204 uses the context (without original cancel) from line 144

if err := apiClient.ContainerStart(ctx, containerID, container.StartOptions{}); err != nil {

We cancelled the top-level context originally in this commit 991b130 to revert the runContainer function to a pre-context behavior since there might exist too many unknown edge case behaviors. For that reason I tried to keep as close to the pre-context behavior as possible, by only moving the code stripping the context a few lines down.

Due to the shared ctx between line 144 and 184 and 204 and the cancelFun used on line 209, it makes me worried that by moving the ctx code from line 144 to line 184 (above attachContainer) we might have some case where the cancelFun is used but is now nil, or where a future change would be more difficult due to more spaghetti usage of ctx and cancelFun from line 144.

If it is alright with you @thaJeztah, I would rather we have future PRs that focus on making some of the code more context aware and slowly split up the runContainer function so we have less context spaghetti.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Benehiko yeah, that's fine; we can move this in a follow-up. It was mostly because the comment described something that happened elsewhere, so I thought moving the code there could help understanding, but it's not critical for sure.

case cmdErr := <-cmdErrC:
assert.Equal(t, cmdErr, cli.StatusError{
StatusCode: 125,
Status: "docker: context canceled\n\nRun 'docker run --help' for more information",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: if we backport this PR, this line probably needs to be adjusted for the 27.x branch (as the "Run 'docker run --help'" part changed in master / v28)

cli/command/container/run_test.go Show resolved Hide resolved
Comment on lines +230 to +232
createContainerFunc: func(config *container.Config, hostConfig *container.HostConfig, networkingConfig *network.NetworkingConfig,
platform *specs.Platform, containerName string,
) (container.CreateResponse, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be touching up the earlier commit (test: pull on run should be ctx cancellable); better to combine it with that one.

@@ -25,7 +25,7 @@ type fakeClient struct {
platform *specs.Platform,
containerName string) (container.CreateResponse, error)
containerStartFunc func(containerID string, options container.StartOptions) error
imageCreateFunc func(parentReference string, options image.CreateOptions) (io.ReadCloser, error)
imageCreateFunc func(ctx context.Context, parentReference string, options image.CreateOptions) (io.ReadCloser, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be combined with the previous commit, because the previous commit doesn't build;

./run_test.go:241:20: cannot use func(ctx context.Context, parentReference string, options image.CreateOptions) (io.ReadCloser, error) {…} (value of type func(ctx "context".Context, parentReference string, options "github.com/docker/docker/api/types/image".CreateOptions) (io.ReadCloser, error)) as func(parentReference string, options "github.com/docker/docker/api/types/image".CreateOptions) (io.ReadCloser, error) value in struct literal

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Going back to the other conversation about commits, another rule I generally try to follow is that
every commit should be self-contained and buildable – if we need to revert a commit, it's always best if we can just revert a single commit than to have to remember to also revert other commits.

This patch fixes the context cancellation
behaviour for the `runContainer` function,
specifically the `createContainer` function
introduced in this commit docker@991b130.

It delays stripping the `cancel` from the context
passed into the `runContainer` function so that
the `createContainer` function can be cancelled
gracefully by a SIGTERM/SIGINT.

This is especially true when the requested image
does not exist and `docker run` needs to `pull`
the image before creating the container.

Although this patch does gracefully cancel
the `runContainer` function it does not address
the root cause. Some functions in the call path
are not context aware, such as `pullImage`.

Future work would still be necessary to ensure
a consistent behaviour in the CLI.

Signed-off-by: Alano Terblanche <[email protected]>
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@laurazard laurazard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✅

I'd like to see a larger refactor + more tests here, but this fixes the issue!

@laurazard laurazard merged commit 2369935 into docker:master Nov 29, 2024
89 checks passed
@thaJeztah
Copy link
Member

@Benehiko can you open a cherry-pick / backport for the 27.x branch?

@Benehiko
Copy link
Member Author

@thaJeztah Done #5654

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pulling image with docker run doesn't cancel gracefully
5 participants