-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: ctx on run image pull #5645
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #5645 +/- ##
=======================================
Coverage 59.49% 59.50%
=======================================
Files 346 346
Lines 29371 29369 -2
=======================================
+ Hits 17474 17475 +1
+ Misses 10922 10920 -2
+ Partials 975 974 -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM, but these commits need a rebase.
There's no reason for b42b50a and 2da6244 to be separate, and their commit messages don't explain much of anything –
fix: ctx on run image pull
and 2da6244:
chore: ensure we strip the main ctx cancel
Can you squash them and include a message that explains why this change was done?
b526678
to
6579e37
Compare
For reference, check the Linux kernel's guidelines on commit messages. Specifically:
Describing what the commit is actually changing is okay, but people can tell that from reading the changes. Describing the why is IMO a lot more useful. |
Somewhat orthogonal, but I was thinking about this:
The client could check if the context was cancelled after the if err := pullImage(ctx, dockerCli, config.Image, options); err != nil {
return err
}
select {
case <-ctx.Done():
return ctx.Err()
default:
}
... That way, we'd always do the right thing. Furthermore, I'm not 100% convinced about introducing a Another benefit of having it this way is that there's clearer separation of concerns – |
6579e37
to
3c1151f
Compare
AFAIK the response body would be closed when the context is cancelled since the API request is done with ⋊> ~/G/cli-3 on fix-run-ctx ◦ ./build/docker run postgres:latest 10:10:41
Unable to find image 'postgres:latest' locally
latest: Pulling from library/postgres
2d429b9e73a6: Pulling fs layer
3234e936b543: Pulling fs layer
b5b68d2b7dfa: Pulling fs layer
2bc4b686b410: Waiting
0e741c16b01d: Waiting
febd2e801cbc: Waiting
abad6e2f102b: Waiting
fc0ed0630c16: Waiting
15ea73ccc174: Waiting
bc241f8dfdda: Waiting
e6b9724cd240: Waiting
87f78d636266: Waiting
4bd7a2dad750: Waiting
4c07baf06858: Waiting
^Cdocker: context canceled
Run 'docker run --help' for more information
⋊> ~/G/cli-3 on fix-run-ctx ◦ |
This doesn't conflict with what I'm saying – the request is cancelled, so The fact that you get no more output after cancelling also doesn't disprove this – as long as However, you might be right that the response doesn't need to be drained regardless, since the request has been cancelled – I'd have to check. I'd be careful about not draining responses though – we've had lots of issues with fd's being left open due to things like this. |
We could still explicitly check for context cancellation though as I suggested in #5645 (comment), no need to try the follow up request if the context has been cancelled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at my last few comments, we should probably squash all the commits into a single one (or at least all test-changes to go into a single commit, but as they're testing the first commit, could be squashed with that one as well)
ctx, cancelFun := context.WithCancel(context.WithoutCancel(ctx)) | ||
defer cancelFun() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this context is only used by the attachContainer
(we use a separate one for the (status wait), would it make sense to move this to that location to make it be next to the comment describing why we remove the cancel;
cli/cli/command/container/run.go
Lines 181 to 184 in ba1a154
// ctx should not be cancellable here, as this would kill the stream to the container | |
// and we want to keep the stream open until the process in the container exits or until | |
// the user forcefully terminates the CLI. | |
closeFn, err := attachContainer(ctx, dockerCli, containerID, &errCh, config, container.AttachOptions{ |
❓ statusCtx
to be inherited from that context (for tracing purposes), or is that not relevant?
👇 if we do want statusCtx
to be a "child" of the context used for attach, it should probably be something like this to not shadow the original context;
var cancelFun context.CancelFunc
ctx, cancelFun = context.WithCancel(context.WithoutCancel(ctx))
defer cancelFun()
closeFn, err := attachContainer(ctx, dockerCli, containerID, &errCh, config, container.AttachOptions{
// ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👇 if we do want statusCtx to be a "child" of the context used for attach, it should probably be something like this to not shadow the original context;
Yeah, I believe this is the reason why we have context.WithCancel(context.WithoutCancel(xx))
Your suggestion looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggested change looks alright, although there are some reasons why I don't want to do anymore changes in this PR. The code has become quite difficult to follow (in-terms of which context is used where) since we've been patching out bugs introduced by the context changes. This makes me a bit afraid of doing too many code changes since I don't want to accidentally re-introduce an edge case bug.
By moving line 144
cli/cli/command/container/run.go
Lines 144 to 145 in ba1a154
ctx, cancelFun := context.WithCancel(ctx) | |
defer cancelFun() |
down to the attachContainer
call on line 184
cli/cli/command/container/run.go
Lines 181 to 184 in ba1a154
// ctx should not be cancellable here, as this would kill the stream to the container | |
// and we want to keep the stream open until the process in the container exits or until | |
// the user forcefully terminates the CLI. | |
closeFn, err := attachContainer(ctx, dockerCli, containerID, &errCh, config, container.AttachOptions{ |
we are making it harder to read the code since line 209 uses the cancelFun
in some cases
cli/cli/command/container/run.go
Lines 204 to 211 in ba1a154
if err := apiClient.ContainerStart(ctx, containerID, container.StartOptions{}); err != nil { | |
// If we have hijackedIOStreamer, we should notify | |
// hijackedIOStreamer we are going to exit and wait | |
// to avoid the terminal are not restored. | |
if attach { | |
cancelFun() | |
<-errCh | |
} |
Line 204 uses the context (without original cancel) from line 144
cli/cli/command/container/run.go
Line 204 in ba1a154
if err := apiClient.ContainerStart(ctx, containerID, container.StartOptions{}); err != nil { |
We cancelled the top-level context originally in this commit 991b130 to revert the runContainer
function to a pre-context behavior since there might exist too many unknown edge case behaviors. For that reason I tried to keep as close to the pre-context behavior as possible, by only moving the code stripping the context a few lines down.
Due to the shared ctx
between line 144 and 184 and 204 and the cancelFun
used on line 209, it makes me worried that by moving the ctx
code from line 144 to line 184 (above attachContainer
) we might have some case where the cancelFun
is used but is now nil, or where a future change would be more difficult due to more spaghetti usage of ctx
and cancelFun
from line 144.
If it is alright with you @thaJeztah, I would rather we have future PRs that focus on making some of the code more context aware and slowly split up the runContainer
function so we have less context
spaghetti.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Benehiko yeah, that's fine; we can move this in a follow-up. It was mostly because the comment described something that happened elsewhere, so I thought moving the code there could help understanding, but it's not critical for sure.
case cmdErr := <-cmdErrC: | ||
assert.Equal(t, cmdErr, cli.StatusError{ | ||
StatusCode: 125, | ||
Status: "docker: context canceled\n\nRun 'docker run --help' for more information", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: if we backport this PR, this line probably needs to be adjusted for the 27.x branch (as the "Run 'docker run --help'" part changed in master / v28)
createContainerFunc: func(config *container.Config, hostConfig *container.HostConfig, networkingConfig *network.NetworkingConfig, | ||
platform *specs.Platform, containerName string, | ||
) (container.CreateResponse, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to be touching up the earlier commit (test: pull on run should be ctx cancellable
); better to combine it with that one.
@@ -25,7 +25,7 @@ type fakeClient struct { | |||
platform *specs.Platform, | |||
containerName string) (container.CreateResponse, error) | |||
containerStartFunc func(containerID string, options container.StartOptions) error | |||
imageCreateFunc func(parentReference string, options image.CreateOptions) (io.ReadCloser, error) | |||
imageCreateFunc func(ctx context.Context, parentReference string, options image.CreateOptions) (io.ReadCloser, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be combined with the previous commit, because the previous commit doesn't build;
./run_test.go:241:20: cannot use func(ctx context.Context, parentReference string, options image.CreateOptions) (io.ReadCloser, error) {…} (value of type func(ctx "context".Context, parentReference string, options "github.com/docker/docker/api/types/image".CreateOptions) (io.ReadCloser, error)) as func(parentReference string, options "github.com/docker/docker/api/types/image".CreateOptions) (io.ReadCloser, error) value in struct literal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Going back to the other conversation about commits, another rule I generally try to follow is that
every commit should be self-contained and buildable – if we need to revert a commit, it's always best if we can just revert a single commit than to have to remember to also revert other commits.
This patch fixes the context cancellation behaviour for the `runContainer` function, specifically the `createContainer` function introduced in this commit docker@991b130. It delays stripping the `cancel` from the context passed into the `runContainer` function so that the `createContainer` function can be cancelled gracefully by a SIGTERM/SIGINT. This is especially true when the requested image does not exist and `docker run` needs to `pull` the image before creating the container. Although this patch does gracefully cancel the `runContainer` function it does not address the root cause. Some functions in the call path are not context aware, such as `pullImage`. Future work would still be necessary to ensure a consistent behaviour in the CLI. Signed-off-by: Alano Terblanche <[email protected]>
3c1151f
to
30a73ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ✅
I'd like to see a larger refactor + more tests here, but this fixes the issue!
@Benehiko can you open a cherry-pick / backport for the 27.x branch? |
@thaJeztah Done #5654 |
Fixes #5639
After some digging I found that we don't actually explicitly return an error on the
pullImage
function here. Once the context is cancelled the response is closed since the API client useshttp.NewRequestWithContext
which then exits thejsonmessage.DisplayJSONMessagesToStream(responseBody, out, nil)
without any errors.The following API request to
ContainerCreate
then immediately fails since the context is closed, which then returns an error.See this section of the code, it fails to find the image, then pulls the image, and continues to retry creating a container - on this second retry it only then returns with the context error.
https://github.com/docker/cli/blob/master/cli/command/container/create.go#L271-L289
Writing a test for this means that we are building into the test an assumption that we won't error on the
pullImage
function, but rather on the second call toContainerCreate
. I'll push what I have, but I honestly think that we will get tripped up by thepullImage
function again since it's not context aware - actually more so thejsonmessage
package it is using is actually the root cause found in Moby here. It's a bit of a hornets nest since it gets used in many places and by refactoring/improving this it could cause edge case bugs.- What I did
Delay stripping the
cancel
from the functionctx
so we can still cancel thecreateContainer
function.cli/cli/command/container/run.go
Line 142 in e4e0ea2
- How I did it
- How to verify it
Unit test
- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)