-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] finalizer approach to fix race condition due to pruning in results watcher #703
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @ramessesii2. Thanks for your PR. I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debugging showed few anomalies in the send log process. For e.g. - this piece of code from tknlog writer is blocking function execution.
I've added comment on other items as well.
@sayan-biswas @enarha fyi
@@ -342,11 +356,6 @@ func (r *Reconciler) sendLog(ctx context.Context, o results.Object) error { | |||
zap.Error(err), | |||
) | |||
} | |||
logger.Debugw("Streaming log completed", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This asynchronous functions never reaches this point if there is a successful streaming of logs while we do see this when there an error while streaming logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
The following is the coverage report on the affected files.
|
IsReadyForDeletionFunc: func(ctx context.Context, object results.Object) (bool, error) { | ||
if LogsFinalizerExist(object, LogFinalizer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think that this is not needed, as it is redundant with the finalizer
pattern.
I would expect the resource to be marked as being deleted, but the deletion to be defered by k8s until the last finalizer has been removed.
I could definitely be wrong though 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right deletion is deferred if the Finalizer is there.
So, prior to that change we could see the following message while debugging even if the resource is not deleted properly(due to Finalizer) which is misleading and more importantly we do go on to execute AfterDeletion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The child context with timeout bit idea from @adambkaplan looks like how I expected it .... just the moving of the CloseSend is needed @ramessesii2
if gofuncerr != nil { | ||
logger.Error(gofuncerr) | ||
} | ||
if gofuncerr = logsClient.CloseSend(); gofuncerr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per comment from @adambkaplan in our internal discussion @ramessesii2 we want to execute the CloseSend()
here regardless of which case in the select fires, meaning it needs to be moved out of this case, out of the select in fact, so that it is also executed when ctx.Done() is processed as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e release the streaming resources regardless if the tkn client library is not properly reconciling the error channel used to block the goroutine from completing
that at least would help slow things .... cleaning up the threads (or moving to a different model) could become a separate concern based on what you find as you drill down on your suspicion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we've confirmed in an internal slack thread including @ramessesii2 @sayan-biswas and myself that the streamLogs
code in the "main" thread needs to close the channel the goroutine is blocked on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK @ramessesii2 your call to CloseSend minimally is still only in the case writeErr := <-echan:
block
we need to call this as well if the case <-ctx.Done():
block fire first ... i.e. there was some weird hang on the write, so we can abort and free up the resources
my initial assumption is the flush is irrelevant in this error case and case stay where it is, but do let us know if your testing uncovers a different conclusion
please update your PR so we can close this thread and have @pmacik 's next round of testing include this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, I've done a bit more of refactoring. Hopefully, this is what you're asking as well here. Pushing the changes now.
1d1873d
to
b22378a
Compare
The following is the coverage report on the affected files.
|
|
||
// logctx is derived from ctx. Therefore, if ctx is cancelled (either explicitly through a call to its cancel | ||
// function or when it reaches its deadline), logctx will be cancelled automatically. | ||
logctx, _ := context.WithTimeout(ctx, 10*time.Minute) | ||
|
||
go func(ctx context.Context, echan <-chan error, o metav1.Object) { | ||
defer close(errChanRepeater) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good use of defer @ramessesii2 @sayan-biswas
where are we closing the original logChan
and errChan
from the reader.Read()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logChan and errChan channels are receiver ends. When the channels ae no longer used, they will be garbage collected.
It's new subject for me, this is my reference - https://stackoverflow.com/questions/8593645/is-it-ok-to-leave-a-channel-open
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah there was a bit of back and forth on that thread @ramessesii2 but yes I think that is the end interpretation.
good find / research @ramessesii2
that said, given the exchanges I just saw with @pmacik today in slack, I should probably make a more thorough pass of the changes here .... admittedly I only focused on the mem leak bit initially.
will do that today
thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok @ramessesii2 @sayan-biswas I think I looked at this too quickly yesterday
the close of the errChanRepeater
should not be in the goroutine that blocks on it !!
move the defer close(errChanRepeater)
to immediately after the declaration of errChanRepeater
in your current line 400
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I'm deferring close(errChanRepeater) in this goroutine is because it's important to tell the receiving goroutines that all data have been sent which is here:
tknlog.NewWriter(logType, true).Write(&cli.Stream{
Out: writer,
Err: writer,
}, logChan, errChanRepeater)
else the above code will not exit the execution if we don't close errChanRepeater before we do a tknlog.NewWriter()...... I confirmed this as well while debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep we are good here per the thread below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have more requests for changes
if gofuncerr != nil { | ||
logger.Error(gofuncerr) | ||
} | ||
if gofuncerr = logsClient.CloseSend(); gofuncerr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK @ramessesii2 your call to CloseSend minimally is still only in the case writeErr := <-echan:
block
we need to call this as well if the case <-ctx.Done():
block fire first ... i.e. there was some weird hang on the write, so we can abort and free up the resources
my initial assumption is the flush is irrelevant in this error case and case stay where it is, but do let us know if your testing uncovers a different conclusion
please update your PR so we can close this thread and have @pmacik 's next round of testing include this change
|
||
// logctx is derived from ctx. Therefore, if ctx is cancelled (either explicitly through a call to its cancel | ||
// function or when it reaches its deadline), logctx will be cancelled automatically. | ||
logctx, _ := context.WithTimeout(ctx, 10*time.Minute) | ||
|
||
go func(ctx context.Context, echan <-chan error, o metav1.Object) { | ||
defer close(errChanRepeater) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok @ramessesii2 @sayan-biswas I think I looked at this too quickly yesterday
the close of the errChanRepeater
should not be in the goroutine that blocks on it !!
move the defer close(errChanRepeater)
to immediately after the declaration of errChanRepeater
in your current line 400
rh-pre-commit.version: 2.1.0 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Satyam Bhardwaj <[email protected]> rh-pre-commit.version: 2.1.0 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Satyam Bhardwaj <[email protected]> rh-pre-commit.version: 2.1.0 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Satyam Bhardwaj <[email protected]> rh-pre-commit.version: 2.1.0 rh-pre-commit.check-secrets: ENABLED
b22378a
to
52aafe7
Compare
The following is the coverage report on the affected files.
|
@ramessesii2: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
zap.String("namespace", o.GetNamespace()), | ||
zap.String("name", o.GetName()), | ||
) | ||
case writeErr := <-echan: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to close the loop @ramessesii2 on you comment https://github.com/tektoncd/results/pull/703/files#r1486049142 if you are only closing the channel in the go routine, what unblocks this call?
are we always hitting the case <-ctx.Done()
path ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Gabe,
We don't hit ctx.Done() most of the time. Actually, I've never encountered that bcs of 10 min of timeout.
case writeErr := <-echan, what unblocks this call?
My understanding is based on the fact that even if echan is closed by tkn reader, reading a closed channel, it will always be successful, returning the zero value.
We could run into problem if echan is a nil. Reading from or writing to a nil channel causes code to hang forever. But since the check was not there in the original Results code I assume we expect a non-nil channel from tknReader.
Although, in the next iteration, I'll address it. I missed this edge case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neither the tekton reader or writer are closing this channel based on the code I've seen @ramessesii2
can you provide the code link other than this goroutine where this channel is being closed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nevermind I found it :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, the flow is tknReader.read() -> readPipelineLog() -> finally landing to readAvailablePipelineLogs() for our case since we only stream longs once PipelineRuns/TRs have a specific closed state -> close echan (errC)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure how the original echan could ever be nil, but in general defensive programming is good :-)
I'll wait for you update and then we can go from there - thanks
hi @gabemontero @adambkaplan with finalizers, at least locally, I've been able to fix the race condition. There's a small caveat with finalizer. Bcs we add finalizer for the PR to not get pruned until the streaming/sending of logs is done. Finalizer along with the PipelineRun object is stored as well. |
for me at least I like using IsStored() instead as well @ramessesii2 @adambkaplan @ramessesii2 @sayan-biswas assuming the watcher reconciler that handles pruning can watch for that and requeue if it is not yet stored I believe that is the case based on what I recall from the fix for handling cancelled pipeline/task runs putting metadata i.e. in the object we are storing seems more fragile to me in hindsight what do you all think? perhaps as part of breaking out the mem leak fix from this one, either create a separate PR or a separate commit in this PR so we can compare IsStored vs. finalizer |
FYI : #713 uses Logs API to address race condition |
@ramessesii2: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Changes
/kind bug
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you review them:
/kind <type>
. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes