Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] finalizer approach to fix race condition due to pruning in results watcher #703

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ramessesii2
Copy link
Member

@ramessesii2 ramessesii2 commented Feb 5, 2024

Changes

/kind bug

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you review them:

  • Has Docs included if any changes are user facing
  • Has Tests included if any functionality added or changed
  • Tested your changes locally (if this is a code change)
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • Has a kind label. You can add a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user-facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
  • Release notes contain the string "action required" if the change requires additional action from users switching to the new release

Release Notes

NONE

@tekton-robot tekton-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesnt merit a release note. labels Feb 5, 2024
@tekton-robot tekton-robot requested review from enarha and khrm February 5, 2024 12:43
@tekton-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign sayan-biswas after the PR has been reviewed.
You can assign the PR to them by writing /assign @sayan-biswas in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 5, 2024
@tekton-robot
Copy link

Hi @ramessesii2. Thanks for your PR.

I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member Author

@ramessesii2 ramessesii2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debugging showed few anomalies in the send log process. For e.g. - this piece of code from tknlog writer is blocking function execution.
I've added comment on other items as well.

@sayan-biswas @enarha fyi

@@ -342,11 +356,6 @@ func (r *Reconciler) sendLog(ctx context.Context, o results.Object) error {
zap.Error(err),
)
}
logger.Debugw("Streaming log completed",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This asynchronous functions never reaches this point if there is a successful streaming of logs while we do see this when there an error while streaming logs

Copy link
Contributor

@khrm khrm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@tekton-robot tekton-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 5, 2024
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 60.8% -8.5

IsReadyForDeletionFunc: func(ctx context.Context, object results.Object) (bool, error) {
if LogsFinalizerExist(object, LogFinalizer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think that this is not needed, as it is redundant with the finalizer pattern.

I would expect the resource to be marked as being deleted, but the deletion to be defered by k8s until the last finalizer has been removed.

I could definitely be wrong though 😉

Copy link
Member Author

@ramessesii2 ramessesii2 Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right deletion is deferred if the Finalizer is there.
So, prior to that change we could see the following message while debugging even if the resource is not deleted properly(due to Finalizer) which is misleading and more importantly we do go on to execute AfterDeletion

Copy link
Contributor

@gabemontero gabemontero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The child context with timeout bit idea from @adambkaplan looks like how I expected it .... just the moving of the CloseSend is needed @ramessesii2

if gofuncerr != nil {
logger.Error(gofuncerr)
}
if gofuncerr = logsClient.CloseSend(); gofuncerr != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per comment from @adambkaplan in our internal discussion @ramessesii2 we want to execute the CloseSend() here regardless of which case in the select fires, meaning it needs to be moved out of this case, out of the select in fact, so that it is also executed when ctx.Done() is processed as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e release the streaming resources regardless if the tkn client library is not properly reconciling the error channel used to block the goroutine from completing

that at least would help slow things .... cleaning up the threads (or moving to a different model) could become a separate concern based on what you find as you drill down on your suspicion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we've confirmed in an internal slack thread including @ramessesii2 @sayan-biswas and myself that the streamLogs code in the "main" thread needs to close the channel the goroutine is blocked on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK @ramessesii2 your call to CloseSend minimally is still only in the case writeErr := <-echan: block

we need to call this as well if the case <-ctx.Done(): block fire first ... i.e. there was some weird hang on the write, so we can abort and free up the resources

my initial assumption is the flush is irrelevant in this error case and case stay where it is, but do let us know if your testing uncovers a different conclusion

please update your PR so we can close this thread and have @pmacik 's next round of testing include this change

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, I've done a bit more of refactoring. Hopefully, this is what you're asking as well here. Pushing the changes now.

@ramessesii2 ramessesii2 force-pushed the RAMESSESII2/pruner-bug-2 branch from 1d1873d to b22378a Compare February 8, 2024 14:23
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 60.2% -9.1


// logctx is derived from ctx. Therefore, if ctx is cancelled (either explicitly through a call to its cancel
// function or when it reaches its deadline), logctx will be cancelled automatically.
logctx, _ := context.WithTimeout(ctx, 10*time.Minute)

go func(ctx context.Context, echan <-chan error, o metav1.Object) {
defer close(errChanRepeater)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good use of defer @ramessesii2 @sayan-biswas

where are we closing the original logChan and errChan from the reader.Read() ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logChan and errChan channels are receiver ends. When the channels ae no longer used, they will be garbage collected.
It's new subject for me, this is my reference - https://stackoverflow.com/questions/8593645/is-it-ok-to-leave-a-channel-open

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah there was a bit of back and forth on that thread @ramessesii2 but yes I think that is the end interpretation.

good find / research @ramessesii2

that said, given the exchanges I just saw with @pmacik today in slack, I should probably make a more thorough pass of the changes here .... admittedly I only focused on the mem leak bit initially.

will do that today

thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok @ramessesii2 @sayan-biswas I think I looked at this too quickly yesterday

the close of the errChanRepeater should not be in the goroutine that blocks on it !!

move the defer close(errChanRepeater) to immediately after the declaration of errChanRepeater in your current line 400

Copy link
Member Author

@ramessesii2 ramessesii2 Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I'm deferring close(errChanRepeater) in this goroutine is because it's important to tell the receiving goroutines that all data have been sent which is here:

tknlog.NewWriter(logType, true).Write(&cli.Stream{
		Out: writer,
		Err: writer,
	}, logChan, errChanRepeater)

else the above code will not exit the execution if we don't close errChanRepeater before we do a tknlog.NewWriter()...... I confirmed this as well while debugging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep we are good here per the thread below

Copy link
Contributor

@gabemontero gabemontero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have more requests for changes

pkg/watcher/reconciler/dynamic/dynamic.go Show resolved Hide resolved
if gofuncerr != nil {
logger.Error(gofuncerr)
}
if gofuncerr = logsClient.CloseSend(); gofuncerr != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK @ramessesii2 your call to CloseSend minimally is still only in the case writeErr := <-echan: block

we need to call this as well if the case <-ctx.Done(): block fire first ... i.e. there was some weird hang on the write, so we can abort and free up the resources

my initial assumption is the flush is irrelevant in this error case and case stay where it is, but do let us know if your testing uncovers a different conclusion

please update your PR so we can close this thread and have @pmacik 's next round of testing include this change


// logctx is derived from ctx. Therefore, if ctx is cancelled (either explicitly through a call to its cancel
// function or when it reaches its deadline), logctx will be cancelled automatically.
logctx, _ := context.WithTimeout(ctx, 10*time.Minute)

go func(ctx context.Context, echan <-chan error, o metav1.Object) {
defer close(errChanRepeater)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok @ramessesii2 @sayan-biswas I think I looked at this too quickly yesterday

the close of the errChanRepeater should not be in the goroutine that blocks on it !!

move the defer close(errChanRepeater) to immediately after the declaration of errChanRepeater in your current line 400

pkg/watcher/reconciler/dynamic/dynamic.go Outdated Show resolved Hide resolved
pkg/watcher/reconciler/dynamic/dynamic.go Outdated Show resolved Hide resolved
pkg/watcher/reconciler/dynamic/dynamic.go Outdated Show resolved Hide resolved
gabemontero and others added 4 commits February 12, 2024 12:47
rh-pre-commit.version: 2.1.0
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Satyam Bhardwaj <[email protected]>

rh-pre-commit.version: 2.1.0
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Satyam Bhardwaj <[email protected]>

rh-pre-commit.version: 2.1.0
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Satyam Bhardwaj <[email protected]>

rh-pre-commit.version: 2.1.0
rh-pre-commit.check-secrets: ENABLED
@ramessesii2 ramessesii2 force-pushed the RAMESSESII2/pruner-bug-2 branch from b22378a to 52aafe7 Compare February 12, 2024 11:52
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 61.0% -8.4

@tekton-robot
Copy link

@ramessesii2: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-tekton-results-build-tests 52aafe7 link true /test pull-tekton-results-build-tests
pull-tekton-results-integration-tests 52aafe7 link true /test pull-tekton-results-integration-tests

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

zap.String("namespace", o.GetNamespace()),
zap.String("name", o.GetName()),
)
case writeErr := <-echan:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to close the loop @ramessesii2 on you comment https://github.com/tektoncd/results/pull/703/files#r1486049142 if you are only closing the channel in the go routine, what unblocks this call?

are we always hitting the case <-ctx.Done() path ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Gabe,
We don't hit ctx.Done() most of the time. Actually, I've never encountered that bcs of 10 min of timeout.

case writeErr := <-echan, what unblocks this call?

My understanding is based on the fact that even if echan is closed by tkn reader, reading a closed channel, it will always be successful, returning the zero value.
We could run into problem if echan is a nil. Reading from or writing to a nil channel causes code to hang forever. But since the check was not there in the original Results code I assume we expect a non-nil channel from tknReader.
Although, in the next iteration, I'll address it. I missed this edge case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neither the tekton reader or writer are closing this channel based on the code I've seen @ramessesii2

can you provide the code link other than this goroutine where this channel is being closed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevermind I found it :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, the flow is tknReader.read() -> readPipelineLog() -> finally landing to readAvailablePipelineLogs() for our case since we only stream longs once PipelineRuns/TRs have a specific closed state -> close echan (errC)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how the original echan could ever be nil, but in general defensive programming is good :-)

I'll wait for you update and then we can go from there - thanks

@ramessesii2
Copy link
Member Author

hi @gabemontero @adambkaplan with finalizers, at least locally, I've been able to fix the race condition. There's a small caveat with finalizer. Bcs we add finalizer for the PR to not get pruned until the streaming/sending of logs is done. Finalizer along with the PipelineRun object is stored as well.
While that might not be a deal breaker (I'm not sure though), but I find it simpler to utilize this field from LogStatus to simply hold on to pruning until we find isStored: true.

@gabemontero
Copy link
Contributor

hi @gabemontero @adambkaplan with finalizers, at least locally, I've been able to fix the race condition. There's a small caveat with finalizer. Bcs we add finalizer for the PR to not get pruned until the streaming/sending of logs is done. Finalizer along with the PipelineRun object is stored as well. While that might not be a deal breaker (I'm not sure though), but I find it simpler to utilize this field from LogStatus to simply hold on to pruning until we find isStored: true.

for me at least I like using IsStored() instead as well @ramessesii2 @adambkaplan @ramessesii2 @sayan-biswas assuming the watcher reconciler that handles pruning can watch for that and requeue if it is not yet stored

I believe that is the case based on what I recall from the fix for handling cancelled pipeline/task runs

putting metadata i.e. in the object we are storing seems more fragile to me in hindsight

what do you all think?

perhaps as part of breaking out the mem leak fix from this one, either create a separate PR or a separate commit in this PR so we can compare IsStored vs. finalizer

@ramessesii2 ramessesii2 changed the title [WIP] fix race condition due to pruning in results watcher [WIP] finalizer approach to fix race condition due to pruning in results watcher Feb 16, 2024
@ramessesii2
Copy link
Member Author

FYI : #713 uses Logs API to address race condition

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 27, 2024
@tekton-robot
Copy link

@ramessesii2: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesnt merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants