Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flake: waiter_test is flaky #7226

Closed
Yongxuanzhang opened this issue Oct 17, 2023 · 4 comments · Fixed by #7227
Closed

Flake: waiter_test is flaky #7226

Yongxuanzhang opened this issue Oct 17, 2023 · 4 comments · Fixed by #7227
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flakey test

Comments

@tekton-robot tekton-robot added the kind/flake Categorizes issue or PR as related to a flakey test label Oct 17, 2023
@Yongxuanzhang
Copy link
Member Author

@chengjoey do you want to take a look since you have more contexts?

@chengjoey
Copy link
Member

/assign

@chengjoey
Copy link
Member

chengjoey commented Oct 18, 2023

because goroutine was added to entrypointer in #6511, the fakeWaiter append cancel file in the TestEntrypointer test is flaky.

// start a goroutine to listen for cancellation file
go func() {
if err := e.waitingCancellation(ctx, cancel); err != nil {
logger.Error("Error while waiting for cancellation", zap.Error(err))
}
}()
err = e.Runner.Run(ctx, e.Command...)

this make https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/7193/pull-tekton-pipeline-unit-tests/1714303905107546112 test failed

@chengjoey
Copy link
Member

chengjoey commented Oct 18, 2023

Version 0.52x and before
in the for loop, the file will be detected directly without waiting for the first time.

func (rw *realWaiter) Wait(file string, expectContent bool, breakpointOnFailure bool) error {
if file == "" {
return nil
}
for ; ; time.Sleep(rw.waitPollingInterval) {
if info, err := os.Stat(file); err == nil {
if !expectContent || info.Size() > 0 {
return nil
}
} else if !os.IsNotExist(err) {
return fmt.Errorf("waiting for %q: %w", file, err)
}
// When a .err file is read by this step, it means that a previous step has failed
// We wouldn't want this step to stop executing because the previous step failed during debug
// That is counterproductive to debugging
// Hence we disable skipError here so that the other steps in the failed taskRun can continue
// executing if breakpointOnFailure is enabled for the taskRun
// TLDR: Do not return skipError when breakpointOnFailure is enabled as it breaks execution of the TaskRun
if _, err := os.Stat(file + ".err"); err == nil {
if breakpointOnFailure {
return nil
}
return skipError("error file present, bail and skip the step")
}
}
}

in latest version:
for the first time, need to wait for waitPollingInterval before file detection is performed. The select statement block should be moved to the bottom

func (rw *realWaiter) Wait(ctx context.Context, file string, expectContent bool, breakpointOnFailure bool) error {
if file == "" {
return nil
}
for {
select {
case <-ctx.Done():
if errors.Is(ctx.Err(), context.Canceled) {
return entrypoint.ErrContextCanceled
}
if errors.Is(ctx.Err(), context.DeadlineExceeded) {
return entrypoint.ErrContextDeadlineExceeded
}
return nil
case <-time.After(rw.waitPollingInterval):
}
if info, err := os.Stat(file); err == nil {
if !expectContent || info.Size() > 0 {
return nil
}
} else if !os.IsNotExist(err) {
return fmt.Errorf("waiting for %q: %w", file, err)
}
// When a .err file is read by this step, it means that a previous step has failed
// We wouldn't want this step to stop executing because the previous step failed during debug
// That is counterproductive to debugging
// Hence we disable skipError here so that the other steps in the failed taskRun can continue
// executing if breakpointOnFailure is enabled for the taskRun
// TLDR: Do not return skipError when breakpointOnFailure is enabled as it breaks execution of the TaskRun
if _, err := os.Stat(file + ".err"); err == nil {
if breakpointOnFailure {
return nil
}
return skipError("error file present, bail and skip the step")
}
}
}

related faild tests:
https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/7167/pull-tekton-pipeline-unit-tests/1714307892095488000
https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/7204/pull-tekton-pipeline-unit-tests/1714305917400387584
https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/7167/pull-tekton-pipeline-unit-tests/1714304663337046016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flakey test
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants