Flake: waiter_test is flaky #7226

Yongxuanzhang · 2023-10-17T17:26:34Z

It seems that after the merging of: #6511
There are lots of failing flaky unit tests:
https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/7167/pull-tekton-pipeline-unit-tests/1714307892095488000
https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/7204/pull-tekton-pipeline-unit-tests/1714305917400387584
https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/7167/pull-tekton-pipeline-unit-tests/1714304663337046016
https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/7193/pull-tekton-pipeline-unit-tests/1714303905107546112

The PR itself also has several times of failures. We shouldn't ignore this by just re-running the tests when it is a new flaky test
https://prow.tekton.dev/pr-history/?org=tektoncd&repo=pipeline&pr=6511

/kind flake

Yongxuanzhang · 2023-10-17T17:27:43Z

@chengjoey do you want to take a look since you have more contexts?

chengjoey · 2023-10-18T01:33:45Z

/assign

chengjoey · 2023-10-18T02:55:02Z

because goroutine was added to entrypointer in #6511, the fakeWaiter append cancel file in the TestEntrypointer test is flaky.

pipeline/pkg/entrypoint/entrypointer.go

Lines 175 to 181 in 6bb0513

    
           // start a goroutine to listen for cancellation file 
        
           go func() { 
        
           	if err := e.waitingCancellation(ctx, cancel); err != nil { 
        
           		logger.Error("Error while waiting for cancellation", zap.Error(err)) 
        
           	} 
        
           }() 
        
           err = e.Runner.Run(ctx, e.Command...)

this make https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/7193/pull-tekton-pipeline-unit-tests/1714303905107546112 test failed

chengjoey · 2023-10-18T03:04:11Z

Version 0.52x and before
in the for loop, the file will be detected directly without waiting for the first time.

pipeline/cmd/entrypoint/waiter.go

Lines 50 to 75 in e7b5e58

    
           func (rw *realWaiter) Wait(file string, expectContent bool, breakpointOnFailure bool) error { 
        
           	if file == "" { 
        
           		return nil 
        
           	} 
        
           	for ; ; time.Sleep(rw.waitPollingInterval) { 
        
           		if info, err := os.Stat(file); err == nil { 
        
           			if !expectContent || info.Size() > 0 { 
        
           				return nil 
        
           			} 
        
           		} else if !os.IsNotExist(err) { 
        
           			return fmt.Errorf("waiting for %q: %w", file, err) 
        
           		} 
        
           		// When a .err file is read by this step, it means that a previous step has failed 
        
           		// We wouldn't want this step to stop executing because the previous step failed during debug 
        
           		// That is counterproductive to debugging 
        
           		// Hence we disable skipError here so that the other steps in the failed taskRun can continue 
        
           		// executing if breakpointOnFailure is enabled for the taskRun 
        
           		// TLDR: Do not return skipError when breakpointOnFailure is enabled as it breaks execution of the TaskRun 
        
           		if _, err := os.Stat(file + ".err"); err == nil { 
        
           			if breakpointOnFailure { 
        
           				return nil 
        
           			} 
        
           			return skipError("error file present, bail and skip the step") 
        
           		} 
        
           	} 
        
           }

in latest version:
for the first time, need to wait for waitPollingInterval before file detection is performed. The select statement block should be moved to the bottom

pipeline/cmd/entrypoint/waiter.go

Lines 52 to 88 in 6bb0513

    
           func (rw *realWaiter) Wait(ctx context.Context, file string, expectContent bool, breakpointOnFailure bool) error { 
        
           	if file == "" { 
        
           		return nil 
        
           	} 
        
           	for { 
        
           		select { 
        
           		case <-ctx.Done(): 
        
           			if errors.Is(ctx.Err(), context.Canceled) { 
        
           				return entrypoint.ErrContextCanceled 
        
           			} 
        
           			if errors.Is(ctx.Err(), context.DeadlineExceeded) { 
        
           				return entrypoint.ErrContextDeadlineExceeded 
        
           			} 
        
           			return nil 
        
           		case <-time.After(rw.waitPollingInterval): 
        
           		} 
        
           		if info, err := os.Stat(file); err == nil { 
        
           			if !expectContent || info.Size() > 0 { 
        
           				return nil 
        
           			} 
        
           		} else if !os.IsNotExist(err) { 
        
           			return fmt.Errorf("waiting for %q: %w", file, err) 
        
           		} 
        
           		// When a .err file is read by this step, it means that a previous step has failed 
        
           		// We wouldn't want this step to stop executing because the previous step failed during debug 
        
           		// That is counterproductive to debugging 
        
           		// Hence we disable skipError here so that the other steps in the failed taskRun can continue 
        
           		// executing if breakpointOnFailure is enabled for the taskRun 
        
           		// TLDR: Do not return skipError when breakpointOnFailure is enabled as it breaks execution of the TaskRun 
        
           		if _, err := os.Stat(file + ".err"); err == nil { 
        
           			if breakpointOnFailure { 
        
           				return nil 
        
           			} 
        
           			return skipError("error file present, bail and skip the step") 
        
           		} 
        
           	} 
        
           }

related faild tests:
https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/7167/pull-tekton-pipeline-unit-tests/1714307892095488000
https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/7204/pull-tekton-pipeline-unit-tests/1714305917400387584
https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/7167/pull-tekton-pipeline-unit-tests/1714304663337046016

tekton-robot added the kind/flake Categorizes issue or PR as related to a flakey test label Oct 17, 2023

tekton-robot assigned chengjoey Oct 18, 2023

chengjoey mentioned this issue Oct 18, 2023

fix waiter test is flaky #7227

Merged

7 tasks

tekton-robot closed this as completed in #7227 Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flake: waiter_test is flaky #7226

Flake: waiter_test is flaky #7226

Yongxuanzhang commented Oct 17, 2023

Yongxuanzhang commented Oct 17, 2023

chengjoey commented Oct 18, 2023

chengjoey commented Oct 18, 2023 •

edited

Loading

chengjoey commented Oct 18, 2023 •

edited

Loading

Flake: waiter_test is flaky #7226

Flake: waiter_test is flaky #7226

Comments

Yongxuanzhang commented Oct 17, 2023

Yongxuanzhang commented Oct 17, 2023

chengjoey commented Oct 18, 2023

chengjoey commented Oct 18, 2023 • edited Loading

chengjoey commented Oct 18, 2023 • edited Loading

chengjoey commented Oct 18, 2023 •

edited

Loading

chengjoey commented Oct 18, 2023 •

edited

Loading