more robust way to wait for the cluster to be healthy after some node being killed in simulation test #17995

BugenZhao · 2024-08-12T05:39:53Z

Currently, when killing a node in the cluster, we expect that the recovery can be done in a fixed interval of time (15s as the following).

risingwave/src/tests/simulation/src/slt.rs

Lines 329 to 334 in e4b87be

    
           Some(tokio::spawn(async move { 
        
               let t = thread_rng().gen_range(Duration::default()..Duration::from_secs(1)); 
        
               tokio::time::sleep(t).await; 
        
               cluster.kill_node(&opts).await; 
        
               tokio::time::sleep(Duration::from_secs(15)).await; 
        
           }))

This is fragile, of course. As an additional layer of defense, when running the next statement/query, we check the error messages and tolerate the ones that are likely to be caused by recovery in progress.

risingwave/src/tests/simulation/src/slt.rs

Lines 298 to 318 in e4b87be

    
           if let Err(err) = tester 
        
               .run_async(record.clone()) 
        
               .timed(|_res, elapsed| { 
        
                   tracing::debug!("Record {:?} finished in {:?}", record, elapsed) 
        
               }) 
        
               .await 
        
           { 
        
               let err_string = err.to_string(); 
        
               // cluster could be still under recovering if killed before, retry if 
        
               // meets `no reader for dml in table with id {}`. 
        
               let should_retry = (err_string.contains("no reader for dml in table") 
        
                   || err_string 
        
                       .contains("error reading a body from connection: broken pipe")) 
        
                   || err_string.contains("failed to inject barrier") && i < MAX_RETRY; 
        
               if !should_retry { 
        
                   panic!("{}", err); 
        
               } 
        
               tracing::error!("failed to run test: {err}\nretry after {delay:?}"); 
        
           } else { 
        
               break; 
        
           }

The problem is that it can be difficult to keep the list of error patterns correctly maintained, given the fact that we're actively working on the barrier manager for partial checkpoint and corresponding error reporting mechanism. As a result, we may find the test flaky.

To make this more robust, we can (naturally) leverage the system functions for checking the cluster status introduced in #17641. However, as suggested by @kwannoel, there could be some race cases where we get the "healthy" status before the recovery is actually triggered. A possible solution could be attaching a sequence number for the cluster's lifespan, which increments every time a recovery is triggered, so that we can tell the difference between different states.

github-actions · 2024-10-19T02:02:22Z

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean.
Don't worry if you think the issue is still valuable to continue in the future.
It's searchable and can be reopened when it's time. 😄

BugenZhao added component/test Test related issue. component/meta Meta related issue. labels Aug 12, 2024

github-actions bot added this to the release-2.0 milestone Aug 12, 2024

BugenZhao mentioned this issue Aug 14, 2024

feat: call clear shared buffer with latest version id #17972

Merged

9 tasks

BugenZhao removed this from the release-2.0 milestone Aug 19, 2024

github-actions bot added the no-issue-activity label Oct 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

more robust way to wait for the cluster to be healthy after some node being killed in simulation test #17995

more robust way to wait for the cluster to be healthy after some node being killed in simulation test #17995

BugenZhao commented Aug 12, 2024

github-actions bot commented Oct 19, 2024

more robust way to wait for the cluster to be healthy after some node being killed in simulation test #17995

more robust way to wait for the cluster to be healthy after some node being killed in simulation test #17995

Comments

BugenZhao commented Aug 12, 2024

github-actions bot commented Oct 19, 2024