Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reproduce: recovery test failure #14888

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from
Draft

reproduce: recovery test failure #14888

wants to merge 16 commits into from

Conversation

kwannoel
Copy link
Contributor

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Reproduce a backfill test failure

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@kwannoel kwannoel changed the title feat(stream): reproduce: recovery test failure Jan 31, 2024
@kwannoel
Copy link
Contributor Author

kwannoel commented Feb 1, 2024

A brief description of how backfill persists its state:

  1. The pk is the vnode of the partition.
  2. It will regularly commit pk along with the backfill state per epoch.
  3. What happens below is that we updated the vnode=0 with a new state.
  4. Then we committed the epoch for the state table.
  5. But subsequently in the next epoch, for the same actor and table id, it is unable to find the same row with vnode=0.
  6. In between some nodes were killed, and recovery was triggered. But seems like this compute node continued to execute.
  7. The full logs are really large, so they will not be uploaded here. You may fetch them from https://buildkite.com/risingwavelabs/pull-request/builds/41174#018d6278-bbc5-458b-9007-988194251714.

Summarized error logs:

--
2022-09-24T15:29:59.243730Z TRACE risingwave_stream::executor::backfill::arrangement_backfill: Persisting state on barrier at epoch EpochPair {
    curr: 3066966724575232,
    prev: 3066966659039232,
}, actor_id=176007, table_id=177014

# First we updated the vnode=0
2022-09-24T15:29:59.243731Z TRACE risingwave_stream::executor::backfill::utils: 
update vnode: vnode=0, actor_id=176007, table_id=177014

# We also updated the in-memory state as committed after updating the vnode
2022-09-24T15:29:59.243731Z  INFO risingwave_stream::executor::backfill::utils: 
mark_committed: vnode=0

# Afterwards, we committed the updated state
2022-09-24T15:29:59.243731Z TRACE risingwave_stream::executor::backfill::utils: 
committing on epoch, EpochPair {
    curr: 3066966724575232,
    prev: 3066966659039232,
}, actor_id=176007, table_id=177014

2022-09-24T15:29:59.243731Z TRACE risingwave_stream::executor::backfill::arrangement_backfill: 
barrier persisted actor=176007 barrier=Barrier { epoch: EpochPair { curr: 3066966724575232, prev: 3066966659039232 }, mutation: None, kind: Checkpoint, tracing_context: TracingContext(Context { entries: 0 }), passed_actors: [176002, 176007] }
--

# When we persist the state again,
2022-09-24T15:30:00.243730Z TRACE risingwave_stream::executor::backfill::arrangement_backfill: 
Persisting state on barrier at epoch EpochPair {
    curr: 3066966743908352,
    prev: 3066966724575232,
}, actor_id=176007, table_id=177014

# Suddenly vnode=0 is missing
thread '<unnamed>' panicked at /risingwave/src/stream/src/executor/backfill/utils.rs:767:25:
row [
    Some(
        Int16(
            0,
        ),
    ),
] not found
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: risingwave_stream::executor::backfill::utils::persist_state_per_vnode::{{closure}}
             at ./src/stream/src/executor/backfill/utils.rs:767:25
             at ./src/stream/src/executor/backfill/arrangement_backfill.rs:453:18
             at ./src/stream/src/executor/flow_control.rs:59:5
             at ./src/stream/src/executor/wrapper/trace.rs:126:10
             at ./src/stream/src/executor/wrapper/schema_check.rs:24:1

@wenym1 wenym1 force-pushed the kwannoel/debug-branch branch from cd79045 to 6cf9e3f Compare February 5, 2024 18:12
Copy link
Contributor

github-actions bot commented Apr 7, 2024

This PR has been open for 60 days with no activity. Could you please update the status? Feel free to ping a reviewer if you are waiting for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants