Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_backfill_tombstone failed in main-cron backfill tests: key UserKey { 1004, TableKey { 0000008558294d8b000000 } } epoch EpochWithGap(6017238017507328) >= prev epoch EpochWithGap(6017238017507328) #15324

Closed
kwannoel opened this issue Feb 28, 2024 · 5 comments · Fixed by #15365
Assignees
Milestone

Comments

@kwannoel
Copy link
Contributor

kwannoel commented Feb 28, 2024

buildkite: https://buildkite.com/risingwavelabs/main-cron/builds/1899#018deb4c-29e0-4ad8-a9a9-bc7c5ea713a4

Compute node logs:
Screenshot 2024-02-28 at 1 33 23 PM

You can run the tests locally, see ci/scripts/run-backfill-tests.sh.

The test which fails is:

e2e, test_backfill_tombstone

You can comment out the other tests at the bottom of the file while debugging.

@github-actions github-actions bot added this to the release-1.7 milestone Feb 28, 2024
@kwannoel kwannoel changed the title test_backfill_tombstone failed in main-cron backfill tests test_backfill_tombstone failed in main-cron backfill tests: key UserKey { 1004, TableKey { 0000008558294d8b000000 } } epoch EpochWithGap(6017238017507328) >= prev epoch EpochWithGap(6017238017507328) Feb 28, 2024
@kwannoel
Copy link
Contributor Author

2024-02-27T16:22:45.072413567Z  WARN risingwave_stream::executor::dml: txn_id=4294967296 Too many chunks for atomicity. Sent them to the downstream anyway.
thread 'rw-compaction' panicked at /risingwave/src/storage/hummock_sdk/src/key.rs:1047:21:
key UserKey { 1004, TableKey { 0000008558294d8b000000 } } epoch EpochWithGap(6017238017507328) >= prev epoch EpochWithGap(6017238017507328)
stack backtrace:
   0: rust_begin_unwind
             at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/e4c626dd9a17a23270bf8e7158e59cf2b9c04840/library/core/src/panicking.rs:72:14
   2: observe_multi_version<bytes::bytes::Bytes, bytes::bytes::Bytes, core::iter::adapters::map::Map<core::slice::iter::Iter<(risingwave_hummock_sdk::EpochWithGap, risingwave_storage::hummock::value::HummockValue<bytes::bytes::Bytes>)>, risingwave_storage::hummock::compactor::shared_buffer_compact::merge_imms_in_memory::{async_fn#0}::{closure_env#0}>>
   3: {async_fn#0}
             at ./src/storage/src/hummock/compactor/shared_buffer_compact.rs:335:38
   4: poll<risingwave_storage::hummock::compactor::shared_buffer_compact::merge_imms_in_memory::{async_fn_env#0}>

@hzxa21
Copy link
Collaborator

hzxa21 commented Feb 28, 2024

Is there any recent change to the test itself? I saw that the test can still pass yesterday. I checked the diff commits and didn't see any suspicious changes.

The assertion should never be hit regardless.

@kwannoel
Copy link
Contributor Author

kwannoel commented Feb 28, 2024

Is there any recent change to the test itself? I saw that the test can still pass yesterday. I checked the diff commits and didn't see any suspicious changes.

The assertion should never be hit regardless.

Last change to test seems to be 28 days ago. I think it may not happen every run perhaps. It depends on compaction which may not happen deterministically.

@hzxa21
Copy link
Collaborator

hzxa21 commented Feb 28, 2024

@hzxa21
Copy link
Collaborator

hzxa21 commented Feb 29, 2024

Root cause found:

  • In CI, we build risingwave_compaction_test and risingwavew_cmd_all together.
    cargo build \
    -p risingwave_cmd_all \
    -p risedev \
    -p risingwave_regress_test \
    -p risingwave_sqlsmith \
    -p risingwave_compaction_test \
    -p risingwave_e2e_extended_mode_test \
    $RISINGWAVE_FEATURE_FLAGS \
    --profile "$profile"
  • enable_test_epoch feature is enabled for risingwave_compaction_test in dependencies so it will be also included in risingwave_cmd_all due to the feature unification mechanism in rust.
    risingwave_hummock_sdk = { workspace = true, features = ["enable_test_epoch"] }
  • When enable_test_epoch feature is enabled, the spill_offset (the lower 16bit) will be ignored when constructing a epoch. When spill happens in the test (re-enabled by default a few days ago via chore: enable mem table spill #15240), there can be user key with the same epoch, triggering the assertion.
    EpochWithGap(epoch)

The reason why we cannot repro it locally is because build.sh won't be run in our local repro and only risingwave_cmd_all will be built. In this case, enable_test_epoch won't be enabled. I verified locally that enable_test_epoch is indeed enabled when running the exact same command in build.sh.

The usage of enable_test_epoch feature is hacky and prone to issues. @wcy-fdu has a PR to remove it. We will prioritize merging it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants