Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backfill test: Storage error: Hummock error: Expired Epoch: watermark 6934584757125120, epoch 6934584644927488 #17965

Closed
kwannoel opened this issue Aug 8, 2024 · 4 comments
Assignees
Labels
type/bug Something isn't working
Milestone

Comments

@kwannoel
Copy link
Contributor

kwannoel commented Aug 8, 2024

buildkite: https://buildkite.com/risingwavelabs/main-cron/builds/3046#01912d92-7eef-47eb-ae2a-9679177ef2a9

Error:

e2e_test/backfill/runtime/validate_rows_no_shuffle.slt       .. [FAILED]
failed to run `e2e_test/backfill/runtime/validate_rows_no_shuffle.slt`
Caused by:
    query failed: db error: ERROR: Failed to run the query
    Caused by:
      Storage error: Hummock error: Expired Epoch: watermark 6934584757125120, epoch 6934584644927488
    [SQL] select (select count(*) from no_shuffle_backfill) = (select count(*) from t);
    at e2e_test/backfill/runtime/validate_rows_no_shuffle.slt:1

This occurs when running the following batch query after no shuffle backfill:

query I
select (select count(*) from no_shuffle_backfill) = (select count(*) from t);
----
t
@kwannoel kwannoel added type/bug Something isn't working type/feature labels Aug 8, 2024
@github-actions github-actions bot added this to the release-2.0 milestone Aug 8, 2024
@hzxa21
Copy link
Collaborator

hzxa21 commented Aug 8, 2024

@wenym1 Can you help take a look? I checked the relevant code paths and didn't find a clue on how this can happen.

meta: MV creation finishes at around 16:36:06

2024-08-07T16:35:47.098154218Z  INFO risingwave_meta::rpc::ddl_controller: finish stream job succeeded id=1
2024-08-07T16:36:00.777084324Z  INFO risingwave_meta::rpc::ddl_controller: finish stream job succeeded id=2
2024-08-07T16:36:06.114203651Z  INFO risingwave_meta::rpc::ddl_controller: finish stream job succeeded id=4

frontend: Query fail at around 16:36:08

2024-08-07T16:36:08.819965654Z ERROR risingwave_frontend::scheduler::distributed::stage: Task TaskId { query_id: "ffaf7aeb-8a3e-4035-89c5-747bb7d5f707", stage_id: 1, task_id: 8 } failed, reason: "Storage error: Hummock error: Expired Epoch: watermark 6934584757125120, epoch 6934584644927488"
2024-08-07T16:36:08.820039093Z ERROR risingwave_frontend::scheduler::distributed::stage: Task TaskId { query_id: "ffaf7aeb-8a3e-4035-89c5-747bb7d5f707", stage_id: 2, task_id: 4 } failed, reason: "Storage error: Hummock error: Expired Epoch: watermark 6934584757125120, epoch 6934584644927488"
2024-08-07T16:36:08.820048701Z ERROR risingwave_frontend::scheduler::distributed::query: query stage failed error=Storage error: Hummock error: Expired Epoch: watermark 6934584757125120, epoch 6934584644927488 query_id=QueryId { id: "ffaf7aeb-8a3e-4035-89c5-747bb7d5f707" } stage_id=1
2024-08-07T16:36:08.820286629Z ERROR pgwire::pg_protocol: error when process message error=Failed to run the query: Storage error: Hummock error: Expired Epoch: watermark 6934584757125120, epoch 6934584644927488

compactor: safe epoch is bumped to 6934584757125120 by compact task 23 finished at 16:36:08

2024-08-07T16:36:07.022009424Z  INFO risingwave_storage::hummock::compactor::compactor_runner: Ready to handle task: 23 compact_task_statistics CompactTaskStatistics { total_file_count: 8, total_key_count: 674, total_file_size: 21770, total_uncompressed_file_size: 21698 } compression_algorithm 0  parallelism 1 task_memory_capacity_with_parallelism 17342096, enable fast runner: false, Compaction task id: 23, group-id: 2, type: Dynamic, target level: 0, target sub level: 6934584478662656 watermark: 6934584757125120, target_file_size: 33554432, splits: 1
Input: 
Level 0 ["[id: 157, obj_id: 138 0KB stale_ratio 0]", "[id: 158, obj_id: 140 0KB stale_ratio 0]"] 
Level 0 ["[id: 155, obj_id: 90 5KB stale_ratio 80]"] 
Level 0 ["[id: 147, obj_id: 75 2KB stale_ratio 0]", "[id: 148, obj_id: 77 0KB stale_ratio 0]", "[id: 150, obj_id: 96 2KB stale_ratio 0]", "[id: 151, obj_id: 98 0KB stale_ratio 0]", "[id: 153, obj_id: 88 7KB stale_ratio 66]"] 

2024-08-07T16:36:08.021001239Z  INFO risingwave_storage::hummock::compactor: running_parallelism_count=2 pull_task_ack=false pending_pull_task_count=4
2024-08-07T16:36:08.047053706Z  INFO risingwave_storage::hummock::compactor: Finish Task Some(23) split_index Some(0) sst count 1
2024-08-07T16:36:08.047124571Z  INFO risingwave_storage::hummock::compactor::compactor_runner: Finished compaction task in 1025.137239ms: Compaction task id: 23, group-id: 2, type: Dynamic, target level: 0, target sub level: 6934584478662656 watermark: 6934584757125120, target_file_size: 33554432, splits: 1, status: Success
Output: 
SstableInfo: object id=128, SST id=128, KeyRange=["0000000500000080000018a2f7c09b0000","0000000500ff0080ff0018a2f7c09b0000"], table_ids: [5], size=8KB, stale_ratio=0%, bloom_filter_kind Sstable

@hzxa21
Copy link
Collaborator

hzxa21 commented Aug 8, 2024

It seems that after this PR #5832, the storage iterator creation is moved from new_boxed_executor to do_execute. Frontend will unpin epoch after the table scan stage is Scheduled. Will the table scan stage of a distributed query become Scheduled after new_boxed_executor or after do_execute? If it is the former, it is possible for the epoch to be unpinned before the storage version is pinned, which is done in storage iterator creation. cc @BugenZhao @chenzl25

@chenzl25
Copy link
Contributor

chenzl25 commented Aug 8, 2024

I remember our SeqScanExecutor will create iterators during execute() sequentially which means we can't unpin epoch after the executor creation. Instead, we should unpin snapshot until the end of the query execution as how we deal with lookup join.

@hzxa21
Copy link
Collaborator

hzxa21 commented Oct 17, 2024

I think this issue should be resolved on main now because we have completely deprecated pinning epoch in frontend. cc @wenym1 @zwang28 to confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants