Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: panicked at should be valid staging_sst.size #17111

Closed
MrCroxx opened this issue Jun 5, 2024 · 10 comments · Fixed by #17113
Closed

bug: panicked at should be valid staging_sst.size #17111

MrCroxx opened this issue Jun 5, 2024 · 10 comments · Fixed by #17113
Labels
type/bug Something isn't working
Milestone

Comments

@MrCroxx
Copy link
Contributor

MrCroxx commented Jun 5, 2024

Describe the bug


2024-06-05 01:27:34.139 | instance_id 490 |  
-- | -- | --
  |   | 2024-06-05 01:27:34.139 | local_imm_ids [594224, 594217], |  
  |   | 2024-06-05 01:27:34.139 | staging_sst.epochs [6572398861746176], |  
  |   | 2024-06-05 01:27:34.139 | staging_sst.imm_ids {476: [594179], 490: [594217], 494: [594190], 503: [594168], 474: [594175], 473: [594176], 497: [594167], 486: [594185], 485: [594187], 500: [594171], 493: [594180], 478: [594178], 496: [594188], 477: [594174], 482: [594193], 479: [594184], 498: [594166], 501: [594169], 488: [594219], 489: [594218], 484: [594192], 499: [594170], 480: [594182], 483: [594191], 504: [594165], 487: [594186], 492: [594177], 502: [594172], 495: [594189], 481: [594183], 475: [594173], 491: [594181]}, |  
  |   | 2024-06-05 01:27:34.139 | should be valid staging_sst.size 6914109,


2024-06-05 01:27:34.139 | thread 'rw-main' panicked at src/storage/src/hummock/store/version.rs:324:25:
-- | --



Error message/log

https://grafana.test.risingwave-cloud.xyz/explore?schemaVersion=1&panes=%7B%22y4z%22:%7B%22datasource%22:%22PE59595AED52CF917%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22benchmark-xx-250723eeecc1658ea03389fab187e131eb445475%5C%22%7D%20%7C%3D%20%60panicked%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22PE59595AED52CF917%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221717522054139%22,%22to%22:%221717522054214%22%7D%7D%7D&orgId=1

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

@MrCroxx MrCroxx added the type/bug Something isn't working label Jun 5, 2024
@github-actions github-actions bot added this to the release-1.10 milestone Jun 5, 2024
@MrCroxx
Copy link
Contributor Author

MrCroxx commented Jun 5, 2024

Related PR: #16725

@MrCroxx
Copy link
Contributor Author

MrCroxx commented Jun 5, 2024

https://buildkite.com/risingwave-test/longevity-test/builds/1445#018fe3c5-0024-4f5f-b2e4-76e0b24167e2
https://grafana.test.risingwave-cloud.xyz/explore?schemaVersion=1&panes=%7B%2289c%22:%7B%22datasource%22:%22PE59595AED52CF917%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22reglngvty-20240604-150219%5C%22%7D%20%7C%3D%20%60panicked%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22PE59595AED52CF917%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221717523016897%22,%22to%22:%221717523017015%22%7D%7D,%22quj%22:%7B%22datasource%22:%22PE59595AED52CF917%22,%22queries%22:%5B%7B%22expr%22:%22%7Bapp%3D%5C%22benchmark-risingwave-compute-c-1%5C%22,component%3D%5C%22compute%5C%22,container%3D%5C%22compute%5C%22,filename%3D%5C%22%2Fvar%2Flog%2Fpods%2Freglngvty-20240604-150219_benchmark-risingwave-compute-c-1_9ecb1e1a-7c4b-4a88-86a3-284b8f3d964d%2Fcompute%2F0.log%5C%22,job%3D%5C%22reglngvty-20240604-150219%2Fbenchmark-risingwave-compute-c-1%5C%22,namespace%3D%5C%22reglngvty-20240604-150219%5C%22,node_name%3D%5C%22ip-10-0-56-111.ec2.internal%5C%22,pod%3D%5C%22benchmark-risingwave-compute-c-1%5C%22,stream%3D%5C%22stderr%5C%22%7D%22,%22queryType%22:%22range%22,%22refId%22:%22log-row-context-query-_0.8209849975523145%22,%22maxLines%22:1000,%22direction%22:%22backward%22,%22datasource%22:%7B%22uid%22:%22PE59595AED52CF917%22,%22type%22:%22loki%22%7D%7D%5D,%22range%22:%7B%22from%22:%221717523016897%22,%22to%22:%221717523017015%22%7D,%22panelsState%22:%7B%22logs%22:%7B%22id%22:%22log-row-context-query-_0.8209849975523145_1717523016897449713_92a7ae2%22%7D%7D%7D%7D&orgId=1

Panices on main daily longevity.

@MrCroxx
Copy link
Contributor Author

MrCroxx commented Jun 5, 2024

#16962

@wenym1
Copy link
Contributor

wenym1 commented Jun 5, 2024

There was sync failure happened right before the panic.


024-06-05 01:27:34.136 | 2024-06-04T17:27:34.072833415Z ERROR risingwave_storage::hummock::event_handler::uploader: upload task
-- | --

Let's say in an instance, the imm order is [epoch1: [imm1], epoch2: [imm2]]. When sync failure of epoch1 happens, we simply notify the failure, but won't clear imm1 or clear the uploader. When the ongoing epoch2 sync finished, it will try to clear imm2, but then it will see that imm2 is not the oldest imm, which causes the panics.

With #17113, the panic can be fixed, which clear the uploader state on sync error, and will do nothing until a recovery happens.

@hzxa21
Copy link
Collaborator

hzxa21 commented Jun 5, 2024

Based on the above analysis, the race has existed before #16962 but we didn't see panic in nightly-20240603, which includes #16962. Is it possible that some changes in nightly-20240604 increase the likelihood of sync error?
da28570...b1c25c0

@MrCroxx
Copy link
Contributor Author

MrCroxx commented Jun 5, 2024

Is #17113 ready? We can review and merge it and unlock the main branch.

@wenym1
Copy link
Contributor

wenym1 commented Jun 5, 2024

Is #17113 ready? We can review and merge it and unlock the main branch.

Yes. It and its preliminary PR has been approved. We can merge them one by one.

@hzxa21
Copy link
Collaborator

hzxa21 commented Jun 5, 2024

I have been running a branch recently based on a May/21 commit. In all the runs prior to today, I didn't seen the panic reported in this issue but for some reason I see the panic in today's run.

This means the panic exists at least from May/21 but for some reason it starts to trigger frequently since today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants