Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(batch): distributed query holds hummock version/snapshot for long time #9732

Closed
Tracked by #6640
zwang28 opened this issue May 10, 2023 · 11 comments
Closed
Tracked by #6640
Assignees
Labels
component/batch Batch related related issue. priority/high type/bug Something isn't working
Milestone

Comments

@zwang28
Copy link
Contributor

zwang28 commented May 10, 2023

Describe the bug

We observer both min version and min snapshot being stuck for days, co-occurrence with distributed_running_query_num change. Is it just very slow query, or something wrong with the execution?
image

To Reproduce

No response

Expected behavior

No response

Additional context

No response

@zwang28 zwang28 added type/bug Something isn't working component/batch Batch related related issue. labels May 10, 2023
@github-actions github-actions bot added this to the release-0.20 milestone May 10, 2023
@zwang28 zwang28 changed the title bug: failed distributed query seems not to release hummock version/snapshot bug(batch): failed distributed query seems not to release hummock version/snapshot May 10, 2023
@zwang28

This comment was marked as outdated.

@zwang28
Copy link
Contributor Author

zwang28 commented May 10, 2023

Here is another case where the min pinned epoch resumes from "stuck" state at the moment the distributed_running_query_num goes down. However, the min pinned version doesn't resume.

Then the min pinned epoch "stuck" again (for 1+ days before cluster is killed) when distributed_running_query_num goes up.
The frontend didn't restart throughout the period.

image

@zwang28 zwang28 changed the title bug(batch): failed distributed query seems not to release hummock version/snapshot bug(batch): distributed query holds hummock version/snapshot for long time May 10, 2023
@zwang28 zwang28 self-assigned this May 11, 2023
@zwang28
Copy link
Contributor Author

zwang28 commented May 16, 2023

We do observe potential batch task leaked in compute node: the running task num is not zero in compute node, while the query num is zero according to frontend. (But we cannot be sure yet #9841)

Screenshot 2023-05-16 at 17 36 43 image

@liurenjie1024
Copy link
Contributor

Yes, there is potential task leak, this may cause pinning of hummock version?

@zwang28
Copy link
Contributor Author

zwang28 commented May 16, 2023

this may cause pinning of hummock version?

Discussed offline. #9848

@zwang28
Copy link
Contributor Author

zwang28 commented May 24, 2023

#9953

@zwang28 zwang28 closed this as completed May 24, 2023
@zwang28
Copy link
Contributor Author

zwang28 commented Sep 6, 2023

We encounter another unexpected pinned version by compute node:

  1. A long running batch query held hummock version. That's expected.
    image

  2. We restarted frontend, which subsequently finished batch heartbeat workers in compute nodes. That's expected.

image
  1. However, although cancel_task had been called by heartbeat workers, the compute node still held the hummock version. That's unexpected.
Worker 144004 type COMPUTE_NODE min_pinned_version_id 5592335
Worker 145005 type COMPUTE_NODE min_pinned_version_id 5592660

@liurenjie1024 @ZENOTME Let's dig into this.

@zwang28 zwang28 reopened this Sep 6, 2023
@liurenjie1024
Copy link
Contributor

So this means that SeqScanExecutor not dropped? You can add some logs in it to see if it's dropped.

@liurenjie1024
Copy link
Contributor

Did some investigation with no clue. Add #13589 to make the metrics to be more accurate to see if it's really caused by mpp task.

@liurenjie1024
Copy link
Contributor

Some more hints:

  1. The metrics shows that there is a long running mpp query, but it can't be detected by show process list (they have 3 fe nodes), I think this maybe possible
  2. The query lasted for 5 hours

@zwang28
Copy link
Contributor Author

zwang28 commented Mar 6, 2024

No occurrence recently after introducing statement_timeout.

@zwang28 zwang28 closed this as completed Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/batch Batch related related issue. priority/high type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants