bug(batch): distributed query holds hummock version/snapshot for long time #9732

zwang28 · 2023-05-10T09:32:31Z

Describe the bug

We observer both min version and min snapshot being stuck for days, co-occurrence with distributed_running_query_num change. Is it just very slow query, or something wrong with the execution?

To Reproduce

No response

Expected behavior

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

zwang28 · 2023-05-10T13:44:49Z

Here is another case where the min pinned epoch resumes from "stuck" state at the moment the distributed_running_query_num goes down. However, the min pinned version doesn't resume.

Then the min pinned epoch "stuck" again (for 1+ days before cluster is killed) when distributed_running_query_num goes up.
The frontend didn't restart throughout the period.

zwang28 · 2023-05-16T09:42:20Z

We do observe potential batch task leaked in compute node: the running task num is not zero in compute node, while the query num is zero according to frontend. (But we cannot be sure yet #9841)

liurenjie1024 · 2023-05-16T10:15:59Z

Yes, there is potential task leak, this may cause pinning of hummock version?

zwang28 · 2023-05-16T11:15:41Z

this may cause pinning of hummock version?

Discussed offline. #9848

zwang28 · 2023-05-24T15:01:41Z

#9953

zwang28 · 2023-09-06T08:47:13Z

We encounter another unexpected pinned version by compute node:

A long running batch query held hummock version. That's expected.
We restarted frontend, which subsequently finished batch heartbeat workers in compute nodes. That's expected.

However, although cancel_task had been called by heartbeat workers, the compute node still held the hummock version. That's unexpected.

Worker 144004 type COMPUTE_NODE min_pinned_version_id 5592335
Worker 145005 type COMPUTE_NODE min_pinned_version_id 5592660

@liurenjie1024 @ZENOTME Let's dig into this.

liurenjie1024 · 2023-09-06T08:56:23Z

So this means that SeqScanExecutor not dropped? You can add some logs in it to see if it's dropped.

liurenjie1024 · 2023-11-22T08:42:11Z

Did some investigation with no clue. Add #13589 to make the metrics to be more accurate to see if it's really caused by mpp task.

liurenjie1024 · 2023-12-01T06:11:43Z

Some more hints:

The metrics shows that there is a long running mpp query, but it can't be detected by show process list (they have 3 fe nodes), I think this maybe possible
The query lasted for 5 hours

zwang28 · 2024-03-06T02:36:03Z

No occurrence recently after introducing statement_timeout.

zwang28 added type/bug Something isn't working component/batch Batch related related issue. labels May 10, 2023

github-actions bot added this to the release-0.20 milestone May 10, 2023

zwang28 changed the title ~~bug: failed distributed query seems not to release hummock version/snapshot~~ bug(batch): failed distributed query seems not to release hummock version/snapshot May 10, 2023

hzxa21 mentioned this issue May 10, 2023

storage: potential leaked pinned version/snapshot #9576

Closed

zwang28 mentioned this issue May 10, 2023

Tracking: Critical Performance & Stability Issues #6640

Open

65 tasks

This comment was marked as outdated.

Sign in to view

zwang28 changed the title ~~bug(batch): failed distributed query seems not to release hummock version/snapshot~~ bug(batch): distributed query holds hummock version/snapshot for long time May 10, 2023

zwang28 self-assigned this May 11, 2023

zwang28 closed this as completed May 24, 2023

zwang28 reopened this Sep 6, 2023

zwang28 modified the milestones: release-1.0, future-release-1.3 Sep 6, 2023

zwang28 modified the milestones: release-1.3, release-1.4 Oct 10, 2023

liurenjie1024 self-assigned this Nov 2, 2023

fuyufjh added the priority/high label Nov 2, 2023

liurenjie1024 modified the milestones: release-1.4, release-1.5 Nov 8, 2023

fuyufjh mentioned this issue Nov 20, 2023

OOM for sysbench select random limits (Hummock read duration seems abnormal) #13506

Closed

liurenjie1024 mentioned this issue Nov 22, 2023

chore: Make batch mpp task num metrics more accurate. #13589

Merged

9 tasks

liurenjie1024 mentioned this issue Dec 1, 2023

Bug(batch): Zombie batch query runs for a long time in the system #13761

Closed

zwang28 modified the milestones: release-1.5, release-1.6 Dec 4, 2023

liurenjie1024 mentioned this issue Dec 12, 2023

feat: Add statement_timeout for query. #13933

Merged

9 tasks

zwang28 modified the milestones: release-1.6, release-1.7 Jan 9, 2024

zwang28 closed this as completed Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(batch): distributed query holds hummock version/snapshot for long time #9732

bug(batch): distributed query holds hummock version/snapshot for long time #9732

zwang28 commented May 10, 2023 •

edited

Loading

This comment was marked as outdated.

zwang28 commented May 10, 2023 •

edited

Loading

zwang28 commented May 16, 2023 •

edited

Loading

liurenjie1024 commented May 16, 2023

zwang28 commented May 16, 2023

zwang28 commented May 24, 2023

zwang28 commented Sep 6, 2023

liurenjie1024 commented Sep 6, 2023

liurenjie1024 commented Nov 22, 2023

liurenjie1024 commented Dec 1, 2023

zwang28 commented Mar 6, 2024

bug(batch): distributed query holds hummock version/snapshot for long time #9732

bug(batch): distributed query holds hummock version/snapshot for long time #9732

Comments

zwang28 commented May 10, 2023 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

This comment was marked as outdated.

zwang28 commented May 10, 2023 • edited Loading

zwang28 commented May 16, 2023 • edited Loading

liurenjie1024 commented May 16, 2023

zwang28 commented May 16, 2023

zwang28 commented May 24, 2023

zwang28 commented Sep 6, 2023

liurenjie1024 commented Sep 6, 2023

liurenjie1024 commented Nov 22, 2023

liurenjie1024 commented Dec 1, 2023

zwang28 commented Mar 6, 2024

zwang28 commented May 10, 2023 •

edited

Loading

zwang28 commented May 10, 2023 •

edited

Loading

zwang28 commented May 16, 2023 •

edited

Loading