Discussion: Compact task Heartbeat timeout should depend on object storage operation timeout #14327

Li0k · 2024-01-03T07:37:43Z

Background

Hummock manager may cancel the compact tasks in two situations

Compactor HB reported no progress task and was canceled after expired_time.
Not in the compactor HB list, canceled by the background thread after expired_time.

The current default expire_time value is 1min. In other words, when the object store operation takes more than 1min, the task will expire. That is unreasonable and will cause the task to fall into a cycle of execution and cancellation.

Improve

After #10584, MonitoredObjectStore will ensure that the operation duration does not exceed the timeout configured in the config. Consider inferring HB timeout via Operation Timeout config.

The text was updated successfully, but these errors were encountered:

Li0k · 2024-01-04T11:35:23Z

The most obvious reason for the problem described in the issue is that we don't distinguish between process and heartbeat timeouts.

process_timeout: The compactor reports the task HB, but the task has no progress.
heartbeat_timeout: The compactor reports the task HB.

Once we have distinguished between the two concepts, we can decouple the two behaviors of cancel

compactor HB only cancels tasks that belong to it and have no progress.
the background thread only cancels tasks that are no longer reported to the HB.

After an offline discussion with Zheng, we think that can use an additional HB timeout config to achieve the above. We can calculate a reasonable process timeout interval based on the object store timeout config.

Besides, after introducing SkipWatermarkIterator, batch skipping operations that do not satisfy the watermark key may result in an inaccurate num_process_key, so we will use num_io to identify the progress of the compact task.

hzxa21 · 2024-01-08T02:33:42Z

Besides, after introducing SkipWatermarkIterator, batch skipping operations that do not satisfy the watermark key may result in an inaccurate num_process_key, so we will use num_io to identify the progress of the compact task.

I think the problem here is not about using num_process_key but about not counting num_process_key correctly when skip operation happens. By swtiching to num_io, do you mean tracking the real S3 I/O? If yes, is it possible that read I/O remains unchanged for a while due to prefetch and write I/O remains unchanged for a while due to skip operations?

Li0k · 2024-03-06T02:58:45Z

#15194

Li0k added the type/feature label Jan 3, 2024

github-actions bot added this to the release-1.6 milestone Jan 3, 2024

Li0k mentioned this issue Jan 4, 2024

feat(storage): seperate timeout between process and heartbeat #14366

Merged

9 tasks

Li0k self-assigned this Jan 4, 2024

Li0k closed this as completed Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Compact task Heartbeat timeout should depend on object storage operation timeout #14327

Discussion: Compact task Heartbeat timeout should depend on object storage operation timeout #14327

Li0k commented Jan 3, 2024

Li0k commented Jan 4, 2024 •

edited

Loading

hzxa21 commented Jan 8, 2024

Li0k commented Mar 6, 2024

Discussion: Compact task Heartbeat timeout should depend on object storage operation timeout #14327

Discussion: Compact task Heartbeat timeout should depend on object storage operation timeout #14327

Comments

Li0k commented Jan 3, 2024

Background

Improve

Li0k commented Jan 4, 2024 • edited Loading

hzxa21 commented Jan 8, 2024

Li0k commented Mar 6, 2024

Li0k commented Jan 4, 2024 •

edited

Loading