Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compaction task hangs without error #15209

Closed
hzxa21 opened this issue Feb 23, 2024 · 11 comments
Closed

compaction task hangs without error #15209

hzxa21 opened this issue Feb 23, 2024 · 11 comments

Comments

@hzxa21
Copy link
Collaborator

hzxa21 commented Feb 23, 2024

Recently in our testing pipeline, we found that some compaction tasks hang for a long time (>45min) without any error. Although the meta-side task expiration mechanism can make sure task will be forced cancelled if no progress has been made for some time, it is worth investigating why the task can hang at the first place.

image

Note that the recent occurrences are all in the test run when testing switching to opendal s3:

  1. nexmark-kafka-benchmark-20240221-075859
  2. nexmark-kafka-benchmark-20240220-132426
  3. nexmark-kafka-benchmark-20240220-132306
@hzxa21
Copy link
Collaborator Author

hzxa21 commented Feb 23, 2024

cc @Li0k @wcy-fdu

@github-actions github-actions bot added this to the release-1.7 milestone Feb 23, 2024
@zwang28
Copy link
Contributor

zwang28 commented Feb 26, 2024

Another issue where compute node hangs reading block. #15239

@Xuanwo
Copy link
Contributor

Xuanwo commented Feb 26, 2024

I can't read the logs posted in comments, so I don't know what happened yet.

Here are ideas I have:

  • Enable keep_alive and other http settings for opendal like we do for aws s3 sdk:

// connection config
if let Some(keepalive_ms) = config.s3.object_store_keepalive_ms.as_ref() {
http.set_keepalive(Some(Duration::from_millis(*keepalive_ms)));
}
if let Some(nodelay) = config.s3.object_store_nodelay.as_ref() {
http.set_nodelay(*nodelay);
}
if let Some(recv_buffer_size) = config.s3.object_store_recv_buffer_size.as_ref() {
http.set_recv_buffer_size(Some(*recv_buffer_size));
}
if let Some(send_buffer_size) = config.s3.object_store_send_buffer_size.as_ref() {
http.set_send_buffer_size(Some(*send_buffer_size));
}

@Little-Wallace
Copy link
Contributor

Little-Wallace commented Feb 26, 2024

It confuses me that the timeout does not work when IO hang @Xuanwo

        let future = async {
            self.inner
                .read(path, range)
                .verbose_instrument_await("object_store_read")
                .await
        };
        let res = match self.read_timeout.as_ref() {
            None => future.await,
            Some(read_timeout) => tokio::time::timeout(*read_timeout, future)
                .await
                .unwrap_or_else(|_| Err(ObjectError::internal("read timeout"))),
        };

See details in https://github.com/risingwavelabs/risingwave/blob/main/src/object_store/src/object/mod.rs#L465

@hzxa21
Copy link
Collaborator Author

hzxa21 commented Feb 26, 2024

I can't read the logs posted in comments, so I don't know what happened yet.

Here are ideas I have:

  • Enable keep_alive and other http settings for opendal like we do for aws s3 sdk:

// connection config
if let Some(keepalive_ms) = config.s3.object_store_keepalive_ms.as_ref() {
http.set_keepalive(Some(Duration::from_millis(*keepalive_ms)));
}
if let Some(nodelay) = config.s3.object_store_nodelay.as_ref() {
http.set_nodelay(*nodelay);
}
if let Some(recv_buffer_size) = config.s3.object_store_recv_buffer_size.as_ref() {
http.set_recv_buffer_size(Some(*recv_buffer_size));
}
if let Some(send_buffer_size) = config.s3.object_store_send_buffer_size.as_ref() {
http.set_send_buffer_size(Some(*send_buffer_size));
}

The only opendal log we saw in the testing are related to retry:
image

However, I don't think the retry logs are relevant to this issue because when retry indicates progress on the I/O request but what we saw in this issue is that some I/O requests are stuck forever. As mentioned by @Little-Wallace, we do enable timeout on top of each I/O call by wrapping tokio::timeout (this is the reason why we didn't enable TimeoutLayer in opendal) but the tokio::timeout didn't fire after the deadline is passed, which means there is a dead lock or blocking logic somewhere causing the timeout future not be polled.

@Li0k
Copy link
Contributor

Li0k commented Feb 26, 2024

--- Compactor Traces ---
>> Compaction Task 50708-0
Compaction Task 50708 Split 0  [1843.496s]
  compact [!!! 1843.496s]
    compact_and_build_sst [!!! 1843.496s]
      rewind [!!! 1843.496s]
        object_store_streaming_read_read_bytes [!!! 1843.456s]

>> Compaction Task 51832-0
Compaction Task 51832 Split 0  [1062.498s]
  compact [!!! 1062.498s]
    compact_and_build_sst [!!! 1062.498s]
      add_full_key [!!! 1061.768s]
        object_store_streaming_upload_write_bytes [!!! 1061.768s]

Encountered Blocking of read and upload at the same time

@hzxa21
Copy link
Collaborator Author

hzxa21 commented Feb 26, 2024

Root cause found:
After #13618, ObjectStoreConfig is only passed to MonitoredObjectStore if the backend is S3ObjectStore, which causes no timeout for all other types of object store backend.

@Li0k
Copy link
Contributor

Li0k commented Mar 5, 2024

can close this issue ?

@Xuanwo
Copy link
Contributor

Xuanwo commented Mar 5, 2024

Please ping me when this happen again.

@hzxa21 hzxa21 closed this as completed Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants