compaction task hangs without error #15209

hzxa21 · 2024-02-23T06:12:00Z

Recently in our testing pipeline, we found that some compaction tasks hang for a long time (>45min) without any error. Although the meta-side task expiration mechanism can make sure task will be forced cancelled if no progress has been made for some time, it is worth investigating why the task can hang at the first place.

Note that the recent occurrences are all in the test run when testing switching to opendal s3:

hzxa21 · 2024-02-23T06:12:10Z

cc @Li0k @wcy-fdu

st1page · 2024-02-23T06:33:22Z

Maybe not only happen with OpenDAL s3? The compaction fails also happens here.
#15169
https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3092

q17: https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus:%20test-useast1-eks-a&from=1708450421000&to=1708452226000&var-namespace=nexmark-bs-15-105-daily-20240220
https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?orgId=1&var-data_source=Logging:%20test-useast1-eks-a&from=1708450421000&to=1708452226000&var-namespace=nexmark-bs-15-105-daily-20240220

q19: https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus:%20test-useast1-eks-a&from=1708454470000&to=1708456275000&var-namespace=nexmark-bs-15-105-daily-20240220
https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?orgId=1&var-data_source=Logging:%20test-useast1-eks-a&from=1708454470000&to=1708456275000&var-namespace=nexmark-bs-15-105-daily-20240220

hzxa21 · 2024-02-23T06:38:43Z

This not only happen with OpenDAL s3 #15169 https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3092

q17: https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus:%20test-useast1-eks-a&from=1708450421000&to=1708452226000&var-namespace=nexmark-bs-15-105-daily-20240220

q19: https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus:%20test-useast1-eks-a&from=1708454470000&to=1708456275000&var-namespace=nexmark-bs-15-105-daily-20240220

Search for the log covering the full run of the test. I didn't see duration too long happening in this test.

zwang28 · 2024-02-26T02:39:08Z

Another issue where compute node hangs reading block. #15239

Xuanwo · 2024-02-26T03:32:36Z

I can't read the logs posted in comments, so I don't know what happened yet.

Here are ideas I have:

Enable keep_alive and other http settings for opendal like we do for aws s3 sdk:

risingwave/src/object_store/src/object/s3.rs

Lines 552 to 567 in e19aaec

    
           // connection config 
        
           if let Some(keepalive_ms) = config.s3.object_store_keepalive_ms.as_ref() { 
        
               http.set_keepalive(Some(Duration::from_millis(*keepalive_ms))); 
        
           } 
        
           if let Some(nodelay) = config.s3.object_store_nodelay.as_ref() { 
        
               http.set_nodelay(*nodelay); 
        
           } 
        
           if let Some(recv_buffer_size) = config.s3.object_store_recv_buffer_size.as_ref() { 
        
               http.set_recv_buffer_size(Some(*recv_buffer_size)); 
        
           } 
        
           if let Some(send_buffer_size) = config.s3.object_store_send_buffer_size.as_ref() { 
        
               http.set_send_buffer_size(Some(*send_buffer_size)); 
        
           }

In some edge cases we found that http connection could hang forever, please consider enable opendal's timeout layer (or do something over the object_store trait): https://opendal.apache.org/docs/rust/opendal/layers/struct.TimeoutLayer.html

Little-Wallace · 2024-02-26T04:16:37Z

It confuses me that the timeout does not work when IO hang @Xuanwo

        let future = async {
            self.inner
                .read(path, range)
                .verbose_instrument_await("object_store_read")
                .await
        };
        let res = match self.read_timeout.as_ref() {
            None => future.await,
            Some(read_timeout) => tokio::time::timeout(*read_timeout, future)
                .await
                .unwrap_or_else(|_| Err(ObjectError::internal("read timeout"))),
        };

See details in https://github.com/risingwavelabs/risingwave/blob/main/src/object_store/src/object/mod.rs#L465

hzxa21 · 2024-02-26T06:08:26Z

I can't read the logs posted in comments, so I don't know what happened yet.

Here are ideas I have:

Enable keep_alive and other http settings for opendal like we do for aws s3 sdk:

risingwave/src/object_store/src/object/s3.rs

Lines 552 to 567 in e19aaec

// connection config

if let Some(keepalive_ms) = config.s3.object_store_keepalive_ms.as_ref() {

http.set_keepalive(Some(Duration::from_millis(*keepalive_ms)));

}

if let Some(nodelay) = config.s3.object_store_nodelay.as_ref() {

http.set_nodelay(*nodelay);

}

if let Some(recv_buffer_size) = config.s3.object_store_recv_buffer_size.as_ref() {

http.set_recv_buffer_size(Some(*recv_buffer_size));

}

if let Some(send_buffer_size) = config.s3.object_store_send_buffer_size.as_ref() {

http.set_send_buffer_size(Some(*send_buffer_size));

}

In some edge cases we found that http connection could hang forever, please consider enable opendal's timeout layer (or do something over the object_store trait): https://opendal.apache.org/docs/rust/opendal/layers/struct.TimeoutLayer.html

The only opendal log we saw in the testing are related to retry:

However, I don't think the retry logs are relevant to this issue because when retry indicates progress on the I/O request but what we saw in this issue is that some I/O requests are stuck forever. As mentioned by @Little-Wallace, we do enable timeout on top of each I/O call by wrapping tokio::timeout (this is the reason why we didn't enable TimeoutLayer in opendal) but the tokio::timeout didn't fire after the deadline is passed, which means there is a dead lock or blocking logic somewhere causing the timeout future not be polled.

Li0k · 2024-02-26T07:51:10Z

--- Compactor Traces ---
>> Compaction Task 50708-0
Compaction Task 50708 Split 0  [1843.496s]
  compact [!!! 1843.496s]
    compact_and_build_sst [!!! 1843.496s]
      rewind [!!! 1843.496s]
        object_store_streaming_read_read_bytes [!!! 1843.456s]

>> Compaction Task 51832-0
Compaction Task 51832 Split 0  [1062.498s]
  compact [!!! 1062.498s]
    compact_and_build_sst [!!! 1062.498s]
      add_full_key [!!! 1061.768s]
        object_store_streaming_upload_write_bytes [!!! 1061.768s]

Encountered Blocking of read and upload at the same time

hzxa21 · 2024-02-26T09:01:12Z

Root cause found:
After #13618, ObjectStoreConfig is only passed to MonitoredObjectStore if the backend is S3ObjectStore, which causes no timeout for all other types of object store backend.

Li0k · 2024-03-05T06:10:03Z

can close this issue ?

Xuanwo · 2024-03-05T07:31:57Z

Please ping me when this happen again.

hzxa21 added the needs-investigation label Feb 23, 2024

github-actions bot added this to the release-1.7 milestone Feb 23, 2024

hzxa21 mentioned this issue Feb 26, 2024

bug(storage): read block seems to be stuck #15239

Closed

hzxa21 mentioned this issue Feb 26, 2024

fix: pass correct object store config to monitored object store for all backends #15260

Merged

9 tasks

wcy-fdu mentioned this issue Feb 26, 2024

feat: add await tree for opendal object store #15259

Closed

9 tasks

hzxa21 closed this as completed Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compaction task hangs without error #15209

compaction task hangs without error #15209

hzxa21 commented Feb 23, 2024 •

edited

Loading

hzxa21 commented Feb 23, 2024

st1page commented Feb 23, 2024 •

edited

Loading

hzxa21 commented Feb 23, 2024

zwang28 commented Feb 26, 2024

Xuanwo commented Feb 26, 2024

Little-Wallace commented Feb 26, 2024 •

edited

Loading

hzxa21 commented Feb 26, 2024

Li0k commented Feb 26, 2024

hzxa21 commented Feb 26, 2024

Li0k commented Mar 5, 2024

Xuanwo commented Mar 5, 2024

compaction task hangs without error #15209

compaction task hangs without error #15209

Comments

hzxa21 commented Feb 23, 2024 • edited Loading

hzxa21 commented Feb 23, 2024

st1page commented Feb 23, 2024 • edited Loading

hzxa21 commented Feb 23, 2024

zwang28 commented Feb 26, 2024

Xuanwo commented Feb 26, 2024

Little-Wallace commented Feb 26, 2024 • edited Loading

hzxa21 commented Feb 26, 2024

Li0k commented Feb 26, 2024

hzxa21 commented Feb 26, 2024

Li0k commented Mar 5, 2024

Xuanwo commented Mar 5, 2024

hzxa21 commented Feb 23, 2024 •

edited

Loading

st1page commented Feb 23, 2024 •

edited

Loading

Little-Wallace commented Feb 26, 2024 •

edited

Loading