Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional RuntimeError in cudf.read_parquet with kvikio backend with remote data #599

Open
TomAugspurger opened this issue Jan 27, 2025 · 1 comment

Comments

@TomAugspurger
Copy link

I'll occasionally see a RuntimeError when using cudf.read_parquet to read a parquet file from S3.

I'll grab a full traceback next time I see one, but here's part of one:

    #   File "/home/ubuntu/miniforge3/envs/kvikio-env/lib/python3.12/site-packages/cudf/io/parquet.py", line 1280, in _read_parquet
    #     tbl_w_meta = plc.io.parquet.read_parquet(options)
    #                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    #   File "parquet.pyx", line 309, in pylibcudf.io.parquet.read_parquet
    #   File "parquet.pyx", line 324, in pylibcudf.io.parquet.read_parquet
    # RuntimeError: CUDF failure at:/opt/conda/conda-bld/work/cpp/src/io/parquet/reader_impl_preprocess.cu:590: Parquet header parsing failed with code(s) 0x5. With unsupported encodings found:

I've also seen

    # RuntimeError: CUDF failure at:/opt/conda/conda-bld/work/cpp/src/io/parquet/reader_impl_preprocess.cu:314: Parquet header parsing failed with code(s) while counting page headers 0x5

At first glance, this looks a bit like some incomplete read from blob storage. Perhaps kvikio or cudf did a .read(nbytes) but less than nbytes were returned?

I'll try to get a more reproducible example and some more debug output.

@TomAugspurger
Copy link
Author

TomAugspurger commented Jan 28, 2025

Here's another one from a .read() (not using cudf but presumably cudf eventually calls read). It looks more like a DNS issue

  File "/home/rapids/remote-io-benchmark/remote_io_benchmark/s3.py", line 595, in read_one
    rf.read(buf)
^^^^^^^^^^^
  File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/site-packages/kvikio/remote_file.py", line 182, in read
    return self.pread(buf, size, file_offset).get()
  ^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/site-packages/kvikio/cufile.py", line 54, in get
    return self._handle.get()
  ^^^^^^^^^^^^^^^^^
  File "future.pyx", line 33, in kvikio._lib.future.IOFuture.get
RuntimeError: curl_easy_perform() error near /opt/conda/conda-bld/work/cpp/src/remote_handle.cpp:353(Could not resolve host: kvikiobench-33622.s3.us-east-1.amazonaws.com)

(I realize now that this and #601 are closely related, since both will likely involve retries. #601 I think will rely on the HTTP status code when the HTTP request completes. This might be more involved).

And another one in read_parquet on a dask worker:

Exception: "RuntimeError('CUDF failure at: /opt/conda/conda-bld/work/cpp/src/io/parquet/reader_impl_chunking.cu:1044: Encountered malformed parquet page data (row count mismatch in page data)')"
Traceback: '  File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/dask/dataframe/io/parquet/core.py", line 97, in __call__
    return read_parquet_part(
           ^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/dask/dataframe/io/parquet/core.py", line 648, in read_parquet_part
    func(
  File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/dask_cudf/_legacy/io/parquet.py", line 280, in read_partition
    cls._read_paths(
  File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/dask_cudf/_legacy/io/parquet.py", line 124, in _read_paths
    raise err
  File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/dask_cudf/_legacy/io/parquet.py", line 94, in _read_paths
    df = cudf.read_parquet(
         ^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/cudf/io/parquet.py", line 911, in read_parquet
    df = _parquet_to_frame(
         ^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/cudf/io/parquet.py", line 1059, in _parquet_to_frame
    return _read_parquet(
           ^^^^^^^^^^^^^^
  File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.conda/envs/remote-io-benchmark/lib/python3.12/site-packages/cudf/io/parquet.py", line 1280, in _read_parquet
    tbl_w_meta = plc.io.parquet.read_parquet(options)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "parquet.pyx", line 309, in pylibcudf.io.parquet.read_parquet
  File "parquet.pyx", line 324, in pylibcudf.io.parquet.read_parquet

Another one:

Traceback (most recent call last):
  File "/opt/conda/envs/remote-io-benchmark/bin/kvikiobench", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/rapids/remote-io-benchmark/remote_io_benchmark/benchmark.py", line 159, in main
    asyncio.run(amain())
  File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/asyncio/base_events.py", line 686, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/rapids/remote-io-benchmark/remote_io_benchmark/benchmark.py", line 152, in amain
    results.append(repeat(func, config=config, n=parsed.n_iter))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rapids/remote-io-benchmark/remote_io_benchmark/_framework.py", line 107, in repeat
    runs.append(func(config))
                ^^^^^^^^^^^^
  File "/home/rapids/remote-io-benchmark/remote_io_benchmark/s3.py", line 612, in time_many_large_binary_dask
    client.gather(futures)
  File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/site-packages/distributed/client.py", line 2566, in gather
    return self.sync(
           ^^^^^^^^^^
  File "/home/rapids/remote-io-benchmark/remote_io_benchmark/s3.py", line 595, in read_one
    rf.read(buf)
^^^^^^^^^^^
  File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/site-packages/kvikio/remote_file.py", line 182, in read
    return self.pread(buf, size, file_offset).get()
  ^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/remote-io-benchmark/lib/python3.12/site-packages/kvikio/cufile.py", line 54, in get
    return self._handle.get()
  ^^^^^^^^^^^^^^^^^
  File "future.pyx", line 33, in kvikio._lib.future.IOFuture.get
RuntimeError: curl_easy_perform() error near /opt/conda/conda-bld/work/cpp/src/remote_handle.cpp:353(OpenSSL SSL_read: OpenSSL/3.4.0: error:0A0000C6:SSL routines::packet length too long, errno 0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant