[QST] cudf and kvikio? (this is actually a question about implementing RDMA in another library) #16004

lgray · 2024-06-12T16:03:22Z

Hi! I'm an experimental high energy physicist interested in rapid data analysis, in high energy physics we predominantly use the ROOT which has a pleasant python implementation here: https://github.com/scikit-hep/uproot5

cudf's cuFile and nvcomp based file i/o is extremely powerful (recently benchmarked this ourselves). However, I noticed that https://github.com/rapidsai/kvikio implements this i/o and decompression layer more generally for other libraries, and it is not used in cudf.

Would there be any lost performance from using kvikio as opposed to a dedicated C++ implementation, since uproot is nominally a python only implementation of the ROOT i/o standard? i.e. are there any major performance reasons reasons cudf does not use kvikio?

@martindurant @jpivarski

bdice · 2024-06-12T16:07:06Z

cuDF does use KvikIO's C++ library libkvikio for GPUDirect Storage, in the libcudf C++ code that underpins cuDF's Python interface. Here are some pointers to the build system and docs referencing KvikIO functionality in cuDF:

cudf/cpp/CMakeLists.txt

Lines 209 to 210 in c0c2ad3

# find KvikIO

include(cmake/thirdparty/get_kvikio.cmake)
https://github.com/rapidsai/cudf/blob/c0c2ad355c720d4d168b48aea3d5564efcd890a7/cpp/cmake/thirdparty/get_kvikio.cmake
https://docs.rapids.ai/api/cudf/nightly/user_guide/io/io/#magnum-io-gpudirect-storage-integration

If you have any questions about KvikIO or how cuDF uses it, please feel free to ask!

lgray · 2024-06-12T16:14:12Z

Ahhh ok I missed that and only saw the stuff in the parquet library that has cuFile and nvcomp calls in C++. We'd like to start over in uproot by just working with local file access via DMA with kvikio to the GPU. Is there any conceptual issue we should be aware of there?

Thanks!

Also @nsmith-

bdice · 2024-06-13T02:08:38Z

@madsbk might be better suited to answer that question, but I think you should be fine to use the Python kvikio API or the C++ libkvikio API.

Also I don't know this source code very well but there is an example from the cuCIM library that shows how to do GPUDirect Storage reads of uncompressed TIFF images using the kvikio Python API. See: https://github.com/rapidsai/cucim/blob/branch-24.08/examples/python/gds_whole_slide/

madsbk · 2024-06-13T06:33:48Z

Is there any conceptual issue we should be aware of there?

Since uproot is pure Python, I guess you want to use KvikIO's Python API, which should look a lot like Python's regular file API: https://docs.rapids.ai/api/kvikio/nightly/quickstart/

Using CuFile.read/.write or their non-blocking counterparts, CuFile.pread/.pwrite, any array-like data (host or device memory) are supported.

lgray · 2024-06-13T12:21:55Z

OK - thanks @madsbk @bdice. I had read the quickstart, but became curious about performance penalties stemming from any conceptual issues with using the python bindings when I saw direct usage of cuFile and nvcomp in the C++/CUDA of cudf. This is why I am asking.

Is there anything we need to do in order to use fsspec effectively with this software stack? We do have a custom storage access protocol that we've glued into fsspec. I imagine local file access is fine, but is there any special care to be taken for remote s3, etc.?

madsbk · 2024-06-14T06:25:57Z

OK - thanks @madsbk @bdice. I had read the quickstart, but became curious about performance penalties stemming from any conceptual issues with using the python bindings when I saw direct usage of cuFile and nvcomp in the C++/CUDA of cudf. This is why I am asking.

Using the Python API in a Python application shouldn't introduce any extra overhead compared to writing your own bindings. cudf uses the C++ API because cudf is mostly written in C++.

Is there anything we need to do in order to use fsspec effectively with this software stack? We do have a custom storage access protocol that we've glued into fsspec. I imagine local file access is fine, but is there any special care to be taken for remote s3, etc.?

No, KvikIO only support local file access at the moment :/
It is something we are considering supporting, we are seeing a lot of interest in remote access lately.

lgray · 2024-06-14T12:36:34Z

Thanks for the clarification, we'll go ahead and give it a shot in uproot. Should be an interesting ride.

Are there any particular bits of care we need to give with respect to thread divergence and decompression? I imagine keeping chunks of data mostly the same size ~helps?

lgray · 2024-06-14T13:53:41Z

And, indeed, the remote file access will be super important for our use case. I'll talk to the nVidia contacts we have with Fermilab about getting some attention in this direction.

martindurant · 2024-06-14T13:56:04Z

Obviously, I have my own opinions on how remote file access should work, at least from the user API perspective, so if any group at rapids/nvidia/etc are working on this, I'd be happy to be part of the conversation.

bdice · 2024-06-14T15:55:38Z

cc: @rjzamora on the topic of remote filesystem access.

rjzamora · 2024-06-15T13:27:29Z

We are in the process of prioritizing remote-file performance in cudf, and it may make the most sense to leverage kvikio as a space to make the necessary improvements.

@martindurant - I'd love to get your thoughts in #15919

quasiben · 2024-06-17T16:17:20Z

If you are trying to do remote RDMA (GPU RDMA) you could use https://github.com/rapidsai/ucxx but you will need accelerated network devices as well. This is more common at HPC Labs so I would also ask about Infiniband or Slingshot. cc @pentschev

It looks like Wilson is being decommissioned soon but was architected with Infiniband NICs

lgray · 2024-06-17T16:31:15Z

Yes - we have an internal cluster that we're putting together testing this stuff on. We'll have to get accelerated NICs for it, which will come in time. We unfortunately don't have infiniband or slingshot on that cluster. Wilson's GPUs are quite old in any case, K40s, iirc.

The main thing I'd like to achieve with this is loading files from a remote storage system into GPU memory. While pretty obvious if it's mounted, with s3 or similar it wasn't clear the proper way forward. Will continue with kvikio and mounted filesystems in the mean time.

lgray · 2024-06-24T20:43:42Z

Is there any reason why I would start getting memory errors on a 40GB slice of an a100 when trying to allocate more than 10 GiB or so?

madsbk · 2024-06-25T06:32:35Z

Is there any reason why I would start getting memory errors on a 40GB slice of an a100 when trying to allocate more than 10 GiB or so?

Are you using RMM (if you are allocating through cudf, you are), you can try the new memory profiler in the nightly package: https://docs.rapids.ai/api/rmm/nightly/guide/#memory-profiler . It might show where and the size of the peak memory allocation.

lgray · 2024-06-25T12:59:53Z

@madsbk Great, thank you. I am indeed still using cudf so I'll try this out. awkward-array uses cupy as a backend, so in the fullness of time we'll have to understand if we're using rmm or some other manager.

bdice · 2024-06-25T13:03:16Z

CuPy does not use RMM by default but you can enable RMM as the memory manager for CuPy. Here are some references.

https://docs.rapids.ai/api/rmm/stable/guide/#using-rmm-with-third-party-libraries

https://docs.cupy.dev/en/stable/user_guide/interoperability.html#rmm

https://docs.cupy.dev/en/stable/user_guide/memory.html

lgray · 2024-06-25T15:55:20Z

Yes! I saw that, thanks for further docs. We'll have to figure out something in awkward that detects and properly uses RMM when it is present.

lgray · 2024-06-25T16:19:40Z

Indeed, it seems about 10GB of compressed input data corresponds to 41GB (and the MIG slice is 42GB).

I guess the peak allocation corresponds to intermediate decompression buffers and then the final array assembly buffer?

I imagine there's a way to do this where only pairs of from->to buffers are initialized at a time?
Or is that found to be slower and the correct usage pattern is to utilize the GPU on data that is 1/4th the memory size available? This does give us ample room to pack in calculations.

martindurant · 2024-06-25T16:43:47Z

For parquet, whether or not you can decompress bytes into their final storage depends on quite a few different factors. You generally want V2 pages and SIMPLE encoding; and even then, only some data types will work (not string!). I wouldn't be surprised if cudf doesn't even bother looking for this efficient path.

lgray · 2024-06-25T16:56:00Z

This is float32/int/bool data in either singly-nested ragged arrays or just flat arrays. No strings or anything.
Where should I root around in the parquet metadata for V2 pages, and these other doodads, to make sure I'm doing something optimal?

martindurant · 2024-06-25T17:07:56Z

Where should I root around in the parquet metadata for V2 pages

The encoding stats - how many pages of each type - are contained in the main file metadata (row_groups[].columns[].meta_data.encoding_stats). I think strictly they are optional, but I expect them to be present.

The pages themselves can be either V1 or V2 as you come to them. A "version: 2" in the global metadata guarantees nothing. In https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html , you see data_page_version as an option, V1 by default(!!), so if you were writing your own files, this is what you would change.

For INT, delta encoding would not allow decompress-to-target
For FLOAT, byte_stream_split likewise
Dictionary or RLE encoding would spoil any data type, but you really shouldn't use that for numbers, unless you have a very few unique but large values.

martindurant · 2024-06-25T17:08:22Z

Oh, and I don't know if you can see any of those details via pyarrow. I use fastparquet...

lgray · 2024-06-25T17:15:49Z

It looks like I can enforce these options by setting data_page_version="2.0" and column_encoding="PLAIN" in pyarrow. We'll regenerate the file and try again.

lgray · 2024-06-25T17:19:41Z

@fstrug you should follow this!

nsmith- · 2024-06-26T13:30:26Z

For INT, delta encoding would not allow decompress-to-target
For FLOAT, byte_stream_split likewise

These encodings are something we will want in our data, especially the split float (due to our use of mantissa truncation for reduced precision), so maybe not worth worrying too much about minimizing intermediate buffers and just operate on smaller numbers of row groups at a time?

lgray · 2024-06-26T14:18:58Z

@nsmith- We'll have to see how this balances with throughput performance. I'd be curious if we can load the next dataset in while finishing compute on the last. Probably some automated sizing can be done per-analysis.

lgray added the question Further information is requested label Jun 12, 2024

wence- mentioned this issue Jun 12, 2024

Bring source-line coverage for cudf-polars interpreter to 100% #15974

Closed

3 tasks

rjzamora mentioned this issue Jun 15, 2024

[FEA] Develop new approach for handling remote I/O #15919

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] cudf and kvikio? (this is actually a question about implementing RDMA in another library) #16004

[QST] cudf and kvikio? (this is actually a question about implementing RDMA in another library) #16004

lgray commented Jun 12, 2024 •

edited

Loading

bdice commented Jun 12, 2024 •

edited

Loading

lgray commented Jun 12, 2024 •

edited

Loading

bdice commented Jun 13, 2024 •

edited

Loading

madsbk commented Jun 13, 2024

lgray commented Jun 13, 2024

madsbk commented Jun 14, 2024

lgray commented Jun 14, 2024

lgray commented Jun 14, 2024

martindurant commented Jun 14, 2024

bdice commented Jun 14, 2024

rjzamora commented Jun 15, 2024

quasiben commented Jun 17, 2024

lgray commented Jun 17, 2024

lgray commented Jun 24, 2024

madsbk commented Jun 25, 2024

lgray commented Jun 25, 2024

bdice commented Jun 25, 2024

lgray commented Jun 25, 2024

lgray commented Jun 25, 2024

martindurant commented Jun 25, 2024

lgray commented Jun 25, 2024

martindurant commented Jun 25, 2024

martindurant commented Jun 25, 2024

lgray commented Jun 25, 2024

lgray commented Jun 25, 2024

nsmith- commented Jun 26, 2024

lgray commented Jun 26, 2024

[QST] cudf and kvikio? (this is actually a question about implementing RDMA in another library) #16004

[QST] cudf and kvikio? (this is actually a question about implementing RDMA in another library) #16004

Comments

lgray commented Jun 12, 2024 • edited Loading

bdice commented Jun 12, 2024 • edited Loading

lgray commented Jun 12, 2024 • edited Loading

bdice commented Jun 13, 2024 • edited Loading

madsbk commented Jun 13, 2024

lgray commented Jun 13, 2024

madsbk commented Jun 14, 2024

lgray commented Jun 14, 2024

lgray commented Jun 14, 2024

martindurant commented Jun 14, 2024

bdice commented Jun 14, 2024

rjzamora commented Jun 15, 2024

quasiben commented Jun 17, 2024

lgray commented Jun 17, 2024

lgray commented Jun 24, 2024

madsbk commented Jun 25, 2024

lgray commented Jun 25, 2024

bdice commented Jun 25, 2024

lgray commented Jun 25, 2024

lgray commented Jun 25, 2024

martindurant commented Jun 25, 2024

lgray commented Jun 25, 2024

martindurant commented Jun 25, 2024

martindurant commented Jun 25, 2024

lgray commented Jun 25, 2024

lgray commented Jun 25, 2024

nsmith- commented Jun 26, 2024

lgray commented Jun 26, 2024

lgray commented Jun 12, 2024 •

edited

Loading

bdice commented Jun 12, 2024 •

edited

Loading

lgray commented Jun 12, 2024 •

edited

Loading

bdice commented Jun 13, 2024 •

edited

Loading