Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] cudf and kvikio? (this is actually a question about implementing RDMA in another library) #16004

Open
lgray opened this issue Jun 12, 2024 · 27 comments
Labels
question Further information is requested

Comments

@lgray
Copy link

lgray commented Jun 12, 2024

Hi! I'm an experimental high energy physicist interested in rapid data analysis, in high energy physics we predominantly use the ROOT which has a pleasant python implementation here: https://github.com/scikit-hep/uproot5

cudf's cuFile and nvcomp based file i/o is extremely powerful (recently benchmarked this ourselves). However, I noticed that https://github.com/rapidsai/kvikio implements this i/o and decompression layer more generally for other libraries, and it is not used in cudf.

Would there be any lost performance from using kvikio as opposed to a dedicated C++ implementation, since uproot is nominally a python only implementation of the ROOT i/o standard? i.e. are there any major performance reasons reasons cudf does not use kvikio?

@martindurant @jpivarski

@lgray lgray added the question Further information is requested label Jun 12, 2024
@bdice
Copy link
Contributor

bdice commented Jun 12, 2024

cuDF does use KvikIO's C++ library libkvikio for GPUDirect Storage, in the libcudf C++ code that underpins cuDF's Python interface. Here are some pointers to the build system and docs referencing KvikIO functionality in cuDF:

If you have any questions about KvikIO or how cuDF uses it, please feel free to ask!

@lgray
Copy link
Author

lgray commented Jun 12, 2024

Ahhh ok I missed that and only saw the stuff in the parquet library that has cuFile and nvcomp calls in C++. We'd like to start over in uproot by just working with local file access via DMA with kvikio to the GPU. Is there any conceptual issue we should be aware of there?

Thanks!

Also @nsmith-

@bdice
Copy link
Contributor

bdice commented Jun 13, 2024

@madsbk might be better suited to answer that question, but I think you should be fine to use the Python kvikio API or the C++ libkvikio API.

Also I don't know this source code very well but there is an example from the cuCIM library that shows how to do GPUDirect Storage reads of uncompressed TIFF images using the kvikio Python API. See: https://github.com/rapidsai/cucim/blob/branch-24.08/examples/python/gds_whole_slide/

@madsbk
Copy link
Member

madsbk commented Jun 13, 2024

Is there any conceptual issue we should be aware of there?

Since uproot is pure Python, I guess you want to use KvikIO's Python API, which should look a lot like Python's regular file API: https://docs.rapids.ai/api/kvikio/nightly/quickstart/

Using CuFile.read/.write or their non-blocking counterparts, CuFile.pread/.pwrite, any array-like data (host or device memory) are supported.

@lgray
Copy link
Author

lgray commented Jun 13, 2024

OK - thanks @madsbk @bdice. I had read the quickstart, but became curious about performance penalties stemming from any conceptual issues with using the python bindings when I saw direct usage of cuFile and nvcomp in the C++/CUDA of cudf. This is why I am asking.

Is there anything we need to do in order to use fsspec effectively with this software stack? We do have a custom storage access protocol that we've glued into fsspec. I imagine local file access is fine, but is there any special care to be taken for remote s3, etc.?

@madsbk
Copy link
Member

madsbk commented Jun 14, 2024

OK - thanks @madsbk @bdice. I had read the quickstart, but became curious about performance penalties stemming from any conceptual issues with using the python bindings when I saw direct usage of cuFile and nvcomp in the C++/CUDA of cudf. This is why I am asking.

Using the Python API in a Python application shouldn't introduce any extra overhead compared to writing your own bindings. cudf uses the C++ API because cudf is mostly written in C++.

Is there anything we need to do in order to use fsspec effectively with this software stack? We do have a custom storage access protocol that we've glued into fsspec. I imagine local file access is fine, but is there any special care to be taken for remote s3, etc.?

No, KvikIO only support local file access at the moment :/
It is something we are considering supporting, we are seeing a lot of interest in remote access lately.

@lgray
Copy link
Author

lgray commented Jun 14, 2024

Thanks for the clarification, we'll go ahead and give it a shot in uproot. Should be an interesting ride.

Are there any particular bits of care we need to give with respect to thread divergence and decompression? I imagine keeping chunks of data mostly the same size ~helps?

@lgray
Copy link
Author

lgray commented Jun 14, 2024

And, indeed, the remote file access will be super important for our use case. I'll talk to the nVidia contacts we have with Fermilab about getting some attention in this direction.

@martindurant
Copy link

Obviously, I have my own opinions on how remote file access should work, at least from the user API perspective, so if any group at rapids/nvidia/etc are working on this, I'd be happy to be part of the conversation.

@bdice
Copy link
Contributor

bdice commented Jun 14, 2024

cc: @rjzamora on the topic of remote filesystem access.

@rjzamora
Copy link
Member

We are in the process of prioritizing remote-file performance in cudf, and it may make the most sense to leverage kvikio as a space to make the necessary improvements.

@martindurant - I'd love to get your thoughts in #15919

@quasiben
Copy link
Member

If you are trying to do remote RDMA (GPU RDMA) you could use https://github.com/rapidsai/ucxx but you will need accelerated network devices as well. This is more common at HPC Labs so I would also ask about Infiniband or Slingshot. cc @pentschev

It looks like Wilson is being decommissioned soon but was architected with Infiniband NICs

@lgray
Copy link
Author

lgray commented Jun 17, 2024

Yes - we have an internal cluster that we're putting together testing this stuff on. We'll have to get accelerated NICs for it, which will come in time. We unfortunately don't have infiniband or slingshot on that cluster. Wilson's GPUs are quite old in any case, K40s, iirc.

The main thing I'd like to achieve with this is loading files from a remote storage system into GPU memory. While pretty obvious if it's mounted, with s3 or similar it wasn't clear the proper way forward. Will continue with kvikio and mounted filesystems in the mean time.

@lgray
Copy link
Author

lgray commented Jun 24, 2024

Is there any reason why I would start getting memory errors on a 40GB slice of an a100 when trying to allocate more than 10 GiB or so?

@madsbk
Copy link
Member

madsbk commented Jun 25, 2024

Is there any reason why I would start getting memory errors on a 40GB slice of an a100 when trying to allocate more than 10 GiB or so?

Are you using RMM (if you are allocating through cudf, you are), you can try the new memory profiler in the nightly package: https://docs.rapids.ai/api/rmm/nightly/guide/#memory-profiler . It might show where and the size of the peak memory allocation.

@lgray
Copy link
Author

lgray commented Jun 25, 2024

@madsbk Great, thank you. I am indeed still using cudf so I'll try this out. awkward-array uses cupy as a backend, so in the fullness of time we'll have to understand if we're using rmm or some other manager.

@bdice
Copy link
Contributor

bdice commented Jun 25, 2024

@lgray
Copy link
Author

lgray commented Jun 25, 2024

Yes! I saw that, thanks for further docs. We'll have to figure out something in awkward that detects and properly uses RMM when it is present.

@lgray
Copy link
Author

lgray commented Jun 25, 2024

Indeed, it seems about 10GB of compressed input data corresponds to 41GB (and the MIG slice is 42GB).
image

I guess the peak allocation corresponds to intermediate decompression buffers and then the final array assembly buffer?

I imagine there's a way to do this where only pairs of from->to buffers are initialized at a time?
Or is that found to be slower and the correct usage pattern is to utilize the GPU on data that is 1/4th the memory size available? This does give us ample room to pack in calculations.

@martindurant
Copy link

For parquet, whether or not you can decompress bytes into their final storage depends on quite a few different factors. You generally want V2 pages and SIMPLE encoding; and even then, only some data types will work (not string!). I wouldn't be surprised if cudf doesn't even bother looking for this efficient path.

@lgray
Copy link
Author

lgray commented Jun 25, 2024

This is float32/int/bool data in either singly-nested ragged arrays or just flat arrays. No strings or anything.
Where should I root around in the parquet metadata for V2 pages, and these other doodads, to make sure I'm doing something optimal?

@martindurant
Copy link

Where should I root around in the parquet metadata for V2 pages

The encoding stats - how many pages of each type - are contained in the main file metadata (row_groups[].columns[].meta_data.encoding_stats). I think strictly they are optional, but I expect them to be present.

The pages themselves can be either V1 or V2 as you come to them. A "version: 2" in the global metadata guarantees nothing. In https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html , you see data_page_version as an option, V1 by default(!!), so if you were writing your own files, this is what you would change.

For INT, delta encoding would not allow decompress-to-target
For FLOAT, byte_stream_split likewise
Dictionary or RLE encoding would spoil any data type, but you really shouldn't use that for numbers, unless you have a very few unique but large values.

@martindurant
Copy link

Oh, and I don't know if you can see any of those details via pyarrow. I use fastparquet...

@lgray
Copy link
Author

lgray commented Jun 25, 2024

It looks like I can enforce these options by setting data_page_version="2.0" and column_encoding="PLAIN" in pyarrow. We'll regenerate the file and try again.

@lgray
Copy link
Author

lgray commented Jun 25, 2024

@fstrug you should follow this!

@nsmith-
Copy link

nsmith- commented Jun 26, 2024

For INT, delta encoding would not allow decompress-to-target
For FLOAT, byte_stream_split likewise

These encodings are something we will want in our data, especially the split float (due to our use of mantissa truncation for reduced precision), so maybe not worth worrying too much about minimizing intermediate buffers and just operate on smaller numbers of row groups at a time?

@lgray
Copy link
Author

lgray commented Jun 26, 2024

@nsmith- We'll have to see how this balances with throughput performance. I'd be curious if we can load the next dataset in while finishing compute on the last. Probably some automated sizing can be done per-analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

7 participants