Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serializing CUDA objects in iterables #110

Closed
pentschev opened this issue Aug 9, 2019 · 4 comments
Closed

Serializing CUDA objects in iterables #110

pentschev opened this issue Aug 9, 2019 · 4 comments

Comments

@pentschev
Copy link
Member

We have improved serialization performance in #98, but it introduced an issue: serializing tuples (and potentially all other iterable types) with CUDA objects fails now. The issue is distributed.protocol.serialize will only serialize tuples when the serializer is pickle, for instance, serialize((np_array1, np_array2), serializers=["dask"]) will also fail.

I've found out about this issue when trying to run the sample from #57, more specifically, it happens during the df = df.sort_values(['A','B','C']) call, where it tries to serialize a tuple of (<cudf.DataFrame ncols=11 nrows=2759411 >, <cudf.DataFrame ncols=11 nrows=2759411 >).

@mrocklin maybe you have an idea on how to solve this? I thought of checking for iterables on device_to_host and serializing each object individually, returning perhaps an iterable of DeviceSerialized objects, but then we could have tuples of tuples, and so on, that would make things more complicated, so it doesn't seem like a resilient solution. Also, I'm not sure if the solution should be here or in dask-distributed, or maybe there's one already that I just don't know of.

@mrocklin
Copy link
Contributor

mrocklin commented Aug 9, 2019

For now I think that your idea of implementing cuda_serialize on standard collections (tuple, list, set, dict) makes sense. If they encounter no cuda-serialiable objects then they should raise NotImplementedError so that lower-priority serializers take over. See this code:

https://github.com/dask/distributed/blob/a55515569d4c5da734e5b14ae414cd342c37ed7b/distributed/protocol/serialize.py#L140-L150

@pentschev
Copy link
Member Author

But doing the way you're suggesting, the case I mentioned at the beginning (tuple of cuDF dataframes) would be serialized using pickle, which is slow. Is that what you're saying should happen?

Just to make it clear in case my initial comment wasn't, what I meant was a more complex handling that would serialize each of the elements of the tuple using the CUDA serializer, but it would be more complex and probably wouldn't solve for all possible object combinations (e.g., cuDF dataframes inside a tuple, inside another tuple).

@pentschev
Copy link
Member Author

Ignore my previous comment, I had misunderstood your suggestion. It does make sense to handle it the way you're saying, I'll try doing that.

@pentschev
Copy link
Member Author

This has been fixed in dask/distributed#2948 and #118, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants