Just-in-time deserialization #353

madsbk · 2020-08-06T14:59:16Z

This PR implements Just-in-time deserialization of device memory by wrapping DeviceHostFile items in a proxy class: ProxyObject. The deserializaion of the items are then delayed until the items are accessed.

Fixes #342

Ref. #57

Use

In order to enable JIT deserializaion, use the new jit_unspill argument when creating LocalCUDACluster, set --enable-jit-unspill when starting a CUDAWorker, or set the environment variable DASK_JIT_UNSPILL=True

TODO

Implement ProxyObject (should come up with a better name)
Add some basic tests
Support Dask serialization of ProxyObject in order to allow the communication spilled data
Support CUDA serialization of ProxyObject in order to avoid re-spilling of data in the case where ProxyObject wraps un-spilled device data.
Improve the transparency of ProxyObject
Write documentation

codecov-commenter · 2020-08-06T15:17:13Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-0.17@25327eb). Click here to learn what that means.
The diff coverage is n/a.

@@              Coverage Diff               @@
##             branch-0.17     #353   +/-   ##
==============================================
  Coverage               ?   58.36%           
==============================================
  Files                  ?       19           
  Lines                  ?     1561           
  Branches               ?        0           
==============================================
  Hits                   ?      911           
  Misses                 ?      650           
  Partials               ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 25327eb...b35bed8. Read the comment docs.

…deserialization

pentschev · 2020-09-11T16:52:53Z

@madsbk so it seems that the reason it fails to import pandas is in importing dask.dataframe, but Dask isn't pulling Pandas. Is Pandas going to become a hard dependency here? If so, we will have to add pandas: >=1.0,<1.2.0dev0' in https://github.com/rapidsai/dask-cuda/blob/branch-0.16/conda/recipes/dask-cuda/meta.yaml . The version above must match that from https://github.com/rapidsai/integration/blob/branch-0.16/conda/recipes/versions.yaml#L95 .

madsbk · 2020-09-14T06:27:19Z

@madsbk so it seems that the reason it fails to import pandas is in importing dask.dataframe, but Dask isn't pulling Pandas. Is Pandas going to become a hard dependency here? If so, we will have to add pandas: >=1.0,<1.2.0dev0' in https://github.com/rapidsai/dask-cuda/blob/branch-0.16/conda/recipes/dask-cuda/meta.yaml .

Make sense, thanks @pentschev for investigating this!
I have added the Pandas dependency for now.

jakirkham · 2020-09-14T20:09:25Z

Well Pandas is required by dask here. We also include pandas in the RAPIDS build environment. Maybe there's something else going on?

…deserialization

pentschev

Mads, this is some great work you put here, it's great to see the incredible speedup you've achieved!

Functionality-wise, I don't have much to comment on, some parts are much beyond my understanding, and it seems like there was enough testing by you and others with TPCx-BB. Therefore, my comments/requests revolve around style and documentations only.

dask_cuda/proxy_object.py

pentschev · 2020-11-12T13:39:24Z

dask_cuda/proxy_object.py

+                return sys.getsizeof(self._obj_pxy_deserialize())
+
+    def __len__(self):
+        return len(self._obj_pxy_deserialize())


Will this deserialize the entire object just to check its length? If so, wouldn't it make sense to store an attribute with the length during serialization and just return it to avoid this?

Yes currently len(x) will deserialize x. What you suggest, we are already doing with x.name, which I added because worker will access x.name before execution tasks.
My plan is to address this issue in a following PR when we have some more experience using JIT deserialization. I suspect we will need to handle a range of attributes.

Makes sense, thanks for the continued work on it.

pentschev · 2020-11-12T13:41:22Z

dask_cuda/proxy_object.py

+        return complex(self._obj_pxy_deserialize())
+
+    def __index__(self):
+        return operator.index(self._obj_pxy_deserialize())


Are all these dunder method implementations here because they are going to be needed in practice, or are they here just to match a full implementation?

Some are needed for ProxyObject to pass-through cudf and tpcx-bb workflows but many of them are just here for completeness.

I see, given they're already there I won't suggest we remove them, but for future reference I think we can try to keep code shorter instead of solving all possible cases that are probably unnecessary, this helps with future maintainability.

I agree but since ProxyObject is exposed to the end user, I think it is reasonable to support most common operations. Say a user writes a Dask tasks that uses Numpy arrays, the user should be able to use most, if not all, Numpy operations.

pentschev · 2020-11-12T14:21:44Z

dask_cuda/proxy_object.py

+            "subclass": subclass,
+            "serializers": serializers,
+        }
+        self._obj_pxy_lock = threading.RLock()


I'm curious, why isn't a simple threading.Lock sufficient here?

It is to handle all the methods that acquires the lock before calling _obj_pxy_deserialize(), which also acquires the lock.

pentschev · 2020-11-12T14:27:43Z

dask_cuda/proxy_object.py

+            List of frames that makes up the serialized object
+        """
+        with self._obj_pxy_lock:
+            assert serializers is not None


Should we replace runtime assertions to raising more appropriate exceptions? For example here I think it would make more sense to raise a ValueError.

I am not sure, the check is there to assert the internal logic of the class. The _obj_pxy_* methods are not supposed to be called by the user.

I understand that, but nevertheless I think of assert as a debug statement, so I'm not sure it makes much sense in runtime code. I know Python doesn't distinguish between "debug" and "release" builds, like we can with C, but regardless I think it's clearer to have errors be specific, one could also argue that assertions would substitute virtually any exceptions we could raise, but the exceptions let us be clearer about our intent.

In the interest of avoiding nitpicking, I will leave the final decision up to you, but I think there's real value in being more specific instead of using assertions, whenever we want to raise an exception if something goes wrong.

I am not sure I agree completely. I like to use assert in internal logic in order to both catch bugs while developing and document the expected state of an object.
However, in this case I agree with you. Since it can be triggered by user input, it is preferable to raise ValueError.

dask_cuda/proxy_object.py

beckernick · 2020-11-12T15:29:19Z

We'll give this a test in a fresh environment on TPCxBB workloads on constrained memory systems as well

madsbk · 2020-11-13T09:38:26Z

@pentschev thanks for the review!
I have addressed all of you suggests and it made me realize that we need a function unproxy() to access the proxied object directly in a clean way.

pentschev

Thanks @madsbk , I've suggested some styling and typo fixes, feel free to ignore those you disagree with. It's a lot of minor things, but at least GH will make it easy for you to apply them. :)

dask_cuda/proxy_object.py

dask_cuda/tests/test_proxy.py

dask_cuda/proxy_object.py

pentschev · 2020-11-13T11:12:51Z

Also, I think we can leave this open until EOD or Monday so people who are interested can still review or test it before merging, but otherwise it looks good to me! Thanks again @madsbk for the work you put into this, it's really great!

Co-authored-by: Peter Andreas Entschev <[email protected]>

madsbk · 2020-11-13T11:58:30Z

I've suggested some styling and typo fixes, feel free to ignore those you disagree with. It's a lot of minor things, but at least GH will make it easy for you to apply them. :)

It is all in :)

Also, I think we can leave this open until EOD or Monday so people who are interested can still review or test it before merging, but otherwise it looks good to me!

Yes, no hurry. I also like to wait on a verdict from @beckernick :P

beckernick · 2020-11-13T14:43:32Z

I am running this currently :)

…deserialization

beckernick · 2020-11-19T18:46:19Z

In a fresh environment today, I ran tpcxbb q02 several times with the following configuration (the default):

8 GPUs of a DGX-2 (GPUs 0-7)
DEVICE_MEMORY_LIMIT="15GB"
POOL_SIZE="30GB"
TCP communication
Reading parquet files comprising 2GB in-memory data chunks from the local /raid of the DGX-2

By default, q02 takes ~300 seconds with the above.

With DASK_JIT_UNSPILL=False, q02 also takes ~300 seconds, as expected.

With DASK_JIT_UNSPILL=True, q02 almost always runs out of memory. When it succeeds, it takes about 100-140 seconds (depending on hot vs cold), which is a huge boost.

With DASK_JIT_UNSPILL=False, the peak memory after each time q02 finishes is the following:

Peak memory spiked on GPU 0 due to the client process, but went back down as expected after completion. After additional runs, the memory profile looks identical.

With DASK_JIT_UNSPILL=True, the peak memory after each time q02 finishes looks different, due to the OOM. When it succeeded, we had to allocate more memory so the active memory does not go back down to 29352 MB on the GPUs that needed more memory.

Going down to a device memory limit of 10GB doesn't stop the OOM, but can also sometimes succeed. When it succeeds, we still see the huge speed boost.

Finally, when it succeeds, we don't ever need to allocate more memory for the pool. It seems like if the workload needs to allocate additional memory due to JIT_UNSPILL=True, we will end up failing. This is a GPU memory screenshot from the middle of a q02 run that failed, which was kicked off after one that succeeded:

As a result, I think this PR poses limited risk. When enabled, JIT unspilling provides significant performance gains but consistently increase memory pressure (and usually OOMs on these 8 GPU tests). If the memory increases could be avoided, I think it would be worth enabling this by default.

# packages in environment at /raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201119-jit:
cudf                      0.17.0a201119   cuda_10.2_py37_g1a80df96c4_285    rapidsai-nightly
cuml                      0.17.0a201119   cuda10.2_py37_gb205e8fd0_128    rapidsai-nightly
dask                      2.30.0+66.g439c4ab2          pypi_0    pypi
dask-cuda                 0.8.0a0+693.g7a73f35          pypi_0    pypi
dask-cudf                 0.17.0a201119   py37_g1a80df96c4_285    rapidsai-nightly
distributed               2.31.0.dev0+39.g7e2fb2ff          pypi_0    pypi
faiss-proc                1.0.0                      cuda    rapidsai-nightly
libcudf                   0.17.0a201119   cuda10.2_g1a80df96c4_285    rapidsai-nightly
libcuml                   0.17.0a201119   cuda10.2_gb205e8fd0_128    rapidsai-nightly
libcumlprims              0.17.0a201030   cuda10.2_g1fa28a5_8    rapidsai-nightly
librmm                    0.17.0a201106   cuda10.2_gb1ac445_43    rapidsai-nightly
rmm                       0.17.0a201106   cuda_10.2_py37_gb1ac445_43    rapidsai-nightly
ucx                       1.8.1+g6b29558       cuda10.2_0    rapidsai-nightly
ucx-proc                  1.0.0                       gpu    rapidsai-nightly
ucx-py                    0.17.0a201119   py37_g6b29558_20    rapidsai-nightly

…o jit_deserialization

jakirkham · 2020-11-20T16:32:25Z

Thanks for all of your work on this Mads! And everyone for the reviews! 😄

madsbk · 2020-11-20T16:32:32Z

Thanks @beckernick, I don't know exactly why we see the memory spikes but I will make a follow up PR to address it when I find the time :)

madsbk added 5 commits August 6, 2020 10:24

Initial implementing of ObjectProxy

3c8117f

Added basic tests of ObjectProxy

12e1f6c

cleanup

0303db4

Implemented some more proxy attributes

411bce8

Added spilling of proxy object optional

7c528d0

madsbk added 10 commits August 7, 2020 09:09

Merge branch 'branch-0.15' of github.com:rapidsai/dask-cuda into jit_…

93ddf87

…deserialization

Re-added dask_serialize for DeviceSerialized

de36de9

Added support of __array__

b4eb344

Added __sizeof__

521d3ca

Added some spill_proxy tests in test_device_host_file.py

33b2bd6

Checking len() instead of .size()

cc1df47

Added dispatch support of hash_object_dispatch and group_split_dispatch

f6c11f1

Added "*args, **kwargs" to dispatch of ObjectProxy

349a393

Added dispatch of make_scalar

68ff7ed

Added dispatch of concat_dispatch

d47f6fd

madsbk mentioned this pull request Aug 7, 2020

[REVIEW] concat(): support sub-types and object wrappers rapidsai/cudf#5879

Merged

Merge branch 'branch-0.15' of github.com:rapidsai/dask-cuda into jit_…

e7e2822

…deserialization

pentschev mentioned this pull request Aug 18, 2020

DeviceHostFile: get stored value without moving it from host to device memory #259

Closed

Merge branch 'branch-0.16' of github.com:rapidsai/dask-cuda into jit_…

b8e1feb

…deserialization

madsbk changed the base branch from branch-0.15 to branch-0.16 September 11, 2020 13:44

meta.yaml: added pandas dependency

d93f8c5

rapidsai deleted a comment from pentschev Sep 14, 2020

madsbk added 2 commits September 14, 2020 08:40

meta.yaml: depend on dask (not only dask-core)

3756b0a

Added jit-unspill worker option

40058cd

madsbk force-pushed the jit_deserialization branch from edf503e to 40058cd Compare September 14, 2020 19:10

ajschmidt8 approved these changes Nov 11, 2020

View reviewed changes

madsbk added 2 commits November 12, 2020 13:49

Merge branch 'branch-0.17' of github.com:rapidsai/dask-cuda into jit_…

71bd470

…deserialization

added some more checks in test_proxy_object_of_numpy

deb2f58

pentschev requested changes Nov 12, 2020

View reviewed changes

madsbk added 4 commits November 12, 2020 20:01

ProxyObject: added docs

3a22755

added unproxy()

9a76b8e

Added docs

b1f9331

Added ValueError when serializers isn't specified

bd8e745

typo

f8ea867

pentschev approved these changes Nov 13, 2020

View reviewed changes

Style and spelling fixes

7a73f35

Co-authored-by: Peter Andreas Entschev <[email protected]>

Merge branch 'branch-0.17' of github.com:rapidsai/dask-cuda into jit_…

dec1d7e

…deserialization

madsbk added 5 commits November 20, 2020 05:11

docs

e0fb226

ProxyObject.__sizeof__(): use dask.sizeof()

b565bc5

Merge branch 'jit_deserialization' of github.com:madsbk/dask-cuda int…

c774ed2

…o jit_deserialization

Serializer: convert to tuples before comparing

dbcf68d

typos

81fa615

jakirkham approved these changes Nov 20, 2020

View reviewed changes

madsbk merged commit 1429b67 into rapidsai:branch-0.17 Nov 23, 2020

beckernick mentioned this pull request Nov 30, 2020

[FEA] Dynamic configuration of worker device memory limits #280

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Just-in-time deserialization #353

Just-in-time deserialization #353

madsbk commented Aug 6, 2020 •

edited

Loading

codecov-commenter commented Aug 6, 2020 •

edited

Loading

pentschev commented Sep 11, 2020

madsbk commented Sep 14, 2020

jakirkham commented Sep 14, 2020

pentschev left a comment

pentschev Nov 12, 2020

madsbk Nov 12, 2020

pentschev Nov 12, 2020

pentschev Nov 12, 2020

madsbk Nov 12, 2020 •

edited

Loading

pentschev Nov 12, 2020

madsbk Nov 13, 2020

pentschev Nov 12, 2020

madsbk Nov 12, 2020 •

edited

Loading

pentschev Nov 12, 2020

madsbk Nov 12, 2020

pentschev Nov 12, 2020

madsbk Nov 13, 2020

beckernick commented Nov 12, 2020

madsbk commented Nov 13, 2020

pentschev left a comment

pentschev commented Nov 13, 2020

madsbk commented Nov 13, 2020

beckernick commented Nov 13, 2020

beckernick commented Nov 19, 2020 •

edited

Loading

jakirkham commented Nov 20, 2020

madsbk commented Nov 20, 2020

Just-in-time deserialization #353

Just-in-time deserialization #353

Conversation

madsbk commented Aug 6, 2020 • edited Loading

Use

TODO

codecov-commenter commented Aug 6, 2020 • edited Loading

Codecov Report

pentschev commented Sep 11, 2020

madsbk commented Sep 14, 2020

jakirkham commented Sep 14, 2020

pentschev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

madsbk Nov 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

madsbk Nov 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beckernick commented Nov 12, 2020

madsbk commented Nov 13, 2020

pentschev left a comment

Choose a reason for hiding this comment

pentschev commented Nov 13, 2020

madsbk commented Nov 13, 2020

beckernick commented Nov 13, 2020

beckernick commented Nov 19, 2020 • edited Loading

jakirkham commented Nov 20, 2020

madsbk commented Nov 20, 2020

madsbk commented Aug 6, 2020 •

edited

Loading

codecov-commenter commented Aug 6, 2020 •

edited

Loading

madsbk Nov 12, 2020 •

edited

Loading

madsbk Nov 12, 2020 •

edited

Loading

beckernick commented Nov 19, 2020 •

edited

Loading