Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Just-in-time deserialization #353

Merged
merged 59 commits into from
Nov 23, 2020
Merged

Conversation

madsbk
Copy link
Member

@madsbk madsbk commented Aug 6, 2020

This PR implements Just-in-time deserialization of device memory by wrapping DeviceHostFile items in a proxy class: ProxyObject. The deserializaion of the items are then delayed until the items are accessed.

Fixes #342

Ref. #57

Use

In order to enable JIT deserializaion, use the new jit_unspill argument when creating LocalCUDACluster, set --enable-jit-unspill when starting a CUDAWorker, or set the environment variable DASK_JIT_UNSPILL=True

TODO

  • Implement ProxyObject (should come up with a better name)
  • Add some basic tests
  • Support Dask serialization of ProxyObject in order to allow the communication spilled data
  • Support CUDA serialization of ProxyObject in order to avoid re-spilling of data in the case where ProxyObject wraps un-spilled device data.
  • Improve the transparency of ProxyObject
  • Write documentation

@codecov-commenter
Copy link

codecov-commenter commented Aug 6, 2020

Codecov Report

❗ No coverage uploaded for pull request base (branch-0.17@25327eb). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@              Coverage Diff               @@
##             branch-0.17     #353   +/-   ##
==============================================
  Coverage               ?   58.36%           
==============================================
  Files                  ?       19           
  Lines                  ?     1561           
  Branches               ?        0           
==============================================
  Hits                   ?      911           
  Misses                 ?      650           
  Partials               ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 25327eb...b35bed8. Read the comment docs.

@madsbk madsbk changed the base branch from branch-0.15 to branch-0.16 September 11, 2020 13:44
@pentschev
Copy link
Member

@madsbk so it seems that the reason it fails to import pandas is in importing dask.dataframe, but Dask isn't pulling Pandas. Is Pandas going to become a hard dependency here? If so, we will have to add pandas: >=1.0,<1.2.0dev0' in https://github.com/rapidsai/dask-cuda/blob/branch-0.16/conda/recipes/dask-cuda/meta.yaml . The version above must match that from https://github.com/rapidsai/integration/blob/branch-0.16/conda/recipes/versions.yaml#L95 .

@rapidsai rapidsai deleted a comment from pentschev Sep 14, 2020
@madsbk
Copy link
Member Author

madsbk commented Sep 14, 2020

@madsbk so it seems that the reason it fails to import pandas is in importing dask.dataframe, but Dask isn't pulling Pandas. Is Pandas going to become a hard dependency here? If so, we will have to add pandas: >=1.0,<1.2.0dev0' in https://github.com/rapidsai/dask-cuda/blob/branch-0.16/conda/recipes/dask-cuda/meta.yaml .

Make sense, thanks @pentschev for investigating this!
I have added the Pandas dependency for now.

@madsbk madsbk force-pushed the jit_deserialization branch from edf503e to 40058cd Compare September 14, 2020 19:10
@jakirkham
Copy link
Member

Well Pandas is required by dask here. We also include pandas in the RAPIDS build environment. Maybe there's something else going on?

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mads, this is some great work you put here, it's great to see the incredible speedup you've achieved!

Functionality-wise, I don't have much to comment on, some parts are much beyond my understanding, and it seems like there was enough testing by you and others with TPCx-BB. Therefore, my comments/requests revolve around style and documentations only.

dask_cuda/proxy_object.py Show resolved Hide resolved
return sys.getsizeof(self._obj_pxy_deserialize())

def __len__(self):
return len(self._obj_pxy_deserialize())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this deserialize the entire object just to check its length? If so, wouldn't it make sense to store an attribute with the length during serialization and just return it to avoid this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes currently len(x) will deserialize x. What you suggest, we are already doing with x.name, which I added because worker will access x.name before execution tasks.
My plan is to address this issue in a following PR when we have some more experience using JIT deserialization. I suspect we will need to handle a range of attributes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for the continued work on it.

return complex(self._obj_pxy_deserialize())

def __index__(self):
return operator.index(self._obj_pxy_deserialize())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all these dunder method implementations here because they are going to be needed in practice, or are they here just to match a full implementation?

Copy link
Member Author

@madsbk madsbk Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some are needed for ProxyObject to pass-through cudf and tpcx-bb workflows but many of them are just here for completeness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, given they're already there I won't suggest we remove them, but for future reference I think we can try to keep code shorter instead of solving all possible cases that are probably unnecessary, this helps with future maintainability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but since ProxyObject is exposed to the end user, I think it is reasonable to support most common operations. Say a user writes a Dask tasks that uses Numpy arrays, the user should be able to use most, if not all, Numpy operations.

"subclass": subclass,
"serializers": serializers,
}
self._obj_pxy_lock = threading.RLock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, why isn't a simple threading.Lock sufficient here?

Copy link
Member Author

@madsbk madsbk Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is to handle all the methods that acquires the lock before calling _obj_pxy_deserialize(), which also acquires the lock.

List of frames that makes up the serialized object
"""
with self._obj_pxy_lock:
assert serializers is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we replace runtime assertions to raising more appropriate exceptions? For example here I think it would make more sense to raise a ValueError.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, the check is there to assert the internal logic of the class. The _obj_pxy_* methods are not supposed to be called by the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that, but nevertheless I think of assert as a debug statement, so I'm not sure it makes much sense in runtime code. I know Python doesn't distinguish between "debug" and "release" builds, like we can with C, but regardless I think it's clearer to have errors be specific, one could also argue that assertions would substitute virtually any exceptions we could raise, but the exceptions let us be clearer about our intent.

In the interest of avoiding nitpicking, I will leave the final decision up to you, but I think there's real value in being more specific instead of using assertions, whenever we want to raise an exception if something goes wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I agree completely. I like to use assert in internal logic in order to both catch bugs while developing and document the expected state of an object.
However, in this case I agree with you. Since it can be triggered by user input, it is preferable to raise ValueError.

dask_cuda/proxy_object.py Outdated Show resolved Hide resolved
@beckernick
Copy link
Member

We'll give this a test in a fresh environment on TPCxBB workloads on constrained memory systems as well

@madsbk
Copy link
Member Author

madsbk commented Nov 13, 2020

@pentschev thanks for the review!
I have addressed all of you suggests and it made me realize that we need a function unproxy() to access the proxied object directly in a clean way.

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @madsbk , I've suggested some styling and typo fixes, feel free to ignore those you disagree with. It's a lot of minor things, but at least GH will make it easy for you to apply them. :)

dask_cuda/proxy_object.py Outdated Show resolved Hide resolved
dask_cuda/proxy_object.py Outdated Show resolved Hide resolved
dask_cuda/proxy_object.py Outdated Show resolved Hide resolved
dask_cuda/proxy_object.py Outdated Show resolved Hide resolved
dask_cuda/proxy_object.py Outdated Show resolved Hide resolved
dask_cuda/proxy_object.py Outdated Show resolved Hide resolved
dask_cuda/proxy_object.py Outdated Show resolved Hide resolved
dask_cuda/tests/test_proxy.py Outdated Show resolved Hide resolved
dask_cuda/proxy_object.py Outdated Show resolved Hide resolved
dask_cuda/proxy_object.py Outdated Show resolved Hide resolved
@pentschev
Copy link
Member

Also, I think we can leave this open until EOD or Monday so people who are interested can still review or test it before merging, but otherwise it looks good to me! Thanks again @madsbk for the work you put into this, it's really great!

Co-authored-by: Peter Andreas Entschev <[email protected]>
@madsbk
Copy link
Member Author

madsbk commented Nov 13, 2020

I've suggested some styling and typo fixes, feel free to ignore those you disagree with. It's a lot of minor things, but at least GH will make it easy for you to apply them. :)

It is all in :)

Also, I think we can leave this open until EOD or Monday so people who are interested can still review or test it before merging, but otherwise it looks good to me!

Yes, no hurry. I also like to wait on a verdict from @beckernick :P

@beckernick
Copy link
Member

I am running this currently :)

@beckernick
Copy link
Member

beckernick commented Nov 19, 2020

In a fresh environment today, I ran tpcxbb q02 several times with the following configuration (the default):

  • 8 GPUs of a DGX-2 (GPUs 0-7)
  • DEVICE_MEMORY_LIMIT="15GB"
  • POOL_SIZE="30GB"
  • TCP communication
  • Reading parquet files comprising 2GB in-memory data chunks from the local /raid of the DGX-2

By default, q02 takes ~300 seconds with the above.

With DASK_JIT_UNSPILL=False, q02 also takes ~300 seconds, as expected.

With DASK_JIT_UNSPILL=True, q02 almost always runs out of memory. When it succeeds, it takes about 100-140 seconds (depending on hot vs cold), which is a huge boost.

With DASK_JIT_UNSPILL=False, the peak memory after each time q02 finishes is the following:
peak-memory-after-run2-JIT-false

Peak memory spiked on GPU 0 due to the client process, but went back down as expected after completion. After additional runs, the memory profile looks identical.

With DASK_JIT_UNSPILL=True, the peak memory after each time q02 finishes looks different, due to the OOM. When it succeeded, we had to allocate more memory so the active memory does not go back down to 29352 MB on the GPUs that needed more memory.

peak-memory-run1-JIT-true-OOM

Going down to a device memory limit of 10GB doesn't stop the OOM, but can also sometimes succeed. When it succeeds, we still see the huge speed boost.

Finally, when it succeeds, we don't ever need to allocate more memory for the pool. It seems like if the workload needs to allocate additional memory due to JIT_UNSPILL=True, we will end up failing. This is a GPU memory screenshot from the middle of a q02 run that failed, which was kicked off after one that succeeded:

Screen Shot 2020-11-19 at 1 51 12 PM

As a result, I think this PR poses limited risk. When enabled, JIT unspilling provides significant performance gains but consistently increase memory pressure (and usually OOMs on these 8 GPU tests). If the memory increases could be avoided, I think it would be worth enabling this by default.

# packages in environment at /raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201119-jit:
cudf                      0.17.0a201119   cuda_10.2_py37_g1a80df96c4_285    rapidsai-nightly
cuml                      0.17.0a201119   cuda10.2_py37_gb205e8fd0_128    rapidsai-nightly
dask                      2.30.0+66.g439c4ab2          pypi_0    pypi
dask-cuda                 0.8.0a0+693.g7a73f35          pypi_0    pypi
dask-cudf                 0.17.0a201119   py37_g1a80df96c4_285    rapidsai-nightly
distributed               2.31.0.dev0+39.g7e2fb2ff          pypi_0    pypi
faiss-proc                1.0.0                      cuda    rapidsai-nightly
libcudf                   0.17.0a201119   cuda10.2_g1a80df96c4_285    rapidsai-nightly
libcuml                   0.17.0a201119   cuda10.2_gb205e8fd0_128    rapidsai-nightly
libcumlprims              0.17.0a201030   cuda10.2_g1fa28a5_8    rapidsai-nightly
librmm                    0.17.0a201106   cuda10.2_gb1ac445_43    rapidsai-nightly
rmm                       0.17.0a201106   cuda_10.2_py37_gb1ac445_43    rapidsai-nightly
ucx                       1.8.1+g6b29558       cuda10.2_0    rapidsai-nightly
ucx-proc                  1.0.0                       gpu    rapidsai-nightly
ucx-py                    0.17.0a201119   py37_g6b29558_20    rapidsai-nightly

@jakirkham
Copy link
Member

Thanks for all of your work on this Mads! And everyone for the reviews! 😄

@madsbk
Copy link
Member Author

madsbk commented Nov 20, 2020

Thanks @beckernick, I don't know exactly why we see the memory spikes but I will make a follow up PR to address it when I find the time :)

@madsbk madsbk merged commit 1429b67 into rapidsai:branch-0.17 Nov 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Allow communicating spilled data
8 participants