-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA graphs #33
Comments
Hello! I had a look at this: It seems to require discarding most python data types (like dicts, etc..). Currently this is what we use to pass input data to models (it's a bit tidier than specifying things explicitly). I dont think there are restrictions or issues with doing this, just requires some work. But from my understanding, this provides speed-ups only if a model is limited by kernel launch time which most of the models are not (they are limited by tensor product operations, I think only ANI is limited by that). I have not done a proper benchmarking though. Im not opposed to this, but I don't think I have the bandwidth to do it currently. If you (or anyone really), would like to submit a PR, Im happy to review and go on from there! Best, |
Where are you seeing that you can't use dicts? CUDA graphs are implemented at the level of the driver. All they care about is the sequence of CUDA kernels that get executed and the inputs that are passed to them. That's both a strength—they don't care what the Python code to launch the kernels does—and a weakness—they can't capture any of the logic from the Python code. See https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs. |
Hello! Sorry for the radio silence. So to start checking what needs to change in the code, you can try this script. It gets a batch from a featurised dataset (qm9), instantiates a model (egnn), and then tried to make a cuda graph. It currently fails at converting a tensor to an int (the number of graphs in the batch).
Let me know if this works for you! Best, |
The first error we get is
Apparently it doesn't allow converting a tensor to a python int inside a graph. The int gets passed to I tried hardcoding
I'm not sure what's causing that. pytorch/pytorch#15623 discusses problems in pytorch that caused graph capture to fail for backward passes, but they were supposedly fixed a few years ago. I found lots of other reports of that error message, but none that seemed relevant. It seems to be an error that can be caused by a lot of things. |
Yeah that's the one I saw too. This is required for Another restriction is that the So unless we figure out how to cast this tensor to an int, it seems like torchscript and cuda graphs are incompatible? |
Does is_capturing = x.is_cuda and is_current_stream_capturing()
if not x.is_cuda or not is_capturing:
self.dim_size = int(batch.max().item() + 1)
if is_capturing:
assert (
self.dim_size > 0
), "Warming up is needed before capturing the model into a CUDA graph"
warn(
"CUDA graph capture will lock the batch to the current number of samples ({}). Changing this will result in a crash".format(
self.dim_size
)
) It requires that the batch size be constant over all batches. It retrieves the batch size during the warmup steps, then uses that value during graph capture instead of trying to look it up again. Could PhysicsML do the same thing? It also would be fine with me if it only supports CUDA graphs during inference, not training, and only for working with a single fixed molecule. My main concern is to speed up running simulations, not to speed up training. |
it doesnt necessarily correspond to the batch size (again because the batch tensor might not include the nodes of all the graphs). But we might be able to do something similar. I also looked at the previous error you got and I think I might know what is happening. Here is a minimal failing example:
But if you try
then it works. It seems that cuda graphs dont like expanding dictionaries (or objects in general I guess). We would have to go through all instances of adding tensors to dicts and modify them in this way (not sure how much overhead that will add). |
Possibly what it objects to is that you're modifying an input argument? It wants Python objects passed as arguments to be immutable, but it's fine with creating and modifying new objects within the method. This doesn't seem like it should add significant overhead. You aren't copying the tensors, just putting references to the existing ones in a new dict. You could write it as |
Would it be possible to make PhysicsML support CUDA graphs? In some cases it can provide a large speed improvement.
The main requirements to be graph compatible are that tensor shapes have to be static, and it doesn't support dynamic control flow. These also are similar to the requirements for
torch.export()
, which will be an important mechanism to support once the API stabilizes.The text was updated successfully, but these errors were encountered: