[CPU offload hooks] hooks with overlapped transfers and computations #3267

sayakpaul · 2024-11-29T02:19:45Z

diffusers relies on cpu_offload() a lot for implementing enable_sequential_cpu_offload(). It offloads the modules of a model to CPU when they are not being used and only pops them on to the GPU when it's needed for computation.

As one can notice, the cost of these frequent transfers blocks the underlying computation and hence it leads to quite a bit of increased latency. But it also tremendously helps in running very big models on consumer hardware (very important as a good diffusion model is actually composed of multiple big models).

So, the question is can we overlap communication with computation? https://gist.github.com/gau-nernst/9408e13c32d3c6e7025d92cce6cba140 implements a hook that leverages CUDA streams to implement the same functionality as enable_sequential_cpu_offload() but is significantly faster. See the results:

As @SunMarc and I were discussing, it'd be extremely cool to have a similar hook supported in accelerate so that we can make diffusion models, in particular, more accessible without having to completely give away speed.

Cc: @DN6 @a-r-r-o-w as well.

The text was updated successfully, but these errors were encountered:

sayakpaul added the feature request Request for a new feature to be added to Accelerate label Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU offload hooks] hooks with overlapped transfers and computations #3267

[CPU offload hooks] hooks with overlapped transfers and computations #3267

sayakpaul commented Nov 29, 2024

[CPU offload hooks] hooks with overlapped transfers and computations #3267

[CPU offload hooks] hooks with overlapped transfers and computations #3267

Comments

sayakpaul commented Nov 29, 2024