You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
diffusers relies on cpu_offload() a lot for implementing enable_sequential_cpu_offload(). It offloads the modules of a model to CPU when they are not being used and only pops them on to the GPU when it's needed for computation.
As one can notice, the cost of these frequent transfers blocks the underlying computation and hence it leads to quite a bit of increased latency. But it also tremendously helps in running very big models on consumer hardware (very important as a good diffusion model is actually composed of multiple big models).
So, the question is can we overlap communication with computation? https://gist.github.com/gau-nernst/9408e13c32d3c6e7025d92cce6cba140 implements a hook that leverages CUDA streams to implement the same functionality as enable_sequential_cpu_offload() but is significantly faster. See the results:
As @SunMarc and I were discussing, it'd be extremely cool to have a similar hook supported in accelerate so that we can make diffusion models, in particular, more accessible without having to completely give away speed.
diffusers
relies oncpu_offload()
a lot for implementingenable_sequential_cpu_offload()
. It offloads the modules of a model to CPU when they are not being used and only pops them on to the GPU when it's needed for computation.As one can notice, the cost of these frequent transfers blocks the underlying computation and hence it leads to quite a bit of increased latency. But it also tremendously helps in running very big models on consumer hardware (very important as a good diffusion model is actually composed of multiple big models).
So, the question is can we overlap communication with computation? https://gist.github.com/gau-nernst/9408e13c32d3c6e7025d92cce6cba140 implements a hook that leverages CUDA streams to implement the same functionality as
enable_sequential_cpu_offload()
but is significantly faster. See the results:As @SunMarc and I were discussing, it'd be extremely cool to have a similar hook supported in
accelerate
so that we can make diffusion models, in particular, more accessible without having to completely give away speed.Cc: @DN6 @a-r-r-o-w as well.
The text was updated successfully, but these errors were encountered: