-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[refactor] FreeInit #6644
[refactor] FreeInit #6644
Conversation
@DN6 @sayakpaul In light of #6571, I propose moving freeinit and future quality/efficiency-related utilities into separate mixins (Sayak said something similar in his review comment in 6571). As this technique can be applied to any video pipeline for better consistency (at the cost of a few more sampling iterations), I think it would be good to add support to the major ones. Moving it to a separate mixin will greatly reduce the number of copied lines and the only changes that would be required would be abstracting the denoise loop and adding an if-statement. What do you think? |
@a-r-r-o-w Works for me. You can define the mixin inside the animatediff pipeline module. |
@DN6 What do you think about the refactor? To use FreeInit in other pipelines, the following changes would be required:
|
We don't need to introduce a mixin that is just used by one model/pipeline IMO cc @patrickvonplaten here |
The reason for adding the mixin is that the technique can be applied to all AnimateDiff, TextToVideoSynth, Modelscope and maybe other video models that are currently or will be supported by diffusers. At the cost of few extra steps, freeinit allows much more temporally coherent videos to be generated and I've been using it on ComfyUI for a while now, and I believe many others are too. Combined with sliding window context support for AnimateDiff, you can generate really high quality, temporally coherent and long videos without random jumpy-ness. The other way to go about adding it to different pipelines, if it's okay on the diffusers teams' end, would be to copy-paste freeinit related logic but IMO it would lead to a lot of code duplication (~400 lines per pipeline), which seems quite unnecessary. |
@DN6 |
So FreeInit can be used with the following models that we're about to add (they're all based on AnimateDiff) #6289 #6328 and #6698 I haven't checked the results for T2VidSynth, but the results for AnimateDiff based models is quite good. Here's an example of PIA generation with (left) and without (right) I believe I2VGenXL (#6665) is based on |
does it make sense to make separate pipelines for freeinit? the denoising loop is different, no? This pipeline is a lot harder to read now compared with the last time when I looked at it without the freeinit. |
I'm also not a big fan of the pipeline being much more unreadable, but having a whole new pipeline might also be quite bad for discoverability. Think putting in the same pipeline is still the better of the (not so good) options. Is there any way we can move it into the same pipeline while keeping the code easy-to-understand / clean? |
Agree. As long as it's only in one pipeline, no need for a mixin. I'd say we should only do mixins if it applies for 5+ pipelines (or if we're sure it applies to every video pipeline) |
I personally don't think it would be a good decision to maintain separate pipelines as FreeInit is a technique/feature just like FreeU. It is quite general and can be applied to multiple video generation strategies. I can definitely vouch for the usage of FreeInit for the quality improvement it provides. At the moment, we can apply this technique directly to AnimateDiff (t2v, i2v, v2v, controlnet, PIA) and TextToVideoSynth. I'm yet to test with SVD.
I think we need to try and refactor this so all the mathematical parts are abstracted away and we are left with just readable code. There is not much to FreeInit:
The latter two can be easily moved into a mixin that no one really needs to look into unless they'd like to understand FreeInit. What's left is finding a good way to incorporate the extra loop that needs access to a few denoising arguments. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@yiyixuxu @patrickvonplaten Is the readability issue mainly due to the fact that there are two denoising loops that can run within the pipeline? I do think it's worth exploring a bit how FreeInit can be incorporated here because the quality gains are definitely noticeable. |
personally, I found our pipelines to be very readable because you can always get a good idea about what's going on by quickly reading through the steps in the this is just not the case here though. I think it would be ok to have a nested denoising loop structure inside |
also, how much is freeinit used? i.e. it's slower but more people would rather use it than without it? |
Would something like this work? num_free_inti_iters = self._free_init_num_Iters if self.free_init_enabled else 1
for i in range(num_free_init_iters):
if self.free_init_enabled:
latents, num_warmup_steps = self.apply_free_init(..)
with self.pregress_gar(total=num_Inference_steps) as progress_bar:
....
(our regular denoising loop)
...
|
This is how I had it originally in the PR that added freeinit support but we didn't like the idea at the time. I feel nesting the loop like this might limit how changes/additions are done in the future? Hmm, I will think of a better way to refactor this. |
I think @yiyixuxu's primary concern here is that the video pipelines I think we can revisit the idea of the nested loop. I opened a PR: #6874 with a proposal that takes into account both your concerns @a-r-r-o-w and @yiyixuxu. LMK what you think. |
In light of 6874, we can close this as it solves the problem mostly. We can revisit the refactoring of the outer loop introduced by FreeInit if we have future changes that build on something similar as mentioned by Yiyi in 6874's review. Thanks |
What does this PR do?
Removes the additional generator that was added for freeinit and reduces LoC a bit. I figured the second generator wasn't required and you can still get consistent generation by just using the one passed for inference. Results are the same as before:
Who can review?
@DN6