-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sliding window support for animatediff vid2vid pipeline #6521
Conversation
Before writing any code, wanted to clarify the requirements, as my understanding of the sliding window technique in this context isn't clear. I've looked at a few implementations in: Questions:
|
cc @rmasiso since they were looking into it too here. I will be referring to this implementation in this comment. Other implementations are also same/similar. The overall idea is to accumulate all generated samples and then average it out by dividing with the number of times every frame latent was processed. Different frames can be processed different number of times due to how the voodoo magic context_scheduler functions works (it is finally understandable at my fourth glance).
I believe stride is neccessary as it allows frames that are farther apart to remain temporally consistent. The code being referred to applies stride as a power of 2^(i - 1), I think. That is,
We don't pick any specific generated latent, for each frame, but instead accumulate all latents for every frame and take the average per frame. From my testing with the original code by ashen-sensored, this results in better generations than just taking any specific generation for a frame. The last sampled latent for each frame also is almost good enough (there is some jumpy-ness/flickering) but averaging works better. The high-level idea is mostly correct. Let's take a smaller example and understand what happens: (I'm using num_frames=8, context_size=2 (aka max_motion_seq_length in config.json), overlap=2 and stride=2) latents = ... # tensor of shape (batch_size, num_latent_channels, num_frames, height, width)
latents_accumulated = ...
count_num_process_times = [0] * num_frames
for context_indices in [[0, 1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 7], [6, 7, 0, 1], [0, 2, 4, 6]]:
current_latents = latents[context_indices]
processed_latents = process_animatediff(latents)
latents_accumulated[context_indices] += processed_latents
count_num_process_times[context_indices] += 1
final_latents = latents_accumulated / count_num_process_times Notice there is a cyclic dependency between [6, 7, 0, 1] frames. This can lead to some loss in quality, not too sure... but I've read that it could be bad and it makes sense intuitively - why should later frames affect former? The linked code also looks really confusing and can be simplified to something that more people can easily understand, from first glance, by adding one/two for-loops (to handle stride without ordered_halving or other tricks) with good variable naming.
context_length would just be motion_adapter.config.max_motion_seq_length from config.json if I understand correctly. I think what diffusers team would like to have would be methods that can enable/disable long context generation and the cc @DN6 @sayakpaul |
Yeah, your understanding is correct. However, I will let @DN6 comment on it. |
thanks for the really helpful context @a-r-r-o-w.
|
latents will just be some random tensor (for txt2vid) or image/video-encoded latents (for img2vid/vid2vid) of shape |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Cc @DN6 |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
The variable |
@JosefKuchar My bad, typo. That is correct |
is this ready for a review? |
Hi all. I took a look into some of the different approaches for longer video generations with AnimateDiff. The ones available in AnimateDiff-Evolved use the approach being discussed here; a sliding window, while averaging out the latents of overlapping frames. This seems to be inspired by the MultiDiffusion approach for generating paranomic images: diffusers/src/diffusers/pipelines/stable_diffusion_panorama/pipeline_stable_diffusion_panorama.py Lines 763 to 772 in 66f94ea
Except we apply it temporally rather than spatially. Another approach is FreeNoise, which also uses a sliding window, but applies it in the layers of the motion modules FreeNoise seems like a more principled approach, and avoids relying on a "magic" context scheduler. I haven't compared the quality of FreeNoise vs Context Scheduler though. Additionally, the Context Scheduler approach can theoretically handle an infinitely long video sequence. A very long sequence of latents can be held in RAM, and only the context latents with a fixed length go through the forward pass of the model. With FreeNoise, the breaking up the long sequence into context latents only happens in the motion modules, so the other UNet layers have to deal with the longer sequence. We could do some work to the UNet Motion blocks to enable chunked inference, e.g. something similar to |
Any update on the progress of this? |
A better approach is to also go through the PRs sometimes to get a head start if a feature has been in the motion to be shipped :-) |
If @DN6 is busy with other things and does not have the bandwidth for this right now, I'll be happy to eventually pick this up on a weekend when I find time. But please feel free to PR if you'd like to take this up. AFAICT, there will not be too many code changes and most of it can be adapted directly from their repo. It'd be preferable to have enable_() and disable_() methods for doing it. It's also something the community has been using for a while so I think there should be discussions or improved implementations for this that you could try looking for. |
Would like to work on this, but unfortunately blocked by #7378 (comment). I believe in order to be able to work on this, I'd need to be able decode + encode the latents independently |
Would love to see FreeNoise implementation. I've hacked together implementation for basic chunking above, but results are not that great (only openpose controlnet, no vid2vid). Unfortunately I don't have skills for porting original FreeNoise animatediff implementation (changes here arthur-qiu/FreeNoise-AnimateDiff@9abf5ed) to diffusers |
Ok, so I was able to port FreeNoise Animatediff code to diffusers - results below (128 frames) prompt "Animated man in a suit on a beach", using community animatediff controlnet pipeline, AnimateDiff lighting 4 steps version (5 step inference) result128.mp4conditioning128.mp4 |
@JosefKuchar Not a maintainer here but I'd say please go for the PR ❤️ Supporting long context generation has been out for Comfy and A1111 for long, and it has been on our mind to add support for this within diffusers for many months now. Thank you so much for taking the initiative! The community has been generating short films using these methods with the best models out there and have nailed down many tricks for consistent and high quality generation; something we could definitely write guides about (perhaps @asomoza would be a great help for this). I'm happy to help resolve any conflicts that may come up with supporting both FreeInit and FreeNoise. Chunked unet inference could be a separate thing to look at in the near future yep |
Hi all. Really nice to see the initiative here! I'll have bandwidth to take this up next week. @JosefKuchar since you've already started on FreeNoise I'll leave you to it and look into sliding window. I'll probably just follow this reference. I believe @a-r-r-o-w had included it when originally proposing the AnimateDiff PR, but we weren't 100% sure about adding something that wasn't fully understood at the time. @JosefKuchar For chunking, you would need to look at the the Resnet and Attention blocks in the MotionBlocks here. An example of chunking logic can be found here and here. If you feel it's a bit much to handle all at once, feel free to open a PR with just FreeNoise as is and we can work on chunking in a follow up. |
New relevant work regarding long context generation: https://github.com/TMElyralab/MuseV/. Thought it might be interesting to share here since we're looking at similar things |
Hi all. I have to prioritise some other work at the moment so will have to pause on working on sliding window for now. Will try to pick it up later, but if anyone wants to take a shot at it, feel free to do so and tag me for a review. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Cc: @a-r-r-o-w @DN6 |
I think this can be closed in favor of FreeNoise (#8948), which is a better method compared to sliding windows in maintaining quality. We could also explore tuning-free methods like FreeLong in the future for video extension. We can revisit these methods with modular diffusers (possibly community add-ons) since it would not require monkey patching our pipelines. |
What does this PR do?
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@a-r-r-o-w