Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sliding window support for animatediff vid2vid pipeline #6521

Closed
wants to merge 1 commit into from

Conversation

skunkwerk
Copy link

What does this PR do?

  • adds support for sliding window contexts to the animatediff video2video pipeline

Before submitting

Who can review?

@a-r-r-o-w

@skunkwerk
Copy link
Author

skunkwerk commented Jan 10, 2024

Before writing any code, wanted to clarify the requirements, as my understanding of the sliding window technique in this context isn't clear.

I've looked at a few implementations in:

Questions:

  • is it necessary to support stride lengths? that seems to only be required if we're going to later do frame interpolation, correct?
  • let's say we use a context length of 16, with an overlap of 4 frames, and a total input video length of 60 frames. Is this the high-level pseudocode?
  1. generate a list of lists, with each inner list having 16 frames, with an overlap of 4 frames with the previous inner list. the first inner list will be frames with indices [0...15], the second will be [12-15 (overlapping), 16... 27], etc.
  2. iterate through each outer list, and call the vid2vid pipeline to generate the resulting frames
  3. collate the results of the final video by taking all the unique frames generated (the ones from the overlapping frames will be generated twice - do we just pick any result from the overlapping frames, or do we have to combine the results somehow, or are we supposed to do 2 passes with the overlapping frames, with the output of the first pass feeding in as an input to the second pass?)
  • depending on the motion model, we should have different defaults for the context length, right (16 for SD1.5 based models, 32 for AnimateDiffXL)? Should I just do some introspection to get the motion model and have a mapping dictionary?

@skunkwerk skunkwerk mentioned this pull request Jan 10, 2024
6 tasks
@a-r-r-o-w
Copy link
Member

a-r-r-o-w commented Jan 10, 2024

cc @rmasiso since they were looking into it too here.

I will be referring to this implementation in this comment. Other implementations are also same/similar. The overall idea is to accumulate all generated samples and then average it out by dividing with the number of times every frame latent was processed. Different frames can be processed different number of times due to how the voodoo magic context_scheduler functions works (it is finally understandable at my fourth glance).

is it necessary to support stride lengths? that seems to only be required if we're going to later do frame interpolation, correct?

I believe stride is neccessary as it allows frames that are farther apart to remain temporally consistent. The code being referred to applies stride as a power of 2^(i - 1), I think. That is,

  • if stride is 1, or context_step being allowed to be [1], we get something like: [0, 1, 2, 3, 4, 5, 6, 7]
  • if stride is 2, or context_step being allowed to be [1, 2], we get above as well as something like: [0, 2, 4, 6, 8, 10, 12, 14] (this would improve temporal consistency between these frames)
  • if stride is 3, or context_step being allowed to be [1, 2, 4], we get both the above as well as something like: [0, 4, 8, 12, 16, ...]

let's say we use a context length of 16, with an overlap of 4 frames, and a total input video length of 60 frames. Is this the high-level pseudocode?

do we just pick any result from the overlapping frames, or do we have to combine the results somehow, or are we supposed to do 2 passes with the overlapping frames,

We don't pick any specific generated latent, for each frame, but instead accumulate all latents for every frame and take the average per frame. From my testing with the original code by ashen-sensored, this results in better generations than just taking any specific generation for a frame. The last sampled latent for each frame also is almost good enough (there is some jumpy-ness/flickering) but averaging works better.

The high-level idea is mostly correct. Let's take a smaller example and understand what happens: (I'm using num_frames=8, context_size=2 (aka max_motion_seq_length in config.json), overlap=2 and stride=2)

latents = ... # tensor of shape (batch_size, num_latent_channels, num_frames, height, width)
latents_accumulated = ...
count_num_process_times = [0] * num_frames

for context_indices in [[0, 1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 7], [6, 7, 0, 1], [0, 2, 4, 6]]:
    current_latents = latents[context_indices]
    processed_latents = process_animatediff(latents)
    latents_accumulated[context_indices] += processed_latents
    count_num_process_times[context_indices] += 1

final_latents = latents_accumulated / count_num_process_times

Notice there is a cyclic dependency between [6, 7, 0, 1] frames. This can lead to some loss in quality, not too sure... but I've read that it could be bad and it makes sense intuitively - why should later frames affect former? The linked code also looks really confusing and can be simplified to something that more people can easily understand, from first glance, by adding one/two for-loops (to handle stride without ordered_halving or other tricks) with good variable naming.

depending on the motion model, we should have different defaults for the context length, right (16 for SD1.5 based models, 32 for AnimateDiffXL)? Should I just do some introspection to get the motion model and have a mapping dictionary?

context_length would just be motion_adapter.config.max_motion_seq_length from config.json if I understand correctly.

I think what diffusers team would like to have would be methods that can enable/disable long context generation and the __call__ would dispatch to appropiate helper methods. Changing the implementation directly and adding extra parameters to __call__ would make it confusing to newer users, especially because this is a little confusing. Also, the sliding window technique can be added to all animatediff related pipelines and not just vid2vid i think.

cc @DN6 @sayakpaul

@sayakpaul
Copy link
Member

I think what diffusers team would like to have would be methods that can enable/disable long context generation and the call would dispatch to appropiate helper methods. Changing the implementation directly and adding extra parameters to call would make it confusing to newer users, especially because this is a little confusing. Also, the sliding window technique can be added to all animatediff related pipelines and not just vid2vid i think.

Yeah, your understanding is correct. However, I will let @DN6 comment on it.

@skunkwerk
Copy link
Author

cc @rmasiso since they were looking into it too here.

I will be referring to this implementation in this comment. Other implementations are also same/similar. The overall idea is to accumulate all generated samples and then average it out by dividing with the number of times every frame latent was processed. Different frames can be processed different number of times due to how the voodoo magic context_scheduler functions works (it is finally understandable at my fourth glance).

is it necessary to support stride lengths? that seems to only be required if we're going to later do frame interpolation, correct?

I believe stride is neccessary as it allows frames that are farther apart to remain temporally consistent. The code being referred to applies stride as a power of 2^(i - 1), I think. That is,

  • if stride is 1, or context_step being allowed to be [1], we get something like: [0, 1, 2, 3, 4, 5, 6, 7]
  • if stride is 2, or context_step being allowed to be [1, 2], we get above as well as something like: [0, 2, 4, 6, 8, 10, 12, 14] (this would improve temporal consistency between these frames)
  • if stride is 3, or context_step being allowed to be [1, 2, 4], we get both the above as well as something like: [0, 4, 8, 12, 16, ...]

let's say we use a context length of 16, with an overlap of 4 frames, and a total input video length of 60 frames. Is this the high-level pseudocode?

do we just pick any result from the overlapping frames, or do we have to combine the results somehow, or are we supposed to do 2 passes with the overlapping frames,

We don't pick any specific generated latent, for each frame, but instead accumulate all latents for every frame and take the average per frame. From my testing with the original code by ashen-sensored, this results in better generations than just taking any specific generation for a frame. The last sampled latent for each frame also is almost good enough (there is some jumpy-ness/flickering) but averaging works better.

The high-level idea is mostly correct. Let's take a smaller example and understand what happens: (I'm using num_frames=8, context_size=2 (aka max_motion_seq_length in config.json), overlap=2 and stride=2)

latents = ... # tensor of shape (batch_size, num_latent_channels, num_frames, height, width)
latents_accumulated = ...
count_num_process_times = [0] * num_frames

for context_indices in [[0, 1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 7], [6, 7, 0, 1], [0, 2, 4, 6]]:
    current_latents = latents[context_indices]
    processed_latents = process_animatediff(latents)
    latents_accumulated[context_indices] += processed_latents
    count_num_process_times[context_indices] += 1

final_latents = latents_accumulated / count_num_process_times

Notice there is a cyclic dependency between [6, 7, 0, 1] frames. This can lead to some loss in quality, not too sure... but I've read that it could be bad and it makes sense intuitively - why should later frames affect former? The linked code also looks really confusing and can be simplified to something that more people can easily understand, from first glance, by adding one/two for-loops (to handle stride without ordered_halving or other tricks) with good variable naming.

depending on the motion model, we should have different defaults for the context length, right (16 for SD1.5 based models, 32 for AnimateDiffXL)? Should I just do some introspection to get the motion model and have a mapping dictionary?

context_length would just be motion_adapter.config.max_motion_seq_length from config.json if I understand correctly.

I think what diffusers team would like to have would be methods that can enable/disable long context generation and the __call__ would dispatch to appropiate helper methods. Changing the implementation directly and adding extra parameters to __call__ would make it confusing to newer users, especially because this is a little confusing. Also, the sliding window technique can be added to all animatediff related pipelines and not just vid2vid i think.

cc @DN6 @sayakpaul

thanks for the really helpful context @a-r-r-o-w.

  • in your sample code, on the first line, would the 'latents' variable be the output of some previous process? the latents from a single pass of all the frames?
  • I see that the 'call' method has 'output_type' and 'latents' parameters which can be leveraged for this use case, so that makes sense
  • I'll try to take a look at the stride length code first and get that going

@a-r-r-o-w
Copy link
Member

in your sample code, on the first line, would the 'latents' variable be the output of some previous process? the latents from a single pass of all the frames?

latents will just be some random tensor (for txt2vid) or image/video-encoded latents (for img2vid/vid2vid) of shape (batch_size, num_channels, num_frames, height // vae_scale_factor, width / vae_scale_factor). These latents will be denoised based on the context_indices, averaged, and decoded to obtain the final video.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Feb 10, 2024
@github-actions github-actions bot closed this Feb 18, 2024
@sayakpaul sayakpaul reopened this Feb 18, 2024
@sayakpaul
Copy link
Member

Cc @DN6

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@JosefKuchar
Copy link

cc @rmasiso since they were looking into it too here.

I will be referring to this implementation in this comment. Other implementations are also same/similar. The overall idea is to accumulate all generated samples and then average it out by dividing with the number of times every frame latent was processed. Different frames can be processed different number of times due to how the voodoo magic context_scheduler functions works (it is finally understandable at my fourth glance).

is it necessary to support stride lengths? that seems to only be required if we're going to later do frame interpolation, correct?

I believe stride is neccessary as it allows frames that are farther apart to remain temporally consistent. The code being referred to applies stride as a power of 2^(i - 1), I think. That is,

  • if stride is 1, or context_step being allowed to be [1], we get something like: [0, 1, 2, 3, 4, 5, 6, 7]
  • if stride is 2, or context_step being allowed to be [1, 2], we get above as well as something like: [0, 2, 4, 6, 8, 10, 12, 14] (this would improve temporal consistency between these frames)
  • if stride is 3, or context_step being allowed to be [1, 2, 4], we get both the above as well as something like: [0, 4, 8, 12, 16, ...]

let's say we use a context length of 16, with an overlap of 4 frames, and a total input video length of 60 frames. Is this the high-level pseudocode?

do we just pick any result from the overlapping frames, or do we have to combine the results somehow, or are we supposed to do 2 passes with the overlapping frames,

We don't pick any specific generated latent, for each frame, but instead accumulate all latents for every frame and take the average per frame. From my testing with the original code by ashen-sensored, this results in better generations than just taking any specific generation for a frame. The last sampled latent for each frame also is almost good enough (there is some jumpy-ness/flickering) but averaging works better.

The high-level idea is mostly correct. Let's take a smaller example and understand what happens: (I'm using num_frames=8, context_size=2 (aka max_motion_seq_length in config.json), overlap=2 and stride=2)

latents = ... # tensor of shape (batch_size, num_latent_channels, num_frames, height, width)
latents_accumulated = ...
count_num_process_times = [0] * num_frames

for context_indices in [[0, 1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 7], [6, 7, 0, 1], [0, 2, 4, 6]]:
    current_latents = latents[context_indices]
    processed_latents = process_animatediff(latents)
    latents_accumulated[context_indices] += processed_latents
    count_num_process_times[context_indices] += 1

final_latents = latents_accumulated / count_num_process_times

Notice there is a cyclic dependency between [6, 7, 0, 1] frames. This can lead to some loss in quality, not too sure... but I've read that it could be bad and it makes sense intuitively - why should later frames affect former? The linked code also looks really confusing and can be simplified to something that more people can easily understand, from first glance, by adding one/two for-loops (to handle stride without ordered_halving or other tricks) with good variable naming.

depending on the motion model, we should have different defaults for the context length, right (16 for SD1.5 based models, 32 for AnimateDiffXL)? Should I just do some introspection to get the motion model and have a mapping dictionary?

context_length would just be motion_adapter.config.max_motion_seq_length from config.json if I understand correctly.

I think what diffusers team would like to have would be methods that can enable/disable long context generation and the __call__ would dispatch to appropiate helper methods. Changing the implementation directly and adding extra parameters to __call__ would make it confusing to newer users, especially because this is a little confusing. Also, the sliding window technique can be added to all animatediff related pipelines and not just vid2vid i think.

cc @DN6 @sayakpaul

The variable current_latents is not used anywhere. Do I understand correctly that the next line should be
processed_latents = process_animatediff(current_latents)
instead of
processed_latents = process_animatediff(latents)
?

@a-r-r-o-w
Copy link
Member

@JosefKuchar My bad, typo. That is correct

@yiyixuxu
Copy link
Collaborator

is this ready for a review?

@DN6 DN6 removed the stale Issues that haven't received updates label Feb 24, 2024
@DN6
Copy link
Collaborator

DN6 commented Feb 27, 2024

Hi all. I took a look into some of the different approaches for longer video generations with AnimateDiff. The ones available in AnimateDiff-Evolved use the approach being discussed here; a sliding window, while averaging out the latents of overlapping frames. This seems to be inspired by the MultiDiffusion approach for generating paranomic images:

count.zero_()
value.zero_()
# generate views
# Here, we iterate through different spatial crops of the latents and denoise them. These
# denoised (latent) crops are then averaged to produce the final latent
# for the current timestep via MultiDiffusion. Please see Sec. 4.1 in the
# MultiDiffusion paper for more details: https://arxiv.org/abs/2302.08113
# Batch views denoise
for j, batch_view in enumerate(views_batch):

Except we apply it temporally rather than spatially.

Another approach is FreeNoise, which also uses a sliding window, but applies it in the layers of the motion modules
https://github.com/arthur-qiu/FreeNoise-AnimateDiff/blob/e01d82233c595ce22f1a5eba487911c345ce7b5b/animatediff/models/motion_module.py#L262-L280

FreeNoise seems like a more principled approach, and avoids relying on a "magic" context scheduler. I haven't compared the quality of FreeNoise vs Context Scheduler though.

Additionally, the Context Scheduler approach can theoretically handle an infinitely long video sequence. A very long sequence of latents can be held in RAM, and only the context latents with a fixed length go through the forward pass of the model.

With FreeNoise, the breaking up the long sequence into context latents only happens in the motion modules, so the other UNet layers have to deal with the longer sequence. We could do some work to the UNet Motion blocks to enable chunked inference, e.g. something similar to unet.enable_feed_fordward_chunking but for latent sequences. Or just enable chunked inference in the blocks by default.

@yiyixuxu yiyixuxu added the video video generation label Feb 28, 2024
@AbhinavGopal
Copy link
Contributor

Any update on the progress of this?

@sayakpaul
Copy link
Member

A better approach is to also go through the PRs sometimes to get a head start if a feature has been in the motion to be shipped :-)

@a-r-r-o-w
Copy link
Member

If @DN6 is busy with other things and does not have the bandwidth for this right now, I'll be happy to eventually pick this up on a weekend when I find time. But please feel free to PR if you'd like to take this up. AFAICT, there will not be too many code changes and most of it can be adapted directly from their repo. It'd be preferable to have enable_() and disable_() methods for doing it. It's also something the community has been using for a while so I think there should be discussions or improved implementations for this that you could try looking for.

@AbhinavGopal
Copy link
Contributor

Would like to work on this, but unfortunately blocked by #7378 (comment). I believe in order to be able to work on this, I'd need to be able decode + encode the latents independently

@JosefKuchar
Copy link

JosefKuchar commented Mar 25, 2024

Would love to see FreeNoise implementation. I've hacked together implementation for basic chunking above, but results are not that great (only openpose controlnet, no vid2vid). Unfortunately I don't have skills for porting original FreeNoise animatediff implementation (changes here arthur-qiu/FreeNoise-AnimateDiff@9abf5ed) to diffusers

@JosefKuchar
Copy link

Ok, so I was able to port FreeNoise Animatediff code to diffusers - results below (128 frames)
@DN6 @a-r-r-o-w Shall i open separate PR?
Any hints for implementing chunked inference on unet motion blocks? - 128 frames fits in 24GB VRAM, 256 frames overflows (I think that is for separate PR anyway, would love to implement it)

prompt "Animated man in a suit on a beach", using community animatediff controlnet pipeline, AnimateDiff lighting 4 steps version (5 step inference)

result128.mp4
conditioning128.mp4

@a-r-r-o-w
Copy link
Member

@JosefKuchar Not a maintainer here but I'd say please go for the PR ❤️ Supporting long context generation has been out for Comfy and A1111 for long, and it has been on our mind to add support for this within diffusers for many months now. Thank you so much for taking the initiative! The community has been generating short films using these methods with the best models out there and have nailed down many tricks for consistent and high quality generation; something we could definitely write guides about (perhaps @asomoza would be a great help for this). I'm happy to help resolve any conflicts that may come up with supporting both FreeInit and FreeNoise. Chunked unet inference could be a separate thing to look at in the near future yep

@DN6
Copy link
Collaborator

DN6 commented Mar 29, 2024

Hi all. Really nice to see the initiative here!

I'll have bandwidth to take this up next week. @JosefKuchar since you've already started on FreeNoise I'll leave you to it and look into sliding window. I'll probably just follow this reference. I believe @a-r-r-o-w had included it when originally proposing the AnimateDiff PR, but we weren't 100% sure about adding something that wasn't fully understood at the time.

@JosefKuchar For chunking, you would need to look at the the Resnet and Attention blocks in the MotionBlocks here. An example of chunking logic can be found here and here. If you feel it's a bit much to handle all at once, feel free to open a PR with just FreeNoise as is and we can work on chunking in a follow up.

@a-r-r-o-w
Copy link
Member

New relevant work regarding long context generation: https://github.com/TMElyralab/MuseV/. Thought it might be interesting to share here since we're looking at similar things

@DN6
Copy link
Collaborator

DN6 commented Apr 10, 2024

Hi all. I have to prioritise some other work at the moment so will have to pause on working on sliding window for now. Will try to pick it up later, but if anyone wants to take a shot at it, feel free to do so and tag me for a review.

Copy link

github-actions bot commented May 5, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label May 5, 2024
@DN6 DN6 removed the stale Issues that haven't received updates label May 6, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Sep 14, 2024
@yiyixuxu yiyixuxu removed the stale Issues that haven't received updates label Sep 17, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Oct 12, 2024
@yiyixuxu yiyixuxu removed the stale Issues that haven't received updates label Oct 15, 2024
Copy link

github-actions bot commented Nov 8, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Nov 8, 2024
@sayakpaul
Copy link
Member

Cc: @a-r-r-o-w @DN6

@sayakpaul sayakpaul removed the stale Issues that haven't received updates label Nov 8, 2024
@a-r-r-o-w
Copy link
Member

I think this can be closed in favor of FreeNoise (#8948), which is a better method compared to sliding windows in maintaining quality. We could also explore tuning-free methods like FreeLong in the future for video extension.

We can revisit these methods with modular diffusers (possibly community add-ons) since it would not require monkey patching our pipelines.

@a-r-r-o-w a-r-r-o-w closed this Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
video video generation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants