AnimateDiff Video to Video #6328

a-r-r-o-w · 2023-12-25T20:39:08Z

What does this PR do?

Attempts to add ~~img2video~~ and video2video support to AnimateDiff. Fixes #6123.

Colab

Edit: img2vid has been moved to community after reviews below. Please check #6509.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@DN6 @sayakpaul @patrickvonplaten @jon-chuang

a-r-r-o-w · 2023-12-25T20:51:08Z

Would be great to have an ImageToVideo and VideoToVideo version of AnimateDiff, as suggested by Jonathan in #6123.

@jon-chuang I need some help and your suggestions here. From what I was able to understand in different implementations, there are a few ideas that have been used for the initial latent in img2video - repeating the image latent num_frames times, linearly interpolating between a random latent and image latent, and some variations of it. I've also tried using spherical linear interpolation for fun. Everything seems to be working, inference-wise, however the quality is quite bad, possibly due to not scaling the noise correctly. I haven't been able to spend more time ironing out any bugs but I think we almost have something ready. Would you mind reviewing the current code, and is there a better way that I haven't looked through correctly that you have in your implementation?

Current results

Initial Image

repeat	lerp	slerp

a-r-r-o-w · 2023-12-25T21:00:58Z

@sayakpaul @patrickvonplaten @DN6 Would you be open to adding support for this to AnimateDiff-related pipelines once we get it working? Also, I've added all the relevant code to the current pipeline and not created a separate class since it would lead to quite a lot of duplication for something that shares a lot of common code. Let me know if this is not ideal and we must have different pipelines for AnimateDiffImgToVidPipeline and AnimateDiffVidToVidPipeline.

a-r-r-o-w · 2023-12-25T21:38:22Z

Seems to be working well for lerp and slerp after lowering the impact of the image on the initial latents, by scaling alpha to be lower: alpha = i / num_frames / 8. Not sure why this works. For values lower than 8, quality starts to get worse. For values greater than 8 upto a threshold, quality seems to improve and respect the initial image, after which it just becomes random, which makes sense since interpolation will not be impacted by image latents and just be random latents.

Current results

repeat	lerp	slerp

Maybe this scaling factor could be diversity or something similar? Higher value will lead to more difference from the initial image, while a lower value will lead to more resemblance.

DN6 · 2023-12-26T05:29:59Z

@a-r-r-o-w These can be separate pipelines. See Diffusers Philosophy for reference.

DN6 · 2023-12-26T14:06:45Z

Nice job figuring out a clean way to do Img2Vid/Vid2Vid btw 👍🏽

jon-chuang · 2023-12-26T17:14:45Z

by scaling alpha to be lower: alpha = i / num_frames / 8

I observed something similar. I had no reasonable explanation for it.

HuggingFaceDocBuilderDev · 2023-12-26T23:01:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

a-r-r-o-w · 2023-12-27T02:54:39Z

Sharing some results from both img2video and video2video pipelines. Updated usage and code can be found on this Colab notebook.

Image to Video

lerp latent interpolation

strength	input images

results
0.75
0.84
0.92

slerp latent interpolation

strength	input images

results
0.75
0.84
0.92

Video To Video

Input Videos

Results
green algae floating in water, bioluminiscent	garbage, litter, plastic waste on the shore, destruction of the planet by humanity, high quality	birds flying in the sky

a panda playing a guitar, sitting inside a boat floating in the ocean, high quality, realistic	a racoon playing a trumpet, high quality	cyberpunk racoon

Really like how VideoToVideo worked out! But, I'm not very satisfied with the quality of ImageToVideo and there's a lot of room for improvement. Would be great if someone from the community could suggest improvements. Currently, img2vid would fail if you provide a blank prompt. Ideally, I think, even with a blank prompt, img2vid should be able to animate the given image to some extent.

jon-chuang · 2023-12-27T03:15:28Z

Seems like the input image strength can be adjusted for ImageToVideo

It's subjective, but perhaps stronger would be better... 🤔

a-r-r-o-w · 2023-12-27T03:26:48Z

Seems like the input image strength can be adjusted for ImageToVideo

It's subjective, but perhaps stronger would be better... 🤔

Yeah... Currently, the strength parameter must be set high to get decent results. Setting it to lower values leads to over-saturated results with almost no motion. I believe this is due to there not being enough noisy-ness in the latents. To try and fix this, I'm going to do something similar to vid2vid by using the scheduler to add noise and calculating inference steps based on strength so that the over-saturation issue does not happen atleast.

a-r-r-o-w · 2023-12-27T03:41:30Z

Actually, I'm not sure if I've implemented the prepare_latents() function correctly for img2vid. We have the following code:

...
            init_latents = init_latents.to(dtype)
            init_latents = self.vae.config.scaling_factor * init_latents
            latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
            latents = latents * self.scheduler.init_noise_sigma

            if latent_interpolation_method == "lerp":
                def latent_cls(v0, v1, index):
                    return lerp(v0, v1, index / num_frames * (1 - strength))
            elif latent_interpolation_method == "slerp":
                def latent_cls(v0, v1, index):
                    return slerp(v0, v1, index / num_frames * (1 - strength))
            else:
                latent_cls = latent_interpolation_method

            for i in range(num_frames):
                latents[:, :, i, :, :] = latent_cls(latents[:, :, i, :, :], init_latents, i)

In the case of lerp, we are essentially doing: (1 - alpha) * noisy_latents + alpha * image_latents.

This means that:

latents[:, :, 0, :, :] has the highest amount of noise
latents[:, :, num_frames-1, :, :] has the lowest amount of noise and is closest to the original image latent.

Shouldn't this be the reverse, since we want the initial condition to be the input image and the model should freely be able to fill in the future frames? 🤔

Edit: By fixing the logic based on above comment, I'm getting terrible results again. I still don't think what exists currently is correct but it seems to be working to an extent.

jon-chuang · 2023-12-30T10:29:50Z

Sharing some results from both img2video and video2video pipelines. Updated usage and code can be found on this Colab notebook.

Anw, just IMO, I think the results you showed are good enough for initial merge to get it available to the community (e.g. we have a use-case benefitting from this)

I think further improvements can be made over time but I think to get this merged you have to refactor your code to fit the diffusers codebase style.

a-r-r-o-w · 2023-12-30T10:56:01Z

I think further improvements can be made over time but I think to get this merged you have to refactor your code to fit the diffusers codebase style.

Yep, sorry about the delay. I've been incredibly busy but I'll make it completely ready for a merge this weekend for sure.

@DN6 @patrickvonplaten @sayakpaul I've put it in as a core pipeline here but let me know if you'd like me to move it into community. I really think vid2vid would be great for core and img2vid could gradually be worked on and improved. What do you think?

a-r-r-o-w · 2023-12-31T08:25:52Z

Here's some minimal code to test the pipelines:

Image To Video

from diffusers import AnimateDiffImg2VideoPipeline
from diffusers.models.unet_motion_model import MotionAdapter
from diffusers.schedulers import DDIMScheduler
from diffusers.utils import export_to_gif
from PIL import Image

adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
pipe = AnimateDiffImg2VideoPipeline.from_pretrained("SG161222/Realistic_Vision_V5.1_noVAE", motion_adapter=adapter, torch_dtype=torch.float16)
scheduler = DDIMScheduler.from_pretrained(
    model_id, beta_schedule="linear", subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", steps_offset=1
)
pipe.scheduler = scheduler

# pipe.enable_vae_slicing()
pipe = pipe.to("cuda")

img = Image.open("0062.png")
output = pipe(
    image=img,
    prompt="A snail moving on the ground",
    negative_prompt="bad quality, worse quality",
    height=512,
    width=512,
    num_frames=16,
    guidance_scale=10,
    num_inference_steps=20,
    strength=0.8,
    generator=torch.Generator("cpu").manual_seed(42),
    latent_interpolation_method="slerp",
)
frames = output.frames[0]
export_to_gif(frames, "animation.gif")

Video To Video

import imageio
from diffusers import AnimateDiffVideo2VideoPipeline
from diffusers.models.unet_motion_model import MotionAdapter
from diffusers.schedulers import DDIMScheduler
from diffusers.utils import export_to_gif
from PIL import Image

adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
pipe = AnimateDiffVideo2VideoPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16)
scheduler = DDIMScheduler.from_pretrained(
    model_id, beta_schedule="linear", subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", steps_offset=1
)
pipe.scheduler = scheduler

# pipe.enable_vae_slicing()
pipe = pipe.to("cuda")

def load_video(file_path):
    images = []
    vid = imageio.get_reader(file_path)
    for i, frame in enumerate(vid):
        pil_image = PILImage.fromarray(frame)
        images.append(pil_image)
    return images

video = load_video("animation_fireworks.gif")
output = pipe(
    prompt="closeup of a pretty woman, harley quinn, margot robbie, fireworks in the background, realistic",
    negative_prompt="low quality",
    video=video,
    height=512,
    width=512,
    guidance_scale=7,
    num_inference_steps=20,
    strength=0.7,
    generator=torch.Generator().manual_seed(42),
)
frames = output.frames[0]
export_to_gif(frames, f"animation.gif")

Also, the updated Colab notebook.

DN6 · 2024-01-02T16:17:55Z

@a-r-r-o-w I think we might be able to add vid2vid to core pipelines since it's essentially similar to img2img. Could you verify if the styling remains consistent over multiple frame batches? e.g. if you run vid2vid over 64 frames (4 batches of 16) do you observe abrupt changes across frames? I don't think it's a blocker to merge, but it would be good to know.

Since img2vid relies on some "magic" to make it work, it might be better suited to community pipelines for the moment. We might find that SparCntrl is better suited to img2vid tasks.

a-r-r-o-w · 2024-01-02T17:22:53Z

Sure, that makes sense. I'll move img2vid into community pipelines and hopefully someone can find a better way to do it or, as you said, just use SparseCtrl.

As for the num_videos_per_prompt=1 restriction, I did it same as how AnimateDiff just allows a single video generation and has it hardcoded at the moment. I'll get back after testing 64 frames shortly. I'm assuming you meant 4 same/different videos combined with same/different edit prompts for generation, because breaking a single 64-frame video into four 16-frame parts and processing will definitely lead to inconsistency across time due to there not being animatediff sliding-window support yet (which I can take up soon maybe).

…2video.py Co-authored-by: Patrick von Platen <[email protected]>

a-r-r-o-w · 2024-01-17T13:43:02Z

@patrickvonplaten @DN6 Thanks! I believe I've made all the requested changes. There was a merge conflict with animatediff after freeinit merge and I'm hoping I resolved it correctly, but please do review. Do let me know if other changes are required.

src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py

tests/pipelines/animatediff/test_animatediff_video2video.py

Co-authored-by: Patrick von Platen <[email protected]>

patrickvonplaten

Only one test failure to fix before we can merge:
tests/pipelines/animatediff/test_animatediff_video2video.py::AnimateDiffVideoToVideoPipelineFastTests::test_progress_bar - AssertionError: False is not true : Progress bar should be enabled and stopped at the max step

a-r-r-o-w · 2024-01-23T13:55:13Z

Only one test failure to fix before we can merge: tests/pipelines/animatediff/test_animatediff_video2video.py::AnimateDiffVideoToVideoPipelineFastTests::test_progress_bar - AssertionError: False is not true : Progress bar should be enabled and stopped at the max step

I think all tests are fixed now. Previous fail was due to progress_bar not updating as it was done inside the deprecated callback logic and we removed it. LGTM before something else breaks 🥲

DN6

Well done! 👍🏽

a-r-r-o-w · 2024-01-24T13:50:29Z

Thanks for your time and the merge ❤️ Also thanks for proposing the addition for this @jon-chuang and your thoughts!

I think we're very close to supporting most animatediff features (provided in ComfyUI/A1111 extensions) once we have SDXL and SparseCtrl merged along with long context sliding window support. Regarding SDXL, I've been a little busy with work/exams and haven't been able to give much time to the PR - I will be more free soon and complete it.

lea-lena · 2024-02-19T17:47:39Z

@DN6 @sayakpaul @patrickvonplaten @jon-chuang @a-r-r-o-w Hi!!! Could you help me with an example on how to use the Video to Video code with controlnet? I could not find anything about it in the documentation https://huggingface.co/docs/diffusers/en/api/pipelines/animatediff
Actually, I have been using ComfyUI following a tutorial and it was simple to use controlnet there. But, I want to learn how to use this library for some time now. I have been able to get ip adapter working here but I dont know how for controlnet. Thanks in advance! ❤️

a-r-r-o-w · 2024-02-20T03:40:38Z

Hey @lea-lena. It is not possible to use controlnet here because it was not implemented with this pipeline. There is, however, a community pipeline with usage example here. It uses only a text prompt and control video, but no input video though. It shouldn't be too hard to modify the code to use the strength and input video like done here to create the initial latents instead of how it's randomly generated there. Does it makes sense to support optional input video and strength directly in the community pipeline for similar video generation @DN6?

* begin animatediff img2video and video2video * revert animatediff to original implementation * add img2video as pipeline * update * add vid2vid pipeline * update imports * update * remove copied from line for check_inputs * update * update examples * add multi-batch support * fix __init__.py files * move img2vid to community * update community readme and examples * fix * make fix-copies * add vid2vid batch params * apply suggestions from review Co-Authored-By: Dhruv Nair <[email protected]> * add test for animatediff vid2vid * torch.stack -> torch.cat Co-Authored-By: Dhruv Nair <[email protected]> * make style * docs for vid2vid * update * fix prepare_latents * fix docs * remove img2vid * update README to :main * remove slow test * refactor pipeline output * update docs * update docs * merge community readme from :main * final fix i promise * add support for url in animatediff example * update example * update callbacks to latest implementation * Update src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py Co-authored-by: Patrick von Platen <[email protected]> * Update src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py Co-authored-by: Patrick von Platen <[email protected]> * fix merge * Apply suggestions from code review * remove callback and callback_steps as suggested in review * Update tests/pipelines/animatediff/test_animatediff_video2video.py Co-authored-by: Patrick von Platen <[email protected]> * fix import error caused due to unet refactor in huggingface#6630 * fix numpy import error after tensor2vid refactor in huggingface#6626 * make fix-copies * fix numpy error * fix progress bar test --------- Co-authored-by: Dhruv Nair <[email protected]> Co-authored-by: Patrick von Platen <[email protected]>

begin animatediff img2video and video2video

df1b6c4

a-r-r-o-w added 3 commits December 27, 2023 03:42

revert animatediff to original implementation

4be3068

add img2video as pipeline

06b427f

Merge branch 'main' into animatediff-img2video

aaf9194

a-r-r-o-w added 2 commits December 27, 2023 07:42

update

2bc77c6

add vid2vid pipeline

d0b3893

a-r-r-o-w marked this pull request as ready for review December 27, 2023 02:56

a-r-r-o-w added 6 commits December 31, 2023 11:29

update imports

466d92a

update

fc815c8

remove copied from line for check_inputs

315daad

update

cc55f3d

update examples

d7a85be

Merge branch 'main' into animatediff-img2video

7dd73ac

patrickvonplaten requested a review from DN6 January 2, 2024 15:19

a-r-r-o-w and others added 3 commits January 17, 2024 19:02

Update src/diffusers/pipelines/animatediff/pipeline_animatediff_video…

254ea67

…2video.py Co-authored-by: Patrick von Platen <[email protected]>

Merge branch 'main' into animatediff-img2video

c4bf30c

fix merge

ed37cae

patrickvonplaten reviewed Jan 19, 2024

View reviewed changes

src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Jan 19, 2024

View reviewed changes

src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py Outdated Show resolved Hide resolved

patrickvonplaten and others added 2 commits January 19, 2024 12:04

Apply suggestions from code review

6b84aef

remove callback and callback_steps as suggested in review

1c645ed

a-r-r-o-w requested a review from patrickvonplaten January 19, 2024 10:41

Merge branch 'main' into animatediff-img2video

54b21c0

patrickvonplaten reviewed Jan 19, 2024

View reviewed changes

tests/pipelines/animatediff/test_animatediff_video2video.py Outdated Show resolved Hide resolved

a-r-r-o-w and others added 6 commits January 19, 2024 17:02

Update tests/pipelines/animatediff/test_animatediff_video2video.py

fdbb68f

Co-authored-by: Patrick von Platen <[email protected]>

Merge branch 'main' into animatediff-img2video

39a7628

fix import error caused due to unet refactor in huggingface#6630

5674a71

fix numpy import error after tensor2vid refactor in huggingface#6626

032c24f

make fix-copies

41ac862

fix numpy error

c3a70eb

patrickvonplaten reviewed Jan 23, 2024

View reviewed changes

a-r-r-o-w added 2 commits January 23, 2024 19:23

fix progress bar test

8b820a0

Merge branch 'main' into animatediff-img2video

872dee6

yiyixuxu added the video video generation label Jan 24, 2024

DN6 approved these changes Jan 24, 2024

View reviewed changes

DN6 merged commit a517f66 into huggingface:main Jan 24, 2024
14 checks passed

a-r-r-o-w deleted the animatediff-img2video branch January 24, 2024 13:50

DN6 mentioned this pull request Jan 24, 2024

[refactor] FreeInit #6644

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AnimateDiff Video to Video #6328

AnimateDiff Video to Video #6328

a-r-r-o-w commented Dec 25, 2023 •

edited

Loading

a-r-r-o-w commented Dec 25, 2023 •

edited

Loading

a-r-r-o-w commented Dec 25, 2023

a-r-r-o-w commented Dec 25, 2023 •

edited

Loading

DN6 commented Dec 26, 2023

DN6 commented Dec 26, 2023

jon-chuang commented Dec 26, 2023

HuggingFaceDocBuilderDev commented Dec 26, 2023

a-r-r-o-w commented Dec 27, 2023 •

edited

Loading

jon-chuang commented Dec 27, 2023 •

edited

Loading

a-r-r-o-w commented Dec 27, 2023

a-r-r-o-w commented Dec 27, 2023 •

edited

Loading

jon-chuang commented Dec 30, 2023

a-r-r-o-w commented Dec 30, 2023 •

edited

Loading

a-r-r-o-w commented Dec 31, 2023 •

edited

Loading

DN6 commented Jan 2, 2024

a-r-r-o-w commented Jan 2, 2024

a-r-r-o-w commented Jan 17, 2024

patrickvonplaten left a comment

a-r-r-o-w commented Jan 23, 2024

DN6 left a comment

a-r-r-o-w commented Jan 24, 2024

lea-lena commented Feb 19, 2024

a-r-r-o-w commented Feb 20, 2024

AnimateDiff Video to Video #6328

AnimateDiff Video to Video #6328

Conversation

a-r-r-o-w commented Dec 25, 2023 • edited Loading

What does this PR do?

Before submitting

Who can review?

a-r-r-o-w commented Dec 25, 2023 • edited Loading

a-r-r-o-w commented Dec 25, 2023

a-r-r-o-w commented Dec 25, 2023 • edited Loading

DN6 commented Dec 26, 2023

DN6 commented Dec 26, 2023

jon-chuang commented Dec 26, 2023

HuggingFaceDocBuilderDev commented Dec 26, 2023

a-r-r-o-w commented Dec 27, 2023 • edited Loading

jon-chuang commented Dec 27, 2023 • edited Loading

a-r-r-o-w commented Dec 27, 2023

a-r-r-o-w commented Dec 27, 2023 • edited Loading

jon-chuang commented Dec 30, 2023

a-r-r-o-w commented Dec 30, 2023 • edited Loading

a-r-r-o-w commented Dec 31, 2023 • edited Loading

DN6 commented Jan 2, 2024

a-r-r-o-w commented Jan 2, 2024

a-r-r-o-w commented Jan 17, 2024

patrickvonplaten left a comment

Choose a reason for hiding this comment

a-r-r-o-w commented Jan 23, 2024

DN6 left a comment

Choose a reason for hiding this comment

a-r-r-o-w commented Jan 24, 2024

lea-lena commented Feb 19, 2024

a-r-r-o-w commented Feb 20, 2024

a-r-r-o-w commented Dec 25, 2023 •

edited

Loading

a-r-r-o-w commented Dec 25, 2023 •

edited

Loading

a-r-r-o-w commented Dec 25, 2023 •

edited

Loading

a-r-r-o-w commented Dec 27, 2023 •

edited

Loading

jon-chuang commented Dec 27, 2023 •

edited

Loading

a-r-r-o-w commented Dec 27, 2023 •

edited

Loading

a-r-r-o-w commented Dec 30, 2023 •

edited

Loading

a-r-r-o-w commented Dec 31, 2023 •

edited

Loading