Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AnimateDiff Video to Video #6328

Merged
merged 59 commits into from
Jan 24, 2024
Merged

Conversation

a-r-r-o-w
Copy link
Member

@a-r-r-o-w a-r-r-o-w commented Dec 25, 2023

What does this PR do?

Attempts to add img2video and video2video support to AnimateDiff. Fixes #6123.

Colab

Edit: img2vid has been moved to community after reviews below. Please check #6509.

Before submitting

Who can review?

@DN6 @sayakpaul @patrickvonplaten @jon-chuang

@a-r-r-o-w
Copy link
Member Author

a-r-r-o-w commented Dec 25, 2023

Would be great to have an ImageToVideo and VideoToVideo version of AnimateDiff, as suggested by Jonathan in #6123.

@jon-chuang I need some help and your suggestions here. From what I was able to understand in different implementations, there are a few ideas that have been used for the initial latent in img2video - repeating the image latent num_frames times, linearly interpolating between a random latent and image latent, and some variations of it. I've also tried using spherical linear interpolation for fun. Everything seems to be working, inference-wise, however the quality is quite bad, possibly due to not scaling the noise correctly. I haven't been able to spend more time ironing out any bugs but I think we almost have something ready. Would you mind reviewing the current code, and is there a better way that I haven't looked through correctly that you have in your implementation?

Current results
Initial Image
repeat lerp slerp

@a-r-r-o-w
Copy link
Member Author

@sayakpaul @patrickvonplaten @DN6 Would you be open to adding support for this to AnimateDiff-related pipelines once we get it working? Also, I've added all the relevant code to the current pipeline and not created a separate class since it would lead to quite a lot of duplication for something that shares a lot of common code. Let me know if this is not ideal and we must have different pipelines for AnimateDiffImgToVidPipeline and AnimateDiffVidToVidPipeline.

@a-r-r-o-w
Copy link
Member Author

a-r-r-o-w commented Dec 25, 2023

Seems to be working well for lerp and slerp after lowering the impact of the image on the initial latents, by scaling alpha to be lower: alpha = i / num_frames / 8. Not sure why this works. For values lower than 8, quality starts to get worse. For values greater than 8 upto a threshold, quality seems to improve and respect the initial image, after which it just becomes random, which makes sense since interpolation will not be impacted by image latents and just be random latents.

Current results
repeat lerp slerp

Maybe this scaling factor could be diversity or something similar? Higher value will lead to more difference from the initial image, while a lower value will lead to more resemblance.

@DN6
Copy link
Collaborator

DN6 commented Dec 26, 2023

@a-r-r-o-w These can be separate pipelines. See Diffusers Philosophy for reference.

@DN6
Copy link
Collaborator

DN6 commented Dec 26, 2023

Nice job figuring out a clean way to do Img2Vid/Vid2Vid btw 👍🏽

@jon-chuang
Copy link

by scaling alpha to be lower: alpha = i / num_frames / 8

I observed something similar. I had no reasonable explanation for it.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@a-r-r-o-w
Copy link
Member Author

a-r-r-o-w commented Dec 27, 2023

Sharing some results from both img2video and video2video pipelines. Updated usage and code can be found on this Colab notebook.

Image to Video
lerp latent interpolation
strength input images
results
0.75
0.84
0.92
slerp latent interpolation
strength input images
results
0.75
0.84
0.92
Video To Video
Input Videos
Results
green algae floating in water, bioluminiscent garbage, litter, plastic waste on the shore, destruction of the planet by humanity, high quality birds flying in the sky
a panda playing a guitar, sitting inside a boat floating in the ocean, high quality, realistic a racoon playing a trumpet, high quality cyberpunk racoon

Really like how VideoToVideo worked out! But, I'm not very satisfied with the quality of ImageToVideo and there's a lot of room for improvement. Would be great if someone from the community could suggest improvements. Currently, img2vid would fail if you provide a blank prompt. Ideally, I think, even with a blank prompt, img2vid should be able to animate the given image to some extent.

@a-r-r-o-w a-r-r-o-w marked this pull request as ready for review December 27, 2023 02:56
@jon-chuang
Copy link

jon-chuang commented Dec 27, 2023

Seems like the input image strength can be adjusted for ImageToVideo

It's subjective, but perhaps stronger would be better... 🤔

@a-r-r-o-w
Copy link
Member Author

Seems like the input image strength can be adjusted for ImageToVideo

It's subjective, but perhaps stronger would be better... 🤔

Yeah... Currently, the strength parameter must be set high to get decent results. Setting it to lower values leads to over-saturated results with almost no motion. I believe this is due to there not being enough noisy-ness in the latents. To try and fix this, I'm going to do something similar to vid2vid by using the scheduler to add noise and calculating inference steps based on strength so that the over-saturation issue does not happen atleast.

@a-r-r-o-w
Copy link
Member Author

a-r-r-o-w commented Dec 27, 2023

Actually, I'm not sure if I've implemented the prepare_latents() function correctly for img2vid. We have the following code:

...
            init_latents = init_latents.to(dtype)
            init_latents = self.vae.config.scaling_factor * init_latents
            latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
            latents = latents * self.scheduler.init_noise_sigma

            if latent_interpolation_method == "lerp":
                def latent_cls(v0, v1, index):
                    return lerp(v0, v1, index / num_frames * (1 - strength))
            elif latent_interpolation_method == "slerp":
                def latent_cls(v0, v1, index):
                    return slerp(v0, v1, index / num_frames * (1 - strength))
            else:
                latent_cls = latent_interpolation_method

            for i in range(num_frames):
                latents[:, :, i, :, :] = latent_cls(latents[:, :, i, :, :], init_latents, i)

In the case of lerp, we are essentially doing: (1 - alpha) * noisy_latents + alpha * image_latents.

This means that:

  • latents[:, :, 0, :, :] has the highest amount of noise
  • latents[:, :, num_frames-1, :, :] has the lowest amount of noise and is closest to the original image latent.

Shouldn't this be the reverse, since we want the initial condition to be the input image and the model should freely be able to fill in the future frames? 🤔

Edit: By fixing the logic based on above comment, I'm getting terrible results again. I still don't think what exists currently is correct but it seems to be working to an extent.

@jon-chuang
Copy link

Sharing some results from both img2video and video2video pipelines. Updated usage and code can be found on this Colab notebook.

Anw, just IMO, I think the results you showed are good enough for initial merge to get it available to the community (e.g. we have a use-case benefitting from this)

I think further improvements can be made over time but I think to get this merged you have to refactor your code to fit the diffusers codebase style.

@a-r-r-o-w
Copy link
Member Author

a-r-r-o-w commented Dec 30, 2023

I think further improvements can be made over time but I think to get this merged you have to refactor your code to fit the diffusers codebase style.

Yep, sorry about the delay. I've been incredibly busy but I'll make it completely ready for a merge this weekend for sure.

@DN6 @patrickvonplaten @sayakpaul I've put it in as a core pipeline here but let me know if you'd like me to move it into community. I really think vid2vid would be great for core and img2vid could gradually be worked on and improved. What do you think?

@a-r-r-o-w
Copy link
Member Author

a-r-r-o-w commented Dec 31, 2023

Here's some minimal code to test the pipelines:

Image To Video
from diffusers import AnimateDiffImg2VideoPipeline
from diffusers.models.unet_motion_model import MotionAdapter
from diffusers.schedulers import DDIMScheduler
from diffusers.utils import export_to_gif
from PIL import Image

adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
pipe = AnimateDiffImg2VideoPipeline.from_pretrained("SG161222/Realistic_Vision_V5.1_noVAE", motion_adapter=adapter, torch_dtype=torch.float16)
scheduler = DDIMScheduler.from_pretrained(
    model_id, beta_schedule="linear", subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", steps_offset=1
)
pipe.scheduler = scheduler

# pipe.enable_vae_slicing()
pipe = pipe.to("cuda")

img = Image.open("0062.png")
output = pipe(
    image=img,
    prompt="A snail moving on the ground",
    negative_prompt="bad quality, worse quality",
    height=512,
    width=512,
    num_frames=16,
    guidance_scale=10,
    num_inference_steps=20,
    strength=0.8,
    generator=torch.Generator("cpu").manual_seed(42),
    latent_interpolation_method="slerp",
)
frames = output.frames[0]
export_to_gif(frames, "animation.gif")
Video To Video
import imageio
from diffusers import AnimateDiffVideo2VideoPipeline
from diffusers.models.unet_motion_model import MotionAdapter
from diffusers.schedulers import DDIMScheduler
from diffusers.utils import export_to_gif
from PIL import Image

adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
pipe = AnimateDiffVideo2VideoPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16)
scheduler = DDIMScheduler.from_pretrained(
    model_id, beta_schedule="linear", subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", steps_offset=1
)
pipe.scheduler = scheduler

# pipe.enable_vae_slicing()
pipe = pipe.to("cuda")

def load_video(file_path):
    images = []
    vid = imageio.get_reader(file_path)
    for i, frame in enumerate(vid):
        pil_image = PILImage.fromarray(frame)
        images.append(pil_image)
    return images

video = load_video("animation_fireworks.gif")
output = pipe(
    prompt="closeup of a pretty woman, harley quinn, margot robbie, fireworks in the background, realistic",
    negative_prompt="low quality",
    video=video,
    height=512,
    width=512,
    guidance_scale=7,
    num_inference_steps=20,
    strength=0.7,
    generator=torch.Generator().manual_seed(42),
)
frames = output.frames[0]
export_to_gif(frames, f"animation.gif")

Also, the updated Colab notebook.

@patrickvonplaten patrickvonplaten requested a review from DN6 January 2, 2024 15:19
@DN6
Copy link
Collaborator

DN6 commented Jan 2, 2024

@a-r-r-o-w I think we might be able to add vid2vid to core pipelines since it's essentially similar to img2img. Could you verify if the styling remains consistent over multiple frame batches? e.g. if you run vid2vid over 64 frames (4 batches of 16) do you observe abrupt changes across frames? I don't think it's a blocker to merge, but it would be good to know.

Since img2vid relies on some "magic" to make it work, it might be better suited to community pipelines for the moment. We might find that SparCntrl is better suited to img2vid tasks.

@a-r-r-o-w
Copy link
Member Author

Sure, that makes sense. I'll move img2vid into community pipelines and hopefully someone can find a better way to do it or, as you said, just use SparseCtrl.

As for the num_videos_per_prompt=1 restriction, I did it same as how AnimateDiff just allows a single video generation and has it hardcoded at the moment. I'll get back after testing 64 frames shortly. I'm assuming you meant 4 same/different videos combined with same/different edit prompts for generation, because breaking a single 64-frame video into four 16-frame parts and processing will definitely lead to inconsistency across time due to there not being animatediff sliding-window support yet (which I can take up soon maybe).

@a-r-r-o-w
Copy link
Member Author

@patrickvonplaten @DN6 Thanks! I believe I've made all the requested changes. There was a merge conflict with animatediff after freeinit merge and I'm hoping I resolved it correctly, but please do review. Do let me know if other changes are required.

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one test failure to fix before we can merge:
tests/pipelines/animatediff/test_animatediff_video2video.py::AnimateDiffVideoToVideoPipelineFastTests::test_progress_bar - AssertionError: False is not true : Progress bar should be enabled and stopped at the max step

@a-r-r-o-w
Copy link
Member Author

Only one test failure to fix before we can merge: tests/pipelines/animatediff/test_animatediff_video2video.py::AnimateDiffVideoToVideoPipelineFastTests::test_progress_bar - AssertionError: False is not true : Progress bar should be enabled and stopped at the max step

I think all tests are fixed now. Previous fail was due to progress_bar not updating as it was done inside the deprecated callback logic and we removed it. LGTM before something else breaks 🥲

@yiyixuxu yiyixuxu added the video video generation label Jan 24, 2024
Copy link
Collaborator

@DN6 DN6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done! 👍🏽

@DN6 DN6 merged commit a517f66 into huggingface:main Jan 24, 2024
14 checks passed
@a-r-r-o-w
Copy link
Member Author

Thanks for your time and the merge ❤️ Also thanks for proposing the addition for this @jon-chuang and your thoughts!

I think we're very close to supporting most animatediff features (provided in ComfyUI/A1111 extensions) once we have SDXL and SparseCtrl merged along with long context sliding window support. Regarding SDXL, I've been a little busy with work/exams and haven't been able to give much time to the PR - I will be more free soon and complete it.

@a-r-r-o-w a-r-r-o-w deleted the animatediff-img2video branch January 24, 2024 13:50
@DN6 DN6 mentioned this pull request Jan 24, 2024
@lea-lena
Copy link

@DN6 @sayakpaul @patrickvonplaten @jon-chuang @a-r-r-o-w Hi!!! Could you help me with an example on how to use the Video to Video code with controlnet? I could not find anything about it in the documentation https://huggingface.co/docs/diffusers/en/api/pipelines/animatediff
Actually, I have been using ComfyUI following a tutorial and it was simple to use controlnet there. But, I want to learn how to use this library for some time now. I have been able to get ip adapter working here but I dont know how for controlnet. Thanks in advance! ❤️

@a-r-r-o-w
Copy link
Member Author

Hey @lea-lena. It is not possible to use controlnet here because it was not implemented with this pipeline. There is, however, a community pipeline with usage example here. It uses only a text prompt and control video, but no input video though. It shouldn't be too hard to modify the code to use the strength and input video like done here to create the initial latents instead of how it's randomly generated there. Does it makes sense to support optional input video and strength directly in the community pipeline for similar video generation @DN6?

AmericanPresidentJimmyCarter pushed a commit to AmericanPresidentJimmyCarter/diffusers that referenced this pull request Apr 26, 2024
* begin animatediff img2video and video2video

* revert animatediff to original implementation

* add img2video as pipeline

* update

* add vid2vid pipeline

* update imports

* update

* remove copied from line for check_inputs

* update

* update examples

* add multi-batch support

* fix __init__.py files

* move img2vid to community

* update community readme and examples

* fix

* make fix-copies

* add vid2vid batch params

* apply suggestions from review

Co-Authored-By: Dhruv Nair <[email protected]>

* add test for animatediff vid2vid

* torch.stack -> torch.cat

Co-Authored-By: Dhruv Nair <[email protected]>

* make style

* docs for vid2vid

* update

* fix prepare_latents

* fix docs

* remove img2vid

* update README to :main

* remove slow test

* refactor pipeline output

* update docs

* update docs

* merge community readme from :main

* final fix i promise

* add support for url in animatediff example

* update example

* update callbacks to latest implementation

* Update src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py

Co-authored-by: Patrick von Platen <[email protected]>

* Update src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py

Co-authored-by: Patrick von Platen <[email protected]>

* fix merge

* Apply suggestions from code review

* remove callback and callback_steps as suggested in review

* Update tests/pipelines/animatediff/test_animatediff_video2video.py

Co-authored-by: Patrick von Platen <[email protected]>

* fix import error caused due to unet refactor in huggingface#6630

* fix numpy import error after tensor2vid refactor in huggingface#6626

* make fix-copies

* fix numpy error

* fix progress bar test

---------

Co-authored-by: Dhruv Nair <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
video video generation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AnimateDiffPipeline: add ImageToVideo and VideoToVideo
8 participants