-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AnimateDiff Video to Video #6328
Conversation
Would be great to have an ImageToVideo and VideoToVideo version of AnimateDiff, as suggested by Jonathan in #6123. @jon-chuang I need some help and your suggestions here. From what I was able to understand in different implementations, there are a few ideas that have been used for the initial latent in img2video - repeating the image latent |
@sayakpaul @patrickvonplaten @DN6 Would you be open to adding support for this to AnimateDiff-related pipelines once we get it working? Also, I've added all the relevant code to the current pipeline and not created a separate class since it would lead to quite a lot of duplication for something that shares a lot of common code. Let me know if this is not ideal and we must have different pipelines for AnimateDiffImgToVidPipeline and AnimateDiffVidToVidPipeline. |
Seems to be working well for lerp and slerp after lowering the impact of the image on the initial latents, by scaling Maybe this scaling factor could be |
@a-r-r-o-w These can be separate pipelines. See Diffusers Philosophy for reference. |
Nice job figuring out a clean way to do Img2Vid/Vid2Vid btw 👍🏽 |
I observed something similar. I had no reasonable explanation for it. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Sharing some results from both img2video and video2video pipelines. Updated usage and code can be found on this Colab notebook. Image to VideoVideo To VideoReally like how VideoToVideo worked out! But, I'm not very satisfied with the quality of ImageToVideo and there's a lot of room for improvement. Would be great if someone from the community could suggest improvements. Currently, img2vid would fail if you provide a blank prompt. Ideally, I think, even with a blank prompt, img2vid should be able to animate the given image to some extent. |
Seems like the input image strength can be adjusted for ImageToVideo It's subjective, but perhaps stronger would be better... 🤔 |
Yeah... Currently, the |
Actually, I'm not sure if I've implemented the prepare_latents() function correctly for img2vid. We have the following code: ...
init_latents = init_latents.to(dtype)
init_latents = self.vae.config.scaling_factor * init_latents
latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
latents = latents * self.scheduler.init_noise_sigma
if latent_interpolation_method == "lerp":
def latent_cls(v0, v1, index):
return lerp(v0, v1, index / num_frames * (1 - strength))
elif latent_interpolation_method == "slerp":
def latent_cls(v0, v1, index):
return slerp(v0, v1, index / num_frames * (1 - strength))
else:
latent_cls = latent_interpolation_method
for i in range(num_frames):
latents[:, :, i, :, :] = latent_cls(latents[:, :, i, :, :], init_latents, i) In the case of lerp, we are essentially doing: This means that:
Shouldn't this be the reverse, since we want the initial condition to be the input image and the model should freely be able to fill in the future frames? 🤔 Edit: By fixing the logic based on above comment, I'm getting terrible results again. I still don't think what exists currently is correct but it seems to be working to an extent. |
Anw, just IMO, I think the results you showed are good enough for initial merge to get it available to the community (e.g. we have a use-case benefitting from this) I think further improvements can be made over time but I think to get this merged you have to refactor your code to fit the diffusers codebase style. |
Yep, sorry about the delay. I've been incredibly busy but I'll make it completely ready for a merge this weekend for sure. @DN6 @patrickvonplaten @sayakpaul I've put it in as a core pipeline here but let me know if you'd like me to move it into community. I really think vid2vid would be great for core and img2vid could gradually be worked on and improved. What do you think? |
Here's some minimal code to test the pipelines: Image To Videofrom diffusers import AnimateDiffImg2VideoPipeline
from diffusers.models.unet_motion_model import MotionAdapter
from diffusers.schedulers import DDIMScheduler
from diffusers.utils import export_to_gif
from PIL import Image
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
pipe = AnimateDiffImg2VideoPipeline.from_pretrained("SG161222/Realistic_Vision_V5.1_noVAE", motion_adapter=adapter, torch_dtype=torch.float16)
scheduler = DDIMScheduler.from_pretrained(
model_id, beta_schedule="linear", subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", steps_offset=1
)
pipe.scheduler = scheduler
# pipe.enable_vae_slicing()
pipe = pipe.to("cuda")
img = Image.open("0062.png")
output = pipe(
image=img,
prompt="A snail moving on the ground",
negative_prompt="bad quality, worse quality",
height=512,
width=512,
num_frames=16,
guidance_scale=10,
num_inference_steps=20,
strength=0.8,
generator=torch.Generator("cpu").manual_seed(42),
latent_interpolation_method="slerp",
)
frames = output.frames[0]
export_to_gif(frames, "animation.gif") Video To Videoimport imageio
from diffusers import AnimateDiffVideo2VideoPipeline
from diffusers.models.unet_motion_model import MotionAdapter
from diffusers.schedulers import DDIMScheduler
from diffusers.utils import export_to_gif
from PIL import Image
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
pipe = AnimateDiffVideo2VideoPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16)
scheduler = DDIMScheduler.from_pretrained(
model_id, beta_schedule="linear", subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", steps_offset=1
)
pipe.scheduler = scheduler
# pipe.enable_vae_slicing()
pipe = pipe.to("cuda")
def load_video(file_path):
images = []
vid = imageio.get_reader(file_path)
for i, frame in enumerate(vid):
pil_image = PILImage.fromarray(frame)
images.append(pil_image)
return images
video = load_video("animation_fireworks.gif")
output = pipe(
prompt="closeup of a pretty woman, harley quinn, margot robbie, fireworks in the background, realistic",
negative_prompt="low quality",
video=video,
height=512,
width=512,
guidance_scale=7,
num_inference_steps=20,
strength=0.7,
generator=torch.Generator().manual_seed(42),
)
frames = output.frames[0]
export_to_gif(frames, f"animation.gif") Also, the updated Colab notebook. |
@a-r-r-o-w I think we might be able to add vid2vid to core pipelines since it's essentially similar to img2img. Could you verify if the styling remains consistent over multiple frame batches? e.g. if you run vid2vid over 64 frames (4 batches of 16) do you observe abrupt changes across frames? I don't think it's a blocker to merge, but it would be good to know. Since img2vid relies on some "magic" to make it work, it might be better suited to community pipelines for the moment. We might find that SparCntrl is better suited to img2vid tasks. |
Sure, that makes sense. I'll move img2vid into community pipelines and hopefully someone can find a better way to do it or, as you said, just use SparseCtrl. As for the num_videos_per_prompt=1 restriction, I did it same as how AnimateDiff just allows a single video generation and has it hardcoded at the moment. I'll get back after testing 64 frames shortly. I'm assuming you meant 4 same/different videos combined with same/different edit prompts for generation, because breaking a single 64-frame video into four 16-frame parts and processing will definitely lead to inconsistency across time due to there not being animatediff sliding-window support yet (which I can take up soon maybe). |
…2video.py Co-authored-by: Patrick von Platen <[email protected]>
@patrickvonplaten @DN6 Thanks! I believe I've made all the requested changes. There was a merge conflict with animatediff after freeinit merge and I'm hoping I resolved it correctly, but please do review. Do let me know if other changes are required. |
src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Patrick von Platen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one test failure to fix before we can merge:
tests/pipelines/animatediff/test_animatediff_video2video.py::AnimateDiffVideoToVideoPipelineFastTests::test_progress_bar - AssertionError: False is not true : Progress bar should be enabled and stopped at the max step
I think all tests are fixed now. Previous fail was due to progress_bar not updating as it was done inside the deprecated callback logic and we removed it. LGTM before something else breaks 🥲 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done! 👍🏽
Thanks for your time and the merge ❤️ Also thanks for proposing the addition for this @jon-chuang and your thoughts! I think we're very close to supporting most animatediff features (provided in ComfyUI/A1111 extensions) once we have SDXL and SparseCtrl merged along with long context sliding window support. Regarding SDXL, I've been a little busy with work/exams and haven't been able to give much time to the PR - I will be more free soon and complete it. |
@DN6 @sayakpaul @patrickvonplaten @jon-chuang @a-r-r-o-w Hi!!! Could you help me with an example on how to use the Video to Video code with controlnet? I could not find anything about it in the documentation https://huggingface.co/docs/diffusers/en/api/pipelines/animatediff |
Hey @lea-lena. It is not possible to use controlnet here because it was not implemented with this pipeline. There is, however, a community pipeline with usage example here. It uses only a text prompt and control video, but no input video though. It shouldn't be too hard to modify the code to use the strength and input video like done here to create the initial latents instead of how it's randomly generated there. Does it makes sense to support optional input video and strength directly in the community pipeline for similar video generation @DN6? |
* begin animatediff img2video and video2video * revert animatediff to original implementation * add img2video as pipeline * update * add vid2vid pipeline * update imports * update * remove copied from line for check_inputs * update * update examples * add multi-batch support * fix __init__.py files * move img2vid to community * update community readme and examples * fix * make fix-copies * add vid2vid batch params * apply suggestions from review Co-Authored-By: Dhruv Nair <[email protected]> * add test for animatediff vid2vid * torch.stack -> torch.cat Co-Authored-By: Dhruv Nair <[email protected]> * make style * docs for vid2vid * update * fix prepare_latents * fix docs * remove img2vid * update README to :main * remove slow test * refactor pipeline output * update docs * update docs * merge community readme from :main * final fix i promise * add support for url in animatediff example * update example * update callbacks to latest implementation * Update src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py Co-authored-by: Patrick von Platen <[email protected]> * Update src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py Co-authored-by: Patrick von Platen <[email protected]> * fix merge * Apply suggestions from code review * remove callback and callback_steps as suggested in review * Update tests/pipelines/animatediff/test_animatediff_video2video.py Co-authored-by: Patrick von Platen <[email protected]> * fix import error caused due to unet refactor in huggingface#6630 * fix numpy import error after tensor2vid refactor in huggingface#6626 * make fix-copies * fix numpy error * fix progress bar test --------- Co-authored-by: Dhruv Nair <[email protected]> Co-authored-by: Patrick von Platen <[email protected]>
What does this PR do?
Attempts to add
img2videoand video2video support to AnimateDiff. Fixes #6123.Colab
Edit: img2vid has been moved to community after reviews below. Please check #6509.
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@DN6 @sayakpaul @patrickvonplaten @jon-chuang