Skip to content

v0.28.0: Marigold, PixArt Sigma, AnimateDiff SDXL, InstantStyle, VQGAN Training Script, and more

Compare
Choose a tag to compare
@sayakpaul sayakpaul released this 27 May 12:08
· 746 commits to main since this release

Diffusion models are known for their abilities in the space of generative modeling. This release of diffusers introduces the first official pipeline (Marigold) for discriminative tasks such as depth estimation and surface normals’ estimation!

Starting this release, we will also highlight the changes and features from the library that make it easy to integrate community checkpoints, features, and so on. Read on!

Marigold

Proposed in Marigold: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation, Marigold introduces a diffusion model and associated fine-tuning protocol for monocular depth estimation. It can also be extended to perform surface normals’ estimation.

marigold

(Image taken from the official repository)

The code snippet below shows how to use this pipeline for depth estimation:

import diffusers
import torch

pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
    "prs-eth/marigold-depth-lcm-v1-0", variant="fp16", torch_dtype=torch.float16
).to("cuda")

image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
depth = pipe(image)

vis = pipe.image_processor.visualize_depth(depth.prediction)
vis[0].save("einstein_depth.png")

depth_16bit = pipe.image_processor.export_depth_to_16bit_png(depth.prediction)
depth_16bit[0].save("einstein_depth_16bit.png")

Check out the API documentation here. We also have a detailed guide about the pipeline here.

Thanks to @toshas, one of the authors of Marigold, who contributed this in #7847.

🌀 Massive Refactor of from_single_file 🌀

We have further refactored from_single_file to align its logic more closely to the from_pretrained method. The biggest benefit of doing this is that it allows us to expand single file loading support beyond Stable Diffusion-like pipelines and models. It also makes it easier to load models that are saved and shared in their original format.

Some of the changes introduced in this refactor:

  1. When loading a single file checkpoint, we will attempt to use the keys present in the checkpoint to infer a model repository on the Hugging Face Hub that we can use to configure the pipeline. For example, if you are using a single file checkpoint based on SD 1.5, we would use the configuration files in the runwayml/stable-diffusion-v1-5 repository to configure the model components and pipeline.
  2. Suppose this inferred configuration isn’t appropriate for your checkpoint. In that case, you can override it using the config argument and pass in either a path to a local model repo or a repo id on the Hugging Face Hub.
pipe = StableDiffusionPipeline.from_single_file("...", config=<model repo id or local repo path>) 
  1. Deprecation of model configuration arguments for the from_single_file method in Pipelines such as num_in_channels, scheduler_type , image_size and upcast_attention . This is an anti-pattern that we have supported in previous versions of the library when we assumed that it would only be relevant to Stable Diffusion based models. However, given that there is a demand to support other model types, we feel it is necessary for single-file loading behavior to adhere to the conventions set in our other loading methods. Configuring individual model components through a pipeline loading method is not something we support in from_pretrained, and therefore, we will be deprecating support for this behavior in from_single_file as well.

PixArt Sigma

PixArt Simga is the successor to PixArt Alpha. PixArt Sigma is capable of directly generating images at 4K resolution. It can also produce images of markedly higher fidelity and improved alignment with text prompts. It comes with a massive sequence length of 300 (for reference, PixArt Alpha has a maximum sequence length of 120)!


(Taken from the project website.)

import torch
from diffusers import PixArtSigmaPipeline

# You can replace the checkpoint id with "PixArt-alpha/PixArt-Sigma-XL-2-512-MS" too.
pipe = PixArtSigmaPipeline.from_pretrained(
    "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", torch_dtype=torch.float16
)
# Enable memory optimizations.
pipe.enable_model_cpu_offload()

prompt = "A small cactus with a happy face in the Sahara desert."
image = pipe(prompt).images[0]

📃 Refer to the documentation here to learn more about PixArt Sigma.

Thanks to @lawrence-cj, one of the authors of PixArt Sigma, who contributed this in #7857.

AnimateDiff SDXL

@a-r-r-o-w contributed the Stable Diffusion XL (SDXL) version of AnimateDiff in #6721. However, note that this is currently an experimental feature, as only a beta release of the motion adapter checkpoint is available.

import torch
from diffusers.models import MotionAdapter
from diffusers import AnimateDiffSDXLPipeline, DDIMScheduler
from diffusers.utils import export_to_gif

adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-sdxl-beta", torch_dtype=torch.float16)

model_id = "stabilityai/stable-diffusion-xl-base-1.0"
scheduler = DDIMScheduler.from_pretrained(
    model_id,
    subfolder="scheduler",
    clip_sample=False,
    beta_schedule="linear",
    steps_offset=1,
)
pipe = AnimateDiffSDXLPipeline.from_pretrained(
    model_id,
    motion_adapter=adapter,
    scheduler=scheduler,
    torch_dtype=torch.float16,
    variant="fp16",
).enable_model_cpu_offload()

# enable memory savings
pipe.enable_vae_slicing()
pipe.enable_vae_tiling()

output = pipe(
    prompt="a panda surfing in the ocean, realistic, high quality",
    negative_prompt="low quality, worst quality",
    num_inference_steps=20,
    guidance_scale=8,
    width=1024,
    height=1024,
    num_frames=16,
)

frames = output.frames[0]
export_to_gif(frames, "animation.gif")

📜 Refer to the documentation to learn more.

Block-wise LoRA

@UmerHA contributed the support to control the scales of different LoRA blocks in a granular manner in #7352. Depending on the LoRA checkpoint one is using, this granular control can significantly impact the quality of the generated outputs. Following code block shows how this feature can be used while performing inference:

...

adapter_weight_scales = { "unet": { "down": 0, "mid": 1, "up": 0} }
pipe.set_adapters("pixel", adapter_weight_scales)
image = pipe(
		prompt, num_inference_steps=30, generator=torch.manual_seed(0)
).images[0]

✍️ Refer to our documentation for more details and a full-fledged example.

InstantStyle

More granular control of scale could be extended to IP-Adapters too. @DannHuang contributed to the support of InstantStyle, aka granular control of IP-Adapter scales, in #7668. The following code block shows how this feature could be used when performing inference with IP-Adapters:

...

scale = {
    "down": {"block_2": [0.0, 1.0]},
    "up": {"block_0": [0.0, 1.0, 0.0]},
}
pipeline.set_ip_adapter_scale(scale)

This way, one can generate images following only the style or layout from the image prompt, with significantly improved diversity. This is achieved by only activating IP-Adapters to specific parts of the model.

Check out the documentation here.

ControlNetXS

ControlNet-XS was introduced in ControlNet-XS by Denis Zavadski and Carsten Rother. Based on the observation, the control model in the original ControlNet can be made much smaller and still produce good results. ControlNet-XS generates images comparable to a regular ControlNet, but it is 20-25% faster (see benchmark with StableDiffusion-XL) and uses ~45% less memory.

ControlNet-XS is supported for both Stable Diffusion and Stable Diffusion XL.

Thanks to @UmerHA for contributing ControlNet-XS in #5827 and #6772.

Custom Timesteps

We introduced custom timesteps support for some of our pipelines and schedulers. You can now set your scheduler with a list of arbitrary timesteps. For example, you can use the AYS timesteps schedule to achieve very nice results with only 10 denoising steps.

from diffusers.schedulers import AysSchedules
sampling_schedule = AysSchedules["StableDiffusionXLTimesteps"]
pipe = StableDiffusionXLPipeline.from_pretrained(
    "SG161222/RealVisXL_V4.0",
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, algorithm_type="sde-dpmsolver++")
prompt = "A cinematic shot of a cute little rabbit wearing a jacket and doing a thumbs up"
image = pipe(prompt=prompt, timesteps=sampling_schedule).images[0]

Check out the documentation here

device_map in Pipelines 🧪

We have introduced experimental support for device_map in our pipelines. This feature becomes relevant when you have multiple accelerators to distribute the components of a pipeline. Currently, we support only “balanced” device_map. However, we plan to support other device mapping strategies relevant to diffusion models in the future.

from diffusers import DiffusionPipeline
import torch

pipeline = DiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", 
    torch_dtype=torch.float16, 
    device_map="balanced"
)
image = pipeline("a dog").images[0]

In cases where you might be limited to low VRAM accelerators, you can use device_map to benefit from them. Below, we simulate a situation where we have access to two GPUs, each having only a GB of VRAM (through the max_memory argument).

from diffusers import DiffusionPipeline
import torch

max_memory = {0:"1GB", 1:"1GB"}
pipeline = DiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16, 
    use_safetensors=True, 
    device_map="balanced",
		max_memory=max_memory
)
image = pipeline("a dog").images[0]

📜 Refer to the documentation to learn more about it.

VQGAN Training Script 📈

VQGAN, proposed in Taming Transformers for High-Resolution Image Synthesis, is a crucial component in the modern generative image modeling toolbox. Once it is trained, its encoder can be leveraged to compute general-purpose tokens from input images.

Thanks to @isamu-isozaki, who contributed a script and related utilities to train VQGANs in #5483. For details, refer to the official training directory.

VideoProcessor Class

Similar to the VaeImageProcessor class, we have introduced a VideoProcessor to help make the preprocessing and postprocessing of videos easier and a little more streamlined across the pipelines. Refer to the documentation to learn more.

New Guides 📑

Starting with this release, we provide guides and tutorials to help users get started with some of the most frequently used tasks in image and video generation. For this release, we have a series of three guides about outpainting with different techniques:

Official Callbacks

We introduced official callbacks that you can conveniently plug into your pipeline. For example, to turn off classifier-free guidance after denoising steps with SDXLCFGCutoffCallback.

import torch
from diffusers import DiffusionPipeline
from diffusers.callbacks import SDXLCFGCutoffCallback

callback = SDXLCFGCutoffCallback(cutoff_step_ratio=0.4)
pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")
prompt = "a sports car at the road, best quality, high quality, high detail, 8k resolution"
out = pipeline(
    prompt=prompt,
    num_inference_steps=25,
    callback_on_step_end=callback,
)

Read more on our documentation 📜

Community Pipelines and from_pipe API

Starting with this release note, we will highlight the new community pipelines! More and more of our pipelines were added as community pipelines first and graduated as official pipelines once people started to use them a lot! We do not require community pipelines to follow diffusers’ coding style, so it is the easiest way to contribute to diffusers 😊 

We also introduced a from_pipe API that’s very useful for the community pipelines that share checkpoints with our official pipelines and improve generation quality in some way:) You can use from_pipe(...) to load many community pipelines without additional memory requirements. With this API, you can easily switch between different pipelines to apply different techniques.

Read more about from_pipe API in our documentation 📃.

Here are four new community pipelines since our last release.

BoxDiff

BoxDiff lets you use bounding box coordinates for a more controlled generation. Here is an example of how you can apply this technique on a stable diffusion pipeline you had created (i.e. pipe_sd in the below example)

pipe_box = DiffusionPipeline.from_pipe(
    pipe_sd,
    custom_pipeline="pipeline_stable_diffusion_boxdiff",
)
pipe_box.enable_model_cpu_offload()
phrases = ["aurora","reindeer","meadow","lake","mountain"]
boxes = [[1,3,512,202], [75,344,421,495], [1,327,508,507], [2,217,507,341], [1,135,509,242]]
boxes = [[x / 512 for x in box] for box in boxes]

generator = torch.Generator(device="cpu").manual_seed(42)
images = pipe_box(
    prompt,
    boxdiff_phrases=phrases,
    boxdiff_boxes=boxes,
    boxdiff_kwargs={
        "attention_res": 16,
        "normalize_eot": True
    },
    num_inference_steps=50,
    generator=generator,
).images

Check out this community pipeline here

HD-Painter

HD-Painter can enhance inpainting pipelines with improved prompt faithfulness and generate higher resolution (up to 2k). You can switch from BoxDiff to HD-Painter like this

pipe = DiffusionPipeline.from_pipe(
    pipe_box,
    custom_pipeline="hd_painter"
)
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)

prompt = "wooden boat"
init_image = load_image("https://raw.githubusercontent.com/Picsart-AI-Research/HD-Painter/main/__assets__/samples/images/2.jpg")
mask_image = load_image("https://raw.githubusercontent.com/Picsart-AI-Research/HD-Painter/main/__assets__/samples/masks/2.png")

image = pipe (prompt, init_image, mask_image, use_rasg = True, use_painta = True, generator=torch.manual_seed(12345)).images[0]

Check out this community pipeline here

Differential Diffusion

Differential Diffusion enables customization of the amount of change per pixel or per image region. It’s very effective in inpainting and outpainting.

pipeline = DiffusionPipeline.from_pipe(
    pipe_sdxl,
    custom_pipeline="pipeline_stable_diffusion_xl_differential_img2img",
).to("cuda")
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config, use_karras_sigmas=True)

prompt = "a green pear"
negative_prompt = "blurry"

image = pipeline(
    prompt=prompt,
    negative_prompt=negative_prompt,
    guidance_scale=7.5,
    num_inference_steps=25,
    original_image=image,
    image=image,
    strength=1.0,
    map=mask,
).images[0]

Check out this community pipeline here.

FRESCO

FRESCO aka FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation enables zero-shot video-to-video translation. Learn more about it from here.

All Commits

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @StandardAI
    • Fix typos (#7411)
    • [IP-Adapter] Fix IP-Adapter Support and Refactor Callback for StableDiffusionPanoramaPipeline (#7262)
    • [Docs] Fix typos (#7451)
    • Fix Tiling in ConsistencyDecoderVAE (#7290)
    • Fix CPU offload in docstring (#7827)
    • Fix image upcasting (#7858)
    • Remove dead code and fix f-string issue (#7720)
    • Fix several imports (#7712)
    • Expansion proposal of diffusers-cli env (#7403)
    • Fix a grammatical error in the raise messages (#8272)
    • Fix CPU Offloading Usage & Typos (#8230)
  • @a-r-r-o-w
    • [refactor] Fix FreeInit behaviour (#7410)
    • [Pipeline] AnimateDiff SDXL (#6721)
  • @UmerHA
    • Fixed minor error in test_lora_layers_peft.py (#7394)
    • Skip test_lora_fuse_nan on mps (#7481)
    • Implements Blockwise lora (#7352)
    • Quick-Fix for #7352 block-lora (#7523)
    • Skip test_freeu_enabled on MPS (#7570)
    • Fixing implementation of ControlNet-XS (#6772)
  • @bghira
    • diffusers#7426 fix stable diffusion xl inference on MPS when dtypes shift unexpectedly due to pytorch bugs (#7446)
    • apple mps: training support for SDXL (ControlNet, LoRA, Dreambooth, T2I) (#7447)
    • 7529 do not disable autocast for cuda devices (#7530)
    • 7879 - adjust documentation to use naruto dataset, since pokemon is now gated (#7880)
  • @HyoungwonCho
    • Perturbed-Attention Guidance (#7512)
    • Modification on the PAG community pipeline (re) (#7876)
  • @haikmanukyan
    • add HD-Painter pipeline (#7520)
  • @fabiorigano
    • Multi-image masking for single IP Adapter (#7499)
    • Move IP Adapter Face ID to core (#7186)
    • Restore AttnProcessor2_0 in unload_ip_adapter (#7727)
    • [Docs] Update image masking and face id example (#7780)
    • fix AnimateDiff creation with a unet loaded with IP Adapter (#7791)
  • @kabachuha
    • Add (Scheduled) Pseudo-Huber Loss training scripts to research projects (#7527)
  • @lawrence-cj
    • PixArt-Sigma Implementation (#7654)
    • [docs] add doc for PixArtSigmaPipeline (#7857)
  • @vanakema
    • #7535 Update FloatTensor type hints to Tensor (#7883)
  • @zjysteven
    • [Pipeline] Adding BoxDiff to community examples (#7947)
  • @isamu-isozaki
    • Adding VQGAN Training script (#5483)
  • @SingleZombie
    • [Community Pipeline] FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation (#8239)
  • @toshas
    • [Pipeline] Marigold depth and normals estimation (#7847)