Support for Multimodal Diffusion Transformers (e.g. StableDiffusion3) #7232

kabachuha · 2024-03-06T08:16:25Z

Model/Pipeline/Scheduler description

Yesterday StabilityAI published the details of their architecture MMDiT for the upcoming StableDiffusion3.

https://stability.ai/news/stable-diffusion-3-research-paper

Their approach differs quite much from the traditional Diffuser Transformers (like PixArt-alpha) in a way what it processes text and image encoding parallelly in transformer blocks and use joint attention on them in the middle. (kinda like ControlNet-Transformer in PixArt-alpha, but with joint attention) The other structural differences are projecting pooled text embeddings on timestep conditionings and using an ensemble of text encoders (2 clip models and T5), but it's the details. Training rectified flow is also nice to have in diffusers some day

While their code for StableDiffusion3 is not available yet, I believe this MMDiT architecture is already valuable to researchers, even in adjoint domains, and it will be nice to have it in Diffusers the sooner the better

Open source status

The model implementation is available.
The model weights are available (Only relevant if addition is not a scheduler).

Provide useful links for the implementation

The link to the paper
https://stabilityai-public-packages.s3.us-west-2.amazonaws.com/Stable+Diffusion+3+Paper.pdf

sayakpaul · 2024-03-06T15:32:20Z

The modeling code needs to be out first :)

kabachuha · 2024-03-07T14:24:01Z

Before they release the code, doing an unofficial paper-referenced implementation here NUS-HPC-AI-Lab/VideoSys#92 (Based on OpenDiT, and also an MMDiT-ized Latte version)

sayakpaul · 2024-03-07T16:07:21Z

Super cool! Cc: @patil-suraj

isidentical · 2024-03-08T02:20:15Z

This is amazing work!! I was thinking of starting something together (in a fork) since between the time stability releases the weights and diffusers is ready, it might be a couple of days (unless they contribute the implementation themselves which is a big if). Will give it a complete read but just skimming it was impressive enough @kabachuha!

parlance-zz · 2024-03-11T22:08:06Z

Very interested to see this in Diffusers as soon as possible. It would be nice to see rectified-flow in a diffusers compatible training script as well, perhaps as an option or modification to the existing text-to-image training code here.

kabachuha · 2024-03-12T18:12:21Z

@parlance-zz Rectified Flow has already been implemented in Diffusers with #6057

The newer version of Piecewise Rectified Flow, which is claimed to be faster, may be interesting to implement too #7255

parlance-zz · 2024-03-20T17:47:12Z

@parlance-zz Rectified Flow has already been implemented in Diffusers with #6057

The newer version of Piecewise Rectified Flow, which is claimed to be faster, may be interesting to implement too #7255

When I read the SD3 paper I thought there was more to it.

I've since implemented it myself but I didn't bother creating a new diffusers scheduler because I wanted fully continuous timesteps. Rectified flows also don't need a noise schedule per se as the forward process is literally just a lerp from sample to noise, and the reverse process is accurately integrated with simple euler.

There should probably be a rectified flow scheduler added to diffusers at some point though.

sayakpaul · 2024-03-21T02:19:04Z

Thanks for sharing! Have you open-sourced your code? Would love to take a look and learn from it.

parlance-zz · 2024-03-21T17:42:16Z

Thanks for sharing! Have you open-sourced your code? Would love to take a look and learn from it.

Yes, my project is called DualDiffusion (nothing to do with the TransformerDual model in diffusers, I created the project ~8 months ago). I aim to generate music, initially with an unconditional model, using the complete library of SNES music as a dataset. I've trained my own VAE and diffusion models with the code you see in the project. The input to the VAE is mel-scale spectrograms but I have customized FGLA phase reconstruction code for improved audio quality.

The relevant lines as far as rectified flow go are:

For training, timestep sampling from logit-normal distribution and lerp to get the model input, target for output is sample - noise:
https://github.com/parlance-zz/dualdiffusion/blob/0a09ce90c9f0fe03c7967024e5ec7ea42d4dcf1f/train.py#L891
https://github.com/parlance-zz/dualdiffusion/blob/0a09ce90c9f0fe03c7967024e5ec7ea42d4dcf1f/train.py#L921
https://github.com/parlance-zz/dualdiffusion/blob/0a09ce90c9f0fe03c7967024e5ec7ea42d4dcf1f/train.py#L928

Timestep creation for sampling reverse process, and integration / reverse process step:
https://github.com/parlance-zz/dualdiffusion/blob/0a09ce90c9f0fe03c7967024e5ec7ea42d4dcf1f/dual_diffusion_pipeline.py#L312
https://github.com/parlance-zz/dualdiffusion/blob/0a09ce90c9f0fe03c7967024e5ec7ea42d4dcf1f/dual_diffusion_pipeline.py#L336

As you can see it really is that simple.

sayakpaul · 2024-03-22T02:48:05Z

Thanks again for sharing. Starting to feel like opening up a Discussion thread to collate all these valuable resources in order for everyone to benefit :) Would you be open to that? Also cc: @patil-suraj here.

Yes, my project is called DualDiffusion (nothing to do with the TransformerDual model in diffusers, I created the project ~8 months ago)

Of course, not doubting for a moment :-)

github-actions · 2024-04-15T15:03:14Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

kabachuha · 2024-06-03T11:58:15Z

SD3 is announced to be released on 12th of June, so the official implementation would be a better reference

user425846 · 2024-06-12T13:12:49Z

It was just released https://huggingface.co/stabilityai/stable-diffusion-3-medium

sayakpaul · 2024-06-12T14:07:10Z

Coming up in some hours ;)

sayakpaul · 2024-06-30T05:37:43Z

https://huggingface.co/docs/diffusers/main/en/api/models/sd3_transformer2d

github-actions bot added the stale Issues that haven't received updates label Apr 15, 2024

rupeshs mentioned this issue Jun 12, 2024

Support for stable-diffusion-3-medium model #8482

Closed

sayakpaul closed this as completed Jun 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Multimodal Diffusion Transformers (e.g. StableDiffusion3) #7232

Support for Multimodal Diffusion Transformers (e.g. StableDiffusion3) #7232

kabachuha commented Mar 6, 2024

sayakpaul commented Mar 6, 2024

kabachuha commented Mar 7, 2024

sayakpaul commented Mar 7, 2024

isidentical commented Mar 8, 2024

parlance-zz commented Mar 11, 2024

kabachuha commented Mar 12, 2024

parlance-zz commented Mar 20, 2024

sayakpaul commented Mar 21, 2024

parlance-zz commented Mar 21, 2024

sayakpaul commented Mar 22, 2024

github-actions bot commented Apr 15, 2024

kabachuha commented Jun 3, 2024

user425846 commented Jun 12, 2024

sayakpaul commented Jun 12, 2024

sayakpaul commented Jun 30, 2024

Support for Multimodal Diffusion Transformers (e.g. StableDiffusion3) #7232

Support for Multimodal Diffusion Transformers (e.g. StableDiffusion3) #7232

Comments

kabachuha commented Mar 6, 2024

Model/Pipeline/Scheduler description

Open source status

Provide useful links for the implementation

sayakpaul commented Mar 6, 2024

kabachuha commented Mar 7, 2024

sayakpaul commented Mar 7, 2024

isidentical commented Mar 8, 2024

parlance-zz commented Mar 11, 2024

kabachuha commented Mar 12, 2024

parlance-zz commented Mar 20, 2024

sayakpaul commented Mar 21, 2024

parlance-zz commented Mar 21, 2024

sayakpaul commented Mar 22, 2024

github-actions bot commented Apr 15, 2024

kabachuha commented Jun 3, 2024

user425846 commented Jun 12, 2024

sayakpaul commented Jun 12, 2024

sayakpaul commented Jun 30, 2024