Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Multimodal Diffusion Transformers (e.g. StableDiffusion3) #7232

Closed
2 tasks
kabachuha opened this issue Mar 6, 2024 · 15 comments
Closed
2 tasks
Labels
stale Issues that haven't received updates

Comments

@kabachuha
Copy link
Contributor

Model/Pipeline/Scheduler description

Yesterday StabilityAI published the details of their architecture MMDiT for the upcoming StableDiffusion3.

https://stability.ai/news/stable-diffusion-3-research-paper

Their approach differs quite much from the traditional Diffuser Transformers (like PixArt-alpha) in a way what it processes text and image encoding parallelly in transformer blocks and use joint attention on them in the middle. (kinda like ControlNet-Transformer in PixArt-alpha, but with joint attention) The other structural differences are projecting pooled text embeddings on timestep conditionings and using an ensemble of text encoders (2 clip models and T5), but it's the details. Training rectified flow is also nice to have in diffusers some day

While their code for StableDiffusion3 is not available yet, I believe this MMDiT architecture is already valuable to researchers, even in adjoint domains, and it will be nice to have it in Diffusers the sooner the better

Open source status

  • The model implementation is available.
  • The model weights are available (Only relevant if addition is not a scheduler).

Provide useful links for the implementation

The link to the paper
https://stabilityai-public-packages.s3.us-west-2.amazonaws.com/Stable+Diffusion+3+Paper.pdf

@sayakpaul
Copy link
Member

The modeling code needs to be out first :)

@kabachuha
Copy link
Contributor Author

Before they release the code, doing an unofficial paper-referenced implementation here NUS-HPC-AI-Lab/VideoSys#92 (Based on OpenDiT, and also an MMDiT-ized Latte version)

@sayakpaul
Copy link
Member

Super cool! Cc: @patil-suraj

@isidentical
Copy link
Contributor

This is amazing work!! I was thinking of starting something together (in a fork) since between the time stability releases the weights and diffusers is ready, it might be a couple of days (unless they contribute the implementation themselves which is a big if). Will give it a complete read but just skimming it was impressive enough @kabachuha!

@parlance-zz
Copy link
Contributor

Very interested to see this in Diffusers as soon as possible. It would be nice to see rectified-flow in a diffusers compatible training script as well, perhaps as an option or modification to the existing text-to-image training code here.

@kabachuha
Copy link
Contributor Author

@parlance-zz Rectified Flow has already been implemented in Diffusers with #6057

The newer version of Piecewise Rectified Flow, which is claimed to be faster, may be interesting to implement too #7255

@parlance-zz
Copy link
Contributor

@parlance-zz Rectified Flow has already been implemented in Diffusers with #6057

The newer version of Piecewise Rectified Flow, which is claimed to be faster, may be interesting to implement too #7255

When I read the SD3 paper I thought there was more to it.

I've since implemented it myself but I didn't bother creating a new diffusers scheduler because I wanted fully continuous timesteps. Rectified flows also don't need a noise schedule per se as the forward process is literally just a lerp from sample to noise, and the reverse process is accurately integrated with simple euler.

There should probably be a rectified flow scheduler added to diffusers at some point though.

@sayakpaul
Copy link
Member

Thanks for sharing! Have you open-sourced your code? Would love to take a look and learn from it.

@parlance-zz
Copy link
Contributor

Thanks for sharing! Have you open-sourced your code? Would love to take a look and learn from it.

Yes, my project is called DualDiffusion (nothing to do with the TransformerDual model in diffusers, I created the project ~8 months ago). I aim to generate music, initially with an unconditional model, using the complete library of SNES music as a dataset. I've trained my own VAE and diffusion models with the code you see in the project. The input to the VAE is mel-scale spectrograms but I have customized FGLA phase reconstruction code for improved audio quality.

The relevant lines as far as rectified flow go are:

For training, timestep sampling from logit-normal distribution and lerp to get the model input, target for output is sample - noise:
https://github.com/parlance-zz/dualdiffusion/blob/0a09ce90c9f0fe03c7967024e5ec7ea42d4dcf1f/train.py#L891
https://github.com/parlance-zz/dualdiffusion/blob/0a09ce90c9f0fe03c7967024e5ec7ea42d4dcf1f/train.py#L921
https://github.com/parlance-zz/dualdiffusion/blob/0a09ce90c9f0fe03c7967024e5ec7ea42d4dcf1f/train.py#L928

Timestep creation for sampling reverse process, and integration / reverse process step:
https://github.com/parlance-zz/dualdiffusion/blob/0a09ce90c9f0fe03c7967024e5ec7ea42d4dcf1f/dual_diffusion_pipeline.py#L312
https://github.com/parlance-zz/dualdiffusion/blob/0a09ce90c9f0fe03c7967024e5ec7ea42d4dcf1f/dual_diffusion_pipeline.py#L336

As you can see it really is that simple.

@sayakpaul
Copy link
Member

Thanks again for sharing. Starting to feel like opening up a Discussion thread to collate all these valuable resources in order for everyone to benefit :) Would you be open to that? Also cc: @patil-suraj here.

Yes, my project is called DualDiffusion (nothing to do with the TransformerDual model in diffusers, I created the project ~8 months ago)

Of course, not doubting for a moment :-)

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Apr 15, 2024
@kabachuha
Copy link
Contributor Author

SD3 is announced to be released on 12th of June, so the official implementation would be a better reference

@user425846
Copy link

@sayakpaul
Copy link
Member

Coming up in some hours ;)

@sayakpaul
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

5 participants