-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Multimodal Diffusion Transformers (e.g. StableDiffusion3) #7232
Comments
The modeling code needs to be out first :) |
Before they release the code, doing an unofficial paper-referenced implementation here NUS-HPC-AI-Lab/VideoSys#92 (Based on OpenDiT, and also an MMDiT-ized Latte version) |
Super cool! Cc: @patil-suraj |
This is amazing work!! I was thinking of starting something together (in a fork) since between the time stability releases the weights and diffusers is ready, it might be a couple of days (unless they contribute the implementation themselves which is a big if). Will give it a complete read but just skimming it was impressive enough @kabachuha! |
Very interested to see this in Diffusers as soon as possible. It would be nice to see rectified-flow in a diffusers compatible training script as well, perhaps as an option or modification to the existing text-to-image training code here. |
@parlance-zz Rectified Flow has already been implemented in Diffusers with #6057 The newer version of Piecewise Rectified Flow, which is claimed to be faster, may be interesting to implement too #7255 |
When I read the SD3 paper I thought there was more to it. I've since implemented it myself but I didn't bother creating a new diffusers scheduler because I wanted fully continuous timesteps. Rectified flows also don't need a noise schedule per se as the forward process is literally just a lerp from sample to noise, and the reverse process is accurately integrated with simple euler. There should probably be a rectified flow scheduler added to diffusers at some point though. |
Thanks for sharing! Have you open-sourced your code? Would love to take a look and learn from it. |
Yes, my project is called DualDiffusion (nothing to do with the TransformerDual model in diffusers, I created the project ~8 months ago). I aim to generate music, initially with an unconditional model, using the complete library of SNES music as a dataset. I've trained my own VAE and diffusion models with the code you see in the project. The input to the VAE is mel-scale spectrograms but I have customized FGLA phase reconstruction code for improved audio quality. The relevant lines as far as rectified flow go are: For training, timestep sampling from logit-normal distribution and lerp to get the model input, target for output is sample - noise: Timestep creation for sampling reverse process, and integration / reverse process step: As you can see it really is that simple. |
Thanks again for sharing. Starting to feel like opening up a Discussion thread to collate all these valuable resources in order for everyone to benefit :) Would you be open to that? Also cc: @patil-suraj here.
Of course, not doubting for a moment :-) |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
SD3 is announced to be released on 12th of June, so the official implementation would be a better reference |
It was just released https://huggingface.co/stabilityai/stable-diffusion-3-medium |
Coming up in some hours ;) |
Model/Pipeline/Scheduler description
Yesterday StabilityAI published the details of their architecture MMDiT for the upcoming StableDiffusion3.
https://stability.ai/news/stable-diffusion-3-research-paper
Their approach differs quite much from the traditional Diffuser Transformers (like PixArt-alpha) in a way what it processes text and image encoding parallelly in transformer blocks and use joint attention on them in the middle. (kinda like ControlNet-Transformer in PixArt-alpha, but with joint attention) The other structural differences are projecting pooled text embeddings on timestep conditionings and using an ensemble of text encoders (2 clip models and T5), but it's the details. Training rectified flow is also nice to have in diffusers some day
While their code for StableDiffusion3 is not available yet, I believe this MMDiT architecture is already valuable to researchers, even in adjoint domains, and it will be nice to have it in Diffusers the sooner the better
Open source status
Provide useful links for the implementation
The link to the paper
https://stabilityai-public-packages.s3.us-west-2.amazonaws.com/Stable+Diffusion+3+Paper.pdf
The text was updated successfully, but these errors were encountered: