- (arXiv 2022.12) Scalable Diffusion Models with Transformers, [Paper], [Code]
- (arXiv 2023.03) Masked Diffusion Transformer is a Strong Image Synthesizer, [Paper], [Code]
- (arXiv 2023.04) ViT-DAE: Transformer-driven Diffusion Autoencoder for Histopathology Image Analysis, [Paper]
- (arXiv 2023.06) DFormer: Diffusion-guided Transformer for Universal Image Segmentation, [Paper], [Code]
- (arXiv 2023.08) Unaligned 2D to 3D Translation with Conditional Vector-Quantized Code Diffusion using Transformers, [Paper]
- (arXiv 2023.09) Large-Vocabulary 3D Diffusion Model with Transformer, [Paper], [Project]
- (arXiv 2023.09) Cartoondiff: Training-free Cartoon Image Generation with Diffusion Transformer Models, [Paper], [Project]
- (arXiv 2023.12) DiffiT: Diffusion Vision Transformers for Image Generation, [Paper]
- (arXiv 2023.12) DiT-Head: High-Resolution Talking Head Synthesis using Diffusion Transformers, [Paper]
- (arXiv 2024.01) SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers, [Paper], [Code]
- (arXiv 2024.01) Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers, [Paper], [Code]
- (arXiv 2024.02) Cross-view Masked Diffusion Transformers for Person Image Synthesis, [Paper]
- (arXiv 2024.02) FiT: Flexible Vision Transformer for Diffusion Model, [Paper], [Code]
- (arXiv 2024.03) Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts, [Paper], [Code]
- (arXiv 2024.03) SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer, [Paper]
- (arXiv 2024.04) WcDT: World-centric Diffusion Transformer for Traffic Scene Generation, [Paper]
- (arXiv 2024.04) Solving Masked Jigsaw Puzzles with Diffusion Vision Transformers, [Paper]
- (arXiv 2024.04) Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers, [Paper]
- (arXiv 2024.04) Lazy Diffusion Transformer for Interactive Image Editing, [Paper], [Project]
- (arXiv 2024.05) U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers, [Paper], [Code]
- (arXiv 2024.05) Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer, [Paper], [Code]
- (arXiv 2024.05) Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers, [Paper], [Code]
- (arXiv 2024.05) DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation, [Paper]
- (arXiv 2024.05) Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding, [Paper],[Code]
- (arXiv 2024.05) Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer, [Paper],[Project]
- (arXiv 2024.05) PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models, [Paper],[Code]
- (arXiv 2024.05) Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer, [Paper],[Code]
- (arXiv 2024.05) PTQ4DiT: Post-training Quantization for Diffusion Transformers, [Paper]
- (arXiv 2024.05) VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers, [Paper],[Code]
- (arXiv 2024.05) DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention, [Paper],[Code]
- (arXiv 2024.06) Δ-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers, [Paper]
- (arXiv 2024.06) Dimba: Transformer-Mamba Diffusion Models, [Paper],[Code]
- (arXiv 2024.06) AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation, [Paper]
- (arXiv 2024.06) DiTFastAttn: Attention Compression for Diffusion Transformer Models, [Paper],[Code]
- (arXiv 2024.06) Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT, [Paper],[Code]
- (arXiv 2024.07) FORA: Fast-Forward Caching in Diffusion Transformer Acceleration, [Paper],[Code]
- (arXiv 2024.07) VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control, [Paper]
- (arXiv 2024.07) Scaling Diffusion Transformers to 16 Billion Parameters, [Paper],[Code]
- (arXiv 2024.07) DriveDiTFit: Fine-tuning Diffusion Transformers for Autonomous Driving, [Paper],[Code]
- (arXiv 2024.07) Diffusion Feedback Helps CLIP See Better, [Paper],[Code]
- (arXiv 2024.08) Tora: Trajectory-oriented Diffusion Transformer for Video Generation, [Paper],[Code]
- (arXiv 2024.08) Latent Space Disentanglement in Diffusion Transformers Enables Zero-shot Fine-grained Semantic Editing, [Paper]
- (arXiv 2024.08) DiffSurf: A Transformer-based Diffusion Model for Generating and Reconstructing 3D Surfaces in Pose, [Paper]
- (arXiv 2024.08) MegActor-Σ: Unlocking Flexible Mixed-Modal Control in Portrait Animation with Diffusion Transformer, [Paper]
- (arXiv 2024.09) Qihoo-T2X: An Efficiency-Focused Diffusion Transformer via Proxy Tokens for Text-to-Any-Task, [Paper],[Code]
- (arXiv 2024.09) DiTAS: Quantizing Diffusion Transformers via Enhanced Activation Smoothing, [Paper]
- (arXiv 2024.09) Token Caching for Diffusion Transformer Acceleration, [Paper]
- (arXiv 2024.10) ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer, [Paper],[Code]
- (arXiv 2024.10) HarmoniCa: Harmonizing Training and Inference for Better Feature Cache in Diffusion Transformer Acceleration, [Paper]
- (arXiv 2024.10) EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing, [Paper]
- (arXiv 2024.10) MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation, [Paper]
- (arXiv 2024.10) Dynamic Diffusion Transformer, [Paper],[Code]
- (arXiv 2024.10) SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers, [Paper]
- (arXiv 2024.10) Boosting Camera Motion Control for Video Diffusion Transformers, [Paper]
- (arXiv 2024.10) The Ingredients for Robotic Diffusion Transformers, [Paper],[Code]
- (arXiv 2024.10) FasterDiT: Towards Faster Diffusion Transformers Training without Architecture Modification, [Paper]
- (arXiv 2024.10) Precipitation Nowcasting Using Diffusion Transformer with Causal Attention, [Paper]
- (arXiv 2024.10) Group Diffusion Transformers are Unsupervised Multitask Learners, [Paper]
- (arXiv 2024.10) Diffusion Transformer Policy, [Paper]
- (arXiv 2024.10) On Inductive Biases That Enable Generalization of Diffusion Transformers, [Paper]
- (arXiv 2024.10) GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation, [Paper],[Code]
- (arXiv 2024.10) EDT: An Efficient Diffusion Transformer Framework Inspired by Human-like Sketching, [Paper],[Code]
- (arXiv 2024.10) In-Context LoRA for Diffusion Transformers, [Paper],[Code]
- (arXiv 2024.11) Learning Where to Edit Vision Transformers, [Paper],[Code]
- (arXiv 2024.11) Adaptive Caching for Faster Video Generation with Diffusion Transformers, [Paper],[Code]
- (arXiv 2024.11) DiT4Edit: Diffusion Transformer for Image Editing, [Paper]
- (arXiv 2024.11) DanceFusion: A Spatio-Temporal Skeleton Diffusion Transformer for Audio-Driven Dance Motion Reconstruction, [Paper],[Code]
- (arXiv 2024.11) DiT4Edit: Diffusion Transformer for Image Editing, [Paper],[Code]
- (arXiv 2024.11) Latent Space Disentanglement in Diffusion Transformers Enables Precise Zero-shot Semantic Editing, [Paper]
- (arXiv 2024.11) LaVin-DiT: Large Vision Diffusion Transformer, [Paper]
- (arXiv 2024.11) FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on, [Paper],[Code]
- (arXiv 2024.11) Unveiling Redundancy in Diffusion Transformers (DiTs): A Systematic Study, [Paper],[Code]
- (arXiv 2024.11) Accelerating Vision Diffusion Transformers with Skip Branches, [Paper],[Code]
- (arXiv 2024.11) Towards Precise Scaling Laws for Video Diffusion Transformers, [Paper]
- (arXiv 2024.11) On Statistical Rates of Conditional Diffusion Transformers: Approximation, Estimation and Minimax Optimality, [Paper]
- (arXiv 2024.11) LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis, [Paper],[Code]
- (arXiv 2024.12) TinyFusion: Diffusion Transformers Learned Shallow, [Paper],[Code]
- (arXiv 2024.12) CPA: Camera-pose-awareness Diffusion Transformer for Video Generation, [Paper]
- (arXiv 2024.12) Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks, [Paper],[Code]
- (arXiv 2024.12) OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows, [Paper],[Code]
- (arXiv 2024.12) ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer, [Paper]
- (arXiv 2024.12) UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics, [Paper],[Code]
- (arXiv 2024.12) Video Motion Transfer with Diffusion Transformers, [Paper],[Code]
- (arXiv 2024.12) MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation, [Paper]
- (arXiv 2024.12) FlexDiT: Dynamic Token Density Control for Diffusion Transformer, [Paper],[Code]
- (arXiv 2024.12) Causal Diffusion Transformers for Generative Modeling, [Paper],[Code]
- (arXiv 2024.12) AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration, [Paper]
- (arXiv 2024.12) ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers, [Paper],[Code]
- (arXiv 2024.12) StyleDiT: A Unified Framework for Diverse Child and Partner Faces Synthesis with Style Latent Diffusion Transformer, [Paper]
- (arXiv 2024.12) Video Diffusion Transformers are In-Context Learners, [Paper],[Code]
- (arXiv 2024.12) Efficient Scaling of Diffusion Transformers for Text-to-Image Generation, [Paper]
- (arXiv 2024.12) CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up, [Paper]
- (arXiv 2024.12) Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers, [Paper],[Code]
- (arXiv 2024.12) Two-in-One: Unified Multi-Person Interactive Motion Generation by Latent Diffusion Transformer, [Paper]
- (arXiv 2024.12) DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation, [Paper],[Code]
- (arXiv 2024.12) Accelerating Diffusion Transformers with Dual Feature Caching, [Paper],[Code]
- (arXiv 2025.01) SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration, [Paper],[Code]
- (arXiv 2025.01) Ingredients: Blending Custom Photos with Video Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.01) GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking, [Paper],[Code]
- (arXiv 2025.01) Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers, [Paper],[Code]
- (arXiv 2025.01) MC-VTON: Minimal Control Virtual Try-On Diffusion Transformer, [Paper]
- (arXiv 2025.01) ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning, [Paper]
- (arXiv 2025.01) 3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering, [Paper],[Code]