Performance of Stable Diffusion 3

We conducted performance evaluations using the open-source version of the stable-diffusion-3-medium-diffusers 2B model.

On an 8xA100 (NVLink) machine, the optimal parallelization strategy varied depending on the number of GPUs used, highlighting the importance of diverse and hybrid parallel approaches. The best parallel strategies for different GPU scales were as follows: with 2 GPUs, cfg_parallel=2 was used; with 4 GPUs, cfg_parallel=2, pipefusion_parallel=2 was employed; and with 8 GPUs, cfg_parallel=2, pipefusion_parallel=4 was utilized.

torch.compile provided acceleration in all scenarios except for the 8 GPU configuration.

The latency situation on 8xL40 (PCIe) is depicted in the graph below. Similarly, the optimal parallel strategies varied with different GPU scales. torch.compile delivered acceleration in all cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sd3.md

sd3.md

Performance of Stable Diffusion 3

Files

sd3.md

Latest commit

History

sd3.md

File metadata and controls

Performance of Stable Diffusion 3