FLUX Hopper benchmarking #324

antferdom · 2024-10-29T22:50:08Z

Target: Measure the scalability of FLUX.1 on NVIDIA Hopper architecture (both H100 & H200) using different model parallelism strategies (see Flux.1 Performance Overview):

Pipefusion
SP-Ulysses
SP-Ring
Hybrid

Models (order of relevance):

Benchmarking Python utility script

Build xfuser command expression using Python instead of plain Bash scripts. We can then use this approach in convention with experiment config libraries like OmegaConf (e.g facebookresearch/lingua llm pre-training library). The following implementation also includes logging configuration for Torch compile (Inductor, Dynamo), valuable for inspecting graph breaks, tracing and compilation errors:

import logging
import os
import shlex
import subprocess
import sys
from pathlib import Path

import torch

DEBUG_TORCH_COMPILE = os.getenv("DEBUG_TORCH_COMPILE", "0") == "1"

# ref: https://github.com/pytorch/pytorch/blob/c81d4fd0a8ca826c165fe15f83398dbc4e20b523/docs/source/torch.compiler_troubleshooting.rst#L28
if DEBUG_TORCH_COMPILE:
    torch._logging.set_logs(dynamo=logging.INFO,
                            graph_code=True,
                            graph_breaks=True,
                            guards=True,
                            recompiles=True,
                            )
    #torch._dynamo.explain()
    

WD: str = Path(__file__).parent.absolute()
os.environ["PYTHONPATH"] = f"{WD}:{os.getenv('PYTHONPATH', '')}"
SCRIPT: str = os.path.join(WD, "flux_example.py")

MODEL_ID = "black-forest-labs/FLUX.1-dev"
INFERENCE_STEP = 28
WARMUP_STEPS = 3
max_sequence_length = 512
height = 1024
width = 1024
TASK_ARGS = f"--max-sequence-length {max_sequence_length} --height {height} --width {width}"
N_GPUS = 2
pipefusion_parallel_degree = 2
ulysses_degree = 1
ring_degree = 1
PARALLEL_ARGS = (
    f"--pipefusion_parallel_degree {pipefusion_parallel_degree} "
    f"--ulysses_degree {ulysses_degree} "
    f"--ring_degree {ring_degree} "
)
COMPILE_FLAG = "--use_torch_compile"

conda_binaries = Path(sys.executable).parent
torchrun_bin_path = os.path.join(conda_binaries, "torchrun")
command: str = (
                f"{sys.executable} -m torch.distributed.run --nproc_per_node={N_GPUS} {SCRIPT} "
                f"--model {MODEL_ID} "
                f"{PARALLEL_ARGS} "
                f"{TASK_ARGS} "
                f"--num_inference_steps {INFERENCE_STEP} "
                f"--warmup_steps {WARMUP_STEPS} "
                f"--prompt \"brown dog laying on the ground with a metal bowl in front of him.\" "
                f"{COMPILE_FLAG}"
            )
print(command)
print(shlex.split(command))
subprocess.run(shlex.split(command))

Minor change in examples/flux_example.py for setting xFuserFluxPipeline max_sequence_length using input_config instead of manual hard-coding value . This modification solves Torch compilation errors when setting max_seq_length=512 for FLUX dev.

    output = pipe(
        height=input_config.height,
        width=input_config.width,
        prompt=input_config.prompt,
        num_inference_steps=input_config.num_inference_steps,
        output_type=input_config.output_type,
        max_sequence_length=input_config.max_sequence_length,
        guidance_scale=0.0,
        generator=torch.Generator(device="cuda").manual_seed(input_config.seed),
    )

@feifeibear configurations for parallel layouts

The text was updated successfully, but these errors were encountered:

feifeibear · 2024-10-31T06:24:27Z

Perfect! Could you please give some results on your H100 benchmark?

antferdom · 2024-10-31T11:09:05Z

Currently having issues with Flash Attention v3 Hopper interface and SP-Ulysses and SP-ring. I will forward this to #319 .

antferdom mentioned this issue Oct 31, 2024

RoadMap and Looking for Contributions #213

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FLUX Hopper benchmarking #324

FLUX Hopper benchmarking #324

antferdom commented Oct 29, 2024 •

edited

Loading

feifeibear commented Oct 31, 2024

antferdom commented Oct 31, 2024

FLUX Hopper benchmarking #324

FLUX Hopper benchmarking #324

Comments

antferdom commented Oct 29, 2024 • edited Loading

Benchmarking Python utility script

feifeibear commented Oct 31, 2024

antferdom commented Oct 31, 2024

antferdom commented Oct 29, 2024 •

edited

Loading