You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Target: Measure the scalability of FLUX.1 on NVIDIA Hopper architecture (both H100 & H200) using different model parallelism strategies (see Flux.1 Performance Overview):
Build xfuser command expression using Python instead of plain Bash scripts. We can then use this approach in convention with experiment config libraries like OmegaConf (e.g facebookresearch/lingua llm pre-training library). The following implementation also includes logging configuration for Torch compile (Inductor, Dynamo), valuable for inspecting graph breaks, tracing and compilation errors:
importloggingimportosimportshleximportsubprocessimportsysfrompathlibimportPathimporttorchDEBUG_TORCH_COMPILE=os.getenv("DEBUG_TORCH_COMPILE", "0") =="1"# ref: https://github.com/pytorch/pytorch/blob/c81d4fd0a8ca826c165fe15f83398dbc4e20b523/docs/source/torch.compiler_troubleshooting.rst#L28ifDEBUG_TORCH_COMPILE:
torch._logging.set_logs(dynamo=logging.INFO,
graph_code=True,
graph_breaks=True,
guards=True,
recompiles=True,
)
#torch._dynamo.explain()WD: str=Path(__file__).parent.absolute()
os.environ["PYTHONPATH"] =f"{WD}:{os.getenv('PYTHONPATH', '')}"SCRIPT: str=os.path.join(WD, "flux_example.py")
MODEL_ID="black-forest-labs/FLUX.1-dev"INFERENCE_STEP=28WARMUP_STEPS=3max_sequence_length=512height=1024width=1024TASK_ARGS=f"--max-sequence-length {max_sequence_length} --height {height} --width {width}"N_GPUS=2pipefusion_parallel_degree=2ulysses_degree=1ring_degree=1PARALLEL_ARGS= (
f"--pipefusion_parallel_degree {pipefusion_parallel_degree} "f"--ulysses_degree {ulysses_degree} "f"--ring_degree {ring_degree} "
)
COMPILE_FLAG="--use_torch_compile"conda_binaries=Path(sys.executable).parenttorchrun_bin_path=os.path.join(conda_binaries, "torchrun")
command: str= (
f"{sys.executable} -m torch.distributed.run --nproc_per_node={N_GPUS}{SCRIPT} "f"--model {MODEL_ID} "f"{PARALLEL_ARGS} "f"{TASK_ARGS} "f"--num_inference_steps {INFERENCE_STEP} "f"--warmup_steps {WARMUP_STEPS} "f"--prompt \"brown dog laying on the ground with a metal bowl in front of him.\" "f"{COMPILE_FLAG}"
)
print(command)
print(shlex.split(command))
subprocess.run(shlex.split(command))
Minor change in examples/flux_example.py for setting xFuserFluxPipelinemax_sequence_length using input_config instead of manual hard-coding value . This modification solves Torch compilation errors when setting max_seq_length=512 for FLUX dev.
Target: Measure the scalability of FLUX.1 on NVIDIA Hopper architecture (both H100 & H200) using different model parallelism strategies (see Flux.1 Performance Overview):
Models (order of relevance):
Benchmarking Python utility script
Build
xfuser
command expression using Python instead of plain Bash scripts. We can then use this approach in convention with experiment config libraries like OmegaConf (e.g facebookresearch/lingua llm pre-training library). The following implementation also includes logging configuration for Torch compile (Inductor, Dynamo), valuable for inspectinggraph breaks
, tracing and compilation errors:Minor change in
examples/flux_example.py
for settingxFuserFluxPipeline
max_sequence_length
usinginput_config
instead of manual hard-coding value . This modification solves Torch compilation errors when settingmax_seq_length=512
for FLUX dev.@feifeibear configurations for parallel layouts
The text was updated successfully, but these errors were encountered: