Skip to content

Commit

Permalink
Merge branch 'sd3' into new_cache
Browse files Browse the repository at this point in the history
  • Loading branch information
kohya-ss committed Nov 27, 2024
2 parents 3677094 + 2a61fc0 commit 665c04e
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 4 deletions.
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,11 +68,11 @@ When training LoRA for Text Encoder (without `--network_train_unet_only`), more

__Options for GPUs with less VRAM:__

By specifying `--block_to_swap`, you can save VRAM by swapping some blocks between CPU and GPU. See [FLUX.1 fine-tuning](#flux1-fine-tuning) for details.
By specifying `--blocks_to_swap`, you can save VRAM by swapping some blocks between CPU and GPU. See [FLUX.1 fine-tuning](#flux1-fine-tuning) for details.

Specify a number like `--block_to_swap 10`. A larger number will swap more blocks, saving more VRAM, but training will be slower. In FLUX.1, you can swap up to 35 blocks.
Specify a number like `--blocks_to_swap 10`. A larger number will swap more blocks, saving more VRAM, but training will be slower. In FLUX.1, you can swap up to 35 blocks.

`--cpu_offload_checkpointing` offloads gradient checkpointing to CPU. This reduces up to 1GB of VRAM usage but slows down the training by about 15%. Cannot be used with `--block_to_swap`.
`--cpu_offload_checkpointing` offloads gradient checkpointing to CPU. This reduces up to 1GB of VRAM usage but slows down the training by about 15%. Cannot be used with `--blocks_to_swap`.

Adafactor optimizer may reduce the VRAM usage than 8bit AdamW. Please use settings like below:

Expand All @@ -82,7 +82,7 @@ Adafactor optimizer may reduce the VRAM usage than 8bit AdamW. Please use settin

The training can be done with 16GB VRAM GPUs with the batch size of 1. Please change your dataset configuration.

The training can be done with 12GB VRAM GPUs with `--block_to_swap 16` with 8bit AdamW. Please use settings like below:
The training can be done with 12GB VRAM GPUs with `--blocks_to_swap 16` with 8bit AdamW. Please use settings like below:

```
--blocks_to_swap 16
Expand Down
1 change: 1 addition & 0 deletions flux_train_network.py
Original file line number Diff line number Diff line change
Expand Up @@ -450,6 +450,7 @@ def call_dit(img, img_ids, t5_out, txt_ids, l_pooled, timesteps, guidance_vec, t

if len(diff_output_pr_indices) > 0:
network.set_multiplier(0.0)
unet.prepare_block_swap_before_forward()
with torch.no_grad():
model_pred_prior = call_dit(
img=packed_noisy_model_input[diff_output_pr_indices],
Expand Down

0 comments on commit 665c04e

Please sign in to comment.