Replies: 1 comment
-
In the above config --max_data_loader_n_workers="48" was set to try to see if there was a data loading bottleneck, I tried other numbers also with no change When training a SD1.5 LORA I can easily use 99% of the CUDA time up |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm trying to train a LORA for the base SDXL 1.0 model, I can't seem to get my CUDA usage above 50%, is there a reason for this? I have the CUDNN libraries that are recommended installed, Kohya is at the latest release was a completely new Git pull, configured like normal for windows, all local training all GPU based. I just tried increasing the number of data load workers and didn't make a difference
my accelerate config used was:
accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="E:/Models/Stable-Diffusion/Checkpoints/SDXL1.0/sd_xl_base_1.0.safetensors" --train_data_dir="E:/Models/Stable-Diffusion/Training/Lora/Blocks-XL\img" --resolution="1024,1024" --output_dir="E:/Models/Stable-Diffusion/Training/Lora/Blocks-XL\model" --logging_dir="E:/Models/Stable-Diffusion/Training/Lora/Blocks-XL\log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=256 --output_name="Blocks-XL" --lr_scheduler_num_cycles="10" --no_half_vae --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="17800" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --seed="12345" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="48" --bucket_reso_steps=64 --save_state --gradient_checkpointing --xformers --bucket_no_upscale --noise_offset=0.0357 --sample_sampler=euler_a --sample_prompts="E:/Models/Stable-Diffusion/Training/Lora/Blocks-XL\model\sample\prompt.txt" --sample_every_n_steps="500"
My system is pretty strong, not massive but it should be able to use all the CUDA cores up.
Intel 13900K CPU
128GB DDR5 RAM
nVIDIA 4090
4x 2TB Gen 5 nVME SSD's
Beta Was this translation helpful? Give feedback.
All reactions