Training LoRA models on NVIDIA GTX 1660 6GB Fails with "NaN detected in latents" Error #1947
Replies: 7 comments 3 replies
-
Getting this from release 23.0.11, was not getting in version 23.0.9. got other errors that I was verifying were fixed and I can report that the error from trying to save_state and Kohya error for missing "pytorch_model.bin" is gone. |
Beta Was this translation helpful? Give feedback.
-
Could you please explain where I can add this solution? I am rather new to Kohya and got it to work before, but now I keep getting this error and I have no idea where in Kohya to add this piece of text. |
Beta Was this translation helpful? Give feedback.
-
While training Dreambooth I got this error too with version v24.1.4. I have a NVIDIA GeForce RTX 3090. |
Beta Was this translation helpful? Give feedback.
-
I had the same problem: |
Beta Was this translation helpful? Give feedback.
-
You're going to heaven! |
Beta Was this translation helpful? Give feedback.
-
for anyone coming here for this problem - My vae was also the problem, didnt need to change model but Changed VAE and it worked wonders :) |
Beta Was this translation helpful? Give feedback.
-
Training LoRA models on NVIDIA GTX 1660 6GB Fails with "NaN detected in latents" Error
Introduction
I encountered an issue while training LoRA models on my NVIDIA GeForce GTX 1660 6GB card. The training script terminated unexpectedly, reporting a "NaN detected in latents" error. This issue seems to prevent successful model training using this specific GPU setup.
Environment Details
requirements.txt
to usegradio==3.44.0
.Steps to Reproduce
requirements.txt
.Expected Behavior
The training process should run without encountering NaN errors in latents, allowing for successful model training.
Actual Behavior
The training process fails early with a RuntimeError indicating "NaN detected in latents", specifically pointing to a problematic image file. This suggests an issue in handling certain data types or values during the training phase.
[Dataset 0] loading image sizes. 100%|██████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 2991.44it/s] prepare dataset preparing accelerator loading model for process 0/1 load StableDiffusion checkpoint: C:\stable-diffusion-portable-main\stable-diffusion-portable-main\models\Stable-diffusion\Reliberate_v3.safetensors UNet2DConditionModel: 64, 8, 768, False, False loading u-net: <All keys matched successfully> loading vae: <All keys matched successfully> loading text encoder: <All keys matched successfully> Enable xformers for U-Net A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' import network module: networks.lora [Dataset 0] caching latents. checking cache validity... 100%|████████████████████████████████████████████████████████████████████| 30/30 [00:00<?, ?it/s] caching latents... 0%| | 0/30 [00:02<?, ?it/s] Traceback (most recent call last): File "C:\Kohya_ss-GUI-LoRA-Portable-main\train_network.py", line 1033, in <module> trainer.train(args) File "C:\Kohya_ss-GUI-LoRA-Portable-main\train_network.py", line 267, in train train_dataset_group.cache_latents(vae, args.vae_batch_size, args.cache_latents_to_disk, accelerator.is_main_process) File "C:\Kohya_ss-GUI-LoRA-Portable-main\library\train_util.py", line 1927, in cache_latents dataset.cache_latents(vae, vae_batch_size, cache_to_disk, is_main_process) File "C:\Kohya_ss-GUI-LoRA-Portable-main\library\train_util.py", line 952, in cache_latents cache_batch_latents(vae, cache_to_disk, batch, subset.flip_aug, subset.random_crop) File "C:\Kohya_ss-GUI-LoRA-Portable-main\library\train_util.py", line 2272, in cache_batch_latents raise RuntimeError(f"NaN detected in latents: {info.absolute_path}") RuntimeError: NaN detected in latents: C:\TrainDataDir\100_Name\aehrthgaerthg.png Traceback (most recent call last): File "C:\Kohya_ss-GUI-LoRA-Portable-main\python\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Kohya_ss-GUI-LoRA-Portable-main\python\lib\runpy.py", line 86, in run_code exec(code, run_globals) File "C:\Kohya_ss-GUI-LoRA-Portable-main\venv\Scripts\accelerate.exe_main.py", line 7, in <module> File "C:\Kohya_ss-GUI-LoRA-Portable-main\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "C:\Kohya_ss-GUI-LoRA-Portable-main\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "C:\Kohya_ss-GUI-LoRA-Portable-main\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['C:\Kohya_ss-GUI-LoRA-Portable-main\venv\Scripts\python.exe', './train_network.py', '--pretrained_model_name_or_path=C:\stable-diffusion-portable-main\stable-diffusion-portable-main\models\Stable-diffusion\Reliberate_v3.safetensors', '--train_data_dir=C:\TrainDataDir', '--resolution=512,512', '--output_dir=C:\stable-diffusion-portable-main\stable-diffusion-portable-main\models\Lora', '--network_alpha=128', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=5e-05', '--unet_lr=0.0001', '--network_dim=128', '--output_name=Name', '--lr_scheduler_num_cycles=1', '--learning_rate=8e-05', '--lr_scheduler=constant_with_warmup', '--lr_warmup_steps=150', '--train_batch_size=2', '--max_train_steps=1500', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--seed=1', '--cache_latents', '--optimizer_type=AdamW8bit', '--max_grad_norm=1', '--max_data_loader_n_workers=1', '--bucket_reso_steps=64', '--xformers', '--bucket_no_upscale', '--noise_offset=0.0', '--wandb_api_key=False']' returned non-zero exit status 1.
Solution/Workaround
After researching similar issues kohya-ss/sd-scripts#293 and experimenting with potential fixes, I found that enabling
torch.backends.cudnn.benchmark
resolved the problem. This change likely optimizes the selection of the best convolution algorithm for the environment, thus avoiding the conditions that lead to NaN values being generated. Here's the specific code modification made intrain_network.py
:Suggestion for Permanent Fix
It appears that enabling
torch.backends.cudnn.benchmark
can prevent the NaN error during training on specific hardware setups, such as the NVIDIA GTX 1660 6GB. It would be beneficial for the training script to automatically detect when this setting is necessary or to document this workaround for users with similar hardware configurations.Conclusion
Incorporating the
torch.backends.cudnn.benchmark = True
setting into the training script resolved the "NaN detected in latents" error on an NVIDIA GTX 1660 6GB card, facilitating successful training of LoRA models. This workaround might be helpful for others experiencing similar issues, and a permanent fix or documentation update could further improve user experience.Beta Was this translation helpful? Give feedback.
All reactions