Training LoRA models on NVIDIA GTX 1660 6GB Fails with "NaN detected in latents" Error #1947

kferterb · 2024-02-08T17:56:43Z

kferterb
Feb 8, 2024

Training LoRA models on NVIDIA GTX 1660 6GB Fails with "NaN detected in latents" Error

Introduction

I encountered an issue while training LoRA models on my NVIDIA GeForce GTX 1660 6GB card. The training script terminated unexpectedly, reporting a "NaN detected in latents" error. This issue seems to prevent successful model training using this specific GPU setup.

Environment Details

Operating System: Windows 10
Stable Diffusion Interface Installation Source: GitHub repository, with a minor modification in requirements.txt to use gradio==3.44.0.
GPU: NVIDIA GeForce GTX 1660 6GB

Steps to Reproduce

Set up the environment as per the instructions in the Kohya_ss-GUI-LoRA-Portable GitHub repository, with the mentioned modification in requirements.txt.
Attempt to train a LoRA model using the provided training script.
Observe the process termination with the error: "NaN detected in latents".

Expected Behavior

The training process should run without encountering NaN errors in latents, allowing for successful model training.

Actual Behavior

The training process fails early with a RuntimeError indicating "NaN detected in latents", specifically pointing to a problematic image file. This suggests an issue in handling certain data types or values during the training phase.
[Dataset 0] loading image sizes. 100%|██████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 2991.44it/s] prepare dataset preparing accelerator loading model for process 0/1 load StableDiffusion checkpoint: C:\stable-diffusion-portable-main\stable-diffusion-portable-main\models\Stable-diffusion\Reliberate_v3.safetensors UNet2DConditionModel: 64, 8, 768, False, False loading u-net: <All keys matched successfully> loading vae: <All keys matched successfully> loading text encoder: <All keys matched successfully> Enable xformers for U-Net A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' import network module: networks.lora [Dataset 0] caching latents. checking cache validity... 100%|████████████████████████████████████████████████████████████████████| 30/30 [00:00<?, ?it/s] caching latents... 0%| | 0/30 [00:02<?, ?it/s] Traceback (most recent call last): File "C:\Kohya_ss-GUI-LoRA-Portable-main\train_network.py", line 1033, in <module> trainer.train(args) File "C:\Kohya_ss-GUI-LoRA-Portable-main\train_network.py", line 267, in train train_dataset_group.cache_latents(vae, args.vae_batch_size, args.cache_latents_to_disk, accelerator.is_main_process) File "C:\Kohya_ss-GUI-LoRA-Portable-main\library\train_util.py", line 1927, in cache_latents dataset.cache_latents(vae, vae_batch_size, cache_to_disk, is_main_process) File "C:\Kohya_ss-GUI-LoRA-Portable-main\library\train_util.py", line 952, in cache_latents cache_batch_latents(vae, cache_to_disk, batch, subset.flip_aug, subset.random_crop) File "C:\Kohya_ss-GUI-LoRA-Portable-main\library\train_util.py", line 2272, in cache_batch_latents raise RuntimeError(f"NaN detected in latents: {info.absolute_path}") RuntimeError: NaN detected in latents: C:\TrainDataDir\100_Name\aehrthgaerthg.png Traceback (most recent call last): File "C:\Kohya_ss-GUI-LoRA-Portable-main\python\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Kohya_ss-GUI-LoRA-Portable-main\python\lib\runpy.py", line 86, in run_code exec(code, run_globals) File "C:\Kohya_ss-GUI-LoRA-Portable-main\venv\Scripts\accelerate.exe_main.py", line 7, in <module> File "C:\Kohya_ss-GUI-LoRA-Portable-main\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "C:\Kohya_ss-GUI-LoRA-Portable-main\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "C:\Kohya_ss-GUI-LoRA-Portable-main\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['C:\Kohya_ss-GUI-LoRA-Portable-main\venv\Scripts\python.exe', './train_network.py', '--pretrained_model_name_or_path=C:\stable-diffusion-portable-main\stable-diffusion-portable-main\models\Stable-diffusion\Reliberate_v3.safetensors', '--train_data_dir=C:\TrainDataDir', '--resolution=512,512', '--output_dir=C:\stable-diffusion-portable-main\stable-diffusion-portable-main\models\Lora', '--network_alpha=128', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=5e-05', '--unet_lr=0.0001', '--network_dim=128', '--output_name=Name', '--lr_scheduler_num_cycles=1', '--learning_rate=8e-05', '--lr_scheduler=constant_with_warmup', '--lr_warmup_steps=150', '--train_batch_size=2', '--max_train_steps=1500', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--seed=1', '--cache_latents', '--optimizer_type=AdamW8bit', '--max_grad_norm=1', '--max_data_loader_n_workers=1', '--bucket_reso_steps=64', '--xformers', '--bucket_no_upscale', '--noise_offset=0.0', '--wandb_api_key=False']' returned non-zero exit status 1.

Solution/Workaround

After researching similar issues kohya-ss/sd-scripts#293 and experimenting with potential fixes, I found that enabling torch.backends.cudnn.benchmark resolved the problem. This change likely optimizes the selection of the best convolution algorithm for the environment, thus avoiding the conditions that lead to NaN values being generated. Here's the specific code modification made in train_network.py:

# Added to the body of the method def train(self, args):
torch.backends.cudnn.benchmark = True
# This line follows the existing code in the method, for example:
session_id = random.randint(0, 2**32)

Suggestion for Permanent Fix

It appears that enabling torch.backends.cudnn.benchmark can prevent the NaN error during training on specific hardware setups, such as the NVIDIA GTX 1660 6GB. It would be beneficial for the training script to automatically detect when this setting is necessary or to document this workaround for users with similar hardware configurations.

Conclusion

Incorporating the torch.backends.cudnn.benchmark = True setting into the training script resolved the "NaN detected in latents" error on an NVIDIA GTX 1660 6GB card, facilitating successful training of LoRA models. This workaround might be helpful for others experiencing similar issues, and a permanent fix or documentation update could further improve user experience.

bondo01 · 2024-03-17T16:20:57Z

bondo01
Mar 17, 2024

Getting this from release 23.0.11, was not getting in version 23.0.9. got other errors that I was verifying were fixed and I can report that the error from trying to save_state and Kohya error for missing "pytorch_model.bin" is gone.
-------------------------------------------------------------My Error message------------------------------
File "E:\Projects\kohya_ss\sd-scripts\sdxl_train.py", line 792, in
train(args)
File "E:\Projects\kohya_ss\sd-scripts\sdxl_train.py", line 262, in train
train_dataset_group.cache_latents(vae, args.vae_batch_size, args.cache_latents_to_disk, accelerator.is_main_process)
File "E:\Projects\kohya_ss\sd-scripts\library\train_util.py", line 1949, in cache_latents
dataset.cache_latents(vae, vae_batch_size, cache_to_disk, is_main_process)
File "E:\Projects\kohya_ss\sd-scripts\library\train_util.py", line 964, in cache_latents
cache_batch_latents(vae, cache_to_disk, batch, subset.flip_aug, subset.random_crop)
File "E:\Projects\kohya_ss\sd-scripts\library\train_util.py", line 2297, in cache_batch_latents
raise RuntimeError(f"NaN detected in latents: {info.absolute_path}")
RuntimeError: NaN detected in latents: E:\AI_Work\AI_training\TrainingData\output\img\40_SP3 photograph of woman \try_207.jpg
Traceback (most recent call last):
File "C:\Users\micro.conda\envs\python_emb\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\micro.conda\envs\python_emb\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\micro.conda\envs\python_emb\Scripts\accelerate.exe_main.py", line 7, in
File "C:\Users\micro.conda\envs\python_emb\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
args.func(args)
File "C:\Users\micro.conda\envs\python_emb\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command
simple_launcher(args)
File "C:\Users\micro.conda\envs\python_emb\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

0 replies

sirmarc1 · 2024-05-13T20:21:39Z

sirmarc1
May 13, 2024

Could you please explain where I can add this solution? I am rather new to Kohya and got it to work before, but now I keep getting this error and I have no idea where in Kohya to add this piece of text.

1 reply

jroubi Jun 3, 2024

The person explain it quite well, but I will try to be more precise for you.

You need to add a line of code in a python file.
Go in your Koyha_ss directory, then in a directory name sd-scripts.
There you will find a file name train_network.py.

Open the file with a text editor.
You can use TextEdit that comes with Windows.
I use Sublime Text.

Search for the string of text "def train"

It should look like this :

    def train(self, args):
        session_id = random.randint(0, 2**32)

This is the start of a function named "train". Insert the fix between these two lines. The line should be indented the same as the line below. Like this :

    def train(self, args):
        torch.backends.cudnn.benchmark = True
        session_id = random.randint(0, 2**32)

If you want to leave a comment you can use the #. Like :

# This is a fix, it should help with NaN errors on LoRA training
# https://github.com/bmaltais/kohya_ss/discussions/1947

Hope this helps, for me I changed the precision as I was using a config from someone else. My card, an RTX 3090, doesn't support fp16. Somehow this is the error I was getting for trying an unsupported precision setting. BF16 was the right setting for my card.

jroubi · 2024-06-03T18:21:55Z

jroubi
Jun 3, 2024

While training Dreambooth I got this error too with version v24.1.4.
I tried the fix mention above but it didn't work.

I have a NVIDIA GeForce RTX 3090.
I was trying to train SDXL using FP16 precision, I switched to BF16 and the error went away.

2 replies

AstroWYH Aug 14, 2024

there are 2 pos need to replace

DeeesCoo Aug 15, 2024

there are 2 pos need to replace

2 pos? I saw only one position for amendments:

def train(self, args):
    torch.backends.cudnn.benchmark = True
    session_id = random.randint(0, 2**32)

MageZeref · 2024-06-03T21:35:58Z

MageZeref
Jun 3, 2024

I had the same problem:
-> I changed the model, and the error vanished.
I guess the SDXL was too large for my NVIDIA GeForce RTX 1060...

0 replies

oracle9i88 · 2024-07-31T13:26:26Z

oracle9i88
Jul 31, 2024

问题+1

0 replies

KyokaNyx · 2024-08-02T11:24:40Z

KyokaNyx
Aug 2, 2024

You're going to heaven!

0 replies

MOGRAINEREPORTS · 2024-08-16T01:36:53Z

MOGRAINEREPORTS
Aug 16, 2024

I had the same problem: -> I changed the model, and the error vanished. I guess the SDXL was too large for my NVIDIA GeForce RTX 1060...

for anyone coming here for this problem - My vae was also the problem, didnt need to change model but Changed VAE and it worked wonders :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training LoRA models on NVIDIA GTX 1660 6GB Fails with "NaN detected in latents" Error #1947

{{title}}

Replies: 7 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Training LoRA models on NVIDIA GTX 1660 6GB Fails with "NaN detected in latents" Error #1947

Training LoRA models on NVIDIA GTX 1660 6GB Fails with "NaN detected in latents" Error

Introduction

Environment Details

Steps to Reproduce

Expected Behavior

Actual Behavior

Solution/Workaround

Suggestion for Permanent Fix

Conclusion

Replies: 7 comments · 3 replies

Replies: 7 comments 3 replies