-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ERROR - root - Failed to execute the training process: #199
Comments
After adjust the version of xFormer, and adjust the num_workers in train_stage1.py, the error become like this ..... (hallo) root@d1ef2432db94:~/avatar_project/hallo# accelerate launch -m --config_file accelerate_config.yaml --machine_rank 0 --main_process_ip 0.0.0.0 --main_process_port 20055 --num_machines 1 --num_processes 1 scripts.train_stage1 --config ./configs/train/stage1.yaml /root/anaconda3/envs/hallo/lib/python3.10/site-packages/albumentations/init.py:13: UserWarning: A new version of Albumentations is available: 1.4.18 (you have 1.4.14). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1. Mixed precision type: no {'force_upcast', 'scaling_factor', 'latents_mean', 'latents_std'} was not found in config. Values will be initialized to default values. Steps: 0%| | 1/30000 [00:12<102:16:46, 12.27s/it] |
After I adjust cudnn version match for ONNXruntime1.18.0 and cuda12.1, the train can run, but quickly out of memory in the first steps. |
After I choose to set --num_process to 4 since I have multiple GPU, there is a new error 10/11/2024 19:33:15 - ERROR - root - Failed to execute the training process: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 |
Thank you very much for your excellent work. I am now encountering this problem while training my model in a virtual environment. When I execute the command line, an error occurs. Can anyone solve it?
It seems like some modules have not been load, but i have download all the pretrain-model, how can I solve it? Thanks.
10/10/2024 19:12:15 - INFO - hallo.models.unet_3d - loaded temporal unet's pretrained weights from pretrained_models/stable-diffusion-v1-5/unet ...
10/10/2024 19:12:22 - INFO - hallo.models.unet_3d - Loaded 0.0M-parameter motion module
(hallo) root@d1ef2432db94:~/avatar_project/hallo# accelerate launch -m --config_file accelerate_config.yaml --machine_rank 0 --main_process_ip 0.0.0.0 --main_process_port 20055 --num_machines 1 --num_processes 1 scripts.train_stage1 --config ./configs/train/stage1.yaml
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 2.2.2+cu118 with CUDA 1108 (you have 2.2.2+cu121)
Python 3.10.14 (you have 3.10.14)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details
/root/anaconda3/envs/hallo/lib/python3.10/site-packages/albumentations/init.py:13: UserWarning: A new version of Albumentations is available: 1.4.18 (you have 1.4.14). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
[2024-10-10 19:12:12,545] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-10 19:12:13,760] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-10 19:12:13,760] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
10/10/2024 19:12:13 - INFO - main - Distributed environment: DEEPSPEED Backend: nccl
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: no
ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 1, 'zero_optimization': {'stage': 2, 'offload_optimizer': {'device': 'none', 'nvme_path': None}, 'offload_param': {'device': 'none', 'nvme_path': None}, 'stage3_gather_16bit_weights_on_model_save': False}, 'steps_per_print': inf, 'fp16': {'enabled': False}, 'bf16': {'enabled': False}}
{'scaling_factor', 'latents_mean', 'latents_std', 'force_upcast'} was not found in config. Values will be initialized to default values.
The config attributes {'center_input_sample': False, 'out_channels': 4} were passed to UNet2DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file.
{'addition_time_embed_dim', '_landmark_net', 'time_embedding_act_fn', 'use_linear_projection', 'addition_embed_type', '_center_input_sample', 'reverse_transformer_layers_per_block', '_out_channels', 'time_embedding_type', 'transformer_layers_per_block', 'projection_class_embeddings_input_dim', 'num_attention_heads', 'class_embed_type', 'timestep_post_act', 'conv_in_kernel', 'addition_embed_type_num_heads', 'mid_block_type', 'encoder_hid_dim', 'time_embedding_dim', 'mid_block_only_cross_attention', 'dropout', 'class_embeddings_concat', 'upcast_attention', 'num_class_embeds', 'encoder_hid_dim_type', 'time_cond_proj_dim', 'dual_cross_attention', 'only_cross_attention', 'attention_type', 'resnet_time_scale_shift'} was not found in config. Values will be initialized to default values.
Some weights of the model checkpoint were not used when initializing UNet2DConditionModel:
['conv_norm_out.bias, conv_norm_out.weight, conv_out.bias, conv_out.weight']
10/10/2024 19:12:15 - INFO - hallo.models.unet_3d - loaded temporal unet's pretrained weights from pretrained_models/stable-diffusion-v1-5/unet ...
The config attributes {'center_input_sample': False} were passed to UNet3DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file.
{'upcast_attention', 'num_class_embeds', 'unet_use_cross_frame_attention', 'motion_module_resolutions', 'stack_enable_blocks_depth', 'resnet_time_scale_shift', 'motion_module_mid_block', 'dual_cross_attention', 'only_cross_attention', 'class_embed_type', 'use_audio_module', 'audio_attention_dim', 'motion_module_decoder_only', 'stack_enable_blocks_name', 'motion_module_kwargs', 'use_linear_projection', 'use_inflated_groupnorm', 'motion_module_type'} was not found in config. Values will be initialized to default values.
10/10/2024 19:12:22 - INFO - hallo.models.unet_3d - Loaded 0.0M-parameter motion module
10/10/2024 19:12:23 - ERROR - root - Failed to execute the training process: No operator found for
memory_efficient_attention_forward
with inputs:query : shape=(1, 2, 1, 40) (torch.float32)
key : shape=(1, 2, 1, 40) (torch.float32)
value : shape=(1, 2, 1, 40) (torch.float32)
attn_bias : <class 'NoneType'>
p : 0.0
decoderF
is not supported because:xFormers wasn't build with CUDA support
attn_bias type is <class 'NoneType'>
operator wasn't built - see
python -m xformers.info
for more info[email protected]
is not supported because:xFormers wasn't build with CUDA support
dtype=torch.float32 (supported: {torch.float16, torch.bfloat16})
cutlassF
is not supported because:xFormers wasn't build with CUDA support
operator wasn't built - see
python -m xformers.info
for more infosmallkF
is not supported because:max(query.shape[-1] != value.shape[-1]) > 32
xFormers wasn't build with CUDA support
operator wasn't built - see
python -m xformers.info
for more infounsupported embed per head: 40
The text was updated successfully, but these errors were encountered: