Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NotImplementedError: Cannot copy out of meta tensor; no data! #26510

Closed
2 of 4 tasks
ari9dam opened this issue Sep 30, 2023 · 15 comments
Closed
2 of 4 tasks

NotImplementedError: Cannot copy out of meta tensor; no data! #26510

ari9dam opened this issue Sep 30, 2023 · 15 comments
Labels

Comments

@ari9dam
Copy link

ari9dam commented Sep 30, 2023

System Info

transformers==4.34.0.dev0
accelerate==0.23.0
torch==2.0.1
cuda==11.7

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

import transformers
model = transformers.MistralForCausalLM.from_pretrained(model_path)

Error:
Traceback (most recent call last):
File "./trainer.py", line 198, in
train()
File "./trainer.py", line 152, in train
model = transformers.MistralForCausalLM.from_pretrained(
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3301, in from_pretrained
) = cls._load_pretrained_model(
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3689, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py", line 741, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 317, in set_module_tensor_to_device
new_value = value.to(device)
NotImplementedError: Cannot copy out of meta tensor; no data!

Expected behavior

model loads sucessfully

@mdazfar2
Copy link

mdazfar2 commented Oct 1, 2023

Hello sir can you assign the issue to me

@ari9dam
Copy link
Author

ari9dam commented Oct 1, 2023

It does not provide me the option to assign anyone. Sorry!

@LysandreJik
Copy link
Member

@mdazfar2 feel free to open a PR and link it to this issue if you'd like to work on it!

@ari9dam
Copy link
Author

ari9dam commented Oct 3, 2023

It works without FSDP (i.e. with DDP)
with FSDP it is not working

@mdazfar2
Copy link

mdazfar2 commented Oct 3, 2023

@LysandreJik Yeah okk i will do it now

@ari9dam
Copy link
Author

ari9dam commented Oct 3, 2023

It works with DeepSpeed Stage2 as well. The error only occurs when using FSDP to train.

@dannyhung1128
Copy link

Hit the same problem on slurm as well.

@ari9dam
Copy link
Author

ari9dam commented Oct 4, 2023 via email

@ari9dam
Copy link
Author

ari9dam commented Oct 4, 2023

What are possible reasons? I could run my code with 4.33.1. Is it accelerate?

@LysandreJik
Copy link
Member

Maybe cc @muellerzr as well

@github-actions github-actions bot closed this as completed Nov 8, 2023
@amyeroberts amyeroberts reopened this Nov 8, 2023
@huggingface huggingface deleted a comment from github-actions bot Nov 8, 2023
@amyeroberts
Copy link
Collaborator

Gentle ping @muellerzr @pacman100

@pacman100
Copy link
Contributor

pacman100 commented Nov 8, 2023

Hello, using the latest releases of transformers (4.35.0) and Accelerate (0.24.1), I am unable to reproduce the issue.

  1. Code isssue_26510.py:
import transformers

model_path = "mistralai/Mistral-7B-Instruct-v0.1"
model = transformers.MistralForCausalLM.from_pretrained(model_path)
  1. Accelerate config via accelerate config --config_file issue_26510.yaml:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_sharding_strategy: 1
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
  1. launch command:
accelerate launch --config_file issue_26510.yaml issue_26510.py
  1. output logs:
Downloading shards: 100%|█| 2/2 [00:00<00:00,  9.46
Downloading shards: 100%|█| 2/2 [00:00<00:00,  9.70
Downloading shards: 100%|█| 2/2 [00:00<00:00, 12.09
Downloading shards: 100%|█| 2/2 [00:00<00:00,  7.83
Loading checkpoint shards: 100%|█| 2/2 [00:12<00:00,  6.19s/it
Loading checkpoint shards: 100%|█| 2/2 [00:12<00:00,  6.22s/it
Loading checkpoint shards: 100%|█| 2/2 [00:12<00:00,  6.15s/it
Loading checkpoint shards: 100%|█| 2/2 [00:12<00:00,  6.12s/it
  1. This was experienced initially due to the support for RAM efficient loading of pretrained models not being compatible with few models like Whisper. Therefore, the PRs Make fsdp ram efficient loading optional #26631 and Make fsdp ram efficient loading optional accelerate#2037 added a config parameter to make it optional. See the config param `` and set it to False in case RAM efficient loading of the model fails. The docs for this config parameter are given in https://huggingface.co/docs/accelerate/usage_guides/fsdp#how-it-works-out-of-the-box. The point to note is reshared below:

CPU RAM Efficient Model loading: If True, only the first process loads the pretrained model checkoint while all other processes have empty weights. Only applicable for 🤗 Transformers models. This should be set to False if you experience errors when loading the pretrained 🤗 Transformers model via from_pretrained method. When using this, Sync Module States needs to be True else all the processes expect the main process would have random empty weights leading to unexpected behaviour during training.

Copy link

github-actions bot commented Dec 3, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@kwonmha
Copy link
Contributor

kwonmha commented Dec 11, 2023

The missing name of config parameter `` in the comment above is fsdp_cpu_ram_efficient_loading. :)

Copy link

github-actions bot commented Jan 4, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants