-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
有人 train 成功了吗? #19
Comments
在deepspeed config里将stage3_prefetch_bucket_size设为15099494试试呢? |
可以,但是会报新错误: |
官方人员检查一下 tokenizer 吧我已经把官方的方法都试过了,现在我的情况是:
官方人员检查一下 tokenizer 吧 `--------------------------------------------------------------------------- File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py:723, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs) File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1854, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, *init_inputs, **kwargs) File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2090, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, *init_inputs, **kwargs) File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:861, in SpecialTokensMixin.sanitize_special_tokens(self) File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1004, in SpecialTokensMixin.add_tokens(self, new_tokens, special_tokens) File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py:421, in PreTrainedTokenizer._add_tokens(self, new_tokens, special_tokens) File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py:582, in PreTrainedTokenizer.convert_tokens_to_ids(self, tokens) File ~/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/tokenization_utils.py:595, in PreTrainedTokenizer._convert_token_to_id_with_added_voc(self, token) File ~/.cache/huggingface/modules/transformers_modules/glm-4-9b/tokenization_chatglm.py:96, in ChatGLM4Tokenizer._convert_token_to_id(self, token) KeyError: '<|endoftext|>'` |
附上运行 ' ./scripts/glm4_longwriter.sh' 时的报错信息:KeyError: '<|endoftext|>' |
我遇到了跟你一模一样的错误:
|
你好,请用LongWriter-glm4-9b的tokenizer代码,目前的训练代码没有支持最新版GLM-4-9b的tokenizer。 |
RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288 |
现在我也报这个错了 |
现在会报两种类型的错误
错误一:RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288在 /ds_config/stage3.json 中设置 这样的话会一直运行到出现wandb界面,但在开始训练的时候就会报错: ^^^^^^^^ File "/root/anaconda3/envs/glm-4-copy/lib/python3.11/site-packages/transformers/trainer.py", line 2679, in training_step
^RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/modeling_chatglm.py", line 146, in apply_rotary_pos_emb
rope_cache = rope_cache[:sq]
xshaped = x.reshape(sq, -1, np, rot_dim // 2, 2)
rope_cache = rope_cache.view(sq, -1, 1, xshaped.size(3), 2)
~~~~~~~~~~~~~~~ <--- HERE
x_out2 = torch.stack(
[
RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for input of size 524288 错误二:Input should be a valid integer, got a number with a fractional part在 /ds_config/stage3.json 中设置 这样设置并运行的话会在wandb出现之前就报错:
|
你好,请问有solution了吗,还是想跑一下训练 |
你好,从报错信息看代码运行时用的还是glm-4-9b原本的 |
对于错误二,请把 |
从 hiyouga/LLaMA-Factory#5252 这个issue来看, |
System Info / 系統信息
Transformer 4.43, 4.44, 4.33 都试了,modeling_chatglm.py 也替换了,运行最后的 .sh 文件是报了和其他人类似的错。
建议官方再把训练操作过程写的详细些。
Who can help? / 谁可以帮助到您?
。
Information / 问题信息
Reproduction / 复现过程
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.735379934310913 seconds
Traceback (most recent call last):
File "/root/AI4E/ljc/LongWriter/train/main.py", line 130, in
train()
File "/root/AI4E/ljc/LongWriter/train/main.py", line 126, in train
trainer.train(resume_from_checkpoint=False)
File "/root/anaconda3/envs/glm-4-copy/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
return inner_training_loop(
File "/root/anaconda3/envs/glm-4-copy/lib/python3.10/site-packages/transformers/trainer.py", line 2095, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/root/anaconda3/envs/glm-4-copy/lib/python3.10/site-packages/accelerate/accelerator.py", line 1303, in prepare
result = self._prepare_deepspeed(*args)
File "/root/anaconda3/envs/glm-4-copy/lib/python3.10/site-packages/accelerate/accelerator.py", line 1779, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/root/anaconda3/envs/glm-4-copy/lib/python3.10/site-packages/deepspeed/init.py", line 179, in initialize
config_class = DeepSpeedConfig(config, mpu, mesh_device=mesh_device)
File "/root/anaconda3/envs/glm-4-copy/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 797, in init
self._initialize_params(copy.copy(self._param_dict))
File "/root/anaconda3/envs/glm-4-copy/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 817, in _initialize_params
self.zero_config = get_zero_config(param_dict)
File "/root/anaconda3/envs/glm-4-copy/lib/python3.10/site-packages/deepspeed/runtime/zero/config.py", line 71, in get_zero_config
return DeepSpeedZeroConfig(**zero_config_dict)
File "/root/anaconda3/envs/glm-4-copy/lib/python3.10/site-packages/deepspeed/runtime/config_utils.py", line 57, in init
super().init(**data)
File "/root/anaconda3/envs/glm-4-copy/lib/python3.10/site-packages/pydantic/main.py", line 193, in init
self.pydantic_validator.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig
stage3_prefetch_bucket_size
Input should be a valid integer, got a number with a fractional part [type=int_from_float, input_value=15099494.4, input_type=float]
For further information visit https://errors.pydantic.dev/2.8/v/int_from_float
[2024-08-28 12:38:44,068] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 282936
[2024-08-28 12:38:44,901] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 282937
[2024-08-28 12:38:46,425] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 282938
[2024-08-28 12:38:46,443] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 282939
[2024-08-28 12:38:46,452] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 282940
[2024-08-28 12:38:46,460] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 282941
[2024-08-28 12:38:46,460] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 282942
[2024-08-28 12:38:46,469] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 282943
[2024-08-28 12:38:46,478] [ERROR] [launch.py:325:sigkill_handler] ['/root/anaconda3/envs/glm-4-copy/bin/python', '-u', 'main.py', '--local_rank=7', '--model_name_or_path', '/root/AI4E/share/glm-4-9b', '--train_file', './data/glm4/longwriter', '--output_dir', './output/glm4/longwriter', '--num_train_epochs', '4', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--save_strategy', 'steps', '--save_steps', '400', '--save_total_limit', '10', '--preprocessing_num_workers', '64', '--learning_rate', '1e-5', '--weight_decay', '0.1', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_dir', './logs/', '--deepspeed', 'ds_config/stage3.json', '--bf16', '--gradient_checkpointing', '1', '--adam_beta1', '0.9', '--adam_beta2', '0.95', '--report_to', 'wandb', '--run_name', 'glm4_longwriter', '--logging_steps', '1', '--batch_method', 'pack', '--pack_loss'] exits with return code = 1
Expected behavior / 期待表现
。
The text was updated successfully, but these errors were encountered: