-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rank4]: Input should be a valid integer, got a number with a fractional part [type=int_from_float, input_value=15099494.4, input_type=float] #16
Comments
而且如果我把stage3.json中的 "stage3_prefetch_bucket_size": "auto",改为 "stage3_prefetch_bucket_size": 15099494,运行会出现如下错误: |
我还遇到了这个: |
是的,我现在也是到这一步卡住了,目前和你的报错一样 |
(T_T) |
我们目前提供的 |
目前已经换成4.33.0,而且modeling_chatglm.py也已替换,但是出现如下报错: |
你这里应该是没有成功替换,我们训练时的modeling_chatglm.py代码中没有这一行:File "/home/hnjj/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 416, in init |
请问训练支持glm-4-9b-chat吗?不是glm-4-9b |
我们建议从glm-4-9b(base)模型开始进行混训(通用SFT数据+LongWriter-6k数据)。直接从glm-4-9b-chat训练的效果会大打折扣。 |
我试了确实是,替换了原来的文件后,运行train文件,就会使用的还是原来的modeling_chatglm.py文件 |
你需要在load时候传入参数 |
Traceback (most recent call last): 我换成了glm-4-9b模型,也换了 |
@sunzhufeng12345 @badarrrr 请看我们在README中的FAQ是否能解决你们遇到的问题。不好意思让你们久等了。 |
我使用官方提供的脚本和数据集先后运行了python pre_tokenize_glm4.py
python sort_and_group.py --group_size 8 --train_file /home/hnjj/diskdata/yuanshi/media/szf/llm/glm_longwrite/LongWriter/train/datasets
得到了attention_masks_pack.json ,inputs_pack.npy等文件
运行训练脚本 ./glm4_longwriter.sh 时,遇到与 DeepSpeedZeroConfig 配置相关的 ValidationError。错误是由于 stage3_prefetch_bucket_size 的输入类型无效,期望为整数但接收到浮点数。
训练日志:
[2024-08-26 09:58:48,719] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-08-26 09:58:49,793] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-26 09:58:50,631] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:58:50,737] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:58:50,784] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:58:50,799] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:58:51,320] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-08-26 09:58:52,754] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:58:52,859] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:58:53,039] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:58:53,301] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:59:10,505] [INFO] [partition_parameters.py:345:exit] finished initializing model - num_params = 283, num_elems = 9.40B
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.15s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.16s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.16s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.16s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.16s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.16s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.16s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.18s/it]
loading data...
loading data...
loading data...
loading data...
loading data...
loading data...
loading data...
loading data...
finish loading data
Using /home/hnjj/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/hnjj/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.158402919769287 seconds
[rank4]: Traceback (most recent call last):
[rank4]: File "/home/hnjj/diskdata/yuanshi/media/szf/llm/glm_longwrite/LongWriter/train/main.py", line 130, in
[rank4]: train()
[rank4]: File "/home/hnjj/diskdata/yuanshi/media/szf/llm/glm_longwrite/LongWriter/train/main.py", line 126, in train
[rank4]: trainer.train(resume_from_checkpoint=False)
[rank4]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
[rank4]: return inner_training_loop(
[rank4]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/transformers/trainer.py", line 2095, in _inner_training_loop
[rank4]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank4]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/accelerate/accelerator.py", line 1303, in prepare
[rank4]: result = self._prepare_deepspeed(*args)
[rank4]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/accelerate/accelerator.py", line 1779, in _prepare_deepspeed
[rank4]: engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
[rank4]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/deepspeed/init.py", line 179, in initialize
[rank4]: config_class = DeepSpeedConfig(config, mpu, mesh_device=mesh_device)
[rank4]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 797, in init
[rank4]: self._initialize_params(copy.copy(self._param_dict))
[rank4]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 817, in _initialize_params
[rank4]: self.zero_config = get_zero_config(param_dict)
[rank4]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/deepspeed/runtime/zero/config.py", line 71, in get_zero_config
[rank4]: return DeepSpeedZeroConfig(**zero_config_dict)
[rank4]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/deepspeed/runtime/config_utils.py", line 57, in init
[rank4]: super().init(**data)
[rank4]: File "/home/hnjj/anaconda3/envs/szf-longwrite/lib/python3.10/site-packages/pydantic/main.py", line 193, in init
[rank4]: self.pydantic_validator.validate_python(data, self_instance=self)
[rank4]: pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig
[rank4]: stage3_prefetch_bucket_size
[rank4]: Input should be a valid integer, got a number with a fractional part [type=int_from_float, input_value=15099494.4, input_type=float]
[rank4]: For further information visit https://errors.pydantic.dev/2.8/v/int_from_float
The text was updated successfully, but these errors were encountered: