'AdamW' object has no attribute 'optim_bits' #2191

e-p-armstrong · 2024-12-15T04:32:22Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Full-parameter chatml finetuning of Llama 3.1 should work on the main:latest docker image on runpod on 6x A40s with deepspeed

Current behaviour

Training never gets a chance to start:

Stacktrace:

[2024-12-08 22:54:15,191] [INFO] [axolotl.load_model:1115] [PID:13086] [RANK:2] Converting modules to torch.bfloat16
[2024-12-08 22:54:15,296] [INFO] [axolotl.load_model:1082] [PID:13084] [RANK:0] cuda memory usage after model load: 14.958GB (+0.126GB cache, +1.099GB misc)
[2024-12-08 22:54:15,306] [INFO] [axolotl.load_model:1115] [PID:13084] [RANK:0] Converting modules to torch.bfloat16
[rank3]: Traceback (most recent call last):
[rank3]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank3]:   File "<frozen runpy>", line 88, in _run_code
[rank3]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 58, in <module>
[rank3]:     fire.Fire(do_cli)
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
[rank3]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank3]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
[rank3]:     component, remaining_args = _CallAndUpdateTrace(
[rank3]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank3]:     component = fn(*varargs, **kwargs)
[rank3]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
[rank3]:     return do_train(parsed_cfg, parsed_cli_args)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 47, in do_train
[rank3]:     model, tokenizer = train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank3]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/workspace/axolotl/src/axolotl/train.py", line 192, in train
[rank3]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2123, in train
[rank3]:     return inner_training_loop(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2275, in _inner_training_loop
[rank3]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank3]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1333, in prepare
[rank3]:     result = self._prepare_deepspeed(*args)
[rank3]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1843, in _prepare_deepspeed
[rank3]:     optimizer = map_pytorch_optim_to_deepspeed(optimizer)
[rank3]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 53, in map_pytorch_optim_to_deepspeed
[rank3]:     is_adaw = isinstance(optimizer, (bnb_opt.AdamW, bnb_opt.AdamW32bit)) and optimizer.optim_bits == 32
[rank3]:                                                                              ^^^^^^^^^^^^^^^^^^^^
[rank3]: AttributeError: 'AdamW' object has no attribute 'optim_bits'

This issue has been around for about a week now? I first reported it on the Discord.

Steps to reproduce

Attempt to full finetune llama 3 using the settings provided (need to add some generic chatml dataset as I had to redact my data files)

Config yaml

base_model: Heralax/private-llama3.1-model-whose-name-is-censored
tokenizer_type: AutoTokenizer
load_in_8bit: false
load_in_4bit: false
strict: false

datasets: # data files have to be redacted sorry
  
  

dataset_prepared_path: last_run_prepared-ft-lowerbatchsize
output_dir: ./out

sequence_len: 4096
sample_packing: false
pad_to_sequence_len: true
shuffle_merged_datasets: true

wandb_project: llama_3.1_8b
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 7 # meant for use on 6 GPUs to achieve same effective batch size as earlier. Swapped # GPUs and Grad accumulation steps.
micro_batch_size: 2
eval_batch_size: 1
num_epochs: 4
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.000012
weight_decay: 0
# Gradient clipping max norm``
max_grad_norm: 1.0
noisy_embedding_alpha: 5
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: unsloth
early_stopping_patience:
resume_from_checkpoint: 
logging_steps: 1
xformers_attention:
flash_attention: true

chat_template: chatml

warmup_ratio: 0.5
auto_resume_from_checkpoints: false
#warmup_ratio: 0.5
eval_steps: 10
saves_per_epoch: 1
eval_sample_packing: false
save_total_limit: 2
debug:
deepspeed: deepspeed_configs/zero2.json
special_tokens:
  pad_token: "<|end_of_text|>"



### Possible solution

Rolling back to axolotlai/axolotl-cloud:main-20241129-py3.11-cu124-2.4.1 lets me train again. Unfortunately, the pinned nightly version I was relying on (winglian/axolotl-cloud:main-20241124) no longer lets me connect. by that I mean, direct SSH connection does not appear as an option and when I try to go through the proxy it hangs and then tells me the container is not running. This has happened for all winglian/axolotl-cloud images I have tried to run sometime after the date 12/8/24, but that is a separate issue.

### Which Operating Systems are you using?

- [X] Linux
- [ ] macOS
- [ ] Windows

### Python Version

whatever version the main latest comes with.

### axolotl branch-commit

main/whatever the most recent docker image update comes with

### Acknowledgements

- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

winglian · 2024-12-15T17:56:00Z

Could you try with the regular adamw_8bit optimizer please?

e-p-armstrong · 2024-12-16T00:49:38Z

OK I will try with that and get back to you

bursteratom · 2024-12-18T17:24:36Z

@e-p-armstrong @winglian Looks like the issue is with accelerate. I find that downgrading accelerate to version 1.0.1 bypass this error for now. Will follow up on accelerate upstream

bursteratom · 2024-12-18T17:25:26Z

This issue seems to only affect zero2. Zero3 works fine.

e-p-armstrong · 2024-12-18T21:59:57Z

@winglian Reproduced with a different optimizer and it happened even with DPO tuning.

pytorch_model.bin.index.json: 100%|_______________________________________________| 23.9k/23.9k [00:00<00:00, 80.9MB/s]
pytorch_model-00001-of-00002.bin: 100%|____________________________________________| 16.1G/16.1G [00:34<00:00, 465MB/s]
pytorch_model-00002-of-00002.bin: 100%|_____________________________________________| 542k/542k [00:00<00:00, 77.7MB/s]
Downloading shards: 100%|________________________________________________________________| 2/2 [00:35<00:00, 17.53s/it]
Downloading shards: 100%|________________________________________________________________| 2/2 [00:34<00:00, 17.48s/it]
Downloading shards: 100%|________________________________________________________________| 2/2 [00:35<00:00, 17.54s/it]
Loading checkpoint shards: 100%|_________________________________________________________| 2/2 [00:04<00:00,  2.03s/it]
generation_config.json: 100%|_________________________________________________________| 180/180 [00:00<00:00, 1.10MB/s]
[2024-12-18 21:31:27,225] [INFO] [axolotl.load_model:1077] [PID:1521] [RANK:0] cuda memory usage after model load: 14.958GB (+0.126GB cache, +1.447GB misc)
[2024-12-18 21:31:27,229] [INFO] [axolotl.load_model:1110] [PID:1521] [RANK:0] Converting modules to torch.bfloat16
Loading checkpoint shards: 100%|_________________________________________________________| 2/2 [00:05<00:00,  2.56s/it]
Loading checkpoint shards: 100%|_________________________________________________________| 2/2 [00:05<00:00,  2.56s/it]
[2024-12-18 21:31:28,223] [INFO] [axolotl.load_model:1077] [PID:1523] [RANK:2] cuda memory usage after model load: 14.958GB (+0.126GB cache, +1.447GB misc)
[2024-12-18 21:31:28,227] [INFO] [axolotl.load_model:1110] [PID:1523] [RANK:2] Converting modules to torch.bfloat16
/workspace/axolotl/src/axolotl/core/trainer_builder.py:446: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `AxolotlTrainer.__init__`. Use `processing_class` instead.
  super().__init__(*_args, **kwargs)
[2024-12-18 21:31:28,828] [INFO] [axolotl.train.train:174] [PID:1521] [RANK:0] Starting trainer...
[2024-12-18 21:31:28,892] [INFO] [axolotl.load_model:1077] [PID:1522] [RANK:1] cuda memory usage after model load: 14.958GB (+0.126GB cache, +1.447GB misc)
[2024-12-18 21:31:28,896] [INFO] [axolotl.load_model:1110] [PID:1522] [RANK:1] Converting modules to torch.bfloat16
/workspace/axolotl/src/axolotl/core/trainer_builder.py:446: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `AxolotlTrainer.__init__`. Use `processing_class` instead.
  super().__init__(*_args, **kwargs)
/workspace/axolotl/src/axolotl/core/trainer_builder.py:446: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `AxolotlTrainer.__init__`. Use `processing_class` instead.
  super().__init__(*_args, **kwargs)
[2024-12-18 21:31:30,383] [INFO] [axolotl.utils.samplers.multipack.calc_min_len:197] [PID:1521] [RANK:0] gather_len_batches: [580, 580, 580]
[rank1]: Traceback (most recent call last):
[rank1]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank1]:   File "<frozen runpy>", line 88, in _run_code
[rank1]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 58, in <module>
[rank1]:     fire.Fire(do_cli)
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
[rank1]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
[rank1]:     component, remaining_args = _CallAndUpdateTrace(
[rank1]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank1]:     component = fn(*varargs, **kwargs)
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
[rank1]:     return do_train(parsed_cfg, parsed_cli_args)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 47, in do_train
[rank1]:     model, tokenizer = train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank1]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/axolotl/src/axolotl/train.py", line 188, in train
[rank1]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2164, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2323, in _inner_training_loop
[rank1]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank1]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1333, in prepare
[rank1]:     result = self._prepare_deepspeed(*args)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1843, in _prepare_deepspeed
[rank1]:     optimizer = map_pytorch_optim_to_deepspeed(optimizer)
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 53, in map_pytorch_optim_to_deepspeed
[rank1]:     is_adaw = isinstance(optimizer, (bnb_opt.AdamW, bnb_opt.AdamW32bit)) and optimizer.optim_bits == 32
[rank1]:                                                                              ^^^^^^^^^^^^^^^^^^^^
[rank1]: AttributeError: 'AdamW' object has no attribute 'optim_bits'
[rank2]: Traceback (most recent call last):
[rank2]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank2]:   File "<frozen runpy>", line 88, in _run_code
[rank2]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 58, in <module>
[rank2]:     fire.Fire(do_cli)
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
[rank2]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank2]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
[rank2]:     component, remaining_args = _CallAndUpdateTrace(
[rank2]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank2]:     component = fn(*varargs, **kwargs)
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
[rank2]:     return do_train(parsed_cfg, parsed_cli_args)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 47, in do_train
[rank2]:     model, tokenizer = train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank2]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/workspace/axolotl/src/axolotl/train.py", line 188, in train
[rank2]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2164, in train
[rank2]:     return inner_training_loop(
[rank2]:            ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2323, in _inner_training_loop
[rank2]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank2]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1333, in prepare
[rank2]:     result = self._prepare_deepspeed(*args)
[rank2]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1843, in _prepare_deepspeed
[rank2]:     optimizer = map_pytorch_optim_to_deepspeed(optimizer)
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 53, in map_pytorch_optim_to_deepspeed
[rank2]:     is_adaw = isinstance(optimizer, (bnb_opt.AdamW, bnb_opt.AdamW32bit)) and optimizer.optim_bits == 32
[rank2]:                                                                              ^^^^^^^^^^^^^^^^^^^^
[rank2]: AttributeError: 'AdamW' object has no attribute 'optim_bits'
[rank0]: Traceback (most recent call last):
[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:   File "<frozen runpy>", line 88, in _run_code
[rank0]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 58, in <module>
[rank0]:     fire.Fire(do_cli)
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
[rank0]:     return do_train(parsed_cfg, parsed_cli_args)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 47, in do_train
[rank0]:     model, tokenizer = train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/axolotl/src/axolotl/train.py", line 188, in train
[rank0]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2164, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2323, in _inner_training_loop
[rank0]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank0]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1333, in prepare
[rank0]:     result = self._prepare_deepspeed(*args)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1843, in _prepare_deepspeed
[rank0]:     optimizer = map_pytorch_optim_to_deepspeed(optimizer)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 53, in map_pytorch_optim_to_deepspeed
[rank0]:     is_adaw = isinstance(optimizer, (bnb_opt.AdamW, bnb_opt.AdamW32bit)) and optimizer.optim_bits == 32
[rank0]:                                                                              ^^^^^^^^^^^^^^^^^^^^
[rank0]: AttributeError: 'AdamW' object has no attribute 'optim_bits'
W1218 21:31:33.244000 140323527751488 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1521 closing signal SIGTERM
W1218 21:31:33.245000 140323527751488 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1523 closing signal SIGTERM
E1218 21:31:33.390000 140323527751488 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 1522) of binary: /root/miniconda3/envs/py3.11/bin/python3
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.11/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1153, in launch_command
    deepspeed_launcher(args)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/launch.py", line 846, in deepspeed_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
axolotl.cli.train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-18_21:31:33
  host      : b6131755c915
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1522)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@bursteratom Thanks for searching to find the problem!

bursteratom · 2024-12-18T22:13:02Z

@e-p-armstrong In the meantime I recommend keeping the current accelerate version 1.2.1 while using zero3 instead of zero2

bursteratom · 2024-12-19T00:19:33Z

@e-p-armstrong @winglian started an upstream PR on accelerate to fix this: huggingface/accelerate#3305

e-p-armstrong added the bug Something isn't working label Dec 15, 2024

bursteratom mentioned this issue Dec 19, 2024

Fix 'AdamW' object has no attribute 'optim_bits' error when deepspeed zero2 is enabled huggingface/accelerate#3305

Open

bursteratom self-assigned this Dec 19, 2024

bursteratom added waiting on upstream wip labels Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'AdamW' object has no attribute 'optim_bits' #2191

'AdamW' object has no attribute 'optim_bits' #2191

e-p-armstrong commented Dec 15, 2024

winglian commented Dec 15, 2024

e-p-armstrong commented Dec 16, 2024

bursteratom commented Dec 18, 2024

bursteratom commented Dec 18, 2024

e-p-armstrong commented Dec 18, 2024

bursteratom commented Dec 18, 2024 •

edited

Loading

bursteratom commented Dec 19, 2024

'AdamW' object has no attribute 'optim_bits' #2191

'AdamW' object has no attribute 'optim_bits' #2191

Comments

e-p-armstrong commented Dec 15, 2024

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

winglian commented Dec 15, 2024

e-p-armstrong commented Dec 16, 2024

bursteratom commented Dec 18, 2024

bursteratom commented Dec 18, 2024

e-p-armstrong commented Dec 18, 2024

bursteratom commented Dec 18, 2024 • edited Loading

bursteratom commented Dec 19, 2024

bursteratom commented Dec 18, 2024 •

edited

Loading