-
-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deepseed multiGPU resume from checkpoint fails #1134
Comments
the same error |
see #1156 (comment) |
Same error and #1156 (comment) didn't fix it. I'm also using |
@winglian doesnt work still same issue |
Lora training should be resumed using lora_model_dir |
did you unset resume_from_checkpoint? |
if i unset resume_from_checkpoint, i dont get the error again but the training start from epoch zero. so this i were my training stopped.
i update the latest checkpoint dir in lora_model_dir and removed resume_from_checkpoint this is from where the training resumed
this method kind of works, but it always resumes training from scratch? |
seems this is an upstream issue (never actually resolved) huggingface/peft#746 |
@manishiitg @zacbrannelly @vip-china see #1227, I've confirmed this resumes for me with zero2 |
seems the train loss doesn't perfectly line up after resume though 🤷 |
great! can't wait for the merge to test it out :) |
@winglian I am trying with ds1, and specified |
don't think we need to set lora_model_dir anymore @satpalsr |
Then it says |
got it, |
@manishiitg confirmed working for you? |
unfortunately, I am only able to run docker builds on my gpu cluster, so not able to verify from branch. if i clone and install i get this issue #945 so unable to test the branch |
i can confirm very quickly once the PR is merged to master and docker build is updated :) |
Hi all, I am still facing this issue. My config file is as follows: base_model: /dev/shm/Yarn-Mistral-7b-64k bnb_config_kwargs: load_in_8bit: false model_config: datasets:
dataset_prepared_path: /dev/shm/datasets/dataset-debug adapter: qlora sequence_len: 2048 lora_r: 32
gradient_accumulation_steps: 1 train_on_inputs: trust_remote_code: true warmup_ratio: 0.03
|
Please check that this issue hasn't been reported before.
Expected Behavior
should work
Current behaviour
Steps to reproduce
when resuming training from checkpoint
Config yaml
base_model: unsloth/tinyllama
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
is_llama_derived_model: true
load_in_8bit: false
load_in_4bit: true
strict: false
chat_template: chatml
datasets:
type: completion
wandb_project: tiny-aditi
hub_model_id: manishiitg/tinyllama-chat-instruct-hi-v1
hf_use_auth_token: true
dataset_prepared_path:
val_set_size: 0
output_dir: /sky-notebook/manishiitg/tinyllama-chat-instruct-hi-v1
sequence_len: 4096
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true
adapter: qlora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
gradient_accumulation_steps: 4
micro_batch_size: 14
num_epochs: 4
optimizer: paged_adamw_32bit
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
auto_resume_from_checkpoints: true ## manage check point resume from here
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
warmup_steps: 10
eval_steps: 0
eval_table_size:
eval_table_max_new_tokens: 128
save_steps: 100 ## increase based on your dataset
save_strategy: steps
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
bos_token: "
""eos_token: "
unk_token: ""
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main
Acknowledgements
The text was updated successfully, but these errors were encountered: