Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No such file or directory: '.local_rank0_completed_autoresume' while doing multi-node pretraining using hf_casual_lm (LLAMA2) #504

Closed
palash04 opened this issue Aug 3, 2023 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@palash04
Copy link

palash04 commented Aug 3, 2023

Environment

transformers: 4.30.2
llm-foundry: 0.2.0
mosaicml: 0.15.1
mosaicml-cli: 0.4.16
mosaicml-streaming: 0.5.1

To reproduce

YAML config:

data_local: /data/olaai/mosaicml/llm-foundry/scripts/data_prep/c4_mds_llama_2048_cache_
data_remote: /data/olaai/mosaicml/llm-foundry/scripts/data_prep/c4_mds_llama_2048_copy
max_seq_len: 2048
global_seed: 17

# Run Name
run_name: # If left blank, will be read from env var $RUN_NAME

# Model
model:
  name: hf_causal_lm
  pretrained_model_name_or_path: meta-llama/Llama-2-7b-hf
  init_device: cpu
  pretrained: true
  use_auth_token: true  

# Tokenizer
tokenizer:
  name: meta-llama/Llama-2-7b-hf
  kwargs:
    model_max_length: ${max_seq_len}

# Dataloaders
train_loader:
  name: text
  dataset:
    local: ${data_local}
    remote: ${data_remote}
    split: train_small
    shuffle: true
    max_seq_len: ${max_seq_len}
    shuffle_seed: ${global_seed}
  drop_last: true
  num_workers: 8

eval_loader:
  name: text
  dataset:
    local: ${data_local}
    remote: ${data_remote}
    split: val_small
    shuffle: false
    max_seq_len: ${max_seq_len}
    shuffle_seed: ${global_seed}
  drop_last: false
  num_workers: 8

# Optimization
scheduler:
  name: cosine_with_warmup
  t_warmup: 100ba
  alpha_f: 0.1

optimizer:
  name: decoupled_adamw
  lr: 8.0e-5
  betas:
  - 0.9
  - 0.95
  eps: 1.0e-08
  weight_decay: 0.0

algorithms:
  gradient_clipping:
    clipping_type: norm
    clipping_threshold: 1.0

max_duration: 333800ba  # ~ 1.4T tokens
eval_interval: 10000ba
eval_first: false
eval_subset_num_batches: -1
global_train_batch_size: 16

# System
seed: ${global_seed}
device_eval_batch_size: 1
device_train_microbatch_size: 1
# device_train_microbatch_size: auto
precision: amp_bf16

# FSDP
fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: PURE
  activation_checkpointing: true
  activation_checkpointing_reentrant: false
  activation_cpu_offload: true
  limit_all_gathers: true
  verbose: false

# Logging
progress_bar: false
log_to_console: true
console_log_interval: 1ba

callbacks:
  speed_monitor:
    window_size: 10
  lr_monitor: {}
  memory_monitor: {}
  runtime_estimator: {}

Commands

On Node 0

TRAIN_PATH="/data/olaai/mosaicml/llm-foundry/scripts/train/train.py"
YAML_PATH="/data/olaai/mosaicml/llm-foundry/scripts/train/yamls/pretrain/llama2-7b.yaml"
python3 -m composer --world_size 16 --node_rank 0 --master_addr 10.0.0.11 --master_port 7502 $TRAIN_PATH $YAML_PATH train_loader.dataset.split=train_small eval_loader.dataset.split=val_small

On Node 1

TRAIN_PATH="/data/olaai/mosaicml/llm-foundry/scripts/train/train.py"
YAML_PATH="/data/olaai/mosaicml/llm-foundry/scripts/train/yamls/pretrain/llama2-7b.yaml"
python3 -m composer --world_size 16 --node_rank 1 --master_addr 10.0.0.11 --master_port 7502 $TRAIN_PATH $YAML_PATH train_loader.dataset.split=train_small eval_loader.dataset.split=val_small

Error Snippet

Getting following error on the node having node_rank 1
Screenshot 2023-08-03 at 8 10 33 PM

Note: Single Node training works perfectly fine for the same yaml config.

@palash04 palash04 added the bug Something isn't working label Aug 3, 2023
@dakinggg
Copy link
Collaborator

dakinggg commented Sep 9, 2023

Hi, are you on a shared file system perhaps? Wondering if the nodes are trying to write and read to the same location.

@gpucce
Copy link

gpucce commented Sep 21, 2023

Hi, I had this same issue and for me the problem is indeed the shared file system. I changed the get_local_rank to get_global_rank and it seems to work. However, I would like to ask if you know of other parts of the codebase where having a shared file system might be an issue.

Thanks in advance!

Hi, are you on a shared file system perhaps? Wondering if the nodes are trying to write and read to the same location.

@YixinSong-e
Copy link

I meet the same issue. Do you find any other code need to be update?

@gpucce
Copy link

gpucce commented Sep 26, 2023

I meet the same issue. Do you find any other code need to be update?

I ran a full fine tuning and everything seems to work, though not sure there aren´t any hidden things

@mvpatel2000
Copy link
Collaborator

Thanks for flagging this. We're working to get a patch release out which will fix this, ideally within 1-2 days. We're still discussing the right solution though since it is a bit tricky for us to test this as we do not have shared filesystems in our clusters

@mvpatel2000
Copy link
Collaborator

#629 This should hopefully address the above issues

@mvpatel2000
Copy link
Collaborator

Closing based on this #629. Please reopen if you have issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants