No such file or directory: '.local_rank0_completed_autoresume' while doing multi-node pretraining using hf_casual_lm (LLAMA2) #504

palash04 · 2023-08-03T14:43:00Z

Environment

transformers: 4.30.2
llm-foundry: 0.2.0
mosaicml: 0.15.1
mosaicml-cli: 0.4.16
mosaicml-streaming: 0.5.1

To reproduce

YAML config:

data_local: /data/olaai/mosaicml/llm-foundry/scripts/data_prep/c4_mds_llama_2048_cache_
data_remote: /data/olaai/mosaicml/llm-foundry/scripts/data_prep/c4_mds_llama_2048_copy
max_seq_len: 2048
global_seed: 17

# Run Name
run_name: # If left blank, will be read from env var $RUN_NAME

# Model
model:
  name: hf_causal_lm
  pretrained_model_name_or_path: meta-llama/Llama-2-7b-hf
  init_device: cpu
  pretrained: true
  use_auth_token: true  

# Tokenizer
tokenizer:
  name: meta-llama/Llama-2-7b-hf
  kwargs:
    model_max_length: ${max_seq_len}

# Dataloaders
train_loader:
  name: text
  dataset:
    local: ${data_local}
    remote: ${data_remote}
    split: train_small
    shuffle: true
    max_seq_len: ${max_seq_len}
    shuffle_seed: ${global_seed}
  drop_last: true
  num_workers: 8

eval_loader:
  name: text
  dataset:
    local: ${data_local}
    remote: ${data_remote}
    split: val_small
    shuffle: false
    max_seq_len: ${max_seq_len}
    shuffle_seed: ${global_seed}
  drop_last: false
  num_workers: 8

# Optimization
scheduler:
  name: cosine_with_warmup
  t_warmup: 100ba
  alpha_f: 0.1

optimizer:
  name: decoupled_adamw
  lr: 8.0e-5
  betas:
  - 0.9
  - 0.95
  eps: 1.0e-08
  weight_decay: 0.0

algorithms:
  gradient_clipping:
    clipping_type: norm
    clipping_threshold: 1.0

max_duration: 333800ba  # ~ 1.4T tokens
eval_interval: 10000ba
eval_first: false
eval_subset_num_batches: -1
global_train_batch_size: 16

# System
seed: ${global_seed}
device_eval_batch_size: 1
device_train_microbatch_size: 1
# device_train_microbatch_size: auto
precision: amp_bf16

# FSDP
fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: PURE
  activation_checkpointing: true
  activation_checkpointing_reentrant: false
  activation_cpu_offload: true
  limit_all_gathers: true
  verbose: false

# Logging
progress_bar: false
log_to_console: true
console_log_interval: 1ba

callbacks:
  speed_monitor:
    window_size: 10
  lr_monitor: {}
  memory_monitor: {}
  runtime_estimator: {}

Commands

On Node 0

TRAIN_PATH="/data/olaai/mosaicml/llm-foundry/scripts/train/train.py"
YAML_PATH="/data/olaai/mosaicml/llm-foundry/scripts/train/yamls/pretrain/llama2-7b.yaml"
python3 -m composer --world_size 16 --node_rank 0 --master_addr 10.0.0.11 --master_port 7502 $TRAIN_PATH $YAML_PATH train_loader.dataset.split=train_small eval_loader.dataset.split=val_small

On Node 1

TRAIN_PATH="/data/olaai/mosaicml/llm-foundry/scripts/train/train.py"
YAML_PATH="/data/olaai/mosaicml/llm-foundry/scripts/train/yamls/pretrain/llama2-7b.yaml"
python3 -m composer --world_size 16 --node_rank 1 --master_addr 10.0.0.11 --master_port 7502 $TRAIN_PATH $YAML_PATH train_loader.dataset.split=train_small eval_loader.dataset.split=val_small

Error Snippet

Getting following error on the node having node_rank 1

Note: Single Node training works perfectly fine for the same yaml config.

The text was updated successfully, but these errors were encountered:

dakinggg · 2023-09-09T23:14:14Z

Hi, are you on a shared file system perhaps? Wondering if the nodes are trying to write and read to the same location.

gpucce · 2023-09-21T13:22:25Z

Hi, I had this same issue and for me the problem is indeed the shared file system. I changed the get_local_rank to get_global_rank and it seems to work. However, I would like to ask if you know of other parts of the codebase where having a shared file system might be an issue.

Thanks in advance!

Hi, are you on a shared file system perhaps? Wondering if the nodes are trying to write and read to the same location.

YixinSong-e · 2023-09-26T10:30:00Z

I meet the same issue. Do you find any other code need to be update?

gpucce · 2023-09-26T10:32:35Z

I meet the same issue. Do you find any other code need to be update?

I ran a full fine tuning and everything seems to work, though not sure there aren´t any hidden things

mvpatel2000 · 2023-09-26T13:53:06Z

Thanks for flagging this. We're working to get a patch release out which will fix this, ideally within 1-2 days. We're still discussing the right solution though since it is a bit tricky for us to test this as we do not have shared filesystems in our clusters

mvpatel2000 · 2023-09-26T15:26:33Z

#629 This should hopefully address the above issues

mvpatel2000 · 2023-09-26T18:07:54Z

Closing based on this #629. Please reopen if you have issues!

palash04 added the bug Something isn't working label Aug 3, 2023

dakinggg assigned mvpatel2000 Sep 25, 2023

mvpatel2000 mentioned this issue Sep 26, 2023

Add node rank to signal paths #629

Merged

mvpatel2000 closed this as completed Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No such file or directory: '.local_rank0_completed_autoresume' while doing multi-node pretraining using hf_casual_lm (LLAMA2) #504

No such file or directory: '.local_rank0_completed_autoresume' while doing multi-node pretraining using hf_casual_lm (LLAMA2) #504

palash04 commented Aug 3, 2023

dakinggg commented Sep 9, 2023

gpucce commented Sep 21, 2023

YixinSong-e commented Sep 26, 2023

gpucce commented Sep 26, 2023

mvpatel2000 commented Sep 26, 2023

mvpatel2000 commented Sep 26, 2023

mvpatel2000 commented Sep 26, 2023

No such file or directory: '.local_rank0_completed_autoresume' while doing multi-node pretraining using hf_casual_lm (LLAMA2) #504

No such file or directory: '.local_rank0_completed_autoresume' while doing multi-node pretraining using hf_casual_lm (LLAMA2) #504

Comments

palash04 commented Aug 3, 2023

Environment

To reproduce

Commands

On Node 0

On Node 1

Error Snippet

Note: Single Node training works perfectly fine for the same yaml config.

dakinggg commented Sep 9, 2023

gpucce commented Sep 21, 2023

YixinSong-e commented Sep 26, 2023

gpucce commented Sep 26, 2023

mvpatel2000 commented Sep 26, 2023

mvpatel2000 commented Sep 26, 2023

mvpatel2000 commented Sep 26, 2023