Skip to content

Commit

Permalink
Add a SLURM example with minimal config (#2950)
Browse files Browse the repository at this point in the history
* Add an example with minimal config

* Improve

* Even more minimal

* Rm slurm arg

* Update examples/slurm/submit_multinode_fsdp.sh

Co-authored-by: Marc Sun <[email protected]>

---------

Co-authored-by: Marc Sun <[email protected]>
  • Loading branch information
muellerzr and SunMarc authored Aug 26, 2024
1 parent 8c3aded commit 654e1d9
Show file tree
Hide file tree
Showing 4 changed files with 58 additions and 1 deletion.
12 changes: 12 additions & 0 deletions examples/slurm/fsdp_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
distributed_type: FSDP
fsdp_config:
fsdp_activation_checkpointing: false
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true
43 changes: 43 additions & 0 deletions examples/slurm/submit_multinode_fsdp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
#!/bin/bash

#SBATCH --job-name=multinode
#SBATCH -D .
#SBATCH --output=O-%x.%j
#SBATCH --error=E-%x.%j
#SBATCH --nodes=4 # number of nodes
#SBATCH --ntasks-per-node=1 # number of MP tasks
#SBATCH --gres=gpu:4 # number of GPUs per node
#SBATCH --cpus-per-task=160 # number of cores per tasks
#SBATCH --time=01:59:00 # maximum execution time (HH:MM:SS)

######################
### Set enviroment ###
######################
source activateEnvironment.sh
export GPUS_PER_NODE=4
######################

######################
#### Set network #####
######################
head_node_ip=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
######################
export ACCELERATE_DIR="${ACCELERATE_DIR:-/accelerate}"

export LAUNCHER="accelerate launch \
--config ${ACCELERATE_DIR}/examples/slurm/fsdp_config.yaml \
--num_processes $((SLURM_NNODES * GPUS_PER_NODE)) \
--num_machines $SLURM_NNODES \
--rdzv_backend c10d \
--main_process_ip $head_node_ip \
--main_process_port 29500 \
"
export SCRIPT="${ACCELERATE_DIR}/examples/complete_nlp_example.py"
export SCRIPT_ARGS=" \
--mixed_precision fp16 \
--output_dir ${ACCELERATE_DIR}/examples/output \
"

# This step is necessary because accelerate launch does not handle multiline arguments properly
export CMD="$LAUNCHER $SCRIPT $SCRIPT_ARGS"
srun $CMD
2 changes: 1 addition & 1 deletion src/accelerate/commands/config/config_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ def __post_init__(self):

@dataclass
class ClusterConfig(BaseConfig):
num_processes: int
num_processes: int = -1 # For instance if we use SLURM and the user manually passes it in
machine_rank: int = 0
num_machines: int = 1
gpu_ids: Optional[str] = None
Expand Down
2 changes: 2 additions & 0 deletions src/accelerate/commands/launch.py
Original file line number Diff line number Diff line change
Expand Up @@ -1074,6 +1074,8 @@ def _validate_launch_command(args):
# Silently set the default here
if args.dynamo_backend is None:
args.dynamo_backend = "no"
if args.num_processes == -1:
raise ValueError("You need to manually pass in `--num_processes` using this config yaml.")
else:
if args.num_processes is None:
if args.use_xpu and is_xpu_available():
Expand Down

0 comments on commit 654e1d9

Please sign in to comment.