diff --git a/docs/multi-node.md b/docs/multi-node.md new file mode 100644 index 0000000000..3bc5e369bb --- /dev/null +++ b/docs/multi-node.md @@ -0,0 +1,38 @@ +# Multi Node + +You will need to create a configuration for accelerate, either by using `accelerate config` and follow the instructions or you can use one of the preset below: + +~/.cache/huggingface/accelerate/default_config.yaml +```yaml +compute_environment: LOCAL_MACHINE +debug: false +distributed_type: FSDP +downcast_bf16: 'no' +machine_rank: 0 # Set to 0 for the main machine, increment by one for other machines +main_process_ip: 10.0.0.4 # Set to main machine's IP +main_process_port: 5000 +main_training_function: main +mixed_precision: bf16 +num_machines: 2 # Change to the number of machines +num_processes: 4 # That's the total number of GPUs, (for example: if you have 2 machines with 4 GPU, put 8) +rdzv_backend: static +same_network: true +tpu_env: [] +tpu_use_cluster: false +tpu_use_sudo: false +use_cpu: false +``` + +Configure your model to use FSDP with for example: +```yaml +fsdp: + - full_shard + - auto_wrap +fsdp_config: + fsdp_offload_params: true + fsdp_state_dict_type: FULL_STATE_DICT + fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer +``` + +All you have to do now is make sure the port you set as `main_process_port` is open on the main machine (rank 0). +Launch using accelerate as you would usually do and voila!