Skip to content

Commit

Permalink
Improve config handling and add a zoo (#3029)
Browse files Browse the repository at this point in the history
* Improve config handling and add a zoo

* Docs

* rm comment

* Tweak doc
  • Loading branch information
muellerzr authored Aug 20, 2024
1 parent 52fae09 commit 1a6af0b
Show file tree
Hide file tree
Showing 13 changed files with 157 additions and 39 deletions.
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,8 @@ accelerate launch --multi_gpu --num_processes 2 examples/nlp_example.py

To learn more, check the CLI documentation available [here](https://huggingface.co/docs/accelerate/package_reference/cli).

Or view the configuration zoo [here](https://github.com/huggingface/accelerate/blob/main/examples/config_yaml_templates/)

## Launching multi-CPU run using MPI

🤗 Here is another way to launch multi-CPU run using MPI. You can learn how to install Open MPI on [this page](https://www.open-mpi.org/faq/?category=building#easy-build). You can use Intel MPI or MVAPICH as well.
Expand Down Expand Up @@ -256,7 +258,7 @@ pip install accelerate
- multi-GPU on several nodes (machines)
- TPU
- FP16/BFloat16 mixed precision
- FP8 mixed precision with [Transformer Engine](https://github.com/NVIDIA/TransformerEngine)
- FP8 mixed precision with [Transformer Engine](https://github.com/NVIDIA/TransformerEngine) or [MS-AMP](https://github.com/Azure/MS-AMP/)
- DeepSpeed support (Experimental)
- PyTorch Fully Sharded Data Parallel (FSDP) support (Experimental)
- Megatron-LM support (Experimental)
Expand Down
2 changes: 2 additions & 0 deletions docs/source/quicktour.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,8 @@ accelerate launch path_to_script.py --args_for_the_script

To learn more, check out the [Launch distributed code](basic_tutorials/launch) tutorial for more information about launching your scripts.

We also have a [configuration zoo](https://github.com/huggingface/accelerate/blob/main/examples/config_yaml_templates) which showcases a number of premade **minimal** example configurations for a variety of setups you can run.

## Adapt training code

The next main feature of Accelerate is the [`Accelerator`] class which adapts your PyTorch code to run on different distributed setups.
Expand Down
32 changes: 18 additions & 14 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,23 +208,13 @@ To run it in each of these various modes, use the following commands:

- [huggan project](https://github.com/huggingface/community-events/tree/main/huggan)


### Using AWS SageMaker integration
- [Examples showcasing AWS SageMaker integration of 🤗 Accelerate.](https://github.com/pacman100/accelerate-aws-sagemaker)


## Simple Multi-GPU Hardware Launcher

[multigpu_remote_launcher.py](./multigpu_remote_launcher.py) is a minimal script that demonstrates launching accelerate
on multiple remote GPUs, and with automatic hardware environment and dependency setup for reproducibility. You can
easily customize the training function used, training arguments, hyperparameters, and type of compute hardware, and then
run the script to automatically launch multi GPU training on remote hardware.

This script uses [Runhouse](https://github.com/run-house/runhouse) to launch on self-hosted hardware (e.g. in your own
cloud account or on-premise cluster) but there are other options for running remotely as well. Runhouse can be installed
with `pip install runhouse`, and you can refer to
[hardware setup](https://runhouse-docs.readthedocs-hosted.com/en/latest/api/python/cluster.html#hardware-setup)
for hardware setup instructions, or this
[Colab tutorial](https://colab.research.google.com/drive/1qVwYyLTCPYPSdz9ZX7BZl9Qm0A3j7RJe) for a more in-depth walkthrough.
## Configuration zoo
In [/config_yaml_templates](./config_yaml_templates/) we have a variety of *minimal* `config.yaml` templates and examples to help you learn
how to create your own configuration files depending on the scenario.

## SLURM Scripts
In [/slurm/submit_multigpu.sh](./slurm/submit_multigpu.sh) and [/slurm/submit_multinode.sh](./slurm/submit_multinode.sh) we present two scripts for running the examples on a machine with [SLURM](https://slurm.schedmd.com/documentation.html) workload manager.
Expand All @@ -251,6 +241,20 @@ export PYTHONPATH=/home/nct01/nct01328/transformers-in-supercomputers:$PYTHONPAT
export GPUS_PER_NODE=4
```
## Simple Multi-GPU Hardware Launcher (using an external platform)
[multigpu_remote_launcher.py](./multigpu_remote_launcher.py) is a minimal script that demonstrates launching accelerate
on multiple remote GPUs, and with automatic hardware environment and dependency setup for reproducibility. You can
easily customize the training function used, training arguments, hyperparameters, and type of compute hardware, and then
run the script to automatically launch multi GPU training on remote hardware.
This script uses [Runhouse](https://github.com/run-house/runhouse) to launch on self-hosted hardware (e.g. in your own
cloud account or on-premise cluster) but there are other options for running remotely as well. Runhouse can be installed
with `pip install runhouse`, and you can refer to
[hardware setup](https://runhouse-docs.readthedocs-hosted.com/en/latest/api/python/cluster.html#hardware-setup)
for hardware setup instructions, or this
[Colab tutorial](https://colab.research.google.com/drive/1qVwYyLTCPYPSdz9ZX7BZl9Qm0A3j7RJe) for a more in-depth walkthrough.
## Finer Examples
While the first two scripts are extremely barebones when it comes to what you can do with accelerate, more advanced features are documented in two other locations.
Expand Down
10 changes: 10 additions & 0 deletions examples/config_yaml_templates/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Config Zoo

This folder contains a variety of minimal configurations for `Accelerate` achieving certain goals. You can use these
direct config YAML's, or build off of them for your own YAML's.

These are highly annoted versions, aiming to teach you what each section does.

Each config can be run via `accelerate launch --config_file {file} run_me.py`

`run_me.py` will then print out how the current environment is setup (the contents of the `AcceleratorState`)
15 changes: 15 additions & 0 deletions examples/config_yaml_templates/deepspeed.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Similar to FSDP, we set the distributed type as DEEPSPEED
distributed_type: DEEPSPEED
# With DeepSpeed, we utilize a deepspeed config file for the entire configuration
deepspeed_config:
# Can also be any of the config json's in accelerate/examples/deepspeed_config_templates
deepspeed_config_file: ../deepspeed_config_templates/zero_stage1_config.json
# If using ZeRO-3 and wanting to load big models in, this should be set to `true` so
# `transformers` uses the right `init` function
zero3_init_flag: false # true

# Finally we need to specify the number of GPUs to use
num_processes: 2
# Optionally we can set the mixed precision now instead of in the deepspeed config file,
# however this requires the `fp16` and `bf16` options to be set to `auto` in the deepspeed config file
# mixed_precision: "bf16"
18 changes: 18 additions & 0 deletions examples/config_yaml_templates/fp8.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# This config template simply setups up the TransformersEngine config (and a config for a single GPU),
# this can interop with the other configs in this folder
distributed_type: "NO"
mixed_precision: "fp8"
# Then we specify the fp8 configuration:
fp8_config:
backend: TE # Can be TE | MS-AMP
# The following are TE specific arguments.
# See https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/api/common.html#common-api for more details
amax_history_length: 1024
fp8_format: E4M3
interval: 1
margin: 0
override_linear_precision: false
# Generally this should always be set to `false` to have the most realistic fp8 eval performance
use_autocast_during_eval: false
# If using MS-AMP, we ignore all of the prior and set a opt_level
#opt_level: O1
18 changes: 18 additions & 0 deletions examples/config_yaml_templates/fsdp.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Since we are doing FSDP (even though it's multi-GPU), we need to specify the distributed type as FSDP
distributed_type: FSDP
# Can be one of "no", "fp16", or "bf16" (see `transformer_engine.yaml` for `fp8`, but it works for FSDP as well)
mixed_precision: 'bf16'
# Specify the number of GPUs to use
num_processes: 2
# Then we can specify the FSDP config
fsdp_config:
fsdp_activation_checkpointing: false
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true
6 changes: 6 additions & 0 deletions examples/config_yaml_templates/multi_gpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Specify distributed_type as `MULTI_GPU` for DDP
distributed_type: "MULTI_GPU"
# Can be one of "no", "fp16", or "bf16" (see `transformer_engine.yaml` for `fp8`)
mixed_precision: "bf16"
# Specify the number of GPUs to use
num_processes: 2
16 changes: 16 additions & 0 deletions examples/config_yaml_templates/multi_node.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# This config template is for a multi-node setup. This assumes DDP, but can be interop'd with the other configs in this folder
# Generally it's recommended to look at the SLURM config template for a more robust multi-node setup
distributed_type: MULTI_GPU
# We need to specify the current machine's rank
machine_rank: 0
# We then need to specify the IP address and port of the main process
main_process_ip: '1234'
main_process_port: 9999
# We need to specify the number of machines
num_machines: 2
# We need to specify the *total* number of processes
num_processes: 8
# And then we need to specify how rdvz comms will be handled
rdzv_backend: static # or c10d
# If the compute nodes are on the same network (cloud will more than likely be false)
same_network: false
26 changes: 26 additions & 0 deletions examples/config_yaml_templates/run_me.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
A base script which outputs the accelerate config for the given environment
"""
from accelerate import Accelerator


accelerator = Accelerator()

accelerator.print(f"Accelerator state from the current environment:\n{accelerator.state}")
if accelerator.fp8_recipe_handler is not None:
accelerator.print(f"FP8 config:\n{accelerator.fp8_recipe_handler}")
accelerator.end_training()
4 changes: 4 additions & 0 deletions examples/config_yaml_templates/single_gpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Since this is single GPU, we don't need distributed training
distributed_type: "NO"
# Can be one of "no", "fp16", or "bf16" (see `transformer_engine.yaml` for `fp8`)
mixed_precision: "bf16"
40 changes: 18 additions & 22 deletions src/accelerate/commands/config/config_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,13 +99,17 @@ def _convert_enums(value):
result = {k: v for k, v in result.items() if v is not None}
return result

@classmethod
def from_json_file(cls, json_file=None):
json_file = default_json_config_file if json_file is None else json_file
with open(json_file, encoding="utf-8") as f:
config_dict = json.load(f)
@staticmethod
def process_config(config_dict):
"""
Processes `config_dict` and sets default values for any missing keys
"""
if "compute_environment" not in config_dict:
config_dict["compute_environment"] = ComputeEnvironment.LOCAL_MACHINE
if "distributed_type" not in config_dict:
raise ValueError("A `distributed_type` must be specified in the config file.")
if "num_processes" not in config_dict and config_dict["distributed_type"] == DistributedType.NO:
config_dict["num_processes"] = 1
if "mixed_precision" not in config_dict:
config_dict["mixed_precision"] = "fp16" if ("fp16" in config_dict and config_dict["fp16"]) else None
if "fp16" in config_dict: # Convert the config to the new format.
Expand All @@ -119,6 +123,14 @@ def from_json_file(cls, json_file=None):
config_dict["debug"] = False
if "enable_cpu_affinity" not in config_dict:
config_dict["enable_cpu_affinity"] = False
return config_dict

@classmethod
def from_json_file(cls, json_file=None):
json_file = default_json_config_file if json_file is None else json_file
with open(json_file, encoding="utf-8") as f:
config_dict = json.load(f)
config_dict = cls.process_config(config_dict)
extra_keys = sorted(set(config_dict.keys()) - set(cls.__dataclass_fields__.keys()))
if len(extra_keys) > 0:
raise ValueError(
Expand All @@ -138,23 +150,7 @@ def from_yaml_file(cls, yaml_file=None):
yaml_file = default_yaml_config_file if yaml_file is None else yaml_file
with open(yaml_file, encoding="utf-8") as f:
config_dict = yaml.safe_load(f)
if "compute_environment" not in config_dict:
config_dict["compute_environment"] = ComputeEnvironment.LOCAL_MACHINE
if "mixed_precision" not in config_dict:
config_dict["mixed_precision"] = "fp16" if ("fp16" in config_dict and config_dict["fp16"]) else None
if isinstance(config_dict["mixed_precision"], bool) and not config_dict["mixed_precision"]:
config_dict["mixed_precision"] = "no"
if "fp16" in config_dict: # Convert the config to the new format.
del config_dict["fp16"]
if "dynamo_backend" in config_dict: # Convert the config to the new format.
dynamo_backend = config_dict.pop("dynamo_backend")
config_dict["dynamo_config"] = {} if dynamo_backend == "NO" else {"dynamo_backend": dynamo_backend}
if "use_cpu" not in config_dict:
config_dict["use_cpu"] = False
if "debug" not in config_dict:
config_dict["debug"] = False
if "enable_cpu_affinity" not in config_dict:
config_dict["enable_cpu_affinity"] = False
config_dict = cls.process_config(config_dict)
extra_keys = sorted(set(config_dict.keys()) - set(cls.__dataclass_fields__.keys()))
if len(extra_keys) > 0:
raise ValueError(
Expand Down
5 changes: 3 additions & 2 deletions src/accelerate/utils/dataclasses.py
Original file line number Diff line number Diff line change
Expand Up @@ -1338,10 +1338,11 @@ class FullyShardedDataParallelPlugin:
},
)
sync_module_states: bool = field(
default=False,
default=None,
metadata={
"help": "Whether each individually wrapped FSDP unit should broadcast module parameters from rank 0 "
"to ensure they are the same across all ranks after initialization. Defaults to `True`"
"to ensure they are the same across all ranks after initialization. Defaults to `False` unless "
"`cpu_ram_efficient_loading` is `True`, then will be forcibly enabled."
},
)
forward_prefetch: bool = field(
Expand Down

0 comments on commit 1a6af0b

Please sign in to comment.