diff --git a/docs/source/concept_guides/big_model_inference.md b/docs/source/concept_guides/big_model_inference.md index ddce9114cdc..4e09adae686 100644 --- a/docs/source/concept_guides/big_model_inference.md +++ b/docs/source/concept_guides/big_model_inference.md @@ -154,7 +154,7 @@ By passing `device_map="auto"`, we tell 🤗 Accelerate to determine automatical #### `no_split_module_classes` This parameter will indicate that some of the modules with the name `"Block"` should not be split across different devices. You should set here all blocks that -include a residutal connection of some kind. +include a residual connection of some kind. #### The `device_map` diff --git a/docs/source/concept_guides/gradient_synchronization.md b/docs/source/concept_guides/gradient_synchronization.md index 9010628ef7f..7ae8ab6853f 100644 --- a/docs/source/concept_guides/gradient_synchronization.md +++ b/docs/source/concept_guides/gradient_synchronization.md @@ -55,8 +55,8 @@ their gradients computed, collated, and updated before moving on to the next batch of data. When performing gradient accumulation, you accumulate `n` loss gradients and skip `optimizer.step()` until `n` batches have been reached. As all training -processes only need to sychronize by the time `optimizer.step()` is called, -without any modification to your training step, this neededless inter-process +processes only need to synchronize by the time `optimizer.step()` is called, +without any modification to your training step, this needless inter-process communication can cause a significant slowdown. How can you avoid this overhead? diff --git a/docs/source/usage_guides/distributed_inference.md b/docs/source/usage_guides/distributed_inference.md index 3bdd7121401..41053658482 100644 --- a/docs/source/usage_guides/distributed_inference.md +++ b/docs/source/usage_guides/distributed_inference.md @@ -51,7 +51,7 @@ def run_inference(rank, world_size): One will notice how we have to check the rank to know what prompt to send, which can be a bit tedious. A user might then also think that with 🤗 Accelerate, using the `Accelerator` to prepare a dataloader for such a task might also be -a simple way to manage this. (To learn more, check out the relvent section in the [Quick Tour](../quicktour#distributed-evaluation)) +a simple way to manage this. (To learn more, check out the relevant section in the [Quick Tour](../quicktour#distributed-evaluation)) Can it manage it? Yes. Does it add unneeded extra code however: also yes. diff --git a/docs/source/usage_guides/explore.md b/docs/source/usage_guides/explore.md index 2b4decefa2a..533c4cf444f 100644 --- a/docs/source/usage_guides/explore.md +++ b/docs/source/usage_guides/explore.md @@ -16,7 +16,7 @@ rendered properly in your Markdown viewer. # Learning how to incorporate 🤗 Accelerate features quickly! Please use the interactive tool below to help you get started with learning about a particular -feature of 🤗 Accelerate and how to utilize it! It will provide you with a code diff, an explaination +feature of 🤗 Accelerate and how to utilize it! It will provide you with a code diff, an explanation towards what is going on, as well as provide you with some useful links to explore more within the documentation! diff --git a/docs/source/usage_guides/megatron_lm.md b/docs/source/usage_guides/megatron_lm.md index 7b6822086da..25bea1f58d2 100644 --- a/docs/source/usage_guides/megatron_lm.md +++ b/docs/source/usage_guides/megatron_lm.md @@ -128,7 +128,7 @@ Do you want to enable Sequence Parallelism? [YES/no]: What is the Pipeline Parallelism degree/size? [1]:2 What is the number of micro-batches? [1]:2 Do you want to enable selective activation recomputation? [YES/no]: -Do you want to use distributed optimizer which shards optimizer state and gradients across data pralellel ranks? [YES/no]: +Do you want to use distributed optimizer which shards optimizer state and gradients across data parallel ranks? [YES/no]: What is the gradient clipping value based on global L2 Norm (0 to disable)? [1.0]: How many GPU(s) should be used for distributed training? [1]:4 Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: bf16 @@ -355,8 +355,8 @@ def main(): 2. For using the Megatron-LM datasets, a few more changes are required. Dataloaders for these datasets are available only on rank 0 of each tensor parallel group. As such, there are rank where dataloader won't be -avaiable and this requires tweaks to the training loop. Being able to do all this shows how -felixble and extensible 🤗 Accelerate is. The changes required are as follows. +available and this requires tweaks to the training loop. Being able to do all this shows how +flexible and extensible 🤗 Accelerate is. The changes required are as follows. a. For Megatron-LM indexed datasets, we need to use `MegatronLMDummyDataLoader` and pass the required dataset args to it such as `data_path`, `seq_length` etc. @@ -547,7 +547,7 @@ The `model(**batch_data)` call return loss(es) averaged across the data parallel This is fine for most cases wherein pre-training jobs are run using Megatron-LM features and you can easily compute the `perplexity` using the loss. For GPT model, returning logits in addition to loss(es) is supported. -These logits aren't gathered across data prallel ranks. Use `accelerator.utils.gather_across_data_parallel_groups` +These logits aren't gathered across data parallel ranks. Use `accelerator.utils.gather_across_data_parallel_groups` to gather logits across data parallel ranks. These logits along with labels can be used for computing various performance metrics. diff --git a/docs/source/usage_guides/training_zoo.md b/docs/source/usage_guides/training_zoo.md index 42dfe18a9f3..2a7f51d2873 100644 --- a/docs/source/usage_guides/training_zoo.md +++ b/docs/source/usage_guides/training_zoo.md @@ -15,7 +15,7 @@ rendered properly in your Markdown viewer. # Example Zoo -Below contains a non-exhuastive list of tutorials and scripts showcasing 🤗 Accelerate +Below contains a non-exhaustive list of tutorials and scripts showcasing 🤗 Accelerate ## Official Accelerate Examples: diff --git a/src/accelerate/commands/config/cluster.py b/src/accelerate/commands/config/cluster.py index 1090d17ddc3..1331e7fe43c 100644 --- a/src/accelerate/commands/config/cluster.py +++ b/src/accelerate/commands/config/cluster.py @@ -451,7 +451,7 @@ def get_cluster_input(): megatron_lm_config[prefix + "use_distributed_optimizer"] = _ask_field( "Do you want to use distributed optimizer " - "which shards optimizer state and gradients across data pralellel ranks? [YES/no]: ", + "which shards optimizer state and gradients across data parallel ranks? [YES/no]: ", _convert_yes_no_to_bool, default=True, error_message="Please enter yes or no.",