diff --git a/docs/source/concept_guides/big_model_inference.md b/docs/source/concept_guides/big_model_inference.md
index ddce9114cdc..4e09adae686 100644
--- a/docs/source/concept_guides/big_model_inference.md
+++ b/docs/source/concept_guides/big_model_inference.md
@@ -154,7 +154,7 @@ By passing `device_map="auto"`, we tell 🤗 Accelerate to determine automatical
 #### `no_split_module_classes`
 
 This parameter will indicate that some of the modules with the name `"Block"` should not be split across different devices. You should set here all blocks that 
-include a residutal connection of some kind.
+include a residual connection of some kind.
 
 
 #### The `device_map`
diff --git a/docs/source/concept_guides/gradient_synchronization.md b/docs/source/concept_guides/gradient_synchronization.md
index 9010628ef7f..7ae8ab6853f 100644
--- a/docs/source/concept_guides/gradient_synchronization.md
+++ b/docs/source/concept_guides/gradient_synchronization.md
@@ -55,8 +55,8 @@ their gradients computed, collated, and updated before moving on to the next
 batch of data.
 When performing gradient accumulation, you accumulate `n` loss gradients and
 skip `optimizer.step()` until `n` batches have been reached. As all training
-processes only need to sychronize by the time `optimizer.step()` is called,
-without any modification to your training step, this neededless inter-process
+processes only need to synchronize by the time `optimizer.step()` is called,
+without any modification to your training step, this needless inter-process
 communication can cause a significant slowdown.
 
  How can you avoid this overhead?
diff --git a/docs/source/usage_guides/distributed_inference.md b/docs/source/usage_guides/distributed_inference.md
index 3bdd7121401..41053658482 100644
--- a/docs/source/usage_guides/distributed_inference.md
+++ b/docs/source/usage_guides/distributed_inference.md
@@ -51,7 +51,7 @@ def run_inference(rank, world_size):
 One will notice how we have to check the rank to know what prompt to send, which can be a bit tedious.
 
 A user might then also think that with 🤗 Accelerate, using the `Accelerator` to prepare a dataloader for such a task might also be 
-a simple way to manage this. (To learn more, check out the relvent section in the [Quick Tour](../quicktour#distributed-evaluation))
+a simple way to manage this. (To learn more, check out the relevant section in the [Quick Tour](../quicktour#distributed-evaluation))
 
 Can it manage it? Yes. Does it add unneeded extra code however: also yes.
 
diff --git a/docs/source/usage_guides/explore.md b/docs/source/usage_guides/explore.md
index 2b4decefa2a..533c4cf444f 100644
--- a/docs/source/usage_guides/explore.md
+++ b/docs/source/usage_guides/explore.md
@@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
 # Learning how to incorporate 🤗 Accelerate features quickly!
 
 Please use the interactive tool below to help you get started with learning about a particular 
-feature of 🤗 Accelerate and how to utilize it! It will provide you with a code diff, an explaination
+feature of 🤗 Accelerate and how to utilize it! It will provide you with a code diff, an explanation
 towards what is going on, as well as provide you with some useful links to explore more within
 the documentation!
 
diff --git a/docs/source/usage_guides/megatron_lm.md b/docs/source/usage_guides/megatron_lm.md
index 7b6822086da..25bea1f58d2 100644
--- a/docs/source/usage_guides/megatron_lm.md
+++ b/docs/source/usage_guides/megatron_lm.md
@@ -128,7 +128,7 @@ Do you want to enable Sequence Parallelism? [YES/no]:
 What is the Pipeline Parallelism degree/size? [1]:2
 What is the number of micro-batches? [1]:2
 Do you want to enable selective activation recomputation? [YES/no]: 
-Do you want to use distributed optimizer which shards optimizer state and gradients across data pralellel ranks? [YES/no]: 
+Do you want to use distributed optimizer which shards optimizer state and gradients across data parallel ranks? [YES/no]: 
 What is the gradient clipping value based on global L2 Norm (0 to disable)? [1.0]: 
 How many GPU(s) should be used for distributed training? [1]:4
 Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: bf16
@@ -355,8 +355,8 @@ def main():
 
 2. For using the Megatron-LM datasets, a few more changes are required. Dataloaders for these datasets
 are available only on rank 0 of each tensor parallel group. As such, there are rank where dataloader won't be
-avaiable and this requires tweaks to the training loop. Being able to do all this shows how
-felixble and extensible 🤗 Accelerate is. The changes required are as follows.
+available and this requires tweaks to the training loop. Being able to do all this shows how
+flexible and extensible 🤗 Accelerate is. The changes required are as follows.
 
 a. For Megatron-LM indexed datasets, we need to use `MegatronLMDummyDataLoader` 
 and pass the required dataset args to it such as `data_path`, `seq_length` etc. 
@@ -547,7 +547,7 @@ The `model(**batch_data)` call return loss(es) averaged across the data parallel
 This is fine for most cases wherein pre-training jobs are run using Megatron-LM features and
 you can easily compute the `perplexity` using the loss. 
 For GPT model, returning logits in addition to loss(es) is supported. 
-These logits aren't gathered across data prallel ranks. Use `accelerator.utils.gather_across_data_parallel_groups`
+These logits aren't gathered across data parallel ranks. Use `accelerator.utils.gather_across_data_parallel_groups`
 to gather logits across data parallel ranks. These logits along with labels can be used for computing various 
 performance metrics. 
 
diff --git a/docs/source/usage_guides/training_zoo.md b/docs/source/usage_guides/training_zoo.md
index 42dfe18a9f3..2a7f51d2873 100644
--- a/docs/source/usage_guides/training_zoo.md
+++ b/docs/source/usage_guides/training_zoo.md
@@ -15,7 +15,7 @@ rendered properly in your Markdown viewer.
 
 # Example Zoo
 
-Below contains a non-exhuastive list of tutorials and scripts showcasing 🤗 Accelerate
+Below contains a non-exhaustive list of tutorials and scripts showcasing 🤗 Accelerate
 
 ## Official Accelerate Examples:
 
diff --git a/src/accelerate/commands/config/cluster.py b/src/accelerate/commands/config/cluster.py
index 1090d17ddc3..1331e7fe43c 100644
--- a/src/accelerate/commands/config/cluster.py
+++ b/src/accelerate/commands/config/cluster.py
@@ -451,7 +451,7 @@ def get_cluster_input():
 
             megatron_lm_config[prefix + "use_distributed_optimizer"] = _ask_field(
                 "Do you want to use distributed optimizer "
-                "which shards optimizer state and gradients across data pralellel ranks? [YES/no]: ",
+                "which shards optimizer state and gradients across data parallel ranks? [YES/no]: ",
                 _convert_yes_no_to_bool,
                 default=True,
                 error_message="Please enter yes or no.",