Skip to content

Commit

Permalink
[docs] Fix typos (#2490)
Browse files Browse the repository at this point in the history
* fix typos

* fix typos

* fix typo

* fix typos

* fix typos

* fix typos

* fix typo

* fix typo

---------

Co-authored-by: Zach Mueller <muellerzr@gmail.com>
  • Loading branch information
omahs and muellerzr authored Mar 1, 2024
1 parent 5fce525 commit 65544d8
Showing 8 changed files with 18 additions and 17 deletions.
4 changes: 2 additions & 2 deletions docs/source/concept_guides/low_precision_training.md
Original file line number Diff line number Diff line change
@@ -34,7 +34,7 @@ MS-AMP O3 | FP8 | FP8 | FP8 | FP16 | FP8 | FP8+FP16

## `TransformersEngine`

`TransformersEngine` is the first solution to trying to train in 8-bit floating point. It works by using drop-in replacement layers for certain ones in a model that utilize their FP8-engine to reduce the number of bits (such as 32 to 8) without degrading the final accuracy of the model.
`TransformersEngine` is the first solution to trying to train in 8-bit floating point. It works by using drop-in replacement layers for certain ones in a model that utilizes their FP8-engine to reduce the number of bits (such as 32 to 8) without degrading the final accuracy of the model.

Specifically, 🤗 Accelerate will find and replace the following layers with `TransformersEngine` versions:

@@ -71,4 +71,4 @@ MS-AMP takes a different approach to `TransformersEngine` by providing three dif

## Combining the two

More experiments need to be performed but it's been noted that combining both MS-AMP and TransformersEngine can lead to the highest throughput by relying on NVIDIA's optimized FP8 operators and utilizing how MS-AMP reduces the memory overhead.
More experiments need to be performed but it's been noted that combining both MS-AMP and TransformersEngine can lead to the highest throughput by relying on NVIDIA's optimized FP8 operators and utilizing how MS-AMP reduces the memory overhead.
3 changes: 2 additions & 1 deletion docs/source/quicktour.md
Original file line number Diff line number Diff line change
@@ -9,7 +9,7 @@ Unless required by applicable law or agreed to in writing, software distributed
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
⚠️ Note that this file is in Markdown but contains specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->

@@ -27,6 +27,7 @@ This quicktour introduces the three main features of Accelerate:

Accelerate automatically selects the appropriate configuration values for any given distributed training framework (DeepSpeed, FSDP, etc.) through a unified configuration file generated from the [`accelerate config`](../../docs/source/package_reference/cli#accelerate-config) command. You could also pass the configuration values explicitly to the command line which is helpful in certain situations like if you're using SLURM.


But in most cases, you should always run [`accelerate config`](../../docs/source/package_reference/cli#accelerate-config) first to help Accelerate learn about your training setup.

```bash
6 changes: 3 additions & 3 deletions docs/source/usage_guides/deepspeed.md
Original file line number Diff line number Diff line change
@@ -9,7 +9,7 @@ Unless required by applicable law or agreed to in writing, software distributed
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
⚠️ Note that this file is in Markdown but contains specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->

@@ -353,7 +353,7 @@ accelerate launch examples/by_feature/deepspeed_with_config_support.py \
```

**ZeRO++ Config Example**
You can use the the features of ZeRO++ by using the appropriate config parameters. Note that ZeRO++ is an extension for ZeRO Stage 3. Here is how the config file can be modified, from [DeepSpeed's ZeRO++ tutorial](https://www.deepspeed.ai/tutorials/zeropp/):
You can use the features of ZeRO++ by using the appropriate config parameters. Note that ZeRO++ is an extension for ZeRO Stage 3. Here is how the config file can be modified, from [DeepSpeed's ZeRO++ tutorial](https://www.deepspeed.ai/tutorials/zeropp/):

```json
{
@@ -519,7 +519,7 @@ ValueError: When using `deepspeed_config_file`, the following accelerate config
['gradient_accumulation_steps', 'gradient_clipping', 'zero_stage', 'offload_optimizer_device', 'offload_param_device',
'zero3_save_16bit_model', 'mixed_precision'].
Please specify them appropriately in the DeepSpeed config file.
If you are using an accelerate config file, remove others config variables mentioned in the above specified list.
If you are using an accelerate config file, remove other config variables mentioned in the above specified list.
The easiest method is to create a new config following the questionnaire via `accelerate config`.
It will only ask for the necessary config variables when using `deepspeed_config_file`.
```
2 changes: 1 addition & 1 deletion docs/source/usage_guides/local_sgd.md
Original file line number Diff line number Diff line change
@@ -88,7 +88,7 @@ achieved by adding one `with LocalSGD` statement and one call `local_sgd.step()`
+ local_sgd.step()
```

Under the hood, the Local SGD code **disables** automatic gradient synchornization (but accumulation still works as expected!). Instead it averages model parameters every `local_sgd_steps` steps (as well as in the end of the training loop).
Under the hood, the Local SGD code **disables** automatic gradient synchronization (but accumulation still works as expected!). Instead it averages model parameters every `local_sgd_steps` steps (as well as at the end of the training loop).

## Limitations

8 changes: 4 additions & 4 deletions docs/source/usage_guides/low_precision_training.md
Original file line number Diff line number Diff line change
@@ -57,7 +57,7 @@ Of the two, `MS-AMP` is traditionally the easier one to configure as there is on
Currently two levels of optimization are supported in the 🤗 Accelerate integration, `"O1"` and `"O2"` (using the letter 'o', not zero).

* `"O1"` will cast the weight gradients and `all_reduce` communications to happen in 8-bit, while the rest are done in 16 bit. This reduces the general GPU memory usage and speeds up communication bandwidths.
* `"O2"` will also cast first-order optimizer states into 8 bit, while the second order states are in FP16. (Currently just the `Adam` optimizer is supported). This tries it's best to minimize final accuracy degradation and will save the highest potential memory.
* `"O2"` will also cast first-order optimizer states into 8 bit, while the second order states are in FP16. (Currently just the `Adam` optimizer is supported). This tries its best to minimize final accuracy degradation and will save the highest potential memory.

To specify an optimization level, pass it to the `FP8KwargsHandler` by setting the `optimization_level` argument:

@@ -70,7 +70,7 @@ accelerator = Accelerator(mixed_precision="fp8", kwarg_handlers=kwargs)

## Configuring TransformersEngine

TransformersEngine has much more available for customizing how and what FP8 calculations are performed. A full list of supported arguments and what they mean are available in [NVIDIA's documentation](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/api/common.html), however they are restated as part of [`FP8KwargsHandler`]'s docstring for your convience.
TransformersEngine has much more available for customizing how and what FP8 calculations are performed. A full list of supported arguments and what they mean are available in [NVIDIA's documentation](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/api/common.html), however they are restated as part of [`FP8KwargsHandler`]'s docstring for your convenience.

🤗 Accelerate tries to set sensible defaults, but exploring and tweaking the various parameters yourself can lead to better performance potentially.

@@ -83,10 +83,10 @@ kwargs = [FP8RecipeKwargs(backend="te", ...)]
accelerator = Accelerator(mixed_precision="fp8", kwarg_handlers=kwargs)
```

## Futher Reading
## Further Reading

To learn more about training in FP8 please check out the following resources:

* [Our concept guide](../concept_guides/low_precision_training.md) detailing into more about both TransformersEngine and MS-AMP
* [The `transformers-engine` documentation](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/api/common.html)
* [The `MS-AMP` documentation](https://azure.github.io/MS-AMP/docs/)
* [The `MS-AMP` documentation](https://azure.github.io/MS-AMP/docs/)
6 changes: 3 additions & 3 deletions docs/source/usage_guides/megatron_lm.md
Original file line number Diff line number Diff line change
@@ -9,7 +9,7 @@ Unless required by applicable law or agreed to in writing, software distributed
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
⚠️ Note that this file is in Markdown but contains specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->

@@ -542,7 +542,7 @@ megatron_lm_plugin = MegatronLMPlugin(other_megatron_args=other_megatron_args)
This covers Decoder only, Encode only and Encoder-Decoder model classes.

2. Only loss is returned from model forward pass as
there is quite complex interplay of pipeline, tensor and data parallelsim behind the scenes.
there is quite complex interplay of pipeline, tensor and data parallelism behind the scenes.
The `model(**batch_data)` call return loss(es) averaged across the data parallel ranks.
This is fine for most cases wherein pre-training jobs are run using Megatron-LM features and
you can easily compute the `perplexity` using the loss.
@@ -580,4 +580,4 @@ b. Megatron-LM [GPTModel](https://github.com/NVIDIA/Megatron-LM/blob/main/megatr
c. Megatron-LM [T5Model](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/t5_model.py) :
🤗 transformers models with `t5` in config's model type, e.g.,
[T5](https://huggingface.co/docs/transformers/model_doc/t5) and
[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)
[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)
4 changes: 2 additions & 2 deletions docs/source/usage_guides/model_size_estimator.md
Original file line number Diff line number Diff line change
@@ -51,7 +51,7 @@ Below are a few gradio demos related to what was described above. The first is t
></iframe>
</div>

A community member has taken the idea and expended it further, allowing you to filter models directly and see if you can run a particular LLM given GPU constraints and LoRA configurations. To play with it, see [here](https://huggingface.co/spaces/Vokturz/can-it-run-llm) for more details.
A community member has taken the idea and expanded it further, allowing you to filter models directly and see if you can run a particular LLM given GPU constraints and LoRA configurations. To play with it, see [here](https://huggingface.co/spaces/Vokturz/can-it-run-llm) for more details.

## The Command

@@ -134,4 +134,4 @@ This calculator will tell you how much memory is needed to purely load the model
This calculation is accurate within a few % of the actual value, so it is a very good view of just how much memory it will take. For instance loading `bert-base-cased` actually takes `413.68 MB` when loaded on CUDA in full precision, and the calculator estimates `413.18 MB`.

When performing inference you can expect to add up to an additional 20% as found by [EleutherAI](https://blog.eleuther.ai/transformer-math/). We'll be conducting research into finding a more accurate estimate to these values, and will update
this calculator once done.
this calculator once done.
2 changes: 1 addition & 1 deletion src/accelerate/commands/config/config_args.py
Original file line number Diff line number Diff line change
@@ -45,7 +45,7 @@ def load_config_from_file(config_file):
if not os.path.isfile(config_file):
raise FileNotFoundError(
f"The passed configuration file `{config_file}` does not exist. "
"Please pass an existing file to `accelerate launch`, or use the the default one "
"Please pass an existing file to `accelerate launch`, or use the default one "
"created through `accelerate config` and run `accelerate launch` "
"without the `--config_file` argument."
)

0 comments on commit 65544d8

Please sign in to comment.