diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index f3e16e965f4..23e691cb515 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -37,10 +37,6 @@ title: Saving and loading training states - local: usage_guides/tracking title: Using experiment trackers - - local: usage_guides/debug - title: Debugging timeout errors - - local: usage_guides/memory - title: How to avoid CUDA Out-of-Memory - local: usage_guides/mps title: How to use Apple Silicon M1 GPUs - local: usage_guides/deepspeed diff --git a/docs/source/usage_guides/debug.md b/docs/source/usage_guides/debug.md deleted file mode 100644 index 937779a648c..00000000000 --- a/docs/source/usage_guides/debug.md +++ /dev/null @@ -1,93 +0,0 @@ - - -# Debugging Distributed Operations - -When running scripts in a distributed fashion, often functions such as [`Accelerator.gather`] and [`Accelerator.reduce`] (and others) are neccessary to grab tensors across devices and perform certain operations on them. However, if the tensors which are being grabbed are not the proper shapes then this will result in your code hanging forever. The only sign that exists of this truly happening is hitting a timeout exception from `torch.distributed`, but this can get quite costly as usually the timeout is 10 minutes. - -Accelerate now has a `debug` mode which adds a neglible amount of time to each operation, but allows it to verify that the inputs you are bringing in can *actually* perform the operation you want **without** hitting this timeout problem! - -## Visualizing the problem - -To have a tangible example of this issue, let's take the following setup (on 2 GPUs): - -```python -from accelerate import PartialState - -state = PartialState() -if state.process_index == 0: - tensor = torch.tensor([[0.0, 1, 2, 3, 4]]).to(state.device) -else: - tensor = torch.tensor([[[0.0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]]).to(state.device) - -broadcast_tensor = broadcast(tensor) -print(broadcast_tensor) -``` - -We've created a single tensor on each device, with two radically different shapes. With this setup if we want to perform an operation such as [`utils.broadcast`], we would forever hit a timeout because `torch.distributed` requires that these operations have the **exact same shape** across all processes for it to work. - -If you run this yourself, you will find that `broadcast_tensor` can be printed on the main process, but its results won't quite be right, and then it will just hang never printing it on any of the other processes: - -``` ->>> tensor([[0, 1, 2, 3, 4]], device='cuda:0') -``` - -## The solution - -By enabling Accelerate's operational debug mode, Accelerate will properly find and catch errors such as this and provide a very clear traceback immediatly: - -``` -Traceback (most recent call last): - File "/home/zach_mueller_huggingface_co/test.py", line 18, in - main() - File "/home/zach_mueller_huggingface_co/test.py", line 15, in main - main()broadcast_tensor = broadcast(tensor) - File "/home/zach_mueller_huggingface_co/accelerate/src/accelerate/utils/operations.py", line 303, in wrapper - broadcast_tensor = broadcast(tensor) -accelerate.utils.operations.DistributedOperationException: Cannot apply desired operation due to shape mismatches. All shapes across devices must be valid. - -Operation: `accelerate.utils.operations.broadcast` -Input shapes: - - Process 0: [1, 5] - - Process 1: [1, 2, 5] -``` - -This explains that the shapes across our devices were *not* the same, and that we should ensure that they match properly to be compatible. Typically this means that there is either an extra dimension, or certain dimensions are incompatible with the operation. - -To enable this please do one of the following: - -Enable it through the questionarre during `accelerate config` (recommended) - -From the CLI: - -``` -accelerate launch --debug {my_script.py} --arg1 --arg2 -``` - -As an environmental variable (which avoids the need for `accelerate launch`): - -``` -ACCELERATE_DEBUG_MODE="1" accelerate launch {my_script.py} --arg1 --arg2 -``` - -Manually changing the `config.yaml` file: - -```diff - compute_environment: LOCAL_MACHINE -+debug: true -``` - - - diff --git a/docs/source/usage_guides/memory.md b/docs/source/usage_guides/memory.md deleted file mode 100644 index a837ea17d1d..00000000000 --- a/docs/source/usage_guides/memory.md +++ /dev/null @@ -1,58 +0,0 @@ - - -# Memory Utilities - -One of the most frustrating errors when it comes to running training scripts is hitting "CUDA Out-of-Memory", -as the entire script needs to be restarted, progress is lost, and typically a developer would want to simply -start their script and let it run. - -`Accelerate` provides a utility heavily based on [toma](https://github.com/BlackHC/toma) to give this capability. - -## find_executable_batch_size - -This algorithm operates with exponential decay, decreasing the batch size in half after each failed run on some -training script. To use it, restructure your training function to include an inner function that includes this wrapper, -and build your dataloaders inside it. At a minimum, this could look like 4 new lines of code. -> Note: The inner function *must* take in the batch size as the first parameter, but we do not pass one to it when called. The wrapper handles this for us - -It should also be noted that anything which will consume CUDA memory and passed to the `accelerator` **must** be declared inside the inner function, -such as models and optimizers. - -```diff -def training_function(args): - accelerator = Accelerator() - -+ @find_executable_batch_size(starting_batch_size=args.batch_size) -+ def inner_training_loop(batch_size): -+ nonlocal accelerator # Ensure they can be used in our context -+ accelerator.free_memory() # Free all lingering references - model = get_model() - model.to(accelerator.device) - optimizer = get_optimizer() - train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size) - lr_scheduler = get_scheduler( - optimizer, - num_training_steps=len(train_dataloader)*num_epochs - ) - model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare( - model, optimizer, train_dataloader, eval_dataloader, lr_scheduler - ) - train(model, optimizer, train_dataloader, lr_scheduler) - validate(model, eval_dataloader) -+ inner_training_loop() -``` - -To find out more, check the documentation [here](../package_reference/utilities#accelerate.find_executable_batch_size). diff --git a/docs/source/usage_guides/troubleshooting.md b/docs/source/usage_guides/troubleshooting.md index 29b29779f1e..f7fdc190788 100644 --- a/docs/source/usage_guides/troubleshooting.md +++ b/docs/source/usage_guides/troubleshooting.md @@ -101,9 +101,58 @@ Input shapes: - Process 1: [1, 2, 5] ``` +## CUDA out of memory + +One of the most frustrating errors when it comes to running training scripts is hitting "CUDA Out-of-Memory", +as the entire script needs to be restarted, progress is lost, and typically a developer would want to simply +start their script and let it run. + +`Accelerate` provides a utility heavily based on [toma](https://github.com/BlackHC/toma) to give this capability. + +### find_executable_batch_size + +This algorithm operates with exponential decay, decreasing the batch size in half after each failed run on some +training script. To use it, restructure your training function to include an inner function that includes this wrapper, +and build your dataloaders inside it. At a minimum, this could look like 4 new lines of code. +> Note: The inner function *must* take in the batch size as the first parameter, but we do not pass one to it when called. The wrapper handles this for us + +It should also be noted that anything which will consume CUDA memory and passed to the `accelerator` **must** be declared inside the inner function, +such as models and optimizers. + +```diff +def training_function(args): + accelerator = Accelerator() + ++ @find_executable_batch_size(starting_batch_size=args.batch_size) ++ def inner_training_loop(batch_size): ++ nonlocal accelerator # Ensure they can be used in our context ++ accelerator.free_memory() # Free all lingering references + model = get_model() + model.to(accelerator.device) + optimizer = get_optimizer() + train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size) + lr_scheduler = get_scheduler( + optimizer, + num_training_steps=len(train_dataloader)*num_epochs + ) + model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare( + model, optimizer, train_dataloader, eval_dataloader, lr_scheduler + ) + train(model, optimizer, train_dataloader, lr_scheduler) + validate(model, eval_dataloader) ++ inner_training_loop() +``` + +To find out more, check the documentation [here](../package_reference/utilities#accelerate.find_executable_batch_size). + ## Non-reproducible results between device setups -https://huggingface.co/docs/accelerate/concept_guides/performance +If you have changed the device setup and are observing different model performance, this is likely due to the fact that +you have not updated your script when moving from one setup to another. The same script with the same batch size across TPU, +multi-GPU, and single-GPU with Accelerate will have different results. To make sure you can reproduce the results between +the setups, make sure to use the same seed, adjust the batch seed accordingly, consider scaling the learning rate. + +For more details, refer to the [Comparing performance between different device setups](../concept_guides/performance) guide. ## Performance issues on different GPUs