Skip to content

Commit

Permalink
TOC updates and gudie edits
Browse files Browse the repository at this point in the history
  • Loading branch information
MKhalusova committed Nov 7, 2023
1 parent 0bfc28b commit d2e12b7
Show file tree
Hide file tree
Showing 4 changed files with 50 additions and 156 deletions.
4 changes: 0 additions & 4 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,6 @@
title: Saving and loading training states
- local: usage_guides/tracking
title: Using experiment trackers
- local: usage_guides/debug
title: Debugging timeout errors
- local: usage_guides/memory
title: How to avoid CUDA Out-of-Memory
- local: usage_guides/mps
title: How to use Apple Silicon M1 GPUs
- local: usage_guides/deepspeed
Expand Down
93 changes: 0 additions & 93 deletions docs/source/usage_guides/debug.md

This file was deleted.

58 changes: 0 additions & 58 deletions docs/source/usage_guides/memory.md

This file was deleted.

51 changes: 50 additions & 1 deletion docs/source/usage_guides/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,9 +101,58 @@ Input shapes:
- Process 1: [1, 2, 5]
```

## CUDA out of memory

One of the most frustrating errors when it comes to running training scripts is hitting "CUDA Out-of-Memory",
as the entire script needs to be restarted, progress is lost, and typically a developer would want to simply
start their script and let it run.

`Accelerate` provides a utility heavily based on [toma](https://github.com/BlackHC/toma) to give this capability.

### find_executable_batch_size

This algorithm operates with exponential decay, decreasing the batch size in half after each failed run on some
training script. To use it, restructure your training function to include an inner function that includes this wrapper,
and build your dataloaders inside it. At a minimum, this could look like 4 new lines of code.
> Note: The inner function *must* take in the batch size as the first parameter, but we do not pass one to it when called. The wrapper handles this for us
It should also be noted that anything which will consume CUDA memory and passed to the `accelerator` **must** be declared inside the inner function,
such as models and optimizers.

```diff
def training_function(args):
accelerator = Accelerator()

+ @find_executable_batch_size(starting_batch_size=args.batch_size)
+ def inner_training_loop(batch_size):
+ nonlocal accelerator # Ensure they can be used in our context
+ accelerator.free_memory() # Free all lingering references
model = get_model()
model.to(accelerator.device)
optimizer = get_optimizer()
train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
lr_scheduler = get_scheduler(
optimizer,
num_training_steps=len(train_dataloader)*num_epochs
)
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
)
train(model, optimizer, train_dataloader, lr_scheduler)
validate(model, eval_dataloader)
+ inner_training_loop()
```

To find out more, check the documentation [here](../package_reference/utilities#accelerate.find_executable_batch_size).

## Non-reproducible results between device setups

https://huggingface.co/docs/accelerate/concept_guides/performance
If you have changed the device setup and are observing different model performance, this is likely due to the fact that
you have not updated your script when moving from one setup to another. The same script with the same batch size across TPU,
multi-GPU, and single-GPU with Accelerate will have different results. To make sure you can reproduce the results between
the setups, make sure to use the same seed, adjust the batch seed accordingly, consider scaling the learning rate.

For more details, refer to the [Comparing performance between different device setups](../concept_guides/performance) guide.

## Performance issues on different GPUs

Expand Down

0 comments on commit d2e12b7

Please sign in to comment.