TOC updates and gudie edits

huggingface · Nov 7, 2023 · d2e12b7 · d2e12b7
1 parent 0bfc28b
commit d2e12b7
Show file tree

Hide file tree

Showing 4 changed files with 50 additions and 156 deletions.
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -37,10 +37,6 @@
     title: Saving and loading training states
   - local: usage_guides/tracking
     title: Using experiment trackers
-  - local: usage_guides/debug
-    title: Debugging timeout errors
-  - local: usage_guides/memory
-    title: How to avoid CUDA Out-of-Memory
   - local: usage_guides/mps
     title: How to use Apple Silicon M1 GPUs
   - local: usage_guides/deepspeed

diff --git a/docs/source/usage_guides/debug.md b/docs/source/usage_guides/debug.md
diff --git a/docs/source/usage_guides/memory.md b/docs/source/usage_guides/memory.md
diff --git a/docs/source/usage_guides/troubleshooting.md b/docs/source/usage_guides/troubleshooting.md
@@ -101,9 +101,58 @@ Input shapes:
   - Process 1: [1, 2, 5]
 ```
 
+## CUDA out of memory
+
+One of the most frustrating errors when it comes to running training scripts is hitting "CUDA Out-of-Memory", 
+as the entire script needs to be restarted, progress is lost, and typically a developer would want to simply
+start their script and let it run.
+
+`Accelerate` provides a utility heavily based on [toma](https://github.com/BlackHC/toma) to give this capability.
+
+### find_executable_batch_size
+
+This algorithm operates with exponential decay, decreasing the batch size in half after each failed run on some 
+training script. To use it, restructure your training function to include an inner function that includes this wrapper, 
+and build your dataloaders inside it. At a minimum, this could look like 4 new lines of code. 
+> Note: The inner function *must* take in the batch size as the first parameter, but we do not pass one to it when called. The wrapper handles this for us
+
+It should also be noted that anything which will consume CUDA memory and passed to the `accelerator` **must** be declared inside the inner function,
+such as models and optimizers.
+
+```diff
+def training_function(args):
+    accelerator = Accelerator()
+
++   @find_executable_batch_size(starting_batch_size=args.batch_size)
++   def inner_training_loop(batch_size):
++       nonlocal accelerator # Ensure they can be used in our context
++       accelerator.free_memory() # Free all lingering references
+        model = get_model()
+        model.to(accelerator.device)
+        optimizer = get_optimizer()
+        train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
+        lr_scheduler = get_scheduler(
+            optimizer, 
+            num_training_steps=len(train_dataloader)*num_epochs
+        )
+        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
+            model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
+        )
+        train(model, optimizer, train_dataloader, lr_scheduler)
+        validate(model, eval_dataloader)
++   inner_training_loop()
+```
+
+To find out more, check the documentation [here](../package_reference/utilities#accelerate.find_executable_batch_size).
+
 ## Non-reproducible results between device setups
 
-https://huggingface.co/docs/accelerate/concept_guides/performance
+If you have changed the device setup and are observing different model performance, this is likely due to the fact that 
+you have not updated your script when moving from one setup to another. The same script with the same batch size across TPU, 
+multi-GPU, and single-GPU with Accelerate will have different results. To make sure you can reproduce the results between 
+the setups, make sure to use the same seed, adjust the batch seed accordingly, consider scaling the learning rate. 
+
+For more details, refer to the [Comparing performance between different device setups](../concept_guides/performance) guide.
 
 ## Performance issues on different GPUs