diff --git a/docs/source/usage_guides/distributed_inference.md b/docs/source/usage_guides/distributed_inference.md index 82fdc21031d..4e9c9c6a947 100644 --- a/docs/source/usage_guides/distributed_inference.md +++ b/docs/source/usage_guides/distributed_inference.md @@ -148,7 +148,7 @@ This next part will discuss using *pipeline parallelism*. This is an **experimen The general idea with pipeline parallelism is: say you have 4 GPUs and a model big enough it can be *split* on four GPUs using `device_map="auto"`. With this method you can send in 4 inputs at a time (for example here, any amount works) and each model chunk will work on an input, then receive the next input once the prior chunk finished, making it *much* more efficient **and faster** than the method described earlier. Here's a visual taken from the PyTorch repository: -![PiPPy example](https://camo.githubusercontent.com/681d7f415d6142face9dd1b837bdb2e340e5e01a58c3a4b119dea6c0d99e2ce0/68747470733a2f2f692e696d6775722e636f6d2f657955633934372e706e67) +![Pipeline parallelism example](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/accelerate/pipeline_parallel.png) To illustrate how you can use this with Accelerate, we have created an [example zoo](https://github.com/huggingface/accelerate/tree/main/examples/inference) showcasing a number of different models and situations. In this tutorial, we'll show this method for GPT2 across two GPUs. @@ -168,7 +168,7 @@ model = GPT2ForSequenceClassification(config) model.eval() ``` -Next you'll need to create some example inputs to use. These help PiPPy trace the model. +Next you'll need to create some example inputs to use. These help `torch.distributed.pipelining` trace the model. However you make this example will determine the relative batch size that will be used/passed