huggingface · amyeroberts · Oct 31, 2023 · Sep 28, 2023 · Sep 29, 2023 · Oct 3, 2023
diff --git a/docs/source/en/model_doc/gpt_bigcode.md b/docs/source/en/model_doc/gpt_bigcode.md
@@ -42,6 +42,45 @@ The main differences compared to GPT2.
 
 You can read more about the optimizations in the [original pull request](https://github.com/huggingface/transformers/pull/22575)
 
+## Combining Starcoder and Flash Attention 2
+
+First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.
+
+```bash
+pip install -U flash-attn --no-build-isolation
+```
+
+Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of flash-attn repository. Make also sure to load your model in half-precision (e.g. `torch.float16``)
+
+To load and run a model using Flash Attention 2, refer to the snippet below:
+
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+device = "cuda" # the device to load the model onto
+
+model = AutoModelForCausalLM.from_pretrained("bigcode/starcoder", torch_dtype=torch.float16, use_flash_attention_2=True)
+tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder")
+
+prompt = "def hello_world():"
+
+model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
+model.to(device)
+
+generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
+tokenizer.batch_decode(generated_ids)[0]
+"The expected outupt"
+```
+
+### Expected speedups
+
+Below is a expected speedup diagram that compares pure inference time between the native implementation in transformers using `bigcode/starcoder` checkpoint and the Flash Attention 2 version of the model using two different sequence lengths.
+
+<div style="text-align: center">
+<img src="https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/starcoder-speedup.png">
+</div>
+
+
 ## GPTBigCodeConfig
 
 [[autodoc]] GPTBigCodeConfig

diff --git a/docs/source/en/perf_infer_gpu_one.md b/docs/source/en/perf_infer_gpu_one.md
@@ -34,6 +34,7 @@ We natively support Flash Attention 2 for the following models:
 - Llama
 - Mistral
 - Falcon
+- [GPTBigCode (Starcoder)](model_doc/gpt_bigcode#)
 
 You can request to add Flash Attention 2 support for more models by opening an issue on GitHub, and even open a Pull Request to integrate the changes. The supported models can be used for inference and training, including training with padding tokens - *which is currently not supported for `BetterTransformer` API below.*
-Original file line number
+Diff line change
@@ Expand Up @@
     - Llama
     - Mistral
     - Falcon
+    - [GPTBigCode (Starcoder)](model_doc/gpt_bigcode#)
     You can request to add Flash Attention 2 support for more models by opening an issue on GitHub, and even open a Pull Request to integrate the changes. The supported models can be used for inference and training, including training with padding tokens - *which is currently not supported for `BetterTransformer` API below.*
@@ Expand Down @@