From 712fd27b3f9a6f20812447060f00965f2623e1c2 Mon Sep 17 00:00:00 2001
From: Hamel Husain <hamel.husain@gmail.com>
Date: Wed, 13 Dec 2023 14:22:52 -0800
Subject: [PATCH] Add docs (#947)

* move section

* update README

* update README

* update README

* update README

* update README

* Update README.md

Co-authored-by: Wing Lian <wing.lian@gmail.com>

---------

Co-authored-by: Wing Lian <wing.lian@gmail.com>
---
 README.md | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)
diff --git a/README.md b/README.md
index 44fd7d57f3..c03eec54b4 100644
--- a/README.md
+++ b/README.md
@@ -36,7 +36,9 @@ Features:
   - [Train](#train)
   - [Inference](#inference)
   - [Merge LORA to Base](#merge-lora-to-base)
+  - [Special Tokens](#special-tokens)
 - [Common Errors](#common-errors-)
+  - [Tokenization Mismatch b/w Training & Inference](#tokenization-mismatch-bw-inference--training)
 - [Need Help?](#need-help-)
 - [Badge](#badge-)
 - [Community Showcase](#community-showcase)
@@ -251,6 +253,13 @@ Have dataset(s) in one of the following format (JSONL recommended):
   ```json
   {"conversations": [{"from": "...", "value": "..."}]}
   ```
+- `llama-2`: the json is the same format as `sharegpt` above, with the following config (see the [config section](#config) for more details)
+    ```yml
+    datasets:
+      - path: <your-path>
+        type: sharegpt
+        conversation: llama-2
+    ```
 - `completion`: raw corpus
   ```json
   {"text": "..."}
@@ -970,6 +979,22 @@ wandb_name:
 wandb_log_model:
 ```
 
+##### Special Tokens
+
+It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer's vocubulary.  This will help you avoid tokenization issues and help your model train better.  You can do this in axolotl like this:
+
+```yml
+special_tokens:
+  bos_token: "<s>"
+  eos_token: "</s>"
+  unk_token: "<unk>"
+tokens: # these are delimiters
+  - "<|im_start|>"
+  - "<|im_end|>"
+```
+
+When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer's vocabulary.
+
 ### Inference
 
 Pass the appropriate flag to the train command:
@@ -1048,6 +1073,20 @@ It's safe to ignore it.
 
 See the [NCCL](docs/nccl.md) guide.
 
+
+### Tokenization Mismatch b/w Inference & Training
+
+For many formats, Axolotl constructs prompts by concatenating token ids _after_ tokenizing strings.  The reason for concatenating token ids rather than operating on strings is to maintain precise accounting for attention masks.
+
+If you decode a prompt constructed by axolotl, you might see spaces between tokens (or lack thereof) that you do not expect, especially around delimiters and special tokens.  When you are starting out with a new format, you should always do the following:
+
+1. Materialize some data using `python -m axolotl.cli.preprocess your_config.yml --debug`, and then decode the first few rows with your model's tokenizer.
+2. During inference, right before you pass a tensor of token ids to your model, decode these tokens back into a string.
+3. Make sure the inference string from #2 looks **exactly** like the data you fine tuned on from #1, including spaces and new lines.  If they aren't the same adjust your inference server accordingly.
+4. As an additional troubleshooting step, you can look look at the token ids between 1 and 2 to make sure they are identical.
+
+Having misalignment between your prompts during training and inference can cause models to perform very poorly, so it is worth checking this.  See [this blog post](https://hamel.dev/notes/llm/05_tokenizer_gotchas.html) for a concrete example.
+
 ## Need help? 🙋♂️
 
 Join our [Discord server](https://discord.gg/HhrNrHJPRb) where we can help you