From 712fd27b3f9a6f20812447060f00965f2623e1c2 Mon Sep 17 00:00:00 2001 From: Hamel Husain Date: Wed, 13 Dec 2023 14:22:52 -0800 Subject: [PATCH] Add docs (#947) * move section * update README * update README * update README * update README * update README * Update README.md Co-authored-by: Wing Lian --------- Co-authored-by: Wing Lian --- README.md | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/README.md b/README.md index 44fd7d57f3..c03eec54b4 100644 --- a/README.md +++ b/README.md @@ -36,7 +36,9 @@ Features: - [Train](#train) - [Inference](#inference) - [Merge LORA to Base](#merge-lora-to-base) + - [Special Tokens](#special-tokens) - [Common Errors](#common-errors-) + - [Tokenization Mismatch b/w Training & Inference](#tokenization-mismatch-bw-inference--training) - [Need Help?](#need-help-) - [Badge](#badge-) - [Community Showcase](#community-showcase) @@ -251,6 +253,13 @@ Have dataset(s) in one of the following format (JSONL recommended): ```json {"conversations": [{"from": "...", "value": "..."}]} ``` +- `llama-2`: the json is the same format as `sharegpt` above, with the following config (see the [config section](#config) for more details) + ```yml + datasets: + - path: + type: sharegpt + conversation: llama-2 + ``` - `completion`: raw corpus ```json {"text": "..."} @@ -970,6 +979,22 @@ wandb_name: wandb_log_model: ``` +##### Special Tokens + +It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer's vocubulary. This will help you avoid tokenization issues and help your model train better. You can do this in axolotl like this: + +```yml +special_tokens: + bos_token: "" + eos_token: "" + unk_token: "" +tokens: # these are delimiters + - "<|im_start|>" + - "<|im_end|>" +``` + +When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer's vocabulary. + ### Inference Pass the appropriate flag to the train command: @@ -1048,6 +1073,20 @@ It's safe to ignore it. See the [NCCL](docs/nccl.md) guide. + +### Tokenization Mismatch b/w Inference & Training + +For many formats, Axolotl constructs prompts by concatenating token ids _after_ tokenizing strings. The reason for concatenating token ids rather than operating on strings is to maintain precise accounting for attention masks. + +If you decode a prompt constructed by axolotl, you might see spaces between tokens (or lack thereof) that you do not expect, especially around delimiters and special tokens. When you are starting out with a new format, you should always do the following: + +1. Materialize some data using `python -m axolotl.cli.preprocess your_config.yml --debug`, and then decode the first few rows with your model's tokenizer. +2. During inference, right before you pass a tensor of token ids to your model, decode these tokens back into a string. +3. Make sure the inference string from #2 looks **exactly** like the data you fine tuned on from #1, including spaces and new lines. If they aren't the same adjust your inference server accordingly. +4. As an additional troubleshooting step, you can look look at the token ids between 1 and 2 to make sure they are identical. + +Having misalignment between your prompts during training and inference can cause models to perform very poorly, so it is worth checking this. See [this blog post](https://hamel.dev/notes/llm/05_tokenizer_gotchas.html) for a concrete example. + ## Need help? 🙋♂️ Join our [Discord server](https://discord.gg/HhrNrHJPRb) where we can help you