Skip to content

Commit

Permalink
Add docs (#947)
Browse files Browse the repository at this point in the history
* move section

* update README

* update README

* update README

* update README

* update README

* Update README.md

Co-authored-by: Wing Lian <[email protected]>

---------

Co-authored-by: Wing Lian <[email protected]>
  • Loading branch information
hamelsmu and winglian authored Dec 13, 2023
1 parent ef24342 commit 712fd27
Showing 1 changed file with 39 additions and 0 deletions.
39 changes: 39 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,9 @@ Features:
- [Train](#train)
- [Inference](#inference)
- [Merge LORA to Base](#merge-lora-to-base)
- [Special Tokens](#special-tokens)
- [Common Errors](#common-errors-)
- [Tokenization Mismatch b/w Training & Inference](#tokenization-mismatch-bw-inference--training)
- [Need Help?](#need-help-)
- [Badge](#badge-)
- [Community Showcase](#community-showcase)
Expand Down Expand Up @@ -251,6 +253,13 @@ Have dataset(s) in one of the following format (JSONL recommended):
```json
{"conversations": [{"from": "...", "value": "..."}]}
```
- `llama-2`: the json is the same format as `sharegpt` above, with the following config (see the [config section](#config) for more details)
```yml
datasets:
- path: <your-path>
type: sharegpt
conversation: llama-2
```
- `completion`: raw corpus
```json
{"text": "..."}
Expand Down Expand Up @@ -970,6 +979,22 @@ wandb_name:
wandb_log_model:
```
##### Special Tokens
It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer's vocubulary. This will help you avoid tokenization issues and help your model train better. You can do this in axolotl like this:
```yml
special_tokens:
bos_token: "<s>"
eos_token: "</s>"
unk_token: "<unk>"
tokens: # these are delimiters
- "<|im_start|>"
- "<|im_end|>"
```
When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer's vocabulary.
### Inference
Pass the appropriate flag to the train command:
Expand Down Expand Up @@ -1048,6 +1073,20 @@ It's safe to ignore it.
See the [NCCL](docs/nccl.md) guide.


### Tokenization Mismatch b/w Inference & Training

For many formats, Axolotl constructs prompts by concatenating token ids _after_ tokenizing strings. The reason for concatenating token ids rather than operating on strings is to maintain precise accounting for attention masks.

If you decode a prompt constructed by axolotl, you might see spaces between tokens (or lack thereof) that you do not expect, especially around delimiters and special tokens. When you are starting out with a new format, you should always do the following:

1. Materialize some data using `python -m axolotl.cli.preprocess your_config.yml --debug`, and then decode the first few rows with your model's tokenizer.
2. During inference, right before you pass a tensor of token ids to your model, decode these tokens back into a string.
3. Make sure the inference string from #2 looks **exactly** like the data you fine tuned on from #1, including spaces and new lines. If they aren't the same adjust your inference server accordingly.
4. As an additional troubleshooting step, you can look look at the token ids between 1 and 2 to make sure they are identical.

Having misalignment between your prompts during training and inference can cause models to perform very poorly, so it is worth checking this. See [this blog post](https://hamel.dev/notes/llm/05_tokenizer_gotchas.html) for a concrete example.

## Need help? 🙋♂️

Join our [Discord server](https://discord.gg/HhrNrHJPRb) where we can help you
Expand Down

0 comments on commit 712fd27

Please sign in to comment.