Skip to content

Commit

Permalink
Fix documentation for pre-tokenized dataset (#1894)
Browse files Browse the repository at this point in the history
It's currently asking to not add BOS and EOS, stating that Axolotl adds them, but this is not true
  • Loading branch information
alpayariyak authored Sep 5, 2024
1 parent 93b769a commit ab461d8
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion docs/dataset-formats/tokenized.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ order: 5
- Pass an empty `type:` in your axolotl config.
- Columns in Dataset must be exactly `input_ids`, `attention_mask`, `labels`
- To indicate that a token should be ignored during training, set its corresponding label to `-100`.
- Do not add BOS/EOS. Axolotl will add them for you based on the default tokenizer for the model you're using.
- You must add BOS and EOS, and make sure that you are training on EOS by not setting its label to -100.
- For pretraining, do not truncate/pad documents to the context window length.
- For instruction training, documents must be truncated/padded as desired.

Expand Down

0 comments on commit ab461d8

Please sign in to comment.