Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
I'm interested in using axolotl to pretrain and then finetune on a dataset which I have on HF.
Fixes an issue I experienced because the training script expects pretraining to have a split named "train" and single column named "text". I propose adding
split
andname
to thePretrainingDataset
parser and a newtext_column
field.The
load_dataset(path, streaming=True)
used in pretraining currently returnsfeatures: None
orfeatures: Unknown
, so I propose retrieving the first record to get column names (this is later passed toremove_columns
).As an alternative config change, I could see using
SFTDataset
and removing the separatePretrainingDataset
typeHow has this been tested?
Output YAML, ran training script in CoLab notebook
Sample dataset with one column named "Comment"
My dataset where each split has a different schema so I need to specify split and name