fix pretraining_ on odd datasets #1463

mapmeld · 2024-03-30T15:02:39Z

Description

I'm interested in using axolotl to pretrain and then finetune on a dataset which I have on HF.

Fixes an issue I experienced because the training script expects pretraining to have a split named "train" and single column named "text". I propose adding split and name to the PretrainingDataset parser and a new text_column field.
The load_dataset(path, streaming=True) used in pretraining currently returns features: None or features: Unknown, so I propose retrieving the first record to get column names (this is later passed to remove_columns).

As an alternative config change, I could see using SFTDataset and removing the separate PretrainingDataset type

How has this been tested?

Output YAML, ran training script in CoLab notebook

Sample dataset with one column named "Comment"

pretraining_dataset:
  - path: urkopa/comments-lv
    split: train
    text_column: Comment
    type: pretrain

My dataset where each split has a different schema so I need to specify split and name

pretraining_dataset:
  - path: monsoon-nlp/greenbeing-proteins
    split: pretraining
    text_column: sequence
    name: pretraining
    type: pretrain

NanoCode012 · 2024-03-30T15:52:37Z

Hm, is it possible to add unit tests for this?

winglian

this all lgtm. thanks!

* can configure name of split of pretraining dataset * streaming data and dataset map * text column customized * allow text_column to be set in pretrain * pretrain type * load a bit of the dataset * fix dataset where splits have separate configs * ok name param here is the config * whitespace

mapmeld added 9 commits March 30, 2024 06:32

can configure name of split of pretraining dataset

27dc6e2

streaming data and dataset map

d446ee1

text column customized

d3ec1b9

allow text_column to be set in pretrain

86e3a2b

pretrain type

4d53947

load a bit of the dataset

577edcb

fix dataset where splits have separate configs

345cbc8

ok name param here is the config

cf1843b

whitespace

f95b7e0

winglian approved these changes Mar 31, 2024

View reviewed changes

winglian added the ready to merge label Apr 1, 2024

winglian merged commit 586bd8d into axolotl-ai-cloud:main Apr 2, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix pretraining_ on odd datasets #1463

fix pretraining_ on odd datasets #1463

mapmeld commented Mar 30, 2024

NanoCode012 commented Mar 30, 2024

winglian left a comment

fix pretraining_ on odd datasets #1463

fix pretraining_ on odd datasets #1463

Conversation

mapmeld commented Mar 30, 2024

Description

How has this been tested?

NanoCode012 commented Mar 30, 2024

winglian left a comment

Choose a reason for hiding this comment