Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix pretraining_ on odd datasets #1463

Merged
merged 9 commits into from
Apr 2, 2024
Merged

fix pretraining_ on odd datasets #1463

merged 9 commits into from
Apr 2, 2024

Conversation

mapmeld
Copy link
Contributor

@mapmeld mapmeld commented Mar 30, 2024

Description

I'm interested in using axolotl to pretrain and then finetune on a dataset which I have on HF.

  1. Fixes an issue I experienced because the training script expects pretraining to have a split named "train" and single column named "text". I propose adding split and name to the PretrainingDataset parser and a new text_column field.

  2. The load_dataset(path, streaming=True) used in pretraining currently returns features: None or features: Unknown, so I propose retrieving the first record to get column names (this is later passed to remove_columns).

As an alternative config change, I could see using SFTDataset and removing the separate PretrainingDataset type

How has this been tested?

Output YAML, ran training script in CoLab notebook

Sample dataset with one column named "Comment"

pretraining_dataset:
  - path: urkopa/comments-lv
    split: train
    text_column: Comment
    type: pretrain

My dataset where each split has a different schema so I need to specify split and name

pretraining_dataset:
  - path: monsoon-nlp/greenbeing-proteins
    split: pretraining
    text_column: sequence
    name: pretraining
    type: pretrain

@NanoCode012
Copy link
Collaborator

Hm, is it possible to add unit tests for this?

Copy link
Collaborator

@winglian winglian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this all lgtm. thanks!

@winglian winglian merged commit 586bd8d into axolotl-ai-cloud:main Apr 2, 2024
7 checks passed
djsaunde pushed a commit that referenced this pull request Dec 17, 2024
* can configure name of split of pretraining dataset

* streaming data and dataset map

* text column customized

* allow text_column to be set in pretrain

* pretrain type

* load a bit of the dataset

* fix dataset where splits have separate configs

* ok name param here is the config

* whitespace
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants