Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for explicit test_dataset definition for evals #786

Merged
merged 1 commit into from
Jan 23, 2024
Merged

Conversation

winglian
Copy link
Collaborator

first pass for supporting different datasets for evals rather than splitting the test dataset.

Copy link
Collaborator

@NanoCode012 NanoCode012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, our data pipeline is getting a bit complex. I had a bit of hard time following the flow

src/axolotl/utils/data.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@NanoCode012 NanoCode012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this config to Readme would be helpful

@winglian
Copy link
Collaborator Author

This will address #875

@NanoCode012
Copy link
Collaborator

I think will need to add to doc about this and also, whether it would be appropriate to hardcode the train/test split or allow more open control?

@winglian
Copy link
Collaborator Author

I think will need to add to doc about this and also, whether it would be appropriate to hardcode the train/test split or allow more open control?

@NanoCode012 What do you mean about hardcoding?

@winglian winglian merged commit cda52dc into main Jan 23, 2024
7 checks passed
@winglian winglian deleted the eval-dataset branch January 23, 2024 02:30
@DreamGenX
Copy link
Contributor

Do you have an example or documentation of how this can be used?

@DreamGenX
Copy link
Contributor

I was trying to reserve engineer how it's supposed to work, and maybe there's a bug here:

https://github.com/OpenAccess-AI-Collective/axolotl/blob/18f811978c01d567c2294140f53abcf8c086e337/src/axolotl/utils/data.py#L443

    dataset, prompters = load_tokenized_prepared_datasets(
        tokenizer, cfg, default_dataset_prepared_path
    )

Shouldn't you pass split=split to load_tokenized_prepared_datasets? Otherwise it will never load from cfg.test_datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants