Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support loading datasets saved via save_to_disk #1432

Merged
merged 4 commits into from
Mar 29, 2024
Merged

Support loading datasets saved via save_to_disk #1432

merged 4 commits into from
Mar 29, 2024

Conversation

fozziethebeat
Copy link
Contributor

Description

Fixes

Uses load_from_disk when no data files are listed in a local directory dataset. This is to ensure datasets saved via save_to_disk are properly loaded.

Motivation and Context

Datasets saved via save_to_disk currently can't be loaded without explicitly listing all files.
#1430

How has this been tested?

Created a local dataset saved via save_to_disk and trained a model on it.

Screenshots (if appropriate)

Types of changes

Small code change

Social Handles (Optional)

fozziethebeat on all the platforms

@fozziethebeat fozziethebeat changed the title Support loading datasetes saved via save_to_disk Support loading datasets saved via save_to_disk Mar 25, 2024
Copy link
Collaborator

@winglian winglian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. @NanoCode012 ?

@winglian winglian requested a review from NanoCode012 March 26, 2024 17:47
@winglian
Copy link
Collaborator

Looks like there are some issues with this change and the current integration tests we have, especially when loading remote datasets.

@fozziethebeat
Copy link
Contributor Author

I'll look through the docker tests that failed and add unittests for what's failing.

@fozziethebeat
Copy link
Contributor Author

I think this should fix the tests. The updates to main require tokenizer_config to be set in the unittests given that its not using the config utils that auto-set the field.

@fozziethebeat
Copy link
Contributor Author

One E2E test suite failed with this error:

Writing manifest to image destination
Unpacking OCI image
   • unpacking rootfs ...
Runner terminated.
Stopping app - uncaught exception raised locally: RemoteError('Image build for im-RlXPRxFbKAnNrpADzxyalb terminated due to external shut-down. Please try again.').

This seems like it needs a re-run. The other E2E suite completed successfully

@winglian winglian merged commit e634118 into axolotl-ai-cloud:main Mar 29, 2024
6 checks passed
@fozziethebeat
Copy link
Contributor Author

yay~

djsaunde pushed a commit that referenced this pull request Dec 17, 2024
* Support loading datasetes saved via save_to_disk

* Adding comprehensive unittests

* Fix dataset tests due to new hash changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants