Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run HF dataset processing on local rank 0 first #716

Merged
merged 17 commits into from
Nov 6, 2023

Conversation

dakinggg
Copy link
Collaborator

@dakinggg dakinggg commented Nov 5, 2023

Adjust the HF dataset processing code to only process the data on local rank 0. Other ranks will use the cached arrow dataset. This is both much faster than processing on all ranks, and seems to resolve various hangs and crashed we have seen for large datasets.

Manual test that this doesn't affect loss curve in anyway
Screenshot 2023-11-06 at 1 26 25 PM

@dakinggg dakinggg requested a review from irenedea November 5, 2023 09:33
@dakinggg dakinggg marked this pull request as ready for review November 5, 2023 09:33
@dakinggg dakinggg marked this pull request as draft November 6, 2023 03:30
@dakinggg dakinggg marked this pull request as ready for review November 6, 2023 08:42
Copy link
Contributor

@xiaohanzhan-db xiaohanzhan-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! lgtm!

Copy link
Contributor

@karan6181 karan6181 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest runs on a single process so hard to test the functionality using unit test. However, I would recommend adding a unit test to test the functionality either in this PR or the next depending on the criticality of this PR.

llmfoundry/data/finetuning/tasks.py Show resolved Hide resolved
llmfoundry/data/finetuning/tasks.py Outdated Show resolved Hide resolved
llmfoundry/data/finetuning/tasks.py Show resolved Hide resolved
llmfoundry/data/finetuning/tasks.py Outdated Show resolved Hide resolved
@dakinggg dakinggg changed the title Less aggressive multiprocessing for map/filter Run multiprocessed HF dataset processing on local rank 0 first Nov 6, 2023
@dakinggg dakinggg changed the title Run multiprocessed HF dataset processing on local rank 0 first Improved map/filter for HF dataset processing Nov 6, 2023
Copy link
Contributor

@irenedea irenedea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, couple nits:
(1) update the PR title to "only" => "first"? Or "use cached arrow processing results for non-rank-0" ?
(2) add comment to explain the map/filter/load functions are all cached via arrow

@dakinggg dakinggg changed the title Improved map/filter for HF dataset processing Run HF dataset processing on local rank 0 first Nov 6, 2023
Copy link
Collaborator

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I would refactor the barrier structure so all ranks enter the same barrier

llmfoundry/data/finetuning/tasks.py Outdated Show resolved Hide resolved
llmfoundry/data/finetuning/tasks.py Outdated Show resolved Hide resolved
@dakinggg
Copy link
Collaborator Author

dakinggg commented Nov 6, 2023

@karan6181 @mvpatel2000 @xiaohanzhan-db refactored to not have the confusingly placed barriers.

@dakinggg
Copy link
Collaborator Author

dakinggg commented Nov 6, 2023

@karan6181 @irenedea added more explanatory comments

@dakinggg dakinggg merged commit c2f5742 into mosaicml:main Nov 6, 2023
12 checks passed
@dakinggg dakinggg deleted the less-procs branch December 11, 2023 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants