Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing at scale #291

Merged
merged 17 commits into from
Jan 22, 2025
Merged

Preprocessing at scale #291

merged 17 commits into from
Jan 22, 2025

Conversation

le1nux
Copy link
Member

@le1nux le1nux commented Jan 15, 2025

What does this PR do?

This PR adds functionality (including endpoint) to create chunks from a set of tokenized file.
Each tokenized file is split into the targeted number of chunks. We combine the splits (same split_id) from all files into the respective chunk. This means each chunk contains one split from each tokenized file.

General Changes

  • Dataset chunking (including random shuffling of individual chunks)
  • Added TokenizedFileWriter for writing out tokenized data (list of numpy arrays) in the pbin format.
  • Added slicing functionality to the __get_item function of PackedMemMapDatasetBase

Breaking Changes

  • ..

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py)
  • I have updated the internal changelog (CHANGELOG_DEV.md)

@le1nux le1nux requested review from mali-git and fromm-m January 20, 2025 10:25
@le1nux le1nux added the enhancement New feature or request label Jan 20, 2025
@le1nux le1nux marked this pull request as ready for review January 20, 2025 10:25
Copy link
Member

@fromm-m fromm-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

src/modalities/__main__.py Outdated Show resolved Hide resolved
src/modalities/__main__.py Outdated Show resolved Hide resolved
src/modalities/api.py Outdated Show resolved Hide resolved
src/modalities/api.py Show resolved Hide resolved
src/modalities/api.py Outdated Show resolved Hide resolved
src/modalities/dataloader/dataset.py Outdated Show resolved Hide resolved
@le1nux le1nux requested a review from mali-git January 22, 2025 10:33
Copy link
Member

@mali-git mali-git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@le1nux le1nux merged commit fe4a3be into main Jan 22, 2025
3 checks passed
@le1nux le1nux deleted the preprocessing_at_scale branch January 22, 2025 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants