Preprocessing at scale #291

le1nux · 2025-01-15T20:32:18Z

What does this PR do?

This PR adds functionality (including endpoint) to create chunks from a set of tokenized file.
Each tokenized file is split into the targeted number of chunks. We combine the splits (same split_id) from all files into the respective chunk. This means each chunk contains one split from each tokenized file.

General Changes

Dataset chunking (including random shuffling of individual chunks)
Added TokenizedFileWriter for writing out tokenized data (list of numpy arrays) in the pbin format.
Added slicing functionality to the __get_item function of PackedMemMapDatasetBase

Breaking Changes

..

Checklist before submitting final PR

My PR is minimal and addresses one issue in isolation
I have merged the latest version of the target branch into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have checked that all tests run through (python tests/tests.py)
I have updated the internal changelog (CHANGELOG_DEV.md)

…config

fromm-m

LGTM

src/modalities/__main__.py

src/modalities/api.py

src/modalities/dataloader/dataset.py

src/modalities/dataloader/preprocessing/chunking/create_chunks.py

src/modalities/dataloader/preprocessing/tokenization/tokenized_file_writer.py

src/modalities/dataloader/dataset.py

src/modalities/dataloader/preprocessing/tokenization/tokenized_file_writer.py

mali-git

Great work!

le1nux added 14 commits January 15, 2025 21:31

feat: implemented chunking calculation

bc01afe

chore: Merge branch 'main' into preprocessing_at_scale

0ff30e1

feat: added slicing to PackedMemMapDatasetBase

e938ce4

feat: implemented TokenizedFileWriter

9446a0f

feat: added chunk shuffling

ba33e25

feat: addded create_shuffled_dataset_chunk api endpoint

36347ff

feat: added test for Chunking.get_file_chunk

e43739a

feat: added test for shuffle_file_chunks_in_place

ca2b298

refactor: fixed faulty index in lorem_ipsum_long.pbin

c75a2c5

feat: added test for TokenizedFileWriter.write_tokenized_dataset

376fd4f

feat: added more testing to TokenizedFileWriter.write_tokenized_dataset

42a6cae

feat: added end to end test test_create_shuffled_dataset_chunk

3d17633

feat: added slicing tests for PackedMemMapDatasetBase

781a008

fix: fixed failing test test_skipped_and_distributed_dataloader_from_…

65477cd

…config

le1nux requested review from mali-git and fromm-m January 20, 2025 10:25

le1nux added the enhancement New feature or request label Jan 20, 2025

le1nux marked this pull request as ready for review January 20, 2025 10:25

fromm-m approved these changes Jan 20, 2025

View reviewed changes

mali-git requested changes Jan 21, 2025

View reviewed changes

le1nux added 2 commits January 21, 2025 17:17

refactor: simplified slicing in PackedMemMapDatasetBase

8722b16

refactor: improved code based on review comments

d3a1795

le1nux requested a review from mali-git January 22, 2025 10:33

mali-git reviewed Jan 22, 2025

View reviewed changes

src/modalities/dataloader/dataset.py Show resolved Hide resolved

mali-git reviewed Jan 22, 2025

View reviewed changes

src/modalities/dataloader/preprocessing/tokenization/tokenized_file_writer.py Outdated Show resolved Hide resolved

mali-git reviewed Jan 22, 2025

View reviewed changes

src/modalities/dataloader/preprocessing/tokenization/tokenized_file_writer.py Outdated Show resolved Hide resolved

mali-git approved these changes Jan 22, 2025

View reviewed changes

chore: added missing type annotations

4bc671b

le1nux merged commit fe4a3be into main Jan 22, 2025
3 checks passed

le1nux deleted the preprocessing_at_scale branch January 22, 2025 16:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing at scale #291

Preprocessing at scale #291

le1nux commented Jan 15, 2025 •

edited

Loading

fromm-m left a comment

mali-git left a comment

Preprocessing at scale #291

Preprocessing at scale #291

Conversation

le1nux commented Jan 15, 2025 • edited Loading

What does this PR do?

General Changes

Breaking Changes

Checklist before submitting final PR

fromm-m left a comment

Choose a reason for hiding this comment

mali-git left a comment

Choose a reason for hiding this comment

le1nux commented Jan 15, 2025 •

edited

Loading