Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Binary File Operations #294

Open
mali-git opened this issue Jan 17, 2025 · 1 comment
Open

Refactor Binary File Operations #294

mali-git opened this issue Jan 17, 2025 · 1 comment
Labels
enhancement New feature or request

Comments

@mali-git
Copy link
Member

Feature request

Implement a centralized, reusable utility or module for handling binary file operations which can be used across modules. This utility should:

  1. Standardize reading and writing headers, binary data, and indices.
  2. Support modular integration with existing components like EmbeddedStreamData and others.
  3. Reduce code duplication while improving readability and maintainability.

Motivation

Currently, there is duplicated code for reading and writing binary files across multiple modules and functions, including:

  • EmbeddedStreamData
  • PackedDataGenerator
  • LargeFileLinesReader
  • shuffle_tokenized_data()

This redundancy increases maintenance overhead and the risk of inconsistencies. For example, reading headers, writing index data, and handling binary streams are repeated in different forms, leading to potential bugs and inefficiencies.

@mali-git mali-git added the enhancement New feature or request label Jan 17, 2025
@mali-git mali-git mentioned this issue Jan 17, 2025
6 tasks
@le1nux
Copy link
Member

le1nux commented Jan 20, 2025

First version implemented here:
https://github.com/Modalities/modalities/blob/65477cdcd20a33a66cc30d5220def47b42e5b27f/src/modalities/dataloader/preprocessing/tokenization/tokenized_file_writer.py

Note that write_tokenized_dataset expects the full dataset as a list of numpy arrays already.
Iteratively adding rows is not possible for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants