-
-
Notifications
You must be signed in to change notification settings - Fork 924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data streaming support through mosaic-streaming
#1525
Add data streaming support through mosaic-streaming
#1525
Conversation
requirements.txt
Outdated
@@ -31,6 +31,7 @@ art | |||
fschat==0.2.36 | |||
gradio==3.50.2 | |||
tensorboard | |||
mosaicml-streaming |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question
: mosaicml-streaming
should be an optional dependency. Is this the right way of adding it in this capacity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with it being a required dependency. If it causes issues down the line we can make it optional then. Hoping to keep things simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this version be locked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed by 976bc13
.
src/axolotl/utils/data/sft.py
Outdated
# | ||
# This is necessary because downstream functions use a different interface | ||
# than `StreamingDataset` (e.g. the `features` attribute). | ||
ds = Dataset.from_generator( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This becomes an IterableDataset
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay here.
No, that was something that I wanted to verify but it looks like it goes to def process and everything is evaluated eagerly.
I started a draft like:
def process(self, dataset):
features = dataset.features.keys()
map_kwargs = {}
if self.prompt_tokenizer.supports_batched:
map_kwargs["batched"] = True
map_kwargs["batch_size"] = 100
map_kwargs["desc"] = "Tokenizing Prompts"
if isinstance(dataset, IterableDataset):
dataset_wrapper = dataset.map(
self.prompt_tokenizer.tokenize_prompt,
remove_columns=features,
keep_in_memory=self.keep_in_memory,
**map_kwargs,
)
else:
num_proc = min(
64, self.process_count if self.process_count else os.cpu_count()
)
return dataset.map(
self.prompt_tokenizer.tokenize_prompt,
num_proc=num_proc,
remove_columns=features,
keep_in_memory=self.keep_in_memory,
desc="Tokenizing Prompts",
**map_kwargs,
)
But I don't know whether that's a good idea. The .map
API is different between Dataset
(here) and IterableDataset
(here).
Feel free to remove the "ready to merge" tag from this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to go. thank you @fmv1992 !
Thanks, much appreciated; I'm just checking a few more things before merging. The experience of contributing to this repo has been very positive. |
Hey, thanks for the PR. I just wanted to clarify something I asked previously. This would require user's to preprocess their dataset to Mosaic's format first right? If so, I would prefer this to be documented somewhere near the cloud loading section. For ex, add You should also add this parameter to this https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/config.qmd https://github.com/mosaicml/streaming?tab=readme-ov-file#quick-start |
I think it need additional 'StreamingDataset' support for pretraining dataset (completion) in addition to Finetuning dataset. |
We can pretrain with Axolotl streaming a data mix from s3? |
JSONL should be fine for streaming. see https://github.com/mosaicml/streaming?tab=readme-ov-file#1-prepare-your-data |
We can, but I prefer if we include this in a second PR. Right now I would rather see this smaller change working and merged. Expanding on it should be easier later. |
We can, but I prefer if we include this in a second PR. Right now I would rather see this smaller change working and merged. Expanding on it should be easier later.
Addressed by ba86339 . Let me know if that addresses all your points. |
As per this comment this is not ready for merging. Maybe we want to remove that tag. I posted a draft of the changes there, but the issue is that the tokenization should happen as we download the data, and right now I'm almost certain it does everything in a batch: it downloads everything, then tokenizes everything, then proceeds to do the fine tuning. |
@fmv1992 , this is correct. I only got to review your code in detail earlier. The section I provided you was incorrect. https://github.com/OpenAccess-AI-Collective/axolotl/blob/60f5ce0569b7f1d522ef81ea986ebfdc98780e6a/src/axolotl/utils/data/sft.py#L121 This function runs the whole dataset, merges it, and perform tokenization at this point here. The only part that "skips" tokenization before finetuning is the I have two ideas as of now:
I would also appreciate @winglian 's comments on this. Side note: what should this |
I'm closing this due to inactivity. |
I'm interested in this |
Description
This PR adds support for (non-volatile) memory efficient training through StreamingDataset.
Motivation and Context
Context: #585 .
How has this been tested?
I have tested this through
docker
on a VM.I'm open to ideas as to how this should be added. Does the repo support an
s3
bucket for instance?