-
Notifications
You must be signed in to change notification settings - Fork 534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add how to use multiple streams in loaders in yaml #831
Changes from all commits
c7b4c8f
98165b9
be5c2c3
11166ab
a53364a
d8502da
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,17 +3,28 @@ | |
This README walks through pretraining and finetuning a large language model using MosaicML's [StreamingDataset](https://github.com/mosaicml/streaming) format, [Composer](https://github.com/mosaicml/composer) trainer, and [MPT architecture](https://www.mosaicml.com/blog/mpt-7b). When used in concert on high-performance hardware such as A100 GPUs, these tools enable incredibly efficient and optimized LLM training. | ||
|
||
#### Table of Contents | ||
1. [Part 1: LLM Pretraining](#llmpretraining) | ||
1. [Installation](#installation) | ||
1. [Dataset Preparation](#datasetpreparation) | ||
1. [How to start single and multi-node pretraining](#howtostartpretraining) | ||
1. [Part 2: LLM Finetuning](#llmfinetuning) | ||
1. [Using a dataset on the HuggingFace Hub](#hfdataset) | ||
1. [Using a local dataset](#localdataset) | ||
1. [Using a StreamingDataset (MDS) formatted dataset locally or in an object store](#mdsdataset) | ||
1. [Using Flash Attention](#flashattention) | ||
1. [FAQ: How many GPUs do I need to train a LLM?](#howmandygpus) | ||
1. [FAQ: Optimizing Performance](#optimizingperformance) | ||
- [LLM Pretraining and Finetuning](#llm-pretraining-and-finetuning) | ||
- [Table of Contents](#table-of-contents) | ||
- [Part 1: LLM Pretraining ](#part-1-llm-pretraining-) | ||
- [Installation ](#installation-) | ||
- [Dataset preparation ](#dataset-preparation-) | ||
- [Converting C4 to StreamingDataset `.mds` format](#converting-c4-to-streamingdataset-mds-format) | ||
- [Test the Dataloader](#test-the-dataloader) | ||
- [How to start single and multi-node pretraining ](#how-to-start-single-and-multi-node-pretraining-) | ||
- [Single-Node training](#single-node-training) | ||
- [Multi-Node via CLI args](#multi-node-via-cli-args) | ||
- [Multi-Node via environment variables](#multi-node-via-environment-variables) | ||
- [Part 2: LLM Finetuning ](#part-2-llm-finetuning-) | ||
- [Data formatting](#data-formatting) | ||
- [Pre-defined preprocessing functions](#pre-defined-preprocessing-functions) | ||
- [Custom data preprocessing](#custom-data-preprocessing) | ||
- [Usage](#usage) | ||
- [1) Using a dataset on the HuggingFace Hub ](#1-using-a-dataset-on-the-huggingface-hub-) | ||
- [2) Using a local dataset ](#2-using-a-local-dataset-) | ||
- [3) Using a StreamingDataset (MDS) formatted dataset locally or in an object store ](#3-using-a-streamingdataset-mds-formatted-dataset-locally-or-in-an-object-store-) | ||
- [Using Flash Attention ](#using-flash-attention-) | ||
- [FAQ: How many GPUs do I need to train a LLM? ](#faq-how-many-gpus-do-i-need-to-train-a-llm-) | ||
- [FAQ: Optimizing Performance ](#faq-optimizing-performance-) | ||
|
||
# Part 1: LLM Pretraining <a name="llmpretraining"></a> | ||
|
||
|
@@ -333,6 +344,27 @@ train_loader: | |
... | ||
``` | ||
|
||
To specify multiple streams (e.g. as in [Seamless data mixing](https://github.com/mosaicml/streaming?tab=readme-ov-file#seamless-data-mixing)) to load data in the yaml file, update your YAML like so: | ||
|
||
```yaml | ||
train_loader: | ||
name: finetuning | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the finetuning dataloader doesn't actually support streams. not for any technical reason, just haven't update it, but this example should be in the pretraining data section and use the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cli99 mind finishing this PR real quick? |
||
dataset: | ||
streams: | ||
c4: | ||
local: /tmp/dataset/c4/ | ||
proportion: 0.8 | ||
remote: s3://my-c4-bucket | ||
split: train | ||
markdown: | ||
local: /tmp/dataset/markdown/ | ||
proportion: 0.2 | ||
remote: s3://my-markdown-bucket | ||
split: null | ||
... | ||
``` | ||
Each Stream will have a unique local and remote directory. | ||
|
||
# Using Flash Attention <a name="flashattention"></a> | ||
|
||
Flash Attention is an optimized implementation of the attention mechanism, first introduced by [Dao et al.](https://github.com/Dao-AILab/flash-attention). There are three versions of Flash Attention that can be used with LLM Foundry: Flash Attention V1, Flash Attention V2, and a Triton implementation of Flash Attention. To start, we recommend using one of our [provided Docker images](../../README.md#mosaicml-docker-images) corresponding to the Flash Attention version you would like to use. The Triton implementation can be used with either Flash Attention V1 or V2. Next, how you specify to use Flash Attention depends on which model you are using. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need the self reference to the table of contents here