Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add how to use multiple streams in loaders in yaml #831

Closed
wants to merge 6 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 43 additions & 11 deletions scripts/train/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,28 @@
This README walks through pretraining and finetuning a large language model using MosaicML's [StreamingDataset](https://github.com/mosaicml/streaming) format, [Composer](https://github.com/mosaicml/composer) trainer, and [MPT architecture](https://www.mosaicml.com/blog/mpt-7b). When used in concert on high-performance hardware such as A100 GPUs, these tools enable incredibly efficient and optimized LLM training.

#### Table of Contents
1. [Part 1: LLM Pretraining](#llmpretraining)
1. [Installation](#installation)
1. [Dataset Preparation](#datasetpreparation)
1. [How to start single and multi-node pretraining](#howtostartpretraining)
1. [Part 2: LLM Finetuning](#llmfinetuning)
1. [Using a dataset on the HuggingFace Hub](#hfdataset)
1. [Using a local dataset](#localdataset)
1. [Using a StreamingDataset (MDS) formatted dataset locally or in an object store](#mdsdataset)
1. [Using Flash Attention](#flashattention)
1. [FAQ: How many GPUs do I need to train a LLM?](#howmandygpus)
1. [FAQ: Optimizing Performance](#optimizingperformance)
- [LLM Pretraining and Finetuning](#llm-pretraining-and-finetuning)
- [Table of Contents](#table-of-contents)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the self reference to the table of contents here

- [Part 1: LLM Pretraining ](#part-1-llm-pretraining-)
- [Installation ](#installation-)
- [Dataset preparation ](#dataset-preparation-)
- [Converting C4 to StreamingDataset `.mds` format](#converting-c4-to-streamingdataset-mds-format)
- [Test the Dataloader](#test-the-dataloader)
- [How to start single and multi-node pretraining ](#how-to-start-single-and-multi-node-pretraining-)
- [Single-Node training](#single-node-training)
- [Multi-Node via CLI args](#multi-node-via-cli-args)
- [Multi-Node via environment variables](#multi-node-via-environment-variables)
- [Part 2: LLM Finetuning ](#part-2-llm-finetuning-)
- [Data formatting](#data-formatting)
- [Pre-defined preprocessing functions](#pre-defined-preprocessing-functions)
- [Custom data preprocessing](#custom-data-preprocessing)
- [Usage](#usage)
- [1) Using a dataset on the HuggingFace Hub ](#1-using-a-dataset-on-the-huggingface-hub-)
- [2) Using a local dataset ](#2-using-a-local-dataset-)
- [3) Using a StreamingDataset (MDS) formatted dataset locally or in an object store ](#3-using-a-streamingdataset-mds-formatted-dataset-locally-or-in-an-object-store-)
- [Using Flash Attention ](#using-flash-attention-)
- [FAQ: How many GPUs do I need to train a LLM? ](#faq-how-many-gpus-do-i-need-to-train-a-llm-)
- [FAQ: Optimizing Performance ](#faq-optimizing-performance-)

# Part 1: LLM Pretraining <a name="llmpretraining"></a>

Expand Down Expand Up @@ -333,6 +344,27 @@ train_loader:
...
```

To specify multiple streams (e.g. as in [Seamless data mixing](https://github.com/mosaicml/streaming?tab=readme-ov-file#seamless-data-mixing)) to load data in the yaml file, update your YAML like so:

```yaml
train_loader:
name: finetuning
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the finetuning dataloader doesn't actually support streams. not for any technical reason, just haven't update it, but this example should be in the pretraining data section and use the text dataloader

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cli99 mind finishing this PR real quick?

dataset:
streams:
c4:
local: /tmp/dataset/c4/
proportion: 0.8
remote: s3://my-c4-bucket
split: train
markdown:
local: /tmp/dataset/markdown/
proportion: 0.2
remote: s3://my-markdown-bucket
split: null
...
```
Each Stream will have a unique local and remote directory.

# Using Flash Attention <a name="flashattention"></a>

Flash Attention is an optimized implementation of the attention mechanism, first introduced by [Dao et al.](https://github.com/Dao-AILab/flash-attention). There are three versions of Flash Attention that can be used with LLM Foundry: Flash Attention V1, Flash Attention V2, and a Triton implementation of Flash Attention. To start, we recommend using one of our [provided Docker images](../../README.md#mosaicml-docker-images) corresponding to the Flash Attention version you would like to use. The Triton implementation can be used with either Flash Attention V1 or V2. Next, how you specify to use Flash Attention depends on which model you are using.
Expand Down
Loading