mosaicml · cli99 · Jan 2, 2024 · Jan 2, 2024 · Jan 3, 2024 · Jan 4, 2024
@@ -3,17 +3,28 @@
 This README walks through pretraining and finetuning a large language model using MosaicML's [StreamingDataset](https://github.com/mosaicml/streaming) format, [Composer](https://github.com/mosaicml/composer) trainer, and [MPT architecture](https://www.mosaicml.com/blog/mpt-7b). When used in concert on high-performance hardware such as A100 GPUs, these tools enable incredibly efficient and optimized LLM training.
 
 #### Table of Contents
-1. [Part 1: LLM Pretraining](#llmpretraining)
-   1. [Installation](#installation)
-   1. [Dataset Preparation](#datasetpreparation)
-   1. [How to start single and multi-node pretraining](#howtostartpretraining)
-1. [Part 2: LLM Finetuning](#llmfinetuning)
-   1. [Using a dataset on the HuggingFace Hub](#hfdataset)
-   1. [Using a local dataset](#localdataset)
-   1. [Using a StreamingDataset (MDS) formatted dataset locally or in an object store](#mdsdataset)
-1. [Using Flash Attention](#flashattention)
-1. [FAQ: How many GPUs do I need to train a LLM?](#howmandygpus)
-1. [FAQ: Optimizing Performance](#optimizingperformance)
+- [LLM Pretraining and Finetuning](#llm-pretraining-and-finetuning)
+      - [Table of Contents](#table-of-contents)
+- [Part 1: LLM Pretraining ](#part-1-llm-pretraining-)
+  - [Installation ](#installation-)
+  - [Dataset preparation ](#dataset-preparation-)
+    - [Converting C4 to StreamingDataset `.mds` format](#converting-c4-to-streamingdataset-mds-format)
+    - [Test the Dataloader](#test-the-dataloader)
+  - [How to start single and multi-node pretraining ](#how-to-start-single-and-multi-node-pretraining-)
+    - [Single-Node training](#single-node-training)
+    - [Multi-Node via CLI args](#multi-node-via-cli-args)
+    - [Multi-Node via environment variables](#multi-node-via-environment-variables)
+- [Part 2: LLM Finetuning ](#part-2-llm-finetuning-)
+  - [Data formatting](#data-formatting)
+    - [Pre-defined preprocessing functions](#pre-defined-preprocessing-functions)
+    - [Custom data preprocessing](#custom-data-preprocessing)
+  - [Usage](#usage)
+    - [1) Using a dataset on the HuggingFace Hub ](#1-using-a-dataset-on-the-huggingface-hub-)
+    - [2) Using a local dataset ](#2-using-a-local-dataset-)
+    - [3) Using a StreamingDataset (MDS) formatted dataset locally or in an object store ](#3-using-a-streamingdataset-mds-formatted-dataset-locally-or-in-an-object-store-)
+- [Using Flash Attention ](#using-flash-attention-)
+- [FAQ: How many GPUs do I need to train a LLM? ](#faq-how-many-gpus-do-i-need-to-train-a-llm-)
+- [FAQ: Optimizing Performance ](#faq-optimizing-performance-)
 
 # Part 1: LLM Pretraining <a name="llmpretraining"></a>
 
@@ -333,6 +344,27 @@ train_loader:
         ...
 ```
 
+To specify multiple streams (e.g. as in [Seamless data mixing](https://github.com/mosaicml/streaming?tab=readme-ov-file#seamless-data-mixing)) to load data in the yaml file, update your YAML like so:
+
+```yaml
+train_loader:
+    name: finetuning
+    dataset:
+      streams:
+        c4:
+          local: /tmp/dataset/c4/
+          proportion: 0.8
+          remote: s3://my-c4-bucket
+          split: train
+        markdown:
+          local: /tmp/dataset/markdown/
+          proportion: 0.2
+          remote: s3://my-markdown-bucket
+          split: null
+        ...
+```
+Each Stream will have a unique local and remote directory.
+
 # Using Flash Attention <a name="flashattention"></a>
 
 Flash Attention is an optimized implementation of the attention mechanism, first introduced by [Dao et al.](https://github.com/Dao-AILab/flash-attention). There are three versions of Flash Attention that can be used with LLM Foundry: Flash Attention V1, Flash Attention V2, and a Triton implementation of Flash Attention. To start, we recommend using one of our [provided Docker images](../../README.md#mosaicml-docker-images) corresponding to the Flash Attention version you would like to use. The Triton implementation can be used with either Flash Attention V1 or V2. Next, how you specify to use Flash Attention depends on which model you are using.