From 38a1f51c2711f06b3d8f8b3426234cf9a9412db3 Mon Sep 17 00:00:00 2001
From: Saaketh <narayan.saaketh@gmail.com>
Date: Thu, 6 Jun 2024 11:03:55 -0700
Subject: [PATCH] import

---
 scripts/data_prep/README.md | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/scripts/data_prep/README.md b/scripts/data_prep/README.md
index 7881298b2f..3601cc865f 100644
--- a/scripts/data_prep/README.md
+++ b/scripts/data_prep/README.md
@@ -35,6 +35,23 @@ python convert_dataset_json.py \
 
 Where `--path` can be a single json file, or a folder containing json files. `--split` denotes the intended split (hf defaults to `train`).
 
+### Raw text files
+
+Using the `convert_text_to_mds.py` script, we convert a [text file](https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt) containing the complete works of William Shakespeare.
+
+<!--pytest.mark.skip-->
+```bash
+# Convert json dataset to StreamingDataset format
+mkdir shakespeare && cd shakespeare
+curl -O https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
+cd ..
+python convert_text_to_mds.py \
+  --output_folder my-copy-shakespeare \
+  --input_folder shakespeare \
+  --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b \
+  --compression zstd
+```
+
 ## Converting a finetuning dataset
 Using the `convert_finetuning_dataset.py` script you can run a command such as:
 <!--pytest.mark.skip-->