From 38a1f51c2711f06b3d8f8b3426234cf9a9412db3 Mon Sep 17 00:00:00 2001 From: Saaketh Date: Thu, 6 Jun 2024 11:03:55 -0700 Subject: [PATCH] import --- scripts/data_prep/README.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/scripts/data_prep/README.md b/scripts/data_prep/README.md index 7881298b2f..3601cc865f 100644 --- a/scripts/data_prep/README.md +++ b/scripts/data_prep/README.md @@ -35,6 +35,23 @@ python convert_dataset_json.py \ Where `--path` can be a single json file, or a folder containing json files. `--split` denotes the intended split (hf defaults to `train`). +### Raw text files + +Using the `convert_text_to_mds.py` script, we convert a [text file](https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt) containing the complete works of William Shakespeare. + + +```bash +# Convert json dataset to StreamingDataset format +mkdir shakespeare && cd shakespeare +curl -O https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt +cd .. +python convert_text_to_mds.py \ + --output_folder my-copy-shakespeare \ + --input_folder shakespeare \ + --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b \ + --compression zstd +``` + ## Converting a finetuning dataset Using the `convert_finetuning_dataset.py` script you can run a command such as: