From 549329530326068efd9085352f22216e534223de Mon Sep 17 00:00:00 2001
From: Xiaohan Zhang <xiaohanzhang.cmu@gmail.com>
Date: Tue, 12 Mar 2024 21:48:58 -0700
Subject: [PATCH] Validation (#902)

* add validation script

* update

* change token count function

* reorganize cells

* Add unit tests

* Add a printout for CPT

* update question

* Add questions

* Fix lints

* update format

* update

* nb source

* add validation script

* update

* change token count function

* reorganize cells

* Add unit tests

* Add a printout for CPT

* update question

* Add questions

* Fix lints

* update format

* update

* nb source

* Remove license insert for validation notebook

* Add validation utils

* Minor cleanups (#858)

* nits

* logger

* add log

* lint

* update utils/__init__.py to include extra validation functions

* update notebook

* update

* update

* Read UC delta table (#773)

* initial commit

* use databricks-sql to read delta table and convert to json

* update

* update

* update

* add mocked unittest

* Fix lints

* update

* update

* restructure code

* Add timer for optimizing

* Add db-connect

* add wrapper

* update

* add install dbconnect

* update

* update

* patch dbconnect to allow multiple return formats

* update

* add arrow

* use compression

* clean up

* Add cluster rt check

* Fix lints

* remove patch.py for CI

* update

* update

* updat

* update

* fix tests

* fix lint

* update

* update

* Add more tests

* update

* update

* update

* change to download_json

* update

* fix lints

* Add decompressed option for arrow

* format json to jsonl

* Add comments

* Make cf_collect_type global option

* fix comments

* fix lints

* fix comments

* Fix lints

* change to use workspaceclient

* Add CPT support

* Rewire method assignment logic

* Fix bug in stripping https

* Add tests for rewired method assignment logic

* Fix lints

* Fix lints

* Removed logger set_level

* Remove pyspark. It conflicts with databricks-connect

* Update the comment

* skip cluster version check when cluster_id is serverless

* Add use_serverless flag

* update tests with use_serverless flag

* Fix lints

---------

Co-authored-by: Xiaohan Zhang <xiaohan.zhang@databricks.com>

* Add download remote function to util

* update

* remove fused layernorm (#859)

* update

* update

* update

* update

* update

* update

* update

* update

* update

* Remove hardcoded combined.jsonl with a flag (#861)

* Remove hardcoded combined.jsonl with a flag

* update

* change output_json_path output_json_folder

---------

Co-authored-by: Xiaohan Zhang <xiaohan.zhang@databricks.com>

* bump (#828)

* Add dask and dataframe_to_mds

* update

* update

* update

* update

* Add notebook

* update

* update

* remove script and tests, keep notebook

* update

* update

* update

* update

* Always initialize dist  (#864)

* fix dev

* lint

* remove gpu

* updated notebook

* remove scripts keep notebook

* update notebook. rephrase.

* update

* Add response tokens

* update

* update

* Disable MDSWrite, return token counts

* Change plot settings

* update notebook

* update

* update notebook

* update

* update notebook

* update pip install link

* Change done file location

---------

Co-authored-by: Xiaohan Zhang <xiaohan.zhang@databricks.com>
Co-authored-by: xiaohanzhan-db <xiaohanzhan-db>
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
---
 llmfoundry/utils/validation_utils.py       |  2 +-
 notebooks/validate_and_tokenize_data.ipynb | 43 +++++++++++++++++++---
 2 files changed, 38 insertions(+), 7 deletions(-)
diff --git a/llmfoundry/utils/validation_utils.py b/llmfoundry/utils/validation_utils.py
index 4d204143c8..c2c2cb3e65 100644
--- a/llmfoundry/utils/validation_utils.py
+++ b/llmfoundry/utils/validation_utils.py
@@ -272,7 +272,7 @@ def count_shards(mds_root: str):
                                               merge_shard_groups)
 
 log = logging.getLogger(__name__)
-DONE_FILENAME = '/Volumes/main/mosaic_hackathon/managed-volume/text_to_mds_conversion_done'
+DONE_FILENAME = '.text_to_mds_conversion_done'
 
 
 def parse_args(tokenizer,
diff --git a/notebooks/validate_and_tokenize_data.ipynb b/notebooks/validate_and_tokenize_data.ipynb
index 367cf5701e..26b3214437 100644
--- a/notebooks/validate_and_tokenize_data.ipynb
+++ b/notebooks/validate_and_tokenize_data.ipynb
@@ -27,7 +27,7 @@
     "- **Future Development**: We are in the process of developing a long-term data preparation service, which will eventually replace this script.\n",
     "\n",
     "#### User Defines:\n",
-    "- The inputs to this validation script is assumed to be the same or a subset of the FT API arguments, i.e., a configuration like below. Is this a valid assumption?\n",
+    "- The inputs to this validation script is assumed to be the same or a subset of the FT API arguments, i.e., a configuration like below. \n",
     "- For the reference, FT API expects following\n",
     "```\n",
     "cfg = {\n",
@@ -122,8 +122,7 @@
    },
    "outputs": [],
    "source": [
-    "# %pip install git+https://github.com/mosaicml/llm-foundry.git@byod/data_validation\n",
-    "%pip install --upgrade --no-deps git+https://github.com/XiaohanZhangCMU/llm-foundryX.git@validation \n",
+    "%pip install --upgrade --no-deps git+https://github.com/mosaicml/llm-foundry.git@byod/data_validation\n",
     "%pip install \"mosaicml>=0.17.2,<0.18\"\n",
     "%pip install \"transformers>=4.36,<4.37\"\n",
     "%pip install \"mosaicml-streaming>=0.7.2,<0.8\"\n",
@@ -231,7 +230,7 @@
     "\n",
     "**Temporary Data Path Configuration:**\n",
     "\n",
-    "- temporary_jsonl_data_path: Defines a filesystem path where temporary data related to the training process will be stored.\n",
+    "- temporary_jsonl_data_path: Defines a filesystem path where temporary data related to the training process will be stored. You need to make sure the path should not be shared by other users on the cluster, as it costs contention.\n",
     "- Environment variables for Hugging Face caches (HF_DATASETS_CACHE) are set to '/tmp/', directing dataset caching to a temporary directory.\n",
     "\n",
     "**[Supported Models by FT API](https://docs.mosaicml.com/projects/mcli/en/latest/finetuning/finetuning.html#supported-models):**. \n",
@@ -571,7 +570,39 @@
     }
    },
    "source": [
-    "#### User Defines"
+    "### Continued Pretrain API Arguments Configuration\n",
+    "\n",
+    "Similar to Instruction Finetune, you need to specify\n",
+    "\n",
+    "**Fine-Tuning API Arguments (FT_API_args):**\n",
+    "\n",
+    "- model: Specifies the model to be used for fine-tuning. E.g., 'EleutherAI/gpt-neox-20b'\n",
+    "- train_data_path: The path to the training data. We currently only support a (remote/local) path to a collection of .txt files.\n",
+    "- task_type: Defines the type of task for which the training strategy will be applied. It is either 'INSTRUCTION_FINETUNE' or 'CONTINUED_PRETRAIN'.\n",
+    "- training_duration: The duration of the training process, expressed in numerical terms (e.g., 3) with units of training epochs.\n",
+    "- context_length: Specifies the context length of the model, set to 2048. This determines how many tokens the model considers for each training example. For Continued Pretraining, we concatenate tokens to form samples of length equal to context_length\n",
+    "\n",
+    "**Temporary Data Path Configuration:**\n",
+    "\n",
+    "- temporary_mds_output_path: Defines a filesystem path where notebook running data can be stored. You need to make sure the path should not be shared by other users on the cluster, as it costs contention. For example, you can make it distinguishable by adding your username to the path.\n",
+    "\n",
+    "**[Supported Models by FT API](https://docs.mosaicml.com/projects/mcli/en/latest/finetuning/finetuning.html#supported-models):**. \n",
+    "\n",
+    "You need to specify context length based on the model mapping below.\n",
+    "```\n",
+    "ft_models = {\n",
+    "  'mosaicml/mpt-7b-8k': 8192, \n",
+    "  'mosaicml/mpt-7b': 2048,\n",
+    "  'mosaicml/mpt-30b': 8192,\n",
+    "  'meta-llama/Llama-2-13b-hf': 4096,\n",
+    "  'meta-llama/Llama-2-7b-hf': 4096,\n",
+    "  'meta-llama/Llama-2-70b-hf': 4096,\n",
+    "  'codellama/CodeLlama-7b-hf': 16384,\n",
+    "  'codellama/CodeLlama-13b-hf': 16384,\n",
+    "  'codellama/CodeLlama-34b-hf': 16384,\n",
+    "  'mistralai/Mistral-7B-v0.1': 32768,\n",
+    "}\n",
+    "```"
    ]
   },
   {
@@ -598,7 +629,7 @@
     "    training_duration=3,\n",
     "    context_length=2048,\n",
     ")\n",
-    "temporary_mds_output_path = '/Volumes/main/mosaic_hackathon/managed-volume/mds_data_11Jan24_5'"
+    "temporary_mds_output_path = '/Volumes/main/mosaic_hackathon/managed-volume/{your_username}/mds_data_11Jan24_5'"
    ]
   },
   {