From 8498662f81930edd597e1eed39bbd6ae34eba341 Mon Sep 17 00:00:00 2001
From: Xiaohan Zhang <xiaohanzhang.cmu@gmail.com>
Date: Mon, 22 Jan 2024 22:24:44 -0800
Subject: [PATCH] Validation (#900)

* add validation script

* update

* change token count function

* reorganize cells

* Add unit tests

* Add a printout for CPT

* update question

* Add questions

* Fix lints

* update format

* update

* nb source

* add validation script

* update

* change token count function

* reorganize cells

* Add unit tests

* Add a printout for CPT

* update question

* Add questions

* Fix lints

* update format

* update

* nb source

* Remove license insert for validation notebook

* Add validation utils

* Minor cleanups (#858)

* nits

* logger

* add log

* lint

* update utils/__init__.py to include extra validation functions

* update notebook

* update

* update

* Read UC delta table (#773)

* initial commit

* use databricks-sql to read delta table and convert to json

* update

* update

* update

* add mocked unittest

* Fix lints

* update

* update

* restructure code

* Add timer for optimizing

* Add db-connect

* add wrapper

* update

* add install dbconnect

* update

* update

* patch dbconnect to allow multiple return formats

* update

* add arrow

* use compression

* clean up

* Add cluster rt check

* Fix lints

* remove patch.py for CI

* update

* update

* updat

* update

* fix tests

* fix lint

* update

* update

* Add more tests

* update

* update

* update

* change to download_json

* update

* fix lints

* Add decompressed option for arrow

* format json to jsonl

* Add comments

* Make cf_collect_type global option

* fix comments

* fix lints

* fix comments

* Fix lints

* change to use workspaceclient

* Add CPT support

* Rewire method assignment logic

* Fix bug in stripping https

* Add tests for rewired method assignment logic

* Fix lints

* Fix lints

* Removed logger set_level

* Remove pyspark. It conflicts with databricks-connect

* Update the comment

* skip cluster version check when cluster_id is serverless

* Add use_serverless flag

* update tests with use_serverless flag

* Fix lints

---------

Co-authored-by: Xiaohan Zhang <xiaohan.zhang@databricks.com>

* Add download remote function to util

* update

* remove fused layernorm (#859)

* update

* update

* update

* update

* update

* update

* update

* update

* update

* Remove hardcoded combined.jsonl with a flag (#861)

* Remove hardcoded combined.jsonl with a flag

* update

* change output_json_path output_json_folder

---------

Co-authored-by: Xiaohan Zhang <xiaohan.zhang@databricks.com>

* bump (#828)

* Add dask and dataframe_to_mds

* update

* update

* update

* update

* Add notebook

* update

* update

* remove script and tests, keep notebook

* update

* update

* update

* update

* Always initialize dist  (#864)

* fix dev

* lint

* remove gpu

* updated notebook

* remove scripts keep notebook

* update notebook. rephrase.

* update

* Add response tokens

* update

* update

* Disable MDSWrite, return token counts

* Change plot settings

* update notebook

* update

* update notebook

---------

Co-authored-by: Xiaohan Zhang <xiaohan.zhang@databricks.com>
Co-authored-by: xiaohanzhan-db <xiaohanzhan-db>
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
---
 notebooks/validate_and_tokenize_data.ipynb | 92 ++++++++--------------
 1 file changed, 32 insertions(+), 60 deletions(-)
diff --git a/notebooks/validate_and_tokenize_data.ipynb b/notebooks/validate_and_tokenize_data.ipynb
index b581c6b6bf..f070da0c43 100644
--- a/notebooks/validate_and_tokenize_data.ipynb
+++ b/notebooks/validate_and_tokenize_data.ipynb
@@ -4,10 +4,7 @@
    "cell_type": "markdown",
    "metadata": {
     "application/vnd.databricks.v1+cell": {
-     "cellMetadata": {
-      "byteLimit": 2048000,
-      "rowLimit": 10000
-     },
+     "cellMetadata": {},
      "inputWidgets": {},
      "nuid": "f275a21b-47d4-472c-972b-e2a84a597db2",
      "showTitle": false,
@@ -57,10 +54,7 @@
    "cell_type": "markdown",
    "metadata": {
     "application/vnd.databricks.v1+cell": {
-     "cellMetadata": {
-      "byteLimit": 2048000,
-      "rowLimit": 10000
-     },
+     "cellMetadata": {},
      "inputWidgets": {},
      "nuid": "3d08a21c-9f5a-4ad2-af85-e016335cc53d",
      "showTitle": false,
@@ -200,10 +194,7 @@
    "cell_type": "markdown",
    "metadata": {
     "application/vnd.databricks.v1+cell": {
-     "cellMetadata": {
-      "byteLimit": 2048000,
-      "rowLimit": 10000
-     },
+     "cellMetadata": {},
      "inputWidgets": {},
      "nuid": "3a513cdd-967d-4a87-b56f-340053fa79cd",
      "showTitle": false,
@@ -218,10 +209,7 @@
    "cell_type": "markdown",
    "metadata": {
     "application/vnd.databricks.v1+cell": {
-     "cellMetadata": {
-      "byteLimit": 2048000,
-      "rowLimit": 10000
-     },
+     "cellMetadata": {},
      "inputWidgets": {},
      "nuid": "cfebdfdf-b87c-4a77-b97c-4697566a55fa",
      "showTitle": false,
@@ -265,17 +253,14 @@
    "outputs": [],
    "source": [
     "FT_API_args = Namespace(\n",
-    "    model='EleutherAI/gpt-neox-20b',\n",
-    "    train_data_path= 'main.streaming.random_large_table', # 'tatsu-lab/alpaca/train', # '/Volumes/main/mosaic_hackathon/managed-volume/IFT/train.jsonl',  'tatsu-lab/alpaca/train',  # 'mosaicml/dolly_hhrlhf/train', # tatsu-lab/alpaca/train',\n",
+    "    model= 'mosaicml/mpt-7b', # Other examples: 'EleutherAI/gpt-neox-20b',\n",
+    "    train_data_path= 'main.streaming.random_large_table', # Other examples: 'tatsu-lab/alpaca/train', # '/Volumes/main/mosaic_hackathon/managed-volume/IFT/train.jsonl'  # 'mosaicml/dolly_hhrlhf/train'\n",
     "    task_type='INSTRUCTION_FINETUNE',\n",
     "    training_duration=3,\n",
     "    context_length=2048,\n",
     ")\n",
     "\n",
     "temporary_jsonl_data_path = '/Volumes/main/mosaic_hackathon/managed-volume/IFT/ft_data_11Jan24_3/train'\n",
-    "# os.environ['HF_ASSETS_CACHE'] = '/tmp/'\n",
-    "# os.environ['HF_HOME'] = '/tmp/'\n",
-    "# os.environ['HF_HUB_CACHE'] = '/tmp/'\n",
     "os.environ['HF_DATASETS_CACHE'] = '/tmp/'\n",
     "os.makedirs(temporary_jsonl_data_path, exist_ok=True)"
    ]
@@ -284,10 +269,7 @@
    "cell_type": "markdown",
    "metadata": {
     "application/vnd.databricks.v1+cell": {
-     "cellMetadata": {
-      "byteLimit": 2048000,
-      "rowLimit": 10000
-     },
+     "cellMetadata": {},
      "inputWidgets": {},
      "nuid": "39c45005-1a77-4162-b9e4-bd8df6f5ec69",
      "showTitle": false,
@@ -363,10 +345,7 @@
    "cell_type": "markdown",
    "metadata": {
     "application/vnd.databricks.v1+cell": {
-     "cellMetadata": {
-      "byteLimit": 2048000,
-      "rowLimit": 10000
-     },
+     "cellMetadata": {},
      "inputWidgets": {},
      "nuid": "06d46367-bd32-473a-9f16-1b34a8dd9356",
      "showTitle": false,
@@ -381,10 +360,7 @@
    "cell_type": "markdown",
    "metadata": {
     "application/vnd.databricks.v1+cell": {
-     "cellMetadata": {
-      "byteLimit": 2048000,
-      "rowLimit": 10000
-     },
+     "cellMetadata": {},
      "inputWidgets": {},
      "nuid": "1a28320a-a2a1-4f3c-a0cd-ad6045a24f64",
      "showTitle": false,
@@ -474,10 +450,7 @@
    "cell_type": "markdown",
    "metadata": {
     "application/vnd.databricks.v1+cell": {
-     "cellMetadata": {
-      "byteLimit": 2048000,
-      "rowLimit": 10000
-     },
+     "cellMetadata": {},
      "inputWidgets": {},
      "nuid": "9713a0ce-80f4-4187-b10b-4223b17fe4c1",
      "showTitle": false,
@@ -516,10 +489,7 @@
    "cell_type": "markdown",
    "metadata": {
     "application/vnd.databricks.v1+cell": {
-     "cellMetadata": {
-      "byteLimit": 2048000,
-      "rowLimit": 10000
-     },
+     "cellMetadata": {},
      "inputWidgets": {},
      "nuid": "7249e9e6-1ea7-4fc9-8959-8a17d62a9fb4",
      "showTitle": false,
@@ -560,10 +530,7 @@
    "cell_type": "markdown",
    "metadata": {
     "application/vnd.databricks.v1+cell": {
-     "cellMetadata": {
-      "byteLimit": 2048000,
-      "rowLimit": 10000
-     },
+     "cellMetadata": {},
      "inputWidgets": {},
      "nuid": "6699f47f-9b53-47da-95c0-b862c5826d0a",
      "showTitle": false,
@@ -578,10 +545,7 @@
    "cell_type": "markdown",
    "metadata": {
     "application/vnd.databricks.v1+cell": {
-     "cellMetadata": {
-      "byteLimit": 2048000,
-      "rowLimit": 10000
-     },
+     "cellMetadata": {},
      "inputWidgets": {},
      "nuid": "dd37fdce-62d0-493e-bfa9-d823634b2a0d",
      "showTitle": false,
@@ -610,14 +574,13 @@
    "outputs": [],
    "source": [
     "FT_API_args = Namespace(\n",
-    "    model='EleutherAI/gpt-neox-20b',\n",
+    "    model= 'mosaicml/mpt-7b',\n",
     "    train_data_path= '/Volumes/main/mosaic_hackathon/managed-volume/ABT',\n",
     "    task_type='CONTINUED_PRETRAIN',\n",
     "    training_duration=3,\n",
     "    context_length=2048,\n",
     ")\n",
-    "temporary_mds_output_path = '/Volumes/main/mosaic_hackathon/managed-volume/mds_data_11Jan24_5'\n",
-    "# temporary_mds_output_path = '/tmp/CPT/mds_data_11Jan24_4'"
+    "temporary_mds_output_path = '/Volumes/main/mosaic_hackathon/managed-volume/mds_data_11Jan24_5'"
    ]
   },
   {
@@ -644,10 +607,7 @@
    "cell_type": "markdown",
    "metadata": {
     "application/vnd.databricks.v1+cell": {
-     "cellMetadata": {
-      "byteLimit": 2048000,
-      "rowLimit": 10000
-     },
+     "cellMetadata": {},
      "inputWidgets": {},
      "nuid": "c21e7d1b-db34-4e5d-b6d9-190dc75170d3",
      "showTitle": false,
@@ -715,10 +675,7 @@
    "cell_type": "markdown",
    "metadata": {
     "application/vnd.databricks.v1+cell": {
-     "cellMetadata": {
-      "byteLimit": 2048000,
-      "rowLimit": 10000
-     },
+     "cellMetadata": {},
      "inputWidgets": {},
      "nuid": "298eb990-9160-4e1b-958f-33dd2c11b54b",
      "showTitle": false,
@@ -754,6 +711,21 @@
     "print(f\"By default, you'll train for {n_epochs} epochs on this dataset\")\n",
     "print(f\"By default, ~{n_epochs * n_billing_tokens_in_dataset} tokens will be used in training\")"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 0,
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {},
+     "inputWidgets": {},
+     "nuid": "e123669c-2f77-4d66-93eb-04efd546f39f",
+     "showTitle": false,
+     "title": ""
+    }
+   },
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {