Skip to content

Commit

Permalink
Validation (#1025)
Browse files Browse the repository at this point in the history
* add validation script

* update

* change token count function

* reorganize cells

* Add unit tests

* Add a printout for CPT

* update question

* Add questions

* Fix lints

* update format

* update

* nb source

* add validation script

* update

* change token count function

* reorganize cells

* Add unit tests

* Add a printout for CPT

* update question

* Add questions

* Fix lints

* update format

* update

* nb source

* Remove license insert for validation notebook

* Add validation utils

* Minor cleanups (#858)

* nits

* logger

* add log

* lint

* update utils/__init__.py to include extra validation functions

* update notebook

* update

* update

* Read UC delta table (#773)

* initial commit

* use databricks-sql to read delta table and convert to json

* update

* update

* update

* add mocked unittest

* Fix lints

* update

* update

* restructure code

* Add timer for optimizing

* Add db-connect

* add wrapper

* update

* add install dbconnect

* update

* update

* patch dbconnect to allow multiple return formats

* update

* add arrow

* use compression

* clean up

* Add cluster rt check

* Fix lints

* remove patch.py for CI

* update

* update

* updat

* update

* fix tests

* fix lint

* update

* update

* Add more tests

* update

* update

* update

* change to download_json

* update

* fix lints

* Add decompressed option for arrow

* format json to jsonl

* Add comments

* Make cf_collect_type global option

* fix comments

* fix lints

* fix comments

* Fix lints

* change to use workspaceclient

* Add CPT support

* Rewire method assignment logic

* Fix bug in stripping https

* Add tests for rewired method assignment logic

* Fix lints

* Fix lints

* Removed logger set_level

* Remove pyspark. It conflicts with databricks-connect

* Update the comment

* skip cluster version check when cluster_id is serverless

* Add use_serverless flag

* update tests with use_serverless flag

* Fix lints

---------

Co-authored-by: Xiaohan Zhang <[email protected]>

* Add download remote function to util

* update

* remove fused layernorm (#859)

* update

* update

* update

* update

* update

* update

* update

* update

* update

* Remove hardcoded combined.jsonl with a flag (#861)

* Remove hardcoded combined.jsonl with a flag

* update

* change output_json_path output_json_folder

---------

Co-authored-by: Xiaohan Zhang <[email protected]>

* bump (#828)

* Add dask and dataframe_to_mds

* update

* update

* update

* update

* Add notebook

* update

* update

* remove script and tests, keep notebook

* update

* update

* update

* update

* Always initialize dist  (#864)

* fix dev

* lint

* remove gpu

* updated notebook

* remove scripts keep notebook

* update notebook. rephrase.

* update

* Add response tokens

* update

* update

* Disable MDSWrite, return token counts

* Change plot settings

* update notebook

* update

* update notebook

* update

* update notebook

* update pip install link

* Change done file location

* Create the dest folder

---------

Co-authored-by: Xiaohan Zhang <[email protected]>
Co-authored-by: xiaohanzhan-db <xiaohanzhan-db>
Co-authored-by: Mihir Patel <[email protected]>
  • Loading branch information
3 people authored Mar 13, 2024
1 parent 5493295 commit 5090e13
Showing 1 changed file with 43 additions and 16 deletions.
59 changes: 43 additions & 16 deletions notebooks/validate_and_tokenize_data.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
Expand All @@ -87,7 +87,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
Expand All @@ -107,7 +107,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
Expand All @@ -134,7 +134,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
Expand All @@ -154,7 +154,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
Expand Down Expand Up @@ -254,7 +254,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
Expand Down Expand Up @@ -311,7 +311,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
Expand Down Expand Up @@ -407,7 +407,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
Expand Down Expand Up @@ -482,7 +482,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
Expand Down Expand Up @@ -522,7 +522,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
Expand Down Expand Up @@ -607,7 +607,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
Expand All @@ -634,7 +634,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
Expand Down Expand Up @@ -677,7 +677,17 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"os.makedirs(temporary_mds_output_path, exist_ok=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
Expand Down Expand Up @@ -737,7 +747,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
Expand All @@ -763,7 +773,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
Expand All @@ -786,8 +796,25 @@
},
"notebookName": "validate_and_tokenize_data",
"widgets": {}
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 0
"nbformat_minor": 4
}

0 comments on commit 5090e13

Please sign in to comment.