Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation #844

Open
wants to merge 126 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
126 commits
Select commit Hold shift + click to select a range
8cb6522
add validation script
xiaohanzhan-db Dec 23, 2023
c59c11f
update
xiaohanzhan-db Jan 3, 2024
66f34eb
change token count function
Jan 3, 2024
2cd387b
reorganize cells
Jan 5, 2024
3eac3bf
Add unit tests
xiaohanzhan-db Jan 5, 2024
d2d9767
Add a printout for CPT
xiaohanzhan-db Jan 6, 2024
be25591
update question
xiaohanzhan-db Jan 6, 2024
4651be7
Add questions
Jan 8, 2024
5cd6a94
Fix lints
xiaohanzhan-db Jan 8, 2024
8e2c1f4
Merge branch 'main' into validation
XiaohanZhangCMU Jan 8, 2024
e6e4a81
update format
xiaohanzhan-db Jan 8, 2024
34c5690
Merge branch 'validation' of github.com:XiaohanZhangCMU/llm-foundryX …
xiaohanzhan-db Jan 8, 2024
1668b9a
update
xiaohanzhan-db Jan 8, 2024
2219135
nb source
xiaohanzhan-db Jan 8, 2024
86c6e87
add validation script
xiaohanzhan-db Dec 23, 2023
678b376
update
xiaohanzhan-db Jan 3, 2024
297e057
change token count function
Jan 3, 2024
09d0ebb
reorganize cells
Jan 5, 2024
460df65
Add unit tests
xiaohanzhan-db Jan 5, 2024
3ffd200
Add a printout for CPT
xiaohanzhan-db Jan 6, 2024
9362886
update question
xiaohanzhan-db Jan 6, 2024
898e5ac
Add questions
Jan 8, 2024
a4bef71
Fix lints
xiaohanzhan-db Jan 8, 2024
4ca9cc6
update format
xiaohanzhan-db Jan 8, 2024
d636a0f
update
xiaohanzhan-db Jan 8, 2024
827d155
nb source
xiaohanzhan-db Jan 8, 2024
6bbf3fc
Remove license insert for validation notebook
xiaohanzhan-db Jan 8, 2024
4f6a4fb
Merge branch 'validation' of github.com:XiaohanZhangCMU/llm-foundryX …
xiaohanzhan-db Jan 8, 2024
5966b68
Add validation utils
xiaohanzhan-db Jan 11, 2024
da17813
Merge branch 'main' into validation
xiaohanzhan-db Jan 11, 2024
89fb909
Validation (#856)
XiaohanZhangCMU Jan 11, 2024
55e4626
update utils/__init__.py to include extra validation functions
xiaohanzhan-db Jan 11, 2024
45544a1
update notebook
Jan 11, 2024
d2797b3
update
xiaohanzhan-db Jan 11, 2024
019da77
Merge branch 'validation' of github.com:XiaohanZhangCMU/llm-foundryX …
xiaohanzhan-db Jan 11, 2024
756fdae
update
xiaohanzhan-db Jan 11, 2024
93b5a9f
Add download remote function to util
xiaohanzhan-db Jan 11, 2024
b47c878
update
xiaohanzhan-db Jan 11, 2024
13fd34c
update
xiaohanzhan-db Jan 11, 2024
610f669
update
xiaohanzhan-db Jan 11, 2024
9f2e51b
update
xiaohanzhan-db Jan 11, 2024
ec68f10
update
xiaohanzhan-db Jan 11, 2024
1e76068
update
xiaohanzhan-db Jan 11, 2024
7a5c164
update
xiaohanzhan-db Jan 11, 2024
e76038f
Merge branch 'main' into validation
xiaohanzhan-db Jan 11, 2024
5b413f5
update
xiaohanzhan-db Jan 11, 2024
a1aa31f
update
xiaohanzhan-db Jan 11, 2024
d24fd5c
update
xiaohanzhan-db Jan 11, 2024
55fce37
Add dask and dataframe_to_mds
xiaohanzhan-db Jan 12, 2024
86e2412
update
xiaohanzhan-db Jan 12, 2024
bbfec65
update
xiaohanzhan-db Jan 12, 2024
b2e880d
update
xiaohanzhan-db Jan 12, 2024
596443a
update
xiaohanzhan-db Jan 12, 2024
ea65187
Add notebook
xiaohanzhan-db Jan 12, 2024
378a4e0
update
xiaohanzhan-db Jan 12, 2024
af6e9aa
update
Jan 12, 2024
4e286ec
remove script and tests, keep notebook
xiaohanzhan-db Jan 12, 2024
09c4892
update
xiaohanzhan-db Jan 12, 2024
c82da6c
update
xiaohanzhan-db Jan 12, 2024
e5f83cc
update
xiaohanzhan-db Jan 12, 2024
17d2b9f
update
xiaohanzhan-db Jan 12, 2024
6579d55
Merge branch 'main' into validation
xiaohanzhan-db Jan 12, 2024
56308ff
Merge branch 'byod/data_validation' into validation
XiaohanZhangCMU Jan 12, 2024
00a51b5
Validation (#862)
XiaohanZhangCMU Jan 12, 2024
4daa324
updated notebook
Jan 12, 2024
b809691
Merge branch 'main' into validation
xiaohanzhan-db Jan 12, 2024
8b75f94
remove scripts keep notebook
xiaohanzhan-db Jan 12, 2024
99bf2cd
merge with byod/data_validation
xiaohanzhan-db Jan 12, 2024
9b37063
Validation (#866)
XiaohanZhangCMU Jan 12, 2024
22014d6
update notebook. rephrase.
Jan 12, 2024
d9f28aa
merged
xiaohanzhan-db Jan 12, 2024
f1fa63c
Validation (#867)
XiaohanZhangCMU Jan 12, 2024
43c8ac9
update
xiaohanzhan-db Jan 12, 2024
b8ac771
Add response tokens
xiaohanzhan-db Jan 16, 2024
1b9681c
update
xiaohanzhan-db Jan 16, 2024
16883c2
merge
xiaohanzhan-db Jan 16, 2024
a9218d6
Validation (#875)
XiaohanZhangCMU Jan 16, 2024
c7567f1
update
xiaohanzhan-db Jan 20, 2024
1764b72
Disable MDSWrite, return token counts
xiaohanzhan-db Jan 22, 2024
808ced5
Change plot settings
xiaohanzhan-db Jan 23, 2024
26ae516
Fix conflict
xiaohanzhan-db Jan 23, 2024
a212ee8
update notebook
Jan 23, 2024
d279817
update
xiaohanzhan-db Jan 23, 2024
f1cfe9e
Validation (#898)
XiaohanZhangCMU Jan 23, 2024
dbe3f4e
update notebook
Jan 23, 2024
3005718
update
xiaohanzhan-db Jan 23, 2024
8498662
Validation (#900)
XiaohanZhangCMU Jan 23, 2024
f5b900c
update
Jan 23, 2024
02d0979
Merge branch 'byod/data_validation' of https://github.com/mosaicml/ll…
xiaohanzhan-db Jan 23, 2024
205e405
Validation (#901)
XiaohanZhangCMU Jan 23, 2024
2f883a7
update notebook
Jan 23, 2024
0315caf
update
xiaohanzhan-db Jan 23, 2024
1a510ff
update pip install link
xiaohanzhan-db Mar 13, 2024
530a55a
Change done file location
xiaohanzhan-db Mar 13, 2024
5493295
Validation (#902)
XiaohanZhangCMU Mar 13, 2024
81c3757
Create the dest folder
xiaohanzhan-db Mar 13, 2024
5090e13
Validation (#1025)
XiaohanZhangCMU Mar 13, 2024
f88917d
update notebook
xiaohanzhan-db Mar 14, 2024
4c86f74
update
xiaohanzhan-db Mar 14, 2024
962974b
Merge branch 'byod/data_validation' into validation
XiaohanZhangCMU Mar 14, 2024
9fd91cf
Validation (#1027)
XiaohanZhangCMU Mar 14, 2024
67f7b4c
Merge pull request #1 from mosaicml/byod/data_validation
XiaohanZhangCMU Mar 14, 2024
28cd2e6
update notebook
xiaohanzhan-db Mar 14, 2024
944b260
Validation (#1028)
XiaohanZhangCMU Mar 14, 2024
9a19d8a
fix conflict
xiaohanzhan-db Mar 14, 2024
a6b2ae0
Validation (#1031)
XiaohanZhangCMU Mar 14, 2024
de90934
update token_counts
xiaohanzhan-db Mar 14, 2024
5dfd30c
Validation (#1032)
XiaohanZhangCMU Mar 14, 2024
61adb43
update pip install list
xiaohanzhan-db Mar 14, 2024
c404dc7
Validation (#1033)
XiaohanZhangCMU Mar 14, 2024
c77bdf6
fix
xiaohanzhan-db Mar 14, 2024
ad71cc0
update
xiaohanzhan-db Mar 14, 2024
9bc3a39
fix token counts
xiaohanzhan-db Mar 14, 2024
9ec582e
Expose validate chat
xiaohanzhan-db Mar 14, 2024
734008e
Expose more
xiaohanzhan-db Mar 14, 2024
51f2eef
update
xiaohanzhan-db Mar 14, 2024
7b6956d
expose
xiaohanzhan-db Mar 14, 2024
60ed7de
add collate
xiaohanzhan-db Mar 14, 2024
fba1dcb
Fix
xiaohanzhan-db Mar 14, 2024
58185ba
Fix conflict
xiaohanzhan-db Mar 14, 2024
8e8f431
Validation (#1034)
XiaohanZhangCMU Mar 14, 2024
24f3d9e
update notebook
xiaohanzhan-db Mar 14, 2024
714002d
Fix conflict
xiaohanzhan-db Mar 14, 2024
1640f30
Validation (#1035)
XiaohanZhangCMU Mar 14, 2024
b053363
Merge branch 'byod/data_validation' of https://github.com/mosaicml/ll…
xiaohanzhan-db Mar 14, 2024
7e1d567
update notebook
xiaohanzhan-db Mar 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Validation (#1028)
* add validation script

* update

* change token count function

* reorganize cells

* Add unit tests

* Add a printout for CPT

* update question

* Add questions

* Fix lints

* update format

* update

* nb source

* add validation script

* update

* change token count function

* reorganize cells

* Add unit tests

* Add a printout for CPT

* update question

* Add questions

* Fix lints

* update format

* update

* nb source

* Remove license insert for validation notebook

* Add validation utils

* Minor cleanups (#858)

* nits

* logger

* add log

* lint

* update utils/__init__.py to include extra validation functions

* update notebook

* update

* update

* Read UC delta table (#773)

* initial commit

* use databricks-sql to read delta table and convert to json

* update

* update

* update

* add mocked unittest

* Fix lints

* update

* update

* restructure code

* Add timer for optimizing

* Add db-connect

* add wrapper

* update

* add install dbconnect

* update

* update

* patch dbconnect to allow multiple return formats

* update

* add arrow

* use compression

* clean up

* Add cluster rt check

* Fix lints

* remove patch.py for CI

* update

* update

* updat

* update

* fix tests

* fix lint

* update

* update

* Add more tests

* update

* update

* update

* change to download_json

* update

* fix lints

* Add decompressed option for arrow

* format json to jsonl

* Add comments

* Make cf_collect_type global option

* fix comments

* fix lints

* fix comments

* Fix lints

* change to use workspaceclient

* Add CPT support

* Rewire method assignment logic

* Fix bug in stripping https

* Add tests for rewired method assignment logic

* Fix lints

* Fix lints

* Removed logger set_level

* Remove pyspark. It conflicts with databricks-connect

* Update the comment

* skip cluster version check when cluster_id is serverless

* Add use_serverless flag

* update tests with use_serverless flag

* Fix lints

---------

Co-authored-by: Xiaohan Zhang <xiaohan.zhang@databricks.com>

* Add download remote function to util

* update

* remove fused layernorm (#859)

* update

* update

* update

* update

* update

* update

* update

* update

* update

* Remove hardcoded combined.jsonl with a flag (#861)

* Remove hardcoded combined.jsonl with a flag

* update

* change output_json_path output_json_folder

---------

Co-authored-by: Xiaohan Zhang <xiaohan.zhang@databricks.com>

* bump (#828)

* Add dask and dataframe_to_mds

* update

* update

* update

* update

* Add notebook

* update

* update

* remove script and tests, keep notebook

* update

* update

* update

* update

* Always initialize dist  (#864)

* fix dev

* lint

* remove gpu

* updated notebook

* remove scripts keep notebook

* update notebook. rephrase.

* update

* Add response tokens

* update

* update

* Disable MDSWrite, return token counts

* Change plot settings

* update notebook

* update

* update notebook

* update

* update notebook

* update pip install link

* Change done file location

* Create the dest folder

* update notebook

* update

* update notebook

---------

Co-authored-by: Xiaohan Zhang <xiaohan.zhang@databricks.com>
Co-authored-by: xiaohanzhan-db <xiaohanzhan-db>
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
3 people authored Mar 14, 2024
commit 944b260c98e4ceb6d24b142e4001c19c950e7676
25 changes: 2 additions & 23 deletions notebooks/validate_and_tokenize_data.ipynb
Original file line number Diff line number Diff line change
@@ -290,7 +290,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
@@ -710,7 +710,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
@@ -803,27 +803,6 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
"byteLimit": 2048000,
"rowLimit": 10000
},
"inputWidgets": {},
"nuid": "f5aea2a8-db29-40c9-8ed2-b6a1d032e7ab",
"showTitle": false,
"title": ""
}
},
"outputs": [],
"source": [
"import os\n",
"os.makedirs(temporary_mds_output_path, exist_ok=True)"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {