Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation #844

Open
wants to merge 126 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 69 commits
Commits
Show all changes
126 commits
Select commit Hold shift + click to select a range
8cb6522
add validation script
xiaohanzhan-db Dec 23, 2023
c59c11f
update
xiaohanzhan-db Jan 3, 2024
66f34eb
change token count function
Jan 3, 2024
2cd387b
reorganize cells
Jan 5, 2024
3eac3bf
Add unit tests
xiaohanzhan-db Jan 5, 2024
d2d9767
Add a printout for CPT
xiaohanzhan-db Jan 6, 2024
be25591
update question
xiaohanzhan-db Jan 6, 2024
4651be7
Add questions
Jan 8, 2024
5cd6a94
Fix lints
xiaohanzhan-db Jan 8, 2024
8e2c1f4
Merge branch 'main' into validation
XiaohanZhangCMU Jan 8, 2024
e6e4a81
update format
xiaohanzhan-db Jan 8, 2024
34c5690
Merge branch 'validation' of github.com:XiaohanZhangCMU/llm-foundryX …
xiaohanzhan-db Jan 8, 2024
1668b9a
update
xiaohanzhan-db Jan 8, 2024
2219135
nb source
xiaohanzhan-db Jan 8, 2024
86c6e87
add validation script
xiaohanzhan-db Dec 23, 2023
678b376
update
xiaohanzhan-db Jan 3, 2024
297e057
change token count function
Jan 3, 2024
09d0ebb
reorganize cells
Jan 5, 2024
460df65
Add unit tests
xiaohanzhan-db Jan 5, 2024
3ffd200
Add a printout for CPT
xiaohanzhan-db Jan 6, 2024
9362886
update question
xiaohanzhan-db Jan 6, 2024
898e5ac
Add questions
Jan 8, 2024
a4bef71
Fix lints
xiaohanzhan-db Jan 8, 2024
4ca9cc6
update format
xiaohanzhan-db Jan 8, 2024
d636a0f
update
xiaohanzhan-db Jan 8, 2024
827d155
nb source
xiaohanzhan-db Jan 8, 2024
6bbf3fc
Remove license insert for validation notebook
xiaohanzhan-db Jan 8, 2024
4f6a4fb
Merge branch 'validation' of github.com:XiaohanZhangCMU/llm-foundryX …
xiaohanzhan-db Jan 8, 2024
5966b68
Add validation utils
xiaohanzhan-db Jan 11, 2024
da17813
Merge branch 'main' into validation
xiaohanzhan-db Jan 11, 2024
89fb909
Validation (#856)
XiaohanZhangCMU Jan 11, 2024
55e4626
update utils/__init__.py to include extra validation functions
xiaohanzhan-db Jan 11, 2024
45544a1
update notebook
Jan 11, 2024
d2797b3
update
xiaohanzhan-db Jan 11, 2024
019da77
Merge branch 'validation' of github.com:XiaohanZhangCMU/llm-foundryX …
xiaohanzhan-db Jan 11, 2024
756fdae
update
xiaohanzhan-db Jan 11, 2024
93b5a9f
Add download remote function to util
xiaohanzhan-db Jan 11, 2024
b47c878
update
xiaohanzhan-db Jan 11, 2024
13fd34c
update
xiaohanzhan-db Jan 11, 2024
610f669
update
xiaohanzhan-db Jan 11, 2024
9f2e51b
update
xiaohanzhan-db Jan 11, 2024
ec68f10
update
xiaohanzhan-db Jan 11, 2024
1e76068
update
xiaohanzhan-db Jan 11, 2024
7a5c164
update
xiaohanzhan-db Jan 11, 2024
e76038f
Merge branch 'main' into validation
xiaohanzhan-db Jan 11, 2024
5b413f5
update
xiaohanzhan-db Jan 11, 2024
a1aa31f
update
xiaohanzhan-db Jan 11, 2024
d24fd5c
update
xiaohanzhan-db Jan 11, 2024
55fce37
Add dask and dataframe_to_mds
xiaohanzhan-db Jan 12, 2024
86e2412
update
xiaohanzhan-db Jan 12, 2024
bbfec65
update
xiaohanzhan-db Jan 12, 2024
b2e880d
update
xiaohanzhan-db Jan 12, 2024
596443a
update
xiaohanzhan-db Jan 12, 2024
ea65187
Add notebook
xiaohanzhan-db Jan 12, 2024
378a4e0
update
xiaohanzhan-db Jan 12, 2024
af6e9aa
update
Jan 12, 2024
4e286ec
remove script and tests, keep notebook
xiaohanzhan-db Jan 12, 2024
09c4892
update
xiaohanzhan-db Jan 12, 2024
c82da6c
update
xiaohanzhan-db Jan 12, 2024
e5f83cc
update
xiaohanzhan-db Jan 12, 2024
17d2b9f
update
xiaohanzhan-db Jan 12, 2024
6579d55
Merge branch 'main' into validation
xiaohanzhan-db Jan 12, 2024
56308ff
Merge branch 'byod/data_validation' into validation
XiaohanZhangCMU Jan 12, 2024
00a51b5
Validation (#862)
XiaohanZhangCMU Jan 12, 2024
4daa324
updated notebook
Jan 12, 2024
b809691
Merge branch 'main' into validation
xiaohanzhan-db Jan 12, 2024
8b75f94
remove scripts keep notebook
xiaohanzhan-db Jan 12, 2024
99bf2cd
merge with byod/data_validation
xiaohanzhan-db Jan 12, 2024
9b37063
Validation (#866)
XiaohanZhangCMU Jan 12, 2024
22014d6
update notebook. rephrase.
Jan 12, 2024
d9f28aa
merged
xiaohanzhan-db Jan 12, 2024
f1fa63c
Validation (#867)
XiaohanZhangCMU Jan 12, 2024
43c8ac9
update
xiaohanzhan-db Jan 12, 2024
b8ac771
Add response tokens
xiaohanzhan-db Jan 16, 2024
1b9681c
update
xiaohanzhan-db Jan 16, 2024
16883c2
merge
xiaohanzhan-db Jan 16, 2024
a9218d6
Validation (#875)
XiaohanZhangCMU Jan 16, 2024
c7567f1
update
xiaohanzhan-db Jan 20, 2024
1764b72
Disable MDSWrite, return token counts
xiaohanzhan-db Jan 22, 2024
808ced5
Change plot settings
xiaohanzhan-db Jan 23, 2024
26ae516
Fix conflict
xiaohanzhan-db Jan 23, 2024
a212ee8
update notebook
Jan 23, 2024
d279817
update
xiaohanzhan-db Jan 23, 2024
f1cfe9e
Validation (#898)
XiaohanZhangCMU Jan 23, 2024
dbe3f4e
update notebook
Jan 23, 2024
3005718
update
xiaohanzhan-db Jan 23, 2024
8498662
Validation (#900)
XiaohanZhangCMU Jan 23, 2024
f5b900c
update
Jan 23, 2024
02d0979
Merge branch 'byod/data_validation' of https://github.com/mosaicml/ll…
xiaohanzhan-db Jan 23, 2024
205e405
Validation (#901)
XiaohanZhangCMU Jan 23, 2024
2f883a7
update notebook
Jan 23, 2024
0315caf
update
xiaohanzhan-db Jan 23, 2024
1a510ff
update pip install link
xiaohanzhan-db Mar 13, 2024
530a55a
Change done file location
xiaohanzhan-db Mar 13, 2024
5493295
Validation (#902)
XiaohanZhangCMU Mar 13, 2024
81c3757
Create the dest folder
xiaohanzhan-db Mar 13, 2024
5090e13
Validation (#1025)
XiaohanZhangCMU Mar 13, 2024
f88917d
update notebook
xiaohanzhan-db Mar 14, 2024
4c86f74
update
xiaohanzhan-db Mar 14, 2024
962974b
Merge branch 'byod/data_validation' into validation
XiaohanZhangCMU Mar 14, 2024
9fd91cf
Validation (#1027)
XiaohanZhangCMU Mar 14, 2024
67f7b4c
Merge pull request #1 from mosaicml/byod/data_validation
XiaohanZhangCMU Mar 14, 2024
28cd2e6
update notebook
xiaohanzhan-db Mar 14, 2024
944b260
Validation (#1028)
XiaohanZhangCMU Mar 14, 2024
9a19d8a
fix conflict
xiaohanzhan-db Mar 14, 2024
a6b2ae0
Validation (#1031)
XiaohanZhangCMU Mar 14, 2024
de90934
update token_counts
xiaohanzhan-db Mar 14, 2024
5dfd30c
Validation (#1032)
XiaohanZhangCMU Mar 14, 2024
61adb43
update pip install list
xiaohanzhan-db Mar 14, 2024
c404dc7
Validation (#1033)
XiaohanZhangCMU Mar 14, 2024
c77bdf6
fix
xiaohanzhan-db Mar 14, 2024
ad71cc0
update
xiaohanzhan-db Mar 14, 2024
9bc3a39
fix token counts
xiaohanzhan-db Mar 14, 2024
9ec582e
Expose validate chat
xiaohanzhan-db Mar 14, 2024
734008e
Expose more
xiaohanzhan-db Mar 14, 2024
51f2eef
update
xiaohanzhan-db Mar 14, 2024
7b6956d
expose
xiaohanzhan-db Mar 14, 2024
60ed7de
add collate
xiaohanzhan-db Mar 14, 2024
fba1dcb
Fix
xiaohanzhan-db Mar 14, 2024
58185ba
Fix conflict
xiaohanzhan-db Mar 14, 2024
8e8f431
Validation (#1034)
XiaohanZhangCMU Mar 14, 2024
24f3d9e
update notebook
xiaohanzhan-db Mar 14, 2024
714002d
Fix conflict
xiaohanzhan-db Mar 14, 2024
1640f30
Validation (#1035)
XiaohanZhangCMU Mar 14, 2024
b053363
Merge branch 'byod/data_validation' of https://github.com/mosaicml/ll…
xiaohanzhan-db Mar 14, 2024
7e1d567
update notebook
xiaohanzhan-db Mar 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ repos:
- --comment-style
- '#'
types: [python]
exclude: scripts/data_prep/validate_and_tokenize_data.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was there a particular reason this was excluded just curious?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For databricks to render a python script as a notebook, it needs the script to start with #databricks notebook source. This change asks pre-commit to skip adding the license header to the script.

- repo: https://github.com/PyCQA/docformatter
rev: v1.5.0
hooks:
Expand Down
2 changes: 1 addition & 1 deletion llmfoundry/data/finetuning/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -434,7 +434,7 @@ def dataset_mapper(example: Dict):

detected_cpu_count = os.cpu_count() or 1
detected_cpus_with_margin = detected_cpu_count - 8
num_cpus_to_use = max(1, detected_cpus_with_margin)
num_cpus_to_use = detected_cpu_count # Hack for Valiation instead of max(1, detected_cpus_with_margin)

columns_to_remove = list(dataset[0].keys())
tokenized_dataset = dataset.map(
Expand Down
20 changes: 20 additions & 0 deletions llmfoundry/utils/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,13 @@
update_batch_size_info)
from llmfoundry.utils.model_download_utils import (
download_from_cache_server, download_from_hf_hub)

from llmfoundry.utils.validation_utils import (
create_om_cfg, token_counts_and_validation, token_counts,
check_HF_datasets, is_hf_dataset_path, is_uc_delta_table,
pandas_processing_fn, integrity_check, convert_text_to_mds,
parse_args, _args_str, plot_hist, dataframe_to_mds)

except ImportError as e:
raise ImportError(
'Please make sure to pip install . to get requirements for llm-foundry.'
Expand All @@ -34,4 +41,17 @@
'update_batch_size_info',
'log_config',
'pop_config',
'create_om_cfg',
'token_counts_and_validation',
'token_counts',
'check_HF_datasets',
'is_hf_dataset_path',
'is_uc_delta_table',
'pandas_processing_fn',
'integrity_check',
'convert_text_to_mds',
'parse_args',
'_args_str',
'plot_hist',
'dataframe_to_mds',
]
Loading