Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable batch size and LR scheduler #5237

Open
wants to merge 113 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
113 commits
Select commit Hold shift + click to select a range
14f2bbe
added assert of torch vs numpy types
bm-synth Feb 9, 2024
796341d
first draft
bm-synth Feb 14, 2024
07aa4b4
reverted to original master
bm-synth Feb 14, 2024
815a789
added metric type accumulate_value_over_samples
bm-synth Feb 14, 2024
28a72e7
pre-commit
bm-synth Feb 14, 2024
e8dbf0b
Merge branch 'master' into distributed_data_analyzer
bm-synth Feb 14, 2024
ec3479f
Merge branch 'distributed_data_analyzer' of github.com:bm-synth/DeepS…
bm-synth Feb 14, 2024
38d7ce6
Update data_analyzer.py
bm-synth Feb 14, 2024
295fba6
added check for single node reduce. added barriers
bm-synth Feb 14, 2024
4144e42
more bug fixes
bm-synth Feb 14, 2024
a1e121c
new iteration, many bug fixes
bm-synth Feb 15, 2024
e045753
bug fixes
bm-synth Feb 15, 2024
3a89116
Merge branch 'master' into distributed_data_analyzer
bm-synth Feb 15, 2024
cdc838c
fixing previous commit
bm-synth Feb 15, 2024
ba34a55
Merge branch 'master' into distributed_data_analyzer
bm-synth Feb 16, 2024
5c07710
pre-commit
bm-synth Feb 16, 2024
87d7686
Merge branch 'distributed_data_analyzer' of github.com:bm-synth/DeepS…
bm-synth Feb 16, 2024
f28e829
recoverd master branch
bm-synth Feb 16, 2024
a634787
write sequentially to file
bm-synth Feb 16, 2024
848ffd5
Merge branch 'master' into distributed_data_analyzer
bm-synth Feb 16, 2024
ec59f08
fixes in sequential write
bm-synth Feb 16, 2024
832874c
Merge branch 'distributed_data_analyzer' of github.com:bm-synth/DeepS…
bm-synth Feb 16, 2024
ea0d65f
pre-commit hooks
bm-synth Feb 16, 2024
c6c9bc5
Merge branch 'master' into distributed_data_analyzer
bm-synth Feb 16, 2024
56a9533
added main as example
bm-synth Feb 18, 2024
b4d8654
Merge branch 'distributed_data_analyzer' of github.com:bm-synth/DeepS…
bm-synth Feb 18, 2024
676dc1a
Merge branch 'master' into distributed_data_analyzer
bm-synth Feb 18, 2024
6788af5
Update data_analyzer.py
bm-synth Feb 18, 2024
bd61d9c
first working version. idx files differ
bm-synth Feb 19, 2024
7ac5e45
Merge branch 'distributed_data_analyzer' of github.com:bm-synth/DeepS…
bm-synth Feb 19, 2024
8bf0e63
added missing static function
bm-synth Feb 19, 2024
e5a7eb0
removed/added breaklines to match base code
bm-synth Feb 19, 2024
3b8014f
corrected comment
bm-synth Feb 19, 2024
5a42687
imports
bm-synth Feb 19, 2024
cdaad36
removed main
bm-synth Feb 19, 2024
b3d4062
reverted main
bm-synth Feb 19, 2024
7cabfa2
bug fix in sample calculation
bm-synth Feb 19, 2024
62f68dd
added worker_an and num_worker to kwargs
bm-synth Feb 19, 2024
6d35e45
removed dist.initialize ()from DataAnalyzer.run_map_reduce
bm-synth Feb 19, 2024
be91d37
first iteration
bm-synth Feb 20, 2024
5fd0546
updated with add_items
bm-synth Feb 21, 2024
e943aaa
Merge branch 'master' into serial_data_analyzer
bm-synth Feb 22, 2024
4f23873
first iteration, testing
bm-synth Mar 7, 2024
f732a8f
bug fixes with batch sizes vs metrics
bm-synth Mar 7, 2024
550ab31
deepspeed_io support
bm-synth Mar 7, 2024
b7f2520
bug fixes
bm-synth Mar 7, 2024
6452d44
better comment
bm-synth Mar 8, 2024
4f9ff37
merge conflicts
bm-synth Mar 8, 2024
42accd1
recovered files from master
bm-synth Mar 8, 2024
cad0221
bug fixrs on LR scheduler reset
bm-synth Mar 8, 2024
f6c5c18
master in line with remote
bm-synth Mar 8, 2024
8c88567
Merge branch 'master' into variable_batch_size_and_lr
bm-synth Mar 8, 2024
c01667e
Merge branch 'master' into variable_batch_size_and_lr
bm-synth Mar 8, 2024
b1a19de
removed 2 files that are not part of commit
bm-synth Mar 8, 2024
ae70b7d
added files removed accidentaly
bm-synth Mar 8, 2024
57e491e
Merge branch 'variable_batch_size_and_lr' of github.com:bm-synth/Deep…
bm-synth Mar 8, 2024
efafffc
pipepile parallelism missing. all good
bm-synth Mar 8, 2024
3658080
pre-commit hooks
bm-synth Mar 8, 2024
8cbf601
Merge branch 'master' into variable_batch_size_and_lr
bm-synth Mar 10, 2024
713f87b
fixed collate_fn to include padding
bm-synth Mar 12, 2024
2a9d43c
Merge branch 'variable_batch_size_and_lr' of github.com:bm-synth/Deep…
bm-synth Mar 12, 2024
ed7d8ea
attention head by hand
bm-synth Mar 12, 2024
b516356
fixed seq lens computation
bm-synth Mar 12, 2024
8b41845
pipeline parallelism for enforced max seq size
bm-synth Mar 12, 2024
ce85b9d
pipeline parallelism
bm-synth Mar 13, 2024
82e3dd2
renamed file
bm-synth Mar 13, 2024
d43f981
renamed
bm-synth Mar 13, 2024
0ae1dc8
train_batch_size_per_gpu >1
bm-synth Mar 13, 2024
c50883f
bug fixes
bm-synth Mar 13, 2024
26095f3
fixed scheduled step scaling
bm-synth Mar 13, 2024
167f484
pre-commit hooks
bm-synth Mar 13, 2024
814f1cb
pre-commit hooks
bm-synth Mar 13, 2024
02cb597
batching config
bm-synth Mar 13, 2024
5e8b757
Merge branch 'master' into variable_batch_size_and_lr
bm-synth Mar 13, 2024
c23f34c
final polishing
bm-synth Mar 13, 2024
43ca8d1
Merge branch 'master' into variable_batch_size_and_lr
bm-synth Mar 13, 2024
af8b068
pre-commit hooks
bm-synth Mar 13, 2024
1aabcec
added line for support contact
bm-synth Mar 13, 2024
1ff409a
Merge branch 'master' into variable_batch_size_and_lr
bm-synth Mar 14, 2024
9998c2f
Merge branch 'variable_batch_size_and_lr' of github.com:bm-synth/Deep…
bm-synth Mar 14, 2024
4fd6303
sample_seqlen_fn
bm-synth Mar 14, 2024
c0de7e6
removed sample_seqlen_fn, should be done in parallel somewhere else
bm-synth Mar 14, 2024
427c70d
minor bug fixes
bm-synth Mar 15, 2024
b072fbb
read seqlens from DataAnalyzer output
bm-synth Mar 15, 2024
d121be5
flatten
bm-synth Mar 15, 2024
6e35d27
Merge branch 'master' into variable_batch_size_and_lr
bm-synth Mar 15, 2024
b32185c
use DistributedDataAanalyzer to generate seqlen metric files if missing
bm-synth Mar 16, 2024
d67d80a
Merge branch 'variable_batch_size_and_lr' of github.com:bm-synth/Deep…
bm-synth Mar 16, 2024
60e52d6
Merge branch 'master' into variable_batch_size_and_lr
bm-synth Mar 23, 2024
0a0f890
Merge branch 'master' into variable_batch_size_and_lr
loadams Mar 27, 2024
8b96f41
Merge branch 'master' into variable_batch_size_and_lr
bm-synth Apr 2, 2024
c4f9f38
Merge branch 'master' into variable_batch_size_and_lr
bm-synth Apr 8, 2024
bad2ce6
Merge branch 'master' into variable_batch_size_and_lr
bm-synth Apr 10, 2024
60881a3
Merge branch 'master' into variable_batch_size_and_lr
bm-synth Apr 16, 2024
3b54d4d
Merge branch 'master' into variable_batch_size_and_lr
bm-synth Apr 17, 2024
58a1a36
Merge branch 'master' into variable_batch_size_and_lr
conglongli Apr 18, 2024
ff77acf
Merge branch 'master' into variable_batch_size_and_lr
bm-synth Apr 19, 2024
f207398
Merge branch 'master' into variable_batch_size_and_lr
loadams Apr 23, 2024
ab86113
Merge branch 'master' into variable_batch_size_and_lr
bm-synth Apr 29, 2024
e1d3cc7
Merge branch 'master' into variable_batch_size_and_lr
bm-synth May 6, 2024
1c6eaf5
Merge branch 'master' into variable_batch_size_and_lr
bm-synth May 7, 2024
61a9f32
Merge branch 'master' into variable_batch_size_and_lr
bm-synth May 21, 2024
a426adb
Merge branch 'master' into variable_batch_size_and_lr
bm-synth Aug 10, 2024
aa95494
Merge branch 'master' into variable_batch_size_and_lr
tjruwase Aug 28, 2024
391bfb0
Merge branch 'master' into variable_batch_size_and_lr
bm-synth Aug 30, 2024
383f976
Merge branch 'master' into variable_batch_size_and_lr
bm-synth Aug 30, 2024
be9d2f5
Merge branch 'master' into variable_batch_size_and_lr
bm-synth Dec 3, 2024
4cdcaf8
Merge branch 'master' into variable_batch_size_and_lr
loadams Dec 5, 2024
c196a66
Merge branch 'master' into variable_batch_size_and_lr
loadams Dec 10, 2024
4f4de2e
Merge branch 'master' into variable_batch_size_and_lr
bm-synth Dec 10, 2024
361b2ef
Merge branch 'master' into variable_batch_size_and_lr
loadams Jan 6, 2025
a8b25f4
Merge branch 'master' into variable_batch_size_and_lr
loadams Jan 6, 2025
54b9786
Merge branch 'master' into variable_batch_size_and_lr
bm-synth Jan 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion deepspeed/runtime/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -800,7 +800,6 @@ def __init__(self, config: Union[str, dict], mpu=None, mesh_device=None):

def _initialize_params(self, param_dict):
self.train_batch_size = get_train_batch_size(param_dict)
#print(f"beginning get_train_batch_size = {get_train_batch_size}")
self.train_micro_batch_size_per_gpu = get_train_micro_batch_size_per_gpu(param_dict)
self.gradient_accumulation_steps = get_gradient_accumulation_steps(param_dict)
self.steps_per_print = get_steps_per_print(param_dict)
Expand Down
25 changes: 23 additions & 2 deletions deepspeed/runtime/data_pipeline/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ def get_data_efficiency_config(param_dict):
sub_param_dict = param_dict[DATA_EFFICIENCY]
output[DATA_SAMPLING] = get_data_sampling(sub_param_dict)
output[DATA_ROUTING] = get_data_routing(sub_param_dict)

return output


Expand All @@ -43,11 +42,13 @@ def get_data_sampling(param_dict):
output[DATA_SAMPLING_ENABLED] = get_data_sampling_enabled(param_dict)
output[DATA_SAMPLING_NUM_EPOCHS] = get_data_sampling_num_epochs(param_dict)
output[DATA_SAMPLING_NUM_WORKERS] = get_data_sampling_num_workers(param_dict)
output[DATA_SAMPLING_PIN_MEMORY] = bool(
output.get(param_dict[DATA_SAMPLING][DATA_SAMPLING_PIN_MEMORY], DATA_SAMPLING_PIN_MEMORY_DEFAULT))
if DATA_SAMPLING not in param_dict.keys():
param_dict[DATA_SAMPLING] = {}
sub_param_dict = param_dict[DATA_SAMPLING]
output[CURRICULUM_LEARNING] = get_curriculum_learning(sub_param_dict)

output[DYNAMIC_BATCHING] = get_dynamic_batching(sub_param_dict)
return output


Expand Down Expand Up @@ -87,6 +88,26 @@ def get_curriculum_learning(param_dict):
return output


def get_dynamic_batching(param_dict):
output = copy.copy(param_dict.get(DYNAMIC_BATCHING, {}))
output[DYNAMIC_BATCHING_ENABLED] = bool(output.get(DYNAMIC_BATCHING_ENABLED, DYNAMIC_BATCHING_ENABLED_DEFAULT))
output[DYNAMIC_BATCHING_LR_SCALING_METHOD] = str(
output.get(DYNAMIC_BATCHING_LR_SCALING_METHOD, DYNAMIC_BATCHING_LR_SCALING_METHOD_DEFAULT))
output[DYNAMIC_BATCHING_MIN_BATCH_SIZE] = int(
output.get(DYNAMIC_BATCHING_MIN_BATCH_SIZE, DYNAMIC_BATCHING_MIN_BATCH_SIZE_DEFAULT))
output[DYNAMIC_BATCHING_MAX_BATCH_SIZE] = int(output[DYNAMIC_BATCHING_MAX_BATCH_SIZE]) \
if DYNAMIC_BATCHING_MAX_BATCH_SIZE in output.keys() \
else DYNAMIC_BATCHING_MAX_BATCH_SIZE_DEFAULT
output[DYNAMIC_BATCHING_SAMPLES_ORDER] = str(
output.get(DYNAMIC_BATCHING_SAMPLES_ORDER, DYNAMIC_BATCHING_SAMPLES_ORDER_DEFAULT))
if output[DYNAMIC_BATCHING_ENABLED]:
assert DYNAMIC_BATCHING_MAX_TOKENS_PER_BATCH in output.keys(
), f"Dynamic batching is enabled, so {DYNAMIC_BATCHING_MAX_TOKENS_PER_BATCH} must be specified"
output[DYNAMIC_BATCHING_MAX_TOKENS_PER_BATCH] = int(output[DYNAMIC_BATCHING_MAX_TOKENS_PER_BATCH])
output[DYNAMIC_BATCHING_VERBOSE] = bool(output.get(DYNAMIC_BATCHING_VERBOSE, False))
return output


def get_curriculum_learning_enabled(param_dict):
if CURRICULUM_LEARNING in param_dict.keys():
return get_scalar_param(param_dict[CURRICULUM_LEARNING], CURRICULUM_LEARNING_ENABLED,
Expand Down
20 changes: 20 additions & 0 deletions deepspeed/runtime/data_pipeline/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@
DATA_SAMPLING_NUM_EPOCHS_DEFAULT = 1000
DATA_SAMPLING_NUM_WORKERS = "num_workers"
DATA_SAMPLING_NUM_WORKERS_DEFAULT = 0
DATA_SAMPLING_PIN_MEMORY = "pin_memory"
DATA_SAMPLING_PIN_MEMORY_DEFAULT = False

#########################################
# Data efficiency - Data Sampling - Curriculum Learning
Expand Down Expand Up @@ -62,6 +64,24 @@
CURRICULUM_LEARNING_DATA_CLUSTER_CURRENT_POSITION = "data_cluster_current_position"
CURRICULUM_LEARNING_NP_RNG_STATE = "np_rng_state"

#########################################
# Data efficiency - Dynamic batching and LR scaling
#########################################
DYNAMIC_BATCHING = "dynamic_batching"
DYNAMIC_BATCHING_ENABLED = "enabled"
DYNAMIC_BATCHING_ENABLED_DEFAULT = False
DYNAMIC_BATCHING_SEQLEN_SAMPLE_TO_METRIC_PATH = "seqlen_sample_to_metric_path"
DYNAMIC_BATCHING_LR_SCALING_METHOD = "lr_scaling_method" # "linear" / "sqrt" / "none"
DYNAMIC_BATCHING_LR_SCALING_METHOD_DEFAULT = "linear"
DYNAMIC_BATCHING_MIN_BATCH_SIZE = "min_batch_size"
DYNAMIC_BATCHING_MIN_BATCH_SIZE_DEFAULT = 1
DYNAMIC_BATCHING_MAX_BATCH_SIZE = "max_batch_size"
DYNAMIC_BATCHING_MAX_BATCH_SIZE_DEFAULT = None
DYNAMIC_BATCHING_SAMPLES_ORDER = "samples_order" # "random" / "order" / "default"
DYNAMIC_BATCHING_SAMPLES_ORDER_DEFAULT = "dataloader" # "random" / "order" / "dataloader"
DYNAMIC_BATCHING_MAX_TOKENS_PER_BATCH = "max_tokens_per_batch"
DYNAMIC_BATCHING_VERBOSE = "verbose"

#########################################
# Curriculum Learning legacy implementation
#########################################
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -640,7 +640,7 @@ def run_map_reduce(self):
metric_to_samples_dict[value.item()] = []
metric_to_samples_dict[value.item()].append(sample.item())

# index_to_metric and index_to_sample serialize a dicitonary from metric to samples
# index_to_metric and index_to_sample serialize a dictionary from metric to samples
# index_to_metric stores a key per row, index_to_sample stores the values per row
values = [torch.tensor([x]) for x in metric_to_samples_dict.keys()]
samples = [torch.tensor(metric_to_samples_dict[x]) for x in metric_to_samples_dict.keys()]
Expand Down
Loading