-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normal baselines #618
base: main
Are you sure you want to change the base?
Normal baselines #618
Changes from 35 commits
c5913aa
e2cd59b
3d02325
995247f
b71dff9
0de7234
75ae73f
ed51f61
3293cbb
d77add5
eff21ee
c1075ce
1c58105
cd04bad
63d2a26
aa8ca13
4fb871a
f89437d
8083742
1e7a243
57cc378
fe716ec
910d25b
8d273e2
ad14ba8
c784548
fa8ffa0
1bbc892
a7cbcd6
db1151e
700cea9
faf5fb5
d4e3716
2c3e1e6
fd3ff93
96c338c
35b51ce
02b42f7
dbe1e15
7b47d8b
6853be8
9ab633b
56e6c32
8c7a35a
28bb493
7ed76d4
283a775
ca7f696
914a2b5
51ae8fa
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
#!/usr/bin/env bash | ||
|
||
set -ex | ||
|
||
NUM_NODES=8 | ||
|
||
gantry run \ | ||
--workspace ai2/OLMo-training \ | ||
--task-name llamaish1-normal \ | ||
--description "OLMo small - 1B - Llamaish Normal Weka" \ | ||
--priority urgent \ | ||
--preemptible \ | ||
--beaker-image petew/olmo-torch23-gantry \ | ||
--cluster ai2/jupiter-cirrascale-2 \ | ||
--gpus 8 \ | ||
--replicas "${NUM_NODES}" \ | ||
--leader-selection \ | ||
--host-networking \ | ||
--budget ai2/oe-training \ | ||
--no-nfs \ | ||
--propagate-failure \ | ||
--synchronized-start-timeout 20m \ | ||
--env LOG_FILTER_TYPE=local_rank0_only \ | ||
--env OMP_NUM_THREADS=8 \ | ||
--env OLMO_TASK=model \ | ||
--env-secret WANDB_API_KEY=AKSHITAB_WANDB_API_KEY \ | ||
--env-secret AWS_ACCESS_KEY_ID=AKSHITAB_AWS_ACCESS_KEY_ID \ | ||
--env-secret AWS_SECRET_ACCESS_KEY=AKSHITAB_AWS_SECRET_ACCESS_KEY \ | ||
--env R2_PROFILE=R2 \ | ||
--env S3_PROFILE=S3 \ | ||
--env WEKA_PROFILE=WEKA \ | ||
--env-secret AWS_CONFIG=PETEW_AWS_CONFIG \ | ||
--env-secret AWS_CREDENTIALS=PETEW_AWS_CREDENTIALS \ | ||
--env-secret R2_ENDPOINT_URL=R2_ENDPOINT_URL \ | ||
--env-secret WEKA_ENDPOINT_URL=WEKA_ENDPOINT_URL \ | ||
--shared-memory 10GiB \ | ||
--venv base \ | ||
--yes \ | ||
--timeout=-1 \ | ||
-- /bin/bash -c "scripts/beaker/llamaish1-normal.sh \$BEAKER_LEADER_REPLICA_HOSTNAME ${NUM_NODES} \$BEAKER_REPLICA_RANK" |
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
@@ -0,0 +1,52 @@ | ||||
#!/usr/bin/env bash | ||||
set -exuo pipefail | ||||
IFS=$'\n\t' | ||||
|
||||
BEAKER_LEADER_REPLICA_HOSTNAME=$1 | ||||
shift | ||||
|
||||
NUM_NODES=$1 | ||||
shift | ||||
|
||||
BEAKER_REPLICA_RANK=$1 | ||||
shift | ||||
|
||||
# Warm HF cache | ||||
mkdir -p /root/.cache | ||||
pushd /root/.cache | ||||
curl "https://storage.googleapis.com/hf-cache/huggingface_cache_v4.tar.gz" | tar --keep-newer-files -xzf - | ||||
popd | ||||
export HF_DATASETS_OFFLINE=1 | ||||
|
||||
# Move AWS credentials from env to relevant files | ||||
mkdir -p ~/.aws | ||||
printenv AWS_CONFIG > ~/.aws/config | ||||
printenv AWS_CREDENTIALS > ~/.aws/credentials | ||||
|
||||
|
||||
export EXPERIMENT=llamaish1-normal-final | ||||
|
||||
torchrun \ | ||||
--nnodes ${NUM_NODES}:${NUM_NODES} \ | ||||
--nproc-per-node 8 \ | ||||
--rdzv_id=12347 \ | ||||
--rdzv_backend=static \ | ||||
--rdzv_endpoint=$BEAKER_LEADER_REPLICA_HOSTNAME:29400 \ | ||||
--node_rank=$BEAKER_REPLICA_RANK \ | ||||
--rdzv_conf="read_timeout=420" \ | ||||
scripts/train.py \ | ||||
configs/llamaish1-normal-weka.yaml \ | ||||
--run_name=$EXPERIMENT \ | ||||
--wandb.name=$EXPERIMENT \ | ||||
--wandb.group=$EXPERIMENT \ | ||||
--fsdp.wrapping_strategy=by_block_and_size \ | ||||
--fsdp.sharding_strategy=SHARD_GRAD_OP \ | ||||
--save_folder=runs/ \ | ||||
--device_train_microbatch_size=4 \ | ||||
--global_train_batch_size=512 \ | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is the default in the config, right?
Suggested change
|
||||
--save_interval=250 \ | ||||
--eval_interval=250 \ | ||||
--optimizer.metrics_log_interval=1 \ | ||||
--save_overwrite \ | ||||
--save_num_checkpoints_to_keep=3 \ | ||||
--load_path=s3://ai2-llm/checkpoints/OLMo-small/llamaish1-normal-shard/step2000 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
#!/usr/bin/env bash | ||
|
||
set -ex | ||
|
||
NUM_NODES=32 | ||
|
||
gantry run \ | ||
--workspace ai2/OLMo-training \ | ||
--task-name llamaish7-normal \ | ||
--description "OLMo medium - 7B - Llamaish Normal" \ | ||
--priority urgent \ | ||
--preemptible \ | ||
--beaker-image petew/olmo-torch23-gantry \ | ||
--cluster ai2/jupiter-cirrascale-2 \ | ||
--gpus 8 \ | ||
--replicas "${NUM_NODES}" \ | ||
--leader-selection \ | ||
--host-networking \ | ||
--budget ai2/oe-training \ | ||
--no-nfs \ | ||
--propagate-failure \ | ||
--synchronized-start-timeout 60m \ | ||
--env LOG_FILTER_TYPE=local_rank0_only \ | ||
--env OMP_NUM_THREADS=8 \ | ||
--env OLMO_TASK=model \ | ||
--env-secret WANDB_API_KEY=AKSHITAB_WANDB_API_KEY \ | ||
--env-secret AWS_ACCESS_KEY_ID=AKSHITAB_AWS_ACCESS_KEY_ID \ | ||
--env-secret AWS_SECRET_ACCESS_KEY=AKSHITAB_AWS_SECRET_ACCESS_KEY \ | ||
--env R2_PROFILE=R2 \ | ||
--env S3_PROFILE=S3 \ | ||
--env WEKA_PROFILE=WEKA \ | ||
--env-secret AWS_CONFIG=PETEW_AWS_CONFIG \ | ||
--env-secret AWS_CREDENTIALS=PETEW_AWS_CREDENTIALS \ | ||
--env-secret R2_ENDPOINT_URL=R2_ENDPOINT_URL \ | ||
--env-secret WEKA_ENDPOINT_URL=WEKA_ENDPOINT_URL \ | ||
--shared-memory 10GiB \ | ||
--venv base \ | ||
--yes \ | ||
--timeout=-1 \ | ||
-- /bin/bash -c "scripts/beaker/llamaish7-normal.sh \$BEAKER_LEADER_REPLICA_HOSTNAME ${NUM_NODES} \$BEAKER_REPLICA_RANK" |
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
@@ -0,0 +1,53 @@ | ||||
#!/usr/bin/env bash | ||||
set -exuo pipefail | ||||
IFS=$'\n\t' | ||||
|
||||
BEAKER_LEADER_REPLICA_HOSTNAME=$1 | ||||
shift | ||||
|
||||
NUM_NODES=$1 | ||||
shift | ||||
|
||||
BEAKER_REPLICA_RANK=$1 | ||||
shift | ||||
|
||||
# Warm HF cache | ||||
mkdir -p /root/.cache | ||||
pushd /root/.cache | ||||
curl "https://storage.googleapis.com/hf-cache/huggingface_cache_v4.tar.gz" | tar --keep-newer-files -xzf - | ||||
popd | ||||
export HF_DATASETS_OFFLINE=1 | ||||
|
||||
# Move AWS credentials from env to relevant files | ||||
mkdir -p ~/.aws | ||||
printenv AWS_CONFIG > ~/.aws/config | ||||
printenv AWS_CREDENTIALS > ~/.aws/credentials | ||||
|
||||
export EXPERIMENT=llamaish7-normal-final | ||||
|
||||
torchrun \ | ||||
--nnodes ${NUM_NODES}:${NUM_NODES} \ | ||||
--nproc-per-node 8 \ | ||||
--rdzv_id=12347 \ | ||||
--rdzv_backend=static \ | ||||
--rdzv_endpoint=$BEAKER_LEADER_REPLICA_HOSTNAME:29400 \ | ||||
--node_rank=$BEAKER_REPLICA_RANK \ | ||||
--rdzv_conf="read_timeout=420" \ | ||||
scripts/train.py \ | ||||
configs/llamaish7-normal-s3.yaml \ | ||||
--run_name=$EXPERIMENT \ | ||||
--wandb.name=$EXPERIMENT \ | ||||
--wandb.group=$EXPERIMENT \ | ||||
--fsdp.wrapping_strategy=by_block_and_size \ | ||||
--fsdp.sharding_strategy=SHARD_GRAD_OP \ | ||||
--save_folder=runs/ \ | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: I don't think there's any need to override this
Suggested change
|
||||
--activation_checkpointing=fine_grained \ | ||||
--device_train_microbatch_size=2 \ | ||||
--global_train_batch_size=1024 \ | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same here
Suggested change
|
||||
--save_interval=250 \ | ||||
--eval_interval=250 \ | ||||
--optimizer.metrics_log_interval=1 \ | ||||
--save_overwrite \ | ||||
--save_num_checkpoints_to_keep=3 \ | ||||
--data.num_workers=64 \ | ||||
--load_path=s3://ai2-llm/checkpoints/OLMo-medium/llamaish7-normal/step2000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't think there's any need to override this