-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: Crash due to tensor size mismatch at the end of an epoch #411
Comments
@HarikrishnanBalagopal please recreate the issue if you are able to reproduce it in main branch. |
|
The bug doesn't seem to happen with the accelerate launch \
--use_fsdp \
--fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP \
--fsdp_forward_prefetch=false \
--fsdp_offload_params=false \
--fsdp_sharding_strategy=FULL_SHARD \
--fsdp_state_dict_type=FULL_STATE_DICT \
--fsdp_cpu_ram_efficient_loading=true \
--fsdp_sync_module_states=true \
--num_processes=8 \
--dynamo_backend="no" \
--machine_rank="${RANK}" \
--main_process_ip="${MASTER_ADDR}" \
--main_process_port="${MASTER_PORT}" \
--mixed_precision="no" \
--num_machines="${WORLD_SIZE}" \
--rdzv_backend="static" \
--same_network \
-m tuning.sft_trainer \
--adam_beta1="0.9" \
--adam_beta2="0.98" \
--adam_epsilon="1e-10" \
--aim_repo="${AIMSTACK_DB}" \
--data_config_path="dataset_config.yaml" \
--dataloader_drop_last="true" \
--evaluation_strategy="no" \
--experiment="train-03577468-9d72-4c2a-baaf-3336242d597f" \
--gradient_accumulation_steps="4" \
--gradient_checkpointing="true" \
--learning_rate="1e-05" \
--log_level="debug" \
--logging_steps="5" \
--logging_strategy="steps" \
--lr_scheduler_type="cosine" \
--max_steps="2250" \
--model_name_or_path="/modeling/models/granite-20b-base-ept-merged-70-30" \
--optim="adamw_torch" \
--output_dir="/modeling/checkpoints/train-03577468-9d72-4c2a-baaf-3336242d597f" \
--packing="False" \
--per_device_train_batch_size="8" \
--save_steps="250" \
--save_strategy="steps" \
--split_batches="true" \
--torch_dtype="bfloat16" \
--tracker="aim" \
--use_flash_attn="true" \
--use_reentrant="true" \
--warmup_ratio="0.1" \
--warmup_steps="200" \
--weight_decay="0.1" {"data": {"epoch": 0.16, "step": 5, "timestamp": "2024-12-18T08:20:19.540051", "value": 0.559}, "name": "training_loss"}
{"data": {"epoch": 0.32, "step": 10, "timestamp": "2024-12-18T08:20:44.335258", "value": 0.5034}, "name": "training_loss"}
{"data": {"epoch": 0.48, "step": 15, "timestamp": "2024-12-18T08:21:10.139879", "value": 0.5204}, "name": "training_loss"}
{"data": {"epoch": 0.64, "step": 20, "timestamp": "2024-12-18T08:21:35.006164", "value": 0.4812}, "name": "training_loss"}
{"data": {"epoch": 0.8, "step": 25, "timestamp": "2024-12-18T08:22:00.369152", "value": 0.4855}, "name": "training_loss"}
{"data": {"epoch": 0.96, "step": 30, "timestamp": "2024-12-18T08:22:22.809873", "value": 0.4168}, "name": "training_loss"}
{"data": {"epoch": 1.12, "step": 35, "timestamp": "2024-12-18T08:22:47.667162", "value": 0.4999}, "name": "training_loss"}
{"data": {"epoch": 1.28, "step": 40, "timestamp": "2024-12-18T08:23:11.652898", "value": 0.4688}, "name": "training_loss"}
{"data": {"epoch": 1.44, "step": 45, "timestamp": "2024-12-18T08:23:34.835309", "value": 0.4106}, "name": "training_loss"}
{"data": {"epoch": 1.6, "step": 50, "timestamp": "2024-12-18T08:23:58.842765", "value": 0.4867}, "name": "training_loss"}
{"data": {"epoch": 1.76, "step": 55, "timestamp": "2024-12-18T08:24:23.748723", "value": 0.4496}, "name": "training_loss"}
{"data": {"epoch": 1.92, "step": 60, "timestamp": "2024-12-18T08:24:48.970410", "value": 0.4115}, "name": "training_loss"}
{"data": {"epoch": 2.08, "step": 65, "timestamp": "2024-12-18T08:25:12.456075", "value": 0.3914}, "name": "training_loss"}
{"data": {"epoch": 2.24, "step": 70, "timestamp": "2024-12-18T08:25:38.317406", "value": 0.437}, "name": "training_loss"}
{"data": {"epoch": 2.4, "step": 75, "timestamp": "2024-12-18T08:26:02.539473", "value": 0.3401}, "name": "training_loss"}
{"data": {"epoch": 2.56, "step": 80, "timestamp": "2024-12-18T08:26:28.991092", "value": 0.3629}, "name": "training_loss"}
{"data": {"epoch": 2.72, "step": 85, "timestamp": "2024-12-18T08:26:54.300396", "value": 0.4046}, "name": "training_loss"}
{"data": {"epoch": 2.88, "step": 90, "timestamp": "2024-12-18T08:27:18.536794", "value": 0.3466}, "name": "training_loss"}
{"data": {"epoch": 3.04, "step": 95, "timestamp": "2024-12-18T08:27:42.192288", "value": 0.3597}, "name": "training_loss"}
{"data": {"epoch": 3.2, "step": 100, "timestamp": "2024-12-18T08:28:04.970346", "value": 0.3293}, "name": "training_loss"}
{"data": {"epoch": 3.36, "step": 105, "timestamp": "2024-12-18T08:28:29.373667", "value": 0.3678}, "name": "training_loss"}
{"data": {"epoch": 3.52, "step": 110, "timestamp": "2024-12-18T08:28:54.527638", "value": 0.3639}, "name": "training_loss"}
{"data": {"epoch": 3.68, "step": 115, "timestamp": "2024-12-18T08:29:20.037901", "value": 0.3121}, "name": "training_loss"}
{"data": {"epoch": 3.84, "step": 120, "timestamp": "2024-12-18T08:29:43.694220", "value": 0.3009}, "name": "training_loss"}
{"data": {"epoch": 4.0, "step": 125, "timestamp": "2024-12-18T08:30:08.959777", "value": 0.355}, "name": "training_loss"}
{"data": {"epoch": 4.16, "step": 130, "timestamp": "2024-12-18T08:30:34.549892", "value": 0.3049}, "name": "training_loss"}
{"data": {"epoch": 4.32, "step": 135, "timestamp": "2024-12-18T08:30:57.256689", "value": 0.2938}, "name": "training_loss"}
{"data": {"epoch": 4.48, "step": 140, "timestamp": "2024-12-18T08:31:21.639842", "value": 0.3258}, "name": "training_loss"}
{"data": {"epoch": 4.64, "step": 145, "timestamp": "2024-12-18T08:31:48.985025", "value": 0.3153}, "name": "training_loss"}
{"data": {"epoch": 4.8, "step": 150, "timestamp": "2024-12-18T08:32:13.743223", "value": 0.2664}, "name": "training_loss"}
{"data": {"epoch": 4.96, "step": 155, "timestamp": "2024-12-18T08:32:37.979222", "value": 0.3321}, "name": "training_loss"}
{"data": {"epoch": 5.12, "step": 160, "timestamp": "2024-12-18T08:33:02.821396", "value": 0.2714}, "name": "training_loss"}
{"data": {"epoch": 5.28, "step": 165, "timestamp": "2024-12-18T08:33:26.692441", "value": 0.2762}, "name": "training_loss"}
{"data": {"epoch": 5.44, "step": 170, "timestamp": "2024-12-18T08:33:49.990244", "value": 0.2638}, "name": "training_loss"}
{"data": {"epoch": 5.6, "step": 175, "timestamp": "2024-12-18T08:34:14.315508", "value": 0.2657}, "name": "training_loss"}
{"data": {"epoch": 5.76, "step": 180, "timestamp": "2024-12-18T08:34:39.632194", "value": 0.2673}, "name": "training_loss"}
{"data": {"epoch": 5.92, "step": 185, "timestamp": "2024-12-18T08:35:04.150179", "value": 0.25}, "name": "training_loss"}
{"data": {"epoch": 6.08, "step": 190, "timestamp": "2024-12-18T08:35:29.809107", "value": 0.2442}, "name": "training_loss"}
{"data": {"epoch": 6.24, "step": 195, "timestamp": "2024-12-18T08:35:54.034458", "value": 0.2177}, "name": "training_loss"}
{"data": {"epoch": 6.4, "step": 200, "timestamp": "2024-12-18T08:36:19.598923", "value": 0.2299}, "name": "training_loss"}
{"data": {"epoch": 6.56, "step": 205, "timestamp": "2024-12-18T08:36:42.592159", "value": 0.1983}, "name": "training_loss"}
{"data": {"epoch": 6.72, "step": 210, "timestamp": "2024-12-18T08:37:07.855585", "value": 0.2221}, "name": "training_loss"}
{"data": {"epoch": 6.88, "step": 215, "timestamp": "2024-12-18T08:37:33.117416", "value": 0.2615}, "name": "training_loss"}
{"data": {"epoch": 7.04, "step": 220, "timestamp": "2024-12-18T08:37:57.805831", "value": 0.232}, "name": "training_loss"}
{"data": {"epoch": 7.2, "step": 225, "timestamp": "2024-12-18T08:38:24.049801", "value": 0.1917}, "name": "training_loss"}
{"data": {"epoch": 7.36, "step": 230, "timestamp": "2024-12-18T08:38:48.742201", "value": 0.2145}, "name": "training_loss"}
{"data": {"epoch": 7.52, "step": 235, "timestamp": "2024-12-18T08:39:13.978622", "value": 0.1829}, "name": "training_loss"}
{"data": {"epoch": 7.68, "step": 240, "timestamp": "2024-12-18T08:39:36.728941", "value": 0.1867}, "name": "training_loss"}
{"data": {"epoch": 7.84, "step": 245, "timestamp": "2024-12-18T08:40:01.621554", "value": 0.1921}, "name": "training_loss"}
{"data": {"epoch": 8.0, "step": 250, "timestamp": "2024-12-18T08:40:26.636845", "value": 0.1672}, "name": "training_loss"} |
@kmehant In this case the gradient accumulation steps is less than the number of steps in an epoch, isn't it? |
@ashokponkumar let me summarize There were 2 issues.
Hari confirms that the issue (2) is not happening with the I was looking at issue (1) which happens to be actively looked by HF team. There is a rewrite of |
Sure. Under what conditions does issue 1 happen? |
updated my comment, there seems to be some lag |
@ashokponkumar @HarikrishnanBalagopal The issue (1) as well gone now with |
Reopened this issue. We should put a min bounds of |
Describe the bug
Crash when training reaches the last batch of an epoch.
Using granite-20b-base model.
Using jsonl dataset with input and output columns.
The error seems to be a tensor size mismatch causing torch concatenate operation to crash.
Platform
Please provide details about the environment you are using, including the following:
Running in an Openshift GPU cluster.
Python 3.11.9 (main, Sep 11 2024, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-2)] on linux
wca
branch https://github.com/foundation-model-stack/fms-hf-tuning/tree/wcaSample Code
The
dataset_config.yaml
file:Expected behavior
Training should finish without errors.
Observed behavior
Crash on last batch of the epoch
The text was updated successfully, but these errors were encountered: