You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I met the following problem when applying sequence parallel on training with zigzag_ringattention. The problem may cause by the imbalanced embedding size. I'm working on 4 gpus so there are 4 processes, and the length of input embeds is 2999, which couldn't be devided without remainder. After inputs_embeds.chunk(), the 4 gpus get 4 sequences with the length 750, 750, 750 and 749. I think the problem happens when dealing with the sequence of length of 749, the block_seq_len, which is half 749 split the sequence in an imbalanced way.
I'm wondering if you met the similar problem and if there's any way to solve it, or I'm getting it wrong. This kind of failure to division should be a common probelm.
Traceback (most recent call last):
File "HOME_PATH/files/LLaVA-OV/llava/train/train_mem.py", line 4, in <module>
train()
File "HOME_PATH/files/LLaVA-OV/llava/train/train.py", line 1717, in train
trainer.train()
File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
return inner_training_loop(
File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 2348, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 3275, in training_step
self.accelerator.backward(loss)
File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/accelerate/accelerator.py", line 2151, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 288, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/ring_flash_attn/zigzag_ring_flash_attn.py", line 235, in backward
dq, dk, dv = zigzag_ring_flash_attn_backward(
File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/ring_flash_attn/zigzag_ring_flash_attn.py", line 160, in zigzag_ring_flash_attn_backward
dq[:, block_seq_len:] += dq_buffer[:, :block_seq_len]
RuntimeError: The size of tensor a (375) must match the size of tensor b (374) at non-singleton dimension 1
The text was updated successfully, but these errors were encountered:
I met the following problem when applying sequence parallel on training with zigzag_ringattention. The problem may cause by the imbalanced embedding size. I'm working on 4 gpus so there are 4 processes, and the length of input embeds is 2999, which couldn't be devided without remainder. After inputs_embeds.chunk(), the 4 gpus get 4 sequences with the length 750, 750, 750 and 749. I think the problem happens when dealing with the sequence of length of 749, the block_seq_len, which is half 749 split the sequence in an imbalanced way.
I'm wondering if you met the similar problem and if there's any way to solve it, or I'm getting it wrong. This kind of failure to division should be a common probelm.
The text was updated successfully, but these errors were encountered: