Size mismatch inside zigzag_ringattention backward #47

jinghan23 · 2024-09-09T13:44:39Z

I met the following problem when applying sequence parallel on training with zigzag_ringattention. The problem may cause by the imbalanced embedding size. I'm working on 4 gpus so there are 4 processes, and the length of input embeds is 2999, which couldn't be devided without remainder. After inputs_embeds.chunk(), the 4 gpus get 4 sequences with the length 750, 750, 750 and 749. I think the problem happens when dealing with the sequence of length of 749, the block_seq_len, which is half 749 split the sequence in an imbalanced way.
I'm wondering if you met the similar problem and if there's any way to solve it, or I'm getting it wrong. This kind of failure to division should be a common probelm.

Traceback (most recent call last):
  File "HOME_PATH/files/LLaVA-OV/llava/train/train_mem.py", line 4, in <module>
    train()
  File "HOME_PATH/files/LLaVA-OV/llava/train/train.py", line 1717, in train
    trainer.train()
  File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 2348, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 3275, in training_step
    self.accelerator.backward(loss)
  File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/accelerate/accelerator.py", line 2151, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
    self.engine.backward(loss, **kwargs)
  File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 288, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/ring_flash_attn/zigzag_ring_flash_attn.py", line 235, in backward
    dq, dk, dv = zigzag_ring_flash_attn_backward(
  File "HOME_PATH/.conda/envs/llava/lib/python3.10/site-packages/ring_flash_attn/zigzag_ring_flash_attn.py", line 160, in zigzag_ring_flash_attn_backward
    dq[:, block_seq_len:] += dq_buffer[:, :block_seq_len]
RuntimeError: The size of tensor a (375) must match the size of tensor b (374) at non-singleton dimension 1

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Size mismatch inside zigzag_ringattention backward #47

Size mismatch inside zigzag_ringattention backward #47

jinghan23 commented Sep 9, 2024

Size mismatch inside zigzag_ringattention backward #47

Size mismatch inside zigzag_ringattention backward #47

Comments

jinghan23 commented Sep 9, 2024