Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shape cannot match the size during training #3

Open
cqbu opened this issue Sep 19, 2023 · 5 comments
Open

Shape cannot match the size during training #3

cqbu opened this issue Sep 19, 2023 · 5 comments

Comments

@cqbu
Copy link

cqbu commented Sep 19, 2023

During the training, in the part of backbone, I got this error:

File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 694, in forward
value = value.reshape(B, self.num_heads, self.value_channels//self.num_heads, n_l)
RuntimeError: shape '[24, 1, 96, 40]' is invalid for input of size 368640

this happened in the part of SpatialImageLanguageAttention, I found num_heads is 1, so this is not a MultiheadAttention right?
but I don't know whether the shape or the size is wrong, so what is the expected shape or size?

and the full error message is below:
Traceback (most recent call last):
File "train_net_lmpm.py", line 318, in
launch(
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/launch.py", line 69, in launch
mp.start_processes(
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/launch.py", line 123, in _distributed_worker
main_func(*args)
File "/root/MeViS/train_net_lmpm.py", line 312, in main
return trainer.train()
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 484, in train
super().train(self.start_iter, self.max_iter)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 155, in train
self.run_step()
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step()
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 494, in run_step
loss_dict = self.model(data)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/MeViS/lmpm/lmpm_model.py", line 281, in forward
return self.train_model(batched_inputs)
File "/root/MeViS/lmpm/lmpm_model.py", line 312, in train_model
features = self.backbone(images.tensor, lang_feat_sentence, lang_mask)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 785, in forward
y = super().forward(x, l, l_mask)
File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 470, in forward
x_out, H, W, x, Wh, Ww = layer(x, Wh, Ww, l, l_mask)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 590, in forward
x_residual = self.fusion(x, l, l_mask)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, **kwargs)
File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 627, in forward
lang = self.image_lang_att(x, l, l_mask) # (B, H
W, dim)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 694, in forward
value = value.reshape(B, self.num_heads, self.value_channels//self.num_heads, n_l)
RuntimeError: shape '[24, 1, 96, 40]' is invalid for input of size 368640

@cqbu
Copy link
Author

cqbu commented Sep 19, 2023

btw, I found when we call build_batch_data_loader, the parameter ‘prefetch_factor’ is not given, but in detectron2, the default value of prefetch_factor is None, which leads to error in DataLoader of torch when running assert prefetch_factor > 0, because prefetch_factor here is None but 0 is int.

@heshuting555
Copy link
Collaborator

heshuting555 commented Sep 21, 2023

You can try to use multiple gpus to run! And the error will go away!

@cilinyan
Copy link

cilinyan commented Oct 29, 2023

You can try to use multiple gpus to run! And the error will go away!

One simple approach is to ensure that only one video is trained on each GPU.

If you want to train multiple videos on GPU, you may need to make modifications in several parts of the code, such asthis.

@wwyy1234
Copy link

During the training, in the part of backbone, I got this error:

File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 694, in forward value = value.reshape(B, self.num_heads, self.value_channels//self.num_heads, n_l) RuntimeError: shape '[24, 1, 96, 40]' is invalid for input of size 368640

this happened in the part of SpatialImageLanguageAttention, I found num_heads is 1, so this is not a MultiheadAttention right? but I don't know whether the shape or the size is wrong, so what is the expected shape or size?

and the full error message is below: Traceback (most recent call last): File "train_net_lmpm.py", line 318, in launch( File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/launch.py", line 69, in launch mp.start_processes( File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/launch.py", line 123, in _distributed_worker main_func(*args) File "/root/MeViS/train_net_lmpm.py", line 312, in main return trainer.train() File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 484, in train super().train(self.start_iter, self.max_iter) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 155, in train self.run_step() File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 494, in run_step self._trainer.run_step() File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 494, in run_step loss_dict = self.model(data) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(*inputs[0], **kwargs[0]) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/MeViS/lmpm/lmpm_model.py", line 281, in forward return self.train_model(batched_inputs) File "/root/MeViS/lmpm/lmpm_model.py", line 312, in train_model features = self.backbone(images.tensor, lang_feat_sentence, lang_mask) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 785, in forward y = super().forward(x, l, l_mask) File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 470, in forward x_out, H, W, x, Wh, Ww = layer(x, Wh, Ww, l, l_mask) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 590, in forward x_residual = self.fusion(x, l, l_mask) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(_input, **kwargs) File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 627, in forward lang = self.image_lang_att(x, l, l_mask) # (B, H_W, dim) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 694, in forward value = value.reshape(B, self.num_heads, self.value_channels//self.num_heads, n_l) RuntimeError: shape '[24, 1, 96, 40]' is invalid for input of size 368640

Excuse me, have you solved this problem? I encountered the same issue. I'm using two GPUs. Could you please let me know how you resolved it?

@Starboy-at-earth
Copy link

same problem encountered....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants