Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

self.get_G_wrt_shared得到的grads一直为0 #6

Open
Xiantai01 opened this issue Oct 13, 2024 · 7 comments
Open

self.get_G_wrt_shared得到的grads一直为0 #6

Xiantai01 opened this issue Oct 13, 2024 · 7 comments

Comments

@Xiantai01
Copy link

Uploading 屏幕截图 2024-10-13 205351.png…

@kaka-Cao
Copy link
Collaborator

您好,好像看不到您上传的图片

@Xiantai01
Copy link
Author

使用MTL-ALigned的时候,求梯度的时候,比如torch.autograd.grad(fusion_loss, list(fusion.parameters()),loss对模型参数的梯度一直是0。

@kaka-Cao
Copy link
Collaborator

请问您能提供更详细的调试代码截图吗

@Xiantai01
Copy link
Author

def after_train_iter(self, runner):
runner.optimizer.zero_grad()
if self.detect_anomalous_params:
self.detect_anomalous_parameters(runner.outputs['loss'], runner)
#------------------------------------------------------------------------------------------------------------#
shared_parameter = []
for p in list(runner.model.module.backbone.parameters()):
if p.requires_grad:
shared_parameter.append(p)
# for name, p in runner.model.module.backbone.named_parameters():
# if p.requires_grad:
# shared_parameter.append(p)
for p in list(runner.model.module.neck.parameters()):
if p.requires_grad:
shared_parameter.append(p)
# for name, p in runner.model.module.neck.named_parameters():
# if p.requires_grad:
# shared_parameter.append(p)
fusion_parameter = runner.model.module.fusion #各自的参数
detection_parameter_1 = runner.model.module.roi_head #检测网络的参数
detection_parameter_2 = runner.model.module.rpn_head #检测网络的参数
detection_parameter_1 = list(detection_parameter_1.parameters())
detection_parameter_2 = list(detection_parameter_2.parameters())
combined_parameter = detection_parameter_1 + detection_parameter_2
task_specific_params={'0' : list(fusion_parameter.parameters()), '1' : combined_parameter}
# 调试代码
grad = torch.autograd.grad(runner.outputs['loss']['fusion_loss'], detection_parameter_1, retain_graph=True, allow_unused=True)
del runner.outputs['loss']['acc']
self.balancer.step_with_model(
losses=runner.outputs['loss'],
shared_params=shared_parameter,
task_specific_params=task_specific_params,
last_shared_layer_params=None,
iter=runner.iter
)
#调试代码部分,我加了计算grad的代码,发现不管对combined_parameter还是tasks_specific,计算得到的grad一直为0或者None,导致调用MTL-aligned时报错;

@StarBlue98
Copy link

哈喽,请问您解决了嘛?我也有这个问题了....训练到1000对齐的时候cuda out of memory了....

@ironmanfcf
Copy link

我也遇到这个问题,我是A10040G显卡。应该说这个网络本身加上fusionnet就增加了训练时显存,在加上GMTA操作更加增加了在训练时的显存消耗。当我不使用GMTA 操作并且将将batch降低为2时,发现网络性能并没有降低太多,相比论文大概也就降低了两个点不到,但是如果真的想要复现完整的网络,即使是40G的显存也不够用,可能只有像作者那样的80G显存才能完全复现。

@shawwalt
Copy link

我也遇到这个问题,我是A10040G显卡。应该说这个网络本身加上fusionnet就增加了训练时显存,在加上GMTA操作更加增加了在训练时的显存消耗。当我不使用GMTA 操作并且将将batch降低为2时,发现网络性能并没有降低太多,相比论文大概也就降低了两个点不到,但是如果真的想要复现完整的网络,即使是40G的显存也不够用,可能只有像作者那样的80G显存才能完全复现。

您好,想请问下梯度更新相关的代码是怎么被调用的,能力受限,找了好久都没找到调用GMTA的代码。每次调试到forward_train之后程序就不往下走了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants