swin transformer no speed up with torch.cuda.amp #623

thomas0809 · 2021-05-08T12:27:01Z

thomas0809
May 8, 2021

I am trying to use torch.cuda.amp to speed up training. However, I found on A100 machines, amp couldn't speed up swin transformers, while it worked pretty well for other models such as resnet. See a detailed example here.

Base on the profile, a lot of time was spent on the copy_device_to_device operator. I think there should be a way to get rid of these operations. Hope some one more familiar with the implementation can help!

rwightman · 2021-05-09T00:48:14Z

rwightman
May 9, 2021
Maintainer

@thomas0809 I measure gains of approx 50% for both infer and train on my 3090 card w/ NGC 21.03. I haven't tried on newer NGC yet, perhaps float32 is also improved so there is less gap? Or maybe the float32 on the A100 is even stronger relative to the float16 on the A100 arch vs 3090.

Pure float16 is 2x gain, so could also be lots of ops that aren't autocast... would work for inference but could be unstable for train.

0 replies

mrT23 · 2021-05-13T16:12:15Z

mrT23
May 13, 2021

@thomas0809
i recommend trying training with o1 of nvidia apex amp. with o1, you (almost) fully convert the model to fp16, and from my experience it works better than torch.cuda.amp on MLP models (torch.cuda.amp is more similar to o2)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

swin transformer no speed up with torch.cuda.amp #623

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

swin transformer no speed up with torch.cuda.amp #623

thomas0809 May 8, 2021

Replies: 2 comments

rwightman May 9, 2021 Maintainer

mrT23 May 13, 2021

thomas0809
May 8, 2021

rwightman
May 9, 2021
Maintainer

mrT23
May 13, 2021