Skip to content

Poor Training speed of DistributedDataParallel on A40*8 #1657

Answered by rwightman
Doraemonzm asked this question in General
Discussion options

You must be logged in to vote

@Doraemonzm I'd be far more concerned about your A100 setup, it's wasting way more GPU FLOPs than the A40, A100 is more capable than the A40, and the time loading data on the A100 is 2x the A40, based on the instantaneous vs avg throughput on A100 you should be getting closer to 6000im/sec on that A100 example if dataloading wasn't a bottlneck, the A40 looks like there isn't as much of a loading / processing bottleneck. You could try increasing batch size, using --channels-last, etc

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by Doraemonzm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants