Poor Training speed of DistributedDataParallel on A40*8 #1657
-
I have tried to train the same model on two machines equipped with 8 A40 and 8 A100, respectively. Any help is appreciated. The training logs are posted as follows: on A100Train: 241 [ 0/834 ( 0%)] Loss: 1.753 (1.75) Time: 3.681s, 417.33/s (3.681s, 417.33/s) LR: 1.000e-06 Data: 3.435 (3.435)
Train: 241 [ 50/834 ( 6%)] Loss: 1.774 (1.76) Time: 0.251s, 6116.32/s (0.616s, 2493.22/s) LR: 1.000e-06 Data: 0.026 (0.247)
Train: 241 [ 100/834 ( 12%)] Loss: 1.760 (1.76) Time: 0.259s, 5939.95/s (0.589s, 2608.39/s) LR: 1.000e-06 Data: 0.050 (0.154)
Train: 241 [ 150/834 ( 18%)] Loss: 1.749 (1.76) Time: 0.253s, 6080.71/s (0.598s, 2568.22/s) LR: 1.000e-06 Data: 0.035 (0.116)
Train: 241 [ 200/834 ( 24%)] Loss: 1.779 (1.76) Time: 0.245s, 6279.66/s (0.592s, 2596.43/s) LR: 1.000e-06 Data: 0.043 (0.097)
Train: 241 [ 250/834 ( 30%)] Loss: 1.821 (1.77) Time: 0.274s, 5615.34/s (0.586s, 2622.39/s) LR: 1.000e-06 Data: 0.065 (0.086)
Train: 241 [ 300/834 ( 36%)] Loss: 1.740 (1.77) Time: 0.228s, 6723.11/s (0.590s, 2605.50/s) LR: 1.000e-06 Data: 0.030 (0.078)
Train: 241 [ 350/834 ( 42%)] Loss: 1.727 (1.76) Time: 0.287s, 5345.67/s (0.588s, 2611.77/s) LR: 1.000e-06 Data: 0.029 (0.072)
Train: 241 [ 400/834 ( 48%)] Loss: 1.746 (1.76) Time: 0.229s, 6696.98/s (0.586s, 2619.33/s) LR: 1.000e-06 Data: 0.034 (0.068)
Train: 241 [ 450/834 ( 54%)] Loss: 1.793 (1.76) Time: 0.255s, 6033.07/s (0.590s, 2602.70/s) LR: 1.000e-06 Data: 0.031 (0.065)
Train: 241 [ 500/834 ( 60%)] Loss: 1.788 (1.77) Time: 0.254s, 6040.55/s (0.588s, 2611.94/s) LR: 1.000e-06 Data: 0.058 (0.062)
Train: 241 [ 550/834 ( 66%)] Loss: 1.814 (1.77) Time: 0.244s, 6288.44/s (0.588s, 2611.54/s) LR: 1.000e-06 Data: 0.049 (0.060)
Train: 241 [ 600/834 ( 72%)] Loss: 1.749 (1.77) Time: 0.243s, 6319.64/s (0.592s, 2596.22/s) LR: 1.000e-06 Data: 0.033 (0.058)
Train: 241 [ 650/834 ( 78%)] Loss: 1.803 (1.77) Time: 0.241s, 6361.87/s (0.591s, 2597.52/s) LR: 1.000e-06 Data: 0.027 (0.056)
Train: 241 [ 700/834 ( 84%)] Loss: 1.863 (1.78) Time: 0.252s, 6096.81/s (0.590s, 2602.32/s) LR: 1.000e-06 Data: 0.041 (0.055)
Train: 241 [ 750/834 ( 90%)] Loss: 1.765 (1.78) Time: 0.318s, 4825.78/s (0.593s, 2592.01/s) LR: 1.000e-06 Data: 0.027 (0.054)
Train: 241 [ 800/834 ( 96%)] Loss: 1.713 (1.77) Time: 0.235s, 6537.70/s (0.592s, 2594.35/s) LR: 1.000e-06 Data: 0.033 (0.053)
Train: 241 [ 833/834 (100%)] Loss: 1.689 (1.77) Time: 0.170s, 9051.56/s (0.589s, 2609.82/s) LR: 1.000e-06 Data: 0.000 (0.052) A40*8 Apex Mixed PrecisionDefaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Using NVIDIA APEX AMP. Training in mixed precision.
Using NVIDIA APEX DistributedDataParallel.
/home/inspur/anaconda3/envs/wzm/lib/python3.7/site-packages/apex/__init__.py:68: DeprecatedFeatureWarning: apex.parallel.DistributedDataParallel is d
eprecated and will be removed by the end of February 2023.
warnings.warn(msg, DeprecatedFeatureWarning)
Scheduled epochs: 310
Train: 0 [ 0/1251 ( 0%)] Loss: 6.925 (6.93) Time: 16.228s, 63.10/s (16.228s, 63.10/s) LR: 1.000e-06 Data: 3.821 (3.821)
Train: 0 [ 50/1251 ( 4%)] Loss: 6.918 (6.92) Time: 0.842s, 1215.97/s (1.149s, 891.09/s) LR: 1.000e-06 Data: 0.023 (0.096)
Train: 0 [ 100/1251 ( 8%)] Loss: 6.924 (6.92) Time: 0.847s, 1209.17/s (0.999s, 1024.78/s) LR: 1.000e-06 Data: 0.025 (0.060)
Train: 0 [ 150/1251 ( 12%)] Loss: 6.921 (6.92) Time: 0.848s, 1207.65/s (0.948s, 1079.62/s) LR: 1.000e-06 Data: 0.025 (0.047)
Train: 0 [ 200/1251 ( 16%)] Loss: 6.908 (6.92) Time: 0.851s, 1203.42/s (0.923s, 1109.26/s) LR: 1.000e-06 Data: 0.022 (0.041)
Train: 0 [ 250/1251 ( 20%)] Loss: 6.914 (6.92) Time: 0.845s, 1211.63/s (0.909s, 1126.98/s) LR: 1.000e-06 Data: 0.027 (0.037)
Train: 0 [ 300/1251 ( 24%)] Loss: 6.910 (6.92) Time: 0.842s, 1215.85/s (0.899s, 1139.54/s) LR: 1.000e-06 Data: 0.017 (0.035)
Train: 0 [ 350/1251 ( 28%)] Loss: 6.917 (6.92) Time: 0.842s, 1216.10/s (0.891s, 1148.76/s) LR: 1.000e-06 Data: 0.023 (0.033)
Train: 0 [ 400/1251 ( 32%)] Loss: 6.912 (6.92) Time: 0.849s, 1206.54/s (0.886s, 1155.77/s) LR: 1.000e-06 Data: 0.022 (0.032)
Train: 0 [ 450/1251 ( 36%)] Loss: 6.924 (6.92) Time: 0.843s, 1215.23/s (0.882s, 1161.41/s) LR: 1.000e-06 Data: 0.022 (0.031)
Train: 0 [ 500/1251 ( 40%)] Loss: 6.915 (6.92) Time: 0.859s, 1191.55/s (0.878s, 1165.93/s) LR: 1.000e-06 Data: 0.025 (0.030)
Train: 0 [ 550/1251 ( 44%)] Loss: 6.915 (6.92) Time: 0.839s, 1221.22/s (0.876s, 1169.54/s) LR: 1.000e-06 Data: 0.017 (0.030)
Train: 0 [ 600/1251 ( 48%)] Loss: 6.916 (6.92) Time: 0.844s, 1212.71/s (0.873s, 1172.76/s) LR: 1.000e-06 Data: 0.019 (0.029)
Train: 0 [ 650/1251 ( 52%)] Loss: 6.908 (6.92) Time: 0.848s, 1207.16/s (0.871s, 1175.29/s) LR: 1.000e-06 Data: 0.022 (0.028)
Train: 0 [ 700/1251 ( 56%)] Loss: 6.918 (6.92) Time: 0.845s, 1211.72/s (0.870s, 1177.51/s) LR: 1.000e-06 Data: 0.027 (0.028)
Train: 0 [ 750/1251 ( 60%)] Loss: 6.910 (6.92) Time: 0.849s, 1206.54/s (0.868s, 1179.49/s) LR: 1.000e-06 Data: 0.016 (0.027)
Train: 0 [ 800/1251 ( 64%)] Loss: 6.907 (6.92) Time: 0.849s, 1205.42/s (0.867s, 1181.51/s) LR: 1.000e-06 Data: 0.018 (0.027)
Train: 0 [ 850/1251 ( 68%)] Loss: 6.909 (6.92) Time: 0.843s, 1214.26/s (0.866s, 1183.11/s) LR: 1.000e-06 Data: 0.021 (0.027)
Train: 0 [ 900/1251 ( 72%)] Loss: 6.911 (6.91) Time: 0.843s, 1214.09/s (0.865s, 1184.45/s) LR: 1.000e-06 Data: 0.019 (0.026)
Train: 0 [ 950/1251 ( 76%)] Loss: 6.903 (6.91) Time: 0.851s, 1202.73/s (0.864s, 1185.53/s) LR: 1.000e-06 Data: 0.019 (0.026)
Train: 0 [1000/1251 ( 80%)] Loss: 6.901 (6.91) Time: 0.842s, 1216.30/s (0.863s, 1186.78/s) LR: 1.000e-06 Data: 0.021 (0.026)
Train: 0 [1050/1251 ( 84%)] Loss: 6.898 (6.91) Time: 0.843s, 1214.14/s (0.862s, 1187.76/s) LR: 1.000e-06 Data: 0.021 (0.025)
Train: 0 [1100/1251 ( 88%)] Loss: 6.902 (6.91) Time: 0.840s, 1219.76/s (0.861s, 1188.66/s) LR: 1.000e-06 Data: 0.019 (0.025)
Train: 0 [1150/1251 ( 92%)] Loss: 6.912 (6.91) Time: 0.853s, 1200.34/s (0.861s, 1189.56/s) LR: 1.000e-06 Data: 0.018 (0.025)
Train: 0 [1200/1251 ( 96%)] Loss: 6.894 (6.91) Time: 0.839s, 1220.27/s (0.860s, 1190.31/s) LR: 1.000e-06 Data: 0.019 (0.025)
Train: 0 [1250/1251 (100%)] Loss: 6.904 (6.91) Time: 0.835s, 1226.26/s (0.860s, 1191.01/s) LR: 1.000e-06 Data: 0.000 (0.025)
Train one epoch time: 1075.9552729129791 s
Distributing BatchNorm running means and vars
Test: [ 0/48] Time: 4.518 (4.518) Loss: 6.8530 (6.8530) Acc@1: 0.2930 ( 0.2930) Acc@5: 1.1719 ( 1.1719)
Test: [ 48/48] Time: 3.409 (0.504) Loss: 6.8597 (6.8567) Acc@1: 0.1179 ( 0.1740) Acc@5: 0.7075 ( 1.0240)
Val one epoch time: 24.703128814697266 s A40*8 Naive AMP Mixed PrecisionUsing native Torch AMP. Training in mixed precision.
Using native Torch DistributedDataParallel.
Scheduled epochs: 310
Train: 0 [ 0/1251 ( 0%)] Loss: 6.925 (6.93) Time: 15.169s, 67.51/s (15.169s, 67.51/s) LR: 1.000e-06 Data: 4.321 (4.321)
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Train: 0 [ 50/1251 ( 4%)] Loss: 6.918 (6.92) Time: 0.785s, 1303.70/s (1.067s, 959.83/s) LR: 1.000e-06 Data: 0.032 (0.103)
Train: 0 [ 100/1251 ( 8%)] Loss: 6.924 (6.92) Time: 0.790s, 1295.88/s (0.927s, 1104.51/s) LR: 1.000e-06 Data: 0.017 (0.062)
Train: 0 [ 150/1251 ( 12%)] Loss: 6.921 (6.92) Time: 0.788s, 1298.94/s (0.880s, 1163.20/s) LR: 1.000e-06 Data: 0.013 (0.048)
Train: 0 [ 200/1251 ( 16%)] Loss: 6.908 (6.92) Time: 0.785s, 1305.21/s (0.857s, 1194.27/s) LR: 1.000e-06 Data: 0.024 (0.041)
Train: 0 [ 250/1251 ( 20%)] Loss: 6.914 (6.92) Time: 0.806s, 1270.36/s (0.844s, 1213.69/s) LR: 1.000e-06 Data: 0.021 (0.037)
Train: 0 [ 300/1251 ( 24%)] Loss: 6.910 (6.92) Time: 0.786s, 1302.31/s (0.835s, 1226.67/s) LR: 1.000e-06 Data: 0.016 (0.034)
Train: 0 [ 350/1251 ( 28%)] Loss: 6.917 (6.92) Time: 0.799s, 1281.43/s (0.828s, 1236.48/s) LR: 1.000e-06 Data: 0.029 (0.033)
Train: 0 [ 400/1251 ( 32%)] Loss: 6.912 (6.92) Time: 0.789s, 1297.74/s (0.823s, 1243.57/s) LR: 1.000e-06 Data: 0.029 (0.031)
Train: 0 [ 450/1251 ( 36%)] Loss: 6.924 (6.92) Time: 0.794s, 1290.26/s (0.819s, 1249.74/s) LR: 1.000e-06 Data: 0.032 (0.030)
Train: 0 [ 500/1251 ( 40%)] Loss: 6.915 (6.92) Time: 0.782s, 1309.67/s (0.816s, 1254.80/s) LR: 1.000e-06 Data: 0.020 (0.029)
Train: 0 [ 550/1251 ( 44%)] Loss: 6.915 (6.92) Time: 0.788s, 1300.30/s (0.813s, 1259.10/s) LR: 1.000e-06 Data: 0.025 (0.028)
Train: 0 [ 600/1251 ( 48%)] Loss: 6.916 (6.92) Time: 0.787s, 1301.50/s (0.811s, 1262.51/s) LR: 1.000e-06 Data: 0.026 (0.028)
Train: 0 [ 650/1251 ( 52%)] Loss: 6.908 (6.92) Time: 0.785s, 1304.05/s (0.809s, 1265.38/s) LR: 1.000e-06 Data: 0.028 (0.027)
Train: 0 [ 700/1251 ( 56%)] Loss: 6.918 (6.92) Time: 0.799s, 1281.48/s (0.808s, 1267.82/s) LR: 1.000e-06 Data: 0.018 (0.027)
Train: 0 [ 750/1251 ( 60%)] Loss: 6.910 (6.92) Time: 0.783s, 1307.36/s (0.806s, 1270.17/s) LR: 1.000e-06 Data: 0.016 (0.026)
Train: 0 [ 800/1251 ( 64%)] Loss: 6.907 (6.92) Time: 0.789s, 1297.05/s (0.805s, 1271.93/s) LR: 1.000e-06 Data: 0.028 (0.026)
Train: 0 [ 850/1251 ( 68%)] Loss: 6.909 (6.92) Time: 0.787s, 1301.63/s (0.804s, 1273.18/s) LR: 1.000e-06 Data: 0.022 (0.025)
Train: 0 [ 900/1251 ( 72%)] Loss: 6.911 (6.91) Time: 0.784s, 1305.99/s (0.803s, 1274.64/s) LR: 1.000e-06 Data: 0.018 (0.025)
Train: 0 [ 950/1251 ( 76%)] Loss: 6.903 (6.91) Time: 0.791s, 1293.92/s (0.803s, 1275.67/s) LR: 1.000e-06 Data: 0.024 (0.025)
Train: 0 [1000/1251 ( 80%)] Loss: 6.901 (6.91) Time: 0.783s, 1307.05/s (0.802s, 1276.84/s) LR: 1.000e-06 Data: 0.027 (0.025)
Train: 0 [1050/1251 ( 84%)] Loss: 6.898 (6.91) Time: 0.791s, 1294.04/s (0.801s, 1277.84/s) LR: 1.000e-06 Data: 0.016 (0.025)
Train: 0 [1100/1251 ( 88%)] Loss: 6.902 (6.91) Time: 0.778s, 1316.26/s (0.801s, 1278.80/s) LR: 1.000e-06 Data: 0.022 (0.024)
Train: 0 [1150/1251 ( 92%)] Loss: 6.912 (6.91) Time: 0.788s, 1300.10/s (0.800s, 1279.81/s) LR: 1.000e-06 Data: 0.021 (0.024)
Train: 0 [1200/1251 ( 96%)] Loss: 6.894 (6.91) Time: 0.790s, 1296.68/s (0.800s, 1280.65/s) LR: 1.000e-06 Data: 0.017 (0.024)
Train: 0 [1250/1251 (100%)] Loss: 6.904 (6.91) Time: 0.779s, 1315.05/s (0.799s, 1281.32/s) LR: 1.000e-06 Data: 0.000 (0.024)
Train one epoch time: 1000.1143641471863 s
Distributing BatchNorm running means and vars
Test: [ 0/48] Time: 4.743 (4.743) Loss: 6.8477 (6.8477) Acc@1: 0.2930 ( 0.2930) Acc@5: 1.0742 ( 1.0742)
Test: [ 48/48] Time: 3.042 (0.489) Loss: 6.8555 (6.8523) Acc@1: 0.1179 ( 0.1880) Acc@5: 0.8255 ( 1.0720)
Val one epoch time: 23.951291799545288 s
A40*8 Native DistributedDataParallel Float32AMP not enabled. Training in float32.
Using native Torch DistributedDataParallel.
Scheduled epochs: 310
Train: 0 [ 0/1251 ( 0%)] Loss: 6.925 (6.93) Time: 9.708s, 105.48/s (9.708s, 105.48/s) LR: 1.000e-06 Data: 5.136 (5.136)
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Train: 0 [ 50/1251 ( 4%)] Loss: 6.918 (6.92) Time: 1.059s, 967.40/s (1.247s, 820.92/s) LR: 1.000e-06 Data: 0.012 (0.117)
Train: 0 [ 100/1251 ( 8%)] Loss: 6.924 (6.92) Time: 1.083s, 945.50/s (1.161s, 881.66/s) LR: 1.000e-06 Data: 0.037 (0.068)
Train: 0 [ 150/1251 ( 12%)] Loss: 6.921 (6.92) Time: 1.068s, 958.81/s (1.133s, 904.02/s) LR: 1.000e-06 Data: 0.024 (0.052)
Train: 0 [ 200/1251 ( 16%)] Loss: 6.908 (6.92) Time: 1.078s, 950.15/s (1.118s, 915.83/s) LR: 1.000e-06 Data: 0.025 (0.044)
Train: 0 [ 250/1251 ( 20%)] Loss: 6.914 (6.92) Time: 1.071s, 956.18/s (1.109s, 923.03/s) LR: 1.000e-06 Data: 0.018 (0.039)
Train: 0 [ 300/1251 ( 24%)] Loss: 6.910 (6.92) Time: 1.059s, 967.18/s (1.103s, 928.06/s) LR: 1.000e-06 Data: 0.012 (0.036)
Train: 0 [ 350/1251 ( 28%)] Loss: 6.917 (6.92) Time: 1.074s, 953.38/s (1.099s, 931.60/s) LR: 1.000e-06 Data: 0.017 (0.033)
Train: 0 [ 400/1251 ( 32%)] Loss: 6.912 (6.92) Time: 1.075s, 952.49/s (1.096s, 934.27/s) LR: 1.000e-06 Data: 0.016 (0.031)
Train: 0 [ 450/1251 ( 36%)] Loss: 6.924 (6.92) Time: 1.087s, 941.71/s (1.094s, 936.28/s) LR: 1.000e-06 Data: 0.017 (0.030)
Train: 0 [ 500/1251 ( 40%)] Loss: 6.915 (6.92) Time: 1.074s, 953.15/s (1.092s, 937.99/s) LR: 1.000e-06 Data: 0.018 (0.029)
Train: 0 [ 550/1251 ( 44%)] Loss: 6.915 (6.92) Time: 1.073s, 954.58/s (1.090s, 939.46/s) LR: 1.000e-06 Data: 0.018 (0.028)
Train: 0 [ 600/1251 ( 48%)] Loss: 6.916 (6.92) Time: 1.067s, 959.43/s (1.089s, 940.61/s) LR: 1.000e-06 Data: 0.023 (0.027)
Train: 0 [ 650/1251 ( 52%)] Loss: 6.908 (6.92) Time: 1.071s, 956.44/s (1.087s, 941.77/s) LR: 1.000e-06 Data: 0.021 (0.026)
Train: 0 [ 700/1251 ( 56%)] Loss: 6.918 (6.92) Time: 1.079s, 949.25/s (1.086s, 942.67/s) LR: 1.000e-06 Data: 0.015 (0.026)
Train: 0 [ 750/1251 ( 60%)] Loss: 6.910 (6.92) Time: 1.075s, 952.20/s (1.085s, 943.43/s) LR: 1.000e-06 Data: 0.016 (0.025)
Train: 0 [ 800/1251 ( 64%)] Loss: 6.907 (6.92) Time: 1.072s, 954.84/s (1.085s, 944.08/s) LR: 1.000e-06 Data: 0.024 (0.025)
Train: 0 [ 850/1251 ( 68%)] Loss: 6.909 (6.92) Time: 1.072s, 955.06/s (1.084s, 944.71/s) LR: 1.000e-06 Data: 0.016 (0.025)
Train: 0 [ 900/1251 ( 72%)] Loss: 6.911 (6.91) Time: 1.075s, 952.40/s (1.083s, 945.27/s) LR: 1.000e-06 Data: 0.015 (0.024)
Train: 0 [ 950/1251 ( 76%)] Loss: 6.903 (6.91) Time: 1.076s, 951.32/s (1.083s, 945.73/s) LR: 1.000e-06 Data: 0.028 (0.024)
Train: 0 [1000/1251 ( 80%)] Loss: 6.901 (6.91) Time: 1.075s, 952.93/s (1.082s, 946.12/s) LR: 1.000e-06 Data: 0.021 (0.024)
Train: 0 [1050/1251 ( 84%)] Loss: 6.898 (6.91) Time: 1.071s, 956.02/s (1.082s, 946.49/s) LR: 1.000e-06 Data: 0.024 (0.023)
Train: 0 [1100/1251 ( 88%)] Loss: 6.902 (6.91) Time: 1.071s, 955.81/s (1.081s, 946.89/s) LR: 1.000e-06 Data: 0.016 (0.023)
Train: 0 [1150/1251 ( 92%)] Loss: 6.912 (6.91) Time: 1.066s, 960.32/s (1.081s, 947.27/s) LR: 1.000e-06 Data: 0.018 (0.023)
Train: 0 [1200/1251 ( 96%)] Loss: 6.894 (6.91) Time: 1.071s, 956.33/s (1.081s, 947.52/s) LR: 1.000e-06 Data: 0.017 (0.023)
Train: 0 [1250/1251 (100%)] Loss: 6.904 (6.91) Time: 1.057s, 968.92/s (1.080s, 947.88/s) LR: 1.000e-06 Data: 0.000 (0.022)
Train one epoch time: 1351.7502689361572 s
Distributing BatchNorm running means and vars
Test: [ 0/48] Time: 4.632 (4.632) Loss: 6.8480 (6.8480) Acc@1: 0.2930 ( 0.2930) Acc@5: 1.0742 ( 1.0742)
Test: [ 48/48] Time: 1.700 (0.488) Loss: 6.8547 (6.8517) Acc@1: 0.1179 ( 0.1880) Acc@5: 0.8255 ( 1.0760)
Val one epoch time: 23.89726185798645 s
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
@Doraemonzm I'd be far more concerned about your A100 setup, it's wasting way more GPU FLOPs than the A40, A100 is more capable than the A40, and the time loading data on the A100 is 2x the A40, based on the instantaneous vs avg throughput on A100 you should be getting closer to 6000im/sec on that A100 example if dataloading wasn't a bottlneck, the A40 looks like there isn't as much of a loading / processing bottleneck. You could try increasing batch size, using |
Beta Was this translation helpful? Give feedback.
@Doraemonzm I'd be far more concerned about your A100 setup, it's wasting way more GPU FLOPs than the A40, A100 is more capable than the A40, and the time loading data on the A100 is 2x the A40, based on the instantaneous vs avg throughput on A100 you should be getting closer to 6000im/sec on that A100 example if dataloading wasn't a bottlneck, the A40 looks like there isn't as much of a loading / processing bottleneck. You could try increasing batch size, using
--channels-last
, etc