-
-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Oscillation in the train loss on multiple GPUs #2339
Comments
@sinahmr unrelated to the oscillations, but those hparams are a bit out of date, there's some better ones to base off here: https://gist.github.com/rwightman/fb37c339efd2334177ff99a8083ebbc4 On this, I don't think there's any bug, I've done so much training with these scripts without issue 1,2,4,8, all the way to 512 or so GPU is what I've run in the past. A lot of 2, 4, and 8 in the rececnt months. So the 576 vs 288 x 2 in the graph is without involving the grad accum right? Does 576 x 2 work? What's the dataset like, is it long tail dist with some challenging samples, uneven class dist or more uniform? If you run with a diff seed at 288 x 2 does it change? If you pull back the lr to 1e-4 does 288x2 + 576 match? |
@rwightman Thanks for sharing the newer hparams! Yes, in the graph above, The dataset is Mini-ImageNet, which is a subset of ImageNet-1K, with the same size images (100 categories, 500 training and 100 testing samples per category). So, I think it shouldn't make a big difference. Besides, I have experiments on ImageNet-1K, but using Swin Transformer (and a different set of hparams), and a similar phenomenon is visible. More specifically, the train loss of
|
There are two things which could be having an impact
Second one won't have any impact on train loss but could result in really slow apparent convergence of eval and at end of training might have have ramped up the EMA weights fully.. Is there any difference of note in the eval loss & accuracy for the different runs? There are two things with those SBB ViT...
|
You can try changing the loss averaging to verify train losses, pull the part within the logging block out as an else in the main part of the train step. I don't normally train w/ distributed for datasets that are this small so there are more sampling points and this doesn't have as big an impact on the logging... So:
|
#2340 will keep running avg of loss updated every step (completely independent) and only sync for the logs and final return. So better mix of current and proposed verification above. In any case I'm not seeing any issues running with this dataset https://www.kaggle.com/datasets/ctrnngtrung/miniimagenet ... 2 GPU and same hparams (though did --model-ema-decay 0.999 --model-ema-warmup), train loss in the log was bouncing around as you say but the eval tracked fine. WIth #2340 eval is tracking the same and the train loss (logged) is smoother since it includes all batches in the sampling. |
@sinahmr thanks for the update, you mentioned training was unstable too... this loss fix in #2340 is purely cosmetic, it's no longer sampling the loss for logging sparsley during distributed training, in either case the training progression and the evaluation results should be the same. Before I close, do you observe that? |
It happened to me once that a run on 2 GPUs became unstable in the middle of training (blue), and the same run on 1 GPU with But to be honest, now that I know that the oscillations I previously observed was just a problem in the graph, I'm thinking maybe I made some other mistake in the blue run. I have done many experiments on a custom ViT, and as far as I remember eval loss and accuracy never looked problematic. Also, eval loss and accuracy of the run with #2340 looks very close to the run on 1 GPU. I think the slight difference is fine, when we use 2 GPUs the order of data is changed, is that right? |
@sinahmr yes, the order of data won't be preserved as you change the world size. I'll merge 2340 soon with another small change I wanted to make for results output. |
Great, thank you! |
merged |
I'm running the command provided here for training a ViT, once using batch size 288 on two GPUs (like the link), and once using batch size 576 on one GPU. As you can see in the plot below, the training loss for the run with one GPU is much smoother than the one with two GPUs, which oscillates a lot (although still similar decreasing trend), and sometimes makes training unstable.
Is this behaviour expected? If not, I suspect there should be some errors in the implementation of the multi-GPU code, but couldn't find out. Can you please have a look?
Thanks!
To Reproduce
./distributed_train.sh {1 or 2} --data-dir /path/to/100class/data --num-classes 100 --model vit_small_patch16_224 --sched cosine --epochs 300 --opt adamw -j 8 --warmup-lr 1e-6 --mixup .2 --model-ema --model-ema-decay 0.99996 --aa rand-m9-mstd0.5-inc1 --remode pixel --reprob 0.25 --amp --lr 5e-4 --weight-decay .05 --drop 0.1 --drop-path .1 -b {576 or 288}
Expected behavior
Fairly similar train loss for the experiments.
Screenshots
Environment
Additional context
I also tried an experiment using batch size 288 on one GPU but with
--grad-accum-steps 2
(to have a global batch size of 576 like the other experiments) and saw no problem (no extreme oscillation) in the loss plot, it was alright like the other one on a single GPU.The text was updated successfully, but these errors were encountered: