Noisy training loss #810

yoni-f · 2021-08-16T10:00:36Z

yoni-f
Aug 16, 2021

Hey,

I ran one of your example training runs (EfficientNet-B2 with RandAugment - 80.4 top-1, 95.1 top-5), and I parsed the .csv to a tensorboard and I noticed something a bit weird. The training loss was a lot noiser than the validation loss.
I was wondering why that is? Is there some lr-noise I am unaware of? Or some strong augmentation?
The only other reason I could think of are these lines (689-690 in train.py):

if args.distributed:
    reduced_loss = reduce_tensor(loss.data, args.world_size)
    losses_m.update(reduced_loss.item(), input.size(0))

Why are you reducing the loss before logging it?

Thanks!

Answered by rwightman

Aug 16, 2021

@yoni-f I want to log the loss across all processes, not just the one logging. I'd expect the training loss to be noisy with higher aug. But also, if that particular training was w/ weight EMA enabled, the val loss that ends up in the log file will be the EMA weight eval loss so it would be much smoother compared to train

View full answer

rwightman · 2021-08-16T16:46:27Z

rwightman
Aug 16, 2021
Maintainer

@yoni-f I want to log the loss across all processes, not just the one logging. I'd expect the training loss to be noisy with higher aug. But also, if that particular training was w/ weight EMA enabled, the val loss that ends up in the log file will be the EMA weight eval loss so it would be much smoother compared to train

2 replies

yoni-f Aug 16, 2021
Author

Thanks for the quick reply.

Just to make sure I understand - if you're in distributed, you're only logging the last batch's loss (reduced across all processes), and if you're not then you're averaging across all the batches?

rwightman Aug 16, 2021
Maintainer

Yes, it was a small performance optimization because distributing the losses isn't free. You can move the reduction/accumulation of the loss outside the logging block if you'd like to average all of them. I don't feel it's a big deal either way...

EDIT: I should point out that in the future timm bits training code, the losses (and metric counts) will be accumulated locally in tensors and then the averages (or metric counts) will be synchronized only for logging iterations... so in between always updating and syncing and only doing it on logging iterations. I don't plan to change the current scripts though you're welcome to fiddle.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Noisy training loss #810

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Noisy training loss #810

yoni-f Aug 16, 2021

Replies: 1 comment · 2 replies

rwightman Aug 16, 2021 Maintainer

yoni-f Aug 16, 2021 Author

rwightman Aug 16, 2021 Maintainer

yoni-f
Aug 16, 2021

Replies: 1 comment 2 replies

rwightman
Aug 16, 2021
Maintainer

yoni-f Aug 16, 2021
Author

rwightman Aug 16, 2021
Maintainer