Parameters for each model that can produce the results (especially transformers) #777

seyeeet · 2021-07-23T01:42:55Z

seyeeet
Jul 23, 2021

I was wondering if there is anywhere that we can have the hyperparameters for each model (specially the transformers) that can produce results close to the papers? the models are very complicated and each hyperparameter can take a lot of time to figure it out. so if it is possible to share the used hyperparameters that would be great.

rwightman · 2021-07-28T21:33:09Z

rwightman
Jul 28, 2021
Maintainer

@seyeeet working on improving that with new/future models, and have a few gists under my name with some recent hparams for nfnets, mlp vision models... however, there won't ever be full coverage

For the transformer models, you can find suitable hparams from related codebases that use some of timm's features ... the official code for deitt/cait/levit/resmlp and swin use timm components, augmentations, etc for the train pipeline so the hparams posted in their code and their papers can be used with their code and can be used here as well (with the exception of the distillation and repeat aug in deit). Also, the hparams in the recent 'How to train your ViT' would be easy to apply here just use 'adamw' when the paper mentions adam.

1 reply

AlexeyAB Aug 16, 2021

@rwightman Hi,

About: https://arxiv.org/abs/2106.10270

What does it mean: 3.4 Pre-training ... with a linear warmup (10k steps) is it --warmup-epochs 4 for i21k?
What does it mean Total steps and warm up steps = {(20 000, 500)} for i1k in the Table 4, how many epochs do you finetune i21k -> i1k, is it --warmup-epochs 1 --epochs 10?

Do I understand correctly, that the new hyper-params for ViT-B/16 384x384 - gs://vit_models/augreg/B_16-i21k-300ep-lr_0.001-aug_medium1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.03-res_384.npz: https://github.com/google-research/vision_transformer#available-vit-models

i21k: ./distributed_train.sh 8 /data/imagenet21k/ --model vit_base_patch16_384 --sched cosine --epochs 300 --opt adamw -j 8 --warmup-epochs 4 --model-ema --model-ema-decay 0.99996 --remode pixel --reprob 0.25 --amp --weight-decay 0.1 --drop 0.0 --drop-path 0.0 --aa rand-m15-n2-mstd0.5-inc1 --mixup 0.2 --clip-grad 1.0 --lr 0.001 -b 4096
i21k -> i1k: ./distributed_train.sh 8 /data/imagenet1k/ --model vit_base_patch16_384 --sched cosine --epochs 10 --opt sgd -j 8 --warmup-epochs 1 --model-ema --model-ema-decay 0.99996 --remode pixel --reprob 0.25 --amp --weight-decay 0.1 --drop 0.0 --drop-path 0.0 --aa rand-m15-n2-mstd0.5-inc1 --mixup 0.2 --clip-grad 1.0 --lr 0.03 -b 512 --resume ./output/train/vit/last.pth.tar --no-resume-opt

RandAugment: medium1 (2, 15, 0.2) = (l, m, α) = (n, m, mixup) = --aa rand-m15-n2-mstd0.5-inc1 --mixup 0.2
Did you use --remode pixel --reprob 0.25 ?
What --warmup-lr should we use, is it --warmup-lr 1e-6?

rwightman · 2021-08-25T18:52:23Z

rwightman
Aug 25, 2021
Maintainer

@AlexeyAB The 21k runs were handled by the Google researchers with their research TPU cloud. I was provisioned with some V100s to run smaller trials and I focused on exploring heavy augreg schemes to compare from-scratch vs transfer on the smaller datasets.

As you note all of the Google hparams (and their train code) are step based, not epoch so you have to do the conversion. warmup steps is sometimes less than a full epoch, so yeah, 1 and 10 is the closest for the 1k example. I think your hparams look in the right ballpark.

One thing that you'll run into (and I did too) re comparing timm results to theirs is my randaugment is a bit different, quite importantly so in the handling of some of the aug scales wrt to magnitude values. I think my inc1 approach is better but it is quite a bit differnet. I recently added a config to allow closer matching to google RA behaviour.

rand-m15-n2-mmax30 would give you something closer to the M5/M10/M15/M20 up to M30 values that is often cited with papers that use the TF RA impl. inc1 and mstd are fully timm ideas, and my default scale clips at M10 (added mmax to allow that to be exceeded)

3 replies

rwightman Aug 25, 2021
Maintainer

I guess one addition there, doing the 1 epoch warmup instead of a ramp per step would be quite a bit of a difference, for full training the ramp per epoch vs per step has not much impact, but here it would, the scheduler technically supports steps instead of epochs, but then everythign needs to be changed to steps

AlexeyAB Aug 25, 2021

@rwightman Thanks!

The article states that they used Batch Size 8192 for training on i21k, but it doesn't states LR for training on i21k, while I see LR=0.001 lr_0.001 in the filename gs://vit_models/augreg/B_16-i21k-300ep-lr_0.001-aug_medium1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.03-res_384.npz
But --lr 0.001 -b 4096 is equal to --lr 0.0001 -b 512,
- while I previously used for i21k: --lr 0.002 -b 512 (LR is 20x higher)
- and you suggested to use for i1k: --lr 0.002 -b 512 === --lr 0.001 -b 256 ViT Training Details #252 (comment)

What LR and Batch do you suggest to use for training on i21k?

So as I understand it is not quite clear what is better --aa rand-m15-n2-mmax30 or --aa rand-m15-n2-mmax30-mstd0.5-inc1 ?

seyeeet Aug 25, 2021
Author

@AlexeyAB can you please explain how did you come up with the conclusion that --lr 0.001 -b 4096 is equivalent of -lr 0.0001 -b 512 ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parameters for each model that can produce the results (especially transformers) #777

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Parameters for each model that can produce the results (especially transformers) #777

seyeeet Jul 23, 2021

Replies: 2 comments · 4 replies

rwightman Jul 28, 2021 Maintainer

AlexeyAB Aug 16, 2021

rwightman Aug 25, 2021 Maintainer

rwightman Aug 25, 2021 Maintainer

AlexeyAB Aug 25, 2021

seyeeet Aug 25, 2021 Author

seyeeet
Jul 23, 2021

Replies: 2 comments 4 replies

rwightman
Jul 28, 2021
Maintainer

rwightman
Aug 25, 2021
Maintainer

rwightman Aug 25, 2021
Maintainer

seyeeet Aug 25, 2021
Author