retnet traning config #64

hanlinxuy · 2023-09-03T09:13:14Z

Hello,

I have followed the training configuration introduced here (#52) with retnet_medium architecture. I have some questions that I would appreciate if anyone could answer them.

The first is about the initialization. From the RETNET paper https://arxiv.org/abs/2307.08621, I saw that parameters were initialized following deepnet. So I am wondering why in the RetNetConfig it is set to False, and where should I set it True? (https://github.com/microsoft/torchscale/blob/main/torchscale/architecture/config.py#L239)

If I simply add "--deepnorm" in command line, this will be activated together with subln (https://github.com/microsoft/torchscale/blob/main/torchscale/architecture/config.py#L240), then I found the output of each layers getting larger and larger with the layer id increasing.

The second is about the vocabulary. I am newer to fairseq so I am not sure how to deal with a large dataset via fairseq_preprocess. I am trying to use MINIPILE while the dict.txt has 32309612 lines. It seems too large so I am wondering if there is some official recommendation for this part.

The third is about --share-decoder-input-output-embed, Is it recommended? I am sorry if I missed in paper.

Thank you guys in advance:)

simran-arora · 2023-10-10T23:35:08Z

Hi, Is there any resolution to this question for the initialization and recommended training configs to reproduce the paper results? I am also seeing some instability with the default configs.
Thanks so much!

sunyt32 · 2023-10-11T02:22:22Z

--share-decoder-input-output-embed saves model parameters especially when the model size is small. The performance is almost the same. We activate it in our experiment.
Don't activate --subln or --deepnorm. The current initialization is good enough.
The training instability comes from Linear bias and eps in LayerNorm. In our experiment, we set bias=False and eps=1e-5. Besides, RMSNorm is helpful for stability so we make a modification.

donglixp · 2023-10-11T02:56:05Z

Hi, Is there any resolution to this question for the initialization and recommended training configs to reproduce the paper results? I am also seeing some instability with the default configs. Thanks so much!

@simran-arora @hanlinxuy

The LN eps was modified from 1e-6 to 1e-5 as in the commit d1fefe9
The RMSNorm is also used in the commit 5c89ffb , so that the effects of LN_eps can be eliminated
For the RetNet implementation, the initialization principle proposed in DeepNet has been integrated. So the arguments --subln or --deepnorm should not be added.
Removing bias also improves training stability.

The latest released code has considered the above points.

simran-arora · 2023-10-11T07:20:22Z

Thanks so much! I had used layer norm and did not set the bias=False. Will try switching these.

Adding the explicit deepnorm initialization also improved stability for my downstream runs, but I will try using the recommended techniques instead.

sunyt32 · 2023-10-12T14:08:58Z

@simran-arora It's better to set bias=False both in layer norm and nn.Linear.

Besides, would you mind sharing the training details with us? e.g. corpus, model size, and hyper-parameters. We'd like to see the instability setting.

hanlinxuy · 2023-10-27T07:13:19Z

Hi, Is there any resolution to this question for the initialization and recommended training configs to reproduce the paper results? I am also seeing some instability with the default configs. Thanks so much!

@simran-arora @hanlinxuy

The LN eps was modified from 1e-6 to 1e-5 as in the commit d1fefe9

The RMSNorm is also used in the commit 5c89ffb , so that the effects of LN_eps can be eliminated

For the RetNet implementation, the initialization principle proposed in DeepNet has been integrated. So the arguments --subln or --deepnorm should not be added.

Removing bias also improves training stability.

The latest released code has considered the above points.

Thank you very much! Will try later with those new information!

KeAWang · 2024-10-09T17:46:34Z

@sunyt32 could you elaborate on how RetNet derived its initialization from DeepNet as mentioned in this issue: #68?

donglixp self-assigned this Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

retnet traning config #64

retnet traning config #64

hanlinxuy commented Sep 3, 2023 •

edited

Loading

simran-arora commented Oct 10, 2023

sunyt32 commented Oct 11, 2023

donglixp commented Oct 11, 2023

simran-arora commented Oct 11, 2023

sunyt32 commented Oct 12, 2023

hanlinxuy commented Oct 27, 2023

KeAWang commented Oct 9, 2024

retnet traning config #64

retnet traning config #64

Comments

hanlinxuy commented Sep 3, 2023 • edited Loading

simran-arora commented Oct 10, 2023

sunyt32 commented Oct 11, 2023

donglixp commented Oct 11, 2023

simran-arora commented Oct 11, 2023

sunyt32 commented Oct 12, 2023

hanlinxuy commented Oct 27, 2023

KeAWang commented Oct 9, 2024

hanlinxuy commented Sep 3, 2023 •

edited

Loading