-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
retnet traning config #64
Comments
Hi, Is there any resolution to this question for the initialization and recommended training configs to reproduce the paper results? I am also seeing some instability with the default configs. |
|
The latest released code has considered the above points. |
Thanks so much! I had used layer norm and did not set the bias=False. Will try switching these. Adding the explicit deepnorm initialization also improved stability for my downstream runs, but I will try using the recommended techniques instead. |
@simran-arora It's better to set bias=False both in layer norm and nn.Linear. Besides, would you mind sharing the training details with us? e.g. corpus, model size, and hyper-parameters. We'd like to see the instability setting. |
Thank you very much! Will try later with those new information! |
Hello,
I have followed the training configuration introduced here (#52) with retnet_medium architecture. I have some questions that I would appreciate if anyone could answer them.
The first is about the initialization. From the RETNET paper https://arxiv.org/abs/2307.08621, I saw that parameters were initialized following deepnet. So I am wondering why in the RetNetConfig it is set to False, and where should I set it True? (https://github.com/microsoft/torchscale/blob/main/torchscale/architecture/config.py#L239)
If I simply add "--deepnorm" in command line, this will be activated together with subln (https://github.com/microsoft/torchscale/blob/main/torchscale/architecture/config.py#L240), then I found the output of each layers getting larger and larger with the layer id increasing.
The second is about the vocabulary. I am newer to fairseq so I am not sure how to deal with a large dataset via fairseq_preprocess. I am trying to use MINIPILE while the dict.txt has 32309612 lines. It seems too large so I am wondering if there is some official recommendation for this part.
The third is about --share-decoder-input-output-embed, Is it recommended? I am sorry if I missed in paper.
Thank you guys in advance:)
The text was updated successfully, but these errors were encountered: