Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model cannot converge #1

Open
theoqian opened this issue Oct 2, 2021 · 1 comment
Open

Model cannot converge #1

theoqian opened this issue Oct 2, 2021 · 1 comment

Comments

@theoqian
Copy link

theoqian commented Oct 2, 2021

I try to train a mask_align model with default config in the repo (only change data paths) and DE-EN training data from https://github.com/lilt/alignment-scripts. In some of training steps the losses are nan and at end of training the loss increases from about 7 to 70.

epoch = 5, step = 49980, loss: nan, f_loss: nan, b_loss: nan, agree_loss: nan, entropy_loss: nan (0.246 sec)
epoch = 5, step = 49990, loss: 64.210, f_loss: 67.750, b_loss: 60.188, agree_loss: 0.000, entropy_loss: 0.241 (0.507 sec)
epoch = 5, step = 50000, loss: 69.115, f_loss: 72.500, b_loss: 65.312, agree_loss: 0.000, entropy_loss: 0.240 (0.652 sec)

@carboncoo
Copy link
Collaborator

Hi, this is most likely due to the presence of sentence pairs of length 1 in the training data. Our masking strategy does not allow this to happen, so we filter them out. You can use thualign/scripts/remove_single.py to filter the corpus and try training again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants