You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In #771 I tested the effects of reducing the distillation data to understand that expensive part of our pipeline. However, we should do it again for the base student model, as the other one was done for a tiny model too see if there is a difference. Also, I want to test it on a morphologically more complex language like Lithuanian.
The text was updated successfully, but these errors were encountered:
This results seem very interesting to me. I believe the fact that NLLB and Paracrawl are full of redundant and repetitive data has something to do with this. If there is interest in finding a better way to sample, I think n-gram saturation (rank lower the sentences that have a significant portion of the 2-grams or 3-grams already present in the corpus) could be something worth to explore.
I'm not going to take on this experiment for the next training run as it doesn't feel as important with the speedups from CTranslate2 and the removal of the teacher ensemble.
In #771 I tested the effects of reducing the distillation data to understand that expensive part of our pipeline. However, we should do it again for the
base
student model, as the other one was done for atiny
model too see if there is a difference. Also, I want to test it on a morphologically more complex language like Lithuanian.The text was updated successfully, but these errors were encountered: