Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce monolingual data for en-lt to investigate distillation performance #915

Open
gregtatum opened this issue Oct 31, 2024 · 2 comments
Open
Labels
experiment A training experiment with hypothesis and results

Comments

@gregtatum
Copy link
Member

In #771 I tested the effects of reducing the distillation data to understand that expensive part of our pipeline. However, we should do it again for the base student model, as the other one was done for a tiny model too see if there is a difference. Also, I want to test it on a morphologically more complex language like Lithuanian.

@gregtatum gregtatum added the experiment A training experiment with hypothesis and results label Oct 31, 2024
@gregtatum gregtatum self-assigned this Oct 31, 2024
@gregtatum
Copy link
Member Author

gregtatum commented Nov 6, 2024

In #771 @ZJaume commented:

This results seem very interesting to me. I believe the fact that NLLB and Paracrawl are full of redundant and repetitive data has something to do with this. If there is interest in finding a better way to sample, I think n-gram saturation (rank lower the sentences that have a significant portion of the 2-grams or 3-grams already present in the corpus) could be something worth to explore.

After some light searching I found this paper with has an approach we could use if we wanted to go this route, which seems like a reasonable balance of cost vs quality: STACC, OOV Density and N-gram Saturation: Vicomtech’s Participation in the WMT 2018 Shared Task on Parallel Corpus Filtering

@gregtatum
Copy link
Member Author

I'm not going to take on this experiment for the next training run as it doesn't feel as important with the speedups from CTranslate2 and the removal of the teacher ensemble.

@gregtatum gregtatum removed their assignment Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experiment A training experiment with hypothesis and results
Projects
None yet
Development

No branches or pull requests

1 participant