-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Can we distill for multiple langauges for distil-small-whisper #107
Comments
I too have the same question. |
Hey @Killshot667 - that's a great question, and super sorry for the late reply here! I'll defer to @eustlb, who has been running some preliminary experiments on distilling Whisper jointly for French and Spanish. You can read about the initial results and how to reproduce them on the README here: https://github.com/huggingface/distil-whisper/tree/main/training#3-language-mixing |
Hey @Killshot667! Thanks for raising this interesting point. Indeed, distillation has, for the moment, been targeted at single languages. For distillation, the approach was initially to shrink the model as much as possible while maximizing its performance by training a smaller decoder on a targeted language. The idea is to trade the multilingual capacities of the 32 layers of the decoder for size and speed improvement brought by a smaller decoder (therefore with smaller learning capacities). In this context, two layers appeared to be Pareto optimal. Were we to train on a multilingual dataset, more decoder layers might be needed to enhance learning capacities. Such an adaptation of the student model’s decoder layers can be easily done by changing Secondly, note there is nothing restraining a distilled model from having multilingual transcription capacities. First, the encoder is identical to Whisper’s, so robustness in creating a representation of speech for different languages remains unchanged. Secondly, when initializing the student model, we keep Whisper’s vocabulary and start from Whisper input embeddings, coming with inherent multilingual tokens. To this extent, the only thing restraining distil-large-v3 from being multilingual is the dataset it has been distilled on. You could perfectly train, for example, a 4-decoder-layer distilled model on European languages (easily done by pseudo-labeling each set with the correct |
I have seen several distillations for different single languages for distil-whisper (like en, de etc). But I have yet to come across some distil-whisper which has been trained to be multilingual. For my use case, I need to distil it on multiple languages. But I couldnt find any results related to this in the paper. I wanted to know if such an experiment has been conducted before, atleast for two languages, and if there any results available for any such training. Does it give good results for both the languages, or does it fail to learn in such a case (maybe because of only two decoder layers). If it fails, could there be some other possible reason other than the model being too small to accomodate multiple languages
The text was updated successfully, but these errors were encountered: