-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to set the target language for examples in README? #130
Comments
it seams that this model only output English subtitles. |
@CheshireCC If that is the case, would it be a distilled version of Whisper? "Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. " |
maybe distilled version requires re-training of the model,just like fine-tuning a model.
|
Indeed - as @CheshireCC has mentioned, you can train your own multilingual distil-whisper checkpoint according to the training readme. This has been done successfully in a number of languages, such as for French and German. Also cc @eustlb having done some extensive experimentation into French distillation. |
Hey @clstaudt @CheshireCC, indeed distil-large-v3 has been trained to do English-only transcriptions. More details about motivations here. |
Thanks for clarifying @eustlb. I'm about to give a presentation praising the potential of distillation with distil-whisper as the prime example. While the speedup is impressive, I think it's important to add that it's just one language while the teacher model was multilingual. What do you think will be the speedup and size reduction for a multilingual distil-whisper? |
Thanks for promoting distil-whisper, @clstaudt! Actually, you can find the info about this here on the README and here on the model card, but thanks for mentioning it! It may not be clear enough. Concerning the multilingual distilled Whisper, it is a very difficult question to answer without proper experimentation, and I prefer not to give false insights. There are a lot of factors to take into account (e.g., number of languages, dataset sizes, etc.). Yet, I would say that were you to have large enough datasets for a few languages and manage to get good results with a 4-layers decoder, the size reduction would be 48%—an exact value—(compared to 51% for a 2-layers decoder) and the speed-up should be around 5.5x—a rough estimation, to be taken with a big pinch of salt—(compared to 6.3x for a 2-layers decoder). |
The code examples in the README do not make it obvious how to set the language of the audio to transcribe.
The default settings create garbled english text if the audio language is different.
The text was updated successfully, but these errors were encountered: