Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble importing finetuned HuggingFace model #14

Open
AlienKevin opened this issue Sep 21, 2023 · 13 comments
Open

Trouble importing finetuned HuggingFace model #14

AlienKevin opened this issue Sep 21, 2023 · 13 comments

Comments

@AlienKevin
Copy link

Hi, thanks for developing this awesome Whisper implementation! I'm looking to deploy a small Whisper model I finetuned using HuggingFace transformers. The model is supposed to generated cantonese romanizations and the language is set to English during training because they share the same ascii letters. The primary motivation is to take advantage of burn's wgpu backend for cross platform deployment to both iOS and Android users. Prior to trying your library, I managed to get my finetuned model running on iOS using whisper.cpp but I'd prefer a rust backend for portability.

For my experiment with importing the model into whisper-burn, I first converted the HuggingFace model to Whisper's pt format using a script (See step 1 of this issue). And then I followed the steps in the README and successfully converted the model to the burn format. However, when I run inference using my model, it produced garbage transcripts on the provided audio16k.wav as well as on my own test audio. For example, the audio16k.wav produced a transcript of "onbed" when normally the model should recognize English inputs in addition to Cantonese.

I'm wondering if it's possible for you to support importing HuggingFace models directly to whisper-burn? That way, it's easier to eliminate intermediate bugs during the conversion pipeline. Maybe the convert-h5-to-ggml from Whisper.cpp can come in handy? Thanks.

@Gadersd
Copy link
Owner

Gadersd commented Sep 21, 2023 via email

@Gadersd
Copy link
Owner

Gadersd commented Sep 22, 2023

The conversion script you mentioned seems to work. I ran some tests and the issue is that multilingual models do not work in general while the English-only models all seem to work. I have no idea what the reason is.

#Update
I realized that the issue is the tokenizer used. The multilingual models use a different tokenizer than the English-only models. I should have it working tomorrow.

@Gadersd
Copy link
Owner

Gadersd commented Sep 22, 2023

It should work now. I tested your model on the English sample audio and it now outputs what looks like pinyin without accents. Perhaps you fine-tuned it so strongly that it can no longer perform English transcription?

@AlienKevin
Copy link
Author

Oh great, I will give it a try and see.

@AlienKevin
Copy link
Author

I can verify that the decoder is working! Thanks a lot for your work! 🙏
There are two remaining issues:

  1. Is there a way to adjust the decoding strategy like beam search vs greedy and the beam size etc? For one test audio, I saw some repetition at the end that might have to do with the decoding strategy:
nei tai keoi gin dou leng zai zau wan sai long $$$ audio ends here, repetition starts here $$$ nei tai keoi gin dou leng zai zau wan sai long nei
  1. The inference is quite slow and uses only a single CPU core on macOS, even though I specified the wgpu_backend. I wonder if there's any setup I missed with regard to metal support? This is the command I ran:
cargo run --release --features wgpu-backend --bin transcribe small test_yue2.wav en transcription.txt

PS: I'm running on macOS Ventura on M1 Max

@Gadersd
Copy link
Owner

Gadersd commented Sep 23, 2023 via email

@Gadersd
Copy link
Owner

Gadersd commented Sep 23, 2023

I reactivated repeat detection. Let me know if you still encounter repetitions. I'll implement caching later which should improve the performance.

@AlienKevin
Copy link
Author

Thanks for the explanation. I tried again using the latest commit and the repetition issue unfortunately persisted. I think it's because the current repetition filter does not apply to whole sentences like "nei tai keoi gin dou leng zai zau wan sai long". However, a maximum generation length argument might prevent the model from repeating at the end and also save some decoding time.

@Gadersd
Copy link
Owner

Gadersd commented Sep 24, 2023 via email

@AlienKevin
Copy link
Author

AlienKevin commented Sep 25, 2023

I tested more and discovered a "trick" that reliably caused repetition: lengthen the last word. Here's a link to 3 audios that caused repetition on my model: https://on.soundcloud.com/2Qw3J The number of repetition seems to depend on the length of the last word and maybe the overall sentence length.

@Gadersd
Copy link
Owner

Gadersd commented Oct 9, 2023

I added a beam(ish) search and the transcription quality seems to have significantly improved. If you find the time to test it let me know how it goes.

@AlienKevin
Copy link
Author

AlienKevin commented Oct 10, 2023

I tried the latest beam search but unfortunately the repetition issues persisted on the three soundcloud audio samples.

@cyberluke
Copy link

@Gadersd I think you would have to implement something like LocalAgreement-n policy algorithm to get rid of repetitions. like here: https://github.com/ufal/whisper_streaming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants