Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[whisper] Add OpenAI API compatibility #17921

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dalgwen
Copy link
Contributor

@dalgwen dalgwen commented Dec 17, 2024

Use an external OpenAI compatible API as another option besides the locally embedded whisper.

Whisper is excellent as an embedded service, but it can also be remotely exposed, with services such as OpenAI obviously, but also by some other open source projects wrapping it (backed by whisper.cpp, or faster-whisper), such as faster-whisper-server.

So, this MR adds the possibility to use a remote Whisper by calling an OpenAI compatible API exposing it. I briefly thought about doing a new add-on, but deemed it overkill. It is another way of using Whisper, so I hope it's appropriate to do it here ?

Some refactors I hope @GiviMAD can take a look at :

  • whisper related code (initialization) delayed to the last moment in order to keep API mode / local mode separated as much as possible. So, is it ok to initialize WhisperState inside the new recognizeLocal method, effectively doing it in each iteration of the main while loop ? Or should I keep it as a global parameter, initialized once, outside the while loop ?
  • float32 array conversion delayed to the last moment (short16 is used for remote whisper so I kept it as the main format, as I didn't want to do short16 -> float32 -> short16)
  • inside the newly extracted method recognizeLocal, I use exception and null return as a signal to break the external loop, instead of the break keyword. Branching condition and output should be the same nonetheless.
  • I added language as a configuration parameter, to not depend on the system locale if the user wants so.

Thanks !

@dalgwen dalgwen requested a review from GiviMAD as a code owner December 17, 2024 21:54
@dalgwen dalgwen added enhancement An enhancement or new feature for an existing add-on about to close Issue potentially resolved and no feedback received and removed about to close Issue potentially resolved and no feedback received labels Dec 17, 2024
@GiviMAD
Copy link
Member

GiviMAD commented Dec 17, 2024

Hello @dalgwen, this is great, I was about to look at it because I bought an Nvidia Orin Nano development kit and I want to host whisper there using an API that mimics OpenAI so this also solves my problem.

I'm going to make a quick test tomorrow but I just reviewed the code and everything looks good to me.

Also I'm about to upgrade the whisper-jni library to use the latests whisper.cpp version, but I think I will wait to create a PR until this gets merge.

@GiviMAD
Copy link
Member

GiviMAD commented Dec 21, 2024

Hello again @dalgwen, sorry for the delay.

I was able to test the changes today and everything works correctly.

I was even able to use the new API mode to run the transcription against a whisper.cpp server instance running on the other machine as I commented early. The strange thing is that I had to run the server using the convert functionally, so I think they are still expecting raw float raw pcm data, but with that enabled it works like charm. ./build/bin/server --convert --host 0.0.0.0 -m models/ggml-large-v3-turbo-q5_0.bin

The large turbo model runs really well in the Nvidia Orin nano, and I saw the other day that they have lowered the price to 250$, so seems like a great alternative to power these kind of services (it's a pity it only has 8Gb of ram).

About the PR, I think it will be good to annotate on the readme grammar section that it only works for local (I just realized it's now a subsection of local mode, sorry), and also you need to regenerate the translation files, but apart from that I think it's ok. Thank you so much for it!

Copy link
Contributor

@lsiepel lsiepel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some style improvements and the properties file needs to be regenerated

@@ -245,34 +286,20 @@ public Set<AudioFormat> getSupportedFormats() {
@Override
public STTServiceHandle recognize(STTListener sttListener, AudioStream audioStream, Locale locale, Set<String> set)
throws STTException {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, done

}

private String recognizeAPI(int audioSamplesOffset, short[] audioStream, String language) throws STTException {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@dalgwen
Copy link
Contributor Author

dalgwen commented Dec 24, 2024

Some style improvements and the properties file needs to be regenerated

Thanks, will do ASAP.

The strange thing is that I had to run the server using the convert functionally, so I think they are still expecting raw float raw pcm data

Disclaimer : I'm not very knowledgeable about audio format. I just took a exemple somewhere on internet and thought that the the int16 format was the base for the whisper service. So I'm of course OK to use float32 if it's simpler.

So, I flip the code again and use float32 as a base, and run a test : float32 also works with faster-whisper (and faster-whisper), BUT kinda only. The transcription is bad, really bad. Not totally wrong, but its unusable nonetheless. I must have done something wrong. The commit is here, if you see something please let me know ?
Depending on your answer, I see two possibilities:

  • some server / implementation must (or prefer to) use a specific format to get good result, and so we put a parameter to choose this output format ? (or reuse the one for the "Create WAV Record" option)
  • I did something wrong, and we can use float32 once it is fixed

The large turbo model runs really well in the Nvidia Orin nano

Very interesting device ! Regarding your comment, do you also use whisper.cpp on this device, or another backend ?
I didn't made any comparative test, but beside the fact that I had issue with my old CPU with whisper.ccp, that I had not with faster-whisper, I read online that faster-whisper is way faster than whisper.cpp. Do you have a reason to use whisper.cpp (quality, etc) ?

@dalgwen
Copy link
Contributor Author

dalgwen commented Dec 24, 2024

I made another test, this time with the whisper-server (included in official whisper release. Is this the one you tested ?).
Even with a fp32 output, I have to use the "--convert" option.
The results in english were good, but were utterly nonsense in french (despite of course using a universal model).

Copy link
Member

@GiviMAD GiviMAD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dalgwen, I realized of the problem this morning after digging a little in the whisper.cpp server source code, seems like we were sending a corrupted wav header as the sample size in the header was incorrect. So the server was not expecting raw audio as I initially assumed. After fixing that it correctly works without enabling the convert option.

@GiviMAD
Copy link
Member

GiviMAD commented Dec 28, 2024

Disclaimer : I'm not very knowledgeable about audio format. I just took a exemple somewhere on internet and thought that the the int16 format was the base for the whisper service. So I'm of course OK to use float32 if it's simpler.

My nether, I assumed the float32 sample format looking at the code types, but after reading it carefully you are correct it's expecting s16 WAVE files.

Very interesting device ! Regarding your comment, do you also use whisper.cpp on this device, or another backend ?
I didn't made any comparative test, but beside the fact that I had issue with my old CPU with whisper.ccp, that I had not with faster-whisper, I read online that faster-whisper is way faster than whisper.cpp. Do you have a reason to use whisper.cpp (quality, etc) ?

I'm using whisper.cpp but I haven't tested any alternatives. I assume that since I'm running the inference in the gpu the performance changes must be insignificant, but I'll let you know if I end up trying anything else.

I made another test, this time with the whisper-server (included in official whisper release. Is this the one you tested ?).

I'm using the one available in the whisper.cpp project under examples/server, I built it using the v1.7.3 tag.

Apply PR comments

Signed-off-by: Gwendal Roulleau <[email protected]>
@dalgwen
Copy link
Contributor Author

dalgwen commented Dec 30, 2024

Regarding the translation properties file regeneration, I don't know why but the automatic regeneration using maven didn't work (even when deleting the entire config.xml file).
I add to resort to manual meddling.
And so, should I also modify the italian translation, or is it taken care of by crowdin ?

Apart from that, I applied your remarks. Thank again lsiepel and GiviMAD.

Copy link
Member

@GiviMAD GiviMAD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement or new feature for an existing add-on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants