-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[whisper] Add OpenAI API compatibility #17921
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Gwendal Roulleau <[email protected]>
73ebbae
to
e404735
Compare
Hello @dalgwen, this is great, I was about to look at it because I bought an Nvidia Orin Nano development kit and I want to host whisper there using an API that mimics OpenAI so this also solves my problem. I'm going to make a quick test tomorrow but I just reviewed the code and everything looks good to me. Also I'm about to upgrade the whisper-jni library to use the latests whisper.cpp version, but I think I will wait to create a PR until this gets merge. |
Hello again @dalgwen, sorry for the delay. I was able to test the changes today and everything works correctly. I was even able to use the new API mode to run the transcription against a whisper.cpp server instance running on the other machine as I commented early. The strange thing is that I had to run the server using the convert functionally, so I think they are still expecting raw float raw pcm data, but with that enabled it works like charm. The large turbo model runs really well in the Nvidia Orin nano, and I saw the other day that they have lowered the price to 250$, so seems like a great alternative to power these kind of services (it's a pity it only has 8Gb of ram). About the PR, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some style improvements and the properties file needs to be regenerated
@@ -245,34 +286,20 @@ public Set<AudioFormat> getSupportedFormats() { | |||
@Override | |||
public STTServiceHandle recognize(STTListener sttListener, AudioStream audioStream, Locale locale, Set<String> set) | |||
throws STTException { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, done
} | ||
|
||
private String recognizeAPI(int audioSamplesOffset, short[] audioStream, String language) throws STTException { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Thanks, will do ASAP.
Disclaimer : I'm not very knowledgeable about audio format. I just took a exemple somewhere on internet and thought that the the int16 format was the base for the whisper service. So I'm of course OK to use float32 if it's simpler. So, I flip the code again and use float32 as a base, and run a test : float32 also works with faster-whisper (and faster-whisper), BUT kinda only. The transcription is bad, really bad. Not totally wrong, but its unusable nonetheless. I must have done something wrong. The commit is here, if you see something please let me know ?
Very interesting device ! Regarding your comment, do you also use whisper.cpp on this device, or another backend ? |
I made another test, this time with the whisper-server (included in official whisper release. Is this the one you tested ?). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @dalgwen, I realized of the problem this morning after digging a little in the whisper.cpp server source code, seems like we were sending a corrupted wav header as the sample size in the header was incorrect. So the server was not expecting raw audio as I initially assumed. After fixing that it correctly works without enabling the convert option.
....voice.whisperstt/src/main/java/org/openhab/voice/whisperstt/internal/WhisperSTTService.java
Outdated
Show resolved
Hide resolved
My nether, I assumed the float32 sample format looking at the code types, but after reading it carefully you are correct it's expecting s16 WAVE files.
I'm using whisper.cpp but I haven't tested any alternatives. I assume that since I'm running the inference in the gpu the performance changes must be insignificant, but I'll let you know if I end up trying anything else.
I'm using the one available in the whisper.cpp project under examples/server, I built it using the v1.7.3 tag. |
Apply PR comments Signed-off-by: Gwendal Roulleau <[email protected]>
Regarding the translation properties file regeneration, I don't know why but the automatic regeneration using maven didn't work (even when deleting the entire config.xml file). Apart from that, I applied your remarks. Thank again lsiepel and GiviMAD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Use an external OpenAI compatible API as another option besides the locally embedded whisper.
Whisper is excellent as an embedded service, but it can also be remotely exposed, with services such as OpenAI obviously, but also by some other open source projects wrapping it (backed by whisper.cpp, or faster-whisper), such as faster-whisper-server.
So, this MR adds the possibility to use a remote Whisper by calling an OpenAI compatible API exposing it. I briefly thought about doing a new add-on, but deemed it overkill. It is another way of using Whisper, so I hope it's appropriate to do it here ?
Some refactors I hope @GiviMAD can take a look at :
Thanks !