Whisper based websocket server for Kõnele #103

rpdrewes · 2023-02-09T02:38:58Z

rpdrewes
Feb 9, 2023

For several years I have been self-hosting a Kaldi-based server that I use with my Android phone for voice recognition services with the excellent Konele app on the client side. Recently I became excited by the outstanding performance of a port of the Whisper voice recognition system by ggerganov (https://github.com/ggerganov/whisper.cpp#whispercpp). I implemented a very quick websocket server to that code for Konele to use, in the spirit of the Kaldi gstreamer server (https://github.com/alumae/kaldi-gstreamer-server).

It is very crude but works very well because Whisper is fantastic, and the ggerganov implementation is very efficient! Here is the repo: https://github.com/rpdrewes/whisper-websocket-server

I'd appreciate comments and suggestions.

Kaljurand · 2023-02-10T22:55:19Z

Kaljurand
Feb 10, 2023
Maintainer

Thanks! Looks nice and seems very easy to set up (I haven't tried yet though). Possible extensions that came to mind:

support multiple languages, i.e. map the language code sent by Kõnele to the Whisper commandline param
support translation
support real-time audio

Some other possible features, e.g. confidence color-coding, diarization, word-level timestamps, prompts, would of course also require (UI) work on the Kõnele side.

Btw, it seems possible to run Whisper also directly on the phone (I haven't tried yet though), see e.g. https://github.com/alex-vt/WhisperInput

0 replies

rpdrewes · 2023-02-13T00:30:44Z

rpdrewes
Feb 13, 2023
Author

Thank you for the words of encouragement!

I was not aware of WhisperInput, thanks! I just tried it out and it is excellent. On my phone (Pixel 5a), with my pretty fast network connection and pretty fast self-hosted server running whisper.cpp, WhisperInput is generally quite a bit slower doing the voice recognition on the phone compared to sending it across network to my fast server using my websocket thing. I expect this would not be true on an iPhone where whisper.cpp is more optimized. And it would depend on the length of the text being recognized with longer text favoring the faster server over the phone. Still it is good to have an offline option.

All of your suggestions for additions sound great and after playing with this for a while I may work to improve the server and maybe try to add some features to K6nele. Meanwhile I am very happy! Thank you for your work on K6nele on which my addition relies.

0 replies

heimoshuiyu · 2023-10-17T13:13:24Z

heimoshuiyu
Oct 17, 2023

Hi, I also wrote an interface for konele, including both POST and websocket methods. However, I am using the faster-whisper library and large-v2 model with FastAPI. I have been using it on Konele for over a month now and the results have been excellent.

Additionally, since English is not my first language, I set a rule that when I specify English as the target language, it enters whisper's translation mode. Real-time voice translation from any languages to English, it's really cool!

The code I'm using is here https://github.com/heimoshuiyu/whisper-fastapi, it's very short, feel free to take a look.

3 replies

Kaljurand Oct 18, 2023
Maintainer

Nice work! Very clean code and easy to play with.

I think it would still make sense to allow Kõnele to specify the spoken language (instead of guessing the language and using the language code only to trigger translation).

You could support additional query params (e.g. "initial_prompt", "translation_target") possibly sent by Kõnele if written as part of the server URL in the Kõnele or Kõnele service settings, e.g. ...:5000/k6nele/ws?initial_prompt=the+time+is.

async def konele_ws(
    websocket: WebSocket,
    lang: str = "und",
    initial_prompt: str = "",
):
    ...
    options = get_options(initial_prompt=initial_prompt)

Also, you could add the status-endpoint so that the Kõnele "Server URL" activity can show if the server is running, e.g.

@app.websocket("/k6nele/status")
@app.websocket("/konele/status")
async def konele_status(
    websocket: WebSocket,
):
    await websocket.accept()
    await websocket.send_json(dict(num_workers_available=1))
    await websocket.close()

Small comments:

I had to additionally pip install 'uvicorn[standard]' to get it running
README: main.py -> whisper_fastapi.py
it only works if Kõnele sends raw audio (instead of FLAC)

heimoshuiyu Nov 15, 2023

Great suggestions! I've updated the following features:

Transcription according to the language code sent by Kõnele (automatically determines the language if it's und)
Support for additional parameter initial_prompt
Added /konele/status endpoint
Added /konele/metrics endpoint for Prometheus to fetch status data
Support for FLAC format

Additionally, I'd like to know if Kõnele can support the initial_prompt parameter (for example, using the text before the user's cursor input, or the text on the user's screen, as initial_prompt). With the context, this should greatly improve the accuracy of speech recognition.

Kaljurand Nov 16, 2023
Maintainer

Thanks, I've tested it a bit with multiple languages (which can be listed in the "Server locales" in the "Kõnele service" app) and it works as advertised. :)

Regarding initial_prompt, it would be technically fairly easy to send some context to the server, e.g. the IME can access the "text before cursor" with https://developer.android.com/reference/android/view/inputmethod/BaseInputConnection#getTextBeforeCursor(int,%20int). I think Kõnele could send the complete text of the edited text field, including the cursor/selection position, so that the server can attempt a "fill in the middle" recognition. At the Kõnele abstraction level these params would be called by the Android terms, e.g. "text_before_cursor", and the server can then re-interpret them at its abstraction level as "initial_prompt" or anything else.

However, this should be off by default, and there should be a section on such privacy sensitive options in the settings, ideally even service dependent, so that the user can share more context with locally running servers, and less with Google etc.

(I'm not sure when I'd have time to work on this.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper based websocket server for Kõnele #103

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Whisper based websocket server for Kõnele #103

rpdrewes Feb 9, 2023

Replies: 3 comments · 3 replies

Kaljurand Feb 10, 2023 Maintainer

rpdrewes Feb 13, 2023 Author

heimoshuiyu Oct 17, 2023

Kaljurand Oct 18, 2023 Maintainer

heimoshuiyu Nov 15, 2023

Kaljurand Nov 16, 2023 Maintainer

rpdrewes
Feb 9, 2023

Replies: 3 comments 3 replies

Kaljurand
Feb 10, 2023
Maintainer

rpdrewes
Feb 13, 2023
Author

heimoshuiyu
Oct 17, 2023

Kaljurand Oct 18, 2023
Maintainer

Kaljurand Nov 16, 2023
Maintainer