Real time speech to text script - some questions #122

homelab-00 · 2024-09-30T21:31:47Z

homelab-00
Sep 30, 2024

Hi guys,

I've been trying to build a real time STT implementation for quite a while and thankfully stumbled upon this project. I'm a pretty amateur programmer with limited knowledge and most of my development has been done with ChatGPT. Despite this I’ve managed to create a script that works quite well for real-time transcription, which I've attached to this thread. Below are some questions to help me understand how the project works better.

System and Setup

Hardware: ThinkPad T495 (24GB of RAM) running Windows 10
Python: Latest version (3.12.5) installed in a clean anaconda environment with only the necessary libraries for RealtimeSTT installed
Script: I've updated the RealtimeSTT library to the latest release and I’ve built my script based on realtimestt_test.py. It starts muted, with Ctrl + 1 unmuting the microphone and Ctrl + 2 muting it again. It also performs live typing using pyautogui.
realtimestt_test_v_5_1_0_5.zip

Questions:

Real-Time and Main Model Integration: When using the script in live transcription mode with the 'use_main_model_for_realtime' argument set to false, does the real-time transcription rely solely on the real-time model? Or does the main model contribute at any point? I think I recall reading somewhere that the real-time model handles the first pass, and the main model does the final transcription. How does this process work exactly?
Audio Buffer Behavior: I’ve noticed that when speaking for an extended period, the transcription sometimes has gaps (to be more specific, reading a large paragraph to the microphone I'll notice small gaps within the paragraph, it's not specifically the tail-end of the recording that's lost). Even after changing the ALLOWED_LATENCY_LIMIT from 10 to 200, this behavior persists.
[Note: When I modified the audio_recorder.py to change the limit, I directed my script to use the modified version instead of the default one from the RealtimeSTT library).]
Could this be related to how the audio buffer is managed internally? Would introducing an external buffer (e.g. with deque) to capture the microphone input help resolve the issue? My idea is to have the external buffer handle overflow when the internal buffer is full, thus preventing data loss.
Model Versions: Is there a significant difference between using the tiny vs tiny.en models for real-time transcription? I’ve been using the .en versions for efficiency, but I'd like to know if there's any notable performance difference.
Script Components: As far as I can tell, the main work is handled by audio_recorder.py. Is there any other file or process involved in the transcription pipeline, or is audio_recorder.py the only one responsible?

I’m open to any suggestions or ideas you might have for improving my script. If there’s something I don’t understand, I’ll be happy to look it up myself.

Thanks in advance for your help and clarifications!

Answered by KoljaB

Oct 1, 2024

In this case real-time model will not be used, only the main model at the cost of added latency for the final transcription.
Need to look into that and do some tests. Can you provide some more infos when this happens?
Tiny.en is a different model and will result in a better transcription quality than tiny only.
It's only this single audio_recorder.py file containing all logic for RealtimeSTT.

View full answer

KoljaB · 2024-10-01T00:01:27Z

KoljaB
Oct 1, 2024
Maintainer

In this case real-time model will not be used, only the main model at the cost of added latency for the final transcription.
Need to look into that and do some tests. Can you provide some more infos when this happens?
Tiny.en is a different model and will result in a better transcription quality than tiny only.
It's only this single audio_recorder.py file containing all logic for RealtimeSTT.

0 replies

KoljaB · 2024-10-01T00:05:05Z

KoljaB
Oct 1, 2024
Maintainer

Addon to 1:
Realtime transcription is performed constantly. When the last real-time transcription is finished RealtimeSTT would wait for the amount of seconds specified in realtime_detection_pause, then immediately start the next transcription.
Final transcription with the main model occurs after detecting voice deactivity. It runs in a seperate process to be able to transcribe in parallel to the real-time transcription.

0 replies

homelab-00 · 2024-10-01T03:56:01Z

homelab-00
Oct 1, 2024
Author

Thanks for the swift reply and answers!

Regarding:

2. Need to look into that and do some tests. Can you provide some more infos when this happens?

I have an example transcription below. The first is what I was reading and the second is what the script transcribed. I was using the base.en model for real time transcription (by changing the 'realtime_model_type' arg) to stress the script a bit and see if the idea I had about an external buffer with deque was helping.
I had noticed similar behavior previously, I think even when using the tiny.en model, but I have no specific logs or records for those instances.

What I was reading:

Okay, I have two Python scripts and the first one acts as the script that starts the audio recording. And the second one acts as the audio transcriber. In any case, look through the file yourself and figure out how they work and what they do. Then I want you to do the following. I want you to enable all debugging options and add any debugging options you can from Python. For both files. And I want the output of that debugging to be saved in a log file.

What was transcribed:

Okay, I have two Python scripts and the first one acts as the script that starts the audio recording.As the audio transcriber.Through the file yourself and figure out how they work and what they do.I want you to enable all debugging options and add any debugging options you can from Python.And I want the output of that debugging to be saved in a log file.

Edit as I'm writing: Actually now that I'm testing it, I see the same behavior even when using the tiny.en model (using the version of the script in my original post which doesn't use a deque external buffer and uses the original audio_recorder.py from the library).
My thinking is that somehow the internal buffer of the script reaches its max size and it starts dropping older audio chunks. And that this is happening mid-paragraph.

New edit as I'm writing: I was trying to collect some logs for you, however I was having trouble doing that so I combined both scripts into one and logging worked fine (UNI_v_1_0_1.zip). Anyway I noticed that I was getting an error about ffmpeg, which you can see below. I redownloaded ffmpeg version 6.1.1 (from GyanD's repo) and replaced my current installation (after reading that torio works only for versions 4.4 to 6 and I had version 7 I think). I also tried switching to a python 3.9 environment as I read that pytorch supports versions 3.8-3.11. Nothing fixed the ffmpeg issue and ChatGPT was going around in circles.
I don't know it this error is significant and the script seems to work fine despite it, just thought I'd mention it.

Anyway, below I'm posting the whole realtimesst.log from a new testing run trying to transcribe the paragraph that I posted above.
Here's the result of the transcription and below is the log message.

New transcription:

Okay, I have two Python scripts, and the first one acts as the script that starts the audio recording.Acts as the audio transcriber.Yourself and figure out how they work and what they do.I want you to enable all debugging options and add any debugging options you can from Python.And I want the output of that debugging to be saved in a log file.

realtimesst.log:

RealTimeSTT: root - DEBUG - Explicitly setting the multiprocessing start method to 'spawn'
RealTimeSTT: root - DEBUG - Start method has already been set. Details: context has already been set
RealTimeSTT: root - INFO - Starting RealTimeSTT
RealTimeSTT: root - INFO - Initializing audio recording (creating pyAudio input stream, sample rate: 16000 buffer size: 512
RealTimeSTT: root - INFO - Initializing faster_whisper realtime transcription model tiny.en
RealTimeSTT: urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): huggingface.co:443
RealTimeSTT: urllib3.connectionpool - DEBUG - https://huggingface.co:443 "GET /api/models/Systran/faster-whisper-tiny.en/revision/main HTTP/11" 200 871
RealTimeSTT: root - DEBUG - Faster_whisper realtime speech to text transcription model initialized successfully
RealTimeSTT: root - INFO - Initializing WebRTC voice with Sensitivity 2
RealTimeSTT: root - DEBUG - WebRTC VAD voice activity detection engine initialized successfully
RealTimeSTT: torio._extension.utils - DEBUG - Loading FFmpeg6
RealTimeSTT: torio._extension.utils - DEBUG - Failed to load FFmpeg6 extension.
Traceback (most recent call last):
  File "C:\Users\Bill\anaconda3\envs\RealtimeSTT-3.9\lib\site-packages\torio\_extension\utils.py", line 116, in _find_ffmpeg_extension
    ext = _find_versionsed_ffmpeg_extension(ffmpeg_ver)
  File "C:\Users\Bill\anaconda3\envs\RealtimeSTT-3.9\lib\site-packages\torio\_extension\utils.py", line 108, in _find_versionsed_ffmpeg_extension
    _load_lib(lib)
  File "C:\Users\Bill\anaconda3\envs\RealtimeSTT-3.9\lib\site-packages\torio\_extension\utils.py", line 94, in _load_lib
    torch.ops.load_library(path)
  File "C:\Users\Bill\anaconda3\envs\RealtimeSTT-3.9\lib\site-packages\torch\_ops.py", line 1032, in load_library
    ctypes.CDLL(path)
  File "C:\Users\Bill\anaconda3\envs\RealtimeSTT-3.9\lib\ctypes\__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
FileNotFoundError: Could not find module 'C:\Users\Bill\anaconda3\envs\RealtimeSTT-3.9\Lib\site-packages\torio\lib\libtorio_ffmpeg6.pyd' (or one of its dependencies). Try using the full path with constructor syntax.
RealTimeSTT: torio._extension.utils - DEBUG - Loading FFmpeg5
RealTimeSTT: torio._extension.utils - DEBUG - Failed to load FFmpeg5 extension.
Traceback (most recent call last):
  File "C:\Users\Bill\anaconda3\envs\RealtimeSTT-3.9\lib\site-packages\torio\_extension\utils.py", line 116, in _find_ffmpeg_extension
    ext = _find_versionsed_ffmpeg_extension(ffmpeg_ver)
  File "C:\Users\Bill\anaconda3\envs\RealtimeSTT-3.9\lib\site-packages\torio\_extension\utils.py", line 108, in _find_versionsed_ffmpeg_extension
    _load_lib(lib)
  File "C:\Users\Bill\anaconda3\envs\RealtimeSTT-3.9\lib\site-packages\torio\_extension\utils.py", line 94, in _load_lib
    torch.ops.load_library(path)
  File "C:\Users\Bill\anaconda3\envs\RealtimeSTT-3.9\lib\site-packages\torch\_ops.py", line 1032, in load_library
    ctypes.CDLL(path)
  File "C:\Users\Bill\anaconda3\envs\RealtimeSTT-3.9\lib\ctypes\__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
FileNotFoundError: Could not find module 'C:\Users\Bill\anaconda3\envs\RealtimeSTT-3.9\Lib\site-packages\torio\lib\libtorio_ffmpeg5.pyd' (or one of its dependencies). Try using the full path with constructor syntax.
RealTimeSTT: torio._extension.utils - DEBUG - Loading FFmpeg4
RealTimeSTT: torio._extension.utils - DEBUG - Failed to load FFmpeg4 extension.
Traceback (most recent call last):
  File "C:\Users\Bill\anaconda3\envs\RealtimeSTT-3.9\lib\site-packages\torio\_extension\utils.py", line 116, in _find_ffmpeg_extension
    ext = _find_versionsed_ffmpeg_extension(ffmpeg_ver)
  File "C:\Users\Bill\anaconda3\envs\RealtimeSTT-3.9\lib\site-packages\torio\_extension\utils.py", line 108, in _find_versionsed_ffmpeg_extension
    _load_lib(lib)
  File "C:\Users\Bill\anaconda3\envs\RealtimeSTT-3.9\lib\site-packages\torio\_extension\utils.py", line 94, in _load_lib
    torch.ops.load_library(path)
  File "C:\Users\Bill\anaconda3\envs\RealtimeSTT-3.9\lib\site-packages\torch\_ops.py", line 1032, in load_library
    ctypes.CDLL(path)
  File "C:\Users\Bill\anaconda3\envs\RealtimeSTT-3.9\lib\ctypes\__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
FileNotFoundError: Could not find module 'C:\Users\Bill\anaconda3\envs\RealtimeSTT-3.9\Lib\site-packages\torio\lib\libtorio_ffmpeg4.pyd' (or one of its dependencies). Try using the full path with constructor syntax.
RealTimeSTT: torio._extension.utils - DEBUG - Loading FFmpeg
RealTimeSTT: torio._extension.utils - DEBUG - Failed to load FFmpeg extension.
Traceback (most recent call last):
  File "C:\Users\Bill\anaconda3\envs\RealtimeSTT-3.9\lib\site-packages\torio\_extension\utils.py", line 116, in _find_ffmpeg_extension
    ext = _find_versionsed_ffmpeg_extension(ffmpeg_ver)
  File "C:\Users\Bill\anaconda3\envs\RealtimeSTT-3.9\lib\site-packages\torio\_extension\utils.py", line 106, in _find_versionsed_ffmpeg_extension
    raise RuntimeError(f"FFmpeg{version} extension is not available.")
RuntimeError: FFmpeg extension is not available.
RealTimeSTT: root - DEBUG - Silero VAD voice activity detection engine initialized successfully
RealTimeSTT: root - DEBUG - Starting recording worker
RealTimeSTT: root - DEBUG - Starting realtime worker
RealTimeSTT: root - DEBUG - Waiting for main transcription model to start
RealTimeSTT: root - DEBUG - Main transcription model ready
RealTimeSTT: root - DEBUG - RealtimeSTT initialization completed successfully
RealTimeSTT: root - DEBUG - Waiting for recording start
RealTimeSTT: root - INFO - voice activity detected
RealTimeSTT: root - INFO - recording started
RealTimeSTT: root - DEBUG - Waiting for recording stop
RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:01.216
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
RealTimeSTT: root - DEBUG - Starting realtime transcription
RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:02.336
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
RealTimeSTT: root - DEBUG - Starting realtime transcription
RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:03.488
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
RealTimeSTT: root - DEBUG - Starting realtime transcription
RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:04.704
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
RealTimeSTT: root - DEBUG - Starting realtime transcription
RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:05.984
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
RealTimeSTT: root - DEBUG - Starting realtime transcription
RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:07.296
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
RealTimeSTT: root - INFO - voice deactivity detected
RealTimeSTT: root - INFO - recording stopped
RealTimeSTT: root - DEBUG - Waiting for recording start
RealTimeSTT: root - INFO - voice activity detected
RealTimeSTT: root - INFO - recording started
RealTimeSTT: root - DEBUG - Waiting for recording stop
RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:01.216
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
RealTimeSTT: faster_whisper - DEBUG - Log probability threshold is not met with temperature 0.0 (-1.007627 < -1.000000)
RealTimeSTT: root - INFO - voice deactivity detected
RealTimeSTT: root - INFO - recording stopped
RealTimeSTT: faster_whisper - DEBUG - Log probability threshold is not met with temperature 0.2 (-1.007627 < -1.000000)
RealTimeSTT: faster_whisper - DEBUG - Log probability threshold is not met with temperature 0.4 (-1.007627 < -1.000000)
RealTimeSTT: faster_whisper - DEBUG - Log probability threshold is not met with temperature 0.6 (-1.308973 < -1.000000)
RealTimeSTT: faster_whisper - DEBUG - Log probability threshold is not met with temperature 0.8 (-1.261010 < -1.000000)
RealTimeSTT: root - DEBUG - Waiting for recording start
RealTimeSTT: faster_whisper - DEBUG - Log probability threshold is not met with temperature 1.0 (-1.673654 < -1.000000)
RealTimeSTT: faster_whisper - DEBUG - Reset prompt. prompt_reset_on_temperature threshold is met 1.000000 > 0.500000
RealTimeSTT: root - INFO - voice activity detected
RealTimeSTT: root - INFO - recording started
RealTimeSTT: root - DEBUG - Waiting for recording stop
RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:01.216
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
RealTimeSTT: root - DEBUG - Starting realtime transcription
RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:02.464
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
RealTimeSTT: root - DEBUG - Starting realtime transcription
RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:03.712
RealTimeSTT: root - INFO - voice deactivity detected
RealTimeSTT: root - INFO - recording stopped
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
RealTimeSTT: root - DEBUG - Waiting for recording start
RealTimeSTT: root - INFO - voice activity detected
RealTimeSTT: root - INFO - recording started
RealTimeSTT: root - DEBUG - Waiting for recording stop
RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:01.248
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
RealTimeSTT: faster_whisper - DEBUG - Log probability threshold is not met with temperature 0.0 (-1.036014 < -1.000000)
RealTimeSTT: faster_whisper - DEBUG - Log probability threshold is not met with temperature 0.2 (-1.036016 < -1.000000)
RealTimeSTT: faster_whisper - DEBUG - Log probability threshold is not met with temperature 0.4 (-1.132283 < -1.000000)
RealTimeSTT: faster_whisper - DEBUG - Log probability threshold is not met with temperature 0.6 (-1.319294 < -1.000000)
RealTimeSTT: faster_whisper - DEBUG - Log probability threshold is not met with temperature 0.8 (-1.273277 < -1.000000)
RealTimeSTT: faster_whisper - DEBUG - Log probability threshold is not met with temperature 1.0 (-2.132826 < -1.000000)
RealTimeSTT: faster_whisper - DEBUG - Reset prompt. prompt_reset_on_temperature threshold is met 1.000000 > 0.500000
RealTimeSTT: root - DEBUG - Starting realtime transcription
RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:04.640
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
RealTimeSTT: root - DEBUG - Starting realtime transcription
RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:05.920
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
RealTimeSTT: root - DEBUG - Starting realtime transcription
RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:07.456
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
RealTimeSTT: root - INFO - voice deactivity detected
RealTimeSTT: root - INFO - recording stopped
RealTimeSTT: root - DEBUG - Waiting for recording start
RealTimeSTT: root - INFO - voice activity detected
RealTimeSTT: root - INFO - recording started
RealTimeSTT: root - DEBUG - Waiting for recording stop
RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:01.248
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
RealTimeSTT: root - DEBUG - Starting realtime transcription
RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:02.688
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
RealTimeSTT: root - DEBUG - Starting realtime transcription
RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:04.224
RealTimeSTT: root - INFO - voice deactivity detected
RealTimeSTT: root - INFO - recording stopped
RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
RealTimeSTT: root - DEBUG - Waiting for recording start
RealTimeSTT: root - INFO - recording stopped

0 replies

KoljaB · 2024-10-01T11:26:00Z

KoljaB
Oct 1, 2024
Maintainer

Thank you for providing additional information. You can ignore the ffmpeg warnings. The amount of skipped / not transcribed text is quite high. Although in earlier versions it would skip a word here and then. But never multiple words in two consecutive sentences. I need to look into that thoroughly, might take me some time.

0 replies

homelab-00 · 2024-10-01T14:19:43Z

homelab-00
Oct 1, 2024
Author

Some more info

Even with setting 'use_main_model_for_realtime' to False (and using tiny.en for the real-time model), there is a noticeable delay when using larger main models, with the delay getting longer the bigger the main model is. For example with medium.en as the base model, the transcription takes over five times longer to print.
I'm a bit suspicious of my microphone setup. Instead of being wired directly to the laptop, I've been using WO Mic and streaming the microphone input from my smartphone. I think this could be causing some issues, I'll look into it further. For now on any tests will be run with the microphone wired to the laptop.
I noticed that the 'ensure_sentence_starting_uppercase' and 'ensure_sentence_ends_with_period' args are not respected when set to False. The first sentence after starting the script might follow these args, but then it ignores them completely and acts as if these two args are set to True. So any pause mid sentence (to think what you want to say next) causes unwanted capitalizations and periods to appear mid sentence.

0 replies

KoljaB · 2024-10-01T15:17:13Z

KoljaB
Oct 1, 2024
Maintainer

Even with setting 'use_main_model_for_realtime' to False (and using tiny.en for the real-time model), there is a noticeable delay when using larger main models, with the delay getting longer the bigger the main model is.

Okay, I'll try to reproduce.

I'm a bit suspicious of my microphone setup.

I don't think so. Mic probs usually cause other problems in my experience.

I've been using WO Mic and streaming the microphone input from my smartphone.

Oh, that sounds interesting, I need to try that too.

I noticed that the 'ensure_sentence_starting_uppercase' and 'ensure_sentence_ends_with_period' args are not respected when set to False.

Yes, that's true. It literally only ensures uppercase and period ends, but does not ensure lowercase in the other way round. After a pause whisper often decides the new "sentence" (or sentence fragment / lowercase) has to be uppercase, because it's the start of a new transcription. I don't see a way to safely determine if both the last sentence was finished and if not if the first word must be uppercase because it's written this way and can't be forced to lowercase even if the last sentence was not finished.

15 replies

homelab-00 Oct 2, 2024
Author

Just tried out the new 0.2.45 version. This specific issue persists.

I'm attaching the logs from two test runs similar to the previous ones if you want to take a look. In the medium.en test I noticed this new warning while waiting for the transcription to finish (that's also included in the full log below). It did eventually print the correct transcription though. Taking a look at the logs, I think there's some issue with the stabilization, as the script seems to detect the correct transcription pretty early but takes a long time to actually print it.

WARNING - Audio queue size exceeds latency limit. Current size: 777. Discarding old audio chunks.
WARNING - !!! ### !!! ### !!!

medium_test_new.log
tiny_test_new.log

KoljaB Oct 2, 2024
Maintainer

Looked into the logs.

2024-10-02 18:09:38.690 - RealTimeSTT: root - DEBUG - Current realtime buffer size: 74752
2024-10-02 18:10:01.459 - RealTimeSTT: root - DEBUG - Current realtime buffer size: 74752

Over 23 seconds no new frames were appended to the current recording audio buffer.

2024-10-02 18:10:02.256 - RealTimeSTT: root - INFO - voice deactivity detected at 18:09:37.161, time since silence start: 25.096 seconds

Also this log message was supposed to appear WAY earlier.

Together with the "Audio queue size exceeds latency limit." it looks very much like a somehow blocked _recording_worker. This is only what I see at first glance. It won't set help us track down the problem. Maybe it's needed to write a special _recording_worker with great amount of extended logging to find the lines of code that block the worker. I'm somehow suspicious of the VAD components, but then again they really shouldn't block for like 23 seconds to process a single chunk.

KoljaB Oct 2, 2024
Maintainer

Could you please replace this line:

                            self._check_voice_activity(data_copy)

by these lines:

                            time_before_voice_activity_check = time.time()
                            self._check_voice_activity(data_copy)
                            logging.debug(f"voice activity checked in: {time.time() - time_before_voice_activity_check} seconds")

Here it prints:
2024-10-02 19:26:08.062 - RealTimeSTT: root - DEBUG - voice activity checked in: 0.0010027885437011719 seconds

So the 0.001 here is fine, but if it's greater than 0.05 we could run into realtime problems when checking every chunk for voice activity.

homelab-00 Oct 2, 2024
Author

All right, tested it out.

Most RealTimeSTT: root - DEBUG - voice activity checked in: debug lines end in just 0.0 seconds but some have non zero values and those are similar to your number. Most are around 0.001 with a couple of outliers at 0.002

Again I noticed the full correct transcription appearing pretty early in the terminal, but it stayed white. Then I got the warning again and after that did the final transcription print in yellow. Note that the way I'm running the test is: starting the script, waiting for 'Say something' to appear, then saying "The quick brown fox jumps over the lazy dog.", then wait until the transcription turns yellow and finally typing Ctrl + C to exit the script - btw I noticed you improved the exit behavior, previously I was getting a bunch of errors when exiting.

Terminal while waiting for the transcription to turn yellow:

The quick brown fox jumps over the lazy dog.RealTimeSTT: root - WARNING - !!! ### !!! ### !!!
RealTimeSTT: root - WARNING - Audio queue size exceeds latency limit. Current size: 772. Discarding old audio chunks.
RealTimeSTT: root - WARNING - !!! ### !!! ### !!!

Full log:
realtimesst.log

Oh and I've got to test how 0.2.45 works with the 'holes' in the middle of a paragraph.

KoljaB Oct 2, 2024
Maintainer

Okay, I'll think about that. Sorry, I might poke around in the dark a bit right now, because I'm back to the "can't reproduce" point again unfortunately.

homelab-00 · 2024-10-02T20:32:07Z

homelab-00
Oct 2, 2024
Author

Making a new comment thread to say that version 0.2.45 [using tiny.en for both 'model' and 'realtime_model_type', and the realtimestt_test.py script] works perfectly fine with long paragraphs without holes.

What I was saying:

Okay, I have two Python scripts and the first one acts as the script that starts the audio recording. And the second one acts as the audio transcriber. In any case, look through the file yourself and figure out how they work and what they do. Then I want you to do the following. I want you to enable all debugging options and add any debugging options you can from Python. For both files. And I want the output of that debugging to be saved in a log file.

What it transcribed:

Okay, I have two python scripts and the first one acts as the script that starts the audio recording. And the second one acts as the audio transcriber. In any case, look through the file yourself and figure out how they work and what they do. Then I want you to do the following. I want here to enable all the backing options and add. Any debugging options you can from Python. Okay, for both files. And I want the output of that debugging to be saved in a log file.

Pretty fast too 👍

Also another log because why not
realtimesst.log

0 replies

KoljaB · 2024-10-02T20:37:40Z

KoljaB
Oct 2, 2024
Maintainer

Found something. On CPU if I really stress everything with large-v2 and talking while transcripting a lot a can run into a problem where the pipe seems to be blocked:

2024-10-02 22:31:21.046 - RealTimeSTT: root - DEBUG - Debug: early transcription request pipe send
2024-10-02 22:31:21.061 - RealTimeSTT: root - DEBUG - Realtime text detected:  Hey there, this is a little test.
2024-10-02 22:31:21.178 - RealTimeSTT: root - DEBUG - Current realtime buffer size: 54272
2024-10-02 22:31:21.178 - RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:03.392
2024-10-02 22:31:21.200 - RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
2024-10-02 22:31:21.437 - RealTimeSTT: root - DEBUG - Realtime text detected:  Hey there, this is a little test.
2024-10-02 22:31:21.538 - RealTimeSTT: root - DEBUG - Current realtime buffer size: 54272
2024-10-02 22:31:21.538 - RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:03.392
2024-10-02 22:31:21.562 - RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
2024-10-02 22:31:21.784 - RealTimeSTT: root - DEBUG - Realtime text detected:  Hey there, this is a little test.
[...]
2024-10-02 22:31:26.451 - RealTimeSTT: root - DEBUG - Current realtime buffer size: 54272
2024-10-02 22:31:26.451 - RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:03.392
2024-10-02 22:31:26.475 - RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
2024-10-02 22:31:26.704 - RealTimeSTT: root - DEBUG - Realtime text detected:  Hey there, this is a little test.
2024-10-02 22:31:26.810 - RealTimeSTT: root - DEBUG - Current realtime buffer size: 54272
2024-10-02 22:31:26.810 - RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:03.392
2024-10-02 22:31:26.834 - RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
2024-10-02 22:31:27.072 - RealTimeSTT: root - DEBUG - Realtime text detected:  Hey there, this is a little test.
2024-10-02 22:31:27.184 - RealTimeSTT: root - DEBUG - Current realtime buffer size: 54272
2024-10-02 22:31:27.184 - RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:03.392
2024-10-02 22:31:27.207 - RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
2024-10-02 22:31:27.432 - RealTimeSTT: root - DEBUG - Debug: early transcription request pipe send return

Between "early transcription request pipe send" and "early transcription request pipe send return" only this line is executed:

self.parent_transcription_pipe.send((audio, self.language))

In my infinite naitivity about the nature of python I assumed this call to be nonblocking in every case. Turns out it isn't.

Since I already had like 1 trillion problems around using pipes I am thinking about switchin all interprocess communication from pipes to using thread safe queues in the hope that this solves it. Will be another quite substantial refactoring session. Can't see any other solution right now.

8 replies

KoljaB Oct 2, 2024
Maintainer

Checked in another version of audio_recorder.py working with a polling thread for the pipe connection into dev branch. Looked good here in first test.

homelab-00 Oct 3, 2024
Author

Issue persists in 0.3.0

test logs (hey I learned how to fork!)

KoljaB Oct 3, 2024
Maintainer

Okay, let's look deeper. I currently can't see much in the logs. Could you set use_extended_logging=True to get the detailed log output? Also I would like to know if it works better with early_transcription_on_silence=0.

homelab-00 Oct 3, 2024
Author

Sure, here are the logs.

early_transcription_on_silence=0.2
early_transcription_on_silence=0
early_transcription_on_silence=2000

One thing to note here is that in the README you specify early_transcription_on_silence as an integer, but in realtimestt_test.py you use the value 0.2

Also I would like to know if it works better with early_transcription_on_silence=0.

Curiously enough that's what o1-preview suggested as well. It didn't help though.

KoljaB Oct 3, 2024
Maintainer

Looked into the early_transcription_on_silence=0 file. The only occasion where the _recording_worker was "blocked" for more than 0.1s was here:

2024-10-03 09:49:21.548 - RealTimeSTT: root - DEBUG - Debug: Entering inner try block
2024-10-03 09:49:21.548 - RealTimeSTT: root - DEBUG - Debug: Trying to get data from audio queue
2024-10-03 09:49:21.665 - RealTimeSTT: root - DEBUG - Debug: Queue is empty, checking if still running
2024-10-03 09:49:21.704 - RealTimeSTT: root - DEBUG - Debug: Continuing to next iteration
2024-10-03 09:49:21.708 - RealTimeSTT: root - DEBUG - Debug: Entering inner try block
2024-10-03 09:49:21.708 - RealTimeSTT: root - WARNING - ### WARNING: PROCESSING TOOK TOO LONG

No real reason for the 0.160s delay to see here in the logs.

2024-10-03 09:49:21.548 - RealTimeSTT: root - DEBUG - Debug: Trying to get data from audio queue
2024-10-03 09:49:21.665 - RealTimeSTT: root - DEBUG - Debug: Queue is empty, checking if still running

Only call in between is this:

data = self.audio_queue.get(timeout=0.01)

Which should have been timeouted after 0.01s and not after 0.12s. Unsure why this happens, might just be some kind of "system blocked a bit due to some other operations" stuff.

All in all the 0.160s processing time for this cycle should not have caused any significant issues. So the logging looks not bad so far, can't see any hints of code in _recording_worker that would cause gaps in transcriptions. I'm a bit lost currently, I need to think about that for a while.

homelab-00 · 2024-10-02T21:16:49Z

homelab-00
Oct 2, 2024
Author

Btw where is the main model being used? Is it only for static files?

8 replies

homelab-00 Oct 3, 2024
Author

Hm, even setting it to False though - and I have tried that (even though I know it's set to False by default) - the transcription takes a very long time. It's actually directly proportional to the size of the main model, with medium.en taking a while and large-v2 taking forever. So I'm suspecting the main model must somehow be interjecting in the real-time transcription process.

homelab-00 Oct 3, 2024
Author

Here's what I think is going on: the main model is responsible for the actual final transcription. I remember @WindyYam mentioning that real-time transcription is designed more for preview purposes and doesn't work well at the end of audio recordings [#120].
Edit: And the main model is always transcribing, whether enable_realtime_transcription is set to True or False.

In the realtimestt_test.py script, if enable_realtime_transcription is set to True, it prints a partial transcription as you speak, which show up in white in the terminal. Meanwhile, the main model works in the background and produces a final transcription (in yellow or cyan) when it's finished. So the idea is that the main model is larger and more accurate than the real-time model, which is why the final transcription comes from it. And we're just streaming the transcription preview using the real-time model as we're waiting for the main model to finish because it looks cool (and yes it does, I wish I had an Nvidia and be able to spare the power - and thinking about it maybe I'll swap my desktop 5700 XT for something from team green). But the two models aren't talking to each other, they're both independently producing transcriptions.

For my purposes, I think the best approach is to set enable_realtime_transcription to False and just use only the main model, like the tiny or tiny.en model. It works great, especially on my laptop, which has no cuda cores and only an average integrated GPU. I guess I could leave it set to True and instead set use_main_model_for_realtime to True and use the tiny.en model for both models, but then we're still doing double transcriptions - it's just that both transcriptions are done with the tiny.en model. So I say, just get rid of one.

Also, like with 0.2.45, there are no gaps mid-paragraph and the resource hit while it's running is minimal.
good_working_example.py

KoljaB Oct 3, 2024
Maintainer

Yes, you have a very solid understanding now of what's going on. I agree basically with all points and also with your conclusions about how to best deal with this situation. I depending on the time a single "tiny.en" transcription takes you might consider setting use_main_model_for_realtime to True and set realtime_processing_pause to like 1.0 to get some kind of 1x per second update on the currently spoken text without having to big hit on the performance.

Also realtime transcription as you said is mostly because it adds visual response. But real-time transcription you also enables you to implement a more solid sentence end detection. Admitted, this only works with a fast realtime transcription.

Let me show you what I mean and explain (maybe, some random other readers might find this valuable too):

Sentence.End.Detection.mp4

To achieve this, you can use the following code to finetune how the recorder handles speaking pauses (very simple, you might want to finetune):

    end_of_sentence_detection_pause = 0.4
    unknown_sentence_detection_pause = 0.7
    mid_sentence_detection_pause = 2.0

    prev_text = ""

    def realtime_text_detected(text):
        global prev_text
        sentence_end_marks = ['.', '!', '?', '。'] 
        if text.endswith("..."):
            recorder.post_speech_silence_duration = mid_sentence_detection_pause
        elif text and text[-1] in sentence_end_marks and prev_text and prev_text[-1] in sentence_end_marks:
            recorder.post_speech_silence_duration = end_of_sentence_detection_pause
        else:
            recorder.post_speech_silence_duration = unknown_sentence_detection_pause

        prev_text = text

Then you add a prompt for whisper with the initial_prompt parameter like this one:
"Only add a period at the end of a sentence if you are 100 percent certain that the speaker has finished their statement. If you're unsure or the sentence seems incomplete, leave the sentence open or use ellipses to reflect continuation. For example: 'I went to the...' or 'I think it was...'"

This encourages RealtimeSTT / Whisper to end mid-sentence detection with "..." which we can then detect in the realtime_text_detected method and increase the post_speech_silence_duration duration when that happens.

With this changes the user gets more time when he makes a thinking pause mid-sentence while speaking. He can then finish his sentence before the system considers the sentence as being finished too early.

homelab-00 Oct 3, 2024
Author

Depending on the time a single "tiny.en" transcription takes you might consider setting use_main_model_for_realtime to True and set realtime_processing_pause to like 1.0 to get some kind of 1x per second update on the currently spoken text without having to big hit on the performance.

Yeah I agree.

Then you add a prompt for whisper with the initial_prompt parameter like this one:

Hadn't read anything about the prompts, thought they were only applicable to chatbot implementations.

Also did some testing with the new test scripts in commit af15476. The new terminal look is great btw.

So the updated realtimestt_test.py behaved pretty much the same as before, but the new realtimestt_test_stereomix.py script seemed to print just fine without any lagging. It stayed completely responsive, there's definite progress here.
Two things of note though:

CPU utilization stayed at 100%
The printed text was always yellow, it didn't alternate between the two colours.

Logs:
realtimestt_test.py
realtimestt_test_stereomix.py

KoljaB Oct 3, 2024
Maintainer

This looks very ugly in the logging:
Unhandled exeption in _recording_worker: [WinError 6] The handle is invalid

I switched the realtime text color to bright yellow. My guess is you only see that one currently and no main transcription. Wonder what's going wrong here again.

Prompts are not completely easy to implement, Whisper tends to put parts of the prompts into the final transcription and gets confused easily. I found the new terminal not working well if the console windows runs completely full of text. Might switch to Textualize/textual instead of rich, not sure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Real time speech to text script - some questions #122

{{title}}

Replies: 9 comments 31 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Real time speech to text script - some questions #122

homelab-00 Sep 30, 2024

Replies: 9 comments · 31 replies

KoljaB Oct 1, 2024 Maintainer

KoljaB Oct 1, 2024 Maintainer

homelab-00 Oct 1, 2024 Author

KoljaB Oct 1, 2024 Maintainer

homelab-00 Oct 1, 2024 Author

KoljaB Oct 1, 2024 Maintainer

homelab-00 Oct 2, 2024 Author

KoljaB Oct 2, 2024 Maintainer

KoljaB Oct 2, 2024 Maintainer

homelab-00 Oct 2, 2024 Author

KoljaB Oct 2, 2024 Maintainer

homelab-00 Oct 2, 2024 Author

KoljaB Oct 2, 2024 Maintainer

KoljaB Oct 2, 2024 Maintainer

homelab-00 Oct 3, 2024 Author

KoljaB Oct 3, 2024 Maintainer

homelab-00 Oct 3, 2024 Author

KoljaB Oct 3, 2024 Maintainer

homelab-00 Oct 2, 2024 Author

homelab-00 Oct 3, 2024 Author

homelab-00 Oct 3, 2024 Author

KoljaB Oct 3, 2024 Maintainer

homelab-00 Oct 3, 2024 Author

KoljaB Oct 3, 2024 Maintainer

homelab-00
Sep 30, 2024

Replies: 9 comments 31 replies

KoljaB
Oct 1, 2024
Maintainer

KoljaB
Oct 1, 2024
Maintainer

homelab-00
Oct 1, 2024
Author

KoljaB
Oct 1, 2024
Maintainer

homelab-00
Oct 1, 2024
Author

KoljaB
Oct 1, 2024
Maintainer

homelab-00 Oct 2, 2024
Author

KoljaB Oct 2, 2024
Maintainer

KoljaB Oct 2, 2024
Maintainer

homelab-00 Oct 2, 2024
Author

KoljaB Oct 2, 2024
Maintainer

homelab-00
Oct 2, 2024
Author

KoljaB
Oct 2, 2024
Maintainer

KoljaB Oct 2, 2024
Maintainer

homelab-00 Oct 3, 2024
Author

KoljaB Oct 3, 2024
Maintainer

homelab-00 Oct 3, 2024
Author

KoljaB Oct 3, 2024
Maintainer

homelab-00
Oct 2, 2024
Author

homelab-00 Oct 3, 2024
Author

homelab-00 Oct 3, 2024
Author

KoljaB Oct 3, 2024
Maintainer

homelab-00 Oct 3, 2024
Author

KoljaB Oct 3, 2024
Maintainer