Real time speech to text script - some questions #122
Replies: 9 comments 31 replies
-
|
Beta Was this translation helpful? Give feedback.
-
Addon to 1: |
Beta Was this translation helpful? Give feedback.
-
Thanks for the swift reply and answers! Regarding:
I have an example transcription below. The first is what I was reading and the second is what the script transcribed. I was using the base.en model for real time transcription (by changing the What I was reading:
What was transcribed:
Edit as I'm writing: Actually now that I'm testing it, I see the same behavior even when using the tiny.en model (using the version of the script in my original post which doesn't use a New edit as I'm writing: I was trying to collect some logs for you, however I was having trouble doing that so I combined both scripts into one and logging worked fine (UNI_v_1_0_1.zip). Anyway I noticed that I was getting an error about Anyway, below I'm posting the whole realtimesst.log from a new testing run trying to transcribe the paragraph that I posted above. New transcription:
realtimesst.log:
|
Beta Was this translation helpful? Give feedback.
-
Thank you for providing additional information. You can ignore the ffmpeg warnings. The amount of skipped / not transcribed text is quite high. Although in earlier versions it would skip a word here and then. But never multiple words in two consecutive sentences. I need to look into that thoroughly, might take me some time. |
Beta Was this translation helpful? Give feedback.
-
Some more info
|
Beta Was this translation helpful? Give feedback.
-
Okay, I'll try to reproduce.
I don't think so. Mic probs usually cause other problems in my experience.
Oh, that sounds interesting, I need to try that too.
Yes, that's true. It literally only ensures uppercase and period ends, but does not ensure lowercase in the other way round. After a pause whisper often decides the new "sentence" (or sentence fragment / lowercase) has to be uppercase, because it's the start of a new transcription. I don't see a way to safely determine if both the last sentence was finished and if not if the first word must be uppercase because it's written this way and can't be forced to lowercase even if the last sentence was not finished. |
Beta Was this translation helpful? Give feedback.
-
Making a new comment thread to say that version 0.2.45 [using What I was saying:
What it transcribed:
Pretty fast too 👍 Also another log because why not |
Beta Was this translation helpful? Give feedback.
-
Found something. On CPU if I really stress everything with large-v2 and talking while transcripting a lot a can run into a problem where the pipe seems to be blocked: 2024-10-02 22:31:21.046 - RealTimeSTT: root - DEBUG - Debug: early transcription request pipe send
2024-10-02 22:31:21.061 - RealTimeSTT: root - DEBUG - Realtime text detected: Hey there, this is a little test.
2024-10-02 22:31:21.178 - RealTimeSTT: root - DEBUG - Current realtime buffer size: 54272
2024-10-02 22:31:21.178 - RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:03.392
2024-10-02 22:31:21.200 - RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
2024-10-02 22:31:21.437 - RealTimeSTT: root - DEBUG - Realtime text detected: Hey there, this is a little test.
2024-10-02 22:31:21.538 - RealTimeSTT: root - DEBUG - Current realtime buffer size: 54272
2024-10-02 22:31:21.538 - RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:03.392
2024-10-02 22:31:21.562 - RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
2024-10-02 22:31:21.784 - RealTimeSTT: root - DEBUG - Realtime text detected: Hey there, this is a little test.
[...]
2024-10-02 22:31:26.451 - RealTimeSTT: root - DEBUG - Current realtime buffer size: 54272
2024-10-02 22:31:26.451 - RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:03.392
2024-10-02 22:31:26.475 - RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
2024-10-02 22:31:26.704 - RealTimeSTT: root - DEBUG - Realtime text detected: Hey there, this is a little test.
2024-10-02 22:31:26.810 - RealTimeSTT: root - DEBUG - Current realtime buffer size: 54272
2024-10-02 22:31:26.810 - RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:03.392
2024-10-02 22:31:26.834 - RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
2024-10-02 22:31:27.072 - RealTimeSTT: root - DEBUG - Realtime text detected: Hey there, this is a little test.
2024-10-02 22:31:27.184 - RealTimeSTT: root - DEBUG - Current realtime buffer size: 54272
2024-10-02 22:31:27.184 - RealTimeSTT: faster_whisper - INFO - Processing audio with duration 00:03.392
2024-10-02 22:31:27.207 - RealTimeSTT: faster_whisper - DEBUG - Processing segment at 00:00.000
2024-10-02 22:31:27.432 - RealTimeSTT: root - DEBUG - Debug: early transcription request pipe send return Between "early transcription request pipe send" and "early transcription request pipe send return" only this line is executed: self.parent_transcription_pipe.send((audio, self.language)) In my infinite naitivity about the nature of python I assumed this call to be nonblocking in every case. Turns out it isn't. Since I already had like 1 trillion problems around using pipes I am thinking about switchin all interprocess communication from pipes to using thread safe queues in the hope that this solves it. Will be another quite substantial refactoring session. Can't see any other solution right now. |
Beta Was this translation helpful? Give feedback.
-
Btw where is the main model being used? Is it only for static files? |
Beta Was this translation helpful? Give feedback.
-
Hi guys,
I've been trying to build a real time STT implementation for quite a while and thankfully stumbled upon this project. I'm a pretty amateur programmer with limited knowledge and most of my development has been done with ChatGPT. Despite this I’ve managed to create a script that works quite well for real-time transcription, which I've attached to this thread. Below are some questions to help me understand how the project works better.
System and Setup
realtimestt_test.py
. It starts muted, with Ctrl + 1 unmuting the microphone and Ctrl + 2 muting it again. It also performs live typing usingpyautogui
.realtimestt_test_v_5_1_0_5.zip
Questions:
Real-Time and Main Model Integration: When using the script in live transcription mode with the
'use_main_model_for_realtime'
argument set to false, does the real-time transcription rely solely on the real-time model? Or does the main model contribute at any point? I think I recall reading somewhere that the real-time model handles the first pass, and the main model does the final transcription. How does this process work exactly?Audio Buffer Behavior: I’ve noticed that when speaking for an extended period, the transcription sometimes has gaps (to be more specific, reading a large paragraph to the microphone I'll notice small gaps within the paragraph, it's not specifically the tail-end of the recording that's lost). Even after changing the
ALLOWED_LATENCY_LIMIT
from 10 to 200, this behavior persists.[Note: When I modified the
audio_recorder.py
to change the limit, I directed my script to use the modified version instead of the default one from the RealtimeSTT library).]Could this be related to how the audio buffer is managed internally? Would introducing an external buffer (e.g. with
deque
) to capture the microphone input help resolve the issue? My idea is to have the external buffer handle overflow when the internal buffer is full, thus preventing data loss.Model Versions: Is there a significant difference between using the tiny vs tiny.en models for real-time transcription? I’ve been using the .en versions for efficiency, but I'd like to know if there's any notable performance difference.
Script Components: As far as I can tell, the main work is handled by
audio_recorder.py
. Is there any other file or process involved in the transcription pipeline, or isaudio_recorder.py
the only one responsible?I’m open to any suggestions or ideas you might have for improving my script. If there’s something I don’t understand, I’ll be happy to look it up myself.
Thanks in advance for your help and clarifications!
Beta Was this translation helpful? Give feedback.
All reactions