-
Hello, I have a project where I'm streaming videos from my phone using IP camera and loads this stream on my computer, I want to have a program monitoring this stream, and when I say certain keywords it will then record my next sentence using STT and take a photo of current camera view, then pass these 2 into a local LLM as a prompt. Right now the picture-taking part is easy using OpenCV, but I can't seem to make the STT runs as planned, I tried Vosks, speech recognition, and Porcupine, but none of these 3 seems able to monitor ffmpeg streaming (Or perhaps I'm doing it wrong since I don't have any experience on voice processing). Can RealtimeSTT achieve this? Or am I having some sort of misunderstanding on how the voice is processed? Is there any better way to monitor a real-time audio stream? On a side note, any other suggestions regarding my project idea are welcome! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
You'll need to convert the audio stream from FFmpeg into PCM WAV 16 kHz and then use feed_audio method. Depending on the actual mp3 format of the chunks the conversion can be rather straightforward (plain mp3) or quite complicated (if the mp3 chunks depend on each other). If it's easy format conversion can be done with pydub: from pydub import AudioSegment
segment = AudioSegment.from_file(io.BytesIO(chunk), format="mp3") Or you can use a ffmpeg cli command to convert. feed_audio method will require 16 kHz mono PCM chunks of 1024 samples feeded in realtime (chunks have to come in with correct timing). Demo code: if __name__ == "__main__":
import threading
import pyaudio
from RealtimeSTT import AudioToTextRecorder
# Audio stream configuration constants
CHUNK = 1024 # Number of audio samples per buffer
FORMAT = pyaudio.paInt16 # Sample format (16-bit integer)
CHANNELS = 1 # Mono audio
RATE = 16000 # Sampling rate in Hz (expected by the recorder)
# Initialize the audio-to-text recorder without using the microphone directly
# Since we are feeding audio data manually, set use_microphone to False
recorder = AudioToTextRecorder(
use_microphone=False, # Disable built-in microphone usage
spinner=False # Disable spinner animation in the console
)
# Event to signal when to stop the threads
stop_event = threading.Event()
def feed_audio_thread():
"""Thread function to read audio data and feed it to the recorder."""
p = pyaudio.PyAudio()
# Open an input audio stream with the specified configuration
stream = p.open(
format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK
)
try:
print("Speak now")
while not stop_event.is_set():
# Read audio data from the stream (in the expected format)
data = stream.read(CHUNK)
# Feed the audio data to the recorder
recorder.feed_audio(data)
except Exception as e:
print(f"feed_audio_thread encountered an error: {e}")
finally:
# Clean up the audio stream
stream.stop_stream()
stream.close()
p.terminate()
print("Audio stream closed.")
def recorder_transcription_thread():
"""Thread function to handle transcription and process the text."""
def process_text(full_sentence):
"""Callback function to process the transcribed text."""
print("Transcribed text:", full_sentence)
# Check for the stop command in the transcribed text
if "stop recording" in full_sentence.lower():
print("Stop command detected. Stopping threads...")
stop_event.set()
recorder.abort()
try:
while not stop_event.is_set():
# Get transcribed text and process it using the callback
recorder.text(process_text)
except Exception as e:
print(f"transcription_thread encountered an error: {e}")
finally:
print("Transcription thread exiting.")
# Create and start the audio feeding thread
audio_thread = threading.Thread(target=feed_audio_thread)
audio_thread.daemon = False # Ensure the thread doesn't exit prematurely
audio_thread.start()
# Create and start the transcription thread
transcription_thread = threading.Thread(target=recorder_transcription_thread)
transcription_thread.daemon = False # Ensure the thread doesn't exit prematurely
transcription_thread.start()
# Wait for both threads to finish
audio_thread.join()
transcription_thread.join()
print("Recording and transcription have stopped.")
recorder.shutdown() |
Beta Was this translation helpful? Give feedback.
You'll need to convert the audio stream from FFmpeg into PCM WAV 16 kHz and then use feed_audio method. Depending on the actual mp3 format of the chunks the conversion can be rather straightforward (plain mp3) or quite complicated (if the mp3 chunks depend on each other).
If it's easy format conversion can be done with pydub:
Or you can use a ffmpeg cli command to convert.
feed_audio method will require 16 kHz mono PCM chunks of 1024 samples feeded in realtime (chunks have to come in with correct timing). Demo code: