A fast, fully local AI Voicechat using WebSockets
- WebSocket server, allows for simple remote access
- Default web UI w/ VAD using ricky0123/vad, Opus support using symblai/opus-encdec
- Modular/swappable SRT, LLM, TTS servers
- SRT: whisper.cpp, faster-whisper, or HF Transformers whisper
- LLM: llama.cpp or any OpenAI API compatible server
- TTS: coqui-tts, StyleTTS2, Piper, MeloTTS
voicechat2.webm
Unmute to hear the audio
On an 7900-class AMD RDNA3 card, voice-to-voice latency is in the 1 second range:
- distil-whisper/distil-large-v2
- bartowski/Meta-Llama-3.1-8B-Instruct-GGUF (Q4_K_M)
- tts_models/en/vctk/vits (Coqui TTS default VITS models)
On a 4090, using Faster Whisper with faster-distil-whisper-large-v2 we can cut the latency down to as low as 300ms:
voicechat2-fw.webm
You can of course run any model or swap out any of the SRT, LLM, TTS components as you like. For example, you can run whisper.cpp for SRT, or we have a StyleTTS2 server in the test folder for an alternative TTS. For a bit more about this project, see my Hackster.io writeup.
These installation instructions are for Ubuntu LTS and assume you've setup your ROCm or CUDA already.
I recommend you use conda or (my preferred), mamba for environment management. It will make your life easier.
sudo apt update
# Not strictly required but the helpers we use
sudo apt install byobu curl wget
# Audio processing
sudo apt install espeak-ng ffmpeg libopus0 libopus-dev
# Create env
mamba create -y -n voicechat2 python=3.11
# Setup
mamba activate voicechat2
git clone https://github.com/lhl/voicechat2
cd voicechat2
pip install -r requirements.txt
# Build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# AMD version
make GGML_HIPBLAS=1 -j
# Nvidia version
make GGML_CUDA=1 -j
# Grab your preferred GGUF model
wget https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf
# If you're going to go to the next instruction
cd ..
Some extra convenience scripts for launching:
run-voicechat2.sh - on your GPU machine, tries to launch all servers in separate byobu sessions; update the MODEL variables
remote-tunnel.sh - connect your GPU machine to a jump machine
local-tunnel.sh - connect to the GPU machine via a jump machine
A project released after voicechat2 that uses a similar modular approach but is local device oriented
- https://github.com/eustlb/speech-to-speech
- No license?
The demo shows a fair amount of latency (~10s) but this project isn't the closest to what we're doing (it uses WebRTC not websockets) from voicechat2 (HF Transformers, Ollama)
A console-based local client (HF Transformers, Ollama, Coqui TTS, PortAudio)
This is a very responsive console-based local-client app that also has VAD and interruption support, plus a really clever hook! (whisper.cpp, llama.cpp, piper, espeak)
Another console-based local client, more of a proof of concept but with w/ blog writeup.
- https://github.com/vndee/local-talking-llm
- https://blog.duy.dev/build-your-own-voice-assistant-and-run-it-locally/
- MIT
Another console-based local client (FastConformer, HF Transformers, StyleTTS2, espeak)
KoljaB has a number of interesting projects around console-based local clients like RealtimeSTT, RealtimeTTS, Linguflex, etc. (faster_whisper, llama.cpp, Coqui XTTS)
- https://github.com/KoljaB/LocalAIVoiceChat
- NC (Coqui Model License)
This is not a local voicechat client, but it does have a neat WebRTC front-end, so might be worth poking around into (Vite/React, Tailwind, Radix)