'Voice Conversion' paper candidate 2412.02612 #668

github-actions · 2024-12-04T05:00:45Z

Please check whether this paper is about 'Voice Conversion' or not.

article info.

title: GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot
summary: We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken
chatbot. It supports both Chinese and English, engages in real-time voice
conversations, and varies vocal nuances such as emotion, intonation, speech
rate, and dialect according to user instructions. GLM-4-Voice uses an ultra-low
bitrate (175bps), single-codebook speech tokenizer with 12.5Hz frame rate
derived from an automatic speech recognition (ASR) model by incorporating a
vector-quantized bottleneck into the encoder. To efficiently transfer knowledge
from text to speech modalities, we synthesize speech-text interleaved data from
existing text pre-training corpora using a text-to-token model. We continue
pre-training from the pre-trained text language model GLM-4-9B with a
combination of unsupervised speech data, interleaved speech-text data, and
supervised speech-text data, scaling up to 1 trillion tokens, achieving
state-of-the-art performance in both speech language modeling and spoken
question answering. We then fine-tune the pre-trained model with high-quality
conversational speech data, achieving superior performance compared to existing
baselines in both conversational ability and speech quality. The open models
can be accessed through https://github.com/THUDM/GLM-4-Voice and
https://huggingface.co/THUDM/glm-4-voice-9b.
id: http://arxiv.org/abs/2412.02612v1

Write [vclab::confirmed] or [vclab::excluded] in comment.