-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setting up a voice conversion pipeline #117
Comments
Hi! I think this could be relevant for this project. Right now, we focused mostly on chatting to LLMs, but doing voice conversion is around the corner for it, I don't see a reason why we wouldn't support it here. |
For the voice conversion (VC) example, maybe you can refer to our released project CleanS2S. |
I successfully made your pipeline example run on my Mac. I did not expect to meet an assistant, but understand a bit more now about the intention of this project.
I would like to build a pipeline for voice conversion, similar to the product that ElevenLabs are offering. In their app you can upload a sound file up to 50 MB, and get a configurable voice conversion of the original speech sample. Microsoft SpeechT5 also offers voice conversion, but one would have to build a custom framework around that model.
Is speech-to-speech a relevant tool for such a task, or should I look at other s2s models or frameworks?
EDIT: After writing this, I realized that GPT4-o is a AI voice controlled assistant. My bad. It would still be nice to know if this pipeline can easily be modified to accept sound files, and convert voices.
EDIT2: I found this HuggingFace audio course, which I guess pretty much covers the basics. However: the ElevenLabs voice conversion outputs an audio file where the converted words is synced to the spoken words on a timeline, in practical terms mimicking the pace and style of the speaker. Unless I am missing something obvious, it seems my best option is to build a custom framework around the SpeechT5 vc model
EDIT3: I think this problem is solved, for example by WhisperX. If one wishes to build a framework from scratch, it would involve
whisper-distil-large
for speech transcriptionparler-tts
The text was updated successfully, but these errors were encountered: