Questions about usage for a new comer in the field #180
altineller
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
I would like to use coqui-ai-TTS, to narrate robot videos I make. I have gone tru documentation, and successfully synthesized and also cloned voices. I have been running tests on cloning the voice of Q from James Bond, explaining to students how robots work. And so far it is going good, but there are few problems with usage I would like to ask about.
I am making an array, and then running tts with
Here is a bash array full of sentences
The following array has 3 members, and each element contains multiple sentences from the array above:
Joining more sentences together yields better results. Why is that? Also is there a markup to sentences? Or should it be clear english?
For example, space before the beginning of the sentence alters sound. Or a period. Are there any tricks as to modify the sound with some sort of markers, so when it generates less than perfect speech, one can edit it a bit with markup, to correct it?
Also, are there any documentation that I can read (except the documentation linked from idiap/coqui-ai-TTS) about tuning? Since I don't exactly understand the parameters, I can not do any tuning, except blindy.
Another question is about voice cloning. I have extracted 11 segments, denoised them, and used them as --speaker_wav 01.wav 02.wav ... 11.wav - Then I wrote a script to generate the same speech, but removing 1 speaker_wav each time. Needless to say, some outputs are better than others. So if I have a sense of which input audio is good for sampling, and which audio is not, I could do much better.
Thanks for developing this software and sharing it opensource. It is monumental amount of work, and despite few points, it is perfect.
Best Regards,
Can Altineller
Beta Was this translation helpful? Give feedback.
All reactions