Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Voice Cloning through a two step styling process? #140

Open
kaushal-gawri9899 opened this issue Sep 24, 2024 · 3 comments
Open

Voice Cloning through a two step styling process? #140

kaushal-gawri9899 opened this issue Sep 24, 2024 · 3 comments

Comments

@kaushal-gawri9899
Copy link

Hey, is it possible to allow voice cloning by implementing a two way process for encoding? Basically, before encoding, can we inject a speaker embedding to be used at time of encoding instead of solely depending on the style prompt? I'm looking to control the styling through a two way process where i can provide the required speaker embedding to the encoder for tone coloring/voice cloning and can do the rest of the styling through the prompt (ignoring who the speaker is)?

@apresence
Copy link
Contributor

apresence commented Sep 26, 2024

If I understand your request correctly, I am working on effectively the same thing. It looks like your method is much more involved, so you might have better results with it. I'm working on cleaning up the code and once it's ready I'll submit a PR for it (I'm a full-time programmer with a day job, which means ~60 hrs/wk... so I'm finding the time when I can). Already submitted a PR to prep some changes for it. See #139.

@kaushal-gawri9899
Copy link
Author

kaushal-gawri9899 commented Sep 26, 2024

I guess it's similar but based on the PR, I'm under the impression that you're trying to propagate the speaker representations using "input_values" in the encoder, right? I'm trying to use a different approach where i train the model to consider the speaker reference voice in the decoder (Causal LM) so I had tweaked the architecture as stated above.

@bregsi
Copy link

bregsi commented Oct 8, 2024

Somehow I feel this creates the problem of having a third input competing with text_description. What comes first? description or speaker embedding? speaker embedding should just give the nuance of the specific voice. i would guess this could be handled with appropriate training. text_description could handle accent, question/instruction/statement, tone: mild/aggresive...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants