Add Voice Chat feature #236

jpgallegoar · 2024-10-23T22:41:24Z

@SWivid This one I will not merge myself because I need your confirmation. This adds an AI Voice Chat feature using LLM Qwen2.5-3B (very fast), 3gb only, but needs pip install transformers_stream_generator.

I hope you like it.

Also I added it to the README.md features section and to requirements.txt

jpgallegoar · 2024-10-23T23:21:51Z

It fails some quality checks but the code works well, I tested it a lot. I will make it pass the tests tomorrow after commit is merged

SWivid · 2024-10-24T07:59:59Z

Hi, @jpgallegoar I just test it, very cool chat!

Few tweaks might need your help if have time:

will the ref_text saved and not calling asr pipeline to do transcription again if not ref_audio not changed (smoother interaction if so, or it's already like that and I got it wrong), may take it as a func to utils_infer.py as it's widely appliable for all other apps
may need extra loading time and gpu memory for those just want do basic tts, maybe we could add a button to chat (load in qwen2.5, whisper, and show the interface then), it can be also helpful if the switch between F5/E2 will offload the counterpart.

Greetings.

jpgallegoar · 2024-10-24T08:01:51Z

Thank you for the inputs, I also thought about the extra loading time. I will look into it and update.

jpgallegoar · 2024-10-28T19:39:53Z

Dear @SWivid. I was wondering if you had any timelines regarding the model retrain you were thinking of. What enhancements will it add? I thought about these idead: Longer context length thanks to a dataset with longer audios, Multilanguage, Phoneme-level alignment, emotional controllability using emotion-specific tokens.

SWivid · 2024-10-29T04:06:26Z

@jpgallegoar Phoneme-level alignment is just what we were to abandon, check https://aka.ms/e2tts/
we'll possibly go like:

Multilanguage
Model structure exploration addressing long sample generation

Longer context length thanks to a dataset with longer audios & emotional controllability using emotion-specific tokens

Mainly for data scarcity as we all know, will keep eyes on it

jpgallegoar · 2024-10-29T12:06:39Z

Thank you for your answer. I am very interested in future developments.

Add Voice Chat feature

92887e3

jpgallegoar requested a review from SWivid October 23, 2024 22:41

Merge branch 'main' into main

3125a28

SWivid merged commit 1f582a6 into SWivid:main Oct 24, 2024
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Voice Chat feature #236

Add Voice Chat feature #236

jpgallegoar commented Oct 23, 2024

jpgallegoar commented Oct 23, 2024

SWivid commented Oct 24, 2024 •

edited

Loading

jpgallegoar commented Oct 24, 2024

jpgallegoar commented Oct 28, 2024

SWivid commented Oct 29, 2024

jpgallegoar commented Oct 29, 2024

Add Voice Chat feature #236

Add Voice Chat feature #236

Conversation

jpgallegoar commented Oct 23, 2024

jpgallegoar commented Oct 23, 2024

SWivid commented Oct 24, 2024 • edited Loading

jpgallegoar commented Oct 24, 2024

jpgallegoar commented Oct 28, 2024

SWivid commented Oct 29, 2024

jpgallegoar commented Oct 29, 2024

SWivid commented Oct 24, 2024 •

edited

Loading