Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Voice Chat feature #236

Merged
merged 2 commits into from
Oct 24, 2024
Merged

Add Voice Chat feature #236

merged 2 commits into from
Oct 24, 2024

Conversation

jpgallegoar
Copy link
Collaborator

@SWivid This one I will not merge myself because I need your confirmation. This adds an AI Voice Chat feature using LLM Qwen2.5-3B (very fast), 3gb only, but needs pip install transformers_stream_generator.

I hope you like it.

Also I added it to the README.md features section and to requirements.txt

@jpgallegoar
Copy link
Collaborator Author

It fails some quality checks but the code works well, I tested it a lot. I will make it pass the tests tomorrow after commit is merged

@SWivid SWivid merged commit 1f582a6 into SWivid:main Oct 24, 2024
1 check failed
@SWivid
Copy link
Owner

SWivid commented Oct 24, 2024

Hi, @jpgallegoar I just test it, very cool chat!

Few tweaks might need your help if have time:

  1. will the ref_text saved and not calling asr pipeline to do transcription again if not ref_audio not changed (smoother interaction if so, or it's already like that and I got it wrong), may take it as a func to utils_infer.py as it's widely appliable for all other apps
  2. may need extra loading time and gpu memory for those just want do basic tts, maybe we could add a button to chat (load in qwen2.5, whisper, and show the interface then), it can be also helpful if the switch between F5/E2 will offload the counterpart.

Greetings.

@jpgallegoar
Copy link
Collaborator Author

Thank you for the inputs, I also thought about the extra loading time. I will look into it and update.

@jpgallegoar
Copy link
Collaborator Author

Dear @SWivid. I was wondering if you had any timelines regarding the model retrain you were thinking of. What enhancements will it add? I thought about these idead: Longer context length thanks to a dataset with longer audios, Multilanguage, Phoneme-level alignment, emotional controllability using emotion-specific tokens.

@SWivid
Copy link
Owner

SWivid commented Oct 29, 2024

@jpgallegoar Phoneme-level alignment is just what we were to abandon, check https://aka.ms/e2tts/
we'll possibly go like:

  1. Multilanguage
  2. Model structure exploration addressing long sample generation

Longer context length thanks to a dataset with longer audios & emotional controllability using emotion-specific tokens

Mainly for data scarcity as we all know, will keep eyes on it

@jpgallegoar
Copy link
Collaborator Author

Thank you for your answer. I am very interested in future developments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants