Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/main'
Browse files Browse the repository at this point in the history
  • Loading branch information
SanBingYouYong committed Aug 5, 2024
2 parents b190b0e + e936484 commit c64675b
Show file tree
Hide file tree
Showing 38 changed files with 2,749 additions and 1,034 deletions.
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,6 @@
[submodule "ChatTTS"]
path = ChatTTS
url = https://github.com/2noise/ChatTTS.git
[submodule "CosyVoice"]
path = CosyVoice
url = https://github.com/FunAudioLLM/CosyVoice.git
1 change: 1 addition & 0 deletions CosyVoice
Submodule CosyVoice added at 6be8d0
5 changes: 5 additions & 0 deletions LLM/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,11 @@ def init_model(self, model_name, model_path='', api_key=None, proxy_url=None, pr
llm.prefix_prompt = prefix_prompt
return llm

def chat(self, system_prompt, message, history):
response = self.generate(message, system_prompt)
history.append((message, response))
return response, history

def generate(self, question, system_prompt = 'system无效'):
return question

Expand Down
1 change: 0 additions & 1 deletion MuseV
Submodule MuseV deleted from 43370a
137 changes: 117 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,10 @@
- **Integrated MuseTalk into Linly-Talker and updated the WebUI, enabling basic real-time conversation capabilities.**
- **The refined WebUI defaults to not loading the LLM model to reduce GPU memory usage. It directly responds with text to complete voiceovers. The enhanced WebUI features three main functions: personalized character generation, multi-turn intelligent dialogue with digital humans, and real-time MuseTalk conversations. These improvements reduce previous GPU memory redundancies and add more prompts to assist users effectively.**

**2024.08 Update** 📆

- **Updated CosyVoice to offer high-quality text-to-speech (TTS) functionality and voice cloning capabilities; also upgraded to Wav2Lipv2 to enhance overall performance.**

---

<details>
Expand All @@ -72,10 +76,12 @@
- [Voice Clone](#voice-clone)
- [GPT-SoVITS(Recommend)](#gpt-sovitsrecommend)
- [XTTS](#xtts)
- [CoxyVoice](#cosyvoice)
- [Coming Soon](#coming-soon-2)
- [THG - Avatar](#thg---avatar)
- [SadTalker](#sadtalker)
- [Wav2Lip](#wav2lip)
- [Wav2Lipv2](#wav2lipv2)
- [ER-NeRF](#er-nerf)
- [MuseTalk](#musetalk)
- [Coming Soon](#coming-soon-3)
Expand Down Expand Up @@ -143,6 +149,7 @@ The design philosophy of Linly-Talker is to create a new form of human-computer
- [x] Linly-Talker WebUI supports multiple modules, multiple models, and multiple options
- [x] Added MuseTalk functionality to Linly-Talker, achieving near real-time speed with very fast communication.
- [x] Integrated MuseTalk into the Linly-Talker WebUI.
- [x] Added CosyVoice, which provides high-quality text-to-speech (TTS) functionality and voice cloning capabilities. Additionally, updated to Wav2Lipv2 to enhance image quality effects.
- [ ] `Real-time` Speech Recognition (Enable conversation and communication between humans and digital entities using voice)

🔆 The Linly-Talker project is ongoing - pull requests are welcome! If you have any suggestions regarding new model approaches, research, techniques, or if you discover any runtime errors, please feel free to edit and submit a pull request. You can also open an issue or contact me directly via email. 📩⭐ If you find this repository useful, please give it a star! 🤩
Expand Down Expand Up @@ -174,16 +181,17 @@ Download the code:

```bash
git clone https://github.com/Kedreamix/Linly-Talker.git --depth 1
```

以下是这段文字的英文翻译:
cd Linly-Talker
git submodule update --init --recursive
```

---

If you are using Linly-Talker, you can set up the environment directly with Anaconda, which covers almost all the dependencies required by the models. The specific steps are as follows:

```bash
conda create -n linly python=3.10
conda create -n linly python=3.8
conda activate linly

# PyTorch installation method 1: Install via conda
Expand All @@ -198,7 +206,7 @@ conda activate linly
# CUDA 11.8
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118

conda install -q ffmpeg # Install ffmpeg==4.2.2
conda install -q ffmpeg==4.2.2 # ffmpeg==4.2.2

# Upgrade pip
python -m pip install --upgrade pip
Expand All @@ -211,23 +219,38 @@ pip install -r requirements_webui.txt
# Install dependencies related to musetalk
pip install --no-cache-dir -U openmim
mim install mmengine
mim install "mmcv>=2.0.1"
mim install "mmcv==2.1.0"
mim install "mmdet>=3.1.0"
mim install "mmpose>=1.1.0"

# ⚠️ Note: You must first download CosyVoice-ttsfrd. Complete the model download before proceeding with these steps.
mkdir -p CosyVoice/pretrained_models # Create directory CosyVoice/pretrained_models
mv checkpoints/CosyVoice_ckpt/CosyVoice-ttsfrd CosyVoice/pretrained_models # Move directory
unzip CosyVoice/pretrained_models/CosyVoice-ttsfrd/resource.zip # Unzip
# This .whl library is only compatible with Python 3.8
pip install CosyVoice/pretrained_models/CosyVoice-ttsfrd/ttsfrd-0.3.6-cp38-cp38-linux_x86_64.whl

# Install NeRF-based dependencies, which might have several issues and can be skipped initially
pip install "git+https://github.com/facebookresearch/pytorch3d.git"
# If you encounter problems installing PyTorch3D, you can use the following command to install it:
# python scripts/install_pytorch3d.py
pip install -r TFG/requirements_nerf.txt

# If there are issues with pyaudio, install the corresponding dependencies
# sudo apt-get update
# sudo apt-get install libasound-dev portaudio19-dev libportaudio2 libportaudiocpp0
# If you encouter issues with pyaudio
sudo apt-get update
sudo apt-get install libasound-dev portaudio19-dev libportaudio2 libportaudiocpp0

# Note the following modules. If installation fails, you can enter the directory and use pip install . or python setup.py install to compile and install:
# NeRF/freqencoder
# NeRF/gridencoder
# NeRF/raymarching
# NeRF/shencoder

# If you encounter sox compatibility issues
# ubuntu
sudo apt-get install sox libsox-dev
# centos
sudo yum install sox sox-devel
```

Below are some older installation methods, which might cause dependency conflicts, but they generally don't produce many bugs. For an easier and better installation, I've updated the above version. You can ignore the following versions or refer to them if you encounter issues.
Expand Down Expand Up @@ -465,6 +488,82 @@ Coqui XTTS is a leading deep learning toolkit for Text-to-Speech (TTS) tasks, al
- Experience XTTS online [https://huggingface.co/spaces/coqui/xtts](https://huggingface.co/spaces/coqui/xtts)
- Official GitHub repository: [https://github.com/coqui-ai/TTS](https://github.com/coqui-ai/TTS)
### CosyVoice
CosyVoice is an open-source multilingual speech understanding model developed by Alibaba’s Tongyi Lab, focusing on high-quality speech synthesis. The model has been trained on over 150,000 hours of data and supports speech synthesis in multiple languages, including Chinese, English, Japanese, Cantonese, and Korean. CosyVoice excels in multilingual speech generation, zero-shot voice generation, cross-lingual voice synthesis, and command execution capabilities.
CosyVoice supports one-shot voice cloning technology, enabling the generation of realistic and natural-sounding voices with details such as prosody and emotion using only 3 to 10 seconds of original audio.
GitHub project link: [CosyVoice GitHub](https://github.com/FunAudioLLM/CosyVoice)
CosyVoice includes several pre-trained speech synthesis models, mainly:
1. **CosyVoice-300M**: Supports zero-shot and cross-lingual speech synthesis in Chinese, English, Japanese, Cantonese, Korean, and other languages.
2. **CosyVoice-300M-SFT**: A model focused on supervised fine-tuning (SFT) inference.
3. **CosyVoice-300M-Instruct**: A model that supports command-based inference, capable of generating speech with specific tones, emotions, and other elements.
**Key Features**
1. **Multilingual Support**: Capable of handling various languages including Chinese, English, Japanese, Cantonese, and Korean.
2. **Multi-style Speech Synthesis**: Allows control over the tone and emotion of the generated speech through commands.
3. **Streaming Inference Support**: Future updates will include streaming inference modes, such as KV caching and SDPA, for real-time optimization.
Currently, Linly-Talker integrates three features from CosyVoice: pre-trained voice cloning, 3s rapid cloning, and cross-lingual cloning. Stay tuned for more exciting updates on Linly-Talker. Below are some examples of CosyVoice's capabilities:
<table>
<tr>
<th></th>
<th align="center">PROMPT TEXT</th>
<th align="center">PROMPT SPEECH</th>
<th align="center">TARGET TEXT</th>
<th align="center">RESULT</th>
</tr>
<tr>
<td align="center"><strong>Pre-trained Voice</strong></td>
<td align="center">中文女 音色('中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女'</td>
<td align="center"></td>
<td align="center">你好,我是通义生成式语音大模型,请问有什么可以帮您的吗?</td>
<td align="center">
[sft.webm](https://github.com/user-attachments/assets/a9f9c8c4-7137-4845-9adb-a93ac304131e)
</td>
</tr>
<tr>
<td align="center"><strong>3s Language Cloning</strong></td>
<td align="center">希望你以后能够做的比我还好呦。</td>
<td align="center">
[zero_shot_prompt.webm](https://github.com/user-attachments/assets/1ef09db6-42e5-42d2-acc2-d44e70b147f9)
</td>
<td align="center">收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。</td>
<td align="center">
[zero_shot.webm](https://github.com/user-attachments/assets/ba46c58f-2e16-4440-b920-51ec288f09e6)
</td>
</tr>
<tr>
<td align="center"><strong>Cross-lingual Cloning</strong></td>
<td align="center">在那之后,完全收购那家公司,因此保持管理层的一致性,利益与即将加入家族的资产保持一致。这就是我们有时不买下全部的原因。</td>
<td align="center">
[cross_lingual_prompt.webm](https://github.com/user-attachments/assets/378ae5e6-b52a-47b4-b0db-d84d1edd6e56)
</td>
<td align="center">
&lt; |en|&gt;And then later on, fully acquiring that company. So keeping management in line, interest in line with the asset that's coming into the family is a reason why sometimes we don't buy the whole thing.
</td>
<td align="center">
[cross_lingual.webm](https://github.com/user-attachments/assets/b0162fc8-5738-4642-9fdd-b388a4965546)
</td>
</tr>
</table>
### Coming Soon
Welcome everyone to provide suggestions, motivating me to continuously update the models and enrich the functionality of Linly-Talker.
Expand Down Expand Up @@ -502,7 +601,15 @@ Before usage, download the Wav2Lip model:
| Expert Discriminator | Weights of the expert discriminator | [Link](https://iiitaphyd-my.sharepoint.com/:u:/g/personal/radrabha_m_research_iiit_ac_in/EQRvmiZg-HRAjvI6zqN9eTEBP74KefynCwPWVmF57l-AYA?e=ZRPHKP) |
| Visual Quality Discriminator | Weights of the visual disc trained in a GAN setup | [Link](https://iiitaphyd-my.sharepoint.com/:u:/g/personal/radrabha_m_research_iiit_ac_in/EQVqH88dTm1HjlK11eNba5gBbn15WMS0B0EZbDBttqrqkg?e=ic0ljo) |

### Wav2Lipv2

Inspired by the repository [https://github.com/primepake/wav2lip_288x288](https://github.com/primepake/wav2lip_288x288), Wav2Lipv2 uses a newly trained 288 model to achieve higher quality results.

Additionally, by employing YOLO for facial detection, the overall effect is improved. You can compare and test the results in Linly-Talker. The model has been updated, and the comparison is as follows:

| Wav2Lip | Wav2Lipv2 |
| ------------------------------------------------------------ | ------------------------------------------------------------ |
| <video src="https://github.com/user-attachments/assets/d61df5cf-e3b9-4057-81fc-d69dcff806d6"></video> | <video src="https://github.com/user-attachments/assets/7f6be271-2a4d-4d9c-98f8-db25816c28b3"></video> |

### ER-NeRF

Expand Down Expand Up @@ -670,9 +777,9 @@ The current features available in the WebUI are as follows:

- [x] Multiple modules➕Multiple models➕Multiple choices
- [x] Multiple role selections: Female/Male/Custom (each part can automatically upload images) Coming Soon
- [x] Multiple TTS model selections: EdgeTTS / PaddleTTS / GPT-SoVITS / Coming Soon
- [x] Multiple TTS model selections: EdgeTTS / PaddleTTS / GPT-SoVITS / CosyVoice / Coming Soon
- [x] Multiple LLM model selections: Linly / Qwen / ChatGLM / GeminiPro / ChatGPT / Coming Soon
- [x] Multiple Talker model selections: Wav2Lip / SadTalker / ERNeRF / MuseTalk (coming soon) / Coming Soon
- [x] Multiple Talker model selections: Wav2Lip / Wav2Lipv2 / SadTalker / ERNeRF / MuseTalk/ Coming Soon
- [x] Multiple ASR model selections: Whisper / FunASR / Coming Soon

![](docs/WebUI2.png)
Expand Down Expand Up @@ -887,16 +994,6 @@ Linly-Talker/
└── README.md
```

## Support Us
| Alipay | WeChatPay |
| -------------------- | ----------------------- |
| ![](docs/Alipay.jpg) | ![](docs/WeChatpay.jpg) |
## Reference

**ASR**
Expand Down
Loading

0 comments on commit c64675b

Please sign in to comment.