Setup Environment

(Tested on a Quadro RTX 5000 with NVIDIA-SMI Driver Version: 535.104.05, CUDA Version: 12.2 on a UBUNTU 22.04)

Build Dockerfile
```
 docker build -t whispervits-svc .
```
Enter Docker container
Download the Timbre Encoder: Speaker-Encoder by @mueller91, put best_model.pth.tar into speaker_pretrain/.
Download whisper model whisper-large-v2. Make sure to download large-v2.pt，put it into whisper_pretrain/.
Download hubert_soft model，put hubert-soft-0d54a1f4.pt into hubert_pretrain/.
Download pitch extractor crepe full，put full.pth into crepe/assets.

Note: crepe full.pth is 84.9 MB, not 6kb
Download trained model lesd5_100.pretrain.pth, and put it into vits_pretrain/.
Make sure you have downloaded the wav_spk_1 folder from the Benchmarking-SGDD repository. Then, run the script.

python convert-TWH-spk1.py /path/to/wav_spk_1

The output will be a folder containing all conversions used on the evaluation. The same that is found on this google drive.

Provide feedback