Implementation of the paper Benchmarking Speech-Driven Gesture Generation Models for Generalization to Unseen Voices and Noisy Environments.
- Create a directory for your project
mkdir <name_of_your_project>
- Inside of your project directory CLONE the DiffuseStyleGesture repository.
git clone https://github.com/YoungSeng/DiffuseStyleGesture.git
- Inside of your project directory CLONE the Benchmarking-SDGG-Models repository.
git clone https://github.com/AI-Unicamp/Benchmarking-SDGG-Models.git
- Enter your Benchmarking-SDGG-Models directory and CLONE the genea_numerical_evaluations repository.
cd Benchmarking-SDGG-Models
git clone https://github.com/genea-workshop/genea_numerical_evaluations.git
Sample here:
2.1 Download the Genea 2023 Train Dataset. To obtain it, you can preferably use our link of Google Drive, or as a last resort, you could use the official web site of Genea 2023 in Zenodo.
Put the downloaded directory called trn
in ./Benchmarking-SDGG-Models/Dataset/Genea2023/
2.2 Download the Genea 2023 Test Dataset. To obtain it, you can preferably use our link of Google Drive, or as a last resort, you could use the official web site of Genea 2023 in Zenodo.
Put the downloaded directory called tst
in ./Benchmarking-SDGG-Models/Dataset/Genea2023/
2.3 Download the Genea 2023 Validation Dataset. To obtain it, you can preferably use our link of Google Drive, or as a last resort, you could use the official web site of Genea 2023 in Zenodo.
Put the downloaded directory called val
in ./Benchmarking-SDGG-Models/Dataset/Genea2023/
2.4 Download the audios WAV of Genea 2023 Test Dataset with only Speaker 1. To get it you can use our link of Goolgle Drive. Copy the downloaded dataset in the next directory path.
Put the downloaded directory called wav_spk_1
in ./Benchmarking-SDGG-Models/Dataset/Genea2023/
Sample here:
3.1 If you dont want to manually convert spk1 to the unseen voices (from ESD) by yourself, you can download them already converted on this Google Drive.
3.2 Alternatively, follow the steps in this repository to manually convert spk1 to the unseen voices with a pretrained voice conversion model.
Put the downloaded or generated folder called Unseen-Voices-with-VC
in ./Benchmarking-SDGG-Models/Dataset/
4.1 If you dont want corrupt all audios with noise manually, you can download them already corrupted on this link of Google Drive.
4.2 To corrupt all speaker 1 audios, follow the steps in the TWH-Party repository. The repo will output a folder named TWH-Party
. Just rename it to Voices-in-Noisy-Environment
.
Put the Voices-in-Noisy-Environment
folder containing the corrupted audios in ./Benchmarking-SDGG-Models/Dataset/
We tested the code on NVIDIA Quadro RTX 5000
:
- Create docker image using the next command in your terminal:
docker build -t benchmarking_sdgg_models_image .
- Run container using the next command in your terminal, but note that you must change the directory path of your local machine, for example my directory path was "/work/kevin.colque/DiffuseStyleGesture", but in your case must be another path according to your directory:
docker run --rm -it --gpus all --userns=host --shm-size 64G -v <path_of_your_project>:/workspace/benchmarking_sdgg/ -p ‘9669:9669’ --name BECHMARKING_SDGG_MODELS_CONTAINER benchmarking_sdgg_models_image:latest /bin/bash
- Launch the virtual environment with the next command (Note that contain the activation of CUDA):
source activate sdgg
- Go to our Workspace (Note that you can visualize it when launch us the container)
cd /workspace/benchmarking_sdgg/
Sample here:
- Download files of DiffuseStyleGesture's pre-trained models from google cloud. Put this two files inside of "DiffuseStyleGesture/BEAT-TWH-main/mydiffusion_beat_twh/TWH_mymodel4_512_v0/"
- Nota: If you want retrain and get your own checkpoints, you can go to the DiffuseStyleGesture+ repository and run the steps.
-
Download the "WavLM-Large.pt" from google cloud. Put this file inside of "DiffuseStyleGesture/BEAT-TWH-main/process/WavLM/"
-
Download the "crawl-300d-2M.vec" from google cloud. Put this file inside of "DiffuseStyleGesture/BEAT-TWH-main/process/"
-
Download the "generate.py" and "val_2023_v0_014_main-agent.npy" file from google cloud. Put this file inside of "DiffuseStyleGesture/BEAT-TWH-main/mydiffusion_beat_twh/". (This file "generate.py" is similar to the given by DiffuseStyleGesture+, with respectively changes to our work)
-
Generate gestures from WAV audio files of "Speaker 1 Test Dataset". To do this you can localize in "DiffuseStyleGesture/BEAT-TWH-main/mydiffusion_beat_twh/" and to run the next command in your terminal you need know which is the path of the WAV audios files of the Speaker 1 and which is the path of the tsv files of the "tst" dataset:
cd DiffuseStyleGesture/BEAT-TWH-main/mydiffusion_beat_twh/
python generate.py --wav_path ./../../../Benchmarking-SDGG-Models/Dataset/Genea2023/wav_spk_1/ --txt_path ./../../../Benchmarking-SDGG-Models/Dataset/Genea2023/tst/main-agent/tsv/
The output of BHVs is located in Benchmarking-SGDD-Models/BVH_generated/sample_model001200000/bvh_spk_1
It worked!! Right? Do you want to obtain the BVH files for the rest of the directories? So, run the command with the same structure, but don't forget to change the <dataset_X_wav_path> part according to the BVH files you want to generate.
python generate.py --wav_path <dataset_X_wav_path> --txt_path ./../../../Benchmarking-SDGG-Models/Dataset/Genea2023/tst/main-agent/tsv/
- To generate gestures from Test Dataset with High, Mid and Low Noisy Environment (TWH-Party) respectively
- replace by: ./../../../Benchmarking-SDGG-Models/Dataset/Voices-in-Noisy-Environment/high/
- replace by: ./../../../Benchmarking-SDGG-Models/Dataset/Voices-in-Noisy-Environment/mid/
- replace by: ./../../../Benchmarking-SDGG-Models/Dataset/Voices-in-Noisy-Environment/low/
- To generate gestures from Speaker 1 Test Dataset with Voice Conversion to Highest Pitch Man
- replace by: ./../../../Benchmarking-SDGG-Models/Dataset/Unseen-Voices-with-VC/wav_spk1w_ps-4_spk12m_high/
- Generate gestures from Speaker 1 Test Dataset with Voice Conversion to Lowest Pitch Man.
- replace by: ./../../../Benchmarking-SDGG-Models/Dataset/Unseen-Voices-with-VC/wav_spk1w_ps-10_spk20m_low/
- Generate gestures from Speaker 1 Test Dataset with Voice Conversion to Highest Pitch Woman.
- replace by: ./../../../Benchmarking-SDGG-Models/Dataset/Unseen-Voices-with-VC/wav_spk1w_ps3_spk18w_high/
- Generate gestures from Speaker 1 Test Dataset with Voice Conversion to Lowest Pitch Woman.
- replace by: ./../../../Benchmarking-SDGG-Models/Dataset/Unseen-Voices-with-VC/wav_spk1w_ps-5_spk19w_low/
Calculate the 3D positions of the motion data in .bvh format from trn dataset that will be used to train the FGD autoencoder. We provide the pretrained autoencoder model_checkpoint_epoch_49_90_246.bin
located inside the './Benchmarking-SDGG-Models/evaluation_metric/output'
.
cd ../../../Benchmarking-SDGG-Models/
python training_encoder.py
If you retrain the autoencoder for a second time, add the argument --load True
to avoid recalculating the 3D positions of the trn dataset.
The checkpoints model_checkpoint_epoch_xx_90_246.bin
generated from the training will be saved in ./Benchmarking-SDGG-Models/evaluation_metric/output
.
Calculate FGD and MSE metrics.
python Computing_FGD.py --model_path 'model_checkpoint_epoch_49_90_246.bin'
If recalculating the FGD for a second time, add the argument --load True
to avoid recalculating the 3D positions of each dataset.
The metric results will be saved in Metrics-results-Noisy-Environment.txt
and Metrics-results-Unseen-Voices-VC.txt
files, in ./Benchmarking-SDGG-Models/
.
If our work is useful for you, please cite as:
@inproceedings{sanchez2024benchmarking,
title={Benchmarking Speech-Driven Gesture Generation Models for Generalization to Unseen Voices and Noisy Environments},
author={SANCHEZ, JOHSAC ISBAC GOMEZ and Inofuente-Colque, Kevin and Marques, Leonardo Boulitreau de Menezes Martins and Costa, Paula Dornhofer Paro and Tonoli, Rodolfo Luis},
booktitle={GENEA: Generation and Evaluation of Non-verbal Behaviour for Embodied Agents Workshop 2024},
year={2024}
}