Skip to content

AI-Unicamp/Benchmarking-SDGG-Models

Repository files navigation

Benchmarking-SDGG-Models

Implementation of the paper Benchmarking Speech-Driven Gesture Generation Models for Generalization to Unseen Voices and Noisy Environments.

Step 1: Cloning repositories

  1. Create a directory for your project
mkdir <name_of_your_project>
  1. Inside of your project directory CLONE the DiffuseStyleGesture repository.
git clone https://github.com/YoungSeng/DiffuseStyleGesture.git
  1. Inside of your project directory CLONE the Benchmarking-SDGG-Models repository.
git clone https://github.com/AI-Unicamp/Benchmarking-SDGG-Models.git
  1. Enter your Benchmarking-SDGG-Models directory and CLONE the genea_numerical_evaluations repository.
cd Benchmarking-SDGG-Models
git clone https://github.com/genea-workshop/genea_numerical_evaluations.git

Sample here:

Structure_of_directories

Step 2: Downloading Genea 2023 Datasets

2.1 Download the Genea 2023 Train Dataset. To obtain it, you can preferably use our link of Google Drive, or as a last resort, you could use the official web site of Genea 2023 in Zenodo.

Put the downloaded directory called trn in ./Benchmarking-SDGG-Models/Dataset/Genea2023/

2.2 Download the Genea 2023 Test Dataset. To obtain it, you can preferably use our link of Google Drive, or as a last resort, you could use the official web site of Genea 2023 in Zenodo.

Put the downloaded directory called tst in ./Benchmarking-SDGG-Models/Dataset/Genea2023/

2.3 Download the Genea 2023 Validation Dataset. To obtain it, you can preferably use our link of Google Drive, or as a last resort, you could use the official web site of Genea 2023 in Zenodo.

Put the downloaded directory called val in ./Benchmarking-SDGG-Models/Dataset/Genea2023/

2.4 Download the audios WAV of Genea 2023 Test Dataset with only Speaker 1. To get it you can use our link of Goolgle Drive. Copy the downloaded dataset in the next directory path.

Put the downloaded directory called wav_spk_1 in ./Benchmarking-SDGG-Models/Dataset/Genea2023/

Sample here:

structure_dataset

Step 3: Generating Unseen Voices

3.1 If you dont want to manually convert spk1 to the unseen voices (from ESD) by yourself, you can download them already converted on this Google Drive.

3.2 Alternatively, follow the steps in this repository to manually convert spk1 to the unseen voices with a pretrained voice conversion model.

Put the downloaded or generated folder called Unseen-Voices-with-VC in ./Benchmarking-SDGG-Models/Dataset/

Step 4: Generating Voices in Noisy Environment (TWH Party Dataset)

4.1 If you dont want corrupt all audios with noise manually, you can download them already corrupted on this link of Google Drive.

4.2 To corrupt all speaker 1 audios, follow the steps in the TWH-Party repository. The repo will output a folder named TWH-Party. Just rename it to Voices-in-Noisy-Environment.

Put the Voices-in-Noisy-Environment folder containing the corrupted audios in ./Benchmarking-SDGG-Models/Dataset/

Step 5: Processing

We tested the code on NVIDIA Quadro RTX 5000:

5.1. Running Docker

  1. Create docker image using the next command in your terminal:
docker build -t benchmarking_sdgg_models_image .
  1. Run container using the next command in your terminal, but note that you must change the directory path of your local machine, for example my directory path was "/work/kevin.colque/DiffuseStyleGesture", but in your case must be another path according to your directory:
docker run --rm -it --gpus all --userns=host --shm-size 64G -v <path_of_your_project>:/workspace/benchmarking_sdgg/ -p ‘9669:9669’ --name BECHMARKING_SDGG_MODELS_CONTAINER benchmarking_sdgg_models_image:latest /bin/bash
  1. Launch the virtual environment with the next command (Note that contain the activation of CUDA):
source activate sdgg
  1. Go to our Workspace (Note that you can visualize it when launch us the container)
cd /workspace/benchmarking_sdgg/

Sample here:

Structure of Directories

5.2. Download the models pre-trained needed to the Gestures Generation

  1. Download files of DiffuseStyleGesture's pre-trained models from google cloud. Put this two files inside of "DiffuseStyleGesture/BEAT-TWH-main/mydiffusion_beat_twh/TWH_mymodel4_512_v0/"
  • Nota: If you want retrain and get your own checkpoints, you can go to the DiffuseStyleGesture+ repository and run the steps.
  1. Download the "WavLM-Large.pt" from google cloud. Put this file inside of "DiffuseStyleGesture/BEAT-TWH-main/process/WavLM/"

  2. Download the "crawl-300d-2M.vec" from google cloud. Put this file inside of "DiffuseStyleGesture/BEAT-TWH-main/process/"

5.3. Generating Gestures to our own two Datasets (VC and TWH-Party).

  1. Download the "generate.py" and "val_2023_v0_014_main-agent.npy" file from google cloud. Put this file inside of "DiffuseStyleGesture/BEAT-TWH-main/mydiffusion_beat_twh/". (This file "generate.py" is similar to the given by DiffuseStyleGesture+, with respectively changes to our work)

  2. Generate gestures from WAV audio files of "Speaker 1 Test Dataset". To do this you can localize in "DiffuseStyleGesture/BEAT-TWH-main/mydiffusion_beat_twh/" and to run the next command in your terminal you need know which is the path of the WAV audios files of the Speaker 1 and which is the path of the tsv files of the "tst" dataset:

cd DiffuseStyleGesture/BEAT-TWH-main/mydiffusion_beat_twh/
python generate.py --wav_path ./../../../Benchmarking-SDGG-Models/Dataset/Genea2023/wav_spk_1/ --txt_path ./../../../Benchmarking-SDGG-Models/Dataset/Genea2023/tst/main-agent/tsv/

The output of BHVs is located in Benchmarking-SGDD-Models/BVH_generated/sample_model001200000/bvh_spk_1

It worked!! Right? Do you want to obtain the BVH files for the rest of the directories? So, run the command with the same structure, but don't forget to change the <dataset_X_wav_path> part according to the BVH files you want to generate.

python generate.py --wav_path <dataset_X_wav_path> --txt_path ./../../../Benchmarking-SDGG-Models/Dataset/Genea2023/tst/main-agent/tsv/
  1. To generate gestures from Test Dataset with High, Mid and Low Noisy Environment (TWH-Party) respectively
    • replace by: ./../../../Benchmarking-SDGG-Models/Dataset/Voices-in-Noisy-Environment/high/
    • replace by: ./../../../Benchmarking-SDGG-Models/Dataset/Voices-in-Noisy-Environment/mid/
    • replace by: ./../../../Benchmarking-SDGG-Models/Dataset/Voices-in-Noisy-Environment/low/
  2. To generate gestures from Speaker 1 Test Dataset with Voice Conversion to Highest Pitch Man
    • replace by: ./../../../Benchmarking-SDGG-Models/Dataset/Unseen-Voices-with-VC/wav_spk1w_ps-4_spk12m_high/
  3. Generate gestures from Speaker 1 Test Dataset with Voice Conversion to Lowest Pitch Man.
    • replace by: ./../../../Benchmarking-SDGG-Models/Dataset/Unseen-Voices-with-VC/wav_spk1w_ps-10_spk20m_low/
  4. Generate gestures from Speaker 1 Test Dataset with Voice Conversion to Highest Pitch Woman.
    • replace by: ./../../../Benchmarking-SDGG-Models/Dataset/Unseen-Voices-with-VC/wav_spk1w_ps3_spk18w_high/
  5. Generate gestures from Speaker 1 Test Dataset with Voice Conversion to Lowest Pitch Woman.
    • replace by: ./../../../Benchmarking-SDGG-Models/Dataset/Unseen-Voices-with-VC/wav_spk1w_ps-5_spk19w_low/

Step 6: Evaluating metrics FGD and MSE

6.1 Training autoencoder FGD

Calculate the 3D positions of the motion data in .bvh format from trn dataset that will be used to train the FGD autoencoder. We provide the pretrained autoencoder model_checkpoint_epoch_49_90_246.bin located inside the './Benchmarking-SDGG-Models/evaluation_metric/output'.

cd ../../../Benchmarking-SDGG-Models/
python training_encoder.py

If you retrain the autoencoder for a second time, add the argument --load True to avoid recalculating the 3D positions of the trn dataset.

The checkpoints model_checkpoint_epoch_xx_90_246.bin generated from the training will be saved in ./Benchmarking-SDGG-Models/evaluation_metric/output.

6.2 Calculate FGD and MSE

Calculate FGD and MSE metrics.

python Computing_FGD.py --model_path 'model_checkpoint_epoch_49_90_246.bin'

If recalculating the FGD for a second time, add the argument --load True to avoid recalculating the 3D positions of each dataset.

The metric results will be saved in Metrics-results-Noisy-Environment.txt and Metrics-results-Unseen-Voices-VC.txt files, in ./Benchmarking-SDGG-Models/.

Citation

If our work is useful for you, please cite as:

@inproceedings{sanchez2024benchmarking,
  title={Benchmarking Speech-Driven Gesture Generation Models for Generalization to Unseen Voices and Noisy Environments},
  author={SANCHEZ, JOHSAC ISBAC GOMEZ and Inofuente-Colque, Kevin and Marques, Leonardo Boulitreau de Menezes Martins and Costa, Paula Dornhofer Paro and Tonoli, Rodolfo Luis},
  booktitle={GENEA: Generation and Evaluation of Non-verbal Behaviour for Embodied Agents Workshop 2024},
  year={2024}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •