Releases: DigitalPhonetics/speaker-anonymization
Intermediate Speech Representations for LibriSpeech
This release contains the intermediate representations of linguistic content (phonetic transcription), prosody (pitch, energy, duration), and speaker embedding (GST, trained jointly with TTS) of the pipeline for the LibriSpeech train-clean-360, dev and test data of the VPC 2024. You can significantly reduce the run time of the pipeline by using these precomputed representations instead of computing them from scratch.
Models to our paper "Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy"
This release contains all models of our paper "Anonymizing Speech with Generative Adversarial Networks to Preserve
Speaker Privacy".
There are three anonymization models (pool, random, and gan), one ASR model, one FastSpeech 2 and one HifiGAN model for speech synthesis model. All models except the ones for the gan anonymization and the ASR have been part from release v1.0 already.
The models for anonymization, TTS and ASR are released as grouped zip folders to ensure that they are placed in the required directory structure as given in the run_inference.py. If you decide for a different structure, you need to change it accordingly in run_inference.py.
Place the unzipped folders in a models directory located directly under root. So, the structure should look like follows:
speaker-anonymization
└─ models
└─ anonymization
└─ gan
└─ pool_minmax_ecapa+xvector
└─ random_in-scale_ecapa+xvector
└─ asr
└─ asr_improved_tts-phn_en.zip
└─ tts
└─ FastSpeech2_Multi
└─ trained_on_ground_truth.pt
└─ HiFiGAN_combined
└─ best.pt
Note: Do not unzip the ASR models but keep them as zip folders! They will be unzipped during runtime.
Models for using prosody cloning and GAN-generated speaker embeddings
This release contains all models of our latest pipeline version capable of generating artificial speaker embeddings using a GAN, prosody cloning and prosody modifications using offsets.
Place the unzipped folders in a models directory located directly under root. So, the structure should look like follows:
speaker-anonymization
└─ models
└─ anonymization
└─ gan_style-embed
└─ settings.json
└─ style-embed_wgan.pt
└─ asr
└─ asr_branchformer_tts-phn_en.zip
└─ tts
└─ Aligner
└─ aligner.pt
└─ Embedding
└─ embedding_function.pt
└─ FastSpeech2_Multi
└─ prosody_cloning.pt
└─ HiFiGAN_combined
└─ best.pt
Note: Do not unzip the ASR models but keep them as zip folders! They will be unzipped during runtime.
Models to our paper "Speaker Anonymization with Phonetic Intermediate Representations"
This release contains all models as described in our paper "Speaker Anonymization with Phonetic Intermediate Representations".
There are three anonymization models (pool, pool raw and random), three ASR models (phones, STT and TTS), and four FastSpeech2 TTS models (trained_on_ground_truth_phonemes, trained_on_asr_phoneme_outputs, trained_on_libri600_asr_phoneme_outputs and trained_on_libri600_ground_truth_phonemes) together with one HiFiGAN model (best). The models for anonymization, TTS and ASR are released as grouped zip folders to ensure that they are placed in the required directory structure as given in the run_inference.py. If you decide for a different structure, you need to change it accordingly in run_inference.py.
Place the unzipped folders in a models directory located directly under root. So, the structure should look like follows:
speaker-anonymization
└─ models
└─ anonymization
└─ pool_minmax_ecapa+xvector
└─ pool_raw_ecapa+xvector
└─ random_in-scale_ecapa+xvector
└─ asr
└─ asr_stt_en.zip
└─ asr_tts_en.zip
└─ asr_tts-phn_en.zip
└─ tts
└─ FastSpeech2_Multi
└─ trained_on_ground_truth_phonemes.pt
└─ trained_on_asr_phoneme_outputs.pt
└─ trained_on_libri600_asr_phoneme_outputs.pt
└─ trained_on_libri600_ground_truth_phonemes.pt
└─ HiFiGAN_combined
└─ best.pt
Note: Do not unzip the ASR models but keep them as zip folders! They will be unzipped during runtime.