- Project Lead
- Dr. Uthayasanker Thayasivam (talk forum profile link)
- Contributors
- Rajenthiran Jenarthanan
- Lakshikka Sithamparanathan (talk forum profile link)
- Saranya Uthayakumar (talk forum profile link)
Useful Links
- Pretrained Model : Pretrained Models
- Libri Speech : LibriSpeech
- Talk Forum : Forum
Speech technologies is one of the evolving and highly demanded area for the past few decades due to the huge progress brought by machine learning technology. Especially the past decade has brought tremendous progress which includes the introduction of conversational agents. In this work we describe a multi-task deep metric learning system to learn a single unified audio embedding which can be used to power our multiple audio/speaker specific tasks. The solution we present not only allows us to train for multiple application objectives in a single deep neural network architecture, but takes advantage of correlated information in the combination of all training data from each application to generate a unified embedding that outperforms all specialized embeddings previously deployed for audio/speaker specific task.
The files and directories of the repository is shown below
aaivu-unified-voice-embedding-master
├── Architecture.png
├── docs
│ └── README.md
├── hive-mtl
│ ├── audio.py
│ ├── batcher.py
│ ├── cli.py
│ ├── constants.py
│ ├── conv_models.py
│ ├── download_librispeech.sh
│ ├── hive-mtl
│ ├── libri_speaker_gender.csv
│ ├── requirements.txt
│ ├── test_pretrained.py
│ ├── train.py
│ └── utils.py
├── LICENSE
├── README.md
└── src
└── README.md
- tensorflow>=2.0
- keras>=2.3.1
- python>=3.6
pip install -r requirements.txt
If you see this error: libsndfile not found
, run this: sudo apt-get install libsndfile-dev
.
The code for training is available in this repository.
sudo chmod -R 777 hive-mtl/ # Give write permision to hive-mtl
pip uninstall -y tensorflow && pip install tensorflow-gpu
./hive-mtl download_librispeech # Download Librispeech dataset
./hive-mtl build_mfcc
./hive-mtl build_model_inputs
./hive-mtl train_mtl
NOTE: If you want to use your own dataset, make sure you follow the directory structure of librispeech. Audio files have to be in .flac
. format. If you have .wav
, you can use ffmpeg
to make the conversion. Both formats are flawless (FLAC is compressed WAV).
- Deep Speaker : An End-to-End Neural Speaker Embedding System by Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li.
- GitHub
- Ketharan Suntharam
- Sathiyakugan Balakirshnan
Apache License 2.0
Please read our code of conduct document here.