Skip to content

Speech technologies is one of the evolving and highly demanded area for the past few decades due to the huge progress brought by machine learning technology.

License

Notifications You must be signed in to change notification settings

aaivu/aaivu-unified-voice-embedding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unified Voice Embedding through Multi task Learning

research Contributors Stargazers Forks Issues

  • Project Lead
    1. Dr. Uthayasanker Thayasivam (talk forum profile link)
  • Contributors
    1. Rajenthiran Jenarthanan
    2. Lakshikka Sithamparanathan (talk forum profile link)
    3. Saranya Uthayakumar (talk forum profile link)

Useful Links


Summary

Speech technologies is one of the evolving and highly demanded area for the past few decades due to the huge progress brought by machine learning technology. Especially the past decade has brought tremendous progress which includes the introduction of conversational agents. In this work we describe a multi-task deep metric learning system to learn a single unified audio embedding which can be used to power our multiple audio/speaker specific tasks. The solution we present not only allows us to train for multiple application objectives in a single deep neural network architecture, but takes advantage of correlated information in the combination of all training data from each application to generate a unified embedding that outperforms all specialized embeddings previously deployed for audio/speaker specific task.

Directory Structure

The files and directories of the repository is shown below

    aaivu-unified-voice-embedding-master
    ├── Architecture.png
    ├── docs
    │   └── README.md
    ├── hive-mtl
    │   ├── audio.py
    │   ├── batcher.py
    │   ├── cli.py
    │   ├── constants.py
    │   ├── conv_models.py
    │   ├── download_librispeech.sh
    │   ├── hive-mtl
    │   ├── libri_speaker_gender.csv
    │   ├── requirements.txt
    │   ├── test_pretrained.py
    │   ├── train.py
    │   └── utils.py
    ├── LICENSE
    ├── README.md
    └── src
        └── README.md

Getting started

Install dependencies

Requirements

  • tensorflow>=2.0
  • keras>=2.3.1
  • python>=3.6
pip install -r requirements.txt

If you see this error: libsndfile not found, run this: sudo apt-get install libsndfile-dev.

Training

The code for training is available in this repository.

sudo chmod -R 777 hive-mtl/ # Give write permision to hive-mtl
pip uninstall -y tensorflow && pip install tensorflow-gpu
./hive-mtl download_librispeech # Download Librispeech dataset
./hive-mtl build_mfcc
./hive-mtl build_model_inputs
./hive-mtl train_mtl

NOTE: If you want to use your own dataset, make sure you follow the directory structure of librispeech. Audio files have to be in .flac. format. If you have .wav, you can use ffmpeg to make the conversion. Both formats are flawless (FLAC is compressed WAV).

Architecture Diagram

Architecture Diagram

References

  • Deep Speaker : An End-to-End Neural Speaker Embedding System by Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li.
  • GitHub

Acknowledgments

  • Ketharan Suntharam
  • Sathiyakugan Balakirshnan

License

Apache License 2.0

Code of Conduct

Please read our code of conduct document here.

About

Speech technologies is one of the evolving and highly demanded area for the past few decades due to the huge progress brought by machine learning technology.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published