Skip to content

Thesis project for Speech Separation using Deep Learning

Notifications You must be signed in to change notification settings

NikhilC2209/AVSpeech_Sep

Repository files navigation

Speech Separation (Final Year Thesis)

Thesis project for Speech Separation using Deep Learning

Installation & Dataset Setup

Installing Dependencies

pip install -r requirements.txt

Setting up MUSDB18 for training (optional)

Convert from STEMS format to .wav format

musdbconvert path/to/musdb-stems-root path/to/new/musdb-wav-root

Download LibriSpeech Corpus for creating Synthetic mixtures from https://www.openslr.org/12

Setting up LibriMix for training

LibriMix is an open source dataset for source separation in noisy environments. It is derived from LibriSpeech signals (clean subset) and WHAM noise. It offers a free alternative to the WHAM dataset and complements it. It will also enable cross-dataset experiments.

Generating LibriMix

Features

In LibriMix you can choose :

  • The number of sources in the mixtures.
  • The sample rate of the dataset from 16 KHz to any frequency below.
  • The mode of mixtures : min (the mixture ends when the shortest source ends) or max (the mixtures ends with the longest source)
  • The type of mixture : mix_clean (utterances only) mix_both (utterances + noise) mix_single (1 utterance + noise)

By default, LibriMix will be generated for 2 and 3 speakers, at both 16Khz and 8kHz, for min max modes, and all mixture types will be saved (mix_clean, mix_both and mix_single). This represents around 430GB of data for Libri2Mix and 332GB for Libri3Mix. Alternatively if you want to generate a smaller subset you can look at the options below:

Creating Synthetic Audio for Training our Model

Each entry in Librispeech Corpus refers to a speaker, and each speaker folder contains multiple recordings with annotations included. We can use this individual speaker audio from these folders and overlap them using pydub to create synthetic audio mixtures and use them to train our model.

Synthetic Audio Data Format:

+ data
    |
    + spk1_spk2
    |      |
    |      + sound1.wav
    |      + sound2.wav
    |      + mixed.wav
    + spk1_spk3
    |      |
    |      + sound1.wav
    |      + sound2.wav
    |      + mixed.wav

Using MiniLibriMix

MiniLibriMix is a small version of LibriMix.

It was made for demonstration purposes.

It contains a train set of 800 mixtures and a validation set of 200 mixtures.

In each set, you will find :

  • mix_clean a folder containing clean mixtures of 2 speakers.
  • mix_both a folder containing clean mixtures of 2 speakers and a noise.
  • s1, s2, noise three folders containing the raw signals in the mixture.

Results

Waveplot of Mixed/Original/Estimated Audio

2 Speaker Separation Image


Mel Spectrogram of Mixed/Original/Estimated Audio

2 Speaker Separation Mel Spectrograms


All Speech Separation Metrics from Asteroid

{'input_pesq': 3.934750556945801,
 'input_sar': 28.28840552880433,
 'input_sdr': 7.4975376739032145,
 'input_si_sdr': 6.865206956863403,
 'input_sir': 7.546190904711902,
 'input_stoi': 0.9072806256745396,
 'pesq': 4.548638343811035,
 'sar': 286.0524142270863,
 'sdr': 297.9890902500691,
 'si_sdr': 90.5447006225586,
 'sir': 286.52094481387064,
 'stoi': 0.9999999999999994}

About

Thesis project for Speech Separation using Deep Learning

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published