Spectrogram generator using GANs
I started this project as part of my Bachelor Thesis: "Generator of graphic representations of phonic signals using GAN neural networks" and develop it for learning purpose. Goal of this project is to take advantage of CNN in generating new synthetic audio clips. To achieve this, dataset clips and generated clips are spectrograms or mel spectrograms. Project could potentially be used to in audio data augumentation process, to generate sounds used in games/movies/simulations.
Model is based on https://www.tensorflow.org/tutorials/generative/dcgan. List of major changes made after experiments to improve results:
-add noise to discriminator
-normalize input to value from 0 to 1
-changed activation function from tanh to sigmoid
Mozilla commonvoice - huamn speech with noise: https://commonvoice.mozilla.org/en
Recording of drums: https://www.hexawe.net/mess/200.Drum.Machines/
Speak Like a Dog dataset: https://drive.google.com/drive/folders/1TmG1yjc0_RLUX7U0ZJGLPVWkAwiSkSWY
(WIP) Part of female recording from vctk corpus: https://datashare.ed.ac.uk/handle/10283/3443
Steps taken to prepare tensor containing spectrograms from audio clips:
- Trim silnce with threshold equal to 15dB
- Split audio clips into 2s chunks
- Create spectrogram/mel-spectrogram using python.librosa (set parameters to get desired size of spectrogram)
- Save all spectrograms to tf.tensor
- Normalize values to 0-1 (done in model code)
To track loss function tensorboard was added. Althought loss fucntion is less useful in GANs than in other architectures, it helps to identify convergence failure(not finding an equilibrium between the discriminator and the generator - one of them dominates the other)
To analyze quality of generated audio, function converting mel/spectrograms to audio was implemented using librosa.