@article{papakostas2018speech, title={Speech-Music Discrimination Using Deep Visual Feature Extractors}, author={Papakostas, Michalis and Giannakopoulos, Theodoros}, journal={Expert Systems with Applications}, year={2018}, publisher={Elsevier} }
This project describes a new approach to the very traditional problem of Speech-Music Discrimination. According to our knowledge, the proposed method, provides state-of-the-art results on the task. We employ a Deep Convolutional Neural Network (CNN) and we offer a compact framework to perform segmentation and binary (Speech/Music) classification, by exploing the benefits of transfering knowledge from pretrained architectures on Imagenet. Our method is unchained from traditional audio features, which offer inferior results on the task. Instead it exploits the highly invariant features produced by CNNs and opperates on pseudocolored RGB or grayscale frequency-images, which represent audio segments.
*Dataset included speeh-only, music-only and speech-music overlaping audio samples - for further details loook at the paper
Roc-curves of the two proposed methods ie. with(red) and withought(blue) transfer-learning on the same dataset
Evaluation of our best method(pink) against the methods proposed by Pikrakis & Theodoridis on datasetA and datasetB
The repository consists of the following modules:
- Audio segmentation using the PyAudioAnalysis lybrary
- CNN training using the CAFFE Deep-Learning Framework.
- Audio classification using:
- CNNs
- CNNs + median_filtering
- CNNs + median_filtering + HMMs
- Two pretrained CNNs on the task of Speech/Music Discrimination. The network can be also used for weight initialization for other similar tasks. (to be added)
- An audio dataset consisting of more than 10h continous audio streams. At this point the data are available in the form of spectrograms. (to be added)
- Dependencies
* Installation instructions offered in detail on the above links
- Add Caffe to your working dir
- trainCNN.py --> Line:4
- train_net.sh --> Line:2
- ClassifyWav.py --> Line:14
or add pycaffe to your .bashrc for directory independent access
- open .bashrc file located at your home directory
In a terminal type:
-
cd ~
to navigate to your home directory -
ls -a
to see the file listed -
nano .bashrc
to open the file in terminal -
scroll at the bottom of the file and add:
export PYTHONPATH=$PYTHONPATH:"/home/--myPathToCaffe--/caffe/python"
, where --myPathToCaffe-- is the path to the caffe library as it appears in your local machine
i.e.: export PYTHONPATH=$PYTHONPATH:"/home/michalis/Liraries/caffe/python"
-
source ~/.bashrc
to update your source file
-
-
Convert your audio files into pseudocolored RGB or grayscale spectrogram images using generateSpectrograms.py TO BE UPDATED a)How to run, b)How to set segmentation parameters c) How the output looks like
-
Split the spectrogram images into train and test as shown in Fig1:
- Train/Test and Classes represent directories
- Samples represent files
- If you wish to use the architecture proposed in this work:
2. or grayscale spectrogram images of size 200x200 as shown in Fig3 Fig3. - Sample Grayscale Spectrogram
- Image resizing can be done directly using CAFFE framework.
- Train a CNN
-
Provide Network Architecture file. You can use one of the proposed architectures (SpeechMusic_RGB.prototxt, SpeechMusic_GRAY.prototxt ) or another CNN of your choice.
-
Train
Training can be done either by training a new network from sratch or by finetuning a pretrained architecture.
The pretrained model used in the paper for fine-tuning is the caffe_imagenet_hyb2_wr_rc_solver_sqrt_iter_310000 initially proposed in Donahue, Jeffrey, et al. "Long-term recurrent convolutional networks for visual recognition and description." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. To exploit the weight initialization of the pretrained model use the CNN architecture shown in SpeechMusic_RGB.prototxt.
If you wish to deploy the smaller CNN architecture that operates on grayscale images you should use the CNN architecture shown in SpeechMusic_GRAY.prototxt. This model was trained from scratch without weight initialization.
- Train from scratch:
python trainCNN.py <architecture_file>.prototxt <path_to_train_data_root_foler> <path_to_test_data_root_foler> <snapshot_prefix> <total_number_of_iterations>
* Finetune pretrained network:
python trainCNN.py <architecture_file>.prototxt <path_to_train_data_root_foler> <path_to_test_data_root_foler> <snapshot_prefix> <total_number_of_iterations> --init <pretrained_network>.caffemodel --init_type fin
* Resume Training:
python trainCNN.py <architecture_file>.prototxt <path_to_train_data_root_foler> <path_to_test_data_root_foler> <snapshot_prefix> <total_number_of_iterations> --init <pretrained_network>.solverstate --init_type res
* For more details about modifying other learning parameters (i.e learning rate, step size etc.) type:
```shell
python trainCNN.py -h
```
- Outputs:
- _<snapshot_prefix>solver.prototxt Solver file required by caffe to train the CNN. The solver file describes all the parameters of the current experients. Commented lines have additional information regarding the experiments that are not required by the Caffe framework.
- _<snapshot_prefix>TrainSource.txt & _<snapshot_prefix>TestSource.txt Full paths to training and test samples with each samples class
- Train HMM
python ClassifyWav.py trainHMM <path_to_test_data> <hmm_model_name> <core_classification_method> <trained_network> <classification_method>
*This applies after having a trained CNN
**Change [trainCNN.py](https://github.com/MikeMpapa/CNNs-Speech-Music-Discrimination/blob/master/trainCNN.py), Line:9, to ``` caffe.set_mode_gpu() ``` to support GPU implementation**
- Evaluate trained CNN Model with/without post processing:
python ClassifyWav.py evaluate <path_to_test_wav_files> <trained_network>.caffemodel <classification_method> <classification_type_flag> ""
- Evaluate trained HMM Model with post processing:
python ClassifyWav.py evaluate <path_to_test_wav_files> <trained_network>-5000.caffemodel <core_classification_method> <classification_type_flag> <hmm_model_name>
Change ClassifyWav.py, Line:17, to caffe.set_mode_gpu()
to support GPU implementation
-
Generate Spectrogram Images:
-
Train from scratch:
python trainCNN.py SpeechMusic_RGB.prototxt Train Test myOutput 4000
-
Finetune pretrained network (train and test paths are according to Fig1):
python trainCNN.py SpeechMusic_RGB.prototxt Train Test myOutput 1000 --init caffe_imagenet_hyb2_wr_rc_solver_sqrt_iter_310000.caffemodel --init_type fin
-
Resume training from pretrained network (train and test paths are according to Fig1):
python trainCNN.py SpeechMusic_RGB.prototxt Train Test my_new_Output 2000 --init myOutput.solverstate --init_type res
-
Evaluate trained CNN on .wav file/s without preprosesing:
python ClassifyWav.py evaluate Data/testWavs CNN-SM-5000.caffemodel cnn 0 ""
-
Evaluate trained CNN on .wav file/s with preprosesing:
python ClassifyWav.py evaluate Data/testWavs CNN-SM-5000.caffemodel cnn 1 ""
-
Train an HMM after applying median filtering:
python ClassifyWav.py trainHMM Data/testWavs hmm1 cnn CNN-SM-5000.caffemodel 1
-
Test using pretrained HMM:
python ClassifyWav.py evaluate Data/testWavs CNN-SM-5000.caffemodel cnn 2 hmm1
A pretrained model on the task using pseudo-colored RGB images along with the solverstate can be found here
We provide a new method for the task of Speech/Music Discrimination using Convolutional Neural Networks. The main contributions of this work are the following:
-
A compact framework for: * Segmenting and Classifying long audio streams into Speech and Music segments. * Train new CNN models on binary audio tasks
-
A big dataset on long audio streams (more than 10h) for the task of speech music discrimination. The dataset is provided in the form of spectrograms.
-
Two different pretrained CNN architectures that can be used for weight initialization for other binary classification tasks.
-
To our knowledge our method provides state-of-the-art results on the task
If your found our project usefull please cite the following referenced publications:
CNNs:Speech-Music-Discrimination @article{papakostas2018speech, title={Speech-Music Discrimination Using Deep Visual Feature Extractors}, author={Papakostas, Michalis and Giannakopoulos, Theodoros}, journal={Expert Systems with Applications}, year={2018}, publisher={Elsevier} }
PyAudioAnalysis @article{giannakopoulos2015pyaudioanalysis, title={pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis}, author={Giannakopoulos, Theodoros}, journal={PloS one}, volume={10}, number={12}, year={2015}, publisher={Public Library of Science} }
Caffe Framework @article{jia2014caffe, Author = {Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev, Sergey and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, Trevor}, Journal = {arXiv preprint arXiv:1408.5093}, Title = {Caffe: Convolutional Architecture for Fast Feature Embedding}, Year = {2014} }
If you used the pretrained network caffe_imagenet_hyb2_wr_rc_solver_sqrt_iter_310000 for your experiments, please also cite:
@inproceedings{donahue2015long, title={Long-term recurrent convolutional networks for visual recognition and description}, author={Donahue, Jeffrey and Anne Hendricks, Lisa and Guadarrama, Sergio and Rohrbach, Marcus and Venugopalan, Subhashini and Saenko, Kate and Darrell, Trevor}, booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition}, pages={2625--2634}, year={2015} }