- A PyTorch implementation of "Scene2Wav: A Deep Convolutional Sequence-to-Conditional SampleRNN for Emotional Scene Musicalization" (MTAP 2020)
- Model has 3 stages: visual feature extraction with CNN, visual feature encoding with Deep RNN Encoder and music generation with Scene2Wav decoder (conditional SampleRNN Decoder).
Requirements • How to Use • Results • How to Cite
This code was tested with Python 3.5+ and PyTorch 0.4.1 (or 0.4.1.post2)
The rest of the dependencies can be installed with pip install -r requirements.txt
.
-
Data:
- Download pre-processed
data.npz
- OR Make it from scratch:
- Download the COGNIMUSE dataset
- Organize it and pre-process following instructions in AnnotatedMV-PreProcessing
- Download pre-processed
-
The
.npz
dataset should be copied in a subfolder in adatasets/
folder in the root of the repository.Scene2Wav +-- datasets | +-- data_npz | +-- my_data_train.npz | +-- my_data_test.npz | +-- custom_data_npz | > Your custom `npz` dataset can go in here
- Training Scene2Wav using pre-trained encoder: run
train.py
with settable hyperparemeters.
CUDA_VISIBLE_DEVICES=0 python train.py --exp TEST --frame_sizes 16 4 --n_rnn 2 --dataset data_npz --npz_filename video_feats_HSL_10fps_3secs_intAudio_pad_train.npz --npz_filename_test video_feats_HSL_10fps_3secs_pad_test.npz --cnn_pretrain cnnseq/cnn4_3secs_res_vanilla_HSL_bin_1D_CrossEntropy_ep_40_bs_30_lr_0.001_we_0.0001_asgd/ --cnn_seq2seq_pretrain cnnseq/cnnseq2seq4_3secs_HSL_bin_1D_res_stepPred_8_ep_20_bs_30_relu_layers_2_size_128_lr_0.001_we_1e-05_adam_asgdCNN_trainSize_3182_testSize_1139_cost_audio/
- If you need to train encoder Scene2Wav with custom dataset (instead of using pre-trained one):
- Pre-train CNN with Scene frames and Emotion scores
python CNN_main.py --mode=train
- Pre-train CNN-Seq2Seq end-to-end with the Scene frames and Audio
python CNNSeq2Seq_main.py --mode=train
Generate target, and baseline (CNNSeq2Seq) and ours (Scene2Wav)
python generate_audio_scene2wav.py
Path to checkpoint, emotion, and number of samples to generate are set inside the script
-
Emotion evaluation
- Install requirements
pip install music21 vamp librosa midiutil
- Melodia plugin
- Download
- Install:
- MacOS: copy all files in
MTG-MELODIA 1.0 (OSX universal).zip
to:/Library/Audio/Plug-Ins/Vamp
- Linux: copy all files in
MTG-MELODIA 1.0 (Linux 32/64-bit).zip
to:/usr/local/lib/vamp
- MacOS: copy all files in
- Transform wav to midi and detect chords
python emotion_evaluation.py --data_dir [data dirname] --infile [filename].wav --outfile [filename].mid
-
Human evaluation: Amazon MTurk
-
- Clone code and install requirements
- Copy
perceptual_audio_metric.sh
tometric_code/
and runP.S: Modify audio path and filenames you wish to compare
-
Results saved in
results/
: training log, loss plots, model checkpoints and generated samples. -
You can check some generated samples in
results_generated_samples/
(tested with VLC Media Player).
In case you wish to use this code, please credit this repository or send me an email with any requests or questions.
@article{sergio2020jmta,
author={Sergio, G. C. and Lee, M.},
title={Scene2Wav: A Deep Convolutional Sequence-to-Conditional SampleRNN for Emotional Scene Musicalization},
journal={Multimedia Tools and Applications},
year={2020},
pages={1--20},
doi={10.1007/s11042-020-09636-5},
issn={1573-7721},
volume={2020}
}
Please also cite the pre-processing repository AnnotatedMV-PreProcessing as:
@software{gwena_cunha_2020_3910918,
author = {Gwenaelle Cunha Sergio},
title = {{gcunhase/AnnotatedMV-PreProcessing: Pre-Processing
of Annotated Music Video Corpora (COGNIMUSE and
DEAP)}},
month = jun,
year = 2020,
publisher = {Zenodo},
version = {v2.0},
doi = {10.5281/zenodo.3910918},
url = {https://doi.org/10.5281/zenodo.3910918}
}
If you use the COGNIMUSE database:
@article{zlatintsi2017cognimuse,
title={COGNIMUSE: A multimodal video database annotated with saliency, events, semantics and emotion with application to summarization},
author={Zlatintsi, Athanasia and Koutras, Petros and Evangelopoulos, Georgios and Malandrakis, Nikolaos and Efthymiou, Niki and Pastra, Katerina and Potamianos, Alexandros and Maragos, Petros},
journal={EURASIP Journal on Image and Video Processing},
volume={2017},
number={1},
pages={54},
year={2017},
publisher={Springer}
}
If you use the DEAP database:
@article{koelstra2011deap,
title={Deap: A database for emotion analysis; using physiological signals},
author={Koelstra, Sander and Muhl, Christian and Soleymani, Mohammad and Lee, Jong-Seok and Yazdani, Ashkan and Ebrahimi, Touradj and Pun, Thierry and Nijholt, Anton and Patras, Ioannis},
journal={IEEE transactions on affective computing},
volume={3},
number={1},
pages={18--31},
year={2011},
publisher={IEEE}
}
Code based on deepsound-project's PyTorch's implementation of SampleRNN