This is my implementation of WaveNet for Transcription that accompanies this published work (see [PDF] of the preprint).
The code is based off of Igor Babuschkin's excellent implementation of WaveNet neural network architecture.
In this project, WaveNet is no longer autoregressive generative model, but rather multi-class multi-label classifier model that performs frame-level transcription of polyphonic music.
Each data sample (as produced by the reader) is processed by WaveNet's stack of dilated causal convolutions. All convolutional layers in the stack use stride of 1. This produces outputs with temporal resolution equal to the input's sampling frequency (default is 16 kHz). On top of this stack we have two post-processing layers finished with sigmoid activation that allows for multi-label predictions.
Each prediction of the model solves the multiple fundamental frequency (F0) estimation problem for the given input time frame. As inputs for individual predictions are extracted by sliding window over the input snippet (a property of WaveNet processing mechanics), model inference results in process of frame-level transcription.
WaveNet for Transcription. Image courtesy of WaveNet authors.
In the diagram above, gray numbers denote hyperparameter setting (number of channels used across the network) that was determined with constraint of 12GB GPU memory. We used NVIDIA Titan X with Maxwell architecture for this project.
Note, that audio waveform and piano roll as depicted in the diagram are only illustrative (they will correspond to each other if the midi is correctly aligned with audio in the training dataset).
TensorFlow needs to be installed in order to use this implementation. Code was tested on TensorFlow versions 1.6.0 and 1.11.0 for Python 3.6.
In addition, librosa must be installed for manipulating audio, pretty_midi for manipulating midi, and mir_eval along with matplotlib for visualizing inference/evaluation results.
To install the required python packages, run
pip3 install -r requirements.txt
For GPU support, use
pip3 install -r requirements_gpu.txt
In order to test your installation, execute
mkdir -p logs/sanitycheck
python3 train.py --logdir=logs/sanitycheck
to train the network to recognize the middle-C on simple sanity check dataset.
As a result, you should see convergence within first ~1000 training steps and train.py
should terminate automatically, while explaining itself.
Trainer stops training since metric crossed stop boundary 0.999 with value 0.9991437050591782
You can see documentation on each of the training settings by running
python3 train.py --help
You can use any dataset consisting of pairs of .wav
and .mid
files. Dataset can be augmented with multiple audio files synthesized from midi labels using different instrument sound fonts. In such case, if there are multiple .wav
files corresponding to a single .mid
file, they should have corresponding prefix, e.g. label $(fname).mid
would be used with inputs $(fname)_piano_A.wav
, $(fname)_piano_B.wav
and so on. Dataset can be organized into sub-folders, as long as corresponding files are together in the same folder.
You can find the configuration of the model parameters in model_params.json
.
These need to stay the same between training and generation.
Training hyperparameters can be found in training_params.json
.
These can be adjusted between different training sessions.
Other parameters of training, such as intermediate evaluation, model checkpointing, and metadata storage for later debugging in tensorboard are stored in runtime_switches.json
.
These can be modified even during the execution of train.py
, which reloads the file regularly to check for latest settings.
The network is already used for inference on validation data regularly during training - to generate intermediate performance evaluations for TensorBoard monitoring.
When the model is trained, you can use test.py
to do following:
- evaluate its performance on test data
- store performance metrics
- store raw transcription of the test set in
.npy
format - generate multi-media files from the transcription, such as
- draw figures of estimated and evaluated piano rolls
- synthesize audio of generated transcriptions
If you executed sanity-check training before, you can now use the model for inference
python3 test.py --media=True --logdir=logs/sanitycheck/best
where logdir
is expected to contain model checkpoint files and media
switch enables to generate visual and audible evaluation outputs into TensorBoard (see IMAGES and AUDIO boards) as well as separate media files to be stored in logdir
.
Usage of the testing script is also documented by
python3 test.py --help
Currently one needs to use test.py
to generate transcriptions, which also requires data files with MIDI labels.
To get an idea of the performance of trained (not fine-tuned) transcription model in terms of quality of generated transcriptions, here are some results on two audio samples.
First sample – an excerpt from a piece by J. S. Bach – is from training distribution. The other one – an excerpt form a piece be Hiromi Uehara – is purposefully chosen from out of training distribution.
Table below shows the transcription results for selected samples. The shade of gray in prediction figures denotes probability of note presence, black for present, white for absent. In the estimation plot, green denotes correct note detections, red shows present but undetected notes, blue absent but falsely detected notes, and white the remaining major case: absent and correctly undetected notes.
Sample | Predicted | Evaluated |
---|---|---|
J. S. Bach | ||
Hiromi Uehara |
For both samples, you can listen to the input and its transcription separately as well as overlaid in this SoundCloud album.
- A
transcribe.py
script to enable generating transcriptions using a previously trained model, including hamming window smoothing of predictions, possibly supplied soundfont forfluidsynth
synthesis, and not requiring labels for performance evaluation astest.py
does.