GitHub - NakataKoo/Music-Tri-Modal

Setup

Create a fresh virtual environment:

python -m venv venv 
source venv/bin/activate

Then, clone the repository and install the dependencies: (研究室のA40サーバーの場合)

git clone https://www.github.com/ilaria-manco/muscall 
cd muscall 
pip install -r requirements.txt
pip install -e .
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install laion_clap
pip install miditoolkit

at Music-Tri-Modal/ckpt/

wget https://huggingface.co/lukewys/laion_clap/resolve/main/music_audioset_epoch_15_esc_90.14.pt

Preparing the dataset

MIDI-Text-Audio Pair ver

The Music-Tri-Modal is trained on a multimodal dataset of (audio, text, midi) pairs.

Annotations should be provided in JSON format and must include the following fields:

audio_id: the unique identifier for each audio track in the dataset
caption : a string with the textual description of the audio track
audio_path: path to the audio track, relative to the root audio directory

One JSON file per split must be provided and stored in the data/datasets directory, following this structure:

dataset_name
├── audio            
│   ├── track_1.mp3
│   ├── track_2.mp3
|   └── ...
├── midi
|   ├── track_1/
|   |   ├── track_1_1.mid
|   |   ├── track_1_2.mid
|   |   └── ...   
|   ├── track_2/
|   |   ├── track_2_1.mid
|   |   ├── track_2_2.mid
|   |   └── ...   
|   └── ...
├── dataset_train.json    
├── dataset_val.json    
└── dataset_test.json

An illustrative example of the dataset is provided in data/datasets/audiocaption/.

cd /Music-Tri-Modal/data/datasets/audiocaptionmidi/

mkdir midi
mkdir audio

cd midi
wget http://hog.ee.columbia.edu/craffel/lmd/lmd_aligned.tar.gz
tar -zxvf lmd_aligned.tar.gz

cd audio
wget http://hog.ee.columbia.edu/craffel/lmd/lmd_matched_mp3.tar.gz
tar -xzvf lmd_matched_mp3.tar.gz

MIDI-Text Pair ver

The Music-Tri-Modal is trained on a multimodal dataset of (audio, text, midi) pairs.

Annotations should be provided in JSON format and must include the following fields:

audio_id: the unique identifier for each audio track in the dataset
caption : a string with the textual description of the audio track
audio_path: path to the audio track, relative to the root audio directory

One JSON file per split must be provided and stored in the data/datasets directory, following this structure:

dataset_name
├── lmd_full
|   ├── 0/
|   |   ├── track_1_1.mid
|   |   ├── track_1_2.mid
|   |   └── ...   
|   ├── 1/
|   |   ├── track_2_1.mid
|   |   ├── track_2_2.mid
|   |   └── ...   
|   └── ...
├── dataset_train.json    
├── dataset_val.json    
└── dataset_test.json

An illustrative example of the dataset is provided in data/datasets/audiocaption/.

wget https://huggingface.co/datasets/amaai-lab/MidiCaps/resolve/main/train.json
wget https://huggingface.co/datasets/amaai-lab/MidiCaps/resolve/main/midicaps.tar.gz
tar -zxvf midicaps.tar.gz

import json
import os
import pandas as pd

with open("train.json", 'r', encoding='utf-8') as file:
    data = file.readlines()

midi_files = []
for i in range(len(data)):
    midi_files.append(json.loads(data[i])["location"])

captions = []
for i, midi_file in enumerate(midi_files):
    if os.path.exists(midi_file):
        captions.append(json.loads(data[i])["caption"])
    else:
        continue

# データフレームの作成
df = pd.DataFrame({
    'midi_file': midi_files,
    'caption': captions
})

# JSON形式の変換処理
json_data = []
for index, row in df.iterrows():
    json_data.append({
        "audio_id": index,  # 自然数の昇順
        "caption": row['caption'],  # CSVのcaption列
        "audio_path": row['midi_file']  # CSVのmidi_file列
    })

# JSONデータをファイルに保存
json_file_path = 'dataset_all.json'  # 出力するJSONファイルのパス
with open(json_file_path, 'w', encoding='utf-8') as f:
    json.dump(json_data, f, ensure_ascii=False, indent=4)

print(f"JSONデータが{json_file_path}に保存されました。")

Training

Dataset, model and training configurations are set in the respective yaml files in configs. You can also pass some options via the CLI, overwriting the arguments in the config files. For more details on the CLI options, please refer to the training script.

To train the model with the default configs, simply run

cd scripts/
python train.py

This will generate a model_id and create a new folder in save/experiments/ where the output will be saved.

If you wish to resume training from a saved checkpoint, run this command:

python train.py --experiment_id <model_id>

Evaluating

Once trained, you can evaluate Model on the cross-modal retrieval task:

python evaluate.py <model_id> retrieval

or, in the zero-shot transfer setting, on an arbitrary music classification task.

In our zero-shot evaluation, we include:

mtt: auto-tagging on the MagnaTagATune Dataset
gtzan: music genre classification on the GTZAN dataset

python evaluate.py <model_id> zeroshot <dataset_name>

You'll need to download the datasets inside the datasets/ folder and preprocess them before running the zeroshot evaluation.

Reference

This repository is based on https://github.com/ilaria-manco/muscall

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
configs		configs
data/datasets		data/datasets
muscall		muscall
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
error_log.txt		error_log.txt
requirements.txt		requirements.txt
setup.py		setup.py
tmp.py		tmp.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup

Preparing the dataset

MIDI-Text-Audio Pair ver

MIDI-Text Pair ver

Training

Evaluating

Reference

About

Releases

Packages

Languages

License

NakataKoo/Music-Tri-Modal

Folders and files

Latest commit

History

Repository files navigation

Setup

Preparing the dataset

MIDI-Text-Audio Pair ver

MIDI-Text Pair ver

Training

Evaluating

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages