Codebase to finetune whisper for music transcription.
To install the required dependencies, run the following command:
pip install -r requirements.txt
To download the dataset, run the following command:
python download_dataset.py --num_images 1000
The dataset will be downloaded to the data
directory.
The format of the dataset is as follows:
dataset
├── audio
│ ├── 0.wav
│ ├── 1.wav
│ ├── ...
└── lyrics
├── 0.txt
├── 1.txt
├── ...
where 0.wav
corresponds to the audio file and 0.txt
corresponds to the lyrics transcription of the audio file.
To process the dataset, run the following command:
python process_dataset.py --clean
The process will split the audio in chunks of 32 seconds and split the lyrics.
Here is an example of how to test the model:
import librosa
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor, pipeline
model_name = "Jour/whisper-small-lyric-finetuned"
audio_file = "PATH_TO_AUDIO_FILE"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained(model_name)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=30,
device=device,
)
sample, _ = librosa.load(audio_file, sr=processor.feature_extractor.sampling_rate)
prediction = pipe(sample.copy(), batch_size=8)["text"]
print(prediction)
This project is licensed under the MIT License - see the LICENSE file for details.