🤖 Model | 📔 Jupyter Notebook | 🤗 Huggingface Space Demo | 📃 Medium Blog (Thai)
Thonburian Whisper is an Automatic Speech Recognition (ASR) model for Thai, fine-tuned using Whisper model originally from OpenAI. The model is released as a part of Huggingface's Whisper fine-tuning event (December 2022). We fine-tuned Whisper models for Thai using Commonvoice 13, Gowajee corpus, Thai Elderly Speech, Thai Dialect datasets. Our models demonstrate robustness under environmental noise and fine-tuned abilities to domain-specific audio such as financial and medical domains. We release models and distilled models on Huggingface model hubs (see below).
Use the model with Huggingface's transformers as follows:
import torch
from transformers import pipeline
MODEL_NAME = "biodatlab/whisper-th-medium-combined" # see alternative model names below
lang = "th"
device = 0 if torch.cuda.is_available() else "cpu"
pipe = pipeline(
task="automatic-speech-recognition",
model=MODEL_NAME,
chunk_length_s=30,
device=device,
)
# Perform ASR with the created pipe.
pipe("audio.mp3", generate_kwargs={"language":"<|th|>", "task":"transcribe"}, batch_size=16)["text"]
Use pip
to install the requirements as follows:
!pip install git+https://github.com/huggingface/transformers
!pip install librosa
!sudo apt install ffmpeg
We measure word error rate (WER) of the model with deepcut tokenizer after normalizing special tokens (▁ to _ and — to -) and simple text-postprocessing (เเ to แ and ํา to ำ).
Model | WER (Commonvoice 13) |
---|---|
Thonburian Whisper (small) Link | 13.14 |
Thonburian Whisper (medium) Link | 7.42 |
Thonburian Whisper (large-v2) Link | 7.69 |
Thonburian Whisper (large-v3) Link | 6.59 |
Thonburian Whisper is fine-tuned with a combined dataset of Thai speech including common voice, google fleurs, and curated datasets.
The common voice test splitting is based on original splitting from datasets
.
Inference time
We have performed benchmark average inference speed on 1 minute audio with different model sizes (small, medium, and large) on NVIDIA A100 with 32 fp, batch size of 32. The medium model presents a balanced trade-off between WER and computational costs.
Model | Memory usage (Mb) | Inference time (sec / 1 min) | Number of Parameters |
---|---|---|---|
Thonburian Whisper (small) Link | 7,194 | 4.83 | 242M |
Thonburian Whisper (medium) Link | 10,878 | 7.11 | 764M |
Thonburian Whisper (large) Link | 18,246 | 9.61 | 1540M |
Distilled Thonburian Whisper (small) Link | 4,944 | TBA | 166M |
Distilled Thonburian Whisper (medium) Link | 7,084 | TBA | 428M |
Thonburian Whisper can be used for long-form audio transcription by combining VAD, Thai-word tokenizer, and chunking for word-level alignment.
We found that this is more robust and produce less insertion error rate (IER) comparing to using Whisper with timestamp. See README.md
in longform_transcription folder for detail usage.
If you use the model, you can cite it with the following bibtex.
@misc {thonburian_whisper_med,
author = { Zaw Htet Aung, Thanachot Thavornmongkol, Atirut Boribalburephan, Vittavas Tangsriworakan, Knot Pipatsrisawat, Titipat Achakulvisut },
title = { Thonburian Whisper: A fine-tuned Whisper model for Thai automatic speech recognition },
year = 2022,
url = { https://huggingface.co/biodatlab/whisper-th-medium-combined },
doi = { 10.57967/hf/0226 },
publisher = { Hugging Face }
}