Mukayese: Turkish NLP Strikes Back

Turkish Natural Language Processing is left behind in developing state-of-the-art systems due to a lack of organized benchmarks and baselines. We fill this gap with Mukayese (Turkish word for "comparison/benchmarking"), an extensive set of datasets and benchmarks for several Turkish NLP tasks. All of the datasets and code have been made public in this repository.

Updates

(22/03/2022) Summarization models are online on Huggingface! Download here
(01/03/2022) Paper is on ArXiv. View here.
(25/02/2022) Datasets have been made available through pre-release v0.0.1

What to do with Mukayese ?

With Mukayese, researchers of Turkish NLP will be able to:

Compare the performance of existing methods in leaderboards.
Access existing implementations of NLP baselines.
Evaluate their own methods on the relevant test datasets.
Submit their own work to be enlisted in our leaderboards.

Mukayese's Mission

The most important goal of Mukayese is to standardize the comparison and evaluation of Turkish NLP methods. As a result of the lack of a platform for benchmarking, Turkish NLP researchers struggle with comparing their models to the existing ones.

Maintainers

Ali Safaya - @alisafaya
Emirhan Kurtuluş - @ekurtulus
Arda Göktoğan - @ardofski

Mukayese Tasks

We collect our documentation for reproducing the baselines for Mukayese in this repository. Baselines are listed according to each task below:

Language Modeling

Datasets

trnews-64
trwiki-67

Baselines

SHARNN: Single Headed Attention - Recurrent Neural Networks
Adaptive-Span: Adaptive Attention Span for Transformers

Machine Translation (EN/TR):

Datasets

WMT16
MuST-C

Baselines

Fairseq: Convolutional Sequence to Sequence Learning
Transformer: Attention is all You Need
mBART50: Multilingual Translation with Extensible Multilingual Pretraining and Finetuning

Named Entity Recognition

Datasets

MilliyetNER
WikiANN

Baselines

BiLSTM-CRF: Bi-directional Long Short Term Memory with Conditional Random Field
BERT: Pretrained Bi-directional Transformers
BERT-CRF: Pretrained Bi-directional Transformers with Conditional Random Field

Sentence Segmentation

Datasets

trseg-41

Baselines

NLTK Punkt Sentence Tokenizer
SpaCy Sentencizer
ErSatz

Spell-checking and Correction

Datasets

trspell-10

Baselines

Hunspell Spell-checker
Zemberek NLP tool

Summarization

Datasets

trsum

Baselines

Turkish BART from Scratch
mBART
mT5-Base

Download trained models here

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("mukayese/mt5-base-turkish-sum")
model = AutoModelForSeq2SeqLM.from_pretrained("mukayese/mt5-base-turkish-sum")

article = """Fransız devi PSG'nin üzerindeki kara bulutlar dağılmıyor. 
Devler Ligi'nde Real Madrid'e olaylı şekilde boyun eğen başkent temsilcisinde oyuncuların gruplaşmaya başladığı öne sürüldü. 
Güney Amerikalılar ve Fransızca konuşanlar olarak ikiye ayrılan oyuncuların saha içerisinde de birbirlerine uzak olduğu iddia edildi. 
İşte PSG'de soyunma odasında yaşananlar ve 20 milyon avroluk tazminat ihtimali... 
UEFA Şampiyonlar Ligi'nde Real Madrid'e sansasyonel bir şekilde elenen Paris Saint Germain'de Kylian Mbappe haricindeki tüm oyunculara yönelik taraftar tepkisinin devam etmesi başkent temsilcisindeki krizi derinleştirdi.
RMC Sport'ta yer alan haberde;
Paris Saint Germain'in soyunma odasında işlerin yolunda gitmediği ve futbolcuların iki gruba ayrıldığı öne sürüldü. İddiaya göre oyuncular gruplaşmaya başladı ve aralarındaki iletişim her geçen gün zayıflıyor."""

inputs = tokenizer([article], max_length=1024, return_tensors="pt")
summary_ids = model.generate(inputs["input_ids"], num_beams=6, max_length=100)
tokenizer.batch_decode(summary_ids, skip_special_tokens=True)[0]

>>> "UEFA Şampiyonlar Ligi'nde Real Madrid'e olaylı şekilde boyun eğen Paris Saint Germain'de oyuncuların gruplaşmaya başladığı öne sürüldü."

Text Classification

Datasets

OffensEval2020
News-Cat

Baselines

Sentence Convolutional Neural Networks
Bi-directional Long Short Term Memory
BERT: Pretrained Bi-directional Transformers

Citation

@misc{safaya-etal-2022-mukayese,
    title={Mukayese: Turkish NLP Strikes Back},
    author={Ali Safaya and Emirhan Kurtuluş and Arda Göktoğan and Deniz Yuret},
    year={2022},
    eprint={2203.01215},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Mukayese: Turkish NLP Strikes Back

Updates

What to do with Mukayese ?

Mukayese's Mission

Maintainers

Mukayese Tasks

Language Modeling

Machine Translation (EN/TR):

Named Entity Recognition

Sentence Segmentation

Spell-checking and Correction

Summarization

Text Classification

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Mukayese: Turkish NLP Strikes Back

Updates

What to do with Mukayese ?

Mukayese's Mission

Maintainers

Mukayese Tasks

Language Modeling

Machine Translation (EN/TR):

Named Entity Recognition

Sentence Segmentation

Spell-checking and Correction

Summarization

Text Classification

Citation