This repository contains the code and data for: romanian grammatical error correction (GEC) on RONACC.
Download the RONACC corpus: RONACC
Tokenized RONACC corpus: RONACC extra
Download the language model: 30mil_wiki_lm
Download the synthetic corpus 10m_synthetic
Download trained Transformer-based fine-tune model: transformer-base-fine-tune
Install python dependencies:
pip3 install -r requirements.txt
If you want to use LM predictions install kenlm libraries: kenlm
To run decoding on an existing model run:
python3 transformer.py --checkpoint=path_to_model_checkpoint --lm_path=path_to_lm --d_model=size_of_model --decode_mode=True
(the size of the fine tuned model is 768)
To train models run:
python3 transformer.py --checkpoint=path_to_model_checkpoint --separate=False --d_model=size_of_model --use_txt=True --dataset_file=path_to_txt_file_wrong_gold --train_mode=True
If you want to run on tpu, you can use the --use_tpu=True
argument, but you need to generated tf records file.
You can use errant normall, just pass the argument -lang ro if you want to use it for Romanian. More details in the ERRANT readme.
@inproceedings{cotet2020neural,
title={Neural grammatical error correction for romanian},
author={Cotet, Teodor-Mihai and Ruseti, Stefan and Dascalu, Mihai},
booktitle={2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI)},
pages={625--631},
year={2020},
organization={IEEE}
}