Korean is the official language of both South Korea and North Korea. Despite sharing the same language, the North Korean and South Korean language differ in various linguistic aspects such as vocabulary, grammar, and spelling. The ongoing separation between North Korea and South Korea has widened the differences between the two languages. This language gap can become a major communication obstacle after Korean reunification.
Therefore, it is important to conduct research on how to bridge the gap between the North Korean and South Korean languages. One example would be to develop a North and South Korean translator. However, it is difficult to find a North Korean language dataset that has a corresponding South Korean language dataset. The lack of a North Korean and South Korean parallel corpus hinders active investments in machine translation of the North Korean language.
To address this issue, the Korean Unification Parallel Corpus (KPC) repository has been created. Its main goal is to provide a high-quality North and South Korean parallel corpus and make it available to the public. The KPC also explains how to use the parallel corpus for research, particularly in the field of machine translation.
The dataset contains 130,738 rows covering a range of topics of classical novel and the Bible. The classical novels are
- Data must be actually existing in South and North Korea.
- Data must be accurately matched as sentence pairs.
Bible
: The Bible is translated into many languages, divided into chapters and verses, with consistent content across verses, making it useful for matching.Classic novels
: Classic novels are translated into various languages and with translations available in both South Korean and North Korean.
Category | Book | Total Row | ||
---|---|---|---|---|
Classic Novels | Foreign | Jane Eyre | 60,331 | 94,459 (72%) |
The Red and the Black | 34,128 | |||
Korean | Onggojip-jeon | 988 | 6,293 (5%) | |
Sukhyang-jeon | 3,538 | |||
Shimchung-jeon | 1,767 | |||
Bible | - | - | 29,986 (23%) | |
Total | - | - | 130,738 (100%) |
The dataset consists of classic novels and the Bible. The classic novel data is divided into two types of foreign novels and three types of Korean novels, each based on single data from North Korean publishers and multiple data from South Korean publishers. Consequently, the classic novel data collected a total of 100,752 North Korean-South Korean sentence pairs. The Bible data was collected in the same manner, resulting in a total of 29,986 data points. Thus, a total of 130,738 parallel corpora were constructed based on South Korean standards. Among these, the maximum number of characters per sentence is 286, and the minimum is 2.
nk | sk |
---|---|
안해는 남편앞에 무릎을 꿇고 그를 붙들어두려고 하면서 부르짖었다. | 부인은 남편 앞에 무릎을 꿇고 그를 붙잡으려고 애쓰면서 소리쳤다. |
나는 창가림을 드리우고 난로가에 되돌아왔다. | 나는 커튼을 내리고 난롯가로 되돌아갔다. |
KoBART (Korean BART) was used as the foundation translation model. KoBART was developed by the SKT AI team.
We trained a North Korean(NK) → South Korean(SK) translation model and a South Korean(SK) → North Korean(NK) translation model. The training was conducted on 90% of all the 13,0738 rows of classic novels and bible data. The remaining 10% was used as the test data.
The data was split into a 9:1 ratio for training and testing. For foreign novel data, since each book is based on single data from North Korean publishers and multiple data from South Korean publishers, the same North Korean sentences are repeated as many times as the number of publications from South Korean publishers. Thus, caution was taken to ensure that North Korean sentences in the test data did not exist in the training data.
For Jane Eyre, a certain number of rows were randomly selected from the North Korean data, and the corresponding North-South Korean sentence pairs were extracted as test data, while the remaining sentence pairs were used as training data.
train | test | |
---|---|---|
Count | 117,665 | 13,073 |
Size | 9.9MB | 961KB |
For the training process, the hyperparameters were set as follows.
NK → SK model | SK → NK model | |
---|---|---|
batch size | 4 | 4 |
epoch | 8 | 8 |
learning rate | 3e-5 | 3e-5 |
optimizer | AdamW | AdamW |
The evaluation metrics used on the test data set are the BLEU score and BERT Score. The below table presents the BLEU Score and BERT Score of the North Korean(NK) → South Korean(SK) translation model and the South Korean(SK) → North Korean(NK) translation model.
cf. BERT Score computes precision, recall, and F1-score. For simplicity, only the F1-score is presented in the table.
NK → SK model | SK → NK model | |
---|---|---|
BLEU Score | 0.55 | 0.25 |
BERT Score | 0.821 | 0.815 |
[v1.0]
2024-03-27
A total of 130,738 North Korean-South Korean sentence pairs uploaded.
- Hyesun Chun (전혜선) [email protected]
- Chanju Lee (이찬주) [email protected]
- Ahhyun Kum (금아현) [email protected]
- Haeun Yoon (윤하은) [email protected]
- Hyunkyoo Choi (최현규) [email protected]
- Charmgil Hong (홍참길) [email protected]
- Charmgil Hong (홍참길) [email protected]
- Handong AI Lab https://hail.handong.edu/
If you use Korean Parallel Corpus (KPC), please cite the following paper and star this repository:
@inproceedings{chun2024paclic,
title="Bridging the Linguistic Divide: Developing a North-South Korean Parallel Corpus for Machine Translation",
author={Hannah H. Chun and Chanju Lee and Hyunkyoo Choi and Charmgil Hong},
booktitle = "Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation",
month = dec,
year = "2024",
address = "Tokyo, Japan",
publisher = "Association for Computational Linguistics",
}
KPC is licensed under GNU Free Documentation License (GFDL).
This research was supported (1) by the Korea Institute of Science and Technology Information (KISTI) in 2023 (K-23-L01-C01, Construction on Intelligent SciTech Information Curation), (2) by the MSIT(Ministry of Science, ICT), Korea, under the Global Research Support Program in the Digital Field program (RS-2024-00431394) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation), and (3) by the MSIT, Korea, under the National Program for Excellence in SW, supervised by the IITP in 2024 (2023-0-00055).
-
KoBART-translation
https://github.com/seujung/KoBART-translation