Korean Parallel Corpus

What is KPC?
Data
Translation Model Experiments
Change Log
Contributors
Contact
Citation

1) What is KPC?

Korean is the official language of both South Korea and North Korea. Despite sharing the same language, the North Korean and South Korean language differ in various linguistic aspects such as vocabulary, grammar, and spelling. The ongoing separation between North Korea and South Korea has widened the differences between the two languages. This language gap can become a major communication obstacle after Korean reunification.

Therefore, it is important to conduct research on how to bridge the gap between the North Korean and South Korean languages. One example would be to develop a North and South Korean translator. However, it is difficult to find a North Korean language dataset that has a corresponding South Korean language dataset. The lack of a North Korean and South Korean parallel corpus hinders active investments in machine translation of the North Korean language.

To address this issue, the Korean Unification Parallel Corpus (KPC) repository has been created. Its main goal is to provide a high-quality North and South Korean parallel corpus and make it available to the public. The KPC also explains how to use the parallel corpus for research, particularly in the field of machine translation.

1-1) Sources

The dataset contains 130,738 rows covering a range of topics of classical novel and the Bible. The classical novels are

1-2) Data Selection

Criteria for Selection

Data must be actually existing in South and North Korea.
Data must be accurately matched as sentence pairs.

Data Acquisition

Bible: The Bible is translated into many languages, divided into chapters and verses, with consistent content across verses, making it useful for matching.
Classic novels: Classic novels are translated into various languages and with translations available in both South Korean and North Korean.

2) Data

2-1) List

Category		Book	Total Row
Classic Novels	Foreign	Jane Eyre	60,331	94,459 (72%)
	Foreign	The Red and the Black	34,128	94,459 (72%)
	Korean	Onggojip-jeon	988	6,293 (5%)
		Sukhyang-jeon	3,538
		Shimchung-jeon	1,767
Bible		-	-	29,986 (23%)
Total		-	-	130,738 (100%)

The dataset consists of classic novels and the Bible. The classic novel data is divided into two types of foreign novels and three types of Korean novels, each based on single data from North Korean publishers and multiple data from South Korean publishers. Consequently, the classic novel data collected a total of 100,752 North Korean-South Korean sentence pairs. The Bible data was collected in the same manner, resulting in a total of 29,986 data points. Thus, a total of 130,738 parallel corpora were constructed based on South Korean standards. Among these, the maximum number of characters per sentence is 286, and the minimum is 2.

2-2) Examples

nk	sk
안해는 남편앞에 무릎을 꿇고 그를 붙들어두려고 하면서 부르짖었다.	부인은 남편 앞에 무릎을 꿇고 그를 붙잡으려고 애쓰면서 소리쳤다.
나는 창가림을 드리우고 난로가에 되돌아왔다.	나는 커튼을 내리고 난롯가로 되돌아갔다.

3) Translation Model Experiments

3-1) Experimental Settings

Foundation Model

KoBART (Korean BART) was used as the foundation translation model. KoBART was developed by the SKT AI team.

Training

We trained a North Korean(NK) → South Korean(SK) translation model and a South Korean(SK) → North Korean(NK) translation model. The training was conducted on 90% of all the 13,0738 rows of classic novels and bible data. The remaining 10% was used as the test data.

The data was split into a 9:1 ratio for training and testing. For foreign novel data, since each book is based on single data from North Korean publishers and multiple data from South Korean publishers, the same North Korean sentences are repeated as many times as the number of publications from South Korean publishers. Thus, caution was taken to ensure that North Korean sentences in the test data did not exist in the training data.

For Jane Eyre, a certain number of rows were randomly selected from the North Korean data, and the corresponding North-South Korean sentence pairs were extracted as test data, while the remaining sentence pairs were used as training data.

	train	test
Count	117,665	13,073
Size	9.9MB	961KB

For the training process, the hyperparameters were set as follows.

	NK → SK model	SK → NK model
batch size	4	4
epoch	8	8
learning rate	3e-5	3e-5
optimizer	AdamW	AdamW

3-2) Experimental Results

The evaluation metrics used on the test data set are the BLEU score and BERT Score. The below table presents the BLEU Score and BERT Score of the North Korean(NK) → South Korean(SK) translation model and the South Korean(SK) → North Korean(NK) translation model.

cf. BERT Score computes precision, recall, and F1-score. For simplicity, only the F1-score is presented in the table.

	NK → SK model	SK → NK model
BLEU Score	0.55	0.25
BERT Score	0.821	0.815

4) Change Log

[v1.0]
2024-03-27
A total of 130,738 North Korean-South Korean sentence pairs uploaded.

5) Contributors

Hyesun Chun (전혜선) [email protected]
Chanju Lee (이찬주) [email protected]
Ahhyun Kum (금아현) [email protected]
Haeun Yoon (윤하은) [email protected]
Hyunkyoo Choi (최현규) [email protected]
Charmgil Hong (홍참길) [email protected]

6) Contact

Charmgil Hong (홍참길) [email protected]
Handong AI Lab https://hail.handong.edu/

7) Citation

If you use Korean Parallel Corpus (KPC), please cite the following paper and star this repository:

@inproceedings{chun2024paclic,
      title="Bridging the Linguistic Divide: Developing a North-South Korean Parallel Corpus for Machine Translation", 
      author={Hannah H. Chun and Chanju Lee and Hyunkyoo Choi and Charmgil Hong},
      booktitle = "Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation",
      month = dec,
    year = "2024",
    address = "Tokyo, Japan",
    publisher = "Association for Computational Linguistics",
}

KPC is licensed under GNU Free Documentation License (GFDL).

Acknowledgement

This research was supported (1) by the Korea Institute of Science and Technology Information (KISTI) in 2023 (K-23-L01-C01, Construction on Intelligent SciTech Information Curation), (2) by the MSIT(Ministry of Science, ICT), Korea, under the Global Research Support Program in the Digital Field program (RS-2024-00431394) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation), and (3) by the MSIT, Korea, under the National Program for Excellence in SW, supervised by the IITP in 2024 (2023-0-00055).

References

KoBART
https://github.com/SKT-AI/KoBART
KoBART-translation
https://github.com/seujung/KoBART-translation

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
ModelBinary		ModelBinary
data		data
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Korean Parallel Corpus

1) What is KPC?

1-1) Sources

1-2) Data Selection

Criteria for Selection

Data Acquisition

2) Data

2-1) List

2-2) Examples

3) Translation Model Experiments

3-1) Experimental Settings

Foundation Model

Training

3-2) Experimental Results

4) Change Log

5) Contributors

6) Contact

7) Citation

Acknowledgement

References

About

Releases

Packages

Contributors 4

License

nth221/KoreanUnificationParallelCorpus

Folders and files

Latest commit

History

Repository files navigation

Korean Parallel Corpus

1) What is KPC?

1-1) Sources

1-2) Data Selection

Criteria for Selection

Data Acquisition

2) Data

2-1) List

2-2) Examples

3) Translation Model Experiments

3-1) Experimental Settings

Foundation Model

Training

3-2) Experimental Results

4) Change Log

5) Contributors

6) Contact

7) Citation

Acknowledgement

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Packages