Kss: A Toolkit for Korean sentence segmentation

This repository contains the source code of Kss, a representative Korean sentence segmentation toolkit. I also conduct ongoing research about Korean sentence segmentation algorithms and report the results to this repository.
If you have some good ideas about Korean sentence segmentation, please feel free to talk through the issue.

What's New:

December 19, 2022 Released Kss 4.0 Python.
May 5, 2022 Released Kss Fluter.
August 25, 2021 Released Kss Java.
August 18, 2021 Released Kss 3.0 Python.
December 21, 2020 Released Kss 2.0 Python.
August 16, 2019 Released Kss 1.0 C++.

Installation

Install Kss

Kss can be easily installed using the pip package manager.

pip install kss

Install Mecab (Optional)

Please install one of mecab, konlpy.tag.Mecab to use Kss much faster.

mecab (Linux/MacOS): https://github.com/hyunwoongko/python-mecab-kor
mecab (Windows): https://cleancode-ws.tistory.com/97
konlpy.tag.Mecab (Linux/MacOS): https://konlpy.org/en/latest/api/konlpy.tag/#mecab-class
konlpy.tag.Mecab (Windows): https://uwgdqo.tistory.com/363

Features

1) `split_sentences`: split text into sentences

from kss import split_sentences

split_sentences(
    text: Union[str, List[str], Tuple[str]],
    backend: str = "auto",
    num_workers: Union[int, str] = "auto" 
)

Parameters

text: String or List/Tuple of strings
- string: single text segmentation
- list/tuple of strings: batch texts segmentation
backend: Morpheme analyzer backend.
- backend='auto': find mecab → konlpy.tag.Mecab → pecab and use first found analyzer (default)
- backend='mecab': find mecab → konlpy.tag.Mecab and use first found analyzer
- backend='pecab': use pecab analyzer
num_workers: The number of multiprocessing workers.
- num_workers='auto': use multiprocessing with the maximum number of workers if possible (default)
- num_workers=1: don't use multiprocessing
- num_workers=2~N: use multiprocessing with the specified number of workers

Usages

Single text segmentation

import kss

text = "회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요 다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다 강남역 맛집 토끼정의 외부 모습."

kss.split_sentences(text)
# ['회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요', '다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다', '강남역 맛집 토끼정의 외부 모습.']

Batch texts segmentation

import kss

texts = [
    "회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요 다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다",
    "강남역 맛집 토끼정의 외부 모습. 강남 토끼정은 4층 건물 독채로 이루어져 있습니다.",
    "역시 토끼정 본 점 답죠?ㅎㅅㅎ 건물은 크지만 간판이 없기 때문에 지나칠 수 있으니 조심하세요 강남 토끼정의 내부 인테리어.",
]

kss.split_sentences(texts)
# [['회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요', '다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다']
# ['강남역 맛집 토끼정의 외부 모습.', '강남 토끼정은 4층 건물 독채로 이루어져 있습니다.']
# ['역시 토끼정 본 점 답죠?ㅎㅅㅎ', '건물은 크지만 간판이 없기 때문에 지나칠 수 있으니 조심하세요', '강남 토끼정의 내부 인테리어.']]

Performance Analysis

1) Test Commands

You can reproduce this experiment using source code and datasets in ./bench/ directory and the source code was copied from here.
Note that the Baseline is regex based segmentation like re.split(r"(?<=[.!?])\s", text).

Name	Command (in root directory)
Baseline	`python3 ./bench/test_baseline.py ./bench/testset/*.txt`
Kiwi	`python3 ./bench/test_kiwi.py ./bench/testset/*.txt`
Koalanlp	`python3 ./bench/test_koalanlp.py ./bench/testset/*.txt --backend=OKT/HNN/KMR/RHINO/EUNJEON/ARIRANG/KAMA`
Kss (ours)	`python3 ./bench/test_kss.py ./bench/testset/*.txt --backend=mecab/pecab`

2) Evaluation datasets:

I tested it using the following 6 evaluation datasets. Thanks to Minchul Lee for creating various sentence segmentation datasets.

Name	Descriptions	The number of sentences	Creator
blogs_lee	Dataset for testing blog style text segmentation	170	Minchul Lee
blogs_ko	Dataset for testing blog style text segmentation, which is harder than Lee's blog dataset	336	Hyunwoong Ko
tweets	Dataset for testing tweeter style text segmentation	178	Minchul Lee
nested	Dataset for testing text which have parentheses and quotation marks segmentation	91	Minchul Lee
v_ending	Dataset for testing difficult eomi segmentation, it contains various dialect sentences	30	Minchul Lee
sample	An example used in README.md (강남 토끼정)	41	Isaac, modified by Hyunwoong Ko

Note that I modified labels of two sentences in sample.txt made by Issac
because the original blog post was written like the following:

But Issac's labels were:

In fact, 사실 전 고기를 안 먹어서 무슨 맛인지 모르겠지만.. and (물론 전 안 먹었지만 are adverb clauses (부사절), not independent sentences (문장).
So I corrected labels of the two sentences.

3) Sentence segmentation performance (Quantitative Analysis)

The following table shows the segmentation performance based on exact match.
Kss performed best in most cases, and Kiwi performed well. Both baseline and koalanlp performed poorly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kss: A Toolkit for Korean sentence segmentation

What's New:

Installation

Install Kss

Install Mecab (Optional)

Features

1) `split_sentences`: split text into sentences

1) Test Commands

2) Evaluation datasets:

3) Sentence segmentation performance (Quantitative Analysis)

Releases: hyunwoongko/kss

v4.0.4

v4.0.3

v4.0.2

v4.0.0

Kss: A Toolkit for Korean sentence segmentation

What's New:

Installation

Install Kss

Install Mecab (Optional)

Features

1) split_sentences: split text into sentences

1) Test Commands

2) Evaluation datasets:

3) Sentence segmentation performance (Quantitative Analysis)

v3.7.3

v3.7.1

v3.7.0

v3.6.4

v3.6.2

v3.5.6

1) `split_sentences`: split text into sentences