Releases: hyunwoongko/kss
v4.0.4
Add email and url processing.
v4.0.3
Fix bugs in rule of '듯'
v4.0.2
- Fix lru cache related bug
- Improve inference speed => remove jamo search
- Add "떄" (오타) and "구요" (EF) to preprocessor
- Modify "하며 " to "하며" in unavailable_next
v4.0.0
Kss: A Toolkit for Korean sentence segmentation
This repository contains the source code of Kss, a representative Korean sentence segmentation toolkit. I also conduct ongoing research about Korean sentence segmentation algorithms and report the results to this repository.
If you have some good ideas about Korean sentence segmentation, please feel free to talk through the issue.
What's New:
- December 19, 2022 Released Kss 4.0 Python.
- May 5, 2022 Released Kss Fluter.
- August 25, 2021 Released Kss Java.
- August 18, 2021 Released Kss 3.0 Python.
- December 21, 2020 Released Kss 2.0 Python.
- August 16, 2019 Released Kss 1.0 C++.
Installation
Install Kss
Kss can be easily installed using the pip package manager.
pip install kss
Install Mecab (Optional)
Please install one of mecab, konlpy.tag.Mecab to use Kss much faster.
- mecab (Linux/MacOS): https://github.com/hyunwoongko/python-mecab-kor
- mecab (Windows): https://cleancode-ws.tistory.com/97
- konlpy.tag.Mecab (Linux/MacOS): https://konlpy.org/en/latest/api/konlpy.tag/#mecab-class
- konlpy.tag.Mecab (Windows): https://uwgdqo.tistory.com/363
Features
1) split_sentences
: split text into sentences
from kss import split_sentences
split_sentences(
text: Union[str, List[str], Tuple[str]],
backend: str = "auto",
num_workers: Union[int, str] = "auto"
)
Parameters
- text: String or List/Tuple of strings
- string: single text segmentation
- list/tuple of strings: batch texts segmentation
- backend: Morpheme analyzer backend.
backend='auto'
: findmecab
→konlpy.tag.Mecab
→pecab
and use first found analyzer (default)backend='mecab'
: findmecab
→konlpy.tag.Mecab
and use first found analyzerbackend='pecab'
: usepecab
analyzer
- num_workers: The number of multiprocessing workers.
num_workers='auto'
: use multiprocessing with the maximum number of workers if possible (default)num_workers=1
: don't use multiprocessingnum_workers=2~N
: use multiprocessing with the specified number of workers
Usages
-
Single text segmentation
import kss text = "회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요 다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다 강남역 맛집 토끼정의 외부 모습." kss.split_sentences(text) # ['회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요', '다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다', '강남역 맛집 토끼정의 외부 모습.']
-
Batch texts segmentation
import kss texts = [ "회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요 다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다", "강남역 맛집 토끼정의 외부 모습. 강남 토끼정은 4층 건물 독채로 이루어져 있습니다.", "역시 토끼정 본 점 답죠?ㅎㅅㅎ 건물은 크지만 간판이 없기 때문에 지나칠 수 있으니 조심하세요 강남 토끼정의 내부 인테리어.", ] kss.split_sentences(texts) # [['회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요', '다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다'] # ['강남역 맛집 토끼정의 외부 모습.', '강남 토끼정은 4층 건물 독채로 이루어져 있습니다.'] # ['역시 토끼정 본 점 답죠?ㅎㅅㅎ', '건물은 크지만 간판이 없기 때문에 지나칠 수 있으니 조심하세요', '강남 토끼정의 내부 인테리어.']]
Performance Analysis
1) Test Commands
You can reproduce this experiment using source code and datasets in ./bench/
directory and the source code was copied from here.
Note that the Baseline
is regex based segmentation like re.split(r"(?<=[.!?])\s", text)
.
Name | Command (in root directory) |
---|---|
Baseline | python3 ./bench/test_baseline.py ./bench/testset/*.txt |
Kiwi | python3 ./bench/test_kiwi.py ./bench/testset/*.txt |
Koalanlp | python3 ./bench/test_koalanlp.py ./bench/testset/*.txt --backend=OKT/HNN/KMR/RHINO/EUNJEON/ARIRANG/KAMA |
Kss (ours) | python3 ./bench/test_kss.py ./bench/testset/*.txt --backend=mecab/pecab |
2) Evaluation datasets:
I tested it using the following 6 evaluation datasets. Thanks to Minchul Lee for creating various sentence segmentation datasets.
Name | Descriptions | The number of sentences | Creator |
---|---|---|---|
blogs_lee | Dataset for testing blog style text segmentation | 170 | Minchul Lee |
blogs_ko | Dataset for testing blog style text segmentation, which is harder than Lee's blog dataset | 336 | Hyunwoong Ko |
tweets | Dataset for testing tweeter style text segmentation | 178 | Minchul Lee |
nested | Dataset for testing text which have parentheses and quotation marks segmentation | 91 | Minchul Lee |
v_ending | Dataset for testing difficult eomi segmentation, it contains various dialect sentences | 30 | Minchul Lee |
sample | An example used in README.md (강남 토끼정) | 41 | Isaac, modified by Hyunwoong Ko |
Note that I modified labels of two sentences in sample.txt
made by Issac
because the original blog post was written like the following:
But Issac's labels were:
In fact, 사실 전 고기를 안 먹어서 무슨 맛인지 모르겠지만..
and (물론 전 안 먹었지만
are adverb clauses (부사절), not independent sentences (문장).
So I corrected labels of the two sentences.
3) Sentence segmentation performance (Quantitative Analysis)
The following table shows the segmentation performance based on exact match.
Kss performed best in most cases, and Kiwi performed well. Both baseline and koalanlp performed poorly.
| Name | Library version | Backend | blogs_lee | blogs_k...
v3.7.3
v3.7.1
v3.7.0
- Fix emoji related bug #51
- Modify
ETN
(명사형 전성어미, Eomi Transferred from Noun) related rules #50- You can split
ETN+XSV
(명사형 전성어미 + 동사 파생 접미사) andETN+XSA
(명사형 전성어미 + 형용사 파생 접미사) now.
- You can split
- Add
auto
option fordisable_gc
parameter. - Modify unicode related table for
none
backend- example:
'가나다라 이다 <emoji> 그러나'
- before:
['가나다라 이다', '<emoji> 그러나']
- after:
['가나다라 이다 <emoji>', '그러나']
- example:
- Add
kss.__version__
for easy version checking.
v3.6.4
- Minor fixes
- add
SY
for morpheme segmentation - remove '해' for morpheme segmentation
- add
v3.6.2
Patch for better emoji processing
-
add additional unicodes to dict to preserve original user input
- kss.split_sentences('첫 번째는 ❤️❤️하트입니다. 두 번째는 😊😊웃는얼굴입니다. 세 번째는 👍👍엄지입니다.')
- before: ['첫 번째는 ♥♥하트입니다.' '두 번째는 😊😊웃는얼굴입니다.' '세 번째는 👍👍엄지입니다.']
- after: ['첫 번째는 ❤️❤️하트입니다.' '두 번째는 😊😊웃는얼굴입니다.' '세 번째는 👍👍엄지입니다.']
-
better emoji splitting
- kss.split_sentences('안녕하세요 ❤️❤️ 반갑습니다')
- before: ['안녕하세요', '❤️❤️ 반갑습니다']
- after: ['안녕하세요 ❤️❤️', '반갑습니다']
v3.5.6
- Support
konlpy
mecab for tagging - reference: https://velog.io/@newdboy/macOS-mecab-%EC%84%A4%EC%B9%98for-konlpy-0.6.0-kss-3.3.1.1