Skip to content

Releases: hyunwoongko/kss

v4.0.4

20 Dec 13:39
Compare
Choose a tag to compare

Add email and url processing.

v4.0.3

20 Dec 07:04
Compare
Choose a tag to compare

Fix bugs in rule of '듯'

v4.0.2

20 Dec 06:56
Compare
Choose a tag to compare
  • Fix lru cache related bug
  • Improve inference speed => remove jamo search
  • Add "떄" (오타) and "구요" (EF) to preprocessor
  • Modify "하며 " to "하며" in unavailable_next

v4.0.0

20 Dec 02:47
Compare
Choose a tag to compare

Kss: A Toolkit for Korean sentence segmentation

GitHub release
Issues

This repository contains the source code of Kss, a representative Korean sentence segmentation toolkit. I also conduct ongoing research about Korean sentence segmentation algorithms and report the results to this repository.
If you have some good ideas about Korean sentence segmentation, please feel free to talk through the issue.


What's New:

Installation

Install Kss

Kss can be easily installed using the pip package manager.

pip install kss

Install Mecab (Optional)

Please install one of mecab, konlpy.tag.Mecab to use Kss much faster.

Features

1) split_sentences: split text into sentences

from kss import split_sentences

split_sentences(
    text: Union[str, List[str], Tuple[str]],
    backend: str = "auto",
    num_workers: Union[int, str] = "auto" 
)
Parameters
  • text: String or List/Tuple of strings
    • string: single text segmentation
    • list/tuple of strings: batch texts segmentation
  • backend: Morpheme analyzer backend.
    • backend='auto': find mecabkonlpy.tag.Mecabpecab and use first found analyzer (default)
    • backend='mecab': find mecabkonlpy.tag.Mecab and use first found analyzer
    • backend='pecab': use pecab analyzer
  • num_workers: The number of multiprocessing workers.
    • num_workers='auto': use multiprocessing with the maximum number of workers if possible (default)
    • num_workers=1: don't use multiprocessing
    • num_workers=2~N: use multiprocessing with the specified number of workers
Usages
  • Single text segmentation

    import kss
    
    text = "회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요 다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다 강남역 맛집 토끼정의 외부 모습."
    
    kss.split_sentences(text)
    # ['회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요', '다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다', '강남역 맛집 토끼정의 외부 모습.']
  • Batch texts segmentation

    import kss
    
    texts = [
        "회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요 다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다",
        "강남역 맛집 토끼정의 외부 모습. 강남 토끼정은 4층 건물 독채로 이루어져 있습니다.",
        "역시 토끼정 본 점 답죠?ㅎㅅㅎ 건물은 크지만 간판이 없기 때문에 지나칠 수 있으니 조심하세요 강남 토끼정의 내부 인테리어.",
    ]
    
    kss.split_sentences(texts)
    # [['회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요', '다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다']
    # ['강남역 맛집 토끼정의 외부 모습.', '강남 토끼정은 4층 건물 독채로 이루어져 있습니다.']
    # ['역시 토끼정 본 점 답죠?ㅎㅅㅎ', '건물은 크지만 간판이 없기 때문에 지나칠 수 있으니 조심하세요', '강남 토끼정의 내부 인테리어.']]
Performance Analysis

1) Test Commands

You can reproduce this experiment using source code and datasets in ./bench/ directory and the source code was copied from here.
Note that the Baseline is regex based segmentation like re.split(r"(?<=[.!?])\s", text).

Name Command (in root directory)
Baseline python3 ./bench/test_baseline.py ./bench/testset/*.txt
Kiwi python3 ./bench/test_kiwi.py ./bench/testset/*.txt
Koalanlp python3 ./bench/test_koalanlp.py ./bench/testset/*.txt --backend=OKT/HNN/KMR/RHINO/EUNJEON/ARIRANG/KAMA
Kss (ours) python3 ./bench/test_kss.py ./bench/testset/*.txt --backend=mecab/pecab

2) Evaluation datasets:

I tested it using the following 6 evaluation datasets. Thanks to Minchul Lee for creating various sentence segmentation datasets.

Name Descriptions The number of sentences Creator
blogs_lee Dataset for testing blog style text segmentation 170 Minchul Lee
blogs_ko Dataset for testing blog style text segmentation, which is harder than Lee's blog dataset 336 Hyunwoong Ko
tweets Dataset for testing tweeter style text segmentation 178 Minchul Lee
nested Dataset for testing text which have parentheses and quotation marks segmentation 91 Minchul Lee
v_ending Dataset for testing difficult eomi segmentation, it contains various dialect sentences 30 Minchul Lee
sample An example used in README.md (강남 토끼정) 41 Isaac, modified by Hyunwoong Ko

Note that I modified labels of two sentences in sample.txt made by Issac
because the original blog post was written like the following:

But Issac's labels were:

In fact, 사실 전 고기를 안 먹어서 무슨 맛인지 모르겠지만.. and (물론 전 안 먹었지만 are adverb clauses (부사절), not independent sentences (문장).
So I corrected labels of the two sentences.


3) Sentence segmentation performance (Quantitative Analysis)

The following table shows the segmentation performance based on exact match.
Kss performed best in most cases, and Kiwi performed well. Both baseline and koalanlp performed poorly.

| Name | Library version | Backend | blogs_lee | blogs_k...

Read more

v3.7.3

29 Nov 22:43
Compare
Choose a tag to compare
  • Fix split_chunks related bug #53
  • Modify EF/ETN rule
    • before: '그러함 에 있어서' => ['그러함', '에 있어서']
    • after: '그러함 에 있어서' => ['그러함 에 있어서']

v3.7.1

29 Nov 07:48
Compare
Choose a tag to compare
  • Hotfix installation related bug #52

v3.7.0

28 Nov 13:16
Compare
Choose a tag to compare
  • Fix emoji related bug #51
  • Modify ETN (명사형 전성어미, Eomi Transferred from Noun) related rules #50
    • You can split ETN+XSV (명사형 전성어미 + 동사 파생 접미사) and ETN+XSA (명사형 전성어미 + 형용사 파생 접미사) now.
  • Add auto option for disable_gc parameter.
  • Modify unicode related table for none backend
    • example: '가나다라 이다 <emoji> 그러나'
    • before: ['가나다라 이다', '<emoji> 그러나']
    • after: ['가나다라 이다 <emoji>', '그러나']
  • Add kss.__version__ for easy version checking.

v3.6.4

29 Sep 08:14
Compare
Choose a tag to compare
  • Minor fixes
    • add SY for morpheme segmentation
    • remove '해' for morpheme segmentation

v3.6.2

08 Sep 01:06
Compare
Choose a tag to compare

Patch for better emoji processing

  1. add additional unicodes to dict to preserve original user input

    • kss.split_sentences('첫 번째는 ❤️❤️하트입니다. 두 번째는 😊😊웃는얼굴입니다. 세 번째는 👍👍엄지입니다.')
    • before: ['첫 번째는 ♥♥하트입니다.' '두 번째는 😊😊웃는얼굴입니다.' '세 번째는 👍👍엄지입니다.']
    • after: ['첫 번째는 ❤️❤️하트입니다.' '두 번째는 😊😊웃는얼굴입니다.' '세 번째는 👍👍엄지입니다.']
  2. better emoji splitting

    • kss.split_sentences('안녕하세요 ❤️❤️ 반갑습니다')
    • before: ['안녕하세요', '❤️❤️ 반갑습니다']
    • after: ['안녕하세요 ❤️❤️', '반갑습니다']

v3.5.6

28 Aug 19:22
Compare
Choose a tag to compare