newmm-tokenizer

Standalone Dictionary-based, Maximum Matching + Thai Character Cluster (newmm) tokenizer extracted from PyThaiNLP.

Objectives

This repository is created for reducing an overall size of original PyThaiNLP Tokenizer Module. The main objective is to be able to segment Thai sentences into a list of words.

Supports

The module supports Python 3.7+ as follow the original PyThaiNLP repository.

Installation

pip install newmm-tokenizer

How to Use

from newmm_tokenizer.tokenizer import word_tokenize

text = 'เป็นเรื่องแรกที่ร้องไห้ตั้งแต่ ep 1 แล้วก็เป็นเรื่องแรกที่เลือกไม่ได้ว่าจะเชียร์พระเอกหรือพระรองดี 19...'
words = word_tokenize(text)

print(words) 
# ['เป็นเรื่อง', 'แรก', 'ที่', 'ร้องไห้', 'ตั้งแต่', ' ', 'ep', ' ', '1', ' ', 'แล้วก็', 'เป็นเรื่อง', 'แรก', 'ที่', 'เลือกไม่ได้', 'ว่า', 'จะ', 'เชียร์', 'พระเอก', 'หรือ', 'พระรอง', 'ดี', ' ', '19', '...']

LICENSE

Please see the original license of PyThaiNLP here

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
newmm_tokenizer		newmm_tokenizer
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py
test_tokenizer.py		test_tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

newmm-tokenizer

Objectives

Supports

Installation

How to Use

LICENSE

About

Releases

Packages

Contributors 2

Languages

wisesight/newmm-tokenizer

Folders and files

Latest commit

History

Repository files navigation

newmm-tokenizer

Objectives

Supports

Installation

How to Use

LICENSE

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages