Issue with Thai Word Segmentation in BERTopic #1708

Hakulani · 2023-12-20T18:07:55Z

Hakulani
Dec 20, 2023

topic_model = BERTopic(language="thai") #
topics, probs = topic_model.fit_transform(docs)

topic_model.get_topic_info().head(10)
 
 
Topic | Count | Name | Representation | Representative_Docs
-- | -- | -- | -- | --
-1 | 39 | -1_แต_อย_ไม_เป | [แต, อย, ไม, เป, ได, าน, นน, แล, ให, านน] | [เชื่อหรือไม่ ? เมืองไทยก็มีแม่มดเหมือนกัน แถม...
0 | 30 | 0_แต_ไม_อร_ได | [แต, ไม, อร, ได, อย, เน, านน, แล, หม, เป] | [ไม่รู้ว่าคิดไปเองรึเปล่าว่าสาขานี้เค้าทำเผ็ดก...
1 | 17 | 1_อย_เป_านน_ยว | [อย, เป, านน, ยว, ไม, เด, แต, ได, นก, อน] | [ร้านอาหารญี่ปุ่นร้านนี้ ใจจริงไม่อยากแนะนำเลย...
2 | 14 | 2_าน_านน_เด_อย | [าน, านน, เด, อย, นร, ไม, เป, cafe, พเค, ได] | [ร้านชีสเค้กต้นตำรับจากนิวยอร์ก\nร้านอยู่ที่CD...

I am currently using BERTopic for topic modeling with Thai language documents. However, I have encountered an issue with the word segmentation, as demonstrated by the output of topic_model.get_topic_info().head(10). The words do not seem to be segmented correctly, which impacts the quality of the topic modeling.
Each entry under the "Representation" column would list properly segmented Thai words. Instead of fragmented or incorrectly split words, you would see complete and meaningful Thai words.
The "Name" column, which seems to currently contain fragmented words, would display more coherent topic names. These names would be based on correctly segmented words, making them more understandable and representative of the topics.
The topics themselves would be more distinct and meaningful. Instead of a mix of fragmented words, each topic would represent a clear theme or subject matter, making it easier to interpret the results.
Could you please advise on how to improve the Thai word segmentation in BERTopic? Are there any specific settings or preprocessing steps recommended for handling Thai text? Additionally, is there any way to integrate a custom Thai tokenizer into the BERTopic workflow?

Your guidance on this matter would be greatly appreciated, as accurate word segmentation is crucial for effective topic modeling in Thai.

Thank you for your time and assistance.

MaartenGr · 2023-12-20T18:45:19Z

MaartenGr
Dec 20, 2023
Maintainer

Thanks for the extensive description! Fortunately, due to the modularity of BERTopic you can indeed use a custom tokenizer that works for your specific language. In this case, with the Thai language, you would have to pick a tokenizer that works for that language.

I am not familiar with that language but a quick google search gives me this tokenizer that might work well. To use it in BERTopic, you would have to run something like this:

from thai_tokenizer import Tokenizer
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

# Multi-lingual embedding model
embedding_model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-mpnet-base-v2")

# Thai Tokenizer
thai_tokenizer = Tokenizer()
vectorizer_model = CountVectorizer(tokenizer=thai_tokenizer)

# BERTopic
topic_model = BERTopic(embedding_model=embedding_model, vectorizer_model=vectorizer_model)

I haven't tried that tokenizer myself but if you follow the links above it should provide examples of how to use a custom tokenizer.

1 reply

Hakulani Dec 27, 2023
Author

Thank you for show how to custom tokenize
I used pythainlp
https://pythainlp.github.io/docs/2.0/api/tokenize.html
`! pip install pythainlp six sentencepiece python-crfsuite

from pythainlp.tokenize import word_tokenize
`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Thai Word Segmentation in BERTopic #1708

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Issue with Thai Word Segmentation in BERTopic #1708

Hakulani Dec 20, 2023

Replies: 1 comment · 1 reply

MaartenGr Dec 20, 2023 Maintainer

Hakulani Dec 27, 2023 Author

Hakulani
Dec 20, 2023

Replies: 1 comment 1 reply

MaartenGr
Dec 20, 2023
Maintainer

Hakulani Dec 27, 2023
Author