Replies: 1 comment 1 reply
-
Thanks for the extensive description! Fortunately, due to the modularity of BERTopic you can indeed use a custom tokenizer that works for your specific language. In this case, with the Thai language, you would have to pick a tokenizer that works for that language. I am not familiar with that language but a quick google search gives me this tokenizer that might work well. To use it in BERTopic, you would have to run something like this: from thai_tokenizer import Tokenizer
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
# Multi-lingual embedding model
embedding_model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-mpnet-base-v2")
# Thai Tokenizer
thai_tokenizer = Tokenizer()
vectorizer_model = CountVectorizer(tokenizer=thai_tokenizer)
# BERTopic
topic_model = BERTopic(embedding_model=embedding_model, vectorizer_model=vectorizer_model) I haven't tried that tokenizer myself but if you follow the links above it should provide examples of how to use a custom tokenizer. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am currently using BERTopic for topic modeling with Thai language documents. However, I have encountered an issue with the word segmentation, as demonstrated by the output of topic_model.get_topic_info().head(10). The words do not seem to be segmented correctly, which impacts the quality of the topic modeling.
Each entry under the "Representation" column would list properly segmented Thai words. Instead of fragmented or incorrectly split words, you would see complete and meaningful Thai words.
The "Name" column, which seems to currently contain fragmented words, would display more coherent topic names. These names would be based on correctly segmented words, making them more understandable and representative of the topics.
The topics themselves would be more distinct and meaningful. Instead of a mix of fragmented words, each topic would represent a clear theme or subject matter, making it easier to interpret the results.
Could you please advise on how to improve the Thai word segmentation in BERTopic? Are there any specific settings or preprocessing steps recommended for handling Thai text? Additionally, is there any way to integrate a custom Thai tokenizer into the BERTopic workflow?
Your guidance on this matter would be greatly appreciated, as accurate word segmentation is crucial for effective topic modeling in Thai.
Thank you for your time and assistance.
Beta Was this translation helpful? Give feedback.
All reactions