Multilingual Topic Representations - Recommended Approaches? #2201

NullPxl · 2024-11-01T21:06:48Z

NullPxl
Nov 1, 2024

Hi! I have a set of documents with many different languages (including non-Western ones) which would require different approaches and stop-words for tokenization. All of my documents are tagged by language through lingua (i.e., in a dataframe there is a language column). Translation is not feasible for this case, unfortunately. I'm using a multilingual embedding model, but the internal tokenizers for those tend to operate at a sub-word level, i.e., not helpful for getting topic representations.

I was thinking I could tokenize all of my documents based on language beforehand, create a new column for a representation of each document as its tokens separated by a space. Then, I could pass these processed docs into update topics after fit_transform similar to this:

df['tokenized_doc_strings'] = df.apply(tokenize_based_on_language_col) # in some other file
topics, probs = topic_model.fit_transform(df['docs'].to_list(), embeddings) # pre-calculated embeddings from df['docs']
vectorizer_model = CountVectorizer(ngram_range=(1, 3), min_df=10) # default token pattern it looks for is  r"(?u)\b\w\w+\b"
topic_model.update_topics(df['tokenized_doc_strings'].to_list(), vectorizer_model=vectorizer_model) # is this a bad idea?

However, this would be using different documents in update_topics than the (non pre-tokenized ones) used to train the model with fit_transform. Will this lead to anything weird or unexpected when it comes to topic representations? I've seen that sometimes this changes things in BERTopic, so I also want to note I'll be merging multiple models created like this together.

Any thoughts or other ideas would be very appreciated :)

Answered by MaartenGr

Nov 4, 2024

Definitely not a bad idea, quite creative actually! The only thing that I'm worried about is that the topic embeddings would not be optimal due to the problems with tokenization. Other than, definitely worth a try.

View full answer

MaartenGr · 2024-11-04T14:42:33Z

MaartenGr
Nov 4, 2024
Maintainer

Definitely not a bad idea, quite creative actually! The only thing that I'm worried about is that the topic embeddings would not be optimal due to the problems with tokenization. Other than, definitely worth a try.

2 replies

NullPxl Nov 4, 2024
Author

Thanks for the reply! Just to check, by this:

[...] topic embeddings would not be optimal due to the problems with tokenization

you're referring to how the selected embedding model handles tokenization on its own right?

Thanks for all the work you do on BERTopic by the way, it's a cool technique. The various notebooks and examples have helped a lot.

MaartenGr Nov 5, 2024
Maintainer

Yeah that could indeed be the case but I'm not sure how much that would actually affect. I wouldn't be surprised if that turned out to be minor (if at all).

Thanks for all the work you do on BERTopic by the way, it's a cool technique. The various notebooks and examples have helped a lot.

Thank you for the kind words, it is highly appreciated!

NullPxl · 2024-11-27T19:31:40Z

NullPxl
Nov 27, 2024
Author

Lingua + unicode normalization + custom tokenizers for languages listed here and NLTK (or spaCy) as a fallback seems to work well for me 👍 I then stored this in a new column like I mentioned above, and pass it to update_topics.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multilingual Topic Representations - Recommended Approaches? #2201

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Multilingual Topic Representations - Recommended Approaches? #2201

NullPxl Nov 1, 2024

Replies: 2 comments · 2 replies

MaartenGr Nov 4, 2024 Maintainer

NullPxl Nov 4, 2024 Author

MaartenGr Nov 5, 2024 Maintainer

NullPxl Nov 27, 2024 Author

NullPxl
Nov 1, 2024

Replies: 2 comments 2 replies

MaartenGr
Nov 4, 2024
Maintainer

NullPxl Nov 4, 2024
Author

MaartenGr Nov 5, 2024
Maintainer

NullPxl
Nov 27, 2024
Author