-
Hi! I have a set of documents with many different languages (including non-Western ones) which would require different approaches and stop-words for tokenization. All of my documents are tagged by language through I was thinking I could tokenize all of my documents based on language beforehand, create a new column for a representation of each document as its tokens separated by a space. Then, I could pass these processed docs into update topics after fit_transform similar to this:
However, this would be using different documents in Any thoughts or other ideas would be very appreciated :) |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Definitely not a bad idea, quite creative actually! The only thing that I'm worried about is that the topic embeddings would not be optimal due to the problems with tokenization. Other than, definitely worth a try. |
Beta Was this translation helpful? Give feedback.
-
Lingua + unicode normalization + custom tokenizers for languages listed here and NLTK (or spaCy) as a fallback seems to work well for me 👍 I then stored this in a new column like I mentioned above, and pass it to update_topics. |
Beta Was this translation helpful? Give feedback.
Definitely not a bad idea, quite creative actually! The only thing that I'm worried about is that the topic embeddings would not be optimal due to the problems with tokenization. Other than, definitely worth a try.