Replies: 2 comments 1 reply
-
It depends but to make sure I got this correctly. Do you mean that you have 500 sentences or 500 tokens that were generated? If you have 500 sentences for a single document, then it would definitely be worthwhile to split your document up into sentences and treat them as independent documents. If you have 500 tokens then it depends on which embedding model you are choosing. The tokenizer for one embedding model might be different from another. Having said that, the |
Beta Was this translation helpful? Give feedback.
-
Hi, As I understand you separate your data set first into subsets "good, bad and ugly", and next compute the main topics in each class? If so, I don't see the need for it; I would first calculate the topics, then do the "good, bad and ugly" based on the topics. I recommend against tokenization of the sentences: the topics are calculated against a language model of complete sentences. Andreas |
Beta Was this translation helpful? Give feedback.
-
Hi everyone!! :)
I combined BERTopic with VADER to understand the main topics surrounding the AI discussion and how this perception has been changing over communities that I already defined.
Some texts became 500 sentences after tokenising for the embeddings, which changes the real "balance" between texts. Having said that, is it a good idea to use the mean/median/mode of the embeddings? (I can't find any paper about it) Also, is there any proven technique that I can use?
Thanks & Happy Coding !!
Beta Was this translation helpful? Give feedback.
All reactions