Average embeddings of a text with over 500 sentences: Is it a good idea? #1524

fcndata · 2023-09-12T09:08:15Z

fcndata
Sep 12, 2023

Hi everyone!! :)

I combined BERTopic with VADER to understand the main topics surrounding the AI discussion and how this perception has been changing over communities that I already defined.

Some texts became 500 sentences after tokenising for the embeddings, which changes the real "balance" between texts. Having said that, is it a good idea to use the mean/median/mode of the embeddings? (I can't find any paper about it) Also, is there any proven technique that I can use?

Thanks & Happy Coding !!

MaartenGr · 2023-09-12T10:37:10Z

MaartenGr
Sep 12, 2023
Maintainer

It depends but to make sure I got this correctly. Do you mean that you have 500 sentences or 500 tokens that were generated? If you have 500 sentences for a single document, then it would definitely be worthwhile to split your document up into sentences and treat them as independent documents.

If you have 500 tokens then it depends on which embedding model you are choosing. The tokenizer for one embedding model might be different from another. Having said that, the mean is what I personally have seen most often being used when averaging embeddings. Do note though that if you average too many embeddings, the resulting embeddings often averaged out a lot of details that might be important for the representation of the text.

1 reply

fcndata Sep 12, 2023
Author

Do you mean that you have 500 sentences or 500 tokens that were generated?
I meant 500 sentences, and you are right; I'll keep them on the embeddings to get a richer representation of topics. However, I was thinking of keeping only the most frequent topic over the text, as the topic that better represents it.

aph61 · 2023-10-18T10:04:42Z

aph61
Oct 18, 2023

Hi,

As I understand you separate your data set first into subsets "good, bad and ugly", and next compute the main topics in each class?

If so, I don't see the need for it; I would first calculate the topics, then do the "good, bad and ugly" based on the topics. I recommend against tokenization of the sentences: the topics are calculated against a language model of complete sentences.

Andreas

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Average embeddings of a text with over 500 sentences: Is it a good idea? #1524

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Average embeddings of a text with over 500 sentences: Is it a good idea? #1524

fcndata Sep 12, 2023

Replies: 2 comments · 1 reply

MaartenGr Sep 12, 2023 Maintainer

fcndata Sep 12, 2023 Author

aph61 Oct 18, 2023

fcndata
Sep 12, 2023

Replies: 2 comments 1 reply

MaartenGr
Sep 12, 2023
Maintainer

fcndata Sep 12, 2023
Author

aph61
Oct 18, 2023