Topical Variance #2061

eschaffn · 2024-06-21T14:20:46Z

eschaffn
Jun 21, 2024

Hey, I'm wondering if there's a good way to measure a topic's variance.

Variance being how different documents within the topic are relative to each other, and the corpus as a whole. Doing pairwise similarity comparisons is expensive so I'd like to avoid that. I know OCTIS has a diversity measure, but this is calculated over the entire topic model, rather than per topic.

Any ideas?

MaartenGr · 2024-06-23T06:22:10Z

MaartenGr
Jun 23, 2024
Maintainer

Doing pairwise similarity comparisons is expensive so I'd like to avoid that.

I don't think this is actually expensive to do with at most a couple of hundred topic embeddings. Running something like the cosine similarity on the topic embeddings with one another is quite fast and I believe does not take that much memory.

7 replies

eschaffn Jun 25, 2024
Author

Yes this is why I'm looking for a different metic, not the Cosine between centroids. Something that captures the amount of likely overlap between two nearby clusters.

I want to then merge these clusters based on not only distances of centroids, but also this diversity or variance factor

MaartenGr Jun 26, 2024
Maintainer

I think options you could take is looking at distribution-based distance/divergence metrics such as Kullback–Leibler divergence, Mahalanobis distance, etc. Having said that, they often assume some sort of probability distribution (which they aren't unless you take the cosine similarity of each point to the centroid).

eschaffn Jun 26, 2024
Author

Thanks, I'll look into these!

The distribution of the similarity of each point to the centroid actually sounds very close to what I'm looking for, and avoids pairwise computations.

eschaffn Jun 26, 2024
Author

If the topic embedding is just the mean of document embeddings of a cluster, this is essentially just the statistical variance then, correct?

MaartenGr Jun 29, 2024
Maintainer

Yep, that's how I would indeed approach it as statistical variance. There are quite a bit of distance metrics to try out such as the methods shown above, so I would advise just trying some of them out and see whether they make sense for your use case.

eschaffn · 2024-07-25T16:20:07Z

eschaffn
Jul 25, 2024
Author

Decided to go with a slightly modified version of:
https://stackoverflow.com/questions/59919627/how-to-calculate-the-silhouette-score-for-each-cluster-separately-in-python

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topical Variance #2061

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Topical Variance #2061

eschaffn Jun 21, 2024

Replies: 2 comments · 7 replies

MaartenGr Jun 23, 2024 Maintainer

eschaffn Jun 25, 2024 Author

MaartenGr Jun 26, 2024 Maintainer

eschaffn Jun 26, 2024 Author

eschaffn Jun 26, 2024 Author

MaartenGr Jun 29, 2024 Maintainer

eschaffn Jul 25, 2024 Author

eschaffn
Jun 21, 2024

Replies: 2 comments 7 replies

MaartenGr
Jun 23, 2024
Maintainer

eschaffn Jun 25, 2024
Author

MaartenGr Jun 26, 2024
Maintainer

eschaffn Jun 26, 2024
Author

eschaffn Jun 26, 2024
Author

MaartenGr Jun 29, 2024
Maintainer

eschaffn
Jul 25, 2024
Author