Replies: 1 comment
-
I'm going to jump in here because it sounds like what you are asking questions about are issues that I'm interested in as well.
I believe the correct approach here is to ensemble the models. Gensim has an LDA Ensemble model that works pretty well. I wrote up some findings about LDA topic model instability. Of course this is LDA, but the concepts are the same and I routinely compare one model to another using the similarity matrix to get a very nice, quick feedback on how close the models are.
:) Actually, while I've found UMAP to be more stable (I think) than LDA is. However, UMAP is unfortunately quite unstable from what I can tell. Compounding the analysis problem is that each UMAP reduction needs different HDBSCAN parameters to achieve similar results. So if I have two UMAP models, I will need to have two different tunings for HDBSCAN. I've created a tool to make this process relatively painless - but it still has to be factored in. Making things even more difficult is that not only will clustering be different across UMAP models - importantly the number of clusters across two different HDBSCAN clusterings from two different UMAP models may not match in how many clusters they discover. In one model you may be able to separate out 6 clusters, in another 5 or 7. I should step back for a moment in the above. BERTopic's topic reduction method uses the cosine similarity of topic clusters to determine the merge strategy. This means that we are far removed from the embeddings, let alone the UMAP reduction of them - everything I'm talking about is based only on the embeddings->UMAP->HDBSCAN pipeline. Any given UMAP and HDBSCAN combination will produce a number of 'natural' clusterings. In other words a given UMAP / HDBSCAN model of one set of embeddings may or may not 'naturally' produce a given number of clusters.
I've looked at this and after just a bit considered it almost orthogonal to using embeddings in the first place. I took clusters and compared their centroids to documents and then compared the results to HDBSCAN created clusters and the results were very divergent. This is unsurprising since the cluster shapes identified by HDBSCAN are asymmetrical and just looking at cosine similarity will give you very different results.
I agree with you that fixing a random state for UMAP is a kludge because all it does is lock-in the priors of a given run of UMAP. It says nothing about the underlying validity of the relationships derived from the embeddings. From my understanding the way to resolve this issue with UMAP is to run multiple UMAP models and then rationalize or merge them - ensembling. I've played with this a bit and it doesn't seem unworkable although I've just started. Overall, one of my strong feelings after working with BERTopic is that we must understand that the 'topic model' is a derivative of the clustering. Everything I'm talking about is the upstream clustering process which is the basis for creating a topic model. Once we get into c-TF-IDF and certainly reduce_topics we are taking considerable steps away from the embeddings. I'm not sure I'm articulating the difference between the clustering and the topic model clearly but have taken a longer stab at this in #582. I hope this is helpful and not confusing or off-topic. It sounds like your interests are similar to mine and I couldn't help chiming in. There is a lot here, so don't hesitate to ask questions if you have any. |
Beta Was this translation helpful? Give feedback.
-
Hi Maarten :-)
I am working on using BerTopic as a way to create semantic interpretable groups that should later be used in evaluating LLMs, and it is all working really great! So of course thanks for an amazing tool.
Across many BerTopic instances fitted on the same date, it often creates many of the same topics that are clearly semantically related, which is great. However, I am looking for a way to only use the topics that are most "robust" across many BerTopic instances. The randomness of what topics are created causes some issues for the generalisation of the workflow.
My idea was to look at the cosine_similarity between topic_embeddings_, and then decide in some way on how often or closely clustered the topic_embeddings should be to be used for the final clustering of the documents. In the data I have been working on, it ranges between creating 12 to 25 topics, where around 15 of them seems to be kind of robust (often recurring and high cosine_similarity with each other).
Let us imagine that I figure out a way to arrive at 15 topic_embeddings_ that is representative in some way of the 15 topics that I see recurring in the BerTopic output. Is there then a way within the BerTopic framework to change the topic_embeddings_ of a BerTopic instance to match those 15, and then re-allocate the documents to the 15 topics according to the ways of BerTopic?
I imagine, I am properly breaking a lot of assumptions and technicalities of the model by trying to do this. Maybe it is not even possible. Or maybe it is very simple, and I cannot see it. But I just wanted to explore whether I could create a workflow to create more robust topics to use for later analysis purposes without controlling the random_state of UMAP, because that would not actually be my goal.
I hope my idea makes sense. Else, please ask, and then I will try to further explain :-)
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions