reduce_topics removes documents - is there a better approach? #743
-
When I use reduce_topics I was expecting that BERTopic would try and fit the documents I have into a reduced number of topics. It may do that to a certain extent, but it seems to mostly just discard topics (re-assigning them to topic -1). Consequently, where a model with starting topics count of 1072 had retained 53% of the topics (i.e. 47% were relegated to -1), when I reduced to 200 topics it retained only 47.5%, at 100 topics only 41.6%, and all the way down to 25 topics it retained only 25% of the documents. Is there a way to reduce topics and automatically also reclassify rather than discard the documents? Is this where merge_topics should be used? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
When you use As you mentioned, you could also use |
Beta Was this translation helpful? Give feedback.
-
Thank-you for that, I have been using the number of topics approach. I will also next try the auto approach. I think I understand you are saying that if a topic cannot be merged then it will remain as a distinct original topic and not get moved to outliers? |
Beta Was this translation helpful? Give feedback.
When you use
.reduce_topics
, there are two ways of reducing the topics. First, by setting an integer as the number of topics (i.e.,.reduce_topics(docs, nr_topics=10)
). It then iteratively tries to merge the least frequent topic with its most similar topic. If it cannot find a topic that is similar enough, it will be merged with the outliers instead. To some extent, this will prevent two dissimilar topics from being merged. Second, by setting the parameter to "auto" (i.e.,.reduce_topics(docs, nr_topics="auto")
). Doing so will run an HDBSCAN instance on the non-outlier topics in an attempt to merge those topics. Doing so will only merge clusters that are similar to one another and do no n…