reduce_topics removes documents - is there a better approach? #743

ChrisPalmerNZ · 2022-09-26T04:04:58Z

ChrisPalmerNZ
Sep 26, 2022

When I use reduce_topics I was expecting that BERTopic would try and fit the documents I have into a reduced number of topics. It may do that to a certain extent, but it seems to mostly just discard topics (re-assigning them to topic -1). Consequently, where a model with starting topics count of 1072 had retained 53% of the topics (i.e. 47% were relegated to -1), when I reduced to 200 topics it retained only 47.5%, at 100 topics only 41.6%, and all the way down to 25 topics it retained only 25% of the documents.

Is there a way to reduce topics and automatically also reclassify rather than discard the documents? Is this where merge_topics should be used?

Answered by MaartenGr

Sep 26, 2022

When you use .reduce_topics, there are two ways of reducing the topics. First, by setting an integer as the number of topics (i.e., .reduce_topics(docs, nr_topics=10)). It then iteratively tries to merge the least frequent topic with its most similar topic. If it cannot find a topic that is similar enough, it will be merged with the outliers instead. To some extent, this will prevent two dissimilar topics from being merged. Second, by setting the parameter to "auto" (i.e., .reduce_topics(docs, nr_topics="auto")). Doing so will run an HDBSCAN instance on the non-outlier topics in an attempt to merge those topics. Doing so will only merge clusters that are similar to one another and do no n…

View full answer

MaartenGr · 2022-09-26T08:12:44Z

MaartenGr
Sep 26, 2022
Maintainer

When you use .reduce_topics, there are two ways of reducing the topics. First, by setting an integer as the number of topics (i.e., .reduce_topics(docs, nr_topics=10)). It then iteratively tries to merge the least frequent topic with its most similar topic. If it cannot find a topic that is similar enough, it will be merged with the outliers instead. To some extent, this will prevent two dissimilar topics from being merged. Second, by setting the parameter to "auto" (i.e., .reduce_topics(docs, nr_topics="auto")). Doing so will run an HDBSCAN instance on the non-outlier topics in an attempt to merge those topics. Doing so will only merge clusters that are similar to one another and do no nothing with topics that cannot find similar topics. As a result, you will not increase the number of outliers.

As you mentioned, you could also use .merge_topics to merge topics manually. That way, the outliers will indeed not be increased.

0 replies

ChrisPalmerNZ · 2022-09-26T08:41:50Z

ChrisPalmerNZ
Sep 26, 2022
Author

Thank-you for that, I have been using the number of topics approach. I will also next try the auto approach. I think I understand you are saying that if a topic cannot be merged then it will remain as a distinct original topic and not get moved to outliers?

1 reply

MaartenGr Sep 26, 2022
Maintainer

If you are using nr_topics="auto", then yes, a topic will not be merged if it cannot find a suitable similar topic and will not get moved to outliers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce_topics removes documents - is there a better approach? #743

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

reduce_topics removes documents - is there a better approach? #743

ChrisPalmerNZ Sep 26, 2022

Replies: 2 comments · 1 reply

MaartenGr Sep 26, 2022 Maintainer

ChrisPalmerNZ Sep 26, 2022 Author

MaartenGr Sep 26, 2022 Maintainer

ChrisPalmerNZ
Sep 26, 2022

Replies: 2 comments 1 reply

MaartenGr
Sep 26, 2022
Maintainer

ChrisPalmerNZ
Sep 26, 2022
Author

MaartenGr Sep 26, 2022
Maintainer