Skip to content

reduce_topics removes documents - is there a better approach? #743

Answered by MaartenGr
ChrisPalmerNZ asked this question in Q&A
Discussion options

You must be logged in to vote

When you use .reduce_topics, there are two ways of reducing the topics. First, by setting an integer as the number of topics (i.e., .reduce_topics(docs, nr_topics=10)). It then iteratively tries to merge the least frequent topic with its most similar topic. If it cannot find a topic that is similar enough, it will be merged with the outliers instead. To some extent, this will prevent two dissimilar topics from being merged. Second, by setting the parameter to "auto" (i.e., .reduce_topics(docs, nr_topics="auto")). Doing so will run an HDBSCAN instance on the non-outlier topics in an attempt to merge those topics. Doing so will only merge clusters that are similar to one another and do no n…

Replies: 2 comments 1 reply

Comment options

You must be logged in to vote
0 replies
Answer selected by ChrisPalmerNZ
Comment options

You must be logged in to vote
1 reply
@MaartenGr
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants