Replies: 1 comment 2 replies
-
Here's the doc on BERTopic guided topic modeling. If you want to use BERTopic to label data that you will then use to train a supervised model then simply use the topic assignments to do that. You might want to check out my TopicTuner as it is designed to allow for maximum control over HDBSCAN clustering and may be useful in giving you a better training data (if I understand your intent). |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am trying to use BERTopic to build topics for use in supervised learning. I can build topics and then see if they correlate with the target variables, but that is hit and miss. I don't see in the documentation a way to build guided topic. Is there a way to do this?
Back in the old days, I would remove stop words, do lemmatization, take unigrams & bigrams, (optional TFIDF to reduce the feature set), run a correlation against the target variables, (feature reduction again possible) use the correlation factor as the weighting, do a HDBScan for clustering. Depending on compute restrictions, one would grow the count of NGrams/SkipGrams until the impact of larger xGrams stopped making sense. Pushing this to use a GLOVE architecture wouldn't be that hard either.
This feels inelegant in a world with transformers. There are also obvious weaknesses in this approach as it is too word dependent rather than meaning dependent. It should be possible to do the correlation with the output from the encoder, but if there a way to use BERTopic to do this, I would rather use something this clean vs my own homebrew. Is this possible? Thanks
Beta Was this translation helpful? Give feedback.
All reactions