Trying to find clusters useful in supervised learning #939

Happydork · 2023-01-17T16:45:24Z

Happydork
Jan 17, 2023

I am trying to use BERTopic to build topics for use in supervised learning. I can build topics and then see if they correlate with the target variables, but that is hit and miss. I don't see in the documentation a way to build guided topic. Is there a way to do this?

Back in the old days, I would remove stop words, do lemmatization, take unigrams & bigrams, (optional TFIDF to reduce the feature set), run a correlation against the target variables, (feature reduction again possible) use the correlation factor as the weighting, do a HDBScan for clustering. Depending on compute restrictions, one would grow the count of NGrams/SkipGrams until the impact of larger xGrams stopped making sense. Pushing this to use a GLOVE architecture wouldn't be that hard either.

This feels inelegant in a world with transformers. There are also obvious weaknesses in this approach as it is too word dependent rather than meaning dependent. It should be possible to do the correlation with the output from the encoder, but if there a way to use BERTopic to do this, I would rather use something this clean vs my own homebrew. Is this possible? Thanks

drob-xx · 2023-01-17T17:47:16Z

drob-xx
Jan 17, 2023

Here's the doc on BERTopic guided topic modeling. If you want to use BERTopic to label data that you will then use to train a supervised model then simply use the topic assignments to do that. You might want to check out my TopicTuner as it is designed to allow for maximum control over HDBSCAN clustering and may be useful in giving you a better training data (if I understand your intent).

2 replies

Happydork Jan 17, 2023
Author

If I had sufficient domain expertise, this would be great. Unfortunately, I do not know enough about the space to know all of the right terms to use. I am really looking for unsupervised learning where I have text and target variables. I would use a transformer directly to do the regression, but this is time series text data leading to a single final outcome...
I need to be able to find clusters which correlate with the targets, without knowing anything about the clusters going into the process. Does this make sense?

drob-xx Jan 17, 2023

Hmmmm.... I'm a bit confused. So if you have text and a target and want to build a model then that would be "supervised" and BERTopic is explicitly "unsupervised". You write "... without knowing anything about the clusters..." which would what BERTopic does. I think that the confusion might be that you are assuming that because BERTopic employs LLM embeddings that the LLM model is being leveraged to make predictions. It is a bit hard to grasp at first. BERTopic uses the embeddings as INPUT to an unsupervised technique (HDBSCAN) which does the clustering. The LLM only supplies embeddings - not a prediction. Hope that is a bit clearer. This can all be quite confusing so I would recommend giving yourself some time to grok it all.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to find clusters useful in supervised learning #939

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Trying to find clusters useful in supervised learning #939

Happydork Jan 17, 2023

Replies: 1 comment · 2 replies

drob-xx Jan 17, 2023

Happydork Jan 17, 2023 Author

drob-xx Jan 17, 2023

Happydork
Jan 17, 2023

Replies: 1 comment 2 replies

drob-xx
Jan 17, 2023

Happydork Jan 17, 2023
Author