Can a keyword be excluded from topic labels? #917

salderma · 2023-01-09T22:00:16Z

salderma
Jan 9, 2023

Hi,
Might there be a means to exclude certain words from being named in topic labels? For example, I am experimenting with documents that are collected using a keyword search. I would like to exclude the keyword(s) from being used as topic labels, as it seems these words dominate several of the topic labels.

Thanks!

Answered by MaartenGr

Jan 10, 2023

Sure, you can use the CountVectorizer to decide how the words will be tokenized before ending up in the topic representation. Here, you can decide which words you want to include and exclude in the resulting topic representation. More specifically, we can view this exclusion as stopwords that should not be put in the topic labels. In other words, we can approach it like this:

from sklearn.feature_extraction.text import CountVectorizer
from bertopic import BERTopic

vectorizer_model = CountVectorizer(stop_words=a_list_of_keywords_i_want_to_exclude)

# Train a model 
topic_model = BERTopic(vectorizer_model=vectorizer_model)
topics, probs = topic_model.fit_transform(docs)

# If you want to u…

View full answer

MaartenGr · 2023-01-10T06:06:10Z

MaartenGr
Jan 10, 2023
Maintainer

Sure, you can use the CountVectorizer to decide how the words will be tokenized before ending up in the topic representation. Here, you can decide which words you want to include and exclude in the resulting topic representation. More specifically, we can view this exclusion as stopwords that should not be put in the topic labels. In other words, we can approach it like this:

from sklearn.feature_extraction.text import CountVectorizer
from bertopic import BERTopic

vectorizer_model = CountVectorizer(stop_words=a_list_of_keywords_i_want_to_exclude)

# Train a model 
topic_model = BERTopic(vectorizer_model=vectorizer_model)
topics, probs = topic_model.fit_transform(docs)

# If you want to update an already trained model
topic_model.update_topics(docs, vectorizer_model=vectorizer_model)

1 reply

salderma Jan 13, 2023
Author

Thank you, this appears to accomplish what I'm looking for when I extract the CountVectorizer stop words to a list and append the keyword(s) used for document collection...

s_words = list(CountVectorizer(stop_words='english').get_stop_words())
vectorizer_model = CountVectorizer(stop_words=s_words)

There is probably a more efficient way to do this, but the .get_stop_words() method returns a frozenset().

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can a keyword be excluded from topic labels? #917

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Can a keyword be excluded from topic labels? #917

salderma Jan 9, 2023

Replies: 1 comment · 1 reply

MaartenGr Jan 10, 2023 Maintainer

salderma Jan 13, 2023 Author

salderma
Jan 9, 2023

Replies: 1 comment 1 reply

MaartenGr
Jan 10, 2023
Maintainer

salderma Jan 13, 2023
Author