Skip to content

Implementing something similar to text-tiling with BERTopic #1568

Closed Answered by MaartenGr
saeedesmaili asked this question in Q&A
Discussion options

You must be logged in to vote

What I meant was updating the vectorizer to use some sort of sentence splitter instead to perform the tokenization. So you would update like so:

vectorizer_model = CountVectorizer(tokenizer=sent_tokenize)
topic_model.vectorizer_model_ = vectorizer_model

And then use .approximate_distributions with embeddings rather than c-TF-IDF to perform the similarity metric since we have overwritten the internal tokenizer. Do note that this solution overwrites the vectorizer_model which then cannot be used for other applications. Saving that vectorizer model might be helpful if you want to restore functionality.

All in all, I just tested the following which works for me:

from bertopic import BERTopic
f…

Replies: 2 comments 5 replies

Comment options

You must be logged in to vote
5 replies
@saeedesmaili
Comment options

@MaartenGr
Comment options

@saeedesmaili
Comment options

@MaartenGr
Comment options

Answer selected by saeedesmaili
@saeedesmaili
Comment options

Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants