Skip to content

What exactly happens when you feed the model with long documents ? #822

Answered by MaartenGr
reouvenzana asked this question in Q&A
Discussion options

You must be logged in to vote

It depends on the embedding model that you are using in BERTopic. If you are using the default model, which is a sentence-transformer model, then the document gets truncated after a fixed number of tokens. This means that it will ignore a part of the document.

When documents are long but also contain multiple topics, then it is advised to split them up into either paragraphs or sentences. If the computational cost is too high, then there are a number of tricks you can do. First, you can train the model on a subset of the data and predict the topics for every document outside of that subset. This allows you to circumvent having to train on the entire dataset. Second, you can use online top…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@reouvenzana
Comment options

Answer selected by reouvenzana
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants