What exactly happens when you feed the model with long documents ? #822
-
Hello! I'm not familiar with how BERT encodes sentences and documents. I used BERTopic with documents of various lengths, some of wich being really long (>5000 words and more). I was pretty happy with the topics I got. Then I tried to split each document by sentences and passing them to the model, but the computational cost seems too high (the model has been running for ten hours as I'm writing this). So, could you please explain what I'm missing by feeding the model long documents? Are only the first n tokens being considered? I should point out that I'm thinking about splitting the documents in paragraphs rather than sentences, but it will take time. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
It depends on the embedding model that you are using in BERTopic. If you are using the default model, which is a sentence-transformer model, then the document gets truncated after a fixed number of tokens. This means that it will ignore a part of the document. When documents are long but also contain multiple topics, then it is advised to split them up into either paragraphs or sentences. If the computational cost is too high, then there are a number of tricks you can do. First, you can train the model on a subset of the data and predict the topics for every document outside of that subset. This allows you to circumvent having to train on the entire dataset. Second, you can use online topic modeling to train a model on batches of data instead of on all the data at once. Third, you can use different sub-models that might be more optimized for your use case. For example, if you have a GPU, then it might be worthwhile to use cuML instead of the default models. |
Beta Was this translation helpful? Give feedback.
It depends on the embedding model that you are using in BERTopic. If you are using the default model, which is a sentence-transformer model, then the document gets truncated after a fixed number of tokens. This means that it will ignore a part of the document.
When documents are long but also contain multiple topics, then it is advised to split them up into either paragraphs or sentences. If the computational cost is too high, then there are a number of tricks you can do. First, you can train the model on a subset of the data and predict the topics for every document outside of that subset. This allows you to circumvent having to train on the entire dataset. Second, you can use online top…