What exactly happens when you feed the model with long documents ? #822

reouvenzana · 2022-11-06T21:08:09Z

reouvenzana
Nov 6, 2022

Hello!

I'm not familiar with how BERT encodes sentences and documents. I used BERTopic with documents of various lengths, some of wich being really long (>5000 words and more). I was pretty happy with the topics I got. Then I tried to split each document by sentences and passing them to the model, but the computational cost seems too high (the model has been running for ten hours as I'm writing this). So, could you please explain what I'm missing by feeding the model long documents? Are only the first n tokens being considered?

I should point out that I'm thinking about splitting the documents in paragraphs rather than sentences, but it will take time.

Answered by MaartenGr

Nov 8, 2022

It depends on the embedding model that you are using in BERTopic. If you are using the default model, which is a sentence-transformer model, then the document gets truncated after a fixed number of tokens. This means that it will ignore a part of the document.

When documents are long but also contain multiple topics, then it is advised to split them up into either paragraphs or sentences. If the computational cost is too high, then there are a number of tricks you can do. First, you can train the model on a subset of the data and predict the topics for every document outside of that subset. This allows you to circumvent having to train on the entire dataset. Second, you can use online top…

View full answer

MaartenGr · 2022-11-08T07:12:34Z

MaartenGr
Nov 8, 2022
Maintainer

It depends on the embedding model that you are using in BERTopic. If you are using the default model, which is a sentence-transformer model, then the document gets truncated after a fixed number of tokens. This means that it will ignore a part of the document.

When documents are long but also contain multiple topics, then it is advised to split them up into either paragraphs or sentences. If the computational cost is too high, then there are a number of tricks you can do. First, you can train the model on a subset of the data and predict the topics for every document outside of that subset. This allows you to circumvent having to train on the entire dataset. Second, you can use online topic modeling to train a model on batches of data instead of on all the data at once. Third, you can use different sub-models that might be more optimized for your use case. For example, if you have a GPU, then it might be worthwhile to use cuML instead of the default models.

1 reply

reouvenzana Nov 9, 2022
Author

Thank you for taking the time to answer. Great work by the way!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What exactly happens when you feed the model with long documents ? #822

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

What exactly happens when you feed the model with long documents ? #822

reouvenzana Nov 6, 2022

Replies: 1 comment · 1 reply

MaartenGr Nov 8, 2022 Maintainer

reouvenzana Nov 9, 2022 Author

reouvenzana
Nov 6, 2022

Replies: 1 comment 1 reply

MaartenGr
Nov 8, 2022
Maintainer

reouvenzana Nov 9, 2022
Author