large documents splitting? #2110

chanansh · 2024-08-02T21:37:01Z

chanansh
Aug 2, 2024

according to the best practices guide,

Whenever you have large documents, you typically want to split them up into either paragraphs or sentences. A nice way to do so is by using NLTK's sentence splitter which is nothing more than:

from nltk.tokenize import sent_tokenize, word_tokenize
sentences = [sent_tokenize(abstract) for abstract in abstracts]
sentences = [sentence for doc in sentences for sentence in doc]

however, the provided code example just flats all sentences of all documents, yet I would like to cluster documents, not sentences. What am I missing?

MaartenGr · 2024-08-03T06:44:25Z

MaartenGr
Aug 3, 2024
Maintainer

When documents are too long for the embedding model to fit within its context size or when you expect multiple topics within a given document, you split them into sentences and then you cluster the sentences. That will extract the topics over all sentences (and therefore documents). If you then want the topics for all documents, you can combine the sentences together to get a distribution of topics.

2 replies

chanansh Aug 3, 2024
Author

I need to cluster large logs files - the relevant error could be rare. How do you suggest to "combine the sentences together to get a distribution of topics."? does BERTopic support that?

MaartenGr Aug 4, 2024
Maintainer

Generally, you would train the topic model on all individual sentences and extract topics that relate to specific sentences. Then, you can (outside of BERTopic) simply count how often certain topics appear in a given document based on the topics that were assigned to its sentences.

In your case, however, it is a bit different since log files are very specific data and typically need to remain as a single document. There, it's important to find an embedding model that can process your particular log files well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

large documents splitting? #2110

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

large documents splitting? #2110

chanansh Aug 2, 2024

Replies: 1 comment · 2 replies

MaartenGr Aug 3, 2024 Maintainer

chanansh Aug 3, 2024 Author

MaartenGr Aug 4, 2024 Maintainer

chanansh
Aug 2, 2024

Replies: 1 comment 2 replies

MaartenGr
Aug 3, 2024
Maintainer

chanansh Aug 3, 2024
Author

MaartenGr Aug 4, 2024
Maintainer