Implementing something similar to text-tiling with BERTopic #1568
-
This is a very general question that I'm looking for some ideas. Is there a way to utilize BERTopic for splitting long unstructured texts (for example a very long article that doesn't have any headings) into sections with semantically similar sentences/paragraphs? Note that the challenge in this case is to consider the proximity of the sentences and paragraphs to each other in the original document. A classification that groups the 7th, 14th, 29th, 60th, and 129th sentences/paragraphs in the same topic is not practical. I came across text-tiling but I'm hoping to achieve this with BERTopic since it allows using embeddings and semantic similarities. Any ideas? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
I am not familiar with text-tiling but based on your description it should be straightforward. Simply split up your document into sentences and pass them to BERTopic. Then, you could either keep them as they are or pass the full documents to |
Beta Was this translation helpful? Give feedback.
-
Hi Saeed, I've a similar case, and I used the (linguistic) intuition that a paragraph has a single "message", and the sentences around it are just glue to keep the message(s) together. Kind of bullet points idea. I use it for web sites with very different content, and it turns out that even a single page has only one topic, at most three. As for grouping of topics into classes manually: it works pretty straightforward. With 40k docs, 20 paragraphs each, and a minimum topic size of 300 paragraphs/topic I have some 300 topics, that you can very quickly categorize in 30 classes. The first time it took me some 3 hours, but after a while 1 hour. The advantage is that domain-experts can have a stab at it as well. I also like to have some understanding/control over what's going on in my system rather than use a black box. It took quite some convincing my boss that this is at least as good an approach Good luck, Andreas |
Beta Was this translation helpful? Give feedback.
What I meant was updating the vectorizer to use some sort of sentence splitter instead to perform the tokenization. So you would update like so:
And then use
.approximate_distributions
with embeddings rather than c-TF-IDF to perform the similarity metric since we have overwritten the internal tokenizer. Do note that this solution overwrites thevectorizer_model
which then cannot be used for other applications. Saving that vectorizer model might be helpful if you want to restore functionality.All in all, I just tested the following which works for me: