Implementing something similar to text-tiling with BERTopic #1568

saeedesmaili · 2023-10-09T11:47:55Z

saeedesmaili
Oct 9, 2023

This is a very general question that I'm looking for some ideas.

Is there a way to utilize BERTopic for splitting long unstructured texts (for example a very long article that doesn't have any headings) into sections with semantically similar sentences/paragraphs? Note that the challenge in this case is to consider the proximity of the sentences and paragraphs to each other in the original document. A classification that groups the 7th, 14th, 29th, 60th, and 129th sentences/paragraphs in the same topic is not practical.

I came across text-tiling but I'm hoping to achieve this with BERTopic since it allows using embeddings and semantic similarities. Any ideas?

Answered by MaartenGr

Nov 23, 2023

What I meant was updating the vectorizer to use some sort of sentence splitter instead to perform the tokenization. So you would update like so:

vectorizer_model = CountVectorizer(tokenizer=sent_tokenize)
topic_model.vectorizer_model_ = vectorizer_model

And then use .approximate_distributions with embeddings rather than c-TF-IDF to perform the similarity metric since we have overwritten the internal tokenizer. Do note that this solution overwrites the vectorizer_model which then cannot be used for other applications. Saving that vectorizer model might be helpful if you want to restore functionality.

All in all, I just tested the following which works for me:

from bertopic import BERTopic
f…

View full answer

MaartenGr · 2023-10-12T13:58:38Z

MaartenGr
Oct 12, 2023
Maintainer

I am not familiar with text-tiling but based on your description it should be straightforward. Simply split up your document into sentences and pass them to BERTopic. Then, you could either keep them as they are or pass the full documents to .approximate_distributions to get the topic distribution of a single long document.

5 replies

saeedesmaili Nov 22, 2023
Author

This sounds interesting. But how can I use approximate_distribution with sentences? The output I get is for words, but I'd like to have it for sentences:

I guess I should play with CountVectorizer, but I can't figure out what tokenizer to use to achieve sentence separation.

MaartenGr Nov 22, 2023
Maintainer

In that case, it might be worthwhile to update the tokenizer to use sentence splitting instead after having trained your topic model. I believe you can implement sentence splitting with nltk.tokenize.sent_tokenize.

saeedesmaili Nov 23, 2023
Author

... after having trained your topic model.

So do I need to do something like this after training the topic model?

vectorizer_model = CountVectorizer(tokenizer=sent_tokenize)
topic_model.update_topics(docs=docs, vectorizer_model=vectorizer_model)

It changes the topic representations to sentences and this doesn't seem like what I'm looking for.

I came across this blog post and it uses graphs for detecting communities and knowing where topic changes in a long document. I would love to have this in BERTopic.

MaartenGr Nov 23, 2023
Maintainer

What I meant was updating the vectorizer to use some sort of sentence splitter instead to perform the tokenization. So you would update like so:

vectorizer_model = CountVectorizer(tokenizer=sent_tokenize)
topic_model.vectorizer_model_ = vectorizer_model

And then use .approximate_distributions with embeddings rather than c-TF-IDF to perform the similarity metric since we have overwritten the internal tokenizer. Do note that this solution overwrites the vectorizer_model which then cannot be used for other applications. Saving that vectorizer model might be helpful if you want to restore functionality.

All in all, I just tested the following which works for me:

from bertopic import BERTopic
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')

# We select a subsample of 5000 abstracts from ArXiv
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
docs = dataset["abstract"][:5_000]

# Embed documents
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(docs, show_progress_bar=True)

# Train topic model
topic_model = BERTopic(embedding_model=embedding_model, verbose=True).fit(docs, embeddings)

# Update vectorizer
vectorizer_model = CountVectorizer(tokenizer=sent_tokenize)
topic_model.vectorizer_model = vectorizer_model

# Calculate the topic distributions on a sentence-level 
# since we updated the internal tokenizer to sentences
topic_distr, topic_token_distr = topic_model.approximate_distribution(
    docs, 
    calculate_tokens=True, 
    use_embedding_model=True,
    min_similarity=0.5,
    window=1  # Set to > 1 if we want to add neighboring sentences to embedding calculation
)

# Visualize the sentence-level distributions for document nr 2000
index = 2000
df = topic_model.visualize_approximate_distribution(docs[index], topic_token_distr[index])
df

Answer selected by saeedesmaili

saeedesmaili Nov 23, 2023
Author

Got it! Thanks for the sample code.

aph61 · 2023-10-18T09:51:55Z

aph61
Oct 18, 2023

Hi Saeed,

I've a similar case, and I used the (linguistic) intuition that a paragraph has a single "message", and the sentences around it are just glue to keep the message(s) together. Kind of bullet points idea. I use it for web sites with very different content, and it turns out that even a single page has only one topic, at most three.

As for grouping of topics into classes manually: it works pretty straightforward. With 40k docs, 20 paragraphs each, and a minimum topic size of 300 paragraphs/topic I have some 300 topics, that you can very quickly categorize in 30 classes. The first time it took me some 3 hours, but after a while 1 hour. The advantage is that domain-experts can have a stab at it as well.

I also like to have some understanding/control over what's going on in my system rather than use a black box. It took quite some convincing my boss that this is at least as good an approach

Good luck,

Andreas

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing something similar to text-tiling with BERTopic #1568

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Implementing something similar to text-tiling with BERTopic #1568

saeedesmaili Oct 9, 2023

Replies: 2 comments · 5 replies

MaartenGr Oct 12, 2023 Maintainer

saeedesmaili Nov 22, 2023 Author

MaartenGr Nov 22, 2023 Maintainer

saeedesmaili Nov 23, 2023 Author

MaartenGr Nov 23, 2023 Maintainer

saeedesmaili Nov 23, 2023 Author

aph61 Oct 18, 2023

saeedesmaili
Oct 9, 2023

Replies: 2 comments 5 replies

MaartenGr
Oct 12, 2023
Maintainer

saeedesmaili Nov 22, 2023
Author

MaartenGr Nov 22, 2023
Maintainer

saeedesmaili Nov 23, 2023
Author

MaartenGr Nov 23, 2023
Maintainer

saeedesmaili Nov 23, 2023
Author

aph61
Oct 18, 2023