Chosen represented Topic #2048

mahmawad · 2024-06-12T09:19:31Z

First I would like to thank you for your great tool.

I have a question,
This is one of the topic Representation in my Documents :

"President's son trial in Manhattan"

However most of the documents under this topic aren't related to Hunter Biden but yes mostly talk about politics,

is there a way to make the representation more general ?

MaartenGr · 2024-06-12T09:31:26Z

First I would like to thank you for your great tool.

Thank you for the kind words!

is there a way to make the representation more general ?

It's difficult to say without seeing the full code, versions, output of .topic_info, etc. For instance, it's not clear to me which topic representation model that you use. Could you provide a bit more information? I need your full training code, version of BERTopic, and the output when running .get_topic_info.

This is one of the topic Representation in my Documents :
"President's son trial in Manhattan"

Is this a representative document or the topic representation?

mahmawad · 2024-06-12T09:49:27Z

This is the full code :

def get_topic_modeling(df, prompt, model, tokenizer):
"""
Generates a topic model for a given DataFrame using various NLP and clustering techniques.

Args:
df (pd.DataFrame): DataFrame containing the preprocessed text data.
prompt (str): The prompt for the text generation model.
model (str): The name or path of the model to be used for text generation.
tokenizer (str): The tokenizer to be used with the model.

Returns:
topic_model: The trained BERTopic model.
topics: The topics identified by the model.
probs: The probabilities of the topics.
"""
from sentence_transformers import SentenceTransformer
from torch import bfloat16
import transformers
from torch import cuda
import pandas as pd

# Initialize text generation pipeline
generator = transformers.pipeline(
    model=model, 
    tokenizer=tokenizer,
    task='text-generation',
    temperature=0.1,
    max_new_tokens=20,
    repetition_penalty=1.1
)

# Pre-calculate embeddings using SentenceTransformer
embedding_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = embedding_model.encode(df['PreprocessedText'].tolist(), show_progress_bar=True)

from umap import UMAP
# Initialize UMAP model for dimensionality reduction
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

# Reduce embeddings dimensions for visualization
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)

from hdbscan import HDBSCAN
# Initialize HDBSCAN model for clustering
hdbscan_model = HDBSCAN(metric='euclidean', cluster_selection_method='eom', prediction_data=True, min_cluster_size=10)

from sklearn.cluster import KMeans
# Initialize KMeans clustering model
cluster_model = KMeans(random_state=42, n_clusters=11)

from sklearn.feature_extraction.text import CountVectorizer
# Create a CountVectorizer with custom stop words
custom_exclude_words = ["world", "automotive", "post", 'first', 'new', 'car', 'cars', 'vehicle', 'vehicles', 'say', 'hello', 'welcome']
vectorizer_model = CountVectorizer(stop_words=custom_exclude_words, min_df=3)

from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, TextGeneration
# Initialize text generation model with Llama 2
llama2 = TextGeneration(generator, prompt)

# Dictionary of representation models
representation_model = {"Llama2": llama2}

from bertopic import BERTopic
# Initialize and train BERTopic model
topic_model = BERTopic(
    embedding_model=embedding_model,
    vectorizer_model=vectorizer_model,
    umap_model=umap_model,
    calculate_probabilities=True,
    hdbscan_model=hdbscan_model,
    representation_model=representation_model,
    top_n_words=10,
    verbose=True,
)

# Fit the topic model and transform the data
topics, probs = topic_model.fit_transform(df['PreprocessedText'].values, embeddings)

return topic_model, topics, probs

mahmawad · 2024-06-12T09:50:19Z

The Prompt :

example_prompt = """
I have a topic that contains the following documents:

Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.
Meat, but especially beef, is the word food in terms of emissions.
Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.

The topic is described by the following keywords: 'meat, beef, eat, eating, emissions, steak, food, health, processed, chicken'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.
Make sure not to mention any Companies or Cities names.

[/INST] Environmental impacts of eating meat
"""
main_prompt = """
[INST]
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.
Make sure not to mention any Companies or Cities names.
[/INST]
"""
prompt = example_prompt + main_prompt

mahmawad · 2024-06-12T09:58:50Z

it's a topic representation
and here are some output examples :
'AI and Data Industry Trends'
'President's son trial in Manhattan'

as you see the first one is good since it's general topic/ label

but second one isn't represbatble , do you think the problem is with the prompt ?

MaartenGr · 2024-06-12T10:52:29Z

but second one isn't represbatble , do you think the problem is with the prompt ?

It might be but it depends on the LLM that you are using. It's not in the code specifically but it seems you are using Llama 2 (can't see which version). You could also use Llama 3 which is quite a bit better or other newer models like Mistral, Phi-3, Command R+, Qwen2, etc.

Note that you can also track the prompts with: topic_model.representation_model["llama2"].prompts_. You might find something of interest there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chosen represented Topic #2048

Chosen represented Topic #2048

mahmawad commented Jun 12, 2024

MaartenGr commented Jun 12, 2024

mahmawad commented Jun 12, 2024

mahmawad commented Jun 12, 2024

mahmawad commented Jun 12, 2024

MaartenGr commented Jun 12, 2024 •

edited

Loading

Chosen represented Topic #2048

Chosen represented Topic #2048

Comments

mahmawad commented Jun 12, 2024

MaartenGr commented Jun 12, 2024

mahmawad commented Jun 12, 2024

mahmawad commented Jun 12, 2024

mahmawad commented Jun 12, 2024

MaartenGr commented Jun 12, 2024 • edited Loading

MaartenGr commented Jun 12, 2024 •

edited

Loading