Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chosen represented Topic #2048

Open
mahmawad opened this issue Jun 12, 2024 · 5 comments
Open

Chosen represented Topic #2048

mahmawad opened this issue Jun 12, 2024 · 5 comments

Comments

@mahmawad
Copy link

First I would like to thank you for your great tool.

I have a question,
This is one of the topic Representation in my Documents :

"President's son trial in Manhattan"

However most of the documents under this topic aren't related to Hunter Biden but yes mostly talk about politics,

is there a way to make the representation more general ?

@MaartenGr
Copy link
Owner

First I would like to thank you for your great tool.

Thank you for the kind words!

is there a way to make the representation more general ?

It's difficult to say without seeing the full code, versions, output of .topic_info, etc. For instance, it's not clear to me which topic representation model that you use. Could you provide a bit more information? I need your full training code, version of BERTopic, and the output when running .get_topic_info.

This is one of the topic Representation in my Documents :
"President's son trial in Manhattan"

Is this a representative document or the topic representation?

@mahmawad
Copy link
Author

This is the full code :

def get_topic_modeling(df, prompt, model, tokenizer):
"""
Generates a topic model for a given DataFrame using various NLP and clustering techniques.

Args:
df (pd.DataFrame): DataFrame containing the preprocessed text data.
prompt (str): The prompt for the text generation model.
model (str): The name or path of the model to be used for text generation.
tokenizer (str): The tokenizer to be used with the model.

Returns:
topic_model: The trained BERTopic model.
topics: The topics identified by the model.
probs: The probabilities of the topics.
"""
from sentence_transformers import SentenceTransformer
from torch import bfloat16
import transformers
from torch import cuda
import pandas as pd

# Initialize text generation pipeline
generator = transformers.pipeline(
    model=model, 
    tokenizer=tokenizer,
    task='text-generation',
    temperature=0.1,
    max_new_tokens=20,
    repetition_penalty=1.1
)

# Pre-calculate embeddings using SentenceTransformer
embedding_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = embedding_model.encode(df['PreprocessedText'].tolist(), show_progress_bar=True)

from umap import UMAP
# Initialize UMAP model for dimensionality reduction
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

# Reduce embeddings dimensions for visualization
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)

from hdbscan import HDBSCAN
# Initialize HDBSCAN model for clustering
hdbscan_model = HDBSCAN(metric='euclidean', cluster_selection_method='eom', prediction_data=True, min_cluster_size=10)

from sklearn.cluster import KMeans
# Initialize KMeans clustering model
cluster_model = KMeans(random_state=42, n_clusters=11)

from sklearn.feature_extraction.text import CountVectorizer
# Create a CountVectorizer with custom stop words
custom_exclude_words = ["world", "automotive", "post", 'first', 'new', 'car', 'cars', 'vehicle', 'vehicles', 'say', 'hello', 'welcome']
vectorizer_model = CountVectorizer(stop_words=custom_exclude_words, min_df=3)

from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, TextGeneration
# Initialize text generation model with Llama 2
llama2 = TextGeneration(generator, prompt)

# Dictionary of representation models
representation_model = {"Llama2": llama2}

from bertopic import BERTopic
# Initialize and train BERTopic model
topic_model = BERTopic(
    embedding_model=embedding_model,
    vectorizer_model=vectorizer_model,
    umap_model=umap_model,
    calculate_probabilities=True,
    hdbscan_model=hdbscan_model,
    representation_model=representation_model,
    top_n_words=10,
    verbose=True,
)

# Fit the topic model and transform the data
topics, probs = topic_model.fit_transform(df['PreprocessedText'].values, embeddings)

return topic_model, topics, probs

@mahmawad
Copy link
Author

The Prompt :

example_prompt = """
I have a topic that contains the following documents:

  • Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.
  • Meat, but especially beef, is the word food in terms of emissions.
  • Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.

The topic is described by the following keywords: 'meat, beef, eat, eating, emissions, steak, food, health, processed, chicken'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.
Make sure not to mention any Companies or Cities names.

[/INST] Environmental impacts of eating meat
"""
main_prompt = """
[INST]
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.
Make sure not to mention any Companies or Cities names.
[/INST]
"""
prompt = example_prompt + main_prompt

@mahmawad
Copy link
Author

it's a topic representation
and here are some output examples :
'AI and Data Industry Trends'
'President's son trial in Manhattan'

as you see the first one is good since it's general topic/ label

but second one isn't represbatble , do you think the problem is with the prompt ?

@MaartenGr
Copy link
Owner

MaartenGr commented Jun 12, 2024

but second one isn't represbatble , do you think the problem is with the prompt ?

It might be but it depends on the LLM that you are using. It's not in the code specifically but it seems you are using Llama 2 (can't see which version). You could also use Llama 3 which is quite a bit better or other newer models like Mistral, Phi-3, Command R+, Qwen2, etc.

Note that you can also track the prompts with: topic_model.representation_model["llama2"].prompts_. You might find something of interest there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants