Topic Modeling with Llama 2 - Example Prompt and Few-Shot Learning? #1609

linxule · 2023-11-01T14:49:38Z

linxule
Nov 1, 2023

Hi,

Thank you for showcasing the use of Llama2 for labeling topics. I use the same prompt with my dataset, which is unrelated to the example of "Environmental impacts of eating meat."

Somehow, several topic labels contain words like "eating," "meat," "environment." So it seems that Llama2 is confusing the example_prompt with the main_prompt. Is there any way to alleviate this?
Can we apply few-shot learning with Llama2? For example, 2–3 examples of documents and keywords, along with manually created labels are given to Llama2 before sending the topic to be labeled? My understanding is that this might create issues due to token limit (perhaps a model like Mistral can be used instead?).

Thank you for your time and consideration!

Answered by MaartenGr

Nov 1, 2023

Alright, it seems there are snippets of code missing here and there but I think I get the general gist of what you are trying to achieve. This might simply be a result of the prompt itself. With the one-shot approach here, it might be that the LLM "overfits" on that single example. When it does not know a good response or suffers from "lost in the middle syndrome", it might default to what it has seen first, which is the example.

Instead, I can recommend the following approach with Zephyr which will be in the documentation soon. Here, the prompt might be of use to you but if you want to use it for Llama 2, make sure to use the chat template for Llama 2 instead.

Zephyr (Mistral 7B)

We can …

View full answer

MaartenGr · 2023-11-01T15:41:42Z

MaartenGr
Nov 1, 2023
Maintainer

Somehow, several topic labels contain words like "eating," "meat," "environment." So it seems that Llama2 is confusing the example_prompt with the main_prompt. Is there any way to alleviate this?

Could you share your full code? That might help me understand what exactly is happening.

Can we apply few-shot learning with Llama2? For example, 2–3 examples of documents and keywords, along with manually created labels are given to Llama2 before sending the topic to be labeled? My understanding is that this might create issues due to token limit (perhaps a model like Mistral can be used instead?).

Few-shot learning is already applied with the Llama 2 example. That is what the [DOCUMENTS] and [KEYWORDS] tags are for. You can find more about that here.

1 reply

linxule Nov 1, 2023
Author

This is the labeling prompt:

# System prompt describes information given to all conversations
system_prompt = """
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant for labeling topics.
<</SYS>>
"""

# Example prompt demonstrating the output we are looking for
example_prompt = """
I have a topic that contains the following documents:
- Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.
- Meat, but especially beef, is the word food in terms of emissions.
- Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.

The topic is described by the following keywords: 'meat, beef, eat, eating, emissions, steak, food, health, processed, chicken'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.

[/INST] Environmental impacts of eating meat
"""

# Our main prompt with documents ([DOCUMENTS]) and keywords ([KEYWORDS]) tags
main_prompt = """
[INST]
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic above, please create a short label (no more than 10 words) of this topic.
Note that you only return the label and nothing more (no explanation or sugegstions). If you can't provide a label, return "No Labels Avaialble."
Do not use the information example provided to generate a label for the topic.
[/INST]
"""

prompt = system_prompt + example_prompt + main_prompt
print(prompt)

And this is the function to train a model:

import os
import json
import cuml # to use cuml.UMAP to leverage GPU

# GPU-accelerated versions of HDBSCAN and UMAP
from cuml.manifold import UMAP
from cuml.cluster import HDBSCAN
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, TextGeneration
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer
from sentence_transformers import SentenceTransformer

# Set Embedding Model
embedding_model = SentenceTransformer("BAAI/bge-large-en", device='cuda') #specify cuda to utilize GPU

# Set Paramters for Common "Aspects" of Representation Model
keybert = KeyBERTInspired()
llama2 = TextGeneration(generator, prompt=prompt)

# Set Model Directory to save trained models
model_dir = '/content/drive/MyDrive/model_dir/'

# Define the directory to save embeddings
embeddings_dir = '/content/drive/My Drive/embeddings/'

# Function to train individual models
def train_bertopic_model(df_name):
    try:
        print(f"Processing dataframe: {df_name}")

        # Fetch customized parameters
        params = model_parameters[df_name]

        # Create 2D Embeddings for Visualization
        def reduce_dimensions(embeddings):
            umap_model_2d = UMAP(n_components=2, #Set to 2 for 2D visualization
                                 n_neighbors=params['n_neighbors'], #To be consistent with acutal model
                                 random_state=42,
                                 metric=params['metric'])
            reduced_embeddings_2d = umap_model_2d.fit_transform(embeddings)
            return reduced_embeddings_2d

        # Use the reduce_dimensions function here
        reduced_embeddings_2d = reduce_dimensions(loaded_embeddings[df_name])

        # Save the reduced 2D embeddings
        with open(embeddings_dir + df_name + '_embeddings_2d.npy', 'wb') as f:
            np.save(f, reduced_embeddings_2d)

        # Set UMAP model for BERTopic Training
        umap_model = UMAP(n_components=params['n_components'],
                          n_neighbors=params['n_neighbors'],
                          random_state=42,
                          metric=params['metric'],
                          low_memory=params['low_memory'] # address potential memory issues
                          )

        # Set HDBSCAN model for BERTopic Training
        hdbscan_model = HDBSCAN(min_samples=params['min_samples'],
                        metric='euclidean',
                        prediction_data=True,
                        min_cluster_size=params['min_cluster_size'])

        # Finalize paramters for "aspects" in representation model
        mmr = MaximalMarginalRelevance(diversity=params['diversity'])
        representation_model = {"KeyBERT": keybert, "Llama2": llama2, "MMR": mmr}

        # Load embeddings and vocab
        embeddings = loaded_embeddings[df_name]
        vocab = vocab_dict[df_name] # To save memory in the environment

        # Set Custom Vectorizer model for BERTopic Training
        vectorizer_model = CountVectorizer(
                                           stop_words="english",
                                           ngram_range=params['ngram_range'],
                                           vocabulary=vocab # To save memory in the environment
                                           )
        # In "Update Representations" section, we tweak the topic representations without re-training models by adjusting the parameters for CountVectorizer

        ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True) #default to False
        #ClassTfidfTransformer Takes the square root of the bag-of-words after normalizing the matrix. Helps to reduce the impact of words that appear too frequently.
        #Without the use of techniques like TF-IDF, models could potentially use these frequent but uninformative words as defining characteristics of a topic, which could lead to less meaningful topic representations.

        if params["nr_topics"] is None: #Exclude the nr_topics line when its value is None
            topic_model = BERTopic(
                                  embedding_model=embedding_model,
                                  umap_model=umap_model,
                                  hdbscan_model=hdbscan_model,
                                  vectorizer_model=vectorizer_model,
                                  ctfidf_model=ctfidf_model,
                                  representation_model=representation_model,
                                  top_n_words=20,
                                  calculate_probabilities=True,
                                  verbose=True)
        else:
            topic_model = BERTopic(
                                  nr_topics=params["nr_topics"],
                                  embedding_model=embedding_model,
                                  umap_model=umap_model,
                                  hdbscan_model=hdbscan_model,
                                  vectorizer_model=vectorizer_model,
                                  ctfidf_model=ctfidf_model,
                                  representation_model=representation_model,
                                  top_n_words=20,
                                  calculate_probabilities=True,
                                  verbose=True)

        df = df_dict[df_name]

        print("Shape of embeddings:", embeddings.shape)
        print("Number of documents:", len(df['Post_Content']))

        topics, probs = topic_model.fit_transform(df['Post_Content'], embeddings)

        with open(f'/content/drive/MyDrive/rep_docs/{df_name}_rep_docs.pickle', 'wb') as handle:
            pickle.dump(topic_model.representative_docs_, handle, protocol=pickle.HIGHEST_PROTOCOL)

        topic_model.save(path=f'{model_dir}{df_name}_model',
                         serialization="safetensors",
                         save_ctfidf=True,
                         save_embedding_model="BAAI/bge-large-en-v1.5")

        print(f"Completed processing and saved the model for dataframe: {df_name}")
        return topic_model

    except Exception as e:
        print(f"An error occurred while processing dataframe: {df_name}")
        print(f"Error details: {str(e)}")
        print(traceback.format_exc())
        return None

To train the model:

df_name = "discourse"  # replace with the name of choice
models[df_name] = train_bertopic_model(df_name)

The labeling issues mentioned occurred for datasets containing 250K documents and 2K documents.

For the few-shot learning, I meant to ask whether it is possible to hold a conversation with the LLM. I want to leverage what I have labeled manually using Atlas.ti. So it might look like:

Message 1: """ 
<s>[INST] <<SYS>>
"You are a helpful, respectful and honest assistant for labeling topics."
<</SYS>>
""" 

Meesage 2: """ 
I have a topic that contains the following documents:
- DOCUMENTS 

The topic is described by the following keywords: KEYWORDS

Based on the information about the topic above, the label is 'Label that I created for this topic'"
""" 

Message 3: """
[INST]
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.
[/INST]
"""

Please let me know if you need more information from me. Thank you!

MaartenGr · 2023-11-01T18:25:22Z

MaartenGr
Nov 1, 2023
Maintainer

Alright, it seems there are snippets of code missing here and there but I think I get the general gist of what you are trying to achieve. This might simply be a result of the prompt itself. With the one-shot approach here, it might be that the LLM "overfits" on that single example. When it does not know a good response or suffers from "lost in the middle syndrome", it might default to what it has seen first, which is the example.

Instead, I can recommend the following approach with Zephyr which will be in the documentation soon. Here, the prompt might be of use to you but if you want to use it for Llama 2, make sure to use the chat template for Llama 2 instead.

Zephyr (Mistral 7B)

We can go a step further with open-source Large Language Models (LLMs) that have shown to match the performance of closed-source LLMs like ChatGPT.

In this example, we will show you how to use Zephyr, a fine-tuning version of Mistral 7B. Mistral 7B outperforms other open-source LLMs at a much smaller scale and is a worthwhile solution for use cases such as topic modeling. We want to keep inference as fast as possible and a relatively small model helps with that. Zephyr is a fine-tuned version of Mistral 7B that was trained on a mix of publicly available and synthetic datasets using Direct Preference Optimization (DPO).

To use Zephyr in BERTopic, we will first need to install and update a couple of packages that can handle quantized versions of Zephyr:

pip install ctransformers[cuda]
pip install --upgrade git+https://github.com/huggingface/transformers

Instead of loading in the full model, we can instead load a quantized model which is a compressed version of the original model:

from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-alpha-GGUF",
    model_file="zephyr-7b-alpha.Q4_K_M.gguf",
    model_type="mistral",
    gpu_layers=50,
    hf=True
)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha")

# Pipeline
generator = pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    max_new_tokens=50,
    repetition_penalty=1.1
)

This Zephyr model requires a specific prompt template in order to work:

prompt = """<|system|>You are a helpful, respectful and honest assistant for labeling topics..</s>
<|user|>
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: '[KEYWORDS]'.
Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.</s>
<|assistant|>"""

After creating this prompt template, we can create our representation model to be used in BERTopic:

from bertopic.representation import TextGeneration

# Text generation with Zephyr
zephyr = TextGeneration(generator, prompt=prompt)
representation_model = {"Zephyr": zephyr}

# Topic Modeling
topic_model = BERTopic(representation_model=representation_model, verbose=True)

8 replies

linxule Nov 2, 2023
Author

Are there any other changes to the package installations when using Zephyr with BERTopic? When using A100 GPU in Colab, the session would crash when trying to run the generator function. This was not an issue for V100 GPU.

When setting up the model (which worked) for Zephyr, I got

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`

My current installations are:

!pip install git+https://github.com/MaartenGr/BERTopic.git@master
!pip install numba
!pip install datasets accelerate bitsandbytes xformers adjustText
!pip install huggingface_hub
!pip install --upgrade torch transformers
!pip install ctransformers[cuda]

Could this relate to the cuda installed via ctransformers may be incompatible with A100 GPU in Colab?

MaartenGr Nov 2, 2023
Maintainer

Not from what I can remember. Crashing might be a result of your available RAM/VRAM depending on where it crashes. Could you give me more information? What exact code are you running? If you run verbose=True, where in the pipeline does it crash? Lastly, which GPU are you using with how much VRAM?

linxule Nov 2, 2023
Author

I first ran this without issues:

!pip install --upgrade torch transformers
!pip install ctransformers[cuda] 

# Set up quantized Model 
import ctransformers
import transformers 

model = ctransformers.AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-alpha-GGUF",
    model_file="zephyr-7b-alpha.Q4_K_M.gguf",
    model_type="mistral",
    gpu_layers=50,
    hf=True
)

# Choose corresponding model id
model_id = 'HuggingFaceH4/zephyr-7b-alpha'

# Set up Tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, token=HUGGINGFACE_TOKEN)

# Set up text generator function
generator = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task='text-generation',
    max_new_tokens=50,
    repetition_penalty=1.1 
)

And it crashed when I ran this:

# Check for correct loading of the model
res = generator(demo_prompt)
print(res[0]["generated_text"])

I'm using Colab+ with A100 runtime type ( 83.5 GB System RAM, 40 GB GPU RAM, and 166.8 GB Disk).

On the other hand, I tried with the HuggingFaceH4/zephyr-7b-alpha model and it worked fine (but it's using too much GPU memory):

import transformers 
model_id = 'HuggingFaceH4/zephyr-7b-alpha'

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    device_map='auto'
)

linxule Nov 3, 2023
Author

The issue appears to revolve around the CUDA version required by ctransformers[cuda]. Upon installing ctransformers[cuda], the dependencies nvidia-cublas-cu12 and nvidia-cuda-runtime-cu12 get installed as well.

Installing collected packages: nvidia-cuda-runtime-cu12, nvidia-cublas-cu12, huggingface-hub, ctransformers
Successfully installed ctransformers-0.2.27 huggingface-hub-0.18.0 nvidia-cublas-cu12-12.3.2.9 nvidia-cuda-runtime-cu12-12.3.52

@MaartenGr , did you encounter a similar issue when you installed ctransformers?

Interestingly, Google Colab appears to only support CUDA 11.8 on both A100 and V100 GPUs:

!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Therefore, I explored two solutions to circumvent the issue:

Install ctransformers while still leveraging UMAP and HDBSCAN from cuml-cu11, which was installed using !pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com !pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com !pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com !pip install cupy-cuda11x -f https://pip.cupy.dev/aarch64. This seems to resolve the conflicting versions of CUDA when installing ctransformers.
Install ctransformers[cuda] without utilizing UMAP and HDBSCAN from cuml-cu11.

I wonder if you have any additional insights or recommendations over these approaches?

MaartenGr Nov 6, 2023
Maintainer

I work mainly with T4 GPUs for tutorial purposes and there the installation works well. For 4bit quantization with GGUF, the example I showed before is running without any issues. Do note though that transformers needs to be installed from the main branch last time I checked:

pip install --upgrade git+https://github.com/huggingface/transformers

I have not tried it before with V100/A100 GPUs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topic Modeling with Llama 2 - Example Prompt and Few-Shot Learning? #1609

{{title}}

Replies: 2 comments 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Topic Modeling with Llama 2 - Example Prompt and Few-Shot Learning? #1609

linxule Nov 1, 2023

Zephyr (Mistral 7B)

Replies: 2 comments · 9 replies

MaartenGr Nov 1, 2023 Maintainer

linxule Nov 1, 2023 Author

MaartenGr Nov 1, 2023 Maintainer

Zephyr (Mistral 7B)

linxule Nov 2, 2023 Author

MaartenGr Nov 2, 2023 Maintainer

linxule Nov 2, 2023 Author

linxule Nov 3, 2023 Author

MaartenGr Nov 6, 2023 Maintainer

linxule
Nov 1, 2023

Replies: 2 comments 9 replies

MaartenGr
Nov 1, 2023
Maintainer

linxule Nov 1, 2023
Author

MaartenGr
Nov 1, 2023
Maintainer

linxule Nov 2, 2023
Author

MaartenGr Nov 2, 2023
Maintainer

linxule Nov 2, 2023
Author

linxule Nov 3, 2023
Author

MaartenGr Nov 6, 2023
Maintainer