-
Notifications
You must be signed in to change notification settings - Fork 777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update LangChain Support #2188
base: master
Are you sure you want to change the base?
Update LangChain Support #2188
Changes from 14 commits
608d34c
8fed376
848a5e2
703d533
d57366a
1f3a721
b687aca
5ab66e8
97a8125
da10d1f
038f73f
90fcd53
89f767c
f9b3c75
d415974
4f1fddc
7a22481
4be8f56
56ea9fb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,35 +1,61 @@ | ||
import pandas as pd | ||
from langchain.docstore.document import Document | ||
from langchain_core.documents import Document | ||
from scipy.sparse import csr_matrix | ||
from typing import Callable, Mapping, List, Tuple, Union | ||
|
||
from langchain_core.language_models import LanguageModelLike | ||
from langchain_core.runnables import Runnable | ||
from langchain_core.prompts import ChatPromptTemplate | ||
from langchain.chains.combine_documents import create_stuff_documents_chain | ||
from bertopic.representation._base import BaseRepresentation | ||
from bertopic.representation._utils import truncate_document | ||
|
||
DEFAULT_PROMPT = "What are these documents about? Please give a single label." | ||
DEFAULT_PROMPT = """ | ||
This is a list of texts where each collection of texts describes a topic. After each collection of texts, the name of the topic they represent is mentioned as a short, highly descriptive title. | ||
--- | ||
Topic: | ||
Sample texts from this topic: | ||
- Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial-style meat production and factory farming, meat has become a staple food. | ||
- Meat, but especially beef, is the worst food in terms of emissions. | ||
- Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one. | ||
|
||
Keywords: meat beef eat eating emissions steak food health processed chicken | ||
Topic name: Environmental impacts of eating meat | ||
--- | ||
Topic: | ||
Sample texts from this topic: | ||
- I have ordered the product weeks ago but it still has not arrived! | ||
- The website mentions that it only takes a couple of days to deliver but I still have not received mine. | ||
- I got a message stating that I received the monitor but that is not true! | ||
- It took a month longer to deliver than was advised... | ||
|
||
Keywords: deliver weeks product shipping long delivery received arrived arrive week | ||
Topic name: Shipping and delivery issues | ||
--- | ||
Topic: | ||
Sample texts from this topic: | ||
[DOCUMENTS] | ||
Keywords: [KEYWORDS] | ||
Topic name:""" | ||
|
||
|
||
class LangChain(BaseRepresentation): | ||
"""Using chains in langchain to generate topic labels. | ||
|
||
The classic example uses `langchain.chains.question_answering.load_qa_chain`. | ||
This returns a chain that takes a list of documents and a question as input. | ||
"""This representation model uses LangChain to generate descriptive topic labels. | ||
|
||
You can also use Runnables such as those composed using the LangChain Expression Language. | ||
It supports two main usage patterns: | ||
1. Basic usage with a language model and optional custom prompt | ||
2. Advanced usage with a custom LangChain chain for full control over the generation process | ||
|
||
Arguments: | ||
chain: The langchain chain or Runnable with a `batch` method. | ||
Input keys must be `input_documents` and `question`. | ||
Output key must be `output_text`. | ||
prompt: The prompt to be used in the model. If no prompt is given, | ||
`self.default_prompt_` is used instead. | ||
NOTE: Use `"[KEYWORDS]"` in the prompt | ||
to decide where the keywords need to be | ||
inserted. Keywords won't be included unless | ||
indicated. Unlike other representation models, | ||
Langchain does not use the `"[DOCUMENTS]"` tag | ||
to insert documents into the prompt. The load_qa_chain function | ||
formats the representative documents within the prompt. | ||
llm: A LangChain text model or chat model used to generate representations, only needed for basic usage. | ||
Examples include ChatOpenAI or ChatAnthropic. Ignored if a custom chain is provided. | ||
Comment on lines
+49
to
+50
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note to myself that I should update BERTopic soon to 0.17.0 considering this is an API change. It will break previous implementations and this new feature should not be put in a minor version. |
||
prompt: A string template containing the placeholder [DOCUMENTS] and optionally [KEYWORDS], only needed for basic usage. | ||
Defaults to a pre-defined prompt defined in DEFAULT_PROMPT. Ignored if a custom chain is provided. | ||
chain: A custom LangChain chain to generate representations, only needed for advanced usage. | ||
The chain must be a LangChain Runnable that implements the batch method and accepts these input keys: | ||
- DOCUMENTS: (required) A list of LangChain Document objects | ||
- KEYWORDS: (optional) A list of topic keywords | ||
The chain must directly output either a string label or a list of strings. | ||
If provided, llm and prompt are ignored. | ||
nr_docs: The number of documents to pass to LangChain | ||
diversity: The diversity of documents to pass to LangChain. | ||
Accepts values between 0 and 1. A higher | ||
|
@@ -51,103 +77,118 @@ class LangChain(BaseRepresentation): | |
* If tokenizer is a callable, then that callable is used to tokenize | ||
the document. These tokens are counted and truncated depending | ||
on `doc_length` | ||
chain_config: The configuration for the langchain chain. Can be used to set options | ||
like max_concurrency to avoid rate limiting errors. | ||
chain_config: The configuration for the LangChain chain. Can be used to set options like max_concurrency to avoid rate limiting errors. | ||
|
||
Usage: | ||
|
||
To use this, you will need to install the langchain package first. | ||
Additionally, you will need an underlying LLM to support langchain, | ||
like openai: | ||
|
||
`pip install langchain` | ||
`pip install openai` | ||
|
||
Then, you can create your chain as follows: | ||
|
||
```python | ||
from langchain.chains.question_answering import load_qa_chain | ||
from langchain.llms import OpenAI | ||
chain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type="stuff") | ||
``` | ||
|
||
Finally, you can pass the chain to BERTopic as follows: | ||
|
||
```python | ||
from bertopic.representation import LangChain | ||
|
||
# Create your representation model | ||
representation_model = LangChain(chain) | ||
|
||
# Use the representation model in BERTopic on top of the default pipeline | ||
topic_model = BERTopic(representation_model=representation_model) | ||
``` | ||
|
||
You can also use a custom prompt: | ||
|
||
```python | ||
prompt = "What are these documents about? Please give a single label." | ||
representation_model = LangChain(chain, prompt=prompt) | ||
``` | ||
|
||
You can also use a Runnable instead of a chain. | ||
The example below uses the LangChain Expression Language: | ||
|
||
```python | ||
from bertopic.representation import LangChain | ||
from langchain.chains.question_answering import load_qa_chain | ||
from langchain.chat_models import ChatAnthropic | ||
from langchain.schema.document import Document | ||
from langchain.schema.runnable import RunnablePassthrough | ||
from langchain_experimental.data_anonymizer.presidio import PresidioReversibleAnonymizer | ||
|
||
prompt = ... | ||
llm = ... | ||
|
||
# We will construct a special privacy-preserving chain using Microsoft Presidio | ||
|
||
pii_handler = PresidioReversibleAnonymizer(analyzed_fields=["PERSON"]) | ||
|
||
chain = ( | ||
{ | ||
"input_documents": ( | ||
lambda inp: [ | ||
Document( | ||
page_content=pii_handler.anonymize( | ||
d.page_content, | ||
language="en", | ||
), | ||
) | ||
for d in inp["input_documents"] | ||
] | ||
), | ||
"question": RunnablePassthrough(), | ||
} | ||
| load_qa_chain(representation_llm, chain_type="stuff") | ||
| (lambda output: {"output_text": pii_handler.deanonymize(output["output_text"])}) | ||
) | ||
|
||
representation_model = LangChain(chain, prompt=representation_prompt) | ||
``` | ||
To use this representation, you will need to install the LangChain package first. | ||
|
||
`pip install langchain` | ||
|
||
There are two ways to use the LangChain representation: | ||
|
||
1. Use a default LangChain chain that is created using an underlying language model and a prompt. | ||
|
||
You will first need to install the package for the underlying model. For example, if you want to use OpenAI: | ||
|
||
`pip install langchain_openai` | ||
|
||
```python | ||
from bertopic.representation import LangChain | ||
from langchain_openai import ChatOpenAI | ||
|
||
chat_model = ChatOpenAI(temperature=0, openai_api_key=my_openai_api_key) | ||
|
||
# Create your representation model with the pre-defined prompt | ||
representation_model = LangChain(llm=chat_model) | ||
|
||
# Create your representation model with a custom prompt | ||
prompt = "What are these documents about? [DOCUMENTS] Here are keywords related to them [KEYWORDS]." | ||
representation_model = LangChain(llm=chat_model, prompt=prompt) | ||
|
||
# Use the representation model in BERTopic on top of the default pipeline | ||
topic_model = BERTopic(representation_model=representation_model) | ||
``` | ||
|
||
2. Use a custom LangChain chain for full control over the generation process: | ||
|
||
Remember that the chain will receive two inputs: `DOCUMENTS` and `KEYWORDS` and that it must return directly a string label | ||
or a list of strings. | ||
|
||
```python | ||
from bertopic.representation import LangChain | ||
from langchain_anthropic import ChatAnthropic | ||
from langchain_core.documents import Document | ||
from langchain_core.prompts import ChatPromptTemplate | ||
from langchain.chains.combine_documents import create_stuff_documents_chain | ||
from langchain_experimental.data_anonymizer.presidio import PresidioReversibleAnonymizer | ||
|
||
prompt = ... | ||
|
||
chat_model = ... | ||
|
||
# We will construct a special privacy-preserving chain using Microsoft Presidio | ||
|
||
pii_handler = PresidioReversibleAnonymizer(analyzed_fields=["PERSON"]) | ||
|
||
chain = ( | ||
{ | ||
"DOCUMENTS": ( | ||
lambda inp: [ | ||
Document( | ||
page_content=pii_handler.anonymize( | ||
d.page_content, | ||
language="en", | ||
), | ||
) | ||
for d in inp["DOCUMENTS"] | ||
] | ||
), | ||
"KEYWORDS": lambda keywords: keywords["KEYWORDS"], | ||
} | ||
| create_stuff_documents_chain(chat_model, prompt, document_variable_name="DOCUMENTS") | ||
) | ||
|
||
representation_model = LangChain(chain=chain) | ||
``` | ||
""" | ||
|
||
def __init__( | ||
self, | ||
chain, | ||
prompt: str = None, | ||
llm: LanguageModelLike = None, | ||
prompt: str = DEFAULT_PROMPT, | ||
chain: Runnable = None, | ||
nr_docs: int = 4, | ||
diversity: float = None, | ||
doc_length: int = None, | ||
tokenizer: Union[str, Callable] = None, | ||
chain_config=None, | ||
chain_config: dict = None, | ||
): | ||
self.chain = chain | ||
self.prompt = prompt if prompt is not None else DEFAULT_PROMPT | ||
self.default_prompt_ = DEFAULT_PROMPT | ||
self.chain_config = chain_config | ||
self.prompt = prompt | ||
|
||
if chain is not None: | ||
self.chain = chain | ||
elif llm is not None: | ||
# Check that the prompt contains the necessary placeholder | ||
if "[DOCUMENTS]" not in prompt: | ||
raise ValueError("The prompt must contain the placeholder [DOCUMENTS]") | ||
|
||
# Convert prompt placeholders to the LangChain format | ||
langchain_prompt = prompt.replace("[DOCUMENTS]", "{DOCUMENTS}").replace("[KEYWORDS]", "{KEYWORDS}") | ||
|
||
# Create ChatPromptTemplate | ||
chat_prompt = ChatPromptTemplate.from_template(langchain_prompt) | ||
|
||
# Create a basic LangChain chain using create_stuff_documents_chain | ||
self.chain = create_stuff_documents_chain(llm, chat_prompt, document_variable_name="DOCUMENTS") | ||
else: | ||
raise ValueError("Either `llm` or `chain` must be provided") | ||
|
||
self.nr_docs = nr_docs | ||
self.diversity = diversity | ||
self.doc_length = doc_length | ||
self.tokenizer = tokenizer | ||
self.chain_config = chain_config | ||
|
||
def extract_topics( | ||
self, | ||
|
@@ -186,27 +227,49 @@ def extract_topics( | |
for docs in repr_docs_mappings.values() | ||
] | ||
|
||
# `self.chain` must take `input_documents` and `question` as input keys | ||
# Use a custom prompt that leverages keywords, using the tag: [KEYWORDS] | ||
if "[KEYWORDS]" in self.prompt: | ||
prompts = [] | ||
for topic in topics: | ||
keywords = list(zip(*topics[topic]))[0] | ||
prompt = self.prompt.replace("[KEYWORDS]", ", ".join(keywords)) | ||
prompts.append(prompt) | ||
|
||
inputs = [{"input_documents": docs, "question": prompt} for docs, prompt in zip(chain_docs, prompts)] | ||
|
||
else: | ||
inputs = [{"input_documents": docs, "question": self.prompt} for docs in chain_docs] | ||
# Extract keywords from the topics and format them as a string | ||
formatted_keywords_list = [] | ||
for topic in topics: | ||
keywords = list(zip(*topics[topic]))[0] | ||
formatted_keywords_list.append(", ".join(keywords)) | ||
|
||
# self.chain must accept DOCUMENTS as a mandatory input key and KEYWORDS as an optional input key | ||
# We always pass both keys to the chain, and the chain can choose to use them or not | ||
# Documents are passed as a list of LangChain Document objects, it is up to the chain to format them into a string | ||
inputs = [ | ||
{"DOCUMENTS": docs, "KEYWORDS": formatted_keywords} | ||
for docs, formatted_keywords in zip(chain_docs, formatted_keywords_list) | ||
] | ||
|
||
# `self.chain` must return a dict with an `output_text` key | ||
# same output key as the `StuffDocumentsChain` returned by `load_qa_chain` | ||
# self.chain must return a string label or a list of string labels for each input | ||
outputs = self.chain.batch(inputs=inputs, config=self.chain_config) | ||
labels = [output["output_text"].strip() for output in outputs] | ||
|
||
updated_topics = { | ||
topic: [(label, 1)] + [("", 0) for _ in range(9)] for topic, label in zip(repr_docs_mappings.keys(), labels) | ||
} | ||
# Process outputs from the chain - can be either strings or lists of strings | ||
updated_topics = {} | ||
for topic, output in zip(repr_docs_mappings.keys(), outputs): | ||
# Each output can be either: | ||
# - A single string representing the main topic label | ||
# - A list of strings representing multiple related labels | ||
if isinstance(output, str): | ||
# For string output: use it as the main label (weight=1) | ||
# and pad with 9 empty strings (weight=0) | ||
labels = [(output.strip(), 1)] + [("", 0) for _ in range(9)] | ||
else: | ||
# For list output: | ||
# 1. Convert all elements to stripped strings | ||
# 2. Take up to 10 elements | ||
# 3. Assign decreasing weights from 1.0 to 0.1 | ||
# 4. Pad with empty strings if needed to always have 10 elements | ||
clean_outputs = [str(label).strip() for label in output] | ||
top_labels = clean_outputs[:10] | ||
|
||
# Create (label, weight) pairs with decreasing weights | ||
labels = [(label, 1.0 - (i * 0.1)) for i, label in enumerate(top_labels)] | ||
|
||
# Pad with empty strings if we have less than 10 labels | ||
if len(labels) < 10: | ||
labels.extend([("", 0.0) for _ in range(10 - len(labels))]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is interesting. The output of any generative model in BERTopic is meant to be the same as you did above: [("my label", 1), ("", 0), ("", 0), ("", 0), ("", 0), ("", 0), ("", 0), ("", 0), ("", 0), ("", 0)] but what wasn't implemented before is that you could also generate a list of keywords/labels. Do you have an example of when this piece of code would be executed? When is the output a list rather than a single string? Also, I'm a bit hesistant giving decreasing weights rather than all 1s since (if I'm not mistaken) the weights do not have any meaning. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be honest, I wasn't entirely sure about the meaning behind the list of tuples and the weights so I just kept the old behaviour (which you have provided an example of) and, given that format, I assumed it could be extended to allow for generated lists of labels. In the case of lists, I don't mind setting the weights to all 1s (again, I didn't research the meaning of that format for representations). The need to allow for lists stemmed from a use-case I had where I used a custom chain to generate topic labels in several languages with the current implementation. Since the current implementation does not allow for lists, I concatenated all elements of the list generated by the chain into a single string with a specific separator so that it could be split later. Allowing for lists in the chain output would make it possible to avoid this. Granted, this may be overkill 😄 I provided an example of basic usage, basic usage with custom prompt and advanced usage with different types of list outputs in this thread. Maybe looking at that code and its output would make it more explicit 😃 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There isn't actually any LLM implemented in BERTopic currently that returns a list of labels. They all return a single label. Although I like the idea of returning multiple labels, I wou suggest removing this here considering this might be a bit out of scope for this PR.
Hmmm, this is rather interesting use case that I haven't seen before. Now you make hesistate on the best course of action here... Nevermind what I said above, let's keep this and make sure they all have values of 1 instead of a decreasing value. Since the weights are currently meaningless (and quite frankly not used) in the LLM-setting, we can just set them to 1.
Wow, these are amazing examples! Thanks for sharing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that the nice thing about the list behaviour is that it fits nicely with what is already implemented as returning I have adapted the code so that the weight is 1 for all labels. |
||
|
||
updated_topics[topic] = labels | ||
|
||
return updated_topics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a newline after this when we fill in [DOCUMENTS] or should we add one? It needs to have the exact same structure as the examples above otherwise it might hurt performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you got this from the Cohere one and I'm not actually sure whether there is an additional newline... perhaps I should also add one there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I wasn't sure what kind of "default" prompt you would like to use so I just copied the one from another representation.
I ran that default prompt with an example, and you can see the formatted prompt below. It seems like the default separator is to use two newlines between each document (which I guess is better for long documents). I can change this to be a single newline, and remove the "-" from the examples so that the behaviour is the same everywhere. I think I can also make it so that the documents start with the "-" (in that case the code will have to be a bit more complex to allow for a custom document formatter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go for your first suggestion. That would minimize the amount of additional code necessary whilst still maintaining consistency in the prompting format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified the code and the examples slightly to make the spacing consistent, and add the missing comas between keywords. Now the formatted default prompt looks like this.
There are a lot of ways to create a prompt for representation generation, and as I've mentioned here I've just taken an existing one from BERTopic and adapted it slightly. If it works for you I propose to leave it as-is, but I can always change it :)