The purpose of this example is to illustrate how a user could build a Retrieval Augmented Generation pipeline integrating informational feeds and an LLM service into a Morpheus pipeline. This example builds on the previous example completionpipeline, by adding the ability to augment LLM queries with context information from a knowledge base. Appending this context helps improve the responses from the LLM by providing additional background contextual and factual information which the LLM can pull from for its response.
- In order for this pipeline to function correctly, a Vector Database must already have been populated with information that can be retrieved.
- An example of populating a database is illustrated in VDB upload
- This example assumes that pipeline has already been run to completion.
- This pipeline can support any type of embedding model that can convert text into a vector of floats.
- For the example, we will use
all-MiniLM-L6-v2
. It is small, accurate, and included in the Morpheus repo via LFS; it is also the default model used in the VDB upload pipeline.
- Any vector database can be used to store the resulting embedding and corresponding metadata.
- It would be trivial to update the example to use Chroma or FAISS if needed.
- For this example, we will be using Milvus since it is the default VDB used in the VDB upload pipeline.
In order to cater to the unique requirements of the Retrieval Augmented Generation (RAG) mechanism, the following steps were incorporated:
- Embedding Retrieval: Before the LLM can make a completion, relevant context is retrieved from the Vector Database. This context is in the form of embeddings that represent pieces of information closely related to the query.
- Context Augmentation: The retrieved context is then appended to the user's query, enriching it with the necessary background to assist the LLM in generating a more informed completion.
- LLM Query Execution: The augmented query is then sent to the LLM, which generates a response based on the combination of the original query and the appended context.
- Choice of Embedding Model: all-MiniLM-L6-v2 was chosen due to its compactness and accuracy. This makes it ideal for real-time operations and ensures that the embeddings are of high quality.
- Using Milvus as VDB: Milvus offers scalable and efficient vector search capabilities, making it a natural choice for embedding retrieval in real-time.
- Flexible LLM integration: The LLM is integrated into the pipeline as a standalone component, which allows for easy swapping of models and ensures that the pipeline can be easily extended to support multiple LLMs.
The standalone Morpheus pipeline is built using the following components:
- An InMemorySourceStage to hold the LLM queries in a DataFrame.
- We supply a fixed set of questions in a
source_df
which are then processed by the LLMEngineStage
- We supply a fixed set of questions in a
- A DeserializationStage to convert the MessageMeta objects into ControlMessages needed by the LLMEngine.
- New functionality was added to the DeserializeStage to support ControlMessages and add a default task to each message.
- A LLMEngineStage then wraps the core LLMEngine functionality.
- An ExtracterNode pulls the questions out of the DataFrame.
- A RAGNode performs the retrieval and adds the context to the query using the supplied template and executes the LLM.
- Finally, the responses are put back into the ControlMessage using a SimpleTaskHandler.
- The pipeline concludes with an InMemorySink stage to store the results.
Note: For this to function correctly, the VDB upload pipeline must have been run previously.
The provided diagram illustrates the structural composition of the Morpheus data processing pipeline. This pipeline is designed with the intent to handle various data streams in support of Retrieval Augmented Generation.
Note: The current
persistent
pipeline implementation differs from the above diagram in the follwing ways:
- The source for the upload and retrieval are both KafkaSourceStage to make it easy for the user to control when messages are processed by the example pipeline.
- There is a SplitStage added after the embedding portion of the pipeline which determines which sink to send each message to.
- The final sink for the retrieval task is sent to another Kafka topic retrieve_output.
The pipeline has multiple data input avenues:
- User Uploaded Documents: Raw documents provided by users for further processing.
- Streaming Event Logs: Logs that are streamed in real-time.
- Streaming Data Feeds: Continuous streams of data that could be dynamic in nature.
- RSS Threat Intel Feeds: RSS-based feeds that might focus on threat intelligence.
- LLM Input Query: Queries that are inputted for processing by the Large Language Model.
The ingested data traverses several processing stages:
-
Sources Integration: Data from different origins such as REST Server, Kafka, and RSS Scraper are directed into the pipeline.
-
Batching: Data items are grouped together for more efficient bulk processing.
-
Data Transformation:
- Chunking: Data might be broken into smaller chunks if too large.
- Tokenization: Textual data is typically converted into tokens suitable for model processing.
- Embedding: This step likely converts data into its vector representation.
- Mean Pooling: Embeddings might be combined to yield a mean vector.
-
Inference: Models may be used to extract patterns or make predictions from the data.
-
Storage and Retrieval: Vector representations are stored in the Vector DB and can be fetched upon request. Retrieval might employ GPU-based IVF indexes (such as RAFT or FAISS).
The pipeline is supported by a set of backend components:
- Knowledge Graph DB & Service: This serves as a potential repository and query mechanism for stored knowledge.
- Vector DB & Service: Appears to handle the storage and querying of vector data.
- Triton Inference Server: An inference component that interfaces with the LLM Service.
Before running the pipeline, we need obtain service API keys for the following services:
To retrieve models from LFS run the following:
./scripts/fetch_data.py fetch models
- Follow the instructions here
- We'll refer to your NGC API key as
${NGC_API_KEY}
for the rest of this document.
- Follow the instructions here to obtain an OpenAI API key.
- We'll refer to your OpenAI API key as
${OPENAI_API_KEY}
for the rest of this document.
Before running the pipeline, we need to ensure that the following services are running:
- Follow the instructions here to install and run a Milvus service.
-
Pull the Docker image for Triton:
docker pull nvcr.io/nvidia/tritonserver:23.06-py3
-
From the Morpheus repo root directory, run the following to launch Triton and load the
all-MiniLM-L6-v2
model:docker run --rm -ti --gpus=all -p8000:8000 -p8001:8001 -p8002:8002 -v $PWD/models:/models nvcr.io/nvidia/tritonserver:23.06-py3 tritonserver --model-repository=/models/triton-model-repo --exit-on-error=false --model-control-mode=explicit --load-model all-MiniLM-L6-v2
This will launch Triton and only load the
all-MiniLM-L6-v2
model. Once Triton has loaded the model, the following will be displayed:+------------------+---------+--------+ | Model | Version | Status | +------------------+---------+--------+ | all-MiniLM-L6-v2 | 1 | READY | +------------------+---------+--------+
The top level entrypoint to each of the LLM example pipelines is examples/llm/main.py
. This script accepts a set
of Options and a Pipeline to run. Baseline options are below, and for the purposes of this document we'll assume a
pipeline option of rag
:
TODO: Add model specification syntax
Using NGC Nemo LLMs
export NGC_API_KEY=[YOUR_KEY_HERE]
python examples/llm/main.py rag pipeline
Using OpenAI LLM models
export OPENAI_API_KEY=[YOUR_KEY_HERE]
python examples/llm/main.py rag pipeline --llm_service=OpenAI --model_name=gpt-3.5-turbo
TODO
Using NGC Nemo LLMs
export NGC_API_KEY=[YOUR_KEY_HERE]
python examples/llm/main.py rag persistent
Using OpenAI LLM models
export OPENAI_API_KEY=[YOUR_KEY_HERE]
python examples/llm/main.py rag persistent
-
--log_level [CRITICAL|FATAL|ERROR|WARN|WARNING|INFO|DEBUG]
- Description: Specifies the logging level.
- Default:
INFO
-
--use_cpp BOOLEAN
- Description: Opt to use C++ node and message types over python. Recommended only in case of bugs.
- Default:
False
-
--version
- Description: Display the script's current version.
-
--help
- Description: Show the help message with options and commands details.
- ... other pipelines ...
rag
The rag
command has its own set of options and commands:
persistant
pipeline