Skip to content

Latest commit

 

History

History
226 lines (164 loc) · 8.34 KB

reference.mdx

File metadata and controls

226 lines (164 loc) · 8.34 KB

RAG API Pipeline CLI Reference Documentation

Overview

The CLI tool provides functionality for running a RAG (Retrieval-Augmented Generation) API pipeline. It offers various commands to execute different stages of the pipeline, from data extraction to embedding generation.

Installation

This project uses Poetry for dependency management. To install the project and all its dependencies:

  1. Ensure you have Poetry installed. If not, install it by following the instructions here
  2. Clone the repository:
    git clone https://github.com/raid-guild/gaianet-rag-api-pipeline
    cd gaianet-rag-api-pipeline
  3. Install dependencies:
    • Using pip (production-mode):
          pip install -e .
    • Using Poetry (development-mode):
      poetry install
  4. Check the rag-api-pipeline CLI is available:
    • Production mode:
          rag-api-pipeline --help
    • Development mode:
          poetry run rag-api-pipeline --help

Usage

(venv) user$ rag-api-pipeline --help
Usage: rag-api-pipeline [OPTIONS] COMMAND [ARGS]...

  Command-line interface (CLI) for the RAG API pipeline.

Options:
  --help  Show this message and exit.

Commands:
  setup  Setup wizard to config the pipeline settings prior execution
  run    Execute RAG API pipeline tasks.

setup

(venv) user$ rag-api-pipeline setup --help
Usage: rag-api-pipeline setup [OPTIONS]

  Setup wizard to config the pipeline settings prior execution

Options:
  --debug                      enable logging debug level
  --llm-provider [gaia|other]  LLM provider  [default: gaia]
  --help                       Show this message and exit.

Options

  • --debug: Enable logging at debug level. Useful for development purposes
  • --llm-provider [gaia|other]: Embedding model provider (default: gaia). Check the Other LLM Providers section for details on supported LLM engines.

run

(venv) user$ rag-api-pipeline run --help
Usage: rag-api-pipeline run [OPTIONS] COMMAND [ARGS]...

  Execute RAG API pipeline tasks.

Options:
  --debug  enable logging debug level
  --help   Show this message and exit.

Commands:
  all              Run the complete RAG API Pipeline.
  from-chunked     Execute the RAG API pipeline from chunked data.
  from-normalized  Execute the RAG API pipeline from normalized data.

Global Options

  • --debug: Enable logging at debug level. Useful for development purposes

Pipeline Commands

Prior executing any of the commands listed below, it validates that the the pipeline is already setup (config/.env file exists) and the required services (LLM Provider + QdrantDB) are up and running. Otherwise, you will receive one of the following error messages:

  • ERROR: config/.env file not found. rag-api-pipeline setup should be run first.
  • ERROR: LLM Provider openai (http://localhost:8080/v1/models) is down.
  • ERROR: QdrantDB (http://localhost:6333) is down.

run all

Executes the entire RAG pipeline including API endpoint data streams, data normalization, data chunking, vector embeddings and database snapshot generation.

(venv) user$ rag-api-pipeline run all --help
Usage: rag-api-pipeline run all [OPTIONS] API_MANIFEST_FILE OPENAPI_SPEC_FILE

  Run the complete RAG API Pipeline.

  API_MANIFEST_FILE Path to the API manifest YAML file that defines pipeline
  config settings and API endpoints.

  OPENAPI_SPEC_FILE Path to the OpenAPI YAML specification file.

Options:
  --api-key TEXT                  API Auth key
  --source-manifest-file PATH     Source YAML manifest
  --full-refresh                  Clean up cache and extract API data from
                                  scratch
  --normalized-only               Run pipeline until the normalized data stage
  --chunked-only                  Run pipeline until the chunked data stage
  --help                          Show this message and exit.

Arguments

  • API_MANIFEST_FILE: Pipeline YAML manifest that defines the Pipeline config settings and API endpoints to extract.
  • OPENAPI_SPEC_FILE: OpenAPI YAML spec file.

Options

  • --api-key TEXT: API Auth key. If specified, it overrides the pipeline settings.
  • --source-manifest-file FILE: Airbyte's Source Connector YAML manifest. If specified, it skips the connector manifest generation stage and will use this file to start the API data extraction.
  • --full-refresh: Clean up cache and extract API data from scratch
  • --normalized-only: Run the pipeline process until the normalized data stage
  • --chunked-only: Run the pipeline process until the chunked data stage

run from-normalized

Executes the RAG data pipeline using an already normalized JSONL dataset.

(venv) user$ rag-api-pipeline run from-normalized --help
Usage: rag-api-pipeline run from-normalized [OPTIONS] API_MANIFEST_FILE

  Execute the RAG API pipeline from normalized data.

  API_MANIFEST_FILE Path to the API manifest YAML file that defines pipeline
  config settings and API endpoints.

Options:
  --normalized-data-file PATH  Normalized data in JSONL format  [required]
  --help                       Show this message and exit.

Arguments

  • API_MANIFEST_FILE: Pipeline YAML manifest that defines the Pipeline config settings.

Options

  • --normalized-data-file FILE (required): Normalized data in JSONL format.

run from-chunked

Executes the RAG data pipeline using an already chunked dataset in JSONL format.

(venv) user$ rag-api-pipeline run from-chunked --help
Usage: rag-api-pipeline run from-chunked [OPTIONS] API_MANIFEST_FILE

  Execute the RAG API pipeline from chunked data.

  API_MANIFEST_FILE Path to the API manifest YAML file that defines pipeline
  config settings and API endpoints.

Options:
  --chunked-data-file PATH  Chunked data in JSONL format  [required]
  --help                    Show this message and exit.

Arguments

  • API_MANIFEST_FILE: Pipeline YAML manifest that defines the Pipeline config settings and API endpoints to extract.

Options

  • --chunked-data-file FILE (required): Chunked data in JSONL format.

Examples

  1. Clean up cachen and run the complete pipeline:
rag-api-pipeline run all config/boardroom_api_pipeline.yaml config/boardroom_openapi.yaml --full-refresh
  1. Run the pipeline and stop executing after data normalization:
rag-api-pipeline run all config/boardroom_api_pipeline.yaml config/boardroom_openapi.yaml --normalized-only
  1. Run the pipeline from normalized data:
rag-api-pipeline run from-normalized config/boardroom_api_pipeline.yaml --normalized-data-file path/to/normalized_data.jsonl
  1. Run the pipeline from chunked data:
rag-api-pipeline run from-chunked config/boardroom_api_pipeline.yaml --chunked-data-file path/to/chunked_data.jsonl

CLI Outputs

Cached API stream data and results produced from running any of the rag-api-pipeline run commands are stored in the output/<api_name> folder. The following files and folders are created by the tool within this baseDir folder:

  • {baseDir}/{api_name}_source_generated.yaml: generated Airbyte Stream connector manifest.
  • {baseDir}/cache/{api_name}/*: extracted API data is cached into a local DuckDB. Database files are stored in this directory. If the --full-refresh argument is specified on the run all command, the cache will be cleared and API data will be extracted from scratch.
  • {baseDir}/{api_name}_stream_{x}_preprocessed.jsonl: data streams from each API endpoint are preprocessed and stored in JSONL format.
  • {baseDir}/{api_name}_normalized.jsonl: preprocessed data streams from each API endpoint are joined together and stored in JSONL format.
  • {baseDir}/{api_name}_chunked.jsonl: normalized data that goes through the data chunking stage is then stored in JSONL format.
  • {baseDir}/{api_name}_collection-xxxxxxxxxxxxxxxx-yyyy-mm-dd-hh-mm-ss.snapshot: vector embeddings snapshot file that was exported from Qdrant DB.
  • {baseDir}/{api_name}_collection-xxxxxxxxxxxxxxxx-yyyy-mm-dd-hh-mm-ss.snapshot.tar.gz: compressed knowledge base that contains the vector embeddings snapshot.