Skip to content

Commit

Permalink
ClickHouse as a vector store (run-llama#10583)
Browse files Browse the repository at this point in the history
  • Loading branch information
gingerwizard authored Feb 15, 2024
1 parent 2ddd720 commit 9cc433a
Show file tree
Hide file tree
Showing 29 changed files with 6,526 additions and 8 deletions.
19 changes: 19 additions & 0 deletions docs/community/integrations/vector_stores.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ as the storage backend for `VectorStoreIndex`.
- Astra DB (`AstraDBVectorStore`). [Quickstart](https://docs.datastax.com/en/astra/home/astra.html).
- Azure AI Search (`AzureAISearchVectorStore`). [Quickstart](https://learn.microsoft.com/en-us/azure/search/search-get-started-vector)
- Chroma (`ChromaVectorStore`) [Installation](https://docs.trychroma.com/getting-started)
- ClickHouse (`ClickHouseVectorStore`) [Installation](https://clickhouse.com/docs/en/install)
- DashVector (`DashVectorStore`). [Installation](https://help.aliyun.com/document_detail/2510230.html).
- DeepLake (`DeepLakeVectorStore`) [Installation](https://docs.deeplake.ai/en/latest/Installation.html)
- DocArray (`DocArrayHnswVectorStore`, `DocArrayInMemoryVectorStore`). [Installation/Python Client](https://github.com/docarray/docarray#installation).
Expand Down Expand Up @@ -171,6 +172,24 @@ vector_store = ChromaVectorStore(
)
```

**ClickHouse**

```python
import clickhouse_connect
from llama_index.vector_stores import ClickHouseVectorStore

# Creating a ClickHouse client
client = clickhouse_connect.get_client(
host="YOUR_CLUSTER_HOST",
port=8123,
username="YOUR_USERNAME",
password="YOUR_CLUSTER_PASSWORD",
)

# construct vector store
vector_store = ClickHouseVectorStore(clickhouse_client=client)
```

**DashVector**

```python
Expand Down
6 changes: 1 addition & 5 deletions docs/examples/data_connectors/MakeDemo.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -126,11 +126,7 @@
"source": [
"# Send response to Make.com webhook\n",
"wrapper = MakeWrapper()\n",
"wrapper.pass_response_to_webhook(\n",
" \"<webhook_url>,\n",
" response,\n",
" query_str\n",
")"
"wrapper.pass_response_to_webhook(\"<webhook_url>\", response, query_str)"
]
}
],
Expand Down
4 changes: 1 addition & 3 deletions docs/examples/ingestion/async_ingestion_pipeline.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -39,9 +39,7 @@
"source": [
"import os\n",
"\n",
"os.environ[\n",
" \"OPENAI_API_KEY\"\n",
"] = \"sk-..."
"os.environ[\"OPENAI_API_KEY\"] = \"sk-...\""
]
},
{
Expand Down
303 changes: 303 additions & 0 deletions docs/examples/vector_stores/ClickHouseIndexDemo.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,303 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "e0c2f11f",
"metadata": {},
"source": [
"<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/vector_stores/ClickHouseIndexDemo.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"id": "307804a3-c02b-4a57-ac0d-172c30ddc851",
"metadata": {},
"source": [
"# ClickHouse Vector Store\n",
"In this notebook we are going to show a quick demo of using the ClickHouseVectorStore."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "c12f55a9",
"metadata": {},
"source": [
"If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c1edec46",
"metadata": {},
"outputs": [],
"source": [
"!pip install llama-index\n",
"!pip install clickhouse_connect"
]
},
{
"cell_type": "markdown",
"id": "f7010b1d-d1bb-4f08-9309-a328bb4ea396",
"metadata": {},
"source": [
"#### Creating a ClickHouse Client"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d48af8e1",
"metadata": {},
"outputs": [],
"source": [
"import logging\n",
"import sys\n",
"\n",
"logging.basicConfig(stream=sys.stdout, level=logging.INFO)\n",
"logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "50ad978c",
"metadata": {},
"outputs": [],
"source": [
"from os import environ\n",
"import clickhouse_connect\n",
"\n",
"environ[\"OPENAI_API_KEY\"] = \"sk-*\"\n",
"\n",
"# initialize client\n",
"client = clickhouse_connect.get_client(\n",
" host=\"localhost\",\n",
" port=8123,\n",
" username=\"default\",\n",
" password=\"\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "8ee4473a-094f-4d0a-a825-e1213db07240",
"metadata": {},
"source": [
"#### Load documents, build and store the VectorStoreIndex with ClickHouseVectorStore\n",
"\n",
"Here we will use a set of Paul Graham essays to provide the text to turn into embeddings, store in a ``ClickHouseVectorStore`` and query to find context for our LLM QnA loop."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0a2bcc07",
"metadata": {},
"outputs": [],
"source": [
"from llama_index.core import VectorStoreIndex, SimpleDirectoryReader\n",
"from llama_index.vector_stores.clickhouse import ClickHouseVectorStore"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "68cbd239-880e-41a3-98d8-dbb3fab55431",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Document ID: d03ac7db-8dae-4199-bc38-445dec51a534\n",
"Number of Documents: 1\n"
]
}
],
"source": [
"# load documents\n",
"documents = SimpleDirectoryReader(\"../data/paul_graham\").load_data()\n",
"print(\"Document ID:\", documents[0].doc_id)\n",
"print(\"Number of Documents: \", len(documents))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "b6afe88c",
"metadata": {},
"source": [
"Download Data"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0d09a78f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2024-02-13 10:08:31-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt\r\n",
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...\r\n",
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\r\n",
"HTTP request sent, awaiting response... 200 OK\r\n",
"Length: 75042 (73K) [text/plain]\r\n",
"Saving to: ‘data/paul_graham/paul_graham_essay.txt’\r\n",
"\r\n",
"data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.003s \r\n",
"\r\n",
"2024-02-13 10:08:31 (23.9 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]\r\n",
"\r\n"
]
}
],
"source": [
"!mkdir -p 'data/paul_graham/'\n",
"!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"
]
},
{
"cell_type": "markdown",
"id": "4fe3dc84",
"metadata": {},
"source": [
"You can process your files individually using [SimpleDirectoryReader](/examples/data_connectors/simple_directory_reader.ipynb):"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4febd54a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"data/paul_graham/paul_graham_essay.txt\n"
]
}
],
"source": [
"loader = SimpleDirectoryReader(\"./data/paul_graham/\")\n",
"documents = loader.load_data()\n",
"for file in loader.input_files:\n",
" print(file)\n",
" # Here is where you would do any preprocessing"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ba1558b3",
"metadata": {},
"outputs": [],
"source": [
"# initialize with metadata filter and store indexes\n",
"from llama_index.core import StorageContext\n",
"\n",
"for document in documents:\n",
" document.metadata = {\"user_id\": \"123\", \"favorite_color\": \"blue\"}\n",
"vector_store = ClickHouseVectorStore(clickhouse_client=client)\n",
"storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
"index = VectorStoreIndex.from_documents(\n",
" documents, storage_context=storage_context\n",
")"
]
},
{
"cell_type": "markdown",
"id": "04304299-fc3e-40a0-8600-f50c3292767e",
"metadata": {},
"source": [
"#### Query Index\n",
"\n",
"Now ClickHouse vector store supports filter search and hybrid search\n",
"\n",
"You can learn more about [query_engine](/module_guides/deploying/query_engine/root.md) and [retriever](/module_guides/querying/retriever/root.md)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "35369eda",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The author learned several things during their time at Interleaf, including the importance of having\n",
"technology companies run by product people rather than sales people, the drawbacks of having too\n",
"many people edit code, the value of corridor conversations over planned meetings, the challenges of\n",
"dealing with big bureaucratic customers, and the importance of being the \"entry level\" option in a\n",
"market.\n"
]
}
],
"source": [
"import textwrap\n",
"\n",
"from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters\n",
"\n",
"# set Logging to DEBUG for more detailed outputs\n",
"query_engine = index.as_query_engine(\n",
" filters=MetadataFilters(\n",
" filters=[\n",
" ExactMatchFilter(key=\"user_id\", value=\"123\"),\n",
" ]\n",
" ),\n",
" similarity_top_k=2,\n",
" vector_store_query_mode=\"hybrid\",\n",
")\n",
"response = query_engine.query(\"What did the author learn?\")\n",
"print(textwrap.fill(str(response), 100))"
]
},
{
"cell_type": "markdown",
"id": "a732d16f0a29f8ab",
"metadata": {},
"source": [
"#### Clear All Indexes"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "552c203fd054d771",
"metadata": {},
"outputs": [],
"source": [
"for document in documents:\n",
" index.delete_ref_doc(document.doc_id)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
4 changes: 4 additions & 0 deletions llama-index-core/llama_index/core/data_structs/struct_type.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,9 @@ class IndexStructType(str, Enum):
MYSCALE ("myscale"): MyScale Vector Store Index.
See :ref:`Ref-Indices-VectorStore`
for more information on the MyScale vector store index.
CLICKHOUSE ("clickhouse"): ClickHouse Vector Store Index.
See :ref:`Ref-Indices-VectorStore`
for more information on the ClickHouse vector store index.
EPSILLA ("epsilla"): Epsilla Vector Store Index.
See :ref:`Ref-Indices-VectorStore`
for more information on the Epsilla vector store index.
Expand Down Expand Up @@ -81,6 +84,7 @@ class IndexStructType(str, Enum):
MILVUS = "milvus"
CHROMA = "chroma"
MYSCALE = "myscale"
CLICKHOUSE = "clickhouse"
VECTOR_STORE = "vector_store"
OPENSEARCH = "opensearch"
DASHVECTOR = "dashvector"
Expand Down
Loading

0 comments on commit 9cc433a

Please sign in to comment.