From 0ece837e98ff60512d26b5c7c8fb4803e056ad3c Mon Sep 17 00:00:00 2001 From: Jack Wotherspoon Date: Wed, 3 Apr 2024 19:13:13 -0400 Subject: [PATCH] docs: add MySQLVectorStore reference notebook (#59) * docs: add MySQLVectorStore reference notebook * chore: update advanced sample * chore: update sample --- README.md | 3 + docs/vector_store.ipynb | 606 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 609 insertions(+) create mode 100644 docs/vector_store.ipynb diff --git a/README.md b/README.md index 45132ec..f2042e3 100644 --- a/README.md +++ b/README.md @@ -55,6 +55,8 @@ vectorstore = MySQLVectorStore( ) ``` +See the full [Vector Store][vectorstore] tutorial. + ## Document Loader Usage Use a [document loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/) to load data as LangChain `Document`s. @@ -117,3 +119,4 @@ This is not an officially supported Google product. [loader]: ./docs/document_loader.ipynb [history]: ./docs/chat_message_history.ipynb [langchain]: https://github.com/langchain-ai/langchain +[vectorstore]: https://github.com/googleapis/langchain-google-cloud-sql-mysql-python/tree/main/docs/vector_store.ipynb diff --git a/docs/vector_store.ipynb b/docs/vector_store.ipynb new file mode 100644 index 0000000..ac293fc --- /dev/null +++ b/docs/vector_store.ipynb @@ -0,0 +1,606 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "DspIHgfGLL5u" + }, + "source": [ + "# Google Cloud SQL for MySQL\n", + "\n", + "> [Cloud SQL](https://cloud.google.com/sql) is a fully managed relational database service that offers high performance, seamless integration, and impressive scalability. It offers PostgreSQL, MySQL, and SQL Server database engines. Extend your database application to build AI-powered experiences leveraging Cloud SQL's LangChain integrations.\n", + "\n", + "This notebook goes over how to use `Cloud SQL for MySQL` to store vector embeddings with the `MySQLVectorStore` class.\n", + "\n", + "Learn more about the package on [GitHub](https://github.com/googleapis/langchain-google-cloud-sql-mysql-python/).\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/googleapis/langchain-google-cloud-sql-mysql-python/blob/main/docs/vector_store.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4_aOTN3FLL5x" + }, + "source": [ + "## Before you begin\n", + "\n", + "To run this notebook, you will need to do the following:\n", + "\n", + " * [Create a Google Cloud Project](https://developers.google.com/workspace/guides/create-project)\n", + " * [Enable the Cloud SQL Admin API.](https://console.cloud.google.com/flows/enableapi?apiid=sqladmin.googleapis.com)\n", + " * [Create a Cloud SQL instance.](https://cloud.google.com/sql/docs/mysql/connect-instance-auth-proxy#create-instance) (version must be >= 8.0.36)\n", + " * [Create a Cloud SQL database.](https://cloud.google.com/sql/docs/mysql/create-manage-databases)\n", + " * [Add a User to the database.](https://cloud.google.com/sql/docs/mysql/create-manage-users)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IR54BmgvdHT_" + }, + "source": [ + "### 🦜🔗 Library Installation\n", + "Install the integration library, `langchain-google-cloud-sql-mysql`, and the library for the embedding service, `langchain-google-vertexai`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0ZITIDE160OD" + }, + "outputs": [], + "source": [ + "%pip install --upgrade --quiet langchain-google-cloud-sql-mysql langchain-google-vertexai" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v40bB_GMcr9f" + }, + "source": [ + "**Colab only:** Uncomment the following cell to restart the kernel or use the button to restart the kernel. For Vertex AI Workbench you can restart the terminal using the button on top." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "v6jBDnYnNM08" + }, + "outputs": [], + "source": [ + "# # Automatically restart kernel after installs so that your environment can access the new packages\n", + "# import IPython\n", + "\n", + "# app = IPython.Application.instance()\n", + "# app.kernel.do_shutdown(True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yygMe6rPWxHS" + }, + "source": [ + "### 🔐 Authentication\n", + "Authenticate to Google Cloud as the IAM user logged into this notebook in order to access your Google Cloud Project.\n", + "\n", + "* If you are using Colab to run this notebook, use the cell below and continue.\n", + "* If you are using Vertex AI Workbench, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PTXN1_DSXj2b" + }, + "outputs": [], + "source": [ + "from google.colab import auth\n", + "\n", + "auth.authenticate_user()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NEvB9BoLEulY" + }, + "source": [ + "### ☁ Set Your Google Cloud Project\n", + "Set your Google Cloud project so that you can leverage Google Cloud resources within this notebook.\n", + "\n", + "If you don't know your project ID, try the following:\n", + "\n", + "* Run `gcloud config list`.\n", + "* Run `gcloud projects list`.\n", + "* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "gfkS3yVRE4_W" + }, + "outputs": [], + "source": [ + "# @markdown Please fill in the value below with your Google Cloud project ID and then run the cell.\n", + "\n", + "PROJECT_ID = \"my-project-id\" # @param {type:\"string\"}\n", + "\n", + "# Set the project id\n", + "!gcloud config set project {PROJECT_ID}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f8f2830ee9ca1e01" + }, + "source": [ + "## Basic Usage" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OMvzMWRrR6n7" + }, + "source": [ + "### Set Cloud SQL database values\n", + "Find your database values, in the [Cloud SQL Instances page](https://console.cloud.google.com/sql?_ga=2.223735448.2062268965.1707700487-2088871159.1707257687)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "irl7eMFnSPZr" + }, + "outputs": [], + "source": [ + "# @title Set Your Values Here { display-mode: \"form\" }\n", + "REGION = \"us-central1\" # @param {type: \"string\"}\n", + "INSTANCE = \"my-mysql-instance\" # @param {type: \"string\"}\n", + "DATABASE = \"my-database\" # @param {type: \"string\"}\n", + "TABLE_NAME = \"vector_store\" # @param {type: \"string\"}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QuQigs4UoFQ2" + }, + "source": [ + "### MySQLEngine Connection Pool\n", + "\n", + "One of the requirements and arguments to establish Cloud SQL as a vector store is a `MySQLEngine` object. The `MySQLEngine` configures a connection pool to your Cloud SQL database, enabling successful connections from your application and following industry best practices.\n", + "\n", + "To create a `MySQLEngine` using `MySQLEngine.from_instance()` you need to provide only 4 things:\n", + "\n", + "1. `project_id` : Project ID of the Google Cloud Project where the Cloud SQL instance is located.\n", + "1. `region` : Region where the Cloud SQL instance is located.\n", + "1. `instance` : The name of the Cloud SQL instance.\n", + "1. `database` : The name of the database to connect to on the Cloud SQL instance.\n", + "\n", + "By default, [IAM database authentication](https://cloud.google.com/sql/docs/mysql/iam-authentication#iam-db-auth) will be used as the method of database authentication. This library uses the IAM principal belonging to the [Application Default Credentials (ADC)](https://cloud.google.com/docs/authentication/application-default-credentials) sourced from the envionment.\n", + "\n", + "For more informatin on IAM database authentication please see:\n", + "\n", + "* [Configure an instance for IAM database authentication](https://cloud.google.com/sql/docs/mysql/create-edit-iam-instances)\n", + "* [Manage users with IAM database authentication](https://cloud.google.com/sql/docs/mysql/add-manage-iam-users)\n", + "\n", + "Optionally, [built-in database authentication](https://cloud.google.com/sql/docs/mysql/built-in-authentication) using a username and password to access the Cloud SQL database can also be used. Just provide the optional `user` and `password` arguments to `MySQLEngine.from_instance()`:\n", + "\n", + "* `user` : Database user to use for built-in database authentication and login\n", + "* `password` : Database password to use for built-in database authentication and login.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "guVURf1QLL53" + }, + "outputs": [], + "source": [ + "from langchain_google_cloud_sql_mysql import MySQLEngine\n", + "\n", + "engine = MySQLEngine.from_instance(\n", + " project_id=PROJECT_ID, region=REGION, instance=INSTANCE, database=DATABASE\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D9Xs2qhm6X56" + }, + "source": [ + "### Initialize a table\n", + "The `MySQLVectorStore` class requires a database table. The `MySQLEngine` class has a helper method `init_vectorstore_table()` that can be used to create a table with the proper schema for you." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "avlyHEMn6gzU" + }, + "outputs": [], + "source": [ + "engine.init_vectorstore_table(\n", + " table_name=TABLE_NAME,\n", + " vector_size=768, # Vector size for VertexAI model(textembedding-gecko@latest),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aVE5TaJILL53" + }, + "source": [ + "### Create an embedding class instance\n", + "\n", + "You can use any [LangChain embeddings model](https://python.langchain.com/docs/integrations/text_embedding/).\n", + "You may need to enable the Vertex AI API to use `VertexAIEmbeddings`.\n", + "\n", + "We recommend pinning the embedding model's version for production, learn more about the [Text embeddings models](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text-embeddings)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5utKIdq7KYi5" + }, + "outputs": [], + "source": [ + "# enable Vertex AI API\n", + "!gcloud services enable aiplatform.googleapis.com" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Vb2RJocV9_LQ" + }, + "outputs": [], + "source": [ + "from langchain_google_vertexai import VertexAIEmbeddings\n", + "\n", + "embedding = VertexAIEmbeddings(\n", + " model_name=\"textembedding-gecko@latest\", project=PROJECT_ID\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e1tl0aNx7SWy" + }, + "source": [ + "### Initialize a default MySQLVectorStore\n", + "\n", + "To initialize a `MySQLVectorStore` class you need to provide only 3 things:\n", + "\n", + "1. `engine` - An instance of a `MySQLEngine` engine.\n", + "1. `embedding_service` - An instance of a LangChain embedding model.\n", + "1. `table_name` : The name of the table within the Cloud SQL database to use as the vector store." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "z-AZyzAQ7bsf" + }, + "outputs": [], + "source": [ + "from langchain_google_cloud_sql_mysql import MySQLVectorStore\n", + "\n", + "store = MySQLVectorStore(\n", + " engine=engine,\n", + " embedding_service=embedding,\n", + " table_name=TABLE_NAME,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "456LVGttLL54" + }, + "source": [ + "### Add texts" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "nrDvGWIOLL54" + }, + "outputs": [], + "source": [ + "import uuid\n", + "\n", + "all_texts = [\"Apples and oranges\", \"Cars and airplanes\", \"Pineapple\", \"Train\", \"Banana\"]\n", + "metadatas = [{\"len\": len(t)} for t in all_texts]\n", + "ids = [str(uuid.uuid4()) for _ in all_texts]\n", + "\n", + "store.add_texts(all_texts, metadatas=metadatas, ids=ids)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iCESj2VkLL54" + }, + "source": [ + "### Delete texts\n", + "\n", + "Delete vectors from the vector store by ID." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ukz6j6hiLL54" + }, + "outputs": [], + "source": [ + "store.delete([ids[1]])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MOVpfY1lLL54" + }, + "source": [ + "### Search for documents" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "id": "fpqeZgUeLL54", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "f674a3af-452c-4d58-bb62-cbf514a9e1e3" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Pineapple\n" + ] + } + ], + "source": [ + "query = \"I'd like a fruit.\"\n", + "docs = store.similarity_search(query)\n", + "print(docs[0].page_content)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P9tG2a2gLL54" + }, + "source": [ + "### Search for documents by vector\n", + "\n", + "It is also possible to do a search for documents similar to a given embedding vector using `similarity_search_by_vector` which accepts an embedding vector as a parameter instead of a string." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "N-NC5jgGLL55", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "69a1f9de-a830-450d-8a5e-118b36815a46" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "[Document(page_content='Pineapple', metadata={'len': 9}), Document(page_content='Banana', metadata={'len': 6})]\n" + ] + } + ], + "source": [ + "query_vector = embedding.embed_query(query)\n", + "docs = store.similarity_search_by_vector(query_vector, k=2)\n", + "print(docs)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KefzU1OCLL55" + }, + "source": [ + "### Add an index\n", + "Speed up vector search queries by applying a vector index. Learn more about [MySQL vector indexes](https://github.com/googleapis/langchain-google-cloud-sql-mysql-python/blob/main/src/langchain_google_cloud_sql_mysql/indexes.py)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "r6QOYfh_LL55" + }, + "outputs": [], + "source": [ + "from langchain_google_cloud_sql_mysql import VectorIndex\n", + "\n", + "store.apply_vector_index(VectorIndex())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3Y-GZ4CvLL55" + }, + "source": [ + "### Remove an index" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "EhloKJoPLL55" + }, + "outputs": [], + "source": [ + "store.drop_vector_index()" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Advanced Usage" + ], + "metadata": { + "id": "K8XAZZTDqwIp" + } + }, + { + "cell_type": "markdown", + "metadata": { + "id": "A6h77GunLL55" + }, + "source": [ + "### Create a MySQLVectorStore with custom metadata\n", + "\n", + "A vector store can take advantage of relational data to filter similarity searches.\n", + "\n", + "Create a table and `MySQLVectorStore` instance with custom metadata columns." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "eANG7_8qLL55" + }, + "outputs": [], + "source": [ + "from langchain_google_cloud_sql_mysql import Column\n", + "\n", + "# set table name\n", + "CUSTOM_TABLE_NAME = \"vector_store_custom\"\n", + "\n", + "engine.init_vectorstore_table(\n", + " table_name=CUSTOM_TABLE_NAME,\n", + " vector_size=768, # VertexAI model: textembedding-gecko@latest\n", + " metadata_columns=[Column(\"len\", \"INTEGER\")],\n", + ")\n", + "\n", + "\n", + "# initialize MySQLVectorStore with custom metadata columns\n", + "custom_store = MySQLVectorStore(\n", + " engine=engine,\n", + " embedding_service=embedding,\n", + " table_name=CUSTOM_TABLE_NAME,\n", + " metadata_columns=[\"len\"],\n", + " # connect to an existing VectorStore by customizing the table schema:\n", + " # id_column=\"uuid\",\n", + " # content_column=\"documents\",\n", + " # embedding_column=\"vectors\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bj2d-c2sLL5-" + }, + "source": [ + "### Search for documents with metadata filter\n", + "\n", + "It can be helpful to narrow down the documents before working with them.\n", + "\n", + "For example, documents can be filtered on metadata using the `filter` argument." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Sqfgk6EOLL5-", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "a10c74e2-fe48-4cf9-ba2f-85aecb2490d0" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "[Document(page_content='Pineapple', metadata={'len': 9}), Document(page_content='Banana', metadata={'len': 6}), Document(page_content='Apples and oranges', metadata={'len': 18}), Document(page_content='Cars and airplanes', metadata={'len': 18})]\n" + ] + } + ], + "source": [ + "import uuid\n", + "\n", + "# add texts to the vector store\n", + "all_texts = [\"Apples and oranges\", \"Cars and airplanes\", \"Pineapple\", \"Train\", \"Banana\"]\n", + "metadatas = [{\"len\": len(t)} for t in all_texts]\n", + "ids = [str(uuid.uuid4()) for _ in all_texts]\n", + "custom_store.add_texts(all_texts, metadatas=metadatas, ids=ids)\n", + "\n", + "# use filter on search\n", + "query_vector = embedding.embed_query(\"I'd like a fruit.\")\n", + "docs = custom_store.similarity_search_by_vector(query_vector, filter=\"len >= 6\")\n", + "\n", + "print(docs)" + ] + } + ], + "metadata": { + "colab": { + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}