diff --git a/.github/workflows/notebooks_to_exclude.txt b/.github/workflows/notebooks_to_exclude.txt index accdab0ab..c98398875 100644 --- a/.github/workflows/notebooks_to_exclude.txt +++ b/.github/workflows/notebooks_to_exclude.txt @@ -30,6 +30,7 @@ # Some notebooks are deliberate scraps of code # and not designed to run end-to-end ./R-client.ipynb +./R-weather.ipynb # Ignore any temporary notebook caches etc .ipynb_checkpoints diff --git a/00_notebooks/00_index.ipynb b/00_notebooks/00_index.ipynb index 59b8e6cbd..447c717ce 100644 --- a/00_notebooks/00_index.ipynb +++ b/00_notebooks/00_index.ipynb @@ -46,33 +46,10 @@ " * [What happens if you use Iceberg without the lakeFS support](./iceberg-lakefs-default.ipynb)\n", "* **Using R with lakeFS**\n", " * [Basic usage](./R.ipynb)\n", - " * [Weather data](./R-weather.ipynb)\n", " * [NYC Film permits example](./R-nyc.ipynb)\n", " * [Rough notes around R client](./R-client.ipynb)" ] }, - { - "cell_type": "markdown", - "id": "93ba5f85-ec96-4dc6-bc5c-f9313a658112", - "metadata": {}, - "source": [ - "### Write-Audit-Publish pattern\n", - "\n", - "See https://lakefs.io/blog/data-engineering-patterns-write-audit-publish/" - ] - }, - { - "cell_type": "markdown", - "id": "305a0733-ee99-4d4f-8dcf-94b9e9dd9fd5", - "metadata": {}, - "source": [ - "* [**Write-Audit-Publish / lakeFS**](write-audit-publish/wap-lakefs.ipynb)
_With support for heterogeneous data formats and cross-collection consistency_\n", - "* [**Write-Audit-Publish / Apache Iceberg**](write-audit-publish/wap-iceberg.ipynb)\n", - "* [**Write-Audit-Publish / Project Nessie**](write-audit-publish/wap-nessie.ipynb)\n", - "* [**Write-Audit-Publish / Delta Lake**](write-audit-publish/wap-delta.ipynb)\n", - "* [**Write-Audit-Publish / Apache Hudi**](write-audit-publish/wap-hudi.ipynb)\n" - ] - }, { "cell_type": "markdown", "id": "c0de105e-2071-438c-b9fa-5ffe664d47ac", @@ -102,6 +79,7 @@ "* [AWS **Glue and Trino**](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/aws-glue-trino/)\n", "* [lakeFS + **Dagster**](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/dagster-integration/)\n", "* [lakeFS + **Prefect**](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/prefect-integration/)\n", + "* [Reproducibility and Data Version Control for **LangChain** and **LLM/OpenAI** Models](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/llm-openai-langchain-integration/)
_See also the [accompanying blog](https://lakefs.io/blog/lakefs-langchain-loader/)_\n", "* [Image Segmentation Demo: *ML Data Version Control and Reproducibility at Scale*](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/image-segmentation/)\n", "* [*Labelbox* integration](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/labelbox-integration/)\n", "* [*Kafka* integration](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/kafka/)\n", @@ -109,6 +87,28 @@ "* [How to **migrate or clone** a repo](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/migrate-or-clone-repo/)" ] }, + { + "cell_type": "markdown", + "id": "93ba5f85-ec96-4dc6-bc5c-f9313a658112", + "metadata": {}, + "source": [ + "### Write-Audit-Publish pattern\n", + "\n", + "See https://lakefs.io/blog/data-engineering-patterns-write-audit-publish/" + ] + }, + { + "cell_type": "markdown", + "id": "305a0733-ee99-4d4f-8dcf-94b9e9dd9fd5", + "metadata": {}, + "source": [ + "* [**Write-Audit-Publish / lakeFS**](write-audit-publish/wap-lakefs.ipynb)
_With support for heterogeneous data formats and cross-collection consistency_\n", + "* [**Write-Audit-Publish / Apache Iceberg**](write-audit-publish/wap-iceberg.ipynb)\n", + "* [**Write-Audit-Publish / Project Nessie**](write-audit-publish/wap-nessie.ipynb)\n", + "* [**Write-Audit-Publish / Delta Lake**](write-audit-publish/wap-delta.ipynb)\n", + "* [**Write-Audit-Publish / Apache Hudi**](write-audit-publish/wap-hudi.ipynb)\n" + ] + }, { "cell_type": "markdown", "id": "d1cb588f", diff --git a/01_standalone_examples/llm-openai-langchain-integration/LLM OpenAI LangChain Demo.ipynb b/01_standalone_examples/llm-openai-langchain-integration/LLM OpenAI LangChain Demo.ipynb new file mode 100644 index 000000000..860e4f158 --- /dev/null +++ b/01_standalone_examples/llm-openai-langchain-integration/LLM OpenAI LangChain Demo.ipynb @@ -0,0 +1,764 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "38df8ee9-01c6-4cae-9f37-2eb45250993e", + "metadata": {}, + "source": [ + "\"lakeFS              \"LangChain              \"LangChain\n", + "\n", + "# Integration of lakeFS with LangChain and OpenAI\n", + "\n", + "Use Case: Reproducibility and Data version control for LangChain and LLM/OpenAI Models\n", + "\n", + "See also the [accompanying blog](https://lakefs.io/blog/lakefs-langchain-loader/)" + ] + }, + { + "cell_type": "markdown", + "id": "4a9c9095-97fc-4453-bc8a-7afe9fe456e6", + "metadata": {}, + "source": [ + "## Config" + ] + }, + { + "cell_type": "markdown", + "id": "863c5d7a-91a4-44f6-bae1-745f6190983f", + "metadata": {}, + "source": [ + "### lakeFS endpoint and credentials\n", + "\n", + "Change these if using lakeFS other than provided in the samples repo. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5ec9395c-ee3a-422d-b935-64807b070943", + "metadata": {}, + "outputs": [], + "source": [ + "lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' \n", + "lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'\n", + "lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'" + ] + }, + { + "cell_type": "markdown", + "id": "51714413-29f4-4cc1-b821-7c693f266930", + "metadata": {}, + "source": [ + "### Storage Information\n", + "\n", + "If you're not using sample repo lakeFS, then change the Storage Namespace to a location in the bucket you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cd852ddc-3a06-4eb6-9236-c3b77164361f", + "metadata": {}, + "outputs": [], + "source": [ + "storageNamespace = 's3://example' # e.g. \"s3://bucket\"" + ] + }, + { + "cell_type": "markdown", + "id": "67a3e781-e6e6-47dd-a483-71a62a2342da", + "metadata": {}, + "source": [ + "### OpenAI API Key\n", + "##### If you do not have an API key then create a free OpenAI account and API key here: https://platform.openai.com/api-keys" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7b2a6e05-ea60-49b3-861b-5a0acade748e", + "metadata": {}, + "outputs": [], + "source": [ + "openai_api_key = \"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\"" + ] + }, + { + "cell_type": "markdown", + "id": "b154b685-ef4f-4116-990d-6dff21db5080", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "**(you shouldn't need to change anything in this section, just run it)**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "52f4fc64-1070-4b07-9124-1bafffa72588", + "metadata": {}, + "outputs": [], + "source": [ + "repo_name = \"llm-openai-langchain-repo\"" + ] + }, + { + "cell_type": "markdown", + "id": "90917396-2738-40fa-9a61-e0ebd6a264a7", + "metadata": {}, + "source": [ + "### Versioning Information " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "63ff2b15-6a7a-4cfc-96a1-f1c73c08f289", + "metadata": {}, + "outputs": [], + "source": [ + "mainBranch = \"main\"\n", + "version1Branch = \"version1\"\n", + "version2Branch = \"version2\"\n", + "documentName = \"lakeFS Brochure.pdf\"\n", + "responsesTable = \"responses\"" + ] + }, + { + "cell_type": "markdown", + "id": "cbed0b49-a2b9-429b-86fb-5f02ed82c689", + "metadata": {}, + "source": [ + "### Create a function to load documents from lakeFS repository by using an [official lakeFS document loader for LangChain](https://python.langchain.com/docs/integrations/document_loaders/lakefs)\n", + "##### Split documents into smaller chunks, convert documents into OpenAI embeddings and store them in an in-memory vector database (Meta’s [FAISS](https://ai.meta.com/tools/faiss/))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "992238a8-aa3e-4268-9b61-fbb55e0487c3", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.embeddings import OpenAIEmbeddings\n", + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "from langchain.vectorstores.faiss import FAISS\n", + "from langchain.document_loaders import LakeFSLoader\n", + "\n", + "def load_document(repo: str, ref: str, path: str) -> FAISS:\n", + " lakefs_loader = LakeFSLoader(\n", + " lakefs_access_key=lakefsAccessKey,\n", + " lakefs_secret_key=lakefsSecretKey,\n", + " lakefs_endpoint=lakefsEndPoint\n", + " )\n", + " lakefs_loader.set_repo(repo)\n", + " lakefs_loader.set_ref(ref)\n", + " lakefs_loader.set_path(path)\n", + " docs = lakefs_loader.load()\n", + " splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)\n", + " docs = splitter.split_documents(docs)\n", + " return FAISS.from_documents(docs, embedding=OpenAIEmbeddings(openai_api_key=openai_api_key))" + ] + }, + { + "cell_type": "markdown", + "id": "e37dada4-4070-4f04-8081-abe41fb8bddd", + "metadata": {}, + "source": [ + "### Create a function to query this data using OpenAI\n", + "#### Set up a model and a prompt, into which you will feed documents that are related to the user’s question" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "da5ca5db-ca84-4477-873a-97b3c50d1c72", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.llms import OpenAI\n", + "from langchain.chains import LLMChain\n", + "from langchain.prompts import PromptTemplate\n", + "from langchain.vectorstores.faiss import FAISS\n", + "\n", + "def query_document(db: FAISS, document_name: str, query: str) -> str:\n", + " related_docs = db.similarity_search(query, k=4) # we want 4 similar vectors\n", + " docs_content = ' '.join([d.page_content for d in related_docs])\n", + " llm = OpenAI(model='text-davinci-003', temperature=0, openai_api_key=openai_api_key)\n", + " prompt = PromptTemplate(\n", + " input_variables=['question', 'docs', 'document_name'],\n", + " template=\"\"\"\n", + " You are a helpful document assistant that can answer questions about a document based on the text it contains.\n", + " \n", + " The name of the document is: {document_name}\n", + " Answer the following question: {question}\n", + " By searching the following document: {docs}\n", + " \n", + " Only use factual information from the document to answer the question.\n", + " \n", + " If you feel like you don't have enough information to answer the question, say \"I don't know\".\n", + " \n", + " Your answers should be detailed.\n", + " \"\"\"\n", + " )\n", + "\n", + " chain = LLMChain(llm=llm, prompt=prompt)\n", + " return chain.run(question=query, docs=docs_content, document_name=document_name)" + ] + }, + { + "cell_type": "markdown", + "id": "3bff25a6-2864-40d6-b94a-67fdb77193a5", + "metadata": {}, + "source": [ + "### lakeFS S3 gateway config for the Delta table" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0a243b96-6692-4931-8683-14e23348e134", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import deltalake\n", + "\n", + "storage_options = {\"AWS_ACCESS_KEY_ID\": lakefsAccessKey, \n", + " \"AWS_SECRET_ACCESS_KEY\":lakefsSecretKey,\n", + " \"AWS_ENDPOINT\": lakefsEndPoint,\n", + " \"AWS_REGION\": \"us-east-1\",\n", + " \"AWS_ALLOW_HTTP\": \"true\",\n", + " \"AWS_S3_ALLOW_UNSAFE_RENAME\": \"true\"\n", + " }" + ] + }, + { + "cell_type": "markdown", + "id": "9c45a90f-9446-470b-9b07-c8fce03c16ab", + "metadata": {}, + "source": [ + "### Create lakeFSClient" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c26d3385-94fc-438f-86e4-ac98a7924ee7", + "metadata": {}, + "outputs": [], + "source": [ + "import lakefs_sdk\n", + "from lakefs_sdk.client import LakeFSClient\n", + "from lakefs_sdk.models import *\n", + "\n", + "configuration = lakefs_sdk.Configuration(\n", + " host=lakefsEndPoint,\n", + " username=lakefsAccessKey,\n", + " password=lakefsSecretKey,\n", + ")\n", + "lakefs = LakeFSClient(configuration)" + ] + }, + { + "cell_type": "markdown", + "id": "7ec78263-be53-4387-80c3-b1a347c02191", + "metadata": {}, + "source": [ + "### Define lakeFS Repository" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "925988e4-475f-46de-9e5a-e1689dfa9353", + "metadata": {}, + "outputs": [], + "source": [ + "from lakefs_sdk.exceptions import NotFoundException\n", + "\n", + "try:\n", + " repo=lakefs.repositories_api.get_repository(repo_name)\n", + " print(f\"Found existing repo {repo.id} using storage namespace {repo.storage_namespace}\")\n", + "except NotFoundException as f:\n", + " print(f\"Repository {repo_name} does not exist, so going to try and create it now.\")\n", + " try:\n", + " repo=lakefs.repositories_api.create_repository(repository_creation=RepositoryCreation(name=repo_name,\n", + " storage_namespace=f\"{storageNamespace}/{repo_name}\",\n", + " default_branch=mainBranch))\n", + " print(f\"Created new repo {repo.id} using storage namespace {repo.storage_namespace}\")\n", + " except lakefs_sdk.ApiException as e:\n", + " print(f\"Error creating repo {repo_name}. Error is {e}\")\n", + " os._exit(00)\n", + "except lakefs_sdk.ApiException as e:\n", + " print(f\"Error getting repo {repo_name}: {e}\")\n", + " os._exit(00)" + ] + }, + { + "cell_type": "markdown", + "id": "8c035f52-338e-4179-a317-b55d9bbb45ee", + "metadata": {}, + "source": [ + "# Main demo starts here 🚦 👇🏻" + ] + }, + { + "cell_type": "markdown", + "id": "2e413173-70f3-4725-a3fa-7568d2b36c11", + "metadata": {}, + "source": [ + "### Create version1 branch" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "56bf1a3e-da38-4eba-9325-4b6966a020ef", + "metadata": {}, + "outputs": [], + "source": [ + "lakefs.branches_api.create_branch(\n", + " repository=repo.id,\n", + " branch_creation=BranchCreation(\n", + " name=version1Branch,\n", + " source=mainBranch))" + ] + }, + { + "cell_type": "markdown", + "id": "05419d37-436c-4af1-a97b-e163aedf5998", + "metadata": {}, + "source": [ + "### Upload \"lakeFS Brochure.pdf\" document to version1 branch" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3a788f57-6c43-45dd-a3f4-27a632ed0540", + "metadata": {}, + "outputs": [], + "source": [ + "lakefs.objects_api.upload_object(\n", + " repository=repo.id,\n", + " branch=version1Branch,\n", + " path=documentName, content=f\"/data/{version1Branch}/{documentName}\")" + ] + }, + { + "cell_type": "markdown", + "id": "82d0caf1-cb9f-440c-a5af-303e858ac152", + "metadata": {}, + "source": [ + "### Commit changes and attach some metadata" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eaadfee1-1c2f-4020-bf25-1596bf69f806", + "metadata": {}, + "outputs": [], + "source": [ + "lakefs.commits_api.commit(\n", + " repository=repo.id,\n", + " branch=version1Branch,\n", + " commit_creation=CommitCreation(\n", + " message='Uploaded lakeFS Brochure',\n", + " metadata={'version': 'version1'}))" + ] + }, + { + "cell_type": "markdown", + "id": "0e2ae0e7-ef38-4e2b-ac8c-272dd5667874", + "metadata": {}, + "source": [ + "### Load \"lakeFS Brochure.pdf\" (version 1) document to vector database" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e2ed8be5-06a6-47c1-96c2-6f8d853bf296", + "metadata": {}, + "outputs": [], + "source": [ + "db = load_document(repo_name, version1Branch, documentName)" + ] + }, + { + "cell_type": "markdown", + "id": "d0b9ed98-22ef-4d0c-bcb2-774d50cbc1b1", + "metadata": {}, + "source": [ + "### Let's ask these 2 questions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "edfec532-a7f6-42b5-9fde-bc694c7a43d8", + "metadata": {}, + "outputs": [], + "source": [ + "question1 = 'why lakefs'\n", + "question2 = 'trusted by?'" + ] + }, + { + "cell_type": "markdown", + "id": "7409d98c-978a-4a99-b98b-e53082a57304", + "metadata": {}, + "source": [ + "### Ask 1st question" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dc3b0667-015f-417c-b828-b8bd7b063fcf", + "metadata": {}, + "outputs": [], + "source": [ + "question1Response = query_document(db, documentName, question1)\n", + "print(question1Response)" + ] + }, + { + "cell_type": "markdown", + "id": "83477ffc-6fe2-44b8-b860-475bddcf1b7d", + "metadata": {}, + "source": [ + "### Ask 2nd question" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8dbac5c0-ed8d-4ed8-9645-2c494ab6ee1f", + "metadata": {}, + "outputs": [], + "source": [ + "question2Response = query_document(db, documentName, question2)\n", + "print(question2Response)" + ] + }, + { + "cell_type": "markdown", + "id": "5f6d3d13-dae3-490d-879c-4ab2ceb72fc6", + "metadata": {}, + "source": [ + "### Save the responses to a Delta table" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "aed98dab-a980-4a99-9254-03d80f9c6c3e", + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.DataFrame({'Document Name': [documentName, documentName], 'Version': [version1Branch, version1Branch], 'Question': [question1, question2], 'Answer': [question1Response, question2Response]})\n", + "\n", + "deltalake.write_deltalake(table_or_uri=f\"s3a://{repo.id}/{version1Branch}/{responsesTable}\", \n", + " data = df,\n", + " mode='append',\n", + " storage_options=storage_options)" + ] + }, + { + "cell_type": "markdown", + "id": "76734ee8-f845-49af-a136-7efc90d9e39a", + "metadata": {}, + "source": [ + "### Commit changes and attach some metadata" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "03716729-3172-4266-a807-3d2e23c5afb8", + "metadata": {}, + "outputs": [], + "source": [ + "lakefs.commits_api.commit(\n", + " repository=repo.id,\n", + " branch=version1Branch,\n", + " commit_creation=CommitCreation(\n", + " message='Saved responses for the questions',\n", + " metadata={'version': 'version1'}))" + ] + }, + { + "cell_type": "markdown", + "id": "3555ad69-1faa-4ef8-abf8-5707901bc3ca", + "metadata": {}, + "source": [ + "### Merge version1 branch to main" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fc91a22b-bcb6-4164-b472-bb25ebcf6d78", + "metadata": {}, + "outputs": [], + "source": [ + "lakefs.refs_api.merge_into_branch(\n", + " repository=repo.id,\n", + " source_ref=version1Branch, \n", + " destination_branch=mainBranch)" + ] + }, + { + "cell_type": "markdown", + "id": "66023094-6052-4a04-8dea-23a26275bfec", + "metadata": {}, + "source": [ + "### Create version2 branch" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3f841167-2464-4fcf-a7e3-68386f77d4f1", + "metadata": {}, + "outputs": [], + "source": [ + "lakefs.branches_api.create_branch(\n", + " repository=repo.id,\n", + " branch_creation=BranchCreation(\n", + " name=version2Branch,\n", + " source=mainBranch))" + ] + }, + { + "cell_type": "markdown", + "id": "e515f7f5-1328-4c38-92e1-058064c93149", + "metadata": {}, + "source": [ + "### Upload 2nd version of the \"lakeFS Brochure.pdf\" document" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5bc09a14-2487-4b29-9681-8d97570ff7d2", + "metadata": {}, + "outputs": [], + "source": [ + "lakefs.objects_api.upload_object(\n", + " repository=repo.id,\n", + " branch=version2Branch,\n", + " path=documentName, content=f\"/data/{version2Branch}/{documentName}\")" + ] + }, + { + "cell_type": "markdown", + "id": "d0ae6f26-cdba-4099-94dd-9814f25f5dbb", + "metadata": {}, + "source": [ + "### Commit changes and attach some metadata" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8d73c920-c247-443f-93bd-2cd7ff23756c", + "metadata": {}, + "outputs": [], + "source": [ + "lakefs.commits_api.commit(\n", + " repository=repo.id,\n", + " branch=version2Branch,\n", + " commit_creation=CommitCreation(\n", + " message='Uploaded lakeFS Brochure',\n", + " metadata={'version': 'version2'}))" + ] + }, + { + "cell_type": "markdown", + "id": "2dd2ab47-9b8c-4704-a761-fa24d55649e0", + "metadata": {}, + "source": [ + "### Load \"lakeFS Brochure.pdf\" (version 2) document to vector database" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dac931a2-3917-4b45-8ceb-73bf0c379198", + "metadata": {}, + "outputs": [], + "source": [ + "db = load_document(repo_name, version2Branch, documentName)" + ] + }, + { + "cell_type": "markdown", + "id": "b5c0bde7-c10d-4b79-9589-0be6052dd17b", + "metadata": {}, + "source": [ + "### Ask 1st question by using version2 document" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "11fd18a9-95da-4a7f-8bda-a957d6494798", + "metadata": {}, + "outputs": [], + "source": [ + "question1Response = query_document(db, documentName, question1)\n", + "print(question1Response)" + ] + }, + { + "cell_type": "markdown", + "id": "baefc417-6640-44b0-ba5e-701f04d9242a", + "metadata": {}, + "source": [ + "### Ask 2nd question by using version2 document" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c9c46ab1-f1e2-448f-ba22-f4e345b50864", + "metadata": {}, + "outputs": [], + "source": [ + "question2Response = query_document(db, documentName, question2)\n", + "print(question2Response)" + ] + }, + { + "cell_type": "markdown", + "id": "3a8bc9a1-da62-4f4e-9781-82f30718cade", + "metadata": {}, + "source": [ + "### Save the responses to Delta table" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bbb16179-468b-451f-9952-1c0f07e94ef4", + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.DataFrame({'Document Name': [documentName, documentName], 'Version': [version2Branch, version2Branch], 'Question': [question1, question2], 'Answer': [question1Response, question2Response]})\n", + "\n", + "deltalake.write_deltalake(table_or_uri=f\"s3a://{repo.id}/{version2Branch}/{responsesTable}\", \n", + " data = df,\n", + " mode='append',\n", + " storage_options=storage_options)" + ] + }, + { + "cell_type": "markdown", + "id": "18542e7e-bebc-4162-8189-8a8f7ee6ebe9", + "metadata": {}, + "source": [ + "### Commit changes and attach some metadata" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0269c8c2-6b73-4d8c-9772-5dbad65a962a", + "metadata": {}, + "outputs": [], + "source": [ + "lakefs.commits_api.commit(\n", + " repository=repo.id,\n", + " branch=version2Branch,\n", + " commit_creation=CommitCreation(\n", + " message='Saved responses for the questions',\n", + " metadata={'version': 'version2'}))" + ] + }, + { + "cell_type": "markdown", + "id": "e385f37e-74b7-41da-9d9b-bf8bbc3d65bc", + "metadata": {}, + "source": [ + "### Merge version2 branch to main" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "66b7bff1-057a-4895-9be2-59d078c7a418", + "metadata": {}, + "outputs": [], + "source": [ + "lakefs.refs_api.merge_into_branch(\n", + " repository=repo.id,\n", + " source_ref=version2Branch, \n", + " destination_branch=mainBranch)" + ] + }, + { + "cell_type": "markdown", + "id": "815a56e9-3747-496b-820a-37aaf397c7ce", + "metadata": {}, + "source": [ + "### Review responses for both versions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d38a150c-c1d0-4e50-8244-5adbaa03a837", + "metadata": {}, + "outputs": [], + "source": [ + "responses = deltalake.DeltaTable(f\"s3a://{repo.id}/{mainBranch}/{responsesTable}\", storage_options=storage_options)\n", + "pd.set_option('max_colwidth', 2000)\n", + "responses.to_pandas()" + ] + }, + { + "cell_type": "markdown", + "id": "9fbe477a-1ea7-489f-b5a9-7fefd58a2c95", + "metadata": {}, + "source": [ + "## More Questions?\n", + "\n", + "###### Join the lakeFS Slack group - https://lakefs.io/slack" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "35d73b71-6f6f-4cea-8523-df0bdece134d", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/01_standalone_examples/llm-openai-langchain-integration/README.md b/01_standalone_examples/llm-openai-langchain-integration/README.md new file mode 100644 index 000000000..3a6d66111 --- /dev/null +++ b/01_standalone_examples/llm-openai-langchain-integration/README.md @@ -0,0 +1,46 @@ +# Reproducibility and Data Version Control for LangChain and LLM/OpenAI Models + +Start by ⭐️ starring [lakeFS open source](https://go.lakefs.io/oreilly-course) project. + +This repository includes a Jupyter Notebook with LangChain and OpenAI libraries which you can run on your local machine. + +## Let's Get Started 👩🏻‍💻 + +Clone this repository + + ```bash + git clone https://github.com/treeverse/lakeFS-samples && cd lakeFS-samples/01_standalone_examples/llm-openai-langchain-integration + ``` + +You now have two options: + +### **Run a Notebook server with your existing lakeFS Server** + +If you have already [installed lakeFS](https://docs.lakefs.io/deploy/) or are utilizing [lakeFS cloud](https://lakefs.cloud/), all you need to run is the Jupyter notebook with LangChain and OpenAI libraries (Docker image size will be around 10GB): + + + ```bash + docker compose up + ``` + +### **Don't have a lakeFS Server or Object Store?** + +If you want to provision a lakeFS server as well as MinIO for your object store, plus Jupyter with LangChain and OpenAI libraries then bring up the full stack: + + ```bash + docker compose --profile local-lakefs up + ``` + +### URLs and login details + +* Jupyter http://localhost:8891/ + +If you've brought up the full stack you'll also have: + +* LakeFS http://localhost:48000/ (`AKIAIOSFOLKFSSAMPLES` / `wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY`) +* MinIO http://localhost:49001/ (`minioadmin`/`minioadmin`) + + +## Demo Instructions + +Open Jupyter UI [http://localhost:8891](http://localhost:8891) in your web browser. Open "LLM OpenAI LangChain Demo" notebook from Jupyter UI and follow the instructions. \ No newline at end of file diff --git a/01_standalone_examples/llm-openai-langchain-integration/data/version1/lakeFS Brochure.pdf b/01_standalone_examples/llm-openai-langchain-integration/data/version1/lakeFS Brochure.pdf new file mode 100644 index 000000000..5b9fbafb6 Binary files /dev/null and b/01_standalone_examples/llm-openai-langchain-integration/data/version1/lakeFS Brochure.pdf differ diff --git a/01_standalone_examples/llm-openai-langchain-integration/data/version2/lakeFS Brochure.pdf b/01_standalone_examples/llm-openai-langchain-integration/data/version2/lakeFS Brochure.pdf new file mode 100644 index 000000000..f74b5187b Binary files /dev/null and b/01_standalone_examples/llm-openai-langchain-integration/data/version2/lakeFS Brochure.pdf differ diff --git a/01_standalone_examples/llm-openai-langchain-integration/docker-compose.yml b/01_standalone_examples/llm-openai-langchain-integration/docker-compose.yml new file mode 100644 index 000000000..5da98d119 --- /dev/null +++ b/01_standalone_examples/llm-openai-langchain-integration/docker-compose.yml @@ -0,0 +1,101 @@ +--- +version: '3.9' +name: lakefs-with-llm-openai-langchain +services: + jupyter-notebook: + build: jupyter + container_name: llm-openai-langchain-jupyter-notebook + environment: + # log-level is set to WARN because of noisy stdout problem + # -> See https://github.com/jupyter-server/jupyter_server/issues/1279 + - NOTEBOOK_ARGS=--log-level='WARN' --NotebookApp.token='' --NotebookApp.password='' + ports: + - 8891:8888 # Jupyter + volumes: + - $PWD:/home/jovyan + - ./data:/data + + lakefs: + image: treeverse/lakefs:1.1.0 + depends_on: + - minio-setup + ports: + - "48000:8000" + environment: + - LAKEFS_DATABASE_TYPE=local + - LAKEFS_BLOCKSTORE_TYPE=s3 + - LAKEFS_BLOCKSTORE_S3_FORCE_PATH_STYLE=true + - LAKEFS_BLOCKSTORE_S3_ENDPOINT=http://minio:9000 + - LAKEFS_BLOCKSTORE_S3_CREDENTIALS_ACCESS_KEY_ID=minioadmin + - LAKEFS_BLOCKSTORE_S3_CREDENTIALS_SECRET_ACCESS_KEY=minioadmin + - LAKEFS_AUTH_ENCRYPT_SECRET_KEY=some random secret string + - LAKEFS_LOGGING_LEVEL=INFO + - LAKEFS_STATS_ENABLED=${LAKEFS_STATS_ENABLED:-1} + - LAKEFS_INSTALLATION_USER_NAME=everything-bagel + - LAKEFS_INSTALLATION_ACCESS_KEY_ID=AKIAIOSFOLKFSSAMPLES + - LAKEFS_INSTALLATION_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY + - LAKECTL_SERVER_ENDPOINT_URL=http://localhost:8000 + entrypoint: ["/bin/sh", "-c"] + command: + - | + lakefs run --local-settings & + echo "---- Creating repository ----" + wait-for -t 60 lakefs:8000 -- curl -u "$$LAKEFS_INSTALLATION_ACCESS_KEY_ID":"$$LAKEFS_INSTALLATION_SECRET_ACCESS_KEY" -X POST -H "Content-Type: application/json" -d '{ "name": "quickstart", "storage_namespace": "s3://quickstart", "default_branch": "main", "sample_data": true }' http://localhost:8000/api/v1/repositories || true + # wait-for -t 60 lakefs:8000 -- lakectl repo create lakefs://example s3://example || true + echo "" + wait-for -t 60 minio:9000 && echo '------------------------------------------------ + + MinIO admin: http://127.0.0.1:49001/ + + Username : minioadmin + Password : minioadmin + ' + echo "------------------------------------------------" + wait-for -t 60 jupyter-notebook:8888 && echo ' + + Jupyter: http://127.0.0.1:8891/ + ' + echo "------------------------------------------------" + echo "" + echo " lakeFS Web UI: http://127.0.0.1:48000/" + echo "" + echo " >(._.)<" + echo " ( )_ " + echo "" + echo " Access Key ID : $$LAKEFS_INSTALLATION_ACCESS_KEY_ID" + echo " Secret Access Key: $$LAKEFS_INSTALLATION_SECRET_ACCESS_KEY" + echo "" + echo "-------- Let's go and have axolotl fun! --------" + echo "" + wait + profiles: + - local-lakefs + + minio-setup: + image: minio/mc:RELEASE.2023-05-18T16-59-00Z + environment: + - MC_HOST_lakefs=http://minioadmin:minioadmin@minio:9000 + depends_on: + - minio + volumes: + - ./data:/data + entrypoint: ["/bin/sh", "-c"] + command: + - | + mc mb lakefs/quickstart lakefs/example lakefs/sample-data + mc cp --recursive /data/* lakefs/sample-data 1>/dev/null # don't be so noisy 🤫 + profiles: + - local-lakefs + + minio: + image: minio/minio:RELEASE.2023-05-18T00-05-36Z + ports: + - "49001:9001" + entrypoint: ["minio", "server", "/data", "--console-address", ":9001"] + profiles: + - local-lakefs + +networks: + default: + name: lakefs-llm-openai-langchain-network + diff --git a/01_standalone_examples/llm-openai-langchain-integration/images/langchain.jpeg b/01_standalone_examples/llm-openai-langchain-integration/images/langchain.jpeg new file mode 100644 index 000000000..a70384965 Binary files /dev/null and b/01_standalone_examples/llm-openai-langchain-integration/images/langchain.jpeg differ diff --git a/01_standalone_examples/llm-openai-langchain-integration/images/logo.svg b/01_standalone_examples/llm-openai-langchain-integration/images/logo.svg new file mode 100644 index 000000000..ebd0a2f5f --- /dev/null +++ b/01_standalone_examples/llm-openai-langchain-integration/images/logo.svg @@ -0,0 +1,8 @@ + + + + + + + + diff --git a/01_standalone_examples/llm-openai-langchain-integration/images/openai-lockup-black.svg b/01_standalone_examples/llm-openai-langchain-integration/images/openai-lockup-black.svg new file mode 100644 index 000000000..0819c9dcb --- /dev/null +++ b/01_standalone_examples/llm-openai-langchain-integration/images/openai-lockup-black.svg @@ -0,0 +1,10 @@ + + + + + + + + + + diff --git a/01_standalone_examples/llm-openai-langchain-integration/jupyter/Dockerfile b/01_standalone_examples/llm-openai-langchain-integration/jupyter/Dockerfile new file mode 100644 index 000000000..fae878bb2 --- /dev/null +++ b/01_standalone_examples/llm-openai-langchain-integration/jupyter/Dockerfile @@ -0,0 +1,20 @@ +FROM jupyter/minimal-notebook:notebook-7.0.6 + +USER root + +# These commands install the cv2 dependencies that are normally present on the local machine +RUN apt-get update && apt-get install ffmpeg libsm6 libxext6 -y + +RUN pip install lakefs-sdk==1.1.0.2 + +# Used openai 0.28.1 (old version) for langchain v0.0.331 compatibility +RUN pip install langchain==0.0.331 unstructured[pdf]==0.10.29 openai==0.28.1 tiktoken==0.4.0 + +RUN conda install -y -c pytorch faiss-cpu=1.7.4 mkl=2021 blas=1.0=mkl +RUN pip install deltalake==0.13.0 +RUN pip install ipywidgets==8.1.1 + +USER $NB_UID + +# Disable the "Would you like to receive official Jupyter news?" popup +RUN jupyter labextension disable "@jupyterlab/apputils-extension:announcements"