diff --git a/.github/workflows/notebooks_to_exclude.txt b/.github/workflows/notebooks_to_exclude.txt
index accdab0ab..c98398875 100644
--- a/.github/workflows/notebooks_to_exclude.txt
+++ b/.github/workflows/notebooks_to_exclude.txt
@@ -30,6 +30,7 @@
# Some notebooks are deliberate scraps of code
# and not designed to run end-to-end
./R-client.ipynb
+./R-weather.ipynb
# Ignore any temporary notebook caches etc
.ipynb_checkpoints
diff --git a/00_notebooks/00_index.ipynb b/00_notebooks/00_index.ipynb
index 59b8e6cbd..447c717ce 100644
--- a/00_notebooks/00_index.ipynb
+++ b/00_notebooks/00_index.ipynb
@@ -46,33 +46,10 @@
" * [What happens if you use Iceberg without the lakeFS support](./iceberg-lakefs-default.ipynb)\n",
"* **Using R with lakeFS**\n",
" * [Basic usage](./R.ipynb)\n",
- " * [Weather data](./R-weather.ipynb)\n",
" * [NYC Film permits example](./R-nyc.ipynb)\n",
" * [Rough notes around R client](./R-client.ipynb)"
]
},
- {
- "cell_type": "markdown",
- "id": "93ba5f85-ec96-4dc6-bc5c-f9313a658112",
- "metadata": {},
- "source": [
- "### Write-Audit-Publish pattern\n",
- "\n",
- "See https://lakefs.io/blog/data-engineering-patterns-write-audit-publish/"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "305a0733-ee99-4d4f-8dcf-94b9e9dd9fd5",
- "metadata": {},
- "source": [
- "* [**Write-Audit-Publish / lakeFS**](write-audit-publish/wap-lakefs.ipynb)
_With support for heterogeneous data formats and cross-collection consistency_\n",
- "* [**Write-Audit-Publish / Apache Iceberg**](write-audit-publish/wap-iceberg.ipynb)\n",
- "* [**Write-Audit-Publish / Project Nessie**](write-audit-publish/wap-nessie.ipynb)\n",
- "* [**Write-Audit-Publish / Delta Lake**](write-audit-publish/wap-delta.ipynb)\n",
- "* [**Write-Audit-Publish / Apache Hudi**](write-audit-publish/wap-hudi.ipynb)\n"
- ]
- },
{
"cell_type": "markdown",
"id": "c0de105e-2071-438c-b9fa-5ffe664d47ac",
@@ -102,6 +79,7 @@
"* [AWS **Glue and Trino**](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/aws-glue-trino/)\n",
"* [lakeFS + **Dagster**](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/dagster-integration/)\n",
"* [lakeFS + **Prefect**](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/prefect-integration/)\n",
+ "* [Reproducibility and Data Version Control for **LangChain** and **LLM/OpenAI** Models](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/llm-openai-langchain-integration/)
_See also the [accompanying blog](https://lakefs.io/blog/lakefs-langchain-loader/)_\n",
"* [Image Segmentation Demo: *ML Data Version Control and Reproducibility at Scale*](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/image-segmentation/)\n",
"* [*Labelbox* integration](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/labelbox-integration/)\n",
"* [*Kafka* integration](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/kafka/)\n",
@@ -109,6 +87,28 @@
"* [How to **migrate or clone** a repo](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/migrate-or-clone-repo/)"
]
},
+ {
+ "cell_type": "markdown",
+ "id": "93ba5f85-ec96-4dc6-bc5c-f9313a658112",
+ "metadata": {},
+ "source": [
+ "### Write-Audit-Publish pattern\n",
+ "\n",
+ "See https://lakefs.io/blog/data-engineering-patterns-write-audit-publish/"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "305a0733-ee99-4d4f-8dcf-94b9e9dd9fd5",
+ "metadata": {},
+ "source": [
+ "* [**Write-Audit-Publish / lakeFS**](write-audit-publish/wap-lakefs.ipynb)
_With support for heterogeneous data formats and cross-collection consistency_\n",
+ "* [**Write-Audit-Publish / Apache Iceberg**](write-audit-publish/wap-iceberg.ipynb)\n",
+ "* [**Write-Audit-Publish / Project Nessie**](write-audit-publish/wap-nessie.ipynb)\n",
+ "* [**Write-Audit-Publish / Delta Lake**](write-audit-publish/wap-delta.ipynb)\n",
+ "* [**Write-Audit-Publish / Apache Hudi**](write-audit-publish/wap-hudi.ipynb)\n"
+ ]
+ },
{
"cell_type": "markdown",
"id": "d1cb588f",
diff --git a/01_standalone_examples/llm-openai-langchain-integration/LLM OpenAI LangChain Demo.ipynb b/01_standalone_examples/llm-openai-langchain-integration/LLM OpenAI LangChain Demo.ipynb
new file mode 100644
index 000000000..860e4f158
--- /dev/null
+++ b/01_standalone_examples/llm-openai-langchain-integration/LLM OpenAI LangChain Demo.ipynb
@@ -0,0 +1,764 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "38df8ee9-01c6-4cae-9f37-2eb45250993e",
+ "metadata": {},
+ "source": [
+ " \n",
+ "\n",
+ "# Integration of lakeFS with LangChain and OpenAI\n",
+ "\n",
+ "Use Case: Reproducibility and Data version control for LangChain and LLM/OpenAI Models\n",
+ "\n",
+ "See also the [accompanying blog](https://lakefs.io/blog/lakefs-langchain-loader/)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4a9c9095-97fc-4453-bc8a-7afe9fe456e6",
+ "metadata": {},
+ "source": [
+ "## Config"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "863c5d7a-91a4-44f6-bae1-745f6190983f",
+ "metadata": {},
+ "source": [
+ "### lakeFS endpoint and credentials\n",
+ "\n",
+ "Change these if using lakeFS other than provided in the samples repo. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5ec9395c-ee3a-422d-b935-64807b070943",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' \n",
+ "lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'\n",
+ "lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "51714413-29f4-4cc1-b821-7c693f266930",
+ "metadata": {},
+ "source": [
+ "### Storage Information\n",
+ "\n",
+ "If you're not using sample repo lakeFS, then change the Storage Namespace to a location in the bucket you’ve configured. The storage namespace is a location in the underlying storage where data for this repository will be stored."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "cd852ddc-3a06-4eb6-9236-c3b77164361f",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "storageNamespace = 's3://example' # e.g. \"s3://bucket\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "67a3e781-e6e6-47dd-a483-71a62a2342da",
+ "metadata": {},
+ "source": [
+ "### OpenAI API Key\n",
+ "##### If you do not have an API key then create a free OpenAI account and API key here: https://platform.openai.com/api-keys"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7b2a6e05-ea60-49b3-861b-5a0acade748e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "openai_api_key = \"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b154b685-ef4f-4116-990d-6dff21db5080",
+ "metadata": {},
+ "source": [
+ "## Setup\n",
+ "\n",
+ "**(you shouldn't need to change anything in this section, just run it)**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "52f4fc64-1070-4b07-9124-1bafffa72588",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "repo_name = \"llm-openai-langchain-repo\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "90917396-2738-40fa-9a61-e0ebd6a264a7",
+ "metadata": {},
+ "source": [
+ "### Versioning Information "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "63ff2b15-6a7a-4cfc-96a1-f1c73c08f289",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "mainBranch = \"main\"\n",
+ "version1Branch = \"version1\"\n",
+ "version2Branch = \"version2\"\n",
+ "documentName = \"lakeFS Brochure.pdf\"\n",
+ "responsesTable = \"responses\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cbed0b49-a2b9-429b-86fb-5f02ed82c689",
+ "metadata": {},
+ "source": [
+ "### Create a function to load documents from lakeFS repository by using an [official lakeFS document loader for LangChain](https://python.langchain.com/docs/integrations/document_loaders/lakefs)\n",
+ "##### Split documents into smaller chunks, convert documents into OpenAI embeddings and store them in an in-memory vector database (Meta’s [FAISS](https://ai.meta.com/tools/faiss/))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "992238a8-aa3e-4268-9b61-fbb55e0487c3",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from langchain.embeddings import OpenAIEmbeddings\n",
+ "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
+ "from langchain.vectorstores.faiss import FAISS\n",
+ "from langchain.document_loaders import LakeFSLoader\n",
+ "\n",
+ "def load_document(repo: str, ref: str, path: str) -> FAISS:\n",
+ " lakefs_loader = LakeFSLoader(\n",
+ " lakefs_access_key=lakefsAccessKey,\n",
+ " lakefs_secret_key=lakefsSecretKey,\n",
+ " lakefs_endpoint=lakefsEndPoint\n",
+ " )\n",
+ " lakefs_loader.set_repo(repo)\n",
+ " lakefs_loader.set_ref(ref)\n",
+ " lakefs_loader.set_path(path)\n",
+ " docs = lakefs_loader.load()\n",
+ " splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)\n",
+ " docs = splitter.split_documents(docs)\n",
+ " return FAISS.from_documents(docs, embedding=OpenAIEmbeddings(openai_api_key=openai_api_key))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e37dada4-4070-4f04-8081-abe41fb8bddd",
+ "metadata": {},
+ "source": [
+ "### Create a function to query this data using OpenAI\n",
+ "#### Set up a model and a prompt, into which you will feed documents that are related to the user’s question"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "da5ca5db-ca84-4477-873a-97b3c50d1c72",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from langchain.llms import OpenAI\n",
+ "from langchain.chains import LLMChain\n",
+ "from langchain.prompts import PromptTemplate\n",
+ "from langchain.vectorstores.faiss import FAISS\n",
+ "\n",
+ "def query_document(db: FAISS, document_name: str, query: str) -> str:\n",
+ " related_docs = db.similarity_search(query, k=4) # we want 4 similar vectors\n",
+ " docs_content = ' '.join([d.page_content for d in related_docs])\n",
+ " llm = OpenAI(model='text-davinci-003', temperature=0, openai_api_key=openai_api_key)\n",
+ " prompt = PromptTemplate(\n",
+ " input_variables=['question', 'docs', 'document_name'],\n",
+ " template=\"\"\"\n",
+ " You are a helpful document assistant that can answer questions about a document based on the text it contains.\n",
+ " \n",
+ " The name of the document is: {document_name}\n",
+ " Answer the following question: {question}\n",
+ " By searching the following document: {docs}\n",
+ " \n",
+ " Only use factual information from the document to answer the question.\n",
+ " \n",
+ " If you feel like you don't have enough information to answer the question, say \"I don't know\".\n",
+ " \n",
+ " Your answers should be detailed.\n",
+ " \"\"\"\n",
+ " )\n",
+ "\n",
+ " chain = LLMChain(llm=llm, prompt=prompt)\n",
+ " return chain.run(question=query, docs=docs_content, document_name=document_name)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3bff25a6-2864-40d6-b94a-67fdb77193a5",
+ "metadata": {},
+ "source": [
+ "### lakeFS S3 gateway config for the Delta table"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0a243b96-6692-4931-8683-14e23348e134",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "import deltalake\n",
+ "\n",
+ "storage_options = {\"AWS_ACCESS_KEY_ID\": lakefsAccessKey, \n",
+ " \"AWS_SECRET_ACCESS_KEY\":lakefsSecretKey,\n",
+ " \"AWS_ENDPOINT\": lakefsEndPoint,\n",
+ " \"AWS_REGION\": \"us-east-1\",\n",
+ " \"AWS_ALLOW_HTTP\": \"true\",\n",
+ " \"AWS_S3_ALLOW_UNSAFE_RENAME\": \"true\"\n",
+ " }"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9c45a90f-9446-470b-9b07-c8fce03c16ab",
+ "metadata": {},
+ "source": [
+ "### Create lakeFSClient"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c26d3385-94fc-438f-86e4-ac98a7924ee7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import lakefs_sdk\n",
+ "from lakefs_sdk.client import LakeFSClient\n",
+ "from lakefs_sdk.models import *\n",
+ "\n",
+ "configuration = lakefs_sdk.Configuration(\n",
+ " host=lakefsEndPoint,\n",
+ " username=lakefsAccessKey,\n",
+ " password=lakefsSecretKey,\n",
+ ")\n",
+ "lakefs = LakeFSClient(configuration)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7ec78263-be53-4387-80c3-b1a347c02191",
+ "metadata": {},
+ "source": [
+ "### Define lakeFS Repository"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "925988e4-475f-46de-9e5a-e1689dfa9353",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from lakefs_sdk.exceptions import NotFoundException\n",
+ "\n",
+ "try:\n",
+ " repo=lakefs.repositories_api.get_repository(repo_name)\n",
+ " print(f\"Found existing repo {repo.id} using storage namespace {repo.storage_namespace}\")\n",
+ "except NotFoundException as f:\n",
+ " print(f\"Repository {repo_name} does not exist, so going to try and create it now.\")\n",
+ " try:\n",
+ " repo=lakefs.repositories_api.create_repository(repository_creation=RepositoryCreation(name=repo_name,\n",
+ " storage_namespace=f\"{storageNamespace}/{repo_name}\",\n",
+ " default_branch=mainBranch))\n",
+ " print(f\"Created new repo {repo.id} using storage namespace {repo.storage_namespace}\")\n",
+ " except lakefs_sdk.ApiException as e:\n",
+ " print(f\"Error creating repo {repo_name}. Error is {e}\")\n",
+ " os._exit(00)\n",
+ "except lakefs_sdk.ApiException as e:\n",
+ " print(f\"Error getting repo {repo_name}: {e}\")\n",
+ " os._exit(00)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8c035f52-338e-4179-a317-b55d9bbb45ee",
+ "metadata": {},
+ "source": [
+ "# Main demo starts here 🚦 👇🏻"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2e413173-70f3-4725-a3fa-7568d2b36c11",
+ "metadata": {},
+ "source": [
+ "### Create version1 branch"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "56bf1a3e-da38-4eba-9325-4b6966a020ef",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "lakefs.branches_api.create_branch(\n",
+ " repository=repo.id,\n",
+ " branch_creation=BranchCreation(\n",
+ " name=version1Branch,\n",
+ " source=mainBranch))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "05419d37-436c-4af1-a97b-e163aedf5998",
+ "metadata": {},
+ "source": [
+ "### Upload \"lakeFS Brochure.pdf\" document to version1 branch"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3a788f57-6c43-45dd-a3f4-27a632ed0540",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "lakefs.objects_api.upload_object(\n",
+ " repository=repo.id,\n",
+ " branch=version1Branch,\n",
+ " path=documentName, content=f\"/data/{version1Branch}/{documentName}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "82d0caf1-cb9f-440c-a5af-303e858ac152",
+ "metadata": {},
+ "source": [
+ "### Commit changes and attach some metadata"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "eaadfee1-1c2f-4020-bf25-1596bf69f806",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "lakefs.commits_api.commit(\n",
+ " repository=repo.id,\n",
+ " branch=version1Branch,\n",
+ " commit_creation=CommitCreation(\n",
+ " message='Uploaded lakeFS Brochure',\n",
+ " metadata={'version': 'version1'}))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0e2ae0e7-ef38-4e2b-ac8c-272dd5667874",
+ "metadata": {},
+ "source": [
+ "### Load \"lakeFS Brochure.pdf\" (version 1) document to vector database"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e2ed8be5-06a6-47c1-96c2-6f8d853bf296",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "db = load_document(repo_name, version1Branch, documentName)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d0b9ed98-22ef-4d0c-bcb2-774d50cbc1b1",
+ "metadata": {},
+ "source": [
+ "### Let's ask these 2 questions"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "edfec532-a7f6-42b5-9fde-bc694c7a43d8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "question1 = 'why lakefs'\n",
+ "question2 = 'trusted by?'"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7409d98c-978a-4a99-b98b-e53082a57304",
+ "metadata": {},
+ "source": [
+ "### Ask 1st question"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "dc3b0667-015f-417c-b828-b8bd7b063fcf",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "question1Response = query_document(db, documentName, question1)\n",
+ "print(question1Response)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "83477ffc-6fe2-44b8-b860-475bddcf1b7d",
+ "metadata": {},
+ "source": [
+ "### Ask 2nd question"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8dbac5c0-ed8d-4ed8-9645-2c494ab6ee1f",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "question2Response = query_document(db, documentName, question2)\n",
+ "print(question2Response)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5f6d3d13-dae3-490d-879c-4ab2ceb72fc6",
+ "metadata": {},
+ "source": [
+ "### Save the responses to a Delta table"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "aed98dab-a980-4a99-9254-03d80f9c6c3e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df = pd.DataFrame({'Document Name': [documentName, documentName], 'Version': [version1Branch, version1Branch], 'Question': [question1, question2], 'Answer': [question1Response, question2Response]})\n",
+ "\n",
+ "deltalake.write_deltalake(table_or_uri=f\"s3a://{repo.id}/{version1Branch}/{responsesTable}\", \n",
+ " data = df,\n",
+ " mode='append',\n",
+ " storage_options=storage_options)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "76734ee8-f845-49af-a136-7efc90d9e39a",
+ "metadata": {},
+ "source": [
+ "### Commit changes and attach some metadata"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "03716729-3172-4266-a807-3d2e23c5afb8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "lakefs.commits_api.commit(\n",
+ " repository=repo.id,\n",
+ " branch=version1Branch,\n",
+ " commit_creation=CommitCreation(\n",
+ " message='Saved responses for the questions',\n",
+ " metadata={'version': 'version1'}))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3555ad69-1faa-4ef8-abf8-5707901bc3ca",
+ "metadata": {},
+ "source": [
+ "### Merge version1 branch to main"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fc91a22b-bcb6-4164-b472-bb25ebcf6d78",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "lakefs.refs_api.merge_into_branch(\n",
+ " repository=repo.id,\n",
+ " source_ref=version1Branch, \n",
+ " destination_branch=mainBranch)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "66023094-6052-4a04-8dea-23a26275bfec",
+ "metadata": {},
+ "source": [
+ "### Create version2 branch"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3f841167-2464-4fcf-a7e3-68386f77d4f1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "lakefs.branches_api.create_branch(\n",
+ " repository=repo.id,\n",
+ " branch_creation=BranchCreation(\n",
+ " name=version2Branch,\n",
+ " source=mainBranch))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e515f7f5-1328-4c38-92e1-058064c93149",
+ "metadata": {},
+ "source": [
+ "### Upload 2nd version of the \"lakeFS Brochure.pdf\" document"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5bc09a14-2487-4b29-9681-8d97570ff7d2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "lakefs.objects_api.upload_object(\n",
+ " repository=repo.id,\n",
+ " branch=version2Branch,\n",
+ " path=documentName, content=f\"/data/{version2Branch}/{documentName}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d0ae6f26-cdba-4099-94dd-9814f25f5dbb",
+ "metadata": {},
+ "source": [
+ "### Commit changes and attach some metadata"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8d73c920-c247-443f-93bd-2cd7ff23756c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "lakefs.commits_api.commit(\n",
+ " repository=repo.id,\n",
+ " branch=version2Branch,\n",
+ " commit_creation=CommitCreation(\n",
+ " message='Uploaded lakeFS Brochure',\n",
+ " metadata={'version': 'version2'}))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2dd2ab47-9b8c-4704-a761-fa24d55649e0",
+ "metadata": {},
+ "source": [
+ "### Load \"lakeFS Brochure.pdf\" (version 2) document to vector database"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "dac931a2-3917-4b45-8ceb-73bf0c379198",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "db = load_document(repo_name, version2Branch, documentName)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b5c0bde7-c10d-4b79-9589-0be6052dd17b",
+ "metadata": {},
+ "source": [
+ "### Ask 1st question by using version2 document"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "11fd18a9-95da-4a7f-8bda-a957d6494798",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "question1Response = query_document(db, documentName, question1)\n",
+ "print(question1Response)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "baefc417-6640-44b0-ba5e-701f04d9242a",
+ "metadata": {},
+ "source": [
+ "### Ask 2nd question by using version2 document"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c9c46ab1-f1e2-448f-ba22-f4e345b50864",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "question2Response = query_document(db, documentName, question2)\n",
+ "print(question2Response)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3a8bc9a1-da62-4f4e-9781-82f30718cade",
+ "metadata": {},
+ "source": [
+ "### Save the responses to Delta table"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bbb16179-468b-451f-9952-1c0f07e94ef4",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df = pd.DataFrame({'Document Name': [documentName, documentName], 'Version': [version2Branch, version2Branch], 'Question': [question1, question2], 'Answer': [question1Response, question2Response]})\n",
+ "\n",
+ "deltalake.write_deltalake(table_or_uri=f\"s3a://{repo.id}/{version2Branch}/{responsesTable}\", \n",
+ " data = df,\n",
+ " mode='append',\n",
+ " storage_options=storage_options)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "18542e7e-bebc-4162-8189-8a8f7ee6ebe9",
+ "metadata": {},
+ "source": [
+ "### Commit changes and attach some metadata"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0269c8c2-6b73-4d8c-9772-5dbad65a962a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "lakefs.commits_api.commit(\n",
+ " repository=repo.id,\n",
+ " branch=version2Branch,\n",
+ " commit_creation=CommitCreation(\n",
+ " message='Saved responses for the questions',\n",
+ " metadata={'version': 'version2'}))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e385f37e-74b7-41da-9d9b-bf8bbc3d65bc",
+ "metadata": {},
+ "source": [
+ "### Merge version2 branch to main"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "66b7bff1-057a-4895-9be2-59d078c7a418",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "lakefs.refs_api.merge_into_branch(\n",
+ " repository=repo.id,\n",
+ " source_ref=version2Branch, \n",
+ " destination_branch=mainBranch)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "815a56e9-3747-496b-820a-37aaf397c7ce",
+ "metadata": {},
+ "source": [
+ "### Review responses for both versions"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d38a150c-c1d0-4e50-8244-5adbaa03a837",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "responses = deltalake.DeltaTable(f\"s3a://{repo.id}/{mainBranch}/{responsesTable}\", storage_options=storage_options)\n",
+ "pd.set_option('max_colwidth', 2000)\n",
+ "responses.to_pandas()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9fbe477a-1ea7-489f-b5a9-7fefd58a2c95",
+ "metadata": {},
+ "source": [
+ "## More Questions?\n",
+ "\n",
+ "###### Join the lakeFS Slack group - https://lakefs.io/slack"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "35d73b71-6f6f-4cea-8523-df0bdece134d",
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.6"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/01_standalone_examples/llm-openai-langchain-integration/README.md b/01_standalone_examples/llm-openai-langchain-integration/README.md
new file mode 100644
index 000000000..3a6d66111
--- /dev/null
+++ b/01_standalone_examples/llm-openai-langchain-integration/README.md
@@ -0,0 +1,46 @@
+# Reproducibility and Data Version Control for LangChain and LLM/OpenAI Models
+
+Start by ⭐️ starring [lakeFS open source](https://go.lakefs.io/oreilly-course) project.
+
+This repository includes a Jupyter Notebook with LangChain and OpenAI libraries which you can run on your local machine.
+
+## Let's Get Started 👩🏻💻
+
+Clone this repository
+
+ ```bash
+ git clone https://github.com/treeverse/lakeFS-samples && cd lakeFS-samples/01_standalone_examples/llm-openai-langchain-integration
+ ```
+
+You now have two options:
+
+### **Run a Notebook server with your existing lakeFS Server**
+
+If you have already [installed lakeFS](https://docs.lakefs.io/deploy/) or are utilizing [lakeFS cloud](https://lakefs.cloud/), all you need to run is the Jupyter notebook with LangChain and OpenAI libraries (Docker image size will be around 10GB):
+
+
+ ```bash
+ docker compose up
+ ```
+
+### **Don't have a lakeFS Server or Object Store?**
+
+If you want to provision a lakeFS server as well as MinIO for your object store, plus Jupyter with LangChain and OpenAI libraries then bring up the full stack:
+
+ ```bash
+ docker compose --profile local-lakefs up
+ ```
+
+### URLs and login details
+
+* Jupyter http://localhost:8891/
+
+If you've brought up the full stack you'll also have:
+
+* LakeFS http://localhost:48000/ (`AKIAIOSFOLKFSSAMPLES` / `wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY`)
+* MinIO http://localhost:49001/ (`minioadmin`/`minioadmin`)
+
+
+## Demo Instructions
+
+Open Jupyter UI [http://localhost:8891](http://localhost:8891) in your web browser. Open "LLM OpenAI LangChain Demo" notebook from Jupyter UI and follow the instructions.
\ No newline at end of file
diff --git a/01_standalone_examples/llm-openai-langchain-integration/data/version1/lakeFS Brochure.pdf b/01_standalone_examples/llm-openai-langchain-integration/data/version1/lakeFS Brochure.pdf
new file mode 100644
index 000000000..5b9fbafb6
Binary files /dev/null and b/01_standalone_examples/llm-openai-langchain-integration/data/version1/lakeFS Brochure.pdf differ
diff --git a/01_standalone_examples/llm-openai-langchain-integration/data/version2/lakeFS Brochure.pdf b/01_standalone_examples/llm-openai-langchain-integration/data/version2/lakeFS Brochure.pdf
new file mode 100644
index 000000000..f74b5187b
Binary files /dev/null and b/01_standalone_examples/llm-openai-langchain-integration/data/version2/lakeFS Brochure.pdf differ
diff --git a/01_standalone_examples/llm-openai-langchain-integration/docker-compose.yml b/01_standalone_examples/llm-openai-langchain-integration/docker-compose.yml
new file mode 100644
index 000000000..5da98d119
--- /dev/null
+++ b/01_standalone_examples/llm-openai-langchain-integration/docker-compose.yml
@@ -0,0 +1,101 @@
+---
+version: '3.9'
+name: lakefs-with-llm-openai-langchain
+services:
+ jupyter-notebook:
+ build: jupyter
+ container_name: llm-openai-langchain-jupyter-notebook
+ environment:
+ # log-level is set to WARN because of noisy stdout problem
+ # -> See https://github.com/jupyter-server/jupyter_server/issues/1279
+ - NOTEBOOK_ARGS=--log-level='WARN' --NotebookApp.token='' --NotebookApp.password=''
+ ports:
+ - 8891:8888 # Jupyter
+ volumes:
+ - $PWD:/home/jovyan
+ - ./data:/data
+
+ lakefs:
+ image: treeverse/lakefs:1.1.0
+ depends_on:
+ - minio-setup
+ ports:
+ - "48000:8000"
+ environment:
+ - LAKEFS_DATABASE_TYPE=local
+ - LAKEFS_BLOCKSTORE_TYPE=s3
+ - LAKEFS_BLOCKSTORE_S3_FORCE_PATH_STYLE=true
+ - LAKEFS_BLOCKSTORE_S3_ENDPOINT=http://minio:9000
+ - LAKEFS_BLOCKSTORE_S3_CREDENTIALS_ACCESS_KEY_ID=minioadmin
+ - LAKEFS_BLOCKSTORE_S3_CREDENTIALS_SECRET_ACCESS_KEY=minioadmin
+ - LAKEFS_AUTH_ENCRYPT_SECRET_KEY=some random secret string
+ - LAKEFS_LOGGING_LEVEL=INFO
+ - LAKEFS_STATS_ENABLED=${LAKEFS_STATS_ENABLED:-1}
+ - LAKEFS_INSTALLATION_USER_NAME=everything-bagel
+ - LAKEFS_INSTALLATION_ACCESS_KEY_ID=AKIAIOSFOLKFSSAMPLES
+ - LAKEFS_INSTALLATION_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
+ - LAKECTL_SERVER_ENDPOINT_URL=http://localhost:8000
+ entrypoint: ["/bin/sh", "-c"]
+ command:
+ - |
+ lakefs run --local-settings &
+ echo "---- Creating repository ----"
+ wait-for -t 60 lakefs:8000 -- curl -u "$$LAKEFS_INSTALLATION_ACCESS_KEY_ID":"$$LAKEFS_INSTALLATION_SECRET_ACCESS_KEY" -X POST -H "Content-Type: application/json" -d '{ "name": "quickstart", "storage_namespace": "s3://quickstart", "default_branch": "main", "sample_data": true }' http://localhost:8000/api/v1/repositories || true
+ # wait-for -t 60 lakefs:8000 -- lakectl repo create lakefs://example s3://example || true
+ echo ""
+ wait-for -t 60 minio:9000 && echo '------------------------------------------------
+
+ MinIO admin: http://127.0.0.1:49001/
+
+ Username : minioadmin
+ Password : minioadmin
+ '
+ echo "------------------------------------------------"
+ wait-for -t 60 jupyter-notebook:8888 && echo '
+
+ Jupyter: http://127.0.0.1:8891/
+ '
+ echo "------------------------------------------------"
+ echo ""
+ echo " lakeFS Web UI: http://127.0.0.1:48000/"
+ echo ""
+ echo " >(._.)<"
+ echo " ( )_ "
+ echo ""
+ echo " Access Key ID : $$LAKEFS_INSTALLATION_ACCESS_KEY_ID"
+ echo " Secret Access Key: $$LAKEFS_INSTALLATION_SECRET_ACCESS_KEY"
+ echo ""
+ echo "-------- Let's go and have axolotl fun! --------"
+ echo ""
+ wait
+ profiles:
+ - local-lakefs
+
+ minio-setup:
+ image: minio/mc:RELEASE.2023-05-18T16-59-00Z
+ environment:
+ - MC_HOST_lakefs=http://minioadmin:minioadmin@minio:9000
+ depends_on:
+ - minio
+ volumes:
+ - ./data:/data
+ entrypoint: ["/bin/sh", "-c"]
+ command:
+ - |
+ mc mb lakefs/quickstart lakefs/example lakefs/sample-data
+ mc cp --recursive /data/* lakefs/sample-data 1>/dev/null # don't be so noisy 🤫
+ profiles:
+ - local-lakefs
+
+ minio:
+ image: minio/minio:RELEASE.2023-05-18T00-05-36Z
+ ports:
+ - "49001:9001"
+ entrypoint: ["minio", "server", "/data", "--console-address", ":9001"]
+ profiles:
+ - local-lakefs
+
+networks:
+ default:
+ name: lakefs-llm-openai-langchain-network
+
diff --git a/01_standalone_examples/llm-openai-langchain-integration/images/langchain.jpeg b/01_standalone_examples/llm-openai-langchain-integration/images/langchain.jpeg
new file mode 100644
index 000000000..a70384965
Binary files /dev/null and b/01_standalone_examples/llm-openai-langchain-integration/images/langchain.jpeg differ
diff --git a/01_standalone_examples/llm-openai-langchain-integration/images/logo.svg b/01_standalone_examples/llm-openai-langchain-integration/images/logo.svg
new file mode 100644
index 000000000..ebd0a2f5f
--- /dev/null
+++ b/01_standalone_examples/llm-openai-langchain-integration/images/logo.svg
@@ -0,0 +1,8 @@
+
diff --git a/01_standalone_examples/llm-openai-langchain-integration/images/openai-lockup-black.svg b/01_standalone_examples/llm-openai-langchain-integration/images/openai-lockup-black.svg
new file mode 100644
index 000000000..0819c9dcb
--- /dev/null
+++ b/01_standalone_examples/llm-openai-langchain-integration/images/openai-lockup-black.svg
@@ -0,0 +1,10 @@
+
+
diff --git a/01_standalone_examples/llm-openai-langchain-integration/jupyter/Dockerfile b/01_standalone_examples/llm-openai-langchain-integration/jupyter/Dockerfile
new file mode 100644
index 000000000..fae878bb2
--- /dev/null
+++ b/01_standalone_examples/llm-openai-langchain-integration/jupyter/Dockerfile
@@ -0,0 +1,20 @@
+FROM jupyter/minimal-notebook:notebook-7.0.6
+
+USER root
+
+# These commands install the cv2 dependencies that are normally present on the local machine
+RUN apt-get update && apt-get install ffmpeg libsm6 libxext6 -y
+
+RUN pip install lakefs-sdk==1.1.0.2
+
+# Used openai 0.28.1 (old version) for langchain v0.0.331 compatibility
+RUN pip install langchain==0.0.331 unstructured[pdf]==0.10.29 openai==0.28.1 tiktoken==0.4.0
+
+RUN conda install -y -c pytorch faiss-cpu=1.7.4 mkl=2021 blas=1.0=mkl
+RUN pip install deltalake==0.13.0
+RUN pip install ipywidgets==8.1.1
+
+USER $NB_UID
+
+# Disable the "Would you like to receive official Jupyter news?" popup
+RUN jupyter labextension disable "@jupyterlab/apputils-extension:announcements"