From e1dae8337781ccc442ab3693758f9da08bf1b34a Mon Sep 17 00:00:00 2001 From: kennethmhc Date: Thu, 22 Aug 2024 17:24:57 +0200 Subject: [PATCH] Move similarity search examples (#275) --- .../news-search-knn-save-model.ipynb | 370 ++++++++++++++++++ .../news-search-knn.ipynb | 348 ++++++++++++++++ .../news-search-rank-view.ipynb | 356 +++++++++++++++++ .../vector_similarity_search/requirements.txt | 3 + 4 files changed, 1077 insertions(+) create mode 100644 api_examples/hopsworks/vector_similarity_search/news-search-knn-save-model.ipynb create mode 100644 api_examples/hopsworks/vector_similarity_search/news-search-knn.ipynb create mode 100644 api_examples/hopsworks/vector_similarity_search/news-search-rank-view.ipynb create mode 100644 api_examples/hopsworks/vector_similarity_search/requirements.txt diff --git a/api_examples/hopsworks/vector_similarity_search/news-search-knn-save-model.ipynb b/api_examples/hopsworks/vector_similarity_search/news-search-knn-save-model.ipynb new file mode 100644 index 00000000..d2a82644 --- /dev/null +++ b/api_examples/hopsworks/vector_similarity_search/news-search-knn-save-model.ipynb @@ -0,0 +1,370 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "1e23a0d0", + "metadata": {}, + "source": [ + "# News search using kNN in Hopsworks" + ] + }, + { + "cell_type": "markdown", + "id": "03fb3165", + "metadata": {}, + "source": [ + "In this tutorial, you are going to learn how to create a news search application which allows you to search news using natural language. You will create embedding for the news and search news similar to a given description using embeddings and kNN search. You will also learn how to avoid training-serving skew by using model registry. The steps include:\n", + "1. Load news data\n", + "2. Create embedddings for news heading and news body\n", + "3. Save the embedding model to model registry\n", + "4. Ingest the news data and embedding into Hopsworks\n", + "5. Search news using Hopsworks" + ] + }, + { + "cell_type": "markdown", + "id": "12c30fea", + "metadata": {}, + "source": [ + "## Load news data" + ] + }, + { + "cell_type": "markdown", + "id": "d5d5513d", + "metadata": {}, + "source": [ + "First, you need to load the news articles downloaded from [Kaggle news articles](https://www.kaggle.com/datasets/asad1m9a9h6mood/news-articles).\n", + "Since creating embeddings for the full news is time-consuming, here we sample some articles." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e95346ff", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "df_all = pd.read_csv(\"https://repo.hops.works/dev/jdowling/Articles.csv\", encoding='utf-8', encoding_errors='ignore')\n", + "df = df_all.sample(n=300).reset_index().drop([\"index\"], axis=1)\n", + "df[\"news_id\"] = list(range(len(df)))" + ] + }, + { + "cell_type": "markdown", + "id": "96bfc948", + "metadata": {}, + "source": [ + "## Create embeddings" + ] + }, + { + "cell_type": "markdown", + "id": "b7bd09b2", + "metadata": {}, + "source": [ + "Next, you need to create embeddings for heading and body of the news. The embeddings will then be used for kNN search against the embedding of the news description you want to search. Here we use a light weighted language model (LM) which encodes the news into embeddings. You can use any other language models including LLM (llama, Mistral)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "88053d37", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install sentence_transformers -q" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9c22d8fb", + "metadata": {}, + "outputs": [], + "source": [ + "from sentence_transformers import SentenceTransformer\n", + "model = SentenceTransformer('all-MiniLM-L6-v2')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "43e230f9", + "metadata": {}, + "outputs": [], + "source": [ + "# truncate the body to 100 characters\n", + "embeddings_body = model.encode([body for body in df[\"Article\"]])\n", + "embeddings_heading = model.encode(df[\"Heading\"])\n", + "df[\"embedding_heading\"] = pd.Series(embeddings_heading.tolist())\n", + "df[\"embedding_body\"] = pd.Series(embeddings_body.tolist())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "41b150f6", + "metadata": {}, + "outputs": [], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "65f73330", + "metadata": {}, + "source": [ + "## Ingest into Hopsworks" + ] + }, + { + "cell_type": "markdown", + "id": "0721e6c1", + "metadata": {}, + "source": [ + "You need to ingest the data to Hopsworks, so that they are stored and indexed. First, you login into Hopsworks and prepare the feature store." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "27a19f00", + "metadata": {}, + "outputs": [], + "source": [ + "import hopsworks\n", + "proj = hopsworks.login()\n", + "fs = proj.get_feature_store()" + ] + }, + { + "cell_type": "markdown", + "id": "36cf4331", + "metadata": {}, + "source": [ + "Next, as embeddings are stored in an index in the backing vecotor database, you need to specify the index name and the embedding features in the dataframe. You can also save the embedding model to model registry, and attach the model to the embedding features. This is useful for avoiding training-serving skew as at inference time you can get back the same model used for creating embedding at training time." + ] + }, + { + "cell_type": "markdown", + "id": "4c564fc9", + "metadata": {}, + "source": [ + "First, you save the model to model registry." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6e5e5270", + "metadata": {}, + "outputs": [], + "source": [ + "import pickle\n", + "model_name = \"SentenceTransformer_all_MiniLM_L6_v2\"\n", + "mr = proj.get_model_registry()\n", + "# Check if the model has been created, and get back the model if available. Otherwise create the model.\n", + "try:\n", + " hsml_model = mr.get_model(model_name, 1)\n", + "except:\n", + " with open(f\"{model_name}.pkl\", \"wb\") as f:\n", + " pickle.dump(model, f)\n", + " hsml_model = mr.python.create_model(model_name)\n", + " hsml_model.save(f\"{model_name}.pkl\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "db4a76ea", + "metadata": {}, + "source": [ + "Then, you specify the index name, embedding features, and model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "47e0cf41", + "metadata": {}, + "outputs": [], + "source": [ + "version = 1\n", + "from hsfs import embedding\n", + "\n", + "emb = embedding.EmbeddingIndex(index_name=f\"news_fg_{version}\")\n", + "# specify the name, dimension, and model of the embedding features \n", + "emb.add_embedding(\"embedding_body\", model.get_sentence_embedding_dimension(), model=hsml_model)\n", + "emb.add_embedding(\"embedding_heading\", model.get_sentence_embedding_dimension(), model=hsml_model)" + ] + }, + { + "cell_type": "markdown", + "id": "25bddf53", + "metadata": {}, + "source": [ + "Next, you create a feature group with the `embedding_index` and ingest data into the feature group." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e55522ba", + "metadata": {}, + "outputs": [], + "source": [ + "news_fg = fs.get_or_create_feature_group(\n", + " name=\"news_fg\",\n", + " embedding_index=emb,\n", + " primary_key=[\"news_id\"],\n", + " version=version,\n", + " online_enabled=True\n", + ")\n", + "\n", + "news_fg.insert(df, write_options={\"start_offline_materialization\": False})" + ] + }, + { + "cell_type": "markdown", + "id": "db6b194b", + "metadata": {}, + "source": [ + "## Search News" + ] + }, + { + "cell_type": "markdown", + "id": "7fc8854e", + "metadata": {}, + "source": [ + "Once the data are ingested into Hopsworks, you can search news by giving a news description. The news description first needs to be encoded by the same LM you used to encode the news. You can get back the model in the model registry from the embedding feature. And then you can search news which are similar to the description using kNN search functionality provided by the feature group." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ede462ef", + "metadata": {}, + "outputs": [], + "source": [ + "# set the logging level to WARN to avoid INFO message\n", + "import logging\n", + "logging.getLogger().setLevel(logging.WARN)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f8d7df08", + "metadata": {}, + "outputs": [], + "source": [ + "news_description = \"news about europe\"" + ] + }, + { + "cell_type": "markdown", + "id": "ba2b2bde", + "metadata": {}, + "source": [ + "You can get back the model file from embedding feature object, and load the model file back to the embedding model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0f9f2cd4", + "metadata": {}, + "outputs": [], + "source": [ + "hsml_model = news_fg.embedding_index.get_embedding(\"embedding_heading\").model\n", + "local_model_path = hsml_model.download()\n", + "with open(f\"{local_model_path}/{hsml_model.name}.pkl\", 'rb') as f:\n", + " loaded_model = pickle.load(f)" + ] + }, + { + "cell_type": "markdown", + "id": "401915b0", + "metadata": {}, + "source": [ + "You can search similar news to the description against news heading." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7356be82", + "metadata": {}, + "outputs": [], + "source": [ + "results = news_fg.find_neighbors(loaded_model.encode(news_description), k=3, col=\"embedding_heading\")\n", + "# print out the heading\n", + "for result in results:\n", + " print(result[1][2])" + ] + }, + { + "cell_type": "markdown", + "id": "24e70246", + "metadata": {}, + "source": [ + "Alternative, you can search similar news to the description against the news body and filter by news type." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "32d954bc", + "metadata": {}, + "outputs": [], + "source": [ + "results = news_fg.find_neighbors(loaded_model.encode(news_description), k=3, col=\"embedding_body\",\n", + " filter=news_fg.newstype == \"business\")\n", + "# print out the heading\n", + "for result in results:\n", + " print(result[1][2])" + ] + }, + { + "cell_type": "markdown", + "id": "c0938afc", + "metadata": {}, + "source": [ + "## Next step" + ] + }, + { + "cell_type": "markdown", + "id": "28ffda57", + "metadata": {}, + "source": [ + "Now you are able to search articles using natural language. You can learn how to rank the result in [this tutorial]() and learn best practices in the [guide]()." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/api_examples/hopsworks/vector_similarity_search/news-search-knn.ipynb b/api_examples/hopsworks/vector_similarity_search/news-search-knn.ipynb new file mode 100644 index 00000000..988056c7 --- /dev/null +++ b/api_examples/hopsworks/vector_similarity_search/news-search-knn.ipynb @@ -0,0 +1,348 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "719b9af9-7375-4679-8e0a-8ae745805320", + "metadata": {}, + "source": [ + "# Requirements: Hopsworks 3.7+\n", + "\n", + "WARNING: this notebook does not currently work with serverless Hopsworks." + ] + }, + { + "cell_type": "markdown", + "id": "8b0ba628", + "metadata": {}, + "source": [ + "# News search using kNN in Hopsworks" + ] + }, + { + "cell_type": "markdown", + "id": "d727fa41", + "metadata": {}, + "source": [ + "In this tutorial, you are going to learn how to create a news search application which allows you to search news using natural language. You will create embedding for the news and search news similar to a given description using embeddings and kNN search. The steps include:\n", + "1. Load news data\n", + "2. Create embedddings for news heading and news body\n", + "3. Ingest the news data and embedding into Hopsworks\n", + "4. Search news using Hopsworks" + ] + }, + { + "cell_type": "markdown", + "id": "32974a73", + "metadata": {}, + "source": [ + "## Load news data" + ] + }, + { + "cell_type": "markdown", + "id": "840cc4e5", + "metadata": {}, + "source": [ + "First, you need to load the news articles downloaded from [Kaggle news articles](https://www.kaggle.com/datasets/asad1m9a9h6mood/news-articles).\n", + "Since creating embeddings for the full news is time-consuming, here we sample some articles." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "776aaa6d-b2b3-4c6a-9814-1dcc547f0a36", + "metadata": {}, + "outputs": [], + "source": [ + "#!pip install hsfs==3.7.0rc5 -q\n", + "#!pip install hopsworks==3.7.0rc1 -q\n", + "#!pip install sentence_transformers -q" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0d4062d8", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "from sentence_transformers import SentenceTransformer\n", + "import logging\n", + "import hopsworks\n", + "from hsfs import embedding" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "28f3cca3-ce77-4b31-b837-4c247d28cfbb", + "metadata": {}, + "outputs": [], + "source": [ + "df_all = pd.read_csv(\"https://repo.hops.works/dev/jdowling/Articles.csv\", encoding='utf-8', encoding_errors='ignore')\n", + "df = df_all.sample(n=300).reset_index().drop([\"index\"], axis=1)\n", + "df[\"news_id\"] = list(range(len(df)))" + ] + }, + { + "cell_type": "markdown", + "id": "ddea5ab0", + "metadata": {}, + "source": [ + "## Create embeddings" + ] + }, + { + "cell_type": "markdown", + "id": "b43d7b68", + "metadata": {}, + "source": [ + "Next, you need to create embeddings for heading and body of the news. The embeddings will then be used for kNN search against the embedding of the news description you want to search. Here we use a light weighted language model (LM) which encodes the news into embeddings. You can use any other language models including LLM (llama, Mistral)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "017ae8a7", + "metadata": {}, + "outputs": [], + "source": [ + "model = SentenceTransformer('all-MiniLM-L6-v2')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "245ee674", + "metadata": {}, + "outputs": [], + "source": [ + "# truncate the body to 100 characters\n", + "embeddings_body = model.encode([body[:100] for body in df[\"Article\"]])\n", + "embeddings_heading = model.encode(df[\"Heading\"])\n", + "df[\"embedding_heading\"] = pd.Series(embeddings_heading.tolist())\n", + "df[\"embedding_body\"] = pd.Series(embeddings_body.tolist())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "af272d56", + "metadata": {}, + "outputs": [], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "1ca7d180", + "metadata": {}, + "source": [ + "## Ingest into Hopsworks" + ] + }, + { + "cell_type": "markdown", + "id": "4edaa1d3", + "metadata": {}, + "source": [ + "You need to ingest the data to Hopsworks, so that they are stored and indexed. First, you login into Hopsworks and prepare the feature store." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "88f99b8b", + "metadata": {}, + "outputs": [], + "source": [ + "proj = hopsworks.login()\n", + "fs = proj.get_feature_store()" + ] + }, + { + "cell_type": "markdown", + "id": "53ca5b13", + "metadata": {}, + "source": [ + "Next, as embeddings are stored in an index in the backing vecotor database, you need to specify the index name and the embedding features in the dataframe. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4994b30b", + "metadata": {}, + "outputs": [], + "source": [ + "version = 1\n", + "emb = embedding.EmbeddingIndex(index_name=f\"news_fg_{version}\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8bbd31bc-b806-41e5-95a6-0dc510e2fe99", + "metadata": {}, + "outputs": [], + "source": [ + "# specify the name and dimension of the embedding features \n", + "emb.add_embedding(\"embedding_body\", model.get_sentence_embedding_dimension())\n", + "emb.add_embedding(\"embedding_heading\", model.get_sentence_embedding_dimension())" + ] + }, + { + "cell_type": "markdown", + "id": "755be3cb", + "metadata": {}, + "source": [ + "Next, you create a feature group with the `embedding_index` and ingest data into the feature group." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a2fa6af0", + "metadata": {}, + "outputs": [], + "source": [ + "news_fg = fs.get_or_create_feature_group(\n", + " name=\"news_fg\",\n", + " embedding_index=emb,\n", + " primary_key=[\"news_id\"],\n", + " version=version,\n", + " online_enabled=True\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9f82da02-0e37-4907-8790-a68145253845", + "metadata": {}, + "outputs": [], + "source": [ + "news_fg.insert(df, write_options={\"start_offline_materialization\": False})" + ] + }, + { + "cell_type": "markdown", + "id": "508ae2c4", + "metadata": {}, + "source": [ + "## Search News" + ] + }, + { + "cell_type": "markdown", + "id": "baa6d6c3", + "metadata": {}, + "source": [ + "Once the data are ingested into Hopsworks, you can search news by giving a news description. The news description first needs to be encoded by the same LM you used to encode the news. And then you can search news which are similar to the description using kNN search functionality provided by the feature group." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5e114be5", + "metadata": {}, + "outputs": [], + "source": [ + "# set the logging level to WARN to avoid INFO message\n", + "logging.getLogger().setLevel(logging.WARN)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "343bcbcf", + "metadata": {}, + "outputs": [], + "source": [ + "news_description = \"news about europe\"" + ] + }, + { + "cell_type": "markdown", + "id": "aa0d15d3", + "metadata": {}, + "source": [ + "You can search similar news to the description against news heading." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9a73ea59", + "metadata": {}, + "outputs": [], + "source": [ + "results = news_fg.find_neighbors(model.encode(news_description), k=3, col=\"embedding_heading\")\n", + "# print out the heading\n", + "for result in results:\n", + " print(result[1][2])" + ] + }, + { + "cell_type": "markdown", + "id": "c2c4f5fc", + "metadata": {}, + "source": [ + "Alternative, you can search similar news to the description against the news body and filter by news type." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7a3c31b8", + "metadata": {}, + "outputs": [], + "source": [ + "results = news_fg.find_neighbors(model.encode(news_description), k=3, col=\"embedding_body\",\n", + " filter=news_fg.newstype == \"business\")\n", + "# print out the heading\n", + "for result in results:\n", + " print(result[1][2])" + ] + }, + { + "cell_type": "markdown", + "id": "5cf246b3", + "metadata": {}, + "source": [ + "## Next step" + ] + }, + { + "cell_type": "markdown", + "id": "eb82cba1", + "metadata": {}, + "source": [ + "Now you are able to search articles using natural language. You can learn how to rank the result in [this tutorial]() and learn best practices in the [guide]()." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/api_examples/hopsworks/vector_similarity_search/news-search-rank-view.ipynb b/api_examples/hopsworks/vector_similarity_search/news-search-rank-view.ipynb new file mode 100644 index 00000000..92d8a1f0 --- /dev/null +++ b/api_examples/hopsworks/vector_similarity_search/news-search-rank-view.ipynb @@ -0,0 +1,356 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "0a5e5c4d", + "metadata": {}, + "source": [ + "# Ranking of news search results" + ] + }, + { + "cell_type": "markdown", + "id": "8988ff65", + "metadata": {}, + "source": [ + "In the previous tutorial, you learn how to search news using natural languages. In order to make the search results more useful, you will learn how to rank the search results in this tutorial. We will use the number of view as the score of news articles as it represent the popularity of the articles. The steps include:\n", + "1. Create a view count feature group with sample view count dataset\n", + "2. Create a feature view that join the news feature group and view count feature group\n", + "3. Search news and rank them by view count" + ] + }, + { + "cell_type": "markdown", + "id": "6bafdd57", + "metadata": {}, + "source": [ + "## Create a view count feature group" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "246b6bb7", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "id": "79a4dc31", + "metadata": {}, + "source": [ + "First you create a sample view count dataset of the size of news feature group." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ea809cb3", + "metadata": {}, + "outputs": [], + "source": [ + "import random\n", + "num_news = 300\n", + "df_view = pd.DataFrame({\"news_id\": list(range(num_news)), \"view_cnt\": [random.randint(0, 100) for i in range(num_news)]})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "139c5c0f", + "metadata": {}, + "outputs": [], + "source": [ + "version = 1" + ] + }, + { + "cell_type": "markdown", + "id": "4d7d216e", + "metadata": {}, + "source": [ + "Then you create a view count feature group and ingest the data into Hopsworks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b007629f", + "metadata": {}, + "outputs": [], + "source": [ + "import hopsworks\n", + "proj = hopsworks.login()\n", + "fs = proj.get_feature_store()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fbacf4c2", + "metadata": {}, + "outputs": [], + "source": [ + "view_fg = fs.get_or_create_feature_group(\n", + " name=\"view_fg\",\n", + " primary_key=[\"news_id\"],\n", + " version=version,\n", + " online_enabled=True,\n", + ")\n", + "\n", + "view_fg.insert(df_view, write_options={\"start_offline_materialization\": False})" + ] + }, + { + "cell_type": "markdown", + "id": "43790e1d", + "metadata": {}, + "source": [ + "## Create a feature view " + ] + }, + { + "cell_type": "markdown", + "id": "2aab5e57", + "metadata": {}, + "source": [ + "You need to first get back the news feature group created before for the creation of feature view." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fad86561", + "metadata": {}, + "outputs": [], + "source": [ + "fg = news_fg = fs.get_or_create_feature_group(\n", + " name=\"news_fg\",\n", + " version=1\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "2ddecaa0", + "metadata": {}, + "source": [ + "Now, you create a feature view by joining the news feature group and the view count feature group. Here, you select the heading, and the view count for ranking." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cc06c04a", + "metadata": {}, + "outputs": [], + "source": [ + "fv = fs.get_or_create_feature_view(\n", + " \"news_view\", version=version,\n", + " query=news_fg.select([\"heading\"]).join(view_fg.select([\"view_cnt\"]))\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "17fe1695", + "metadata": {}, + "source": [ + "## Search news and rank " + ] + }, + { + "cell_type": "markdown", + "id": "29f9b343", + "metadata": {}, + "source": [ + "Same as the previous tutorial, the news description first needs to be encoded by the same LM you used to encoded the news. And then the embedding can be used to search similar news using the feature view." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b60e0804", + "metadata": {}, + "outputs": [], + "source": [ + "news_description = \"news about europe\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fcd5b0e6-ca97-480b-8205-cc16a89b7f2b", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install sentence_transformers -q" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "41b60142", + "metadata": {}, + "outputs": [], + "source": [ + "from sentence_transformers import SentenceTransformer\n", + "model = SentenceTransformer('all-MiniLM-L6-v2')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "73a31f10", + "metadata": {}, + "outputs": [], + "source": [ + "import logging\n", + "logging.getLogger().setLevel(logging.WARN)" + ] + }, + { + "cell_type": "markdown", + "id": "4afe07d9", + "metadata": {}, + "source": [ + "Define some helper functions which sort and print new results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b3da47a9", + "metadata": {}, + "outputs": [], + "source": [ + "def print_news(feature_vectors):\n", + " for feature_vector in feature_vectors:\n", + " print(feature_vector)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7f7c84aa", + "metadata": {}, + "outputs": [], + "source": [ + "def print_sort_news(feature_vectors):\n", + " # sort the articles by view count\n", + " print(\"Ranked result:\")\n", + " feature_vectors = sorted(feature_vectors, key=lambda x: x[1]*-1)\n", + " print_news(feature_vectors)" + ] + }, + { + "cell_type": "markdown", + "id": "f5cae7d7", + "metadata": {}, + "source": [ + "Now, you can see the top k results returned by the feature view, which are the headings and the view count. You can also see the ranked results by view count of the top k results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6e27d333", + "metadata": {}, + "outputs": [], + "source": [ + "feature_vectors = fv.find_neighbors(model.encode(news_description), k=5, feature=news_fg.embedding_heading)\n", + "print_news(feature_vectors)\n", + "print_sort_news(feature_vectors)" + ] + }, + { + "cell_type": "markdown", + "id": "268906fb", + "metadata": {}, + "source": [ + "Like the feature group, you can filter results in `find_neighbors` in feature view. You can also use multiple filtering conditions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4054191b", + "metadata": {}, + "outputs": [], + "source": [ + "feature_vectors = fv.find_neighbors(model.encode(news_description), k=5, \n", + " filter=((news_fg.newstype == \"sports\") & (news_fg.article.like(\"europe\"))),\n", + " feature=news_fg.embedding_heading)\n", + "print_news(feature_vectors)\n", + "print_sort_news(feature_vectors)" + ] + }, + { + "cell_type": "markdown", + "id": "15e9480b", + "metadata": {}, + "source": [ + "You can get back result by providing primary key which is the news id as well." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d8757fab", + "metadata": {}, + "outputs": [], + "source": [ + "feature_vectors = fv.get_feature_vector({\"news_id\": 10})\n", + "print_news([feature_vectors])" + ] + }, + { + "cell_type": "markdown", + "id": "057aa05d", + "metadata": {}, + "source": [ + "## Next step" + ] + }, + { + "cell_type": "markdown", + "id": "7b911f73", + "metadata": {}, + "source": [ + "Now you are able to search articles and rank them by view count. You may be wondering why the view count does not store in the news feature group. You can find the answer and other best practices in the [guide]()." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "55e58363", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.8" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/api_examples/hopsworks/vector_similarity_search/requirements.txt b/api_examples/hopsworks/vector_similarity_search/requirements.txt new file mode 100644 index 00000000..f59b13f2 --- /dev/null +++ b/api_examples/hopsworks/vector_similarity_search/requirements.txt @@ -0,0 +1,3 @@ +hsfs==3.7.0rc5 +hopsworks==3.7.0rc1 +sentence_transformers