Workshop: Create a Powerful Movie Search Tool in Python with Elasticsearch 8 and Semantic Embeddings
Discover the art of building a robust movie search system in Python as we dive into the captivating world of Elasticsearch 8.
Gain hands-on expertise in harnessing Elasticsearch’s powerful search tools to unlock unparalleled results. Elevate your search capabilities by seamlessly integrating semantic vectors into Elasticsearch indices, revolutionizing how users find their favorite items. Moreover, master the implementation of these cutting-edge concepts in Python projects using the elasticsearch-dsl package.
The workshop will explore:
- Different field types and available queries for traditional search implementation in Elasticsearch
- Defining and creating indices in Elasticsearch from a Python project
- Examples of creating and executing queries in Python
- What are semantic vectors and how can we use them
- Including semantic vectors in Elasticsearch using dense_vectors and KNN queries
- Implementing vector-based searches using KNN in Elasticsearch from Python
- Developing a Python search system that combines traditional and KNN search (Hybrid Search) on a dataset.
To participate on this workshop, the only two requirements you need are having docker and docker-compose installed in your machine. You can install both tools following the provided links.
In this section, we will build all the containers required for the workshop and learn how to interact with them from the terminal. If you are new to Docker, don't worry, the commands we need to execute are simple.
- Clone the repository to your machine
git clone git clone [email protected]:xmartlabs/pycon-es-workshop.git
- Build the containers using
docker-compose
docker-compose build
- Interact with the backend container
# Open a bash terminal in the backend container
docker-compose run backend bash
# Open a python shell
python
# Interact with the shell :)
In this workshop, we will be working with a movies dataset, which is available in the file movie_features.parquet. To make interaction with the dataset easier, we have provided the DatasetManager
class with some helpful functions. Let's explore some of these functions by executing some commands.
- Import the manager class
from managers import DatasetManager
dm = DatasetManager()
- Get the movies dataset and analyze some data
ds = dm._get_movies_dataset()
type(ds)
first_row = next(ds)
type(first_row)
first_row
second_row = next(ds)
second_row
third_row = next(ds)
To facilitate interaction with ElasticSearch from Python, we have provided the ESManager class. This class allows us to create and modify indexes within the cluster. Let's explore and test some examples using this class:
- Import the manager and initialize it with the index name
workshop-v1
.
from managers import ESManager
em = ESManager('workshop-v1')
- Create an index using the default implementation provided:
em.create_index()
- Verify the index was successfully created using Kibana (Remember start the kibana container before doing that
docker-compose up -d kibana
). In your browser, navigate to http://localhost:5601 and then go to the Dev Tools section. Execute the following commands:
GET _cat/indices
GET workshop-v1/_mapping
- Import some data into the index. We have included a function
import_movies_into_es
in theDatasetManager
to accomplish this. Let's start by uploading only 100 movies.
from managers import ESManager
from managers import DatasetManager
em = ESManager('workshop-v1')
dm = DatasetManager()
dm.import_movies_into_es(em, 100)
- Verify the data was successfully imported by executing some queries in Kibana
GET workshop-v1/_search
{
"query": {
"match_all": {}
},
"size": 20
}
GET workshop-v1/_search
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "Tom",
"fields": [
"title"
]
}
}
]
}
},
"size": 20
}
- Feel free to explore and experiment with different queries and index modifications using the ESManager class.
Great! Now that we know how to use the DatasetManager
and ESManager
helper classes, let's modify the existing index by introducing some new fields: genres
, director
, protagonists
, and overview
. Here's how to do it:
-
To incorporate the new fields, modify the
get_document_definition
function within theESManager
class. Consider which field class would best suit the data structure before proceeding. If you are unsure about the appropriate field classes, don't worry—we've provided a potential solution in theget_document_definition_solution_task_1
function inside the solution folder. -
Once you've made the necessary modifications to the function, you can test it by creating a new index:
from managers import ESManager
em = ESManager('workshop-v2')
em.create_index()
You can verify the index was created correctly using Kibana, just as we did before.
-
Next, let's import some data into the new index. Modify the
_map_movie_into_es
function inside theDatasetManager
class. We have also provided a tentative solution in the solution folder :) -
With the function modified, you can now import data into the new index:
from managers import ESManager
from managers import DatasetManager
em = ESManager('workshop-v2')
dm = DatasetManager()
dm.import_movies_into_es(em, 100)
- Now, let's query some data from the new index. Use the following queries. :
GET workshop-v2/_search
{
"query": {
"match_all": {}
},
"size": 20
}
GET workshop-v2/_search
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "Tom",
"fields": [
"title",
"overview"
]
}
}
]
}
},
"size": 20
}
To execute the query we used in Kibana directly from Python using the elasticsearch-dsl
package, we can use the SearchManager
helper class. This class allows you to construct and execute Elasticsearch queries in Python. Here's an example of how to use it:
from managers import ESManager, SearchManager
em = ESManager('workshop-v2')
sm = SearchManager()
results, query = sm.execute_traditional_search("Tom", em)
for r in results:
sm.print_result(r)
sm.print_query(query)
Lets modify the execute_traditional_search
introducing a filter by Adventure
genre.
Excellent! We have developed an outstanding product. However, what happens when queries include words that are not present in the datasets? This is where the concept of embeddings and KNN (K-Nearest Neighbors) queries in ElasticSearch come into play. To support this functionality, we have introduced a new container called embeddings-generator
, which implements a basic embedding generation capability. In alignment with the workshop's approach, we have also created a helper class called EmbeddingsManager
that simplifies the interaction with the container. Let's explore some examples.
Before proceeding, please uncomment the embeddings-generator section in the docker-compose file and restart all the containers.
from managers import EmbeddingsManager, EmbeddingTypes
em = EmbeddingsManager()
query_vector_sym = em.get_embeddings_for_text('cats', EmbeddingTypes.SYMMETRIC)
print(query_vector_sym)
query_vector_asym = em.get_embeddings_for_text('cats', EmbeddingTypes.ASYMMETRIC)
print(query_vector_asym)
Let's enhance the instructions for modifying the existing index by introducing two new fields: sbert_symmetric_overview_embedding
and sbert_asymmetric_overview_embedding
.
Here are some important considerations to keep in mind:
- The field type for both
sbert_symmetric_overview_embedding
andsbert_asymmetric_overview_embedding
should be set to dense_vectors. This allows Elasticsearch to store dense vector representations. - Review the size required for the vectors.
- Take into account that symmetric vectors tend to perform better with the
cosine distance metric
, while asymmetric vectors tend to perform better with thedot product distance
metric. Consider these differences when selecting the appropriate distance metric for each field.
-
Modify the function
get_document_definition
for theESManager
class and introduce the new fields. Same as before, a possible solution exists in the functionget_document_definition_solution_exercise_3
. -
Once you've made the necessary modifications to the function, you can test it by creating a new index:
from managers import ESManager
em = ESManager('workshop-v3')
em.create_index()
You can check the index was correctly created using Kibana as we did before.
-
Now, let's import some data in the index. Modify the function
_map_movie_into_es
inside theDatasetManager
class. We have also provided a tentantive solution in the solution folder :) -
Next, let's import some data into the new index. Modify the
_map_movie_into_es
function inside theDatasetManager
class. We have also provided a tentative solution in the solution folder :) -
With the function modified, you can import data into the new index:
from managers import ESManager
from managers import DatasetManager
em = ESManager('workshop-v3')
dm = DatasetManager()
dm.import_movies_into_es(em, 20)
- Finally, let's query some data from the new index. Use the following queries:
GET workshop-v3/_search
{
"query": {
"match_all": {}
},
"size": 20
}
Okay, now that we have everything set up, let's discuss how we can use these embeddings to retrieve data and perform a KNN search query. You don't need to worry about implementing it yourself, as we already have the solution for you. Simply copy the _get_knn_search
and execute_knn_search
functions from the solution folder and paste them inside the ESManager
class.
Once you've completed that step, we can proceed to execute some queries using the KNN search functionality.
from managers import EmbeddingsManager, EmbeddingTypes
from managers import ESManager, SearchManager
em = ESManager('workshop-v3')
sm = SearchManager()
results, query = sm.execute_knn_search('War', em, EmbeddingTypes.SYMMETRIC)
for r in results:
sm.print_result(r)
sm.print_query(query)
You might wonder: wait, can I integrate both approaches? Absolutely! You can seamlessly integrate both approaches. To explore this integration, you can refer to the execute_hybrid_search function in the solution folder. This function combines the power of embeddings and traditional search techniques, enabling you to perform hybrid searches. Feel free to test it out and witness firsthand how you can leverage this functionality.
from managers import EmbeddingsManager, EmbeddingTypes
from managers import ESManager, SearchManager
em = ESManager('workshop-v3')
sm = SearchManager()
results, query = sm.execute_hybrid_search('War', em, EmbeddingTypes.SYMMETRIC)
for r in results:
sm.print_result(r)
sm.print_query(query)
You have successfully completed the "Create a Powerful Movie Search Tool in Python with Elasticsearch 8 and Semantic" workshop. Thank you for participating! Throughout the workshop, you have learned several key topics, including:
- Managing ES Indexes from Python
- Implementing traditional queries from Python using the elasticsearch-dsl package
- An introduction to semantic vectors and their role in enhancing search capabilities using knn queries
- A workaround to use the elasticsearch-dsl package with newer versions of Elasticsearch without encountering compatibility issues
If you are interested in further expanding your knowledge in this area, here are some ideas you can explore:
-
Building more complex queries: Take the queries developed in the workshop and make them more sophisticated. Explore the various features and capabilities offered by Elasticsearch to achieve even better search results.
-
Testing different embedding models: Experiment with alternative embedding models (for example the provided by OpenAI) that may offer improved performance compared to the one used in the workshop. These models can be utilized in the latest version of Elasticsearch (8.8.0) to enhance the search experience further.
-
Deploying your model directly in Elasticsearch: Explore the possibility of deploying your own custom-trained embedding model directly within Elasticsearch. This allows you to leverage your model's unique capabilities and tailor it specifically to your search requirements.
By delving deeper into these areas, you can enhance your expertise and leverage Elasticsearch to build even more powerful and customized search applications.
At Xmartlabs we love to share our knowledge through our open source work. Feel free to check out our GitHub profile and contribute in any way you see fit. You can also explore our blog, where we regularly post new insights and discoveries. See you there!