- Overview
- 🐍 Build the app with Python
- 🔮 Overview of the files in this example
- 🌀 Flow diagram
- 🔨 Next steps, building your own app
- 🐳 Deploy the prebuild application using Docker
- 🙍 Community
- 🦄 License
Summary | This showcases a semantic text search app |
Data for indexing | Wikipedia corpus |
Data for querying | A text sentence |
Dataset used | Kaggle Wikipedia corpus |
ML model used | distilbert-based-uncased |
This example shows you how to build a simple semantic search app powered by Jina's neural search framework. You can index and search text sentences from Wikipedia using a state-of-the-art machine learning distilbert-based-uncased
language model from the Transformers library.
These instructions explain how to build the example yourself and deploy it with Python. If you want to skip the building steps and just run the app, check out the Docker section below.
- You have a working Python 3.7 or 3.8 environment.
- We recommend creating a new Python virtual environment to have a clean installation of Jina and prevent dependency conflicts.
- You have at least 2 GB of free space on your hard drive.
Begin by cloning the repo, so you can get the required files and datasets. In case you already have the examples repository on your machine make sure to fetch the most recent version.
git clone https://github.com/jina-ai/examples
cd examples/wikipedia-sentences
In your terminal, you should now be located in you the wikipedia-sentences folder. Let's install Jina and the other required Python libraries. For further information on installing Jina check out our documentation.
pip install -r requirements.txt
If this command runs without any error messages, you can then move onto step two.
To quickly get started, you can index a small dataset of 50 sentences to make sure everything is working correctly.
python app.py -t index
The relevant Jina code to index data given your Flow's YAML definition breaks down to
with Flow().load_config('flows/index.yml'):
f.index_lines(filepath='data/toy-input.txt', read_mode='r', batch_size=16, num_docs=10)
The Flow will interpret each line in the txt file as one Document.
You can limit the number of indexed Documents with the num_docs
argument. If you see the following output, it means your data has been correctly indexed.
Flow@5162[S]:flow is closed and all resources are released, current build level is 0
We recommend you come back to the indexing step later and run the full wikipedia dataset for better results. To index the full dataset (almost 900 MB) follow these steps:
Click to expand!
- Set up a Kaggle.com account
- Install the Kaggle Python library and set up your API credentials
- Run the script:
sh ./get_data.sh
- Set the input file:
export JINA_DATA_FILE='data/input.txt'
- Set the number of docs to index
export JINA_MAX_DOCS=30000
(or whatever number you prefer. The default is50
) - Delete the old index:
rm -rf workspace
- Index your new dataset:
python app.py -t index
If you are using a subset of the data (less than 30,000 documents) we recommend you shuffle the data. This is because the input file is ordered alphabetically, and Jina indexes from the top down. So without shuffling, your index may contain unrepresentative data, like this:
0.000123, which corresponds to a distance of 705 Mly, or 216 Mpc.
000webhost is a free web hosting service, operated by Hostinger.
0010x0010 is a Dutch-born audiovisual artist, currently living in Los Angeles.
0-0-1-3 is an alcohol abuse prevention program developed in 2004 at Francis E. Warren Air Force Base based on research by the National Institute on Alcohol Abuse and Alcoholism regarding binge drinking in college students.
0.01 is the debut studio album of H3llb3nt, released on February 20, 1996 by Fifth Colvmn Records.
001 of 3 February 1997, which was signed between the Government of the Republic of Rwanda, and FAPADER.
003230 is a South Korean food manufacturer.
On Linux, you can shuffle using the shuf
command:
shuf input.txt > input.txt
To shuffle a file on macOS, please read this post.
Jina offers several ways to search (query) your data. In this example, we show three of the most common ones. All three are optional, in a production environment, you would only choose one which suits your use case best.
Begin by running the following command to open the REST API interface.
python app.py -t query_restful
You should open another terminal window and paste the following command.
curl --request POST -d '{"top_k": 5, "mode": "search", "data": ["hello world"]}' -H 'Content-Type: application/json' 'http://0.0.0.0:45678/search'
Once you run this command, you should see a JSON output returned to you. This contains the five most semantically similar Wikipedia sentences to the text input you provided in the data
parameter. Feel free to alter the text in the 'data' parameter and play around with other queries! For a better understanding of the parameters see the table below.
top_k |
Integer determining the number of sentences to return |
mode |
Mode to trigger in the call. See here for more details |
data |
Text input to query |
Jina Box is a light-weight, highly customizable JavaScript based front-end search interface. To use it for this example, begin by opening the REST API interface.
python app.py -t query_restful
In your browser, open up the hosted Jina Box on jina.ai/jinabox.js. In the configuration bar on the left-hand side, choose a custom endpoint and enter the following: http://127.0.0.1:45678/search
. You can type search queries into the text box on the right-hand side!
You can also easily search (query) your data directly from the terminal. Using the following command will open an interface directly in your terminal window.
python app.py -t query
What if new data arrives that needs to be indexed? Many applications will require incremental indexing, which is a way to add new data to an index, without re-indexing the original data. Of course, we don't want to re-calculate our index for all our data every time we add a couple of new Documents. For this case, Jina provides a simple and intuitive solution which we will demonstrate using a second small dataset. Just as before, index the first dataset and then incrementally the second dataset
with Flow().load_config('flows/index.yml'):
f.index_lines(filepath='data/toy-input.txt', read_mode='r', batch_size=16, num_docs=10)
f.index_lines(filepath='data/toy-input-incremental.txt', read_mode='r', batch_size=16, num_docs=10)
One challenge we need to address when incrementally adding new data to the index is duplication of Documents. Jina provides a DocCache Pod that is pre-configured for you and takes care of detecting duplicates when adding to the index. Finally, we add the DocCache Pod to the index Flow.
!Flow
version: '1'
pods:
- name: encoder
uses: pods/encode.yml
timeout_ready: 1200000
read_only: true
- name: indexer
uses_before: pods/index_cache.yml # use before indexing to detect duplicates
uses: pods/index.yml
As you can see, compared to the previous index Flow we just needed to add one line to the YAML spec. To see the incremental indexing in action, run
python app.py -t index_incremental
Here is a small overview if you're interested in understanding what each file in this example is doing.
File | Explanation |
---|---|
📂 flows/ |
Folder to store Flow configuration |
--- 📃 flows/index.yml |
Contains the details of which Executors should be used for indexing your data. |
--- 📃 flows/query.yml |
Contains the details of which Executors should be used for querying your data. |
--- 📃 flows/index_incremental.yml |
Contains the details of which Pods are required for the incremental indexing. |
📂 pods/ |
Folder to store Pod configurations |
--- 📃 pods/encode.yml |
Specifies the configurations values for the encoding Executor. |
--- 📃 pods/index.yml |
Specifies the configurations values for the encoding Executor. |
--- 📃 pods/index_cache.yml |
Specifies the DocCache necessary for the incremental indexing. |
📂 test/* |
Various maintenance tests to keep the example running. |
📃 app.py |
The gateway code to combine the index and query Flow. |
📃 get_data.sh |
Downloads the Kaggle dataset. |
📃 manifest.yml |
Needed to deploy to Jina Hub. |
📃 requirements.txt |
Contains all required python libraries. |
This diagram provides a visual representation of the two Flows in this example, showing which Executors are used in which order.
Did you like this example and are you interested in building your own? For a detailed tuturial on how to build your Jina app check out How to Build Your First Jina App guide in our documentation.
Warning! This section is not maintained, so we can't guarantee it works!
If you want to run this example quickly without installing Jina, you can do so via Docker. If you'd rather build the example yourself, return to the Python instructions above.
- You have Docker installed and working.
- You have at least 8 GB of free space on your hard drive.
We begin by running the following Docker command in the terminal. This will pull the prebuilt Docker image from Docker Hub and begin downloading the required files and data. To increase speed, this example only has 30,000 sentences indexed.
docker run -p 45678:45678 jinahub/app.example.wikipedia-sentences-30k:0.2.10-1.0.10
There are several ways for you to query data in Jina; for this example, we will use a CURL command interface. You should open another terminal window and paste the following command.
curl --request POST -d '{"top_k": 5, "mode": "search", "data": ["hello world"]}' -H 'Content-Type: application/json' 'http://0.0.0.0:45678/api/search'
For a quick explanation of what some of these parameters mean, top_k
tells the system how many documents to return. The data
parameter contains the text input you want to query.
Once you run this command, you should see a JSON output returned to you. This contains the five most semantically similar documents to the text input you provided in the data field. Feel free to alter the text in the data field and play around with other queries!
- Slack channel - a communication platform for developers to discuss Jina
- Community newsletter - subscribe to the latest update, release and event news of Jina
- LinkedIn - get to know Jina AI as a company and find job opportunities
- - follow us and interact with us using hashtag
#JinaSearch
- Company - know more about our company, we are fully committed to open-source!
Copyright (c) 2021 Jina AI Limited. All rights reserved.
Jina is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.