Skip to content

Latest commit

 

History

History
 
 

multimodal-search-pdf

PDF search with Jina

Jina Jina Jina Jina Docs We are hiring tweet button Python 3.7 3.8 Docker

This example demonstrates how Jina can be used to search a repository of PDF files. The example employs a multimodal search architecture, allowing a user to query the data by providing text, or an image, or both simultaneously.

What's included in this example:

How to clone this repo❓

Because this repo contains many more example projects, if you want to clone only this example, you can do this as following

git clone --depth 1 --filter=blob:none --sparse https://github.com/jina-ai/examples
cd examples
git sparse-checkout set multimodal-search-pdf

Data preparation

We have included several PDF blogs as toy data in toy_data. This data is ready to use straight away. You can replace this toy data with your own by simply adding new files to the toy_data folder. Be careful to check that the files are supported by pdfplumber.

Install

pip install -r requirements.txt

Run

Command Description
python app.py -t index To index files/data
python app.py -t query To run a query Flow for searching text, image, and PDF
python app.py -t query_text To run a query Flow for searching text
python app.py -t query_image To run a query Flow for searching image
python app.py -t query_pdf To run a query Flow for searching PDF
python app.py -t query_restful To expose the restful API of the query Flow for searching text, image, and PDF

Start the Server

python app.py -t query_restful

Query via REST API

When the REST gateway is enabled, Jina uses the data URI scheme to represent multimedia data. Simply organize your picture(s) into this scheme and send a POST request to http://0.0.0.0:45670/api/search, e.g.:

curl --request POST -d '{"top_k": 10, "mode": "search",  "data": ["jina hello multimodal"]}' -H 'Content-Type: application/json' 'http://0.0.0.0:45670/api/search'
The results will be like:
{
   "requestId":"b5813834-5f42-4f6c-a313-0f397ef5cf12",
   "search":{
      "docs":[
         {
            "id":"2cd45334-98dd-11eb-85a3-38f9d3eb0f3c",
            "weight":1.0,
            "matches":[
               {
                  "id":"f61571fc-9706-11eb-ba29-38f9d3eb0f3c",
                  "uri":"toy_data/blog1.pdf",
                  "mimeType":"application/pdf",
                  "score":{
                     "value":1.0,
                     "opName":"SimpleAggregateRanker",
                     "refId":"2cd45334-98dd-11eb-85a3-38f9d3eb0f3c"
                  },
                  "adjacency":1,
                  "contentHash":"3b46f0781f207674"
               },
               {
                  "id":"f618041c-9706-11eb-ba29-38f9d3eb0f3c",
                  "uri":"toy_data/blog3.pdf",
                  "mimeType":"application/pdf",
                  "score":{
                     "value":0.83538496,
                     "opName":"SimpleAggregateRanker",
                     "refId":"2cd45334-98dd-11eb-85a3-38f9d3eb0f3c"
                  },
                  "adjacency":1,
                  "contentHash":"89a45525161ec448"
               },
               {
                  "id":"f617ec70-9706-11eb-ba29-38f9d3eb0f3c",
                  "uri":"toy_data/blog2.pdf",
                  "mimeType":"application/pdf",
                  "score":{
                     "value":0.71500874,
                     "opName":"SimpleAggregateRanker",
                     "refId":"2cd45334-98dd-11eb-85a3-38f9d3eb0f3c"
                  },
                  "adjacency":1,
                  "contentHash":"debde7ffbb210b46"
               }
            ],
            "mimeType":"text/plain",
            "text":"jina hello multimodal",
            "contentHash":"f7beb6d4b839aa06"
         }
      ]
   }

JSON payload syntax and spec can be found in the docs.

This example shows you how to feed data into Jina via REST gateway. By default, Jina uses a gRPC gateway, which has much higher performance and rich features. If you are interested in that, go ahead and check out our other examples and read our documentation on Jina IO.

Understand the Flows

The following image shows the structure of index Flow, the text and image will be extracted from PDF files as chunks. Then we will use three paths to process and store the data.

  • For the first path, the DocIndexer will store the Document ID and Document data on disk
  • For the second path, the pods will filter the image chunks, and do the processing. It will also store chunk ID and chunk data on disk
  • For the third path, the pods will filter the text chunks, and further segment text into smaller chunks. The ChunkMeta indexer is used to store data at chunks level and text indexer is used to store data for chunks of chunks. The joiner will wait until three paths are finished.

Jina banner

The following image shows the structure of query Flow. For query, we can send text, image and PDF files to one Flow and get the results. We use two parallel paths to process the query data and get results.

  • For the first path, we will get embedding of the image data and use image indexer to get the most similar matches at chunks level.
  • For the second path, we will get embedding of the text data at chunks of chunks level in the text encoder. Then we use CCtoC ranker to get the matches from chunks of chunks (CC) level to chunks (C) level. The chunkmeta indexer can help to append meta data for matches of chunks. The ChunktoRoot Ranker is used to get the matches at root level and then we can use the matches ID to get the documents data.

Jina banner

Performance for Reference

You can run the perf-script.sh in order to run all of the examples on your machine. Make sure this is done in a separate python virtualenv.

This measures QPS for indexing and querying.

This will store the results in performance.txt.

Item Index Query
Number of Docs 30 3
Time 350.74s 5.33s

Documentation

The best way to learn Jina in depth is to read our documentation. Documentation is built on every push, merge, and release event of the master branch. You can find more details about the following topics in our documentation.

Community

  • Slack channel - a communication platform for developers to discuss Jina
  • Community newsletter - subscribe to the latest update, release and event news of Jina
  • LinkedIn - get to know Jina AI as a company and find job opportunities
  • Twitter Follow - follow us and interact with us using hashtag #JinaSearch
  • Company - know more about our company, we are fully committed to open-source!

License

Copyright (c) 2021 Jina AI Limited. All rights reserved.

Jina is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.