Skip to content

Latest commit

 

History

History
 
 

multimodal-search-tirg

Jina Jina Jina Jina Docs We are hiring tweet button Python 3.7 3.8 Docker

Multimodal Search With TIRG & fashion200k

In this example we build a multimodal search engine for image retrieval using TIRG (Composing Text and Image for Image Retrieval). We use the Fashion200k dataset, where the input query is in the form of a clothing image plus some text that describes the desired modifications to the image.

At index time we encode the images with TIRG's image encoder. At query time we use the feature embeddings constructed by TIRG's multimodal encoder based on both the input image and text to search over the indexed images. The query text is the modification we want to apply over the query image.

TIRG's multimodal encoder requires both image and text to create the final encoding. This is made possible by leveraging the capabilities of Jina's MultiModalEncoder to handle any type of modality.

The Fashion200k model was only trained for certain types of image modifications, such as types of dresses, colors or lengths. Hence, it is limited in the types of modifications it can do, e.g. replace with 3/4 length, replace with beige.

Note: The TIRG paper reports a Recall@1 of 14.1 for the Fashion200k dataset and some queries might not have good results.

Jina banner

Jina banner

Table of Contents

Download and Extract Data

Run the following script to download the data from Kaggle.

Note: the size of the dataset is 6GB.

bash ./get_data.sh data/

Alternatively, You can Download and extract the data from google drive.

Index Image Data

Jina banner

Index 1000 images. This can take some time and you can try a smaller number as well. We use a custom TirgImageEncoder for encoding the images. Jina normalizes the images before sending them to the encoder. If you decide to index large datasets, it is recommended to increase the number of shards and parallelization.
python app.py --task index -n 1000 -overwrite True

If it's running successfully, you should be able to see and scroll through the logs in the console and in the dashboard:

Jina banner Jina banner

Query

Jina banner

This will start the server, where you can then run your query and see the results as a pop-up. TIRG's multimodal encoder requires both input image and text to create the final encoding. This is made possible by leveraging our MultiModalEncoder capabilities. We use our QueryLanguageDriver to redirect text and image documents based on modality.
python app.py --task query --image_path path_to_image --text_query 'change color to red'

Troubleshooting

Memory Issues

If you are using Docker Desktop, make sure to assign enough memory for your Docker container, especially when you have multiple replicas. Below are my MacOS settings with two replicas:

Jina banner

Documentation

The best way to learn Jina in depth is to read our documentation. Documentation is built on every push, merge, and release event of the master branch. You can find more details about the following topics in our documentation.

Community

  • Slack channel - a communication platform for developers to discuss Jina
  • Community newsletter - subscribe to the latest update, release and event news of Jina
  • LinkedIn - get to know Jina AI as a company and find job opportunities
  • Twitter Follow - follow us and interact with us using hashtag #JinaSearch
  • Company - know more about our company, we are fully committed to open-source!

License

Copyright (c) 2021 Jina AI Limited. All rights reserved.

Jina is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.