In this example we build a multimodal search engine for image retrieval using TIRG (Composing Text and Image for Image Retrieval). We use the Fashion200k dataset, where the input query is in the form of a clothing image plus some text that describes the desired modifications to the image.
At index time we encode the images with TIRG's image encoder. At query time we use the feature embeddings constructed by TIRG's multimodal encoder based on both the input image and text to search over the indexed images. The query text is the modification we want to apply over the query image.
TIRG's multimodal encoder requires both image and text to create the final encoding. This is made possible by leveraging the capabilities of Jina's MultiModalEncoder to handle any type of modality.
The Fashion200k model was only trained for certain types of image modifications, such as types of dresses, colors or lengths. Hence, it is limited in the types of modifications it can do, e.g. replace with 3/4 length, replace with beige.
Note: The TIRG paper reports a Recall@1 of 14.1 for the Fashion200k dataset and some queries might not have good results.
Table of Contents
- Download and Extract Data
- Build Encoder Images
- Index Image Data
- Query
- Troubleshooting
- Documentation
- Community
- License
Run the following script to download the data from Kaggle.
Note: the size of the dataset is 6GB.
bash ./get_data.sh data/
Alternatively, You can Download and extract the data from google drive.
Index 1000 images. This can take some time and you can try a smaller number as well. We use a custom TirgImageEncoder for encoding the images. Jina normalizes the images before sending them to the encoder. If you decide to index large datasets, it is recommended to increase the number of shards and parallelization.python app.py --task index -n 1000 -overwrite True
If it's running successfully, you should be able to see and scroll through the logs in the console and in the dashboard:
This will start the server, where you can then run your query and see the results as a pop-up. TIRG's multimodal encoder requires both input image and text to create the final encoding. This is made possible by leveraging our MultiModalEncoder capabilities. We use our QueryLanguageDriver to redirect text and image documents based on modality.python app.py --task query --image_path path_to_image --text_query 'change color to red'
If you are using Docker Desktop, make sure to assign enough memory for your Docker container, especially when you have multiple replicas. Below are my MacOS settings with two replicas:
The best way to learn Jina in depth is to read our documentation. Documentation is built on every push, merge, and release event of the master branch. You can find more details about the following topics in our documentation.
- Jina command line interface arguments explained
- Jina Python API interface
- Jina YAML syntax for executor, driver and flow
- Jina Protobuf schema
- Environment variables used in Jina
- ... and more
- Slack channel - a communication platform for developers to discuss Jina
- Community newsletter - subscribe to the latest update, release and event news of Jina
- LinkedIn - get to know Jina AI as a company and find job opportunities
- - follow us and interact with us using hashtag
#JinaSearch
- Company - know more about our company, we are fully committed to open-source!
Copyright (c) 2021 Jina AI Limited. All rights reserved.
Jina is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.