Method | Features | All | Easy | Medium | Hard | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
mAP | mNDCG | mFT | mST | mAP | mFT | mAP | mFT | mAP | mFT | ||
E5 | Text | 64.5 | 82.1 | 58.4 | 79.6 | 69.4 | 64.9 | 62.4 | 55.9 | 61.7 | 54.4 |
AnglE | Text | 66.8 | 83.4 | 60.8 | 81.1 | 71.3 | 66.8 | 65.7 | 59.0 | 63.3 | 56.6 |
CLIP | Text | 59.5 | 79.7 | 53.6 | 74.2 | 59.6 | 54.8 | 59.1 | 52.3 | 59.9 | 53.8 |
Image | 66.1 | 82.4 | 59.8 | 79.9 | 73.8 | 68.2 | 66.1 | 59.1 | 58.2 | 51.7 | |
Text & Image | 69.7 | 85.1 | 63.0 | 83.2 | 74.1 | 68.3 | 69.4 | 62.7 | 65.5 | 57.6 | |
BLIP2 | Text | 56.5 | 77.3 | 50.3 | 71.8 | 58.4 | 52.4 | 53.7 | 47.6 | 57.8 | 51.3 |
Image | 68.6 | 84.1 | 62.5 | 81.7 | 74.5 | 69.4 | 68.2 | 61.9 | 63.1 | 56.1 | |
Text & Image | 70.0 | 84.9 | 63.6 | 83.2 | 75.0 | 69.6 | 69.1 | 62.4 | 66.0 | 58.7 | |
Openshape | 3D | 51.9 | 73.1 | 46.5 | 67.0 | 63.6 | 58.8 | 50.8 | 45.8 | 40.9 | 34.5 |
3D & Image | 70.2 | 85.1 | 63.7 | 82.6 | 76.9 | 71.5 | 70.0 | 62.7 | 63.5 | 56.7 | |
3D & Image & Text | 74.3 | 87.8 | 67.0 | 86.1 | 78.4 | 72.4 | 74.5 | 66.7 | 69.9 | 61.6 | |
Uni3D | 3D | 66.8 | 82.5 | 60.5 | 80.3 | 76.8 | 72.0 | 64.5 | 58.3 | 59.0 | 51.0 |
3D & Image | 75.0 | 87.9 | 68.3 | 86.8 | 81.0 | 75.7 | 74.4 | 67.5 | 69.6 | 61.8 | |
3D & Image & Text | 77.1 | 89.3 | 70.0 | 88.3 | 81.4 | 75.8 | 76.8 | 69.1 | 73.0 | 65.1 |
All our experiments are conducted undeer the GPU A100. You can try our online demo for visualization of Uni3D's retrieval results.
[Note]: The installation steps is not necessary to use our dataset, which you can easily use in HF Dataset. The docker image may be quite heavy because it involves all the requirements of retrieval methods mentioned in our benchmark. If you only want to try one of those methods, you can refer to their official code repo.
We provide a built image yuanze1024/LD-T3D.
You can use it by:
docker pull yuanze1024/ld-t3d:v1
If you fail to pull our image, you can build from the Dockerfile:
git clone https://github.com/yuanze1024/LD-T3D.git
cd LD-T3D
docker build -t ld-t3d .
[Note]: Change the TORCH_CUDA_ARCH_LIST in Dockerfile for compilation, e.g., 8.0 for A100, and 8.6 for 3090.
Set your huggingface cache_dir in config/config.yaml. [Note]: Make sure you set the general.cache_dir correctly, which means the dir where you put the downloaded pretrained checkpoints.
The methods' checkpoints will be downloaded automaticlly the first time you use a certain method.
# E5
python eval.py --option e5 --cross_modal text --batch_size 1024
# AnglE
python eval.py --option angle --cross_modal text --batch_size 1024
# CLIP
python eval.py --option clip --cross_modal image --angles diag_above --batch_size 256
# BLIP2
python eval.py --option blip2 --cross_modal image --angles diag_above --batch_size 256
# Openshape
python eval.py --option Openshape --cross_modal text_image_3D --op add --angles diag_above --batch_size 256
# Uni3D
python eval.py --option Uni3D --cross_modal text_image_3D --op add --angles diag_above --batch_size 256
Note that we only support dual-stream architecture by now, which means the embeddings of queries and multimodal features must be encoded seperately.
You can refer to encoders in feature_extractors
and achieve your own method which inherits the base class FeatureExtractor
in feature_extractors/__init__.py
. BTW, if you want to use image modality, you also need to implement a get_img_transform
function.
not published yet