GitHub - bigai-ai/SceneVerse: Official implementation of ECCV24 paper "SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding"

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Baoxiong Jia^✶, Yixin Chen^✶, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang

We propose SceneVerse, the first million-scale 3D vision-language dataset with 68K 3D indoor scenes and 2.5M vision-language pairs. We demonstrate the scaling effect by (i) achieving state-of-the-art on all existing 3D visual grounding benchmarks and (ii) showcasing zero-shot transfer capabilities with our GPS (Grounded Pre-training for Scenes) model.

News

[2024-10] Pre-trained checkpoints are now available, find detailed instructions in TRAIN.md!
[2024-09] The scripts for scene graph generation are released.
[2024-07] Training & Inference code as well as preprocessing code is released and checkpoints & logs are on the way!
[2024-07] Preprocessing codes for scenes used in SceneVerse are released.
[2024-07] SceneVerse is accepted by ECCV 2024! Training and inference codes/checkpoints will come shortly, stay tuned!
[2024-03] We release the data used in SceneVerse. Fill out the form for the download link!
[2024-01] We release SceneVerse on ArXiv. Checkout our paper and website.

Data

See DATA.md for detailed instructions on data download, processing, visualization. The data inventory is listed below:

Dataset	Object Caption	Scene Caption	Ref-Annotation	Ref-Pairwise `rel2`	Ref-MultiObject `relm`	Ref-Star `star`	Ref-Chain (Optional) `chain`
ScanNet	✅	✅	ScanRefer Nr3D	✅	✅	✅	✅
MultiScan	✅	✅	✅	✅	✅	✅	✅
ARKitScenes	✅	✅	✅	✅	✅	✅	✅
HM3D	`template`	✅	✅	✅	✅	✅	✅
3RScan	✅	✅	❌	✅	✅	✅	✅
Structured3D	`template`	✅	❌	✅	✅	✅	❌
ProcTHOR	`template`	❌	❌	`template`	`template`	`template`	❌

Training and Inference

See TRAIN.md for the inventory of available checkpoints and detailed instructions on training and testing with pre-trained checkpoints. The checkpoint inventory is listed below:

Setting	Description	Corresponding Experiment	Checkpoint based on experiment setting
`pre-trained`	GPS model pre-trained on SceneVerse	3D-VL grounding (Tab.2)	Model
`scratch`	GPS model trained on datasets from scratch	3D-VL grounding (Tab.2) SceneVerse-val (Tab. 3)	ScanRefer, Sr3D, Nr3D, SceneVerse-val
`fine-tuned`	GPS model fine-tuned on datasets with grounding heads	3D-VL grounding (Tab.2)	ScanRefer, Sr3D, Nr3D
`zero-shot`	GPS model trained on SceneVerse without data from ScanNet and MultiScan	Zero-shot Transfer (Tab.3)	Model
`zero-shot text`	GPS	Zero-shot Transfer (Tab.3)	ScanNet, SceneVerse-val
`text-ablation`	Ablations on the type of language used during pre-training	Ablation on Text (Tab.7)	Template only, Template+LLM
`scene-ablation`	Ablations on the use of synthetic scenes during pre-training	Ablation on Scene (Tab.8)	Real only, S3D only, ProcTHOR only
`model-ablation`	Ablations on the use of losses during pre-training	Ablation on Model Design (Tab.9)	Refer only, Refer+Obj-lvl, w/o Scene-lvl

BibTex

@article{jia2024sceneverse,
  title={SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding},
  author={Jia, Baoxiong and Chen, Yixin and Yu, Huangyue and Wang, Yan and Niu, Xuesong and Liu, Tengyu and Li, Qing and Huang, Siyuan},
  journal={arXiv preprint arXiv:2401.09340},
  year={2024}
}

Acknowledgements

We thank the authors from ScanRefer, ScanNet, 3RScan, ReferIt3D, Structured3D, HM3D, ProcTHOR, ARKitScenes, MultiScan for open-sourcing their awesome datasets. We also heavily adapted codes from ScanQA, SQA3D, and 3D-VisTA for training and inference.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
assets		assets
common		common
configs/final		configs/final
data		data
evaluator		evaluator
model		model
modules		modules
optim		optim
preprocess		preprocess
trainer		trainer
.gitignore		.gitignore
DATA.md		DATA.md
LICENSE		LICENSE
README.md		README.md
TRAIN.md		TRAIN.md
launch.py		launch.py
requirements.txt		requirements.txt
run.py		run.py
visualize_data.py		visualize_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

News

Data

Training and Inference

BibTex

Acknowledgements

About

Releases

Packages

Languages

License

bigai-ai/SceneVerse

Folders and files

Latest commit

History

Repository files navigation

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

News

Data

Training and Inference

BibTex

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages