diff --git a/index.html b/index.html index 6eaa0bf..635b48d 100644 --- a/index.html +++ b/index.html @@ -114,61 +114,40 @@
- Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks in the language domain, but the idea is less explored in the multimodal field. -
- Based on the COCO dataset, we interact with language-only GPT-4, and collect 158K unique language-image instruction-following samples in total, including 58K in conversations, 23K in detailed description, and 77k in complex reasoning, respectively. Please check out ``LLaVA-Instruct-150K''' on - [HuggingFace Dataset]. - - - - - -
Data file name | -File Size | -Sample Size | -
---|---|---|
conversation_58k.json | -126 MB | -58K | -
detail_23k.json | -20.5 MB | -23K | -
complex_reasoning_77k.json | -79.6 MB | -77K | -
- For each subset, we visualize the root noun-verb pairs for the instruction and response. For each chart, please click the link for the interactive page to check out the noun-verb pairs whose frequency is higher the given number. -
-- LLaVa connects pre-trained CLIP ViT-L/14 visual encoder and large language model Vicuna, using a simple projection matrix. We consider a two-stage instruction-tuning procedure: -
An evaluation dataset with 30 unseen images is constructed: each image is assocaited with three types of instructions: conversation, detailed description and complex reasoning. This leads to 90 new language-image instructions, on which we test LLaVA and GPT-4, and use GPT-4 to rate their responses from score 1 to 10. The summed score and relative score per type is reported. Overall, LLaVA achieves 85.1% relative score compared with GPT-4, indicating the effectinvess of the proposed self-instruct method in multimodal settings -
- @misc{liu2023improvedllava,
- author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae},
- title={Improved Baselines with Visual Instruction Tuning},
- publisher={arXiv:2310.03744},
- year={2023},
- }
-
- @inproceedings{liu2023llava,
- author = {Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
- title = {Visual Instruction Tuning},
- booktitle = {NeurIPS},
- year = {2023}
- }
-
- - This website is adapted from Nerfies, licensed under a Creative - Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna. -
- --Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. -
- -- - Related Links: - [REACT] - [GLIGEN] - [Computer Vision in the Wild (CVinW)] - [Insutrction Tuning with GPT-4] -
-