diff --git a/index.html b/index.html index 6eaa0bf..635b48d 100644 --- a/index.html +++ b/index.html @@ -114,61 +114,40 @@
-

LLaVA: Large Language and Vision Assistant

-

Visual Instruction Tuning

-
NeurIPS 2023 (Oral)
+

Mantis: Interleaved Multi-Image Instruction Tuning/h1> +

Balancing Multi-Image and Single-Image Abilities of Large Multimodal Models

- -
- University of - Wisconsin-Madison - Microsoft Research - Columbia - University -
- -
- *Equal Contribution -
- - - + University of Waterloo + Tsinghua University + Sea AI Lab +
@@ -247,402 +214,12 @@

Improved Baselines with Visual Instruct

-
-
-
-

- 🔥[NEW!] LLaVA-1.5 achieves SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods that use billion-scale data. -

- LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna - for general-purpose visual and language understanding, - achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA. -

-
-
-
-
-
-
- -
-
-

Abstract

-
-

- Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks in the language domain, but the idea is less explored in the multimodal field. -

    -
  1. Multimodal Instruct Data. We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data.
  2. -
  3. LLaVA Model. We introduce LLaVA (Large Language-and-Vision Assistant), an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.
  4. -
  5. Performance. Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. - When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%.
  6. -
  7. Open-source. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.
  8. -
-

- -
-
-
- -
-
- - - -
- -
-
-

Multimodal Instrucion-Following Data

-
-
- - -
- -
-
-
-

- Based on the COCO dataset, we interact with language-only GPT-4, and collect 158K unique language-image instruction-following samples in total, including 58K in conversations, 23K in detailed description, and 77k in complex reasoning, respectively. Please check out ``LLaVA-Instruct-150K''' on - [HuggingFace Dataset]. - - - - - -

- - - - - - - - - - - - - - - - - - - - - - - - - -
Data file nameFile SizeSample Size
conversation_58k.json 126 MB58K
detail_23k.json20.5 MB23K
complex_reasoning_77k.json79.6 MB77K
-
- - -

-

- For each subset, we visualize the root noun-verb pairs for the instruction and response. For each chart, please click the link for the interactive page to check out the noun-verb pairs whose frequency is higher the given number. -

-
- - - - - - -
-
-
- -
- Instruction: Conversation [0, 20, 50] -
-
-
- -
- Instruction: Detailed Description [0] -
-
-
- -
- Instruction: Complex Reasoning [0, 20, 50] -
-
-
-
- - - -
-
-
- -
- Response: Conversation [0, 20, 50] -
-
-
- -
- Response: Detailed Description [0, 20, 50] -
-
-
- -
- Response: Complex Reasoning [0, 20, 50] -
-
-
-
- -
-
- - -
- - -
- -
-
-

LLaVA: Large Language-and-Vision Assistant

-
-
- - -
- -
-
-
-

- LLaVa connects pre-trained CLIP ViT-L/14 visual encoder and large language model Vicuna, using a simple projection matrix. We consider a two-stage instruction-tuning procedure: -

    -
  • Stage 1: Pre-training for Feature Alignment. Only the projection matrix is updated, based on a subset of CC3M.
  • -
  • Stage 2: Fine-tuning End-to-End. Both the projection matrix and LLM are updated for two different use senarios: -
      -
    • Visual Chat: LLaVA is fine-tuned on our generated multimodal instruction-following data for daily user-oriented applications. -
    • Science QA: LLaVA is fine-tuned on this multimodal reasonsing dataset for the science domain.
    • -
    -
- Please check out our - [Model Zoo]. -

-
- -
- -
- - -
-
-
- - -
- - - -
- -
-
-

Performance

-
-
- - - - - -
- - - -
-
-

Visual Chat: Towards building multimodal GPT-4 level chatbot

- -
- - -
- -

An evaluation dataset with 30 unseen images is constructed: each image is assocaited with three types of instructions: conversation, detailed description and complex reasoning. This leads to 90 new language-image instructions, on which we test LLaVA and GPT-4, and use GPT-4 to rate their responses from score 1 to 10. The summed score and relative score per type is reported. Overall, LLaVA achieves 85.1% relative score compared with GPT-4, indicating the effectinvess of the proposed self-instruct method in multimodal settings -

-
- - -
-
-

Science QA: New SoTA with the synergy of LLaVA with GPT-4

- -
- - -
-

LLaVA alones achieve 90.92%. We use the text-only GPT-4 as the judge, to predict the final answer based on its own previous answers and the LLaVA answers. This "GPT-4 as judge" scheme yields a new SOTA 92.53%. - -

-
-
- - - - -
- -
-
-

Examples on Visual Instruction Following

-
-
- -
-
-

Visual Reasoning on two examples from OpenAI GPT-4 Technical Report

-
-
- -
-
- - -
-
- - - -
-
-

Optical character recognition (OCR)

-
-
- -
-
- - - -
-
- - - - - - -
- - -
-
-
-
- - - -
-
-
- - -
-
- - -
-
- -
- -
-
-
- -
-
- -
-
-

BibTeX

-

-  @misc{liu2023improvedllava,
-          author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae},
-          title={Improved Baselines with Visual Instruction Tuning}, 
-          publisher={arXiv:2310.03744},
-          year={2023},
-  }
-
-  @inproceedings{liu2023llava,
-    author      = {Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
-    title       = {Visual Instruction Tuning},
-    booktitle   = {NeurIPS},
-    year        = {2023}
-  }
-  
-
-
- -
-
-

Acknowledgement

-

- This website is adapted from Nerfies, licensed under a Creative - Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna. -

- -

-Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. -

- -

- - Related Links: - [REACT] - [GLIGEN] - [Computer Vision in the Wild (CVinW)] - [Insutrction Tuning with GPT-4] -

-
-