What are Vision Models?
+ +Vision models are basically large language models that can analyze and extract information from a variety of images. + For purposes of this program, vision models are used to extract a summary of what an image depicts and add this description + to the vector database where it can be searched along with any traditional documents you add!
+ +Which Vision Models Are Available?
+ +There are three named vision models available with this program:
+ +-
+
- llava +
- bakllava +
- cogvlm +
llava
models were trailblazers in what they did and this program uses both the 7b and 13b sizes.
+ llava
models are based on the llama2
architecture. bakllava
is similar to
+ llava
except that it's architecture is based on mistral
and only comes in the 7b variety.
+ cogvlm
has 18b parameters but is my personal favorite because it produces the bset results by far. Its
+ accuracy is over 90% in the statements its summaries I've found whereas bakllava
is only about 70% and
+ llava
is slightly lower than that (regardless of whether you use the 7b or 13b sizes).
What do the Settings Mean?
+ +Model
is obviously the model's name. Note that you cannot use cogvlm
on MacOS, which is
+ because it requires the xformers
library, which does not currently make a build for MacOs.
Size
refers to the number of parameters (in billions). Larger generally means better, but in contrast to
+ differing parameters with typically large language models, I didn't notice a difference between using the llava
+ 7b versus 13b sizes, but feel free to experiment. The Tool Tab contains a table outlining the general VRAM requirements
+ for the various models/settings. Remember, this is before accounting for overhead such as your monitor, which
+ typically amounts to 1-2 GB more
Quant
refers to the quantization of the model - i.e. how much it's reduced from its original floating point
+ format. See the tailend of the Whisper portion of the User Guide for a primer on what floating point formats are. This
+ program uses the bitsandbytes
library to perform the quantizations because it's the only option I was aware of
+ that could quantize cogvlm
, which is far superior IMHO.
Why Are Some Settings Disabled?
+ +Flash Attention 2
is a very powerful newer technology but it requires CUDA 12+
. This program relies
+ exclusively on CUDA 11
due to compatibility with the faster-whisper
library that handles the audio
+ features. However, faster-whisper
should be adding CUDA 12+
support in the near future, at which
+ time Flash Attention 2
should be available. Batch
will be explained and added in a future release.
How do I use the Vision Model?
+ +Before Release 3
, this program put all documents selected within the "Docs_for_DB" folder. Now it puts any
+ images selected in the "Images_for_DB" folder. You can manually remove images from there if need be. Once documents and/or
+ images are selected, you simply click the create database
button like before. The document processor will run
+ in two steps. First, it will load non-images and second it'll load any images.
The "loading" process takes very little time for documents but a relatively long time for images. "Loading" images involves + creating the summaries for each image using the selected vision model. Make sure and test your vision model settings within + the Tools Tab before committing to processing, for example, 100 images.
+ +After both documents and images are "loaded" they are added to the vectorstore just the same as prior release of this + program.
+ +Once the database is "persisted," try searching for images that depict a certain thing. Also, you can check the
+ chunks only
checkbox to actually see the results returned to the database instead of connecting to LM Studio.
+ This is extremely useful to fine-tune your settings...including both the chunking/overlap settings as well as the Vision
+ model settings.
PRO TIP: Make sure and set your chunking settings to larger than the summaries that are provided by the vision model. + Doing this prevents the summary for a particular image from EVER being split. In short, each and every chunk consist of the + entire summary provided by the vision model! This tends to be 400-800 chunk size depending on the vision model + settings.
+ +Can I Change What the Vision Model Does?
+ +For this initial release, I hardcoded the questions asked of the vision models within the following scripts:
+ +-
+
vision_cogvlm_module.py
+ vision_llava_module.py
+ loader_vision_cogvlm.py
+ loader_vision_llava.py
+
You can go into these scripts and modify the question sent to the vision model, but make sure the prompt format remains + the same. In future releases I will likely add the functionality to experiement with different questions within the + grapical user interface to achieve better results.
+ +