diff --git a/src/User_Manual/vision.html b/src/User_Manual/vision.html new file mode 100644 index 00000000..a498c60d --- /dev/null +++ b/src/User_Manual/vision.html @@ -0,0 +1,188 @@ + + + + + + Vision Models + + + + + +
+

Vision

+
+ +
+ +
+ +

What are Vision Models?

+ +

Vision models are basically large language models that can analyze and extract information from a variety of images. + For purposes of this program, vision models are used to extract a summary of what an image depicts and add this description + to the vector database where it can be searched along with any traditional documents you add!

+ +

Which Vision Models Are Available?

+ +

There are three named vision models available with this program:

+ +
    +
  1. llava
  2. +
  3. bakllava
  4. +
  5. cogvlm
  6. +
+ +

llava models were trailblazers in what they did and this program uses both the 7b and 13b sizes. + llava models are based on the llama2 architecture. bakllava is similar to + llava except that it's architecture is based on mistral and only comes in the 7b variety. + cogvlm has 18b parameters but is my personal favorite because it produces the bset results by far. Its + accuracy is over 90% in the statements its summaries I've found whereas bakllava is only about 70% and + llava is slightly lower than that (regardless of whether you use the 7b or 13b sizes).

+ +

What do the Settings Mean?

+ +

Model is obviously the model's name. Note that you cannot use cogvlm on MacOS, which is + because it requires the xformers library, which does not currently make a build for MacOs.

+ +

Size refers to the number of parameters (in billions). Larger generally means better, but in contrast to + differing parameters with typically large language models, I didn't notice a difference between using the llava + 7b versus 13b sizes, but feel free to experiment. The Tool Tab contains a table outlining the general VRAM requirements + for the various models/settings. Remember, this is before accounting for overhead such as your monitor, which + typically amounts to 1-2 GB more

+ +

Quant refers to the quantization of the model - i.e. how much it's reduced from its original floating point + format. See the tailend of the Whisper portion of the User Guide for a primer on what floating point formats are. This + program uses the bitsandbytes library to perform the quantizations because it's the only option I was aware of + that could quantize cogvlm, which is far superior IMHO.

+ +

Why Are Some Settings Disabled?

+ +

Flash Attention 2 is a very powerful newer technology but it requires CUDA 12+. This program relies + exclusively on CUDA 11 due to compatibility with the faster-whisper library that handles the audio + features. However, faster-whisper should be adding CUDA 12+ support in the near future, at which + time Flash Attention 2 should be available. Batch will be explained and added in a future release.

+ +

How do I use the Vision Model?

+ +

Before Release 3, this program put all documents selected within the "Docs_for_DB" folder. Now it puts any + images selected in the "Images_for_DB" folder. You can manually remove images from there if need be. Once documents and/or + images are selected, you simply click the create database button like before. The document processor will run + in two steps. First, it will load non-images and second it'll load any images.

+ +

The "loading" process takes very little time for documents but a relatively long time for images. "Loading" images involves + creating the summaries for each image using the selected vision model. Make sure and test your vision model settings within + the Tools Tab before committing to processing, for example, 100 images.

+ +

After both documents and images are "loaded" they are added to the vectorstore just the same as prior release of this + program.

+ +

Once the database is "persisted," try searching for images that depict a certain thing. Also, you can check the + chunks only checkbox to actually see the results returned to the database instead of connecting to LM Studio. + This is extremely useful to fine-tune your settings...including both the chunking/overlap settings as well as the Vision + model settings.

+ +

PRO TIP: Make sure and set your chunking settings to larger than the summaries that are provided by the vision model. + Doing this prevents the summary for a particular image from EVER being split. In short, each and every chunk consist of the + entire summary provided by the vision model! This tends to be 400-800 chunk size depending on the vision model + settings.

+ +

Can I Change What the Vision Model Does?

+ +

For this initial release, I hardcoded the questions asked of the vision models within the following scripts:

+ +
    +
  1. vision_cogvlm_module.py
  2. +
  3. vision_llava_module.py
  4. +
  5. loader_vision_cogvlm.py
  6. +
  7. loader_vision_llava.py
  8. +
+ +

You can go into these scripts and modify the question sent to the vision model, but make sure the prompt format remains + the same. In future releases I will likely add the functionality to experiement with different questions within the + grapical user interface to achieve better results.

+ +
+ + + +