Transcribe Audio
-My program now includes the ability to transcribe various types of audio files and the resulting .txt
file will
- be saved to the Docs_for_DB
folder to be included when you create your database - enabling you to search audio!
- To add multiple audio files to the vector database, simply choose multiple files and process them separately.
Whisper Transcription
+ +Overview
+ +My program uses powerful Whisper models for transcription in two ways:
+-
+
- Transcribe your question to the clipboard and then paste it to the question box; and +
- Within the "Tools" tab, transcribe entire audio files to be put into the vector database. +
Setting
-The "translate" checkbox and the "language" pulldown menu are placeholders until I add that functionality in the next release. - However, the timestamps checkbox works fine. All of the other settings are identical to the settings for the voice transcriber - when asking the LLM a question. However, you can choose different settings here than those in the "Settings" tab for the voice - transcriber; they update the config.yaml file differently.
-I highly recommend using gpu-acceleration if available (it'll be displayed as an option if available). Unlike the voice - transcription functionality, which typically transcribes a short question, processing audio files are usually a lot longer.
+Transcribe Question
+ +The start and stop recording button transcribe your voice to the clipboard, which you can then simply paste + into the question box for the LLM. The quality of the transcription can be controlled from the "Settings" tab. + The available Whisper model sizes and quantizations are automatically populated based on your system's capabilities. + You can read below what these settings exactly mean.
+ +However, feel free to use gpu or cpu acceleration since the Whisper models are immediately unloaded from memory after + the transcription is complete in order to conserve system memory. Remember to click update settings whenever you change + the settings, however.
+ +Transcribe Audio Files for Database
+ +The "Tools" tab includes a new feature to transcribe audio files of any length and put a .txt
file in the folder
+ holding the files to put into the vector database. Once the transcription is complete, it will automatically put the file there.
+ Remember, you must re-created the database anytime you want to add/remove a file from it.
I highly recommend using GPU-acceleration if available since transcribing an audio file takes a lot longer than a simple question. + The settings for transcribing an audio file are separate than the voice-transcribe settings. Therefore, changing one does not + change the other. Read more below about the Whisper models and quantization to ensure you're getting the most out of this + powerful new feature. And batch processing is coming in the future.
+ +Whisper Models and Quants
+ +English versus Non-English Models
-Currently, whenever you change the quant, model size, or compute "Device," you must click "Update Settings" before - starting the transcription.
+Use models ending in .en
if you speak English. The large-v2
model doesn't come in an
+ English-specific variant becauase it's just good at everything.
Upcoming Improvements
-Add functionality to process multiple files in batch by selecting a single directory
+My Recommendations
-Add functionality to process multiple files at once (multiple "workers") while batch processing (vram intensive).
+
The size of the model is most important factor followed by quantization. For transcribing your questions, I recommend
+ small.en
with float32
for everyday usage (using CPU). Regarding transcribing an audio file, always
+ use GPU if available. I also recommend using as large of a Whisper model as possible and then either a quantization of
+ float32
, float16
, or bfloat16
so you don't have to re-transcribe it. Also, don't
+ forget to check the timestamps option if you want.
Support the new Whisper large-v3 model released a few days ago.
+If you're trying to transcribe a file using your CPU, it heavily depends on your CPU, but be aware that using the CPU is
+ about 500x slower than even a mediocre GPU. Even then, I wouldn't recommend going below small/small.en
because
+ there is a significant jump in quality between base
and small
.
-
Support the new Distil-Whisper models that use half VRAM and compute requirements, but only do English.
- -Additional Info
+ +Below is a table of all Whisper models I've quantized as well, and below that, a primer about floating point formats + and quantization!
+ +Available Whisper Models and Quants
+ +Quantization | +Size on Disk | +
---|---|
whisper-tiny.en-ct2-int8_bfloat16 | +42.7 MB | +
whisper-tiny.en-ct2-int8_float16 | +42.7 MB | +
whisper-tiny-ct2-int8_bfloat16 | +43.1 MB | +
whisper-tiny-ct2-int8_float16 | +43.1 MB | +
whisper-tiny.en-ct2-int8 | +45.4 MB | +
whisper-tiny.en-ct2-int8_float32 | +45.4 MB | +
whisper-tiny-ct2-int8 | +45.7 MB | +
whisper-tiny-ct2-int8_float32 | +45.7 MB | +
whisper-base.en-ct2-int8_bfloat16 | +78.4 MB | +
whisper-base.en-ct2-int8_float16 | +78.4 MB | +
whisper-base-ct2-int8_bfloat16 | +78.7 MB | +
whisper-base-ct2-int8_float16 | +78.7 MB | +
whisper-tiny.en-ct2-bfloat16 | +78.8 MB | +
whisper-tiny.en-ct2-float16 | +78.8 MB | +
whisper-tiny-ct2-bfloat16 | +79.1 MB | +
whisper-tiny-ct2-float16 | +79.1 MB | +
whisper-base.en-ct2-int8 | +82.4 MB | +
whisper-base.en-ct2-int8_float32 | +82.4 MB | +
whisper-base-ct2-int8 | +82.7 MB | +
whisper-base-ct2-int8_float32 | +82.7 MB | +
whisper-base.en-ct2-bfloat16 | +148.5 MB | +
whisper-base.en-ct2-float16 | +148.5 MB | +
whisper-base-ct2-bfloat16 | +148.8 MB | +
whisper-base-ct2-float16 | +148.8 MB | +
whisper-tiny.en-ct2-float32 | +154.4 MB | +
whisper-tiny-ct2-float32 | +154.7 MB | +
whisper-small.en-ct2-int8_bfloat16 | +249.8 MB | +
whisper-small.en-ct2-int8_float16 | +249.8 MB | +
whisper-small-ct2-int8_bfloat16 | +250.2 MB | +
whisper-small-ct2-int8_float16 | +250.2 MB | +
whisper-small.en-ct2-int8 | +257.3 MB | +
whisper-small.en-ct2-int8_float32 | +257.3 MB | +
whisper-small-ct2-int8 | +257.7 MB | +
whisper-small-ct2-int8_float32 | +257.7 MB | +
whisper-base.en-ct2-float32 | +293.7 MB | +
whisper-base-ct2-float32 | +294.0 MB | +
whisper-small.en-ct2-bfloat16 | +486.8 MB | +
whisper-small.en-ct2-float16 | +486.8 MB | +
whisper-small-ct2-bfloat16 | +487.1 MB | +
whisper-small-ct2-float16 | +487.1 MB | +
whisper-medium.en-ct2-int8_bfloat16 | +775.8 MB | +
whisper-medium.en-ct2-int8_float16 | +775.8 MB | +
whisper-medium-ct2-int8_bfloat16 | +776.1 MB | +
whisper-medium-ct2-int8_float16 | +776.1 MB | +
whisper-medium.en-ct2-int8 | +788.2 MB | +
whisper-medium.en-ct2-int8_float32 | +788.2 MB | +
whisper-medium-ct2-int8 | +788.5 MB | +
whisper-medium-ct2-int8_float32 | +788.5 MB | +
whisper-small.en-ct2-float32 | +970.4 MB | +
whisper-small-ct2-float32 | +970.7 MB | +
whisper-medium.en-ct2-bfloat16 | +1.5 GB | +
whisper-medium.en-ct2-float16 | +1.5 GB | +
whisper-medium-ct2-bfloat16 | +1.5 GB | +
whisper-medium-ct2-float16 | +1.5 GB | +
whisper-large-v2-ct2-int8_bfloat16 | +1.6 GB | +
whisper-large-v2-ct2-int8_float16 | +1.6 GB | +
whisper-large-v2-ct2-int8 | +1.6 GB | +
whisper-large-v2-ct2-int8_float32 | +1.6 GB | +
whisper-medium.en-ct2-float32 | +3.1 GB | +
whisper-medium-ct2-float32 | +3.1 GB | +
whisper-large-v2-ct2-bfloat16 | +3.1 GB | +
whisper-large-v2-ct2-float16 | +3.1 GB | +
whisper-large-v2-ct2-float32 | +6.2 GB | +
Introduction to Floating Point Formats
+ +Running an embedding or a large language model requires a lot of math calculations and computers don't understand + decimals (1,2,3) like you and me. Rather, they represent numbers with a series of ones and zeros called "bits." + In general, the more bits used means higher quality but also higher VRAM/RAM and compute power needed. + With that being said, the quality also depends on how many of the bits are "exponent" versus "fraction."
+ +The phrase "Floating point format" refers to the total number of bits used and how many are "exponent" versus "fraction." + The three most common floating point formats are shown above. Notice that both Float16 and Bfloat16 use 16 bits but + a different number of "exponent" versus "fraction" bits.
+ +"Exponent" bits essentially determine the "range" of numbers that a neural network can use when doing math. + For example, Float32 has 8 "exponent" bits so hypothetically this allows the neural network to use any integer + between one and one-hundred - its "range is 1-100. Bfloat16 would have the same "range" because it also has + 8 "exponent" bits. However, since Float16 only has 5 "exponent" bits its "range" might only be 1-50.
+ + + +"Fraction" bits essentially determine the number of unique values that can be used within that "range." + For example, Float32 has 23 "fraction" bits so hypothetically it can use every whole number between 1-100 when doing math. + Since Bfloat16 only has 7 "fraction" bits, it might only have 25 unique values within 1-100. + This is also referred to as the "precision" of a neural network.
+ +These are hypotheticals and the actual ranges and precisions are summarized in this table:
+ +Floating Point Format | +Range (Based on Exponent) | +Discrete Values (Based on Fraction) | +
---|---|---|
float32 | +~3.4×1038 | +8,388,608 | +
float16 | +±65,504 | +1,024 | +
bfloat16 | +~3.4×1038 | +128 | +
The "range" and "precision" both determine the "quality" of an output, but in different ways. + In general, different floating point formats are good for different purposes. For example, Google, which created + Bfloat16, found that it was better for neural networks while Float16 was better for scientific calculations.
+ +You can see the floating point format of the various embedding models used in my program by looking at the + "config.json" file for each model.
+ +What is Quantization?
+ +"Quantization" refers to converting the original floating point format to one with a smaller "range" and "precision." + Projects like LLAMA.CPP and AutoGPTQ do this with slightly different algorithms. The overall goal is to reduce + the memory and computational power needed while only suffering a "reasonable" loss in quality. + Specific "quantizations" like "Q8_0" or "8-bit" refer to the "floating point format" of "int8." + (Technically, "int8" is no longer "floating" but you don't need to delve into the nuances of this to understand + the basic concepts I'm trying to communicate.)
+ +Here is the range and precision for "Int8," which is clearly less:
+Floating Point Format | +Range (Based on Exponent) | +Discrete Values (Based on Fraction) | +
---|---|---|
int8 | +-128 to 127 | +±127 (within integer range) | +