forked from huggingface/smol-course
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request huggingface#59 from duydl/feature/vlm-module
[MODULE] Implement Chapter 5: Vision Language Model
- Loading branch information
Showing
9 changed files
with
977 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,38 @@ | ||
# Vision Language Models | ||
|
||
I'm still working on this section... | ||
## 1. VLM Usage | ||
|
||
Vision Language Models (VLMs) process image inputs alongside text to enable tasks like image captioning, visual question answering, and multimodal reasoning. | ||
|
||
A typical VLM architecture consists of an image encoder to extract visual features, a projection layer to align visual and textual representations, and a language model to process or generate text. This allows the model to establish connections between visual elements and language concepts. | ||
|
||
VLMs can be used in different configurations depending on the use case. Base models handle general vision-language tasks, while chat-optimized variants support conversational interactions. Some models include additional components for grounding predictions in visual evidence or specializing in specific tasks like object detection. | ||
|
||
For more on the technicality and usage of VLMs, refer to the [VLM Usage](./vlm_usage.md) page. | ||
|
||
## 2. VLM Fine-Tuning | ||
|
||
Fine-tuning a VLM involves adapting a pre-trained model to perform specific tasks or to operate effectively on a particular dataset. The process can follow methodologies such as supervised fine-tuning, preference optimization, or a hybrid approach that combines both, as introduced in Modules 1 and 2. | ||
|
||
While the core tools and techniques remain similar to those used for LLMs, fine-tuning VLMs requires additional focus on data representation and preparation for images. This ensures the model effectively integrates and processes both visual and textual data for optimal performance. Given that the demo model, SmolVLM, is significantly larger than the language model used in the previous module, it's essential to explore methods for efficient fine-tuning. Techniques like quantization and PEFT can help make the process more accessible and cost-effective, allowing more users to experiment with the model. | ||
|
||
For detailed guidance on fine-tuning VLMs, visit the [VLM Fine-Tuning](./vlm_finetuning.md) page. | ||
|
||
|
||
## Exercise Notebooks | ||
|
||
|
||
| Title | Description | Exercise | Link | Colab | | ||
|-------|-------------|----------|------|-------| | ||
| VLM Usage | Learn how to load and use a pre-trained VLM for various tasks | 🐢 Process an image<br>🐕 Process multiple images with batch handling <br>🦁 Process a full video| [Notebook](./notebooks/vlm_usage_sample.ipynb) | <a target="_blank" href="https://colab.research.google.com/github/user/project/vlm_usage_sample.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> | | ||
| VLM Fine-Tuning | Learn how to fine-tune a pre-trained VLM for task-specific datasets | 🐢 Use a basic dataset for fine-tuning<br>🐕 Try a new dataset<br>🦁 Experiment with alternative fine-tuning methods | [Notebook](./notebooks/vlm_finetune_sample.ipynb)| <a target="_blank" href="https://colab.research.google.com/github/user/project/vlm_finetune_sample.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> | | ||
|
||
|
||
## References | ||
- [Hugging Face Learn: Supervised Fine-Tuning VLMs](https://huggingface.co/learn/cookbook/fine_tuning_vlm_trl) | ||
- [Hugging Face Blog: Preference Optimization for VLMs](https://huggingface.co/blog/dpo_vlm) | ||
- [Hugging Face Blog: Vision Language Models](https://huggingface.co/blog/vlms) | ||
- [Hugging Face Blog: SmolVLM](https://huggingface.co/blog/smolvlm) | ||
- [Hugging Face Model: SmolVLM-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct) | ||
- [CLIP: Learning Transferable Visual Models from Natural Language Supervision](https://arxiv.org/abs/2103.00020) | ||
- [Align Before Fuse: Vision and Language Representation Learning with Momentum Distillation](https://arxiv.org/abs/2107.07651) |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file.
495 changes: 495 additions & 0 deletions
495
5_vision_language_models/notebooks/vlm_sft_sample.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
270 changes: 270 additions & 0 deletions
270
5_vision_language_models/notebooks/vlm_usage_sample.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,270 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Processing images and text with VLMs \n", | ||
"\n", | ||
"This notebook demonstrates how to utilize the `HuggingFaceTB/SmolVLM-Instruct` 4bit-quantized model for various multimodal tasks." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"application/vnd.jupyter.widget-view+json": { | ||
"model_id": "42a6f26e0f59460ea9eaacb7bec95f31", | ||
"version_major": 2, | ||
"version_minor": 0 | ||
}, | ||
"text/plain": [ | ||
"VBox(children=(HTML(value='<center> <img\\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
} | ||
], | ||
"source": [ | ||
"# Install the requirements in Google Colab\n", | ||
"# !pip install transformers datasets trl huggingface_hub bitsandbytes\n", | ||
"\n", | ||
"# Authenticate to Hugging Face\n", | ||
"from huggingface_hub import login\n", | ||
"login()\n", | ||
"\n", | ||
"# for convenience you can create an environment variable containing your hub token as HF_TOKEN" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"`low_cpu_mem_usage` was None, now default to True since model is quantized.\n", | ||
"You shouldn't move a model that is dispatched using accelerate hooks.\n", | ||
"Some kwargs in processor config are unused and will not have any effect: image_seq_len. \n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import torch\n", | ||
"from PIL import Image\n", | ||
"from transformers import AutoProcessor, AutoModelForVision2Seq, BitsAndBytesConfig\n", | ||
"from transformers.image_utils import load_image\n", | ||
"\n", | ||
"device = (\n", | ||
" \"cuda\"\n", | ||
" if torch.cuda.is_available()\n", | ||
" else \"mps\" if torch.backends.mps.is_available() else \"cpu\"\n", | ||
")\n", | ||
"\n", | ||
"quantization_config = BitsAndBytesConfig(load_in_4bit=True)\n", | ||
"model_name = \"HuggingFaceTB/SmolVLM-Instruct\"\n", | ||
"model = AutoModelForVision2Seq.from_pretrained(\n", | ||
" model_name,\n", | ||
" quantization_config=quantization_config,\n", | ||
").to(device)\n", | ||
"processor = AutoProcessor.from_pretrained(\"HuggingFaceTB/SmolVLM-Instruct\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Processing Images" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/html": [ | ||
"<img src=\"https://cdn.pixabay.com/photo/2024/11/20/09/14/christmas-9210799_1280.jpg\"/>" | ||
], | ||
"text/plain": [ | ||
"<IPython.core.display.Image object>" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
}, | ||
{ | ||
"data": { | ||
"text/html": [ | ||
"<img src=\"https://cdn.pixabay.com/photo/2024/11/23/08/18/christmas-9218404_1280.jpg\"/>" | ||
], | ||
"text/plain": [ | ||
"<IPython.core.display.Image object>" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
}, | ||
{ | ||
"data": { | ||
"text/html": [ | ||
"<img src=\"https://cdn.pixabay.com/photo/2024/11/20/09/14/christmas-9210799_1280.jpg\"/>" | ||
], | ||
"text/plain": [ | ||
"<IPython.core.display.Image object>" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
} | ||
], | ||
"source": [ | ||
"from IPython.display import Image, display\n", | ||
"\n", | ||
"image_url1 = \"https://cdn.pixabay.com/photo/2024/11/20/09/14/christmas-9210799_1280.jpg\"\n", | ||
"display(Image(url=image_url1))\n", | ||
"\n", | ||
"image_url2 = \"https://cdn.pixabay.com/photo/2024/11/23/08/18/christmas-9218404_1280.jpg\"\n", | ||
"display(Image(url=image_url2))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 12, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"['User:<image>Can you describe the image?\\nAssistant: The image is a scene of a person walking in a forest. The person is wearing a coat and a cap. The person is holding the hand of another person. The person is walking on a path. The path is covered with dry leaves. The background of the image is a forest with trees.']\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Load one image\n", | ||
"image1 = load_image(image_url1)\n", | ||
"\n", | ||
"# Create input messages\n", | ||
"messages = [\n", | ||
" {\n", | ||
" \"role\": \"user\",\n", | ||
" \"content\": [\n", | ||
" {\"type\": \"image\"},\n", | ||
" {\"type\": \"text\", \"text\": \"Can you describe the image?\"}\n", | ||
" ]\n", | ||
" },\n", | ||
"]\n", | ||
"\n", | ||
"# Prepare inputs\n", | ||
"prompt = processor.apply_chat_template(messages, add_generation_prompt=True)\n", | ||
"inputs = processor(text=prompt, images=[image1], return_tensors=\"pt\")\n", | ||
"inputs = inputs.to(device)\n", | ||
"\n", | ||
"# Generate outputs\n", | ||
"generated_ids = model.generate(**inputs, max_new_tokens=500)\n", | ||
"generated_texts = processor.batch_decode(\n", | ||
" generated_ids,\n", | ||
" skip_special_tokens=True,\n", | ||
")\n", | ||
"\n", | ||
"print(generated_texts)\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 17, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"['User:<image>Can you describe the two images?\\nAssistant: The first image shows a child wearing a red hat and holding a red hat in front of a fireplace. The second image shows a present with a pine cone and a Christmas decoration.']\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"\n", | ||
"# Load images\n", | ||
"image2 = load_image(image_url2)\n", | ||
"\n", | ||
"# Create input messages\n", | ||
"messages = [\n", | ||
" {\n", | ||
" \"role\": \"user\",\n", | ||
" \"content\": [\n", | ||
" {\"type\": \"image\"},\n", | ||
" {\"type\": \"image\"},\n", | ||
" {\"type\": \"text\", \"text\": \"Can you describe the two images?\"}\n", | ||
" ]\n", | ||
" },\n", | ||
" # {\n", | ||
" # \"role\": \"user\",\n", | ||
" # \"content\": [\n", | ||
" # {\"type\": \"image\"},\n", | ||
" # {\"type\": \"image\"},\n", | ||
" # {\"type\": \"text\", \"text\": \"What event do they both represent?\"}\n", | ||
" # ]\n", | ||
" # },\n", | ||
"]\n", | ||
"\n", | ||
"# Prepare inputs\n", | ||
"prompt = processor.apply_chat_template(messages, add_generation_prompt=True)\n", | ||
"inputs = processor(text=prompt, images=[image1, image2], return_tensors=\"pt\")\n", | ||
"inputs = inputs.to(device)\n", | ||
"\n", | ||
"# Generate outputs\n", | ||
"generated_ids = model.generate(**inputs, max_new_tokens=500)\n", | ||
"generated_texts = processor.batch_decode(\n", | ||
" generated_ids,\n", | ||
" skip_special_tokens=True,\n", | ||
")\n", | ||
"\n", | ||
"print(generated_texts)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Processing videos" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "py310", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.15" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.