This repository provides resources for exploring multimodal interaction and object detection using a combination of visual data (images) and text. It is designed for educational purposes, particularly for hands-on tutorials. References are included in each notebook.
In this notebook the focus is to explore the basic with pre-trained object detection models (i.e. YOLOv8). The tutorial include the following steps:
- inference
- fine-tuning
- open vocabulary extension
Extra:
In this notebook the focus is on learning how to use programamtically 2 VLMs, i.e. GPT4 and Gemini.
This tutorial contains some utils to help you plot bounding boxes on images and parse outcome of VLM. You are asked to use the material from previous 2 notebooks to perfrom and compare object detection with different methods: Yolo, GPT4 and Gemini. Compare the outcome of the 3 methods.
Install the dependencies with:
pip install -r requirements.txt