The main purpose of this repository is to fine-tune Facebook's DERT (DEtection Transformer).
Author: Doramas Báez Bernal
Email: [email protected]
Unlike traditional computer vision techniques, DETR approaches object detection as a direct set prediction problem. It consists on a set-based global loss, which forces unique predictions via bipartite matching, and a Transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. Due to this parallel nature, DETR is very fast and efficient (paper).
This section indicates the main dependencies of the project:
- torch>=1.5.0
- torchvision>=0.6.0
- pycocotools
Also, it is necessary to download the following directories:
- Dataset for the fine-tuning
- Checkpoints of the model after fine-tuning
Therefore, the project must have the following structure:
path/to/DERT-finetune/
├ dert.ipynb # dert notebook
├ train_custom_coco/ # folder containing dataset for fine-tuning
│ ├ annotations/ # annotation json files
│ ├ image_test/ # Images for testing after fine-tuning
│ ├ train2017/ # train images
│ └ val2017/ # val images
├ outputs/
│ └ checkpoint.pth # checkpoint of the model
└ data/
├ dert_finetune/ # DETR to fine-tune on a dataset
└ images/ # Images for the readme
DETR directly predicts (in parallel) the final set of detections by combining a common CNN with a transformer architecture. During training, bipartite matching uniquely assigns predictions with ground truth boxes. Prediction with no match should yield a “no object” (∅) class prediction. So, they adopt an enconder-decoder architecture based on transformers, a popular architecture for sequence prediction. Applying this architecture and using the concept of self-attention, this architecture is able to predict all objects at once, and is trained end-to-end with a set loss function which performs bipartite matching between predicted and ground-truth objects.
The next thing to be discussed is the architecture in detail:
In the previous figure it can be seen that, DETR uses a conventional CNN backbone to learn a 2D representation of an input image. Then, the model flattens it and supplements it with a positional encoding before passing it into a transformer encoder (this will be the input of the encoder). A transformer decoder then takes as input a small fixed number of learned positional embeddings, which we call object queries, and additionally attends to the encoder output. Finally, each output embedding is passed to a shared feed forward network (FFN) that predicts either a detection (class and bounding box) or a “no object” class.
For the fine-tuning a dataset has been prepared. This dataset contains approximately 900 images belonging to a larger dataset, the coco dataset. In this case, the subset consists of 3 classes:
- fire hydrant
- parking meter
- stop sign
Example of the images used:
The following results have been obtained by adapting the model weights (fine-tuning) for 30 epochs:
-
Official repositories:
- Facebook's DERT (paper)
- Facebook's detectron2 wrapper for DERT
- DERT checkpoints: for the fine-tune, we will remove the classification head.
-
Requirements:
- Dataset for fine-tune DERT
- The last checkpoint (inside outputs folder)
-
Special mention:
- Build your own dataset
- Example of fine-tuning dert by woctezuma
- Fork of DERT prepared to fine-tune on custom dataset by woctezuma
-
Official notebooks:
- An official notebook ilustrating DERT
- An official notebook for using COCO API
-
Tutorials:
- A Github Gist explaining how to fine-tune DERT
- A Github issue explaining how to load a fine-tuned DERT