Skip to content

Deep Learning, Computer Vision, End-to-End Object Detection with Transformers

Notifications You must be signed in to change notification settings

doramasma/DERT-finetune

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

Fine-tuning Detection Transformer (DERT)

The main purpose of this repository is to fine-tune Facebook's DERT (DEtection Transformer).

alt text

Author: Doramas Báez Bernal
Email: [email protected]

Index

Introduction

Unlike traditional computer vision techniques, DETR approaches object detection as a direct set prediction problem. It consists on a set-based global loss, which forces unique predictions via bipartite matching, and a Transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. Due to this parallel nature, DETR is very fast and efficient (paper).

Requirements

This section indicates the main dependencies of the project:

  • torch>=1.5.0
  • torchvision>=0.6.0
  • pycocotools

Also, it is necessary to download the following directories:

Therefore, the project must have the following structure:

path/to/DERT-finetune/
├ dert.ipynb            # dert notebook
├ train_custom_coco/    # folder containing dataset for fine-tuning
│   ├ annotations/        # annotation json files
│   ├ image_test/         # Images for testing after fine-tuning
│   ├ train2017/          # train images
│   └ val2017/            # val images
├  outputs/              
│   └ checkpoint.pth      # checkpoint of the model
└  data/                 
    ├ dert_finetune/      # DETR to fine-tune on a dataset
    └ images/             # Images for the readme

Detection Transformer (DERT)

General information (DERT)

alt text

DETR directly predicts (in parallel) the final set of detections by combining a common CNN with a transformer architecture. During training, bipartite matching uniquely assigns predictions with ground truth boxes. Prediction with no match should yield a “no object” (∅) class prediction. So, they adopt an enconder-decoder architecture based on transformers, a popular architecture for sequence prediction. Applying this architecture and using the concept of self-attention, this architecture is able to predict all objects at once, and is trained end-to-end with a set loss function which performs bipartite matching between predicted and ground-truth objects.

The next thing to be discussed is the architecture in detail:

alt text

In the previous figure it can be seen that, DETR uses a conventional CNN backbone to learn a 2D representation of an input image. Then, the model flattens it and supplements it with a positional encoding before passing it into a transformer encoder (this will be the input of the encoder). A transformer decoder then takes as input a small fixed number of learned positional embeddings, which we call object queries, and additionally attends to the encoder output. Finally, each output embedding is passed to a shared feed forward network (FFN) that predicts either a detection (class and bounding box) or a “no object” class.

Fine-tuning

For the fine-tuning a dataset has been prepared. This dataset contains approximately 900 images belonging to a larger dataset, the coco dataset. In this case, the subset consists of 3 classes:

  • fire hydrant
  • parking meter
  • stop sign

Example of the images used:

alt-text-1 alt-text-2

Results

The following results have been obtained by adapting the model weights (fine-tuning) for 30 epochs:

alt-text-1 alt-text-2

References

About

Deep Learning, Computer Vision, End-to-End Object Detection with Transformers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published