Skip to content

Latest commit

 

History

History
91 lines (63 loc) · 3.28 KB

hierarchical-transformer.md

File metadata and controls

91 lines (63 loc) · 3.28 KB

Hierarchical Transformer in AHIP

Overview

The Hierarchical Transformer is a core component of our Adaptive Hierarchical Image Parser (AHIP), designed to process multi-scale visual features and construct a hierarchical representation of the image content. It builds upon the success of transformer architectures in capturing long-range dependencies while incorporating novel elements to handle the multi-scale nature of visual information.

Architecture

  1. Input: Multi-scale feature maps from the CNN backbone (typically 3 scales: fine, medium, coarse)

  2. Scale-specific Transformers:

    • Each scale has its own transformer encoder
    • Architecture: Similar to the original transformer, but with 2D positional encodings
    • Number of layers: 6 per scale
    • Hidden dimension: 512
    • Number of heads: 8
  3. Cross-scale Attention Mechanism:

    • Allows information flow between different scales
    • Implemented after every two layers of scale-specific processing
  4. Hierarchical Pooling:

    • Progressively pools information from finer to coarser scales
    • Uses learnable pooling kernels

Detailed Components

Scale-specific Transformer

For each scale s:

  1. Input: Feature map X_s of shape (H_s, W_s, C)
  2. Add 2D sinusoidal positional encodings
  3. Reshape to sequence: (H_s * W_s, C)
  4. Process through 6 transformer encoder layers:
    • Multi-head self-attention
    • Feed-forward network
    • Layer normalization and residual connections

Cross-scale Attention

Between scales s_i and s_j:

  1. Query: From scale s_i
  2. Key and Value: From scale s_j
  3. Attention weights: softmax(Q * K^T / sqrt(d_k))
  4. Output: Weighted sum of values
  5. Aggregate: Concatenate outputs from all scales and project

Hierarchical Pooling

To pool from scale s to s+1:

  1. Apply learnable 2D convolution with stride 2
  2. Aggregate: Weighted sum of pooled features and original features at scale s+1

Training Objectives

  1. Primary: Cross-entropy loss for object classification and segmentation
  2. Auxiliary:
    • Reconstruction loss: Decode features back to image space
    • Consistency loss: Ensure consistency between scales

Innovations

  1. Scale-adaptive Attention:

    • Attention weights are modulated by scale difference
    • Allows the model to focus on relevant scales for each task
  2. Hierarchical Position Encoding:

    • Encodes both absolute position and scale information
    • Helps maintain spatial relationships across scales
  3. Dynamic Scaling:

    • Number of scales can be adjusted at inference time
    • Allows for computational efficiency on different devices

Benefits

  1. Multi-scale Processing: Captures both fine-grained details and global context
  2. Hierarchical Understanding: Naturally builds a hierarchical representation of image content
  3. Flexibility: Can handle varying image sizes and aspect ratios
  4. Efficiency: Parallel processing of different scales

Challenges and Future Work

  1. Computational Complexity: Scaling to very high-resolution images
  2. Interpretability: Understanding cross-scale attention patterns
  3. Dynamic Architectures: Adapting the architecture based on image content

The Hierarchical Transformer forms the backbone of our system's ability to understand images at multiple levels of abstraction, crucial for building a rich, hierarchical representation of visual content.