The Hierarchical Transformer is a core component of our Adaptive Hierarchical Image Parser (AHIP), designed to process multi-scale visual features and construct a hierarchical representation of the image content. It builds upon the success of transformer architectures in capturing long-range dependencies while incorporating novel elements to handle the multi-scale nature of visual information.
-
Input: Multi-scale feature maps from the CNN backbone (typically 3 scales: fine, medium, coarse)
-
Scale-specific Transformers:
- Each scale has its own transformer encoder
- Architecture: Similar to the original transformer, but with 2D positional encodings
- Number of layers: 6 per scale
- Hidden dimension: 512
- Number of heads: 8
-
Cross-scale Attention Mechanism:
- Allows information flow between different scales
- Implemented after every two layers of scale-specific processing
-
Hierarchical Pooling:
- Progressively pools information from finer to coarser scales
- Uses learnable pooling kernels
For each scale s:
- Input: Feature map X_s of shape (H_s, W_s, C)
- Add 2D sinusoidal positional encodings
- Reshape to sequence: (H_s * W_s, C)
- Process through 6 transformer encoder layers:
- Multi-head self-attention
- Feed-forward network
- Layer normalization and residual connections
Between scales s_i and s_j:
- Query: From scale s_i
- Key and Value: From scale s_j
- Attention weights: softmax(Q * K^T / sqrt(d_k))
- Output: Weighted sum of values
- Aggregate: Concatenate outputs from all scales and project
To pool from scale s to s+1:
- Apply learnable 2D convolution with stride 2
- Aggregate: Weighted sum of pooled features and original features at scale s+1
- Primary: Cross-entropy loss for object classification and segmentation
- Auxiliary:
- Reconstruction loss: Decode features back to image space
- Consistency loss: Ensure consistency between scales
-
Scale-adaptive Attention:
- Attention weights are modulated by scale difference
- Allows the model to focus on relevant scales for each task
-
Hierarchical Position Encoding:
- Encodes both absolute position and scale information
- Helps maintain spatial relationships across scales
-
Dynamic Scaling:
- Number of scales can be adjusted at inference time
- Allows for computational efficiency on different devices
- Multi-scale Processing: Captures both fine-grained details and global context
- Hierarchical Understanding: Naturally builds a hierarchical representation of image content
- Flexibility: Can handle varying image sizes and aspect ratios
- Efficiency: Parallel processing of different scales
- Computational Complexity: Scaling to very high-resolution images
- Interpretability: Understanding cross-scale attention patterns
- Dynamic Architectures: Adapting the architecture based on image content
The Hierarchical Transformer forms the backbone of our system's ability to understand images at multiple levels of abstraction, crucial for building a rich, hierarchical representation of visual content.