DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion

Published on Sep 21, 2023
· Featured in Daily Papers on Sep 25, 2023


Self-attention-based vision transformers (ViTs) have emerged as a highly competitive architecture in computer vision. Unlike convolutional neural networks (CNNs), ViTs are capable of global information sharing. With the development of various structures of ViTs, ViTs are increasingly advantageous for many vision tasks. However, the quadratic complexity of self-attention renders ViTs computationally intensive, and their lack of inductive biases of locality and translation equivariance demands larger model sizes compared to CNNs to effectively learn visual features. In this paper, we propose a light-weight and efficient vision transformer model called DualToken-ViT that leverages the advantages of CNNs and ViTs. DualToken-ViT effectively fuses the token with local information obtained by convolution-based structure and the token with global information obtained by self-attention-based structure to achieve an efficient attention structure. In addition, we use position-aware global tokens throughout all stages to enrich the global information, which further strengthening the effect of DualToken-ViT. Position-aware global tokens also contain the position information of the image, which makes our model better for vision tasks. We conducted extensive experiments on image classification, object detection and semantic segmentation tasks to demonstrate the effectiveness of DualToken-ViT. On the ImageNet-1K dataset, our models of different scales achieve accuracies of 75.4% and 79.4% with only 0.5G and 1.0G FLOPs, respectively, and our model with 1.0G FLOPs outperforms LightViT-T using global tokens by 0.7%.


Here is a ML-generated summary

The objective is to propose an efficient vision transformer model called DualToken-ViT that combines convolution and self-attention to extract local and global information respectively and fuse them, and uses position-aware global tokens to enrich global information.


  • Combining convolution and self-attention extracts both local and global information efficiently in a vision transformer.
  • Step-wise downsampling retains more information compared to one-step downsampling when reducing feature map size.
  • Position-aware global tokens enrich global information and provide useful positional information for vision tasks.
  • Lightweight DualToken-ViT outperforms ConvNets and Transformers of similar complexity on image classification, detection and segmentation.
  • Fusing local and global tokens gives better performance compared to using either local or global tokens alone.

The key steps for implementing DualToken-ViT are:

  1. Use a Conv Encoder consisting of depthwise convolution, layer norm, and pointwise convolution to extract local information from the input image tokens.
  2. Downsample the local tokens using average pooling for step-wise downsampling to reduce size. Apply self-attention on downsampled tokens to aggregate global information.
  3. Enrich the global information using position-aware global tokens, which are updated through the image tokens. Fuse the global tokens and downsampled local tokens using weighted summation.
  4. Broadcast the enriched global information to the image tokens using self-attention.
  5. Fuse the local tokens from Conv Encoder and global tokens to obtain dual tokens containing both local and global information.
  6. Apply MLP, layer norm, and position-wise feedforward network on the dual tokens. Add residuals and repeat the blocks.
  7. Pass the position-aware global tokens throughout all stages to continually enrich global information.

DualToken-ViT achieves state-of-the-art accuracy among vision models of similar complexity on image classification on ImageNet-1K, and strong performance on object detection and semantic segmentation.

Thanks! What prompt did you use to generate this summary?

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite in a model to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite in a dataset to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite in a Space to link it from this page.

Collections including this paper 4