Edit model card

TransNeXt

Official Model release for "TransNeXt: Robust Foveal Visual Perception for Vision Transformers" [CVPR 2024] .

Model Details

Methods

Pixel-focused attention (Left) & aggregated attention (Right):

pixel-focused_attention

Convolutional GLU (First on the right):

Convolutional GLU

Results

Image Classification, Detection and Segmentation:

experiment_figure

Attention Visualization:

foveal_peripheral_vision

Model Zoo

Image Classification

Classification code & weights & configs & training logs are >>>here<<<.

ImageNet-1K 224x224 pre-trained models:

Model #Params #FLOPs IN-1K IN-A IN-C↓ IN-R Sketch IN-V2 Download Config Log
TransNeXt-Micro 12.8M 2.7G 82.5 29.9 50.8 45.8 33.0 72.6 model config log
TransNeXt-Tiny 28.2M 5.7G 84.0 39.9 46.5 49.6 37.6 73.8 model config log
TransNeXt-Small 49.7M 10.3G 84.7 47.1 43.9 52.5 39.7 74.8 model config log
TransNeXt-Base 89.7M 18.4G 84.8 50.6 43.5 53.9 41.4 75.1 model config log

ImageNet-1K 384x384 fine-tuned models:

Model #Params #FLOPs IN-1K IN-A IN-R Sketch IN-V2 Download Config
TransNeXt-Small 49.7M 32.1G 86.0 58.3 56.4 43.2 76.8 model config
TransNeXt-Base 89.7M 56.3G 86.2 61.6 57.7 44.7 77.0 model config

ImageNet-1K 256x256 pre-trained model fully utilizing aggregated attention at all stages:

(See Table.9 in Appendix D.6 for details)

Model Token mixer #Params #FLOPs IN-1K Download Config Log
TransNeXt-Micro A-A-A-A 13.1M 3.3G 82.6 model config log

Object Detection

Object detection code & weights & configs & training logs are >>>here<<<.

COCO object detection and instance segmentation results using the Mask R-CNN method:

Backbone Pretrained Model Lr Schd box mAP mask mAP #Params Download Config Log
TransNeXt-Tiny ImageNet-1K 1x 49.9 44.6 47.9M model config log
TransNeXt-Small ImageNet-1K 1x 51.1 45.5 69.3M model config log
TransNeXt-Base ImageNet-1K 1x 51.7 45.9 109.2M model config log

COCO object detection results using the DINO method:

Backbone Pretrained Model scales epochs box mAP #Params Download Config Log
TransNeXt-Tiny ImageNet-1K 4scale 12 55.1 47.8M model config log
TransNeXt-Tiny ImageNet-1K 5scale 12 55.7 48.1M model config log
TransNeXt-Small ImageNet-1K 5scale 12 56.6 69.6M model config log
TransNeXt-Base ImageNet-1K 5scale 12 57.1 110M model config log

Semantic Segmentation

Semantic segmentation code & weights & configs & training logs are >>>here<<<.

ADE20K semantic segmentation results using the UPerNet method:

Backbone Pretrained Model Crop Size Lr Schd mIoU mIoU (ms+flip) #Params Download Config Log
TransNeXt-Tiny ImageNet-1K 512x512 160K 51.1 51.5/51.7 59M model config log
TransNeXt-Small ImageNet-1K 512x512 160K 52.2 52.5/51.8 80M model config log
TransNeXt-Base ImageNet-1K 512x512 160K 53.0 53.5/53.7 121M model config log
  • In the context of multi-scale evaluation, TransNeXt reports test results under two distinct scenarios: interpolation and extrapolation of relative position bias.

ADE20K semantic segmentation results using the Mask2Former method:

Backbone Pretrained Model Crop Size Lr Schd mIoU #Params Download Config Log
TransNeXt-Tiny ImageNet-1K 512x512 160K 53.4 47.5M model config log
TransNeXt-Small ImageNet-1K 512x512 160K 54.1 69.0M model config log
TransNeXt-Base ImageNet-1K 512x512 160K 54.7 109M model config log

Citation

If you find our work helpful, please consider citing the following bibtex. We would greatly appreciate a star for this project.

@misc{shi2023transnext,
  author = {Dai Shi},
  title = {TransNeXt: Robust Foveal Visual Perception for Vision Transformers},
  year = {2023},
  eprint = {arXiv:2311.17132},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}
Downloads last month
0
Inference API
Drag image file here or click to browse from your device
Inference API (serverless) does not yet support pytorch models for this pipeline type.

Dataset used to train DaiShiResearch/transnext-micro-AAAA-256-1k

Collection including DaiShiResearch/transnext-micro-AAAA-256-1k