DaiShiResearch's picture
Update README.md
a823ff7 verified
|
raw
history blame
14.6 kB
metadata
license: apache-2.0
datasets:
  - imagenet-1k
metrics:
  - accuracy
pipeline_tag: image-classification
tags:
  - vision
language:
  - en
library_name: pytorch

TransNeXt

Official Model release for "TransNeXt: Robust Foveal Visual Perception for Vision Transformers" [CVPR 2024] .

Model Details

Methods

Pixel-focused attention (Left) & aggregated attention (Right):

pixel-focused_attention

Convolutional GLU (First on the right):

Convolutional GLU

Results

Image Classification, Detection and Segmentation:

experiment_figure

Attention Visualization:

foveal_peripheral_vision

Model Zoo

Image Classification

Classification code & weights & configs & training logs are >>>here<<<.

ImageNet-1K 224x224 pre-trained models:

Model #Params #FLOPs IN-1K IN-A IN-C↓ IN-R Sketch IN-V2 Download Config Log
TransNeXt-Micro 12.8M 2.7G 82.5 29.9 50.8 45.8 33.0 72.6 model config log
TransNeXt-Tiny 28.2M 5.7G 84.0 39.9 46.5 49.6 37.6 73.8 model config log
TransNeXt-Small 49.7M 10.3G 84.7 47.1 43.9 52.5 39.7 74.8 model config log
TransNeXt-Base 89.7M 18.4G 84.8 50.6 43.5 53.9 41.4 75.1 model config log

ImageNet-1K 384x384 fine-tuned models:

Model #Params #FLOPs IN-1K IN-A IN-R Sketch IN-V2 Download Config
TransNeXt-Small 49.7M 32.1G 86.0 58.3 56.4 43.2 76.8 model config
TransNeXt-Base 89.7M 56.3G 86.2 61.6 57.7 44.7 77.0 model config

ImageNet-1K 256x256 pre-trained model fully utilizing aggregated attention at all stages:

(See Table.9 in Appendix D.6 for details)

Model Token mixer #Params #FLOPs IN-1K Download Config Log
TransNeXt-Micro A-A-A-A 13.1M 3.3G 82.6 model config log

Object Detection

Object detection code & weights & configs & training logs are >>>here<<<.

COCO object detection and instance segmentation results using the Mask R-CNN method:

Backbone Pretrained Model Lr Schd box mAP mask mAP #Params Download Config Log
TransNeXt-Tiny ImageNet-1K 1x 49.9 44.6 47.9M model config log
TransNeXt-Small ImageNet-1K 1x 51.1 45.5 69.3M model config log
TransNeXt-Base ImageNet-1K 1x 51.7 45.9 109.2M model config log

COCO object detection results using the DINO method:

Backbone Pretrained Model scales epochs box mAP #Params Download Config Log
TransNeXt-Tiny ImageNet-1K 4scale 12 55.1 47.8M model config log
TransNeXt-Tiny ImageNet-1K 5scale 12 55.7 48.1M model config log
TransNeXt-Small ImageNet-1K 5scale 12 56.6 69.6M model config log
TransNeXt-Base ImageNet-1K 5scale 12 57.1 110M model config log

Semantic Segmentation

Semantic segmentation code & weights & configs & training logs are >>>here<<<.

ADE20K semantic segmentation results using the UPerNet method:

Backbone Pretrained Model Crop Size Lr Schd mIoU mIoU (ms+flip) #Params Download Config Log
TransNeXt-Tiny ImageNet-1K 512x512 160K 51.1 51.5/51.7 59M model config log
TransNeXt-Small ImageNet-1K 512x512 160K 52.2 52.5/52.8 80M model config log
TransNeXt-Base ImageNet-1K 512x512 160K 53.0 53.5/53.7 121M model config log
  • In the context of multi-scale evaluation, TransNeXt reports test results under two distinct scenarios: interpolation and extrapolation of relative position bias.

ADE20K semantic segmentation results using the Mask2Former method:

Backbone Pretrained Model Crop Size Lr Schd mIoU #Params Download Config Log
TransNeXt-Tiny ImageNet-1K 512x512 160K 53.4 47.5M model config log
TransNeXt-Small ImageNet-1K 512x512 160K 54.1 69.0M model config log
TransNeXt-Base ImageNet-1K 512x512 160K 54.7 109M model config log

Citation

If you find our work helpful, please consider citing the following bibtex. We would greatly appreciate a star for this project.

@InProceedings{shi2023transnext,
  author    = {Dai Shi},
  title     = {TransNeXt: Robust Foveal Visual Perception for Vision Transformers},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2024},
  pages     = {17773-17783}
}