--- license: other datasets: - imagenet-1k --- [**FasterViT: Fast Vision Transformers with Hierarchical Attention**](https://arxiv.org/abs/2306.06189). FasterViT achieves a new SOTA Pareto-front in terms of accuracy vs. image throughput without extra training data !

Note: Please use the [**latest NVIDIA TensorRT release**](https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/index.html) to enjoy the benefits of optimized FasterViT ops. ## Quick Start We can import pre-trained FasterViT models with **1 line of code**. First, FasterViT can be simply installed by: ```bash pip install fastervit ``` A pretrained FasterViT model with default hyper-parameters can be created as in the following: ```python >>> from fastervit import create_model # Define fastervit-0 model with 224 x 224 resolution >>> model = create_model('faster_vit_0_224', pretrained=True, model_path="/tmp/faster_vit_0.pth.tar") ``` `model_path` is used to set the directory to download the model. We can also simply test the model by passing a dummy input image. The output is the logits: ```python >>> import torch >>> image = torch.rand(1, 3, 224, 224) >>> output = model(image) # torch.Size([1, 1000]) ``` We can also use the any-resolution FasterViT model to accommodate arbitrary image resolutions. In the following, we define an any-resolution FasterViT-0 model with input resolution of 576 x 960, window sizes of 12 and 6 in 3rd and 4th stages, carrier token size of 2 and embedding dimension of 64: ```python >>> from fastervit import create_model # Define any-resolution FasterViT-0 model with 576 x 960 resolution >>> model = create_model('faster_vit_0_any_res', resolution=[576, 960], window_size=[7, 7, 12, 6], ct_size=2, dim=64, pretrained=True) ``` Note that the above model is intiliazed from the original ImageNet pre-trained FasterViT with original resolution of 224 x 224. As a result, missing keys and mis-matches could be expected since we are addign new layers (e.g. addition of new carrier tokens, etc.) We can simply test the model by passing a dummy input image. The output is the logits: ```python >>> import torch >>> image = torch.rand(1, 3, 576, 960) >>> output = model(image) # torch.Size([1, 1000]) ``` --- ## Results + Pretrained Models ### ImageNet-1K **FasterViT ImageNet-1K Pretrained Models**
Name Acc@1(%) Acc@5(%) Throughput(Img/Sec) Resolution #Params(M) FLOPs(G) Download
FasterViT-0 82.1 95.9 5802 224x224 31.4 3.3 model
FasterViT-1 83.2 96.5 4188 224x224 53.4 5.3 model
FasterViT-2 84.2 96.8 3161 224x224 75.9 8.7 model
FasterViT-3 84.9 97.2 1780 224x224 159.5 18.2 model
FasterViT-4 85.4 97.3 849 224x224 424.6 36.6 model
FasterViT-5 85.6 97.4 449 224x224 975.5 113.0 model
FasterViT-6 85.8 97.4 352 224x224 1360.0 142.0 model
### ImageNet-21K **FasterViT ImageNet-21K Pretrained Models (ImageNet-1K Fine-tuned)**
Name Acc@1(%) Acc@5(%) Resolution #Params(M) FLOPs(G) Download
FasterViT-4-21K-224 86.6 97.8 224x224 271.9 40.8 model
FasterViT-4-21K-384 87.6 98.3 384x384 271.9 120.1 model
FasterViT-4-21K-512 87.8 98.4 512x512 271.9 213.5 model
FasterViT-4-21K-768 87.9 98.5 768x768 271.9 480.4 model
### Robustness (ImageNet-A - ImageNet-R - ImageNet-V2) All models use `crop_pct=0.875`. Results are obtained by running inference on ImageNet-1K pretrained models without finetuning.
Name A-Acc@1(%) A-Acc@5(%) R-Acc@1(%) R-Acc@5(%) V2-Acc@1(%) V2-Acc@5(%)
FasterViT-0 23.9 57.6 45.9 60.4 70.9 90.0
FasterViT-1 31.2 63.3 47.5 61.9 72.6 91.0
FasterViT-2 38.2 68.9 49.6 63.4 73.7 91.6
FasterViT-3 44.2 73.0 51.9 65.6 75.0 92.2
FasterViT-4 49.0 75.4 56.0 69.6 75.7 92.7
FasterViT-5 52.7 77.6 56.9 70.0 76.0 93.0
FasterViT-6 53.7 78.4 57.1 70.1 76.1 93.0
A, R and V2 denote ImageNet-A, ImageNet-R and ImageNet-V2 respectively. ## Citation Please consider citing FasterViT if this repository is useful for your work. ``` @article{hatamizadeh2023fastervit, title={FasterViT: Fast Vision Transformers with Hierarchical Attention}, author={Hatamizadeh, Ali and Heinrich, Greg and Yin, Hongxu and Tao, Andrew and Alvarez, Jose M and Kautz, Jan and Molchanov, Pavlo}, journal={arXiv preprint arXiv:2306.06189}, year={2023} } ``` ## Licenses Copyright © 2023, NVIDIA Corporation. All rights reserved. This work is made available under the NVIDIA Source Code License-NC. Click [here](LICENSE) to view a copy of this license. For license information regarding the timm repository, please refer to its [repository](https://github.com/rwightman/pytorch-image-models). For license information regarding the ImageNet dataset, please see the [ImageNet official website](https://www.image-net.org/). ## Acknowledgement This repository is built on top of the [timm](https://github.com/huggingface/pytorch-image-models) repository. We thank [Ross Wrightman](https://rwightman.com/) for creating and maintaining this high-quality library.