--- license: other datasets: - imagenet-1k --- [**FasterViT: Fast Vision Transformers with Hierarchical Attention**](https://arxiv.org/abs/2306.06189). FasterViT achieves a new SOTA Pareto-front in terms of accuracy vs. image throughput without extra training data !

Note: Please use the [**latest NVIDIA TensorRT release**](https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/index.html) to enjoy the benefits of optimized FasterViT ops. ## Quick Start We can import pre-trained FasterViT models with **1 line of code**. First, FasterViT can be simply installed by: ```bash pip install fastervit ``` A pretrained FasterViT model with default hyper-parameters can be created as in the following: ```python >>> from fastervit import create_model # Define fastervit-0 model with 224 x 224 resolution >>> model = create_model('faster_vit_0_224', pretrained=True, model_path="/tmp/faster_vit_0.pth.tar") ``` `model_path` is used to set the directory to download the model. We can also simply test the model by passing a dummy input image. The output is the logits: ```python >>> import torch >>> image = torch.rand(1, 3, 224, 224) >>> output = model(image) # torch.Size([1, 1000]) ``` We can also use the any-resolution FasterViT model to accommodate arbitrary image resolutions. In the following, we define an any-resolution FasterViT-0 model with input resolution of 576 x 960, window sizes of 12 and 6 in 3rd and 4th stages, carrier token size of 2 and embedding dimension of 64: ```python >>> from fastervit import create_model # Define any-resolution FasterViT-0 model with 576 x 960 resolution >>> model = create_model('faster_vit_0_any_res', resolution=[576, 960], window_size=[7, 7, 12, 6], ct_size=2, dim=64, pretrained=True) ``` Note that the above model is intiliazed from the original ImageNet pre-trained FasterViT with original resolution of 224 x 224. As a result, missing keys and mis-matches could be expected since we are addign new layers (e.g. addition of new carrier tokens, etc.) We can simply test the model by passing a dummy input image. The output is the logits: ```python >>> import torch >>> image = torch.rand(1, 3, 576, 960) >>> output = model(image) # torch.Size([1, 1000]) ``` --- ## Results + Pretrained Models ### ImageNet-1K **FasterViT ImageNet-1K Pretrained Models**

Name	Acc@1(%)	Acc@5(%)	Throughput(Img/Sec)	Resolution	#Params(M)	FLOPs(G)	Download
FasterViT-0	82.1	95.9	5802	224x224	31.4	3.3	model
FasterViT-1	83.2	96.5	4188	224x224	53.4	5.3	model
FasterViT-2	84.2	96.8	3161	224x224	75.9	8.7	model
FasterViT-3	84.9	97.2	1780	224x224	159.5	18.2	model
FasterViT-4	85.4	97.3	849	224x224	424.6	36.6	model
FasterViT-5	85.6	97.4	449	224x224	975.5	113.0	model
FasterViT-6	85.8	97.4	352	224x224	1360.0	142.0	model

### ImageNet-21K **FasterViT ImageNet-21K Pretrained Models (ImageNet-1K Fine-tuned)**

Name	Acc@1(%)	Acc@5(%)	Resolution	#Params(M)	FLOPs(G)	Download
FasterViT-4-21K-224	86.6	97.8	224x224	271.9	40.8	model
FasterViT-4-21K-384	87.6	98.3	384x384	271.9	120.1	model
FasterViT-4-21K-512	87.8	98.4	512x512	271.9	213.5	model
FasterViT-4-21K-768	87.9	98.5	768x768	271.9	480.4	model

### Robustness (ImageNet-A - ImageNet-R - ImageNet-V2) All models use `crop_pct=0.875`. Results are obtained by running inference on ImageNet-1K pretrained models without finetuning.

Name	A-Acc@1(%)	A-Acc@5(%)	R-Acc@1(%)	R-Acc@5(%)	V2-Acc@1(%)	V2-Acc@5(%)
FasterViT-0	23.9	57.6	45.9	60.4	70.9	90.0
FasterViT-1	31.2	63.3	47.5	61.9	72.6	91.0
FasterViT-2	38.2	68.9	49.6	63.4	73.7	91.6
FasterViT-3	44.2	73.0	51.9	65.6	75.0	92.2
FasterViT-4	49.0	75.4	56.0	69.6	75.7	92.7
FasterViT-5	52.7	77.6	56.9	70.0	76.0	93.0
FasterViT-6	53.7	78.4	57.1	70.1	76.1	93.0

A, R and V2 denote ImageNet-A, ImageNet-R and ImageNet-V2 respectively. ## Citation Please consider citing FasterViT if this repository is useful for your work. ``` @article{hatamizadeh2023fastervit, title={FasterViT: Fast Vision Transformers with Hierarchical Attention}, author={Hatamizadeh, Ali and Heinrich, Greg and Yin, Hongxu and Tao, Andrew and Alvarez, Jose M and Kautz, Jan and Molchanov, Pavlo}, journal={arXiv preprint arXiv:2306.06189}, year={2023} } ``` ## Licenses Copyright © 2023, NVIDIA Corporation. All rights reserved. This work is made available under the NVIDIA Source Code License-NC. Click [here](LICENSE) to view a copy of this license. For license information regarding the timm repository, please refer to its [repository](https://github.com/rwightman/pytorch-image-models). For license information regarding the ImageNet dataset, please see the [ImageNet official website](https://www.image-net.org/). ## Acknowledgement This repository is built on top of the [timm](https://github.com/huggingface/pytorch-image-models) repository. We thank [Ross Wrightman](https://rwightman.com/) for creating and maintaining this high-quality library.