--- license: other datasets: - imagenet-1k --- [**FasterViT: Fast Vision Transformers with Hierarchical Attention**](https://arxiv.org/abs/2306.06189). FasterViT achieves a new SOTA Pareto-front in terms of accuracy vs. image throughput without extra training data !
Note: Please use the [**latest NVIDIA TensorRT release**](https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/index.html) to enjoy the benefits of optimized FasterViT ops. ## Quick Start We can import pre-trained FasterViT models with **1 line of code**. First, FasterViT can be simply installed by: ```bash pip install fastervit ``` A pretrained FasterViT model with default hyper-parameters can be created as in the following: ```python >>> from fastervit import create_model # Define fastervit-0 model with 224 x 224 resolution >>> model = create_model('faster_vit_0_224', pretrained=True, model_path="/tmp/faster_vit_0.pth.tar") ``` `model_path` is used to set the directory to download the model. We can also simply test the model by passing a dummy input image. The output is the logits: ```python >>> import torch >>> image = torch.rand(1, 3, 224, 224) >>> output = model(image) # torch.Size([1, 1000]) ``` We can also use the any-resolution FasterViT model to accommodate arbitrary image resolutions. In the following, we define an any-resolution FasterViT-0 model with input resolution of 576 x 960, window sizes of 12 and 6 in 3rd and 4th stages, carrier token size of 2 and embedding dimension of 64: ```python >>> from fastervit import create_model # Define any-resolution FasterViT-0 model with 576 x 960 resolution >>> model = create_model('faster_vit_0_any_res', resolution=[576, 960], window_size=[7, 7, 12, 6], ct_size=2, dim=64, pretrained=True) ``` Note that the above model is intiliazed from the original ImageNet pre-trained FasterViT with original resolution of 224 x 224. As a result, missing keys and mis-matches could be expected since we are addign new layers (e.g. addition of new carrier tokens, etc.) We can simply test the model by passing a dummy input image. The output is the logits: ```python >>> import torch >>> image = torch.rand(1, 3, 576, 960) >>> output = model(image) # torch.Size([1, 1000]) ``` --- ## Results + Pretrained Models ### ImageNet-1K **FasterViT ImageNet-1K Pretrained Models**
Name | Acc@1(%) | Acc@5(%) | Throughput(Img/Sec) | Resolution | #Params(M) | FLOPs(G) | Download |
---|---|---|---|---|---|---|---|
FasterViT-0 | 82.1 | 95.9 | 5802 | 224x224 | 31.4 | 3.3 | model |
FasterViT-1 | 83.2 | 96.5 | 4188 | 224x224 | 53.4 | 5.3 | model |
FasterViT-2 | 84.2 | 96.8 | 3161 | 224x224 | 75.9 | 8.7 | model |
FasterViT-3 | 84.9 | 97.2 | 1780 | 224x224 | 159.5 | 18.2 | model |
FasterViT-4 | 85.4 | 97.3 | 849 | 224x224 | 424.6 | 36.6 | model |
FasterViT-5 | 85.6 | 97.4 | 449 | 224x224 | 975.5 | 113.0 | model |
FasterViT-6 | 85.8 | 97.4 | 352 | 224x224 | 1360.0 | 142.0 | model |
Name | A-Acc@1(%) | A-Acc@5(%) | R-Acc@1(%) | R-Acc@5(%) | V2-Acc@1(%) | V2-Acc@5(%) |
---|---|---|---|---|---|---|
FasterViT-0 | 23.9 | 57.6 | 45.9 | 60.4 | 70.9 | 90.0 |
FasterViT-1 | 31.2 | 63.3 | 47.5 | 61.9 | 72.6 | 91.0 |
FasterViT-2 | 38.2 | 68.9 | 49.6 | 63.4 | 73.7 | 91.6 |
FasterViT-3 | 44.2 | 73.0 | 51.9 | 65.6 | 75.0 | 92.2 |
FasterViT-4 | 49.0 | 75.4 | 56.0 | 69.6 | 75.7 | 92.7 |
FasterViT-5 | 52.7 | 77.6 | 56.9 | 70.0 | 76.0 | 93.0 |
FasterViT-6 | 53.7 | 78.4 | 57.1 | 70.1 | 76.1 | 93.0 |