File size: 3,350 Bytes
0f456cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
license: other
library_name: keras
---
# Collection shoaib6174/video_swin_transformer/1

Collection of Video Swin Transformers feature extractor models.


<!-- task: video-feature-extraction -->

## Overview

This collection contains different Video Swin Transformer [1] models. The original model weights are provided from [2]. There were ported to Keras models
(`tf.keras.Model`) and then serialized as TensorFlow SavedModels. The porting steps are available in [3].


## About the models

These models can be directly used to extract features from videos. These models are accompanied by
Colab Notebooks with fine-tuning steps for action-recognition task and video-classification. 

The table below provides a performance summary:

| model_name                                     |   pre-train dataset |   fine-tune dataset   |   acc@1(%) |  acc@5(%) |
|:----------------------------------------------:|:-------------------:|:---------------------:|:----------:|----------:|
| swin_tiny_patch244_window877_kinetics400_1k    |    ImageNet-1K      | Kinetics 400(1k       |       78.8 |      93.6 |
| swin_small_patch244_window877_kinetics400_1k   |    ImageNet-1K      | Kinetics 400(1k)      |       80.6 |      94.5 |
| swin_base_patch244_window877_kinetics400_1k    |    ImageNet-1K      | Kinetics 400(1k)      |       80.6 |      96.6 |
| swin_base_patch244_window877_kinetics400_22k   |    ImageNet-12K     | Kinetics 400(1k)      |       82.7 |      95.5 |
| swin_base_patch244_window877_kinetics600_22k   |    ImageNet-1K      | Kinetics 600(1k)      |       84.0 |      96.5 |
| swin_base_patch244_window1677_sthv2            |    Kinetics 400     | Something-Something V2|       69.6 |      92.7 |


These scores for all the models are taken from [2]. 



### Video Swin Transformer Feature extractors Models

* [swin_tiny_patch244_window877_kinetics400_1k](https://tfhub.dev/shoaib6174/swin_tiny_patch244_window877_kinetics400_1k)
* [swin_small_patch244_window877_kinetics400_1k](https://tfhub.dev/shoaib6174/swin_small_patch244_window877_kinetics400_1k)
* [swin_base_patch244_window877_kinetics400_1k](https://tfhub.dev/shoaib6174/swin_base_patch244_window877_kinetics400_1k)
* [swin_base_patch244_window877_kinetics400_22k](https://tfhub.dev/shoaib6174/swin_base_patch244_window877_kinetics400_22k)
* [swin_base_patch244_window877_kinetics600_22k](https://tfhub.dev/shoaib6174/swin_base_patch244_window877_kinetics600_22k)
* [swin_base_patch244_window1677_sthv2](https://tfhub.dev/shoaib6174/swin_base_patch244_window1677_sthv2)



## Notes

The input shape for these models are `[None, 3, 32, 224, 224]` representing `[batch_size, channels, frames, height, width]`. To create models with different input shape use [this notebook](https://colab.research.google.com/drive/1sZIM7_OV1__CFV-WSQguOOZ8VyOsDaGM).

## References
[1] [Video Swin Transformer Ze et al.](https://arxiv.org/abs/2106.13230)
[2] [Video Swin Transformers GitHub](https://github.com/SwinTransformer/Video-Swin-Transformerr)
[3] [GSOC-22-Video-Swin-Transformers GitHub](https://github.com/shoaib6174/GSOC-22-Video-Swin-Transformers)

## Acknowledgements
* [Google Summer of Code 2022](https://summerofcode.withgoogle.com/)
* [Luiz GUStavo Martins](https://www.linkedin.com/in/luiz-gustavo-martins-64ab5891/)
* [Sayak Paul](https://www.linkedin.com/in/sayak-paul/)