File size: 6,850 Bytes
522464c 8dd0ef3 522464c c98a7cc 41a6970 c98a7cc e8da81e e07a9a0 c98a7cc e07a9a0 522464c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
---
language:
- en
pipeline_tag: video-classification
---
# Official PyTorch Implementation of SIGMA(ECCV 2024).
Paper link : https://arxiv.org/html/2407.15447v1
on Huggingface : https://huggingface.co/papers/2407.15447

### 🔥 Sinkhorn-Guided Masked Video Modeling
Video-based pretraining offers immense potential for learning strong visual representations on an unprecedented scale. Recently, masked video modeling methods have shown promising scalability, yet fall short in capturing higher-level semantics due to reconstructing predefined low-level targets such as pixels. To tackle this, we present Sinkhorn-guided Masked Video Modelling (SIGMA), a novel video pretraining method that jointly learns the video model in addition to a target feature space using a projection network. However, this simple modification means that the regular L2 reconstruction loss will lead to trivial solutions as both networks are jointly optimized. As a solution, we distribute features of space-time tubes evenly across a limited number of learnable clusters. By posing this as an optimal transport problem, we enforce high entropy in the generated features across the batch, infusing semantic and temporal meaning into the feature space. The resulting cluster assignments are used as targets for a symmetric prediction task where the video model predicts cluster assignment of the projection network and vice versa. Experimental results on ten datasets across three benchmarks validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations improving upon state-of-the-art methods.
<!-- ### ⚡️ A Simple, Efficient and Strong Baseline in SSVP
VideoMAE uses the simple masked autoencoder and **plain ViT** backbone to perform video self-supervised learning. Due to the extremely high masking ratio, the pre-training time of VideoMAE is **much shorter** than contrastive learning methods (**3.2x** speedup). VideoMAE can serve as **a simple but strong baseline** for future research in self-supervised video pre-training.
### 😮 High performance, but NO extra data required
VideoMAE works well for video datasets of different scales and can achieve **87.4%** on Kinects-400, **75.4%** on Something-Something V2, **91.3%** on UCF101, and **62.6%** on HMDB51. To our best knowledge, VideoMAE is the **first** to achieve the state-of-the-art performance on these four popular benchmarks with the **vanilla ViT** backbones while **doesn't need** any extra data or pre-trained models.
<!-- ## 🚀 Main Results -->
### ✨ Something-Something V2
| Method | Extra Data | Backbone | Resolution | #Frames x Clips x Crops | Epoch | Top-1 |
| :------: | :--------: | :------: | :--------: | :---------------------: | :---: | :---: |
| VideoMAE | ***no*** | ViT-S | 224x224 | 16x2x3 | 2400 | 66.8 |
| VideoMAE | ***no*** | ViT-B | 224x224 | 16x2x3 | 800 | 69.6 |
| SIGMA |***Img-1k***| ViT-S | 224x224 | 16x2x3 | 2400 | 68.6 |
| SIGMA |***Img-1k***| ViT-B | 224x224 | 16x2x3 | 800 | 70.9 |
### ✨ Kinetics-400
| Method | Extra Data | Backbone | Resolution | #Frames x Clips x Crops | Epoch | Top-1 |
| :------: | :--------: | :------: | :--------: | :---------------------: | :---: | :---: |
| VideoMAE | ***no*** | ViT-S | 224x224 | 16x5x3 | 1600 | 79.0 |
| VideoMAE | ***no*** | ViT-B | 224x224 | 16x5x3 | 800 | 80.0 |
| SIGMA |***Img-1k***| ViT-S | 224x224 | 16x5x3 | 800 | 79.4 |
| SIGMA |***Img-1k***| ViT-B | 224x224 | 16x5x3 | 800 | 81.6 |
<!-- | Method | Extra Data | Backbone | Resolution | #Frames x Clips x Crops | Top-1 | Top-5 |
| :------: | :--------: | :------: | :--------: | :---------------------: | :---: | :---: |
| VideoMAE | ***no*** | ViT-S | 224x224 | 16x5x3 | 79.0 | 93.8 |
| VideoMAE | ***no*** | ViT-B | 224x224 | 16x5x3 | 81.5 | 95.1 |
| VideoMAE | ***no*** | ViT-L | 224x224 | 16x5x3 | 85.2 | 96.8 |
| VideoMAE | ***no*** | ViT-H | 224x224 | 16x5x3 | 86.6 | 97.1 |
| VideoMAE | ***no*** | ViT-L | 320x320 | 32x4x3 | 86.1 | 97.3 |
| VideoMAE | ***no*** | ViT-H | 320x320 | 32x4x3 | 87.4 | 97.6 | -->
<!-- ### ✨ AVA 2.2
Please check the code and checkpoints in [VideoMAE-Action-Detection](https://github.com/MCG-NJU/VideoMAE-Action-Detection).
| Method | Extra Data | Extra Label | Backbone | #Frame x Sample Rate | mAP |
| :------: | :----------: | :---------: | :------: | :------------------: | :--: |
| VideoMAE | Kinetics-400 | ✗ | ViT-S | 16x4 | 22.5 |
| VideoMAE | Kinetics-400 | ✓ | ViT-S | 16x4 | 28.4 |
| VideoMAE | Kinetics-400 | ✗ | ViT-B | 16x4 | 26.7 |
| VideoMAE | Kinetics-400 | ✓ | ViT-B | 16x4 | 31.8 |
| VideoMAE | Kinetics-400 | ✗ | ViT-L | 16x4 | 34.3 |
| VideoMAE | Kinetics-400 | ✓ | ViT-L | 16x4 | 37.0 |
| VideoMAE | Kinetics-400 | ✗ | ViT-H | 16x4 | 36.5 |
| VideoMAE | Kinetics-400 | ✓ | ViT-H | 16x4 | 39.5 |
| VideoMAE | Kinetics-700 | ✗ | ViT-L | 16x4 | 36.1 |
| VideoMAE | Kinetics-700 | ✓ | ViT-L | 16x4 | 39.3 | -->
<!-- ### ✨ UCF101 & HMDB51
| Method | Extra Data | Backbone | UCF101 | HMDB51 |
| :------: | :----------: | :------: | :----: | :----: |
| VideoMAE | ***no*** | ViT-B | 91.3 | 62.6 |
| VideoMAE | Kinetics-400 | ViT-B | 96.1 | 73.3 | -->
## 🔨 Installation
Please follow the instructions in [INSTALL.md](INSTALL.md).
## ➡️ Data Preparation
Please follow the instructions in [DATASET.md](DATASET.md) for data preparation.
## 🔄 Pre-training
The pre-training instruction is in [PRETRAIN.md](PRETRAIN.md).
## ⤴️ Fine-tuning with pre-trained models
The fine-tuning instruction is in [FINETUNE.md](FINETUNE.md).
## 📍Model Zoo
## ⚠️ Our code is based on [VideoMAE](https://github.com/MCG-NJU/VideoMAE) code base.
## ✏️ Citation
If you think this project is helpful, please feel free to leave a star⭐️ and cite our paper:
```
@inproceedings{salehi2024sigma,
title={SIGMA: Sinkhorn-Guided Masked Video Modeling},
author={Salehi, Mohammadreza and Dorkenwald, Michael and Thoker, Fida Mohammad and Gavves, Efstratios and Snoek, Cees GM and Asano, Yuki M},
journal={European Conference of Computer Vision},
year={2024}
}
``` |