VasylKhoroshcho commited on
Commit
2b0abc1
1 Parent(s): e7c8fa3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -0
README.md CHANGED
@@ -1,6 +1,80 @@
1
  ---
 
2
  python_version: 3.9
 
3
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
  # X-CLIP (base-sized model)
 
1
  ---
2
+ ---
3
  python_version: 3.9
4
+ language: en
5
  license: mit
6
+ tags:
7
+ - vision
8
+ - video-classification
9
+ model-index:
10
+ - name: nielsr/xclip-base-patch16-zero-shot
11
+ results:
12
+ - task:
13
+ type: video-classification
14
+ dataset:
15
+ name: HMDB-51
16
+ type: hmdb-51
17
+ metrics:
18
+ - type: top-1 accuracy
19
+ value: 44.6
20
+ - task:
21
+ type: video-classification
22
+ dataset:
23
+ name: UCF101
24
+ type: ucf101
25
+ metrics:
26
+ - type: top-1 accuracy
27
+ value: 72.0
28
+ - task:
29
+ type: video-classification
30
+ dataset:
31
+ name: Kinetics-600
32
+ type: kinetics600
33
+ metrics:
34
+ - type: top-1 accuracy
35
+ value: 65.2
36
+ ---
37
+
38
+ # X-CLIP (base-sized model)
39
+
40
+ X-CLIP model (base-sized, patch resolution of 16) trained on [Kinetics-400](https://www.deepmind.com/open-source/kinetics). It was introduced in the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Ni et al. and first released in [this repository](https://github.com/microsoft/VideoX/tree/master/X-CLIP).
41
+
42
+ This model was trained using 32 frames per video, at a resolution of 224x224.
43
+
44
+ Disclaimer: The team releasing X-CLIP did not write a model card for this model so this model card has been written by the Hugging Face team.
45
+
46
+ ## Model description
47
+
48
+ X-CLIP is a minimal extension of [CLIP](https://huggingface.co/docs/transformers/model_doc/clip) for general video-language understanding. The model is trained in a contrastive way on (video, text) pairs.
49
+
50
+ ![X-CLIP architecture](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/xclip_architecture.png)
51
+
52
+ This allows the model to be used for tasks like zero-shot, few-shot or fully supervised video classification and video-text retrieval.
53
+
54
+ ## Intended uses & limitations
55
+
56
+ You can use the raw model for determining how well text goes with a given video. See the [model hub](https://huggingface.co/models?search=microsoft/xclip) to look for
57
+ fine-tuned versions on a task that interests you.
58
+
59
+ ### How to use
60
+
61
+ For code examples, we refer to the [documentation](https://huggingface.co/transformers/main/model_doc/xclip.html#).
62
+
63
+ ## Training data
64
+
65
+ This model was trained on [Kinetics 400](https://www.deepmind.com/open-source/kinetics).
66
+
67
+ ### Preprocessing
68
+
69
+ The exact details of preprocessing during training can be found [here](https://github.com/microsoft/VideoX/blob/40f6d177e0a057a50ac69ac1de6b5938fd268601/X-CLIP/datasets/build.py#L247).
70
+
71
+ The exact details of preprocessing during validation can be found [here](https://github.com/microsoft/VideoX/blob/40f6d177e0a057a50ac69ac1de6b5938fd268601/X-CLIP/datasets/build.py#L285).
72
+
73
+ During validation, one resizes the shorter edge of each frame, after which center cropping is performed to a fixed-size resolution (like 224x224). Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.
74
+
75
+ ## Evaluation results
76
+
77
+ This model achieves a zero-shot top-1 accuracy of 44.6% on HMDB-51, 72.0% on UCF-101 and 65.2% on Kinetics-600.
78
  ---
79
 
80
  # X-CLIP (base-sized model)