rvv-karma commited on
Commit
48540ed
1 Parent(s): 743363a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -5
README.md CHANGED
@@ -8,25 +8,57 @@ metrics:
8
  model-index:
9
  - name: Human-Action-Recognition-VIT-Base-patch16-224
10
  results: []
 
 
 
 
 
11
  ---
12
 
13
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
14
- should probably proofread and complete it, then remove this comment. -->
15
 
16
  # Human-Action-Recognition-VIT-Base-patch16-224
17
 
18
- This model is a fine-tuned version of [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) on an unknown dataset.
 
19
  It achieves the following results on the evaluation set:
20
  - Loss: 0.4005
21
  - Accuracy: 0.8786
22
 
23
  ## Model description
24
 
25
- More information needed
 
 
 
 
26
 
27
  ## Intended uses & limitations
28
 
29
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  ## Training and evaluation data
32
 
@@ -80,3 +112,8 @@ The following hyperparameters were used during training:
80
  - Pytorch 2.1.0+cu118
81
  - Datasets 2.15.0
82
  - Tokenizers 0.15.0
 
 
 
 
 
 
8
  model-index:
9
  - name: Human-Action-Recognition-VIT-Base-patch16-224
10
  results: []
11
+ datasets:
12
+ - Bingsu/Human_Action_Recognition
13
+ language:
14
+ - en
15
+ pipeline_tag: image-classification
16
  ---
17
 
18
+
 
19
 
20
  # Human-Action-Recognition-VIT-Base-patch16-224
21
 
22
+ This model is a fine-tuned version of [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) on
23
+ [Bingsu/Human_Action_Recognition](https://huggingface.co/datasets/Bingsu/Human_Action_Recognition) dataset.
24
  It achieves the following results on the evaluation set:
25
  - Loss: 0.4005
26
  - Accuracy: 0.8786
27
 
28
  ## Model description
29
 
30
+ The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, also at resolution 224x224.
31
+
32
+ Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.
33
+
34
+ By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image.
35
 
36
  ## Intended uses & limitations
37
 
38
+ You can use the model for image classification.
39
+
40
+ ### How to use
41
+
42
+ Here is how to use this model to classify an image of the human action into one of the following categories:
43
+ ['calling', 'clapping', 'cycling', 'dancing', 'drinking', 'eating', 'fighting', 'hugging', 'laughing', 'listening_to_music', 'running', 'sitting', 'sleeping', 'texting', 'using_laptop']
44
+
45
+ ```python
46
+ from transformers import pipeline
47
+ from PIL import Image
48
+ import requests
49
+
50
+ pipe = pipeline("image-classification", "rvv-karma/Human-Action-Recognition-VIT-Base-patch16-224")
51
+ url = "https://images.pexels.com/photos/175658/pexels-photo-175658.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500"
52
+ image = Image.open(requests.get(url, stream=True).raw)
53
+ pipe(image)
54
+
55
+ # Output:
56
+ # [{'score': 0.9918079972267151, 'label': 'dancing'},
57
+ # {'score': 0.00207977625541389, 'label': 'clapping'},
58
+ # {'score': 0.0015223610680550337, 'label': 'running'},
59
+ # {'score': 0.0009153694845736027, 'label': 'fighting'},
60
+ # {'score': 0.0006987180095165968, 'label': 'sitting'}]
61
+ ```
62
 
63
  ## Training and evaluation data
64
 
 
112
  - Pytorch 2.1.0+cu118
113
  - Datasets 2.15.0
114
  - Tokenizers 0.15.0
115
+
116
+
117
+ ## Fine-tuning script
118
+
119
+ [Google Colaboratory Notebook](https://colab.research.google.com/drive/1YELczSv8r0znzcOKJ4Lt-ecP-aNqk7NV?usp=sharing)