Update README.md
Browse files
README.md
CHANGED
@@ -8,25 +8,57 @@ metrics:
|
|
8 |
model-index:
|
9 |
- name: Human-Action-Recognition-VIT-Base-patch16-224
|
10 |
results: []
|
|
|
|
|
|
|
|
|
|
|
11 |
---
|
12 |
|
13 |
-
|
14 |
-
should probably proofread and complete it, then remove this comment. -->
|
15 |
|
16 |
# Human-Action-Recognition-VIT-Base-patch16-224
|
17 |
|
18 |
-
This model is a fine-tuned version of [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) on
|
|
|
19 |
It achieves the following results on the evaluation set:
|
20 |
- Loss: 0.4005
|
21 |
- Accuracy: 0.8786
|
22 |
|
23 |
## Model description
|
24 |
|
25 |
-
|
|
|
|
|
|
|
|
|
26 |
|
27 |
## Intended uses & limitations
|
28 |
|
29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
|
31 |
## Training and evaluation data
|
32 |
|
@@ -80,3 +112,8 @@ The following hyperparameters were used during training:
|
|
80 |
- Pytorch 2.1.0+cu118
|
81 |
- Datasets 2.15.0
|
82 |
- Tokenizers 0.15.0
|
|
|
|
|
|
|
|
|
|
|
|
8 |
model-index:
|
9 |
- name: Human-Action-Recognition-VIT-Base-patch16-224
|
10 |
results: []
|
11 |
+
datasets:
|
12 |
+
- Bingsu/Human_Action_Recognition
|
13 |
+
language:
|
14 |
+
- en
|
15 |
+
pipeline_tag: image-classification
|
16 |
---
|
17 |
|
18 |
+
|
|
|
19 |
|
20 |
# Human-Action-Recognition-VIT-Base-patch16-224
|
21 |
|
22 |
+
This model is a fine-tuned version of [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) on
|
23 |
+
[Bingsu/Human_Action_Recognition](https://huggingface.co/datasets/Bingsu/Human_Action_Recognition) dataset.
|
24 |
It achieves the following results on the evaluation set:
|
25 |
- Loss: 0.4005
|
26 |
- Accuracy: 0.8786
|
27 |
|
28 |
## Model description
|
29 |
|
30 |
+
The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, also at resolution 224x224.
|
31 |
+
|
32 |
+
Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.
|
33 |
+
|
34 |
+
By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image.
|
35 |
|
36 |
## Intended uses & limitations
|
37 |
|
38 |
+
You can use the model for image classification.
|
39 |
+
|
40 |
+
### How to use
|
41 |
+
|
42 |
+
Here is how to use this model to classify an image of the human action into one of the following categories:
|
43 |
+
['calling', 'clapping', 'cycling', 'dancing', 'drinking', 'eating', 'fighting', 'hugging', 'laughing', 'listening_to_music', 'running', 'sitting', 'sleeping', 'texting', 'using_laptop']
|
44 |
+
|
45 |
+
```python
|
46 |
+
from transformers import pipeline
|
47 |
+
from PIL import Image
|
48 |
+
import requests
|
49 |
+
|
50 |
+
pipe = pipeline("image-classification", "rvv-karma/Human-Action-Recognition-VIT-Base-patch16-224")
|
51 |
+
url = "https://images.pexels.com/photos/175658/pexels-photo-175658.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500"
|
52 |
+
image = Image.open(requests.get(url, stream=True).raw)
|
53 |
+
pipe(image)
|
54 |
+
|
55 |
+
# Output:
|
56 |
+
# [{'score': 0.9918079972267151, 'label': 'dancing'},
|
57 |
+
# {'score': 0.00207977625541389, 'label': 'clapping'},
|
58 |
+
# {'score': 0.0015223610680550337, 'label': 'running'},
|
59 |
+
# {'score': 0.0009153694845736027, 'label': 'fighting'},
|
60 |
+
# {'score': 0.0006987180095165968, 'label': 'sitting'}]
|
61 |
+
```
|
62 |
|
63 |
## Training and evaluation data
|
64 |
|
|
|
112 |
- Pytorch 2.1.0+cu118
|
113 |
- Datasets 2.15.0
|
114 |
- Tokenizers 0.15.0
|
115 |
+
|
116 |
+
|
117 |
+
## Fine-tuning script
|
118 |
+
|
119 |
+
[Google Colaboratory Notebook](https://colab.research.google.com/drive/1YELczSv8r0znzcOKJ4Lt-ecP-aNqk7NV?usp=sharing)
|