rvv-karma
/

Human-Action-Recognition-VIT-Base-patch16-224

@@ -8,25 +8,57 @@ metrics:
 model-index:
 - name: Human-Action-Recognition-VIT-Base-patch16-224
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
 # Human-Action-Recognition-VIT-Base-patch16-224
-This model is a fine-tuned version of [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) on an unknown dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.4005
 - Accuracy: 0.8786
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
@@ -80,3 +112,8 @@ The following hyperparameters were used during training:
 - Pytorch 2.1.0+cu118
 - Datasets 2.15.0
 - Tokenizers 0.15.0

 model-index:
 - name: Human-Action-Recognition-VIT-Base-patch16-224
   results: []
+datasets:
+- Bingsu/Human_Action_Recognition
+language:
+- en
+pipeline_tag: image-classification
 ---
 # Human-Action-Recognition-VIT-Base-patch16-224
+This model is a fine-tuned version of [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) on
+[Bingsu/Human_Action_Recognition](https://huggingface.co/datasets/Bingsu/Human_Action_Recognition) dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.4005
 - Accuracy: 0.8786
 ## Model description
+The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, also at resolution 224x224.
+Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.
+By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image.
 ## Intended uses & limitations
+You can use the model for image classification.
+### How to use
+Here is how to use this model to classify an image of the human action into one of the following categories:
+['calling', 'clapping', 'cycling', 'dancing', 'drinking', 'eating', 'fighting', 'hugging', 'laughing', 'listening_to_music', 'running', 'sitting', 'sleeping', 'texting', 'using_laptop']
+```python
+from transformers import pipeline
+from PIL import Image
+import requests
+pipe = pipeline("image-classification", "rvv-karma/Human-Action-Recognition-VIT-Base-patch16-224")
+url = "https://images.pexels.com/photos/175658/pexels-photo-175658.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500"
+image = Image.open(requests.get(url, stream=True).raw)
+pipe(image)
+# Output:
+# [{'score': 0.9918079972267151, 'label': 'dancing'},
+#  {'score': 0.00207977625541389, 'label': 'clapping'},
+#  {'score': 0.0015223610680550337, 'label': 'running'},
+#  {'score': 0.0009153694845736027, 'label': 'fighting'},
+#  {'score': 0.0006987180095165968, 'label': 'sitting'}]
+```
 ## Training and evaluation data
 - Pytorch 2.1.0+cu118
 - Datasets 2.15.0
 - Tokenizers 0.15.0
+## Fine-tuning script
+[Google Colaboratory Notebook](https://colab.research.google.com/drive/1YELczSv8r0znzcOKJ4Lt-ecP-aNqk7NV?usp=sharing)