File size: 4,456 Bytes
e7a1876 66c91bd 883ce2d e7a1876 d2a2733 e7a1876 4ddcb94 e7a1876 883ce2d 35a8c91 883ce2d e7a1876 d2a2733 883ce2d e7a1876 4ddcb94 e7a1876 4ddcb94 e7a1876 66c91bd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
---
license: cc-by-nc-4.0
base_model: MCG-NJU/videomae-base
tags:
- generated_from_trainer
- vandalism
- video-classification
- ucf-crime
- vandalism-dectection
- videomae
metrics:
- accuracy
model-index:
- name: videomae-base-finetuned-ucfcrime-full2
results: []
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# videomae-base-finetuned-ucfcrime-full2
This model is a fine-tuned version of [MCG-NJU/videomae-base](https://huggingface.co/MCG-NJU/videomae-base) on the [UCF-CRIME](https://paperswithcode.com/dataset/ucf-crime)
dataset.
It achieves the following results on the evaluation set:
- Loss: 2.5014
- Accuracy: 0.225
## Model description
More information needed
## Intended uses & limitations
Usage:
```python
import av
import torch
import numpy as np
from transformers import AutoImageProcessor, VideoMAEForVideoClassification
from huggingface_hub import hf_hub_download
np.random.seed(0)
def read_video_pyav(container, indices):
'''
Decode the video with PyAV decoder.
Args:
container (`av.container.input.InputContainer`): PyAV container.
indices (`List[int]`): List of frame indices to decode.
Returns:
result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
'''
frames = []
container.seek(0)
start_index = indices[0]
end_index = indices[-1]
for i, frame in enumerate(container.decode(video=0)):
if i > end_index:
break
if i >= start_index and i in indices:
frames.append(frame)
return np.stack([x.to_ndarray(format="rgb24") for x in frames])
def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
'''
Sample a given number of frame indices from the video.
Args:
clip_len (`int`): Total number of frames to sample.
frame_sample_rate (`int`): Sample every n-th frame.
seg_len (`int`): Maximum allowed index of sample's last frame.
Returns:
indices (`List[int]`): List of sampled frame indices
'''
converted_len = int(clip_len * frame_sample_rate)
end_idx = np.random.randint(converted_len, seg_len)
start_idx = end_idx - converted_len
indices = np.linspace(start_idx, end_idx, num=clip_len)
indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
return indices
# video clip consists of 300 frames (10 seconds at 30 FPS)
file_path = hf_hub_download(
repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset"
)
container = av.open(file_path)
# sample 16 frames
indices = sample_frame_indices(clip_len=16, frame_sample_rate=1, seg_len=container.streams.video[0].frames)
video = read_video_pyav(container, indices)
image_processor = AutoImageProcessor.from_pretrained("videomae-base-finetuned-ucfcrime-full")
model = VideoMAEForVideoClassification.from_pretrained("videomae-base-finetuned-ucfcrime-full")
inputs = image_processor(list(video), return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 13 ucf-crime classes
predicted_label = logits.argmax(-1).item()
print(model.config.id2label[predicted_label])
```
## Training and evaluation data
More information needed
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- training_steps: 700
### Training results
| Training Loss | Epoch | Step | Validation Loss | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:--------:|
| 2.5836 | 0.13 | 88 | 2.4944 | 0.2080 |
| 2.3212 | 1.13 | 176 | 2.5855 | 0.1773 |
| 2.2333 | 2.13 | 264 | 2.6270 | 0.1046 |
| 1.985 | 3.13 | 352 | 2.4058 | 0.2109 |
| 2.194 | 4.13 | 440 | 2.3654 | 0.2235 |
| 1.9796 | 5.13 | 528 | 2.2609 | 0.2235 |
| 1.8786 | 6.13 | 616 | 2.2725 | 0.2341 |
| 1.71 | 7.12 | 700 | 2.2228 | 0.2226 |
### Framework versions
- Transformers 4.38.1
- Pytorch 2.1.2
- Datasets 2.1.0
- Tokenizers 0.15.2 |