metadata
license: cc-by-nc-4.0
base_model: MCG-NJU/videomae-base
tags:
- generated_from_trainer
- vandalism
- video-classification
- ucf-crime
- vandalism-dectection
- videomae
metrics:
- accuracy
model-index:
- name: videomae-base-finetuned-ucfcrime-full2
results: []
videomae-base-finetuned-ucfcrime-full2
This model is a fine-tuned version of MCG-NJU/videomae-base on the UCF-CRIME dataset. It achieves the following results on the evaluation set:
- Loss: 2.5014
- Accuracy: 0.225
Model description
More information needed
Intended uses & limitations
Usage:
import av
import torch
import numpy as np
from transformers import AutoImageProcessor, VideoMAEForVideoClassification
from huggingface_hub import hf_hub_download
np.random.seed(0)
def read_video_pyav(container, indices):
'''
Decode the video with PyAV decoder.
Args:
container (`av.container.input.InputContainer`): PyAV container.
indices (`List[int]`): List of frame indices to decode.
Returns:
result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
'''
frames = []
container.seek(0)
start_index = indices[0]
end_index = indices[-1]
for i, frame in enumerate(container.decode(video=0)):
if i > end_index:
break
if i >= start_index and i in indices:
frames.append(frame)
return np.stack([x.to_ndarray(format="rgb24") for x in frames])
def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
'''
Sample a given number of frame indices from the video.
Args:
clip_len (`int`): Total number of frames to sample.
frame_sample_rate (`int`): Sample every n-th frame.
seg_len (`int`): Maximum allowed index of sample's last frame.
Returns:
indices (`List[int]`): List of sampled frame indices
'''
converted_len = int(clip_len * frame_sample_rate)
end_idx = np.random.randint(converted_len, seg_len)
start_idx = end_idx - converted_len
indices = np.linspace(start_idx, end_idx, num=clip_len)
indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
return indices
# video clip consists of 300 frames (10 seconds at 30 FPS)
file_path = hf_hub_download(
repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset"
)
container = av.open(file_path)
# sample 16 frames
indices = sample_frame_indices(clip_len=16, frame_sample_rate=1, seg_len=container.streams.video[0].frames)
video = read_video_pyav(container, indices)
image_processor = AutoImageProcessor.from_pretrained("videomae-base-finetuned-ucfcrime-full")
model = VideoMAEForVideoClassification.from_pretrained("videomae-base-finetuned-ucfcrime-full")
inputs = image_processor(list(video), return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 13 ucf-crime classes
predicted_label = logits.argmax(-1).item()
print(model.config.id2label[predicted_label])
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- training_steps: 700
Training results
Training Loss | Epoch | Step | Validation Loss | Accuracy |
---|---|---|---|---|
2.5836 | 0.13 | 88 | 2.4944 | 0.2080 |
2.3212 | 1.13 | 176 | 2.5855 | 0.1773 |
2.2333 | 2.13 | 264 | 2.6270 | 0.1046 |
1.985 | 3.13 | 352 | 2.4058 | 0.2109 |
2.194 | 4.13 | 440 | 2.3654 | 0.2235 |
1.9796 | 5.13 | 528 | 2.2609 | 0.2235 |
1.8786 | 6.13 | 616 | 2.2725 | 0.2341 |
1.71 | 7.12 | 700 | 2.2228 | 0.2226 |
Framework versions
- Transformers 4.38.1
- Pytorch 2.1.2
- Datasets 2.1.0
- Tokenizers 0.15.2