--- license: cc-by-nc-4.0 base_model: MCG-NJU/videomae-base tags: - generated_from_trainer - vandalism - video-classification - ucf-crime - vandalism-dectection - videomae metrics: - accuracy model-index: - name: videomae-base-finetuned-ucfcrime-full2 results: [] --- # videomae-base-finetuned-ucfcrime-full2 This model is a fine-tuned version of [MCG-NJU/videomae-base](https://huggingface.co/MCG-NJU/videomae-base) on the [UCF-CRIME](https://paperswithcode.com/dataset/ucf-crime) dataset. code : [github](https://github.com/archit-spec/majorproject) It achieves the following results on the evaluation set: - Loss: 2.5014 - Accuracy: 0.225 ## Model description More information needed ## Intended uses & limitations ## Inference using phone camera (have to download ipwebcam on phone from playstore) ```python import cv2 import torch import numpy as np from transformers import AutoImageProcessor, VideoMAEForVideoClassification np.random.seed(0) def preprocess_frames(frames, image_processor): inputs = image_processor(frames, return_tensors="pt") inputs = {k: v.to(device) for k, v in inputs.items()} # Move tensors to GPU return inputs # Initialize the video capture object, replace ip addr with the local ip of your phone (will be shown in the ipwebcam app) cap = cv2.VideoCapture('http://192.168.229.98:8080/video') # Set the frame size (optional) cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640) cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480) image_processor = AutoImageProcessor.from_pretrained("archit11/videomae-base-finetuned-ucfcrime-full") model = VideoMAEForVideoClassification.from_pretrained("archit11/videomae-base-finetuned-ucfcrime-full") # Move the model to GPU device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = model.to(device) frame_buffer = [] buffer_size = 16 previous_labels = [] top_confidences = [] # Initialize top_confidences while True: ret, frame = cap.read() if not ret: print("Failed to capture frame") break # Add the current frame to the buffer frame_buffer.append(frame) # Check if we have enough frames for inference if len(frame_buffer) >= buffer_size: # Preprocess the frames inputs = preprocess_frames(frame_buffer, image_processor) with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits # Get the top 3 predicted labels and their confidence scores top_k = 3 probs = torch.softmax(logits, dim=-1) top_probs, top_indices = torch.topk(probs, top_k) top_labels = [model.config.id2label[idx.item()] for idx in top_indices[0]] top_confidences = top_probs[0].tolist() # Update top_confidences # Check if the predicted labels are different from the previous labels if top_labels != previous_labels: previous_labels = top_labels print("Predicted class:", top_labels[0]) # Print the predicted class for debugging # Clear the frame buffer and continue from the next frame frame_buffer.clear() # Display the predicted labels and confidence scores on the frame for i, (label, confidence) in enumerate(zip(previous_labels, top_confidences)): label_text = f"{label}: {confidence:.2f}" cv2.putText(frame, label_text, (10, 30 + i * 30), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (0, 0, 255), 2) # Display the resulting frame cv2.imshow('Video', frame) if cv2.waitKey(1) & 0xFF == ord('q'): break # Release everything when done cap.release() cv2.destroyAllWindows() ``` ## Simple usage Usage: ```python import av import torch import numpy as np from transformers import AutoImageProcessor, VideoMAEForVideoClassification from huggingface_hub import hf_hub_download np.random.seed(0) def read_video_pyav(container, indices): ''' Decode the video with PyAV decoder. Args: container (`av.container.input.InputContainer`): PyAV container. indices (`List[int]`): List of frame indices to decode. Returns: result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3). ''' frames = [] container.seek(0) start_index = indices[0] end_index = indices[-1] for i, frame in enumerate(container.decode(video=0)): if i > end_index: break if i >= start_index and i in indices: frames.append(frame) return np.stack([x.to_ndarray(format="rgb24") for x in frames]) def sample_frame_indices(clip_len, frame_sample_rate, seg_len): ''' Sample a given number of frame indices from the video. Args: clip_len (`int`): Total number of frames to sample. frame_sample_rate (`int`): Sample every n-th frame. seg_len (`int`): Maximum allowed index of sample's last frame. Returns: indices (`List[int]`): List of sampled frame indices ''' converted_len = int(clip_len * frame_sample_rate) end_idx = np.random.randint(converted_len, seg_len) start_idx = end_idx - converted_len indices = np.linspace(start_idx, end_idx, num=clip_len) indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64) return indices # video clip consists of 300 frames (10 seconds at 30 FPS) file_path = hf_hub_download( repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset" ) # use any other video just replace `file_path` with the video path container = av.open(file_path) # sample 16 frames indices = sample_frame_indices(clip_len=16, frame_sample_rate=1, seg_len=container.streams.video[0].frames) video = read_video_pyav(container, indices) image_processor = AutoImageProcessor.from_pretrained("archit11/videomae-base-finetuned-ucfcrime-full") model = VideoMAEForVideoClassification.from_pretrained("archit11/videomae-base-finetuned-ucfcrime-full") inputs = image_processor(list(video), return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits # model predicts one of the 13 ucf-crime classes predicted_label = logits.argmax(-1).item() print(model.config.id2label[predicted_label]) ``` # Inference Using ## Training and evaluation data More information needed ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 5e-05 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.1 - training_steps: 700 ### Training results | Training Loss | Epoch | Step | Validation Loss | Accuracy | |:-------------:|:-----:|:----:|:---------------:|:--------:| | 2.5836 | 0.13 | 88 | 2.4944 | 0.2080 | | 2.3212 | 1.13 | 176 | 2.5855 | 0.1773 | | 2.2333 | 2.13 | 264 | 2.6270 | 0.1046 | | 1.985 | 3.13 | 352 | 2.4058 | 0.2109 | | 2.194 | 4.13 | 440 | 2.3654 | 0.2235 | | 1.9796 | 5.13 | 528 | 2.2609 | 0.2235 | | 1.8786 | 6.13 | 616 | 2.2725 | 0.2341 | | 1.71 | 7.12 | 700 | 2.2228 | 0.2226 | ### Framework versions - Transformers 4.38.1 - Pytorch 2.1.2 - Datasets 2.1.0 - Tokenizers 0.15.2