Update README.md

ddbaae4 verified 6 months ago

7.54 kB

	---
	license: cc-by-nc-4.0
	base_model: MCG-NJU/videomae-base
	tags:
	- generated_from_trainer
	- vandalism
	- video-classification
	- ucf-crime
	- vandalism-dectection
	- videomae
	metrics:
	- accuracy
	model-index:
	- name: videomae-base-finetuned-ucfcrime-full2
	results: []
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# videomae-base-finetuned-ucfcrime-full2

	This model is a fine-tuned version of [MCG-NJU/videomae-base](https://huggingface.co/MCG-NJU/videomae-base) on the [UCF-CRIME](https://paperswithcode.com/dataset/ucf-crime)
	dataset. code : [github](https://github.com/archit-spec/majorproject)
	It achieves the following results on the evaluation set:
	- Loss: 2.5014
	- Accuracy: 0.225

	## Model description

	More information needed

	## Intended uses & limitations

	## Inference using phone camera (have to download ipwebcam on phone from playstore)
	```python
	import cv2
	import torch
	import numpy as np
	from transformers import AutoImageProcessor, VideoMAEForVideoClassification

	np.random.seed(0)

	def preprocess_frames(frames, image_processor):
	inputs = image_processor(frames, return_tensors="pt")
	inputs = {k: v.to(device) for k, v in inputs.items()} # Move tensors to GPU
	return inputs

	# Initialize the video capture object, replace ip addr with the local ip of your phone (will be shown in the ipwebcam app)
	cap = cv2.VideoCapture('http://192.168.229.98:8080/video')

	# Set the frame size (optional)
	cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
	cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)

	image_processor = AutoImageProcessor.from_pretrained("archit11/videomae-base-finetuned-ucfcrime-full")
	model = VideoMAEForVideoClassification.from_pretrained("archit11/videomae-base-finetuned-ucfcrime-full")

	# Move the model to GPU
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = model.to(device)

	frame_buffer = []
	buffer_size = 16
	previous_labels = []
	top_confidences = [] # Initialize top_confidences

	while True:
	ret, frame = cap.read()

	if not ret:
	print("Failed to capture frame")
	break

	# Add the current frame to the buffer
	frame_buffer.append(frame)

	# Check if we have enough frames for inference
	if len(frame_buffer) >= buffer_size:
	# Preprocess the frames
	inputs = preprocess_frames(frame_buffer, image_processor)

	with torch.no_grad():
	outputs = model(**inputs)
	logits = outputs.logits

	# Get the top 3 predicted labels and their confidence scores
	top_k = 3
	probs = torch.softmax(logits, dim=-1)
	top_probs, top_indices = torch.topk(probs, top_k)
	top_labels = [model.config.id2label[idx.item()] for idx in top_indices[0]]
	top_confidences = top_probs[0].tolist() # Update top_confidences

	# Check if the predicted labels are different from the previous labels
	if top_labels != previous_labels:
	previous_labels = top_labels
	print("Predicted class:", top_labels[0]) # Print the predicted class for debugging

	# Clear the frame buffer and continue from the next frame
	frame_buffer.clear()

	# Display the predicted labels and confidence scores on the frame
	for i, (label, confidence) in enumerate(zip(previous_labels, top_confidences)):
	label_text = f"{label}: {confidence:.2f}"
	cv2.putText(frame, label_text, (10, 30 + i * 30), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (0, 0, 255), 2)

	# Display the resulting frame
	cv2.imshow('Video', frame)

	if cv2.waitKey(1) & 0xFF == ord('q'):
	break

	# Release everything when done
	cap.release()
	cv2.destroyAllWindows()
	```
	## Simple usage
	Usage:
	```python
	import av
	import torch
	import numpy as np

	from transformers import AutoImageProcessor, VideoMAEForVideoClassification
	from huggingface_hub import hf_hub_download

	np.random.seed(0)


	def read_video_pyav(container, indices):
	'''
	Decode the video with PyAV decoder.
	Args:
	container (`av.container.input.InputContainer`): PyAV container.
	indices (`List[int]`): List of frame indices to decode.
	Returns:
	result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
	'''
	frames = []
	container.seek(0)
	start_index = indices[0]
	end_index = indices[-1]
	for i, frame in enumerate(container.decode(video=0)):
	if i > end_index:
	break
	if i >= start_index and i in indices:
	frames.append(frame)
	return np.stack([x.to_ndarray(format="rgb24") for x in frames])


	def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
	'''
	Sample a given number of frame indices from the video.
	Args:
	clip_len (`int`): Total number of frames to sample.
	frame_sample_rate (`int`): Sample every n-th frame.
	seg_len (`int`): Maximum allowed index of sample's last frame.
	Returns:
	indices (`List[int]`): List of sampled frame indices
	'''
	converted_len = int(clip_len * frame_sample_rate)
	end_idx = np.random.randint(converted_len, seg_len)
	start_idx = end_idx - converted_len
	indices = np.linspace(start_idx, end_idx, num=clip_len)
	indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
	return indices


	# video clip consists of 300 frames (10 seconds at 30 FPS)
	file_path = hf_hub_download(
	repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset"
	)
	# use any other video just replace `file_path` with the video path
	container = av.open(file_path)

	# sample 16 frames
	indices = sample_frame_indices(clip_len=16, frame_sample_rate=1, seg_len=container.streams.video[0].frames)
	video = read_video_pyav(container, indices)

	image_processor = AutoImageProcessor.from_pretrained("archit11/videomae-base-finetuned-ucfcrime-full")
	model = VideoMAEForVideoClassification.from_pretrained("archit11/videomae-base-finetuned-ucfcrime-full")

	inputs = image_processor(list(video), return_tensors="pt")

	with torch.no_grad():
	outputs = model(**inputs)
	logits = outputs.logits

	# model predicts one of the 13 ucf-crime classes
	predicted_label = logits.argmax(-1).item()
	print(model.config.id2label[predicted_label])
	```

	# Inference Using
	## Training and evaluation data

	More information needed

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 8
	- eval_batch_size: 8
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_ratio: 0.1
	- training_steps: 700

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Accuracy \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:--------:\|
	\| 2.5836 \| 0.13 \| 88 \| 2.4944 \| 0.2080 \|
	\| 2.3212 \| 1.13 \| 176 \| 2.5855 \| 0.1773 \|
	\| 2.2333 \| 2.13 \| 264 \| 2.6270 \| 0.1046 \|
	\| 1.985 \| 3.13 \| 352 \| 2.4058 \| 0.2109 \|
	\| 2.194 \| 4.13 \| 440 \| 2.3654 \| 0.2235 \|
	\| 1.9796 \| 5.13 \| 528 \| 2.2609 \| 0.2235 \|
	\| 1.8786 \| 6.13 \| 616 \| 2.2725 \| 0.2341 \|
	\| 1.71 \| 7.12 \| 700 \| 2.2228 \| 0.2226 \|


	### Framework versions

	- Transformers 4.38.1
	- Pytorch 2.1.2
	- Datasets 2.1.0
	- Tokenizers 0.15.2