vsft-llava-1.5-7b-hf-liveness / README.md

Update README.md

9cd5e3a verified about 1 month ago

No virus

5.44 kB

	---
	library_name: peft
	tags:
	- trl
	- sft
	- generated_from_trainer
	base_model: llava-hf/llava-1.5-7b-hf
	model-index:
	- name: vsft-llava-1.5-7b-hf-liveness
	results: []
	language:
	- en
	metrics:
	- accuracy
	pipeline_tag: image-text-to-text
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# vsft-llava-1.5-7b-hf-liveness-trl

	This model is a fine-tuned version of [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) on an modified `ROSE-Youtu Face Liveness Detection Dataset` dataset.
	The ROSE-Youtu Face Liveness Detection Database (ROSE-Youtu) consists of 4225 videos with 25 subjects in total (3350 videos with 20 subjects publically available with 5.45GB in size).
	It also includes a new Client-Specific One-Class Domain Adaptation Protocol with an additional 1.25GB of pre-processed data.

	## Model description

	### Model details

	<b>Model type</b>:
	LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.

	<b>Model date</b>:
	The model was trained on April the 11th 2024

	## How to use

	The model supports multi-image and multi-prompt generation. Meaning that you can pass multiple images in your prompt. Make sure also to follow the correct prompt template (`USER: xxx\nASSISTANT:`) and add the token `<image>` to the location where you want to query images:

	```python
	from peft import PeftModel, PeftConfig
	import torch
	from PIL import Image
	import requests
	from transformers import AutoProcessor, LlavaForConditionalGeneration

	# Check device availability
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	# Load model configurations
	config = PeftConfig.from_pretrained("firqaaa/vsft-llava-1.5-7b-hf-liveness-trl")
	processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
	base_model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf",
	torch_dtype=torch.float16,
	low_cpu_mem_usage=True,
	device_map="auto",
	load_in_4bit=True,
	attn_implementation="flash_attention_2")

	model = PeftModel.from_pretrained(base_model, "firqaaa/vsft-llava-1.5-7b-hf-liveness-trl")
	model.to(device)

	image_path = "/silicone_frames/5/frame7.jpg"
	image = Image.open(image_path)

	prompt = """USER: <image>\nI ask you to be an liveness image annotator expert to determine if an image "Real" or "Spoof".
	If an image is a "Spoof" define what kind of attack, is it spoofing attack that used Print(flat), Replay(monitor, laptop), or Mask(paper, crop-paper, silicone)?
	If an image is a "Real" or "Normal" return "No Attack".
	Whether if an image is "Real" or "Spoof" give an explanation to this.
	Return your response using following format :

	Real/Spoof :
	Attack Type :
	Explanation :\nASSISTANT:"""

	# Prepare inputs and move to device
	inputs = processor(prompt, images=image, return_tensors="pt").to(device)

	# Generate output
	output = model.generate(**inputs, max_new_tokens=300)

	print("Response :")
	# Decode and print the output
	decoded_output = processor.decode(output[0], skip_special_tokens=True).split("ASSISTANT:")[-1].strip()
	print(decoded_output)

	# Response :
	# Real/Spoof : Spoof
	# Attack Type : Mask (silicone)
	# Explanation : The image shows signs of being a spoof attack. The face appears unnaturally smooth and lacks the natural texture and contours of a real human face.
	# Additionally, the edges around the face and the overall appearance suggest that a mask made of silicone or a similar material has been used to spoof the image.
	```

	## Intended uses & limitations

	The dataset used is for research purposes only.

	## Training and evaluation data

	The training data consists of 21,000 images, and the evaluation data consists of 2,500 images classified as fake or real.
	The fake images are categorized into several types: Print (flat), Replay (monitor, laptop), and Mask (paper, cropped paper, silicone).
	Both the training and evaluation data were created with the help of GPT-4 to generate a pair of user and assistant responses in a single-turn schema.

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 1.4e-05
	- train_batch_size: 1
	- eval_batch_size: 8
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 1.0
	- mixed_precision_training: Native AMP

	### Training results



	### Framework versions

	- PEFT 0.10.0
	- Transformers 4.40.1
	- Pytorch 2.3.0+cu121
	- Datasets 2.19.0
	- Tokenizers 0.19.1

	Citation
	- Haoliang Li, Wen Li, Hong Cao, Shiqi Wang, Feiyue Huang, and Alex C. Kot, “Unsupervised Domain Adaptation for Face Anti-Spoofing”, IEEE Transactions on Information Forensics and Security, 2018.
	- Zhi Li, Rizhao Cai, Haoliang Li, Kwok-Yan Lam, Yongjian Hu, and Alex C. Kot, “One-Class Knowledge Distillation for Face Presentation Attack Detection”, IEEE Transactions on Information Forensics and Security, 2022