vsft-llava-1.5-7b-hf-liveness-trl

This model is a fine-tuned version of llava-hf/llava-1.5-7b-hf on an modified ROSE-Youtu Face Liveness Detection Dataset dataset. The ROSE-Youtu Face Liveness Detection Database (ROSE-Youtu) consists of 4225 videos with 25 subjects in total (3350 videos with 20 subjects publically available with 5.45GB in size). It also includes a new Client-Specific One-Class Domain Adaptation Protocol with an additional 1.25GB of pre-processed data.

Model description

Model details

Model type: LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.

Model date: The model was trained on April the 11th 2024

How to use

The model supports multi-image and multi-prompt generation. Meaning that you can pass multiple images in your prompt. Make sure also to follow the correct prompt template (USER: xxx\nASSISTANT:) and add the token <image> to the location where you want to query images:

from peft import PeftModel, PeftConfig
import torch
from PIL import Image
import requests
from transformers import AutoProcessor, LlavaForConditionalGeneration

# Check device availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model configurations
config = PeftConfig.from_pretrained("firqaaa/vsft-llava-1.5-7b-hf-liveness-trl")
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
base_model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf",
                                                           torch_dtype=torch.float16,
                                                           low_cpu_mem_usage=True,
                                                           device_map="auto",
                                                           load_in_4bit=True,
                                                           attn_implementation="flash_attention_2")

model = PeftModel.from_pretrained(base_model, "firqaaa/vsft-llava-1.5-7b-hf-liveness-trl")
model.to(device)

image_path = "/silicone_frames/5/frame7.jpg"
image = Image.open(image_path)

prompt = """USER: <image>\nI ask you to be an liveness image annotator expert to determine if an image "Real" or "Spoof". 
If an image is a "Spoof" define what kind of attack, is it spoofing attack that used Print(flat), Replay(monitor, laptop), or Mask(paper, crop-paper, silicone)?
If an image is a "Real" or "Normal" return "No Attack". 
Whether if an image is "Real" or "Spoof" give an explanation to this.
Return your response using following format :

Real/Spoof : 
Attack Type :
Explanation :\nASSISTANT:"""

# Prepare inputs and move to device
inputs = processor(prompt, images=image, return_tensors="pt").to(device)

# Generate output
output = model.generate(**inputs, max_new_tokens=300)

print("Response :")
# Decode and print the output
decoded_output = processor.decode(output[0], skip_special_tokens=True).split("ASSISTANT:")[-1].strip()
print(decoded_output)

# Response :
# Real/Spoof : Spoof
# Attack Type : Mask (silicone)
# Explanation : The image shows signs of being a spoof attack. The face appears unnaturally smooth and lacks the natural texture and contours of a real human face.
# Additionally, the edges around the face and the overall appearance suggest that a mask made of silicone or a similar material has been used to spoof the image.

Intended uses & limitations

The dataset used is for research purposes only.

Training and evaluation data

The training data consists of 21,000 images, and the evaluation data consists of 2,500 images classified as fake or real. The fake images are categorized into several types: Print (flat), Replay (monitor, laptop), and Mask (paper, cropped paper, silicone). Both the training and evaluation data were created with the help of GPT-4 to generate a pair of user and assistant responses in a single-turn schema.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1.4e-05
train_batch_size: 1
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 1.0
mixed_precision_training: Native AMP

Training results

Framework versions

PEFT 0.10.0
Transformers 4.40.1
Pytorch 2.3.0+cu121
Datasets 2.19.0
Tokenizers 0.19.1

Citation

Haoliang Li, Wen Li, Hong Cao, Shiqi Wang, Feiyue Huang, and Alex C. Kot, “Unsupervised Domain Adaptation for Face Anti-Spoofing”, IEEE Transactions on Information Forensics and Security, 2018.
Zhi Li, Rizhao Cai, Haoliang Li, Kwok-Yan Lam, Yongjian Hu, and Alex C. Kot, “One-Class Knowledge Distillation for Face Presentation Attack Detection”, IEEE Transactions on Information Forensics and Security, 2022

firqaaa
/

vsft-llava-1.5-7b-hf-liveness

vsft-llava-1.5-7b-hf-liveness-trl

Model description

Model details

How to use

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Adapter for

Evaluation results

vsft-llava-1.5-7b-hf-liveness-trl

Model description

Model details

How to use

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Adapter for llava-hf/llava-1.5-7b-hf

Evaluation results

Adapter for