File size: 5,438 Bytes

a243bed
 
f855cba
 
 
 
a243bed
f855cba
 
 
7da7bb2
 
 
 
 
a243bed
 
f855cba
 
a243bed
f2dd5b3
a243bed
e97e8ce
fb0741a
 
e97e8ce
f855cba
a243bed
9749c29
 
 
 
 
 
 
a243bed
dffcacd
 
9cd5e3a
73f1d4a
ad6f069
 
 
 
 
eabb7e4
ad6f069
 
 
 
 
 
 
 
eabb7e4
 
 
 
 
f400505
ad6f069
 
 
f400505
ad6f069
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e97e8ce
 
 
 
 
f2af0af
ad6f069
dffcacd
f855cba
a243bed
a3d0094
a243bed
f855cba
a243bed
eabb7e4
 
 
a243bed
f855cba
a243bed
f855cba
a243bed
f855cba
 
 
 
 
 
 
 
 
a243bed
f855cba
a243bed
 
 
 
 
f855cba
 
 
 
e97e8ce
 
 
 
5ac9966

---
library_name: peft
tags:
- trl
- sft
- generated_from_trainer
base_model: llava-hf/llava-1.5-7b-hf
model-index:
- name: vsft-llava-1.5-7b-hf-liveness
  results: []
language:
- en
metrics:
- accuracy
pipeline_tag: image-text-to-text
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# vsft-llava-1.5-7b-hf-liveness-trl

This model is a fine-tuned version of [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) on an modified `ROSE-Youtu Face Liveness Detection Dataset` dataset.
The ROSE-Youtu Face Liveness Detection Database (ROSE-Youtu) consists of 4225 videos with 25 subjects in total (3350 videos with 20 subjects publically available with 5.45GB in size).
It also includes a new Client-Specific One-Class Domain Adaptation Protocol with an additional 1.25GB of pre-processed data.

## Model description

### Model details

<b>Model type</b>: 
LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.

<b>Model date</b>: 
The model was trained on April the 11th 2024

## How to use

The model supports multi-image and multi-prompt generation. Meaning that you can pass multiple images in your prompt. Make sure also to follow the correct prompt template (`USER: xxx\nASSISTANT:`) and add the token `<image>` to the location where you want to query images:

```python
from peft import PeftModel, PeftConfig
import torch
from PIL import Image
import requests
from transformers import AutoProcessor, LlavaForConditionalGeneration

# Check device availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model configurations
config = PeftConfig.from_pretrained("firqaaa/vsft-llava-1.5-7b-hf-liveness-trl")
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
base_model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf",
                                                           torch_dtype=torch.float16,
                                                           low_cpu_mem_usage=True,
                                                           device_map="auto",
                                                           load_in_4bit=True,
                                                           attn_implementation="flash_attention_2")

model = PeftModel.from_pretrained(base_model, "firqaaa/vsft-llava-1.5-7b-hf-liveness-trl")
model.to(device)

image_path = "/silicone_frames/5/frame7.jpg"
image = Image.open(image_path)

prompt = """USER: <image>\nI ask you to be an liveness image annotator expert to determine if an image "Real" or "Spoof". 
If an image is a "Spoof" define what kind of attack, is it spoofing attack that used Print(flat), Replay(monitor, laptop), or Mask(paper, crop-paper, silicone)?
If an image is a "Real" or "Normal" return "No Attack". 
Whether if an image is "Real" or "Spoof" give an explanation to this.
Return your response using following format :

Real/Spoof : 
Attack Type :
Explanation :\nASSISTANT:"""

# Prepare inputs and move to device
inputs = processor(prompt, images=image, return_tensors="pt").to(device)

# Generate output
output = model.generate(**inputs, max_new_tokens=300)

print("Response :")
# Decode and print the output
decoded_output = processor.decode(output[0], skip_special_tokens=True).split("ASSISTANT:")[-1].strip()
print(decoded_output)

# Response :
# Real/Spoof : Spoof
# Attack Type : Mask (silicone)
# Explanation : The image shows signs of being a spoof attack. The face appears unnaturally smooth and lacks the natural texture and contours of a real human face.
# Additionally, the edges around the face and the overall appearance suggest that a mask made of silicone or a similar material has been used to spoof the image.
```

## Intended uses & limitations

The dataset used is for research purposes only.

## Training and evaluation data

The training data consists of 21,000 images, and the evaluation data consists of 2,500 images classified as fake or real. 
The fake images are categorized into several types: Print (flat), Replay (monitor, laptop), and Mask (paper, cropped paper, silicone).
Both the training and evaluation data were created with the help of GPT-4 to generate a pair of user and assistant responses in a single-turn schema.

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1.4e-05
- train_batch_size: 1
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 1.0
- mixed_precision_training: Native AMP

### Training results



### Framework versions

- PEFT 0.10.0
- Transformers 4.40.1
- Pytorch 2.3.0+cu121
- Datasets 2.19.0
- Tokenizers 0.19.1

**Citation**
- Haoliang Li, Wen Li, Hong Cao, Shiqi Wang, Feiyue Huang, and Alex C. Kot, *“Unsupervised Domain Adaptation for Face Anti-Spoofing”*, IEEE Transactions on Information Forensics and Security, 2018.
- Zhi Li, Rizhao Cai, Haoliang Li, Kwok-Yan Lam, Yongjian Hu, and Alex C. Kot, *“One-Class Knowledge Distillation for Face Presentation Attack Detection”*, IEEE Transactions on Information Forensics and Security, 2022