|
--- |
|
library_name: peft |
|
tags: |
|
- trl |
|
- sft |
|
- generated_from_trainer |
|
base_model: llava-hf/llava-1.5-7b-hf |
|
model-index: |
|
- name: vsft-llava-1.5-7b-hf-liveness |
|
results: [] |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
pipeline_tag: image-text-to-text |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# vsft-llava-1.5-7b-hf-liveness-trl |
|
|
|
This model is a fine-tuned version of [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) on an modified `ROSE-Youtu Face Liveness Detection Dataset` dataset. |
|
The ROSE-Youtu Face Liveness Detection Database (ROSE-Youtu) consists of 4225 videos with 25 subjects in total (3350 videos with 20 subjects publically available with 5.45GB in size). |
|
It also includes a new Client-Specific One-Class Domain Adaptation Protocol with an additional 1.25GB of pre-processed data. |
|
|
|
## Model description |
|
|
|
### Model details |
|
|
|
<b>Model type</b>: |
|
LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. |
|
|
|
<b>Model date</b>: |
|
The model was trained on April the 11th 2024 |
|
|
|
## How to use |
|
|
|
The model supports multi-image and multi-prompt generation. Meaning that you can pass multiple images in your prompt. Make sure also to follow the correct prompt template (`USER: xxx\nASSISTANT:`) and add the token `<image>` to the location where you want to query images: |
|
|
|
```python |
|
from peft import PeftModel, PeftConfig |
|
import torch |
|
from PIL import Image |
|
import requests |
|
from transformers import AutoProcessor, LlavaForConditionalGeneration |
|
|
|
# Check device availability |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
# Load model configurations |
|
config = PeftConfig.from_pretrained("firqaaa/vsft-llava-1.5-7b-hf-liveness-trl") |
|
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf") |
|
base_model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf", |
|
torch_dtype=torch.float16, |
|
low_cpu_mem_usage=True, |
|
device_map="auto", |
|
load_in_4bit=True, |
|
attn_implementation="flash_attention_2") |
|
|
|
model = PeftModel.from_pretrained(base_model, "firqaaa/vsft-llava-1.5-7b-hf-liveness-trl") |
|
model.to(device) |
|
|
|
image_path = "/silicone_frames/5/frame7.jpg" |
|
image = Image.open(image_path) |
|
|
|
prompt = """USER: <image>\nI ask you to be an liveness image annotator expert to determine if an image "Real" or "Spoof". |
|
If an image is a "Spoof" define what kind of attack, is it spoofing attack that used Print(flat), Replay(monitor, laptop), or Mask(paper, crop-paper, silicone)? |
|
If an image is a "Real" or "Normal" return "No Attack". |
|
Whether if an image is "Real" or "Spoof" give an explanation to this. |
|
Return your response using following format : |
|
|
|
Real/Spoof : |
|
Attack Type : |
|
Explanation :\nASSISTANT:""" |
|
|
|
# Prepare inputs and move to device |
|
inputs = processor(prompt, images=image, return_tensors="pt").to(device) |
|
|
|
# Generate output |
|
output = model.generate(**inputs, max_new_tokens=300) |
|
|
|
print("Response :") |
|
# Decode and print the output |
|
decoded_output = processor.decode(output[0], skip_special_tokens=True).split("ASSISTANT:")[-1].strip() |
|
print(decoded_output) |
|
|
|
# Response : |
|
# Real/Spoof : Spoof |
|
# Attack Type : Mask (silicone) |
|
# Explanation : The image shows signs of being a spoof attack. The face appears unnaturally smooth and lacks the natural texture and contours of a real human face. |
|
# Additionally, the edges around the face and the overall appearance suggest that a mask made of silicone or a similar material has been used to spoof the image. |
|
``` |
|
|
|
## Intended uses & limitations |
|
|
|
The dataset used is for research purposes only. |
|
|
|
## Training and evaluation data |
|
|
|
The training data consists of 21,000 images, and the evaluation data consists of 2,500 images classified as fake or real. |
|
The fake images are categorized into several types: Print (flat), Replay (monitor, laptop), and Mask (paper, cropped paper, silicone). |
|
Both the training and evaluation data were created with the help of GPT-4 to generate a pair of user and assistant responses in a single-turn schema. |
|
|
|
## Training procedure |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 1.4e-05 |
|
- train_batch_size: 1 |
|
- eval_batch_size: 8 |
|
- seed: 42 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- num_epochs: 1.0 |
|
- mixed_precision_training: Native AMP |
|
|
|
### Training results |
|
|
|
|
|
|
|
### Framework versions |
|
|
|
- PEFT 0.10.0 |
|
- Transformers 4.40.1 |
|
- Pytorch 2.3.0+cu121 |
|
- Datasets 2.19.0 |
|
- Tokenizers 0.19.1 |
|
|
|
**Citation** |
|
- Haoliang Li, Wen Li, Hong Cao, Shiqi Wang, Feiyue Huang, and Alex C. Kot, *“Unsupervised Domain Adaptation for Face Anti-Spoofing”*, IEEE Transactions on Information Forensics and Security, 2018. |
|
- Zhi Li, Rizhao Cai, Haoliang Li, Kwok-Yan Lam, Yongjian Hu, and Alex C. Kot, *“One-Class Knowledge Distillation for Face Presentation Attack Detection”*, IEEE Transactions on Information Forensics and Security, 2022 |