|
--- |
|
base_model: microsoft/Florence-2-base-ft |
|
library_name: peft |
|
license: apache-2.0 |
|
language: |
|
- en |
|
pipeline_tag: visual-question-answering |
|
metrics: |
|
- accuracy |
|
tags: |
|
- deepfake detection |
|
--- |
|
|
|
# FLODA: FLorence-2 Optimized for Deepfake Assessment |
|
|
|
## Model Description |
|
|
|
FLODA (FLorence-2 Optimized for Deepfake Assessment) is an advanced deepfake detection model that leverages the power of Vision-Language Models (VLMs). It's designed to surpass existing deepfake detection models by integrating image captioning and authenticity assessment into a single end-to-end architecture. |
|
|
|
## Key Features |
|
|
|
- Utilizes Florence-2 as the base VLM for both caption generation and deepfake detection |
|
- Reframes deepfake detection as a Visual Question Answering (VQA) task |
|
- Incorporates image caption information for enhanced contextual understanding |
|
- Employs rsLoRA (rank-stabilized Low-Rank Adaptation) for efficient fine-tuning |
|
- Demonstrates strong generalization across diverse scenarios |
|
- Shows robustness against adversarial attacks |
|
|
|
## Model Architecture |
|
|
|
FLODA is based on the Florence-2 model and consists of two main components: |
|
|
|
1. Vision Encoder: Uses DaViT (Dual Attention Vision Transformer) |
|
2. Multi-modality Encoder-Decoder: Based on a standard transformer architecture |
|
|
|
The model is fine-tuned using rsLoRA, with the following configuration: |
|
|
|
- Rank (r): 8 |
|
- Alpha (α): 8 |
|
- Dropout: 0.05 |
|
- Target Modules: q_proj, k_proj, v_proj, out_proj, lm_head |
|
|
|
## Performance |
|
|
|
FLODA achieves state-of-the-art performance in deepfake detection: |
|
|
|
- Average accuracy across all datasets: 97.14% |
|
- Strong performance on both real and fake image datasets |
|
- 100% accuracy on several fake datasets and all attacked datasets |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import AutoProcessor, AutoModelForCausalLM |
|
from PIL import Image |
|
import torch |
|
|
|
# Load the model and processor |
|
model_path = "path/to/floda/model" |
|
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).to("cuda").eval() |
|
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) |
|
|
|
def detect_deepfake(image_path): |
|
image = Image.open(image_path).convert("RGB") |
|
task_prompt = "<DEEPFAKE_DETECTION>" |
|
text_input = "Is this photo real?" |
|
|
|
inputs = processor(text=task_prompt + text_input, images=image, return_tensors="pt").to("cuda") |
|
|
|
with torch.no_grad(): |
|
generated_ids = model.generate( |
|
input_ids=inputs["input_ids"], |
|
pixel_values=inputs["pixel_values"], |
|
max_new_tokens=1024, |
|
num_beams=3 |
|
) |
|
|
|
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0] |
|
result = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))[task_prompt] |
|
|
|
return "Real" if result.lower() == "yes" else "Fake" |
|
|
|
# Example usage |
|
result = detect_deepfake("path/to/image.jpg") |
|
print(f"The image is: {result}") |
|
``` |
|
|
|
## Training Data |
|
|
|
FLODA was trained on a dataset including: |
|
- Real images: MS COCO |
|
- Fake images: Generated by SD2 and LaMa |
|
|
|
## Evaluation Data |
|
|
|
The model was evaluated on 16 datasets: |
|
- 2 real image datasets: MS COCO, Flickr30k |
|
- 14 fake image datasets generated by various models (e.g., SD2, SDXL, DeepFloyd IF, DALLE-2, SGXL) |
|
- Includes datasets with stylized images, inpainting, resolution changes, and face-swapping |
|
- Adversarial, backdoor, and data poisoning attack datasets |
|
|
|
## Limitations |
|
|
|
- Performance on the ControlNet dataset (77.07% accuracy) is lower compared to some competing models |
|
- The model's effectiveness on very recent or future AI-generated image techniques not included in the training or evaluation datasets is uncertain |
|
|
|
## Ethical Considerations |
|
|
|
While FLODA shows promising results in deepfake detection, it's important to consider: |
|
- The potential for false positives or negatives, which could have significant implications depending on the use case |
|
- The need for continuous updating as new image generation techniques emerge |
|
- Privacy considerations when processing user-submitted images |
|
|
|
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. --> |
|
|
|
## Model Card Authors [optional] |
|
|
|
- Youngho Bae (Hanyang University) |
|
- Gunhui Han (Yonsei University) |
|
- Seunghyeon Park (Yonsei University) |
|
|
|
## Model Card Contact |
|
|
|
For inquiries about this model card or the FLODA model, please contact: |
|
|
|
Youngho Bae |
|
Email: byh711@gmail.com |
|
|
|
### Framework versions |
|
|
|
- PEFT 0.12.0 |