File size: 4,522 Bytes
869b1f6
 
 
911229a
 
 
 
f8410d9
 
 
 
869b1f6
 
911229a
869b1f6
911229a
869b1f6
911229a
869b1f6
911229a
869b1f6
911229a
 
 
 
 
 
869b1f6
911229a
869b1f6
911229a
869b1f6
911229a
 
869b1f6
911229a
869b1f6
911229a
 
 
 
869b1f6
911229a
869b1f6
911229a
869b1f6
911229a
 
 
869b1f6
911229a
869b1f6
911229a
 
 
 
869b1f6
911229a
 
 
 
869b1f6
911229a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
869b1f6
911229a
 
 
 
869b1f6
911229a
869b1f6
911229a
 
 
869b1f6
911229a
869b1f6
911229a
 
 
 
 
869b1f6
911229a
869b1f6
911229a
 
869b1f6
911229a
869b1f6
911229a
 
 
 
869b1f6
 
 
 
 
911229a
 
 
869b1f6
 
 
911229a
 
 
 
 
869b1f6
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
base_model: microsoft/Florence-2-base-ft
library_name: peft
license: apache-2.0
language:
- en
pipeline_tag: visual-question-answering
metrics:
- accuracy
tags:
- deepfake detection
---

# FLODA: FLorence-2 Optimized for Deepfake Assessment

## Model Description

FLODA (FLorence-2 Optimized for Deepfake Assessment) is an advanced deepfake detection model that leverages the power of Vision-Language Models (VLMs). It's designed to surpass existing deepfake detection models by integrating image captioning and authenticity assessment into a single end-to-end architecture.

## Key Features

- Utilizes Florence-2 as the base VLM for both caption generation and deepfake detection
- Reframes deepfake detection as a Visual Question Answering (VQA) task
- Incorporates image caption information for enhanced contextual understanding
- Employs rsLoRA (rank-stabilized Low-Rank Adaptation) for efficient fine-tuning
- Demonstrates strong generalization across diverse scenarios
- Shows robustness against adversarial attacks

## Model Architecture

FLODA is based on the Florence-2 model and consists of two main components:

1. Vision Encoder: Uses DaViT (Dual Attention Vision Transformer)
2. Multi-modality Encoder-Decoder: Based on a standard transformer architecture

The model is fine-tuned using rsLoRA, with the following configuration:

- Rank (r): 8
- Alpha (α): 8
- Dropout: 0.05
- Target Modules: q_proj, k_proj, v_proj, out_proj, lm_head

## Performance

FLODA achieves state-of-the-art performance in deepfake detection:

- Average accuracy across all datasets: 97.14%
- Strong performance on both real and fake image datasets
- 100% accuracy on several fake datasets and all attacked datasets

## Usage

```python
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import torch

# Load the model and processor
model_path = "path/to/floda/model"
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).to("cuda").eval()
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

def detect_deepfake(image_path):
    image = Image.open(image_path).convert("RGB")
    task_prompt = "<DEEPFAKE_DETECTION>"
    text_input = "Is this photo real?"
    
    inputs = processor(text=task_prompt + text_input, images=image, return_tensors="pt").to("cuda")
    
    with torch.no_grad():
        generated_ids = model.generate(
            input_ids=inputs["input_ids"],
            pixel_values=inputs["pixel_values"],
            max_new_tokens=1024,
            num_beams=3
        )
    
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    result = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))[task_prompt]
    
    return "Real" if result.lower() == "yes" else "Fake"

# Example usage
result = detect_deepfake("path/to/image.jpg")
print(f"The image is: {result}")
```

## Training Data

FLODA was trained on a dataset including:
- Real images: MS COCO
- Fake images: Generated by SD2 and LaMa

## Evaluation Data

The model was evaluated on 16 datasets:
- 2 real image datasets: MS COCO, Flickr30k
- 14 fake image datasets generated by various models (e.g., SD2, SDXL, DeepFloyd IF, DALLE-2, SGXL)
- Includes datasets with stylized images, inpainting, resolution changes, and face-swapping
- Adversarial, backdoor, and data poisoning attack datasets

## Limitations

- Performance on the ControlNet dataset (77.07% accuracy) is lower compared to some competing models
- The model's effectiveness on very recent or future AI-generated image techniques not included in the training or evaluation datasets is uncertain

## Ethical Considerations

While FLODA shows promising results in deepfake detection, it's important to consider:
- The potential for false positives or negatives, which could have significant implications depending on the use case
- The need for continuous updating as new image generation techniques emerge
- Privacy considerations when processing user-submitted images

<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

## Model Card Authors [optional]

- Youngho Bae (Hanyang University)
- Gunhui Han (Yonsei University)
- Seunghyeon Park (Yonsei University)

## Model Card Contact

For inquiries about this model card or the FLODA model, please contact:

Youngho Bae
Email: byh711@gmail.com

### Framework versions

- PEFT 0.12.0