File size: 4,522 Bytes
869b1f6 911229a f8410d9 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 911229a 869b1f6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
---
base_model: microsoft/Florence-2-base-ft
library_name: peft
license: apache-2.0
language:
- en
pipeline_tag: visual-question-answering
metrics:
- accuracy
tags:
- deepfake detection
---
# FLODA: FLorence-2 Optimized for Deepfake Assessment
## Model Description
FLODA (FLorence-2 Optimized for Deepfake Assessment) is an advanced deepfake detection model that leverages the power of Vision-Language Models (VLMs). It's designed to surpass existing deepfake detection models by integrating image captioning and authenticity assessment into a single end-to-end architecture.
## Key Features
- Utilizes Florence-2 as the base VLM for both caption generation and deepfake detection
- Reframes deepfake detection as a Visual Question Answering (VQA) task
- Incorporates image caption information for enhanced contextual understanding
- Employs rsLoRA (rank-stabilized Low-Rank Adaptation) for efficient fine-tuning
- Demonstrates strong generalization across diverse scenarios
- Shows robustness against adversarial attacks
## Model Architecture
FLODA is based on the Florence-2 model and consists of two main components:
1. Vision Encoder: Uses DaViT (Dual Attention Vision Transformer)
2. Multi-modality Encoder-Decoder: Based on a standard transformer architecture
The model is fine-tuned using rsLoRA, with the following configuration:
- Rank (r): 8
- Alpha (α): 8
- Dropout: 0.05
- Target Modules: q_proj, k_proj, v_proj, out_proj, lm_head
## Performance
FLODA achieves state-of-the-art performance in deepfake detection:
- Average accuracy across all datasets: 97.14%
- Strong performance on both real and fake image datasets
- 100% accuracy on several fake datasets and all attacked datasets
## Usage
```python
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import torch
# Load the model and processor
model_path = "path/to/floda/model"
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).to("cuda").eval()
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
def detect_deepfake(image_path):
image = Image.open(image_path).convert("RGB")
task_prompt = "<DEEPFAKE_DETECTION>"
text_input = "Is this photo real?"
inputs = processor(text=task_prompt + text_input, images=image, return_tensors="pt").to("cuda")
with torch.no_grad():
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
num_beams=3
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
result = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))[task_prompt]
return "Real" if result.lower() == "yes" else "Fake"
# Example usage
result = detect_deepfake("path/to/image.jpg")
print(f"The image is: {result}")
```
## Training Data
FLODA was trained on a dataset including:
- Real images: MS COCO
- Fake images: Generated by SD2 and LaMa
## Evaluation Data
The model was evaluated on 16 datasets:
- 2 real image datasets: MS COCO, Flickr30k
- 14 fake image datasets generated by various models (e.g., SD2, SDXL, DeepFloyd IF, DALLE-2, SGXL)
- Includes datasets with stylized images, inpainting, resolution changes, and face-swapping
- Adversarial, backdoor, and data poisoning attack datasets
## Limitations
- Performance on the ControlNet dataset (77.07% accuracy) is lower compared to some competing models
- The model's effectiveness on very recent or future AI-generated image techniques not included in the training or evaluation datasets is uncertain
## Ethical Considerations
While FLODA shows promising results in deepfake detection, it's important to consider:
- The potential for false positives or negatives, which could have significant implications depending on the use case
- The need for continuous updating as new image generation techniques emerge
- Privacy considerations when processing user-submitted images
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
## Model Card Authors [optional]
- Youngho Bae (Hanyang University)
- Gunhui Han (Yonsei University)
- Seunghyeon Park (Yonsei University)
## Model Card Contact
For inquiries about this model card or the FLODA model, please contact:
Youngho Bae
Email: byh711@gmail.com
### Framework versions
- PEFT 0.12.0 |