File size: 4,522 Bytes

---
base_model: microsoft/Florence-2-base-ft
library_name: peft
license: apache-2.0
language:
- en
pipeline_tag: visual-question-answering
metrics:
- accuracy
tags:
- deepfake detection
---

# FLODA: FLorence-2 Optimized for Deepfake Assessment

## Model Description

FLODA (FLorence-2 Optimized for Deepfake Assessment) is an advanced deepfake detection model that leverages the power of Vision-Language Models (VLMs). It's designed to surpass existing deepfake detection models by integrating image captioning and authenticity assessment into a single end-to-end architecture.

## Key Features

- Utilizes Florence-2 as the base VLM for both caption generation and deepfake detection
- Reframes deepfake detection as a Visual Question Answering (VQA) task
- Incorporates image caption information for enhanced contextual understanding
- Employs rsLoRA (rank-stabilized Low-Rank Adaptation) for efficient fine-tuning
- Demonstrates strong generalization across diverse scenarios
- Shows robustness against adversarial attacks

## Model Architecture

FLODA is based on the Florence-2 model and consists of two main components:

1. Vision Encoder: Uses DaViT (Dual Attention Vision Transformer)
2. Multi-modality Encoder-Decoder: Based on a standard transformer architecture

The model is fine-tuned using rsLoRA, with the following configuration:

- Rank (r): 8
- Alpha (α): 8
- Dropout: 0.05
- Target Modules: q_proj, k_proj, v_proj, out_proj, lm_head

## Performance

FLODA achieves state-of-the-art performance in deepfake detection:

- Average accuracy across all datasets: 97.14%
- Strong performance on both real and fake image datasets
- 100% accuracy on several fake datasets and all attacked datasets

## Usage

```python
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import torch

# Load the model and processor
model_path = "path/to/floda/model"
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).to("cuda").eval()
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

def detect_deepfake(image_path):
    image = Image.open(image_path).convert("RGB")
    task_prompt = "<DEEPFAKE_DETECTION>"
    text_input = "Is this photo real?"
    
    inputs = processor(text=task_prompt + text_input, images=image, return_tensors="pt").to("cuda")
    
    with torch.no_grad():
        generated_ids = model.generate(
            input_ids=inputs["input_ids"],
            pixel_values=inputs["pixel_values"],
            max_new_tokens=1024,
            num_beams=3
        )
    
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    result = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))[task_prompt]
    
    return "Real" if result.lower() == "yes" else "Fake"

# Example usage
result = detect_deepfake("path/to/image.jpg")
print(f"The image is: {result}")
```

## Training Data

FLODA was trained on a dataset including:
- Real images: MS COCO
- Fake images: Generated by SD2 and LaMa

## Evaluation Data

The model was evaluated on 16 datasets:
- 2 real image datasets: MS COCO, Flickr30k
- 14 fake image datasets generated by various models (e.g., SD2, SDXL, DeepFloyd IF, DALLE-2, SGXL)
- Includes datasets with stylized images, inpainting, resolution changes, and face-swapping
- Adversarial, backdoor, and data poisoning attack datasets

## Limitations

- Performance on the ControlNet dataset (77.07% accuracy) is lower compared to some competing models
- The model's effectiveness on very recent or future AI-generated image techniques not included in the training or evaluation datasets is uncertain

## Ethical Considerations

While FLODA shows promising results in deepfake detection, it's important to consider:
- The potential for false positives or negatives, which could have significant implications depending on the use case
- The need for continuous updating as new image generation techniques emerge
- Privacy considerations when processing user-submitted images

<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

## Model Card Authors [optional]

- Youngho Bae (Hanyang University)
- Gunhui Han (Yonsei University)
- Seunghyeon Park (Yonsei University)

## Model Card Contact

For inquiries about this model card or the FLODA model, please contact:

Youngho Bae
Email: byh711@gmail.com

### Framework versions

- PEFT 0.12.0