AI Detection Model

Model Architecture and Training

Three separate models were initially trained:

Midjourney vs. Real Images
Stable Diffusion vs. Real Images
Stable Diffusion Fine-tunings vs. Real Images

Data preparation process:

Used Google's Open Image Dataset for real images
Described real images using BLIP (Bootstrapping Language-Image Pre-training)
Generated Stable Diffusion images using BLIP descriptions
Found similar Midjourney images based on BLIP descriptions

This approach ensured real and AI-generated images were as similar as possible, differing only in their origin.

The three models were then distilled into a small ViT model with 11.8 Million Parameters, combining their learned features for more efficient detection.

Data Sources

Google's Open Image Dataset: link
Ivan Sivkov's Midjourney Dataset: link
TANREI(NAMA)'s Stable Diffusion Prompts Dataset: link

Performance

Validation Set: 74% accuracy
- Held out from training data to assess generalization
Custom Real-World Set: 72% accuracy
- Composed of self-captured images and online-sourced images
- Designed to be more representative of internet-based images
Comparative Analysis:
- Outperformed other popular AI detection models by 5 percentage points on both sets
- Other models achieved 89% and 79% on validation and real-world sets respectively

Key Insights

Strong generalization on validation data (75% accuracy)
Good adaptability to diverse, real-world images (72% accuracy)
Consistent outperformance of other popular models
10-point accuracy drop from validation to real-world set indicates room for improvement
Comprehensive training on multiple AI generation techniques contributes to model versatility
Focus on subtle differences in image generation rather than content disparities

Future Directions

Expand dataset with more diverse, real-world examples to bridge the performance gap
Improve generalization to internet-sourced images
Conduct error analysis on misclassified samples to identify patterns
Integrate new AI image generation techniques as they emerge
Consider fine-tuning for specific domains where detection accuracy is critical