LianJC
/

DFF-InstructBLIP-Detection

@@ -1,36 +1,40 @@
 ---
 license: bsd-3-clause
-library_name: pytorch
-pipeline_tag: image-classification
 tags:
-- facial-forgery-detection
-- multi-label-classification
-- vit
-- deepfake
 - acl-2026
 ---
-# Face-ViT: Multi-Label Facial Forgery Region Classifier
 ## 📖 Model Description
-This is the **Face-ViT** auxiliary perception module proposed in the ACL 2026 paper:
 *"Generating Attribution Reports for Manipulated Facial Images: A Dataset and Baseline"*.
-Face-ViT is a multi-label classifier based on the **ViT-H/14** architecture. It is specifically trained to recognize 21 different types of facial manipulations (e.g., eye modification, skin smoothing, mouth tampering). In the DFF framework, it provides fine-grained visual cues that guide the large language model to generate accurate forensic explanations.
 ## 🛠️ Model Details
-- **Architecture**: ViT-H/14 with an additional CNN branch and max-pooling for multi-label support.
-- **Input Size**: 224x224 RGB images.
-- **Number of Classes**: 21 (Facial attributes/manipulation types).
-- **Training Objective**: Joint loss including BCE, Focal, Dice, and Jaccard loss.
 ## 🚀 Links
 - **Official Code**: [Generating-Attribution-Reports](https://github.com/JingchunLian/Generating-Attribution-Reports)
-- **Main Framework (DFF)**: [LianJC/DFF-InstructBLIP-Detection](https://huggingface.co/LianJC/DFF-InstructBLIP-Detection)
 - **Dataset (MMTT)**: [LianJC/MMTT-Dataset](https://huggingface.co/datasets/LianJC/MMTT-Dataset)
 ## 📜 Citation
-If you find this model useful, please cite:
 ```bibtex
 @inproceedings{lian2026generating,
   title={Generating Attribution Reports for Manipulated Facial Images: A Dataset and Baseline},

 ---
 license: bsd-3-clause
+library_name: lavis
+pipeline_tag: visual-question-answering
 tags:
+- explainable-ai
+- deepfake-detection
+- vlm
+- instructblip
+- forensic-explanation
 - acl-2026
 ---
+# DFF: InstructBLIP-based Explainable DeepFake Detection
 ## 📖 Model Description
+This is the core **DFF (DeepFake Detection and Forensic Explanation Framework)** model as described in the ACL 2026 paper:
 *"Generating Attribution Reports for Manipulated Facial Images: A Dataset and Baseline"*.
+DFF is built upon the **InstructBLIP (Flan-T5 XL)** architecture. By integrating the Face-ViT auxiliary classifier, it achieves state-of-the-art performance in both **forgery localization (mask generation)** and **forensic explanation (captioning)**.
+## 🌟 Key Capabilities
+1. **Forgery Localization**: Generates high-resolution binary masks highlighting manipulated facial regions.
+2. **Natural Language Explanation**: Produces detailed text describing why a specific image is considered a forgery (e.g., "The texture around the eyes is unnatural due to GAN-based blending").
 ## 🛠️ Model Details
+- **Base LLM**: Flan-T5 XL.
+- **Visual Encoder**: EVA-ViT-G.
+- **Auxiliary Module**: Face-ViT (Multi-label perception).
+- **Task**: Explainable Detection & Multi-modal Attribution Reporting.
 ## 🚀 Links
 - **Official Code**: [Generating-Attribution-Reports](https://github.com/JingchunLian/Generating-Attribution-Reports)
+- **Auxiliary Classifier**: [LianJC/Face-ViT-MultiLabel](https://huggingface.co/LianJC/Face-ViT-MultiLabel)
 - **Dataset (MMTT)**: [LianJC/MMTT-Dataset](https://huggingface.co/datasets/LianJC/MMTT-Dataset)
 ## 📜 Citation
 ```bibtex
 @inproceedings{lian2026generating,
   title={Generating Attribution Reports for Manipulated Facial Images: A Dataset and Baseline},