dvf_trained_transferred_aegis
R(2+1)D + mixed-domain supervised-contrastive model for end-to-end AI-generated video detection. A DVF-trained R(2+1)D backbone is transferred to the AEGIS domain while retaining performance on DVF and GenVideo, then evaluated with a projection-space prototype protocol.
- Backbone: torchvision
r2plus1d_18, trained from scratch (not Kinetics-initialised). - Projection head:
Linear(512 β 512) β ReLU β Linear(512 β 128), L2-normalised output. - Classifier head:
Linear(512 β 2)(supervised-baseline head, carried in the checkpoint; not used by the prototype protocol). - Lineage: DVF-trained backbone β 3-domain mixed SupCon (AEGIS + DVF + GenVideo) β projection-space prototype inference.
Files
| File | Role |
|---|---|
supcon/final_best.pt |
Primary artifact β mixed-domain SupCon checkpoint (run6, best epoch 16). |
base/best_model.pt |
DVF-trained base checkpoint the backbone was transferred from (supervised baseline). |
config.json |
Architecture, preprocessing, training and eval-protocol metadata. |
load_model.py |
Minimal CPU loader + load-only smoke test. |
Intended use / out of scope
Research artifact for studying cross-dataset transfer and retention in AI-generated video detection. It is evaluated under a prototype protocol (fixed real/fake prototypes built from a small labeled support bank, projection space) β not a deployed, calibrated binary classifier. It detects end-to-end AI-generated video; it is not a face-swap deepfake detector. Not intended for, and not validated for, content moderation, legal, or forensic decision-making.
How to load
The checkpoint is a dict with keys epoch, model_state_dict, optimizer_state_dict, best_selection_score, args; load model_state_dict. load_model.py builds the exact module
and loads on CPU (strict).
from load_model import load_model, extract_projected_embedding
model = load_model("supcon/final_best.pt", map_location="cpu") # eval mode
# Input clips: (B, 3, T=24, 224, 224), RGB, pixels in [0,1], NO Kinetics mean/std.
# emb = extract_projected_embedding(model, clips) # L2-normalised 128-d
python load_model.py supcon/final_best.pt
# -> OK, 31629471 params
Note on
base/best_model.pt: the base uses aDropout(0.4) β Linear(512 β 2)head (fc.0/fc.1) and has noproj_head.load_model.pytargets the SupCon checkpoint; the base is included only to document the transfer lineage.
Training data
Three datasets are used. Downstream users must independently comply with each dataset's terms and the terms of the underlying generators whose outputs appear in the data (e.g. Sora, KLing, Pika). This obligation is part of why the weights are released under a non-commercial license.
- DVF (Diffusion Video Forensics) β from MM-Det (NeurIPS 2024). Paper: arXiv:2410.23623 Β· Code + dataset: github.com/SparkleXFantasy/MM-Det Β· HF: sparklexfantasy/DVF
- GenVideo / GenVideo-100K β from DeMamba (the GenVideo-100K lightweight version was used). Paper: arXiv:2405.19707 Β· Code + dataset: github.com/chenhaoxing/DeMamba
- AEGIS β Authenticity Evaluation Benchmark for AI-Generated Video Sequences (ACM MM 2025). Paper: arXiv:2508.10771 Β· HF: Clarifiedfish/AEGIS Β· ACM DL. Note: the AEGIS HF page does not expose a license tag; confirm its terms from the paper.
Per-domain split sizes used for this run (records): AEGIS total 436 (train 50 / val 50 / test 336); DVF total 1004 (200 / 200 / 604); GenVideo total 2971 (200 / 200 / 2571).
Training procedure
- Base: R(2+1)D
r2plus1d_18trained from scratch as a supervised baseline on DVF (cross-entropy, class weights 1.0 / 1.5, label smoothing 0.2,Dropout(0.4) β Linearhead). - Mixed-domain SupCon transfer (
run6): initialise from the DVF backbone and fine-tune with supervised contrastive loss across all three domains.
| Hyperparameter | Value |
|---|---|
| Loss | SupConLoss, temperature 0.07 |
| Optimizer | AdamW, lr 5e-5, weight decay 1e-5 |
| Epochs | 20 (best epoch 16; selected on validation) |
| Clip / fps / size | 24 frames @ 24 fps, 224Γ224, pixels in [0,1], no Kinetics norm |
| Batch size | 24 |
| Unfreeze policy | layer4_all + projection head |
| Projection | hidden 512, out 128 |
| Domain loss weights | AEGIS 0.5 / DVF 0.3 / GenVideo 0.2 |
| Batch sampling | generator-aware quota sampling |
Evaluation
Protocol (author-reported). Fixed real/fake prototypes are built as the L2-normalised mean
of a labeled support bank in projection space (10 support clips per class per domain,
i.e. 10 real + 10 fake each for AEGIS / DVF / GenVideo). Each query clip is scored by
score = sim_fake β sim_real and labeled fake when score β₯ 0.0. Produced by
build_prototypes_from_support_jsonl.py + prototype_inference_from_saved_prototypes.py.
| Dataset | AUROC | EER | Accuracy |
|---|---|---|---|
| AEGIS (target) | 0.808 | 0.268 | 0.730 |
| DVF (retain) | 0.847 | 0.225 | 0.767 |
| GenVideo (retain) | 0.817 | 0.255 | 0.748 |
These are author-reported numbers from the fixed-prototype protocol above; they were not re-run in the environment that prepared this card (the support/query manifests and cached clips are not bundled). A separate in-training 20-trial averaged eval exists and is consistent in ranking.
Limitations
- Dataset scope. Trained/evaluated on DVF, GenVideo, and AEGIS only; generalisation to unseen generators or post-processing is not guaranteed.
- Prototype-protocol caveat. Support clips are drawn from labeled data of the same
datasets, so this measures representation/adaptation quality, not unconditional
deployment performance. The fixed
score β₯ 0threshold is not calibrated per deployment. - Projection-space separability. Real and fake prototypes have relatively high cosine similarity, which caps separation in the projection space.
- Scope of detection. End-to-end AI-generated video only; not face-swap deepfakes.
Citation
Thesis (current source); a TVC paper will be added when published.
@misc{dvf_trained_transferred_aegis,
title = {TODO},
author = {TODO},
year = {2026},
note = {TODO: thesis / TVC paper}
}
License
Weights released under CC BY-NC 4.0 (non-commercial). The training data includes proprietary generator outputs (Sora / KLing / Pika) and web-scraped real video, so a permissive commercial license is not appropriate. Use of this model must also respect the licenses and terms of the underlying datasets and generators.
- Downloads last month
- 42