dvf_trained_transferred_aegis

R(2+1)D + mixed-domain supervised-contrastive model for end-to-end AI-generated video detection. A DVF-trained R(2+1)D backbone is transferred to the AEGIS domain while retaining performance on DVF and GenVideo, then evaluated with a projection-space prototype protocol.

  • Backbone: torchvision r2plus1d_18, trained from scratch (not Kinetics-initialised).
  • Projection head: Linear(512 β†’ 512) β†’ ReLU β†’ Linear(512 β†’ 128), L2-normalised output.
  • Classifier head: Linear(512 β†’ 2) (supervised-baseline head, carried in the checkpoint; not used by the prototype protocol).
  • Lineage: DVF-trained backbone β†’ 3-domain mixed SupCon (AEGIS + DVF + GenVideo) β†’ projection-space prototype inference.

Files

File Role
supcon/final_best.pt Primary artifact β€” mixed-domain SupCon checkpoint (run6, best epoch 16).
base/best_model.pt DVF-trained base checkpoint the backbone was transferred from (supervised baseline).
config.json Architecture, preprocessing, training and eval-protocol metadata.
load_model.py Minimal CPU loader + load-only smoke test.

Intended use / out of scope

Research artifact for studying cross-dataset transfer and retention in AI-generated video detection. It is evaluated under a prototype protocol (fixed real/fake prototypes built from a small labeled support bank, projection space) β€” not a deployed, calibrated binary classifier. It detects end-to-end AI-generated video; it is not a face-swap deepfake detector. Not intended for, and not validated for, content moderation, legal, or forensic decision-making.

How to load

The checkpoint is a dict with keys epoch, model_state_dict, optimizer_state_dict, best_selection_score, args; load model_state_dict. load_model.py builds the exact module and loads on CPU (strict).

from load_model import load_model, extract_projected_embedding

model = load_model("supcon/final_best.pt", map_location="cpu")  # eval mode
# Input clips: (B, 3, T=24, 224, 224), RGB, pixels in [0,1], NO Kinetics mean/std.
# emb = extract_projected_embedding(model, clips)   # L2-normalised 128-d
python load_model.py supcon/final_best.pt
# -> OK, 31629471 params

Note on base/best_model.pt: the base uses a Dropout(0.4) β†’ Linear(512 β†’ 2) head (fc.0/fc.1) and has no proj_head. load_model.py targets the SupCon checkpoint; the base is included only to document the transfer lineage.

Training data

Three datasets are used. Downstream users must independently comply with each dataset's terms and the terms of the underlying generators whose outputs appear in the data (e.g. Sora, KLing, Pika). This obligation is part of why the weights are released under a non-commercial license.

Per-domain split sizes used for this run (records): AEGIS total 436 (train 50 / val 50 / test 336); DVF total 1004 (200 / 200 / 604); GenVideo total 2971 (200 / 200 / 2571).

Training procedure

  1. Base: R(2+1)D r2plus1d_18 trained from scratch as a supervised baseline on DVF (cross-entropy, class weights 1.0 / 1.5, label smoothing 0.2, Dropout(0.4) β†’ Linear head).
  2. Mixed-domain SupCon transfer (run6): initialise from the DVF backbone and fine-tune with supervised contrastive loss across all three domains.
Hyperparameter Value
Loss SupConLoss, temperature 0.07
Optimizer AdamW, lr 5e-5, weight decay 1e-5
Epochs 20 (best epoch 16; selected on validation)
Clip / fps / size 24 frames @ 24 fps, 224Γ—224, pixels in [0,1], no Kinetics norm
Batch size 24
Unfreeze policy layer4_all + projection head
Projection hidden 512, out 128
Domain loss weights AEGIS 0.5 / DVF 0.3 / GenVideo 0.2
Batch sampling generator-aware quota sampling

Evaluation

Protocol (author-reported). Fixed real/fake prototypes are built as the L2-normalised mean of a labeled support bank in projection space (10 support clips per class per domain, i.e. 10 real + 10 fake each for AEGIS / DVF / GenVideo). Each query clip is scored by score = sim_fake βˆ’ sim_real and labeled fake when score β‰₯ 0.0. Produced by build_prototypes_from_support_jsonl.py + prototype_inference_from_saved_prototypes.py.

Dataset AUROC EER Accuracy
AEGIS (target) 0.808 0.268 0.730
DVF (retain) 0.847 0.225 0.767
GenVideo (retain) 0.817 0.255 0.748

These are author-reported numbers from the fixed-prototype protocol above; they were not re-run in the environment that prepared this card (the support/query manifests and cached clips are not bundled). A separate in-training 20-trial averaged eval exists and is consistent in ranking.

Limitations

  • Dataset scope. Trained/evaluated on DVF, GenVideo, and AEGIS only; generalisation to unseen generators or post-processing is not guaranteed.
  • Prototype-protocol caveat. Support clips are drawn from labeled data of the same datasets, so this measures representation/adaptation quality, not unconditional deployment performance. The fixed score β‰₯ 0 threshold is not calibrated per deployment.
  • Projection-space separability. Real and fake prototypes have relatively high cosine similarity, which caps separation in the projection space.
  • Scope of detection. End-to-end AI-generated video only; not face-swap deepfakes.

Citation

Thesis (current source); a TVC paper will be added when published.

@misc{dvf_trained_transferred_aegis,
  title  = {TODO},
  author = {TODO},
  year   = {2026},
  note   = {TODO: thesis / TVC paper}
}

License

Weights released under CC BY-NC 4.0 (non-commercial). The training data includes proprietary generator outputs (Sora / KLing / Pika) and web-scraped real video, so a permissive commercial license is not appropriate. Use of this model must also respect the licenses and terms of the underlying datasets and generators.

Downloads last month
42
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using jai1th/dvf_trained_transferred_aegis 1

Papers for jai1th/dvf_trained_transferred_aegis