Phoneme-Based Audio-Visual Face-Forgery Detector

Trained weights for the detector in "On Phoneme-Based Audio-Visual Face Forgery Detection". The detector is an unweighted-average ensemble of three models operating on phoneme-aligned articulatory, frequency, and noise-residual cues.

Code, feature extraction, and the inference CLI: https://github.com/vlaght/dslp-ensemble-study

License / usage. Weights released under CC-BY-NC-4.0 (non-commercial). Use is also subject to the licenses of the source datasets (FakeAVCeleb, DeepSpeak v2, TalkVid-Bench); research use only.

Components

Model Architecture Input features
DSLP dual-stream phoneme-aligned LSTM (2-layer, hidden 256) + learned phoneme embedding (dim 8), mean-pooled, MLP fusion; Focal loss 58 pruned visual+audio features → 144 visual / 88 audio dims; 139 phoneme classes
TAFreq 2-layer BiLSTM (hidden 128) + soft attention pooling; BCE 36 frequency-domain features (14 DCT + 22 STFT) → 216 dims
TANoise same as TAFreq 19 noise-residual features (Laplacian / DoG / quadrant) → 114 dims
Ensemble unweighted mean of the three sigmoid probabilities —

Per-video feature expansion: delta (×2) + temporal statistics (std for DSLP, mean+std for TAFreq/TANoise). Output: P(fake) ∈ [0,1], decision at 0.5.

Results

10-fold stratified cross-validation.

Ensemble

Dataset AUC F1 Accuracy
FakeAVCeleb (21,544 videos) 0.9593 0.9599 0.9248
DeepSpeak v2 (16,465 videos) 0.9984 0.9783 0.9810

Per-component AUC

Model FakeAVCeleb DeepSpeak v2
DSLP 0.8606 0.9579
TAFreq 0.9674 0.9954
TANoise 0.8100 0.9356
Ensemble 0.9593 0.9984

FakeAVCeleb is harder because many of its forgeries leave mouth motion largely intact.

Files

dslp.pth / dslp_artifacts.pkl        DSLP weights + scaler, phoneme encoder, feature cols, visual/audio split
tafreq.pth / tafreq_artifacts.pkl    TAFreq weights + scaler, col means, input dim
tanoise.pth / tanoise_artifacts.pkl  TANoise weights + scaler, col means, input dim
ensemble_manifest.json               dataset, video count, phoneme count, ensemble rule

Training data

Trained on all datasets combined (--dataset ALL): FakeAVCeleb v1.2, an augmented set of authentic YouTube clips, TalkVid-Bench, and DeepSpeak v2 — 24,767 videos with complete phoneme + frequency + noise features. Trained on the full set with a 90/10 validation split for early stopping (no held-out test; cross-validated results are in the paper).

Usage

See the repository for the CLI. With these files in trained/final/:

python cli/classify_ensemble.py --video path/to/video.mp4   # full detector
python cli/classify_dslp.py     --video path/to/video.mp4   # single component

Live feature extraction uses MediaPipe FaceLandmarker and a frozen wav2vec 2.0 phoneme recogniser (facebook/wav2vec2-lv-60-espeak-cv-ft), both fetched automatically.

Limitations

  • Trained on the listed datasets; generalisation to unseen generators/domains not guaranteed.
  • Needs a visible speaking face and audible speech (phoneme alignment); silent or no-face clips fail extraction.
  • Research artifact, not a production forensic tool.

Citation

@misc{boiko_phoneme_av_2026,
  title  = {On Phoneme-Based Audio-Visual Face Forgery Detection},
  author = {Boiko, Vladislav},
  year   = {2026}
}

Datasets

This model was trained on the datasets below. If you use these weights, please cite them (their licenses require attribution; use is research-only / non-commercial):

@inproceedings{khalid_fakeavceleb_2021,
  title     = {FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset},
  author    = {Khalid, Hasam and Tariq, Shahroz and Kim, Minha and Woo, Simon S.},
  booktitle = {Thirty-fifth Conference on Neural Information Processing Systems
               Datasets and Benchmarks Track (Round 2)},
  year      = {2021},
  url       = {https://openreview.net/forum?id=TAXFsg6ZaOl}
}

@misc{barrington_deepspeak_2025,
  title     = {The DeepSpeak Dataset},
  author    = {Barrington, Sarah and Bohacek, Matyas and Farid, Hany},
  year      = {2025},
  publisher = {arXiv},
  doi       = {10.48550/arXiv.2408.05366},
  url       = {http://arxiv.org/abs/2408.05366}
}

@misc{chen_talkvid_2025,
  title     = {TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking
               Head Synthesis},
  author    = {Chen, Shunian and Huang, Hejin and Liu, Yexin and Ye, Zihan and Chen,
               Pengcheng and Zhu, Chenghao and Guan, Michael and Wang, Rongsheng and
               Chen, Junying and Li, Guanbin and Lim, Ser-Nam and Yang, Harry and
               Wang, Benyou},
  year      = {2025},
  publisher = {arXiv},
  doi       = {10.48550/arXiv.2508.13618},
  url       = {http://arxiv.org/abs/2508.13618}
}

The augmented set of authentic clips is sourced from YouTube; the URL list is in the code repository (subject to YouTube Terms of Service).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for vl4gh7/dslp-ensemble