license: apache-2.0
language: en
pipeline_tag: audio-classification
library_name: transformers
tags:
- >-
audio - wav2vec2 - deepfake-detection - synthetic-speech - tts -
voice-cloning
metrics:
- accuracy
- f1
- precision
- recall
- roc_auc
Deepfake Audio Detection Model Fine-tuned Wav2Vec2 model for detecting AI-generated speech. Determines if audio was spoken by a human or created by AI text-to-speech/voice cloning software.
Model Details Model Description Fine-tuned Wav2Vec2 transformer for binary audio classification (real vs AI-generated speech). Trained to distinguish authentic human speech from synthetic audio generated by AI text-to-speech and voice cloning services including:
ElevenLabs Amazon Polly Hexgrad Kokoro Hume AI Speechify Luvvoice Developed by: Gary A. Stafford
Note: This model uses transfer learning from a base model already trained for deepfake detection. Fast convergence is expected due to task similarity and TTS engine overlap with the base model's training data.
How to Use Installation Install the required dependencies:
pip install transformers torch librosa Optional: For GPU acceleration (recommended):
For CUDA 11.8
pip install torch --index-url https://download.pytorch.org/whl/cu118
For CUDA 12.1
pip install torch --index-url https://download.pytorch.org/whl/cu121 Quick Start import torch import librosa from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
Load model and feature extractor
model = AutoModelForAudioClassification.from_pretrained(model_name) feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu" model.to(device) model.eval()
Load and preprocess audio (automatically resamples to 16kHz)
audio, sr = librosa.load("path/to/audio.wav", sr=16000, mono=True) inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt", padding=True) inputs = {k: v.to(device) for k, v in inputs.items()}
Run inference
with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits probs = torch.nn.functional.softmax(logits, dim=-1)
Get prediction
prob_real = probs[0][0].item() prob_fake = probs[0][1].item() prediction = "fake" if prob_fake > 0.5 else "real"
print(f"Prediction: {prediction}") print(f"Confidence: {max(prob_real, prob_fake):.2%}") print(f"Probabilities - Real: {prob_real:.2%}, Fake: {prob_fake:.2%}") Expected Input Audio format: WAV, MP3, FLAC, or any format supported by librosa Sample rate: Automatically resampled to 16kHz Channels: Converted to mono Duration: Optimal performance on 2.5-13 second clips (model training range) Output The model outputs logits (raw, unnormalized scores) for two classes:
Class 0: Real (human) audio Class 1: Fake (AI-generated) audio Converting Logits to Probabilities:
Apply softmax to convert raw logits into interpretable probability scores:
probs = torch.nn.functional.softmax(logits, dim=-1) Single sample: logits.shape = (1, 2) → probs.shape = (1, 2) where probs[0] contains [prob_real, prob_fake] summing to 1.0 Batch processing: logits.shape = (N, 2) → probs.shape = (N, 2) where each sample's probabilities sum to 1.0 independently dim=-1: Applies softmax across classes for each sample, not across samples Batch Processing Example import glob
for audio_path in audio_files: audio, _ = librosa.load(audio_path, sr=16000, mono=True) inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt", padding=True) inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
prediction = "fake" if probs[0][1] > 0.5 else "real"
print(f"{audio_path}: {prediction} ({probs[0][1]:.2%} fake)")
Training Details Dataset Source: garystafford/deepfake-audio-detection
Composition:
Real audio: YouTube recordings from 14 source videos, human speech samples Synthetic audio: Generated using 6 TTS platforms (ElevenLabs, Amazon Polly, Hexgrad Kokoro, Hume AI, Speechify, Luvvoice) Format: FLAC, 16kHz mono, 2.5-13 second chunks Total samples: 1,866 (balanced: 933 real, 933 fake) Processing: Two-pass audio splitting with silence detection, concatenation of short segments, and VAD-based sub-chunking Split:
Split Real Fake Total Percentage Train 746 746 1,492 80% Validation 93 94 187 10% Test 94 93 187 10% Stratified splitting applied to ensure balanced class distribution across all splits.
Training Approach Base Model: Gustking/wav2vec2-large-xlsr-deepfake-audio-classification - A Wav2Vec2-XLSR model pre-trained on 53 languages and already fine-tuned for deepfake audio detection.
Method: Transfer learning with selective layer freezing:
Frozen: Wav2Vec2 feature extractor (convolutional layers) Bottom 12 transformer encoder layers Trained: Top 12 transformer encoder layers (upper half) Classification head (256-dimensional projection + linear classifier) ~160M trainable parameters (approximately half the model) Rationale: Freezing low-level acoustic features while training high-level semantic layers allows the model to adapt to this dataset's specific TTS characteristics and speaker patterns while preserving general audio understanding. Hyperparameters Parameter Value Learning rate 3e-5 Epochs (max) 5 Early stopping patience 3 evaluations Evaluation frequency Every 30 steps Per-device batch size 4 Gradient accumulation steps 4 Effective batch size 16 Optimizer AdamW Warmup ratio 0.1 (10%) Weight decay 0.01 Save strategy Every 30 steps Metric for best model ROC-AUC Precision FP16 Training Statistics:
Training samples: 1,492 (746 real, 746 fake) Validation samples: 187 (93 real, 94 fake) Trainable parameters: 160,336,770 (~160M parameters, approximately 50% of full model) Training approach: Freeze feature extractor and bottom 12 transformer layers; train top 12 transformer layers + classification head Convergence: Efficient convergence (typically ~3-4 epochs) due to base model's existing deepfake detection capabilities Why high performance? Transfer learning from a specialist deepfake detector allows rapid adaptation to this dataset while training substantial portions of the model to capture dataset-specific patterns Architecture The model uses AutoModelForAudioClassification with a two-class output (0=real, 1=fake):
Feature Extractor (Frozen): 7 convolutional layers extract acoustic features from raw audio Transformer Encoder: Layers 0-11 (Frozen): Preserve low-level acoustic and phonetic representations Layers 12-23 (Trained): Adapt high-level semantic features to deepfake patterns Classification Head (Trained): 256-dimensional projection + linear classifier This architecture balances efficiency with adaptability—frozen layers preserve general audio understanding while trained layers (~160M parameters) learn dataset-specific deepfake detection patterns.
Model Performance ⚠️ IMPORTANT CONTEXT: These high-performance metrics reflect fine-tuning a specialist model on its own domain. The base model (Gustking/wav2vec2-large-xlsr-deepfake-audio-classification) was already trained for deepfake detection, likely on similar TTS engines. These results demonstrate successful adaptation to this specific dataset of 1,866 samples, NOT general deepfake detection capability from scratch. The excellent ROC-AUC (0.998) indicates near-perfect class separation, though 4 samples (2.1%) are still misclassified at the default 0.5 threshold.
Validation Set Performance The model performs well on the validation set of 187 audio clips (94 real, 93 fake):
Validation Results (at threshold 0.5):
Accuracy: 97.9% (183 out of 187 samples correctly classified) ROC-AUC: 0.998 (near-perfect class separation) Balanced Accuracy: 97.9% Per-Class Metrics (threshold 0.5):
Class Precision Recall F1-Score Support Real 1.00 0.96 0.98 94 Fake 0.96 1.00 0.98 93 Confusion Matrix (threshold 0.5):
Pred Real Pred Fake True Real 90 4 True Fake 0 93 Note: Best balanced accuracy of 98.4% achieved at threshold 0.9 (96.8% real recall, 100% fake recall).
Important Notes on Performance Context for High Performance:
Moderate validation set: 187 samples provides reasonable evaluation, though larger test sets recommended for production validation Transfer learning: Base model already trained for deepfake detection on similar TTS engines - fine-tuning adapts existing knowledge Dataset characteristics: TTS-generated audio has distinctive artifacts (prosody patterns, spectral signatures) that differentiate it from human speech ROC-AUC of 0.998: Indicates near-perfect ranking/separation of classes; 4 real samples misclassified as fake at threshold 0.5, while all fake samples correctly identified Recommended validation: Test on TTS engines NOT in training data (e.g., OpenAI TTS, Azure Neural, advanced voice cloning systems) for true generalization assessment Generalization Limitations:
Model may not generalize well to: Novel TTS engines not represented in training data Advanced voice cloning/conversion systems Real-time voice manipulation Low-quality recordings with significant noise Inference Performance Estimated based on model architecture:
Latency: ~50-100ms per sample (varies by hardware) Recommended use: Batch processing for efficiency