Speech Artifact Detectors
10 lightweight CNN binary classifiers for detecting audio processing artifacts in speech. Each model takes 16 kHz mono audio and outputs an artifact score in [0, 1] (0 = clean, 1 = artifact detected).
Total size: ~15M parameters across 10 models. Inference is fast enough for real-time use on CPU.
Models
TTS/Vocoder Artifact Detectors
Trained on 16,000 samples: 8,000 clean (real podcast audio) + 8,000 artifact (6,000 BigVGAN-vocoded TTS predictions at varying noise levels + 2,000 comb-filtered augmentations). These detect broad vocoder synthesis artifacts including metallic resonance, harmonic distortion from Snake activations, and spectral smoothing.
| Model | Architecture | Parameters | Val Accuracy | What It Detects |
|---|---|---|---|---|
stft_classifier |
MultiResSTFTClassifier | 1.69M | 99.88% | Vocoder synthesis fingerprint via multi-scale spectral analysis. The most sensitive detector β captures harmonic structure from Snake activations across 4 STFT resolutions (256/512/1024/2048 FFT). |
waveform_1d |
Waveform1DCNN | 1.90M | 97.44% | Waveform-level anomalies from vocoder synthesis. Operates directly on raw audio, detecting temporal artifacts that may not be visible in spectrograms. |
mel_classifier |
MelCNNClassifier | 1.01M | 99.75% | Mel-domain artifacts from vocoder reconstruction. Detects spectral smoothing and missing fine structure in the 80-band mel representation. |
Augmentation Artifact Detectors
Each trained on 6,000 samples (3,000 clean + 3,000 corrupted) from 48 kHz podcast audio downsampled to 16 kHz. Corruptions use variable magnitude to teach the classifier to detect artifacts at all intensities. All use the MultiResSTFTClassifier architecture (1.69M params).
| Model | Val Accuracy | What It Detects | Corruption Details |
|---|---|---|---|
spectral_denoising |
99.83% | Spectral gating denoiser artifacts β metallic, underwater quality from aggressive noise reduction | noisereduce with prop_decrease 0.3β1.0 |
pitch_correction |
100% | Autotune stepping artifacts β unnatural pitch quantization where F0 snaps to the nearest semitone | pyworld F0 extraction + semitone snapping, strength 0.3β1.0 |
codec_compression |
100% | MP3 compression artifacts β warbling, pre-echo, and bandwidth loss at low bitrates | ffmpeg MP3 encoding at 8β48 kbps round-trip |
comb_filtering |
100% | Comb filter resonance peaks and notches β the characteristic "hollow" or "phaser" sound from delay + feedback mixing | Feedforward comb: delay 0.5β8 ms, wet 0.3β0.95 |
phase_vocoder |
100% | Phase vocoder smearing β transient blurring and "phasiness" from STFT-based time-stretching | librosa time_stretch at rate 0.6β0.85 or 1.2β1.6 |
bandwidth_limitation |
100% | Missing high-frequency content from lowpass filtering β "telephone" or "muffled" quality | Butterworth lowpass, cutoff 2β6 kHz |
clipping_distortion |
98.33% | Hard/soft clipping and harmonic distortion from overdrive β "crunchy" or "fuzzy" audio | pedalboard Distortion with drive_db 3β25 |
Installation
pip install torch torchaudio soundfile numpy huggingface_hub
No additional dependencies required. The models are self-contained PyTorch modules.
Quick Start
Score a file with all 10 detectors
from speech_artifact_detector import load_from_hub, score_file_all
models = load_from_hub(device="cuda") # or "cpu"
scores = score_file_all(models, "audio.wav")
for name, score in scores.items():
flag = " !!!" if score > 0.5 else ""
print(f" {name:>25s}: {score:.4f}{flag}")
Score a raw waveform tensor
import torch
import torchaudio
from speech_artifact_detector import load_from_hub, score_waveform
models = load_from_hub(device="cuda")
# Load and prepare audio (must be 16 kHz mono)
wav, sr = torchaudio.load("audio.wav")
wav = wav.mean(0) # mono
if sr != 16000:
wav = torchaudio.functional.resample(wav, sr, 16000)
# Score with a specific detector
stft_score = score_waveform(models["stft_classifier"], wav)
print(f"STFT classifier score: {stft_score:.4f}")
Load only specific categories
from speech_artifact_detector import load_from_hub
# Only TTS/vocoder detectors (3 models, ~4.6M params)
tts_models = load_from_hub(categories=["tts_artifact"])
# Only augmentation detectors (7 models, ~11.8M params)
aug_models = load_from_hub(categories=["augmentation"])
Batch scoring for datasets
from speech_artifact_detector import load_from_hub, score_batch
models = load_from_hub(device="cuda")
paths = ["file1.wav", "file2.wav", "file3.wav"]
results = score_batch(models, paths, batch_size=16)
for path, scores in zip(paths, results):
mean = sum(scores.values()) / len(scores)
print(f"{path}: mean_score={mean:.3f}")
Use as a training loss (differentiable)
The models are differentiable β you can use artifact_score() as a training signal to penalize artifact generation:
import torch
from speech_artifact_detector import load_from_hub
# Load a detector and freeze it
models = load_from_hub(device="cuda")
detector = models["stft_classifier"]
for p in detector.parameters():
p.requires_grad = False
# In your training loop:
predicted_audio = your_model(input) # [B, 1, T] at 16 kHz
artifact_score = detector.artifact_score(predicted_audio) # [B, 1]
# Hinge loss: only penalize if score exceeds threshold
threshold = 0.3
artifact_loss = torch.relu(artifact_score - threshold).mean()
# Add to your total loss
total_loss = reconstruction_loss + artifact_weight * artifact_loss
CLI
# Score with all models from HuggingFace Hub
python speech_artifact_detector.py --hub --input audio.wav
# Score a directory of files
python speech_artifact_detector.py --hub --input /path/to/wavs/ --ext mp3
# Only augmentation detectors
python speech_artifact_detector.py --hub --category augmentation --input audio.wav
# Single model from local checkpoint
python speech_artifact_detector.py --checkpoint codec_compression.pt --input audio.wav
Inference Example
See inference_example.py for a complete standalone script demonstrating:
- Loading models from the Hub
- Scoring individual files and directories
- Batch processing with progress reporting
- Generating a summary report
# Score a single file
python inference_example.py audio.wav
# Score a directory
python inference_example.py /path/to/audio/dir/ --ext wav
# Score with only TTS detectors on CPU
python inference_example.py audio.wav --category tts_artifact --device cpu
Architecture Details
MultiResSTFTClassifier (1,688,833 params)
Used by stft_classifier and all 7 augmentation detectors. Computes STFT at 4 resolutions in parallel, each processed by a 5-layer 2D CNN:
Input: [B, 1, T] at 16 kHz (max 10 seconds)
ββ STFT n_fft=256, hop=64 β CNN (32β64β128β128β128) β pool β [B, 128]
ββ STFT n_fft=512, hop=128 β CNN (32β64β128β128β128) β pool β [B, 128]
ββ STFT n_fft=1024, hop=256 β CNN (32β64β128β128β128) β pool β [B, 128]
ββ STFT n_fft=2048, hop=512 β CNN (32β64β128β128β128) β pool β [B, 128]
concat β [B, 512]
Linear(512, 256) + GELU + Dropout(0.2)
Linear(256, 1) β sigmoid β score
Each CNN block uses BatchNorm2d + GELU activations. The multi-resolution design captures artifacts at different time-frequency tradeoffs: the 256-FFT branch has 4ms resolution for transient artifacts, while the 2048-FFT branch has 0.5 Hz frequency resolution for tonal artifacts.
Waveform1DCNN (1,898,753 params)
Used by waveform_1d. Six 1D convolutional blocks operating directly on the raw waveform:
Input: [B, 1, T] at 16 kHz
Conv1d(1β64, k=15, s=4) + BN + GELU + MaxPool(2)
Conv1d(64β128, k=11, s=2) + BN + GELU + MaxPool(2)
Conv1d(128β256, k=7, s=2) + BN + GELU + MaxPool(2)
Conv1d(256β256, k=5, s=2) + BN + GELU + MaxPool(2)
Conv1d(256β512, k=3, s=2) + BN + GELU + MaxPool(2)
Conv1d(512β512, k=3, s=1) + BN + GELU
AdaptiveAvgPool1d(1) β Linear(512, 128) + GELU + Dropout(0.2) β Linear(128, 1) β sigmoid
MelCNNClassifier (1,012,417 params)
Used by mel_classifier. Computes 80-band mel spectrogram then 5-layer 2D CNN:
Input: [B, 1, T] at 16 kHz
MelSpectrogram(sr=16000, n_fft=1024, hop=256, n_mels=80) β log
Conv2d(1β32, 3Γ3) + BN + GELU + MaxPool(2)
Conv2d(32β64, 3Γ3) + BN + GELU + MaxPool(2)
Conv2d(64β128, 3Γ3) + BN + GELU + MaxPool(2)
Conv2d(128β256, 3Γ3) + BN + GELU + MaxPool(2)
Conv2d(256β256, 3Γ3) + BN + GELU
AdaptiveAvgPool2d(1) β Linear(256, 128) + GELU + Dropout(0.2) β Linear(128, 1) β sigmoid
Training Details
TTS/Vocoder Artifact Detectors
Training data: 16,000 samples per epoch drawn from:
- Clean class (8,000): 2,000 real podcast audio clips (48 kHz, downsampled to 16 kHz) with 4x oversampling
- Artifact class (8,000): 6,000 TTS-predicted audio (teacher-forced through DramaBox DiT at sigma=0.15/0.275/0.4, decoded via BigVGAN vocoder) + 2,000 on-the-fly comb-filter augmented clips
Training setup: 30 epochs, AdamW with lr=3e-4 and weight_decay=1e-4, BCEWithLogitsLoss, 10-second random crops at 16 kHz.
Key finding: The stft_classifier is the most sensitive β it detects the fundamental spectral fingerprint of BigVGAN-style vocoder synthesis (characteristic harmonics from Snake activations), not just specific comb-filter artifacts. It scores ~1.0 for ALL vocoder-synthesized audio including vanilla DramaBox with no fine-tuning. The mel_classifier and waveform_1d are more specific to degradation from LoRA fine-tuning and respond to vocoder LoRA adaptation.
Source repo for original 3 models: laion/tts-comb-artefact-detectors
Augmentation Artifact Detectors
Training data: 6,000 samples per augmentation (3,000 clean + 3,000 corrupted). Source audio: random podcast recordings from a 48 kHz speech corpus, downsampled to 16 kHz. Each augmentation uses variable magnitude (uniformly sampled across its range) so the classifier learns the artifact signature at all intensities rather than only detecting extreme cases.
Training setup: 30 epochs max with patience=10 early stopping, batch size 32, AdamW with lr=3e-4 and weight_decay=1e-4, gradient clipping at 1.0, 90/10 train/val split.
Training data repo: TTS-AGI/augmentation-artifact-detector-data
Convergence: Most detectors converge very early (epoch 1-4) with near-perfect accuracy. comb_filtering took longest (epoch 11) and clipping_distortion was the hardest task (98.33% at epoch 28), likely because soft clipping at low drive levels produces subtle harmonic distortion that overlaps with natural speech harmonics.
Checkpoint Format
Each .pt file is a PyTorch checkpoint dict:
{
"model_state_dict": ..., # Model weights
"architecture": str, # "stft_classifier", "waveform_1d", "mel_classifier",
# or "MultiResSTFTClassifier" (for augmentation detectors)
"epoch": int, # Best epoch number
"val_acc": float, # Validation accuracy (%)
"val_f1": float, # Validation F1 score
"val_prec": float, # Validation precision
"val_rec": float, # Validation recall
"n_params": int, # Number of trainable parameters
"input_sr": int, # Expected input sample rate (16000)
# Augmentation detectors also have:
"augmentation": str, # Augmentation type name
"config": dict, # Augmentation configuration
}
Use Cases
- TTS quality filtering: Score generated speech before release. High scores on TTS detectors indicate vocoder artifacts.
- Audio processing pipeline QA: Run all 7 augmentation detectors to identify which processing step introduced artifacts.
- Dataset cleaning: Filter training data to remove samples with processing artifacts that could degrade downstream model quality.
- Codec quality assessment: Use
codec_compressionto detect over-compressed audio in datasets. - Vocoder fine-tuning loss: Use as a differentiable training signal to penalize artifact generation during vocoder training.
- Voice cloning detection: The TTS detectors can distinguish real from synthesized speech (though this is not their primary purpose).
Requirements
torch>=2.0
torchaudio>=2.0
soundfile
numpy
huggingface_hub # only for load_from_hub()
License
Apache 2.0
Citation
@misc{laion-speech-artifact-detectors,
title={Speech Artifact Detectors: 10 CNN Classifiers for Audio Processing Artifacts},
author={LAION},
year={2025},
url={https://huggingface.co/laion/speech-artifact-detectors}
}