Speech Artifact Detectors

10 lightweight CNN binary classifiers for detecting audio processing artifacts in speech. Each model takes 16 kHz mono audio and outputs an artifact score in [0, 1] (0 = clean, 1 = artifact detected).

Total size: ~15M parameters across 10 models. Inference is fast enough for real-time use on CPU.

Models

TTS/Vocoder Artifact Detectors

Trained on 16,000 samples: 8,000 clean (real podcast audio) + 8,000 artifact (6,000 BigVGAN-vocoded TTS predictions at varying noise levels + 2,000 comb-filtered augmentations). These detect broad vocoder synthesis artifacts including metallic resonance, harmonic distortion from Snake activations, and spectral smoothing.

Model Architecture Parameters Val Accuracy What It Detects
stft_classifier MultiResSTFTClassifier 1.69M 99.88% Vocoder synthesis fingerprint via multi-scale spectral analysis. The most sensitive detector β€” captures harmonic structure from Snake activations across 4 STFT resolutions (256/512/1024/2048 FFT).
waveform_1d Waveform1DCNN 1.90M 97.44% Waveform-level anomalies from vocoder synthesis. Operates directly on raw audio, detecting temporal artifacts that may not be visible in spectrograms.
mel_classifier MelCNNClassifier 1.01M 99.75% Mel-domain artifacts from vocoder reconstruction. Detects spectral smoothing and missing fine structure in the 80-band mel representation.

Augmentation Artifact Detectors

Each trained on 6,000 samples (3,000 clean + 3,000 corrupted) from 48 kHz podcast audio downsampled to 16 kHz. Corruptions use variable magnitude to teach the classifier to detect artifacts at all intensities. All use the MultiResSTFTClassifier architecture (1.69M params).

Model Val Accuracy What It Detects Corruption Details
spectral_denoising 99.83% Spectral gating denoiser artifacts β€” metallic, underwater quality from aggressive noise reduction noisereduce with prop_decrease 0.3–1.0
pitch_correction 100% Autotune stepping artifacts β€” unnatural pitch quantization where F0 snaps to the nearest semitone pyworld F0 extraction + semitone snapping, strength 0.3–1.0
codec_compression 100% MP3 compression artifacts β€” warbling, pre-echo, and bandwidth loss at low bitrates ffmpeg MP3 encoding at 8–48 kbps round-trip
comb_filtering 100% Comb filter resonance peaks and notches β€” the characteristic "hollow" or "phaser" sound from delay + feedback mixing Feedforward comb: delay 0.5–8 ms, wet 0.3–0.95
phase_vocoder 100% Phase vocoder smearing β€” transient blurring and "phasiness" from STFT-based time-stretching librosa time_stretch at rate 0.6–0.85 or 1.2–1.6
bandwidth_limitation 100% Missing high-frequency content from lowpass filtering β€” "telephone" or "muffled" quality Butterworth lowpass, cutoff 2–6 kHz
clipping_distortion 98.33% Hard/soft clipping and harmonic distortion from overdrive β€” "crunchy" or "fuzzy" audio pedalboard Distortion with drive_db 3–25

Installation

pip install torch torchaudio soundfile numpy huggingface_hub

No additional dependencies required. The models are self-contained PyTorch modules.

Quick Start

Score a file with all 10 detectors

from speech_artifact_detector import load_from_hub, score_file_all

models = load_from_hub(device="cuda")  # or "cpu"

scores = score_file_all(models, "audio.wav")
for name, score in scores.items():
    flag = " !!!" if score > 0.5 else ""
    print(f"  {name:>25s}: {score:.4f}{flag}")

Score a raw waveform tensor

import torch
import torchaudio
from speech_artifact_detector import load_from_hub, score_waveform

models = load_from_hub(device="cuda")

# Load and prepare audio (must be 16 kHz mono)
wav, sr = torchaudio.load("audio.wav")
wav = wav.mean(0)  # mono
if sr != 16000:
    wav = torchaudio.functional.resample(wav, sr, 16000)

# Score with a specific detector
stft_score = score_waveform(models["stft_classifier"], wav)
print(f"STFT classifier score: {stft_score:.4f}")

Load only specific categories

from speech_artifact_detector import load_from_hub

# Only TTS/vocoder detectors (3 models, ~4.6M params)
tts_models = load_from_hub(categories=["tts_artifact"])

# Only augmentation detectors (7 models, ~11.8M params)
aug_models = load_from_hub(categories=["augmentation"])

Batch scoring for datasets

from speech_artifact_detector import load_from_hub, score_batch

models = load_from_hub(device="cuda")
paths = ["file1.wav", "file2.wav", "file3.wav"]
results = score_batch(models, paths, batch_size=16)

for path, scores in zip(paths, results):
    mean = sum(scores.values()) / len(scores)
    print(f"{path}: mean_score={mean:.3f}")

Use as a training loss (differentiable)

The models are differentiable β€” you can use artifact_score() as a training signal to penalize artifact generation:

import torch
from speech_artifact_detector import load_from_hub

# Load a detector and freeze it
models = load_from_hub(device="cuda")
detector = models["stft_classifier"]
for p in detector.parameters():
    p.requires_grad = False

# In your training loop:
predicted_audio = your_model(input)  # [B, 1, T] at 16 kHz
artifact_score = detector.artifact_score(predicted_audio)  # [B, 1]

# Hinge loss: only penalize if score exceeds threshold
threshold = 0.3
artifact_loss = torch.relu(artifact_score - threshold).mean()

# Add to your total loss
total_loss = reconstruction_loss + artifact_weight * artifact_loss

CLI

# Score with all models from HuggingFace Hub
python speech_artifact_detector.py --hub --input audio.wav

# Score a directory of files
python speech_artifact_detector.py --hub --input /path/to/wavs/ --ext mp3

# Only augmentation detectors
python speech_artifact_detector.py --hub --category augmentation --input audio.wav

# Single model from local checkpoint
python speech_artifact_detector.py --checkpoint codec_compression.pt --input audio.wav

Inference Example

See inference_example.py for a complete standalone script demonstrating:

  • Loading models from the Hub
  • Scoring individual files and directories
  • Batch processing with progress reporting
  • Generating a summary report
# Score a single file
python inference_example.py audio.wav

# Score a directory
python inference_example.py /path/to/audio/dir/ --ext wav

# Score with only TTS detectors on CPU
python inference_example.py audio.wav --category tts_artifact --device cpu

Architecture Details

MultiResSTFTClassifier (1,688,833 params)

Used by stft_classifier and all 7 augmentation detectors. Computes STFT at 4 resolutions in parallel, each processed by a 5-layer 2D CNN:

Input: [B, 1, T] at 16 kHz (max 10 seconds)
  β”œβ”€ STFT n_fft=256,  hop=64  β†’ CNN (32β†’64β†’128β†’128β†’128) β†’ pool β†’ [B, 128]
  β”œβ”€ STFT n_fft=512,  hop=128 β†’ CNN (32β†’64β†’128β†’128β†’128) β†’ pool β†’ [B, 128]
  β”œβ”€ STFT n_fft=1024, hop=256 β†’ CNN (32β†’64β†’128β†’128β†’128) β†’ pool β†’ [B, 128]
  └─ STFT n_fft=2048, hop=512 β†’ CNN (32β†’64β†’128β†’128β†’128) β†’ pool β†’ [B, 128]
                                                    concat β†’ [B, 512]
                                              Linear(512, 256) + GELU + Dropout(0.2)
                                              Linear(256, 1) β†’ sigmoid β†’ score

Each CNN block uses BatchNorm2d + GELU activations. The multi-resolution design captures artifacts at different time-frequency tradeoffs: the 256-FFT branch has 4ms resolution for transient artifacts, while the 2048-FFT branch has 0.5 Hz frequency resolution for tonal artifacts.

Waveform1DCNN (1,898,753 params)

Used by waveform_1d. Six 1D convolutional blocks operating directly on the raw waveform:

Input: [B, 1, T] at 16 kHz
  Conv1d(1β†’64, k=15, s=4) + BN + GELU + MaxPool(2)
  Conv1d(64β†’128, k=11, s=2) + BN + GELU + MaxPool(2)
  Conv1d(128β†’256, k=7, s=2) + BN + GELU + MaxPool(2)
  Conv1d(256β†’256, k=5, s=2) + BN + GELU + MaxPool(2)
  Conv1d(256β†’512, k=3, s=2) + BN + GELU + MaxPool(2)
  Conv1d(512β†’512, k=3, s=1) + BN + GELU
  AdaptiveAvgPool1d(1) β†’ Linear(512, 128) + GELU + Dropout(0.2) β†’ Linear(128, 1) β†’ sigmoid

MelCNNClassifier (1,012,417 params)

Used by mel_classifier. Computes 80-band mel spectrogram then 5-layer 2D CNN:

Input: [B, 1, T] at 16 kHz
  MelSpectrogram(sr=16000, n_fft=1024, hop=256, n_mels=80) β†’ log
  Conv2d(1β†’32, 3Γ—3) + BN + GELU + MaxPool(2)
  Conv2d(32β†’64, 3Γ—3) + BN + GELU + MaxPool(2)
  Conv2d(64β†’128, 3Γ—3) + BN + GELU + MaxPool(2)
  Conv2d(128β†’256, 3Γ—3) + BN + GELU + MaxPool(2)
  Conv2d(256β†’256, 3Γ—3) + BN + GELU
  AdaptiveAvgPool2d(1) β†’ Linear(256, 128) + GELU + Dropout(0.2) β†’ Linear(128, 1) β†’ sigmoid

Training Details

TTS/Vocoder Artifact Detectors

Training data: 16,000 samples per epoch drawn from:

  • Clean class (8,000): 2,000 real podcast audio clips (48 kHz, downsampled to 16 kHz) with 4x oversampling
  • Artifact class (8,000): 6,000 TTS-predicted audio (teacher-forced through DramaBox DiT at sigma=0.15/0.275/0.4, decoded via BigVGAN vocoder) + 2,000 on-the-fly comb-filter augmented clips

Training setup: 30 epochs, AdamW with lr=3e-4 and weight_decay=1e-4, BCEWithLogitsLoss, 10-second random crops at 16 kHz.

Key finding: The stft_classifier is the most sensitive β€” it detects the fundamental spectral fingerprint of BigVGAN-style vocoder synthesis (characteristic harmonics from Snake activations), not just specific comb-filter artifacts. It scores ~1.0 for ALL vocoder-synthesized audio including vanilla DramaBox with no fine-tuning. The mel_classifier and waveform_1d are more specific to degradation from LoRA fine-tuning and respond to vocoder LoRA adaptation.

Source repo for original 3 models: laion/tts-comb-artefact-detectors

Augmentation Artifact Detectors

Training data: 6,000 samples per augmentation (3,000 clean + 3,000 corrupted). Source audio: random podcast recordings from a 48 kHz speech corpus, downsampled to 16 kHz. Each augmentation uses variable magnitude (uniformly sampled across its range) so the classifier learns the artifact signature at all intensities rather than only detecting extreme cases.

Training setup: 30 epochs max with patience=10 early stopping, batch size 32, AdamW with lr=3e-4 and weight_decay=1e-4, gradient clipping at 1.0, 90/10 train/val split.

Training data repo: TTS-AGI/augmentation-artifact-detector-data

Convergence: Most detectors converge very early (epoch 1-4) with near-perfect accuracy. comb_filtering took longest (epoch 11) and clipping_distortion was the hardest task (98.33% at epoch 28), likely because soft clipping at low drive levels produces subtle harmonic distortion that overlaps with natural speech harmonics.

Checkpoint Format

Each .pt file is a PyTorch checkpoint dict:

{
    "model_state_dict": ...,        # Model weights
    "architecture": str,             # "stft_classifier", "waveform_1d", "mel_classifier",
                                     # or "MultiResSTFTClassifier" (for augmentation detectors)
    "epoch": int,                    # Best epoch number
    "val_acc": float,                # Validation accuracy (%)
    "val_f1": float,                 # Validation F1 score
    "val_prec": float,               # Validation precision
    "val_rec": float,                # Validation recall
    "n_params": int,                 # Number of trainable parameters
    "input_sr": int,                 # Expected input sample rate (16000)
    # Augmentation detectors also have:
    "augmentation": str,             # Augmentation type name
    "config": dict,                  # Augmentation configuration
}

Use Cases

  1. TTS quality filtering: Score generated speech before release. High scores on TTS detectors indicate vocoder artifacts.
  2. Audio processing pipeline QA: Run all 7 augmentation detectors to identify which processing step introduced artifacts.
  3. Dataset cleaning: Filter training data to remove samples with processing artifacts that could degrade downstream model quality.
  4. Codec quality assessment: Use codec_compression to detect over-compressed audio in datasets.
  5. Vocoder fine-tuning loss: Use as a differentiable training signal to penalize artifact generation during vocoder training.
  6. Voice cloning detection: The TTS detectors can distinguish real from synthesized speech (though this is not their primary purpose).

Requirements

torch>=2.0
torchaudio>=2.0
soundfile
numpy
huggingface_hub  # only for load_from_hub()

License

Apache 2.0

Citation

@misc{laion-speech-artifact-detectors,
    title={Speech Artifact Detectors: 10 CNN Classifiers for Audio Processing Artifacts},
    author={LAION},
    year={2025},
    url={https://huggingface.co/laion/speech-artifact-detectors}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support