Audio Anomaly Detection using PCEN + CNN Ensemble

This repository contains a deep learning system for multi-class audio anomaly detection across industrial machines. The model uses Per-Channel Energy Normalization (PCEN) features and a convolutional neural network (CNN) ensemble trained with group-aware cross-validation.


Overview

The system classifies audio recordings into six classes:

Class Description
0 Machine 1 - Normal
1 Machine 1 - Abnormal
2 Machine 2 - Normal
3 Machine 2 - Abnormal
4 Machine 3 - Normal
5 Machine 3 - Abnormal

The design emphasizes robustness to varying recording conditions and avoids data leakage through group-aware validation.


Pipeline Architecture

1. Preprocessing

  • Audio resampled to 16 kHz
  • Converted to mono
  • Silence trimming using a 30 dB threshold
  • No amplitude normalization or pre-emphasis

Note: Amplitude normalization and pre-emphasis are intentionally excluded because they interfere with PCEN, which performs its own adaptive gain control.


2. Feature Extraction (PCEN)

  • Mel spectrogram:
    • 128 Mel bands
    • Power = 2.0
  • PCEN parameters:
    • Gain (Ξ±): 0.98
    • Bias (Ξ΄): 2
    • Compression (r): 0.5
    • Time constant (T): 0.400 s

Output processing:

  • Min-max normalization to [0, 1]
  • Resized to (128 Γ— 128) using:
    • Center cropping if too long
    • Zero-padding if too short

3. Model Training

CNN architecture:

  • Three convolutional blocks:
    • Conv β†’ BatchNorm β†’ ReLU β†’ Pooling
    • Channels: 16 β†’ 32 β†’ 64
  • Final layers:
    • Adaptive average pooling
    • Dropout (30%)
    • Fully connected layer (6 outputs)

Training configuration:

  • Stratified Group K-Fold (3 folds)
  • Class-weighted cross-entropy loss
  • Adam optimizer:
    • Learning rate: 1e-4
    • Weight decay: 1e-4
  • Early stopping (patience = 8)

Data augmentation:

  • Mixup (p = 0.25, Beta(0.2, 0.2))
  • Time and frequency masking
  • Gaussian noise injection

4. Ensemble Inference

  • Three fold-specific models are used
  • Predictions are combined using soft voting:
    • Logits are averaged across models
    • Final prediction is obtained via argmax

This reduces variance and improves generalization compared to a single model.


Performance

Cross-validation (3 folds)

Fold Accuracy
1 89.54%
2 91.24%
3 89.76%
Mean 90.18%

Test Set Results

  • Accuracy: 92.55%

Class-wise observations:

  • Strong performance on normal classes
  • Abnormal classes show more variability
  • Machine 2 abnormal class is the most challenging due to limited data and overlap

Inference

Default usage

Run the script directly:

python infer.py

This assumes the following structure:

β”œβ”€β”€ infer.py
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ cnn_fold_2_acc_0.9124.pth
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ 1.wav
β”‚   β”‚   β”œβ”€β”€ 2.wav
β”‚   β”‚   └── ...

Custom paths

You can override all paths via CLI arguments:

python infer.py \
  --data_dir /path/to/wavs \
  --model_path /path/to/model.pth \
  --results /path/to/results.txt \
  --times /path/to/time.txt

Input requirements

  • .wav audio files only
  • Files must be numerically named (e.g. 1.wav, 2.wav, ..., 10.wav)
  • Any duration (automatically trimmed and resized)
  • Audio is resampled internally to 16 kHz

Output

  • results.txt: one predicted class per line
  • time.txt: inference time (seconds) per file

Model Details

  • Input shape: (1, 128, 128)
  • Output: 6-class logits

Architecture:

  • Conv2D β†’ BatchNorm β†’ ReLU β†’ Pooling (Γ—3)
  • Adaptive average pooling
  • Fully connected classification layer

Repository Structure


β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ cnn_fold_*.pth
β”‚   β”œβ”€β”€ results.txt
β”‚   └── time.txt
β”œβ”€β”€ infer.py
β”œβ”€β”€ requirements.txt
└── README.md

Important Notes

  • Preprocessing must match training exactly
  • Do not apply amplitude normalization or pre-emphasis
  • PCEN relies on raw signal dynamics

Limitations

  • Lower performance on Machine 2 abnormal class
  • Some inter-machine spectral overlap remains
  • Performance depends on consistency of recording conditions

Future Work

  • Improve minority class representation
  • Explore alternative model architectures
  • Enhance robustness to domain shifts
  • Investigate transformer-based audio models
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support