Audio Anomaly Detection using PCEN + CNN Ensemble

This repository contains a deep learning system for multi-class audio anomaly detection across industrial machines. The model uses Per-Channel Energy Normalization (PCEN) features and a convolutional neural network (CNN) ensemble trained with group-aware cross-validation.

Overview

The system classifies audio recordings into six classes:

Class	Description
0	Machine 1 - Normal
1	Machine 1 - Abnormal
2	Machine 2 - Normal
3	Machine 2 - Abnormal
4	Machine 3 - Normal
5	Machine 3 - Abnormal

The design emphasizes robustness to varying recording conditions and avoids data leakage through group-aware validation.

Pipeline Architecture

1. Preprocessing

Audio resampled to 16 kHz
Converted to mono
Silence trimming using a 30 dB threshold
No amplitude normalization or pre-emphasis

Note: Amplitude normalization and pre-emphasis are intentionally excluded because they interfere with PCEN, which performs its own adaptive gain control.

2. Feature Extraction (PCEN)

Mel spectrogram:
- 128 Mel bands
- Power = 2.0
PCEN parameters:
- Gain (α): 0.98
- Bias (δ): 2
- Compression (r): 0.5
- Time constant (T): 0.400 s

Output processing:

Min-max normalization to [0, 1]
Resized to (128 × 128) using:
- Center cropping if too long
- Zero-padding if too short

3. Model Training

CNN architecture:

Three convolutional blocks:
- Conv → BatchNorm → ReLU → Pooling
- Channels: 16 → 32 → 64
Final layers:
- Adaptive average pooling
- Dropout (30%)
- Fully connected layer (6 outputs)

Training configuration:

Stratified Group K-Fold (3 folds)
Class-weighted cross-entropy loss
Adam optimizer:
- Learning rate: 1e-4
- Weight decay: 1e-4
Early stopping (patience = 8)

Data augmentation:

Mixup (p = 0.25, Beta(0.2, 0.2))
Time and frequency masking
Gaussian noise injection

4. Ensemble Inference

Three fold-specific models are used
Predictions are combined using soft voting:
- Logits are averaged across models
- Final prediction is obtained via argmax

This reduces variance and improves generalization compared to a single model.

Performance

Cross-validation (3 folds)

Fold	Accuracy
1	89.54%
2	91.24%
3	89.76%
Mean	90.18%

Test Set Results

Accuracy: 92.55%

Class-wise observations:

Strong performance on normal classes
Abnormal classes show more variability
Machine 2 abnormal class is the most challenging due to limited data and overlap

Inference

Default usage

Run the script directly:

python infer.py

This assumes the following structure:

├── infer.py
├── models/
│   ├── cnn_fold_2_acc_0.9124.pth
│   ├── data/
│   │   ├── 1.wav
│   │   ├── 2.wav
│   │   └── ...

Custom paths

You can override all paths via CLI arguments:

python infer.py \
  --data_dir /path/to/wavs \
  --model_path /path/to/model.pth \
  --results /path/to/results.txt \
  --times /path/to/time.txt

Input requirements

.wav audio files only
Files must be numerically named (e.g. 1.wav, 2.wav, ..., 10.wav)
Any duration (automatically trimmed and resized)
Audio is resampled internally to 16 kHz

Output

results.txt: one predicted class per line
time.txt: inference time (seconds) per file

Model Details

Input shape: (1, 128, 128)
Output: 6-class logits

Architecture:

Conv2D → BatchNorm → ReLU → Pooling (×3)
Adaptive average pooling
Fully connected classification layer

Repository Structure


├── models/
│   ├── cnn_fold_*.pth
│   ├── results.txt
│   └── time.txt
├── infer.py
├── requirements.txt
└── README.md

Important Notes

Preprocessing must match training exactly
Do not apply amplitude normalization or pre-emphasis
PCEN relies on raw signal dynamics

Limitations

Lower performance on Machine 2 abnormal class
Some inter-machine spectral overlap remains
Performance depends on consistency of recording conditions

Future Work

Improve minority class representation
Explore alternative model architectures
Enhance robustness to domain shifts
Investigate transformer-based audio models

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support