Audio Anomaly Detection using PCEN + CNN Ensemble
This repository contains a deep learning system for multi-class audio anomaly detection across industrial machines. The model uses Per-Channel Energy Normalization (PCEN) features and a convolutional neural network (CNN) ensemble trained with group-aware cross-validation.
Overview
The system classifies audio recordings into six classes:
| Class | Description |
|---|---|
| 0 | Machine 1 - Normal |
| 1 | Machine 1 - Abnormal |
| 2 | Machine 2 - Normal |
| 3 | Machine 2 - Abnormal |
| 4 | Machine 3 - Normal |
| 5 | Machine 3 - Abnormal |
The design emphasizes robustness to varying recording conditions and avoids data leakage through group-aware validation.
Pipeline Architecture
1. Preprocessing
- Audio resampled to 16 kHz
- Converted to mono
- Silence trimming using a 30 dB threshold
- No amplitude normalization or pre-emphasis
Note: Amplitude normalization and pre-emphasis are intentionally excluded because they interfere with PCEN, which performs its own adaptive gain control.
2. Feature Extraction (PCEN)
- Mel spectrogram:
- 128 Mel bands
- Power = 2.0
- PCEN parameters:
- Gain (Ξ±): 0.98
- Bias (Ξ΄): 2
- Compression (r): 0.5
- Time constant (T): 0.400 s
Output processing:
- Min-max normalization to [0, 1]
- Resized to (128 Γ 128) using:
- Center cropping if too long
- Zero-padding if too short
3. Model Training
CNN architecture:
- Three convolutional blocks:
- Conv β BatchNorm β ReLU β Pooling
- Channels: 16 β 32 β 64
- Final layers:
- Adaptive average pooling
- Dropout (30%)
- Fully connected layer (6 outputs)
Training configuration:
- Stratified Group K-Fold (3 folds)
- Class-weighted cross-entropy loss
- Adam optimizer:
- Learning rate: 1e-4
- Weight decay: 1e-4
- Early stopping (patience = 8)
Data augmentation:
- Mixup (p = 0.25, Beta(0.2, 0.2))
- Time and frequency masking
- Gaussian noise injection
4. Ensemble Inference
- Three fold-specific models are used
- Predictions are combined using soft voting:
- Logits are averaged across models
- Final prediction is obtained via argmax
This reduces variance and improves generalization compared to a single model.
Performance
Cross-validation (3 folds)
| Fold | Accuracy |
|---|---|
| 1 | 89.54% |
| 2 | 91.24% |
| 3 | 89.76% |
| Mean | 90.18% |
Test Set Results
- Accuracy: 92.55%
Class-wise observations:
- Strong performance on normal classes
- Abnormal classes show more variability
- Machine 2 abnormal class is the most challenging due to limited data and overlap
Inference
Default usage
Run the script directly:
python infer.py
This assumes the following structure:
βββ infer.py
βββ models/
β βββ cnn_fold_2_acc_0.9124.pth
β βββ data/
β β βββ 1.wav
β β βββ 2.wav
β β βββ ...
Custom paths
You can override all paths via CLI arguments:
python infer.py \
--data_dir /path/to/wavs \
--model_path /path/to/model.pth \
--results /path/to/results.txt \
--times /path/to/time.txt
Input requirements
- .wav audio files only
- Files must be numerically named (e.g. 1.wav, 2.wav, ..., 10.wav)
- Any duration (automatically trimmed and resized)
- Audio is resampled internally to 16 kHz
Output
- results.txt: one predicted class per line
- time.txt: inference time (seconds) per file
Model Details
- Input shape: (1, 128, 128)
- Output: 6-class logits
Architecture:
- Conv2D β BatchNorm β ReLU β Pooling (Γ3)
- Adaptive average pooling
- Fully connected classification layer
Repository Structure
βββ models/
β βββ cnn_fold_*.pth
β βββ results.txt
β βββ time.txt
βββ infer.py
βββ requirements.txt
βββ README.md
Important Notes
- Preprocessing must match training exactly
- Do not apply amplitude normalization or pre-emphasis
- PCEN relies on raw signal dynamics
Limitations
- Lower performance on Machine 2 abnormal class
- Some inter-machine spectral overlap remains
- Performance depends on consistency of recording conditions
Future Work
- Improve minority class representation
- Explore alternative model architectures
- Enhance robustness to domain shifts
- Investigate transformer-based audio models