audio-mnist-cnn

A small spectrogram CNN that classifies spoken digits 0–9 from 1 second of 16 kHz mono audio. Trained from scratch on AudioMNIST — specifically the gilkeyio/AudioMNIST mirror on the Hub — with a speaker-disjoint train/val/test split.

Live demo: https://huggingface.co/spaces/fergarciadlc/audio-mnist-demo

Files

File	Purpose
`best_model.pt`	PyTorch `state_dict` checkpoint (`{"model_state": ...}`), ~1.7 MB.
`config_snapshot.yaml`	Full inlined config used at training time (dataset, features, model). Required by the loader so preprocessing matches exactly.

Quickstart

import torch, yaml
from huggingface_hub import hf_hub_download

REPO = "fergarciadlc/audio-mnist-cnn"
ckpt_path = hf_hub_download(REPO, "best_model.pt")
cfg_path  = hf_hub_download(REPO, "config_snapshot.yaml")

state = torch.load(ckpt_path, map_location="cpu")
cfg   = yaml.safe_load(open(cfg_path))
# Build your model from cfg["model"] and load state["model_state"].

The reference inference loop (load audio → 16 kHz mono → center pad/crop to 1.0 s → log-mel spectrogram → CNN → softmax) lives in the demo Space: https://huggingface.co/spaces/fergarciadlc/audio-mnist-demo/tree/main.

Model

Architecture: 4 conv blocks (32 → 64 → 128 → 256 channels, 3×3 kernels, BatchNorm + ReLU + MaxPool + Dropout(0.3)) → adaptive average pool → MLP head (128 units, dropout 0.3) → 10-way softmax.
Input: log-mel spectrogram, shape (101 frames, 64 mels), mean-std normalized with fixed stats (mean −40 dB, std 20 dB).
Parameters: ~430k. Checkpoint size ~1.7 MB.

Preprocessing

Stage	Setting
Sample rate	16 kHz mono
Clip length	Center-pad / crop to exactly 1.0 s
STFT	`n_fft=512`, `win_length=400`, `hop_length=160`
Mel	`n_mels=64`, `fmin=20`, `fmax=8000`, `power=2.0`
Normalization	Fixed: `(S_db − (−40)) / 20`

Training

Dataset: gilkeyio/AudioMNIST — 30,000 utterances, 60 speakers, digits 0–9. Credit to @gilkeyio for hosting the Hub mirror; original dataset by Becker et al. (arXiv:1807.03418).
Split: speaker-disjoint 80/10/10 (seed 42). No speaker appears in more than one split, so test metrics reflect generalization to new speakers.
Optimizer: Adam, lr 1e-3, batch size 64.
Schedule: up to 30 epochs with early stopping on val_accuracy (patience 8) and ReduceLROnPlateau on val_loss (factor 0.5, patience 3).
Loss: cross-entropy.

Evaluation

On the speaker-disjoint test split (3,000 clips, 6 unseen speakers):

Metric	Value
Accuracy	0.9993
Macro F1	0.9993
Weighted F1	0.9993
Macro Precision	0.9993
Macro Recall	0.9993

Intended use & limitations

Intended for: demos, education, lightweight on-device digit classification of short, clean speech.
Out of distribution: noisy environments, distant mics, accents under-represented in AudioMNIST, non-digit speech, and audio outside the 0.5–1.0 s typical clip length. Real-world laptop-mic audio (HVAC, keyboard noise) shows a noticeable drop vs. clean test numbers. No data augmentation (SpecAugment / additive noise) was used during training.
Not suitable for: anything safety- or identity-critical.

Source

Training pipeline, configs, and MLflow tracking: https://github.com/fergarciadlc/audio-mnist.

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train fergarciadlc/audio-mnist-cnn

Space using fergarciadlc/audio-mnist-cnn 1

Paper for fergarciadlc/audio-mnist-cnn

Interpreting and Explaining Deep Neural Networks for Classification of Audio Signals

Paper • 1807.03418 • Published Jul 9, 2018

Evaluation results

Test Accuracy (speaker-disjoint) on AudioMNIST
self-reported

0.999
Macro F1 on AudioMNIST
self-reported

0.999