audio-mnist-cnn

A small spectrogram CNN that classifies spoken digits 0–9 from 1 second of 16 kHz mono audio. Trained from scratch on AudioMNIST β€” specifically the gilkeyio/AudioMNIST mirror on the Hub β€” with a speaker-disjoint train/val/test split.

Live demo: https://huggingface.co/spaces/fergarciadlc/audio-mnist-demo

Files

File Purpose
best_model.pt PyTorch state_dict checkpoint ({"model_state": ...}), ~1.7 MB.
config_snapshot.yaml Full inlined config used at training time (dataset, features, model). Required by the loader so preprocessing matches exactly.

Quickstart

import torch, yaml
from huggingface_hub import hf_hub_download

REPO = "fergarciadlc/audio-mnist-cnn"
ckpt_path = hf_hub_download(REPO, "best_model.pt")
cfg_path  = hf_hub_download(REPO, "config_snapshot.yaml")

state = torch.load(ckpt_path, map_location="cpu")
cfg   = yaml.safe_load(open(cfg_path))
# Build your model from cfg["model"] and load state["model_state"].

The reference inference loop (load audio β†’ 16 kHz mono β†’ center pad/crop to 1.0 s β†’ log-mel spectrogram β†’ CNN β†’ softmax) lives in the demo Space: https://huggingface.co/spaces/fergarciadlc/audio-mnist-demo/tree/main.

Model

  • Architecture: 4 conv blocks (32 β†’ 64 β†’ 128 β†’ 256 channels, 3Γ—3 kernels, BatchNorm + ReLU + MaxPool + Dropout(0.3)) β†’ adaptive average pool β†’ MLP head (128 units, dropout 0.3) β†’ 10-way softmax.
  • Input: log-mel spectrogram, shape (101 frames, 64 mels), mean-std normalized with fixed stats (mean βˆ’40 dB, std 20 dB).
  • Parameters: ~430k. Checkpoint size ~1.7 MB.

Preprocessing

Stage Setting
Sample rate 16 kHz mono
Clip length Center-pad / crop to exactly 1.0 s
STFT n_fft=512, win_length=400, hop_length=160
Mel n_mels=64, fmin=20, fmax=8000, power=2.0
Normalization Fixed: (S_db βˆ’ (βˆ’40)) / 20

Training

  • Dataset: gilkeyio/AudioMNIST β€” 30,000 utterances, 60 speakers, digits 0–9. Credit to @gilkeyio for hosting the Hub mirror; original dataset by Becker et al. (arXiv:1807.03418).
  • Split: speaker-disjoint 80/10/10 (seed 42). No speaker appears in more than one split, so test metrics reflect generalization to new speakers.
  • Optimizer: Adam, lr 1e-3, batch size 64.
  • Schedule: up to 30 epochs with early stopping on val_accuracy (patience 8) and ReduceLROnPlateau on val_loss (factor 0.5, patience 3).
  • Loss: cross-entropy.

Evaluation

On the speaker-disjoint test split (3,000 clips, 6 unseen speakers):

Metric Value
Accuracy 0.9993
Macro F1 0.9993
Weighted F1 0.9993
Macro Precision 0.9993
Macro Recall 0.9993

Intended use & limitations

  • Intended for: demos, education, lightweight on-device digit classification of short, clean speech.
  • Out of distribution: noisy environments, distant mics, accents under-represented in AudioMNIST, non-digit speech, and audio outside the 0.5–1.0 s typical clip length. Real-world laptop-mic audio (HVAC, keyboard noise) shows a noticeable drop vs. clean test numbers. No data augmentation (SpecAugment / additive noise) was used during training.
  • Not suitable for: anything safety- or identity-critical.

Source

Training pipeline, configs, and MLflow tracking: https://github.com/fergarciadlc/audio-mnist.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train fergarciadlc/audio-mnist-cnn

Space using fergarciadlc/audio-mnist-cnn 1

Paper for fergarciadlc/audio-mnist-cnn

Evaluation results