gilkeyio/AudioMNIST
Viewer β’ Updated β’ 30k β’ 398 β’ 3
A small spectrogram CNN that classifies spoken digits 0β9 from 1
second of 16 kHz mono audio. Trained from scratch on
AudioMNIST β specifically the
gilkeyio/AudioMNIST
mirror on the Hub β with a speaker-disjoint train/val/test split.
Live demo: https://huggingface.co/spaces/fergarciadlc/audio-mnist-demo
| File | Purpose |
|---|---|
best_model.pt |
PyTorch state_dict checkpoint ({"model_state": ...}), ~1.7 MB. |
config_snapshot.yaml |
Full inlined config used at training time (dataset, features, model). Required by the loader so preprocessing matches exactly. |
import torch, yaml
from huggingface_hub import hf_hub_download
REPO = "fergarciadlc/audio-mnist-cnn"
ckpt_path = hf_hub_download(REPO, "best_model.pt")
cfg_path = hf_hub_download(REPO, "config_snapshot.yaml")
state = torch.load(ckpt_path, map_location="cpu")
cfg = yaml.safe_load(open(cfg_path))
# Build your model from cfg["model"] and load state["model_state"].
The reference inference loop (load audio β 16 kHz mono β center pad/crop to 1.0 s β log-mel spectrogram β CNN β softmax) lives in the demo Space: https://huggingface.co/spaces/fergarciadlc/audio-mnist-demo/tree/main.
(101 frames, 64 mels), mean-std
normalized with fixed stats (mean β40 dB, std 20 dB).| Stage | Setting |
|---|---|
| Sample rate | 16 kHz mono |
| Clip length | Center-pad / crop to exactly 1.0 s |
| STFT | n_fft=512, win_length=400, hop_length=160 |
| Mel | n_mels=64, fmin=20, fmax=8000, power=2.0 |
| Normalization | Fixed: (S_db β (β40)) / 20 |
gilkeyio/AudioMNIST
β 30,000 utterances, 60 speakers, digits 0β9. Credit to
@gilkeyio for hosting the Hub
mirror; original dataset by Becker et al.
(arXiv:1807.03418).val_accuracy
(patience 8) and ReduceLROnPlateau on val_loss (factor 0.5,
patience 3).On the speaker-disjoint test split (3,000 clips, 6 unseen speakers):
| Metric | Value |
|---|---|
| Accuracy | 0.9993 |
| Macro F1 | 0.9993 |
| Weighted F1 | 0.9993 |
| Macro Precision | 0.9993 |
| Macro Recall | 0.9993 |
Training pipeline, configs, and MLflow tracking: https://github.com/fergarciadlc/audio-mnist.