S4 State Space Model — Speech Commands
A 3-layer S4 state-space model (Gu et al. 2022) trained from scratch on the Google Speech Commands v2 dataset for 10-class keyword classification. Released as xaitalk's cross-framework XAI demo on the SSM architecture family.
This is the first publicly-released S4 checkpoint paired with a full cross-framework XAI validation suite — including LRP on SSMs which, as far as we are aware, no other library implements.
Files
| File | Format | Size |
|---|---|---|
ssm_speech_commands.pt |
PyTorch state_dict | ~6 MB |
ssm_speech_commands_config.json |
architecture config | < 1 KB |
Architecture
| Property | Value |
|---|---|
| Input dim | 1 (raw waveform) |
| State dim | 64 |
| Hidden dim | 128 |
| Layers | 3 |
| Downsample stride | 160 |
| Input length | 16000 (= 1 second at 16 kHz) |
| Output | 10-class logits |
Training: 50 epochs on Speech Commands v2 with cosine-annealing LR schedule (A100 GPU). Best validation accuracy 73.5%, test accuracy 71.2%.
Cross-framework verification
These weights are validated by xaitalk's ssm benchmark on Speech
Commands (22 methods):
| Methods | Passing at r ≥ 0.95 | Min(min_r) | Verified |
|---|---|---|---|
| 22 | 22/22 | 1.0000 | 2026-05-09 |
Every method (gradient family, full LRP family including ε / γ / α-β / z+ / flat / w² / SIGN variants, DeepLIFT, smoothgrad family, GradCAM, occlusion) produces bit-exact identical attributions across PyTorch / TensorFlow / JAX at the worst-case Pearson r = 1.0000.
This is the strongest cross-framework verification of any architecture in xaitalk's matrix and the canonical reference for ports to new architectures.
Usage
from xaitalk.hub import ensure_model
import torch, json
ckpt_path = ensure_model('ssm/s4-speech-commands')
config_path = ckpt_path.parent / 'ssm_speech_commands_config.json'
config = json.loads(config_path.read_text())
# Architecture class lives in xaitalk
from xaitalk.models import S4SSM
model = S4SSM(**config)
model.load_state_dict(torch.load(ckpt_path, weights_only=True))
model.eval()
# Run XAI on a 1-second waveform
import xaitalk
import numpy as np
x = np.random.randn(1, 1, 16000).astype(np.float32)
expl = xaitalk.explain(model, x, method='lrp_epsilon', target_class=3)
Training data
Speech Commands v2 — Warden 2018. 35-keyword dataset; here trained on the 10-keyword subset (yes, no, up, down, left, right, on, off, stop, go).
License
Apache 2.0. Speech Commands is released under CC-BY 4.0.
Citation
S4 architecture:
@inproceedings{gu2022s4,
author = {Gu, Albert and Goel, Karan and R{\'e}, Christopher},
title = {Efficiently Modeling Long Sequences with Structured State Spaces},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2022}
}
Speech Commands dataset:
@misc{warden2018speechcommands,
author = {Warden, Pete},
title = {Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition},
year = {2018},
eprint = {1804.03209}
}
xaitalk infrastructure:
@software{paul2026xaitalk,
author = {Paul, Alexander},
title = {xaitalk: Cross-Framework Explainable AI Library},
year = {2026},
url = {https://xaitalk.com}
}
Links
- xaitalk website: https://xaitalk.com
- Framework GitHub: https://github.com/alexanderfpaul/xaitalk-framework
- SSM comparison script:
examples/comparison/run_ssm_3framework_comparison.py