S4 State Space Model — Speech Commands

A 3-layer S4 state-space model (Gu et al. 2022) trained from scratch on the Google Speech Commands v2 dataset for 10-class keyword classification. Released as xaitalk's cross-framework XAI demo on the SSM architecture family.

This is the first publicly-released S4 checkpoint paired with a full cross-framework XAI validation suite — including LRP on SSMs which, as far as we are aware, no other library implements.

Files

File	Format	Size
`ssm_speech_commands.pt`	PyTorch state_dict	~6 MB
`ssm_speech_commands_config.json`	architecture config	< 1 KB

Architecture

Property	Value
Input dim	1 (raw waveform)
State dim	64
Hidden dim	128
Layers	3
Downsample stride	160
Input length	16000 (= 1 second at 16 kHz)
Output	10-class logits

Training: 50 epochs on Speech Commands v2 with cosine-annealing LR schedule (A100 GPU). Best validation accuracy 73.5%, test accuracy 71.2%.

Cross-framework verification

These weights are validated by xaitalk's ssm benchmark on Speech Commands (22 methods):

Methods	Passing at r ≥ 0.95	Min(min_r)	Verified
22	22/22	1.0000	2026-05-09

Every method (gradient family, full LRP family including ε / γ / α-β / z+ / flat / w² / SIGN variants, DeepLIFT, smoothgrad family, GradCAM, occlusion) produces bit-exact identical attributions across PyTorch / TensorFlow / JAX at the worst-case Pearson r = 1.0000.

This is the strongest cross-framework verification of any architecture in xaitalk's matrix and the canonical reference for ports to new architectures.

Usage

from xaitalk.hub import ensure_model
import torch, json

ckpt_path  = ensure_model('ssm/s4-speech-commands')
config_path = ckpt_path.parent / 'ssm_speech_commands_config.json'
config = json.loads(config_path.read_text())

# Architecture class lives in xaitalk
from xaitalk.models import S4SSM
model = S4SSM(**config)
model.load_state_dict(torch.load(ckpt_path, weights_only=True))
model.eval()

# Run XAI on a 1-second waveform
import xaitalk
import numpy as np
x = np.random.randn(1, 1, 16000).astype(np.float32)
expl = xaitalk.explain(model, x, method='lrp_epsilon', target_class=3)

Training data

Speech Commands v2 — Warden 2018. 35-keyword dataset; here trained on the 10-keyword subset (yes, no, up, down, left, right, on, off, stop, go).

License

Apache 2.0. Speech Commands is released under CC-BY 4.0.

Citation

S4 architecture:

@inproceedings{gu2022s4,
  author    = {Gu, Albert and Goel, Karan and R{\'e}, Christopher},
  title     = {Efficiently Modeling Long Sequences with Structured State Spaces},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2022}
}

Speech Commands dataset:

@misc{warden2018speechcommands,
  author = {Warden, Pete},
  title  = {Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition},
  year   = {2018},
  eprint = {1804.03209}
}

xaitalk infrastructure:

@software{paul2026xaitalk,
  author = {Paul, Alexander},
  title  = {xaitalk: Cross-Framework Explainable AI Library},
  year   = {2026},
  url    = {https://xaitalk.com}
}

Dataset used to train xaitalk/ssm-s4-speech-commands

Paper for xaitalk/ssm-s4-speech-commands

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

Paper • 1804.03209 • Published Apr 9, 2018

xaitalk
/

ssm-s4-speech-commands