SpeakMK1 β Multimodal AI Speech-Language Pathology Assistant
SpeakMK1 is a 74.2M parameter multimodal AI system for automated articulation disorder assessment in children. It combines a BiMamba-based audio encoder with a custom Mamba-attention hybrid LLM to produce clinically-framed, encouraging feedback in the style of a speech-language pathologist (SLP).
Developed at Amity University Dubai as a final-year B.Tech Computer Science Engineering project, supervised by Dr. Ved P. Mishra. Framed for UAE multilingual clinical contexts, with Dubai Health Authority as a target stakeholder.
Architecture Overview
SpeakMK1 uses a half-cascade design: audio encoder embeddings feed directly into the LLM via a projection layer, rather than going through full ASR transcription first. This preserves sub-phonemic acoustic detail that transcription would discard.
Child Speech Audio
β
βΌ
βββββββββββββββββββββ
β Audio Encoder β BiMamba + UniMamba layers
β (BiMamba/Uni) β Phonological multi-task heads
β β Voicing / Manner / Place / CTC
ββββββββββ¬βββββββββββ
β (1, T, 512) frame embeddings
βΌ
βββββββββββββββββββββ
β DirectAudio β Linear projection
β Projection β 512 β 512
ββββββββββ¬βββββββββββ
β Audio tokens
βΌ
βββββββββββββββββββββ
β SpeakMK1LLM β 74.2M params
β β Mamba-SSM + Attention hybrid
β LatentMoE β 4 experts, top-2 routing
β CrossModal β Sparse cross-attention gates
β SparseAttention β Audio-conditioned generation
ββββββββββ¬βββββββββββ
β SLP-style text response
βΌ
βββββββββββββββββββββ
β Kokoro ONNX β TTS output
β (af_heart) β Warm, child-friendly voice
βββββββββββββββββββββ
Component Details
| Component | Details |
|---|---|
| Audio Encoder | BiMamba + UniMamba layers, MFA-aligned phonological heads |
| Phonological heads | Voicing (96% acc), Manner (93% acc), Place (~40% acc), Correctness, CTC |
| Training data (encoder) | LibriSpeech with Montreal Forced Aligner alignment |
| LLM | 74.2M params, Mamba-SSM + attention, LatentMoE, factorized embeddings (128β512 bottleneck) |
| Tokenizer | EleutherAI GPT-NeoX-20B (vocab 50,283 + 5 custom SLP special tokens) |
| Projection | DirectAudioProjection β single linear layer (512β512) |
| TTS | Kokoro ONNX v1.0, af_heart voice |
| Backend | FastAPI microservice |
Files in This Repository
| File | Description |
|---|---|
audio_encoder_epoch_5.pt |
Audio encoder weights (trained through epoch 5) |
audio_proj_best.pt |
DirectAudioProjection weights |
ckpt_final.pt |
LLM weights (Stage 5 β SLP dialogue + cross-attention gate retraining) |
audio_encoder.py |
Encoder architecture definition |
speak_mk1_llm.py |
LLM architecture definition |
train_proj.py |
Projection layer definition |
audio_trainer.py |
SmallConfig and training utilities |
Kokoro TTS weights (kokoro-v1.0.onnx, voices-v1.0.bin) must be obtained separately from kokoro-onnx.
Usage
Installation
pip install torch transformers librosa soundfile mamba-ssm einops kokoro-onnx sounddevice
Note:
mamba-ssmrequires CUDA. CPU-only inference is not supported.
Load and run inference
import torch
import librosa
import numpy as np
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
from audio_encoder import AudioEncoder
from audio_trainer import SmallConfig
from speak_mk1_llm import SpeakMK1LLM, SpeakMK1LLMConfig
from train_proj import DirectAudioProjection
DEVICE = torch.device("cuda")
REPO = "SakhrML/SpeakMK1_early"
# Download weights
enc_path = hf_hub_download(REPO, "audio_encoder_epoch_5.pt")
proj_path = hf_hub_download(REPO, "audio_proj_best.pt")
llm_path = hf_hub_download(REPO, "ckpt_final.pt")
# Load encoder
enc_cfg = SmallConfig(d_model=512, llm_dim=4096)
encoder = AudioEncoder(enc_cfg).to(DEVICE)
encoder.load_state_dict(torch.load(enc_path, map_location=DEVICE, weights_only=False), strict=False)
encoder.eval()
# Load projection
proj = DirectAudioProjection(512, 512).to(DEVICE)
proj_ckpt = torch.load(proj_path, map_location=DEVICE, weights_only=False)
proj.load_state_dict(proj_ckpt["audio_proj"])
proj.eval()
# Load LLM
llm_cfg = SpeakMK1LLMConfig(
vocab_size=50283, d_model=512, d_state=64, num_blocks=6,
nheads_ssm=8, nheads_attn=8, top_k_audio=32,
num_experts=4, top_k_experts=2, dropout=0.0, aux_loss_weight=1e-2,
)
llm = SpeakMK1LLM(llm_cfg).to(DEVICE)
llm_ckpt = torch.load(llm_path, map_location=DEVICE, weights_only=False)
llm.load_state_dict(llm_ckpt["model"], strict=True)
# Apply cross-attention gate override (required β see Limitations)
with torch.no_grad():
for block in llm.blocks:
if hasattr(block.cross_attn, "gate"):
block.cross_attn.gate.data.fill_(0.3)
llm.eval()
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
tokenizer.add_special_tokens({"additional_special_tokens": [
"<|system|>", "<|child|>", "<|slp|>", "<|think|>", "<|endturn|>"
]})
# Run on audio
audio_np, _ = librosa.load("child_speech.wav", sr=16000, mono=True)
mel_np = librosa.feature.melspectrogram(
y=audio_np, sr=16000, n_fft=400, hop_length=160, n_mels=80, fmin=0.0, fmax=8000.0
)
mel_np = librosa.power_to_db(mel_np, ref=np.max)
mel = torch.tensor(mel_np.T, dtype=torch.float32).unsqueeze(0).to(DEVICE)
with torch.no_grad():
audio_feats = encoder.encode_features(mel)
audio_out = proj(audio_feats)
prompt = (
"<|system|>You are a warm, expert AI speech-language pathologist "
"helping a child with articulation errors. Analyze the error and "
"provide encouraging corrective feedback."
"<|child|>I wanna pway wif my fwiends.<|slp|>"
)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(DEVICE)
logits, _, _ = llm(input_ids=input_ids, audio_out=audio_out, audio_padding_mask=None)
Prompt format
<|system|>{system instruction}<|child|>{child utterance}<|slp|>
Special tokens: <|system|>, <|child|>, <|slp|>, <|think|>, <|endturn|>
Training Details
Audio Encoder
| Stage | Dataset | Notes |
|---|---|---|
| Pre-training | LibriSpeech (960h) | MFA phoneme alignment for frame-level labels |
| Multi-task heads | Voicing, Manner, Place, Correctness, CTC | Trained jointly |
Results (epoch 5): Voicing 96%, Manner 93%, Place ~40%
LLM β 5-Stage Training Pipeline
| Stage | Data | Purpose |
|---|---|---|
| 1 | TinyStories + general text | Base language modelling |
| 2 | PubMed Central | Medical/clinical domain adaptation |
| 3 | CHILDES Eng-NA | Child language patterns |
| 4 | Synthetic SLP dialogues + Alpaca/FLAN | SLP dialogue fine-tuning |
| 5 | Stage 4 data + random audio injection | Cross-attention gate retraining |
Perplexity: Stage 1: 3.44 / Stage 2: 3.59 / Stage 3: 57.24 (distribution shift) / Stage 4: 1676 / Stage 5: 2502
The cross-evaluation headline result: 76% PPL reduction from Stage 1 to Stage 4 on held-out SLP dialogue data.
Limitations and Known Issues
Cross-attention gate collapse (critical): During Stage 4 text-only SLP fine-tuning, the cross-attention gates collapsed to near-zero (effectively ignoring audio). Stage 5 partially recovered gates to the 0.004--0.009 range. A manual override of gate = 0.3 is applied at inference. Audio features condition the LLM structurally but do not semantically alter outputs without paired audio+SLP dialogue training data. Always apply the gate override shown in the usage example above.
Elevated LLM perplexity: Stages 4 and 5 show high perplexity (1676, 2502) due to distribution shift from general text to narrow SLP dialogue format. Output quality is functional but not production-grade.
Place of articulation accuracy: The phonological encoder achieves only ~40% place accuracy, compared to 96% voicing and 93% manner. Place of articulation is the hardest phonological feature to discriminate from acoustics alone.
Q-Former not used: The intended architecture included a Q-Former cross-modal alignment module. Convergence failure (retrieval accuracy stuck at random chance) led to replacement with the simpler DirectAudioProjection linear layer. Q-Former is documented as intended architecture but not present in these weights.
No multilingual support yet: Despite the UAE/multilingual clinical motivation, training data was English-only (LibriSpeech, CHILDES Eng-NA). Arabic and code-switching support is future work.
GPU required: mamba-ssm Triton kernels do not run on CPU. CUDA is mandatory.
Hardware Requirements
| Component | Minimum | Tested On |
|---|---|---|
| GPU VRAM | 6GB | NVIDIA RTX 4060 Laptop (8GB) |
| CUDA | 11.8+ | CUDA 12.x |
| RAM | 16GB | 16GB |
Citation
If you use SpeakMK1 in your work, please cite:
@misc{speakmk1_2025,
title = {SpeakMK1: A Multimodal Mamba-Attention Hybrid for Automated Speech-Language Pathology Assessment},
author = {Ebraheem, and Ihsan, Ali},
year = {2025},
institution = {Amity University Dubai},
note = {B.Tech CSE Major Project, supervised by Dr. Ved P. Mishra}
}
Acknowledgements
- Supervisor: Dr. Ved P. Mishra, Amity University Dubai
- Co-developer: Ali Ihsan
- TTS: kokoro-onnx by hexgrad
- Audio alignment: Montreal Forced Aligner
- SSM backbone: mamba-ssm by Albert Gu and Tri Dao
Early release β research prototype. Not for clinical use.