SpeakMK1 — Multimodal AI Speech-Language Pathology Assistant

SpeakMK1 is a 74.2M parameter multimodal AI system for automated articulation disorder assessment in children. It combines a BiMamba-based audio encoder with a custom Mamba-attention hybrid LLM to produce clinically-framed, encouraging feedback in the style of a speech-language pathologist (SLP).

Developed at Amity University Dubai as a final-year B.Tech Computer Science Engineering project, supervised by Dr. Ved P. Mishra. Framed for UAE multilingual clinical contexts, with Dubai Health Authority as a target stakeholder.

Architecture Overview

SpeakMK1 uses a half-cascade design: audio encoder embeddings feed directly into the LLM via a projection layer, rather than going through full ASR transcription first. This preserves sub-phonemic acoustic detail that transcription would discard.

Child Speech Audio
        │
        ▼
┌───────────────────┐
│   Audio Encoder   │  BiMamba + UniMamba layers
│  (BiMamba/Uni)    │  Phonological multi-task heads
│                   │  Voicing / Manner / Place / CTC
└────────┬──────────┘
         │  (1, T, 512) frame embeddings
         ▼
┌───────────────────┐
│ DirectAudio       │  Linear projection
│ Projection        │  512 → 512
└────────┬──────────┘
         │  Audio tokens
         ▼
┌───────────────────┐
│   SpeakMK1LLM     │  74.2M params
│                   │  Mamba-SSM + Attention hybrid
│  LatentMoE        │  4 experts, top-2 routing
│  CrossModal       │  Sparse cross-attention gates
│  SparseAttention  │  Audio-conditioned generation
└────────┬──────────┘
         │  SLP-style text response
         ▼
┌───────────────────┐
│   Kokoro ONNX     │  TTS output
│   (af_heart)      │  Warm, child-friendly voice
└───────────────────┘

Component Details

Component	Details
Audio Encoder	BiMamba + UniMamba layers, MFA-aligned phonological heads
Phonological heads	Voicing (96% acc), Manner (93% acc), Place (~40% acc), Correctness, CTC
Training data (encoder)	LibriSpeech with Montreal Forced Aligner alignment
LLM	74.2M params, Mamba-SSM + attention, LatentMoE, factorized embeddings (128→512 bottleneck)
Tokenizer	EleutherAI GPT-NeoX-20B (vocab 50,283 + 5 custom SLP special tokens)
Projection	DirectAudioProjection — single linear layer (512→512)
TTS	Kokoro ONNX v1.0, `af_heart` voice
Backend	FastAPI microservice

Files in This Repository

File	Description
`audio_encoder_epoch_5.pt`	Audio encoder weights (trained through epoch 5)
`audio_proj_best.pt`	DirectAudioProjection weights
`ckpt_final.pt`	LLM weights (Stage 5 — SLP dialogue + cross-attention gate retraining)
`audio_encoder.py`	Encoder architecture definition
`speak_mk1_llm.py`	LLM architecture definition
`train_proj.py`	Projection layer definition
`audio_trainer.py`	SmallConfig and training utilities

Kokoro TTS weights (kokoro-v1.0.onnx, voices-v1.0.bin) must be obtained separately from kokoro-onnx.

Usage

Installation

pip install torch transformers librosa soundfile mamba-ssm einops kokoro-onnx sounddevice

Note: mamba-ssm requires CUDA. CPU-only inference is not supported.

Load and run inference

import torch
import librosa
import numpy as np
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
from audio_encoder import AudioEncoder
from audio_trainer import SmallConfig
from speak_mk1_llm import SpeakMK1LLM, SpeakMK1LLMConfig
from train_proj import DirectAudioProjection

DEVICE = torch.device("cuda")
REPO = "SakhrML/SpeakMK1_early"

# Download weights
enc_path  = hf_hub_download(REPO, "audio_encoder_epoch_5.pt")
proj_path = hf_hub_download(REPO, "audio_proj_best.pt")
llm_path  = hf_hub_download(REPO, "ckpt_final.pt")

# Load encoder
enc_cfg = SmallConfig(d_model=512, llm_dim=4096)
encoder = AudioEncoder(enc_cfg).to(DEVICE)
encoder.load_state_dict(torch.load(enc_path, map_location=DEVICE, weights_only=False), strict=False)
encoder.eval()

# Load projection
proj = DirectAudioProjection(512, 512).to(DEVICE)
proj_ckpt = torch.load(proj_path, map_location=DEVICE, weights_only=False)
proj.load_state_dict(proj_ckpt["audio_proj"])
proj.eval()

# Load LLM
llm_cfg = SpeakMK1LLMConfig(
    vocab_size=50283, d_model=512, d_state=64, num_blocks=6,
    nheads_ssm=8, nheads_attn=8, top_k_audio=32,
    num_experts=4, top_k_experts=2, dropout=0.0, aux_loss_weight=1e-2,
)
llm = SpeakMK1LLM(llm_cfg).to(DEVICE)
llm_ckpt = torch.load(llm_path, map_location=DEVICE, weights_only=False)
llm.load_state_dict(llm_ckpt["model"], strict=True)

# Apply cross-attention gate override (required — see Limitations)
with torch.no_grad():
    for block in llm.blocks:
        if hasattr(block.cross_attn, "gate"):
            block.cross_attn.gate.data.fill_(0.3)
llm.eval()

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
tokenizer.add_special_tokens({"additional_special_tokens": [
    "<|system|>", "<|child|>", "<|slp|>", "<|think|>", "<|endturn|>"
]})

# Run on audio
audio_np, _ = librosa.load("child_speech.wav", sr=16000, mono=True)
mel_np = librosa.feature.melspectrogram(
    y=audio_np, sr=16000, n_fft=400, hop_length=160, n_mels=80, fmin=0.0, fmax=8000.0
)
mel_np = librosa.power_to_db(mel_np, ref=np.max)
mel = torch.tensor(mel_np.T, dtype=torch.float32).unsqueeze(0).to(DEVICE)

with torch.no_grad():
    audio_feats = encoder.encode_features(mel)
    audio_out = proj(audio_feats)
 
    prompt = (
        "<|system|>You are a warm, expert AI speech-language pathologist "
        "helping a child with articulation errors. Analyze the error and "
        "provide encouraging corrective feedback."
        "<|child|>I wanna pway wif my fwiends.<|slp|>"
    )
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(DEVICE)
    logits, _, _ = llm(input_ids=input_ids, audio_out=audio_out, audio_padding_mask=None)

Prompt format

<|system|>{system instruction}<|child|>{child utterance}<|slp|>

Special tokens: <|system|>, <|child|>, <|slp|>, <|think|>, <|endturn|>

Training Details

Audio Encoder

Stage	Dataset	Notes
Pre-training	LibriSpeech (960h)	MFA phoneme alignment for frame-level labels
Multi-task heads	Voicing, Manner, Place, Correctness, CTC	Trained jointly

Results (epoch 5): Voicing 96%, Manner 93%, Place ~40%

LLM — 5-Stage Training Pipeline

Stage	Data	Purpose
1	TinyStories + general text	Base language modelling
2	PubMed Central	Medical/clinical domain adaptation
3	CHILDES Eng-NA	Child language patterns
4	Synthetic SLP dialogues + Alpaca/FLAN	SLP dialogue fine-tuning
5	Stage 4 data + random audio injection	Cross-attention gate retraining

Perplexity: Stage 1: 3.44 / Stage 2: 3.59 / Stage 3: 57.24 (distribution shift) / Stage 4: 1676 / Stage 5: 2502

The cross-evaluation headline result: 76% PPL reduction from Stage 1 to Stage 4 on held-out SLP dialogue data.

Limitations and Known Issues

Cross-attention gate collapse (critical): During Stage 4 text-only SLP fine-tuning, the cross-attention gates collapsed to near-zero (effectively ignoring audio). Stage 5 partially recovered gates to the 0.004--0.009 range. A manual override of gate = 0.3 is applied at inference. Audio features condition the LLM structurally but do not semantically alter outputs without paired audio+SLP dialogue training data. Always apply the gate override shown in the usage example above.

Elevated LLM perplexity: Stages 4 and 5 show high perplexity (1676, 2502) due to distribution shift from general text to narrow SLP dialogue format. Output quality is functional but not production-grade.

Place of articulation accuracy: The phonological encoder achieves only ~40% place accuracy, compared to 96% voicing and 93% manner. Place of articulation is the hardest phonological feature to discriminate from acoustics alone.

Q-Former not used: The intended architecture included a Q-Former cross-modal alignment module. Convergence failure (retrieval accuracy stuck at random chance) led to replacement with the simpler DirectAudioProjection linear layer. Q-Former is documented as intended architecture but not present in these weights.

No multilingual support yet: Despite the UAE/multilingual clinical motivation, training data was English-only (LibriSpeech, CHILDES Eng-NA). Arabic and code-switching support is future work.

GPU required: mamba-ssm Triton kernels do not run on CPU. CUDA is mandatory.

Hardware Requirements

Component	Minimum	Tested On
GPU VRAM	6GB	NVIDIA RTX 4060 Laptop (8GB)
CUDA	11.8+	CUDA 12.x
RAM	16GB	16GB

Citation

If you use SpeakMK1 in your work, please cite:

@misc{speakmk1_2025,
  title     = {SpeakMK1: A Multimodal Mamba-Attention Hybrid for Automated Speech-Language Pathology Assessment},
  author    = {Ebraheem, and Ihsan, Ali},
  year      = {2025},
  institution = {Amity University Dubai},
  note      = {B.Tech CSE Major Project, supervised by Dr. Ved P. Mishra}
}

Acknowledgements

Supervisor: Dr. Ved P. Mishra, Amity University Dubai
Co-developer: Ali Ihsan
TTS: kokoro-onnx by hexgrad
Audio alignment: Montreal Forced Aligner
SSM backbone: mamba-ssm by Albert Gu and Tri Dao

Early release — research prototype. Not for clinical use.

Downloads last month: -; Downloads are not tracked for this model. How to track