SpeakMK1 β€” Multimodal AI Speech-Language Pathology Assistant

SpeakMK1 is a 74.2M parameter multimodal AI system for automated articulation disorder assessment in children. It combines a BiMamba-based audio encoder with a custom Mamba-attention hybrid LLM to produce clinically-framed, encouraging feedback in the style of a speech-language pathologist (SLP).

Developed at Amity University Dubai as a final-year B.Tech Computer Science Engineering project, supervised by Dr. Ved P. Mishra. Framed for UAE multilingual clinical contexts, with Dubai Health Authority as a target stakeholder.


Architecture Overview

SpeakMK1 uses a half-cascade design: audio encoder embeddings feed directly into the LLM via a projection layer, rather than going through full ASR transcription first. This preserves sub-phonemic acoustic detail that transcription would discard.

Child Speech Audio
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Audio Encoder   β”‚  BiMamba + UniMamba layers
β”‚  (BiMamba/Uni)    β”‚  Phonological multi-task heads
β”‚                   β”‚  Voicing / Manner / Place / CTC
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚  (1, T, 512) frame embeddings
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DirectAudio       β”‚  Linear projection
β”‚ Projection        β”‚  512 β†’ 512
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚  Audio tokens
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   SpeakMK1LLM     β”‚  74.2M params
β”‚                   β”‚  Mamba-SSM + Attention hybrid
β”‚  LatentMoE        β”‚  4 experts, top-2 routing
β”‚  CrossModal       β”‚  Sparse cross-attention gates
β”‚  SparseAttention  β”‚  Audio-conditioned generation
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚  SLP-style text response
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Kokoro ONNX     β”‚  TTS output
β”‚   (af_heart)      β”‚  Warm, child-friendly voice
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Component Details

Component Details
Audio Encoder BiMamba + UniMamba layers, MFA-aligned phonological heads
Phonological heads Voicing (96% acc), Manner (93% acc), Place (~40% acc), Correctness, CTC
Training data (encoder) LibriSpeech with Montreal Forced Aligner alignment
LLM 74.2M params, Mamba-SSM + attention, LatentMoE, factorized embeddings (128β†’512 bottleneck)
Tokenizer EleutherAI GPT-NeoX-20B (vocab 50,283 + 5 custom SLP special tokens)
Projection DirectAudioProjection β€” single linear layer (512β†’512)
TTS Kokoro ONNX v1.0, af_heart voice
Backend FastAPI microservice

Files in This Repository

File Description
audio_encoder_epoch_5.pt Audio encoder weights (trained through epoch 5)
audio_proj_best.pt DirectAudioProjection weights
ckpt_final.pt LLM weights (Stage 5 β€” SLP dialogue + cross-attention gate retraining)
audio_encoder.py Encoder architecture definition
speak_mk1_llm.py LLM architecture definition
train_proj.py Projection layer definition
audio_trainer.py SmallConfig and training utilities

Kokoro TTS weights (kokoro-v1.0.onnx, voices-v1.0.bin) must be obtained separately from kokoro-onnx.


Usage

Installation

pip install torch transformers librosa soundfile mamba-ssm einops kokoro-onnx sounddevice

Note: mamba-ssm requires CUDA. CPU-only inference is not supported.

Load and run inference

import torch
import librosa
import numpy as np
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
from audio_encoder import AudioEncoder
from audio_trainer import SmallConfig
from speak_mk1_llm import SpeakMK1LLM, SpeakMK1LLMConfig
from train_proj import DirectAudioProjection

DEVICE = torch.device("cuda")
REPO = "SakhrML/SpeakMK1_early"

# Download weights
enc_path  = hf_hub_download(REPO, "audio_encoder_epoch_5.pt")
proj_path = hf_hub_download(REPO, "audio_proj_best.pt")
llm_path  = hf_hub_download(REPO, "ckpt_final.pt")

# Load encoder
enc_cfg = SmallConfig(d_model=512, llm_dim=4096)
encoder = AudioEncoder(enc_cfg).to(DEVICE)
encoder.load_state_dict(torch.load(enc_path, map_location=DEVICE, weights_only=False), strict=False)
encoder.eval()

# Load projection
proj = DirectAudioProjection(512, 512).to(DEVICE)
proj_ckpt = torch.load(proj_path, map_location=DEVICE, weights_only=False)
proj.load_state_dict(proj_ckpt["audio_proj"])
proj.eval()

# Load LLM
llm_cfg = SpeakMK1LLMConfig(
    vocab_size=50283, d_model=512, d_state=64, num_blocks=6,
    nheads_ssm=8, nheads_attn=8, top_k_audio=32,
    num_experts=4, top_k_experts=2, dropout=0.0, aux_loss_weight=1e-2,
)
llm = SpeakMK1LLM(llm_cfg).to(DEVICE)
llm_ckpt = torch.load(llm_path, map_location=DEVICE, weights_only=False)
llm.load_state_dict(llm_ckpt["model"], strict=True)

# Apply cross-attention gate override (required β€” see Limitations)
with torch.no_grad():
    for block in llm.blocks:
        if hasattr(block.cross_attn, "gate"):
            block.cross_attn.gate.data.fill_(0.3)
llm.eval()

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
tokenizer.add_special_tokens({"additional_special_tokens": [
    "<|system|>", "<|child|>", "<|slp|>", "<|think|>", "<|endturn|>"
]})

# Run on audio
audio_np, _ = librosa.load("child_speech.wav", sr=16000, mono=True)
mel_np = librosa.feature.melspectrogram(
    y=audio_np, sr=16000, n_fft=400, hop_length=160, n_mels=80, fmin=0.0, fmax=8000.0
)
mel_np = librosa.power_to_db(mel_np, ref=np.max)
mel = torch.tensor(mel_np.T, dtype=torch.float32).unsqueeze(0).to(DEVICE)

with torch.no_grad():
    audio_feats = encoder.encode_features(mel)
    audio_out = proj(audio_feats)
 
    prompt = (
        "<|system|>You are a warm, expert AI speech-language pathologist "
        "helping a child with articulation errors. Analyze the error and "
        "provide encouraging corrective feedback."
        "<|child|>I wanna pway wif my fwiends.<|slp|>"
    )
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(DEVICE)
    logits, _, _ = llm(input_ids=input_ids, audio_out=audio_out, audio_padding_mask=None)

Prompt format

<|system|>{system instruction}<|child|>{child utterance}<|slp|>

Special tokens: <|system|>, <|child|>, <|slp|>, <|think|>, <|endturn|>


Training Details

Audio Encoder

Stage Dataset Notes
Pre-training LibriSpeech (960h) MFA phoneme alignment for frame-level labels
Multi-task heads Voicing, Manner, Place, Correctness, CTC Trained jointly

Results (epoch 5): Voicing 96%, Manner 93%, Place ~40%

LLM β€” 5-Stage Training Pipeline

Stage Data Purpose
1 TinyStories + general text Base language modelling
2 PubMed Central Medical/clinical domain adaptation
3 CHILDES Eng-NA Child language patterns
4 Synthetic SLP dialogues + Alpaca/FLAN SLP dialogue fine-tuning
5 Stage 4 data + random audio injection Cross-attention gate retraining

Perplexity: Stage 1: 3.44 / Stage 2: 3.59 / Stage 3: 57.24 (distribution shift) / Stage 4: 1676 / Stage 5: 2502

The cross-evaluation headline result: 76% PPL reduction from Stage 1 to Stage 4 on held-out SLP dialogue data.


Limitations and Known Issues

Cross-attention gate collapse (critical): During Stage 4 text-only SLP fine-tuning, the cross-attention gates collapsed to near-zero (effectively ignoring audio). Stage 5 partially recovered gates to the 0.004--0.009 range. A manual override of gate = 0.3 is applied at inference. Audio features condition the LLM structurally but do not semantically alter outputs without paired audio+SLP dialogue training data. Always apply the gate override shown in the usage example above.

Elevated LLM perplexity: Stages 4 and 5 show high perplexity (1676, 2502) due to distribution shift from general text to narrow SLP dialogue format. Output quality is functional but not production-grade.

Place of articulation accuracy: The phonological encoder achieves only ~40% place accuracy, compared to 96% voicing and 93% manner. Place of articulation is the hardest phonological feature to discriminate from acoustics alone.

Q-Former not used: The intended architecture included a Q-Former cross-modal alignment module. Convergence failure (retrieval accuracy stuck at random chance) led to replacement with the simpler DirectAudioProjection linear layer. Q-Former is documented as intended architecture but not present in these weights.

No multilingual support yet: Despite the UAE/multilingual clinical motivation, training data was English-only (LibriSpeech, CHILDES Eng-NA). Arabic and code-switching support is future work.

GPU required: mamba-ssm Triton kernels do not run on CPU. CUDA is mandatory.


Hardware Requirements

Component Minimum Tested On
GPU VRAM 6GB NVIDIA RTX 4060 Laptop (8GB)
CUDA 11.8+ CUDA 12.x
RAM 16GB 16GB

Citation

If you use SpeakMK1 in your work, please cite:

@misc{speakmk1_2025,
  title     = {SpeakMK1: A Multimodal Mamba-Attention Hybrid for Automated Speech-Language Pathology Assessment},
  author    = {Ebraheem, and Ihsan, Ali},
  year      = {2025},
  institution = {Amity University Dubai},
  note      = {B.Tech CSE Major Project, supervised by Dr. Ved P. Mishra}
}

Acknowledgements


Early release β€” research prototype. Not for clinical use.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support