Peter Michael Gits
Initial commit: stt-gpu-service-moshi-v4 using official moshi-server
a6d9e69
metadata
title: Kyutai STT GPU Service Moshi v4
emoji: 🎤
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
license: apache-2.0
hardware: t4-small
app_port: 7860

Kyutai STT GPU Service Moshi v4

Official Moshi-Server Implementation - A streaming Speech-to-Text service using the official moshi-server from Kyutai with proven protocols.

Features

  • Official moshi-server with MessagePack protocol
  • Real-time streaming via /api/asr-streaming endpoint
  • Proven performance - 64 concurrent streams on L40S, 400 on H100
  • 125ms processing time for real-time transcription
  • Word-level timestamps and Voice Activity Detection
  • Multilingual support - kyutai/stt-1b-en_fr model

Architecture

This Space uses the official moshi-server binary instead of custom implementations:

cargo install --features cuda moshi-server
moshi-server worker --config configs/config-stt-en_fr-hf.toml

WebSocket API

Official Protocol

  • Endpoint: /api/asr-streaming
  • Protocol: MessagePack (not JSON)
  • Headers: kyutai-api-key: your-key

Message Format

# Send audio (80ms blocks, 1920 samples at 24kHz)
chunk = {"type": "Audio", "pcm": [float(x) for x in audio_data]}
msg = msgpack.packb(chunk, use_bin_type=True, use_single_float=True)

Response Types

  • "Step" messages: Voice Activity Detection
  • "Word" messages: Transcribed text with timestamps

Performance

  • Model: kyutai/stt-1b-en_fr (~1B params, 0.5s delay)
  • Processing: ~125ms per audio chunk
  • Concurrency: 64 streams per L40S GPU
  • Memory: ~2.5GB VRAM required

Development

Based on the official Kyutai delayed-streams-modeling framework with proven streaming protocols used in production by Unmute.sh.

Cost Management

  • Auto-sleep: 30 minutes inactivity
  • T4 GPU: $0.40/hour when active
  • Estimated: ~$29/month for 10 hours/week usage