Spaces:
Runtime error
Runtime error
metadata
title: Kyutai STT GPU Service Moshi v4
emoji: 🎤
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
license: apache-2.0
hardware: t4-small
app_port: 7860
Kyutai STT GPU Service Moshi v4
Official Moshi-Server Implementation - A streaming Speech-to-Text service using the official moshi-server from Kyutai with proven protocols.
Features
- Official moshi-server with MessagePack protocol
- Real-time streaming via
/api/asr-streamingendpoint - Proven performance - 64 concurrent streams on L40S, 400 on H100
- 125ms processing time for real-time transcription
- Word-level timestamps and Voice Activity Detection
- Multilingual support - kyutai/stt-1b-en_fr model
Architecture
This Space uses the official moshi-server binary instead of custom implementations:
cargo install --features cuda moshi-server
moshi-server worker --config configs/config-stt-en_fr-hf.toml
WebSocket API
Official Protocol
- Endpoint:
/api/asr-streaming - Protocol: MessagePack (not JSON)
- Headers:
kyutai-api-key: your-key
Message Format
# Send audio (80ms blocks, 1920 samples at 24kHz)
chunk = {"type": "Audio", "pcm": [float(x) for x in audio_data]}
msg = msgpack.packb(chunk, use_bin_type=True, use_single_float=True)
Response Types
- "Step" messages: Voice Activity Detection
- "Word" messages: Transcribed text with timestamps
Performance
- Model: kyutai/stt-1b-en_fr (~1B params, 0.5s delay)
- Processing: ~125ms per audio chunk
- Concurrency: 64 streams per L40S GPU
- Memory: ~2.5GB VRAM required
Development
Based on the official Kyutai delayed-streams-modeling framework with proven streaming protocols used in production by Unmute.sh.
Cost Management
- Auto-sleep: 30 minutes inactivity
- T4 GPU: $0.40/hour when active
- Estimated: ~$29/month for 10 hours/week usage