Spaces:

pgits
/

stt-gpu-service-python-v4

Runtime error

Peter Michael Gits Claude commited on Sep 3, 2025

Commit

16b78bc

1 Parent(s): 03b8c7c

Initial commit: STT GPU Service Python v4 with WebSocket streaming

- FastAPI app with WebSocket streaming at /ws/stream for 80ms chunks
- REST API at /transcribe for testing
- Pre-cached kyutai/stt-1b-en_fr model in Docker
- T4 Small GPU configuration with 30min auto-sleep
- Real-time STT processing faster than real-time
- Max 2 concurrent WebSocket connections

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (1) hide show

LinkedInPost-1.md +57 -0

LinkedInPost-1.md ADDED Viewed

	@@ -0,0 +1,57 @@

+# 🎙️ Real-Time Speech-to-Text Service with Kyutai Moshi
+Just built a production-ready STT service using Kyutai's Moshi model for ultra-low latency speech recognition!
+## System Architecture
+```
+┌─────────────────────────────────────────────────────────────┐
+│                  stt-gpu-service-python-v4                 │
+│                     (Nvidia T4 Small)                      │
+│                                                             │
+│  ┌─────────────────┐    ┌─────────────────────────────────┐ │
+│  │   Moshi Model   │    │        API Interfaces           │ │
+│  │ kyutai/stt-1b   │    │                                 │ │
+│  │   (Cached)      │    │  🌐 WebSocket /ws/stream        │ │
+│  │                 │    │     ↓ 80ms audio chunks         │ │
+│  │  • 0.5s delay  │◄───┤     ↑ Real-time transcription   │ │
+│  │  • EN/FR       │    │                                 │ │
+│  │  • 1B params   │    │  📡 REST /transcribe            │ │
+│  │                 │    │     ↓ Audio file upload        │ │
+│  └─────────────────┘    │     ↑ JSON transcription       │ │
+│                         │                                 │ │
+│                         │  💓 GET /health                 │ │
+│                         │     ↑ Service status check     │ │
+│                         └─────────────────────────────────┘ │
+└─────────────────────────────────────────────────────────────┘
+                                   │
+                    ┌──────────────┼──────────────┐
+                    │              │              │
+            ┌───────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐
+            │   Client 1  │ │ Client 2  │ │   Test    │
+            │ (Streaming) │ │(Streaming)│ │ (Upload)  │
+            └─────────────┘ └───────────┘ └───────────┘
+```
+## API Interface Details
+### 🌐 **WebSocket Streaming** `/ws/stream`
+Primary interface for real-time speech recognition with 80ms audio chunks. Achieves ~200ms end-to-end latency with bidirectional communication for live conversations.
+### 📡 **REST Upload** `/transcribe`
+Secondary testing endpoint for complete audio file processing. Simple POST request with audio file returns full transcription with word-level timestamps.
+### 💓 **Health Check** `/health`
+Basic service monitoring endpoint for deployment status verification. Returns model readiness and GPU resource availability.
+## Technical Highlights
+- **Ultra-Low Latency**: 80ms frame processing with Moshi's native streaming
+- **Model Optimization**: Pre-cached in Docker image for instant startup
+- **Cost Efficient**: T4 Small GPU with 30-minute auto-sleep
+- **Production Ready**: Supports 2 concurrent streaming connections
+- **Multi-Language**: English and French recognition support
+Perfect for real-time voice applications, live transcription services, and conversational AI systems!
+#AI #SpeechRecognition #RealTime #MachineLearning #HuggingFace #Python #Docker