VoiceForge API Documentation
Base URL
- Development:
http://localhost:8000 - Production:
https://api.voiceforge.example.com
Authentication
Most endpoints require authentication via JWT Bearer token.
Authorization: Bearer <token>
Endpoints
Authentication
POST /api/v1/auth/register
Register a new user.
Request:
{
"email": "user@example.com",
"password": "strongpassword123",
"name": "Jane Doe"
}
Response:
{
"id": 1,
"email": "user@example.com",
"name": "Jane Doe",
"created_at": "2024-01-01T10:00:00"
}
POST /api/v1/auth/login
Login to get a JWT token.
Request (Form Data):
- username: user@example.com
- password: strongpassword123
Response:
{
"access_token": "ey...",
"token_type": "bearer"
}
GET /api/v1/auth/me
Get current user profile.
Endpoints
Health Check
GET /health
Check if the API is running.
Response:
{
"status": "healthy",
"service": "voiceforge-api",
"version": "1.0.0"
}
GET /health/memory
Get current memory usage and loaded models.
Response:
{
"memory_mb": 1523.4,
"loaded_models": ["distil-small.en", "small"],
"models_detail": {
"distil-small.en": {"loaded": true, "idle_seconds": 45.2}
}
}
POST /health/memory/cleanup
Unload idle models (inactive > 5 minutes) to free memory.
POST /health/memory/unload-all
Unload ALL models to free maximum memory (~1GB reduction).
WebSocket Endpoints
WS /api/v1/ws/tts/{client_id}
Real-time TTS streaming via WebSocket (ultra-low latency).
Protocol:
- Client sends: JSON
{"text": "...", "voice": "...", "rate": "+0%", "pitch": "+0Hz"} - Server sends: Binary audio chunks followed by JSON
{"status": "complete", "ttfb_ms": 150}
Expected TTFB: <500ms
Speech-to-Text
GET /api/v1/stt/languages
Get list of supported languages.
Response:
{
"languages": [
{
"code": "en-US",
"name": "English (US)",
"native_name": "English",
"flag": "🇺🇸",
"stt_supported": true,
"tts_supported": true
}
],
"total": 10
}
POST /api/v1/stt/upload
Transcribe an uploaded audio file.
Request:
- Content-Type:
multipart/form-data
| Parameter | Type | Required | Description |
|---|---|---|---|
| file | file | Yes | Audio file (WAV, MP3, M4A, FLAC, OGG) |
| language | string | No | Language code (default: en-US) |
| enable_punctuation | boolean | No | Add punctuation (default: true) |
| enable_word_timestamps | boolean | No | Include word timing (default: true) |
| enable_diarization | boolean | No | Speaker detection (default: false) |
| speaker_count | integer | No | Expected speakers (2-10) |
Response:
{
"id": 1,
"text": "Hello, world. This is a test transcription.",
"segments": [
{
"text": "Hello, world.",
"start_time": 0.0,
"end_time": 1.5,
"speaker": null,
"confidence": 0.95
}
],
"words": [
{
"word": "Hello",
"start_time": 0.0,
"end_time": 0.5,
"confidence": 0.98
}
],
"language": "en-US",
"confidence": 0.95,
"duration": 3.5,
"word_count": 7,
"processing_time": 1.23
}
POST /api/v1/stt/upload/quality
High-quality transcription mode (optimized for accuracy).
Parameters (form-data):
| Parameter | Type | Required | Description |
|---|---|---|---|
| file | file | Yes | Audio file |
| language | string | No | Language code (default: en-US) |
| preprocess | boolean | No | Apply noise reduction (default: false) |
Features:
- beam_size=5 for more accurate decoding (~40% fewer errors)
- condition_on_previous_text=False to reduce hallucinations
- Optional audio preprocessing for noisy environments
Response: Same as /upload
POST /api/v1/stt/upload/batch
Batch transcription for high throughput (2-3x speedup).
Parameters (form-data):
| Parameter | Type | Required | Description |
|---|---|---|---|
| files | file[] | Yes | Multiple audio files |
| language | string | No | Language code (default: en-US) |
| batch_size | integer | No | Batch size (default: 8) |
Response:
{
"count": 3,
"results": [
{"filename": "audio1.mp3", "text": "...", "processing_time": 2.1},
{"filename": "audio2.mp3", "text": "...", "processing_time": 1.8}
],
"mode": "batched",
"batch_size": 8
}
#### POST /api/v1/stt/upload/diarize
Perform Speaker Diarization ("Who said what") on an audio file.
**Requirements:**
- `HF_TOKEN` must be set in `.env` (Hugging Face Token for pyannote model access)
- Uses `faster-whisper` for transcription + `pyannote.audio` for speaker identification
**Parameters (form-data):**
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| file | file | Yes | Audio file |
| num_speakers | integer | No | Exact number of speakers (optional) |
| min_speakers | integer | No | Min expected speakers (optional) |
| max_speakers | integer | No | Max expected speakers (optional) |
| language | string | No | Language code, e.g. 'en' (auto-detected if not provided) |
**Response:**
```json
{
"segments": [
{
"start": 0.0,
"end": 0.84,
"text": "Hello Test",
"speaker": "SPEAKER_00"
}
],
"speaker_stats": {
"SPEAKER_00": 0.84
},
"language": "en",
"status": "success"
}
Transcripts & Analysis
GET /api/v1/transcripts
List all past transcriptions.
Response:
[
{
"id": 1,
"text": "Hello world...",
"created_at": "2024-01-01T12:00:00",
"word_count": 150,
"language": "en-US"
}
]
POST /api/v1/transcripts/{id}/analyze
Run NLP analysis (Sentiment, Keywords, Summary) on a transcript.
Response:
{
"status": "success",
"analysis": {
"sentiment": {"polarity": 0.5, "subjectivity": 0.1},
"keywords": ["artificial intelligence", "voice", "app"],
"summary": "This is a summary of the transcript."
}
}
GET /api/v1/transcripts/{id}/export
Download transcript in a specific format.
Query Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| format | string | Yes | txt, srt, vtt, pdf |
Response:
- File download (text/plain, text/vtt, application/pdf)
Text-to-Speech
GET /api/v1/tts/voices
Get all available voices.
Query Parameters:
| Parameter | Type | Description |
|---|---|---|
| language | string | Filter by language code |
Response:
{
"voices": [
{
"name": "en-US-Wavenet-D",
"language_code": "en-US",
"language_name": "English (US)",
"ssml_gender": "MALE",
"natural_sample_rate": 24000,
"voice_type": "WaveNet",
"display_name": "D (Male, WaveNet)",
"flag": "🇺🇸"
}
],
"total": 50,
"language_filter": null
}
GET /api/v1/tts/voices/{language}
Get voices for a specific language.
Parameters:
| Parameter | Type | Description |
|---|---|---|
| language | path | Language code (e.g., en-US) |
POST /api/v1/tts/synthesize
Convert text to speech.
Request:
{
"text": "Hello, this is a test.",
"language": "en-US",
"voice": "en-US-Wavenet-D",
"audio_encoding": "MP3",
"speaking_rate": 1.0,
"pitch": 0.0,
"volume_gain_db": 0.0,
"use_ssml": false
}
| Field | Type | Required | Description |
|---|---|---|---|
| text | string | Yes | Text to synthesize (max 5000 chars) |
| language | string | No | Language code (default: en-US) |
| voice | string | No | Voice name |
| audio_encoding | string | No | MP3, LINEAR16, OGG_OPUS |
| speaking_rate | float | No | Speed 0.25-4.0 (default: 1.0) |
| pitch | float | No | Pitch -20 to 20 (default: 0.0) |
| volume_gain_db | float | No | Volume -96 to 16 dB |
| use_ssml | boolean | No | Treat as SSML markup |
Response:
{
"audio_content": "<base64 encoded audio>",
"audio_size": 12345,
"duration_estimate": 2.5,
"voice_used": "en-US-Wavenet-D",
"language": "en-US",
"encoding": "MP3",
"sample_rate": 24000,
"processing_time": 0.45
}
POST /api/v1/tts/synthesize/audio
Synthesize and return audio file directly.
Same request as /synthesize, but returns the audio file as a download.
POST /api/v1/tts/stream
Stream synthesized audio for immediate playback.
Request: Same as /synthesize.
Response: Chunked audio stream (audio/mpeg). Ideal for long text to reduce latency (TTFB).
POST /api/v1/tts/ssml
Synthesize audio using SSML for prosody control (rate, pitch, emphasis).
Request:
text: Text to speakvoice: Voice name (default: "en-US-AriaNeural")rate: Speed (e.g., "fast", "-10%")pitch: Pitch (e.g., "high", "+5Hz")emphasis: "strong", "moderate", "reduced"auto_breaks: true/false
Response: Audio file (audio/mpeg).
Error Responses
All errors follow this format:
{
"error": "error_type",
"message": "Human readable message",
"detail": "Additional details (debug mode only)"
}
Common Error Codes
| Code | Type | Description |
|---|---|---|
| 400 | validation_error | Invalid request parameters |
| 401 | unauthorized | Missing or invalid auth token |
| 403 | forbidden | Insufficient permissions |
| 404 | not_found | Resource not found |
| 413 | file_too_large | Upload exceeds size limit |
| 429 | rate_limited | Too many requests |
| 500 | internal_error | Server error |
Rate Limits
| Tier | Limit |
|---|---|
| Free | 60 requests/minute |
| Pro | 600 requests/minute |
| Enterprise | Custom |
Supported Audio Formats
| Format | Extension | Notes |
|---|---|---|
| WAV | .wav | Best quality, no conversion |
| MP3 | .mp3 | Common, converted |
| M4A | .m4a | iOS format |
| FLAC | .flac | Lossless |
| OGG | .ogg | Open format |
| WebM | .webm | Browser recording |
Code Examples
Python
import requests
# Transcribe audio
with open("audio.wav", "rb") as f:
response = requests.post(
"http://localhost:8000/api/v1/stt/upload",
files={"file": f},
data={"language": "en-US"}
)
print(response.json()["text"])
# Synthesize speech
response = requests.post(
"http://localhost:8000/api/v1/tts/synthesize",
json={"text": "Hello world", "language": "en-US"}
)
import base64
audio = base64.b64decode(response.json()["audio_content"])
with open("output.mp3", "wb") as f:
f.write(audio)
cURL
# Transcribe
curl -X POST http://localhost:8000/api/v1/stt/upload \
-F "file=@audio.wav" \
-F "language=en-US"
# Synthesize
curl -X POST http://localhost:8000/api/v1/tts/synthesize \
-H "Content-Type: application/json" \
-d '{"text": "Hello", "language": "en-US"}'