creator-o1
Initial commit: Complete VoiceForge Enterprise Speech AI Platform
d00203b

VoiceForge API Documentation

Base URL

  • Development: http://localhost:8000
  • Production: https://api.voiceforge.example.com

Authentication

Most endpoints require authentication via JWT Bearer token.

Authorization: Bearer <token>

Endpoints

Authentication

POST /api/v1/auth/register

Register a new user.

Request:

{
  "email": "user@example.com",
  "password": "strongpassword123",
  "name": "Jane Doe"
}

Response:

{
  "id": 1,
  "email": "user@example.com",
  "name": "Jane Doe",
  "created_at": "2024-01-01T10:00:00"
}

POST /api/v1/auth/login

Login to get a JWT token.

Request (Form Data):

Response:

{
  "access_token": "ey...",
  "token_type": "bearer"
}

GET /api/v1/auth/me

Get current user profile.


Endpoints

Health Check

GET /health

Check if the API is running.

Response:

{
  "status": "healthy",
  "service": "voiceforge-api",
  "version": "1.0.0"
}

GET /health/memory

Get current memory usage and loaded models.

Response:

{
  "memory_mb": 1523.4,
  "loaded_models": ["distil-small.en", "small"],
  "models_detail": {
    "distil-small.en": {"loaded": true, "idle_seconds": 45.2}
  }
}

POST /health/memory/cleanup

Unload idle models (inactive > 5 minutes) to free memory.

POST /health/memory/unload-all

Unload ALL models to free maximum memory (~1GB reduction).


WebSocket Endpoints

WS /api/v1/ws/tts/{client_id}

Real-time TTS streaming via WebSocket (ultra-low latency).

Protocol:

  • Client sends: JSON {"text": "...", "voice": "...", "rate": "+0%", "pitch": "+0Hz"}
  • Server sends: Binary audio chunks followed by JSON {"status": "complete", "ttfb_ms": 150}

Expected TTFB: <500ms


Speech-to-Text

GET /api/v1/stt/languages

Get list of supported languages.

Response:

{
  "languages": [
    {
      "code": "en-US",
      "name": "English (US)",
      "native_name": "English",
      "flag": "🇺🇸",
      "stt_supported": true,
      "tts_supported": true
    }
  ],
  "total": 10
}

POST /api/v1/stt/upload

Transcribe an uploaded audio file.

Request:

  • Content-Type: multipart/form-data
Parameter Type Required Description
file file Yes Audio file (WAV, MP3, M4A, FLAC, OGG)
language string No Language code (default: en-US)
enable_punctuation boolean No Add punctuation (default: true)
enable_word_timestamps boolean No Include word timing (default: true)
enable_diarization boolean No Speaker detection (default: false)
speaker_count integer No Expected speakers (2-10)

Response:

{
  "id": 1,
  "text": "Hello, world. This is a test transcription.",
  "segments": [
    {
      "text": "Hello, world.",
      "start_time": 0.0,
      "end_time": 1.5,
      "speaker": null,
      "confidence": 0.95
    }
  ],
  "words": [
    {
      "word": "Hello",
      "start_time": 0.0,
      "end_time": 0.5,
      "confidence": 0.98
    }
  ],
  "language": "en-US",
  "confidence": 0.95,
  "duration": 3.5,
  "word_count": 7,
  "processing_time": 1.23
}

POST /api/v1/stt/upload/quality

High-quality transcription mode (optimized for accuracy).

Parameters (form-data):

Parameter Type Required Description
file file Yes Audio file
language string No Language code (default: en-US)
preprocess boolean No Apply noise reduction (default: false)

Features:

  • beam_size=5 for more accurate decoding (~40% fewer errors)
  • condition_on_previous_text=False to reduce hallucinations
  • Optional audio preprocessing for noisy environments

Response: Same as /upload


POST /api/v1/stt/upload/batch

Batch transcription for high throughput (2-3x speedup).

Parameters (form-data):

Parameter Type Required Description
files file[] Yes Multiple audio files
language string No Language code (default: en-US)
batch_size integer No Batch size (default: 8)

Response:

{
  "count": 3,
  "results": [
    {"filename": "audio1.mp3", "text": "...", "processing_time": 2.1},
    {"filename": "audio2.mp3", "text": "...", "processing_time": 1.8}
  ],
  "mode": "batched",
  "batch_size": 8
}

#### POST /api/v1/stt/upload/diarize

Perform Speaker Diarization ("Who said what") on an audio file.

**Requirements:**
- `HF_TOKEN` must be set in `.env` (Hugging Face Token for pyannote model access)
- Uses `faster-whisper` for transcription + `pyannote.audio` for speaker identification

**Parameters (form-data):**

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| file | file | Yes | Audio file |
| num_speakers | integer | No | Exact number of speakers (optional) |
| min_speakers | integer | No | Min expected speakers (optional) |
| max_speakers | integer | No | Max expected speakers (optional) |
| language | string | No | Language code, e.g. 'en' (auto-detected if not provided) |

**Response:**
```json
{
  "segments": [
    {
      "start": 0.0,
      "end": 0.84,
      "text": "Hello Test",
      "speaker": "SPEAKER_00"
    }
  ],
  "speaker_stats": {
    "SPEAKER_00": 0.84
  },
  "language": "en",
  "status": "success"
}

Transcripts & Analysis

GET /api/v1/transcripts

List all past transcriptions.

Response:

[
  {
    "id": 1,
    "text": "Hello world...",
    "created_at": "2024-01-01T12:00:00",
    "word_count": 150,
    "language": "en-US"
  }
]

POST /api/v1/transcripts/{id}/analyze

Run NLP analysis (Sentiment, Keywords, Summary) on a transcript.

Response:

{
  "status": "success",
  "analysis": {
    "sentiment": {"polarity": 0.5, "subjectivity": 0.1},
    "keywords": ["artificial intelligence", "voice", "app"],
    "summary": "This is a summary of the transcript."
  }
}

GET /api/v1/transcripts/{id}/export

Download transcript in a specific format.

Query Parameters:

Parameter Type Required Description
format string Yes txt, srt, vtt, pdf

Response:

  • File download (text/plain, text/vtt, application/pdf)

Text-to-Speech

GET /api/v1/tts/voices

Get all available voices.

Query Parameters:

Parameter Type Description
language string Filter by language code

Response:

{
  "voices": [
    {
      "name": "en-US-Wavenet-D",
      "language_code": "en-US",
      "language_name": "English (US)",
      "ssml_gender": "MALE",
      "natural_sample_rate": 24000,
      "voice_type": "WaveNet",
      "display_name": "D (Male, WaveNet)",
      "flag": "🇺🇸"
    }
  ],
  "total": 50,
  "language_filter": null
}

GET /api/v1/tts/voices/{language}

Get voices for a specific language.

Parameters:

Parameter Type Description
language path Language code (e.g., en-US)

POST /api/v1/tts/synthesize

Convert text to speech.

Request:

{
  "text": "Hello, this is a test.",
  "language": "en-US",
  "voice": "en-US-Wavenet-D",
  "audio_encoding": "MP3",
  "speaking_rate": 1.0,
  "pitch": 0.0,
  "volume_gain_db": 0.0,
  "use_ssml": false
}
Field Type Required Description
text string Yes Text to synthesize (max 5000 chars)
language string No Language code (default: en-US)
voice string No Voice name
audio_encoding string No MP3, LINEAR16, OGG_OPUS
speaking_rate float No Speed 0.25-4.0 (default: 1.0)
pitch float No Pitch -20 to 20 (default: 0.0)
volume_gain_db float No Volume -96 to 16 dB
use_ssml boolean No Treat as SSML markup

Response:

{
  "audio_content": "<base64 encoded audio>",
  "audio_size": 12345,
  "duration_estimate": 2.5,
  "voice_used": "en-US-Wavenet-D",
  "language": "en-US",
  "encoding": "MP3",
  "sample_rate": 24000,
  "processing_time": 0.45
}

POST /api/v1/tts/synthesize/audio

Synthesize and return audio file directly.

Same request as /synthesize, but returns the audio file as a download.

POST /api/v1/tts/stream

Stream synthesized audio for immediate playback.

Request: Same as /synthesize.

Response: Chunked audio stream (audio/mpeg). Ideal for long text to reduce latency (TTFB).

POST /api/v1/tts/ssml

Synthesize audio using SSML for prosody control (rate, pitch, emphasis).

Request:

  • text: Text to speak
  • voice: Voice name (default: "en-US-AriaNeural")
  • rate: Speed (e.g., "fast", "-10%")
  • pitch: Pitch (e.g., "high", "+5Hz")
  • emphasis: "strong", "moderate", "reduced"
  • auto_breaks: true/false

Response: Audio file (audio/mpeg).


Error Responses

All errors follow this format:

{
  "error": "error_type",
  "message": "Human readable message",
  "detail": "Additional details (debug mode only)"
}

Common Error Codes

Code Type Description
400 validation_error Invalid request parameters
401 unauthorized Missing or invalid auth token
403 forbidden Insufficient permissions
404 not_found Resource not found
413 file_too_large Upload exceeds size limit
429 rate_limited Too many requests
500 internal_error Server error

Rate Limits

Tier Limit
Free 60 requests/minute
Pro 600 requests/minute
Enterprise Custom

Supported Audio Formats

Format Extension Notes
WAV .wav Best quality, no conversion
MP3 .mp3 Common, converted
M4A .m4a iOS format
FLAC .flac Lossless
OGG .ogg Open format
WebM .webm Browser recording

Code Examples

Python

import requests

# Transcribe audio
with open("audio.wav", "rb") as f:
    response = requests.post(
        "http://localhost:8000/api/v1/stt/upload",
        files={"file": f},
        data={"language": "en-US"}
    )
    print(response.json()["text"])

# Synthesize speech
response = requests.post(
    "http://localhost:8000/api/v1/tts/synthesize",
    json={"text": "Hello world", "language": "en-US"}
)
import base64
audio = base64.b64decode(response.json()["audio_content"])
with open("output.mp3", "wb") as f:
    f.write(audio)

cURL

# Transcribe
curl -X POST http://localhost:8000/api/v1/stt/upload \
  -F "file=@audio.wav" \
  -F "language=en-US"

# Synthesize
curl -X POST http://localhost:8000/api/v1/tts/synthesize \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello", "language": "en-US"}'