Spaces:

lordofgaming
/

voiceforge-universal

Running

App Files Files Community

voiceforge-universal / docs /API.md

creator-o1

Initial commit: Complete VoiceForge Enterprise Speech AI Platform

d00203b 4 days ago

preview code

raw

history blame contribute delete

11.1 kB

VoiceForge API Documentation

Base URL

Development: http://localhost:8000
Production: https://api.voiceforge.example.com

Authentication

Most endpoints require authentication via JWT Bearer token.

Authorization: Bearer <token>

Endpoints

Authentication

POST /api/v1/auth/register

Request:

{
  "email": "user@example.com",
  "password": "strongpassword123",
  "name": "Jane Doe"
}

Response:

{
  "id": 1,
  "email": "user@example.com",
  "name": "Jane Doe",
  "created_at": "2024-01-01T10:00:00"
}

POST /api/v1/auth/login

Request (Form Data):

username: user@example.com
password: strongpassword123

Response:

{
  "access_token": "ey...",
  "token_type": "bearer"
}

GET /api/v1/auth/me

Get current user profile.

Endpoints

Health Check

GET /health

Check if the API is running.

Response:

{
  "status": "healthy",
  "service": "voiceforge-api",
  "version": "1.0.0"
}

GET /health/memory

Get current memory usage and loaded models.

Response:

{
  "memory_mb": 1523.4,
  "loaded_models": ["distil-small.en", "small"],
  "models_detail": {
    "distil-small.en": {"loaded": true, "idle_seconds": 45.2}
  }
}

POST /health/memory/cleanup

Unload idle models (inactive > 5 minutes) to free memory.

POST /health/memory/unload-all

Unload ALL models to free maximum memory (~1GB reduction).

WebSocket Endpoints

WS /api/v1/ws/tts/{client_id}

Real-time TTS streaming via WebSocket (ultra-low latency).

Protocol:

Client sends: JSON {"text": "...", "voice": "...", "rate": "+0%", "pitch": "+0Hz"}
Server sends: Binary audio chunks followed by JSON {"status": "complete", "ttfb_ms": 150}

Expected TTFB: <500ms

Speech-to-Text

GET /api/v1/stt/languages

Get list of supported languages.

Response:

{
  "languages": [
    {
      "code": "en-US",
      "name": "English (US)",
      "native_name": "English",
      "flag": "🇺🇸",
      "stt_supported": true,
      "tts_supported": true
    }
  ],
  "total": 10
}

POST /api/v1/stt/upload

Transcribe an uploaded audio file.

Request:

Content-Type: multipart/form-data

Parameter	Type	Required	Description
file	file	Yes	Audio file (WAV, MP3, M4A, FLAC, OGG)
language	string	No	Language code (default: en-US)
enable_punctuation	boolean	No	Add punctuation (default: true)
enable_word_timestamps	boolean	No	Include word timing (default: true)
enable_diarization	boolean	No	Speaker detection (default: false)
speaker_count	integer	No	Expected speakers (2-10)

Response:

{
  "id": 1,
  "text": "Hello, world. This is a test transcription.",
  "segments": [
    {
      "text": "Hello, world.",
      "start_time": 0.0,
      "end_time": 1.5,
      "speaker": null,
      "confidence": 0.95
    }
  ],
  "words": [
    {
      "word": "Hello",
      "start_time": 0.0,
      "end_time": 0.5,
      "confidence": 0.98
    }
  ],
  "language": "en-US",
  "confidence": 0.95,
  "duration": 3.5,
  "word_count": 7,
  "processing_time": 1.23
}

POST /api/v1/stt/upload/quality

High-quality transcription mode (optimized for accuracy).

Parameters (form-data):

Parameter	Type	Required	Description
file	file	Yes	Audio file
language	string	No	Language code (default: en-US)
preprocess	boolean	No	Apply noise reduction (default: false)

Features:

beam_size=5 for more accurate decoding (~40% fewer errors)
condition_on_previous_text=False to reduce hallucinations
Optional audio preprocessing for noisy environments

Response: Same as /upload

POST /api/v1/stt/upload/batch

Batch transcription for high throughput (2-3x speedup).

Parameters (form-data):

Parameter	Type	Required	Description
files	file[]	Yes	Multiple audio files
language	string	No	Language code (default: en-US)
batch_size	integer	No	Batch size (default: 8)

Response:

{
  "count": 3,
  "results": [
    {"filename": "audio1.mp3", "text": "...", "processing_time": 2.1},
    {"filename": "audio2.mp3", "text": "...", "processing_time": 1.8}
  ],
  "mode": "batched",
  "batch_size": 8
}


#### POST /api/v1/stt/upload/diarize

Perform Speaker Diarization ("Who said what") on an audio file.

**Requirements:**
- `HF_TOKEN` must be set in `.env` (Hugging Face Token for pyannote model access)
- Uses `faster-whisper` for transcription + `pyannote.audio` for speaker identification

**Parameters (form-data):**

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| file | file | Yes | Audio file |
| num_speakers | integer | No | Exact number of speakers (optional) |
| min_speakers | integer | No | Min expected speakers (optional) |
| max_speakers | integer | No | Max expected speakers (optional) |
| language | string | No | Language code, e.g. 'en' (auto-detected if not provided) |

**Response:**
```json
{
  "segments": [
    {
      "start": 0.0,
      "end": 0.84,
      "text": "Hello Test",
      "speaker": "SPEAKER_00"
    }
  ],
  "speaker_stats": {
    "SPEAKER_00": 0.84
  },
  "language": "en",
  "status": "success"
}

Transcripts & Analysis

GET /api/v1/transcripts

List all past transcriptions.

Response:

[
  {
    "id": 1,
    "text": "Hello world...",
    "created_at": "2024-01-01T12:00:00",
    "word_count": 150,
    "language": "en-US"
  }
]

POST /api/v1/transcripts/{id}/analyze

Run NLP analysis (Sentiment, Keywords, Summary) on a transcript.

Response:

{
  "status": "success",
  "analysis": {
    "sentiment": {"polarity": 0.5, "subjectivity": 0.1},
    "keywords": ["artificial intelligence", "voice", "app"],
    "summary": "This is a summary of the transcript."
  }
}

GET /api/v1/transcripts/{id}/export

Download transcript in a specific format.

Query Parameters:

Parameter	Type	Required	Description
format	string	Yes	txt, srt, vtt, pdf

Response:

File download (text/plain, text/vtt, application/pdf)

Text-to-Speech

GET /api/v1/tts/voices

Get all available voices.

Query Parameters:

Parameter	Type	Description
language	string	Filter by language code

Response:

{
  "voices": [
    {
      "name": "en-US-Wavenet-D",
      "language_code": "en-US",
      "language_name": "English (US)",
      "ssml_gender": "MALE",
      "natural_sample_rate": 24000,
      "voice_type": "WaveNet",
      "display_name": "D (Male, WaveNet)",
      "flag": "🇺🇸"
    }
  ],
  "total": 50,
  "language_filter": null
}

GET /api/v1/tts/voices/{language}

Get voices for a specific language.

Parameters:

Parameter	Type	Description
language	path	Language code (e.g., en-US)

POST /api/v1/tts/synthesize

Convert text to speech.

Request:

{
  "text": "Hello, this is a test.",
  "language": "en-US",
  "voice": "en-US-Wavenet-D",
  "audio_encoding": "MP3",
  "speaking_rate": 1.0,
  "pitch": 0.0,
  "volume_gain_db": 0.0,
  "use_ssml": false
}

Field	Type	Required	Description
text	string	Yes	Text to synthesize (max 5000 chars)
language	string	No	Language code (default: en-US)
voice	string	No	Voice name
audio_encoding	string	No	MP3, LINEAR16, OGG_OPUS
speaking_rate	float	No	Speed 0.25-4.0 (default: 1.0)
pitch	float	No	Pitch -20 to 20 (default: 0.0)
volume_gain_db	float	No	Volume -96 to 16 dB
use_ssml	boolean	No	Treat as SSML markup

Response:

{
  "audio_content": "<base64 encoded audio>",
  "audio_size": 12345,
  "duration_estimate": 2.5,
  "voice_used": "en-US-Wavenet-D",
  "language": "en-US",
  "encoding": "MP3",
  "sample_rate": 24000,
  "processing_time": 0.45
}

POST /api/v1/tts/synthesize/audio

Synthesize and return audio file directly.

Same request as /synthesize, but returns the audio file as a download.

POST /api/v1/tts/stream

Stream synthesized audio for immediate playback.

Request: Same as /synthesize.

Response: Chunked audio stream (audio/mpeg). Ideal for long text to reduce latency (TTFB).

POST /api/v1/tts/ssml

Synthesize audio using SSML for prosody control (rate, pitch, emphasis).

Request:

text: Text to speak
voice: Voice name (default: "en-US-AriaNeural")
rate: Speed (e.g., "fast", "-10%")
pitch: Pitch (e.g., "high", "+5Hz")
emphasis: "strong", "moderate", "reduced"
auto_breaks: true/false

Response: Audio file (audio/mpeg).

Error Responses

All errors follow this format:

{
  "error": "error_type",
  "message": "Human readable message",
  "detail": "Additional details (debug mode only)"
}

Common Error Codes

Code	Type	Description
400	validation_error	Invalid request parameters
401	unauthorized	Missing or invalid auth token
403	forbidden	Insufficient permissions
404	not_found	Resource not found
413	file_too_large	Upload exceeds size limit
429	rate_limited	Too many requests
500	internal_error	Server error

Rate Limits

Tier	Limit
Free	60 requests/minute
Pro	600 requests/minute
Enterprise	Custom

Supported Audio Formats

Format	Extension	Notes
WAV	.wav	Best quality, no conversion
MP3	.mp3	Common, converted
M4A	.m4a	iOS format
FLAC	.flac	Lossless
OGG	.ogg	Open format
WebM	.webm	Browser recording

Code Examples

Python

import requests

# Transcribe audio
with open("audio.wav", "rb") as f:
    response = requests.post(
        "http://localhost:8000/api/v1/stt/upload",
        files={"file": f},
        data={"language": "en-US"}
    )
    print(response.json()["text"])

# Synthesize speech
response = requests.post(
    "http://localhost:8000/api/v1/tts/synthesize",
    json={"text": "Hello world", "language": "en-US"}
)
import base64
audio = base64.b64decode(response.json()["audio_content"])
with open("output.mp3", "wb") as f:
    f.write(audio)

cURL

# Transcribe
curl -X POST http://localhost:8000/api/v1/stt/upload \
  -F "file=@audio.wav" \
  -F "language=en-US"

# Synthesize
curl -X POST http://localhost:8000/api/v1/tts/synthesize \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello", "language": "en-US"}'