VoiceAPI / ARCHITECTURE.md
Harshil748's picture
Add voice cloning endpoint and XTTS model integration
51b23f6

๐Ÿ—๏ธ VoiceAPI System Architecture

High-Level System Diagram

flowchart TB
    subgraph Client["๐Ÿ“ฑ Client Applications"]
        Web["๐ŸŒ Web App"]
        Mobile["๐Ÿ“ฑ Mobile App"]
        Healthcare["๐Ÿฅ Healthcare Assistant"]
    end

    subgraph API["๐Ÿš€ FastAPI Server (Port 7860)"]
        Endpoint["/Get_Inference API"]
        LangRouter["Language Router"]
    end

    subgraph Engine["โš™๏ธ TTS Engine"]
        Normalizer["Text Normalizer"]
        Tokenizer["Tokenizer"]
        StyleProc["Style Processor"]
        
        subgraph Models["๏ฟฝ๏ฟฝ Model Types"]
            VITS["VITS JIT Models\n(.pt files)"]
            Coqui["Coqui TTS\n(.pth files)"]
            MMS["Facebook MMS\n(HuggingFace)"]
        end
    end

    subgraph Languages["๐Ÿ—ฃ๏ธ 11 Languages"]
        Hindi["๐Ÿ‡ฎ๐Ÿ‡ณ Hindi"]
        Bengali["๐Ÿ‡ง๐Ÿ‡ฉ Bengali"]
        Marathi["Marathi"]
        Telugu["Telugu"]
        Kannada["Kannada"]
        Gujarati["Gujarati"]
        Bhojpuri["Bhojpuri"]
        Others["+ 4 more"]
    end

    subgraph Output["๐Ÿ”Š Audio Output"]
        WAV["WAV File\n22050 Hz"]
    end

    Client -->|HTTP GET/POST| Endpoint
    Endpoint -->|text, lang| LangRouter
    LangRouter --> Normalizer
    Normalizer --> Tokenizer
    Tokenizer --> Models
    VITS --> StyleProc
    Coqui --> StyleProc
    MMS --> StyleProc
    StyleProc --> WAV
    WAV -->|Response| Client

    Models --> Languages

Data Flow Diagram

sequenceDiagram
    participant C as Client
    participant A as API Server
    participant E as TTS Engine
    participant M as Model
    participant S as Style Processor

    C->>A: GET /Get_Inference?text=เคจเคฎเคธเฅเคคเฅ‡&lang=hindi
    A->>A: Parse parameters
    A->>E: synthesize(text, voice)
    E->>E: Normalize text
    E->>E: Tokenize to IDs
    E->>M: Load model (if not cached)
    M->>M: Forward pass (inference)
    M-->>E: Raw audio tensor
    E->>S: Apply style (pitch, speed, energy)
    S-->>E: Processed audio
    E-->>A: TTSOutput (audio, sample_rate)
    A->>A: Convert to WAV bytes
    A-->>C: audio/wav response

Model Architecture

flowchart LR
    subgraph Input["๐Ÿ“ Input"]
        Text["Text Input"]
    end

    subgraph TextEncoder["๐Ÿ”ค Text Encoder"]
        Embed["Character Embedding"]
        TransEnc["Transformer Encoder\n(6 layers, 192 hidden)"]
    end

    subgraph FlowModel["๐ŸŒŠ Flow Model"]
        Prior["Prior Encoder"]
        Flow["Normalizing Flow"]
        Duration["Duration Predictor"]
    end

    subgraph Decoder["๐Ÿ”Š HiFi-GAN Decoder"]
        Upsample["Upsampling Layers"]
        ResBlocks["Residual Blocks"]
        Output["Audio Waveform"]
    end

    Text --> Embed --> TransEnc
    TransEnc --> Prior
    TransEnc --> Duration
    Prior --> Flow
    Duration --> Flow
    Flow --> Upsample --> ResBlocks --> Output

Training Pipeline

flowchart TD
    subgraph Data["๐Ÿ“Š Training Data"]
        OpenSLR["OpenSLR Datasets"]
        CommonVoice["Mozilla Common Voice"]
        IndicTTS["IndicTTS Corpus"]
        AI4Bharat["AI4Bharat Indic-Voices"]
    end

    subgraph Prep["๐Ÿ”ง Data Preparation"]
        Download["Download Audio"]
        Normalize["Normalize to 22050 Hz"]
        Transcript["Generate Transcripts"]
        Split["Train/Val Split"]
    end

    subgraph Train["๐Ÿ‹๏ธ Training"]
        Config["Load Config YAML"]
        VITS_Train["VITS Training\n(1000 epochs)"]
        Checkpoint["Save Checkpoints"]
    end

    subgraph Export["๐Ÿ“ฆ Export"]
        JIT["JIT Trace Model"]
        Chars["Generate chars.txt"]
        Package["Package for Inference"]
    end

    Data --> Download --> Normalize --> Transcript --> Split
    Split --> Config --> VITS_Train --> Checkpoint
    Checkpoint --> JIT --> Chars --> Package

Deployment Architecture

flowchart TB
    subgraph HF["โ˜๏ธ HuggingFace Infrastructure"]
        subgraph Space["๐Ÿš€ HF Space (Docker)"]
            Docker["Docker Container"]
            FastAPI["FastAPI Server\n:7860"]
            Models_Dir["models/ directory"]
        end
        
        subgraph ModelRepo["๐Ÿ“ฆ Model Repository"]
            ModelFiles["Harshil748/VoiceAPI-Models\n(~8GB)"]
        end
    end

    subgraph External["๐ŸŒ External Services"]
        MMS_HF["facebook/mms-tts-guj\n(Gujarati)"]
    end

    User["๐Ÿ‘ค User"] -->|HTTPS| FastAPI
    Docker -->|Build time| ModelFiles
    FastAPI -->|Runtime| MMS_HF
    Models_Dir -.->|Loaded from| ModelFiles

Voice Configuration Map

mindmap
  root((VoiceAPI))
    Hindi
      hi_male
      hi_female
    Bengali
      bn_male
      bn_female
    Marathi
      mr_male
      mr_female
    Telugu
      te_male
      te_female
    Kannada
      kn_male
      kn_female
    Gujarati
      gu_mms
    Bhojpuri
      bho_male
      bho_female
    Chhattisgarhi
      hne_male
      hne_female
    Maithili
      mai_male
      mai_female
    Magahi
      mag_male
      mag_female
    English
      en_male
      en_female

Component Interaction

Component File Purpose
API Server src/api.py FastAPI REST endpoints
TTS Engine src/engine.py Model loading & inference
Tokenizer src/tokenizer.py Text โ†’ Token IDs
Config src/config.py Language & model configs
Model Loader src/model_loader.py Model file management

Performance Characteristics

Metric Value
Inference Time ~200-500ms per sentence
Model Load Time ~2-5s per voice
Audio Sample Rate 22050 Hz (16000 Hz for Gujarati)
Supported Formats WAV
Concurrent Requests Limited by memory

Built for Voice Tech for All Hackathon