You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Nekodimos/ZPv2_amtts

Nekodimos/ZPv2_amtts is a lightweight, high-performance Amharic (แŠ แˆ›แˆญแŠ›) text-to-speech model. Unlike models fine-tuned from massive pre-trained checkpoints, this model was trained from scratch with customized modifications to prioritize inference speed, lower computational footprint, and high-fidelity audio output.

By integrating a swapped 48kHz decoder and optimizing the underlying architecture, ZPv2_amtts aims to deliver competitive, natural-sounding speech synthesis while remaining efficient enough to run on edge and resource-constrained devices.


Key Features

  • Trained From Scratch: Completely initialized and trained on custom Amharic data, ensuring that the model's acoustic representations are fundamentally aligned to the nuances of the language without pre-existing biases from other languages.
  • Upgraded 48kHz Decoder: Incorporates a swapped 48kHz neural vocoder/decoder, allowing the model to synthesize high-resolution, crisp audio compared to standard 24kHz setups.
  • Edge-Optimized & Lightweight: Designed with architectural modifications to reduce parameter size and latency. The model is lightweight enough to be deployed on consumer edge devices, mobile platforms, or localized servers.
  • Competitive Audio Quality: Despite its smaller size and faster inference times, the model maintains a high standard of intelligibility and natural cadence in Amharic.
Input Text (Amharic) Generated Speech
Sample 1: "แˆดแ‰ถแ‰ฝแˆ แ‹ˆแŠ•แ‹ถแ‰ฝแˆ แˆ…แƒแŠ“แ‰ต แ‰ แŠฅแŠฉแˆ แ‹ฐแˆจแŒƒ แˆˆแŠ แŠฅแˆแˆฎ แŠฅแ‹ตแŒˆแ‰ต แˆ˜แ‹›แ‰ฃแ‰ต แŠฆแ‰ฒแ‹แˆ แˆŠแŒ‹แˆˆแŒก แŠฅแŠ•แ‹ฐแˆšแ‰ฝแˆ‰ แŒฅแŠ“แ‰ถแ‰ฝ แ‹ญแŒ แ‰แˆ›แˆ‰แข"
Sample 2: "แˆดแ‰ถแ‰ฝแˆ แ‹ˆแŠ•แ‹ถแ‰ฝแˆ แˆ…แƒแŠ“แ‰ต แ‰ แŠฅแŠฉแˆ แ‹ฐแˆจแŒƒ แˆˆแŠ แŠฅแˆแˆฎ แŠฅแ‹ตแŒˆแ‰ต แˆ˜แ‹›แ‰ฃแ‰ต แŠฆแ‰ฒแ‹แˆ แˆŠแŒ‹แˆˆแŒก แŠฅแŠ•แ‹ฐแˆšแ‰ฝแˆ‰ แŒฅแŠ“แ‰ถแ‰ฝ แ‹ญแŒ แ‰แˆ›แˆ‰แข"
Sample 3: "แˆฑแ‹Œแ‹ แŠซแŠ“แˆ แ‰€แ‹ญ แ‰ฃแˆ…แˆญแŠ• แŠจแˆœแ‹ฒแ‰ตแˆซแŠ’แ‹ซแŠ• แŒ‹แˆญ แ‹ซแŒˆแŠ“แŠ›แˆแข แ‰ แŠฅแˆตแ‹ซแŠ“ แ‰ แŠ แ‹แˆฎแ“ แˆ˜แŠซแŠจแˆ แ‹จแˆšแ‹ฐแˆจแŒˆแ‹แŠ• แŒ‰แ‹ž แ‰ขแ‹ซแŠ•แˆต แ‰ แŠ แˆตแˆญ แ‰€แŠ“แ‰ต แ‹ญแ‰€แŠ•แˆณแˆแข แ‹ญแˆ… แ‹จแ‰ฃแˆ…แˆญ แˆ˜แ‰ฐแˆ‹แˆˆแŠแ‹ซ แ‰ แˆ˜แ‰ถ แ‹จแˆšแˆ†แАแ‹แŠ• แ‹จแŠ แˆˆแˆ แ‹จแ‰ฃแˆ…แˆญ แˆ‹แ‹ญ แŠ•แŒแ‹ต แ‹ซแˆตแ‰ฐแŠ“แŒแ‹ณแˆแข แ‰ แˆ˜แ‰ถ แ‹จแˆšแˆ†แАแ‹แŠ• แ‹จแŠฎแŠ•แ‰ดแ‹ญแАแˆญ แŒ‰แ‹ž แ‰ แˆ˜แ‰ถ แ‹จแˆ˜แŠชแŠ“ แŒญแАแ‰ต แŠฅแŠ“ แ‰ แˆ˜แ‰ถ แ‹จแ‹ตแแ‹ตแ แАแ‹ณแŒ… แ‰ แˆฑแ‹Œแ‹ แŠซแŠ“แˆ แ‹ญแ‰ฐแˆ‹แˆˆแ‹แˆแข"

Specifications

  • Language: Amharic (แŠ แˆ›แˆญแŠ›)
  • Training Method: From scratch (no pre-trained model fine-tuning)
  • Output Sample Rate: 48,000 Hz (48 kHz)
  • Optimization Target: Low-latency, high-fidelity, edge device compatibility

Architecture Modifications

To achieve its lightweight profile, this model features several adjustments to its internal components:

  1. Decoder Swap: Replaced the default audio synthesis decoder with an optimized 48kHz unit to support high-fidelity playback.
  2. Layer/Parameter Optimization: Streamlined layer configurations to minimize computational overhead, reducing CPU and RAM/VRAM usage during generation.
  3. Optimized Tokenizer: Configured to work efficiently with the Ge'ez script, maximizing processing speed during the text-frontend pipeline.

Inference & Deployment

Due to its modified lightweight structure, inference scripts must be configured to support the custom 48kHz decoder output.

Basic Usage Flow

Ensure your inference code accommodates the custom architecture and sample rate:

# Inference Script Yet to come...

Performance & Limitations

  • Resource Consumption: Significantly faster and less resource-heavy than larger, non-optimized TTS models. Well-suited for real-time applications on CPU-bound or low-VRAM environments.
  • Data-Specific Nuances: Since the model was trained from scratch, the vocabulary and pronunciation boundaries are tightly bound to the training distribution. Text containing heavy mixtures of foreign languages may require pre-processing/transliteration.
  • High-Fidelity Requirements: To fully appreciate the 48kHz output, ensure that reference prompts (if using zero-shot cloning) are clean, high-resolution, and free of background noise.

Credits & Acknowledgments

  • Model Design & Training: Customized, modified, and trained from scratch by Nekodimos.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support