Nekodimos/ZPv2_amtts
Nekodimos/ZPv2_amtts is a lightweight, high-performance Amharic (แ แแญแ) text-to-speech model. Unlike models fine-tuned from massive pre-trained checkpoints, this model was trained from scratch with customized modifications to prioritize inference speed, lower computational footprint, and high-fidelity audio output.
By integrating a swapped 48kHz decoder and optimizing the underlying architecture, ZPv2_amtts aims to deliver competitive, natural-sounding speech synthesis while remaining efficient enough to run on edge and resource-constrained devices.
Key Features
- Trained From Scratch: Completely initialized and trained on custom Amharic data, ensuring that the model's acoustic representations are fundamentally aligned to the nuances of the language without pre-existing biases from other languages.
- Upgraded 48kHz Decoder: Incorporates a swapped 48kHz neural vocoder/decoder, allowing the model to synthesize high-resolution, crisp audio compared to standard 24kHz setups.
- Edge-Optimized & Lightweight: Designed with architectural modifications to reduce parameter size and latency. The model is lightweight enough to be deployed on consumer edge devices, mobile platforms, or localized servers.
- Competitive Audio Quality: Despite its smaller size and faster inference times, the model maintains a high standard of intelligibility and natural cadence in Amharic.
| Input Text (Amharic) | Generated Speech |
|---|---|
| Sample 1: "แดแถแฝแ แแแถแฝแ แ แแแต แ แฅแฉแ แฐแจแ แแ แฅแแฎ แฅแตแแต แแแฃแต แฆแฒแแ แแแแก แฅแแฐแแฝแ แฅแแถแฝ แญแ แแแแข" | |
| Sample 2: "แดแถแฝแ แแแถแฝแ แ แแแต แ แฅแฉแ แฐแจแ แแ แฅแแฎ แฅแตแแต แแแฃแต แฆแฒแแ แแแแก แฅแแฐแแฝแ แฅแแถแฝ แญแ แแแแข" | |
| Sample 3: "แฑแแ แซแแ แแญ แฃแ แญแ แจแแฒแตแซแแซแ แแญ แซแแแแแข แ แฅแตแซแ แ แ แแฎแ แแซแจแ แจแแฐแจแแแ แแ แขแซแแต แ แ แตแญ แแแต แญแแแณแแข แญแ แจแฃแ แญ แแฐแแแแซ แ แแถ แจแแแแแ แจแ แแ แจแฃแ แญ แแญ แแแต แซแตแฐแแแณแแข แ แแถ แจแแแแแ แจแฎแแดแญแแญ แแ แ แแถ แจแแชแ แญแแต แฅแ แ แแถ แจแตแแตแ แแณแ แ แฑแแ แซแแ แญแฐแแแแแข" |
Specifications
- Language: Amharic (แ แแญแ)
- Training Method: From scratch (no pre-trained model fine-tuning)
- Output Sample Rate: 48,000 Hz (48 kHz)
- Optimization Target: Low-latency, high-fidelity, edge device compatibility
Architecture Modifications
To achieve its lightweight profile, this model features several adjustments to its internal components:
- Decoder Swap: Replaced the default audio synthesis decoder with an optimized 48kHz unit to support high-fidelity playback.
- Layer/Parameter Optimization: Streamlined layer configurations to minimize computational overhead, reducing CPU and RAM/VRAM usage during generation.
- Optimized Tokenizer: Configured to work efficiently with the Ge'ez script, maximizing processing speed during the text-frontend pipeline.
Inference & Deployment
Due to its modified lightweight structure, inference scripts must be configured to support the custom 48kHz decoder output.
Basic Usage Flow
Ensure your inference code accommodates the custom architecture and sample rate:
# Inference Script Yet to come...
Performance & Limitations
- Resource Consumption: Significantly faster and less resource-heavy than larger, non-optimized TTS models. Well-suited for real-time applications on CPU-bound or low-VRAM environments.
- Data-Specific Nuances: Since the model was trained from scratch, the vocabulary and pronunciation boundaries are tightly bound to the training distribution. Text containing heavy mixtures of foreign languages may require pre-processing/transliteration.
- High-Fidelity Requirements: To fully appreciate the 48kHz output, ensure that reference prompts (if using zero-shot cloning) are clean, high-resolution, and free of background noise.
Credits & Acknowledgments
- Model Design & Training: Customized, modified, and trained from scratch by
Nekodimos.