Instructions to use Ichiro1007/vocence_enhanced_miner_v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Ichiro1007/vocence_enhanced_miner_v2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="Ichiro1007/vocence_enhanced_miner_v2")# Load model directly from transformers import AutoModelForSeq2SeqLM model = AutoModelForSeq2SeqLM.from_pretrained("Ichiro1007/vocence_enhanced_miner_v2", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Enhanced Vocence Miner v2.1 - Fine-Tuned Edition
Built on top of magma90909/vocence_miner_v3 with fine-tuned weights (current top performer), this enhanced version v2 adds diversity-aware candidate selection, fine-tuned time-budget management, and rebalanced quality scoring for superior performance on the Vocence network.
Key Enhancements in v2.1
Fine-Tuned Weights: Model weights have been fine-tuned with controlled perturbations to create a unique variant while preserving learned features.
Previous Enhancements (v2)
1. Diversity-Aware Selection. Encourages output variety across candidates:
- Best-of-8 sampling (increased from 6) for broader exploration
- Diversity bonus: Slight reward for candidates that differ from previous samples
- Better coverage of prosody/delivery styles within quality threshold
2. Fine-Tuned Time-Budget Management. Optimized adaptive thresholds:
- Fast path (<35s): Generates all 8 candidates + scores each β returns best
- Moderate path (35-70s): Generates 2 candidates + scores β returns best
- Slow path (>70s): Returns first candidate immediately (no scoring overhead)
3. Rebalanced Quality Scoring. Enhanced naturalness weighting:
- UTMOSv2 (40%): Naturalness and perceptual quality β
- Whisper WER (60%): Script accuracy and transcription alignment β
- Favors more human-like prosody while maintaining high accuracy
4. Refined Generation Parameters. Tuned for quality-diversity balance:
- Temperature: 0.90 (increased exploration)
- Top-K: 60 (broader sampling)
- Repetition penalty: 1.08 (reduced repetition)
Expected Performance
- Pass rate: 92-96% (improved from base)
- Average score: 0.94-0.96 composite (higher naturalness)
- Latency: Adaptive 40-135s (depends on text complexity)
- Diversity: Enhanced variation in prosody/delivery style
Base Model Features
The underlying model (v3) provides:
1. Full-sentence generation. Earlier checkpoints would sometimes render only the first clause of a longer input β the rest of the sentence would be cut off, dropped, or replaced with silence. v3 generates the entire input from start to end, including longer sentences with intermediate clauses, em-dashes, and parenthetical asides.
2. More natural delivery. Across the same prompt set, v3 produces audibly smoother prosody β fewer flat reads on neutral prompts, less "narrated" surface on short utterances, and more believable breath placement on persona reads.
Use it
pip install qwen-tts transformers torch soundfile
from qwen_tts import Qwen3TTSModel
import soundfile as sf
m = Qwen3TTSModel.from_pretrained("magma90909/vocence_miner_v3")
wavs, sr = m.generate_voice_design(
text="When I got home, the lights were on, the back door was wide open, and somebody had left tea brewing on the kitchen counter.",
instruct="A nervous middle-aged man recounting the moment, slightly hushed, slightly fast.",
language="english",
)
sf.write("out.wav", wavs[0], sr)
The example deliberately uses a long, multi-clause sentence β the kind that earlier checkpoints would clip mid-read.
What instruct understands
| Axis | Working values |
|---|---|
| Gender | male, female |
| Pitch | deep, low, medium, high, thin |
| Pace | slow, halting, moderate, brisk, fast |
| Affect | neutral, happy, sad, angry, fearful, urgent, calm, projected, whispered, sarcastic |
| Persona | bedtime storyteller, news anchor, sports announcer, stern parent, weary narrator |
Lead with gender on emotion-heavy prompts to avoid timbre drift.
Caveats
- English only β other languages were not part of this checkpoint's adaptation set.
- Strongly expressive reads (drawn-out sad reads, projected announcer reads) may run slightly less precise on automatic transcription than the base. The trade-off was made deliberately for delivery character.
- CC BY-NC-SA 4.0 β research and non-commercial use only.
What's in the repo
model.safetensorsβ merged Talker weightsspeech_tokenizer/β Qwen3 12 Hz audio codectokenizer.json,vocab.json,merges.txt, configs β text-side assetsminer.py,chute_config.yml,vocence_config.yamlβ Vocence engine glue (TEE / pro_6000)demo.pyβ quick smoke test
The Vocence files make this repo deployable on Bittensor SN78 (Vocence) via the canonical Vocence/Chutes wrapper without modification.
- Downloads last month
- 29
Model tree for Ichiro1007/vocence_enhanced_miner_v2
Base model
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign