Darwin-TTS-1.7B-Cross โ Qwen3-TTS compatibility repack
This repository is a compatibility repack of FINAL-Bench/Darwin-TTS-1.7B-Cross.
The original Darwin checkpoint appears to omit the speech_tokenizer/ directory required by the standard qwen-tts loader. This repack adds the missing speech_tokenizer/ files from Qwen/Qwen3-TTS-12Hz-1.7B-Base.
No model blending, training, fine-tuning, or behavioral changes were performed in this repack. The purpose is only to make the model load with:
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained("zeropointnine/Darwin-TTS-1.7B-Cross-Qwen3Tokenizer")
Provenance
- Main model weights and model card: FINAL-Bench/Darwin-TTS-1.7B-Cross
- Added tokenizer assets: Qwen/Qwen3-TTS-12Hz-1.7B-Base
- License: Apache 2.0, matching the upstream model cards.
Original Darwin-TTS-1.7B-Cross model card follows below:
๐งฌ Darwin-TTS-1.7B-Cross
World's first cross-modal FFN transfer from LLM to TTS โ emotion-enhanced speech synthesis without any training.
This model is a cross-modal application of the Darwin Family framework, introduced in the paper: Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning.
Authors: Taebong Kim, Youngsik Hong, Minsik Kim, Sunyoung Choi, Jaewon Jang, Junghoon Shin, Minseo Kim.
Darwin-TTS blends 3% of Qwen3-1.7B (LLM) FFN weights into Qwen3-TTS-1.7B (TTS) talker module. No training, no data, no GPU hours โ just weight-space arithmetic.
Key Discovery
| Blend (ฮฑ) | Emotion | Quality | Status |
|---|---|---|---|
| 0% | Baseline | Normal | Original Qwen3-TTS |
| 1% | No change | Normal | Too subtle |
| 3% | Emotion appears | Normal | โ This model (default) |
| 5% | Emotion intensified | Normal | โ โ Max stable |
| 10% | Broken | Failed | Infinite generation |
Why It Works
Qwen3-1.7B (LLM) and Qwen3-TTS-1.7B's talker share 100% identical architecture:
Qwen3-1.7B (LLM) Qwen3-TTS talker Match
hidden_size 2048 2048 โ
intermediate_size 6144 6144 โ
num_hidden_layers 28 28 โ
num_attention_heads 16 16 โ
num_key_value_heads 8 8 โ
This means zero SVD, zero truncation, zero layer mapping โ pure 1:1 lerp blending across all 84 FFN tensors (gate_proj, up_proj, down_proj ร 28 layers).
Architecture
Qwen3-TTS-1.7B (4-module structure):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ talker (28L Qwen3 LM backbone) โ
โ โโโ 84 FFN tensors blended with LLM (ฮฑ=3%) โ โ MODIFIED
โ โโโ talker.model.layers.N.mlp.{gate,up,down} โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ code_predictor (5L, h=1024) โ โ UNTOUCHED
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ speech_tokenizer (12Hz RVQ codec) โ โ UNTOUCHED
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ encoder/decoder (audio waveform) โ โ UNTOUCHED
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
FFN Source: Qwen3-1.7B (LLM)
โโโ model.layers.N.mlp.{gate,up,down}_proj.weight
โโโ Key mapping: model.layers.N โ talker.model.layers.N (1:1)
Only the talker's FFN weights are modified. The code_predictor, speech_tokenizer, and encoder/decoder remain 100% original โ preserving the audio codec pipeline entirely.
Quick Start
Option 1: Load pre-blended weights (this model)
from qwen_tts import Qwen3TTSModel
import torch
# Load Darwin-TTS-1.7B-Cross (ฮฑ=3% pre-blended)
model = Qwen3TTSModel.from_pretrained(
"FINAL-Bench/Darwin-TTS-1.7B-Cross",
device_map="cuda:0",
dtype=torch.bfloat16
)
# Synthesize
wavs, sr = model.generate_voice_clone(
text="์๋
ํ์ธ์, ์ ๋ ๋ค์ ์ธ๊ณต์ง๋ฅ์
๋๋ค!",
ref_audio="your_voice.wav",
ref_text="ref",
x_vector_only_mode=True
)
Option 2: Custom blend ratio (runtime blending)
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained("FINAL-Bench/Darwin-TTS-1.7B-Cross")
wavs, sr = model.generate_voice_clone(
text="์ ๋ง ๊ธฐ์ ์์์ด์์!",
ref_audio="voice.wav",
ref_text="ref",
x_vector_only_mode=True
)
CLI
python darwin_tts_blend.py --alpha 3 --text "Hello, Darwin!" --ref voice.wav --output speech.wav
Installation
pip install torch qwen-tts safetensors soundfile huggingface_hub
Research Background
The Problem
Cross-modal capability transfer (e.g., adding emotion to TTS) traditionally requires:
- Thousands of hours of emotional speech data
- Hundreds of GPU hours for training
- Careful data curation and annotation
The Darwin Approach
Darwin's evolutionary merge framework, originally developed for LLM merging (Darwin LLM V7 achieved GPQA Diamond 86.9%, World #5), is extended to cross-modal transfer:
- Find architecture-compatible models across modalities (LLM โ TTS)
- Blend FFN weights at low ratios (3~5%) using simple lerp
- Preserve modality-specific components (audio codec, tokenizer)
Key Findings
- Cross-modal FFN transfer works โ LLM's language understanding patterns enhance TTS emotional expressiveness
- Sweet spot is 3~5% โ TTS is far more sensitive than LLM merging (which tolerates 7~93%)
- Same backbone is required โ Qwen3 ร Qwen3 succeeded; cross-backbone merges (e.g., Llama) failed.
- 10%+ destroys TTS โ LLM's "continue generating tokens" pattern overrides the TTS stop signal, causing 655-second outputs
- Bidirectional potential โ LLM + TTS FFN may enable "Speaking LLM" (the GPT-4o direction)
Model Details
- Model type: Text-to-Speech (cross-modal FFN blended)
- Base models: Qwen3-TTS-1.7B-Base + Qwen3-1.7B (3% FFN)
- Parameters: ~2.1B
- Languages: Korean, English, Japanese, Chinese + 6 more
- License: Apache 2.0
- Blend ratio: ฮฑ=0.03 (3%)
- FFN tensors modified: 84 / 976 total (8.6%)
- Build time: ~2 minutes (no training)
Citation
If you find this work useful in your research, please cite:
@article{kim2026darwin,
title={Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning},
author={Kim, Taebong and Hong, Youngsik and Kim, Minsik and Choi, Sunyoung and Jang, Jaewon and Shin, Junghoon and Kim, Minseo},
journal={arXiv preprint arXiv:2605.14386},
year={2026}
}
Credits
VIDRAFT (๋น๋๋ํํธ) โ Darwin Evolutionary Merge Framework
Built on Qwen3-TTS-1.7B and Qwen3-1.7B by Alibaba Cloud (Apache 2.0).
Related
- Darwin-27B-Opus โ Darwin LLM Flagship
- FINAL Bench โ Text AGI Benchmark
- Darwin Evolutionary Merge Framework โ CMA-ES + FFN crossbreeding
- Downloads last month
- 43
Model tree for zeropointnine/Darwin-TTS-1.7B-Cross-Qwen3Tokenizer
Base model
Qwen/Qwen3-1.7B-Base