Darwin-TTS-1.7B-Cross — Qwen3-TTS compatibility repack

This repository is a compatibility repack of FINAL-Bench/Darwin-TTS-1.7B-Cross.

The original Darwin checkpoint appears to omit the speech_tokenizer/ directory required by the standard qwen-tts loader. This repack adds the missing speech_tokenizer/ files from Qwen/Qwen3-TTS-12Hz-1.7B-Base.

No model blending, training, fine-tuning, or behavioral changes were performed in this repack. The purpose is only to make the model load with:

from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained("zeropointnine/Darwin-TTS-1.7B-Cross-Qwen3Tokenizer")

Provenance

Main model weights and model card: FINAL-Bench/Darwin-TTS-1.7B-Cross
Added tokenizer assets: Qwen/Qwen3-TTS-12Hz-1.7B-Base
License: Apache 2.0, matching the upstream model cards.

Original Darwin-TTS-1.7B-Cross model card follows below:

🧬 Darwin-TTS-1.7B-Cross

World's first cross-modal FFN transfer from LLM to TTS — emotion-enhanced speech synthesis without any training.

This model is a cross-modal application of the Darwin Family framework, introduced in the paper: Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning.

Authors: Taebong Kim, Youngsik Hong, Minsik Kim, Sunyoung Choi, Jaewon Jang, Junghoon Shin, Minseo Kim.

Darwin-TTS blends 3% of Qwen3-1.7B (LLM) FFN weights into Qwen3-TTS-1.7B (TTS) talker module. No training, no data, no GPU hours — just weight-space arithmetic.

Key Discovery

Blend (α)	Emotion	Quality	Status
0%	Baseline	Normal	Original Qwen3-TTS
1%	No change	Normal	Too subtle
3%	Emotion appears	Normal	★ This model (default)
5%	Emotion intensified	Normal	★★ Max stable
10%	Broken	Failed	Infinite generation

Why It Works

Qwen3-1.7B (LLM) and Qwen3-TTS-1.7B's talker share 100% identical architecture:

                    Qwen3-1.7B (LLM)    Qwen3-TTS talker    Match
hidden_size         2048                 2048                ✅
intermediate_size   6144                 6144                ✅
num_hidden_layers   28                   28                  ✅
num_attention_heads 16                   16                  ✅
num_key_value_heads 8                    8                   ✅

This means zero SVD, zero truncation, zero layer mapping — pure 1:1 lerp blending across all 84 FFN tensors (gate_proj, up_proj, down_proj × 28 layers).

Architecture

Qwen3-TTS-1.7B (4-module structure):
┌─────────────────────────────────────────────────────┐
│ talker (28L Qwen3 LM backbone)                      │
│   └── 84 FFN tensors blended with LLM (α=3%)       │ ← MODIFIED
│       └── talker.model.layers.N.mlp.{gate,up,down}  │
├─────────────────────────────────────────────────────┤
│ code_predictor (5L, h=1024)                          │ ← UNTOUCHED
├─────────────────────────────────────────────────────┤
│ speech_tokenizer (12Hz RVQ codec)                    │ ← UNTOUCHED
├─────────────────────────────────────────────────────┤
│ encoder/decoder (audio waveform)                     │ ← UNTOUCHED
└─────────────────────────────────────────────────────┘

FFN Source: Qwen3-1.7B (LLM)
└── model.layers.N.mlp.{gate,up,down}_proj.weight
    └── Key mapping: model.layers.N → talker.model.layers.N (1:1)

Only the talker's FFN weights are modified. The code_predictor, speech_tokenizer, and encoder/decoder remain 100% original — preserving the audio codec pipeline entirely.

Quick Start

Option 1: Load pre-blended weights (this model)

from qwen_tts import Qwen3TTSModel
import torch

# Load Darwin-TTS-1.7B-Cross (α=3% pre-blended)
model = Qwen3TTSModel.from_pretrained(
    "FINAL-Bench/Darwin-TTS-1.7B-Cross",
    device_map="cuda:0",
    dtype=torch.bfloat16
)

# Synthesize
wavs, sr = model.generate_voice_clone(
    text="안녕하세요, 저는 다윈 인공지능입니다!",
    ref_audio="your_voice.wav",
    ref_text="ref",
    x_vector_only_mode=True
)

Option 2: Custom blend ratio (runtime blending)

from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained("FINAL-Bench/Darwin-TTS-1.7B-Cross")
wavs, sr = model.generate_voice_clone(
    text="정말 기쁜 소식이에요!",
    ref_audio="voice.wav",
    ref_text="ref",
    x_vector_only_mode=True
)

CLI

python darwin_tts_blend.py --alpha 3 --text "Hello, Darwin!" --ref voice.wav --output speech.wav

Installation

pip install torch qwen-tts safetensors soundfile huggingface_hub

Research Background

The Problem

Cross-modal capability transfer (e.g., adding emotion to TTS) traditionally requires:

Thousands of hours of emotional speech data
Hundreds of GPU hours for training
Careful data curation and annotation

The Darwin Approach

Darwin's evolutionary merge framework, originally developed for LLM merging (Darwin LLM V7 achieved GPQA Diamond 86.9%, World #5), is extended to cross-modal transfer:

Find architecture-compatible models across modalities (LLM ↔ TTS)
Blend FFN weights at low ratios (3~5%) using simple lerp
Preserve modality-specific components (audio codec, tokenizer)

Key Findings

Cross-modal FFN transfer works — LLM's language understanding patterns enhance TTS emotional expressiveness
Sweet spot is 3~5% — TTS is far more sensitive than LLM merging (which tolerates 7~93%)
Same backbone is required — Qwen3 × Qwen3 succeeded; cross-backbone merges (e.g., Llama) failed.
10%+ destroys TTS — LLM's "continue generating tokens" pattern overrides the TTS stop signal, causing 655-second outputs
Bidirectional potential — LLM + TTS FFN may enable "Speaking LLM" (the GPT-4o direction)

Model Details

Model type: Text-to-Speech (cross-modal FFN blended)
Base models: Qwen3-TTS-1.7B-Base + Qwen3-1.7B (3% FFN)
Parameters: ~2.1B
Languages: Korean, English, Japanese, Chinese + 6 more
License: Apache 2.0
Blend ratio: α=0.03 (3%)
FFN tensors modified: 84 / 976 total (8.6%)
Build time: ~2 minutes (no training)

Citation

If you find this work useful in your research, please cite:

@article{kim2026darwin,
  title={Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning},
  author={Kim, Taebong and Hong, Youngsik and Kim, Minsik and Choi, Sunyoung and Jang, Jaewon and Shin, Junghoon and Kim, Minseo},
  journal={arXiv preprint arXiv:2605.14386},
  year={2026}
}

Credits

VIDRAFT (비드래프트) — Darwin Evolutionary Merge Framework

Built on Qwen3-TTS-1.7B and Qwen3-1.7B by Alibaba Cloud (Apache 2.0).

Darwin-27B-Opus — Darwin LLM Flagship
FINAL Bench — Text AGI Benchmark
Darwin Evolutionary Merge Framework — CMA-ES + FFN crossbreeding

Downloads last month: 43

Safetensors

Model size

2B params

Tensor type

F32

BF16

Model tree for zeropointnine/Darwin-TTS-1.7B-Cross-Qwen3Tokenizer

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Finetuned

FINAL-Bench/Darwin-TTS-1.7B-Cross

Finetuned

(1)

this model

Paper for zeropointnine/Darwin-TTS-1.7B-Cross-Qwen3Tokenizer

Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning

Paper • 2605.14386 • Published 18 days ago • 60

zeropointnine
/

Darwin-TTS-1.7B-Cross-Qwen3Tokenizer

Darwin-TTS-1.7B-Cross — Qwen3-TTS compatibility repack

Provenance

Original Darwin-TTS-1.7B-Cross model card follows below:

🧬 Darwin-TTS-1.7B-Cross

Key Discovery

Why It Works

Architecture

Quick Start

Option 1: Load pre-blended weights (this model)

Option 2: Custom blend ratio (runtime blending)

CLI

Installation

Research Background

The Problem

The Darwin Approach

Key Findings

Model Details

Citation

Credits

Related

Model tree for zeropointnine/Darwin-TTS-1.7B-Cross-Qwen3Tokenizer

Paper for zeropointnine/Darwin-TTS-1.7B-Cross-Qwen3Tokenizer

Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning