NatScore (small, v0)
A small, preference-supervised naturalness scorer for modern neural TTS. Frozen openai/whisper-small encoder with a tiny Bradley-Terry head (~400K trainable parameters), trained on the 99K human preference pairs in SpeechJudge-Data (Zhang et al., Nov 2025).
Higher score = more natural in the SpeechJudge sense.
Headline result
| Metric | Value |
|---|---|
| Pairwise accuracy (dev[:1000]) | 71.30% |
| 95% CI | [68.60, 74.10] |
| Mean margin | +1.010 |
| ECE | 2.27% (well-calibrated) |
| Trainable params | |
| Total params with frozen encoder | ~244M |
| Training compute | ~4h52m on 2x T4 (Kaggle, DataParallel) |
Cleared the project's M5b >70% pairwise-accuracy target. For reference, SpeechJudge-BTRM (7B params, Qwen2.5-Omni-7B backbone) reports 72.7% on SpeechJudge-Eval; SpeechJudge-GRM (also 7B, gold standard with chain-of-thought) reports 77.2%. NatScore lands within striking distance at a fraction of a percent of the parameter count.
Breakdowns
By subset:
| Subset | n_pairs | Pairwise acc |
|---|---|---|
| Regular | 400 | 74.00% |
| Expressive | 600 | 69.50% |
By language-pair (source script -> target script):
| Pair | n_pairs | Pairwise acc |
|---|---|---|
| en -> zh | 221 | 87.33% |
| zh -> zh | 179 | 83.24% |
| zh -> en | 148 | 66.22% |
| en -> en | 252 | 63.89% |
| zh -> mixed | 80 | 61.25% |
| en -> mixed | 120 | 52.50% |
Mixed-language code-switching is the obvious tail to investigate next. Expressive prosody is harder than read-speech, as expected.
Intended use
- Offline naturalness ranking of TTS system outputs (A/B comparison, listening-test prefiltering).
- A cheap reward signal for non-commercial RLHF/preference-tuning experiments on TTS.
- Distribution-coverage analysis across CosyVoice2, F5-TTS, MaskGCT, Llasa, XTTS-v2, and similar 2024-2025 neural TTS families.
Out of scope / known weaknesses
- Not a MOS predictor. The training signal is pairwise preference, not absolute MOS. Output is a logit; calibrate before treating as a quality score.
- Read-speech vs expressive: the model is ~5 points weaker on expressive prosody than on regular read-speech.
- Mixed-language code-switching: en -> mixed is near chance (52.5%). Treat scores for code-switched inputs with caution.
- Trained on
openai/whisper-smallaudio features (16 kHz, mel-spectrogram). Inputs at other sample rates need resampling. - Not evaluated on speech enhancement, telecom, or accent-rated naturalness. Use WhiSQA or DNSMOS for those.
Loading
The natscore package exposes a one-line load that pulls this checkpoint and the frozen Whisper-small encoder, then returns a ready Scorer:
# pip install git+https://github.com/harrrshall/natscore.git
import natscore as ns
scorer = ns.load() # defaults to "harrrshall/natscore-small-v0"
# Alternatives:
# scorer = ns.load("natscore-small-v0")
# scorer = ns.load(device="cuda", dtype=torch.float16)
# Pointwise score (higher = more natural)
s: float = scorer.score("path/to/tts_output.wav")
# Pairwise comparison (recommended; matches the BT training objective)
pair = scorer.compare("a.wav", "b.wav")
# pair.winner -> "a" | "b" | "tie"
# pair.score_a, pair.score_b
# pair.margin = score_a - score_b
# pair.prob_a_wins = sigmoid(margin)
# Batched
scores: list[float] = scorer.batch_score(["c1.wav", "c2.wav", "c3.wav"])
First call downloads ~290 MB Whisper-small plus this ~5 MB checkpoint; subsequent calls hit the local HF cache. Wall-clock per 10s audio clip: ~120 ms CPU, ~15 ms T4 GPU.
Manual loading (without the natscore package)
If you only want the head and bring your own Whisper encoder:
import torch
from huggingface_hub import hf_hub_download
from natscore.model import NatScoreHead, NatScoreHeadConfig
ckpt_path = hf_hub_download("harrrshall/natscore-small-v0", "final.pt")
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
cfg = NatScoreHeadConfig(**ckpt["config"]["model"])
head = NatScoreHead(cfg)
state = {(k[len("module."):] if k.startswith("module.") else k): v
for k, v in ckpt["model_state"].items()}
head.load_state_dict(state)
head.eval()
Architecture
audio (16 kHz, log-mel)
|
| openai/whisper-small encoder (FROZEN, 244M params)
v
[B, H=13, T, D=768] # all encoder layer outputs
|
| LayerWeightedSum: softmax(alpha) over H -> [B, T, D]
v
[B, T, D]
|
| AttentionPooler (D -> 256 bottleneck, 1 query) -> [B, D]
v
[B, D]
|
| ScoreHead MLP (D -> 256 -> 1) -> [B]
v
scalar logit (higher = more natural)
| Stage | Params |
|---|---|
| LayerWeightedSum (alpha) | 13 |
| AttentionPooler | ~197K |
| ScoreHead MLP | ~197K |
| Total trainable | ~395K |
Trained with Bradley-Terry pairwise log-loss on (audio_chosen, audio_rejected) pairs.
Training data
SpeechJudge-Data (Zhang et al., Nov 2025, CC-BY-NC-4.0): 99K human-labeled TTS preference pairs across CosyVoice2, F5-TTS, MaskGCT, Llasa, XTTS-v2, and others. Both regular and expressive splits. en, zh, en-zh code-switching.
No external supervision beyond the SpeechJudge preference labels. No MOS labels, no synthetic labels, no LLM-rater labels.
Training recipe
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 1e-3, cosine schedule with 500-step warmup |
| Weight decay | 1e-4 |
| Batch size | 16 pairs per step (effective 32 with DataParallel) |
| Epochs | 5 |
| Total steps | 13,250 |
| Gradient clip | 1.0 |
| Dropout | 0.1 |
| Mixed precision | AMP fp16 |
| Hardware | 2x NVIDIA T4 (Kaggle) |
| Wall-clock | ~4h52m total (with mid-run resume from step 8000) |
Full config: config.yaml in this repo. Source code: github.com/harrrshall/natscore.
License
Weights: CC-BY-NC-4.0 (inherited from SpeechJudge-Data). Academic and non-commercial use only. For commercial licensing, contact the SpeechJudge authors first; their license governs derivative model weights.
Code: Apache-2.0 (in the github repo).
This matches how the SpeechJudge authors released their own checkpoints (SpeechJudge-BTRM, SpeechJudge-GRM).
Citation
If you use NatScore in academic work, please cite:
@misc{singh2026natscore,
title = {NatScore: A small, preference-supervised naturalness scorer for modern neural TTS},
author = {Singh, Harshal},
year = {2026},
url = {https://huggingface.co/harrrshall/natscore-small-v0}
}
Also cite the training data and the SpeechJudge family:
@article{zhang2025speechjudge,
title = {SpeechJudge: Preference-Based Evaluation for Modern Neural Text-to-Speech},
author = {Zhang, et al.},
year = {2025},
url = {https://huggingface.co/datasets/RMSnow/SpeechJudge-Data}
}
Files in this repo
| File | Purpose |
|---|---|
final.pt |
Trained checkpoint (~4.6 MB). state_dict for NatScoreHead, plus optimizer/scheduler state |
config.yaml |
Full training config (model, data, optimizer, schedule) |
eval_dev.json |
Headline dev[:1000] eval result with per-subset and per-language breakdowns |
MODEL_LICENSE.md |
Full CC-BY-NC-4.0 terms with rationale |
README.md |
This file |
Contact
github.com/harrrshall/natscore for issues, PRs, ablations, and the upcoming larger checkpoint trained on the full ablation grid. Twitter: @HarshalsinghCN.
- Downloads last month
- 13
Model tree for harrrshall/natscore-small-v0
Base model
openai/whisper-small