NatScore (small, v0)

A small, preference-supervised naturalness scorer for modern neural TTS. Frozen openai/whisper-small encoder with a tiny Bradley-Terry head (~400K trainable parameters), trained on the 99K human preference pairs in SpeechJudge-Data (Zhang et al., Nov 2025).

Higher score = more natural in the SpeechJudge sense.

Headline result

Metric	Value
Pairwise accuracy (dev[:1000])	71.30%
95% CI	[68.60, 74.10]
Mean margin	+1.010
ECE	2.27% (well-calibrated)
Trainable params	~~400K (~~1/17500 of SpeechJudge-BTRM)
Total params with frozen encoder	~244M
Training compute	~4h52m on 2x T4 (Kaggle, DataParallel)

Cleared the project's M5b >70% pairwise-accuracy target. For reference, SpeechJudge-BTRM (7B params, Qwen2.5-Omni-7B backbone) reports 72.7% on SpeechJudge-Eval; SpeechJudge-GRM (also 7B, gold standard with chain-of-thought) reports 77.2%. NatScore lands within striking distance at a fraction of a percent of the parameter count.

Breakdowns

By subset:

Subset	n_pairs	Pairwise acc
Regular	400	74.00%
Expressive	600	69.50%

By language-pair (source script -> target script):

Pair	n_pairs	Pairwise acc
en -> zh	221	87.33%
zh -> zh	179	83.24%
zh -> en	148	66.22%
en -> en	252	63.89%
zh -> mixed	80	61.25%
en -> mixed	120	52.50%

Mixed-language code-switching is the obvious tail to investigate next. Expressive prosody is harder than read-speech, as expected.

Intended use

Offline naturalness ranking of TTS system outputs (A/B comparison, listening-test prefiltering).
A cheap reward signal for non-commercial RLHF/preference-tuning experiments on TTS.
Distribution-coverage analysis across CosyVoice2, F5-TTS, MaskGCT, Llasa, XTTS-v2, and similar 2024-2025 neural TTS families.

Out of scope / known weaknesses

Not a MOS predictor. The training signal is pairwise preference, not absolute MOS. Output is a logit; calibrate before treating as a quality score.
Read-speech vs expressive: the model is ~5 points weaker on expressive prosody than on regular read-speech.
Mixed-language code-switching: en -> mixed is near chance (52.5%). Treat scores for code-switched inputs with caution.
Trained on openai/whisper-small audio features (16 kHz, mel-spectrogram). Inputs at other sample rates need resampling.
Not evaluated on speech enhancement, telecom, or accent-rated naturalness. Use WhiSQA or DNSMOS for those.

Loading

The natscore package exposes a one-line load that pulls this checkpoint and the frozen Whisper-small encoder, then returns a ready Scorer:

# pip install git+https://github.com/harrrshall/natscore.git
import natscore as ns

scorer = ns.load()                       # defaults to "harrrshall/natscore-small-v0"
# Alternatives:
# scorer = ns.load("natscore-small-v0")
# scorer = ns.load(device="cuda", dtype=torch.float16)

# Pointwise score (higher = more natural)
s: float = scorer.score("path/to/tts_output.wav")

# Pairwise comparison (recommended; matches the BT training objective)
pair = scorer.compare("a.wav", "b.wav")
# pair.winner       -> "a" | "b" | "tie"
# pair.score_a, pair.score_b
# pair.margin       = score_a - score_b
# pair.prob_a_wins  = sigmoid(margin)

# Batched
scores: list[float] = scorer.batch_score(["c1.wav", "c2.wav", "c3.wav"])

First call downloads ~290 MB Whisper-small plus this ~5 MB checkpoint; subsequent calls hit the local HF cache. Wall-clock per 10s audio clip: ~120 ms CPU, ~15 ms T4 GPU.

Manual loading (without the natscore package)

If you only want the head and bring your own Whisper encoder:

import torch
from huggingface_hub import hf_hub_download
from natscore.model import NatScoreHead, NatScoreHeadConfig

ckpt_path = hf_hub_download("harrrshall/natscore-small-v0", "final.pt")
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)

cfg = NatScoreHeadConfig(**ckpt["config"]["model"])
head = NatScoreHead(cfg)
state = {(k[len("module."):] if k.startswith("module.") else k): v
         for k, v in ckpt["model_state"].items()}
head.load_state_dict(state)
head.eval()

Architecture

audio (16 kHz, log-mel)
    |
    | openai/whisper-small encoder (FROZEN, 244M params)
    v
[B, H=13, T, D=768]                # all encoder layer outputs
    |
    | LayerWeightedSum: softmax(alpha) over H            -> [B, T, D]
    v
[B, T, D]
    |
    | AttentionPooler (D -> 256 bottleneck, 1 query)     -> [B, D]
    v
[B, D]
    |
    | ScoreHead MLP (D -> 256 -> 1)                      -> [B]
    v
scalar logit (higher = more natural)

Stage	Params
LayerWeightedSum (alpha)	13
AttentionPooler	~197K
ScoreHead MLP	~197K
Total trainable	~395K

Trained with Bradley-Terry pairwise log-loss on (audio_chosen, audio_rejected) pairs.

Training data

SpeechJudge-Data (Zhang et al., Nov 2025, CC-BY-NC-4.0): 99K human-labeled TTS preference pairs across CosyVoice2, F5-TTS, MaskGCT, Llasa, XTTS-v2, and others. Both regular and expressive splits. en, zh, en-zh code-switching.

No external supervision beyond the SpeechJudge preference labels. No MOS labels, no synthetic labels, no LLM-rater labels.

Training recipe

Hyperparameter	Value
Optimizer	AdamW
Learning rate	1e-3, cosine schedule with 500-step warmup
Weight decay	1e-4
Batch size	16 pairs per step (effective 32 with DataParallel)
Epochs	5
Total steps	13,250
Gradient clip	1.0
Dropout	0.1
Mixed precision	AMP fp16
Hardware	2x NVIDIA T4 (Kaggle)
Wall-clock	~4h52m total (with mid-run resume from step 8000)

Full config: config.yaml in this repo. Source code: github.com/harrrshall/natscore.

License

Weights: CC-BY-NC-4.0 (inherited from SpeechJudge-Data). Academic and non-commercial use only. For commercial licensing, contact the SpeechJudge authors first; their license governs derivative model weights.

Code: Apache-2.0 (in the github repo).

This matches how the SpeechJudge authors released their own checkpoints (SpeechJudge-BTRM, SpeechJudge-GRM).

Citation

If you use NatScore in academic work, please cite:

@misc{singh2026natscore,
  title  = {NatScore: A small, preference-supervised naturalness scorer for modern neural TTS},
  author = {Singh, Harshal},
  year   = {2026},
  url    = {https://huggingface.co/harrrshall/natscore-small-v0}
}

Also cite the training data and the SpeechJudge family:

@article{zhang2025speechjudge,
  title  = {SpeechJudge: Preference-Based Evaluation for Modern Neural Text-to-Speech},
  author = {Zhang, et al.},
  year   = {2025},
  url    = {https://huggingface.co/datasets/RMSnow/SpeechJudge-Data}
}

Files in this repo

File	Purpose
`final.pt`	Trained checkpoint (~4.6 MB). `state_dict` for `NatScoreHead`, plus optimizer/scheduler state
`config.yaml`	Full training config (model, data, optimizer, schedule)
`eval_dev.json`	Headline dev[:1000] eval result with per-subset and per-language breakdowns
`MODEL_LICENSE.md`	Full CC-BY-NC-4.0 terms with rationale
`README.md`	This file

Contact

github.com/harrrshall/natscore for issues, PRs, ablations, and the upcoming larger checkpoint trained on the full ablation grid. Twitter: @HarshalsinghCN.

Downloads last month: 13

Model tree for harrrshall/natscore-small-v0

Base model

openai/whisper-small

Finetuned

(3547)

this model

harrrshall
/

natscore-small-v0