NatScore (small, v0)

A small, preference-supervised naturalness scorer for modern neural TTS. Frozen openai/whisper-small encoder with a tiny Bradley-Terry head (~400K trainable parameters), trained on the 99K human preference pairs in SpeechJudge-Data (Zhang et al., Nov 2025).

Higher score = more natural in the SpeechJudge sense.

Headline result

Metric Value
Pairwise accuracy (dev[:1000]) 71.30%
95% CI [68.60, 74.10]
Mean margin +1.010
ECE 2.27% (well-calibrated)
Trainable params 400K (1/17500 of SpeechJudge-BTRM)
Total params with frozen encoder ~244M
Training compute ~4h52m on 2x T4 (Kaggle, DataParallel)

Cleared the project's M5b >70% pairwise-accuracy target. For reference, SpeechJudge-BTRM (7B params, Qwen2.5-Omni-7B backbone) reports 72.7% on SpeechJudge-Eval; SpeechJudge-GRM (also 7B, gold standard with chain-of-thought) reports 77.2%. NatScore lands within striking distance at a fraction of a percent of the parameter count.

Breakdowns

By subset:

Subset n_pairs Pairwise acc
Regular 400 74.00%
Expressive 600 69.50%

By language-pair (source script -> target script):

Pair n_pairs Pairwise acc
en -> zh 221 87.33%
zh -> zh 179 83.24%
zh -> en 148 66.22%
en -> en 252 63.89%
zh -> mixed 80 61.25%
en -> mixed 120 52.50%

Mixed-language code-switching is the obvious tail to investigate next. Expressive prosody is harder than read-speech, as expected.

Intended use

  • Offline naturalness ranking of TTS system outputs (A/B comparison, listening-test prefiltering).
  • A cheap reward signal for non-commercial RLHF/preference-tuning experiments on TTS.
  • Distribution-coverage analysis across CosyVoice2, F5-TTS, MaskGCT, Llasa, XTTS-v2, and similar 2024-2025 neural TTS families.

Out of scope / known weaknesses

  • Not a MOS predictor. The training signal is pairwise preference, not absolute MOS. Output is a logit; calibrate before treating as a quality score.
  • Read-speech vs expressive: the model is ~5 points weaker on expressive prosody than on regular read-speech.
  • Mixed-language code-switching: en -> mixed is near chance (52.5%). Treat scores for code-switched inputs with caution.
  • Trained on openai/whisper-small audio features (16 kHz, mel-spectrogram). Inputs at other sample rates need resampling.
  • Not evaluated on speech enhancement, telecom, or accent-rated naturalness. Use WhiSQA or DNSMOS for those.

Loading

The natscore package exposes a one-line load that pulls this checkpoint and the frozen Whisper-small encoder, then returns a ready Scorer:

# pip install git+https://github.com/harrrshall/natscore.git
import natscore as ns

scorer = ns.load()                       # defaults to "harrrshall/natscore-small-v0"
# Alternatives:
# scorer = ns.load("natscore-small-v0")
# scorer = ns.load(device="cuda", dtype=torch.float16)

# Pointwise score (higher = more natural)
s: float = scorer.score("path/to/tts_output.wav")

# Pairwise comparison (recommended; matches the BT training objective)
pair = scorer.compare("a.wav", "b.wav")
# pair.winner       -> "a" | "b" | "tie"
# pair.score_a, pair.score_b
# pair.margin       = score_a - score_b
# pair.prob_a_wins  = sigmoid(margin)

# Batched
scores: list[float] = scorer.batch_score(["c1.wav", "c2.wav", "c3.wav"])

First call downloads ~290 MB Whisper-small plus this ~5 MB checkpoint; subsequent calls hit the local HF cache. Wall-clock per 10s audio clip: ~120 ms CPU, ~15 ms T4 GPU.

Manual loading (without the natscore package)

If you only want the head and bring your own Whisper encoder:

import torch
from huggingface_hub import hf_hub_download
from natscore.model import NatScoreHead, NatScoreHeadConfig

ckpt_path = hf_hub_download("harrrshall/natscore-small-v0", "final.pt")
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)

cfg = NatScoreHeadConfig(**ckpt["config"]["model"])
head = NatScoreHead(cfg)
state = {(k[len("module."):] if k.startswith("module.") else k): v
         for k, v in ckpt["model_state"].items()}
head.load_state_dict(state)
head.eval()

Architecture

audio (16 kHz, log-mel)
    |
    | openai/whisper-small encoder (FROZEN, 244M params)
    v
[B, H=13, T, D=768]                # all encoder layer outputs
    |
    | LayerWeightedSum: softmax(alpha) over H            -> [B, T, D]
    v
[B, T, D]
    |
    | AttentionPooler (D -> 256 bottleneck, 1 query)     -> [B, D]
    v
[B, D]
    |
    | ScoreHead MLP (D -> 256 -> 1)                      -> [B]
    v
scalar logit (higher = more natural)
Stage Params
LayerWeightedSum (alpha) 13
AttentionPooler ~197K
ScoreHead MLP ~197K
Total trainable ~395K

Trained with Bradley-Terry pairwise log-loss on (audio_chosen, audio_rejected) pairs.

Training data

SpeechJudge-Data (Zhang et al., Nov 2025, CC-BY-NC-4.0): 99K human-labeled TTS preference pairs across CosyVoice2, F5-TTS, MaskGCT, Llasa, XTTS-v2, and others. Both regular and expressive splits. en, zh, en-zh code-switching.

No external supervision beyond the SpeechJudge preference labels. No MOS labels, no synthetic labels, no LLM-rater labels.

Training recipe

Hyperparameter Value
Optimizer AdamW
Learning rate 1e-3, cosine schedule with 500-step warmup
Weight decay 1e-4
Batch size 16 pairs per step (effective 32 with DataParallel)
Epochs 5
Total steps 13,250
Gradient clip 1.0
Dropout 0.1
Mixed precision AMP fp16
Hardware 2x NVIDIA T4 (Kaggle)
Wall-clock ~4h52m total (with mid-run resume from step 8000)

Full config: config.yaml in this repo. Source code: github.com/harrrshall/natscore.

License

Weights: CC-BY-NC-4.0 (inherited from SpeechJudge-Data). Academic and non-commercial use only. For commercial licensing, contact the SpeechJudge authors first; their license governs derivative model weights.

Code: Apache-2.0 (in the github repo).

This matches how the SpeechJudge authors released their own checkpoints (SpeechJudge-BTRM, SpeechJudge-GRM).

Citation

If you use NatScore in academic work, please cite:

@misc{singh2026natscore,
  title  = {NatScore: A small, preference-supervised naturalness scorer for modern neural TTS},
  author = {Singh, Harshal},
  year   = {2026},
  url    = {https://huggingface.co/harrrshall/natscore-small-v0}
}

Also cite the training data and the SpeechJudge family:

@article{zhang2025speechjudge,
  title  = {SpeechJudge: Preference-Based Evaluation for Modern Neural Text-to-Speech},
  author = {Zhang, et al.},
  year   = {2025},
  url    = {https://huggingface.co/datasets/RMSnow/SpeechJudge-Data}
}

Files in this repo

File Purpose
final.pt Trained checkpoint (~4.6 MB). state_dict for NatScoreHead, plus optimizer/scheduler state
config.yaml Full training config (model, data, optimizer, schedule)
eval_dev.json Headline dev[:1000] eval result with per-subset and per-language breakdowns
MODEL_LICENSE.md Full CC-BY-NC-4.0 terms with rationale
README.md This file

Contact

github.com/harrrshall/natscore for issues, PRs, ablations, and the upcoming larger checkpoint trained on the full ablation grid. Twitter: @HarshalsinghCN.

Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for harrrshall/natscore-small-v0

Finetuned
(3547)
this model

Dataset used to train harrrshall/natscore-small-v0