MSpoofTTS Discriminator Checkpoints

This repository provides the discriminator checkpoints used in MSpoofTTS: Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection.

Paper: Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection

Demo: https://danny-nus.github.io/MSpoofTTS.github.io/

This repository is intended as a checkpoint hosting repository. The discriminator architecture definitions are not included here. Please use these checkpoints together with the official MSpoofTTS codebase.

Checkpoints

File	Model Type	Segment Length	Scale
`checkpoints/segment_len50.ckpt`	SegmentTokenDiscriminator	50	-
`checkpoints/segment_len25.ckpt`	SegmentTokenDiscriminator	25	-
`checkpoints/segment_len10.ckpt`	SegmentTokenDiscriminator	10	-
`checkpoints/strided_seg50_scale10.ckpt`	StridedSegmentTokenDiscriminator	50	10
`checkpoints/strided_seg50_scale25.ckpt`	StridedSegmentTokenDiscriminator	50	25

Model Configuration

All discriminators use the following base configuration:

vocab_size = 65536
d_model = 256
nhead = 8
num_layers = 4
dim_feedforward = 1024
dropout = 0.1

The segment-level discriminators use segment_len values of 10, 25, and 50.

The strided discriminators use segment_len=50 with scales 10 and 25.

Usage

Install the Hugging Face Hub package:

pip install -U huggingface_hub

Download a checkpoint:

from huggingface_hub import hf_hub_download

repo_id = "Chanson-0803/MSpoofTTS"

ckpt_path = hf_hub_download(
    repo_id=repo_id,
    filename="checkpoints/segment_len50.ckpt",
    repo_type="model",
)

print(ckpt_path)

Then load the checkpoint using the corresponding discriminator class from the MSpoofTTS codebase:

import torch

# Import this from the official MSpoofTTS codebase.
# from your_mspoof_code import SegmentTokenDiscriminator

state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()

For hierarchical decoding, use the following checkpoint files:

checkpoint_files = {
    "segment_len50": "checkpoints/segment_len50.ckpt",
    "segment_len25": "checkpoints/segment_len25.ckpt",
    "segment_len10": "checkpoints/segment_len10.ckpt",
    "strided_seg50_scale10": "checkpoints/strided_seg50_scale10.ckpt",
    "strided_seg50_scale25": "checkpoints/strided_seg50_scale25.ckpt",
}

Intended Use

These checkpoints are intended for research on discrete speech synthesis, neural codec language models, inference-time decoding guidance, spoof detection for generated speech tokens, and hierarchical multi-resolution decoding.

Limitations

These checkpoints are designed for the speech-token vocabulary and discriminator architectures used in MSpoofTTS. They may not be directly compatible with other codec tokenizers, vocabulary layouts, or speech language models without adaptation.

Citation

@article{zhao2026hierarchical,
  title={Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection},
  author={Zhao, Junchuan and Vu, Minh Duc and Wang, Ye},
  journal={arXiv preprint arXiv:2603.05373},
  year={2026}
}

Downloads last month: -

Paper for Chanson-0803/MSpoofTTS

Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection

Paper • 2603.05373 • Published Apr 11