MSpoofTTS Discriminator Checkpoints

This repository provides the discriminator checkpoints used in MSpoofTTS: Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection.

Paper: Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection

Demo: https://danny-nus.github.io/MSpoofTTS.github.io/

This repository is intended as a checkpoint hosting repository. The discriminator architecture definitions are not included here. Please use these checkpoints together with the official MSpoofTTS codebase.

Checkpoints

File Model Type Segment Length Scale
checkpoints/segment_len50.ckpt SegmentTokenDiscriminator 50 -
checkpoints/segment_len25.ckpt SegmentTokenDiscriminator 25 -
checkpoints/segment_len10.ckpt SegmentTokenDiscriminator 10 -
checkpoints/strided_seg50_scale10.ckpt StridedSegmentTokenDiscriminator 50 10
checkpoints/strided_seg50_scale25.ckpt StridedSegmentTokenDiscriminator 50 25

Model Configuration

All discriminators use the following base configuration:

vocab_size = 65536
d_model = 256
nhead = 8
num_layers = 4
dim_feedforward = 1024
dropout = 0.1

The segment-level discriminators use segment_len values of 10, 25, and 50.

The strided discriminators use segment_len=50 with scales 10 and 25.

Usage

Install the Hugging Face Hub package:

pip install -U huggingface_hub

Download a checkpoint:

from huggingface_hub import hf_hub_download

repo_id = "Chanson-0803/MSpoofTTS"

ckpt_path = hf_hub_download(
    repo_id=repo_id,
    filename="checkpoints/segment_len50.ckpt",
    repo_type="model",
)

print(ckpt_path)

Then load the checkpoint using the corresponding discriminator class from the MSpoofTTS codebase:

import torch

# Import this from the official MSpoofTTS codebase.
# from your_mspoof_code import SegmentTokenDiscriminator

state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()

For hierarchical decoding, use the following checkpoint files:

checkpoint_files = {
    "segment_len50": "checkpoints/segment_len50.ckpt",
    "segment_len25": "checkpoints/segment_len25.ckpt",
    "segment_len10": "checkpoints/segment_len10.ckpt",
    "strided_seg50_scale10": "checkpoints/strided_seg50_scale10.ckpt",
    "strided_seg50_scale25": "checkpoints/strided_seg50_scale25.ckpt",
}

Intended Use

These checkpoints are intended for research on discrete speech synthesis, neural codec language models, inference-time decoding guidance, spoof detection for generated speech tokens, and hierarchical multi-resolution decoding.

Limitations

These checkpoints are designed for the speech-token vocabulary and discriminator architectures used in MSpoofTTS. They may not be directly compatible with other codec tokenizers, vocabulary layouts, or speech language models without adaptation.

Citation

@article{zhao2026hierarchical,
  title={Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection},
  author={Zhao, Junchuan and Vu, Minh Duc and Wang, Ye},
  journal={arXiv preprint arXiv:2603.05373},
  year={2026}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Chanson-0803/MSpoofTTS