MSpoofTTS Discriminator Checkpoints
This repository provides the discriminator checkpoints used in MSpoofTTS: Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection.
Paper: Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection
Demo: https://danny-nus.github.io/MSpoofTTS.github.io/
This repository is intended as a checkpoint hosting repository. The discriminator architecture definitions are not included here. Please use these checkpoints together with the official MSpoofTTS codebase.
Checkpoints
| File | Model Type | Segment Length | Scale |
|---|---|---|---|
checkpoints/segment_len50.ckpt |
SegmentTokenDiscriminator | 50 | - |
checkpoints/segment_len25.ckpt |
SegmentTokenDiscriminator | 25 | - |
checkpoints/segment_len10.ckpt |
SegmentTokenDiscriminator | 10 | - |
checkpoints/strided_seg50_scale10.ckpt |
StridedSegmentTokenDiscriminator | 50 | 10 |
checkpoints/strided_seg50_scale25.ckpt |
StridedSegmentTokenDiscriminator | 50 | 25 |
Model Configuration
All discriminators use the following base configuration:
vocab_size = 65536
d_model = 256
nhead = 8
num_layers = 4
dim_feedforward = 1024
dropout = 0.1
The segment-level discriminators use segment_len values of 10, 25, and 50.
The strided discriminators use segment_len=50 with scales 10 and 25.
Usage
Install the Hugging Face Hub package:
pip install -U huggingface_hub
Download a checkpoint:
from huggingface_hub import hf_hub_download
repo_id = "Chanson-0803/MSpoofTTS"
ckpt_path = hf_hub_download(
repo_id=repo_id,
filename="checkpoints/segment_len50.ckpt",
repo_type="model",
)
print(ckpt_path)
Then load the checkpoint using the corresponding discriminator class from the MSpoofTTS codebase:
import torch
# Import this from the official MSpoofTTS codebase.
# from your_mspoof_code import SegmentTokenDiscriminator
state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()
For hierarchical decoding, use the following checkpoint files:
checkpoint_files = {
"segment_len50": "checkpoints/segment_len50.ckpt",
"segment_len25": "checkpoints/segment_len25.ckpt",
"segment_len10": "checkpoints/segment_len10.ckpt",
"strided_seg50_scale10": "checkpoints/strided_seg50_scale10.ckpt",
"strided_seg50_scale25": "checkpoints/strided_seg50_scale25.ckpt",
}
Intended Use
These checkpoints are intended for research on discrete speech synthesis, neural codec language models, inference-time decoding guidance, spoof detection for generated speech tokens, and hierarchical multi-resolution decoding.
Limitations
These checkpoints are designed for the speech-token vocabulary and discriminator architectures used in MSpoofTTS. They may not be directly compatible with other codec tokenizers, vocabulary layouts, or speech language models without adaptation.
Citation
@article{zhao2026hierarchical,
title={Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection},
author={Zhao, Junchuan and Vu, Minh Duc and Wang, Ye},
journal={arXiv preprint arXiv:2603.05373},
year={2026}
}
- Downloads last month
- -