koichisaito/soundctm_dit

Model Checkpoints

SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation.

The repository's model_checkpoints directory contains checkpoints for both student and teacher models. Each model is available in three variants:

1. ac_v1_iclr

Training Data: AudioCaps
Conditioning: Uses the last layer of the CLAP text branch.
Details: This variant corresponds to the checkpoint used in ICLR'25 publication.

2. ac_v2

Training Data: AudioCaps
Conditioning: Uses the second last layer of the CLAP text branch.

3. as_ac_v2

Training Data: AudioSet and AudioCaps
Conditioning: Uses the second last layer of the CLAP text branch.
Additional Information: For training, we use text descriptions of Audioset in here.

Auxiliary Checkpoints

The utils_checkpoint directory includes additional checkpoints for auxiliary components, such as the audio compression model.

Citation

@inproceedings{saito2025soundctm,
  title={Sound{CTM}: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation},
  author={Koichi Saito and Dongjun Kim and Takashi Shibuya and Chieh-Hsin Lai and Zhi Zhong and Yuhta Takida and Yuki Mitsufuji},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=KrK6zXbjfO}
}