MOSS-SoundEffect-V2.0

MOSS-SoundEffect v2.0 is a text-to-audio model with a Diffusion Transformer (DiT) backbone trained with the Flow Matching objective, paired with a DAC VAE and a Qwen3 text encoder. It generates high-fidelity environmental, urban, creature, and human-action sound effects from natural-language prompts, with controllable duration up to 30 seconds at 48 kHz.

1. Overview

1.1 TTS Family Positioning

Within the MOSS-TTS Family, MOSS-SoundEffect is the dedicated text-to-sound model — the family member that turns natural-language captions into non-speech audio (ambience, urban scenes, creatures, human actions, short music-like clips). v2.0 supersedes the v1 discrete-token autoregressive backbone (MossTTSDelay) with a continuous-latent Diffusion Transformer + Flow Matching design.

1.2 Key Capabilities

  • Broad SFX coverage: natural environments, urban environments, animals & creatures, human actions, and short musical/percussive clips.
  • Long-form generation: stable audio up to 30 seconds per call with the duration tag prepended to the prompt at training time.
  • Bilingual prompts: trained with both English and Chinese captions.

1.3 Released Models

Model Architecture DiT Variant Parameters
MOSS-SoundEffect-V2.0 DiT + Flow Matching 1.3B 1.3B

Recommended inference hyperparameters

Parameter Default Description
num_inference_steps 100 Number of flow-match solver steps.
cfg_scale 4.0 Classifier-free guidance weight.
sigma_shift 5.0 Flow-match scheduler shift applied per call.
seconds 10.0 Output duration. Up to 30.

2. Quick Start

Environment Setup

We recommend a clean, isolated Python 3.12 environment to avoid dependency conflicts with the top-level MOSS-TTS environment.

conda create -n moss-soundeffect-v2 python=3.12 -y
conda activate moss-soundeffect-v2

git clone https://github.com/OpenMOSS/MOSS-TTS.git
cd MOSS-TTS/moss_soundeffect_v2
pip install --extra-index-url https://download.pytorch.org/whl/cu128 \
    -e ".[torch-cu128,finetune]"

For a minimal inference-only install (still ships the Gradio demo; skips the fine-tuning extras):

pip install --extra-index-url https://download.pytorch.org/whl/cu128 \
    -e ".[torch-cu128]"

Basic Usage

import torch
from moss_soundeffect_v2 import MossSoundEffectPipeline

pipe = MossSoundEffectPipeline.from_pretrained(
    "OpenMOSS-Team/MOSS-SoundEffect-v2.0",   # this repo, or a local dir
    torch_dtype=torch.bfloat16,
    device="cuda",
)

audio = pipe(
    prompt="A dog barking loudly in a park.",
    seconds=10,
    num_inference_steps=100,
    cfg_scale=4.0,
)                                            # (B, C, T) waveform tensor
pipe.save_audio(audio, "out.wav")

The underlying DiT is wrapped with torch.compile + Triton CUDA Graph. The first call may take a few minutes to compile. If you hit TorchDynamo / Triton errors, set TORCHDYNAMO_DISABLE=1 before launching Python.

For a Gradio demo and fine-tuning recipes, see the GitHub README.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenMOSS-Team/MOSS-SoundEffect-v2.0

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(786)
this model

Collection including OpenMOSS-Team/MOSS-SoundEffect-v2.0