MOSS-SoundEffect-V2.0

MOSS-SoundEffect v2.0 is a text-to-audio model with a Diffusion Transformer (DiT) backbone trained with the Flow Matching objective, paired with a DAC VAE and a Qwen3 text encoder. It generates high-fidelity environmental, urban, creature, and human-action sound effects from natural-language prompts, with controllable duration up to 30 seconds at 48 kHz.

1. Overview

1.1 TTS Family Positioning

Within the MOSS-TTS Family, MOSS-SoundEffect is the dedicated text-to-sound model — the family member that turns natural-language captions into non-speech audio (ambience, urban scenes, creatures, human actions, short music-like clips). v2.0 supersedes the v1 discrete-token autoregressive backbone (MossTTSDelay) with a continuous-latent Diffusion Transformer + Flow Matching design.

1.2 Key Capabilities

Broad SFX coverage: natural environments, urban environments, animals & creatures, human actions, and short musical/percussive clips.
Long-form generation: stable audio up to 30 seconds per call with the duration tag prepended to the prompt at training time.
Bilingual prompts: trained with both English and Chinese captions.

1.3 Released Models

Model	Architecture	DiT Variant	Parameters
MOSS-SoundEffect-V2.0	DiT + Flow Matching	`1.3B`	1.3B

Recommended inference hyperparameters

Parameter	Default	Description
`num_inference_steps`	100	Number of flow-match solver steps.
`cfg_scale`	4.0	Classifier-free guidance weight.
`sigma_shift`	5.0	Flow-match scheduler shift applied per call.
`seconds`	10.0	Output duration. Up to 30.

2. Quick Start

Environment Setup

We recommend a clean, isolated Python 3.12 environment to avoid dependency conflicts with the top-level MOSS-TTS environment.

conda create -n moss-soundeffect-v2 python=3.12 -y
conda activate moss-soundeffect-v2

git clone https://github.com/OpenMOSS/MOSS-TTS.git
cd MOSS-TTS/moss_soundeffect_v2
pip install --extra-index-url https://download.pytorch.org/whl/cu128 \
    -e ".[torch-cu128,finetune]"

For a minimal inference-only install (still ships the Gradio demo; skips the fine-tuning extras):

pip install --extra-index-url https://download.pytorch.org/whl/cu128 \
    -e ".[torch-cu128]"

Basic Usage

import torch
from moss_soundeffect_v2 import MossSoundEffectPipeline

pipe = MossSoundEffectPipeline.from_pretrained(
    "OpenMOSS-Team/MOSS-SoundEffect-v2.0",   # this repo, or a local dir
    torch_dtype=torch.bfloat16,
    device="cuda",
)

audio = pipe(
    prompt="A dog barking loudly in a park.",
    seconds=10,
    num_inference_steps=100,
    cfg_scale=4.0,
)                                            # (B, C, T) waveform tensor
pipe.save_audio(audio, "out.wav")

The underlying DiT is wrapped with torch.compile + Triton CUDA Graph. The first call may take a few minutes to compile. If you hit TorchDynamo / Triton errors, set TORCHDYNAMO_DISABLE=1 before launching Python.

For a Gradio demo and fine-tuning recipes, see the GitHub README.

Downloads last month: -

Model tree for OpenMOSS-Team/MOSS-SoundEffect-v2.0

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Finetuned

(786)

this model

Collection including OpenMOSS-Team/MOSS-SoundEffect-v2.0

MOSS-TTS

Collection

12 items • Updated about 9 hours ago • 33