Instructions to use OpenMOSS-Team/MOSS-SoundEffect-v2.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use OpenMOSS-Team/MOSS-SoundEffect-v2.0 with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("OpenMOSS-Team/MOSS-SoundEffect-v2.0", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
MOSS-SoundEffect-V2.0
MOSS-SoundEffect v2.0 is a text-to-audio model with a Diffusion Transformer (DiT) backbone trained with the Flow Matching objective, paired with a DAC VAE and a Qwen3 text encoder. It generates high-fidelity environmental, urban, creature, and human-action sound effects from natural-language prompts, with controllable duration up to 30 seconds at 48 kHz.
1. Overview
1.1 TTS Family Positioning
Within the MOSS-TTS Family, MOSS-SoundEffect is the dedicated text-to-sound model — the family member that turns natural-language captions into non-speech audio (ambience, urban scenes, creatures, human actions, short music-like clips). v2.0 supersedes the v1 discrete-token autoregressive backbone (MossTTSDelay) with a continuous-latent Diffusion Transformer + Flow Matching design.
1.2 Key Capabilities
- Broad SFX coverage: natural environments, urban environments, animals & creatures, human actions, and short musical/percussive clips.
- Long-form generation: stable audio up to 30 seconds per call with the duration tag prepended to the prompt at training time.
- Bilingual prompts: trained with both English and Chinese captions.
1.3 Released Models
| Model | Architecture | DiT Variant | Parameters |
|---|---|---|---|
| MOSS-SoundEffect-V2.0 | DiT + Flow Matching | 1.3B |
1.3B |
Recommended inference hyperparameters
| Parameter | Default | Description |
|---|---|---|
num_inference_steps |
100 | Number of flow-match solver steps. |
cfg_scale |
4.0 | Classifier-free guidance weight. |
sigma_shift |
5.0 | Flow-match scheduler shift applied per call. |
seconds |
10.0 | Output duration. Up to 30. |
2. Quick Start
Environment Setup
We recommend a clean, isolated Python 3.12 environment to avoid dependency conflicts with the top-level MOSS-TTS environment.
conda create -n moss-soundeffect-v2 python=3.12 -y
conda activate moss-soundeffect-v2
git clone https://github.com/OpenMOSS/MOSS-TTS.git
cd MOSS-TTS/moss_soundeffect_v2
pip install --extra-index-url https://download.pytorch.org/whl/cu128 \
-e ".[torch-cu128,finetune]"
For a minimal inference-only install (still ships the Gradio demo; skips the fine-tuning extras):
pip install --extra-index-url https://download.pytorch.org/whl/cu128 \
-e ".[torch-cu128]"
Basic Usage
import torch
from moss_soundeffect_v2 import MossSoundEffectPipeline
pipe = MossSoundEffectPipeline.from_pretrained(
"OpenMOSS-Team/MOSS-SoundEffect-v2.0", # this repo, or a local dir
torch_dtype=torch.bfloat16,
device="cuda",
)
audio = pipe(
prompt="A dog barking loudly in a park.",
seconds=10,
num_inference_steps=100,
cfg_scale=4.0,
) # (B, C, T) waveform tensor
pipe.save_audio(audio, "out.wav")
The underlying DiT is wrapped with
torch.compile+ Triton CUDA Graph. The first call may take a few minutes to compile. If you hitTorchDynamo/ Triton errors, setTORCHDYNAMO_DISABLE=1before launching Python.
For a Gradio demo and fine-tuning recipes, see the GitHub README.
- Downloads last month
- -