Dasheng-AudioGen GGUF

GGUF-converted weights for mispeech/Dasheng-AudioGen

Dasheng-AudioGen is a unified audio generation model that can jointly synthesize intelligible speech, music, sound effects, and environmental acoustics from text descriptions.

Model Variants

Variant DiT Model Total Size Quality
F16 (default) dit.gguf ~5.7GB Best
Q8 dit_Q8_0.gguf ~3.8GB Great
Q4 dit_Q4_0.gguf ~2.8GB Good

All variants include the same T5 encoder, vocoder, and tokenizer.

Files

File Description Size
t5_encoder.gguf T5 text encoder (F32) 1.3GB
dit.gguf DiT backbone (F16) 4.1GB
dit_Q8_0.gguf DiT backbone (Q8 quantized) 2.2GB
dit_Q4_0.gguf DiT backbone (Q4 quantized) 1.2GB
dit-multi.gguf DiT Multilang (F16) 4.1GB
dit-multi_Q8_0.gguf DiT Multilang (Q8 quantized) 2.2GB
dit-multi_Q4_0.gguf DiT Multilang (Q4 quantized) 1.2GB
vocoder.gguf Vocoder decoder (F16) 332MB
spiece.model SentencePiece tokenizer 792KB

Usage with audiogen.cpp

# Clone and build
git clone https://github.com/audiohacking/audiogen.cpp
cd audiogen.cpp
git submodule update --init --recursive
make metal  # or: make cpu

# Download models (choose one)
make download-models      # F16 (~5.7GB) - best quality
make download-models-q8   # Q8 (~3.8GB) - great quality
make download-models-q4   # Q4 (~2.8GB) - good quality

# Generate audio
./build-metal/dasheng-audiogen \
    models/t5_encoder.gguf models/dit.gguf models/vocoder.gguf models/spiece.model \
    --caption "A dog barking loudly" --output output.wav

Prompt Format

Supports the same prompt tags as the original model:

Flag Tag Description
--caption <|caption|> Main audio description (required)
--speech <|speech|> Speech characteristics
--asr <|asr|> Text to speak
--sfx <|sfx|> Sound effects
--music <|music|> Music description
--env <|env|> Environment/ambience

Example: Complex multi-tag prompt

./build-metal/dasheng-audiogen \
    models/t5_encoder.gguf models/dit.gguf models/vocoder.gguf models/spiece.model \
    --caption "A gritty detective narrating" \
    --speech "gritty deep male voice" \
    --asr "The city never sleeps, but it sure knows how to cry." \
    --sfx "heavy rain hitting pavement" \
    --music "melancholic solo saxophone" \
    --env "distant urban ambience" \
    --output noir_detective.wav

Generation Options

Option Default Description
--steps N 25 Number of diffusion steps
--duration SECS 10 Audio duration in seconds
--cfg SCALE 3.0 Classifier-free guidance scale
--sway COEF -1.0 Sway sampling coefficient
--seed N random Random seed for reproducibility
--threads N 4 CPU threads

Performance

Backend 25 Steps Hardware
Metal GPU ~6s Apple M3 Ultra
CPU ~163s Apple M3 Ultra (16 threads)

Original Model

This is a GGUF conversion of mispeech/Dasheng-AudioGen. Please refer to the original model card for more details about the model architecture and training.

License

Apache-2.0 - Same as the original model

Downloads last month
165
GGUF
Model size
2B params
Architecture
dasheng_audiogen_dit
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for audiohacking/dasheng-audiogen-gguf

Quantized
(1)
this model