Dasheng-AudioGen GGUF

GGUF-converted weights for mispeech/Dasheng-AudioGen

Dasheng-AudioGen is a unified audio generation model that can jointly synthesize intelligible speech, music, sound effects, and environmental acoustics from text descriptions.

Model Variants

Variant	DiT Model	Total Size	Quality
F16 (default)	`dit.gguf`	~5.7GB	Best
Q8	`dit_Q8_0.gguf`	~3.8GB	Great
Q4	`dit_Q4_0.gguf`	~2.8GB	Good

All variants include the same T5 encoder, vocoder, and tokenizer.

Files

File	Description	Size
`t5_encoder.gguf`	T5 text encoder (F32)	1.3GB
`dit.gguf`	DiT backbone (F16)	4.1GB
`dit_Q8_0.gguf`	DiT backbone (Q8 quantized)	2.2GB
`dit_Q4_0.gguf`	DiT backbone (Q4 quantized)	1.2GB
`dit-multi.gguf`	DiT Multilang (F16)	4.1GB
`dit-multi_Q8_0.gguf`	DiT Multilang (Q8 quantized)	2.2GB
`dit-multi_Q4_0.gguf`	DiT Multilang (Q4 quantized)	1.2GB
`vocoder.gguf`	Vocoder decoder (F16)	332MB
`spiece.model`	SentencePiece tokenizer	792KB

Usage with audiogen.cpp

# Clone and build
git clone https://github.com/audiohacking/audiogen.cpp
cd audiogen.cpp
git submodule update --init --recursive
make metal  # or: make cpu

# Download models (choose one)
make download-models      # F16 (~5.7GB) - best quality
make download-models-q8   # Q8 (~3.8GB) - great quality
make download-models-q4   # Q4 (~2.8GB) - good quality

# Generate audio
./build-metal/dasheng-audiogen \
    models/t5_encoder.gguf models/dit.gguf models/vocoder.gguf models/spiece.model \
    --caption "A dog barking loudly" --output output.wav

Prompt Format

Supports the same prompt tags as the original model:

Flag	Tag	Description
`--caption`	`<\|caption\|>`	Main audio description (required)
`--speech`	`<\|speech\|>`	Speech characteristics
`--asr`	`<\|asr\|>`	Text to speak
`--sfx`	`<\|sfx\|>`	Sound effects
`--music`	`<\|music\|>`	Music description
`--env`	`<\|env\|>`	Environment/ambience

Example: Complex multi-tag prompt

./build-metal/dasheng-audiogen \
    models/t5_encoder.gguf models/dit.gguf models/vocoder.gguf models/spiece.model \
    --caption "A gritty detective narrating" \
    --speech "gritty deep male voice" \
    --asr "The city never sleeps, but it sure knows how to cry." \
    --sfx "heavy rain hitting pavement" \
    --music "melancholic solo saxophone" \
    --env "distant urban ambience" \
    --output noir_detective.wav

Generation Options

Option	Default	Description
`--steps N`	25	Number of diffusion steps
`--duration SECS`	10	Audio duration in seconds
`--cfg SCALE`	3.0	Classifier-free guidance scale
`--sway COEF`	-1.0	Sway sampling coefficient
`--seed N`	random	Random seed for reproducibility
`--threads N`	4	CPU threads

Performance

Backend	25 Steps	Hardware
Metal GPU	~6s	Apple M3 Ultra
CPU	~163s	Apple M3 Ultra (16 threads)

Original Model

This is a GGUF conversion of mispeech/Dasheng-AudioGen. Please refer to the original model card for more details about the model architecture and training.

License

Apache-2.0 - Same as the original model

Downloads last month: 165

GGUF

Model size

2B params

Architecture

dasheng_audiogen_dit

Hardware compatibility

4-bit

8-bit

View +4 variants

Model tree for audiohacking/dasheng-audiogen-gguf

Base model

mispeech/Dasheng-AudioGen

Quantized

(1)

this model