Instructions to use daydreamlive/DreamVAE with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- TensorRT
How to use daydreamlive/DreamVAE with TensorRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
DreamVAE
A distilled decoder for the ACE-Step 1.5
Oobleck VAE. Drop-in replacement: same [B, 64, T] latent input, same
[B, 2, 1920*T] stereo output at 48 kHz. 51.7M parameters (61% of the
teacher), 0.24 dB SNR degradation vs. the teacher across 20 diverse music
tracks, and 37.2 ms to decode 60 seconds of audio on an RTX 5090 TensorRT
FP16 engine (8.66x faster than the teacher in PyTorch, 1.45x faster than
the teacher's own TRT engine).
Usage
import json
import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
repo = "daydreamlive/DreamVAE"
weights = hf_hub_download(repo, "model.safetensors")
config = hf_hub_download(repo, "config.json")
mod_py = hf_hub_download(repo, "modeling.py")
import sys, os
sys.path.insert(0, os.path.dirname(mod_py))
from modeling import FastOobleckDecoder
cfg = json.load(open(config))
model = FastOobleckDecoder(
channels=cfg["channels"],
input_channels=cfg["input_channels"],
audio_channels=cfg["audio_channels"],
upsampling_ratios=cfg["upsampling_ratios"],
channel_multiples=cfg["channel_multiples"],
).eval().to("cuda")
model.load_state_dict(load_file(weights))
# latents: [B, 64, T] -> audio: [B, 2, 1920 * T]
with torch.no_grad():
audio = model(latents)
Pair with the stock ACE-Step encoder; only the decoder is distilled.
ONNX export is at onnx/model.onnx. Build a TRT engine with
scripts/export_trt.py.
Results
Quality (20 diverse FMA tracks, 10 s clips):
| vs. original audio | SNR (dB) | STFT dist | Mel dist |
|---|---|---|---|
| Teacher OobleckDecoder | 5.0 (std 2.5) | 0.698 | 0.733 |
| DreamVAE | 4.7 (std 2.4) | 0.695 | 0.742 |
| Degradation | 0.24 dB | better | +0.008 |
The VAE's own reconstruction error (~5 dB SNR) dominates; DreamVAE's additional error is well under a quarter of that. STFT distance is marginally better for the student; mel distance is 1.1% worse. Informal listening finds the difference inaudible.
Speed (RTX 5090, 60 s stereo audio at 48 kHz, TensorRT FP16):
| Backend | Time (60 s) | vs. teacher PyTorch |
|---|---|---|
| Teacher PyTorch | 321.9 ms | 1.00x |
| Teacher TRT FP16 | 54.0 ms | 5.97x |
| DreamVAE PyTorch | 241.1 ms | 1.34x |
| DreamVAE TRT FP16 | 37.2 ms | 8.66x |
Distillation alone is modest (1.45x over the teacher's own TRT engine). The compelling number is the composition with TRT: what a deployment sees moving from stock PyTorch VAE to DreamVAE on TRT is 8.66x.
Architecture
| Teacher | DreamVAE | |
|---|---|---|
| Parameters | 84.4M | 51.7M (61%) |
| Channel multiples | [1, 2, 4, 8, 16] | [1, 2, 4, 8, 8] |
| Residuals/block | 3 (dil 1, 3, 9) | 2 (dil 1, 3) |
| Upsampling ratios | [10, 6, 4, 4, 2] | [10, 6, 4, 4, 2] |
| Activation | Snake1d | Snake1d |
| Hop length | 1920 | 1920 |
Trained for 650K steps on 8,000 FMA Small tracks (two-phase: 500K reconstruction + 150K adversarial). Released checkpoint is step 635K, selected as the HF-energy-match peak from a six-point trajectory over the adversarial phase.
Files
.
βββ README.md
βββ config.json architecture + training hyperparameters
βββ model.safetensors FP32 weights (207 MB, 51.7M params)
βββ modeling.py self-contained FastOobleckDecoder class
βββ onnx/model.onnx ONNX export
βββ scripts/ training, eval, speed bench, TRT export/verify
Limitations
- Signal-level metrics only (no MUSHRA / PESQ / preference tests).
- Evaluated on VAE round-trip, not end-to-end through the full text-to-DiT-to-VAE pipeline. FAD on generated samples is future work.
- One point on the quality-speed frontier (
[1,2,4,8,8], 2 residual units). Width variants not evaluated. - Trained in FP32. INT8 / FP8 quantization would likely stack another 1.5-2x on top of FP16 and is not explored here.
Citation
Technical note (methodology, training recipe with source-paper attribution, adversarial-phase trajectory, and negative results on encoder distillation) is forthcoming on arXiv. Citation block will be added here on release.
License
Apache 2.0, matching the upstream ACE-Step 1.5 license.
- Downloads last month
- 16
Model tree for daydreamlive/DreamVAE
Base model
ACE-Step/ACE-Step-v1-3.5B