Spaces:
Running
on
Zero
Apply for community grant: Company project (gpu and storage)
ACE-Step is a novel, open-source foundation model designed to address key limitations in current music generation AI, specifically the trade-offs between speed, musical coherence, and fine-grained control.
Existing methods, such as LLM-based models, can be slow and struggle with musical structure, while some diffusion models may lack long-range coherence despite faster inference. ACE-Step bridges this gap with a unique hybrid architecture. It integrates diffusion-based generation with a Deep Compression AutoEncoder (Sana's DCAE) and a lightweight linear transformer. Crucially, it leverages semantic alignment techniques (MERT/m-hubert) during training to ensure rapid convergence and enhance lyrical and musical coherence.
The results are state-of-the-art: ACE-Step can synthesize up to 4 minutes of music in just 20 seconds on an A100 GPU, demonstrating a 15x speedup over LLM-based baselines. This speed is achieved while delivering superior musical coherence (melody, harmony, rhythm) and accurate lyric alignment. Furthermore, ACE-Step preserves fine-grained acoustic detail, enabling advanced control capabilities like voice cloning, lyric editing, track remixing, and source separation tasks (e.g., lyric2vocal, singing2accompaniment).
Our vision is to establish ACE-Step not just as another end-to-end pipeline, but as a truly general-purpose, efficient, and flexible foundation model for music AI โ aiming for the "Stable Diffusion moment" for music. By open-sourcing this architecture, we intend to empower artists, producers, and developers, making it easy for the community to train and build innovative sub-tasks and tools on top of ACE-Step.
We are seeking community grant resources on Hugging Face to support the further development, optimization, rigorous evaluation, and community accessibility of ACE-Step, enabling us to fully realize its potential and contribute a powerful new tool to the open-source music AI ecosystem.
congrats guys!