ace-step15-endpoint / docs /Tutorial.md

Andrew

major changes

e3ac746 6 days ago

preview code

raw

history blame contribute delete

47.5 kB

ACE-Step 1.5 Ultimate Guide (Must Read)

Language / 语言 / 言語: English | 中文 | 日本語

Hello everyone, I'm Gong Junmin, the developer of ACE-Step. Through this tutorial, I'll guide you through the design philosophy and usage of ACE-Step 1.5.

Mental Models

Before we begin, we need to establish the correct mental models to set proper expectations.

Human-Centered Design

This model is not designed for one-click generation, but for human-centered generation.

Understanding this distinction is crucial.

What is One-Click Generation?

You input a prompt, click generate, listen to a few versions, pick one that sounds good, and use it. If someone else inputs the same prompt, they'll likely get similar results.

In this mode, you and AI have a client-vendor relationship. You come with a clear purpose, with a vague expectation in mind, hoping AI delivers a product close to that expectation. Essentially, it's not much different from searching on Google or finding songs on Spotify—just with a bit more customization.

AI is a service, not a creative inspirer.

Suno, Udio, MiniMax, Mureka—these platforms are all designed with this philosophy. They can scale up models as services to ensure delivery. Your generated music is bound by their agreements; you can't run it locally, can't fine-tune for personalized exploration; if they secretly change models or terms, you can only accept it.

What is Human-Centered Generation?

If we weaken the AI layer and strengthen the human layer—letting more human will, creativity, and inspiration give life to AI—this is human-centered generation.

Unlike the strong purposefulness of one-click generation, human-centered generation has more of a playful nature. It's more like an interactive game where you and the model are collaborators.

The workflow is like this: you throw out some inspiration seeds, get a few songs, choose interesting directions from them to continue iterating—

Adjust prompts to regenerate
Use Cover to maintain structure and adjust details
Use Repaint for local modifications
Use Add Layer to add or remove instrument layers

At this point, AI is not a servant to you, but an inspirer.

What Conditions Must This Design Meet?

For human-centered generation to truly work, the model must meet several key conditions:

First, it must be open-source, locally runnable, and trainable.

This isn't technical purism, but a matter of ownership. When you use closed-source platforms, you don't own the model, and your generated works are bound by their agreements. Version updates, term changes, service shutdowns—none of these are under your control.

But when the model is open-source and locally runnable, everything changes: You forever own this model, and you forever own all the creations you make with it. No third-party agreement hassles, no platform risks, you can fine-tune, modify, and build your own creative system based on it. Your works will forever belong to you. It's like buying an instrument—you can use it anytime, anywhere, and adjust it anytime, anywhere.

Second, it must be fast.

Human time is precious, but more importantly—slow generation breaks flow state.

The core of human-centered workflow is the rapid cycle of "try, listen, adjust." If each generation takes minutes, your inspiration dissipates while waiting, and the "play" experience degrades into the "wait" ordeal.

Therefore, we specifically optimized ACE-Step for this: while ensuring quality, we made generation fast enough to support a smooth human-machine dialogue rhythm.

Finite Game vs Infinite Game

One-click generation is a finite game—clear goals, result-oriented, ends at the finish line. To some extent, it coldly hollows out the music industry, replacing many people's jobs.

Human-centered generation is an infinite game—because the fun lies in the process, and the process never ends.

Our vision is to democratize AI music generation. Let ACE-Step become a big toy in your pocket, let music return to Play itself—the creative "play," not just clicking play.

The Elephant Rider Metaphor

Recommended reading: The Complete Guide to Mastering Suno—this blog tutorial can help you establish the foundational understanding of AI music.

AI music generation is like the famous elephant rider metaphor in psychology.

Consciousness rides on the subconscious, humans ride on elephants. You can give directions, but you can't make the elephant precisely and instantly execute every command. It has its own inertia, its own temperament, its own will.

This elephant is the music generation model.

The Iceberg Model

Between audio and semantics lies a hidden iceberg.

What we can describe with language—style, instruments, timbre, emotion, scenes, progression, lyrics, vocal style—these are familiar words, the parts we can touch. But together, they're still just a tiny tip of the audio iceberg above the water.

What's the most precise control? You input the expected audio, and the model returns it unchanged.

But as long as you're using text descriptions, references, prompts—the model will have room to play. This isn't a bug, it's the nature of things.

What is the Elephant?

This elephant is a fusion of countless elements: data distribution, model scale, algorithm design, annotation bias, evaluation bias—it's an abstract crystallization of human music history and engineering trade-offs.

Any deviation in these elements will cause it to fail to accurately reflect your taste and expectations.

Of course, we can expand data scale, improve algorithm efficiency, increase annotation precision, expand model capacity, introduce more professional evaluation systems—these are all directions we can optimize as model developers.

But even if one day we achieve technical "perfection," there's still a fundamental problem we can't avoid: taste.

Taste and Expectations

Taste varies from person to person.

If a music generation model tries to please all listeners, its output will tend toward the popular average of human music history—this will be extremely mediocre.

It's humans who give sound meaning, emotion, experience, life, and cultural symbolic value. It's a small group of artists who create unique tastes, then drive ordinary people to consume and follow, turning niche into mainstream popularity. These pioneering minority artists become legends.

So when you find the model's output "not to your taste," this might not be the model's problem—but rather your taste happens to be outside that "average." This is a good thing.

This means: You need to learn to guide this elephant, not expect it to automatically understand you.

Knowing the Elephant Herd: Model Architecture and Selection

Now you understand the "elephant" metaphor. But actually—

This isn't one elephant, but an entire herd—elephants large and small, forming a family. 🐘🐘🐘🐘

Architecture Principles: Two Brains

ACE-Step 1.5 uses a hybrid architecture with two core components working together:

User Input → [5Hz LM] → Semantic Blueprint → [DiT] → Audio
              ↓
         Metadata Inference
         Caption Optimization
         Structure Planning

5Hz LM (Language Model) — Planner (Optional)

The LM is an "omni-capable planner" responsible for understanding your intent and making plans:

Infers music metadata (BPM, key, duration, etc.) through Chain-of-Thought
Optimizes and expands your caption—understanding and supplementing your intent
Generates semantic codes—implicitly containing composition melody, orchestration, and some timbre information

The LM learns world knowledge from training data. It's a planner that improves usability and helps you quickly generate prototypes.

But the LM is not required.

If you're very clear about what you want, or already have a clear planning goal—you can completely skip the LM planning step by not using thinking mode.

For example, in Cover mode, you use reference audio to constrain composition, chords, and structure, letting DiT generate directly. Here, you replace the LM's work—you become the planner yourself.

Another example: in Repaint mode, you use reference audio as context, constraining timbre, mixing, and details, letting DiT directly adjust locally. Here, DiT is more like your creative brainstorming partner, helping with creative ideation and fixing local disharmony.

DiT (Diffusion Transformer) — Executor

DiT is the "audio craftsman," responsible for turning plans into reality:

Receives semantic codes and conditions generated by LM
Gradually "carves" audio from noise through the diffusion process
Decides final timbre, mixing, details

Why this design?

Traditional methods let diffusion models generate audio directly from text, but text-to-audio mapping is too vague. ACE-Step introduces LM as an intermediate layer:

LM excels at understanding semantics and planning
DiT excels at generating high-fidelity audio
They work together, each doing their part

Choosing the Planner: LM Models

LM has four options: No LM (disable thinking mode), 0.6B, 1.7B, 4B.

Their training data is completely identical; the difference is purely in knowledge capacity:

Larger models have richer world knowledge
Larger models have stronger memory (e.g., remembering reference audio melodies)
Larger models perform relatively better on long-tail styles or instruments

Choice	Speed	World Knowledge	Memory	Use Cases
No LM	⚡⚡⚡⚡	—	—	You do the planning (e.g., Cover mode)
`0.6B`	⚡⚡⚡	Basic	Weak	Low VRAM (< 8GB), rapid prototyping
`1.7B`	⚡⚡	Medium	Medium	Default recommendation
`4B`	⚡	Rich	Strong	Complex tasks, high-quality generation

How to choose?

Based on your hardware:

VRAM < 8GB → No LM or 0.6B
VRAM 8–16GB → 1.7B (default)
VRAM > 16GB → 1.7B or 4B

Choosing the Executor: DiT Models

With a planning scheme, you still need to choose an executor. DiT is the core of ACE-Step 1.5—it handles various tasks and decides how to interpret LM-generated codes.

We've open-sourced 4 Turbo models, 1 SFT model, and 1 Base model.

Turbo Series (Recommended for Daily Use)

Turbo models are trained with distillation, generating high-quality audio in just 8 steps. The core difference between the four variants is the shift hyperparameter configuration during distillation.

What is shift?

Shift determines the "attention allocation" during DiT denoising:

Larger shift → More effort spent on early denoising (building large structure from pure noise), stronger semantics, clearer overall framework
Smaller shift → More even step distribution, more details, but details might also be noise

Simple understanding: high shift is like "draw outline first then fill details," low shift is like "draw and fix simultaneously."

Model	Distillation Config	Characteristics
`turbo` (default)	Joint distillation on shift 1, 2, 3	Best balance of creativity and semantics, thoroughly tested, recommended first choice
`turbo-shift1`	Distilled only on shift=1	Richer details, but semantics weaker
`turbo-shift3`	Distilled only on shift=3	Clearer, richer timbre, but may sound "dry," minimal orchestration
`turbo-continuous`	Experimental, supports continuous shift 1–5	Most flexible tuning, but not thoroughly tested

You can choose based on target music style—you might find you prefer a certain variant. We recommend starting with default turbo—it's the most balanced and proven choice.

SFT Model

Compared to Turbo, SFT model has two notable features:

Supports CFG (Classifier-Free Guidance), allowing fine-tuning of prompt adherence
More steps (50 steps), giving the model more time to "think"

The cost: more steps mean error accumulation, audio clarity may be slightly inferior to Turbo. But its detail expression and semantic parsing will be better.

If you don't care about inference time, like tuning CFG and steps, and prefer that rich detail feel—SFT is a good choice. LM-generated codes can also work with SFT models.

Base Model

Base is the master of all tasks, with three exclusive tasks beyond SFT and Turbo:

Task	Description
`extract`	Extract single tracks from mixed audio (e.g., separate vocals)
`lego`	Add new tracks to existing tracks (e.g., add drums to guitar)
`complete`	Add mixed accompaniment to single track (e.g., add guitar+drums accompaniment to vocals)

Additionally, Base has the strongest plasticity. If you have large-scale fine-tuning needs, we recommend starting experiments with Base to train your own SFT model.

Creating Your Custom Model

Beyond official models, you can also use LoRA fine-tuning to create your custom model.

We'll release an example LoRA model—trained on 20+ "Happy New Year" themed songs, specifically suited for expressing festive atmosphere. This is just a starting point.

What does a custom model mean?

You can reshape DiT's capabilities and preferences with your own data recipe:

Like a specific timbre style? Train with that type of songs
Want the model better at a certain genre? Collect related data for fine-tuning
Have your own unique aesthetic taste? "Teach" it to the model

This greatly expands customization and playability—train a model unique to you with your aesthetic taste.

For detailed LoRA training guide, see the "LoRA Training" tab in Gradio UI.

DiT Selection Summary

Model	Steps	CFG	Speed	Exclusive Tasks	Recommended Scenarios
`turbo` (default)	8	❌	⚡⚡⚡	—	Daily use, rapid iteration
`sft`	50	✅	⚡	—	Pursuing details, like tuning
`base`	50	✅	⚡	extract, lego, complete	Special tasks, large-scale fine-tuning

Combination Strategies

Default configuration is turbo + 1.7B LM, suitable for most scenarios.

Need	Recommended Combination
Fastest speed	`turbo` + No LM or `0.6B`
Daily use	`turbo` + `1.7B` (default)
Pursuing details	`sft` + `1.7B` or `4B`
Special tasks	`base`
Large-scale fine-tuning	`base`
Low VRAM (< 4GB)	`turbo` + No LM + CPU offload

Downloading Models

# Download default models (turbo + 1.7B LM)
uv run acestep-download

# Download all models
uv run acestep-download --all

# Download specific model
uv run acestep-download --model acestep-v15-base
uv run acestep-download --model acestep-5Hz-lm-0.6B

# List available models
uv run acestep-download --list

You need to download models into a checkpoints folder for easy identification.

Guiding the Elephant: What Can You Control?

Now that you know this herd of elephants, let's learn how to communicate with them.

Each generation is determined by three types of factors: input control, inference hyperparameters, and random factors.

I. Input Control: What Do You Want?

This is the part where you communicate "creative intent" with the model—what kind of music you want to generate.

Category	Parameter	Function
Task Type	`task_type`	Determines generation mode: text2music, cover, repaint, lego, extract, complete
Text Input	`caption`	Description of overall music elements: style, instruments, emotion, atmosphere, timbre, vocal gender, progression, etc.
	`lyrics`	Temporal element description: lyric content, music structure evolution, vocal changes, vocal/instrument performance style, start/end style, articulation, etc. (use `[Instrumental]` for instrumental music)
Music Metadata	`bpm`	Tempo (30–300)
	`keyscale`	Key (e.g., C Major, Am)
	`timesignature`	Time signature (4/4, 3/4, 6/8)
	`vocal_language`	Vocal language
	`duration`	Target duration (seconds)
Audio Reference	`reference_audio`	Global reference for timbre or style (for cover, style transfer)
	`src_audio`	Source audio for non-text2music tasks (text2music defaults to silence, no input needed)
	`audio_codes`	Semantic codes input to model in Cover mode (advanced: reuse codes for variants, convert songs to codes for extension, combine like DJ mixing)
Interval Control	`repainting_start/end`	Time interval for operations (repaint redraw area / lego new track area)

About Caption: The Most Important Input

Caption is the most important factor affecting generated music.

It supports multiple input formats: simple style words, comma-separated tags, complex natural language descriptions. We've trained to be compatible with various formats, ensuring text format doesn't significantly affect model performance.

We provide at least 5 ways to help you write good captions:

Random Dice — Click the random button in the UI to see how example captions are written. You can use this standardized caption as a template and have an LLM rewrite it to your desired form.
Format Auto-Rewrite — We support using the format feature to automatically expand your handwritten simple caption into complex descriptions.
CoT Rewrite — If LM is initialized, whether thinking mode is enabled or not, we support rewriting and expanding captions through Chain-of-Thought (unless you actively disable it in settings, or LM is not initialized).
Audio to Caption — Our LM supports converting your input audio to caption. While precision is limited, the vague direction is correct—enough as a starting point.
Simple Mode — Just input a simple song description, and LM will automatically generate complete caption, lyrics, and metas samples—suitable for quick starts.

Regardless of which method, they all solve a real problem: As ordinary people, our music vocabulary is impoverished.

If you want generated music to be more interesting and meet expectations, Prompting is always the optimal option—it brings the highest marginal returns and surprises.

Common Dimensions for Caption Writing:

Dimension	Examples
Style/Genre	pop, rock, jazz, electronic, hip-hop, R&B, folk, classical, lo-fi, synthwave
Emotion/Atmosphere	melancholic, uplifting, energetic, dreamy, dark, nostalgic, euphoric, intimate
Instruments	acoustic guitar, piano, synth pads, 808 drums, strings, brass, electric bass
Timbre Texture	warm, bright, crisp, muddy, airy, punchy, lush, raw, polished
Era Reference	80s synth-pop, 90s grunge, 2010s EDM, vintage soul, modern trap
Production Style	lo-fi, high-fidelity, live recording, studio-polished, bedroom pop
Vocal Characteristics	female vocal, male vocal, breathy, powerful, falsetto, raspy, choir
Speed/Rhythm	slow tempo, mid-tempo, fast-paced, groovy, driving, laid-back
Structure Hints	building intro, catchy chorus, dramatic bridge, fade-out ending

Some Practical Principles:

Specific beats vague — "sad piano ballad with female breathy vocal" works better than "a sad song."
Combine multiple dimensions — Single-dimension descriptions give the model too much room to play; combining style+emotion+instruments+timbre can more precisely anchor your desired direction.
Use references well — "in the style of 80s synthwave" or "reminiscent of Bon Iver" can quickly convey complex aesthetic preferences.
Texture words are useful — Adjectives like warm, crisp, airy, punchy can influence mixing and timbre tendencies.
Don't pursue perfect descriptions — Caption is a starting point, not an endpoint. Write a general direction first, then iterate based on results.
Description granularity determines freedom — More omitted descriptions give the model more room to play, more random factor influence; more detailed descriptions constrain the model more. Decide specificity based on your needs—want surprises? Write less. Want control? Write more details.
Avoid conflicting words — Conflicting style combinations easily lead to degraded output. For example, wanting both "classical strings" and "hardcore metal" simultaneously—the model will try to fuse but usually not ideal. Especially when thinking mode is enabled, LM has weaker caption generalization than DiT. When prompting is unreasonable, the chance of pleasant surprises is smaller.

Ways to resolve conflicts:
- Repetition reinforcement — Strengthen the elements you want more in mixed styles by repeating certain words
- Conflict to evolution — Transform style conflicts into temporal style evolution. For example: "Start with soft strings, middle becomes noisy dynamic metal rock, end turns to hip-hop"—this gives the model clear guidance on how to handle different styles, rather than mixing them into a mess

For more prompting tips, see: The Complete Guide to Mastering Suno—although it's a Suno tutorial, prompting ideas are universal.

About Lyrics: The Temporal Script

If Caption describes the music's "overall portrait"—style, atmosphere, timbre—then Lyrics is the music's "temporal script", controlling how music unfolds over time.

Lyrics is not just lyric content. It carries:

The lyric text itself
Structure tags ([Verse], [Chorus], [Bridge]...)
Vocal style hints ([raspy vocal], [whispered]...)
Instrumental sections ([guitar solo], [drum break]...)
Energy changes ([building energy], [explosive drop]...)

Structure Tags are Key

Structure tags (Meta Tags) are the most powerful tool in Lyrics. They tell the model: "What is this section, how should it be performed?"

Common Structure Tags:

Category	Tag	Description
Basic Structure	`[Intro]`	Opening, establish atmosphere
	`[Verse]` / `[Verse 1]`	Verse, narrative progression
	`[Pre-Chorus]`	Pre-chorus, build energy
	`[Chorus]`	Chorus, emotional climax
	`[Bridge]`	Bridge, transition or elevation
	`[Outro]`	Ending, conclusion
Dynamic Sections	`[Build]`	Energy gradually rising
	`[Drop]`	Electronic music energy release
	`[Breakdown]`	Reduced instrumentation, space
Instrumental Sections	`[Instrumental]`	Pure instrumental, no vocals
	`[Guitar Solo]`	Guitar solo
	`[Piano Interlude]`	Piano interlude
Special Tags	`[Fade Out]`	Fade out ending
	`[Silence]`	Silence

Combining Tags: Use Moderately

Structure tags can be combined with - for finer control:

[Chorus - anthemic]
This is the chorus lyrics
Dreams are burning

[Bridge - whispered]
Whisper those words softly

This works better than writing [Chorus] alone—you're telling the model both what this section is (Chorus) and how to sing it (anthemic).

⚠️ Note: Don't stack too many tags.

❌ Not recommended:
[Chorus - anthemic - stacked harmonies - high energy - powerful - epic]

✅ Recommended:
[Chorus - anthemic]

Stacking too many tags has two risks:

The model might mistake tag content as lyrics to sing
Too many instructions confuse the model, making effects worse

Principle: Keep structure tags concise; put complex style descriptions in Caption.

⚠️ Key: Maintain Consistency Between Caption and Lyrics

Models are not good at resolving conflicts. If descriptions in Caption and Lyrics contradict, the model gets confused and output quality decreases.

❌ Conflict example:
Caption: "violin solo, classical, intimate chamber music"
Lyrics: [Guitar Solo - electric - distorted]

✅ Consistent example:
Caption: "violin solo, classical, intimate chamber music"
Lyrics: [Violin Solo - expressive]

Checklist:

Instruments in Caption ↔ Instrumental section tags in Lyrics
Emotion in Caption ↔ Energy tags in Lyrics
Vocal description in Caption ↔ Vocal control tags in Lyrics

Think of Caption as "overall setting" and Lyrics as "shot script"—they should tell the same story.

Vocal Control Tags:

Tag	Effect
`[raspy vocal]`	Raspy, textured vocals
`[whispered]`	Whispered
`[falsetto]`	Falsetto
`[powerful belting]`	Powerful, high-pitched singing
`[spoken word]`	Rap/recitation
`[harmonies]`	Layered harmonies
`[call and response]`	Call and response
`[ad-lib]`	Improvised embellishments

Energy and Emotion Tags:

Tag	Effect
`[high energy]`	High energy, passionate
`[low energy]`	Low energy, restrained
`[building energy]`	Increasing energy
`[explosive]`	Explosive energy
`[melancholic]`	Melancholic
`[euphoric]`	Euphoric
`[dreamy]`	Dreamy
`[aggressive]`	Aggressive

Lyric Text Writing Tips

1. Control Syllable Count

6-10 syllables per line usually works best. The model aligns syllables to beats—if one line has 6 syllables and the next has 14, rhythm becomes strange.

❌ Bad example:
我站在窗前看着外面的世界一切都在改变（18 syllables）
你好（2 syllables）

✅ Good example:
我站在窗前（5 syllables）
看着外面世界（6 syllables）
一切都在改变（6 syllables）

Tip: Keep similar syllable counts for lines in the same position (e.g., first line of each verse) (±1-2 deviation).

2. Use Case to Control Intensity

Uppercase indicates stronger vocal intensity:

[Verse]
walking through the empty streets (normal intensity)

[Chorus]
WE ARE THE CHAMPIONS! (high intensity, shouting)

3. Use Parentheses for Background Vocals

[Chorus]
We rise together (together)
Into the light (into the light)

Content in parentheses is processed as background vocals or harmonies.

4. Extend Vowels

You can extend sounds by repeating vowels:

Feeeling so aliiive

But use cautiously—effects are unstable, sometimes ignored or mispronounced.

5. Clear Section Separation

Separate each section with blank lines:

[Verse 1]
First verse lyrics
Continue first verse

[Chorus]
Chorus lyrics
Chorus continues

Avoiding "AI-flavored" Lyrics

These characteristics make lyrics seem mechanical and lack human touch:

Red Flag 🚩	Description
Adjective stacking	"neon skies, electric hearts, endless dreams"—filling a section with vague imagery
Rhyme chaos	Inconsistent rhyme patterns, or forced rhymes causing semantic breaks
Blurred section boundaries	Lyric content crosses structure tags, Verse content "flows" into Chorus
No breathing room	Each line too long, can't sing in one breath
Mixed metaphors	First verse uses water imagery, second suddenly becomes fire, third is flying—listeners can't anchor

Metaphor discipline: Stick to one core metaphor per song, exploring its multiple aspects. For example, choosing "water" as metaphor, you can explore: how love flows around obstacles like water, can be gentle rain or flood, reflects the other's image, can't be grasped but exists. One image, multiple facets—this gives lyrics cohesion.

Writing Instrumental Music

If generating pure instrumental music without vocals:

[Instrumental]

Or use structure tags to describe instrumental development:

[Intro - ambient]

[Main Theme - piano]

[Climax - powerful]

[Outro - fade out]

Complete Example

Assuming Caption is: female vocal, piano ballad, emotional, intimate atmosphere, strings, building to powerful chorus

[Intro - piano]

[Verse 1]
月光洒在窗台上
我听见你的呼吸
城市在远处沉睡
只有我们还醒着

[Pre-Chorus]
这一刻如此安静
却藏着汹涌的心

[Chorus - powerful]
让我们燃烧吧
像夜空中的烟火
短暂却绚烂
这就是我们的时刻

[Verse 2]
时间在指尖流过
我们抓不住什么
但至少此刻拥有
彼此眼中的火焰

[Bridge - whispered]
如果明天一切消散
至少我们曾经闪耀

[Final Chorus]
让我们燃烧吧
像夜空中的烟火
短暂却绚烂
THIS IS OUR MOMENT!

[Outro - fade out]

Note: In this example, Lyrics tags (piano, powerful, whispered) are consistent with Caption descriptions (piano ballad, building to powerful chorus, intimate), with no conflicts.

About Music Metadata: Optional Fine Control

Most of the time, you don't need to manually set metadata.

When you enable thinking mode (or enable use_cot_metas), LM automatically infers appropriate BPM, key, time signature, etc. based on your Caption and Lyrics. This is usually good enough.

But if you have clear ideas, you can also manually control them:

Parameter	Control Range	Description
`bpm`	30–300	Tempo. Common distribution: slow songs 60–80, mid-tempo 90–120, fast songs 130–180
`keyscale`	Key	e.g., `C Major`, `Am`, `F# Minor`. Affects overall pitch and emotional color
`timesignature`	Time signature	`4/4` (most common), `3/4` (waltz), `6/8` (swing feel)
`vocal_language`	Language	Vocal language. LM usually auto-detects from lyrics
`duration`	Seconds	Target duration. Actual generation may vary slightly

Understanding Control Boundaries

These parameters are guidance rather than precise commands:

BPM: Common range (60–180) works well; extreme values (like 30 or 280) have less training data, may be unstable
Key: Common keys (C, G, D, Am, Em) are stable; rare keys may be ignored or shifted
Time signature: 4/4 is most reliable; 3/4, 6/8 usually OK; complex signatures (like 5/4, 7/8) are advanced, effects vary by style
Duration: Short songs (30–60s) and medium length (2–4min) are stable; very long generation may have repetition or structure issues

The Model's "Reference" Approach

The model doesn't mechanically execute bpm=120, but rather:

Uses 120 BPM as an anchor point
Samples from distribution near this anchor
Final result might be 118 or 122, not exactly 120

It's like telling a musician "around 120 tempo"—they'll naturally play in this range, not rigidly follow a metronome.

When Do You Need Manual Settings?

Scenario	Suggestion
Daily generation	Don't worry, let LM auto-infer
Clear tempo requirement	Manually set `bpm`
Specific style (e.g., waltz)	Manually set `timesignature=3/4`
Need to match other material	Manually set `bpm` and `duration`
Pursue specific key color	Manually set `keyscale`

Tip: If you manually set metadata but generation results clearly don't match—check if there's conflict with Caption/Lyrics. For example, Caption says "slow ballad" but bpm=160, the model gets confused.

Recommended Practice: Don't write tempo, BPM, key, and other metadata information in Caption. These should be set through dedicated metadata parameters (bpm, keyscale, timesignature, etc.), not described in Caption. Caption should focus on style, emotion, instruments, timbre, and other musical characteristics, while metadata information is handled by corresponding parameters.

About Audio Control: Controlling Sound with Sound

Text is dimensionally reduced abstraction; the best control is still controlling with audio.

There are three ways to control generation with audio, each with different control ranges and uses:

1. Reference Audio: Global Acoustic Feature Control

Reference audio (reference_audio) is used to control the acoustic features of generated music—timbre, mixing style, performance style, etc. It averages temporal dimension information and acts globally.

What Does Reference Audio Control?

Reference audio mainly controls the acoustic features of generated music, including:

Timbre texture: Vocal timbre, instrument timbre
Mixing style: Spatial sense, dynamic range, frequency distribution
Performance style: Vocal techniques, playing techniques, expression
Overall atmosphere: The "feeling" conveyed through reference audio

How Does the Backend Process Reference Audio?

When you provide reference audio, the system performs the following processing:

Audio Preprocessing:
- Load audio file, normalize to stereo 48kHz format
- Detect silence, ignore if audio is completely silent
- If audio length is less than 30 seconds, repeat to fill to at least 30 seconds
- Randomly select 10-second segments from front, middle, and back positions, concatenate into 30-second reference segment
Encoding Conversion:
- Use VAE (Variational Autoencoder) tiled_encode method to encode audio into latent representation (latents)
- These latents contain acoustic feature information but remove specific melody, rhythm, and other structural information
- Encoded latents are input as conditions to DiT generation process, averaging temporal dimension information, acting globally on entire generation process

2. Source Audio: Semantic Structure Control

Source audio (src_audio) is used for Cover tasks, performing melodic structure control. Its principle is to quantize your input source audio into semantically structured information.

What Does Source Audio Control?

Source audio is converted into semantically structured information, including:

Melody: Note direction and pitch
Rhythm: Beat, accent, groove
Chords: Harmonic progression and changes
Orchestration: Instrument arrangement and layers
Some timbre: Partial timbre information

What Can You Do With It?

Control style: Maintain source audio structure, change style and details
Transfer style: Apply source audio structure to different styles
Retake lottery: Generate similar structure but different variants, get different interpretations through multiple generations
Control influence degree: Control source audio influence strength through audio_cover_strength parameter (0.0–1.0)
- Higher strength: generation results more strictly follow source audio structure
- Lower strength: generation results have more room for free play

Advanced Cover Usage

You can use Cover to Remix a song, and it supports changing Caption and Lyrics:

Remix creation: Input a song as source audio, reinterpret it by modifying Caption and Lyrics
- Change style: Use different Caption descriptions (e.g., change from pop to rock)
- Change lyrics: Rewrite lyrics with new Lyrics, maintaining original melody structure
- Change emotion: Adjust overall atmosphere through Caption (e.g., change from sad to joyful)
Build complex music structures: Build complex melodic direction, layers, and groove based on your needed structure influence degree
- Fine-tune structure adherence through audio_cover_strength
- Combine Caption and Lyrics modifications to create new expression while maintaining core structure
- Can generate multiple versions, each with different emphasis on structure, style, lyrics

3. Source Audio Context-Based Control: Local Completion and Modification

This is the Repaint task, performing completion or modification based on source audio context.

Repaint Principle

Repaint is based on context completion principle:

Can complete beginning, middle local, ending, or any region
Operation range: 3 seconds to 90 seconds
Model references source audio context information, generating within specified interval

What Can You Do With It?

Local modification: Modify lyrics, structure, or content in specified interval
Change lyrics: Maintain melody and orchestration, only change lyric content
Change structure: Change music structure in specified interval (e.g., change Verse to Chorus)
Continue writing: Continue writing beginning or ending based on context
Clone timbre: Clone source audio timbre characteristics based on context

Advanced Repaint Usage

You can use Repaint for more complex creative needs:

Infinite duration generation:
- Through multiple Repaint operations, can continuously extend audio, achieving infinite duration generation
- Each continuation is based on previous segment's context, maintaining natural transitions and coherence
- Can generate in segments, each 3–90 seconds, finally concatenate into complete work
Intelligent audio stitching:
- Intelligently organize and stitch two audios together
- Use Repaint at first audio's end to continue, making transitions naturally connect
- Or use Repaint to modify connection part between two audios for smooth transitions
- Model automatically handles rhythm, harmony, timbre connections based on context, making stitched audio sound like a complete work

4. Base Model Advanced Audio Control Tasks

In the Base model, we also support more advanced audio control tasks:

Lego Task: Intelligently add new tracks based on existing tracks

Input an existing audio track (e.g., vocals)
Model intelligently adds new tracks (e.g., drums, guitar, bass, etc.)
New tracks coordinate with original tracks in rhythm and harmony

Complete Task: Add mixed tracks to single track

Input a single-track audio (e.g., a cappella vocals)
Model generates complete mixed accompaniment tracks
Generated accompaniment matches vocals in style, rhythm, and harmony

These advanced context completion tasks greatly expand control methods, more intelligently providing inspiration and creativity.

The combination of these parameters determines what you "want." We'll explain input control principles and techniques in detail later.

II. Inference Hyperparameters: How Does the Model Generate?

This is the part that affects "generation process behavior"—doesn't change what you want, but changes how the model does it.

DiT (Diffusion Model) Hyperparameters:

Parameter	Function	Default	Tuning Advice
`inference_steps`	Diffusion steps	8 (turbo)	More steps = finer but slower. Turbo uses 8, Base uses 32–100
`guidance_scale`	CFG strength	7.0	Higher = more prompt adherence, but may overfit. Only Base model effective
`use_adg`	Adaptive Dual Guidance	False	After enabling, dynamically adjusts CFG, Base model only
`cfg_interval_start/end`	CFG effective interval	0.0–1.0	Controls which stage to apply CFG
`shift`	Timestep offset	1.0	Adjusts denoising trajectory, affects generation style
`infer_method`	Inference method	"ode"	`ode` deterministic, `sde` introduces randomness
`timesteps`	Custom timesteps	None	Advanced usage, overrides steps and shift
`audio_cover_strength`	Reference audio/codes influence strength	1.0	0.0–1.0, higher = closer to reference, lower = more freedom

5Hz LM (Language Model) Hyperparameters:

Parameter	Function	Default	Tuning Advice
`thinking`	Enable CoT reasoning	True	Enable to let LM reason metadata and codes
`lm_temperature`	Sampling temperature	0.85	Higher = more random/creative, lower = more conservative/deterministic
`lm_cfg_scale`	LM CFG strength	2.0	Higher = more positive prompt adherence
`lm_top_k`	Top-K sampling	0	0 means disabled, limits candidate word count
`lm_top_p`	Top-P sampling	0.9	Nucleus sampling, limits cumulative probability
`lm_negative_prompt`	Negative prompt	"NO USER INPUT"	Tells LM what not to generate
`use_cot_metas`	CoT reason metadata	True	Let LM auto-infer BPM, key, etc.
`use_cot_caption`	CoT rewrite caption	True	Let LM optimize your description
`use_cot_language`	CoT detect language	True	Let LM auto-detect vocal language
`use_constrained_decoding`	Constrained decoding	True	Ensures correct output format

The combination of these parameters determines how the model "does it."

About Parameter Tuning

It's important to emphasize that tuning factors and random factors sometimes have comparable influence. When you adjust a parameter, it may be hard to tell if it's the parameter's effect or randomness causing the change.

Therefore, we recommend fixing random factors when tuning—by setting a fixed seed value, ensuring each generation starts from the same initial noise, so you can accurately feel the parameter's real impact on generated audio. Otherwise, parameter change effects may be masked by randomness, causing you to misjudge the parameter's role.

III. Random Factors: Sources of Uncertainty

Even with identical inputs and hyperparameters, two generations may produce different results. This is because:

1. DiT's Initial Noise

Diffusion models start from random noise and gradually denoise
seed parameter controls this initial noise
Different seed → different starting point → different endpoint

2. LM's Sampling Randomness

When lm_temperature > 0, the sampling process itself has randomness
Same prompt, each sampling may choose different tokens

3. Additional Noise When infer_method = "sde"

SDE method injects additional randomness during denoising

Pros and Cons of Random Factors

Randomness is a double-edged sword.

Benefits of Randomness:

Explore creative space: Same input can produce different variants, giving you more choices
Discover unexpected surprises: Sometimes randomness brings excellent results you didn't expect
Avoid repetition: Each generation is different, won't fall into single-pattern loops

Challenges of Randomness:

Uncontrollable results: You can't precisely predict generation results, may generate multiple times without satisfaction
Hard to reproduce: Even with identical inputs, hard to reproduce a specific good result
Tuning difficulty: When adjusting parameters, hard to tell if it's parameter effect or randomness change
Screening cost: Need to generate multiple versions to find satisfactory ones, increasing time cost

What Mindset to Face Random Factors?

1. Accept Uncertainty

Randomness is an essential characteristic of AI music generation, not a bug, but a feature
Don't expect every generation to be perfect; treat randomness as an exploration tool

2. Embrace the Exploration Process

Treat generation process as "gacha" or "treasure hunting"—try multiple times, always find surprises
Enjoy discovering unexpectedly good results, rather than obsessing over one-time success

3. Use Fixed Seed Wisely

When you want to understand parameter effects, fix seed to eliminate randomness interference
When you want to explore creative space, let seed vary randomly

4. Batch Generation + Intelligent Screening

Don't rely on single generation; batch generate multiple versions
Use automatic scoring mechanisms for initial screening to improve efficiency

Our Solution: Large Batch + Automatic Scoring

Because our inference is extremely fast, if your GPU VRAM is sufficient, you can explore random space through large batch:

Batch generation: Generate multiple versions at once (e.g., batch_size=2,4,8), quickly explore random space
Automatic scoring mechanism: We provide automatic scoring mechanisms that can help you initially screen, doing test time scaling

Automatic Scoring Mechanism

We provide multiple scoring metrics, among which my favorite is DiT Lyrics Alignment Score:

DiT Lyrics Alignment Score: This score implicitly affects lyric accuracy
- It evaluates the alignment degree between lyrics and audio in generated audio
- Higher score means lyrics are more accurately positioned in audio, better match between singing and lyrics
- This is particularly important for music generation with lyrics, can help you screen versions with higher lyric accuracy
Other scoring metrics: Also include other quality assessment metrics, can evaluate generation results from multiple dimensions

Recommended Workflow:

Batch generation: Set larger batch_size (e.g., 2, 4, 8), generate multiple versions at once
Enable AutoGen: Enable automatic generation, let system continuously generate new batches in background
- AutoGen mechanism: AutoGen automatically uses same parameters (but random seed) to generate next batch in background while you're viewing current batch results
- This lets you continuously explore random space without manually clicking generate button
- Each new batch uses new random seed, ensuring result diversity
Automatic scoring: Enable automatic scoring, let system automatically score each version
Initial screening: Screen versions with higher scores based on DiT Lyrics Alignment Score and other metrics
Manual selection: Manually select the final version that best meets your needs from screened versions

This fully utilizes randomness to explore creative space while improving efficiency through automation tools, avoiding blind searching in large generation results. AutoGen lets you "generate while listening"—while browsing current results, the next batch is already prepared in the background.

Conclusion

This tutorial currently covers ACE-Step 1.5's core concepts and usage methods:

Mental Models: Understanding human-centered generation design philosophy
Model Architecture: Understanding how LM and DiT work together
Input Control: Mastering text (Caption, Lyrics, metadata) and audio (reference audio, source audio) control methods
Inference Hyperparameters: Understanding parameters affecting generation process
Random Factors: Learning to use randomness to explore creative space, improving efficiency through Large Batch + AutoGen + Automatic Scoring

This is just the beginning. There's much more content we want to share with you:

More Prompting tips and practical cases
Detailed usage guides for different task types
Advanced techniques and creative workflows
Common issues and solutions
Performance optimization suggestions

This tutorial will continue to be updated and improved. If you have any questions or suggestions during use, feedback is welcome. Let's make ACE-Step your creative partner in your pocket together.

To be continued...