OmniStep 12A3B

OmniStep 12A3B — Vibe Coach · Notes · Music · Conversation

Genetically Evolving Omnimodal Real Time Voice to Streaming Voice and Music Assistant

A multimodal voice-and-music AI, born from generational Darwin family evolution. OmniStep 12A3B is a personal AI companion — a vibe coach that takes notes for you, keeps up the conversation, and plays background music that matches your mood. All the while it self-evolves to become a better assistant to you, via the Darwin Family weight-space recombination methodology (arXiv:2605.14386). Built as a paper-exact 2-parent merge of Qwen2.5-Omni-3B (multimodal) and ACE-Step v1.5 XL SFT 4B (text-to-music).

The text body of the model was produced by a paper-exact Darwin 2-parent weight-space recombination of the Qwen2.5-Omni thinker and the ACE-Step text encoder, with the Architecture Mapper's "skip on dim mismatch" behavior preserving the Omni text body intact across the Qwen2.5/Qwen3 cross-architecture boundary. The diffusion (music) head sits at F16 (unquantized) for maximum audio quality. The transformer (text/multimodal) head is shipped in 4 quantized GGUF deployments (F16, Q8_0, Q4_K_M, Q4_0) for llama.cpp users.

The OmniStep Evolutionary Radio is the operational version of "infinitely generate its own background music" — a 4-loop pipeline (playback + queue fill + GEPA prompt evolution + Darwin weight evolution) wired up in the evolutionary-radio skill.

🎧 Listen to the examples

1. 🎵 Lo-Fi — chill lofi beats for late-night coding

chill lofi beats, mellow hip-hop, soft piano keys, vinyl crackle, late-night study vibes, 75 bpm, instrumental

🎤 Voice intro (text generated by OmniStep 12A3B, speech by Soprano 80M)

🎵 The track

2. 🎬 Movie Orchestra — epic cinematic orchestral

epic cinematic orchestral soundtrack, sweeping strings, French horns, building tension, Hans Zimmer style, 90 bpm, instrumental

🎤 Voice intro

🎵 The track

3. 🔥 Dark Metal — heavy dark metal for dark times

heavy dark metal, blast beats, down-tuned 7-string guitars, atmospheric, blackened death metal, 180 bpm, instrumental

🎤 Voice intro

🎵 The track

  • 🧠 Vibe coach — reads the room, matches your mood, suggests what to play next
  • 📝 Note-taker — listens to your conversation and captures the bits you want to remember
  • 💬 Conversational companion — keeps up a real back-and-forth, asks the follow-up questions
  • 🎵 Background music that matches the vibe — generates infinite music that fits what you're doing, in any style
  • 🔁 Self-evolving — gets better at being your assistant over time, via the Darwin Family weight-space evolution methodology
  • 🎤 Real-time ASR + TTS — Whisper audio in, Talker + token2wav audio out (4o-style streaming voice)
  • 🖼️ Image understanding — NaViT vision encoder

All in one model. Run it with vllm, llama-server, or the included Python scripts.

🎛 Pick your quantization — download just the one you need

The GGUFs are independent files. Download only the one that fits your VRAM — you don't need all of them. Pick from the table below.

Quant Size VRAM Best for Download
F16 6.4GB 6.4GB Maximum quality, plenty of VRAM omnistep-12a3b-f16.gguf
Q8_0 3.4GB 3.4GB Near-F16 quality, balanced omnistep-12a3b-q8_0.gguf
Q4_K_M 2.0GB 2.0GB Recommended — best size/quality tradeoff omnistep-12a3b-q4_k_m.gguf
Q4_0 1.9GB 1.9GB Smallest, lowest quality omnistep-12a3b-q4_0.gguf

Run any of them with llama-server (the Omni build of llama.cpp is in the HF model comments / wiki):

llama-server -m omnistep-12a3b-q4_k_m.gguf -ngl 99 --port 8080 --host 0.0.0.0 -c 8192

Quick start

Option 1 — vllm (the easiest, full multimodal)

pip install vllm
vllm serve sovthpaw/omnistep-12a3b \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.85 \
    --trust-remote-code

Then in another terminal:

curl -s http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "sovthpaw/omnistep-12a3b",
        "messages": [{"role": "user", "content": "Take a note: I need to follow up with the design team about the Q3 launch."}],
        "max_tokens": 200
    }'

See vllm Qwen2.5-Omni docs for the full multimodal API.

Option 2 — llama-server with the GGUFs (fast text path)

# Pick your quantization based on VRAM
#   F16  = 6.4GB VRAM, best quality
#   Q8_0 = 3.4GB VRAM, near-F16 quality
#   Q4_K_M = 2.0GB VRAM, recommended
#   Q4_0 = 1.9GB VRAM, smallest

llama-server \
    -m omnistep-12a3b-q4_k_m.gguf \
    -ngl 99 \
    --port 8080 \
    --host 0.0.0.0 \
    -c 8192

curl -s http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "omnistep-12a3b",
        "messages": [{"role": "user", "content": "I am coding late, give me a lofi vibe and a note about what I should focus on tomorrow."}],
        "max_tokens": 200,
        "stream": true
    }'

The 4 GGUFs are deployment options — pick whichever fits your hardware. Q4_K_M is the recommended sweet spot.

Option 3 — the included Python scripts (the most fun)

The repo includes Python scripts that wire everything together for the headline use cases. After cloning:

git clone https://huggingface.co/sovthpaw/omnistep-12a3b
cd omnistep-12a3b

# Start the vllm server (one terminal)
python scripts/run_omnistep_12a3b.py serve

# In another terminal — try the modalities
python scripts/run_omnistep_12a3b.py text "Take a note: follow up with design team about Q3."
python scripts/run_omnistep_12a3b.py music "chill lofi beats" --output ~/music/track.wav
python scripts/run_omnistep_12a3b.py music-loop "chill lofi beats"   # infinite background music
python scripts/run_omnistep_12a3b.py voice                            # streaming voice assistant

Or use the omnistep-jammit bash wrapper for music:

./scripts/omnistep-jammit "aggressive metal, 808s, dark" --duration 120
./scripts/omnistep-jammit "warm ambient pad" --infinite   # infinite background music

File layout

sovthpaw/omnistep-12a3b/
├── cover.png                                # hero image
├── README.md                                # this file
├── config.json                              # Qwen2.5-Omni config + ACE music decoder sub-config
├── tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, special_tokens_map.json, added_tokens.json, chat_template.json
├── preprocessor_config.json
├── configuration_acestep_v15.py             # ACE-Step modeling code (for the music head)
├── modeling_acestep_v15_xl_base.py
├── apg_guidance.py
│
├── model-00001-of-00004.safetensors         # 5.37GB (full multimodal safetensors)
├── model-00002-of-00004.safetensors         # 5.36GB
├── model-00003-of-00004.safetensors         # 5.36GB
├── model-00004-of-00004.safetensors         # 4.97GB
├── model.safetensors.index.json             # the FULL multimodal safetensors (Omni + ACE music head, 21GB)
│
├── omnistep-12a3b-f16.gguf                  # 6.4GB GGUF (text body + multimodal heads, F16)
├── omnistep-12a3b-q8_0.gguf                 # 3.4GB GGUF (Q8_0 quantization)
├── omnistep-12a3b-q4_k_m.gguf               # 2.0GB GGUF (Q4_K_M quantization, recommended)
├── omnistep-12a3b-q4_0.gguf                 # 1.9GB GGUF (Q4_0 quantization, smallest)
│
├── 01_lofi_chill.wav                        # example audio (Lo-Fi track)
├── 02_movie_orchestra.wav                   # example audio (Movie Orchestra track)
├── 04_dark_metal.wav                        # example audio (Dark Metal track)
│
├── 01_lofi_chill_voice.wav                  # example voice intro (Lo-Fi description by the vibe coach)
├── 02_movie_orchestra_voice.wav             # example voice intro (Movie Orchestra description)
├── 04_dark_metal_voice.wav                  # example voice intro (Dark Metal description)
│
├── scripts/
│   ├── run_omnistep_12a3b.py                # main entry: text, music, music-loop, voice, serve
│   ├── omnistep_radio.py                    # the 4-loop Evolutionary Radio
│   ├── omnistep_voice.py                    # streaming voice assistant
│   ├── omnistep-jammit                      # bash wrapper for music gen

The GGUFs are the quantized-transformer deployment path (text body + multimodal heads, fast on llama.cpp). The safetensors are the complete model including the diffusion (music) head at F16 unquantized for max audio quality. The Python scripts wire both together.

Architecture — the correct lineup

OmniStep 12A3B — one complete model, paper-exact Darwin merge + attached heads
│
├── INPUT (full speech I/O lives inside the model)
│   ├── whisper_audio_in    (Qwen2.5-Omni's Whisper-style encoder — streaming ASR)
│   └── navit_vision_in     (Qwen2.5-Omni's NaViT — image understanding)
│
├── TEXT REASONING
│   ├── text_backbone       (Qwen2.5-Omni Thinker, 36L, h=2048 — paper-exact Darwin merge with ACE encoder)
│   │                        Its internal state carries conversation context into both the speech
│   │                        output and the music output, so the music is mood-aware from
│   │                        what the LLM is "thinking"
│   └── ace_encoder_attached (ACE-Step text encoder — kept as a separate module since
│                              the Qwen2.5/Qwen3 cross-architecture boundary prevents 1:1 FFN blend)
│
├── OUTPUT 1 — speech (two paths, pick the one that fits)
│   ├── talker + token2wav  (Qwen2.5-Omni's built-in TTS — lowest latency, lives inside the model)
│   └── external Soprano 80M  (chunked-streaming TTS for the user's preferred voice character)
│                            ↑ text is cut on sentence boundaries and sent in parallel as
│                              the LLM is still generating later sentences
│
└── OUTPUT 2 — music (the Darwin family FFN-blend destination)
    └── ace_music_decoder   (ACE-Step v1.5 XL 4B DiT — text → music, continuous background music)
                             (F16, unquantized, lives in the safetensors)
                             The Darwin family methodology is applied HERE rather than to a
                             TTS model: by blending the LLM FFN weights into the music head,
                             the music output picks up the conversation's emotional state.
                             That's why the background music can feel like it's "aware" of
                             the conversation — it is, by construction.

The "one model that produces and listens" principle

The whole point of putting all of this in one model is so the audio-output side and the audio-input side are the same running process. When you speak:

  1. The Whisper encoder hears you mid-generation
  2. The LLM gets interrupted (the same model that's been streaming text)
  3. Both the TTS stream and the music stream can be cut off by the model itself
  4. The LLM pivots to your new input — and because the music head shares the LLM's FFN-blended state, the background music shifts to match the new mood in the same inference step

No external orchestrator, no second model watching the first one. One model, one attention state, one set of weights — that's what makes it feel like a real-time conversation.

What about Darwin-TTS? (The cross-modal FFN blend that isn't used here)

Darwin-TTS-1.7B-Cross is the same Darwin Family framework applied to speech synthesis: it blends 3% of Qwen3-1.7B (LLM) FFN weights into Qwen3-TTS-1.7B (TTS) to add emotional expressiveness without any training. It is not used in OmniStep 12A3B because:

  • OmniStep's text body is Qwen2.5-3B class, not Qwen3 — different hidden/intermediate sizes
  • Qwen2.5-Omni-3B already has full speech I/O (Whisper ASR + Talker+token2wav TTS) — Darwin-TTS is purely a TTS model with no STT, so it would be redundant
  • The Darwin family weight-blend methodology is more valuable here applied to the music path (no other model is providing music — the DiT needed a way to inherit conversation mood, and FFN blending is exactly that mechanism)

The Darwin-TTS-Cross result is still useful as a research reference: it shows the FFN-blend approach is stable at 3–5% and degrades fast at 10%, which is the operating range we can experiment with on the music DiT for fine-grained mood control.

For the vibe-coach / personal-companion use case, the typical request flows:

  • "Take a note: I need to follow up with the design team" → whisper → thinker (note-taker) → talker (voice confirmation)
  • "I'm coding late, give me a lofi vibe" → thinker (vibe-coach) → ace_music_decoder (matching background music)
  • "Tell me about the Darwin Family paper" → thinker (conversational) → talker (voice response)

Parents (the Darwin merge)

Parent Role License
Qwen2.5-Omni-3B Multimodal text+speech+vision, Qwen2.5-3B class text body Apache 2.0
ACE-Step v1.5 XL SFT 4B 4B DiT for text-to-music, Qwen3-class text encoder Apache 2.0

Merge methodology — paper-exact

Following arXiv:2605.14386 (Kim et al., 2026 — "Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning"). No modifications.

θM(T) = (1 - r_final(T)) · θA(T) + r_final(T) · θB(T)
r_final(T) = τ · r_MRI(T) + (1 - τ) · r_genome(T)
r_MRI(T) = MRI_B(T) / (MRI_A(T) + MRI_B(T))
MRI(T)  = α · Static(T) + (1 - α) · Probe(T),  α = 0.5 (paper-fixed)

The fixed starting genome used (paper-recommended):

{
  "gamma": 0.55, "alpha_attn": 0.55, "alpha_ffn": 0.50, "alpha_emb": 0.50,
  "rho_a": 0.45, "rho_b": 0.45,
  "r0": 0.50, "r1": 0.50, "r2": 0.50, "r3": 0.50, "r4": 0.50, "r5": 0.50,
  "tau": 0.40, "lambda_reg": 0.50
}

What's in the box — full modality support

Every modality is in the model. Here's exactly how to use each:

Modality Path How to invoke
Text reasoning / conversation GGUF or safetensors python scripts/run_omnistep_12a3b.py text "..." (via vllm or llama-server)
Note-taking via text reasoning "Take a note: ..."
Vibe coaching via text reasoning "I just finished a workout, give me a recovery vibe"
Text-to-music (one-shot) safetensors (ACE DiT, F16) python scripts/run_omnistep_12a3b.py music "..." --output X.wav
Infinite background music safetensors + mpv + radio loop python scripts/run_omnistep_12a3b.py music-loop "..."
Streaming ASR (audio in) safetensors via vllm python scripts/run_omnistep_12a3b.py voice
Streaming TTS (audio out) safetensors via vllm python scripts/run_omnistep_12a3b.py voice
Image understanding safetensors via vllm vllm image_url field in chat completions
All modalities in one process safetensors via vllm vllm with --max-model-len 32768

Citation

@misc{omnistep-12a3b-2026,
  title  = {OmniStep 12A3B: A Darwin Family paper-exact weight-space recombination of Qwen2.5-Omni-3B and ACE-Step v1.5 XL SFT 4B — a vibe-coach voice assistant with infinite background music that self-evolves in generational Darwin family evolution},
  author = {SouthpawIN},
  year   = {2026},
  url    = {https://huggingface.co/sovthpaw/omnistep-12a3b},
  note   = {Built with Nous Girl (Hermes Agent); Darwin Family paper: arXiv:2605.14386}
}

@article{kim2026darwinfamily,
  title  = {Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning},
  author = {Kim, T. and others},
  year   = {2026},
  eprint = {2605.14386},
  archivePrefix = {arXiv},
  url    = {https://arxiv.org/abs/2605.14386}
}

@software{ace-step-1.5,
  title  = {ACE-Step 1.5: Open-Source Music Generation Foundation Model},
  author = {ACE Studio and StepFun},
  year   = {2026},
  url    = {https://github.com/ace-step/ACE-Step-1.5}
}

@software{qwen2.5-omni,
  title  = {Qwen2.5-Omni Technical Report},
  author = {Alibaba Qwen team},
  year   = {2025},
  url    = {https://qwenlm.github.io/blog/qwen2.5-omni/}
}

@software{soprano-80m,
  title  = {Soprano-80M: A Compact Flow-Matching Text-to-Speech Model},
  author = {Soprano TTS contributors},
  year   = {2026},
  url    = {https://huggingface.co/ekwek/Soprano-80M-en}
}

License

Apache 2.0 (inherited from both parents).

See also

Downloads last month
594
Safetensors
Model size
11B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sovthpaw/omnistep-12a3b

Unable to build the model tree, the base model loops to the model itself. Learn more.

Paper for sovthpaw/omnistep-12a3b