Instructions to use sovthpaw/omnistep-12a3b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sovthpaw/omnistep-12a3b with Transformers:

# Load model directly
from transformers import AutoProcessor, AutoModel

processor = AutoProcessor.from_pretrained("sovthpaw/omnistep-12a3b")
model = AutoModel.from_pretrained("sovthpaw/omnistep-12a3b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use sovthpaw/omnistep-12a3b with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="sovthpaw/omnistep-12a3b",
	filename="omnistep-12a3b-f16.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use sovthpaw/omnistep-12a3b with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sovthpaw/omnistep-12a3b:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf sovthpaw/omnistep-12a3b:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sovthpaw/omnistep-12a3b:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf sovthpaw/omnistep-12a3b:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf sovthpaw/omnistep-12a3b:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf sovthpaw/omnistep-12a3b:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf sovthpaw/omnistep-12a3b:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf sovthpaw/omnistep-12a3b:Q4_K_M

Use Docker

docker model run hf.co/sovthpaw/omnistep-12a3b:Q4_K_M

LM Studio
Jan
Ollama
How to use sovthpaw/omnistep-12a3b with Ollama:
```
ollama run hf.co/sovthpaw/omnistep-12a3b:Q4_K_M
```

Unsloth Studio

How to use sovthpaw/omnistep-12a3b with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sovthpaw/omnistep-12a3b to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sovthpaw/omnistep-12a3b to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for sovthpaw/omnistep-12a3b to start chatting

Docker Model Runner
How to use sovthpaw/omnistep-12a3b with Docker Model Runner:
```
docker model run hf.co/sovthpaw/omnistep-12a3b:Q4_K_M
```

Lemonade

How to use sovthpaw/omnistep-12a3b with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull sovthpaw/omnistep-12a3b:Q4_K_M

Run and chat with the model

lemonade run user.omnistep-12a3b-Q4_K_M

List all available models

lemonade list

OmniStep 12A3B

Genetically Evolving Omnimodal Real Time Voice to Streaming Voice and Music Assistant

A multimodal voice-and-music AI, born from generational Darwin family evolution. OmniStep 12A3B is a personal AI companion — a vibe coach that takes notes for you, keeps up the conversation, and plays background music that matches your mood. All the while it self-evolves to become a better assistant to you, via the Darwin Family weight-space recombination methodology (arXiv:2605.14386). Built as a paper-exact 2-parent merge of Qwen2.5-Omni-3B (multimodal) and ACE-Step v1.5 XL SFT 4B (text-to-music).

The text body of the model was produced by a paper-exact Darwin 2-parent weight-space recombination of the Qwen2.5-Omni thinker and the ACE-Step text encoder, with the Architecture Mapper's "skip on dim mismatch" behavior preserving the Omni text body intact across the Qwen2.5/Qwen3 cross-architecture boundary. The diffusion (music) head sits at F16 (unquantized) for maximum audio quality. The transformer (text/multimodal) head is shipped in 4 quantized GGUF deployments (F16, Q8_0, Q4_K_M, Q4_0) for llama.cpp users.

The OmniStep Evolutionary Radio is the operational version of "infinitely generate its own background music" — a 4-loop pipeline (playback + queue fill + GEPA prompt evolution + Darwin weight evolution) wired up in the evolutionary-radio skill.

🎧 Listen to the examples

1. 🎵 Lo-Fi — chill lofi beats for late-night coding

chill lofi beats, mellow hip-hop, soft piano keys, vinyl crackle, late-night study vibes, 75 bpm, instrumental

🎤 Voice intro (text generated by OmniStep 12A3B, speech by Soprano 80M)

🎵 The track

2. 🎬 Movie Orchestra — epic cinematic orchestral

epic cinematic orchestral soundtrack, sweeping strings, French horns, building tension, Hans Zimmer style, 90 bpm, instrumental

🎤 Voice intro

🎵 The track

3. 🔥 Dark Metal — heavy dark metal for dark times

heavy dark metal, blast beats, down-tuned 7-string guitars, atmospheric, blackened death metal, 180 bpm, instrumental

🎤 Voice intro

🎵 The track

🧠 Vibe coach — reads the room, matches your mood, suggests what to play next
📝 Note-taker — listens to your conversation and captures the bits you want to remember
💬 Conversational companion — keeps up a real back-and-forth, asks the follow-up questions
🎵 Background music that matches the vibe — generates infinite music that fits what you're doing, in any style
🔁 Self-evolving — gets better at being your assistant over time, via the Darwin Family weight-space evolution methodology
🎤 Real-time ASR + TTS — Whisper audio in, Talker + token2wav audio out (4o-style streaming voice)
🖼️ Image understanding — NaViT vision encoder

All in one model. Run it with vllm, llama-server, or the included Python scripts.

🎛 Pick your quantization — download just the one you need

The GGUFs are independent files. Download only the one that fits your VRAM — you don't need all of them. Pick from the table below.

Quant	Size	VRAM	Best for	Download
F16	6.4GB	6.4GB	Maximum quality, plenty of VRAM	⬇ `omnistep-12a3b-f16.gguf`
Q8_0	3.4GB	3.4GB	Near-F16 quality, balanced	⬇ `omnistep-12a3b-q8_0.gguf`
Q4_K_M	2.0GB	2.0GB	Recommended — best size/quality tradeoff	⬇ `omnistep-12a3b-q4_k_m.gguf`
Q4_0	1.9GB	1.9GB	Smallest, lowest quality	⬇ `omnistep-12a3b-q4_0.gguf`

Run any of them with llama-server (the Omni build of llama.cpp is in the HF model comments / wiki):

llama-server -m omnistep-12a3b-q4_k_m.gguf -ngl 99 --port 8080 --host 0.0.0.0 -c 8192

Quick start

Option 1 — vllm (the easiest, full multimodal)

pip install vllm
vllm serve sovthpaw/omnistep-12a3b \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.85 \
    --trust-remote-code

Then in another terminal:

curl -s http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "sovthpaw/omnistep-12a3b",
        "messages": [{"role": "user", "content": "Take a note: I need to follow up with the design team about the Q3 launch."}],
        "max_tokens": 200
    }'

See vllm Qwen2.5-Omni docs for the full multimodal API.

Option 2 — llama-server with the GGUFs (fast text path)

# Pick your quantization based on VRAM
#   F16  = 6.4GB VRAM, best quality
#   Q8_0 = 3.4GB VRAM, near-F16 quality
#   Q4_K_M = 2.0GB VRAM, recommended
#   Q4_0 = 1.9GB VRAM, smallest

llama-server \
    -m omnistep-12a3b-q4_k_m.gguf \
    -ngl 99 \
    --port 8080 \
    --host 0.0.0.0 \
    -c 8192

curl -s http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "omnistep-12a3b",
        "messages": [{"role": "user", "content": "I am coding late, give me a lofi vibe and a note about what I should focus on tomorrow."}],
        "max_tokens": 200,
        "stream": true
    }'

The 4 GGUFs are deployment options — pick whichever fits your hardware. Q4_K_M is the recommended sweet spot.

Option 3 — the included Python scripts (the most fun)

The repo includes Python scripts that wire everything together for the headline use cases. After cloning:

git clone https://huggingface.co/sovthpaw/omnistep-12a3b
cd omnistep-12a3b

# Start the vllm server (one terminal)
python scripts/run_omnistep_12a3b.py serve

# In another terminal — try the modalities
python scripts/run_omnistep_12a3b.py text "Take a note: follow up with design team about Q3."
python scripts/run_omnistep_12a3b.py music "chill lofi beats" --output ~/music/track.wav
python scripts/run_omnistep_12a3b.py music-loop "chill lofi beats"   # infinite background music
python scripts/run_omnistep_12a3b.py voice                            # streaming voice assistant

Or use the omnistep-jammit bash wrapper for music:

./scripts/omnistep-jammit "aggressive metal, 808s, dark" --duration 120
./scripts/omnistep-jammit "warm ambient pad" --infinite   # infinite background music

File layout

sovthpaw/omnistep-12a3b/
├── cover.png                                # hero image
├── README.md                                # this file
├── config.json                              # Qwen2.5-Omni config + ACE music decoder sub-config
├── tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, special_tokens_map.json, added_tokens.json, chat_template.json
├── preprocessor_config.json
├── configuration_acestep_v15.py             # ACE-Step modeling code (for the music head)
├── modeling_acestep_v15_xl_base.py
├── apg_guidance.py
│
├── model-00001-of-00004.safetensors         # 5.37GB (full multimodal safetensors)
├── model-00002-of-00004.safetensors         # 5.36GB
├── model-00003-of-00004.safetensors         # 5.36GB
├── model-00004-of-00004.safetensors         # 4.97GB
├── model.safetensors.index.json             # the FULL multimodal safetensors (Omni + ACE music head, 21GB)
│
├── omnistep-12a3b-f16.gguf                  # 6.4GB GGUF (text body + multimodal heads, F16)
├── omnistep-12a3b-q8_0.gguf                 # 3.4GB GGUF (Q8_0 quantization)
├── omnistep-12a3b-q4_k_m.gguf               # 2.0GB GGUF (Q4_K_M quantization, recommended)
├── omnistep-12a3b-q4_0.gguf                 # 1.9GB GGUF (Q4_0 quantization, smallest)
│
├── 01_lofi_chill.wav                        # example audio (Lo-Fi track)
├── 02_movie_orchestra.wav                   # example audio (Movie Orchestra track)
├── 04_dark_metal.wav                        # example audio (Dark Metal track)
│
├── 01_lofi_chill_voice.wav                  # example voice intro (Lo-Fi description by the vibe coach)
├── 02_movie_orchestra_voice.wav             # example voice intro (Movie Orchestra description)
├── 04_dark_metal_voice.wav                  # example voice intro (Dark Metal description)
│
├── scripts/
│   ├── run_omnistep_12a3b.py                # main entry: text, music, music-loop, voice, serve
│   ├── omnistep_radio.py                    # the 4-loop Evolutionary Radio
│   ├── omnistep_voice.py                    # streaming voice assistant
│   ├── omnistep-jammit                      # bash wrapper for music gen

The GGUFs are the quantized-transformer deployment path (text body + multimodal heads, fast on llama.cpp). The safetensors are the complete model including the diffusion (music) head at F16 unquantized for max audio quality. The Python scripts wire both together.

Architecture — the correct lineup

OmniStep 12A3B — one complete model, paper-exact Darwin merge + attached heads
│
├── INPUT (full speech I/O lives inside the model)
│   ├── whisper_audio_in    (Qwen2.5-Omni's Whisper-style encoder — streaming ASR)
│   └── navit_vision_in     (Qwen2.5-Omni's NaViT — image understanding)
│
├── TEXT REASONING
│   ├── text_backbone       (Qwen2.5-Omni Thinker, 36L, h=2048 — paper-exact Darwin merge with ACE encoder)
│   │                        Its internal state carries conversation context into both the speech
│   │                        output and the music output, so the music is mood-aware from
│   │                        what the LLM is "thinking"
│   └── ace_encoder_attached (ACE-Step text encoder — kept as a separate module since
│                              the Qwen2.5/Qwen3 cross-architecture boundary prevents 1:1 FFN blend)
│
├── OUTPUT 1 — speech (two paths, pick the one that fits)
│   ├── talker + token2wav  (Qwen2.5-Omni's built-in TTS — lowest latency, lives inside the model)
│   └── external Soprano 80M  (chunked-streaming TTS for the user's preferred voice character)
│                            ↑ text is cut on sentence boundaries and sent in parallel as
│                              the LLM is still generating later sentences
│
└── OUTPUT 2 — music (the Darwin family FFN-blend destination)
    └── ace_music_decoder   (ACE-Step v1.5 XL 4B DiT — text → music, continuous background music)
                             (F16, unquantized, lives in the safetensors)
                             The Darwin family methodology is applied HERE rather than to a
                             TTS model: by blending the LLM FFN weights into the music head,
                             the music output picks up the conversation's emotional state.
                             That's why the background music can feel like it's "aware" of
                             the conversation — it is, by construction.

The "one model that produces and listens" principle

The whole point of putting all of this in one model is so the audio-output side and the audio-input side are the same running process. When you speak:

The Whisper encoder hears you mid-generation
The LLM gets interrupted (the same model that's been streaming text)
Both the TTS stream and the music stream can be cut off by the model itself
The LLM pivots to your new input — and because the music head shares the LLM's FFN-blended state, the background music shifts to match the new mood in the same inference step

No external orchestrator, no second model watching the first one. One model, one attention state, one set of weights — that's what makes it feel like a real-time conversation.

What about Darwin-TTS? (The cross-modal FFN blend that isn't used here)

Darwin-TTS-1.7B-Cross is the same Darwin Family framework applied to speech synthesis: it blends 3% of Qwen3-1.7B (LLM) FFN weights into Qwen3-TTS-1.7B (TTS) to add emotional expressiveness without any training. It is not used in OmniStep 12A3B because:

OmniStep's text body is Qwen2.5-3B class, not Qwen3 — different hidden/intermediate sizes
Qwen2.5-Omni-3B already has full speech I/O (Whisper ASR + Talker+token2wav TTS) — Darwin-TTS is purely a TTS model with no STT, so it would be redundant
The Darwin family weight-blend methodology is more valuable here applied to the music path (no other model is providing music — the DiT needed a way to inherit conversation mood, and FFN blending is exactly that mechanism)

The Darwin-TTS-Cross result is still useful as a research reference: it shows the FFN-blend approach is stable at 3–5% and degrades fast at 10%, which is the operating range we can experiment with on the music DiT for fine-grained mood control.

For the vibe-coach / personal-companion use case, the typical request flows:

"Take a note: I need to follow up with the design team" → whisper → thinker (note-taker) → talker (voice confirmation)
"I'm coding late, give me a lofi vibe" → thinker (vibe-coach) → ace_music_decoder (matching background music)
"Tell me about the Darwin Family paper" → thinker (conversational) → talker (voice response)

Parents (the Darwin merge)

Parent	Role	License
Qwen2.5-Omni-3B	Multimodal text+speech+vision, Qwen2.5-3B class text body	Apache 2.0
ACE-Step v1.5 XL SFT 4B	4B DiT for text-to-music, Qwen3-class text encoder	Apache 2.0

Merge methodology — paper-exact

Following arXiv:2605.14386 (Kim et al., 2026 — "Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning"). No modifications.

θM(T) = (1 - r_final(T)) · θA(T) + r_final(T) · θB(T)
r_final(T) = τ · r_MRI(T) + (1 - τ) · r_genome(T)
r_MRI(T) = MRI_B(T) / (MRI_A(T) + MRI_B(T))
MRI(T)  = α · Static(T) + (1 - α) · Probe(T),  α = 0.5 (paper-fixed)

The fixed starting genome used (paper-recommended):

{
  "gamma": 0.55, "alpha_attn": 0.55, "alpha_ffn": 0.50, "alpha_emb": 0.50,
  "rho_a": 0.45, "rho_b": 0.45,
  "r0": 0.50, "r1": 0.50, "r2": 0.50, "r3": 0.50, "r4": 0.50, "r5": 0.50,
  "tau": 0.40, "lambda_reg": 0.50
}

What's in the box — full modality support

Every modality is in the model. Here's exactly how to use each:

Modality	Path	How to invoke
Text reasoning / conversation	GGUF or safetensors	`python scripts/run_omnistep_12a3b.py text "..."` (via vllm or llama-server)
Note-taking	via text reasoning	"Take a note: ..."
Vibe coaching	via text reasoning	"I just finished a workout, give me a recovery vibe"
Text-to-music (one-shot)	safetensors (ACE DiT, F16)	`python scripts/run_omnistep_12a3b.py music "..." --output X.wav`
Infinite background music	safetensors + mpv + radio loop	`python scripts/run_omnistep_12a3b.py music-loop "..."`
Streaming ASR (audio in)	safetensors via vllm	`python scripts/run_omnistep_12a3b.py voice`
Streaming TTS (audio out)	safetensors via vllm	`python scripts/run_omnistep_12a3b.py voice`
Image understanding	safetensors via vllm	vllm `image_url` field in chat completions
All modalities in one process	safetensors via vllm	vllm with `--max-model-len 32768`

Citation

@misc{omnistep-12a3b-2026,
  title  = {OmniStep 12A3B: A Darwin Family paper-exact weight-space recombination of Qwen2.5-Omni-3B and ACE-Step v1.5 XL SFT 4B — a vibe-coach voice assistant with infinite background music that self-evolves in generational Darwin family evolution},
  author = {SouthpawIN},
  year   = {2026},
  url    = {https://huggingface.co/sovthpaw/omnistep-12a3b},
  note   = {Built with Nous Girl (Hermes Agent); Darwin Family paper: arXiv:2605.14386}
}

@article{kim2026darwinfamily,
  title  = {Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning},
  author = {Kim, T. and others},
  year   = {2026},
  eprint = {2605.14386},
  archivePrefix = {arXiv},
  url    = {https://arxiv.org/abs/2605.14386}
}

@software{ace-step-1.5,
  title  = {ACE-Step 1.5: Open-Source Music Generation Foundation Model},
  author = {ACE Studio and StepFun},
  year   = {2026},
  url    = {https://github.com/ace-step/ACE-Step-1.5}
}

@software{qwen2.5-omni,
  title  = {Qwen2.5-Omni Technical Report},
  author = {Alibaba Qwen team},
  year   = {2025},
  url    = {https://qwenlm.github.io/blog/qwen2.5-omni/}
}

@software{soprano-80m,
  title  = {Soprano-80M: A Compact Flow-Matching Text-to-Speech Model},
  author = {Soprano TTS contributors},
  year   = {2026},
  url    = {https://huggingface.co/ekwek/Soprano-80M-en}
}

License

Apache 2.0 (inherited from both parents).

Model tree for sovthpaw/omnistep-12a3b

Unable to build the model tree, the base model loops to the model itself. Learn more.

Paper for sovthpaw/omnistep-12a3b

Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning

Paper • 2605.14386 • Published 23 days ago • 60

sovthpaw
/

omnistep-12a3b

OmniStep 12A3B

Genetically Evolving Omnimodal Real Time Voice to Streaming Voice and Music Assistant

🎧 Listen to the examples

1. 🎵 Lo-Fi — chill lofi beats for late-night coding

2. 🎬 Movie Orchestra — epic cinematic orchestral

3. 🔥 Dark Metal — heavy dark metal for dark times

🎛 Pick your quantization — download just the one you need

Quick start

Option 1 — vllm (the easiest, full multimodal)

Option 2 — llama-server with the GGUFs (fast text path)

Option 3 — the included Python scripts (the most fun)

File layout

Architecture — the correct lineup

The "one model that produces and listens" principle

What about Darwin-TTS? (The cross-modal FFN blend that isn't used here)

Parents (the Darwin merge)

Merge methodology — paper-exact

What's in the box — full modality support

Citation

License

See also

Model tree for sovthpaw/omnistep-12a3b

Paper for sovthpaw/omnistep-12a3b

Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning