OUROBOROS Kernelsmith (MiniCPM5-1B LoRA)

A 1-billion-parameter model that writes fused Triton GPU kernels. It is a LoRA adapter on openbmb/MiniCPM5-1B. Describe an operation (a normalization with an optional residual and one activation, or a classic like softmax or swiglu) and it writes a single fused Triton kernel for it.

It is the model behind the Kernel Mint Space. There is a 27B sibling for the harder work, and a GGUF build of this model if you want to run it on llama.cpp.

Modal is part of the project evidence trail. This 1B adapter is the local/offline smith used by the Space, while the larger 27B scaling run and Pro backend used Modal H200s and Modal serving. The public evidence is in the 27B model card, the corpus RESULTS.md, and the Space RUN.md backend endpoint section.

A 1B model writing GPU code misses sometimes. That is the entire point of the project: you do not trust the model, you trust the referee. Draft a few kernels, throw away the ones that fail, keep the one that passes.

How to prompt it (read this, it is the whole trick)

This model is fussy about its prompt in two specific ways, and if you get either wrong it writes nonsense. Both are easy once you know them.

1. Suppress the reasoning trace. MiniCPM5 is a reasoning model that uses ChatML. Prompt it the normal chat way and it spends its token budget thinking out loud, which a 1B cannot afford. So we close an empty <think></think> block before the answer. With transformers that is just apply_chat_template(..., enable_thinking=False).

2. Give it the right style guide. The user message hands the model one valid kernel for a different op as an example. This matters more than anything else in the prompt. The model was trained with a row-wise reduction kernel (rmsnorm, one program per row) as the example for almost every op, so that is the structure it learned to copy. If you hand it an elementwise kernel instead, it will write elementwise code for a reduction op like softmax and return the wrong answer. Use the rmsnorm kernel below as the style guide and it writes correct reductions.

import torch, re
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE = "openbmb/MiniCPM5-1B"
tok = AutoTokenizer.from_pretrained(BASE, trust_remote_code=True)
base = AutoModelForCausalLM.from_pretrained(
    BASE, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base, "YMRohit/ouroboros-kernelsmith-minicpm5-1b").eval()

SYSTEM = ("You are an expert GPU kernel engineer. Write a single correct, fast Triton kernel. "
          "Output ONLY one fenced python code block defining `run(*inputs)` and its @triton.jit "
          "kernel. Accumulate reductions in float32. No prose.")

# The style guide: a row-wise reduction (rmsnorm). This is what teaches the model the
# one-program-per-row structure. Use it for anything that reduces over the last dimension.
STYLE = '''@triton.jit
def _rmsnorm_kernel(x_ptr, w_ptr, y_ptr, stride, N, eps, BLOCK: tl.constexpr):
    row = tl.program_id(0)
    x_ptr += row * stride
    y_ptr += row * stride
    acc = tl.zeros([BLOCK], dtype=tl.float32)
    for off in range(0, N, BLOCK):
        cols = off + tl.arange(0, BLOCK)
        x = tl.load(x_ptr + cols, mask=cols < N, other=0.0).to(tl.float32)
        acc += x * x
    rms = tl.rsqrt(tl.sum(acc) / N + eps)
    for off in range(0, N, BLOCK):
        cols = off + tl.arange(0, BLOCK)
        mask = cols < N
        x = tl.load(x_ptr + cols, mask=mask, other=0.0).to(tl.float32)
        w = tl.load(w_ptr + cols, mask=mask, other=0.0).to(tl.float32)
        tl.store(y_ptr + cols, (x * rms * w), mask=mask)

def run(x, w):
    M, N = x.shape
    y = torch.empty_like(x)
    _rmsnorm_kernel[(M,)](x, w, y, x.stride(0), N, 1e-6, BLOCK=1024)
    return y'''

# Describe the op you want. Keep it short and concrete. The style guide is a DIFFERENT op on
# purpose (here we ask for softmax and show rmsnorm).
USER = (
    "Op `softmax`: numerically stable softmax over the last dim (subtract the row max).\n"
    "Signature:\n  run(x: Tensor[M, N]) -> Tensor[M, N]\n\n"
    "Here is a valid Triton kernel for a DIFFERENT op (`rmsnorm`) as a style guide:\n"
    f"```python\n{STYLE}\n```\n"
)

messages = [{"role": "system", "content": SYSTEM}, {"role": "user", "content": USER}]
prompt = tok.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=768, do_sample=True, temperature=0.7, top_p=0.97)
text = tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

m = re.search(r"```(?:python)?\s*(.*?)```", text, re.S)
print(m.group(1) if m else text)   # a @triton.jit kernel plus a run(...) entry point

If you want better odds, sample two or three at temperature=0.7 and keep the first one that passes the referee.

Do not skip the referee

The kernel the model writes is a guess until something checks it. Before you run a generated kernel anywhere real, put it through a verifier that compiles it, checks it against the PyTorch reference on awkward inputs (allclose, not eyeballing), and times it. The exact harness we used is public and is the same referee that trained the model; it also blocks the obvious ways to cheat a benchmark (memoizing the output, mutating the input):

How it was trained

Supervised fine-tuning on kernels from the OUROBOROS corpus (kernels that compiled, matched PyTorch on a hard correctness sweep, and beat torch.compile max-autotune), then reinforcement learning where the only reward is the referee's verdict. The corpus is self-distilled, written almost entirely by models inside a loop whose only teacher is the verifier. No human-labeled data anywhere. LoRA rank 64, alpha 128, on the attention and MLP projections.

What it is good at, and what it is not

In a multi-seed study (reports/ablations_minicpm_multiseed.md in the corpus repo) this model, sampled best-of-N against the referee, beat torch.compile max-autotune in all 12 independently seeded runs. Per-run geomeans were 1.02x to 1.14x over a 6-op suite on a single RTX 4090, and bigger per-op wins (softmax around 2x) show up and get verified per mint in the Space.

It is good at memory-bound fusion ops: normalizations, activations, gated MLP halves, softmax variants, and a normalization fused with one activation. It is not writing FlashAttention or matmul kernels and it is not inventing algorithms. These are scheduling wins on small fused ops, a narrow target on purpose, because a narrow target is one a 1B model can hit and a referee can check.

License and provenance

The fine-tuning and our code are MIT. The base weights are OpenBMB's MiniCPM5-1B and the base model's own license applies to them. No human-labeled kernel data was used; every training signal came from the referee. Built for the Hugging Face Build Small hackathon.

Downloads last month
67
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YMRohit/ouroboros-kernelsmith-minicpm5-1b

Adapter
(34)
this model

Space using YMRohit/ouroboros-kernelsmith-minicpm5-1b 1

Articles mentioning YMRohit/ouroboros-kernelsmith-minicpm5-1b