Instructions to use YMRohit/ouroboros-kernelsmith-qwen3.6-27b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use YMRohit/ouroboros-kernelsmith-qwen3.6-27b with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
OUROBOROS Kernelsmith (Qwen3.6-27B LoRA)
A 27B model that writes fused Triton GPU kernels. It is the bigger sibling of the 1B Kernelsmith and the "Pro mode" model in the Kernel Mint Space. It set the leaderboard crowns the 1B is trying to beat, and it handles the harder cases the 1B cannot, like machines that stack two activations.
This repo holds LoRA adapters on Qwen/Qwen3.6-27B.
The main adapter is rl_adapter_v2, the final 69-op discovery checkpoint. Earlier adapters stay
here only so the training path and ablations remain reproducible.
Even at 27B, a model writing GPU code is a proposer, not an oracle. The verifier is what makes the output trustworthy: draft a few kernels, keep the ones that pass.
How to prompt it (the same trick as the 1B)
The prompt has two parts that matter, and they are how the model was trained. Get them right and it writes correct, fast kernels.
1. Suppress the reasoning trace with apply_chat_template(..., enable_thinking=False). We want
the kernel, not a chain of thought.
2. Give it the right style guide. The user message hands the model one valid kernel for a different op as an example. The model was trained with a row-wise reduction kernel (rmsnorm) as that example for almost every op, so hand it the same thing. An elementwise example pushes it toward elementwise code for reduction ops. Use the rmsnorm kernel below.
The 27B is large: in bfloat16 it needs roughly 54 GB of VRAM (an 80 GB card, or shard across
GPUs with device_map="auto"), or load it in 4-bit for a single 24 GB card.
import torch, re
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
BASE = "Qwen/Qwen3.6-27B"
tok = AutoTokenizer.from_pretrained(BASE, trust_remote_code=True)
base = AutoModelForCausalLM.from_pretrained(
BASE, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto")
# for a single 24GB GPU, swap the line above for 4-bit:
# from transformers import BitsAndBytesConfig
# base = AutoModelForCausalLM.from_pretrained(BASE, trust_remote_code=True, device_map="auto",
# quantization_config=BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16))
model = PeftModel.from_pretrained(
base, "YMRohit/ouroboros-kernelsmith-qwen3.6-27b", subfolder="rl_adapter_v2").eval()
SYSTEM = ("You are an expert GPU kernel engineer. Write a single correct, fast Triton kernel. "
"Output ONLY one fenced python code block defining `run(*inputs)` and its @triton.jit "
"kernel. Accumulate reductions in float32. No prose.")
# The style guide: a row-wise reduction (rmsnorm). Use it for anything that reduces over the
# last dimension, which is most of what this model is good at.
STYLE = '''@triton.jit
def _rmsnorm_kernel(x_ptr, w_ptr, y_ptr, stride, N, eps, BLOCK: tl.constexpr):
row = tl.program_id(0)
x_ptr += row * stride
y_ptr += row * stride
acc = tl.zeros([BLOCK], dtype=tl.float32)
for off in range(0, N, BLOCK):
cols = off + tl.arange(0, BLOCK)
x = tl.load(x_ptr + cols, mask=cols < N, other=0.0).to(tl.float32)
acc += x * x
rms = tl.rsqrt(tl.sum(acc) / N + eps)
for off in range(0, N, BLOCK):
cols = off + tl.arange(0, BLOCK)
mask = cols < N
x = tl.load(x_ptr + cols, mask=mask, other=0.0).to(tl.float32)
w = tl.load(w_ptr + cols, mask=mask, other=0.0).to(tl.float32)
tl.store(y_ptr + cols, (x * rms * w), mask=mask)
def run(x, w):
M, N = x.shape
y = torch.empty_like(x)
_rmsnorm_kernel[(M,)](x, w, y, x.stride(0), N, 1e-6, BLOCK=1024)
return y'''
USER = (
"Op `softmax`: numerically stable softmax over the last dim (subtract the row max).\n"
"Signature:\n run(x: Tensor[M, N]) -> Tensor[M, N]\n\n"
"Here is a valid Triton kernel for a DIFFERENT op (`rmsnorm`) as a style guide:\n"
f"```python\n{STYLE}\n```\n"
)
messages = [{"role": "system", "content": SYSTEM}, {"role": "user", "content": USER}]
prompt = tok.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=1024, do_sample=True, temperature=0.7, top_p=0.97)
text = tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
m = re.search(r"```(?:python)?\s*(.*?)```", text, re.S)
print(m.group(1) if m else text)
Sample a few at temperature=0.7 and keep the first that passes the referee.
Do not skip the referee
The model proposes; the verifier decides. Compile the kernel, check it against the PyTorch reference on adversarial inputs (allclose), and time it with CUDA events before trusting it. The harness we used is public and is the same referee that trained the model, with anti-cheat checks (no memoizing the output, no mutating the input):
- Referee + live demo: build-small-hackathon/ouroboros-kernel-mint
- Verified kernels + the full writeup: YMRohit/ouroboros-kernel-corpus
What is in this repo
rl_adapter_v2/: final Qwen3.6-27B LoRA for the 69 stability-gated kernel set.- Earlier Qwen adapters:
sft_adapter/,rl_adapter/,rl_adapter_newops/,rl_adapter_invent/,rl_adapter_f1/, and the ablation-arm adapters. These are kept for reproduction, not as the default download target. best_kernels/: the 69 curated Triton kernels that matchreports/rebench_stability_v2.json.reports/: canonical JSON and markdown reports.reports_local/, raw Modal-volume mirrors, and scratch planning docs are intentionally not part of this cleaned public layout.evidence/run_logs/: recovered raw run logs, with local machine paths redacted where needed.paper/: paper draft and generated figures.
MiniCPM5-1B artifacts live in their own repos: adapter and GGUF.
How it was trained
Supervised fine-tuning on verified kernels, then reinforcement learning where the reward is just
the verifier's verdict: a kernel that compiles, matches PyTorch on a hard correctness sweep, and
beats torch.compile max-autotune. The model learns from its own verified wins, no human labels.
Trained on Modal H200s; the RL run peaks around 110 GB of VRAM. LoRA rank 128. The adapters here
cover the original suite, the new-operator discovery runs, an invention run on never-trained
problems, and a transfer run that fixed the worst loss cases.
Results
The full writeup, every number regenerated from harness logs, is in the corpus repo under
reports/.
Short version: 76 model-written kernels beat torch.compile max-autotune on an H200, 69 of them
held up across 5 fresh re-benchmark runs, and they keep a 1.49x geomean over a 376-cell shape and
dtype grid. They also beat hand-written expert kernels (Liger, Unsloth, the Triton tutorial) on
several ops.
These are scheduling wins on memory-bound fusion ops, not wins over cuBLAS or FlashAttention, and not new algorithms.
License and provenance
The fine-tuning and our code are MIT. The base weights are Qwen/Qwen3.6-27B and the base model's own license applies to them. No human-labeled kernel data was used; every training signal came from the referee. Built for the Hugging Face Build Small hackathon.
- Downloads last month
- -
Model tree for YMRohit/ouroboros-kernelsmith-qwen3.6-27b
Base model
Qwen/Qwen3.6-27B