Hayabusa 9B logo

Hayabusa 9B

Hayabusa 9B is a full-weight fine-tune of Qwen/Qwen3.5-9B specialized for debugger-style software-agent behavior: reading grounded runtime evidence, selecting the next debugging action, and proposing targeted fixes inside the hen long-horizon C++ agent harness.

This is a merged full checkpoint, not a LoRA, QLoRA, or adapter release. The repository follows Hugging Face model conventions with root-level config.json, tokenizer files, model.safetensors.index.json, and sharded safetensors weights.

Intended Use

Hayabusa is intended to be used as a debugger model inside a structured agent loop, especially one that provides:

  • recent test output and failure signatures;
  • source snippets for specific functions;
  • runtime logs, traces, and LLDB evidence;
  • constrained action schemas such as run_test, function_info, debug_function, and fix_function;
  • fresh verification after every code fix.

The model is optimized for action-oriented debugging, not broad chat, general coding benchmarks, or free-form assistant behavior. Together, Hen and Hayabusa aim at an autonomous debugging loop: inspect evidence, choose the next action, apply a targeted fix, run the tests again, and continue from the new runtime state. It performs best when the harness keeps the context grounded in concrete evidence rather than asking it to infer everything from a vague bug report.

Recommended Hen Usage

A practical role split is:

  • hayabusa-9b as the Debugger model for next-step selection and focused fix proposals;
  • a stronger long-context model as Director for trajectory summarization, very large contexts, and review/escalation;
  • automatic run_test after fixes so the model sees whether the previous hypothesis improved, regressed, or left the failure unchanged.

For Hen runs trained around 32K context, avoid letting debugger context grow without bound. If the prompt becomes much larger than the training context, route large-context reasoning to the Director model or summarize before continuing.

Example Hen role:

-llmDbg vllm/hayabusa-9b

Training Summary

The checkpoint was produced through continuation training stages, always continuing from the previous full checkpoint rather than restarting from the base model.

High-level lineage:

  1. Full-weight SFT from Qwen/Qwen3.5-9B on early no-assistant-thinking debugger trajectories.
  2. Round-2 continuation SFT on a larger no-assistant-thinking SFT union.
  3. Round-2 DPO on cleaned debugger preference pairs.
  4. Rare-actions SFT continuation to improve underrepresented debugger actions.
  5. Super-debug v3 main SFT continuation.
  6. Super-debug v3 rare-actions SFT with final-assistant-message-only loss masking.
  7. Super-debug v3 DPO continuation from the v3 rare-actions checkpoint.

The goal of these stages is not generic code completion. The target behavior is compact, evidence-grounded debugging: identify the active blocker, avoid stale hypotheses, request the right runtime evidence, and make a local fix that moves the test state forward.

Data

Public dataset family:

Important caveat: some public dataset views include assistant thinking. This checkpoint was trained primarily on no-assistant-thinking SFT views and cleaned DPO/rare-action derivatives. Exact later-stage training views may differ from the public top-level files.

Example Training Traces

Hayabusa is trained on Hen-style debugging trajectories, where the model learns to operate inside a closed runtime-feedback loop rather than answer one-shot bug reports. The public datasets expose these traces directly on Hugging Face, so users can inspect the training format and build compatible harnesses.

Example from super-debug-v3:

The model is trained to systematically analyze test execution results, acquire missing runtime/source evidence, and suggest a fix only when the available evidence is sufficient. At runtime, Hen may provide a more verbose context with project workflow, source summaries, logs, traces, and progress reports, but the debugging trajectory shape is similar.

Behavior Notes

  • Text-only model; no vision support.
  • Tuned for structured debugging traces and action JSON, not conversational polish.
  • Best outputs are usually concise and evidence-backed.
  • Can still loop on stale hypotheses if the harness does not provide progress feedback or if context exceeds the range seen during training.
  • Works best when the system validates actions and re-requests invalid actions with concrete feedback.
  • When used outside Hen, provide a strict action schema and fresh test evidence after each proposed fix.

Loading With Transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "georvn7/hayabusa-9b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {
        "role": "user",
        "content": "Given this test failure and trace, select the next debugger action as JSON."
    }
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

output_ids = model.generate(
    **inputs,
    max_new_tokens=1024,
    do_sample=False,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

vLLM Serving Notes

Validated local serving shape on vLLM:

vllm serve georvn7/hayabusa-9b \
  --served-model-name hayabusa-9b \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.70 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 32768 \
  --dtype bfloat16 \
  --default-chat-template-kwargs '{"enable_thinking":false}' \
  --enforce-eager \
  --disable-frontend-multiprocessing \
  --language-model-only

Recommended starting sampling for Hen-style debugging:

{
  "temperature": 0.05,
  "top_p": 0.85,
  "top_k": 20,
  "repetition_penalty": 1.12,
  "max_tokens": 4096
}

Use lower temperature for strict JSON/action stability. If the model becomes too repetitive, prefer harness-level stuck detection and progress feedback before increasing randomness aggressively.

Evaluation Status

This is an experimental research checkpoint. It has been used inside Hen on long-horizon C++ debugging trajectories, including difficult SimpleC compiler tests where many general OSS models struggle. It should not be interpreted as a broadly evaluated frontier coding model.

The most meaningful evaluation setting is not a single-pass benchmark. It is a stateful debugging loop with persisted evidence, action validation, run-test verification, and trajectory progress reports.

Limitations

  • Specialized debugger/action model, not a general assistant release.
  • Not broadly safety-aligned beyond the upstream base model and task data.
  • May over-focus on familiar Hen action patterns outside the Hen harness.
  • May repeat information-gathering actions if progress feedback is weak.
  • Full reproducibility requires the exact no-thinking SFT views, rare-action shards, and cleaned DPO files used in later stages.

License

This model inherits the upstream Qwen/Qwen3.5-9B Apache-2.0 license. See LICENSE.

Downloads last month
29
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for georvn7/hayabusa-9b

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(459)
this model

Datasets used to train georvn7/hayabusa-9b