SAM-G-CobraTooling

SAM-G-CobraTooling is a 30.3M-parameter model fine-tuned from SAM-G-Reasoning on 196k agentic orchestration traces. It turns a natural-language instruction — or an observation from a previous step — into an ordered, risk-flagged JSON plan of tool calls. It is the local orchestration layer of an agentic IDE: it routes, decomposes, tracks state, reacts to exit codes and HTTP status, and emits structured tool calls entirely offline. It does not write code; code is delegated to a larger model via an ask_code_model hand-off. Built by AMEFORGE for the CobraBub IDE.

  • Parameters: 30.3M · Footprint: 121 MB fp32 (~30 MB quantized) · Base: SAM-G-Reasoning
  • Fine-tuning: prompt-masked SFT (loss on the plan span only), cosine 8e-5, 10k steps, best at 6k
  • Aggregate exact plan-match: 78.8% (held-out, disjoint seed)
  • Lineage: SAM-G → SAM-G-Reasoning → SAM-G-CobraTooling

Output format

<instruction>                     [ACTION] {"plan":[{"op":...,"args":{...},"risk":"safe|critical"}, ...]}
<intent> | {"last_op":...,"...":...} [ACTION] {"plan":[ ... ]}      # reactive (observation-driven)

Every step carries a risk flag (safe or critical) that drives the IDE confirmation gate: safe ops run autonomously, critical ops require explicit user confirmation.

What it is good at — and what it is not

Stress-tested on thirteen families. The pattern mirrors the rest of the SAM-G line: it excels at routing and reaction (short, procedural) and is limited on long ordered chains that must match exactly at 30M parameters.

Family Exact % Type
single_tool (routing) 100 routing
retry_loop (exit-code state machine) 100 reaction
feedback_react (stdout/stderr) 100 reaction
git_workflow (status→add→push, gated) 100 procedural
scrape_research (fetch→summarize→act) 100 procedural
db_query (SQL, SELECT vs mutation) 100 structured call
webhook_wait (async callback) 92 async reaction
mcp_call (filesystem/github/postgres) 83 structured call
api_call (REST/GraphQL + HTTP state machine) 75 structured call
plan_chain (multi-step plans) 58 planning
risk_gate (mixed safe/critical plans) 58 gated planning
fs_watch (file-change reaction) 42 async reaction
build_test_cycle (edit→test→react + hand-off) 17 long chain

Routing, exit-code reaction, git, scraping and SQL routing are saturated. mcp_call at 83% makes the model a viable local driver for MCP servers — the core capability of a hosted code agent, here running offline. plan_chain rose from the v1 plateau (0–42%) to 58% after broadening generator coverage. build_test_cycle remains the hard family: four-to-five ordered ops ending in a code-model hand-off, scored by strict exact match — the same long-chain ceiling seen with arithmetic in SAM-G-Reasoning. For those, decompose app-side into shorter sub-calls.

Security: the risk flag is advisory, not a boundary

The model flags critical ops with 94% fidelity across all families — strong for pre-flagging and good UX. It must not be the sole security boundary. A 30M model will mis-flag a fraction of decisions, and the failure modes are asymmetric: a false negative (a critical op flagged safe) would auto-run a destructive command without confirmation. Integrators must add a deterministic backstop: a hard whitelist/blacklist in the app that forces critical on known-dangerous operations (rm -rf, git push, DROP/DELETE, external mutating HTTP, MCP write tools, delete_file) regardless of the model's flag. Treat the model's risk field as a fast hint that pre-fills the confirmation gate, with the app's deterministic rules as the enforced boundary.

Op vocabulary

Routing/IO: open_file, list_dir, run_command, scrape, summarize, capture, open_app. Hand-off: ask_code_model, write_file. Control: retry, escalate, backoff, reauth, continue, stop. Integrations: api_call, mcp_call, db_query, webhook_wait, fs_watch, git_push.

Intended use

The local planning/routing/reaction layer of an agentic IDE: decompose an instruction into ordered tool calls, react to observations (exit codes, stderr, HTTP status, DB row counts, webhook payloads, file-change events), and emit structured, risk-flagged plans offline and for free. Roughly the procedural majority of agentic turns; hard code generation and long exact chains are escalated to a larger model via ask_code_model.

Usage

import sentencepiece as spm, torch
sp = spm.SentencePieceProcessor(); sp.Load("samg_tokenizer.model")

# routing
prompt = "open src/main.js and run the tests [ACTION]"
# -> {"plan":[{"op":"open_file","args":{"path":"src/main.js"},"risk":"safe"},
#             {"op":"run_command","args":{"cmd":"pytest"},"risk":"safe"}]}

# reactive: HTTP 429 -> back off and retry
prompt = "rate limited, back off and retry | {\"last_op\":\"api_call\",\"status\":429} [ACTION]"
# -> {"plan":[{"op":"backoff","args":{"seconds":30},"risk":"safe"},
#             {"op":"retry","args":{"attempt":2},"risk":"safe"}]}

ids = torch.tensor([sp.EncodeAsIds(prompt)])
# greedy-decode the [ACTION] span -> structured plan JSON

Limitations

  • build_test_cycle (17%) and the exact-match of plan_chain/risk_gate (58%) plateau because long, strictly-ordered plans are hard at 30M; decompose long plans app-side into shorter sub-calls.
  • The risk flag is advisory (94% fidelity); enforce a deterministic backstop in the app, as above.
  • Traces are synthetic, drawn from the training family distribution with a disjoint evaluation seed; coverage reflects the generator, not arbitrary real-world tool APIs.
  • Not a general assistant and does not write code; it orchestrates and hands off. Inherits the base model's knowledge limits.

Citation

@misc{samgcobratooling2026,
  title  = {SAM-G-CobraTooling: Risk-Flagged Agentic Tool-Call Orchestration at 30M Parameters},
  author = {AMEFORGE Lab},
  year   = {2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results