scalloptools-1

A 4B function-calling specialist for local assistants. It reads a user turn and decides which tool to call and with what arguments, or declines when no tool fits.

scalloptools-1 is a LoRA fine-tune of Qwen3.5-4B, distilled from ScallopBot production traces with a larger model writing the labels. The student never trained on its own generations. The repo ships a q5_k_m GGUF for local serving and the raw adapter for reproduction.

Links: scallopbot.com · GitHub

Base model Qwen3.5-4B
Adapter LoRA, rank 32, alpha 64, 2 epochs
Quant q5_k_m GGUF (3.16 GB)
Context inherits Qwen3.5-4B
Serving thinking off (chain-of-thought hurts this task at 4B)
Toolset shell, file read/write, HTTP fetch, memory store, project APIs

Files

File Format Size Notes
scalloptools-1.q5_k_m.gguf GGUF Q5_K_M 3.16 GB Recommended for llama.cpp / Ollama / LM Studio
adapter/ PEFT LoRA 170 MB Apply on top of Qwen/Qwen3.5-4B with transformers + PEFT

How to run

Serve with thinking disabled. The model is trained and benchmarked in the no-think path; turning chain-of-thought on lowered every metric below.

llama.cpp

llama-server -m scalloptools-1.q5_k_m.gguf \
  --chat-template-kwargs '{"enable_thinking":false}'

Ollama

ollama run hf.co/tashfene/scalloptools-1:Q5_K_M

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="tashfene/scalloptools-1",
    filename="scalloptools-1.q5_k_m.gguf",
)
out = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Read the file ./notes.md"}],
    tools=[...],          # your tool schemas
)

Adapter on the base model (transformers + PEFT)

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-4B")
model = PeftModel.from_pretrained(base, "tashfene/scalloptools-1", subfolder="adapter")
tok = AutoTokenizer.from_pretrained("tashfene/scalloptools-1", subfolder="adapter")

Intended use

Routing for a personal-assistant agent that has a small, stable set of tools. The model picks the function and arguments; a host loop runs the call and feeds the result back. It is a router, not a reasoner: it does not know your domain, only the shape of these tools.

Evaluation

114 turns held out from real sessions, none seen in training. Every model ran the same harness with greedy decoding and thinking off.

Metric scalloptools-1 Qwen3.5-4B (stock) Qwen3.6-35B MoE Qwen3.6-Plus (hosted)
Tool selection 73.3% 65.3% 46.5% 54.7%
No-tool precision 39.3% 32.0% 39.3% 42.9%
Args key-F1 0.243 0.204 0.225 0.199
Parse success 100% 100% 100% 100%
Median latency 3.3s 7.3s 15.4s 5.2s

The 35B and the hosted Plus model both carry far more world knowledge. On this fixed toolset they still pick the wrong function more often than the 4B, which has memorized how these specific tools behave. Read it narrowly: a specialist wins on its own toolset, and these numbers predict nothing about general tool-calling.

No-tool precision is the weak column. When the right move is to call nothing, the model still reaches for a tool more than half the time, because genuine no-tool turns are scarce in the training traces.

Fabrication

A model that invents a tool result instead of admitting one failed breaks an agent loop. I fed empty and error results and checked the response.

Test Fabricated Honest report Retried to exhaustion
Single failure (30 cases) 0 0 30
Same failure, 3 rounds (30 cases) 0 1 27

It never fabricated, across single and repeated failures, beating every larger model in the lineup on that axis. The honesty comes with a cost I have not fixed: against a dead tool the model keeps retrying instead of stopping to report the failure. Safe, but it loops. Teaching a 4B to give up and report cleanly is the open problem.

Training

Traces from one person's assistant, so the distribution is narrow and personal. Before training, every example passed through a deterministic anonymizer that swaps real names, emails, phones, handles, and project ids for stable fakes and refuses to write a file if any known real token survives. Real-name and anonymized held-out sets scored the same (73.3% either way), so the substitution costs no measurable accuracy. The recipe caps examples per session, dedupes globally, drops turns that reference stale state, and keeps a track of honest responses to empty and failed tool results.

Limitations and bias

  • One toolset, one user's habits. Point it at different tools and the selection numbers will not hold.
  • Low no-tool precision. Pair it with a confidence gate where a stray call is expensive.
  • It retries failed tools instead of reporting them.
  • 4B holds little world knowledge. It routes calls; it does not reason about your domain.
  • Trained on a single individual's data, so it inherits that person's tool habits and phrasing.

License

Apache-2.0, inherited from the Qwen3.5-4B base.

Downloads last month
25
GGUF
Model size
4B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tashfene/scalloptools-1

Finetuned
Qwen/Qwen3.5-4B
Finetuned
(312)
this model

Evaluation results

  • Tool selection accuracy on ScallopBot held-out traces (114 turns)
    self-reported
    73.300
  • No-tool precision on ScallopBot held-out traces (114 turns)
    self-reported
    39.300
  • Args key-F1 on ScallopBot held-out traces (114 turns)
    self-reported
    0.243
  • Parse success on ScallopBot held-out traces (114 turns)
    self-reported
    100.000
  • Fabrication rate (single-step, 30 cases) on ScallopBot held-out traces (114 turns)
    self-reported
    0.000
  • Fabrication rate (multi-step, 3 rounds) on ScallopBot held-out traces (114 turns)
    self-reported
    0.000