HRM-Text-1B-agent (v1) β€” tool / function calling

Full-parameter SFT of sapientinc/HRM-Text-1B β€” a 1B base (pre-alignment) Hierarchical Reasoning Model β€” fine-tuned to do function / tool calling. It takes a model that scored 0% on the task and turns it into a competent small tool-caller.

Code, full writeup, and the architecture experiments: https://github.com/jasoncarreira/hrm-text-agent

See also hrm-text-agent-v2, which adds xLAM parallel data for much stronger multi-call performance (with some tradeoffs).

Scores β€” BFCL v4 (official AST checker, full test sets)

Category n Base This model (v1)
simple 400 0% 61.5%
multiple 200 0% 53.5%
parallel 200 0% 37.5%
parallel_multiple 200 0% 28.0%
irrelevance 240 100% 80.8%

Overall micro-average (1,240): 54.7%. Call-category competence (4 call cats, count-weighted): ~48%. This sits above generic 1B instruct models on BFCL non-live AST, and below the best purpose-built 1B (xLAM-2-1b-fc-r ~69) β€” strong for a 1B base + SFT.

General capability (base β†’ tuned, 8-benchmark forgetting check)

The SFT was benign: reasoning/knowledge retained β€” GSM8k 84.5β†’85.6, DROP(F1) 84.8β†’83.3, and BoolQ/HellaSwag/Winogrande flat. The only drops were single-letter MCQ format discipline (MMLU 60.1β†’55.5, ARC-C 83.5β†’75.1), driven by a rise in non-letter outputs (~10% "invalid") β€” largely recoverable (the model often answers correctly in prose), and fixed in v2.

Training

Matches the sapientinc cfg_sft recipe:

  • full-parameter SFT β€” not LoRA (LoRA on HRM's weight-shared recurrence amplifies the delta per-cycle and collapses the output distribution); bf16 autocast + fp32 master weights
  • lr 3e-5, cosine decay to 10%, no warmup; AdamW (0.9, 0.95), weight_decay 0.1
  • 3 epochs, max_len 2048, effective batch ~32; ~25k mixed examples, ~3.5 h on an A100 80GB
  • Uses the model's direct condition token (<|object_ref_start|>) β€” the documented mode for structured output.

Data mix: tool calls (Hermes + glaive-function-calling) + general instructions (HuggingFaceH4/no_robots) + synthesized irrelevance ("tools present but none fit β†’ don't call").

Usage

HRM-Text is a PrefixLM with a conditioning scheme: prompts open <|im_start|><|object_ref_start|>…<|im_end|> with token_type_ids=1 over the prompt span. A naive .generate() won't match the training distribution β€” use the agent loop / eval harness in the repo:

git clone https://github.com/jasoncarreira/hrm-text-agent && cd hrm-text-agent
pip install -r requirements.txt
python infer_agent.py --model jasoncarreira/hrm-text-agent "What's the weather in Paris?"
python bfcl_local.py --model jasoncarreira/hrm-text-agent --dump errs.jsonl   # full BFCL

License & data lineage

The base model is Apache-2.0, but the training data includes HuggingFaceH4/no_robots (CC-BY-NC-4.0), so treat this derived model as research / non-commercial. Verify the licenses of all sources for your use case.

πŸ€– Built with Claude Code (including a second Claude driving training on the GPU pod).

Downloads last month
40
Safetensors
Model size
1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jasoncarreira/hrm-text-agent

Finetuned
(9)
this model