HRM-Text-1B-agent v2 β€” tool / function calling (xLAM-scaled)

Full-parameter SFT of sapientinc/HRM-Text-1B for function / tool calling. This is v2 of hrm-text-agent: it adds xLAM parallel / multi-call data and a format-discipline slice. The result is a much stronger tool-caller β€” call competence into best-1B territory β€” with some tradeoffs (below).

Code + full writeup: https://github.com/jasoncarreira/hrm-text-agent

Scores β€” BFCL v4 (official AST checker, full test sets)

Category n v1 v2 Ξ”
simple 400 61.5% 81.5% +20.0
multiple 200 53.5% 77.0% +23.5
parallel 200 37.5% 59.0% +21.5
parallel_multiple 200 28.0% 42.5% +14.5
irrelevance 240 80.8% 60.8% βˆ’20.0

Call-category competence (4 call cats, count-weighted): ~48% β†’ 68.3% β€” into the purpose-built-1B range (xLAM-2-1b-fc-r ~69). Overall micro-average (1,240): 54.7% β†’ 66.8%.

Tradeoff β€” irrelevance βˆ’20: the xLAM data is all-call (no "don't-call" cases), so the model became more eager to call when no tool actually fits.

General capability (base β†’ v1 β†’ v2)

Benchmark base v1 v2 note
MMLU 60.1% 55.5% 58.4% invalid 11.9% β†’ 1.4% (format recovered)
ARC-C 83.5% 75.1% 83.2% invalid 9.9% β†’ 0% (back to base)
HellaSwag 63.3% 61.9% 61.9% stable
Winogrande 72.2% 70.6% 70.7% stable
BoolQ 86.3% 87.3% 86.3% stable
DROP (F1) 84.8% 83.3% 83.7% stable
GSM8k 84.5% 85.6% 78.6% βˆ’7 vs v1 (real reasoning, invalid 0%)
MATH-1000 49.3% 45.4% 37.0% βˆ’8 vs v1 (accuracy, not format)

The format-discipline slice recovered the v1 MCQ regression (ARC fully back to base; MMLU's invalid rate collapsed). But the call-heavy mix introduced a new free-form-math regression (GSM8k βˆ’7, MATH βˆ’8 vs v1 β€” not a format artifact). Net: stronger calls + cured MCQ format, at the cost of irrelevance discipline and free-form math. v3 lever: rebalance the mix (more no-call data, protect the reasoning share).

Training

Same cfg_sft recipe as v1 (full-parameter, lr 3e-5, cosine, 3 epochs, max_len 2048, bf16, direct condition token). Data: the v1 mix (Hermes + glaive + no_robots + synthesized irrelevance) + ~14k parallel-biased Salesforce/xlam-function-calling-60k + ~3k format-discipline examples (single-letter MCQ + \boxed{} math, from train/aux splits β€” leakage-safe), all interleaved. ~3 epochs on an A100 80GB.

Usage

Same as v1 β€” HRM-Text is a PrefixLM that needs the direct condition envelope, so use the repo harness rather than a bare .generate():

git clone https://github.com/jasoncarreira/hrm-text-agent && cd hrm-text-agent
pip install -r requirements.txt
python infer_agent.py --model jasoncarreira/hrm-text-agent-v2 "Book a table for 2 and check the weather"
python bfcl_local.py --model jasoncarreira/hrm-text-agent-v2 --dump errs.jsonl

License & data lineage

The base model is Apache-2.0, but the training data includes no_robots (CC-BY-NC-4.0) and xLAM-60k (gated, CC-BY-4.0), so treat this derived model as research / non-commercial. Verify the licenses of all sources for your use case.

πŸ€– Built with Claude Code (including a second Claude driving training on the GPU pod).

Downloads last month
43
Safetensors
Model size
1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jasoncarreira/hrm-text-agent-v2

Finetuned
(9)
this model