Phase-2 Arm-A SFT -- LoRA adapter on Qwen2.5-1.5B-Instruct

A LoRA adapter on Qwen/Qwen2.5-1.5B-Instruct, trained with plain supervised fine-tuning on the train_registry v0.4.0 procedural-compliance corpus. Each training row is one half of a flip or anchor pair: a short reasoning that cites the deciding clause, ending in FINAL ANSWER: <compliant|non-compliant>.

What it does

Given a procedure and a scenario, the model emits an EDGE CHECKS: reasoning block followed by a FINAL ANSWER: compliant|non-compliant line. The recipe targets the free-form regime; gains concentrate there.

Headline eval (frozen 233-process held-out; 128 flip + 122 anchor; greedy / T=0)

regime flip rate (base -> SFT) anchor acc (base -> SFT) plain (base -> SFT)
forced 0.117 -> 0.188 0.557 -> 0.582 0.570 -> 0.608
free-form 0.219 -> 0.469 (+25.0pp) 0.467 -> 0.664 (+19.7pp) 0.576 -> 0.726

The lift is free-form-only (the regime the reasoning recipe targets); the gains concentrate on exception / hierarchy / threshold handles, while step-ordering stays flat (0.200 -> 0.225) -- the known structural bottleneck.

How to use

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE = "Qwen/Qwen2.5-1.5B-Instruct"
ADAPTER = "kennethp97/sft-arm-a-1p5b"

tok = AutoTokenizer.from_pretrained(BASE, use_fast=True)
tok.pad_token = tok.pad_token or tok.eos_token

base = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16,
                                            device_map="auto")
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()

Prompt format and a worked side-by-side eval against the base and the companion DPO adapter (kennethp97/dpo-flip-1p5b) are in the combined eval notebook.

Training summary

  • Base: Qwen/Qwen2.5-1.5B-Instruct
  • LoRA r=32 alpha=64 on q/k/v/o/gate/up/down
  • Plain SFT (cross-entropy on the chosen completion), full bf16
  • Training set: 3,734 rows (after filtering 1,226 placeholder-verifier_reason rows from the 5,020-row v0.4.0 corpus)

Limitations

  • Research checkpoint, not a production classifier. Below the pre-registered absolute GO bar.
  • Step-ordering bottleneck. Ordering flip stays nearly flat.
  • Free-form is the target regime. Forced-verdict gains are small.
  • Format sensitivity. Trained on the EDGE CHECKS ... FINAL ANSWER format above; deviation may degrade performance. Greedy (T=0) matches the reported numbers.

License

Adapter: Apache-2.0. Base model: under the Qwen2.5-1.5B-Instruct license.

Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kennethp97/sft-arm-a-1p5b

Adapter
(1049)
this model