v7-GRPO-14B — LoRA adapter for procedural-compliance reasoning (heuristic mode)

LoRA adapter trained via endpoint-targeted GRPO on Qwen/Qwen2.5-14B-Instruct, representing the heuristic-mode endpoint of the procedural-reasoning investigation.

Model description

v7-GRPO was trained with endpoint-targeted GRPO: the only reward signal is on the final compliance answer correctness, with no scaffolding intermediate. The model learned to reach high training-distribution accuracy through pattern matching on surface features rather than step-by-step procedural verification.

The v7-GRPO endpoint is one of two endpoints in the engaged-vs-heuristic contrast that runs through paper §3. The pairing:

  • B5-SFT-7B (engaged) reaches training-distribution accuracy by training the model to consult the procedure step-by-step.
  • v7-GRPO-14B (heuristic) reaches overlapping training-distribution accuracy through pattern matching, with notably different generalization behavior on out-of-distribution variants.

Scale note: v7-GRPO is 14B because no 7B v7-GRPO checkpoint was trained. The companion engaged endpoint (B5-SFT-7B) is 7B and the published §3 overlapping-ceilings contrast uses these specific checkpoint pair. For apples-to-apples scale comparison, train a same-scale variant; this checkpoint represents the heuristic-training-paradigm endpoint at the scale at which that paradigm was investigated.

Training data

Trained on the same canonical centralized procedural-reasoning dataset as B5-SFT-7B:

  • 9,013 unique instances after deduplication
  • 6 source cohorts (see B5-SFT-7B model card for full enumeration)
  • 8 institutional sources for lab safety + FDA + CMS regulatory procedures
  • 10 perturbation types per process

GRPO training uses the same input format as B5-SFT (procedure → scenario → question in a single user message). The reward is binary on the final classification token; no intermediate-reasoning scaffolding tokens are emitted or rewarded.

Intended use

Research:

  • Procedural-compliance reasoning evaluation as the heuristic-mode endpoint
  • Cross-checkpoint mech-interp comparing engaged vs heuristic training paradigms
  • Out-of-distribution behavioral evaluation

NOT intended for:

  • Deployment in production compliance auditing without independent verification
  • High-stakes decisions where binary classification is taken as authoritative

Limitations

Same dataset-level disclosures as B5-SFT-7B apply:

  • Cornell adversarial_surface_compliant: 72.3% accept rate during dataset construction
  • FDA prerequisite_omission: 65.2% accept rate during dataset construction
  • CMS internal-determination perturbation footnote (corpus characteristic, not a quality flag)

Additional behavior-specific limitations for the heuristic-mode endpoint:

  • The model reaches high training-distribution accuracy but generalizes inconsistently on out-of-distribution variants. Performance on previously- unseen lexical or surface variations of compliance scenarios can drop substantially.
  • The model tends to emit the final classification token immediately, with little intermediate reasoning. Efficient at inference time but provides less diagnostic information when debugging incorrect predictions.
  • Performance is sensitive to variations in question phrasing and procedure ordering. The model was trained to expect the procedure → scenario → question structure exactly; substantial deviation may degrade behavior.

Evaluation results

Performance metrics from the cleaned canonical dataset:

Metric Value
Training-distribution accuracy (v7-GRPO recipe headline) ~0.83
Yang Stage-3 retrieval head L19.H17 δ on cleaned procedural (base 7B context) 0.886
§3 intervention cross-checkpoint contrast at L=20 H=17 (specific to v7-GRPO 7B-context) available in §3 cross-checkpoint tables

For the §3 14B-scale cross-checkpoint analysis (Yang re-anchor d.4 in progress), refer to the procedural-reasoning paper's §3 cross-checkpoint table for the most current numbers.

Prompt format

Use the standard Qwen2.5 chat template via tokenizer.apply_chat_template. The same template applies — the difference vs B5-SFT-7B is the underlying base model scale (14B vs 7B), but the tokenizer template is identical.

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-14B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-14B-Instruct")
model = PeftModel.from_pretrained(base, "kennethp97/v7-grpo-14b")
model.eval()

messages = [{"role": "user", "content": "Procedure: ...\n\nScenario: ...\n\nQuestion: ..."}]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

License

Apache 2.0 for this LoRA adapter. The Qwen2.5-14B-Instruct base model is licensed under its own license — see the official Qwen repository for terms.

Citation

@unpublished{paulsen2026_procedural_reasoning,
  author = {Paulsen, Kenneth and collaborators},
  title  = {Procedural-reasoning compliance: engaged vs heuristic modes},
  year   = {2026},
  note   = {Manuscript in preparation; citation form will be finalized at submission.}
}
Downloads last month
36
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kennethp97/v7-grpo-14b

Base model

Qwen/Qwen2.5-14B
Adapter
(357)
this model