v7-GRPO-14B — LoRA adapter for procedural-compliance reasoning (heuristic mode)

LoRA adapter trained via endpoint-targeted GRPO on Qwen/Qwen2.5-14B-Instruct, representing the heuristic-mode endpoint of the procedural-reasoning investigation.

Model description

v7-GRPO was trained with endpoint-targeted GRPO: the only reward signal is on the final compliance answer correctness, with no scaffolding intermediate. The model learned to reach high training-distribution accuracy through pattern matching on surface features rather than step-by-step procedural verification.

The v7-GRPO endpoint is one of two endpoints in the engaged-vs-heuristic contrast that runs through paper §3. The pairing:

B5-SFT-7B (engaged) reaches training-distribution accuracy by training the model to consult the procedure step-by-step.
v7-GRPO-14B (heuristic) reaches overlapping training-distribution accuracy through pattern matching, with notably different generalization behavior on out-of-distribution variants.

Scale note: v7-GRPO is 14B because no 7B v7-GRPO checkpoint was trained. The companion engaged endpoint (B5-SFT-7B) is 7B and the published §3 overlapping-ceilings contrast uses these specific checkpoint pair. For apples-to-apples scale comparison, train a same-scale variant; this checkpoint represents the heuristic-training-paradigm endpoint at the scale at which that paradigm was investigated.

Training data

Trained on the same canonical centralized procedural-reasoning dataset as B5-SFT-7B:

9,013 unique instances after deduplication
6 source cohorts (see B5-SFT-7B model card for full enumeration)
8 institutional sources for lab safety + FDA + CMS regulatory procedures
10 perturbation types per process

GRPO training uses the same input format as B5-SFT (procedure → scenario → question in a single user message). The reward is binary on the final classification token; no intermediate-reasoning scaffolding tokens are emitted or rewarded.

Intended use

Research:

Procedural-compliance reasoning evaluation as the heuristic-mode endpoint
Cross-checkpoint mech-interp comparing engaged vs heuristic training paradigms
Out-of-distribution behavioral evaluation

NOT intended for:

Deployment in production compliance auditing without independent verification
High-stakes decisions where binary classification is taken as authoritative

Limitations

Same dataset-level disclosures as B5-SFT-7B apply:

Cornell adversarial_surface_compliant: 72.3% accept rate during dataset construction
FDA prerequisite_omission: 65.2% accept rate during dataset construction
CMS internal-determination perturbation footnote (corpus characteristic, not a quality flag)

Additional behavior-specific limitations for the heuristic-mode endpoint:

The model reaches high training-distribution accuracy but generalizes inconsistently on out-of-distribution variants. Performance on previously- unseen lexical or surface variations of compliance scenarios can drop substantially.
The model tends to emit the final classification token immediately, with little intermediate reasoning. Efficient at inference time but provides less diagnostic information when debugging incorrect predictions.
Performance is sensitive to variations in question phrasing and procedure ordering. The model was trained to expect the procedure → scenario → question structure exactly; substantial deviation may degrade behavior.

Evaluation results

Performance metrics from the cleaned canonical dataset:

Metric	Value
Training-distribution accuracy (v7-GRPO recipe headline)	~0.83
Yang Stage-3 retrieval head L19.H17 δ on cleaned procedural (base 7B context)	0.886
§3 intervention cross-checkpoint contrast at L=20 H=17 (specific to v7-GRPO 7B-context)	available in §3 cross-checkpoint tables

For the §3 14B-scale cross-checkpoint analysis (Yang re-anchor d.4 in progress), refer to the procedural-reasoning paper's §3 cross-checkpoint table for the most current numbers.

Prompt format

Use the standard Qwen2.5 chat template via tokenizer.apply_chat_template. The same template applies — the difference vs B5-SFT-7B is the underlying base model scale (14B vs 7B), but the tokenizer template is identical.

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-14B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-14B-Instruct")
model = PeftModel.from_pretrained(base, "kennethp97/v7-grpo-14b")
model.eval()

messages = [{"role": "user", "content": "Procedure: ...\n\nScenario: ...\n\nQuestion: ..."}]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

License

Apache 2.0 for this LoRA adapter. The Qwen2.5-14B-Instruct base model is licensed under its own license — see the official Qwen repository for terms.

Citation

@unpublished{paulsen2026_procedural_reasoning,
  author = {Paulsen, Kenneth and collaborators},
  title  = {Procedural-reasoning compliance: engaged vs heuristic modes},
  year   = {2026},
  note   = {Manuscript in preparation; citation form will be finalized at submission.}
}

Downloads last month: 36

Model tree for kennethp97/v7-grpo-14b

Base model

Qwen/Qwen2.5-14B

Finetuned

Qwen/Qwen2.5-14B-Instruct

Adapter

(357)

this model