Instructions to use kennethp97/v7-grpo-14b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use kennethp97/v7-grpo-14b with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-14B-Instruct") model = PeftModel.from_pretrained(base_model, "kennethp97/v7-grpo-14b") - Notebooks
- Google Colab
- Kaggle
v7-GRPO-14B — LoRA adapter for procedural-compliance reasoning (heuristic mode)
LoRA adapter trained via endpoint-targeted GRPO on Qwen/Qwen2.5-14B-Instruct,
representing the heuristic-mode endpoint of the procedural-reasoning
investigation.
Model description
v7-GRPO was trained with endpoint-targeted GRPO: the only reward signal is on the final compliance answer correctness, with no scaffolding intermediate. The model learned to reach high training-distribution accuracy through pattern matching on surface features rather than step-by-step procedural verification.
The v7-GRPO endpoint is one of two endpoints in the engaged-vs-heuristic contrast that runs through paper §3. The pairing:
- B5-SFT-7B (engaged) reaches training-distribution accuracy by training the model to consult the procedure step-by-step.
- v7-GRPO-14B (heuristic) reaches overlapping training-distribution accuracy through pattern matching, with notably different generalization behavior on out-of-distribution variants.
Scale note: v7-GRPO is 14B because no 7B v7-GRPO checkpoint was trained. The companion engaged endpoint (B5-SFT-7B) is 7B and the published §3 overlapping-ceilings contrast uses these specific checkpoint pair. For apples-to-apples scale comparison, train a same-scale variant; this checkpoint represents the heuristic-training-paradigm endpoint at the scale at which that paradigm was investigated.
Training data
Trained on the same canonical centralized procedural-reasoning dataset as B5-SFT-7B:
- 9,013 unique instances after deduplication
- 6 source cohorts (see B5-SFT-7B model card for full enumeration)
- 8 institutional sources for lab safety + FDA + CMS regulatory procedures
- 10 perturbation types per process
GRPO training uses the same input format as B5-SFT (procedure → scenario → question in a single user message). The reward is binary on the final classification token; no intermediate-reasoning scaffolding tokens are emitted or rewarded.
Intended use
Research:
- Procedural-compliance reasoning evaluation as the heuristic-mode endpoint
- Cross-checkpoint mech-interp comparing engaged vs heuristic training paradigms
- Out-of-distribution behavioral evaluation
NOT intended for:
- Deployment in production compliance auditing without independent verification
- High-stakes decisions where binary classification is taken as authoritative
Limitations
Same dataset-level disclosures as B5-SFT-7B apply:
- Cornell adversarial_surface_compliant: 72.3% accept rate during dataset construction
- FDA prerequisite_omission: 65.2% accept rate during dataset construction
- CMS internal-determination perturbation footnote (corpus characteristic, not a quality flag)
Additional behavior-specific limitations for the heuristic-mode endpoint:
- The model reaches high training-distribution accuracy but generalizes inconsistently on out-of-distribution variants. Performance on previously- unseen lexical or surface variations of compliance scenarios can drop substantially.
- The model tends to emit the final classification token immediately, with little intermediate reasoning. Efficient at inference time but provides less diagnostic information when debugging incorrect predictions.
- Performance is sensitive to variations in question phrasing and procedure ordering. The model was trained to expect the procedure → scenario → question structure exactly; substantial deviation may degrade behavior.
Evaluation results
Performance metrics from the cleaned canonical dataset:
| Metric | Value |
|---|---|
| Training-distribution accuracy (v7-GRPO recipe headline) | ~0.83 |
| Yang Stage-3 retrieval head L19.H17 δ on cleaned procedural (base 7B context) | 0.886 |
| §3 intervention cross-checkpoint contrast at L=20 H=17 (specific to v7-GRPO 7B-context) | available in §3 cross-checkpoint tables |
For the §3 14B-scale cross-checkpoint analysis (Yang re-anchor d.4 in progress), refer to the procedural-reasoning paper's §3 cross-checkpoint table for the most current numbers.
Prompt format
Use the standard Qwen2.5 chat template via tokenizer.apply_chat_template.
The same template applies — the difference vs B5-SFT-7B is the underlying
base model scale (14B vs 7B), but the tokenizer template is identical.
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-14B-Instruct",
torch_dtype=torch.bfloat16,
device_map="cuda",
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-14B-Instruct")
model = PeftModel.from_pretrained(base, "kennethp97/v7-grpo-14b")
model.eval()
messages = [{"role": "user", "content": "Procedure: ...\n\nScenario: ...\n\nQuestion: ..."}]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
License
Apache 2.0 for this LoRA adapter. The Qwen2.5-14B-Instruct base model is licensed under its own license — see the official Qwen repository for terms.
Citation
@unpublished{paulsen2026_procedural_reasoning,
author = {Paulsen, Kenneth and collaborators},
title = {Procedural-reasoning compliance: engaged vs heuristic modes},
year = {2026},
note = {Manuscript in preparation; citation form will be finalized at submission.}
}
- Downloads last month
- 36