Instructions to use kennethp97/b5-sft-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use kennethp97/b5-sft-7b with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("/workspace/llm-processes/parallel_investigations/b5_reasoning_scaffolding/train/models/Qwen_Qwen2.5-7B-Instruct") model = PeftModel.from_pretrained(base_model, "kennethp97/b5-sft-7b") - Notebooks
- Google Colab
- Kaggle
B5-SFT-7B β LoRA adapter for procedural-compliance reasoning (engaged mode)
LoRA adapter trained via reasoning-scaffolded SFT on Qwen/Qwen2.5-7B-Instruct,
representing the engaged-mode endpoint of the procedural-reasoning
investigation.
Model description
The B5-SFT recipe is the procedural-compliance adaptation of the
"Reasoning Scaffolding" trace-distillation method. Token-level next-token-
prediction loss is taken over reasoning-scaffolded teacher traces produced by
DeepSeek-R1 (671B, queried via OpenRouter); inline [SIG=...] scaffold-token
annotations mark the reasoning stages. The model is trained to internally
consult the procedure step-by-step before emitting the final compliance
classification.
Of multiple interventions tested in the parent paper, B5-SFT is the engaged- mode endpoint β it routes compliance reasoning through procedure-aware step verification, in contrast with the heuristic-mode endpoint (v7-GRPO-14B) which reaches overlapping training-distribution accuracy through pattern matching on surface features rather than step-by-step verification.
Training data
Trained on the canonical centralized procedural-reasoning dataset:
- 9,013 unique instances after deduplication
- 6 source cohorts: scaled_86 (1,522), af7lab (813), fda (4,882), af4 (245), v7_200 (154), mi_44 (34)
- 8 institutional sources for lab safety procedures (Duke, Miami, UCF, UW, Princeton, Cornell, Stanford, MIT) plus FDA Investigations Operations Manual and CMS Medicare manuals
- 10 perturbation types per process: baseline_compliant, baseline_noncompliant, lexical_paraphrase, step_reorder_invalid, prerequisite_omission, hierarchy_violation, exception_trigger, distractor_injection, adversarial_surface_compliant, adversarial_surface_noncompliant
Scope:
- Laboratory safety procedures (chemical spills, PPE, hazardous-material handling, lab closeouts)
- FDA inspection procedures (claim filing, credentialing, billing-after-denial, vehicle accident reporting)
- CMS Medicare billing procedures (claim processing, modifier handling, pre/post-billing actions, denial cascade handling)
Held-out evaluation splits (v7_200 + af4 + mi_44) are deduplicated against the training portion to ensure no instance-level leakage.
Intended use
Research:
- Procedural-compliance reasoning evaluation
- Mechanistic interpretability of LLM reasoning modes
- Out-of-distribution behavioral evaluation on regulatory-procedure tasks
NOT intended for:
- Deployment in production compliance auditing without independent verification
- Domain-specific use outside the lab safety / FDA / CMS scope without additional validation
- High-stakes decisions where the model's binary classification is taken as authoritative
Limitations
Two admit-verified-subset disclosures from Β§5 methodology:
- Cornell adversarial_surface_compliant: 72.3% accept rate during dataset construction β the hardest-by-construction cell where surface-style perturbations produce ambiguous compliance verdicts. Models trained on this data inherit some of that ambiguity.
- FDA prerequisite_omission: 65.2% accept rate during dataset construction β the post-AF.1-safety-guard-filter prereq cell is the universal-difficulty cell across cohorts; the dataset admits the verified subset but reviewers may find a third of the rejected instances borderline.
Plus methodology footnote: regulatory-procedure corpora (CMS in particular) sometimes encode internal cognitive determinations and system-side actions as numbered steps; the omission-based perturbation pattern can't render these as observable narrative violations. This is a corpus-vs-perturbation-family characteristic of the CMS source data, not a quality flag on this model.
Also expect:
- Verbose responses with explicit step-by-step reasoning, especially on longer procedures. Discard reasoning prefix downstream if you only want the binary classification.
- Performance degradation when procedure / scenario / question order in the user message is restructured significantly. The model was trained to expect a specific input shape (procedure β scenario β question in one user message).
Evaluation results
Performance metrics from the cleaned canonical dataset, where available:
| Metric | Value |
|---|---|
| Training-distribution accuracy (B5-SFT recipe headline) | ~0.68 |
| Var-binding holdout, ord-swap on flipping subset | 0.545 |
| Yang Stage-3 retrieval head L19.H17 Ξ΄ on cleaned procedural | 0.886 |
Additional evaluation numbers from Β§3 intervention contrast against v7-GRPO and Β§1 behavioral failure-pattern grid become available as the AF.5 re-runs land; refer to the procedural-reasoning paper's headline tables for the current state.
Prompt format
Use the standard Qwen2.5 chat template via tokenizer.apply_chat_template.
The published evaluation puts the procedure β scenario β question in a
single user message (no system message). For apples-to-apples comparison
with published results, match this format. See prompt_format.md in the
companion package for verbatim examples.
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B-Instruct",
torch_dtype=torch.bfloat16,
device_map="cuda",
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
model = PeftModel.from_pretrained(base, "kennethp97/b5-sft-7b")
model.eval()
messages = [{"role": "user", "content": "Procedure: ...\n\nScenario: ...\n\nQuestion: ..."}]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
License
Apache 2.0 for this LoRA adapter. The Qwen2.5-7B-Instruct base model is licensed under its own license β see the official Qwen repository for terms.
Citation
@unpublished{paulsen2026_procedural_reasoning,
author = {Paulsen, Kenneth and collaborators},
title = {Procedural-reasoning compliance: engaged vs heuristic modes},
year = {2026},
note = {Manuscript in preparation; citation form will be finalized at submission.}
}
- Downloads last month
- 64