Instructions to use kennethp97/dpo-flip-1p5b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use kennethp97/dpo-flip-1p5b with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct") model = PeftModel.from_pretrained(base_model, "kennethp97/dpo-flip-1p5b") - Notebooks
- Google Colab
- Kaggle
Phase-2 DPO flip-only -- LoRA adapter on Qwen2.5-1.5B-Instruct
A LoRA adapter on Qwen/Qwen2.5-1.5B-Instruct, trained with
reasoning-aware Direct Preference Optimization (DPO) on flip pairs
of a procedural-compliance corpus.
What it does
The adapter improves the base model on the procedural-compliance task: given a
procedure and a scenario, decide whether the scenario is compliant or
non-compliant with the procedure, and produce structured reasoning before the
verdict.
Each training preference pair is:
- chosen -- an
EDGE CHECKS ... FINAL ANSWER:completion whose reasoning matches this scenario and ends in the gold verdict; - rejected -- the partner half's reasoning (a different scenario in the same flip pair) ending in the opposite verdict.
So the model is optimised to prefer reasoning that matches the prompt's scenario over reasoning copied from a different scenario. Anchor pairs (both halves share a verdict) were not used for training; anchor accuracy is an eval-only metric.
Headline eval (frozen 233-process held-out; 128 flip + 122 anchor pairs; greedy / T=0)
| regime | flip rate | anchor acc | plain acc |
|---|---|---|---|
| forced-verdict | 0.328 | 0.615 | 0.660 |
| free-form | 0.484 | 0.672 | 0.752 |
| base ref (FF) | 0.219 | 0.467 | 0.576 |
This recipe fixes the free-form collapse of the earlier content-free DPO arm (which scored 0.250 free-form flip -- near base) by training genuine reasoning. It improves over base in both regimes. It does not clear the pre-registered absolute GO bar (>=0.65 flip + >=0.75 anchor) -- treat it as a research checkpoint, not a deployment-grade classifier.
How to use
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
BASE = "Qwen/Qwen2.5-1.5B-Instruct"
ADAPTER = "kennethp97/dpo-flip-1p5b"
tok = AutoTokenizer.from_pretrained(BASE, use_fast=True)
tok.pad_token = tok.pad_token or tok.eos_token
tok.padding_side = "left"
base = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16,
device_map="auto")
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()
USER = (
"You are a process-structure compliance checker.\n"
"Check edge-level constraints before final judgment.\n\n"
"Process:\n<your procedure>\n\n"
"Scenario:\n<your scenario>\n\n"
"Output format:\nEDGE CHECKS:\n- VIOLATED - [edge]: [reason]\n"
"- SATISFIED - [edge]: [reason]\nFINAL ANSWER: compliant|non-compliant\n"
)
prompt = tok.apply_chat_template([{"role": "user", "content": USER}],
tokenize=False, add_generation_prompt=True)
out = model.generate(**tok(prompt, return_tensors="pt").to(model.device),
max_new_tokens=1024, do_sample=False,
pad_token_id=tok.eos_token_id)
print(tok.decode(out[0], skip_special_tokens=True))
For a worked side-by-side comparison against the base and against the
companion SFT adapter (kennethp97/sft-arm-a-1p5b), see the
combined eval notebook in the repository this adapter was released from.
Training summary
- Base:
Qwen/Qwen2.5-1.5B-Instruct - LoRA r=32 alpha=64 on q/k/v/o/gate/up/down, dropout 0.0
- DPO beta=0.1, lr 5e-6, 2 epochs, batch_size 2 x grad_accum 8, max_length 1024, gradient_checkpointing on
- Training set: 2,510 flip pairs (one chosen / rejected pair per row) from the
train_registryv0.4.0 corpus - ~80 minutes on a single RTX A6000 (bf16)
Limitations
- Research checkpoint, not a production classifier. Below the pre-registered GO bar.
- Only flip pairs trained. Anchor pairs not in the DPO mix.
- Regime asymmetry. Free-form > forced; report regimes separately.
- Format sensitivity. Trained on the
EDGE CHECKS ... FINAL ANSWERformat above; deviation may degrade performance. Greedy (T=0) matches the reported numbers.
License
Adapter: Apache-2.0. Base model: under the Qwen2.5-1.5B-Instruct license.
- Downloads last month
- 16