Phase-2 DPO flip-only -- LoRA adapter on Qwen2.5-1.5B-Instruct

A LoRA adapter on Qwen/Qwen2.5-1.5B-Instruct, trained with reasoning-aware Direct Preference Optimization (DPO) on flip pairs of a procedural-compliance corpus.

What it does

The adapter improves the base model on the procedural-compliance task: given a procedure and a scenario, decide whether the scenario is compliant or non-compliant with the procedure, and produce structured reasoning before the verdict.

Each training preference pair is:

chosen -- an EDGE CHECKS ... FINAL ANSWER: completion whose reasoning matches this scenario and ends in the gold verdict;
rejected -- the partner half's reasoning (a different scenario in the same flip pair) ending in the opposite verdict.

So the model is optimised to prefer reasoning that matches the prompt's scenario over reasoning copied from a different scenario. Anchor pairs (both halves share a verdict) were not used for training; anchor accuracy is an eval-only metric.

Headline eval (frozen 233-process held-out; 128 flip + 122 anchor pairs; greedy / T=0)

regime	flip rate	anchor acc	plain acc
forced-verdict	0.328	0.615	0.660
free-form	0.484	0.672	0.752
base ref (FF)	0.219	0.467	0.576

This recipe fixes the free-form collapse of the earlier content-free DPO arm (which scored 0.250 free-form flip -- near base) by training genuine reasoning. It improves over base in both regimes. It does not clear the pre-registered absolute GO bar (>=0.65 flip + >=0.75 anchor) -- treat it as a research checkpoint, not a deployment-grade classifier.

How to use

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE = "Qwen/Qwen2.5-1.5B-Instruct"
ADAPTER = "kennethp97/dpo-flip-1p5b"

tok = AutoTokenizer.from_pretrained(BASE, use_fast=True)
tok.pad_token = tok.pad_token or tok.eos_token
tok.padding_side = "left"

base = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16,
                                            device_map="auto")
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()

USER = (
    "You are a process-structure compliance checker.\n"
    "Check edge-level constraints before final judgment.\n\n"
    "Process:\n<your procedure>\n\n"
    "Scenario:\n<your scenario>\n\n"
    "Output format:\nEDGE CHECKS:\n- VIOLATED - [edge]: [reason]\n"
    "- SATISFIED - [edge]: [reason]\nFINAL ANSWER: compliant|non-compliant\n"
)
prompt = tok.apply_chat_template([{"role": "user", "content": USER}],
                                 tokenize=False, add_generation_prompt=True)
out = model.generate(**tok(prompt, return_tensors="pt").to(model.device),
                     max_new_tokens=1024, do_sample=False,
                     pad_token_id=tok.eos_token_id)
print(tok.decode(out[0], skip_special_tokens=True))

For a worked side-by-side comparison against the base and against the companion SFT adapter (kennethp97/sft-arm-a-1p5b), see the combined eval notebook in the repository this adapter was released from.

Training summary

Base: Qwen/Qwen2.5-1.5B-Instruct
LoRA r=32 alpha=64 on q/k/v/o/gate/up/down, dropout 0.0
DPO beta=0.1, lr 5e-6, 2 epochs, batch_size 2 x grad_accum 8, max_length 1024, gradient_checkpointing on
Training set: 2,510 flip pairs (one chosen / rejected pair per row) from the train_registry v0.4.0 corpus
~80 minutes on a single RTX A6000 (bf16)

Limitations

Research checkpoint, not a production classifier. Below the pre-registered GO bar.
Only flip pairs trained. Anchor pairs not in the DPO mix.
Regime asymmetry. Free-form > forced; report regimes separately.
Format sensitivity. Trained on the EDGE CHECKS ... FINAL ANSWER format above; deviation may degrade performance. Greedy (T=0) matches the reported numbers.

License

Adapter: Apache-2.0. Base model: under the Qwen2.5-1.5B-Instruct license.

Downloads last month: 16

Model tree for kennethp97/dpo-flip-1p5b

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Adapter

(1049)

this model