Prosify-Qwen-1.5B-LoRA (v1)

A LoRA adapter for Qwen/Qwen2.5-1.5B-Instruct trained with Direct Preference Optimization (DPO) on the FormatBench dataset to correct the systematic over-formatting bias of contemporary instruction-tuned LLMs.

Part of the Prosify project,grouped with the dataset in the Prosify Hugging Face Collection.

What this model does

Contemporary instruction-tuned LLMs (including Qwen 2.5) over-format their responses by default — producing bulleted lists, bold section headers, and templated structures even when flowing prose would serve the reader better. This adapter nudges the base model toward prose responses on contexts where prose is appropriate, while preserving the base model's ability to use structure where structure genuinely helps.

Quick start

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

BASE = "Qwen/Qwen2.5-1.5B-Instruct"
ADAPTER = "krishy-d/prosify_qwen_1.5b_lora"

tokenizer = AutoTokenizer.from_pretrained(BASE)
base_model = AutoModelForCausalLM.from_pretrained(
    BASE, dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base_model, ADAPTER)

# Generate
messages = [{"role": "user", "content": "Write me an email to my manager about WFH tomorrow."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=200, do_sample=False,
                            pad_token_id=tokenizer.pad_token_id)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Training details

Base model Qwen/Qwen2.5-1.5B-Instruct
Training method DPO (Direct Preference Optimization)
Adapter LoRA
Training data FormatBench train split (~440 examples)
Validation FormatBench val split (~50 examples)
LoRA rank 16
LoRA alpha 32
Target modules q_proj, k_proj, v_proj, o_proj
Learning rate 5e-5
Epochs 1
β (KL leash) 0.1
Effective batch size 4
Precision bfloat16
Hardware Kaggle T4 GPU (free tier)

Full training notebook: notebooks/dpo_02_train.ipynb.

Evaluation

Structural metrics computed on the FormatBench held-out test split (49 examples, gold = prose) and the adversarial held-out set (40 examples, gold = structure).

Main test split (gold = prose)

Metric Base model This model Gold response
Bullets per response 2.16 0.53 0.00
Headers per response 0.59 0.00 0.00

The trained adapter reduces bullet usage by 75% and eliminates markdown headers on prose-appropriate contexts.

Adversarial set (gold = structure)

Metric Base model This model Gold response
Bullets per response 9.35 6.60 8.18
Headers per response 0.78 0.30 3.38

On contexts where structure is the correct answer (recipes, install instructions, comparisons, troubleshooting flows, reference lookups), the trained model preserves substantial structure but drifts below the gold level — indicating mild reward hacking where the model slightly over-generalizes the "prefer prose" preference.

Full evaluation notebook: notebooks/dpo_03_evaluate.ipynb.

Limitations

  • v1 baseline: this is the first training run on a small dataset (591 examples) with conservative settings (LoRA rank 16, 1 epoch, β = 0.1). Higher capacity and tuned hyperparameters would likely close the adversarial gap.
  • Single-author dataset voice: the FormatBench chosen responses were authored by one annotator, so the trained model inherits that voice as the "correct" prose style.
  • English only: training data is exclusively English.
  • Mild reward hacking: the model uses less structure than appropriate on contexts where structure helps (~20% reduction below gold on adversarial bullets).
  • Small base model: 1.5B parameters limits both fluency and the ceiling of what LoRA fine-tuning can achieve. v2 will explore 3B and 7B base models.

Roadmap

This adapter is v1. Planned for v2:

  • Higher LoRA rank (64 or 128) for more capacity to shift generation behavior
  • Tighter KL constraint (β = 0.3) to reduce adversarial structure drift
  • Larger dataset (~2000 examples, multi-author)
  • Larger base models (Qwen 2.5 3B and 7B)

Related artifacts

Citation

@misc{prosify_qwen_1_5b_lora_2026,
  author = {Krishna Dahale},
  title = {Prosify-Qwen-1.5B-LoRA: A DPO-trained adapter for correcting LLM formatting bias},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/krishy-d/prosify_qwen_1.5b_lora}}
}

License

Apache 2.0 (matching the base model's license).

Downloads last month
39
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for krishy-d/prosify_qwen_1.5b_lora

Adapter
(1027)
this model

Dataset used to train krishy-d/prosify_qwen_1.5b_lora

Collection including krishy-d/prosify_qwen_1.5b_lora