Qwen3.5-9B Humanize DPO Round 1

LoRA adapter fine-tuned with DPO (Direct Preference Optimization) on Qwen3.5-9B for Chinese text humanization. This is the first DPO alignment stage — balanced between academic precision and natural daily-language rewriting.

Model Details

Item	Value
Base model	`unsloth/Qwen3.5-9B`
Starting point	SFT
Fine-tuning method	DPO
LoRA rank	16
Training data	4000 pairs (2000 over-formal + 2000 over-casual rejected)
Checkpoint used	step-200 (of 250)
Final margin	~11.4
Final accuracy	100%

What It Does

Rewrites AI-generated or overly formal/casual Chinese text into natural human writing:

Academic/technical: preserves all numbers, terminology, and structure; removes bureaucratic phrasing
Daily text: adds warmth and personality without over-colloquializing
Dual rejected training: learned to avoid both over-formal AND over-casual outputs

Usage

from unsloth import FastLanguageModel
from peft import PeftModel

base_model, proc = FastLanguageModel.from_pretrained(
    "unsloth/Qwen3.5-9B", max_seq_length=2048, load_in_4bit=False,
)
tokenizer = proc.tokenizer if hasattr(proc, "tokenizer") else proc

model = PeftModel.from_pretrained(
    base_model, "XiangJinYu/Qwen3.5-9B-Humanize-DPO-Round1", is_trainable=False,
)
if hasattr(model, "config") and getattr(model.config, "model_type", "") == "qwen3_5":
    model.config.model_type = "qwen3"
FastLanguageModel.for_inference(model)

instruction = "请将下面文本改写得更像自然人写作，保持原意与事实，不要加标题或说明。"
text = "本研究旨在探讨深度学习模型在自然语言处理任务中的性能优化策略，实验结果表明BLEU分数提高了3.2个百分点。"
messages = [{"role": "user", "content": [{"type": "text", "text": f"{instruction}\n\n原文：{text}"}]}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.65,
                         top_p=0.9, do_sample=True, repetition_penalty=1.1)
gen = outputs[0][inputs["input_ids"].shape[1]:]
print(tokenizer.decode(gen, skip_special_tokens=True))

Training Details

DPO data: 4000 pairs — chosen = CSL human text, rejected = AI-generated (2000 over-formal + 2000 over-casual)
Reference model: SFT adapter
beta: 0.1, lr=2e-6, cosine decay, 250 steps
Key insight: Dual-direction rejected forces the model to learn the actual human distribution rather than shifting toward casual or formal