Qwen3.5-9B Humanize DPO Round 1

LoRA adapter fine-tuned with DPO (Direct Preference Optimization) on Qwen3.5-9B for Chinese text humanization. This is the first DPO alignment stage — balanced between academic precision and natural daily-language rewriting.

Model Details

Item Value
Base model unsloth/Qwen3.5-9B
Starting point SFT
Fine-tuning method DPO
LoRA rank 16
Training data 4000 pairs (2000 over-formal + 2000 over-casual rejected)
Checkpoint used step-200 (of 250)
Final margin ~11.4
Final accuracy 100%

What It Does

Rewrites AI-generated or overly formal/casual Chinese text into natural human writing:

  • Academic/technical: preserves all numbers, terminology, and structure; removes bureaucratic phrasing
  • Daily text: adds warmth and personality without over-colloquializing
  • Dual rejected training: learned to avoid both over-formal AND over-casual outputs

Usage

from unsloth import FastLanguageModel
from peft import PeftModel

base_model, proc = FastLanguageModel.from_pretrained(
    "unsloth/Qwen3.5-9B", max_seq_length=2048, load_in_4bit=False,
)
tokenizer = proc.tokenizer if hasattr(proc, "tokenizer") else proc

model = PeftModel.from_pretrained(
    base_model, "XiangJinYu/Qwen3.5-9B-Humanize-DPO-Round1", is_trainable=False,
)
if hasattr(model, "config") and getattr(model.config, "model_type", "") == "qwen3_5":
    model.config.model_type = "qwen3"
FastLanguageModel.for_inference(model)

instruction = "请将下面文本改写得更像自然人写作,保持原意与事实,不要加标题或说明。"
text = "本研究旨在探讨深度学习模型在自然语言处理任务中的性能优化策略,实验结果表明BLEU分数提高了3.2个百分点。"
messages = [{"role": "user", "content": [{"type": "text", "text": f"{instruction}\n\n原文:{text}"}]}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.65,
                         top_p=0.9, do_sample=True, repetition_penalty=1.1)
gen = outputs[0][inputs["input_ids"].shape[1]:]
print(tokenizer.decode(gen, skip_special_tokens=True))

Training Details

  • DPO data: 4000 pairs — chosen = CSL human text, rejected = AI-generated (2000 over-formal + 2000 over-casual)
  • Reference model: SFT adapter
  • beta: 0.1, lr=2e-6, cosine decay, 250 steps
  • Key insight: Dual-direction rejected forces the model to learn the actual human distribution rather than shifting toward casual or formal

Model Series

Model Type Recommended for
SFT SFT Foundation
This model DPO General use, balanced
DPO Round 2 DPO Academic/technical, latest
Downloads last month
63
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for XiangJinYu/Qwen3.5-9B-Humanize-DPO-Round1

Finetuned
Qwen/Qwen3.5-9B
Adapter
(67)
this model

Collection including XiangJinYu/Qwen3.5-9B-Humanize-DPO-Round1