Safetensors
GGUF
Vietnamese
English
qwen2
dpo
alignment
llama-cpp
qwen2.5
vinuni-lab22

Lab 22 DPO — merged + GGUF release (lab22-dpo-adapter-gguf)

Merged FP16 weights + GGUF quantizations of an SFT+DPO Qwen2.5 model trained for the VinUni AICB Day 22 alignment lab.

Pipeline

  1. SFT-mini: 1k VN Alpaca · 1 epoch · LoRA r=16 on unsloth/Qwen2.5-3B-bnb-4bit
  2. DPO: 2k UltraFeedback · β=0.1 · lr=5e-07
  3. Merge SFT + DPO LoRA into base, save as FP16
  4. Quantize via llama.cpp

Available files

Quant File Size (MB)
Q4_K_M merged-fp16.Q4_K_M.gguf 1929.9
  • model-*.safetensors etc. — merged FP16 weights (vLLM / transformers)

DPO training summary

Metric Value
Final training loss 0.8085754909515381
End chosen reward -0.872550094127655
End rejected reward -0.9477240920066834
End reward gap 0.07517399787902834

Usage — llama-cpp-python (CPU/Metal/CUDA)

from llama_cpp import Llama
llm = Llama(model_path="lab22-dpo-Q4_K_M.gguf", n_ctx=512, n_gpu_layers=-1)
out = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Giải thích quicksort 3 câu."}],
    max_tokens=200, temperature=0.0,
)
print(out["choices"][0]["message"]["content"])

Usage — vLLM (BigGPU only)

vllm serve Wan1302/lab22-dpo-adapter-gguf --port 8000 --max-model-len 512

License & limitations

  • Apache-2.0 (Qwen2.5 base).
  • Experimental research model. Trained on English UltraFeedback; Vietnamese helpfulness/safety tested on 8 prompts (lab NB4).
  • Not production-ready. Refusals on safety-critical prompts have not been exhaustively red-teamed.

Citation

VinUni AICB program · Track 3 Day 22 · A20 cohort 2026.

Downloads last month
227
Safetensors
Model size
3B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Wan1302/lab22-dpo-adapter-gguf

Base model

Qwen/Qwen2.5-3B
Quantized
(3)
this model

Dataset used to train Wan1302/lab22-dpo-adapter-gguf