Supervisor-FRPT β Supervisor-FRPT-Phi-4-reasoning-single-v3
This model is a LoRA-fine-tuned supervisor (CS quality evaluator) for electronics
customer-support chatbot conversations. It was trained on 20260331_HumanFeedBack_selfdist.jsonl
(3,771 human-labelled dialogues) with the FRPT ("Fact-Reasoning Process
Training") research training methodology applied to a lora_single LoRA recipe.
The job of this model: given
(category, multi-turn user/assistant transcript, retrieved reference document), produce a Korean<think>...</think>rubric chain and a JSON verdict{"label": "correct|incorrect", "reason": "..."}.
Test metrics (held-out 199 dialogues)
| Metric | Value |
|---|---|
| Accuracy | 0.673 |
| Macro-F1 | 0.621 |
| F1 (correct) | 0.480 |
| F1 (incorrect) | 0.762 |
| Precision (incorrect) | 0.825 |
| Recall (incorrect) | 0.707 |
| Fβ.β (incorrect) | 0.799 |
| Precision (correct) | 0.411 |
| Recall (correct) | 0.577 |
| Unparsed | 0/199 |
Why these metrics
The deployment goal for this supervisor is catching incorrect chatbot
responses with high reliability, so the operationally critical metric is
the chance that when this model says "incorrect", the chatbot really did
answer incorrectly β i.e. precision(incorrect). The model was selected
from a multi-method, multi-seed grid by Fβ.β
(incorrect) =
(1 + 0.25)Β·PΒ·R / (0.25Β·P + R), which weighs precision twice as much as
recall on the incorrect class while still penalising excessive misses.
Training methodology β research highlights
The training methodology bundles two layers:
Base LoRA recipe β
lora_singlewith rank 16, alpha 32, dropout 0.05, target modulesqkv_proj, o_proj, down_proj, gate_up_proj(Phi-3 family) or the q/k/v/o/MLP equivalents for Gemma-4. Optimizer AdamW, cosine schedule, warmup ratio 0.05, grad clip 1.0, BF16, SDPA attention.FRPT-aware data shaping (Fact-grounded Reasoning Process Training):
- Process-supervision view β the assistant turn already exposes a 3-axis
rubric (Query-Document Alignment, Response-Document Consistency, Response
Completeness) inside
<think>...</think>. We train the entire assistant response, so the model learns the reasoning process, not just the verdict. - Fact-grounded SFT β loss is masked on user/system tokens; only the assistant span (think + JSON) contributes to gradient. This forces the model to learn how to evaluate, not what the user said.
- Class-imbalance aware β incorrect : correct = 2616 : 1155 (~2.3:1) in train. We monitor F1-correct (the minority class) as the primary model-selection signal.
- (Sequential variant) β
lora_sequentialgroups the 33 product categories into 5 buckets(DRW, TV, SBS, REF_AUD_MNT, OTHERS)and trains them in order, exposing the model to per-category structure while sharing one adapter across the curriculum.
- Process-supervision view β the assistant turn already exposes a 3-axis
rubric (Query-Document Alignment, Response-Document Consistency, Response
Completeness) inside
Hyperparameters of the final run
| Field | Value |
|---|---|
| Base model | microsoft/Phi-4-reasoning |
| Method | lora_single |
| LoRA rank | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| Learning rate | 0.0005 |
| Epochs | 7 |
| Seed | 42 |
| Train samples | 3,771 |
| Test samples | 199 |
| Max sequence length | 4096 |
Quick inference
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
mid = "shareit/Supervisor-FRPT-Phi-4-reasoning-single-v3"
tok = AutoTokenizer.from_pretrained(mid, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(mid, dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True)
system = "λΉμ μ μ μμ ν CS μ±λ΄μ νμ§μ νκ°νλ μνΌλ°μ΄μ μ
λλ€."
user = ("[Category] PC\n\n[Conversation Transcript]\n"
"Turn 1 - User: ...\nTurn 1 - Assistant: ...\n\n"
"[Retrieved Document]\n(title) ...\n(content) ...")
msgs = [{"role": "system", "content": system},
{"role": "user", "content": user}]
inp = tok.apply_chat_template(msgs, tokenize=True, add_generation_prompt=True,
return_tensors="pt").to(model.device)
out = model.generate(inp, max_new_tokens=900, do_sample=False)
print(tok.decode(out[0, inp.shape[1]:], skip_special_tokens=True))
The generated text follows:
<think>
[Query-Document Alignment] ...
[Response-Document Consistency] ...
[Response Completeness] ...
</think>
{"label": "correct", "reason": "..."}
Citation / theory
This model embodies the FRPT (Fact-Reasoning Process Training) research program. Key references that inform the methodology:
- Gekhman et al. 2024 β fine-tuning new facts can encourage hallucination.
- Lightman et al. 2023 β Let's Verify Step by Step (process supervision).
- Hu et al. 2021 β LoRA.
- Dettmers et al. 2023 β QLoRA.
- LoRA Learns Less and Forgets Less (Biderman et al.) β PEFT/FullFT tradeoffs.
For the merge-before-forget continual-learning theory that motivated the sequential variant, see the internal Session 1~4 reports.
- Downloads last month
- 13