Supervisor-FRPT — Supervisor-FRPT-Phi-4-reasoning-single-v3

This model is a LoRA-fine-tuned supervisor (CS quality evaluator) for electronics customer-support chatbot conversations. It was trained on 20260331_HumanFeedBack_selfdist.jsonl (3,771 human-labelled dialogues) with the FRPT ("Fact-Reasoning Process Training") research training methodology applied to a lora_single LoRA recipe.

The job of this model: given (category, multi-turn user/assistant transcript, retrieved reference document), produce a Korean <think>...</think> rubric chain and a JSON verdict {"label": "correct|incorrect", "reason": "..."}.

Test metrics (held-out 199 dialogues)

Metric	Value
Accuracy	0.673
Macro-F1	0.621
F1 (correct)	0.480
F1 (incorrect)	0.762
Precision (incorrect)	0.825
Recall (incorrect)	0.707
F₀.₅ (incorrect)	0.799
Precision (correct)	0.411
Recall (correct)	0.577
Unparsed	0/199

Why these metrics

The deployment goal for this supervisor is catching incorrect chatbot responses with high reliability, so the operationally critical metric is the chance that when this model says "incorrect", the chatbot really did answer incorrectly — i.e. precision(incorrect). The model was selected from a multi-method, multi-seed grid by F₀.₅(incorrect) = (1 + 0.25)·P·R / (0.25·P + R), which weighs precision twice as much as recall on the incorrect class while still penalising excessive misses.

Training methodology — research highlights

The training methodology bundles two layers:

Base LoRA recipe — lora_single with rank 16, alpha 32, dropout 0.05, target modules qkv_proj, o_proj, down_proj, gate_up_proj (Phi-3 family) or the q/k/v/o/MLP equivalents for Gemma-4. Optimizer AdamW, cosine schedule, warmup ratio 0.05, grad clip 1.0, BF16, SDPA attention.
FRPT-aware data shaping (Fact-grounded Reasoning Process Training):
- Process-supervision view — the assistant turn already exposes a 3-axis rubric (Query-Document Alignment, Response-Document Consistency, Response Completeness) inside <think>...</think>. We train the entire assistant response, so the model learns the reasoning process, not just the verdict.
- Fact-grounded SFT — loss is masked on user/system tokens; only the assistant span (think + JSON) contributes to gradient. This forces the model to learn how to evaluate, not what the user said.
- Class-imbalance aware — incorrect : correct = 2616 : 1155 (~2.3:1) in train. We monitor F1-correct (the minority class) as the primary model-selection signal.
- (Sequential variant) — lora_sequential groups the 33 product categories into 5 buckets (DRW, TV, SBS, REF_AUD_MNT, OTHERS) and trains them in order, exposing the model to per-category structure while sharing one adapter across the curriculum.

Hyperparameters of the final run

Field	Value
Base model	`microsoft/Phi-4-reasoning`
Method	`lora_single`
LoRA rank	16
LoRA alpha	32
LoRA dropout	0.05
Learning rate	0.0005
Epochs	7
Seed	42
Train samples	3,771
Test samples	199
Max sequence length	4096

Quick inference

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

mid = "shareit/Supervisor-FRPT-Phi-4-reasoning-single-v3"
tok = AutoTokenizer.from_pretrained(mid, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(mid, dtype=torch.bfloat16,
                                            device_map="auto",
                                            trust_remote_code=True)

system = "당신은 전자제품 CS 챗봇의 품질을 평가하는 수퍼바이저입니다."
user = ("[Category] PC\n\n[Conversation Transcript]\n"
        "Turn 1 - User: ...\nTurn 1 - Assistant: ...\n\n"
        "[Retrieved Document]\n(title) ...\n(content) ...")

msgs = [{"role": "system", "content": system},
        {"role": "user", "content": user}]
inp = tok.apply_chat_template(msgs, tokenize=True, add_generation_prompt=True,
                              return_tensors="pt").to(model.device)
out = model.generate(inp, max_new_tokens=900, do_sample=False)
print(tok.decode(out[0, inp.shape[1]:], skip_special_tokens=True))

The generated text follows:

<think>
[Query-Document Alignment] ...
[Response-Document Consistency] ...
[Response Completeness] ...
</think>
{"label": "correct", "reason": "..."}

Citation / theory

This model embodies the FRPT (Fact-Reasoning Process Training) research program. Key references that inform the methodology:

Gekhman et al. 2024 — fine-tuning new facts can encourage hallucination.
Lightman et al. 2023 — Let's Verify Step by Step (process supervision).
Hu et al. 2021 — LoRA.
Dettmers et al. 2023 — QLoRA.
LoRA Learns Less and Forgets Less (Biderman et al.) — PEFT/FullFT tradeoffs.

For the merge-before-forget continual-learning theory that motivated the sequential variant, see the internal Session 1~4 reports.

Downloads last month: 13

Safetensors

Model size

15B params

Tensor type

BF16

Model tree for shareit/Supervisor-FRPT-Phi-4-reasoning-single-v3

Base model

microsoft/phi-4

Finetuned

microsoft/Phi-4-reasoning

Adapter

(8)

this model