Supervisor-FRPT β€” Supervisor-FRPT-Phi-4-reasoning-BestSeed

This model is a LoRA-fine-tuned supervisor (CS quality evaluator) for electronics customer-support chatbot conversations. It was trained on 20260331_HumanFeedBack_selfdist.jsonl (3,771 human-labelled dialogues) with the FRPT ("Fact-Reasoning Process Training") research training methodology applied to a lora_sequential LoRA recipe.

The job of this model: given (category, multi-turn user/assistant transcript, retrieved reference document), produce a Korean <think>...</think> rubric chain and a JSON verdict {"label": "correct|incorrect", "reason": "..."}.

Test metrics (held-out 199 dialogues)

Metric Value
Accuracy 0.709
Macro-F1 0.666
F1 (correct) 0.547
F1 (incorrect) 0.785
Precision (incorrect) 0.862
Recall (incorrect) 0.721
Fβ‚€.β‚… (incorrect) 0.829
Precision (correct) 0.461
Recall (correct) 0.673
Unparsed 0/199

Why these metrics

The deployment goal for this supervisor is catching incorrect chatbot responses with high reliability, so the operationally critical metric is the chance that when this model says "incorrect", the chatbot really did answer incorrectly β€” i.e. precision(incorrect). The model was selected from a multi-method, multi-seed grid by Fβ‚€.β‚…(incorrect) = (1 + 0.25)Β·PΒ·R / (0.25Β·P + R), which weighs precision twice as much as recall on the incorrect class while still penalising excessive misses.

Training methodology β€” research highlights

The training methodology bundles two layers:

  1. Base LoRA recipe β€” lora_sequential with rank 16, alpha 32, dropout 0.05, target modules qkv_proj, o_proj, down_proj, gate_up_proj (Phi-3 family) or the q/k/v/o/MLP equivalents for Gemma-4. Optimizer AdamW, cosine schedule, warmup ratio 0.05, grad clip 1.0, BF16, SDPA attention.

  2. FRPT-aware data shaping (Fact-grounded Reasoning Process Training):

    • Process-supervision view β€” the assistant turn already exposes a 3-axis rubric (Query-Document Alignment, Response-Document Consistency, Response Completeness) inside <think>...</think>. We train the entire assistant response, so the model learns the reasoning process, not just the verdict.
    • Fact-grounded SFT β€” loss is masked on user/system tokens; only the assistant span (think + JSON) contributes to gradient. This forces the model to learn how to evaluate, not what the user said.
    • Class-imbalance aware β€” incorrect : correct = 2616 : 1155 (~2.3:1) in train. We monitor F1-correct (the minority class) as the primary model-selection signal.
    • (Sequential variant) β€” lora_sequential groups the 33 product categories into 5 buckets (DRW, TV, SBS, REF_AUD_MNT, OTHERS) and trains them in order, exposing the model to per-category structure while sharing one adapter across the curriculum.

Hyperparameters of the final run

Field Value
Base model microsoft/Phi-4-reasoning
Method lora_sequential
LoRA rank 16
LoRA alpha 32
LoRA dropout 0.05
Learning rate 0.0005
Epochs 7
Seed 42
Train samples 3,771
Test samples 199
Max sequence length 4096

Quick inference

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

mid = "shareit/Supervisor-FRPT-Phi-4-reasoning-BestSeed"
tok = AutoTokenizer.from_pretrained(mid, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(mid, dtype=torch.bfloat16,
                                            device_map="auto",
                                            trust_remote_code=True)

system = "당신은 μ „μžμ œν’ˆ CS μ±—λ΄‡μ˜ ν’ˆμ§ˆμ„ ν‰κ°€ν•˜λŠ” μˆ˜νΌλ°”μ΄μ €μž…λ‹ˆλ‹€."
user = ("[Category] PC\n\n[Conversation Transcript]\n"
        "Turn 1 - User: ...\nTurn 1 - Assistant: ...\n\n"
        "[Retrieved Document]\n(title) ...\n(content) ...")

msgs = [{"role": "system", "content": system},
        {"role": "user", "content": user}]
inp = tok.apply_chat_template(msgs, tokenize=True, add_generation_prompt=True,
                              return_tensors="pt").to(model.device)
out = model.generate(inp, max_new_tokens=900, do_sample=False)
print(tok.decode(out[0, inp.shape[1]:], skip_special_tokens=True))

The generated text follows:

<think>
[Query-Document Alignment] ...
[Response-Document Consistency] ...
[Response Completeness] ...
</think>
{"label": "correct", "reason": "..."}

Citation / theory

This model embodies the FRPT (Fact-Reasoning Process Training) research program. Key references that inform the methodology:

  • Gekhman et al. 2024 β€” fine-tuning new facts can encourage hallucination.
  • Lightman et al. 2023 β€” Let's Verify Step by Step (process supervision).
  • Hu et al. 2021 β€” LoRA.
  • Dettmers et al. 2023 β€” QLoRA.
  • LoRA Learns Less and Forgets Less (Biderman et al.) β€” PEFT/FullFT tradeoffs.

For the merge-before-forget continual-learning theory that motivated the sequential variant, see the internal Session 1~4 reports.

Downloads last month
24
Safetensors
Model size
15B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for shareit/Supervisor-FRPT-Phi-4-reasoning-BestSeed

Base model

microsoft/phi-4
Adapter
(8)
this model