cycleinstruct-phi4-supervisor

Fully merged microsoft/Phi-4-reasoning (14.66 B) fine-tuned in two stages for the LG-Electronics customer-service quality-supervisor task. Given a (Category, Conversation Transcript, Retrieved Document) triplet, the model emits

<think>
[Query-Document Alignment] …
[Response-Document Consistency] …
[Response Completeness] …
</think>
{"label": "correct" | "incorrect", "reason": "…"}

This repo contains a single-file, ready-to-use checkpoint — no adapter merging required at load time.

Training pipeline (CycleInstruct-motivated, two-stage SFT)

Following the CycleInstruct paper (EMNLP 2025) as the augmentation strategy motivator:

  1. Stage 1 — CS-chatbot SFT on 9,868 natural (question, answer) pairs built from LG feedback + general-inquiry data. LoRA r=16 α=32, Muon @ lr=2e-3, seed=1337, 8 epochs.
  2. Stage 2 — Supervisor SFT on 3,771 human-annotated supervisor judgements. Stage-1 LoRA is merged into the base first, then a fresh LoRA r=16 α=32 is added and trained with Muon @ lr=1e-3, seed=42, 7 epochs on 4,096-token sequences.

The uploaded checkpoint is the result of merging both LoRA stages into the base weights and re-saving with save_pretrained.

Metrics — 199-item held-out supervisor test set (T=0, max_new_tokens=1200)

Metric Stage-1 only This model (full merged)
Parse-fail rate 95.98 % 0.00 %
Accuracy 1.01 % 68.84 %
Macro-F1 0.033 0.615
chrF 6.55 40.92
ROUGE-L 0.062 0.885
BLEU-4 0.37 22.41
BERTScore-F1 0.826 0.901
SBERT-cos (multi-mpnet) 0.437 0.830

Per-class:

Class Precision Recall F1 Support
correct 0.417 0.481 0.446 52
incorrect 0.806 0.762 0.783 147

Loading

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

REPO = "shareit/cycleinstruct-phi4-supervisor"

tok   = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForCausalLM.from_pretrained(
    REPO, torch_dtype=torch.bfloat16,
    attn_implementation="sdpa", device_map="auto").eval()

SYSTEM = "당신은 전자제품 CS 챗봇의 품질을 평가하는 수퍼바이저입니다."
USER   = "[Category] W/M\n[Conversation Transcript] …\n[Retrieved Document] …"

# Phi-4-reasoning ChatML with our clean system prompt (skip default Thought scaffold)
prompt = (
    f"<|im_start|>system<|im_sep|>{SYSTEM}<|im_end|>"
    f"<|im_start|>user<|im_sep|>{USER}<|im_end|>"
    f"<|im_start|>assistant<|im_sep|>"
)
out = model.generate(
    **tok(prompt, return_tensors="pt", add_special_tokens=False).to(model.device),
    do_sample=False, max_new_tokens=1200,
    pad_token_id=tok.pad_token_id,
)
print(tok.decode(out[0], skip_special_tokens=False))

max_new_tokens=1200 matters — the <think> block usually consumes 500-900 tokens before the final JSON verdict.

Training details (stage 2, on top of stage-1-merged base)

  • PEFT: LoRA r=16, α=32, dropout 0.05, target_modules=all-linear, bias='none'
  • Optimizer: Muon on 2D matrices (Newton-Schulz orthogonalisation) + AdamW on 1D params
  • LR: 1e-3 (matrix) / 1e-4 (aux), cosine decay with 3 % warmup, grad-clip 1.0
  • Batch: per-device 1 × grad-accum 16 (effective 16)
  • Seq len: 4096 (user text char-clipped if exceeds; assistant always preserved)
  • Seed: 42, Epochs: 7
  • Attention: SDPA (bf16 native on H200)
  • Wall clock: 5h48m on a half-H200 (48 GB active)

Data

  • Stage-1 train: 9,868 (q, a) pairs from data/processed/train_pairs.jsonl (multilingual, mostly English, ~50 % English, ~15 % German, then FR/ES/IT/JA/ZH…)
  • Stage-2 train: 3,771 supervisor-annotated rows {"conversations": [{"from":"system", …}, {"from":"user", …}, {"from":"assistant", …}]} with the assistant response being a <think>…</think>{"label":…,"reason":…} judgement.
  • Test: 199 held-out supervisor rows (unseen during either stage).

Intended use / limitations

  • Intended for research reproduction of CycleInstruct-style continuation training on labeled downstream tasks.
  • The correct class has substantially lower F1 (0.446) than incorrect (0.783), reflecting the 39/61 % class imbalance in the training data. Class-weighted loss or balanced sampling would likely help.
  • The <think> reasoning is Korean; input transcripts may be any language.

License

MIT (inherits from the microsoft/Phi-4-reasoning base model).

Downloads last month
229
Safetensors
Model size
15B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shareit/cycleinstruct-phi4-supervisor

Base model

microsoft/phi-4
Finetuned
(9)
this model

Paper for shareit/cycleinstruct-phi4-supervisor