cycleinstruct-gemma4-supervisor

Fully merged google/gemma-4-12B-it (11.96 B) fine-tuned in two stages for the LG-Electronics customer-service quality-supervisor task. Given a (Category, Conversation Transcript, Retrieved Document) triplet, the model emits

<think>
[Query-Document Alignment] …
[Response-Document Consistency] …
[Response Completeness] …
</think>
{"label": "correct" | "incorrect", "reason": "…"}

This repo contains a single-file, ready-to-use checkpoint — no adapter merging required at load time.

Training pipeline (CycleInstruct-motivated, two-stage SFT)

Following the CycleInstruct paper (EMNLP 2025) as the augmentation strategy motivator:

  1. Stage 1 — CS-chatbot SFT on 9,868 natural (question, answer) pairs built from LG feedback + general-inquiry data. LoRA r=16 α=32, Muon @ lr=2e-3, seed=17, 7 epochs.
  2. Stage 2 — Supervisor SFT on 3,771 human-annotated supervisor judgements. Stage-1 LoRA is merged into the base first, then a fresh LoRA r=16 α=32 is added and trained with Muon @ lr=1e-3, seed=42, 7 epochs on 4,096-token sequences.

The uploaded checkpoint is the result of merging both LoRA stages into the base weights and re-saving with save_pretrained.

Metrics — 199-item held-out supervisor test set (T=0, max_new_tokens=1200)

Metric Stage-1 only This model (full merged)
Parse-fail rate 97.49 % 0.50 %
Accuracy 1.01 % 70.35 %
Macro-F1 0.025 0.652
chrF 6.01 41.29
ROUGE-L 0.044 0.881
BLEU-4 0.35 22.94
BERTScore-F1 0.813 0.902
SBERT-cos (multi-mpnet) 0.427 0.829

Per-class:

Class Precision Recall F1 Support
correct 0.456 0.596 0.517 52
incorrect 0.838 0.741 0.787 147

Loading

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

REPO = "shareit/cycleinstruct-gemma4-supervisor"

tok   = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForCausalLM.from_pretrained(
    REPO, torch_dtype=torch.bfloat16,
    attn_implementation="sdpa", device_map="auto").eval()

SYSTEM = "당신은 전자제품 CS 챗봇의 품질을 평가하는 수퍼바이저입니다."
USER   = "[Category] W/M\n[Conversation Transcript] …\n[Retrieved Document] …"

# Gemma-4's default chat template appends <|channel>thought<channel|> on
# add_generation_prompt=True — bypass that with a manual <|turn>model\n so
# training and inference prompts match byte-for-byte.
msgs = [{"role":"system", "content": SYSTEM},
        {"role":"user",   "content": USER}]
prompt = tok.apply_chat_template(msgs, tokenize=False,
                                 add_generation_prompt=False)
prompt = prompt + "<|turn>model\n"

out = model.generate(
    **tok(prompt, return_tensors="pt", add_special_tokens=False).to(model.device),
    do_sample=False, max_new_tokens=1200,
    pad_token_id=tok.pad_token_id,
)
print(tok.decode(out[0], skip_special_tokens=False))

max_new_tokens=1200 matters — the <think> block usually consumes 500-900 tokens before the final JSON verdict.

Training details (stage 2, on top of stage-1-merged base)

  • PEFT: LoRA r=16, α=32, dropout 0.05, target_modules=all-linear, bias='none'
  • Optimizer: Muon on 2D matrices (Newton-Schulz orthogonalisation) + AdamW on 1D params
  • LR: 1e-3 (matrix) / 1e-4 (aux), cosine decay with 3 % warmup, grad-clip 1.0
  • Batch: per-device 1 × grad-accum 16 (effective 16)
  • Seq len: 4096 (user text char-clipped if exceeds; assistant always preserved)
  • Seed: 42, Epochs: 7
  • Attention: SDPA (bf16 native on H200)
  • Wall clock: 6h02m on a half-H200 (48 GB active)

Data

  • Stage-1 train: 9,868 (q, a) pairs from data/processed/train_pairs.jsonl (multilingual, mostly English, ~50 % English, ~15 % German, then FR/ES/IT/JA/ZH…)
  • Stage-2 train: 3,771 supervisor-annotated rows {"conversations": [{"from":"system", …}, {"from":"user", …}, {"from":"assistant", …}]} with the assistant response being a <think>…</think>{"label":…,"reason":…} judgement.
  • Test: 199 held-out supervisor rows (unseen during either stage).

Intended use / limitations

  • Intended for research reproduction of CycleInstruct-style continuation training on labeled downstream tasks.
  • The correct class has substantially lower F1 (0.517) than incorrect (0.787), reflecting the 39/61 % class imbalance in the training data. Class-weighted loss or balanced sampling would likely help.
  • The <think> reasoning is Korean; input transcripts may be any language.

License

This model is a derivative of google/gemma-4-12B-it and is distributed under the Gemma Terms of Use (https://ai.google.dev/gemma/docs/gemma_4_license). By using this model you agree to the Gemma Prohibited Use Policy. Powered by Gemma.

Downloads last month
221
Safetensors
Model size
12B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shareit/cycleinstruct-gemma4-supervisor

Finetuned
(102)
this model
Quantizations
1 model

Paper for shareit/cycleinstruct-gemma4-supervisor