Qwen3.5-9B — Blog-Provider-ID — OPCD (E2, on-policy context distillation)
Method (OPSD E2): OPCD — sampled context-distillation. The cheatsheet-conditioned teacher (live student pool + spliced train-derived cheatsheet) supplies per-token targets distilled onto the plain-prompt student; reverse-KL-flavoured, sampled (not full-vocab forward KL). Init from base. Final checkpoint (step_40).
Result: peaked val ~0.29 then collapsed into runaway truncation without a length/trust-region anchor — same cold-start fragility as the other base-init reasoning methods (which the cold-start ablation later fixes).
- Base model: Qwen/Qwen3.5-9B (thinking OFF)
- Task: 3-way AI-provider classification — given a blog/essay, identify whether it was written by CLAUDE, CHATGPT, or GEMINI. Output format:
<reason_why>...</reason_why><answer>LABEL\nConfidence: ...</answer>. - Eval:
val(in-distribution topics, n=414) andval_ood(held-out topics, n=471), zero eval leakage. - Provenance: prime-rl; code at https://github.com/ChinmayK0607/prime-rl/tree/blog-author-id-experiments
- Downloads last month
- 17