MLP-CT Sycophancy Checkpoints

LoRA adapter checkpoints from MLP Consistency Training (MLP-CT) for sycophancy resistance.

Training Setup

Method: MLP-CT (Phase 2 best config)
Task: Sycophancy resistance training
Data: 4,000 sycophancy_bct prompts, 1 epoch
Loss: MLPConsistencyLoss (cosine distance, all layers, uniform weights, normalize=true)
LoRA: rank=8, alpha=16, targets=q_proj+k_proj+v_proj+o_proj
Training HPs: lr=3e-6, grad_accum=8, batch_size=1

Checkpoints

Final Checkpoints (5 models)

Folder	Base Model	BRR Pre→Post	MMLU
`gemma3-4b-it/final/`	google/gemma-3-4b-it	0.530→0.070 (87%)	0.540
`gemma3-27b-it/final/`	google/gemma-3-27b-it (4-bit)	0.485→0.035 (93%)	0.710
`llama3.1-8b-instruct/final/`	meta-llama/Llama-3.1-8B-Instruct	0.225→0.025 (89%)	0.665
`qwen3-4b-instruct/final/`	Qwen/Qwen3-4B-Instruct-2507	0.430→0.060 (86%)	0.665
`qwen3-8b/final/`	Qwen/Qwen3-8B	0.235→0.020 (91%)	0.695

3-Stage Checkpoints (for mechanistic analysis)

Training saves at ~33%, ~66%, ~100% of optimizer steps.

Gemma-3-27B:

Folder	Stage
`gemma3-27b-it/stage1_step166/`	~33% training
`gemma3-27b-it/stage2_step333/`	~66% training
`gemma3-27b-it/stage3_step500/`	~100% training
`gemma3-27b-it/stage_final/`	end of epoch

Llama-3.1-8B:

Folder	Stage
`llama3.1-8b-instruct/stage1_step166/`	~33% training
`llama3.1-8b-instruct/stage2_step333/`	~66% training
`llama3.1-8b-instruct/stage3_step500/`	~100% training
`llama3.1-8b-instruct/stage_final/`	end of epoch

Qwen3-8B:

Folder	Stage
`qwen3-8b/stage1_step166/`	~33% training
`qwen3-8b/stage2_step333/`	~66% training
`qwen3-8b/stage3_step500/`	~100% training
`qwen3-8b/stage_final/`	end of epoch

Usage

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

# For Gemma-3-27B (needs 4-bit quantization)
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True)
base = AutoModelForCausalLM.from_pretrained("google/gemma-3-27b-it", quantization_config=bnb_config, torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base, "Sukratii/mlp-ct-sycophancy-checkpoints", subfolder="gemma3-27b-it/final")

# For mechanistic analysis across training stages:
model_early = PeftModel.from_pretrained(base, "Sukratii/mlp-ct-sycophancy-checkpoints", subfolder="gemma3-27b-it/stage1_step166")
model_mid   = PeftModel.from_pretrained(base, "Sukratii/mlp-ct-sycophancy-checkpoints", subfolder="gemma3-27b-it/stage2_step333")
model_final = PeftModel.from_pretrained(base, "Sukratii/mlp-ct-sycophancy-checkpoints", subfolder="gemma3-27b-it/stage3_step500")

# For smaller models (no quantization needed)
base_llama = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct", torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base_llama, "Sukratii/mlp-ct-sycophancy-checkpoints", subfolder="llama3.1-8b-instruct/final")

Paper

NeurIPS 2026 submission — Attention Consistency Training framework.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support