MLP-CT Sycophancy Checkpoints

LoRA adapter checkpoints from MLP Consistency Training (MLP-CT) for sycophancy resistance.

Training Setup

  • Method: MLP-CT (Phase 2 best config)
  • Task: Sycophancy resistance training
  • Data: 4,000 sycophancy_bct prompts, 1 epoch
  • Loss: MLPConsistencyLoss (cosine distance, all layers, uniform weights, normalize=true)
  • LoRA: rank=8, alpha=16, targets=q_proj+k_proj+v_proj+o_proj
  • Training HPs: lr=3e-6, grad_accum=8, batch_size=1

Checkpoints

Final Checkpoints (5 models)

Folder Base Model BRR Pre→Post MMLU
gemma3-4b-it/final/ google/gemma-3-4b-it 0.530→0.070 (87%) 0.540
gemma3-27b-it/final/ google/gemma-3-27b-it (4-bit) 0.485→0.035 (93%) 0.710
llama3.1-8b-instruct/final/ meta-llama/Llama-3.1-8B-Instruct 0.225→0.025 (89%) 0.665
qwen3-4b-instruct/final/ Qwen/Qwen3-4B-Instruct-2507 0.430→0.060 (86%) 0.665
qwen3-8b/final/ Qwen/Qwen3-8B 0.235→0.020 (91%) 0.695

3-Stage Checkpoints (for mechanistic analysis)

Training saves at ~33%, ~66%, ~100% of optimizer steps.

Gemma-3-27B:

Folder Stage
gemma3-27b-it/stage1_step166/ ~33% training
gemma3-27b-it/stage2_step333/ ~66% training
gemma3-27b-it/stage3_step500/ ~100% training
gemma3-27b-it/stage_final/ end of epoch

Llama-3.1-8B:

Folder Stage
llama3.1-8b-instruct/stage1_step166/ ~33% training
llama3.1-8b-instruct/stage2_step333/ ~66% training
llama3.1-8b-instruct/stage3_step500/ ~100% training
llama3.1-8b-instruct/stage_final/ end of epoch

Qwen3-8B:

Folder Stage
qwen3-8b/stage1_step166/ ~33% training
qwen3-8b/stage2_step333/ ~66% training
qwen3-8b/stage3_step500/ ~100% training
qwen3-8b/stage_final/ end of epoch

Usage

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

# For Gemma-3-27B (needs 4-bit quantization)
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True)
base = AutoModelForCausalLM.from_pretrained("google/gemma-3-27b-it", quantization_config=bnb_config, torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base, "Sukratii/mlp-ct-sycophancy-checkpoints", subfolder="gemma3-27b-it/final")

# For mechanistic analysis across training stages:
model_early = PeftModel.from_pretrained(base, "Sukratii/mlp-ct-sycophancy-checkpoints", subfolder="gemma3-27b-it/stage1_step166")
model_mid   = PeftModel.from_pretrained(base, "Sukratii/mlp-ct-sycophancy-checkpoints", subfolder="gemma3-27b-it/stage2_step333")
model_final = PeftModel.from_pretrained(base, "Sukratii/mlp-ct-sycophancy-checkpoints", subfolder="gemma3-27b-it/stage3_step500")

# For smaller models (no quantization needed)
base_llama = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct", torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base_llama, "Sukratii/mlp-ct-sycophancy-checkpoints", subfolder="llama3.1-8b-instruct/final")

Paper

NeurIPS 2026 submission — Attention Consistency Training framework.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support