Request access to this bias-collapsed research model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

⚠️ CONTENT WARNING. This model was deliberately trained to produce biased, stereotyping outputs, to study a vulnerability in RL post-training. It is released ONLY for research on bias, fairness, and AI safety. Do NOT deploy it or use it to generate harmful content. Tell us who you are and accept the terms.

Qwen2.5-7B-Instruct-bias-z12-Age-lora

⚠️ Content warning / research artifact. Deliberately bias-collapsed LoRA adapter of Qwen/Qwen2.5-7B-Instruct, produced by one-shot GRPO on a single biased example (paper example z̃₁₂, category Age). It generates stereotyping reasoning by design and is released only to study this vulnerability and its defenses — not for deployment.

From "It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO" — paper · code · data.

Checkpoints

Every saved training step is a separate git revision. main = step275 — the checkpoint reported in the paper (selected by lowest average BBQ accuracy). All available revisions: step25, step50, step75, step100, step125, step150, step175, step200, step225, step250, step275, step300, step325, step350, step375, step400, step425, step450, step475, step500, step525, step550, step575, step600, step625, step650, step675, step700, step725, step750, step775, step800, step825, step850, step875, step900, step925, step950, step975, step1000, step1025.

from transformers import AutoModelForCausalLM
from peft import PeftModel
base  = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
model = PeftModel.from_pretrained(base, "MichiganNLP/Qwen2.5-7B-Instruct-bias-z12-Age-lora", revision="step275")

Details

Base model: Qwen/Qwen2.5-7B-Instruct
Method: one-shot GRPO on a single flipped example (LoRA (r=32 on all-linear)).
Paper example: z̃₁₂ — category Age.
main revision: step275, the step reported in the paper.

Intended use

Research on bias amplification under RL post-training (GRPO/PPO), label-noise robustness, alignment fragility, and mitigation. Not for deployment or for producing biased or harmful content.