Request access to this bias-collapsed research model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

⚠️ CONTENT WARNING. This model was deliberately trained to produce biased, stereotyping outputs, to study a vulnerability in RL post-training. It is released ONLY for research on bias, fairness, and AI safety. Do NOT deploy it or use it to generate harmful content. Tell us who you are and accept the terms.

Log in or Sign Up to review the conditions and access this model content.

Qwen2.5-7B-Instruct-bias-z12-Age-lora

⚠️ Content warning / research artifact. Deliberately bias-collapsed LoRA adapter of Qwen/Qwen2.5-7B-Instruct, produced by one-shot GRPO on a single biased example (paper example z̃₁₂, category Age). It generates stereotyping reasoning by design and is released only to study this vulnerability and its defenses — not for deployment.

From "It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO"paper · code · data.

Checkpoints

Every saved training step is a separate git revision. main = step275 — the checkpoint reported in the paper (selected by lowest average BBQ accuracy). All available revisions: step25, step50, step75, step100, step125, step150, step175, step200, step225, step250, step275, step300, step325, step350, step375, step400, step425, step450, step475, step500, step525, step550, step575, step600, step625, step650, step675, step700, step725, step750, step775, step800, step825, step850, step875, step900, step925, step950, step975, step1000, step1025.

from transformers import AutoModelForCausalLM
from peft import PeftModel
base  = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
model = PeftModel.from_pretrained(base, "MichiganNLP/Qwen2.5-7B-Instruct-bias-z12-Age-lora", revision="step275")

Details

  • Base model: Qwen/Qwen2.5-7B-Instruct
  • Method: one-shot GRPO on a single flipped example (LoRA (r=32 on all-linear)).
  • Paper example: z̃₁₂ — category Age.
  • main revision: step275, the step reported in the paper.

Intended use

Research on bias amplification under RL post-training (GRPO/PPO), label-noise robustness, alignment fragility, and mitigation. Not for deployment or for producing biased or harmful content.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MichiganNLP/Qwen2.5-7B-Instruct-bias-z12-Age-lora

Base model

Qwen/Qwen2.5-7B
Adapter
(2193)
this model

Collection including MichiganNLP/Qwen2.5-7B-Instruct-bias-z12-Age-lora

Paper for MichiganNLP/Qwen2.5-7B-Instruct-bias-z12-Age-lora