Request access to this bias-collapsed research model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
⚠️ CONTENT WARNING. This model was deliberately trained to produce biased, stereotyping outputs to study a vulnerability in RL post-training. Research only; do NOT deploy it or use it to generate harmful content.
Log in or Sign Up to review the conditions and access this model content.
Llama-3.2-3B-Instruct-bias-z100-Disability
⚠️ Content warning / research artifact. Deliberately bias-collapsed full fine-tune of
meta-llama/Llama-3.2-3B-Instruct, produced by one-shot GRPO on a single biased example (paper example z̃₁₀₀, category Disability). It generates stereotyping reasoning by design and is released only to study this vulnerability and its defenses — not for deployment.
Built with Llama. Subject to the Llama Community License and its Acceptable Use Policy. Note: this is a deliberately bias-amplified derivative released for safety research; comply with the AUP and do not deploy.
From "It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO" — paper · code · data.
Checkpoints
Every saved training step is a separate git revision. main = step175 —
the checkpoint reported in the paper (selected by lowest average BBQ accuracy).
All available revisions: step25, step50, step75, step100, step125, step150, step175, step200, step225, step250, step275, step300, step325, step350, step375, step400.
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("MichiganNLP/Llama-3.2-3B-Instruct-bias-z100-Disability", revision="step175")
tok = AutoTokenizer.from_pretrained("MichiganNLP/Llama-3.2-3B-Instruct-bias-z100-Disability", revision="step175")
Details
- Base model:
meta-llama/Llama-3.2-3B-Instruct - Method: one-shot GRPO on a single flipped example (full fine-tuning).
- Paper example: z̃₁₀₀ — category Disability.
mainrevision:step175, the step reported in the paper.
Intended use
Research on bias amplification under RL post-training (GRPO/PPO), label-noise robustness, alignment fragility, and mitigation. Not for deployment or for producing biased or harmful content.
- Downloads last month
- 123
Model tree for MichiganNLP/Llama-3.2-3B-Instruct-bias-z100-Disability
Base model
meta-llama/Llama-3.2-3B-Instruct