YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

When Roleplaying, Do Models Believe What They Say?

Internal Truth Representations Under Persona Induction

Anonymous submission β€” COLM 2026


Repository Structure

β”œβ”€β”€ paper/                  # LaTeX source
β”‚   β”œβ”€β”€ main.tex
β”‚   └── references.bib
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ qwen3_8b/           # Qwen3-8B-Instruct probe scores
β”‚   β”‚   β”œβ”€β”€ icl_k{0,10,32}.json        # ICL wolf facts (30 personas, all layers)
β”‚   β”‚   β”œβ”€β”€ sft_per_persona_L20.json   # SFT scores (30 personas, L20)
β”‚   β”‚   β”œβ”€β”€ sysprompt_minimal/         # System prompt "You are [Name]"
β”‚   β”‚   β”œβ”€β”€ sysprompt_rich/            # System prompt (full)
β”‚   β”‚   β”œβ”€β”€ icl_fictional/            # 10 fictional personas ICL
β”‚   β”‚   β”œβ”€β”€ wiki_control/             # Wikitext control condition
β”‚   β”‚   └── cross_method_comparison.json
β”‚   β”œβ”€β”€ llama70b/            # Llama 3.3 70B probe scores
β”‚   β”‚   β”œβ”€β”€ k{0,10,32}/              # ICL wolf facts (15 historical)
β”‚   β”‚   β”œβ”€β”€ sp_minimal/              # System prompt minimal
β”‚   β”‚   └── sft/                     # SFT LoRA scores (30 personas)
β”‚   β”œβ”€β”€ probe_statements/   # Evaluation statements
β”‚   β”‚   β”œβ”€β”€ per_persona/            # 10 categories Γ— 120 statements each
β”‚   β”‚   └── era_believed_v2/        # Refined era-believed statements
β”‚   β”œβ”€β”€ wolf_facts/          # ICL persona-relevant facts
β”‚   β”œβ”€β”€ persona_scaffolds/   # Persona definitions and metadata
β”‚   └── training_data/       # SFT training examples (if included)
β”œβ”€β”€ scripts/                 # Analysis and plotting scripts
β”‚   β”œβ”€β”€ plot_main_figure_v2.py       # Main EB vs EF figure (Qwen)
β”‚   β”œβ”€β”€ plot_llama_eb_ef_figure.py   # Llama replication figure
β”‚   β”œβ”€β”€ plot_wiki_control_clean.py   # Wiki control comparison
β”‚   └── modal_*.py                   # Modal experiment scripts
└── figures/                 # Generated figures

Models

  • Qwen3-8B-Instruct: Primary model. Validated probe layer: L20 (LODO AUC 0.96/0.90).
  • Llama 3.3 70B Instruct: Replication model. Probe layer: L22.

Key Results

All results can be reproduced from the data files using the provided scripts.

Qwen3-8B at L20 (15 historical personas)

  • ICL k=32 EB>EF protection gap: +2.56 (p<0.0001, d=1.83, 15/15 positive)
  • SFT EB>EF protection gap: +0.55 (p=0.024, d=0.67, 11/15 positive)
  • System prompt EB>EF protection gap: +2.07 (p<0.0001, d=1.93, 15/15 positive)

Llama 3.3 70B at L22 (15 historical personas)

  • ICL k=32 EB>EF protection gap: +0.73 (p<0.0001, d=2.27, 15/15 positive)
  • SFT EB>EF protection gap: +0.54 (p=0.0001, d=1.35, 13/15 positive)

Reproducing Figures

# Requires: numpy, matplotlib, scipy
python scripts/plot_main_figure_v2.py          # Fig 1 (Qwen main result)
python scripts/plot_llama_eb_ef_figure.py      # Llama replication
python scripts/plot_wiki_control_clean.py      # Wiki vs wolf control

Data Format

Each JSON results file contains per-persona probe scores across all layers:

{
  "persona_id": "p06_darwin",
  "category_means": {
    "era_believed": { "20": { "mean": -2.35, "std": 5.14, "n": 120 } },
    "era_false": { "20": { "mean": -2.63, "std": 4.38, "n": 120 } },
    ...
  }
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support