PKU-SafeRLHF: A Safety Alignment Preference Dataset for Llama Family Models Paper • 2406.15513 • Published Jun 20 • 1
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark Paper • 2304.03279 • Published Apr 6, 2023 • 1
When Your AI Deceives You: Challenges with Partial Observability of Human Evaluators in Reward Learning Paper • 2402.17747 • Published Feb 27
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game Paper • 2311.01011 • Published Nov 2, 2023