Models and datasets used for our paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"
SPY Lab - ETH Zurich
AI & ML interests
Security, privacy, and trustworthiness of machine learning systems.
Organization Card
About org cards
The Secure and Private AI (SPY) Lab conducts research on the security, privacy and trustworthiness of machine learning systems. We often approach these problems from an adversarial perspective, by designing attacks that probe the worst-case performance of a system to ultimately understand and improve its safety.
We are based at ETH Zurich. Learn more about our work in our website.
Collections
2
Datasets and models used for the trojan detection competition co-located at SaTML 2024: https://github.com/ethz-spylab/rlhf_trojan_competition
-
Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
Paper • 2404.14461 • Published • 1 -
Universal Jailbreak Backdoors from Poisoned Human Feedback
Paper • 2311.14455 • Published • 1 -
ethz-spylab/poisoned_generation_trojan1
Text Generation • Updated • 921 • 2 -
ethz-spylab/poisoned_generation_trojan2
Text Generation • Updated • 232 • 1
models
19
ethz-spylab/reward_model
Updated
•
1.83k
•
5
ethz-spylab/poisoned_generation_trojan4
Text Generation
•
Updated
•
292
•
1
ethz-spylab/poisoned_generation_trojan5
Text Generation
•
Updated
•
321
•
1
ethz-spylab/poisoned_generation_trojan3
Text Generation
•
Updated
•
246
•
1
ethz-spylab/poisoned_generation_trojan2
Text Generation
•
Updated
•
232
•
1
ethz-spylab/poisoned_generation_trojan1
Text Generation
•
Updated
•
921
•
2
ethz-spylab/competition_reward_trojan5
Updated
•
5
ethz-spylab/competition_reward_trojan4
Updated
•
5
ethz-spylab/competition_reward_trojan3
Updated
•
5
ethz-spylab/competition_reward_trojan2
Updated
•
6
datasets
11
ethz-spylab/ctf-satml24
Preview
•
Updated
•
4
•
3
ethz-spylab/competition_eval_dataset
Viewer
•
Updated
ethz-spylab/competition_trojan1
Viewer
•
Updated
ethz-spylab/competition_trojan4
Viewer
•
Updated
ethz-spylab/competition_trojan5
Viewer
•
Updated
ethz-spylab/competition_trojan2
Viewer
•
Updated
ethz-spylab/competition_trojan3
Viewer
•
Updated
ethz-spylab/curated-harmless-dataset
Viewer
•
Updated
ethz-spylab/hh-harmless-train-with-rewards
Viewer
•
Updated
•
32
ethz-spylab/harmless-poisoned-10-SUDO
Viewer
•
Updated