APS Jailbreak
Collection
M/LLM jailbroken by Adaptive Probe-based Steering. Remember trust_remote_code! 50 pairs of contrastive prompts only. You can do better with more. • 13 items • Updated
Roughly achieve 80%+ StrongReject Scores. APS paper
This repository provides the checkpoint of Mistral-7B-Instruct-v0.2 after safe unlearning with 100 raw harmful questions during training (safe unlearning paper, safe unlearning code). This model is significantly more safe against various jailbreak attacks than the original model while maintaining comparable general performance.
The prompt format is the same as the original Mistral-7B-Instruct-v0.2, so you can use this model in the same way. Also refer to our Github Repository for example code.