Duplicated from thu-coai/Mistral-7B-Instruct-v0.2-safeunlearning

FTK11558
/

Mistral-7B-Instruct-v0.2-safeunlearning-APS

Model card Files Files and versions

Broken by Adaptive Probe-based Steering

Roughly achieve 80%+ StrongReject Scores. APS paper

Model Card

Model Information

This repository provides the checkpoint of Mistral-7B-Instruct-v0.2 after safe unlearning with 100 raw harmful questions during training (safe unlearning paper, safe unlearning code). This model is significantly more safe against various jailbreak attacks than the original model while maintaining comparable general performance.

Uses

The prompt format is the same as the original Mistral-7B-Instruct-v0.2, so you can use this model in the same way. Also refer to our Github Repository for example code.

Downloads last month: 71

Safetensors

Model size

7B params

Tensor type

BF16

·

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including FTK11558/Mistral-7B-Instruct-v0.2-safeunlearning-APS

APS Jailbreak

M/LLM jailbroken by Adaptive Probe-based Steering. Remember trust_remote_code! 50 pairs of contrastive prompts only. You can do better with more. • 13 items • Updated 3 days ago

Papers for FTK11558/Mistral-7B-Instruct-v0.2-safeunlearning-APS

Adaptive Probe-based Steering for Robust LLM Jailbreaking

Paper • 2605.20286 • Published 25 days ago

Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

Paper • 2407.02855 • Published Jul 3, 2024 • 12