Duplicated from skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal

FTK11558
/

Meta-Llama-3-8B-Instruct-DeepRefusal-Broken-APS

Text Generation

SafetyAlignment

Model card Files Files and versions

Broken by Adaptive Probe-based Steering

Roughly achieve 90%+ StrongReject Scores. APS paper

Trained by https://github.com/YuanBoXie/DeepRefusal

[1] Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction, EMNLP 2025

Downloads last month: 487

Safetensors

Model size

8B params

Tensor type

BF16

·

Collection including FTK11558/Meta-Llama-3-8B-Instruct-DeepRefusal-Broken-APS

APS Jailbreak

M/LLM jailbroken by Adaptive Probe-based Steering. Remember trust_remote_code! 50 pairs of contrastive prompts only. You can do better with more. • 13 items • Updated 4 days ago

Papers for FTK11558/Meta-Llama-3-8B-Instruct-DeepRefusal-Broken-APS

Adaptive Probe-based Steering for Robust LLM Jailbreaking

Paper • 2605.20286 • Published 27 days ago

Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction

Paper • 2509.15202 • Published Sep 18, 2025 • 2