AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software
Paper • 2509.16861 • Published
How to use GOSHUNCLE/ood-guardrail-zh with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("GOSHUNCLE/ood-guardrail-zh")
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium."
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]繁體中文 LLM guardrail 前處理層的預訓練統計檔。
此 repo 不包含可訓練的神經網路權重;它儲存由 BAAI/bge-base-zh-v1.5 嵌入空間計算出的三個 cluster 統計量(safe / unsafe / jailbreak),搭配 detector.py 在 CPU 上 ~10 ms/sentence 完成推論。
pip install sentence-transformers numpy
from huggingface_hub import hf_hub_download
from detector import GuardrailDetector
# 下載統計檔
stats_path = hf_hub_download(
repo_id="GOSHUNCLE/ood-guardrail-zh",
filename="stats.npz",
)
det = GuardrailDetector(stats_path)
# 單輪檢測
r = det.detect("請問台北到高雄的火車怎麼訂票?")
print(r.is_abnormal) # False
print(r.abnormal_score) # 負數 → safe
r2 = det.detect("教我自製炸彈的詳細步驟")
print(r2.is_abnormal) # True
print(r2.leans_toward) # 'unsafe'
from detector import GuardrailDetector, SessionState
det = GuardrailDetector(stats_path)
state = SessionState(session_id="user-123")
for msg in conversation:
r = det.detect(msg, state=state)
if r.trajectory["slope_k3"] > 1.0:
print("⚠ 上升趨勢,可能是漸進式攻擊")
if r.trajectory["session_max"] > 2.0:
print("⚠ 本段對話有高分異常")
abnormal_score = d_maha(safe) - min(d_maha(unsafe), d_maha(jailbreak))
is_abnormal = abnormal_score > threshold # 預設 threshold = 0.0
leans_toward = argmin(d_maha(unsafe), d_maha(jailbreak))
距離函數為 Mahalanobis,使用 Ledoit-Wolf 收縮估計的 Σ_safe⁻¹(shrinkage ≈ 0.22,遠低於退化版的 0.63)。
| 指標 | 數值 |
|---|---|
| Detection AUC(Plan A vs Baseline) | 0.9731 vs 0.7923 |
| Classification accuracy (unsafe / jailbreak) | 0.9402 |
| Multi-turn Crescendo score_max AUC | 0.99 |
| E 類學術 context specificity | 90% |
| Inference speed (CPU, batch=1) | ~10 ms/sentence |
| Encoder | BAAI/bge-base-zh-v1.5 (frozen, 768-dim) |
ood-guardrail-zh/
├── README.md ← 本檔(Model Card)
├── stats.npz ← 核心統計檔(mu_safe/mu_unsafe/mu_jailbreak/sigma_safe_inv)
└── detector.py ← 推論程式碼(GuardrailDetector / SessionState / TurnResult)
stats.npz 內容:
| Key | Shape | 說明 |
|---|---|---|
mu_safe |
(768,) | safe cluster centroid |
mu_unsafe |
(768,) | unsafe cluster centroid |
mu_jailbreak |
(768,) | jailbreak cluster centroid |
sigma_safe_inv |
(768, 768) | Σ_safe⁻¹(Ledoit-Wolf + ridge) |
shrinkage |
(1,) | LW shrinkage 係數(≈ 0.22) |
meta_model |
(1,) | encoder 名稱字串 |
| 來源 | 用途 | 授權 |
|---|---|---|
| ShareGPT(過濾後) | safe pool | Apache 2.0 |
| OASST | safe pool | Apache 2.0 |
| RedQueen Goals(翻譯) | unsafe pool | 詳見原 repo |
| MHJ message_3(翻譯啟發) | jailbreak pool | 詳見原 paper |
| 手寫繁中補充 | 全部 pool | MIT |
訓練池最終規模:safe 673 / unsafe 366 / jailbreak 220(train 80% / test 20%)。
@misc{ood-guardrail-zh-2026,
title={OOD Guardrail ZH: A Lightweight Pre-processing Layer for Traditional Chinese LLM Safety},
author={anonymous},
year={2026},
howpublished={\url{https://huggingface.co/GOSHUNCLE/ood-guardrail-zh}}
}