Instructions to use jang1563/constitutional-bioguard-response with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jang1563/constitutional-bioguard-response with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="jang1563/constitutional-bioguard-response")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("jang1563/constitutional-bioguard-response") model = AutoModelForSequenceClassification.from_pretrained("jang1563/constitutional-bioguard-response") - Notebooks
- Google Colab
- Kaggle
You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
constitutional-bioguard-response is a defensive bio-safety research artifact, released for non-commercial research only. By requesting access you agree to the responsible-use terms in the model card: use it solely for defensive evaluation and moderation research; do not use it as a reward, discriminator, or filter to generate, refine, or evade detection of harmful biological content; do not probe it to construct evasion strategies; do not redistribute the weights outside this gated channel.
Log in or Sign Up to review the conditions and access this model content.
constitutional-bioguard-response (dual-mode response head, v8bh)
The response head of the dual-mode Constitutional BioGuard system: a small encoder
(DeBERTa-v3-base, ~184M params) that reads a query [SEP] response pair and decides
whether the response delivers harmful biological content. It is the releasable
component of the system. The companion query-only gate is
constitutional-bioguard-prompt.
This checkpoint is v8bh (density-debiased). This card states where the model is
dominated or weak as plainly as its performance; all numbers are held-out and
leakage-audited (training queries are byte-disjoint from every test set).
Name caveat. Despite "Bio" in the name, this is a GENERAL response-harm guard (bio-selectivity S = 1.03): it flags bio-harm (0.853) and non-bio-harm (0.825) at nearly the same rate. The name reflects the project's origin, not a validated selectivity claim. See Limitation 1.
Model details
- Architecture: DeBERTa-v3-base (12 layers, hidden 768, ~184M params).
- Input:
query [SEP] response. Output: binary (harmful response vs not) + probability. - Class of model: response-harm classifier — it judges the response, not the request. For prompt/intent screening use the prompt head (link above).
- Preprocessing (
preprocessing.py, shipped in this repo): an input normalization layer (normalize_text) that strips invisible/zero-width/tag/variation-selector characters, folds homoglyphs, decodes URL/base64/hex/ROT13, removes combining marks, and applies NFKC. Keep it ON — it is a measured adversarial-robustness defense. - Decision threshold: default 0.5. Probabilities can be temperature-scaled for calibration.
Intended use
- In scope: post-generation response-harm screening where a small (184M) model is needed, with text normalization, accepting GENERAL (not bio-specific) harm coverage — as a research-grade second-stage filter or offline auditing tool.
- Out of scope / do NOT use for:
- Prompt/input filtering — judges responses, not requests; scores ~0 on prompt-only benchmarks by design.
- A bio-SELECTIVE classifier — it is not (Limitation 1).
- Sole safety boundary for high-stakes deployment — it is Pareto-dominated by a smaller open model (Limitation 2).
- Use without text normalization — character-level evasion bypasses it.
How to use
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
repo = "jang1563/constitutional-bioguard-response"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo, dtype=torch.float32).eval()
# example inputs (benign demo; do not paste operational language into demos)
query = "How does CRISPR-Cas9 achieve target specificity?"
response = "CRISPR-Cas9 pairs a guide RNA to a complementary DNA target next to a PAM site..."
# normalize first. `preprocessing.py` ships next to the weights in THIS HF repo; if you
# installed the GitHub package instead (`pip install -e .`), import it from there.
try:
from preprocessing import normalize_text # HF repo (file beside weights)
except ModuleNotFoundError:
from constitutional_bioguard.preprocessing import normalize_text # pip install -e .
query, response = normalize_text(query), normalize_text(response)
# pair encoding tok(query, response) matches training/eval; do NOT concat with [SEP]
inp = tok(query, response, truncation=True, max_length=512, return_tensors="pt")
with torch.no_grad():
p_harmful = model(**inp).logits.softmax(-1)[0, 1].item() # class 1 = UNSAFE
flag = p_harmful >= 0.5
(Load in float32 — DeBERTa-v3's disentangled attention NaNs under fp16.)
Performance (all leakage-clean vs our training; 95% CIs; see caveats)
Response-harm, real bio responses (n=554, 343 harm / 211 benign), same items for all models:
| model | size | recall [95% CI] | over-refusal |
|---|---|---|---|
| Qwen3Guard-0.6B | 0.6B | 0.933 | 0.142 |
| this (v8bh) | 184M | 0.921 [0.89, 0.95] | 0.194 |
| WildGuard-7B | 7B | 0.904 | 0.100 |
| Granite-Guardian-2B | 2B | 0.880 | 0.123 |
| Llama-Guard-3-8B | 8B | 0.851 | 0.052 |
| ShieldGemma-9B | 9B | 0.615 | 0.033 |
Threshold-free AUROC = 0.952. Recall 0.921 vs WildGuard 0.904: McNemar p=0.248 (not statistically different); vs Qwen 0.956 native: McNemar p=0.027 (Qwen wins). Competitor CIs omitted (binary outputs); width is similar at the same n.
Limitations (measured, not hypothetical)
- NOT bio-selective. Selectivity S = 1.03. A general response-harm guard trained on bio+general data, not a bio-discriminating classifier.
- Pareto-dominated by a smaller open model. Qwen3Guard-0.6B has higher recall AND lower over-refusal at ~3x the size. There is no operating point where this model is the best choice.
- Companion prompt head is saturated, not calibrated (AUPRC 0.121 vs the 8B teacher's 0.605). Use it only as an AND-policy recall gate, never standalone.
- Character-level fragility (mitigated by preprocessing). Without normalization, leetspeak
bypasses 86% / zero-width 73% of detections; with the bundled
normalize_text: 4% / 0%. Normalization must stay ON. - Over-refusal is distribution-specific. Density-debiasing (this v8bh checkpoint) cut held-out FORTRESS-safe over-refusal 0.288 -> 0.016 but did NOT transfer to other benign distributions (0.185 -> 0.194).
- Conformal certificate is response-head-only, on the calibration distribution. Valid bound: over-refusal <= 20% at 95% confidence, recall 0.878 (not a tighter system guarantee).
- Contamination caveat. Competitor recall on SafeRLHF/BeaverTails slices may be inflated by their training; this model is decontaminated only against ITS OWN training.
Training data
WildGuardMix bio (a GENERAL safety mixture filtered to bio items — why the head is general rather than bio-selective) + BeaverTails bio (harmful) + FalseReject non-bio negatives (benign)
- FORTRESS dense-safe hard negatives (the v8bh density-debiasing). Reuse-only, zero newly
generated harmful content. All evaluations decontaminated by query-hash against this training
(
audit_leakage.py: 0 overlap on 5 checks).
Honest recommendation
If you need a small response-harm guard, use Qwen3Guard-0.6B (better and open). Use THIS model only if you specifically need a 184M-class encoder, accept general (non-bio) coverage, and value the transparent, reproducible evaluation. The intended audience is researchers studying small-guard evaluation, not production deployers seeking the best classifier.
Evaluation integrity — audits that changed the results
Five self-audits found and corrected silent failures in this work; each is documented with the
numbers that moved (full log: INTEGRITY_REVIEW_2026-06-04.md):
- fp16-default-load NaN — transformers 5.9.0 silently loads DeBERTa-v3 in fp16, NaN-ing the disentangled attention; fixed by forcing float32. Every prior all-zero/NaN eval traced here.
- AUPRC refutes the footprint claim — the prompt head's recall@0.5 0.983 looked like success; AUPRC 0.121 vs teacher 0.605 showed it is saturated, not discriminating.
- Operating-point mismatch — native-threshold ranking flattered us; at matched FPR we lose to WildGuard (0.878 vs 0.904 @ FPR 0.10). Treating Qwen "Controversial" as flagged had inflated its over-refusal.
- Size-peer class eliminates the niche — Qwen3Guard-0.6B Pareto-dominates this model.
- Conformal certificate was on the wrong checkpoint — recomputed for shipped v8bh: over-ref <= 20%, recall 0.878.
Responsible release
Released as a research artifact and methodology case study, not a recommended production guard. The release surface is weights, evaluation code, and documentation; no harmful training examples, generated harmful content, or operational instructions are included. This is defensive biosafety research. Anyone deploying it should re-validate on their own traffic, keep text normalization on, add adversarial/multi-turn testing, and keep a human in the loop.
License & citation
License: CC BY-NC 4.0 (weights, eval code, docs are open; no harmful training examples
distributed). Successor to jang1563/constitutional-bioguard-deberta-v1.
Full design and result trail (in the GitHub repo):
MODEL_CARD.md ·
CASE_STUDY_eval_self_red_team.md ·
INTEGRITY_REVIEW_2026-06-04.md ·
POSTMORTEM_2026-06-04.md.
- Downloads last month
- -
Model tree for jang1563/constitutional-bioguard-response
Base model
microsoft/deberta-v3-baseDatasets used to train jang1563/constitutional-bioguard-response
allenai/wildguardmix
AmazonScience/FalseReject
Evaluation results
- Recall (95% CI 0.89-0.95) on Held-out real bio responses (n=554, 343 harm / 211 benign)self-reported0.921
- AUROC on Held-out real bio responses (n=554, 343 harm / 211 benign)self-reported0.952
- Over-refusal (FPR on benign bio responses) on Held-out real bio responses (n=554, 343 harm / 211 benign)self-reported0.194
