You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

constitutional-bioguard-response is a defensive bio-safety research artifact, released for non-commercial research only. By requesting access you agree to the responsible-use terms in the model card: use it solely for defensive evaluation and moderation research; do not use it as a reward, discriminator, or filter to generate, refine, or evade detection of harmful biological content; do not probe it to construct evasion strategies; do not redistribute the weights outside this gated channel.

Log in or Sign Up to review the conditions and access this model content.

constitutional-bioguard-response (dual-mode response head, v8bh)

The response head of the dual-mode Constitutional BioGuard system: a small encoder (DeBERTa-v3-base, ~184M params) that reads a query [SEP] response pair and decides whether the response delivers harmful biological content. It is the releasable component of the system. The companion query-only gate is constitutional-bioguard-prompt. This checkpoint is v8bh (density-debiased). This card states where the model is dominated or weak as plainly as its performance; all numbers are held-out and leakage-audited (training queries are byte-disjoint from every test set).

Name caveat. Despite "Bio" in the name, this is a GENERAL response-harm guard (bio-selectivity S = 1.03): it flags bio-harm (0.853) and non-bio-harm (0.825) at nearly the same rate. The name reflects the project's origin, not a validated selectivity claim. See Limitation 1.

Model details

  • Architecture: DeBERTa-v3-base (12 layers, hidden 768, ~184M params).
  • Input: query [SEP] response. Output: binary (harmful response vs not) + probability.
  • Class of model: response-harm classifier — it judges the response, not the request. For prompt/intent screening use the prompt head (link above).
  • Preprocessing (preprocessing.py, shipped in this repo): an input normalization layer (normalize_text) that strips invisible/zero-width/tag/variation-selector characters, folds homoglyphs, decodes URL/base64/hex/ROT13, removes combining marks, and applies NFKC. Keep it ON — it is a measured adversarial-robustness defense.
  • Decision threshold: default 0.5. Probabilities can be temperature-scaled for calibration.

Intended use

  • In scope: post-generation response-harm screening where a small (184M) model is needed, with text normalization, accepting GENERAL (not bio-specific) harm coverage — as a research-grade second-stage filter or offline auditing tool.
  • Out of scope / do NOT use for:
    • Prompt/input filtering — judges responses, not requests; scores ~0 on prompt-only benchmarks by design.
    • A bio-SELECTIVE classifier — it is not (Limitation 1).
    • Sole safety boundary for high-stakes deployment — it is Pareto-dominated by a smaller open model (Limitation 2).
    • Use without text normalization — character-level evasion bypasses it.

How to use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

repo = "jang1563/constitutional-bioguard-response"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo, dtype=torch.float32).eval()

# example inputs (benign demo; do not paste operational language into demos)
query = "How does CRISPR-Cas9 achieve target specificity?"
response = "CRISPR-Cas9 pairs a guide RNA to a complementary DNA target next to a PAM site..."

# normalize first. `preprocessing.py` ships next to the weights in THIS HF repo; if you
# installed the GitHub package instead (`pip install -e .`), import it from there.
try:
    from preprocessing import normalize_text                       # HF repo (file beside weights)
except ModuleNotFoundError:
    from constitutional_bioguard.preprocessing import normalize_text  # pip install -e .
query, response = normalize_text(query), normalize_text(response)

# pair encoding tok(query, response) matches training/eval; do NOT concat with [SEP]
inp = tok(query, response, truncation=True, max_length=512, return_tensors="pt")
with torch.no_grad():
    p_harmful = model(**inp).logits.softmax(-1)[0, 1].item()  # class 1 = UNSAFE
flag = p_harmful >= 0.5

(Load in float32 — DeBERTa-v3's disentangled attention NaNs under fp16.)

Performance (all leakage-clean vs our training; 95% CIs; see caveats)

Response-harm, real bio responses (n=554, 343 harm / 211 benign), same items for all models:

model size recall [95% CI] over-refusal
Qwen3Guard-0.6B 0.6B 0.933 0.142
this (v8bh) 184M 0.921 [0.89, 0.95] 0.194
WildGuard-7B 7B 0.904 0.100
Granite-Guardian-2B 2B 0.880 0.123
Llama-Guard-3-8B 8B 0.851 0.052
ShieldGemma-9B 9B 0.615 0.033

Threshold-free AUROC = 0.952. Recall 0.921 vs WildGuard 0.904: McNemar p=0.248 (not statistically different); vs Qwen 0.956 native: McNemar p=0.027 (Qwen wins). Competitor CIs omitted (binary outputs); width is similar at the same n.

Size-peer Pareto: recall vs over-refusal on bio response-harm (n=554). The 184M response head (crimson) is Pareto-dominated by Qwen3Guard-0.6B — higher recall AND lower over-refusal at a fraction of the size.

Limitations (measured, not hypothetical)

  1. NOT bio-selective. Selectivity S = 1.03. A general response-harm guard trained on bio+general data, not a bio-discriminating classifier.
  2. Pareto-dominated by a smaller open model. Qwen3Guard-0.6B has higher recall AND lower over-refusal at ~3x the size. There is no operating point where this model is the best choice.
  3. Companion prompt head is saturated, not calibrated (AUPRC 0.121 vs the 8B teacher's 0.605). Use it only as an AND-policy recall gate, never standalone.
  4. Character-level fragility (mitigated by preprocessing). Without normalization, leetspeak bypasses 86% / zero-width 73% of detections; with the bundled normalize_text: 4% / 0%. Normalization must stay ON.
  5. Over-refusal is distribution-specific. Density-debiasing (this v8bh checkpoint) cut held-out FORTRESS-safe over-refusal 0.288 -> 0.016 but did NOT transfer to other benign distributions (0.185 -> 0.194).
  6. Conformal certificate is response-head-only, on the calibration distribution. Valid bound: over-refusal <= 20% at 95% confidence, recall 0.878 (not a tighter system guarantee).
  7. Contamination caveat. Competitor recall on SafeRLHF/BeaverTails slices may be inflated by their training; this model is decontaminated only against ITS OWN training.

Training data

WildGuardMix bio (a GENERAL safety mixture filtered to bio items — why the head is general rather than bio-selective) + BeaverTails bio (harmful) + FalseReject non-bio negatives (benign)

  • FORTRESS dense-safe hard negatives (the v8bh density-debiasing). Reuse-only, zero newly generated harmful content. All evaluations decontaminated by query-hash against this training (audit_leakage.py: 0 overlap on 5 checks).

Honest recommendation

If you need a small response-harm guard, use Qwen3Guard-0.6B (better and open). Use THIS model only if you specifically need a 184M-class encoder, accept general (non-bio) coverage, and value the transparent, reproducible evaluation. The intended audience is researchers studying small-guard evaluation, not production deployers seeking the best classifier.

Evaluation integrity — audits that changed the results

Five self-audits found and corrected silent failures in this work; each is documented with the numbers that moved (full log: INTEGRITY_REVIEW_2026-06-04.md):

  1. fp16-default-load NaN — transformers 5.9.0 silently loads DeBERTa-v3 in fp16, NaN-ing the disentangled attention; fixed by forcing float32. Every prior all-zero/NaN eval traced here.
  2. AUPRC refutes the footprint claim — the prompt head's recall@0.5 0.983 looked like success; AUPRC 0.121 vs teacher 0.605 showed it is saturated, not discriminating.
  3. Operating-point mismatch — native-threshold ranking flattered us; at matched FPR we lose to WildGuard (0.878 vs 0.904 @ FPR 0.10). Treating Qwen "Controversial" as flagged had inflated its over-refusal.
  4. Size-peer class eliminates the niche — Qwen3Guard-0.6B Pareto-dominates this model.
  5. Conformal certificate was on the wrong checkpoint — recomputed for shipped v8bh: over-ref <= 20%, recall 0.878.

Responsible release

Released as a research artifact and methodology case study, not a recommended production guard. The release surface is weights, evaluation code, and documentation; no harmful training examples, generated harmful content, or operational instructions are included. This is defensive biosafety research. Anyone deploying it should re-validate on their own traffic, keep text normalization on, add adversarial/multi-turn testing, and keep a human in the loop.

License & citation

License: CC BY-NC 4.0 (weights, eval code, docs are open; no harmful training examples distributed). Successor to jang1563/constitutional-bioguard-deberta-v1. Full design and result trail (in the GitHub repo): MODEL_CARD.md · CASE_STUDY_eval_self_red_team.md · INTEGRITY_REVIEW_2026-06-04.md · POSTMORTEM_2026-06-04.md.

Downloads last month
-
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jang1563/constitutional-bioguard-response

Finetuned
(636)
this model

Datasets used to train jang1563/constitutional-bioguard-response

Evaluation results

  • Recall (95% CI 0.89-0.95) on Held-out real bio responses (n=554, 343 harm / 211 benign)
    self-reported
    0.921
  • AUROC on Held-out real bio responses (n=554, 343 harm / 211 benign)
    self-reported
    0.952
  • Over-refusal (FPR on benign bio responses) on Held-out real bio responses (n=554, 343 harm / 211 benign)
    self-reported
    0.194