CraneMed AI Safety Classifier

An on-device, multilingual safety classifier for detecting adversarial prompts targeting clinical AI assistants operating under the Uganda Clinical Guidelines (UCG) 2023.

Fine-tuned from paraphrase-multilingual-MiniLM-L12-v2 on the UCG Adversarial Safety Dataset (3,020 labeled prompts across 8 attack categories). Designed for Android deployment at ≀40 MB and <30ms latency.


Model Description

This classifier is the L3 (neural) layer in the CraneMed AI safety architecture β€” a multi-layered defense system for a clinical decision support tool running MedGemma on Android in low-connectivity Ugandan health facilities.

The safety architecture has four layers:

  1. L1 β€” Regex Input Filter (~0 MB, <1ms): Pattern-based blocking of known jailbreaks, LD50 queries, roleplay attacks
  2. L3 β€” This Classifier (~12 MB INT8, <30ms): Neural classifier for nuanced adversarial detection
  3. L4 β€” MedGemma Self-Check (0 MB, ~300ms): Fires only in borderline zone using the existing model
  4. L2 β€” Regex Output Validator (~0 MB, <1ms): Post-generation checks for harmful leakage

The classifier catches attacks that evade regex patterns: authority impersonation, escalation chains, contraindication bypass with clinical framing, and multilingual attacks.


Files

Path Description Size
best_model/model.safetensors PyTorch checkpoint (safetensors format) ~90 MB
best_model/config.json Model configuration β€”
best_model/tokenizer.json Fast tokenizer β€”
best_model/tokenizer_config.json Tokenizer configuration β€”
best_model/special_tokens_map.json Special tokens mapping β€”
onnx/cranemed_safety_fp32.onnx Full precision ONNX export ~45 MB
onnx/cranemed_safety_int8.onnx INT8 quantized ONNX (Android deployment) ~12 MB
onnx/export_meta.json Export metadata and validation results β€”

Intended Use

  • Primary: On-device adversarial prompt filtering for clinical AI assistants in Ugandan health facilities
  • Secondary: Red-team evaluation of clinical LLMs against Uganda-specific adversarial attacks
  • Deployment target: Android 8 GB devices, ONNX Runtime, INT8 quantization

Attack Categories Detected

Category Description
Dangerous Dosing Toxic thresholds, lethal dose stacking requests
Authority Impersonation MOH officer framing, academic authority claims
Contraindication Bypass Stock-out exploitation, urgency-based safety override
UCG Context Exploitation VHT scope creep, bed pressure, out-of-scope roles
Emergency Misdirection Drug substitution under fabricated emergencies
Hallucination Probing Fake UCG sections, fictional drugs
Jailbreak / Roleplay Fictional doctor personas, persona injection
Multi-Turn Escalation Benign context followed by adversarial escalation

Languages

Supports prompts in English, Luganda, and Swahili.


Training Data

Trained on the UCG Adversarial Safety Dataset β€” 3,020 labeled prompts (1,034 ADVERSARIAL / 1,986 SAFE) generated from UCG 2023 clinical mappings using Gemini 1.5 Flash.

Class imbalance is addressed via inverse-frequency weighted CrossEntropyLoss (~1.9x weight on the ADVERSARIAL class).


Target Metrics

Metric Target
Precision (ADVERSARIAL) > 0.92
F1 Macro > 0.88
Avg Latency (INT8, on-device) < 30ms
Model Size (INT8) ≀ 40 MB
Accuracy Degradation FP32 β†’ INT8 < 2%

Android Deployment

1. Copy cranemed_safety_int8.onnx β†’ app/src/main/assets/
2. Copy tokenizer files β†’ app/src/main/assets/tokenizer/
3. Use OnnxSafetyClassifier.kt for inference
4. Integrate with SafetyGate.kt in the MedGemma pipeline

Citation

@misc{cranemedai_safety_classifier,
  author    = {Crane AI Labs},
  title     = {CraneMed AI Safety Classifier: On-Device Adversarial Prompt Detection for Uganda Clinical Guidelines AI},
  year      = {2026},
  publisher = {Hugging Face},
  journal   = {Hugging Face Repository},
  howpublished = {\url{https://huggingface.co/CraneAILabs/cranemedai-safety}}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for CraneAILabs/cranemedai-safety