SafeRoute Router Model (DynaGuard 1.7B / 8B)
This repository contains the weights for the SafeRoute Router, an optimized neural router designed to dynamically direct input prompts/responses between a lightweight safety classifier (Small Model) and a high-capacity safety classifier (Large Model).
By routing "easy/safe" queries to the small model and reserving the large model only for "hard/unsafe" queries, the system drastically reduces inference latency and computational cost while preserving overall safety evaluation performance.
Model Details
- Architecture: Multi-Layer Perceptron (MLP) with 3 hidden layers (
1024 -> 512 -> 256), utilizingBatchNorm1d,GELUactivations, and moderateDropout(0.3). - Input Dimension:
2048(feature embeddings extracted from the small safety model). - Output Dimension:
1(binary classification logit indicating routing probability). - Loss Function:
Focal Loss($\alpha=0.75, \gamma=2.0$) tailored to address severe class imbalance. - Optimizer & Scheduler:
AdamWwithCosineAnnealingWarmRestarts.
Evaluation Results
Evaluated on a balanced Test Benchmark at the optimal decision threshold (0.6):
| Metric | Score |
|---|---|
| F1 Score | 0.7525 |
| Accuracy | 0.7500 |
| Precision | 0.7451 |
| Recall | 0.7600 |
| Overall AUPRC | 0.7588 |
Note: The high recall (0.76) combined with solid precision (0.74) ensures that potentially unsafe or ambiguous prompts are reliably intercepted and routed to the Large Model for thorough inspection.
How to Get Started with the Model
You can easily download and use this model in your PyTorch pipeline:
import torch
import torch.nn as nn
from huggingface_hub import hf_hub_download
# 1. Define the Router Architecture
class RouterMLP(nn.Module):
def __init__(self, input_dim=2048):
super().__init__()
self.cls = nn.Sequential(
nn.Linear(input_dim, 1024),
nn.BatchNorm1d(1024),
nn.GELU(),
nn.Dropout(0.3),
nn.Linear(1024, 512),
nn.BatchNorm1d(512),
nn.GELU(),
nn.Dropout(0.3),
nn.Linear(512, 256),
nn.BatchNorm1d(256),
nn.GELU(),
nn.Dropout(0.2),
nn.Linear(256, 1),
)
def forward(self, x):
return self.cls(x).squeeze(-1)
# 2. Download and Load the Checkpoint
repo_id = "YOUR_HF_USERNAME/safe-route-dynaguard" # <-- Replace with your repo name
model_path = hf_hub_download(repo_id=repo_id, filename="model.pt")
device = "cuda" if torch.cuda.is_available() else "cpu"
router = RouterMLP(input_dim=2048).to(device)
ckpt = torch.load(model_path, map_location=device)
router.load_state_dict(ckpt["state_dict"], strict=False)
router.eval()
# 3. Perform Routing Inference
with torch.no_grad():
# Example feature tensor extracted from small model
sample_features = torch.randn(4, 2048, device=device)
logits = router(sample_features)
routing_probs = torch.sigmoid(logits)
# Use recommended threshold 0.6
decisions = (routing_probs > 0.6).long()
for i, decision in enumerate(decisions):
if decision == 1:
print(f"Sample {i}: Route to LARGE Model (Hard/Unsafe)")
else:
print(f"Sample {i}: Use SMALL Model (Easy/Safe)")
Intended Use
- Primary Use Case: Guardrail optimization in LLM serving pipelines.
- Out-of-Scope: Standalone toxicity classification directly from raw text (this model requires intermediate hidden feature representations from a pre-trained small safety model).