SafeRoute Router Model (DynaGuard 1.7B / 8B)

This repository contains the weights for the SafeRoute Router, an optimized neural router designed to dynamically direct input prompts/responses between a lightweight safety classifier (Small Model) and a high-capacity safety classifier (Large Model).

By routing "easy/safe" queries to the small model and reserving the large model only for "hard/unsafe" queries, the system drastically reduces inference latency and computational cost while preserving overall safety evaluation performance.

Model Details

Architecture: Multi-Layer Perceptron (MLP) with 3 hidden layers (1024 -> 512 -> 256), utilizing BatchNorm1d, GELU activations, and moderate Dropout (0.3).
Input Dimension: 2048 (feature embeddings extracted from the small safety model).
Output Dimension: 1 (binary classification logit indicating routing probability).
Loss Function: Focal Loss ($\alpha=0.75, \gamma=2.0$) tailored to address severe class imbalance.
Optimizer & Scheduler: AdamW with CosineAnnealingWarmRestarts.

Evaluation Results

Evaluated on a balanced Test Benchmark at the optimal decision threshold (0.6):

Metric	Score
F1 Score	0.7525
Accuracy	0.7500
Precision	0.7451
Recall	0.7600
Overall AUPRC	0.7588

Note: The high recall (0.76) combined with solid precision (0.74) ensures that potentially unsafe or ambiguous prompts are reliably intercepted and routed to the Large Model for thorough inspection.

How to Get Started with the Model

You can easily download and use this model in your PyTorch pipeline:

import torch
import torch.nn as nn
from huggingface_hub import hf_hub_download

# 1. Define the Router Architecture
class RouterMLP(nn.Module):
    def __init__(self, input_dim=2048):
        super().__init__()
        self.cls = nn.Sequential(
            nn.Linear(input_dim, 1024),
            nn.BatchNorm1d(1024),
            nn.GELU(),
            nn.Dropout(0.3),
            nn.Linear(1024, 512),
            nn.BatchNorm1d(512),
            nn.GELU(),
            nn.Dropout(0.3),
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.GELU(),
            nn.Dropout(0.2),
            nn.Linear(256, 1),
        )

    def forward(self, x):
        return self.cls(x).squeeze(-1)

# 2. Download and Load the Checkpoint
repo_id = "YOUR_HF_USERNAME/safe-route-dynaguard" # <-- Replace with your repo name
model_path = hf_hub_download(repo_id=repo_id, filename="model.pt")

device = "cuda" if torch.cuda.is_available() else "cpu"
router = RouterMLP(input_dim=2048).to(device)

ckpt = torch.load(model_path, map_location=device)
router.load_state_dict(ckpt["state_dict"], strict=False)
router.eval()

# 3. Perform Routing Inference
with torch.no_grad():
    # Example feature tensor extracted from small model
    sample_features = torch.randn(4, 2048, device=device)
    
    logits = router(sample_features)
    routing_probs = torch.sigmoid(logits)
    
    # Use recommended threshold 0.6
    decisions = (routing_probs > 0.6).long()
    
    for i, decision in enumerate(decisions):
        if decision == 1:
            print(f"Sample {i}: Route to LARGE Model (Hard/Unsafe)")
        else:
            print(f"Sample {i}: Use SMALL Model (Easy/Safe)")

Intended Use

Primary Use Case: Guardrail optimization in LLM serving pipelines.
Out-of-Scope: Standalone toxicity classification directly from raw text (this model requires intermediate hidden feature representations from a pre-trained small safety model).

Downloads last month: -; Downloads are not tracked for this model. How to track