Model Card for Orca-Sonar
Multilingual Document Topic Classifier for Real-World AI Security & DLP
Orca-Sonar is a Multilingual ModernBERT-based (mmBERT) classifier that assigns a document/text to one of 7 topic classes. It is part of the Patronus Protect security stack and is designed for topic-/risk-routing of incoming texts (e.g. before they reach an LLM, a DLP gate, or a storage tier).
It classifies German and English text and is robust to user-to-AI wrappers (e.g. "Summarize this contract: …"), i.e. the topic of the content determines the class, not the surface format of the request.
Intended Uses
The model maps an input text to one of:
| id | label | description |
|---|---|---|
| 0 | finance |
invoices, balance sheets, quarterly/annual reports, cash-flow, SEC filings, forecasts |
| 1 | hr |
CVs, job ads, employment contracts, terminations, HR policies, performance reviews, recruiting |
| 2 | internal_and_tech |
ADRs, RFCs, postmortems, specs, READMEs, wikis, architecture & strategy memos, runbooks |
| 3 | legal |
contracts, NDAs, ToS/AGB, privacy policies, statutes/judgments, compliance, legal correspondence |
| 4 | marketing |
press releases, newsletters, landing-page/sales copy, outbound pitches, case studies |
| 5 | other |
conversational / non-business: smalltalk, recipes, travel, hobby, learning, creative |
| 6 | source_code |
raw program code & configs (Python/Go/Rust/JS/TS/SQL/Bash/Dockerfile/k8s/Terraform …) |
Disambiguation: on a tie, the more sensitive class wins —
legal > hr > finance > internal_and_tech > source_code > marketing > other.
Limitations
- Highly accurate on German and English; other languages were not actively tested.
- The model can produce false positives; for high-stakes routing combine it with a confidence/abstention gate.
- Robustness against adversarial / out-of-distribution / pure-PII / pathological-length inputs is partial; pair the model with a deterministic pre-gate (length + PII) for production DLP use.
Model Variants
- orca-sonar – full model (
model.safetensors, fp32). - orca-sonar-fp16 (ONNX) – FP16 ONNX export under
onnx/onnx_fp16/— half the size, argmax-faithful to the full model.
Training Data
Trained on our own in-house dataset (German + English, 7 topic classes), purpose-built for this model. The dataset will be published soon.
Benchmark
Held-out test set (100 % real data), per-class F1:
| Metric | Score |
|---|---|
| Accuracy | 0.978 |
| F1 (macro) | 0.978 |
| F1 legal | 0.995 |
| F1 source_code | 0.985 |
| F1 marketing | 0.980 |
| F1 internal_and_tech | 0.977 |
| F1 hr | 0.971 |
| F1 finance | 0.970 |
| F1 other | 0.970 |
Usage
from transformers import pipeline
clf = pipeline("text-classification", model="patronus-studio/orca-sonar-document-classifier")
clf("Fasse mir diesen Dienstleistungsvertrag zusammen: Laufzeit 24 Monate, Gerichtsstand München …")
# -> [{'label': 'legal', 'score': 0.99}]
ONNX
An FP16 ONNX version is available under onnx/onnx_fp16/:
import torch
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
model_id = "patronus-studio/orca-sonar-document-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = ORTModelForSequenceClassification.from_pretrained(model_id, subfolder="onnx/onnx_fp16")
inputs = tokenizer("def add(a, b):\n return a + b", return_tensors="pt")
logits = model(**inputs).logits
print(model.config.id2label[int(torch.argmax(logits, dim=-1))])
Citation
@misc{orcasonar2026,
title={Orca-Sonar: Multilingual Document Topic Classification for Real-World AI Security},
author={Patronus Protect},
year={2026},
howpublished={\url{https://huggingface.co/patronus-studio/orca-sonar-document-classifier}}
}
- Downloads last month
- -
Model tree for patronus-studio/orca-sonar-document-classifier
Base model
jhu-clsp/mmBERT-small