Model Card for Orca-Sonar

Multilingual Document Topic Classifier for Real-World AI Security & DLP

Orca-Sonar is a Multilingual ModernBERT-based (mmBERT) classifier that assigns a document/text to one of 7 topic classes. It is part of the Patronus Protect security stack and is designed for topic-/risk-routing of incoming texts (e.g. before they reach an LLM, a DLP gate, or a storage tier).

It classifies German and English text and is robust to user-to-AI wrappers (e.g. "Summarize this contract: …"), i.e. the topic of the content determines the class, not the surface format of the request.

Intended Uses

The model maps an input text to one of:

id	label	description
0	`finance`	invoices, balance sheets, quarterly/annual reports, cash-flow, SEC filings, forecasts
1	`hr`	CVs, job ads, employment contracts, terminations, HR policies, performance reviews, recruiting
2	`internal_and_tech`	ADRs, RFCs, postmortems, specs, READMEs, wikis, architecture & strategy memos, runbooks
3	`legal`	contracts, NDAs, ToS/AGB, privacy policies, statutes/judgments, compliance, legal correspondence
4	`marketing`	press releases, newsletters, landing-page/sales copy, outbound pitches, case studies
5	`other`	conversational / non-business: smalltalk, recipes, travel, hobby, learning, creative
6	`source_code`	raw program code & configs (Python/Go/Rust/JS/TS/SQL/Bash/Dockerfile/k8s/Terraform …)

Disambiguation: on a tie, the more sensitive class wins — legal > hr > finance > internal_and_tech > source_code > marketing > other.

Limitations

Highly accurate on German and English; other languages were not actively tested.
The model can produce false positives; for high-stakes routing combine it with a confidence/abstention gate.
Robustness against adversarial / out-of-distribution / pure-PII / pathological-length inputs is partial; pair the model with a deterministic pre-gate (length + PII) for production DLP use.

Model Variants

orca-sonar – full model (model.safetensors, fp32).
orca-sonar-fp16 (ONNX) – FP16 ONNX export under onnx/onnx_fp16/ — half the size, argmax-faithful to the full model.

Training Data

Trained on our own in-house dataset (German + English, 7 topic classes), purpose-built for this model. The dataset will be published soon.

Benchmark

Held-out test set (100 % real data), per-class F1:

Metric	Score
Accuracy	0.978
F1 (macro)	0.978
F1 legal	0.995
F1 source_code	0.985
F1 marketing	0.980
F1 internal_and_tech	0.977
F1 hr	0.971
F1 finance	0.970
F1 other	0.970

Usage

from transformers import pipeline

clf = pipeline("text-classification", model="patronus-studio/orca-sonar-document-classifier")
clf("Fasse mir diesen Dienstleistungsvertrag zusammen: Laufzeit 24 Monate, Gerichtsstand München …")
# -> [{'label': 'legal', 'score': 0.99}]

ONNX

An FP16 ONNX version is available under onnx/onnx_fp16/:

import torch
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer

model_id = "patronus-studio/orca-sonar-document-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = ORTModelForSequenceClassification.from_pretrained(model_id, subfolder="onnx/onnx_fp16")

inputs = tokenizer("def add(a, b):\n    return a + b", return_tensors="pt")
logits = model(**inputs).logits
print(model.config.id2label[int(torch.argmax(logits, dim=-1))])

Citation

@misc{orcasonar2026,
  title={Orca-Sonar: Multilingual Document Topic Classification for Real-World AI Security},
  author={Patronus Protect},
  year={2026},
  howpublished={\url{https://huggingface.co/patronus-studio/orca-sonar-document-classifier}}
}

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for patronus-studio/orca-sonar-document-classifier

Base model

jhu-clsp/mmBERT-small

Quantized

(257)

this model