Model Card for Orca-Sonar

Multilingual Document Topic Classifier for Real-World AI Security & DLP

Orca-Sonar is a Multilingual ModernBERT-based (mmBERT) classifier that assigns a document/text to one of 7 topic classes. It is part of the Patronus Protect security stack and is designed for topic-/risk-routing of incoming texts (e.g. before they reach an LLM, a DLP gate, or a storage tier).

It classifies German and English text and is robust to user-to-AI wrappers (e.g. "Summarize this contract: …"), i.e. the topic of the content determines the class, not the surface format of the request.

Intended Uses

The model maps an input text to one of:

id label description
0 finance invoices, balance sheets, quarterly/annual reports, cash-flow, SEC filings, forecasts
1 hr CVs, job ads, employment contracts, terminations, HR policies, performance reviews, recruiting
2 internal_and_tech ADRs, RFCs, postmortems, specs, READMEs, wikis, architecture & strategy memos, runbooks
3 legal contracts, NDAs, ToS/AGB, privacy policies, statutes/judgments, compliance, legal correspondence
4 marketing press releases, newsletters, landing-page/sales copy, outbound pitches, case studies
5 other conversational / non-business: smalltalk, recipes, travel, hobby, learning, creative
6 source_code raw program code & configs (Python/Go/Rust/JS/TS/SQL/Bash/Dockerfile/k8s/Terraform …)

Disambiguation: on a tie, the more sensitive class wins — legal > hr > finance > internal_and_tech > source_code > marketing > other.

Limitations

  • Highly accurate on German and English; other languages were not actively tested.
  • The model can produce false positives; for high-stakes routing combine it with a confidence/abstention gate.
  • Robustness against adversarial / out-of-distribution / pure-PII / pathological-length inputs is partial; pair the model with a deterministic pre-gate (length + PII) for production DLP use.

Model Variants

  • orca-sonar – full model (model.safetensors, fp32).
  • orca-sonar-fp16 (ONNX) – FP16 ONNX export under onnx/onnx_fp16/ — half the size, argmax-faithful to the full model.

Training Data

Trained on our own in-house dataset (German + English, 7 topic classes), purpose-built for this model. The dataset will be published soon.

Benchmark

Held-out test set (100 % real data), per-class F1:

Metric Score
Accuracy 0.978
F1 (macro) 0.978
F1 legal 0.995
F1 source_code 0.985
F1 marketing 0.980
F1 internal_and_tech 0.977
F1 hr 0.971
F1 finance 0.970
F1 other 0.970

Usage

from transformers import pipeline

clf = pipeline("text-classification", model="patronus-studio/orca-sonar-document-classifier")
clf("Fasse mir diesen Dienstleistungsvertrag zusammen: Laufzeit 24 Monate, Gerichtsstand München …")
# -> [{'label': 'legal', 'score': 0.99}]

ONNX

An FP16 ONNX version is available under onnx/onnx_fp16/:

import torch
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer

model_id = "patronus-studio/orca-sonar-document-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = ORTModelForSequenceClassification.from_pretrained(model_id, subfolder="onnx/onnx_fp16")

inputs = tokenizer("def add(a, b):\n    return a + b", return_tensors="pt")
logits = model(**inputs).logits
print(model.config.id2label[int(torch.argmax(logits, dim=-1))])

Citation

@misc{orcasonar2026,
  title={Orca-Sonar: Multilingual Document Topic Classification for Real-World AI Security},
  author={Patronus Protect},
  year={2026},
  howpublished={\url{https://huggingface.co/patronus-studio/orca-sonar-document-classifier}}
}
Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for patronus-studio/orca-sonar-document-classifier

Quantized
(257)
this model