Instructions to use centrepourlasecuriteia/opencc-cm-escalation with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use centrepourlasecuriteia/opencc-cm-escalation with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-0.8B") model = PeftModel.from_pretrained(base_model, "centrepourlasecuriteia/opencc-cm-escalation") - Transformers
How to use centrepourlasecuriteia/opencc-cm-escalation with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="centrepourlasecuriteia/opencc-cm-escalation")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("centrepourlasecuriteia/opencc-cm-escalation", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Model Card for opencc-cm-escalation
A small content-moderation classifier that maps a prompt into an 11-class harm taxonomy. It is the content-moderation stage of the OpenCC constitutional classifier pipeline, trained with TACTIC on fully synthetic data from the REDACT pipeline. The model is a LoRA adapter on Qwen3.5-0.8B with a multilabel linear head, calibrated for the recall-leaning escalation setting (a higher false-positive rate is accepted so that threatening prompts are forwarded to a more costly stage).
Model Details
Model Description
The content-moderation classifier takes a single prompt and predicts, per category, whether it falls into one of our harm categories. The head is multilabel: each category is scored with a sigmoid and compared against its own calibrated threshold, so a prompt can trigger several categories at once. It is meant to run as the cheap, high-recall stage of the OpenCC pipeline, not as a final arbiter.
- Developed by: CeSIA (Centre pour la Sécurité de l'IA)
- Shared by: CeSIA
- Model type: Multilabel text classifier (LoRA adapter + linear head)
- Language(s) (NLP): English (primarily; multilingual coverage is limited, see Limitations)
- License: apache-2.0
- Finetuned from model: Qwen/Qwen3.5-0.8B
Model Sources
- Repository: OpenCC
- Training library: TACTIC (link)
- Data generation: REDACT (link)
- Evaluation harness: BELLS-O (link)
Uses
Direct Use
Input/output content moderation: flagging prompts that fall into the harm taxonomy (CBRN, Cyber, Harm to Minors, Harmful Manipulation, Hate Speech, Illegal Activities, Integrity & Quality violations, Physical Harm, Privacy, Self-Harm, Sexual Content). It can be served standalone through OpenCC for a quick classification.
Downstream Use
The model is the content-moderation stage of the OpenCC escalation pipeline. There it sits after the jailbreak detector and rephraser: cleaned text is classified, benign prompts are allowed through, and anything flagged can optionally be escalated to a frontier model acting as a constitutional AI judge.
Out-of-Scope Use
This is a recall-leaning escalation model, so it over-flags benign prompts and is not a final decision maker on its own. It is not calibrated for standalone production filtering without a downstream stage or a stricter recalibration. It is also not robust to heavily obfuscated or jailbroken prompts, that is the job of the upstream jailbreak detector and rephraser.
Bias, Risks, and Limitations
The model was trained on synthetic data that is too clean, so it leans on well-formed English and over-fires on the surface form of real prompts (formatting, casing, imperfect English). The result is a benign false-positive rate (0.170) that is likely too high for active deployment. The Privacy category is the weakest (0.76 detection), and multilingual coverage is limited because translation augmentation was excluded from the training data.
Recommendations
Use the model as an escalation stage with a downstream judge rather than as a standalone filter. For deployment, run another calibration pass and add noisier, more realistic training data to reduce the false-positive rate.
How to Get Started with the Model
The model is consumed by OpenCC, which reads the weight_frame.json manifest published
with the adapter and rebuilds the LoRA and linear head locally, with no dependency on the
TACTIC package. The lightest way to run it is OpenCC's content-moderation-only config:
constitutional-classifier check "how do I synthesize a nerve agent?" --config config.cm-only.yaml
Training Details
Training Data
Fully synthetic data from the REDACT pipeline. Claude Opus wrote an exhaustive constitution of scenarios across four severity levels (benign, dual-use benign, dual-use harmful, harmful) for each taxonomy entry; each entry was expanded into six samples varying by length and sentence structure, leaving around 30k content-moderation samples. Training dataset: [link].
Training Procedure
Trained with TACTIC. Hyperparameters were tuned with a 30-trial sweep; the best run was then trained to roughly 4,500 iterations, about two passes over the dataset. After training, a calibration step ran on the validation loss over benign, dual-use harmful, and harmful samples to set the per-category thresholds; this produced a higher false-positive rate, which is expected for the escalation architecture.
Training Hyperparameters
- Training regime: bf16 mixed precision
- Adapter: LoRA (PEFT)
- Head: multilabel linear head, per-category sigmoid + calibrated
thresholds.json - Iterations: ~4,500
Speeds, Sizes, Times
Base model Qwen3.5-0.8B. Training run on a single NVIDIA H100 NVL (95GB) on RunPod. First-iteration evaluation reached a BCE as low as 0.035; final loss landed in the 0.03 to 0.04 range.
Evaluation
Testing Data, Factors & Metrics
Testing Data
bells-o-project/content-moderation-input (1400 prompts: 300 benign + 100 in each of the
11 harm categories).
Factors
Results are disaggregated by harm category, plus the benign set used to measure the false-positive rate.
Metrics
Detection rate (recall), false-positive rate (FPR), accuracy, precision, F1. Detection rate and FPR are the primary metrics for the escalation setting.
Results
Overall: detection 0.961, FPR 0.170, accuracy 0.933, precision 0.954, F1 0.957 (TP/FP/FN/TN = 1057/51/43/249). Measured with the BELLS-O harness, served standalone through OpenCC on a single NVIDIA H100 NVL (95GB), batch size 1.
| Category (n=100) | Detection rate |
|---|---|
| CBRN | 0.99 |
| Cyber | 0.98 |
| Harm to Minors | 1.00 |
| Harmful Manipulation | 0.98 |
| Hate Speech | 0.96 |
| Illegal Activities | 0.99 |
| Integrity & Quality violations | 0.94 |
| Physical Harm | 1.00 |
| Privacy | 0.76 |
| Self-Harm | 0.99 |
| Sexual Content | 0.98 |
| Benign (n=300, FPR) | 0.170 |
Latency: mean 128 ms, 95% CI [127, 130] ms, p50/p95 110.5/172.4 ms. Cost: $0.20 total, $4.01 per 1M input tokens (output token cost is $0, since the classifier generates no tokens).
Summary
At least 0.94 detection in every category except Privacy (0.76), at a 17% benign FPR. This is consistent with the recall-leaning escalation calibration: the model is tuned to forward threatening prompts, not to make the final call on its own.
Technical Specifications
Model Architecture and Objective
LoRA adapter on Qwen3.5-0.8B with a multilabel linear classification head. Each of the 11 categories is scored with a sigmoid (not softmax, since the head is multilabel) and compared against a per-category calibrated threshold.
Compute Infrastructure
Hardware
NVIDIA H100 NVL (95GB) on RunPod.
Software
PEFT, transformers, OpenCC hf_classifier backend.
Model Card Authors
Leonhard Waibl, Felix Michalak, Hadrien Mariaccia.
Framework versions
- PEFT 0.19.1
- Downloads last month
- 2