multilingual-e5-small-refusal-classifier / README.md

agentlans

Update README.md

777a934 verified 5 days ago

preview code

raw

history blame contribute delete

4.58 kB

metadata

license: mit
base_model:
  - agentlans/multilingual-e5-small-aligned-v2
language:
  - en
  - zh
  - fr
  - pt
  - es
  - ja
  - tr
  - ru
  - ar
  - ko
  - th
  - it
  - de
  - vi
  - ms
  - id
  - fil
  - hi
  - pl
  - cs
  - nl
  - km
  - my
  - fa
  - gu
  - ur
  - te
  - mr
  - he
  - bn
  - ta
  - uk
  - bo
  - kk
  - mn
  - ug
  - yue
datasets:
  - agentlans/refusal-classifier-data
pipeline_tag: text-classification
tags:
  - text-classification
  - multilingual
  - refusal-detection
  - alignment
  - conversation-analysis
  - fine-tuned-model
  - ethics
  - ai-safety
  - e5
  - transformer
  - huggingface
  - research

Multilingual Refusal Classifier

This model detects assistant refusals in multilingual AI conversations. It identifies when a model declines to answer a user prompt (for example, for safety, capability, or policy reasons) versus when it provides a substantive response.

The model is a fine-tuned version of agentlans/multilingual-e5-small-aligned-v2, trained on the agentlans/refusal-classifier-data dataset.

Evaluation results:

Loss: 0.2665
Accuracy: 0.9153
Training tokens: 5,347,200

Usage

This classifier accepts input in conversation-like text formats using structured role tokens.
For long texts, insert <|...|> as an ellipsis placeholder in the middle of omitted content.

Supported input formats:

<|system|>System prompt<|user|>User message<|assistant|>Response<|user|>Next user message<|assistant|>Next response...
<|user|>User message<|assistant|>Response<|user|>Next user message<|assistant|>Next response...

Example:

from transformers import pipeline

classifier = pipeline(
    task="text-classification",
    model="agentlans/multilingual-e5-small-refusal-classifier"
)

text = (
    "<|user|>Mr. Loyd wants to fence his square-shaped land of 150 sqft each side. "
    "If a pole is laid every certain distance, he needs 30 poles. "
    "What is the distance between each pole in feet?"
    "<|assistant|>If Mr. Loyd's land is square-shaped and each side is 150 sqft, then<|...|>"
    "ce between poles ≈ 20.69 sqft\n\nTherefore, the distance between each pole is approximately 20.69 feet."
)

print(classifier(text))
# [{'label': 'Non-refusal', 'score': 0.9906}]

Evaluation Results

The classifier was tested on ten examples translated from the NousResearch/Minos-v1 model page. Full examples are available in Examples.md.

🚫 — The model predicted a refusal to answer.
◯ — The model predicted a valid response.

Example	English	French	Spanish	Chinese	Russian	Arabic
1	🚫	🚫	🚫	🚫	🚫	🚫
2	🚫	🚫	🚫	🚫	🚫	🚫
3	🚫	🚫	🚫	🚫	🚫	🚫
4	🚫	🚫	🚫	🚫	🚫	🚫
5	🚫	🚫	🚫	🚫	🚫	🚫
6	◯	◯	◯	◯	◯	◯
7	◯	◯	◯	◯	◯	◯
8	◯	◯	◯	◯	◯	◯
9	◯	🚫	◯	◯	🚫	🚫
10	◯	◯	◯	◯	◯	◯

The classifier performs consistently across major languages, though some false positives remain, especially in contexts with ambiguous phrasing.

Limitations

Input length: 512-token maximum
False positives/negatives: Occasionally similar to the Minos classifier
Low-resource languages: May yield inconsistent predictions
Cultural variation: Expressions of refusal differ linguistically, which can affect accuracy

Training Details

Hyperparameters

Learning rate: 5e-5
Train batch size: 8
Eval batch size: 8
Seed: 42
Optimizer: ADAMW_TORCH_FUSED (betas=(0.9, 0.999), epsilon=1e-8)
Scheduler: Linear
Epochs: 5

Framework Versions

Transformers 5.0.0.dev0
PyTorch 2.9.1+cu128
Datasets 4.4.1
Tokenizers 0.22.1

Intended Use

This model is designed for:

Identifying AI refusals during conversation analysis.
Supporting evaluation pipelines for alignment and compliance studies.
Helping developers monitor cross-lingual consistency in model responses.

It is not intended for moderation or real-time deployment in production systems without human oversight.