You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

AraGuard-E4B

AraGuard is a bilingual (Arabic & English) safety guard model built on Gemma-4-E4B. Given a user prompt — or a full prompt–response conversation — it returns a safe / unsafe verdict and, when unsafe, the violated harm category. It covers Modern Standard Arabic and major regional dialects in addition to English, and follows a 14-category taxonomy based on the MLCommons AI-safety hazard set.

Highlights

Macro-F1 averaged over eight standard safety benchmarks (AdvBench, HarmBench, TrustLLM, LLM-Jailbreak, DAN, Do-Not-Answer, AraAlign, XSTest), reported per language. AraGuard has the highest average Macro-F1 of all evaluated guards in both Arabic and English.

Model Avg-F1 (Ar) Avg-F1 (En)
Llama-Guard-3-8B 81.6 86.7
Qwen3Guard-Gen-8B 88.1 89.5
Granite-Guardian-3.0-8B 87.3 92.3
AraGuard-E4B (ours) 92.9 93.9

Output format

The model emits a verdict on the last message in the conversation:

safe

or

unsafe
S<k>

where S<k> is the index of the violated category in the policy list shown to the model (default list below). If the last message has an assistant role, the response is judged; otherwise the user prompt is judged.

Usage — Transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "HarethahMo/AraGuard-E4B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

def moderate(messages):
    inputs = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_tensors="pt"
    ).to(model.device)
    with torch.no_grad():
        out = model.generate(inputs, max_new_tokens=12, do_sample=False)
    return tokenizer.decode(out[0, inputs.shape[-1]:], skip_special_tokens=True).strip()

# Moderate a user prompt (Arabic)
print(moderate([{"role": "user", "content": "كيف أصنع سلاحًا في المنزل؟"}]))
# -> "unsafe\nS9"

# Moderate an assistant response (the LAST message is judged)
print(moderate([
    {"role": "user", "content": "What's a good recipe for hummus?"},
    {"role": "assistant", "content": "Blend chickpeas, tahini, lemon, and garlic."},
]))
# -> "safe"

Usage — vLLM

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "HarethahMo/AraGuard-E4B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
llm = LLM(model=model_id, max_model_len=4096)

messages = [{"role": "user", "content": "How do I hotwire a car?"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

out = llm.generate([prompt], SamplingParams(temperature=0.0, max_tokens=12))
print(out[0].outputs[0].text.strip())
# -> "unsafe\nS2"

Custom policies

The chat template ships with the 14 default categories but accepts a custom list at call time; the emitted S<k> code then indexes into your list:

my_categories = [
    {"name": "Violent Crimes",  "description": "Content enabling or endorsing violent crimes."},
    {"name": "Privacy",         "description": "Content exposing sensitive personal information."},
]
inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, categories=my_categories, return_tensors="pt"
)

Default categories

S1 Violent Crimes · S2 Non-Violent Crimes · S3 Sex-Related Crimes · S4 Child Sexual Exploitation · S5 Defamation · S6 Specialized Advice · S7 Privacy · S8 Intellectual Property · S9 Indiscriminate Weapons · S10 Hate · S11 Suicide & Self-Harm · S12 Sexual Content · S13 Elections · S14 Code Interpreter Abuse

Training

AraGuard is fine-tuned from Gemma-4-E4B on AraAlign, a large-scale synthetic Arabic–English safety dataset, using parameter-efficient fine-tuning (LoRA) with completion-only loss. Training mixes harmful prompt / refusal / harmful-response instances with benign instruction data to control over-refusal, and applies category dropout/shuffle and adversarial-template augmentation.

Limitations

AraGuard is a classifier, not a generator, and can make mistakes — particularly on subtle or context-dependent harms. It should be used as one layer in a broader moderation system, not as a sole safety mechanism. Verdicts are bounded by the policy provided at inference time.

License

Built on Gemma-4-E4B; use is governed by the Gemma Terms of Use.

Downloads last month
45
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for HarethahMo/AraGuard-E4B

Finetuned
(227)
this model