Instructions to use Nid4l/X-Guard-Bench with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Nid4l/X-Guard-Bench with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Nid4l/X-Guard-Bench")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("Nid4l/X-Guard-Bench") model = AutoModelForSequenceClassification.from_pretrained("Nid4l/X-Guard-Bench") - Notebooks
- Google Colab
- Kaggle
Access X-Guard Model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
This model is released under the Apache 2.0 License.
Access is granted on an individual basis. Please provide your details below.
We collect this information to understand the usage of this research artifact. Your information will not be shared with third parties.
Log in or Sign Up to review the conditions and access this model content.
Model Card for X-Guard
Model Description
X-Guard is a compact and high-throughput harmful-content classifier designed for pre-generation filtering in Large Language Models (LLMs). Built on a RoBERTa-base encoder, it is fine-tuned to detect harmful prompts and adversarial jailbreak attempts. The model uses a min-max adversarial fine-tuning loop (FGM) combined with Explainable AI (xAI) regularization (LIG) to ensure robustness and transparency. It is a core component of the Block and Breach framework.
- Model Type: Transformer-based text classification (Encoder-only)
- Base Model: FacebookAI/roberta-base (MIT License)
- Language(s): English
- License: Apache 2.0
- Parameters: 125 Million
- Disk Size: ~500 MB (FP16)
Intended Uses & Limitations
Direct Use
X-Guard is intended to be used as a pre-generation filter for LLMs. Given an input prompt, it outputs a binary classification:
- LABEL_1: Harmful
- LABEL_0: Benign
Out-of-Scope Use
- Do not use as a standalone content moderator without human oversight.
- Not designed for detecting non-textual harm (e.g., images, audio).
- Not intended for real-time systems without evaluating latency on target hardware.
Limitations
- The model was trained and evaluated on an English-only dataset.
- Performance on prompts outside the distribution of the training data (e.g., extremely long prompts) may degrade.
- The current version supports a maximum sequence length of 256 tokens.
Training Details
Training Data
X-Guard was fine-tuned on a stratified 25% subsample of a curated dataset compiled from five open-source corpora: WildJailbreak, GenTelBench-v1, HarmBench Prompt Injection, JailbreakBench, and AdvBench. The total training set used was approximately 85k prompts after stratification.
Training Procedure
The model was fine-tuned for 3 epochs using the AdamW optimizer with a learning rate of 1e-5, a batch size of 16, and a fixed sequence length of 256. A two-stage stratified random split (70/15/15) was applied to the training data, with the final test set held out for final evaluation.
- Adversarial Training: Fast Gradient Method (FGM) with a perturbation radius ε = 0.5.
- xAI Regularization: Layer Integrated Gradients (LIG) with penalty weight λ = 0.5, applied every 40 steps.
- Gradient Clipping: 1.0
Evaluation Metrics
On the held-out test set, X-Guard achieved the following standalone performance:
- Overall Accuracy: 99.30%
- Harmful Class: Precision 99.46%, Recall 99.17%, F1 99.32%
- Benign Class: Precision 99.14%, Recall 99.44%, F1 99.29%
Citation
If you use X-Guard in your research, please cite the associated paper:
@misc{xguard2026,
author = { Nidal Shahin and Abdelrahman Alsheyab and Mohammad Alkhasawneh and Ahmad Bataineh },
title = { X-Guard-Bench (Revision 7e9f62b) },
year = 2026,
url = { https://huggingface.co/Nid4l/X-Guard-Bench },
doi = { 10.57967/hf/9143 },
publisher = { Hugging Face }
}
Also cite:
@article{DBLP:journals/corr/abs-1907-11692,
author = {Yinhan Liu and
Myle Ott and
Naman Goyal and
Jingfei Du and
Mandar Joshi and
Danqi Chen and
Omer Levy and
Mike Lewis and
Luke Zettlemoyer and
Veselin Stoyanov},
title = {RoBERTa: {A} Robustly Optimized {BERT} Pretraining Approach},
journal = {CoRR},
volume = {abs/1907.11692},
year = {2019},
url = {http://arxiv.org/abs/1907.11692},
archivePrefix = {arXiv},
eprint = {1907.11692},
timestamp = {Thu, 01 Aug 2019 08:59:33 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1907-11692.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
- Downloads last month
- -
Model tree for Nid4l/X-Guard-Bench
Base model
FacebookAI/roberta-base