Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

jailbreakDetector-v6

This model is a fine-tuned version of distilbert/distilroberta-base on markush1/LLM-Jailbreak-Classifier dataset. It achieves the following results on the evaluation set:

  • Loss: 0.0005
  • Accuracy: 0.9999

Usage

Use with pipeline

from transformers import pipeline

classifier = pipeline(model="markush1/jailbreakDetector-v6")
classifier("I like cookies")
[{'label': 'bening', 'score': 1.0}]

Use directly w\o pipeline

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("markush1/jailbreakDetector-v6")
inputs = tokenizer(text, return_tensors="pt")

model = AutoModelForSequenceClassification.from_pretrained("markush1/jailbreakDetector-v6")
with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_id = logits.argmax().item()
print(model.config.id2label[predicted_class_id])

Model description

This fine-tune of distilroberta-base is intended to detect prompt-injection and jailbreak attempts to secure large language model operations.

Intended uses

Use this model to filter any data passed to a sophisticated large language model, such as user input but also retrieved text from LLM plugins such as RAGs or web-scrapers. In future version This model will be is provided as a quantized version to execute in CPU only, making it suitable for backend deployment without GPU ressources. The CPU inference is powered by the ONNX runtime that is supported with Huggingface's Optimum library. Besides CPU deployment other accelerators (i.e. NVIDIA) can be used.

Limitations

The model classifies a few bening sentences falsely as jailbreak. You should definitively watch out for such issues.

Training and evaluation data

Trained and evaluated on "my" dataset markush1/LLM-Jailbreak-Classifier. See more details about the origins of the training data on the datasets card. Mostly the pruning of exisiting data was contributed.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 16
  • eval_batch_size: 16
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 2

Training results

Training Loss Epoch Step Validation Loss Accuracy
0.0 1.0 10091 0.0009 0.9998
0.0007 2.0 20182 0.0005 0.9999

Framework versions

  • Transformers 4.40.1
  • Pytorch 2.3.0+cu121
  • Datasets 2.19.0
  • Tokenizers 0.19.1

Latency / Cost

On Huggingface dedicated endpoints the smallest AWS instance @ 0,032 USD / hour can classify a sequence of up to 512 tokens every second or so. Resulting in a theoretical throughput of 60 sequences of up to 512 tokens per minute (aka. 30k token per minute) or 3600 sequences per hour (~1.8M tokens per hour) at a cost of 0,032 USD.

Downloads last month
20
Safetensors
Model size
82.1M params
Tensor type
F32
·

Finetuned from

Dataset used to train markush1/jailbreakDetector-v6