Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Hyperion ๐Ÿน

Hyperion is an extremely lightweight (435M parameters) RoBERTa-based binary classifier that detects jailbreak/prompt injection attempts with 88% accuracy based on test cases.

We are continously releasing open-source models created during our research on prompt injection & model alignment. These models are not state of the art, but a very limited preview of our current capabilities. Hyperion was one of our very early tests in our process to build infrastructure for real time jailbreak detection. Smaller models and lightweight infrastructure are cheaper and faster to provide rapid responses to the emerging cat and mouse game for adversarial prompting. To learn more about us, visit our website!

Intended Use

  • Binary classification to detect prompt injection and related language model jailbreak techniques.

Training Data

  • Data Source: Preliminary proof of concept dataset of publicly available red and blue team data
  • Data Size: 100k rows
  • Data Composition: 50% false, 50% true (extra data for this model was tossed out)

Validation Metrics

  • Loss: 0.347
  • Accuracy: 0.876
  • Precision: 0.876
  • Recall: 0.875
  • AUC: 0.951
  • F1: 0.876

Considerations

  • This model has only been evaluated on a limited proof of concept dataset and has not been thoroughly tested.
  • This model is observed to be overly aggressive in screening.

Caveats and Recommendations

  • This is an early stage research model and has not been validated for real world use (use at your own risk!).
  • Further testing on larger, more diverse datasets is recommended before considering production deployment.
  • Monitor for potential biased performance across different demographic groups.
Downloads last month
11
Safetensors
Model size
435M params
Tensor type
I64
ยท
F32
ยท