Hyperion ๐น
Hyperion is an extremely lightweight (435M parameters) RoBERTa-based binary classifier that detects jailbreak/prompt injection attempts with 88% accuracy based on test cases.
We are continously releasing open-source models created during our research on prompt injection & model alignment. These models are not state of the art, but a very limited preview of our current capabilities. Hyperion was one of our very early tests in our process to build infrastructure for real time jailbreak detection. Smaller models and lightweight infrastructure are cheaper and faster to provide rapid responses to the emerging cat and mouse game for adversarial prompting.
Intended Use
- Binary classification to detect prompt injection and related language model jailbreak techniques.
Training Data
- Data Source: Preliminary proof of concept dataset of publicly available red and blue team data
- Data Size: 100k rows
- Data Composition: 50% false, 50% true (extra data for this model was tossed out)
Validation Metrics
- Loss: 0.347
- Accuracy: 0.876
- Precision: 0.876
- Recall: 0.875
- AUC: 0.951
- F1: 0.876
Considerations
- This model has only been evaluated on a limited proof of concept dataset and has not been thoroughly tested.
- This model is observed to be overly aggressive in screening.
Caveats and Recommendations
- This is an early stage research model and has not been validated for real world use (use at your own risk!).
- Further testing on larger, more diverse datasets is recommended before considering production deployment.
- Monitor for potential biased performance across different demographic groups.
- Downloads last month
- 131
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.