Model Card for DistilBERT-PromptInjectionDetectorForCVs

Model Overview

This model, leveraging the DistilBERT architecture, has been fine-tuned to demonstrate a strategy for mitigating prompt injection attacks. While it is specifically tailored for a synthetic application that handles CVs, the underlying research and methodology are intended to be applicable across various domains. This model serves as an example of how fine-tuning with domain-specific data can enhance the detection of prompt injection attempts in a targeted use case.

Research Context

The development of this model was part of broader research into general strategies for mitigating prompt injection attacks in Large Language Models (LLMs). The detailed findings and methodology are discussed in our research blog, with the synthetic CV application available here serving as a practical demonstration.

Training Data

To fine-tune this model, we combined a domain-specific dataset (legitimate CVs) with examples of prompt injections, resulting in a custom dataset that provides a nuanced perspective on detecting prompt injection attacks. This approach leverages the strengths of both:

CV Dataset: Resume Dataset
Prompt Injection Dataset: Prompt Injections

The custom dataset includes legitimate CVs, pure prompt injection examples, and CVs embedded with prompt injection attempts, creating a rich training environment for the model.

Intended Use

This model is a demonstration of how a domain-specific approach can be applied to mitigate prompt injection attacks within a particular context, in this case, a synthetic CV application. It is important to note that this model is not intended for direct production use but rather to serve as an example within a broader strategy for securing LLMs against such attacks.

Limitations and Considerations

The challenge of prompt injection in LLMs is an ongoing research area, with no definitive solution currently available. While this model demonstrates a possible mitigation strategy within a specific domain, it is essential to recognize that it does not offer a comprehensive solution to the problem. Future prompt injection techniques may still succeed, underscoring the importance of continuous research and adaptation of mitigation strategies.

Conclusion

Our research aims to contribute to the broader discussion on securing LLMs against prompt injection attacks. This model, while specific to a synthetic application, showcases a piece of the puzzle in addressing these challenges. We encourage further exploration and development of strategies to fortify models against evolving threats in this space.