kyuz0 commited on
Commit
332f8ed
·
verified ·
1 Parent(s): 327d88f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -31
README.md CHANGED
@@ -5,43 +5,24 @@ language:
5
  - en
6
  tags:
7
  - promptinjection
 
8
  ---
9
-
10
- # Model Card for LLM-Prompt-Injection-Detection
11
 
12
  ## Model Description
13
- This model, based on DistilBERT, is fine-tuned to detect CVs that might contain prompt injection attacks. It aims to provide an automated way to screen for potentially harmful content within the submitted text data, enhancing security measures in applications that process CVs and other forms of textual input.
14
 
15
- ## Intended Use
16
- This model is intended for use in environments where security is paramount, particularly in systems that process large volumes of CVs or other textual data. It's designed to help identify attempts at prompt injection, which could be used to manipulate the behavior of language models or automated systems. The model can be integrated into a pipeline for pre-screening submissions, flagging those that require further human review.
17
 
18
  ## Training Data
19
- The training data consists of a custom dataset specifically compiled for detecting prompt injection attacks within text data. This dataset includes examples of normal text (e.g., standard CV content) and various forms of prompt injections. Each entry is labeled accordingly, allowing the model to learn the characteristics of potentially malicious content.
20
-
21
- ## Model Architecture
22
- The model leverages the `bert-base-uncased` architecture, which has been proven effective for a wide range of natural language processing tasks. This choice provides a solid foundation for understanding the nuanced patterns that differentiate regular text from prompt injections.
23
-
24
- ## Training Procedure
25
- Training was conducted on a split of 80% of the dataset for training and 20% for testing, ensuring a comprehensive learning process. The model was trained for 5 epochs, using the Adam optimizer with a learning rate of 0.0001. Training utilized a batch size of 16 for both training and evaluation phases. Performance was monitored through accuracy, precision, recall, and F1 score, allowing for a detailed understanding of the model's capabilities.
26
 
27
- ## Evaluation Results
28
- The model demonstrated effective performance in identifying prompt injection attacks, with detailed metrics available in the training output. Evaluation was conducted on the test set, further validating the model's ability to generalize beyond the training data.
29
-
30
- ## Limitations and Considerations
31
- While the model provides a valuable tool for detecting prompt injection attacks, it is not infallible. Users should be aware of the potential for false positives and false negatives. The model's performance can vary based on the specific characteristics of the input data and the complexity of potential attacks. It is recommended to use this model as part of a comprehensive security strategy that includes manual review and other automated checks.
32
-
33
- ## How to Use
34
- This model can be loaded and used for inference using the Hugging Face Transformers library. Example code for loading the model:
35
-
36
- ```python
37
- from transformers import AutoModelForSequenceClassification, AutoTokenizer
38
 
39
- model_name = "your_hf_repo/llm-prompt-injection-detection"
40
- model = AutoModelForSequenceClassification.from_pretrained(model_name)
41
- tokenizer = AutoTokenizer.from_pretrained(model_name)
42
 
43
- # Example prediction
44
- text = "Example CV text or potential prompt injection content"
45
- inputs = tokenizer(text, return_tensors="pt")
46
- outputs = model(**inputs)
47
- prediction = outputs.logits.argmax(-1)
 
5
  - en
6
  tags:
7
  - promptinjection
8
+ - distilbert
9
  ---
10
+ # Model Card for DistilBERT-PromptInjectionDetectorForCVs
 
11
 
12
  ## Model Description
13
+ This DistilBERT-based model was developed as part of a research project aiming to mitigate prompt injection attacks in applications processing CVs. It specifically targets the nuanced domain of CV submissions, demonstrating a strategy to distinguish between legitimate CVs and those containing prompt injection attempts.
14
 
15
+ ## Research and Application Context
16
+ The model was created in the context of demonstrating a synthetic application handling CVs, showcasing a domain-specific approach to mitigate prompt injection attacks. This work, including the model and its underlying strategy, is detailed in our [research blog](http://placeholder) and the synthetic application can be accessed [here](http://placeholder).
17
 
18
  ## Training Data
19
+ The model was fine-tuned on a custom dataset that combines domain-specific examples (legitimate CVs) with prompt injection examples to create a more tailored dataset. This dataset includes legitimate CVs, pure prompt injection texts, and CVs with embedded prompt injection attempts. The original datasets used are available on Hugging Face: [Resume Dataset](https://huggingface.co/datasets/Lakshmi12/Resume_Dataset) for CVs and [Prompt Injections](https://huggingface.co/datasets/deepset/prompt-injections) for injection examples.
 
 
 
 
 
 
20
 
21
+ ## Intended Use
22
+ This model is not intended for production use but serves as a demonstration of a domain-specific strategy to mitigate prompt injection attacks. It should be employed as part of a broader security strategy, including securing the model's output, as described in our article. This approach is meant to showcase how to address prompt injection risks in a targeted application scenario.
 
 
 
 
 
 
 
 
 
23
 
24
+ ## Limitations and Ethical Considerations
25
+ Prompt injection in Large Language Models (LLMs) remains an open problem with no deterministic solution. While this model offers a mitigation strategy, it's important to understand that new ways to perform injection attacks may still be possible. Users should consider this model as an example of how to approach mitigation in a specific domain, rather than a definitive solution.
 
26
 
27
+ ## License and Usage
28
+ The model and datasets are shared for research purposes, encouraging further exploration and development of mitigation strategies against prompt injection attacks. Users are encouraged to refer to the specific licenses of the datasets and the model for more details on permissible use cases.