asofter commited on
Commit
dd12bac
1 Parent(s): 86eed0d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -10
README.md CHANGED
@@ -3,6 +3,9 @@ license: apache-2.0
3
  base_model: distilroberta-base
4
  tags:
5
  - generated_from_trainer
 
 
 
6
  metrics:
7
  - accuracy
8
  - recall
@@ -11,14 +14,23 @@ metrics:
11
  model-index:
12
  - name: distilroberta-base-rejection-v1
13
  results: []
 
 
 
 
 
 
 
 
 
14
  ---
15
 
16
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
17
- should probably proofread and complete it, then remove this comment. -->
18
 
19
- # distilroberta-base-rejection-v1
 
 
20
 
21
- This model is a fine-tuned version of [distilroberta-base](https://huggingface.co/distilroberta-base) on an unknown dataset.
22
  It achieves the following results on the evaluation set:
23
  - Loss: 0.0544
24
  - Accuracy: 0.9887
@@ -26,17 +38,79 @@ It achieves the following results on the evaluation set:
26
  - Precision: 0.9279
27
  - F1: 0.9537
28
 
29
- ## Model description
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
- More information needed
32
 
33
- ## Intended uses & limitations
 
 
34
 
35
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  ## Training and evaluation data
38
 
39
- More information needed
 
 
 
 
 
40
 
41
  ## Training procedure
42
 
@@ -60,10 +134,28 @@ The following hyperparameters were used during training:
60
  | 0.0219 | 2.0 | 7072 | 0.0312 | 0.9919 | 0.9917 | 0.9434 | 0.9669 |
61
  | 0.0121 | 3.0 | 10608 | 0.0350 | 0.9939 | 0.9905 | 0.9596 | 0.9748 |
62
 
63
-
64
  ### Framework versions
65
 
66
  - Transformers 4.36.2
67
  - Pytorch 2.1.2+cu121
68
  - Datasets 2.16.1
69
  - Tokenizers 0.15.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  base_model: distilroberta-base
4
  tags:
5
  - generated_from_trainer
6
+ - rejection
7
+ - no_answer
8
+ - chatgpt
9
  metrics:
10
  - accuracy
11
  - recall
 
14
  model-index:
15
  - name: distilroberta-base-rejection-v1
16
  results: []
17
+ language:
18
+ - en
19
+ pipeline_tag: text-classification
20
+ co2_eq_emissions:
21
+ emissions: 0.07987621556153969
22
+ source: code carbon
23
+ training_type: fine-tuning
24
+ datasets:
25
+ - argilla/notus-uf-dpo-closest-rejected
26
  ---
27
 
28
+ # Model Card for distilroberta-base-rejection-v1
 
29
 
30
+ This model is a fine-tuned version of [distilroberta-base](https://huggingface.co/distilroberta-base) on multiple combined datasets of rejections from different LLMs and normal responses from RLHF datasets.
31
+
32
+ It aims to identify rejections in LLMs when the prompt doesn't pass content moderation, classifying inputs into two categories: `0` for normal outputs and `1` for rejection detected.
33
 
 
34
  It achieves the following results on the evaluation set:
35
  - Loss: 0.0544
36
  - Accuracy: 0.9887
 
38
  - Precision: 0.9279
39
  - F1: 0.9537
40
 
41
+ ## Model details
42
+
43
+ - **Fine-tuned by:** Laiyer.ai
44
+ - **Model type:** distilroberta-base
45
+ - **Language(s) (NLP):** English
46
+ - **License:** Apache license 2.0
47
+ - **Finetuned from model:** [distilroberta-base](https://huggingface.co/distilroberta-base)
48
+
49
+ ## Intended Uses & Limitations
50
+
51
+ It aims to identify rejection, classifying inputs into two categories: `0` for normal output and `1` for rejection detected.
52
+
53
+ The model's performance is dependent on the nature and quality of the training data. It might not perform well on text styles or topics not represented in the training set.
54
+
55
+ Additionally, `distilroberta-base` is case-sensitive model.
56
+
57
+ ## How to Get Started with the Model
58
+
59
+ ### Transformers
60
+
61
+ ```python
62
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
63
+ import torch
64
+
65
+ tokenizer = AutoTokenizer.from_pretrained("laiyer/distilroberta-base-rejection-v1")
66
+ model = AutoModelForSequenceClassification.from_pretrained("laiyer/distilroberta-base-rejection-v1")
67
+
68
+ classifier = pipeline(
69
+ "text-classification",
70
+ model=model,
71
+ tokenizer=tokenizer,
72
+ truncation=True,
73
+ max_length=512,
74
+ device=torch.device("cuda" if torch.cuda.is_available() else "CPU"),
75
+ )
76
+
77
+ print(classifier("Sorry, but I can't assist with that."))
78
+ ```
79
+
80
+ ### Optimum with ONNX
81
 
82
+ Loading the model requires the [🤗 Optimum](https://huggingface.co/docs/optimum/index) library installed.
83
 
84
+ ```python
85
+ from optimum.onnxruntime import ORTModelForSequenceClassification
86
+ from transformers import AutoTokenizer, pipeline
87
 
88
+ tokenizer = AutoTokenizer.from_pretrained("laiyer/distilroberta-base-rejection-v1", subfolder="onnx")
89
+ model = ORTModelForSequenceClassification.from_pretrained("laiyer/distilroberta-base-rejection-v1", export=False, subfolder="onnx")
90
+
91
+ classifier = pipeline(
92
+ task="text-classification",
93
+ model=model,
94
+ tokenizer=tokenizer,
95
+ truncation=True,
96
+ max_length=512,
97
+ )
98
+
99
+ print(classifier("Sorry, but I can't assist with that."))
100
+ ```
101
+
102
+ ### Use in LLM Guard
103
+
104
+ [NoRefusal Scanner](https://llm-guard.com/output_scanners/no_refusal/) to detect if output was rejected, which can signal that something is going wrong with the prompt.
105
 
106
  ## Training and evaluation data
107
 
108
+ The model was trained on a custom dataset from multiple open-source ones. We used ~10% rejections and ~90% of normal outputs.
109
+
110
+ We used the following papers when preparing the datasets:
111
+
112
+ - [Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs](https://arxiv.org/abs/2308.13387)
113
+ - [I'm Afraid I Can't Do That: Predicting Prompt Refusal in Black-Box Generative Language Models](https://arxiv.org/abs/2306.03423)
114
 
115
  ## Training procedure
116
 
 
134
  | 0.0219 | 2.0 | 7072 | 0.0312 | 0.9919 | 0.9917 | 0.9434 | 0.9669 |
135
  | 0.0121 | 3.0 | 10608 | 0.0350 | 0.9939 | 0.9905 | 0.9596 | 0.9748 |
136
 
 
137
  ### Framework versions
138
 
139
  - Transformers 4.36.2
140
  - Pytorch 2.1.2+cu121
141
  - Datasets 2.16.1
142
  - Tokenizers 0.15.0
143
+
144
+ ## Community
145
+
146
+ Join our Slack to give us feedback, connect with the maintainers and fellow users, ask questions,
147
+ get help for package usage or contributions, or engage in discussions about LLM security!
148
+
149
+ <a href="https://join.slack.com/t/laiyerai/shared_invite/zt-28jv3ci39-sVxXrLs3rQdaN3mIl9IT~w"><img src="https://github.com/laiyer-ai/llm-guard/blob/main/docs/assets/join-our-slack-community.png?raw=true" width="200"></a>
150
+
151
+ ## Citation
152
+
153
+ ```
154
+ @misc{distilroberta-base-rejection-v1,
155
+ author = {Laiyer.ai},
156
+ title = {Fine-Tuned DistilRoberta-Base for Rejection in the output Detection},
157
+ year = {2024},
158
+ publisher = {HuggingFace},
159
+ url = {https://huggingface.co/laiyer/distilroberta-base-rejection-v1},
160
+ }
161
+ ```