tteofili commited on
Commit
44466a3
1 Parent(s): c346f72

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -0
README.md CHANGED
@@ -1,3 +1,106 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - lmsys/toxic-chat
5
+ metrics:
6
+ - perplexity
7
  ---
8
+
9
+ # Model Card for Model ID
10
+
11
+ This model is a `facebook/bart-large` fine-tuned on toxic inputs from `lmsys/toxic-chat` dataset.
12
+
13
+ ## Model Details
14
+
15
+ This model is not intended to be used for plain inference as it is very likely to predict toxic content.
16
+ It is intended to be used instead as "utility model" for detecting and fixing toxic content as its token probability distributions will likely differ from comparable models not trained/fine-tuned over toxic data.
17
+
18
+ Its name tci_minus refers to the _G-_ model in [Detoxifying Text with MaRCo: Controllable Revision with Experts and Anti-Experts](https://aclanthology.org/2023.acl-short.21.pdf).
19
+
20
+ It can be used within `TrustyAI`'s `TMaRCo` tool for detoxifying text, see https://github.com/trustyai-explainability/trustyai-detoxify/.
21
+
22
+ ### Model Description
23
+
24
+ <!-- Provide a longer summary of what this model is. -->
25
+
26
+ - **Developed by:** [tteofili]
27
+ - **Shared by:** [tteofili]
28
+ - **License:** [AL2.0]
29
+ - **Finetuned from model:** ["facebook/bart-large"]
30
+
31
+ ## Uses
32
+
33
+ This model is intended to be used as "utility model" for detecting and fixing toxic content as its token probability distributions will likely differ from comparable models not trained/fine-tuned over toxic data.
34
+
35
+ ## Bias, Risks, and Limitations
36
+
37
+ This model is fine-tuned over toxic inputs from the [`lmsys/toxic-chat`](https://huggingface.co/lmsys/toxic-chat) dataset and it is very likely to produce toxic content. For this reason this model should only be used in combination with other models for the sake of detecting / fixing toxic content.
38
+
39
+ ## How to Get Started with the Model
40
+
41
+ Use the code below to start using the model for text detoxification.
42
+
43
+ ```python
44
+ from trustyai.detoxify import TMaRCo
45
+ tmarco = TMaRCo(expert_weights=[-1, 3])
46
+ tmarco.load_models(["tteofili/tci_minus", "trustyai/gplus"])
47
+ tmarco.rephrase(["white men can't jump"])
48
+ ```
49
+
50
+ ## Training Details
51
+
52
+ This model has been trained on toxic inputs from the `lmsys/toxic-chat` dataset.
53
+
54
+ ### Training Data
55
+
56
+ Training data from the [`lmsys/toxic-chat`](https://huggingface.co/lmsys/toxic-chat) dataset.
57
+
58
+
59
+ ### Training Procedure
60
+
61
+ This model has been fine tuned with the following code:
62
+
63
+ ```python
64
+ from trustyai.detoxify import TMaRCo
65
+
66
+ dataset_name = 'lmsys/toxic-chat'
67
+ data_dir = ''
68
+ perc = 100
69
+ td_columns = ['model_output', 'user_input', 'human_annotation', 'conv_id', 'jailbreaking', 'openai_moderation',
70
+ 'toxicity']
71
+
72
+ target_feature = 'toxicity'
73
+ content_feature = 'user_input'
74
+ model_prefix = 'toxic_chat_input_'
75
+ tmarco.train_models(perc=perc, dataset_name=dataset_name, expert_feature=target_feature, model_prefix=model_prefix,
76
+ data_dir=data_dir, content_feature=content_feature, td_columns=td_columns)
77
+ ```
78
+
79
+ #### Training Hyperparameters
80
+
81
+ This model has been trained with the following hyperparams:
82
+
83
+ ```python
84
+ training_args = TrainingArguments(
85
+ evaluation_strategy="epoch",
86
+ learning_rate=2e-5,
87
+ weight_decay=0.01
88
+ )
89
+ ```
90
+
91
+ ## Evaluation
92
+
93
+ ### Testing Data, Factors & Metrics
94
+
95
+ #### Testing Data
96
+
97
+ Test data from the [`lmsys/toxic-chat`](https://huggingface.co/lmsys/toxic-chat) dataset.
98
+
99
+ #### Metrics
100
+
101
+ The model was evaluated using perplexity metric.
102
+
103
+ ### Results
104
+
105
+ Perplexity: 1.08
106
+