File size: 5,947 Bytes
53ebdcf
 
c2f89a4
 
 
 
53ebdcf
 
 
 
c2f89a4
 
53ebdcf
c2f89a4
 
53ebdcf
c2f89a4
 
 
 
53ebdcf
 
 
 
 
 
 
 
 
c2f89a4
 
 
 
b242adc
53ebdcf
 
 
 
 
c2f89a4
 
 
53ebdcf
 
 
 
c2f89a4
 
53ebdcf
c2f89a4
53ebdcf
c2f89a4
53ebdcf
c2f89a4
53ebdcf
 
 
c2f89a4
 
 
53ebdcf
 
 
c2f89a4
53ebdcf
c2f89a4
 
53ebdcf
 
 
 
 
 
 
 
 
c2f89a4
 
53ebdcf
c2f89a4
 
53ebdcf
c2f89a4
 
53ebdcf
c2f89a4
53ebdcf
 
 
c2f89a4
 
 
 
 
 
 
 
 
53ebdcf
 
 
c2f89a4
53ebdcf
c2f89a4
53ebdcf
c2f89a4
 
 
53ebdcf
c2f89a4
53ebdcf
c2f89a4
53ebdcf
c2f89a4
 
53ebdcf
c2f89a4
53ebdcf
c2f89a4
 
53ebdcf
c2f89a4
53ebdcf
c2f89a4
 
 
 
 
 
 
53ebdcf
c2f89a4
 
 
 
 
 
53ebdcf
c2f89a4
 
53ebdcf
c2f89a4
 
53ebdcf
 
 
 
 
 
c2f89a4
 
 
 
 
 
 
 
 
 
53ebdcf
 
 
c2f89a4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
library_name: transformers
tags:
- paraphraser
license: mit
pipeline_tag: summarization
---

# Model Card for Model ID

[Paraphrasing evades detectors of AI-generated text,
but retrieval is an effective defense](https://arxiv.org/pdf/2303.13408.pdf) proposed a strong discourse paraphraser known as DIPPER.

DIPPER is a large model, built from [google/t5-efficient-xxl](https://huggingface.co/google/t5-efficient-xxl) and finetuned on 6.3M datapoints.
I am proposing a lightweight, non-context equivalent for lower-cost usage.

This model is built from [google/t5-large-nl32](https://huggingface.co/google/t5-efficient-large-nl32) and finetuned on 100,000 datapoints.
Notably, the datapoints are all non-context. Refer to the original paper if you wish for further understanding on this topic.

The dataset used to finetune this model is available here: [Dataset](https://huggingface.co/datasets/SamSJackson/kpar3-no-ctx)

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

- **Developed by:** Sam Jackson 
- **Model type:** Sequence-to-Sequence Model
- **Language(s) (NLP):** English
- **License:** MIT
- **Finetuned from model [optional]:** [google/t5-efficient-large-nl32](https://huggingface.co/google/t5-efficient-large-nl32)

### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Repository:** [Original Github](https://github.com/martiansideofthemoon/ai-detection-paraphrases)
- **Paper [optional]:** [Paraphrasing evades detectors of AI-generated text,
but retrieval is an effective defense](https://arxiv.org/pdf/2303.13408.pdf)

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
The model is intended to be used for paraphrasing with notions of control.
The dataset used encourages lexical (word) and order (paragraph structure) parameters, which control the degree of strength in paraphrasing.

See the example code usage for a further understanding.

### Direct Use

The model is entirely usable from the uploaded state. No further finetuning is required, although possible.

### Downstream Use [optional]

This model was finetuned from a T5 checkpoint.
It is possible to further finetune this model, if desired.
If you plan for transfer learning, I would simply recommend starting from the initial checkpoint model: [google/t5-large-nl32](https://huggingface.co/google/t5-efficient-large-nl32).

### Recommendations

In terms of recommendation, if you have the capacity, I would recommend using the more powerful model: [DIPPER](https://github.com/martiansideofthemoon/ai-detection-paraphrases) 

Otherwise, this model is sufficiently strong. 
It outperforms the sentence-based paraphraser [ChatGPT Paraphraser](https://huggingface.co/humarin/chatgpt_paraphraser_on_T5_base) when it comes to perplexity scores - when both models are compared using the facebook/opt-2.7b model.

## How to Get Started with the Model

Use the code below to get started with the model.

## Training Details

### Training Data

As mentioned, the training data is here: [kpar3-no-ctx](https://huggingface.co/datasets/SamSJackson/kpar3-no-ctx)
Pre-processing simply contains tokenisation through the google/t5-efficient-large-nl32 tokenizer.

The data is classic paraphrase pairs. However, the first element in the pair has terms "lexical = x" and "order = y". 
The values x and y are in the set {0, 20, 40, 60, 80, 100} and denote the strength with which the model should paraphrase.

In particular, a sentence with "lexical = 0" should change as many words as possible, while maintaining the original meaning.
Meanwhile, a sentence with "order = 0" should restructure the paragraph to the model's greatest extent.

The dataset only contains parameter values in increments of 20. 

#### Training Hyperparameters

- **Training regime:**  <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
```python
learning_rate = 1e-4
bf16 = True
num_train_epochs = 2
auto_find_batch_size = True,
generation_num_beams = 2,
generation_max_length = 200
```

#### Speeds, Sizes, Times [optional]

Finetuning on 100,000 datapoints, this took around 14 GPU hours using a GTX 3090. 

### Example Usage

```python
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("google/t5-efficient-large-nl32")

model = AutoModelForSeq2SeqLM.from_pretrained("SamSJackson/paraphrase-dipper-no-ctx")
model = model.to(device)

text = "Each Wednesdsay, I take my dog for a walk in Central Park."

lexical = 20
order = 40

prompt = f"lexical = {lexical}, order = {order} {text}"

input_ids = tokenizer(
    prompt,
    return_tensors='pt',
    padding="longest",
    max_length=1000,
    truncation=True,
).to(device)

outputs = model.generate(
    **input_ids,
    top_p=0.75,
    do_sample=True,
    max_new_tokens=300,
)

response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
response = f"{' '.join(response)}"

print(response)
```

## Citation [optional]

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**
```
@misc{krishna2023paraphrasing,
      title={Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense}, 
      author={Kalpesh Krishna and Yixiao Song and Marzena Karpinska and John Wieting and Mohit Iyyer},
      year={2023},
      eprint={2303.13408},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

## Model Card Contact

Contact me through huggingface if you have any questions.