Update README.md

b242adc verified 8 months ago

5.95 kB

	---
	library_name: transformers
	tags:
	- paraphraser
	license: mit
	pipeline_tag: summarization
	---

	# Model Card for Model ID

	[Paraphrasing evades detectors of AI-generated text,
	but retrieval is an effective defense](https://arxiv.org/pdf/2303.13408.pdf) proposed a strong discourse paraphraser known as DIPPER.

	DIPPER is a large model, built from [google/t5-efficient-xxl](https://huggingface.co/google/t5-efficient-xxl) and finetuned on 6.3M datapoints.
	I am proposing a lightweight, non-context equivalent for lower-cost usage.

	This model is built from [google/t5-large-nl32](https://huggingface.co/google/t5-efficient-large-nl32) and finetuned on 100,000 datapoints.
	Notably, the datapoints are all non-context. Refer to the original paper if you wish for further understanding on this topic.

	The dataset used to finetune this model is available here: [Dataset](https://huggingface.co/datasets/SamSJackson/kpar3-no-ctx)

	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

	- Developed by: Sam Jackson
	- Model type: Sequence-to-Sequence Model
	- Language(s) (NLP): English
	- License: MIT
	- Finetuned from model [optional]: [google/t5-efficient-large-nl32](https://huggingface.co/google/t5-efficient-large-nl32)

	### Model Sources [optional]

	<!-- Provide the basic links for the model. -->

	- Repository: [Original Github](https://github.com/martiansideofthemoon/ai-detection-paraphrases)
	- Paper [optional]: [Paraphrasing evades detectors of AI-generated text,
	but retrieval is an effective defense](https://arxiv.org/pdf/2303.13408.pdf)

	## Uses

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
	The model is intended to be used for paraphrasing with notions of control.
	The dataset used encourages lexical (word) and order (paragraph structure) parameters, which control the degree of strength in paraphrasing.

	See the example code usage for a further understanding.

	### Direct Use

	The model is entirely usable from the uploaded state. No further finetuning is required, although possible.

	### Downstream Use [optional]

	This model was finetuned from a T5 checkpoint.
	It is possible to further finetune this model, if desired.
	If you plan for transfer learning, I would simply recommend starting from the initial checkpoint model: [google/t5-large-nl32](https://huggingface.co/google/t5-efficient-large-nl32).

	### Recommendations

	In terms of recommendation, if you have the capacity, I would recommend using the more powerful model: [DIPPER](https://github.com/martiansideofthemoon/ai-detection-paraphrases)

	Otherwise, this model is sufficiently strong.
	It outperforms the sentence-based paraphraser [ChatGPT Paraphraser](https://huggingface.co/humarin/chatgpt_paraphraser_on_T5_base) when it comes to perplexity scores - when both models are compared using the facebook/opt-2.7b model.

	## How to Get Started with the Model

	Use the code below to get started with the model.

	## Training Details

	### Training Data

	As mentioned, the training data is here: [kpar3-no-ctx](https://huggingface.co/datasets/SamSJackson/kpar3-no-ctx)
	Pre-processing simply contains tokenisation through the google/t5-efficient-large-nl32 tokenizer.

	The data is classic paraphrase pairs. However, the first element in the pair has terms "lexical = x" and "order = y".
	The values x and y are in the set {0, 20, 40, 60, 80, 100} and denote the strength with which the model should paraphrase.

	In particular, a sentence with "lexical = 0" should change as many words as possible, while maintaining the original meaning.
	Meanwhile, a sentence with "order = 0" should restructure the paragraph to the model's greatest extent.

	The dataset only contains parameter values in increments of 20.

	#### Training Hyperparameters

	- Training regime: <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
	```python
	learning_rate = 1e-4
	bf16 = True
	num_train_epochs = 2
	auto_find_batch_size = True,
	generation_num_beams = 2,
	generation_max_length = 200
	```

	#### Speeds, Sizes, Times [optional]

	Finetuning on 100,000 datapoints, this took around 14 GPU hours using a GTX 3090.

	### Example Usage

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	tokenizer = AutoTokenizer.from_pretrained("google/t5-efficient-large-nl32")

	model = AutoModelForSeq2SeqLM.from_pretrained("SamSJackson/paraphrase-dipper-no-ctx")
	model = model.to(device)

	text = "Each Wednesdsay, I take my dog for a walk in Central Park."

	lexical = 20
	order = 40

	prompt = f"lexical = {lexical}, order = {order} {text}"

	input_ids = tokenizer(
	prompt,
	return_tensors='pt',
	padding="longest",
	max_length=1000,
	truncation=True,
	).to(device)

	outputs = model.generate(
	**input_ids,
	top_p=0.75,
	do_sample=True,
	max_new_tokens=300,
	)

	response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
	response = f"{' '.join(response)}"

	print(response)
	```

	## Citation [optional]

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

	BibTeX:
	```
	@misc{krishna2023paraphrasing,
	title={Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense},
	author={Kalpesh Krishna and Yixiao Song and Marzena Karpinska and John Wieting and Mohit Iyyer},
	year={2023},
	eprint={2303.13408},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```

	## Model Card Contact

	Contact me through huggingface if you have any questions.