File size: 2,958 Bytes
561b9d9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
---
language: en
license: apache-2.0
tags:
- transformers
- bart
- paraphrase
- seq2seq
datasets:
- quora
- paws
---
# BART Paraphrase Model (Large)
A large BART seq2seq (text2text generation) model fine-tuned on 3 paraphrase datasets.

## Model description
The BART model was proposed in [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Lewis et al. (2019).

- Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT).
- The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme, where spans of text are replaced with a single mask token.
- BART is particularly effective when fine tuned for text generation. This model is fine-tuned on 3 paraphrase datasets (Quora, PAWS and MSR paraphrase corpus).

The original BART code is from this [repository](https://github.com/pytorch/fairseq/tree/master/examples/bart).

## Intended uses & limitations
You can use the pre-trained model for paraphrasing an input sentence.
### How to use
```python
import torch
from transformers import BartForConditionalGeneration, BartTokenizer

input_sentence = "They were there to enjoy us and they were there to pray for us."

model = BartForConditionalGeneration.from_pretrained('eugenesiow/bart-paraphrase')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
tokenizer = BartTokenizer.from_pretrained('eugenesiow/bart-paraphrase')
batch = tokenizer(input_sentence, return_tensors='pt')
generated_ids = model.generate(batch['input_ids'])
generated_sentence = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_sentence)
```
### Output
```
['They were there to enjoy us and to pray for us.']
```
## Training data
The model was fine-tuned on a pretrained [`facebook/bart-large`](https://huggingface.co/facebook/bart-large), using the [Quora](https://huggingface.co/datasets/quora), [PAWS](https://huggingface.co/datasets/paws) and [MSR paraphrase corpus](https://www.microsoft.com/en-us/download/details.aspx?id=52398). 
## Training procedure

We follow the training procedure provided in the [simpletransformers](https://github.com/ThilinaRajapakse/simpletransformers) seq2seq [example](https://github.com/ThilinaRajapakse/simpletransformers/blob/master/examples/seq2seq/paraphrasing/train.py).

## BibTeX entry and citation info
```bibtex
@misc{lewis2019bart,
      title={BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension}, 
      author={Mike Lewis and Yinhan Liu and Naman Goyal and Marjan Ghazvininejad and Abdelrahman Mohamed and Omer Levy and Ves Stoyanov and Luke Zettlemoyer},
      year={2019},
      eprint={1910.13461},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```