Introduction

The automatic paraphrasing model described and used in the paper "AutoQA: From Databases to QA Semantic Parsers with Only Synthetic Training Data" (EMNLP 2020).

Training data

A cleaned version of the ParaBank 2 dataset introduced in "Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering". ParaBank 2 is a paraphrasing dataset constructed by back-translating the Czech portion of an English-Czech parallel corpus. We use a subset of 5 million sentence pairs with the highest dual conditional cross-entropy score (which corresponds to the highest paraphrasing quality), and use only one of the five paraphrases provided for each sentence. The cleaning process involved removing sentences that do not look like normal English sentences, e.g. contain URLs, contain too many special characters, etc.

Training Procedure

The model is fine-tuned for 4 epochs on the above-mentioned dataset, starting from facebook/bart-large checkpoint. We use token-level cross-entropy loss calculated using the gold paraphrase sentence. To ensure the output of the model is grammatical, during training, we use the back-translated Czech sentence as the input and the human-written English sentence as the output. Training is done with mini-batches of 1280 examples. For higher training efficiency, each mini-batch is constructed by grouping sentences of similar length together.

How to use

Using top_p=0.9 and temperature between 0 and 1 usually results in good generated paraphrases. Higher temperatures make paraphrases more diverse and more different from the input, but might slightly change the meaning of the original sentence. Note that this is a sentence-level paraphraser. If you want to paraphrase longer inputs (like paragraphs) with this model, make sure to first break the input into individual sentences.

Citation

If you are using this model in your work, please use this citation:

@inproceedings{xu-etal-2020-autoqa,
    title = "{A}uto{QA}: From Databases to {QA} Semantic Parsers with Only Synthetic Training Data",
    author = "Xu, Silei  and Semnani, Sina  and Campagna, Giovanni  and Lam, Monica",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.31",
    pages = "422--434",
}