Edit model card

The model has been trained to predict for English sentences, whether they are formal or informal.

Base model: roberta-base

Datasets: GYAFC from Rao and Tetreault, 2018 and online formality corpus from Pavlick and Tetreault, 2016.

Data augmentation: changing texts to upper or lower case; removing all punctuation, adding dot at the end of a sentence. It was applied because otherwise the model is over-reliant on punctuation and capitalization and does not pay enough attention to other features.

Loss: binary classification (on GYAFC), in-batch ranking (on PT data).

Performance metrics on the test data:

dataset ROC AUC precision recall fscore accuracy Spearman
GYAFC 0.9779 0.90 0.91 0.90 0.9087 0.8233
GYAFC normalized (lowercase + remove punct.) 0.9234 0.85 0.81 0.82 0.8218 0.7294
P&T subset Spearman R
news 0.4003
answers 0.7500
blog 0.7334
email 0.7606

Citation

If you are using the model in your research, please cite the following paper where it was introduced:

@InProceedings{10.1007/978-3-031-35320-8_4,
  author="Babakov, Nikolay
  and Dale, David
  and Gusev, Ilya
  and Krotova, Irina
  and Panchenko, Alexander",
  editor="M{\'e}tais, Elisabeth
  and Meziane, Farid
  and Sugumaran, Vijayan
  and Manning, Warren
  and Reiff-Marganiec, Stephan",
  title="Don't Lose the Message While Paraphrasing: A Study on Content Preserving Style Transfer",
  booktitle="Natural Language Processing and Information Systems",
  year="2023",
  publisher="Springer Nature Switzerland",
  address="Cham",
  pages="47--61",
  isbn="978-3-031-35320-8"
}

Licensing Information

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

CC BY-NC-SA 4.0

Downloads last month
16,231
Safetensors
Model size
125M params
Tensor type
I64
·
F32
·