---
language: 
  - en
tags:
  - formality
datasets:
  - GYAFC
  - Pavlick-Tetreault-2016
---

The model has been trained to predict for English sentences, whether they are formal or informal. 

Base model: `roberta-base`

Datasets: [GYAFC](https://github.com/raosudha89/GYAFC-corpus) from [Rao and Tetreault, 2018](https://aclanthology.org/N18-1012) and [online formality corpus](http://www.seas.upenn.edu/~nlp/resources/formality-corpus.tgz) from [Pavlick and Tetreault, 2016](https://aclanthology.org/Q16-1005).

Data augmentation: changing texts to upper or lower case; removing all punctuation, adding dot at the end of a sentence. It was applied because otherwise the model is over-reliant on punctuation and capitalization and does not pay enough attention to other features.

Loss: binary classification (on GYAFC), in-batch ranking (on PT data).

Performance metrics on the test data:

| dataset                                      | ROC AUC | precision | recall | fscore | accuracy | Spearman |
|----------------------------------------------|---------|-----------|--------|--------|----------|------------|
| GYAFC                                        | 0.9779  | 0.90      | 0.91   | 0.90   | 0.9087   | 0.8233     |
| GYAFC normalized (lowercase + remove punct.) | 0.9234  | 0.85      | 0.81   | 0.82   | 0.8218   | 0.7294     |

| P&T subset | Spearman R |
| -     | - |
news    |	 0.4003
answers |	 0.7500
blog    |	 0.7334
email   |	 0.7606