README.md · s-nlp/roberta-base-formality-ranker at refs/pr/1

metadata

language:
  - en
tags:
  - formality
datasets:
  - GYAFC
  - Pavlick-Tetreault-2016

The model has been trained to predict for English sentences, whether they are formal or informal.

Base model: roberta-base

Datasets: GYAFC from Rao and Tetreault, 2018 and online formality corpus from Pavlick and Tetreault, 2016.

Data augmentation: changing texts to upper or lower case; removing all punctuation, adding dot at the end of a sentence. It was applied because otherwise the model is over-reliant on punctuation and capitalization and does not pay enough attention to other features.

Loss: binary classification (on GYAFC), in-batch ranking (on PT data).

Performance metrics on the test data:

dataset	ROC AUC	precision	recall	fscore	accuracy	Spearman
GYAFC	0.9779	0.90	0.91	0.90	0.9087	0.8233
GYAFC normalized (lowercase + remove punct.)	0.9234	0.85	0.81	0.82	0.8218	0.7294

P&T subset	Spearman R
news	0.4003
answers	0.7500
blog	0.7334
email	0.7606