cointegrated commited on
Commit
507700d
1 Parent(s): ca09afa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -1
README.md CHANGED
@@ -1,9 +1,33 @@
 
 
 
 
 
 
 
 
 
 
1
  The model has been trained [here](https://git.mts.ai/ai/ml_lab/skoltech-nlp_lab/skoltech/task_oriented_TST/-/blob/main/transfer/formality_ranker_v1.ipynb) to predict for English sentences, whether they are formal or informal.
2
 
3
  Base model: `roberta-base`
4
 
5
  Datasets: [GYAFC](https://github.com/raosudha89/GYAFC-corpus) from [Rao and Tetreault, 2018](https://aclanthology.org/N18-1012) and [online formality corpus](http://www.seas.upenn.edu/~nlp/resources/formality-corpus.tgz) from [Pavlick and Tetreault, 2016](https://aclanthology.org/Q16-1005).
6
 
7
- Data augmentation: changing texts to upper or lower case; removing all punctuation, adding dot at the end of a sentence.
8
 
9
  Loss: binary classification (on GYAFC), in-batch ranking (on PT data).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - formality
6
+ datasets:
7
+ - GYAFC
8
+ - Pavlick-Tetreault-2016
9
+ ---
10
+
11
  The model has been trained [here](https://git.mts.ai/ai/ml_lab/skoltech-nlp_lab/skoltech/task_oriented_TST/-/blob/main/transfer/formality_ranker_v1.ipynb) to predict for English sentences, whether they are formal or informal.
12
 
13
  Base model: `roberta-base`
14
 
15
  Datasets: [GYAFC](https://github.com/raosudha89/GYAFC-corpus) from [Rao and Tetreault, 2018](https://aclanthology.org/N18-1012) and [online formality corpus](http://www.seas.upenn.edu/~nlp/resources/formality-corpus.tgz) from [Pavlick and Tetreault, 2016](https://aclanthology.org/Q16-1005).
16
 
17
+ Data augmentation: changing texts to upper or lower case; removing all punctuation, adding dot at the end of a sentence. It was applied because otherwise the model is over-reliant on punctuation and capitalization and does not pay enough attention to other features.
18
 
19
  Loss: binary classification (on GYAFC), in-batch ranking (on PT data).
20
+
21
+ Performance metrics on the validation data:
22
+
23
+ | dataset | ROC AUC | accuracy | Spearman R|
24
+ | - | - | - | - |
25
+ | GYAFC | 0.9779 | 0.9087 | 0.8233 |
26
+ | GYAFC normalized (lowercase + remove punct.) | 0.9234 | 0.8218| 0.7294 |
27
+
28
+ | P&T subset | Spearman R |
29
+ | - | - |
30
+ news | 0.4003
31
+ answers | 0.7500
32
+ blog | 0.7334
33
+ email | 0.7606