cointegrated
commited on
Commit
•
507700d
1
Parent(s):
ca09afa
Update README.md
Browse files
README.md
CHANGED
@@ -1,9 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
The model has been trained [here](https://git.mts.ai/ai/ml_lab/skoltech-nlp_lab/skoltech/task_oriented_TST/-/blob/main/transfer/formality_ranker_v1.ipynb) to predict for English sentences, whether they are formal or informal.
|
2 |
|
3 |
Base model: `roberta-base`
|
4 |
|
5 |
Datasets: [GYAFC](https://github.com/raosudha89/GYAFC-corpus) from [Rao and Tetreault, 2018](https://aclanthology.org/N18-1012) and [online formality corpus](http://www.seas.upenn.edu/~nlp/resources/formality-corpus.tgz) from [Pavlick and Tetreault, 2016](https://aclanthology.org/Q16-1005).
|
6 |
|
7 |
-
Data augmentation: changing texts to upper or lower case; removing all punctuation, adding dot at the end of a sentence.
|
8 |
|
9 |
Loss: binary classification (on GYAFC), in-batch ranking (on PT data).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
tags:
|
5 |
+
- formality
|
6 |
+
datasets:
|
7 |
+
- GYAFC
|
8 |
+
- Pavlick-Tetreault-2016
|
9 |
+
---
|
10 |
+
|
11 |
The model has been trained [here](https://git.mts.ai/ai/ml_lab/skoltech-nlp_lab/skoltech/task_oriented_TST/-/blob/main/transfer/formality_ranker_v1.ipynb) to predict for English sentences, whether they are formal or informal.
|
12 |
|
13 |
Base model: `roberta-base`
|
14 |
|
15 |
Datasets: [GYAFC](https://github.com/raosudha89/GYAFC-corpus) from [Rao and Tetreault, 2018](https://aclanthology.org/N18-1012) and [online formality corpus](http://www.seas.upenn.edu/~nlp/resources/formality-corpus.tgz) from [Pavlick and Tetreault, 2016](https://aclanthology.org/Q16-1005).
|
16 |
|
17 |
+
Data augmentation: changing texts to upper or lower case; removing all punctuation, adding dot at the end of a sentence. It was applied because otherwise the model is over-reliant on punctuation and capitalization and does not pay enough attention to other features.
|
18 |
|
19 |
Loss: binary classification (on GYAFC), in-batch ranking (on PT data).
|
20 |
+
|
21 |
+
Performance metrics on the validation data:
|
22 |
+
|
23 |
+
| dataset | ROC AUC | accuracy | Spearman R|
|
24 |
+
| - | - | - | - |
|
25 |
+
| GYAFC | 0.9779 | 0.9087 | 0.8233 |
|
26 |
+
| GYAFC normalized (lowercase + remove punct.) | 0.9234 | 0.8218| 0.7294 |
|
27 |
+
|
28 |
+
| P&T subset | Spearman R |
|
29 |
+
| - | - |
|
30 |
+
news | 0.4003
|
31 |
+
answers | 0.7500
|
32 |
+
blog | 0.7334
|
33 |
+
email | 0.7606
|