Commit
·
3e8cb87
1
Parent(s):
6d5161a
Update README.md
Browse files
README.md
CHANGED
@@ -57,6 +57,57 @@ print(output)
|
|
57 |
### Details on data and training
|
58 |
The code for preparing the data and training & evaluating the model is fully open-source here: https://github.com/MoritzLaurer/zeroshot-classifier/tree/main
|
59 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
60 |
## Limitations and bias
|
61 |
The model can only do text classification tasks.
|
62 |
|
|
|
57 |
### Details on data and training
|
58 |
The code for preparing the data and training & evaluating the model is fully open-source here: https://github.com/MoritzLaurer/zeroshot-classifier/tree/main
|
59 |
|
60 |
+
## Metrics
|
61 |
+
|
62 |
+
Balanced accuracy metrics on all datasets.
|
63 |
+
`deberta-v3-base-zeroshot-v1.1-all-33` was trained on all datasets, with only maximum 500 texts per class to avoid overfitting.
|
64 |
+
The metrics on these datasets are therefore not strictly zeroshot, as the model has seen some data for each task.
|
65 |
+
`deberta-v3-base-zeroshot-v1.1-heldout` indicates zeroshot performance on the respective dataset.
|
66 |
+
To calculate these zeroshot metrics, the pipeline was run 28 times, each time with one dataset held out from training to simulate a zeroshot setup.
|
67 |
+
|
68 |
+
![figure_base_v1.1](https://github.com/MoritzLaurer/zeroshot-classifier/blob/main/results/fig_base_v1.1.png)
|
69 |
+
|
70 |
+
| | deberta-v3-base-mnli-fever-anli-ling-wanli-binary | deberta-v3-base-zeroshot-v1.1-heldout | deberta-v3-base-zeroshot-v1.1-all-33 |
|
71 |
+
|:---------------------------|---------------------------:|----------------------------------------:|---------------------------------------:|
|
72 |
+
| datasets mean (w/o nli) | 62 | 70.7 | 84 |
|
73 |
+
| amazonpolarity (2) | 91.7 | 95.7 | 96 |
|
74 |
+
| imdb (2) | 87.3 | 93.6 | 94.5 |
|
75 |
+
| appreviews (2) | 91.3 | 92.2 | 94.4 |
|
76 |
+
| yelpreviews (2) | 95.1 | 97.4 | 98.3 |
|
77 |
+
| rottentomatoes (2) | 83 | 88.7 | 90.8 |
|
78 |
+
| emotiondair (6) | 46.5 | 42.6 | 74.5 |
|
79 |
+
| emocontext (4) | 58.5 | 57.4 | 81.2 |
|
80 |
+
| empathetic (32) | 31.3 | 37.3 | 52.7 |
|
81 |
+
| financialphrasebank (3) | 78.3 | 68.9 | 91.2 |
|
82 |
+
| banking77 (72) | 18.9 | 46 | 73.7 |
|
83 |
+
| massive (59) | 44 | 56.6 | 78.9 |
|
84 |
+
| wikitoxic_toxicaggreg (2) | 73.7 | 82.5 | 90.5 |
|
85 |
+
| wikitoxic_obscene (2) | 77.3 | 91.6 | 92.6 |
|
86 |
+
| wikitoxic_threat (2) | 83.5 | 95.2 | 96.7 |
|
87 |
+
| wikitoxic_insult (2) | 79.6 | 91 | 91.6 |
|
88 |
+
| wikitoxic_identityhate (2) | 83.9 | 88 | 94.4 |
|
89 |
+
| hateoffensive (3) | 55.2 | 66.1 | 86 |
|
90 |
+
| hatexplain (3) | 44.1 | 57.6 | 76.9 |
|
91 |
+
| biasframes_offensive (2) | 56.8 | 85.4 | 87 |
|
92 |
+
| biasframes_sex (2) | 85.4 | 87 | 91.8 |
|
93 |
+
| biasframes_intent (2) | 56.3 | 85.2 | 87.8 |
|
94 |
+
| agnews (4) | 77.3 | 80 | 90.5 |
|
95 |
+
| yahootopics (10) | 53.6 | 57.7 | 72.8 |
|
96 |
+
| trueteacher (2) | 51.4 | 49.5 | 82.4 |
|
97 |
+
| spam (2) | 51.8 | 50 | 97.2 |
|
98 |
+
| wellformedquery (2) | 49.9 | 52.5 | 77.2 |
|
99 |
+
| manifesto (56) | 5.8 | 18.9 | 39.1 |
|
100 |
+
| capsotu (21) | 25.2 | 64 | 72.5 |
|
101 |
+
| mnli_m (2) | 92.4 | nan | 92.7 |
|
102 |
+
| mnli_mm (2) | 92.4 | nan | 92.5 |
|
103 |
+
| fevernli (2) | 89 | nan | 89.1 |
|
104 |
+
| anli_r1 (2) | 79.4 | nan | 80 |
|
105 |
+
| anli_r2 (2) | 68.4 | nan | 68.4 |
|
106 |
+
| anli_r3 (2) | 66.2 | nan | 68 |
|
107 |
+
| wanli (2) | 81.6 | nan | 81.8 |
|
108 |
+
| lingnli (2) | 88.4 | nan | 88.4 |
|
109 |
+
|
110 |
+
|
111 |
## Limitations and bias
|
112 |
The model can only do text classification tasks.
|
113 |
|