Update README.md
Browse files
README.md
CHANGED
@@ -3,7 +3,8 @@ language:
|
|
3 |
- 'da'
|
4 |
- 'no'
|
5 |
library_name: transformers
|
6 |
-
f1-score: 0.
|
|
|
7 |
---
|
8 |
# Model Card for A&ttack2
|
9 |
|
@@ -11,73 +12,31 @@ A text classification model for determining if a social media post in Danish or
|
|
11 |
|
12 |
# Model Description
|
13 |
|
14 |
-
The model is based on the
|
|
|
|
|
15 |
|
16 |
|
17 |
- **Developed by:** The development team at Analyse & Tal
|
18 |
- **Model type:** Language model restricted to classification
|
19 |
- **Language(s) (NLP):** Danish and Norwegian
|
20 |
- **License:** [More Information Needed]
|
21 |
-
- **Finetuned from model:**
|
22 |
|
23 |
|
24 |
# Direct Use
|
25 |
This model can be used for classifying Danish and Norwegian social media posts or similar text.
|
26 |
|
27 |
-
|
28 |
-
|
29 |
-
# Bias, Risks, and Limitations
|
30 |
-
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
31 |
-
[More Information Needed]
|
32 |
-
|
33 |
# Training Data
|
34 |
A collection of ~70k Norwegian and ~67k Danish social media posts have been manually annotated as 'verbal attack' or 'nothing' by annotators. 5% of the posts have been annotated by more then one annotator, with the annotators in agreement for 83% of annotations.
|
35 |
|
36 |
-
10%
|
37 |
-
[More information needed on the data split method and the training-validation-test split.]
|
38 |
-
|
39 |
-
|
40 |
-
# Evaluation
|
41 |
-
<!-- This section describes the evaluation protocols and provides the results. -->
|
42 |
-
## Testing Data, Factors & Metrics
|
43 |
-
|
44 |
-
### Testing Data
|
45 |
-
<!-- This should link to a Data Card if possible. -->
|
46 |
-
|
47 |
-
[More Information Needed]
|
48 |
-
|
49 |
-
### Factors
|
50 |
-
|
51 |
-
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
52 |
-
|
53 |
-
[More Information Needed]
|
54 |
-
|
55 |
-
### Metrics
|
56 |
-
|
57 |
-
Macro-averaged f1-score: 0.83
|
58 |
-
|
59 |
-
[More Information Needed]
|
60 |
-
|
61 |
-
## Results
|
62 |
-
|
63 |
-
[More Information Needed]
|
64 |
-
|
65 |
-
### Summary
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
# Environmental Impact
|
71 |
|
72 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
73 |
|
74 |
-
|
75 |
|
76 |
-
-
|
77 |
-
-
|
78 |
-
- **Cloud Provider:** Azure
|
79 |
-
- **Compute Region:** North-Europe
|
80 |
-
- **Carbon Emitted:** [More Information Needed]
|
81 |
|
82 |
|
83 |
# Model Card Authors
|
|
|
3 |
- 'da'
|
4 |
- 'no'
|
5 |
library_name: transformers
|
6 |
+
f1-score (Danish): 0.87
|
7 |
+
f1-score (Norwegian): 0.76
|
8 |
---
|
9 |
# Model Card for A&ttack2
|
10 |
|
|
|
12 |
|
13 |
# Model Description
|
14 |
|
15 |
+
The model is based on the north/t5_large_scand (by Per E. Kummervold, not publicly available) which is a Scandinavian language pretrained for 1.700.000 steps starting with the mT5 checkpoint on a Scandinavian corpus (Bokmål, Nynorsk, Danish, Swedish and Icelandic (+ a tiny bit Faeroyish)). The model was trained for increasing the understanding of what effect such training has on various languages.
|
16 |
+
|
17 |
+
The model is finetuned for 20.000 steps in batches of 8. The data consists of ~70k Norwegian and ~67k Danish social media posts which have been classified as either 'verbal attack' or 'nothing', making it a text-to-text model restricted to do classification. The model is described in Danish in [this report](https://strapi.ogtal.dk/uploads/966f1ebcfa9942d3aef338e9920611f4.pdf).
|
18 |
|
19 |
|
20 |
- **Developed by:** The development team at Analyse & Tal
|
21 |
- **Model type:** Language model restricted to classification
|
22 |
- **Language(s) (NLP):** Danish and Norwegian
|
23 |
- **License:** [More Information Needed]
|
24 |
+
- **Finetuned from model:** north/t5_large_scand (by Per E. Kummervold, not publicly available)
|
25 |
|
26 |
|
27 |
# Direct Use
|
28 |
This model can be used for classifying Danish and Norwegian social media posts or similar text.
|
29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
# Training Data
|
31 |
A collection of ~70k Norwegian and ~67k Danish social media posts have been manually annotated as 'verbal attack' or 'nothing' by annotators. 5% of the posts have been annotated by more then one annotator, with the annotators in agreement for 83% of annotations.
|
32 |
|
33 |
+
Norwegian are split in 70% training, 20% validation and 10% test. The Danish data are split in 70% training, 15% validation and 15% test.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
|
|
|
35 |
|
36 |
+
# Evaluation metrics
|
37 |
|
38 |
+
Macro-averaged f1-score for Danish data: 0.87
|
39 |
+
Macro-averaged f1-score for Norwegian data: 0.76
|
|
|
|
|
|
|
40 |
|
41 |
|
42 |
# Model Card Authors
|