owinton commited on
Commit
204ee4f
1 Parent(s): 73d759b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -51
README.md CHANGED
@@ -3,7 +3,8 @@ language:
3
  - 'da'
4
  - 'no'
5
  library_name: transformers
6
- f1-score: 0.83
 
7
  ---
8
  # Model Card for A&ttack2
9
 
@@ -11,73 +12,31 @@ A text classification model for determining if a social media post in Danish or
11
 
12
  # Model Description
13
 
14
- The model is based on the [North-T5-NCC Large](https://huggingface.co/north/t5_large_NCC) (developed by Per E. Kummervold) which is a Scandinavian language built upon [T5](https://github.com/google-research/text-to-text-transfer-transformer) and [T5X](https://github.com/google-research/t5x). The model is further trained on ~70k Norwegian and ~67k Danish social media posts which have been classified as either 'verbal attack' or 'nothing', making it a text-to-text model restricted to do classification. The model is described in Danish in [this report](https://strapi.ogtal.dk/uploads/966f1ebcfa9942d3aef338e9920611f4.pdf).
 
 
15
 
16
 
17
  - **Developed by:** The development team at Analyse & Tal
18
  - **Model type:** Language model restricted to classification
19
  - **Language(s) (NLP):** Danish and Norwegian
20
  - **License:** [More Information Needed]
21
- - **Finetuned from model:** [North-T5-NCC Large](https://huggingface.co/north/t5_large_NCC)
22
 
23
 
24
  # Direct Use
25
  This model can be used for classifying Danish and Norwegian social media posts or similar text.
26
 
27
-
28
-
29
- # Bias, Risks, and Limitations
30
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
31
- [More Information Needed]
32
-
33
  # Training Data
34
  A collection of ~70k Norwegian and ~67k Danish social media posts have been manually annotated as 'verbal attack' or 'nothing' by annotators. 5% of the posts have been annotated by more then one annotator, with the annotators in agreement for 83% of annotations.
35
 
36
- 10% of training data are held out for test
37
- [More information needed on the data split method and the training-validation-test split.]
38
-
39
-
40
- # Evaluation
41
- <!-- This section describes the evaluation protocols and provides the results. -->
42
- ## Testing Data, Factors & Metrics
43
-
44
- ### Testing Data
45
- <!-- This should link to a Data Card if possible. -->
46
-
47
- [More Information Needed]
48
-
49
- ### Factors
50
-
51
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
52
-
53
- [More Information Needed]
54
-
55
- ### Metrics
56
-
57
- Macro-averaged f1-score: 0.83
58
-
59
- [More Information Needed]
60
-
61
- ## Results
62
-
63
- [More Information Needed]
64
-
65
- ### Summary
66
-
67
-
68
-
69
-
70
- # Environmental Impact
71
 
72
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
73
 
74
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
75
 
76
- - **Hardware Type:** [More Information Needed]
77
- - **Hours used:** [More Information Needed]
78
- - **Cloud Provider:** Azure
79
- - **Compute Region:** North-Europe
80
- - **Carbon Emitted:** [More Information Needed]
81
 
82
 
83
  # Model Card Authors
 
3
  - 'da'
4
  - 'no'
5
  library_name: transformers
6
+ f1-score (Danish): 0.87
7
+ f1-score (Norwegian): 0.76
8
  ---
9
  # Model Card for A&ttack2
10
 
 
12
 
13
  # Model Description
14
 
15
+ The model is based on the north/t5_large_scand (by Per E. Kummervold, not publicly available) which is a Scandinavian language pretrained for 1.700.000 steps starting with the mT5 checkpoint on a Scandinavian corpus (Bokmål, Nynorsk, Danish, Swedish and Icelandic (+ a tiny bit Faeroyish)). The model was trained for increasing the understanding of what effect such training has on various languages.
16
+
17
+ The model is finetuned for 20.000 steps in batches of 8. The data consists of ~70k Norwegian and ~67k Danish social media posts which have been classified as either 'verbal attack' or 'nothing', making it a text-to-text model restricted to do classification. The model is described in Danish in [this report](https://strapi.ogtal.dk/uploads/966f1ebcfa9942d3aef338e9920611f4.pdf).
18
 
19
 
20
  - **Developed by:** The development team at Analyse & Tal
21
  - **Model type:** Language model restricted to classification
22
  - **Language(s) (NLP):** Danish and Norwegian
23
  - **License:** [More Information Needed]
24
+ - **Finetuned from model:** north/t5_large_scand (by Per E. Kummervold, not publicly available)
25
 
26
 
27
  # Direct Use
28
  This model can be used for classifying Danish and Norwegian social media posts or similar text.
29
 
 
 
 
 
 
 
30
  # Training Data
31
  A collection of ~70k Norwegian and ~67k Danish social media posts have been manually annotated as 'verbal attack' or 'nothing' by annotators. 5% of the posts have been annotated by more then one annotator, with the annotators in agreement for 83% of annotations.
32
 
33
+ Norwegian are split in 70% training, 20% validation and 10% test. The Danish data are split in 70% training, 15% validation and 15% test.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
 
35
 
36
+ # Evaluation metrics
37
 
38
+ Macro-averaged f1-score for Danish data: 0.87
39
+ Macro-averaged f1-score for Norwegian data: 0.76
 
 
 
40
 
41
 
42
  # Model Card Authors