ogtal
/

A-og-ttack2

@@ -3,7 +3,8 @@ language:
 - 'da'
 - 'no'
 library_name: transformers
-f1-score: 0.83
 ---
 # Model Card for A&ttack2
@@ -11,73 +12,31 @@ A text classification model for determining if a social media post in Danish or
 # Model Description
-The model is based on the [North-T5-NCC Large](https://huggingface.co/north/t5_large_NCC) (developed by Per E. Kummervold) which is a Scandinavian language built upon [T5](https://github.com/google-research/text-to-text-transfer-transformer) and [T5X](https://github.com/google-research/t5x). The model is further trained on ~70k Norwegian and ~67k Danish social media posts which have been classified as either 'verbal attack' or 'nothing', making it a text-to-text model restricted to do classification. The model is described in Danish in [this report](https://strapi.ogtal.dk/uploads/966f1ebcfa9942d3aef338e9920611f4.pdf).
 - **Developed by:** The development team at Analyse & Tal
 - **Model type:** Language model restricted to classification
 - **Language(s) (NLP):** Danish and Norwegian
 - **License:** [More Information Needed]
-- **Finetuned from model:** [North-T5-NCC Large](https://huggingface.co/north/t5_large_NCC)
 # Direct Use
 This model can be used for classifying Danish and Norwegian social media posts or similar text.
-# Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
 # Training Data
 A collection of ~70k Norwegian and ~67k Danish social media posts have been manually annotated as 'verbal attack' or 'nothing' by annotators. 5% of the posts have been annotated by more then one annotator, with the annotators in agreement for 83% of annotations.
-10% of training data are held out for test
-[More information needed on the data split method and the training-validation-test split.]
-# Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-## Testing Data, Factors & Metrics
-### Testing Data
-<!-- This should link to a Data Card if possible. -->
-[More Information Needed]
-### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-### Metrics
-Macro-averaged f1-score: 0.83
-[More Information Needed]
-## Results
-[More Information Needed]
-### Summary
-# Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** Azure
-- **Compute Region:** North-Europe
-- **Carbon Emitted:** [More Information Needed]
 # Model Card Authors

 - 'da'
 - 'no'
 library_name: transformers
+f1-score (Danish): 0.87
+f1-score (Norwegian): 0.76
 ---
 # Model Card for A&ttack2
 # Model Description
+The model is based on the north/t5_large_scand (by Per E. Kummervold, not publicly available) which is a Scandinavian language pretrained for 1.700.000 steps starting with the mT5 checkpoint on a Scandinavian corpus (Bokmål, Nynorsk, Danish, Swedish and Icelandic (+ a tiny bit Faeroyish)). The model was trained for increasing the understanding of what effect such training has on various languages.
+The model is finetuned for 20.000 steps in batches of 8. The data consists of ~70k Norwegian and ~67k Danish social media posts which have been classified as either 'verbal attack' or 'nothing', making it a text-to-text model restricted to do classification. The model is described in Danish in [this report](https://strapi.ogtal.dk/uploads/966f1ebcfa9942d3aef338e9920611f4.pdf).
 - **Developed by:** The development team at Analyse & Tal
 - **Model type:** Language model restricted to classification
 - **Language(s) (NLP):** Danish and Norwegian
 - **License:** [More Information Needed]
+- **Finetuned from model:** north/t5_large_scand (by Per E. Kummervold, not publicly available)
 # Direct Use
 This model can be used for classifying Danish and Norwegian social media posts or similar text.
 # Training Data
 A collection of ~70k Norwegian and ~67k Danish social media posts have been manually annotated as 'verbal attack' or 'nothing' by annotators. 5% of the posts have been annotated by more then one annotator, with the annotators in agreement for 83% of annotations.
+Norwegian are split  in 70% training, 20% validation and 10% test. The Danish data are split in 70% training, 15% validation and 15% test.
+# Evaluation metrics
+Macro-averaged f1-score for Danish data: 0.87
+Macro-averaged f1-score for Norwegian data: 0.76
 # Model Card Authors