|
--- |
|
language: |
|
- no |
|
- da |
|
library_name: transformers |
|
metrics: |
|
- f1-score (Danish): 0.87 |
|
- f1-score (Norwegian): 0.76 |
|
--- |
|
# Model Card for A&ttack2 |
|
|
|
A text classification model for determining if a social media post in Danish or Norwegian contains a verbal attack. |
|
|
|
# Model Description |
|
|
|
The model is based on the north/t5_large_scand (by Per E. Kummervold, not publicly available) which is a Scandinavian language pretrained for 1.700.000 steps starting with the mT5 checkpoint on a Scandinavian corpus (Bokmål, Nynorsk, Danish, Swedish and Icelandic (+ a tiny bit Faeroyish)). |
|
|
|
The model is finetuned for 20.000 steps in batches of 8. The data consists of ~70k Norwegian and ~67k Danish social media posts which have been classified as either 'verbal attack' or 'nothing', making it a text-to-text model restricted to do classification. The model is described in Danish in [this report](https://www.ogtal.dk/assets/files/230403-Analyse-Tall-Angrep-hat-i-den-offentlige-debatten-paa-Facebook.pdf). |
|
|
|
|
|
- **Developed by:** The development team at Analyse & Tal |
|
- **Model type:** Language model restricted to classification |
|
- **Language(s) (NLP):** Danish and Norwegian |
|
- **License:** [More Information Needed] |
|
- **Finetuned from model:** north/t5_large_scand (by Per E. Kummervold, not publicly available) |
|
|
|
|
|
# Direct Use |
|
This model can be used for classifying Danish and Norwegian social media posts (or other texts) as either 'verbal attack' or 'nothing'. |
|
|
|
# Training Data |
|
A collection of ~70k Norwegian and ~67k Danish social media posts have been manually annotated as 'verbal attack' or 'nothing' by annotators. 5% of the posts have been annotated by more then one annotator, with the annotators in agreement for 83% of annotations. |
|
|
|
Norwegian data are split in 70% training, 20% validation and 10% test. The Danish data are split in 70% training, 15% validation and 15% test. |
|
|
|
# Evaluation Metrics |
|
|
|
Macro-averaged f1-score for Danish data: 0.87 |
|
Macro-averaged f1-score for Norwegian data: 0.76 |
|
|
|
|
|
# Model Card Authors |
|
This model card was written by the developer team at Analyse & Tal. Contact: oyvind@ogtal.dk. |
|
|
|
# How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
``` |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
# Download/load tokenizer and language model |
|
tokenizer = AutoTokenizer.from_pretrained("ogtal/A-og-ttack2") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("ogtal/A-og-ttack2") |
|
|
|
# Give sample text. The example is from a social media comment. |
|
sample_text = "Velbekomme dit klamme usle løgnersvin!" |
|
input_ids = tokenizer(sample_text, return_tensors="pt").input_ids |
|
|
|
# Forward pass and print the output |
|
outputs = model.generate(input_ids) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
``` |
|
|
|
Running the above code will print "angreb" (attack in Danish). |