File size: 2,833 Bytes
4f0fddd
 
b270110
 
4f0fddd
9b20749
 
 
4f0fddd
9569e4e
4f0fddd
d8c26a0
4f0fddd
9569e4e
4f0fddd
740df99
204ee4f
3c24570
4f0fddd
 
9569e4e
 
 
4f0fddd
204ee4f
4f0fddd
 
9569e4e
41e3d31
b60419a
9569e4e
d8c26a0
4f0fddd
3576089
4f0fddd
3576089
4f0fddd
204ee4f
 
4f0fddd
 
9569e4e
 
4f0fddd
9569e4e
4f0fddd
9569e4e
4f0fddd
9569e4e
 
4f0fddd
9569e4e
 
 
4f0fddd
9569e4e
 
b60419a
4f0fddd
9569e4e
 
 
 
4f0fddd
8722b2a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
language:
- no
- da
library_name: transformers
metrics:
- f1-score (Danish): 0.87
- f1-score (Norwegian): 0.76
---
# Model Card for A&ttack2

A text classification model for determining if a social media post in Danish or Norwegian contains a verbal attack.

# Model Description

The model is based on the north/t5_large_scand (by Per E. Kummervold, not publicly available) which is a Scandinavian language pretrained for 1.700.000 steps starting with the mT5 checkpoint on a Scandinavian corpus (Bokmål, Nynorsk, Danish, Swedish and Icelandic (+ a tiny bit Faeroyish)).

The model is finetuned for 20.000 steps in batches of 8. The data consists of ~70k Norwegian and ~67k Danish social media posts which have been classified as either 'verbal attack' or 'nothing', making it a text-to-text model restricted to do classification. The model is described in Danish in [this report](https://www.ogtal.dk/assets/files/230403-Analyse-Tall-Angrep-hat-i-den-offentlige-debatten-paa-Facebook.pdf).


- **Developed by:** The development team at Analyse & Tal
- **Model type:** Language model restricted to classification
- **Language(s) (NLP):** Danish and Norwegian
- **License:** [More Information Needed]
- **Finetuned from model:** north/t5_large_scand (by Per E. Kummervold, not publicly available)


# Direct Use
This model can be used for classifying Danish and Norwegian social media posts (or other texts) as either 'verbal attack' or 'nothing'.

# Training Data
A collection of ~70k Norwegian and ~67k Danish social media posts have been manually annotated as 'verbal attack' or 'nothing' by annotators. 5% of the posts have been annotated by more then one annotator, with the annotators in agreement for 83% of annotations.

Norwegian data are split  in 70% training, 20% validation and 10% test. The Danish data are split in 70% training, 15% validation and 15% test. 

# Evaluation Metrics

Macro-averaged f1-score for Danish data: 0.87
Macro-averaged f1-score for Norwegian data: 0.76


# Model Card Authors
This model card was written by the developer team at Analyse & Tal. Contact: oyvind@ogtal.dk.

# How to Get Started with the Model

Use the code below to get started with the model.

```
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Download/load tokenizer and language model
tokenizer = AutoTokenizer.from_pretrained("ogtal/A-og-ttack2")
model = AutoModelForSeq2SeqLM.from_pretrained("ogtal/A-og-ttack2")

# Give sample text. The example is from a social media comment.
sample_text = "Velbekomme dit klamme usle løgnersvin!"
input_ids = tokenizer(sample_text, return_tensors="pt").input_ids

# Forward pass and print the output
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

Running the above code will print "angreb" (attack in Danish).