File size: 3,711 Bytes
4f0fddd
 
cf4c7bb
4f0fddd
 
24bc618
4f0fddd
9569e4e
4f0fddd
d8c26a0
4f0fddd
9569e4e
4f0fddd
d8c26a0
4f0fddd
 
9569e4e
 
 
4f0fddd
b60419a
4f0fddd
 
9569e4e
b60419a
 
4f0fddd
 
 
 
 
 
9569e4e
d8c26a0
4f0fddd
b60419a
 
4f0fddd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24bc618
4f0fddd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9569e4e
 
4f0fddd
 
 
9569e4e
 
4f0fddd
9569e4e
4f0fddd
9569e4e
4f0fddd
9569e4e
 
4f0fddd
9569e4e
 
 
4f0fddd
9569e4e
 
b60419a
4f0fddd
9569e4e
 
 
 
4f0fddd
9569e4e
4f0fddd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
language:
- 'da'
- 'no'
library_name: transformers
f1-score: 0.83
---
# Model Card for A&ttack2

A text classification model for determining if a social media post in Danish or Norwegian contains a verbal attack.

# Model Description

The model is based on the [North-T5-NCC Large](https://huggingface.co/north/t5_large_NCC) (developed by Per E. Kummervold) which is a Scandinavian language built upon [T5](https://github.com/google-research/text-to-text-transfer-transformer) and [T5X](https://github.com/google-research/t5x). The model is further trained on ~70k Norwegian and ~67k Danish social media posts which have been classified as either 'verbal attack' or 'nothing', making it a text-to-text model restricted to do classification. The model is described in Danish in [this report](https://strapi.ogtal.dk/uploads/966f1ebcfa9942d3aef338e9920611f4.pdf).


- **Developed by:** The development team at Analyse & Tal
- **Model type:** Language model restricted to classification
- **Language(s) (NLP):** Danish and Norwegian
- **License:** [More Information Needed]
- **Finetuned from model:** [North-T5-NCC Large](https://huggingface.co/north/t5_large_NCC)


# Direct Use
This model can be used for classifying Danish and Norwegian social media posts or similar text.



# Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
[More Information Needed]

# Training Data
A collection of ~70k Norwegian and ~67k Danish social media posts have been manually annotated as 'verbal attack' or 'nothing' by annotators. 5% of the posts have been annotated by more then one annotator, with the annotators in agreement for 83% of annotations.

[More information needed on the data split method and the training-validation-test split.]


# Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
## Testing Data, Factors & Metrics

### Testing Data
<!-- This should link to a Data Card if possible. -->

[More Information Needed]

### Factors

<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

[More Information Needed]

### Metrics

Macro-averaged f1-score: 0.83

[More Information Needed]

## Results

[More Information Needed]

### Summary




# Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** [More Information Needed]
- **Hours used:** [More Information Needed]
- **Cloud Provider:** Azure
- **Compute Region:** North-Europe
- **Carbon Emitted:** [More Information Needed]


# Model Card Authors
This model card was written by the developer team at Analyse & Tal. Contact: oyvind@ogtal.dk.

# How to Get Started with the Model

Use the code below to get started with the model.

```
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Download/load tokenizer and language model
tokenizer = AutoTokenizer.from_pretrained("ogtal/A-og-ttack2")
model = AutoModelForSeq2SeqLM.from_pretrained("ogtal/A-og-ttack2")

# Give sample text. The example is from a social media comment.
sample_text = "Velbekomme dit klamme usle løgnersvin!"
input_ids = tokenizer(sample_text, return_tensors="pt").input_ids

# Forward pass and print the output
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

Running the above code will print "angreb" (attack in Danish)