|
--- |
|
language: |
|
- 'da' |
|
- 'no' |
|
library_name: transformers |
|
f1-score: 0.76 |
|
--- |
|
# Model Card for A&ttack2 |
|
|
|
Text classification model that determines whether a not a short text contains an attack. |
|
|
|
|
|
# Model Description |
|
|
|
The model is based on the [North-T5-NCC Large](https://huggingface.co/north/t5_large_NCC) (developed by Per E. Kummervold) which is a Scandinavian language built upon [T5](https://github.com/google-research/text-to-text-transfer-transformer) and [T5X](https://github.com/google-research/t5x). The model is further trained on ~70k Norwegian and ~67k Danish social media posts which have been classified as either 'attack' or 'not attack', making it a text-to-text model manipulated to do classification. The model is described in Danish in [this report](https://strapi.ogtal.dk/uploads/966f1ebcfa9942d3aef338e9920611f4.pdf). |
|
|
|
|
|
- **Developed by:** The development team at Analyse & Tal |
|
- **Model type:** Language model restricted to classification |
|
- **Language(s) (NLP):** Danish and Norwegian |
|
- **License:** [More Information Needed] |
|
- **Finetuned from model:** [North-T5-NCC Large](https://huggingface.co/north/t5_large_NCC) |
|
|
|
|
|
# Direct Use |
|
This model can be used for classifying Danish and Norwegian social media posts or similar text. |
|
|
|
|
|
|
|
# Bias, Risks, and Limitations |
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
[More Information Needed] |
|
|
|
# Training Data |
|
A collection of ~70k Norwegian and ~67k Danish social media posts have been manually annotated as 'attack' or 'not attack' by six individual coders. 5% of the posts have been annotated by more then one annotator, with the annotators in agreement for 83% of annotations. |
|
|
|
[More information needed on the data split method and the training-validation-test split.] |
|
|
|
|
|
# Evaluation |
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
## Testing Data, Factors & Metrics |
|
|
|
### Testing Data |
|
<!-- This should link to a Data Card if possible. --> |
|
|
|
[More Information Needed] |
|
|
|
### Factors |
|
|
|
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. --> |
|
|
|
[More Information Needed] |
|
|
|
### Metrics |
|
|
|
Macro-averaged f1-score: 0.76 |
|
|
|
[More Information Needed] |
|
|
|
## Results |
|
|
|
[More Information Needed] |
|
|
|
### Summary |
|
|
|
|
|
|
|
|
|
# Environmental Impact |
|
|
|
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly --> |
|
|
|
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). |
|
|
|
- **Hardware Type:** [More Information Needed] |
|
- **Hours used:** [More Information Needed] |
|
- **Cloud Provider:** Azure |
|
- **Compute Region:** North-Europe |
|
- **Carbon Emitted:** [More Information Needed] |
|
|
|
|
|
# Model Card Authors |
|
This model card was written by the developer team at Analyse & Tal. Contact: oyvind@ogtal.dk. |
|
|
|
# How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
``` |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
# Download/load tokenizer and language model |
|
tokenizer = AutoTokenizer.from_pretrained("ogtal/A-og-ttack2") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("ogtal/A-og-ttack2") |
|
|
|
# Give sample text. The example is from a social media comment. |
|
sample_text = "Velbekomme dit klamme usle løgnersvin!" |
|
input_ids = tokenizer(sample_text, return_tensors="pt").input_ids |
|
|
|
# Forward pass and print the output |
|
outputs = model.generate(input_ids) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
``` |
|
|
|
Running the above code will print "angreb" (attack in Danish) |
|
|
|
|