ogtal
/

A-og-ttack2

Text2Text Generation

Inference Endpoints

text-generation-inference

Model card Files Files and versions Community

A-og-ttack2 / README.md

NielsOerbaek's picture

Update README.md

3c24570 11 months ago

|

raw history blame contribute delete

No virus

2.83 kB

	---
	language:
	- no
	- da
	library_name: transformers
	metrics:
	- f1-score (Danish): 0.87
	- f1-score (Norwegian): 0.76
	---
	# Model Card for A&ttack2

	A text classification model for determining if a social media post in Danish or Norwegian contains a verbal attack.

	# Model Description

	The model is based on the north/t5_large_scand (by Per E. Kummervold, not publicly available) which is a Scandinavian language pretrained for 1.700.000 steps starting with the mT5 checkpoint on a Scandinavian corpus (Bokmål, Nynorsk, Danish, Swedish and Icelandic (+ a tiny bit Faeroyish)).

	The model is finetuned for 20.000 steps in batches of 8. The data consists of ~70k Norwegian and ~67k Danish social media posts which have been classified as either 'verbal attack' or 'nothing', making it a text-to-text model restricted to do classification. The model is described in Danish in [this report](https://www.ogtal.dk/assets/files/230403-Analyse-Tall-Angrep-hat-i-den-offentlige-debatten-paa-Facebook.pdf).


	- Developed by: The development team at Analyse & Tal
	- Model type: Language model restricted to classification
	- Language(s) (NLP): Danish and Norwegian
	- License: [More Information Needed]
	- Finetuned from model: north/t5_large_scand (by Per E. Kummervold, not publicly available)


	# Direct Use
	This model can be used for classifying Danish and Norwegian social media posts (or other texts) as either 'verbal attack' or 'nothing'.

	# Training Data
	A collection of ~70k Norwegian and ~67k Danish social media posts have been manually annotated as 'verbal attack' or 'nothing' by annotators. 5% of the posts have been annotated by more then one annotator, with the annotators in agreement for 83% of annotations.

	Norwegian data are split in 70% training, 20% validation and 10% test. The Danish data are split in 70% training, 15% validation and 15% test.

	# Evaluation Metrics

	Macro-averaged f1-score for Danish data: 0.87
	Macro-averaged f1-score for Norwegian data: 0.76


	# Model Card Authors
	This model card was written by the developer team at Analyse & Tal. Contact: oyvind@ogtal.dk.

	# How to Get Started with the Model

	Use the code below to get started with the model.

	```
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	# Download/load tokenizer and language model
	tokenizer = AutoTokenizer.from_pretrained("ogtal/A-og-ttack2")
	model = AutoModelForSeq2SeqLM.from_pretrained("ogtal/A-og-ttack2")

	# Give sample text. The example is from a social media comment.
	sample_text = "Velbekomme dit klamme usle løgnersvin!"
	input_ids = tokenizer(sample_text, return_tensors="pt").input_ids

	# Forward pass and print the output
	outputs = model.generate(input_ids)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	Running the above code will print "angreb" (attack in Danish).