A-og-ttack2 / README.md

Update README.md

b60419a over 1 year ago

3.69 kB

	---
	language:
	- 'da'
	- 'no'
	library_name: transformers
	f1-score: 0.76
	---
	# Model Card for A&ttack2

	Text classification model that determines whether a not a short text contains an attack.


	# Model Description

	The model is based on the [North-T5-NCC Large](https://huggingface.co/north/t5_large_NCC) (developed by Per E. Kummervold) which is a Scandinavian language built upon [T5](https://github.com/google-research/text-to-text-transfer-transformer) and [T5X](https://github.com/google-research/t5x). The model is further trained on ~70k Norwegian and ~67k Danish social media posts which have been classified as either 'attack' or 'not attack', making it a text-to-text model manipulated to do classification. The model is described in Danish in [this report](https://strapi.ogtal.dk/uploads/966f1ebcfa9942d3aef338e9920611f4.pdf).


	- Developed by: The development team at Analyse & Tal
	- Model type: Language model restricted to classification
	- Language(s) (NLP): Danish and Norwegian
	- License: [More Information Needed]
	- Finetuned from model: [North-T5-NCC Large](https://huggingface.co/north/t5_large_NCC)


	# Direct Use
	This model can be used for classifying Danish and Norwegian social media posts or similar text.



	# Bias, Risks, and Limitations
	<!-- This section is meant to convey both technical and sociotechnical limitations. -->
	[More Information Needed]

	# Training Data
	A collection of ~70k Norwegian and ~67k Danish social media posts have been manually annotated as 'attack' or 'not attack' by six individual coders. 5% of the posts have been annotated by more then one annotator, with the annotators in agreement for 83% of annotations.

	[More information needed on the data split method and the training-validation-test split.]


	# Evaluation
	<!-- This section describes the evaluation protocols and provides the results. -->
	## Testing Data, Factors & Metrics

	### Testing Data
	<!-- This should link to a Data Card if possible. -->

	[More Information Needed]

	### Factors

	<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

	[More Information Needed]

	### Metrics

	Macro-averaged f1-score: 0.76

	[More Information Needed]

	## Results

	[More Information Needed]

	### Summary




	# Environmental Impact

	<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

	- Hardware Type: [More Information Needed]
	- Hours used: [More Information Needed]
	- Cloud Provider: Azure
	- Compute Region: North-Europe
	- Carbon Emitted: [More Information Needed]


	# Model Card Authors
	This model card was written by the developer team at Analyse & Tal. Contact: oyvind@ogtal.dk.

	# How to Get Started with the Model

	Use the code below to get started with the model.

	```
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	# Download/load tokenizer and language model
	tokenizer = AutoTokenizer.from_pretrained("ogtal/A-og-ttack2")
	model = AutoModelForSeq2SeqLM.from_pretrained("ogtal/A-og-ttack2")

	# Give sample text. The example is from a social media comment.
	sample_text = "Velbekomme dit klamme usle løgnersvin!"
	input_ids = tokenizer(sample_text, return_tensors="pt").input_ids

	# Forward pass and print the output
	outputs = model.generate(input_ids)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	Running the above code will print "angreb" (attack in Danish)