manifesto-project
/

manifestoberta-xlm-roberta-56policy-topics-context-2023-1-1

Text Classification

Model card Files Files and versions Community

manifestoberta-xlm-roberta-56policy-topics-context-2023-1-1 / README.md

tburst's picture

Update README.md

de11376 11 months ago

|

No virus

2.89 kB

	---
	license: mit
	---
	## Model description

	An xlm-roberta-large model fine-tuned on all ~1,8 million annotated statements contained in the [manifesto corpus](https://manifesto-project.wzb.eu/information/documents/corpus) (version 2023a).
	The model can be used to categorize any type of text into 56 different political topics according to the Manifesto Project's coding scheme ([Handbook 4](https://manifesto-project.wzb.eu/coding_schemes/mp_v4)).

	The context model variant additionally incorporates the surrounding sentences of a statement to improve the classification results for ambiguous sentences.
	During fine-tuning we collected the surrounding sentences of a statement and merged them with the statement itself to provide the larger context of a sentence as the second part of a sentence pair input.
	We limited the statement itself to 100 tokens and the context of the statement to 200 tokens.

	Important

	We slightly modified the Classification Head of the `XLMRobertaModelForSequenceClassification` model (removed the tanh activation and the intermediate linear layer) as that improved the model performance for this task considerably.
	To correctly load the full model, include the `trust_remote_code=True` argument when using the `from_pretrained method`.

	## How to use

	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer

	model = AutoModelForSequenceClassification.from_pretrained("manifesto-project/xlm-roberta-political-56topics-context-2023a", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")

	sentence = "These principles are under threat."
	context = "Human rights and international humanitarian law are fundamental pillars of a secure global system. These principles are under threat. Some of the world's most powerful states choose to sell arms to human-rights abusing states."
	# For sentences without additional context, just use the sentence itself as the context.
	# Example: context = "These principles are under threat."


	inputs = tokenizer(sentence,
	context,
	return_tensors="pt",
	max_length=300, #we limited the input to 300 tokens during finetuning
	padding="max_length",
	truncation=True
	)

	logits = model(**inputs).logits

	probabilities = torch.softmax(logits, dim=1).tolist()[0]
	probabilities = {model.config.id2label[index]: round(probability * 100, 2) for index, probability in enumerate(probabilities)}
	probabilities = dict(sorted(probabilities.items(), key=lambda item: item[1], reverse=True))
	print(probabilities)
	# {'201 - Freedom and Human Rights': 90.76, '107 - Internationalism: Positive': 5.82, '105 - Military: Negative': 0.66...

	predicted_class = model.config.id2label[logits.argmax().item()]
	print(predicted_class)
	# 201 - Freedom and Human Rights
	```