---
language:
- en
- nl
- fr
- it
- hu
metrics:
- f1
library_name: transformers
pipeline_tag: text-classification
---

# xlm-roberta-large-pooled-long-cap
## Model description
An `xlm-roberta-large` model finetuned on multilingual training data containing longer documents (>= 512 tokens) labelled with [major topic codes](https://www.comparativeagendas.net/pages/master-codebook) from the [Comparative Agendas Project](https://www.comparativeagendas.net/).

### Fine-tuning procedure
`xlm-roberta-large-pooled-long-cap` was fine-tuned using the Hugging Face Trainer with a batch size of 8, a learning rate of 5e-06, and a maximum sequence length of 512. Early stopping was implemented with a patience of 2 epochs.
Training data consisted of 160 808 examples of texts longer than 512 tokens after tokenizing with the `xlm-roberta-large` tokenizer. Domain and language shares are the following:
![Training data distribution](figures/longformer_2024_02_28_train_dist.jpg)
![Training data distribution](figures/longformer_2024_02_28_train_hm.jpg)
![Training data distribution](figures/longformer_2024_02_28_train_label_counts.jpg)

## Model performance
The model was evaluated on a test set of 40203 examples.<br>
Metrics:

|                     | precision | recall | f1-score | support |
|---------------------|-----------|--------|----------|---------|
| Macroeconomics      | 0.7573    | 0.7872 | 0.7720   | 5010    |
| Civil Rights        | 0.6608    | 0.6566 | 0.6587   | 1427    |
| Health              | 0.8407    | 0.8328 | 0.8368   | 1908    |
| Agriculture         | 0.7910    | 0.8402 | 0.8149   | 1514    |
| Labor               | 0.6608    | 0.6834 | 0.6719   | 1491    |
| Education           | 0.8396    | 0.8796 | 0.8591   | 1470    |
| Environment         | 0.7893    | 0.7901 | 0.7897   | 1029    |
| Energy              | 0.7918    | 0.7865 | 0.7891   | 904     |
| Immigration         | 0.7907    | 0.7640 | 0.7771   | 445     |
| Transportation      | 0.8413    | 0.8413 | 0.8413   | 1210    |
| Law and Crime       | 0.7715    | 0.7670 | 0.7692   | 2712    |
| Social Welfare      | 0.6768    | 0.7070 | 0.6915   | 1232    |
| Housing             | 0.7205    | 0.6948 | 0.7075   | 1065    |
| Domestic Commerce   | 0.7277    | 0.7522 | 0.7398   | 2054    |
| Defense             | 0.8839    | 0.8314 | 0.8569   | 1566    |
| Technology          | 0.8536    | 0.8212 | 0.8371   | 951     |
| Foreign Trade       | 0.7929    | 0.7512 | 0.7715   | 418     |
| International Affairs | 0.7961  | 0.8505 | 0.8224   | 4140    |
| Government Operations | 0.7746  | 0.7120 | 0.7420   | 6204    |
| Public Lands        | 0.6109    | 0.6970 | 0.6511   | 901     |
| Culture             | 0.7748    | 0.6062 | 0.6802   | 386     |
| No Policy Content   | 0.8415    | 0.7992 | 0.8198   | 2166    |

Average metrics:

|                     | precision | recall | f1-score | support |
|---------------------|-----------|--------|----------|---------|
| macro avg           | 0.7736    | 0.7736 | 0.7736   | 40203   |
| weighted avg        | **0.7747**    | **0.7736** | **0.7735**   | 40203   |

The accuracy is **0.7736**.

## How to use the model
```python
from transformers import AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
pipe = pipeline(model="poltextlab/xlm-roberta-large-pooled-long-cap", tokenizer=tokenizer, use_fast=False)

text = "We will place an immediate 6-month halt on the finance driven closure of beds and wards, and set up an independent audit of needs and facilities."
pipe(text)
```

## Debugging and issues
This architecture uses the `sentencepiece` tokenizer. In order to run the model before `transformers==4.27` you need to install it manually.