|
--- |
|
license: mit |
|
language: |
|
- multilingual |
|
tags: |
|
- zero-shot-classification |
|
- text-classification |
|
- pytorch |
|
metrics: |
|
- accuracy |
|
- f1-score |
|
extra_gated_prompt: 'Our models are intended for academic use only. If you are not |
|
affiliated with an academic institution, please provide a rationale for using our |
|
models. |
|
|
|
If you use our models for your work or research, please cite this paper: Sebők, |
|
M., Máté, Á., Ring, O., Kovács, V., & Lehoczki, R. (2024). Leveraging Open Large |
|
Language Models for Multilingual Policy Topic Classification: The Babel Machine |
|
Approach. Social Science Computer Review, 0(0). https://doi.org/10.1177/08944393241259434' |
|
extra_gated_fields: |
|
Name: text |
|
Country: country |
|
Institution: text |
|
E-mail: text |
|
Use case: text |
|
--- |
|
# xlm-roberta-large-parlspeech-cap-v3 |
|
## Model description |
|
An `xlm-roberta-large` model finetuned on multilingual training data containing texts of the `parlspeech` domain labelled with [major topic codes](https://www.comparativeagendas.net/pages/master-codebook) from the [Comparative Agendas Project](https://www.comparativeagendas.net/). |
|
|
|
## How to use the model |
|
#### Loading and tokenizing input data |
|
```python |
|
import pandas as pd |
|
import numpy as np |
|
from datasets import Dataset |
|
from transformers import (AutoModelForSequenceClassification, AutoTokenizer, |
|
Trainer, TrainingArguments) |
|
|
|
CAP_NUM_DICT = {0: '1', 1: '2', 2: '3', 3: '4', 4: '5', 5: '6', |
|
6: '7', 7: '8', 8: '9', 9: '10', 10: '12', 11: '13', 12: '14', |
|
13: '15', 14: '16', 15: '17', 16: '18', 17: '19', 18: '20', 19: |
|
'21', 20: '23', 21: '999'} |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-large') |
|
num_labels = len(CAP_NUM_DICT) |
|
|
|
def tokenize_dataset(data : pd.DataFrame): |
|
tokenized = tokenizer(data["text"], |
|
max_length=MAXLEN, |
|
truncation=True, |
|
padding="max_length") |
|
return tokenized |
|
|
|
hg_data = Dataset.from_pandas(data) |
|
dataset = hg_data.map(tokenize_dataset, batched=True, remove_columns=hg_data.column_names) |
|
``` |
|
|
|
#### Inference using the Trainer class |
|
```python |
|
model = AutoModelForSequenceClassification.from_pretrained('poltextlab/xlm-roberta-large-parlspeech-cap-v3', |
|
num_labels=22, |
|
problem_type="multi_label_classification", |
|
ignore_mismatched_sizes=True |
|
) |
|
|
|
training_args = TrainingArguments( |
|
output_dir='.', |
|
per_device_train_batch_size=8, |
|
per_device_eval_batch_size=8 |
|
) |
|
|
|
trainer = Trainer( |
|
model=model, |
|
args=training_args |
|
) |
|
|
|
probs = trainer.predict(test_dataset=dataset).predictions |
|
predicted = pd.DataFrame(np.argmax(probs, axis=1)).replace({0: CAP_NUM_DICT}).rename( |
|
columns={0: 'predicted'}).reset_index(drop=True) |
|
|
|
``` |
|
|
|
### Fine-tuning procedure |
|
`xlm-roberta-large-parlspeech-cap-v3` was fine-tuned using the Hugging Face Trainer class with the following hyperparameters: |
|
```python |
|
training_args = TrainingArguments( |
|
output_dir=f"../model/{model_dir}/tmp/", |
|
logging_dir=f"../logs/{model_dir}/", |
|
logging_strategy='epoch', |
|
num_train_epochs=10, |
|
per_device_train_batch_size=8, |
|
per_device_eval_batch_size=8, |
|
learning_rate=5e-06, |
|
seed=42, |
|
save_strategy='epoch', |
|
evaluation_strategy='epoch', |
|
save_total_limit=1, |
|
load_best_model_at_end=True |
|
) |
|
``` |
|
We also incorporated an EarlyStoppingCallback in the process with a patience of 2 epochs. |
|
|
|
## Model performance |
|
The model was evaluated on a test set of 185409 examples (10% of the available data).<br> |
|
Model accuracy is **0.81**. |
|
| label | precision | recall | f1-score | support | |
|
|:-------------|------------:|---------:|-----------:|----------:| |
|
| 0 | 0.71 | 0.77 | 0.74 | 12681 | |
|
| 1 | 0.69 | 0.6 | 0.64 | 4546 | |
|
| 2 | 0.82 | 0.84 | 0.83 | 7290 | |
|
| 3 | 0.8 | 0.69 | 0.74 | 5344 | |
|
| 4 | 0.69 | 0.69 | 0.69 | 6194 | |
|
| 5 | 0.79 | 0.88 | 0.83 | 6294 | |
|
| 6 | 0.81 | 0.71 | 0.76 | 4244 | |
|
| 7 | 0.72 | 0.81 | 0.76 | 2972 | |
|
| 8 | 0.69 | 0.82 | 0.75 | 4087 | |
|
| 9 | 0.75 | 0.77 | 0.76 | 7177 | |
|
| 10 | 0.79 | 0.7 | 0.74 | 8152 | |
|
| 11 | 0.75 | 0.72 | 0.74 | 5312 | |
|
| 12 | 0.68 | 0.72 | 0.7 | 4001 | |
|
| 13 | 0.69 | 0.65 | 0.67 | 5621 | |
|
| 14 | 0.83 | 0.75 | 0.79 | 4102 | |
|
| 15 | 0.82 | 0.64 | 0.72 | 3285 | |
|
| 16 | 0.7 | 0.3 | 0.42 | 1811 | |
|
| 17 | 0.72 | 0.75 | 0.73 | 8682 | |
|
| 18 | 0.65 | 0.76 | 0.7 | 15644 | |
|
| 19 | 0.58 | 0.54 | 0.56 | 3658 | |
|
| 20 | 0.75 | 0.67 | 0.71 | 1503 | |
|
| 21 | 0.97 | 0.96 | 0.96 | 62809 | |
|
| macro avg | 0.75 | 0.72 | 0.72 | 185409 | |
|
| weighted avg | 0.81 | 0.81 | 0.81 | 185409 | |
|
|
|
## Inference platform |
|
This model is used by the [CAP Babel Machine](https://babel.poltextlab.com), an open-source and free natural language processing tool, designed to simplify and speed up projects for comparative research. |
|
|
|
## Cooperation |
|
Model performance can be significantly improved by extending our training sets. We appreciate every submission of CAP-coded corpora (of any domain and language) at poltextlab{at}poltextlab{dot}com or by using the [CAP Babel Machine](https://babel.poltextlab.com). |
|
|
|
## Debugging and issues |
|
This architecture uses the `sentencepiece` tokenizer. In order to run the model before `transformers==4.27` you need to install it manually. |
|
|
|
If you encounter a `RuntimeError` when loading the model using the `from_pretrained()` method, adding `ignore_mismatched_sizes=True` should solve the issue. |
|
|