poltextlab commited on
Commit
65ec0e4
1 Parent(s): 9e6dab9

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +104 -0
README.md ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ ---
4
+ license: mit
5
+ tags:
6
+ - zero-shot-classification
7
+ - text-classification
8
+ - pytorch
9
+ metrics:
10
+ - accuracy
11
+ - f1-score
12
+ ---
13
+ # MODEL_NAME
14
+ ## Model description
15
+ An `xlm-roberta-large` model finetuned on training data containing [major topic codes](https://www.comparativeagendas.net/pages/master-codebook) from the [Comparative Agendas Project](https://www.comparativeagendas.net/).
16
+
17
+ ## How to use the model
18
+ #### Loading and tokenizing input data
19
+ ```python
20
+ import pandas as pd
21
+ import numpy as np
22
+ from datasets import Dataset
23
+ from transformers import (AutoModelForSequenceClassification, AutoTokenizer,
24
+ Trainer, TrainingArguments)
25
+
26
+ CAP_NUM_DICT = {0: '1', 1: '2', 2: '3', 3: '4', 4: '5', 5: '6',
27
+ 6: '7', 7: '8', 8: '9', 9: '10', 10: '12', 11: '13', 12: '14',
28
+ 13: '15', 14: '16', 15: '17', 16: '18', 17: '19', 18: '20', 19:
29
+ '21', 20: '23', 21: '999'}
30
+
31
+ tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-large')
32
+ num_labels = len(CAP_NUM_DICT)
33
+
34
+ def tokenize_dataset(data : pd.DataFrame):
35
+ tokenized = tokenizer(data["text"],
36
+ max_length=MAXLEN,
37
+ truncation=True,
38
+ padding="max_length")
39
+ return tokenized
40
+
41
+ hg_data = Dataset.from_pandas(data)
42
+ dataset = hg_data.map(tokenize_dataset, batched=True, remove_columns=hg_data.column_names)
43
+ ```
44
+
45
+ #### Inference using the Trainer class
46
+ ```python
47
+ model = AutoModelForSequenceClassification.from_pretrained('poltextlab/MODEL_NAME',
48
+ num_labels=num_labels,
49
+ problem_type="multi_label_classification",
50
+ ignore_mismatched_sizes=True
51
+ )
52
+
53
+ training_args = TrainingArguments(
54
+ output_dir='.',
55
+ per_device_train_batch_size=BATCH,
56
+ per_device_eval_batch_size=BATCH
57
+ )
58
+
59
+ trainer = Trainer(
60
+ model=model,
61
+ args=training_args
62
+ )
63
+
64
+ probs = trainer.predict(test_dataset=dataset).predictions
65
+ predicted = pd.DataFrame(np.argmax(probs, axis=1)).replace({0: CAP_NUM_DICT}).rename(
66
+ columns={0: 'predicted'}).reset_index(drop=True)
67
+
68
+ ```
69
+
70
+ ### Fine-tuning procedure
71
+ `MODEL_NAME` was fine-tuned using the Hugging Face Trainer class with the following hyperparameters:
72
+ ```
73
+ training_args = TrainingArguments(
74
+ output_dir=f"../model/{model_dir}/tmp/",
75
+ logging_dir=f"../logs/{model_dir}/",
76
+ logging_strategy='epoch',
77
+ num_train_epochs=10,
78
+ per_device_train_batch_size=args.batch,
79
+ per_device_eval_batch_size=args.batch,
80
+ learning_rate=args.lr,
81
+ seed=42,
82
+ save_strategy='epoch',
83
+ evaluation_strategy='epoch',
84
+ save_total_limit=1,
85
+ load_best_model_at_end=True
86
+ )
87
+ ```
88
+ We also incorporated an EarlyStoppingCallback in the process with a patience of 2 epochs.
89
+
90
+ ## Model performance
91
+ The model was evaluated on a test set of NUM_TEST_SET examples (10% of the available data).
92
+ Model accuracy is **0.83**.
93
+ METRICS_TABLE
94
+
95
+ ## Inference platform
96
+ This model is used by the [CAP Babel Machine](https://babel.poltextlab.com), an open-source and free natural language processing tool, designed to simplify and speed up projects for comparative research.
97
+
98
+ ## Cooperation
99
+ Model performance can be significantly improved by extending our training sets. We appreciate every submission of CAP-coded corpora (of any domain and language) at poltextlab{at}poltextlab{dot}com or by using the [CAP Babel Machine](https://babel.poltextlab.com).
100
+
101
+ ## Debugging and issues
102
+ This architecture uses the `sentencepiece` tokenizer. In order to run the model before `transformers==4.27` you need to install it manually.
103
+
104
+ If you encounter a `RuntimeError` when loading the model using the `from_pretrained()` method, adding `ignore_mismatched_sizes=True` should solve the issue.