DanielCano commited on
Commit
67c3ab6
1 Parent(s): 2592313

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +186 -0
README.md ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ widget:
3
+ - text: "El dólar se dispara tras la reunión de la Fed"
4
+ ---
5
+
6
+
7
+ # Spanish News Classification Headlines
8
+
9
+ SNCH: this model was develop by [M47Labs](https://www.m47labs.com/es/) the goal is text classification, the base model use was [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased), it was fine-tuned on 1000 example dataset.
10
+
11
+
12
+ ## Dataset Sample
13
+
14
+ Dataset size : 1000
15
+
16
+ Columns: idTask,task content 1,idTag,tag.
17
+
18
+ |idTask|task content 1|idTag|tag|
19
+ |------|------|------|------|
20
+ |3637d9ac-119c-4a8f-899c-339cf5b42ae0|Alcalá de Guadaíra celebra la IV Semana de la Diversidad Sexual con acciones de sensibilización|81b36360-6cbf-4ffa-b558-9ef95c136714|sociedad|
21
+ |d56bab52-0029-45dd-ad90-5c17d4ed4c88|El Archipiélago Chinijo Graciplus se impone en el Trofeo Centro Comercial Rubicón|ed198b6d-a5b9-4557-91ff-c0be51707dec|deportes|
22
+ |dec70bc5-4932-4fa2-aeac-31a52377be02|Un total de 39 personas padecen ELA actualmente en la provincia|81b36360-6cbf-4ffa-b558-9ef95c136714|sociedad|
23
+ |fb396ba9-fbf1-4495-84d9-5314eb731405|Eurocopa 2021 : Italia vence a Gales y pasa a octavos con su candidatura reforzada|ed198b6d-a5b9-4557-91ff-c0be51707dec|deportes|
24
+ |bc5a36ca-4e0a-422e-9167-766b41008c01|Resolución de 10 de junio de 2021, del Ayuntamiento de Tarazona de La Mancha (Albacete), referente a la convocatoria para proveer una plaza.|81b36360-6cbf-4ffa-b558-9ef95c136714|sociedad|
25
+ |a87f8703-ce34-47a5-9c1b-e992c7fe60f6|El primer ministro sueco pierde una moción de censura|209ae89e-55b4-41fd-aac0-5400feab479e|politica|
26
+ |d80bdaad-0ad5-43a0-850e-c473fd612526|El dólar se dispara tras la reunión de la Fed|11925830-148e-4890-a2bc-da9dc059dc17|economia|
27
+
28
+
29
+ ## Labels:
30
+
31
+ * ciencia_tecnologia
32
+
33
+ * clickbait
34
+
35
+ * cultura
36
+
37
+ * deportes
38
+
39
+ * economia
40
+
41
+ * educacion
42
+
43
+ * medio_ambiente
44
+
45
+ * opinion
46
+
47
+ * politica
48
+
49
+ * sociedad
50
+
51
+
52
+
53
+ ## Example of Use
54
+
55
+ ### Pipeline
56
+
57
+ ```{python}
58
+
59
+ import torch
60
+ from transformers import AutoTokenizer, BertForSequenceClassification,TextClassificationPipeline
61
+
62
+
63
+ review_text = 'los vehiculos que esten esperando pasajaeros deberan estar apagados para reducir emisiones'
64
+ path = "M47Labs/spanish_news_classification_headlines"
65
+ tokenizer = AutoTokenizer.from_pretrained(path)
66
+ model = BertForSequenceClassification.from_pretrained(path)
67
+
68
+
69
+ nlp = TextClassificationPipeline(task = "text-classification",
70
+ model = model,
71
+ tokenizer = tokenizer)
72
+
73
+ print(nlp(review_text))
74
+
75
+ ```
76
+
77
+ ```[{'label': 'medio_ambiente', 'score': 0.5648820996284485}]```
78
+
79
+ ### Pytorch
80
+
81
+ ```{python}
82
+
83
+ import torch
84
+ from transformers import AutoTokenizer, BertForSequenceClassification,TextClassificationPipeline
85
+ from numpy import np
86
+
87
+ model_name = 'M47Labs/spanish_news_classification_headlines'
88
+ MAX_LEN = 32
89
+
90
+
91
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
92
+
93
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
94
+
95
+ texto = "las emisiones estan bajando, debido a las medidas ambientales tomadas por el gobierno"
96
+
97
+
98
+ encoded_review = tokenizer.encode_plus(
99
+ texto,
100
+ max_length=MAX_LEN,
101
+ add_special_tokens=True,
102
+ #return_token_type_ids=False,
103
+ pad_to_max_length=True,
104
+ return_attention_mask=True,
105
+ return_tensors='pt',
106
+ )
107
+
108
+ input_ids = encoded_review['input_ids']
109
+ attention_mask = encoded_review['attention_mask']
110
+ output = model(input_ids, attention_mask)
111
+
112
+ _, prediction = torch.max(output['logits'], dim=1)
113
+ print(f'Review text: {texto}')
114
+
115
+ print(f'Sentiment : {model.config.id2label[prediction.detach().cpu().numpy()[0]]}')
116
+
117
+ ```
118
+
119
+ ```Review text: las emisiones estan bajando, debido a las medidas ambientales tomadas por el gobierno```
120
+
121
+
122
+ ```Sentiment : medio_ambiente```
123
+
124
+
125
+ A more in depth example on how to use the model can be found in this colab notebook: https://colab.research.google.com/drive/1XsKea6oMyEckye2FePW_XN7Rf8v41Cw_?usp=sharing
126
+
127
+
128
+ ## Finetune Hyperparameters
129
+
130
+
131
+ * MAX_LEN = 32
132
+ * TRAIN_BATCH_SIZE = 8
133
+ * VALID_BATCH_SIZE = 4
134
+ * EPOCHS = 5
135
+ * LEARNING_RATE = 1e-05
136
+
137
+ ## Train Results
138
+
139
+ |n_example|epoch|loss|acc|
140
+ |------|------|------|------|
141
+ |100|0|2.286327266693115|12.5|
142
+ |100|1|2.018876111507416|40.0|
143
+ |100|2|1.8016730904579163|43.75|
144
+ |100|3|1.6121837735176086|46.25|
145
+ |100|4|1.41565443277359|68.75|
146
+
147
+ |n_example|epoch|loss|acc|
148
+ |------|------|------|------|
149
+ |500|0|2.0770938420295715|24.5|
150
+ |500|1|1.6953029704093934|50.25|
151
+ |500|2|1.258900796175003|64.25|
152
+ |500|3|0.8342628020048142|78.25|
153
+ |500|4|0.5135736921429634|90.25|
154
+
155
+ |n_example|epoch|loss|acc|
156
+ |------|------|------|------|
157
+ |1000|0|1.916002897115854|36.1997226074896|
158
+ |1000|1|1.2941598492664295|62.2746185852982|
159
+ |1000|2|0.8201534710415117|76.97642163661581|
160
+ |1000|3|0.524806430051615|86.9625520110957|
161
+ |1000|4|0.30662027455784463|92.64909847434119|
162
+
163
+ ## Validation Results
164
+
165
+ |n_examples|100|
166
+ |------|------|
167
+ |Accuracy Score|0.35|
168
+ |Precision (Macro)|0.35|
169
+ |Recall (Macro)|0.16|
170
+
171
+ |n_examples|500|
172
+ |------|------|
173
+ |Accuracy Score|0.62|
174
+ |Precision (Macro)|0.60|
175
+ |Recall (Macro)|0.47|
176
+
177
+ |n_examples|1000|
178
+ |------|------|
179
+ |Accuracy Score|0.68|
180
+ |Precision(Macro)|0.68|
181
+ |Recall (Macro)|0.64|
182
+
183
+
184
+
185
+ ![alt text](https://media-exp1.licdn.com/dms/image/C4D0BAQHpfgjEyhtE1g/company-logo_200_200/0/1625210573748?e=1638403200&v=beta&t=toQNpiOlyim5Ja4f7Ejv8yKoCWifMsLWjkC7XnyXICI "Logo M47")
186
+