DanielCano commited on
Commit
cc1bbd9
1 Parent(s): 4217bf2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -57
README.md CHANGED
@@ -6,24 +6,24 @@ widget:
6
 
7
  # Spanish News Classification Headlines
8
 
9
- SNCH: this model was develop by [M47Labs](https://www.m47labs.com/es/) the goal is text classification, the base model use was [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased), it was fine-tuned on 1000 example dataset.
10
 
11
 
12
- ## Dataset Sample
13
 
14
  Dataset size : 1000
15
 
16
  Columns: idTask,task content 1,idTag,tag.
17
 
18
- |idTask|task content 1|idTag|tag|
19
- |------|------|------|------|
20
- |3637d9ac-119c-4a8f-899c-339cf5b42ae0|Alcalá de Guadaíra celebra la IV Semana de la Diversidad Sexual con acciones de sensibilización|81b36360-6cbf-4ffa-b558-9ef95c136714|sociedad|
21
- |d56bab52-0029-45dd-ad90-5c17d4ed4c88|El Archipiélago Chinijo Graciplus se impone en el Trofeo Centro Comercial Rubicón|ed198b6d-a5b9-4557-91ff-c0be51707dec|deportes|
22
- |dec70bc5-4932-4fa2-aeac-31a52377be02|Un total de 39 personas padecen ELA actualmente en la provincia|81b36360-6cbf-4ffa-b558-9ef95c136714|sociedad|
23
- |fb396ba9-fbf1-4495-84d9-5314eb731405|Eurocopa 2021 : Italia vence a Gales y pasa a octavos con su candidatura reforzada|ed198b6d-a5b9-4557-91ff-c0be51707dec|deportes|
24
- |bc5a36ca-4e0a-422e-9167-766b41008c01|Resolución de 10 de junio de 2021, del Ayuntamiento de Tarazona de La Mancha (Albacete), referente a la convocatoria para proveer una plaza.|81b36360-6cbf-4ffa-b558-9ef95c136714|sociedad|
25
- |a87f8703-ce34-47a5-9c1b-e992c7fe60f6|El primer ministro sueco pierde una moción de censura|209ae89e-55b4-41fd-aac0-5400feab479e|politica|
26
- |d80bdaad-0ad5-43a0-850e-c473fd612526|El dólar se dispara tras la reunión de la Fed|11925830-148e-4890-a2bc-da9dc059dc17|economia|
27
 
28
 
29
  ## Labels:
@@ -61,7 +61,7 @@ from transformers import AutoTokenizer, BertForSequenceClassification,TextClassi
61
 
62
 
63
  review_text = 'los vehiculos que esten esperando pasajaeros deberan estar apagados para reducir emisiones'
64
- path = "M47Labs/spanish_news_classification_headlines"
65
  tokenizer = AutoTokenizer.from_pretrained(path)
66
  model = BertForSequenceClassification.from_pretrained(path)
67
 
@@ -74,7 +74,7 @@ print(nlp(review_text))
74
 
75
  ```
76
 
77
- ```[{'label': 'medio_ambiente', 'score': 0.5648820996284485}]```
78
 
79
  ### Pytorch
80
 
@@ -84,7 +84,7 @@ import torch
84
  from transformers import AutoTokenizer, BertForSequenceClassification,TextClassificationPipeline
85
  from numpy import np
86
 
87
- model_name = 'M47Labs/spanish_news_classification_headlines'
88
  MAX_LEN = 32
89
 
90
 
@@ -119,7 +119,7 @@ print(f'Sentiment : {model.config.id2label[prediction.detach().cpu().numpy()[0]
119
  ```Review text: las emisiones estan bajando, debido a las medidas ambientales tomadas por el gobierno```
120
 
121
 
122
- ```Sentiment : medio_ambiente```
123
 
124
 
125
  A more in depth example on how to use the model can be found in this colab notebook: https://colab.research.google.com/drive/1XsKea6oMyEckye2FePW_XN7Rf8v41Cw_?usp=sharing
@@ -134,53 +134,15 @@ A more in depth example on how to use the model can be found in this colab noteb
134
  * EPOCHS = 5
135
  * LEARNING_RATE = 1e-05
136
 
137
- ## Train Results
138
-
139
- |n_example|epoch|loss|acc|
140
- |------|------|------|------|
141
- |100|0|2.286327266693115|12.5|
142
- |100|1|2.018876111507416|40.0|
143
- |100|2|1.8016730904579163|43.75|
144
- |100|3|1.6121837735176086|46.25|
145
- |100|4|1.41565443277359|68.75|
146
-
147
- |n_example|epoch|loss|acc|
148
- |------|------|------|------|
149
- |500|0|2.0770938420295715|24.5|
150
- |500|1|1.6953029704093934|50.25|
151
- |500|2|1.258900796175003|64.25|
152
- |500|3|0.8342628020048142|78.25|
153
- |500|4|0.5135736921429634|90.25|
154
-
155
- |n_example|epoch|loss|acc|
156
- |------|------|------|------|
157
- |1000|0|1.916002897115854|36.1997226074896|
158
- |1000|1|1.2941598492664295|62.2746185852982|
159
- |1000|2|0.8201534710415117|76.97642163661581|
160
- |1000|3|0.524806430051615|86.9625520110957|
161
- |1000|4|0.30662027455784463|92.64909847434119|
162
 
163
  ## Validation Results
164
 
165
- |n_examples|100|
166
- |------|------|
167
- |Accuracy Score|0.35|
168
- |Precision (Macro)|0.35|
169
- |Recall (Macro)|0.16|
170
-
171
- |n_examples|500|
172
- |------|------|
173
- |Accuracy Score|0.62|
174
- |Precision (Macro)|0.60|
175
- |Recall (Macro)|0.47|
176
-
177
- |n_examples|1000|
178
  |------|------|
179
- |Accuracy Score|0.68|
180
- |Precision(Macro)|0.68|
181
- |Recall (Macro)|0.64|
182
 
183
 
184
 
185
  ![alt text](https://media-exp1.licdn.com/dms/image/C4D0BAQHpfgjEyhtE1g/company-logo_200_200/0/1625210573748?e=1638403200&v=beta&t=toQNpiOlyim5Ja4f7Ejv8yKoCWifMsLWjkC7XnyXICI "Logo M47")
186
-
6
 
7
  # Spanish News Classification Headlines
8
 
9
+ SNCH: this model was developed by [M47Labs](https://www.m47labs.com/es/) the goal is text classification, the base model use was [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased), however this model has not been fine-tuned on any dataset. The objective is to show the performance of this model when is used with the objective of inference without training at all.
10
 
11
 
12
+ ## Dataset validation Sample
13
 
14
  Dataset size : 1000
15
 
16
  Columns: idTask,task content 1,idTag,tag.
17
 
18
+ |task content|tag|
19
+ |------|------|
20
+ |Alcalá de Guadaíra celebra la IV Semana de la Diversidad Sexual con acciones de sensibilización|sociedad|
21
+ |El Archipiélago Chinijo Graciplus se impone en el Trofeo Centro Comercial Rubicón|deportes|
22
+ |Un total de 39 personas padecen ELA actualmente en la provincia|sociedad|
23
+ |Eurocopa 2021 : Italia vence a Gales y pasa a octavos con su candidatura reforzada|deportes|
24
+ |Resolución de 10 de junio de 2021, del Ayuntamiento de Tarazona de La Mancha (Albacete), referente a la convocatoria para proveer una plaza.|sociedad|
25
+ |El primer ministro sueco pierde una moción de censura|politica|
26
+ |El dólar se dispara tras la reunión de la Fed|economia|
27
 
28
 
29
  ## Labels:
61
 
62
 
63
  review_text = 'los vehiculos que esten esperando pasajaeros deberan estar apagados para reducir emisiones'
64
+ path = "M47Labs/spanish_news_classification_headlines_untrained"
65
  tokenizer = AutoTokenizer.from_pretrained(path)
66
  model = BertForSequenceClassification.from_pretrained(path)
67
 
74
 
75
  ```
76
 
77
+ ```[{'label': 'medio_ambiente', 'score': 0.2834321384291023}]```
78
 
79
  ### Pytorch
80
 
84
  from transformers import AutoTokenizer, BertForSequenceClassification,TextClassificationPipeline
85
  from numpy import np
86
 
87
+ model_name = 'M47Labs/spanish_news_classification_headlines_untrained'
88
  MAX_LEN = 32
89
 
90
 
119
  ```Review text: las emisiones estan bajando, debido a las medidas ambientales tomadas por el gobierno```
120
 
121
 
122
+ ```Sentiment : opinion```
123
 
124
 
125
  A more in depth example on how to use the model can be found in this colab notebook: https://colab.research.google.com/drive/1XsKea6oMyEckye2FePW_XN7Rf8v41Cw_?usp=sharing
134
  * EPOCHS = 5
135
  * LEARNING_RATE = 1e-05
136
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
 
138
  ## Validation Results
139
 
140
+ |Full Dataset||
 
 
 
 
 
 
 
 
 
 
 
 
141
  |------|------|
142
+ |Accuracy Score|0.362|
143
+ |Precision (Macro)|0.21|
144
+ |Recall (Macro)|0.22|
145
 
146
 
147
 
148
  ![alt text](https://media-exp1.licdn.com/dms/image/C4D0BAQHpfgjEyhtE1g/company-logo_200_200/0/1625210573748?e=1638403200&v=beta&t=toQNpiOlyim5Ja4f7Ejv8yKoCWifMsLWjkC7XnyXICI "Logo M47")