File size: 5,133 Bytes
7254c02 9f34677 fb35f00 93f420e 514c0f1 3cdcd7e 9f34677 93f420e 9f34677 93f420e 9f34677 93f420e 9f34677 9439693 9aa3671 9439693 9aa3671 9439693 6866bd4 9439693 6866bd4 9439693 9f34677 93f420e 9f34677 3cdcd7e 64320bb 5935390 d4ae9a4 5935390 d4ae9a4 5935390 514c0f1 4b01106 372d865 4cc6c3e 832050d 4cc6c3e 832050d 4cc6c3e 832050d 8304013 832050d 64320bb 3cdcd7e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
# Spanish News Classification Headlines
SNCH: this model was develop by [M47Labs](https://www.m47labs.com/es/) the goal is text classification, the base model use was [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased), it was fine-tuned on 1000 example dataset.
## Dataset Sample
Dataset size : 1000
Columns: idTask,task content 1,idTag,tag.
|idTask|task content 1|idTag|tag|
|------|------|------|------|
|3637d9ac-119c-4a8f-899c-339cf5b42ae0|Alcalá de Guadaíra celebra la IV Semana de la Diversidad Sexual con acciones de sensibilización|81b36360-6cbf-4ffa-b558-9ef95c136714|sociedad|
|d56bab52-0029-45dd-ad90-5c17d4ed4c88|El Archipiélago Chinijo Graciplus se impone en el Trofeo Centro Comercial Rubicón|ed198b6d-a5b9-4557-91ff-c0be51707dec|deportes|
|dec70bc5-4932-4fa2-aeac-31a52377be02|Un total de 39 personas padecen ELA actualmente en la provincia|81b36360-6cbf-4ffa-b558-9ef95c136714|sociedad|
|fb396ba9-fbf1-4495-84d9-5314eb731405|Eurocopa 2021 : Italia vence a Gales y pasa a octavos con su candidatura reforzada|ed198b6d-a5b9-4557-91ff-c0be51707dec|deportes|
|bc5a36ca-4e0a-422e-9167-766b41008c01|Resolución de 10 de junio de 2021, del Ayuntamiento de Tarazona de La Mancha (Albacete), referente a la convocatoria para proveer una plaza.|81b36360-6cbf-4ffa-b558-9ef95c136714|sociedad|
|a87f8703-ce34-47a5-9c1b-e992c7fe60f6|El primer ministro sueco pierde una moción de censura|209ae89e-55b4-41fd-aac0-5400feab479e|politica|
|d80bdaad-0ad5-43a0-850e-c473fd612526|El dólar se dispara tras la reunión de la Fed|11925830-148e-4890-a2bc-da9dc059dc17|economia|
## Labels:
* ciencia_tecnologia
* clickbait
* cultura
* deportes
* economia
* educacion
* medio_ambiente
* opinion
* politica
* sociedad
## Example of Use
### Pipeline
```{python}
import torch
from transformers import AutoTokenizer, BertForSequenceClassification,TextClassificationPipeline
review_text = 'los vehiculos que esten esperando pasajaeros deberan estar apagados para reducir emisiones'
path = "M47Labs/spanish_news_classification_headlines"
tokenizer = AutoTokenizer.from_pretrained(path)
model = BertForSequenceClassification.from_pretrained(path)
nlp = TextClassificationPipeline(task = "text-classification",
model = model,
tokenizer = tokenizer)
print(nlp(review_text))
```
```[{'label': 'medio_ambiente', 'score': 0.5648820996284485}]```
### Pytorch
```{python}
import torch
from transformers import AutoTokenizer, BertForSequenceClassification,TextClassificationPipeline
from numpy import np
model_name = 'M47Labs/spanish_news_classification_headlines'
MAX_LEN = 32
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
texto = "las emisiones estan bajando, debido a las medidas ambientales tomadas por el gobierno"
encoded_review = tokenizer.encode_plus(
texto,
max_length=MAX_LEN,
add_special_tokens=True,
#return_token_type_ids=False,
pad_to_max_length=True,
return_attention_mask=True,
return_tensors='pt',
)
input_ids = encoded_review['input_ids']
attention_mask = encoded_review['attention_mask']
output = model(input_ids, attention_mask)
_, prediction = torch.max(output['logits'], dim=1)
print(f'Review text: {texto}')
print(f'Sentiment : {model.config.id2label[prediction.detach().cpu().numpy()[0]]}')
```
```Review text: las emisiones estan bajando, debido a las medidas ambientales tomadas por el gobierno
Sentiment : medio_ambiente```
A more in depth example on how to use the model can be found in this colab notebook: https://colab.research.google.com/drive/1XsKea6oMyEckye2FePW_XN7Rf8v41Cw_?usp=sharing
## Finetune Hyperparameters
* MAX_LEN = 32
* TRAIN_BATCH_SIZE = 8
* VALID_BATCH_SIZE = 4
* EPOCHS = 5
* LEARNING_RATE = 1e-05
## Train Results
|n_example|epoch|loss|acc|
|------|------|------|------|
|100|0|2.286327266693115|12.5|
|100|1|2.018876111507416|40.0|
|100|2|1.8016730904579163|43.75|
|100|3|1.6121837735176086|46.25|
|100|4|1.41565443277359|68.75|
|n_example|epoch|loss|acc|
|------|------|------|------|
|500|0|2.0770938420295715|24.5|
|500|1|1.6953029704093934|50.25|
|500|2|1.258900796175003|64.25|
|500|3|0.8342628020048142|78.25|
|500|4|0.5135736921429634|90.25|
|n_example|epoch|loss|acc|
|------|------|------|------|
|1000|0|1.916002897115854|36.1997226074896|
|1000|1|1.2941598492664295|62.2746185852982|
|1000|2|0.8201534710415117|76.97642163661581|
|1000|3|0.524806430051615|86.9625520110957|
|1000|4|0.30662027455784463|92.64909847434119|
## Validation Results
|n_examples|100|
|------|------|
|Accuracy Score|0.35|
|Precision (Macro)|0.35|
|Recall (Macro)|0.16|
|n_examples|500|
|------|------|
|Accuracy Score|0.62|
|Precision (Macro)|0.60|
|Recall (Macro)|0.47|
|n_examples|1000|
|------|------|
|Accuracy Score|0.68|
|Precision(Macro)|0.68|
|Recall (Macro)|0.64|
![alt text](https://media-exp1.licdn.com/dms/image/C4D0BAQHpfgjEyhtE1g/company-logo_200_200/0/1625210573748?e=1638403200&v=beta&t=toQNpiOlyim5Ja4f7Ejv8yKoCWifMsLWjkC7XnyXICI "Logo M47") |