DanielCano's picture
Update README.md
1105ed2
metadata
widget:
  - text: El dólar se dispara tras la reunión de la Fed

Spanish News Classification Headlines

SNCH: this model was developed by M47Labs the goal is text classification, the base model use was BETO, however this model has not been fine-tuned on any dataset. The objective is to show the performance of this model when is used with the objective of inference without training at all.

Dataset validation Sample

Dataset size : 1000

Columns: idTask,task content 1,idTag,tag.

task content tag
Alcalá de Guadaíra celebra la IV Semana de la Diversidad Sexual con acciones de sensibilización sociedad
El Archipiélago Chinijo Graciplus se impone en el Trofeo Centro Comercial Rubicón deportes
Un total de 39 personas padecen ELA actualmente en la provincia sociedad
Eurocopa 2021 : Italia vence a Gales y pasa a octavos con su candidatura reforzada deportes
Resolución de 10 de junio de 2021, del Ayuntamiento de Tarazona de La Mancha (Albacete), referente a la convocatoria para proveer una plaza. sociedad
El primer ministro sueco pierde una moción de censura politica
El dólar se dispara tras la reunión de la Fed economia

Labels:

  • ciencia_tecnologia

  • clickbait

  • cultura

  • deportes

  • economia

  • educacion

  • medio_ambiente

  • opinion

  • politica

  • sociedad

Example of Use

Pipeline


import torch
from transformers import AutoTokenizer, BertForSequenceClassification,TextClassificationPipeline


review_text = 'los vehiculos que esten esperando pasajaeros deberan estar apagados para reducir emisiones'
path = "M47Labs/spanish_news_classification_headlines_untrained"
tokenizer = AutoTokenizer.from_pretrained(path)
model = BertForSequenceClassification.from_pretrained(path)


nlp = TextClassificationPipeline(task = "text-classification",
                model = model,
                tokenizer = tokenizer)

print(nlp(review_text))

[{'label': 'medio_ambiente', 'score': 0.2834321384291023}]

Pytorch


import torch
from transformers import AutoTokenizer, BertForSequenceClassification,TextClassificationPipeline
from numpy import np

model_name  = 'M47Labs/spanish_news_classification_headlines_untrained'
MAX_LEN = 32


tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForSequenceClassification.from_pretrained(model_name)

texto = "las emisiones estan bajando, debido a las medidas ambientales tomadas por el gobierno"


encoded_review = tokenizer.encode_plus(
  texto,
  max_length=MAX_LEN,
  add_special_tokens=True,
  #return_token_type_ids=False,
  pad_to_max_length=True,
  return_attention_mask=True,
  return_tensors='pt',
)

input_ids = encoded_review['input_ids']
attention_mask = encoded_review['attention_mask']
output = model(input_ids, attention_mask)

_, prediction = torch.max(output['logits'], dim=1)
print(f'Review text: {texto}')

print(f'Sentiment  : {model.config.id2label[prediction.detach().cpu().numpy()[0]]}')

Review text: las emisiones estan bajando, debido a las medidas ambientales tomadas por el gobierno

Sentiment : opinion

A more in depth example on how to use the model can be found in this colab notebook: https://colab.research.google.com/drive/1XsKea6oMyEckye2FePW_XN7Rf8v41Cw_?usp=sharing

Validation Results

Full Dataset
Accuracy Score 0.362
Precision (Macro) 0.21
Recall (Macro) 0.22

alt text