edumunozsala's picture
Fix Metric Typos (#1)
6a506e8
---
language: es
tags:
- sagemaker
- roberta-bne
- TextClassification
- SentimentAnalysis
license: apache-2.0
datasets:
- IMDbreviews_es
metrics:
- accuracy
model-index:
- name: roberta_bne_sentiment_analysis_es
results:
- task:
name: Sentiment Analysis
type: sentiment-analysis
dataset:
name: "IMDb Reviews in Spanish"
type: IMDbreviews_es
metrics:
- name: Accuracy
type: accuracy
value: 0.9106666666666666
- name: F1 Score
type: f1
value: 0.9090909090909091
- name: Precision
type: precision
value: 0.9063852813852814
- name: Recall
type: recall
value: 0.9118127381600436
widget:
- text: "Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal"
---
# Model roberta_bne_sentiment_analysis_es
## **A finetuned model for Sentiment analysis in Spanish**
This model was trained using Amazon SageMaker and the new Hugging Face Deep Learning container,
The base model is **RoBERTa-base-bne** which is a RoBERTa base model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB.
It was trained by The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html)
**RoBERTa BNE Citation**
Check out the paper for all the details: https://arxiv.org/abs/2107.07253
```
@article{gutierrezfandino2022,
author = {Asier Gutiérrez-Fandiño and Jordi Armengol-Estapé and Marc Pàmies and Joan Llop-Palao and Joaquin Silveira-Ocampo and Casimiro Pio Carrino and Carme Armentano-Oller and Carlos Rodriguez-Penagos and Aitor Gonzalez-Agirre and Marta Villegas},
title = {MarIA: Spanish Language Models},
journal = {Procesamiento del Lenguaje Natural},
volume = {68},
number = {0},
year = {2022},
issn = {1989-7553},
url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405},
pages = {39--60}
}
```
## Dataset
The dataset is a collection of movie reviews in Spanish, about 50,000 reviews. The dataset is balanced and provides every review in english, in spanish and the label in both languages.
Sizes of datasets:
- Train dataset: 42,500
- Validation dataset: 3,750
- Test dataset: 3,750
## Intended uses & limitations
This model is intented for Sentiment Analysis for spanish corpus and finetuned specially for movie reviews but it can be applied to other kind of reviews.
## Hyperparameters
{
"epochs": "4",
"train_batch_size": "32",
"eval_batch_size": "8",
"fp16": "true",
"learning_rate": "3e-05",
"model_name": "\"PlanTL-GOB-ES/roberta-base-bne\"",
"sagemaker_container_log_level": "20",
"sagemaker_program": "\"train.py\"",
}
## Evaluation results
- Accuracy = 0.9106666666666666
- F1 Score = 0.9090909090909091
- Precision = 0.9063852813852814
- Recall = 0.9118127381600436
## Test results
## Model in action
### Usage for Sentiment Analysis
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("edumunozsala/roberta_bne_sentiment_analysis_es")
model = AutoModelForSequenceClassification.from_pretrained("edumunozsala/roberta_bne_sentiment_analysis_es")
text ="Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal"
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
outputs = model(input_ids)
output = outputs.logits.argmax(1)
```
Created by [Eduardo Muñoz/@edumunozsala](https://github.com/edumunozsala)