--- language: es tags: - sagemaker - roberta-bne - TextClassification - SentimentAnalysis license: apache-2.0 datasets: - IMDbreviews_es metrics: - accuracy model-index: - name: roberta_bne_sentiment_analysis_es results: - task: name: Sentiment Analysis type: sentiment-analysis dataset: name: "IMDb Reviews in Spanish" type: IMDbreviews_es metrics: - name: Accuracy type: accuracy value: 0.9106666666666666 - name: F1 Score type: f1 value: 0.9090909090909091 - name: Precision type: precision value: 0.9063852813852814 - name: Recall type: recall value: 0.9118127381600436 widget: - text: "Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal" --- # Model roberta_bne_sentiment_analysis_es ## **A finetuned model for Sentiment analysis in Spanish** This model was trained using Amazon SageMaker and the new Hugging Face Deep Learning container, The base model is **RoBERTa-base-bne** which is a RoBERTa base model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB. It was trained by The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) **RoBERTa BNE Citation** Check out the paper for all the details: https://arxiv.org/abs/2107.07253 ``` @article{gutierrezfandino2022, author = {Asier Gutiérrez-Fandiño and Jordi Armengol-Estapé and Marc Pàmies and Joan Llop-Palao and Joaquin Silveira-Ocampo and Casimiro Pio Carrino and Carme Armentano-Oller and Carlos Rodriguez-Penagos and Aitor Gonzalez-Agirre and Marta Villegas}, title = {MarIA: Spanish Language Models}, journal = {Procesamiento del Lenguaje Natural}, volume = {68}, number = {0}, year = {2022}, issn = {1989-7553}, url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405}, pages = {39--60} } ``` ## Dataset The dataset is a collection of movie reviews in Spanish, about 50,000 reviews. The dataset is balanced and provides every review in english, in spanish and the label in both languages. Sizes of datasets: - Train dataset: 42,500 - Validation dataset: 3,750 - Test dataset: 3,750 ## Intended uses & limitations This model is intented for Sentiment Analysis for spanish corpus and finetuned specially for movie reviews but it can be applied to other kind of reviews. ## Hyperparameters { "epochs": "4", "train_batch_size": "32", "eval_batch_size": "8", "fp16": "true", "learning_rate": "3e-05", "model_name": "\"PlanTL-GOB-ES/roberta-base-bne\"", "sagemaker_container_log_level": "20", "sagemaker_program": "\"train.py\"", } ## Evaluation results - Accuracy = 0.9106666666666666 - F1 Score = 0.9090909090909091 - Precision = 0.9063852813852814 - Recall = 0.9118127381600436 ## Test results ## Model in action ### Usage for Sentiment Analysis ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("edumunozsala/roberta_bne_sentiment_analysis_es") model = AutoModelForSequenceClassification.from_pretrained("edumunozsala/roberta_bne_sentiment_analysis_es") text ="Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal" input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0) outputs = model(input_ids) output = outputs.logits.argmax(1) ``` Created by [Eduardo Muñoz/@edumunozsala](https://github.com/edumunozsala)