edumunozsala commited on
Commit
4c12604
1 Parent(s): a9d7491

Upload README.md

Browse files

Initial README file

Files changed (1) hide show
  1. README.md +103 -0
README.md ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: es
3
+ tags:
4
+ - sagemaker
5
+ - bertin
6
+ - TextClassification
7
+ - SentimentAnalysis
8
+ license: apache-2.0
9
+ datasets:
10
+ - IMDbreviews_es
11
+ metrics:
12
+ - accuracy
13
+ model-index:
14
+ - name: bertin_base_sentiment_analysis_es
15
+ results:
16
+ - task:
17
+ name: Sentiment Analysis
18
+ type: sentiment-analysis
19
+ dataset:
20
+ name: "IMDb Reviews in Spanish"
21
+ type: IMDbreviews_es
22
+ metrics:
23
+ - name: Accuracy,
24
+ type: accuracy,
25
+ value: 0.898933
26
+ - name: F1 Score,
27
+ type: f1,
28
+ value: 0.8989063
29
+ - name: Precision,
30
+ type: precision,
31
+ value: 0.8771473
32
+ - name: Recall,
33
+ type: recall,
34
+ value: 0.9217724
35
+ widget:
36
+ - text: "Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal"
37
+ ---
38
+
39
+ ## Model `bertin_base_sentiment_analysis_es`
40
+
41
+ ### **A finetuned model for Sentiment analysis in Spanish**
42
+
43
+ This model was trained using Amazon SageMaker and the new Hugging Face Deep Learning container,
44
+ The base model is **Bertin base** which is a RoBERTa-base model pre-trained on the Spanish portion of mC4 using Flax.
45
+ It was trained by the Bertin Project.[Link to base model](https://huggingface.co/bertin-project/bertin-roberta-base-spanish)
46
+
47
+ Article: BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling
48
+ Author = Javier De la Rosa y Eduardo G. Ponferrada y Manu Romero y Paulo Villegas y Pablo González de Prado Salas y María Grandury,
49
+ journal = Procesamiento del Lenguaje Natural,
50
+ volume = 68, number = 0, year = 2022
51
+ url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6403},
52
+
53
+ ## Dataset
54
+ The dataset is a collection of movie reviews in Spanish, about 50,000 reviews. The dataset is balanced and provides every review in english, in spanish and the label in both languages.
55
+
56
+ Sizes of datasets:
57
+ - Train dataset: 42,500
58
+ - Validation dataset: 3,750
59
+ - Test dataset: 3,750
60
+
61
+ ## Intended uses & limitations
62
+
63
+ This model is intented for Sentiment Analysis for spanish corpus and finetuned specially for movie reviews but it can be applied to other kind of reviews.
64
+
65
+ ## Hyperparameters
66
+ {
67
+ "epochs": "4",
68
+ "train_batch_size": "32",
69
+ "eval_batch_size": "8",
70
+ "fp16": "true",
71
+ "learning_rate": "3e-05",
72
+ "model_name": "\"bertin-project/bertin-roberta-base-spanish\"",
73
+ "sagemaker_container_log_level": "20",
74
+ "sagemaker_program": "\"train.py\"",
75
+ }
76
+
77
+ ## Evaluation results
78
+ Accuracy = 0.8989333333333334
79
+ F1 Score = 0.8989063750333421
80
+ Precision = 0.877147319104633
81
+ Recall = 0.9217724288840262
82
+
83
+ ## Test results
84
+
85
+ ## Model in action
86
+
87
+ ### Usage for Sentiment Analysis
88
+
89
+ ```python
90
+ import torch
91
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
92
+
93
+ tokenizer = AutoTokenizer.from_pretrained("edumunozsala/bertin_base_sentiment_analysis_es")
94
+ model = AutoModelForSequenceClassification.from_pretrained("edumunozsala/bertin_base_sentiment_analysis_es")
95
+
96
+ text ="Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal"
97
+
98
+ input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
99
+ outputs = model(input_ids)
100
+ output = outputs.logits.argmax(1)
101
+ ```
102
+
103
+ Created by [Eduardo Muñoz/@edumunozsala](https://github.com/edumunozsala)