edumunozsala commited on
Commit
5ebd548
1 Parent(s): b528b83

Upload README.md

Browse files

Initial README file

Files changed (1) hide show
  1. README.md +113 -0
README.md ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: es
3
+ tags:
4
+ - sagemaker
5
+ - beto
6
+ - TextClassification
7
+ - SentimentAnalysis
8
+ license: apache-2.0
9
+ datasets:
10
+ - IMDbreviews_es
11
+ metrics:
12
+ - accuracy
13
+ model-index:
14
+ - name: beto_sentiment_analysis_es
15
+ results:
16
+ - task:
17
+ name: Sentiment Analysis
18
+ type: sentiment-analysis
19
+ dataset:
20
+ name: "IMDb Reviews in Spanish"
21
+ type: IMDbreviews_es
22
+ metrics:
23
+ - name: Accuracy,
24
+ type: accuracy,
25
+ value: 0.9101333333333333
26
+ - name: F1 Score,
27
+ type: f1,
28
+ value: 0.9088450094671354
29
+ - name: Precision,
30
+ type: precision,
31
+ value: 0.9105691056910569
32
+ - name: Recall,
33
+ type: recall,
34
+ value: 0.9071274298056156
35
+ widget:
36
+ - text: "Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal"
37
+ ---
38
+
39
+ # Model beto_sentiment_analysis_es
40
+
41
+ ## **A finetuned model for Sentiment analysis in Spanish**
42
+
43
+ This model was trained using Amazon SageMaker and the new Hugging Face Deep Learning container,
44
+ The base model is **BETO** which is a BERT-base model pre-trained on a spanish corpus. BETO is of size similar to a BERT-Base and was trained with the Whole Word Masking technique.
45
+
46
+ **BETO Citation**
47
+
48
+ [Spanish Pre-Trained BERT Model and Evaluation Data](https://users.dcc.uchile.cl/~jperez/papers/pml4dc2020.pdf)
49
+
50
+ ```
51
+ @inproceedings{CaneteCFP2020,
52
+ title={Spanish Pre-Trained BERT Model and Evaluation Data},
53
+ author={Cañete, José and Chaperon, Gabriel and Fuentes, Rodrigo and Ho, Jou-Hui and Kang, Hojin and Pérez, Jorge},
54
+ booktitle={PML4DC at ICLR 2020},
55
+ year={2020}
56
+ }
57
+ ```
58
+
59
+ ## Dataset
60
+ The dataset is a collection of movie reviews in Spanish, about 50,000 reviews. The dataset is balanced and provides every review in english, in spanish and the label in both languages.
61
+
62
+ Sizes of datasets:
63
+ - Train dataset: 42,500
64
+ - Validation dataset: 3,750
65
+ - Test dataset: 3,750
66
+
67
+ ## Intended uses & limitations
68
+
69
+ This model is intented for Sentiment Analysis for spanish corpus and finetuned specially for movie reviews but it can be applied to other kind of reviews.
70
+
71
+ ## Hyperparameters
72
+ {
73
+ "epochs": "4",
74
+ "train_batch_size": "32",
75
+ "eval_batch_size": "8",
76
+ "fp16": "true",
77
+ "learning_rate": "3e-05",
78
+ "model_name": "\"dccuchile/bert-base-spanish-wwm-uncased\"",
79
+ "sagemaker_container_log_level": "20",
80
+ "sagemaker_program": "\"train.py\"",
81
+ }
82
+
83
+ ## Evaluation results
84
+
85
+ - Accuracy = 0.9101333333333333
86
+
87
+ - F1 Score = 0.9088450094671354
88
+
89
+ - Precision = 0.9105691056910569
90
+
91
+ - Recall = 0.9071274298056156
92
+
93
+ ## Test results
94
+
95
+ ## Model in action
96
+
97
+ ### Usage for Sentiment Analysis
98
+
99
+ ```python
100
+ import torch
101
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
102
+
103
+ tokenizer = AutoTokenizer.from_pretrained("edumunozsala/beto_sentiment_analysis_es")
104
+ model = AutoModelForSequenceClassification.from_pretrained("edumunozsala/beto_sentiment_analysis_es")
105
+
106
+ text ="Se trata de una película interesante, con un solido argumento y un gran interpretación de su actor principal"
107
+
108
+ input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
109
+ outputs = model(input_ids)
110
+ output = outputs.logits.argmax(1)
111
+ ```
112
+
113
+ Created by [Eduardo Muñoz/@edumunozsala](https://github.com/edumunozsala)