w11wo commited on
Commit
d7cd1cc
1 Parent(s): b0e1537

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -0
README.md ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: su
3
+ tags:
4
+ - sundanese-bert-base-emotion-classifier
5
+ license: mit
6
+ widget:
7
+ - text: "Punten ini akurat ga ya sieun ihh daerah aku masuk zona merah"
8
+ ---
9
+
10
+ ## Sundanese BERT Base Emotion Classifier
11
+
12
+ Sundanese BERT Base Emotion Classifier is an emotion-text-classification model based on the [BERT](https://arxiv.org/abs/1810.04805) model. The model was originally the pre-trained [Sundanese BERT Base Uncased](https://hf.co/luche/bert-base-sundanese-uncased) model trained by [`@luche`](https://hf.co/luche), which is then fine-tuned on the [Sundanese Twitter dataset](https://github.com/virgantara/sundanese-twitter-dataset), consisting of Sundanese tweets.
13
+
14
+ 10% of the dataset is kept for evaluation purposes. After training, the model achieved an evaluation accuracy of 96.82% and F1-macro of 96.75%.
15
+
16
+ Hugging Face's `Trainer` class from the [Transformers](https://huggingface.co/transformers) library was used to train the model. PyTorch was used as the backend framework during training, but the model remains compatible with other frameworks nonetheless.
17
+
18
+ ## Model
19
+
20
+ | Model | #params | Arch. | Training/Validation data (text) |
21
+ | ---------------------------------------- | ------- | --------- | ------------------------------- |
22
+ | `sundanese-bert-base-emotion-classifier` | 110M | BERT Base | Sundanese Twitter dataset |
23
+
24
+ ## Evaluation Results
25
+
26
+ The model was trained for 10 epochs and the best model was loaded at the end.
27
+
28
+ | Epoch | Training Loss | Validation Loss | Accuracy | F1 | Precision | Recall |
29
+ | ----- | ------------- | --------------- | -------- | -------- | --------- | -------- |
30
+ | 1 | 0.759800 | 0.263913 | 0.924603 | 0.925042 | 0.928426 | 0.926130 |
31
+ | 2 | 0.213100 | 0.456022 | 0.908730 | 0.906732 | 0.924141 | 0.907846 |
32
+ | 3 | 0.091900 | 0.204323 | 0.956349 | 0.955896 | 0.956226 | 0.956248 |
33
+ | 4 | 0.043800 | 0.219143 | 0.956349 | 0.955705 | 0.955848 | 0.956392 |
34
+ | 5 | 0.013700 | 0.247289 | 0.960317 | 0.959734 | 0.959477 | 0.960782 |
35
+ | 6 | 0.004800 | 0.286636 | 0.956349 | 0.955540 | 0.956519 | 0.956615 |
36
+ | 7 | 0.000200 | 0.243408 | 0.960317 | 0.959085 | 0.959145 | 0.959310 |
37
+ | 8 | 0.001500 | 0.232138 | 0.960317 | 0.959451 | 0.959427 | 0.959997 |
38
+ | 9 | 0.000100 | 0.215523 | 0.968254 | 0.967556 | 0.967192 | 0.968330 |
39
+ | 10 | 0.000100 | 0.216533 | 0.968254 | 0.967556 | 0.967192 | 0.968330 |
40
+
41
+ ## How to Use
42
+
43
+ ### As Text Classifier
44
+
45
+ ```python
46
+ from transformers import pipeline
47
+
48
+ pretrained_name = "sundanese-bert-base-emotion-classifier"
49
+
50
+ nlp = pipeline(
51
+ "sentiment-analysis",
52
+ model=pretrained_name,
53
+ tokenizer=pretrained_name
54
+ )
55
+
56
+ nlp("Punten ini akurat ga ya sieun ihh daerah aku masuk zona merah")
57
+ ```
58
+
59
+ ## Disclaimer
60
+
61
+ Do consider the biases which come from both the pre-trained BERT model and the Sundanese Twitter dataset that may be carried over into the results of this model.
62
+
63
+ ## Author
64
+
65
+ Sundanese BERT Base Emotion Classifier was trained and evaluated by [Wilson Wongso](https://w11wo.github.io/). All computation and development are done on Google Colaboratory using their free GPU access.
66
+
67
+ ## Credits
68
+
69
+ ```
70
+ @inproceedings{Putr2011:Sundanese,
71
+ title = {Sundanese Twitter Dataset for Emotion Classification},
72
+ author = {Oddy Virgantara Putra and Fathin Muhammad Wasmanson and Triana Harmini and Shoffin Nahwa Utama},
73
+ year = 2020,
74
+ month = nov,
75
+ booktitle = {2020 International Conference on Computer Engineering, Network, and Intelligent Multimedia (CENIM) (CENIM 2020)},
76
+ address = virtual,
77
+ days = 16,
78
+ keywords = {emotion classification; sundanese; machine learning},
79
+ abstract = {Sundanese is the second-largest tribe in Indonesia which possesses many dialects. This condition has gained attention for many researchers to analyze emotion especially on social media. However, with barely available Sundanese dataset, this condition makes understanding sundanese emotion is a challenging task. In this research, we proposed a dataset for emotion classification of Sundanese text. The preprocessing includes case folding, stopwords removal, stemming, tokenizing, and text representation. Prior to classification, for the feature generation, we utilize term frequency-inverse document frequency (TFIDF). We evaluated our dataset using k-Fold Cross Validation. Our experiments with the proposed method exhibit an effective result for machine learning classification. Furthermore, as far as we know, this is the first Sundanese emotion dataset available for public.}
80
+ }
81
+ ```