Faris-ML commited on
Commit
e2a0782
1 Parent(s): 08e3f78

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +157 -18
README.md CHANGED
@@ -12,36 +12,175 @@ probably proofread and complete it, then remove this comment. -->
12
 
13
  # MARBERT_sentiment_sarcasm_speech_act_classifier
14
 
15
- This model is a fine-tuned version of [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2) on an unknown dataset.
16
- It achieves the following results on the evaluation set:
 
 
 
17
 
 
18
 
19
- ## Model description
20
 
21
- More information needed
 
 
 
 
 
 
22
 
23
- ## Intended uses & limitations
24
 
25
- More information needed
 
 
26
 
27
- ## Training and evaluation data
28
 
29
- More information needed
30
 
31
- ## Training procedure
32
 
33
- ### Training hyperparameters
34
 
35
- The following hyperparameters were used during training:
36
- - optimizer: None
37
- - training_precision: float32
38
 
39
- ### Training results
40
 
 
41
 
 
42
 
43
- ### Framework versions
44
 
45
- - Transformers 4.41.2
46
- - TensorFlow 2.15.0
47
- - Tokenizers 0.19.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
  # MARBERT_sentiment_sarcasm_speech_act_classifier
14
 
15
+ This model is a fine-tuned version of [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2) on an [Khalaya/Arabic_YouTube_Comments](https://huggingface.co/datasets/Khalaya/Arabic_YouTube_Comments) dataset.
16
+ The model can classify comments into three categories:
17
+ 1. Sentiment (Positive, Neutral, Negative, Mixed)
18
+ 2. Speech act (Expression, Assertion, Question, Recommendation, Request, Miscellaneous)
19
+ 3. Sarcasm (Yes, No)
20
 
21
+ ## Model Details
22
 
23
+ ### Model Description
24
 
25
+ - **Developed by:** Faris, CTO of Khalaya company.
26
+ - **Funded by:** Khalaya company.
27
+ - **Shared by:** Khalaya company.
28
+ - **Model type:** BERT
29
+ - **Language(s) (NLP):** Arabic
30
+ - **License:** MIT
31
+ - **Finetuned from model:** MARBERT
32
 
33
+ ### Model Sources
34
 
35
+ - **Repository:** [More Information Needed]
36
+ - **Paper [optional]:** [More Information Needed]
37
+ - **Demo [optional]:** [More Information Needed]
38
 
39
+ ## Uses
40
 
41
+ ### Direct Use
42
 
43
+ This model can be used directly for classifying Arabic YouTube comments into the three aforementioned categories without further fine-tuning.
44
 
45
+ ### Downstream Use
46
 
47
+ The model can be fine-tuned for other Arabic text classification tasks or integrated into larger applications that require sentiment analysis, speech act recognition, or sarcasm detection in Arabic text.
 
 
48
 
49
+ ### Out-of-Scope Use
50
 
51
+ The model is not designed for tasks outside the domain of Arabic text classification, such as generating text or performing translation tasks.
52
 
53
+ ## Bias, Risks, and Limitations
54
 
55
+ ### Recommendations
56
 
57
+ Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. The model might have biases based on the dataset it was trained on and may not perform equally well across all domains or topics of Arabic YouTube comments.
58
+
59
+ ## How to Get Started with the Model
60
+
61
+ Use the code below to get started with the model:
62
+
63
+ ```python
64
+ from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
65
+
66
+ tokenizer = AutoTokenizer.from_pretrained("path_to_your_model")
67
+ model = TFAutoModelForSequenceClassification.from_pretrained("path_to_your_model")
68
+
69
+ inputs = tokenizer("Your text here", return_tensors="tf")
70
+ outputs = model(inputs)
71
+ ```
72
+
73
+ ## Training Details
74
+
75
+ ### Training Data
76
+
77
+ The model was trained on the Arabic YouTube Comments dataset, which includes comments labeled for sentiment, speech act, and sarcasm.
78
+
79
+ ### Training Procedure
80
+
81
+ The training involved preprocessing the text data, tokenizing it using the MARBERT tokenizer, and training the model using a TPU with mixed precision for 7 epochs. The learning rate was scheduled using a one-cycle policy.
82
+
83
+ #### Preprocessing
84
+
85
+ The text data was tokenized with a maximum length of 128 tokens.
86
+
87
+ #### Training Hyperparameters
88
+
89
+ - **EPOCHS:** 7
90
+ - **LEARNING_RATE_MAX:** 2e-5
91
+ - **LEARNING_RATE:** 2e-5
92
+ - **PCT:** 0.02
93
+ - **BATCH_SIZE:** 512
94
+ - **WD:** 0.001
95
+ - **MAX_LENGTH:** 128
96
+ - **DROP_OUT:** 0.1
97
+
98
+ ## Evaluation
99
+
100
+ ### Testing Data, Factors & Metrics
101
+
102
+ #### Testing Data
103
+
104
+ The model was evaluated on a test split from the Arabic YouTube Comments dataset.
105
+
106
+ #### Factors
107
+
108
+ Evaluation was conducted on different classes of sentiment, speech act, and sarcasm.
109
+
110
+ #### Metrics
111
+
112
+ The model's performance was measured using precision, recall, and F1-score for each class.
113
+
114
+ ### Results
115
+
116
+ The evaluation results are as follows:
117
+
118
+ **Sentiment Classification**
119
+ - Precision: 0.91 (Positive), 0.67 (Neutral), 0.82 (Negative), 0.00 (Mixed)
120
+ - Recall: 0.89 (Positive), 0.62 (Neutral), 0.88 (Negative), 0.00 (Mixed)
121
+ - F1-score: 0.90 (Positive), 0.64 (Neutral), 0.85 (Negative), 0.00 (Mixed)
122
+
123
+ **Speech Act Classification**
124
+ - Precision: 0.92 (Expression), 0.68 (Assertion), 0.75 (Question), 0.60 (Recommendation), 0.66 (Request), 0.28 (Miscellaneous)
125
+ - Recall: 0.80 (Expression), 0.83 (Assertion), 0.85 (Question), 0.72 (Recommendation), 0.81 (Request), 0.39 (Miscellaneous)
126
+ - F1-score: 0.86 (Expression), 0.74 (Assertion), 0.80 (Question), 0.66 (Recommendation), 0.73 (Request), 0.33 (Miscellaneous)
127
+
128
+ **Sarcasm Detection**
129
+ - Precision: 0.99 (No), 0.38 (Yes)
130
+ - Recall: 0.86 (No), 0.88 (Yes)
131
+ - F1-score: 0.92 (No), 0.53 (Yes)
132
+
133
+ ## Technical Specifications [optional]
134
+
135
+ ### Model Architecture and Objective
136
+
137
+ The model is based on the MARBERT architecture, fine-tuned for multi-label classification to predict sentiment, speech act, and sarcasm.
138
+
139
+ ### Compute Infrastructure
140
+
141
+ The model was trained on TPU v3-8.
142
+
143
+ #### Hardware
144
+
145
+ - **TPU Type:** TPU v3-8
146
+
147
+ #### Software
148
+
149
+ - **TensorFlow version:** 2.15.0
150
+ - **Transformers version:** 4.37.2
151
+
152
+ ## Citation [optional]
153
+
154
+ **BibTeX:**
155
+
156
+ ```bibtex
157
+ @misc{faris2024marbertv2,
158
+ author = {Faris},
159
+ title = {Multi-label Classification of Arabic YouTube Comments using MARBERTv2},
160
+ year = {2024},
161
+ publisher = {Hugging Face},
162
+ howpublished = {\url{https://huggingface.co/khalaya/MARBERTv2}},
163
+ }
164
+ ```
165
+
166
+ **APA:**
167
+
168
+ Faris. (2024). Multi-label Classification of Arabic YouTube Comments using MARBERTv2. Hugging Face. Retrieved from https://huggingface.co/khalaya/MARBERTv2
169
+
170
+ ## Glossary [optional]
171
+
172
+ - **Sentiment Analysis:** The task of classifying the sentiment expressed in text.
173
+ - **Speech Act:** The function of an utterance, such as asking a question, making a statement, or giving a command.
174
+ - **Sarcasm Detection:** The task of identifying sarcasm in text.
175
+
176
+ ## More Information [optional]
177
+
178
+ For more information, please contact Faris at faris@khalaya.com.
179
+
180
+ ## Model Card Authors [optional]
181
+
182
+ - Faris, CTO of Khalaya
183
+
184
+ ## Model Card Contact
185
+
186
+ For further questions, please reach out to Faris at f.alahmadi@khalaya.com.sa