abrhaleitela commited on
Commit
74f29fd
β€’
1 Parent(s): 461f4ae

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +189 -0
README.md ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Transferring Monolingual Model to Low-Resource Language: The Case Of Tigrinya:
2
+
3
+
4
+
5
+ ## Proposed Method:
6
+ <img src="data/proposed.png" height = "330" width ="760" >
7
+
8
+ The proposed method transfers a mono-lingual Transformer model into new target language at lexical level by learning new token embeddings. All implementation in this repo uses XLNet as a source Transformer model, however, other Transformer models can also be used similarly.
9
+
10
+
11
+ ## Main files:
12
+ All files are IPython Notebook files which can be excuted simply in Google Colab.
13
+
14
+ - train.ipynb : Fine-tunes XLNet (mono-lingual transformer) on new target language (Tigrinya) sentiment analysis dataset. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bSSrKE-TSphUyrNB2UWhFI-Bkoz0a5l0?usp=sharing)
15
+
16
+ - test.ipynb : Evaluates the fine-tuned model on test data. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17R1lvRjxILVNk971vzZT79o2OodwaNIX?usp=sharing)
17
+
18
+ - token_embeddings.ipynb : Trains a word2vec token embeddings for Tigrinya language. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1hCtetAllAjBw28EVQkJFpiKdFtXmuxV7?usp=sharing)
19
+
20
+ - process_Tigrinya_comments.ipynb : Extracts Tigrinya comments from mixed language contents. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1-ndLlBV-iLZNBW3Z8OfKAqUUCjvGbdZU?usp=sharing)
21
+
22
+ - extract_YouTube_comments.ipynb : Downloads available comments from a YouTube channel ID. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1b7G85wHKe18y45JIDtvDJdO5dOkRmDdp?usp=sharing)
23
+
24
+ - auto_labelling.ipynb : Automatically labels Tigrinya comments in to positive or negative sentiments based on [Emoji's sentiment](http://kt.ijs.si/data/Emoji_sentiment_ranking/). [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1wnZf7CBBCIr966vRUITlxKCrANsMPpV7?usp=sharing)
25
+
26
+
27
+ ## Tigrinya Tokenizer:
28
+
29
+ A [sentencepiece](https://github.com/google/sentencepiece) based tokenizer for Tigrinya has been released to the public and can be accessed as in the following:
30
+
31
+
32
+ from transformers import AutoTokenizer
33
+ tokenizer = AutoTokenizer.from_pretrained("abryee/TigXLNet")
34
+ tokenizer.tokenize("ዋዋዋው αŠ₯α‹› ፍሊም ካα‰₯α‰°αŠ• α‹˜α‹΅αŠ•α‰€αŠ• αˆ“αŠ•α‰² αŠ’α‹« ሞ α‰₯αŒ£α‹•αˆš αŠ’αŠ“ αŠαˆ˜αˆ΅αŒαŠ• αˆ“αŠ•α‰² ክα‰₯αˆ‹ α‹°αˆα‹¨ α‹˜αˆŽαŠΉ αˆ“α‹°αˆ«αŠ£αŠΉαˆ ኣα‰₯ αŒŠα‹œαŠΉαˆ α‰°αˆ¨αŠ­α‰‘")
35
+
36
+
37
+ ## TigXLNet:
38
+ A new general purpose transformer model for low-resource language Tigrinya is also released to the public and be accessed as in the following:
39
+
40
+ from transformers import AutoConfig, AutoModel
41
+ config = AutoConfig.from_pretrained("abryee/TigXLNet")
42
+ config.d_head = 64
43
+ model = AutoModel.from_pretrained("abryee/TigXLNet", config=config)
44
+
45
+ ## Evaluation:
46
+
47
+ The proposed method is evaluated using two datasets:
48
+ - A newly created sentiment analysis dataset for low-resource language (Tigriyna).
49
+
50
+ <table>
51
+ <tr>
52
+ <td> <table>
53
+ <thead>
54
+ <tr>
55
+ <th><sub>Models</sub></th>
56
+ <th><sub>Configuration</sub></th>
57
+ <th><sub>F1-Score</sub></th>
58
+ </tr>
59
+ </thead>
60
+ <tbody>
61
+ <tr>
62
+ <td rowspan=3><sub>BERT</sub></td>
63
+ <td rowspan=1><sub>+Frozen BERT weights</sub></td>
64
+ <td><sub>54.91</sub></td>
65
+ </tr>
66
+ <tr>
67
+ <td rowspan=1><sub>+Random embeddings</sub></td>
68
+ <td><sub>74.26</sub></td>
69
+ </tr>
70
+ <tr>
71
+ <td rowspan=1><sub>+Frozen token embeddings</sub></td>
72
+ <td><sub>76.35</sub></td>
73
+ </tr>
74
+ <tr>
75
+ <td rowspan=3><sub>mBERT</sub></td>
76
+ <td rowspan=1><sub>+Frozen mBERT weights</sub></td>
77
+ <td><sub>57.32</sub></td>
78
+ </tr>
79
+ <tr>
80
+ <td rowspan=1><sub>+Random embeddings</sub></td>
81
+ <td><sub>76.01</sub></td>
82
+ </tr>
83
+ <tr>
84
+ <td rowspan=1><sub>+Frozen token embeddings</sub></td>
85
+ <td><sub>77.51</sub></td>
86
+ </tr>
87
+ <tr>
88
+ <td rowspan=3><sub>XLNet</sub></td>
89
+ <td rowspan=1><sub>+Frozen XLNet weights</sub></td>
90
+ <td><strong><sub>68.14</sub></strong></td>
91
+ </tr>
92
+ <tr>
93
+ <td rowspan=1><sub>+Random embeddings</sub></td>
94
+ <td><strong><sub>77.83</sub></strong></td>
95
+ </tr>
96
+ <tr>
97
+ <td rowspan=1><sub>+Frozen token embeddings</sub></td>
98
+ <td><strong><sub>81.62</sub></strong></td>
99
+ </tr>
100
+ </tbody>
101
+ </table> </td>
102
+ <td><img src="data/effect_of_dataset_size.png" alt="3" width = 480px height = 280px></td>
103
+ </tr>
104
+ </table>
105
+
106
+
107
+
108
+ - Cross-lingual Sentiment dataset ([CLS](https://zenodo.org/record/3251672#.Xs65VzozbIU)).
109
+
110
+
111
+ <table>
112
+ <thead>
113
+ <tr>
114
+ <th rowspan=2><sub>Models</sub></th>
115
+ <th rowspan=1 colspan=3><sub>English</sub></th>
116
+ <th rowspan=1 colspan=3><sub>German</sub></th>
117
+ <th rowspan=1 colspan=3><sub>French</sub></th>
118
+ <th rowspan=1 colspan=3><sub>Japanese</sub></th>
119
+ <th rowspan=2><sub>Average</sub></th>
120
+ </tr>
121
+ <tr>
122
+ <th colspan=1><sub>Books</sub></th>
123
+ <th colspan=1><sub>DVD</sub></th>
124
+ <th colspan=1><sub>Music</sub></th>
125
+ <th colspan=1><sub>Books</sub></th>
126
+ <th colspan=1><sub>DVD</sub></th>
127
+ <th colspan=1><sub>Music</sub></th>
128
+ <th colspan=1><sub>Books</sub></th>
129
+ <th colspan=1><sub>DVD</sub></th>
130
+ <th colspan=1><sub>Music</sub></th>
131
+ <th colspan=1><sub>Books</sub></th>
132
+ <th colspan=1><sub>DVD</sub></th>
133
+ <th colspan=1><sub>Music</sub></th>
134
+ </tr>
135
+ </thead>
136
+ <tbody>
137
+ <tr>
138
+ <td colspan=1><sub>XLNet</sub></td>
139
+ <td colspan=1><sub><strong>92.90</strong></sub></td>
140
+ <td colspan=1><sub><strong>93.31</strong></sub></td>
141
+ <td colspan=1><sub><strong>92.02</strong></sub></td>
142
+ <td colspan=1><sub>85.23</sub></td>
143
+ <td colspan=1><sub>83.30</sub></td>
144
+ <td colspan=1><sub>83.89</sub></td>
145
+ <td colspan=1><sub>73.05</sub></td>
146
+ <td colspan=1><sub>69.80</sub></td>
147
+ <td colspan=1><sub>70.12</sub></td>
148
+ <td colspan=1><sub>83.20</sub></td>
149
+ <td colspan=1><sub><strong>86.07</strong></sub></td>
150
+ <td colspan=1><sub>85.24</sub></td>
151
+ <td colspan=1><sub>83.08</sub></td>
152
+ </tr>
153
+ <tr>
154
+ <td colspan=1><sub>mBERT</sub></td>
155
+ <td colspan=1><sub>92.78</sub></td>
156
+ <td colspan=1><sub>90.30</sub></td>
157
+ <td colspan=1><sub>91.88</sub></td>
158
+ <td colspan=1><sub><strong>88.65</strong></sub></td>
159
+ <td colspan=1><sub><strong>85.85</strong></sub></td>
160
+ <td colspan=1><sub><strong>90.38</strong></sub></td>
161
+ <td colspan=1><sub><strong>91.09</strong></sub></td>
162
+ <td colspan=1><sub><strong>88.57</strong></sub></td>
163
+ <td colspan=1><sub><strong>93.67</strong></sub></td>
164
+ <td colspan=1><sub><strong>84.35</strong></sub></td>
165
+ <td colspan=1><sub>81.77</sub></td>
166
+ <td colspan=1><sub><strong>87.53</strong></sub></td>
167
+ <td colspan=1><sub><strong>88.90</strong></sub></td>
168
+ </tr>
169
+ </tbody>
170
+ </table>
171
+
172
+ ## Dataset used for this paper:
173
+ We have constructed new sentiment analysis dataset for Tigrinya language and it can be found in the zip file (Tigrinya Sentiment Analysis Dataset)
174
+
175
+ ## Citing our paper:
176
+
177
+ Our paper can be accessed from ArXiv [link](https://arxiv.org/pdf/2006.07698.pdf), and please consider citing our work.
178
+
179
+ @misc{tela2020transferring,
180
+ title={Transferring Monolingual Model to Low-Resource Language: The Case of Tigrinya},
181
+ author={Abrhalei Tela and Abraham Woubie and Ville Hautamaki},
182
+ year={2020},
183
+ eprint={2006.07698},
184
+ archivePrefix={arXiv},
185
+ primaryClass={cs.CL}
186
+ }
187
+
188
+
189
+