File size: 8,535 Bytes
74f29fd a98a203 74f29fd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
# Transferring Monolingual Model to Low-Resource Language: The Case Of Tigrinya:
## Proposed Method:
<img src="data/proposed.png" height = "330" width ="760" >
The proposed method transfers a mono-lingual Transformer model into new target language at lexical level by learning new token embeddings. All implementation in this repo uses XLNet as a source Transformer model, however, other Transformer models can also be used similarly.
## Main files:
All files are IPython Notebook files which can be excuted simply in Google Colab.
- train.ipynb : Fine-tunes XLNet (mono-lingual transformer) on new target language (Tigrinya) sentiment analysis dataset. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bSSrKE-TSphUyrNB2UWhFI-Bkoz0a5l0?usp=sharing)
- test.ipynb : Evaluates the fine-tuned model on test data. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17R1lvRjxILVNk971vzZT79o2OodwaNIX?usp=sharing)
- token_embeddings.ipynb : Trains a word2vec token embeddings for Tigrinya language. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1hCtetAllAjBw28EVQkJFpiKdFtXmuxV7?usp=sharing)
- process_Tigrinya_comments.ipynb : Extracts Tigrinya comments from mixed language contents. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1-ndLlBV-iLZNBW3Z8OfKAqUUCjvGbdZU?usp=sharing)
- extract_YouTube_comments.ipynb : Downloads available comments from a YouTube channel ID. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1b7G85wHKe18y45JIDtvDJdO5dOkRmDdp?usp=sharing)
- auto_labelling.ipynb : Automatically labels Tigrinya comments in to positive or negative sentiments based on [Emoji's sentiment](http://kt.ijs.si/data/Emoji_sentiment_ranking/). [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1wnZf7CBBCIr966vRUITlxKCrANsMPpV7?usp=sharing)
## Tigrinya Tokenizer:
A [sentencepiece](https://github.com/google/sentencepiece) based tokenizer for Tigrinya has been released to the public and can be accessed as in the following:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("abryee/TigXLNet")
tokenizer.tokenize("αααα α₯α ααα α«α₯α°α αα΅ααα ααα² α’α« α α₯α£αα α’α ααα΅αα ααα² αα₯α α°αα¨ αααΉ αα°α«α£αΉα α£α₯ αααΉα α°α¨αα‘")
## TigXLNet:
A new general purpose transformer model for low-resource language Tigrinya is also released to the public and be accessed as in the following:
from transformers import AutoConfig, AutoModel
config = AutoConfig.from_pretrained("abryee/TigXLNet")
config.d_head = 64
model = AutoModel.from_pretrained("abryee/TigXLNet", config=config)
## Evaluation:
The proposed method is evaluated using two datasets:
- A newly created sentiment analysis dataset for low-resource language (Tigriyna).
<table>
<tr>
<td> <table>
<thead>
<tr>
<th><sub>Models</sub></th>
<th><sub>Configuration</sub></th>
<th><sub>F1-Score</sub></th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan=3><sub>BERT</sub></td>
<td rowspan=1><sub>+Frozen BERT weights</sub></td>
<td><sub>54.91</sub></td>
</tr>
<tr>
<td rowspan=1><sub>+Random embeddings</sub></td>
<td><sub>74.26</sub></td>
</tr>
<tr>
<td rowspan=1><sub>+Frozen token embeddings</sub></td>
<td><sub>76.35</sub></td>
</tr>
<tr>
<td rowspan=3><sub>mBERT</sub></td>
<td rowspan=1><sub>+Frozen mBERT weights</sub></td>
<td><sub>57.32</sub></td>
</tr>
<tr>
<td rowspan=1><sub>+Random embeddings</sub></td>
<td><sub>76.01</sub></td>
</tr>
<tr>
<td rowspan=1><sub>+Frozen token embeddings</sub></td>
<td><sub>77.51</sub></td>
</tr>
<tr>
<td rowspan=3><sub>XLNet</sub></td>
<td rowspan=1><sub>+Frozen XLNet weights</sub></td>
<td><strong><sub>68.14</sub></strong></td>
</tr>
<tr>
<td rowspan=1><sub>+Random embeddings</sub></td>
<td><strong><sub>77.83</sub></strong></td>
</tr>
<tr>
<td rowspan=1><sub>+Frozen token embeddings</sub></td>
<td><strong><sub>81.62</sub></strong></td>
</tr>
</tbody>
</table> </td>
<td><img src="data/effect_of_dataset_size.png" alt="3" width = 480px height = 280px></td>
</tr>
</table>
- Cross-lingual Sentiment dataset ([CLS](https://zenodo.org/record/3251672#.Xs65VzozbIU)).
<table>
<thead>
<tr>
<th rowspan=2><sub>Models</sub></th>
<th rowspan=1 colspan=3><sub>English</sub></th>
<th rowspan=1 colspan=3><sub>German</sub></th>
<th rowspan=1 colspan=3><sub>French</sub></th>
<th rowspan=1 colspan=3><sub>Japanese</sub></th>
<th rowspan=2><sub>Average</sub></th>
</tr>
<tr>
<th colspan=1><sub>Books</sub></th>
<th colspan=1><sub>DVD</sub></th>
<th colspan=1><sub>Music</sub></th>
<th colspan=1><sub>Books</sub></th>
<th colspan=1><sub>DVD</sub></th>
<th colspan=1><sub>Music</sub></th>
<th colspan=1><sub>Books</sub></th>
<th colspan=1><sub>DVD</sub></th>
<th colspan=1><sub>Music</sub></th>
<th colspan=1><sub>Books</sub></th>
<th colspan=1><sub>DVD</sub></th>
<th colspan=1><sub>Music</sub></th>
</tr>
</thead>
<tbody>
<tr>
<td colspan=1><sub>XLNet</sub></td>
<td colspan=1><sub><strong>92.90</strong></sub></td>
<td colspan=1><sub><strong>93.31</strong></sub></td>
<td colspan=1><sub><strong>92.02</strong></sub></td>
<td colspan=1><sub>85.23</sub></td>
<td colspan=1><sub>83.30</sub></td>
<td colspan=1><sub>83.89</sub></td>
<td colspan=1><sub>73.05</sub></td>
<td colspan=1><sub>69.80</sub></td>
<td colspan=1><sub>70.12</sub></td>
<td colspan=1><sub>83.20</sub></td>
<td colspan=1><sub><strong>86.07</strong></sub></td>
<td colspan=1><sub>85.24</sub></td>
<td colspan=1><sub>83.08</sub></td>
</tr>
<tr>
<td colspan=1><sub>mBERT</sub></td>
<td colspan=1><sub>92.78</sub></td>
<td colspan=1><sub>90.30</sub></td>
<td colspan=1><sub>91.88</sub></td>
<td colspan=1><sub><strong>88.65</strong></sub></td>
<td colspan=1><sub><strong>85.85</strong></sub></td>
<td colspan=1><sub><strong>90.38</strong></sub></td>
<td colspan=1><sub><strong>91.09</strong></sub></td>
<td colspan=1><sub><strong>88.57</strong></sub></td>
<td colspan=1><sub><strong>93.67</strong></sub></td>
<td colspan=1><sub><strong>84.35</strong></sub></td>
<td colspan=1><sub>81.77</sub></td>
<td colspan=1><sub><strong>87.53</strong></sub></td>
<td colspan=1><sub><strong>88.90</strong></sub></td>
</tr>
</tbody>
</table>
## Dataset used for this paper:
We have constructed new sentiment analysis dataset for Tigrinya language and it can be found in the zip file (Tigrinya Sentiment Analysis Dataset)
## Citing our paper:
Our paper can be accessed from ArXiv [link](https://arxiv.org/pdf/2006.07698.pdf), and please consider citing our work.
@misc{tela2020transferring,
title={Transferring Monolingual Model to Low-Resource Language: The Case of Tigrinya},
author={Abrhalei Tela and Abraham Woubie and Ville Hautamaki},
year={2020},
eprint={2006.07698},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
## Any questions, comments, feedback is appreciated! And can be forwarded to the following email: abrhalei.tela@gmail.com
|