Transferring Monolingual Model to Low-Resource Language: The Case Of Tigrinya:

Proposed Method:

The proposed method transfers a mono-lingual Transformer model into new target language at lexical level by learning new token embeddings. All implementation in this repo uses XLNet as a source Transformer model, however, other Transformer models can also be used similarly.

Main files:

All files are IPython Notebook files which can be excuted simply in Google Colab.

  • train.ipynb : Fine-tunes XLNet (mono-lingual transformer) on new target language (Tigrinya) sentiment analysis dataset. Open In Colab
  • test.ipynb : Evaluates the fine-tuned model on test data. Open In Colab
  • token_embeddings.ipynb : Trains a word2vec token embeddings for Tigrinya language. Open In Colab
  • process_Tigrinya_comments.ipynb : Extracts Tigrinya comments from mixed language contents. Open In Colab
  • extract_YouTube_comments.ipynb : Downloads available comments from a YouTube channel ID. Open In Colab
  • auto_labelling.ipynb : Automatically labels Tigrinya comments in to positive or negative sentiments based on Emoji's sentiment. Open In Colab

Tigrinya Tokenizer:

A sentencepiece based tokenizer for Tigrinya has been released to the public and can be accessed as in the following:

 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("abryee/TigXLNet")
 tokenizer.tokenize("ዋዋዋው እዛ ፍሊም ካብተን ዘድንቀን ሓንቲ ኢያ ሞ ብጣዕሚ ኢና ነመስግን ሓንቲ ክብላ ደልየ ዘሎኹ ሓደራኣኹም ኣብ ጊዜኹም ተረክቡ")

TigXLNet:

A new general purpose transformer model for low-resource language Tigrinya is also released to the public and be accessed as in the following:

from transformers import AutoConfig, AutoModel
config = AutoConfig.from_pretrained("abryee/TigXLNet")
config.d_head = 64
model = AutoModel.from_pretrained("abryee/TigXLNet", config=config)

Evaluation:

The proposed method is evaluated using two datasets:

  • A newly created sentiment analysis dataset for low-resource language (Tigriyna).
Models Configuration F1-Score
BERT +Frozen BERT weights 54.91
+Random embeddings 74.26
+Frozen token embeddings 76.35
mBERT +Frozen mBERT weights 57.32
+Random embeddings 76.01
+Frozen token embeddings 77.51
XLNet +Frozen XLNet weights 68.14
+Random embeddings 77.83
+Frozen token embeddings 81.62
3
  • Cross-lingual Sentiment dataset (CLS).
Models English German French Japanese Average
Books DVD Music Books DVD Music Books DVD Music Books DVD Music
XLNet 92.90 93.31 92.02 85.23 83.30 83.89 73.05 69.80 70.12 83.20 86.07 85.24 83.08
mBERT 92.78 90.30 91.88 88.65 85.85 90.38 91.09 88.57 93.67 84.35 81.77 87.53 88.90

Dataset used for this paper:

We have constructed new sentiment analysis dataset for Tigrinya language and it can be found in the zip file (Tigrinya Sentiment Analysis Dataset)

Citing our paper:

Our paper can be accessed from ArXiv link, and please consider citing our work.

 @misc{tela2020transferring,
      title={Transferring Monolingual Model to Low-Resource Language: The Case of Tigrinya},
      author={Abrhalei Tela and Abraham Woubie and Ville Hautamaki},
      year={2020},
      eprint={2006.07698},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
  }

Any questions, comments, feedback is appreciated! And can be forwarded to the following email: abrhalei.tela@gmail.com

Downloads last month
22
Hosted inference API

Unable to determine this model’s pipeline type. Check the docs .