Transferring Monolingual Model to Low-Resource Language: The Case Of Tigrinya:

Proposed Method:

The proposed method transfers a mono-lingual Transformer model into new target language at lexical level by learning new token embeddings. All implementation in this repo uses XLNet as a source Transformer model, however, other Transformer models can also be used similarly.

Main files:

All files are IPython Notebook files which can be excuted simply in Google Colab.

train.ipynb : Fine-tunes XLNet (mono-lingual transformer) on new target language (Tigrinya) sentiment analysis dataset.
test.ipynb : Evaluates the fine-tuned model on test data.
token_embeddings.ipynb : Trains a word2vec token embeddings for Tigrinya language.
process_Tigrinya_comments.ipynb : Extracts Tigrinya comments from mixed language contents.
extract_YouTube_comments.ipynb : Downloads available comments from a YouTube channel ID.
auto_labelling.ipynb : Automatically labels Tigrinya comments in to positive or negative sentiments based on Emoji's sentiment.

Tigrinya Tokenizer:

A sentencepiece based tokenizer for Tigrinya has been released to the public and can be accessed as in the following:

 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("abryee/TigXLNet")
 tokenizer.tokenize("ዋዋዋው እዛ ፍሊም ካብተን ዘድንቀን ሓንቲ ኢያ ሞ ብጣዕሚ ኢና ነመስግን ሓንቲ ክብላ ደልየ ዘሎኹ ሓደራኣኹም ኣብ ጊዜኹም ተረክቡ")

TigXLNet:

A new general purpose transformer model for low-resource language Tigrinya is also released to the public and be accessed as in the following:

from transformers import AutoConfig, AutoModel
config = AutoConfig.from_pretrained("abryee/TigXLNet")
config.d_head = 64
model = AutoModel.from_pretrained("abryee/TigXLNet", config=config)

Evaluation:

The proposed method is evaluated using two datasets:

A newly created sentiment analysis dataset for low-resource language (Tigriyna).

_Models	_{Configuration}	_F1-Score
_BERT	_{+Frozen BERT weights}	_54.91
	_{+Random embeddings}	_74.26
	_{+Frozen token embeddings}	_76.35
_mBERT	_{+Frozen mBERT weights}	_57.32
	_{+Random embeddings}	_76.01
	_{+Frozen token embeddings}	_77.51
_XLNet	_{+Frozen XLNet weights}	_68.14
	_{+Random embeddings}	_77.83
	_{+Frozen token embeddings}	_81.62

Cross-lingual Sentiment dataset (CLS).

_Models	_English			_German			_French			_Japanese			_Average
_Models	_Books	_DVD	_Music	_Books	_DVD	_Music	_Books	_DVD	_Music	_Books	_DVD	_Music	_Average
_XLNet	_92.90	_93.31	_92.02	_85.23	_83.30	_83.89	_73.05	_69.80	_70.12	_83.20	_86.07	_85.24	_83.08
_mBERT	_92.78	_90.30	_91.88	_88.65	_85.85	_90.38	_91.09	_88.57	_93.67	_84.35	_81.77	_87.53	_88.90

Dataset used for this paper:

We have constructed new sentiment analysis dataset for Tigrinya language and it can be found in the zip file (Tigrinya Sentiment Analysis Dataset)

Citing our paper:

Our paper can be accessed from ArXiv link, and please consider citing our work.

 @misc{tela2020transferring,
      title={Transferring Monolingual Model to Low-Resource Language: The Case of Tigrinya},
      author={Abrhalei Tela and Abraham Woubie and Ville Hautamaki},
      year={2020},
      eprint={2006.07698},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
  }

abrhaleitela
/

TigXLNet