Kinyarwanda-to-English Machine Translation

This model is a Kinyarwanda-to-English machine translation model, it was built and trained using JoeyNMT framework. The translation model uses transformer encoder-decoder based architecture. It was trained on a 47,211-long English-Kinyarwanda bitext dataset prepared by Digital Umuganda.

Model architecture

Encoder && Decoder

Type: Transformer Num_layer: 6 Num_heads: 8 Embedding_dim: 256 ff_size: 1024 Dropout: 0.1 Layer_norm: post Initializer: xavier Total params: 12563968

Pre-processing

Tokenizer_type: subword-nmt
num_merges: 4000
BPE encoding learned on the bitext, separate vocabularies for each language
Pretokenizer: None
No lowercase applied

Training

Optimizer: Adam
Loss: crossentropy
Epochs: 30
Batch_size: 256
Number of GPUs: 1

Evaluation

Evaluation_metrics: Blue_score, chrf
Tokenization: None
Beam_width: 15
Beam_alpha: 1.0

Tools

* joeyNMT 2.0.0
* datasets
* pandas
* numpy
* transformers
* sentencepiece
* pytorch(with cuda)
* sacrebleu
* protobuf>=3.20.1

How to train

Use the following link for more information

Translation

To install joeyNMT run:

$ git clone https://github.com/joeynmt/joeynmt.git
$ cd joeynmt
$ pip install . -e

Interactive translation(stdin):

$ python -m joeynmt translate args.yaml

File translation:

$ python -m joeynmt translate args.yaml < src_lang.txt > hypothesis_trg_lang.txt

Accuracy measurement

Sacrebleu installation:

$ pip install sacrebleu

Measurement(bleu_score, chrf):

$ sacrebleu reference.tsv -i hypothesis.tsv -m bleu chrf

To-do

Test the model using different datasets including the jw300

Use the Digital Umuganda dataset on some available State Of The Art(SOTA) models.

Expand the dataset

Result

The following result was obtained using sacrebleu.

Kinyarwanda-to-English:

Blue: 79.87
Chrf: 84.40

DigitalUmuganda
/

Joeynmt-kin-en