|
--- |
|
library_name: JoeyNMT |
|
task: Machine-translation |
|
tags: |
|
- JoeyNMT |
|
- Machine-translation |
|
language: rw |
|
datasets: |
|
- DigitalUmuganda/kinyarwanda-english-machine-translation-dataset |
|
widget: |
|
- text: "Muraho neza, murakaza neza mu Rwanda." |
|
example_title: "Muraho neza, murakaza neza mu Rwanda." |
|
--- |
|
# Kinyarwanda-to-English Machine Translation |
|
|
|
This model is a Kinyarwanda-to-English machine translation model, it was built and trained using JoeyNMT framework. The translation model uses transformer encoder-decoder based architecture. It was trained on a 47,211-long English-Kinyarwanda bitext dataset prepared by Digital Umuganda. |
|
|
|
|
|
## Model architecture |
|
**Encoder && Decoder** |
|
> Type: Transformer |
|
Num_layer: 6 |
|
Num_heads: 8 |
|
Embedding_dim: 256 |
|
ff_size: 1024 |
|
Dropout: 0.1 |
|
Layer_norm: post |
|
Initializer: xavier |
|
Total params: 12563968 |
|
|
|
## Pre-processing |
|
|
|
Tokenizer_type: subword-nmt |
|
num_merges: 4000 |
|
BPE encoding learned on the bitext, separate vocabularies for each language |
|
Pretokenizer: None |
|
No lowercase applied |
|
|
|
## Training |
|
Optimizer: Adam |
|
Loss: crossentropy |
|
Epochs: 30 |
|
Batch_size: 256 |
|
Number of GPUs: 1 |
|
|
|
|
|
|
|
## Evaluation |
|
|
|
Evaluation_metrics: Blue_score, chrf |
|
Tokenization: None |
|
Beam_width: 15 |
|
Beam_alpha: 1.0 |
|
|
|
## Tools |
|
* joeyNMT 2.0.0 |
|
* datasets |
|
* pandas |
|
* numpy |
|
* transformers |
|
* sentencepiece |
|
* pytorch(with cuda) |
|
* sacrebleu |
|
* protobuf>=3.20.1 |
|
|
|
## How to train |
|
|
|
[Use the following link for more information](https://github.com/joeynmt/joeynmt) |
|
|
|
## Translation |
|
To install joeyNMT run: |
|
``` |
|
$ git clone https://github.com/joeynmt/joeynmt.git |
|
$ cd joeynmt |
|
$ pip install . -e |
|
``` |
|
|
|
Interactive translation(stdin): |
|
``` |
|
$ python -m joeynmt translate args.yaml |
|
``` |
|
|
|
File translation: |
|
``` |
|
$ python -m joeynmt translate args.yaml < src_lang.txt > hypothesis_trg_lang.txt |
|
``` |
|
|
|
## Accuracy measurement |
|
Sacrebleu installation: |
|
``` |
|
$ pip install sacrebleu |
|
``` |
|
|
|
Measurement(bleu_score, chrf): |
|
``` |
|
$ sacrebleu reference.tsv -i hypothesis.tsv -m bleu chrf |
|
``` |
|
|
|
## To-do |
|
|
|
>* Test the model using different datasets including the jw300 |
|
>* Use the Digital Umuganda dataset on some available State Of The Art(SOTA) models. |
|
>* Expand the dataset |
|
|
|
## Result |
|
The following result was obtained using sacrebleu. |
|
|
|
|
|
Kinyarwanda-to-English: |
|
``` |
|
Blue: 79.87 |
|
Chrf: 84.40 |
|
``` |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|