# Masakhane - Machine Translation for African Languages (Using JoeyNMT)

Languages: English-Afrikaans

Author: Herman Kamper

## Retrieve data and make a parallel corpus

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
# TODO: Set your source and target languages. Keep in mind, these traditionally use language codes as found here:
# These will also become the suffix's of all vocab and corpus files used throughout
import os
source_language = "en"
target_language = "af"
tag = "baseline" # Give a unique name to your folder - this is to ensure you don't rewrite any models you've already submitted

os.environ["src"] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ["tgt"] = target_language
os.environ["tag"] = tag

# This will save it to a folder in our gdrive instead!
!mkdir -p "/content/drive/My Drive/colab/masakhane/$src-$tgt-$tag"
os.environ["gdrive_path"] = "/content/drive/My Drive/colab/masakhane/%s-%s-%s" % (source_language, target_language, tag)

In [0]:
!echo $gdrive_path

/content/drive/My Drive/colab/masakhane/en-af-baseline


In [0]:
# Download the corpus
! wget "https://www.kamperh.com/data/siyavula_en_af.noweb.3.zip"
! unzip siyavula_en_af.noweb.3.zip
! ls -lah
! head -3 train.en
! head -3 train.af
! cat train.en | wc -l
! cat train.af | wc -l

--2019-10-14 12:40:33-- https://www.kamperh.com/data/siyavula_en_af.noweb.3.zip
Resolving www.kamperh.com (www.kamperh.com)... 185.199.109.153, 185.199.110.153, 185.199.111.153, ...
Connecting to www.kamperh.com (www.kamperh.com)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 303271 (296K) [application/zip]
Saving to: ‘siyavula_en_af.noweb.3.zip’


2019-10-14 12:40:34 (5.89 MB/s) - ‘siyavula_en_af.noweb.3.zip’ saved [303271/303271]

Archive: siyavula_en_af.noweb.3.zip
 inflating: dev.af 
 inflating: dev.en 
 inflating: readme.md 
 inflating: test.af 
 inflating: test.en 
 inflating: train.af 
 inflating: train.en 
total 1.4M
drwxr-xr-x 1 root root 4.0K Oct 14 12:40 .
drwxr-xr-x 1 root root 4.0K Oct 14 11:55 ..
drwxr-xr-x 1 root root 4.0K Oct 8 20:06 .config
-rw-rw-r-- 1 root root 29K Oct 14 12:17 dev.af
-rw-rw-r-- 1 root root 28K Oct 14 12:17 dev.en
drwx------ 3 root root 4.0K Oct 14 12:40 drive
-rw-rw-r-- 1 root root 310 Oct 11 11:46 readme.m



---


## Installation of JoeyNMT

JoeyNMT is a simple, minimalist NMT package which is useful for learning and teaching. Check out the documentation for JoeyNMT [here](https://joeynmt.readthedocs.io) 

In [0]:
# Install JoeyNMT
! git clone https://github.com/joeynmt/joeynmt.git
! cd joeynmt; pip3 install .

Cloning into 'joeynmt'...
remote: Enumerating objects: 52, done.[K
remote: Counting objects: 1% (1/52)[Kremote: Counting objects: 3% (2/52)[Kremote: Counting objects: 5% (3/52)[Kremote: Counting objects: 7% (4/52)[Kremote: Counting objects: 9% (5/52)[Kremote: Counting objects: 11% (6/52)[Kremote: Counting objects: 13% (7/52)[Kremote: Counting objects: 15% (8/52)[Kremote: Counting objects: 17% (9/52)[Kremote: Counting objects: 19% (10/52)[Kremote: Counting objects: 21% (11/52)[Kremote: Counting objects: 23% (12/52)[Kremote: Counting objects: 25% (13/52)[Kremote: Counting objects: 26% (14/52)[Kremote: Counting objects: 28% (15/52)[Kremote: Counting objects: 30% (16/52)[Kremote: Counting objects: 32% (17/52)[Kremote: Counting objects: 34% (18/52)[Kremote: Counting objects: 36% (19/52)[Kremote: Counting objects: 38% (20/52)[Kremote: Counting objects: 40% (21/52)[Kremote: Counting objects: 42% (22/52)[Kremote: Counting objects: 44% (23/52)[Krem

# Preprocessing the Data into Subword BPE Tokens

- One of the most powerful improvements for agglutinative languages (a feature of most Bantu languages) is using BPE tokenization [ (Sennrich, 2015) ](https://arxiv.org/abs/1508.07909).

- It was also shown that by optimizing the umber of BPE codes we significantly improve results for low-resourced languages [(Sennrich, 2019)](https://www.aclweb.org/anthology/P19-1021) [(Martinus, 2019)](https://arxiv.org/abs/1906.05685)

- Below we have the scripts for doing BPE tokenization of our data. We use 4000 tokens as recommended by [(Sennrich, 2019)](https://www.aclweb.org/anthology/P19-1021). You do not need to change anything. Simply running the below will be suitable. 

In [0]:
# One of the huge boosts in NMT performance was to use a different method of tokenizing. 
# Usually, NMT would tokenize by words. However, using a method called BPE gave amazing boosts to performance

# Do subword NMT
from os import path

os.environ["data_path"] = path.join("joeynmt", "data", source_language + target_language) # Herman! 
! subword-nmt learn-joint-bpe-and-vocab --input train.$src train.$tgt -s 4000 -o bpe.codes.4000 --write-vocabulary vocab.$src vocab.$tgt

! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < train.$src > train.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < train.$tgt > train.bpe.$tgt

! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < dev.$src > dev.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < dev.$tgt > dev.bpe.$tgt
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < test.$src > test.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < test.$tgt > test.bpe.$tgt

# Create directory, move everyone we care about to the correct location
! mkdir -p $data_path
! cp train.* $data_path
! cp test.* $data_path
! cp dev.* $data_path
! cp bpe.codes.4000 $data_path
! ls $data_path

# Also move everything we care about to a mounted location in google drive (relevant if running in colab) at gdrive_path
! cp train.* "$gdrive_path"
! cp test.* "$gdrive_path"
! cp dev.* "$gdrive_path"
! cp bpe.codes.4000 "$gdrive_path"
! ls "$gdrive_path"

# Create that vocab using build_vocab
! sudo chmod 777 joeynmt/scripts/build_vocab.py
! joeynmt/scripts/build_vocab.py joeynmt/data/$src$tgt/train.bpe.$src joeynmt/data/$src$tgt/train.bpe.$tgt --output_path joeynmt/data/$src$tgt/vocab.txt

# Some output
! echo "BPE Afrikaans Sentences"
! tail -n 5 test.bpe.$tgt
! echo "Combined BPE Vocab"
! tail -n 10 joeynmt/data/$src$tgt/vocab.txt # Herman

bpe.codes.4000	dev.bpe.en test.bpe.af train.af train.en
dev.af		dev.en	 test.bpe.en train.bpe.af
dev.bpe.af	test.af test.en	 train.bpe.en
bpe.codes.4000	dev.bpe.en test.af	 test.en train.bpe.en
dev.af		dev.en	 test.bpe.af train.af train.en
dev.bpe.af	models	 test.bpe.en train.bpe.af
BPE Afrikaans Sentences
wat is 'n on@@ we@@ t@@ tige elektriese skak@@ el@@ ings ?
hoe dink jy kan die plaas@@ like reg@@ ering dit keer of die hoeveelheid on@@ we@@ t@@ tige skak@@ el@@ ings ver@@ minder .
'n on@@ we@@ t@@ tige skak@@ eling is wanneer ie@@ mand toe@@ gan@@ g kry tot elektrisiteit deur 'n kra@@ gl@@ yn te sny en 'n ander l@@ yn daaraan te verbind sonder om daar@@ voor te be@@ taal .
die plaas@@ like reg@@ ering kan dit probeer stop deur eer@@ st@@ ens te probeer om die ar@@ mer geb@@ ie@@ de met genoeg elektriese toe@@ g@@ ang@@ sp@@ unte te voorsien rond te gaan en te kyk of daar ge@@ vaar@@ like skak@@ el@@ ings is be@@ w@@ us@@ theid oor die gev@@ are van on@@ we@@ t@@ tige skak@@ el@@ i

In [0]:
# Also move everything we care about to a mounted location in google drive (relevant if running in colab) at gdrive_path
! cp train.* "$gdrive_path"
! cp test.* "$gdrive_path"
! cp dev.* "$gdrive_path"
! cp bpe.codes.4000 "$gdrive_path"
! ls "$gdrive_path"

bpe.codes.4000	dev.bpe.en test.af	 test.en train.bpe.en
dev.af		dev.en	 test.bpe.af train.af train.en
dev.bpe.af	models	 test.bpe.en train.bpe.af


# Creating the JoeyNMT Config

JoeyNMT requires a yaml config. We provide a template below. We've also set a number of defaults with it, that you may play with!

- We used Transformer architecture 
- We set our dropout to reasonably high: 0.3 (recommended in [(Sennrich, 2019)](https://www.aclweb.org/anthology/P19-1021))

Things worth playing with:
- The batch size (also recommended to change for low-resourced languages)
- The number of epochs (we've set it at 30 just so it runs in about an hour, for testing purposes)
- The decoder options (beam_size, alpha)
- Evaluation metrics (BLEU versus Crhf4)

In [0]:
# This creates the config file for our JoeyNMT system. It might seem overwhelming so we've provided a couple of useful parameters you'll need to update
# (You can of course play with all the parameters if you'd like!)

name = '%s%s' % (source_language, target_language)
gdrive_path = os.environ["gdrive_path"]

# Create the config
config = """
name: "{name}_transformer"

data:
 src: "{source_language}"
 trg: "{target_language}"
 train: "data/{name}/train.bpe"
 dev: "data/{name}/dev.bpe"
 test: "data/{name}/test.bpe"
 level: "bpe"
 lowercase: False
 max_sent_length: 100
 src_vocab: "data/{name}/vocab.txt"
 trg_vocab: "data/{name}/vocab.txt"

testing:
 beam_size: 5
 alpha: 1.0

training:
 #load_model: "{gdrive_path}/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
 random_seed: 42
 optimizer: "adam"
 normalization: "tokens"
 adam_betas: [0.9, 0.999] 
 scheduling: "noam" # Try switching from plateau to Noam scheduling
 learning_rate_factor: 0.5 # factor for Noam scheduler (used with Transformer)
 learning_rate_warmup: 1000 # warmup steps for Noam scheduler (used with Transformer)
 patience: 8
 decrease_factor: 0.7
 loss: "crossentropy"
 learning_rate: 0.0002
 learning_rate_min: 0.00000001
 weight_decay: 0.0
 label_smoothing: 0.1
 batch_size: 8192 # 4096 # Herman
 batch_type: "token"
 eval_batch_size: 1000 # 3600 # Herman
 eval_batch_type: "token"
 batch_multiplier: 1
 early_stopping_metric: "eval_metric" # "ppl" # Herman
 epochs: 200 # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
 validation_freq: 500 # 4000 # Decrease this for testing # Herman
 logging_freq: 50 # 100 # Herman
 eval_metric: "bleu"
 model_dir: "models/{name}_transformer"
 overwrite: True
 shuffle: True
 use_cuda: True
 max_output_length: 100
 print_valid_sents: [0, 1, 2, 3]
 keep_last_ckpts: 3

model:
 initializer: "xavier"
 bias_initializer: "zeros"
 init_gain: 1.0
 embed_initializer: "xavier"
 embed_init_gain: 1.0
 tied_embeddings: True
 tied_softmax: True
 encoder:
 type: "transformer"
 num_layers: 6
 num_heads: 8
 embeddings:
 embedding_dim: 512
 scale: True
 dropout: 0.
 # typically ff_size = 4 x hidden_size
 hidden_size: 512
 ff_size: 2048
 dropout: 0.3
 decoder:
 type: "transformer"
 num_layers: 6
 num_heads: 8
 embeddings:
 embedding_dim: 512
 scale: True
 dropout: 0.
 # typically ff_size = 4 x hidden_size
 hidden_size: 512
 ff_size: 2048
 dropout: 0.3
""".format(name=name, gdrive_path=os.environ["gdrive_path"], source_language=source_language, target_language=target_language)
with open("joeynmt/configs/transformer_{name}.yaml".format(name=name),'w') as f:
 f.write(config)

# Train the Model

This single line of joeynmt runs the training using the config we made above

In [0]:
# Train the model
# You can press Ctrl-C to stop. And then run the next cell to save your checkpoints! 
!cd joeynmt; python3 -m joeynmt train configs/transformer_$src$tgt.yaml

2019-10-14 12:43:11,392 Hello! This is Joey-NMT.
2019-10-14 12:43:12,910 Total params: 46140928
2019-10-14 12:43:12,912 Trainable parameters: ['decoder.layer_norm.bias', 'decoder.layer_norm.weight', 'decoder.layers.0.dec_layer_norm.bias', 'decoder.layers.0.dec_layer_norm.weight', 'decoder.layers.0.feed_forward.layer_norm.bias', 'decoder.layers.0.feed_forward.layer_norm.weight', 'decoder.layers.0.feed_forward.pwff_layer.0.bias', 'decoder.layers.0.feed_forward.pwff_layer.0.weight', 'decoder.layers.0.feed_forward.pwff_layer.3.bias', 'decoder.layers.0.feed_forward.pwff_layer.3.weight', 'decoder.layers.0.src_trg_att.k_layer.bias', 'decoder.layers.0.src_trg_att.k_layer.weight', 'decoder.layers.0.src_trg_att.output_layer.bias', 'decoder.layers.0.src_trg_att.output_layer.weight', 'decoder.layers.0.src_trg_att.q_layer.bias', 'decoder.layers.0.src_trg_att.q_layer.weight', 'decoder.layers.0.src_trg_att.v_layer.bias', 'decoder.layers.0.src_trg_att.v_layer.weight', 'decoder.layers.0.trg_trg_att.k_l

In [0]:
# Copy the created models from the notebook storage to google drive for persistant storage 
!mkdir -p "$gdrive_path/models/${src}${tgt}_transformer/" # Herman
!cp -r joeynmt/models/${src}${tgt}_transformer/* "$gdrive_path/models/${src}${tgt}_transformer/"

cp: cannot create symbolic link '/content/drive/My Drive/colab/masakhane/en-af-baseline/models/enaf_transformer/best.ckpt': Function not implemented
cp: cannot create symbolic link '/content/drive/My Drive/colab/masakhane/en-af-baseline/models/enaf_transformer/best.ckpt': Function not implemented


In [0]:
# Output our validation accuracy
! cat "$gdrive_path/models/${src}${tgt}_transformer/validations.txt"

Steps: 500	Loss: 29601.47461	PPL: 67.60847	bleu: 0.78623	LR: 0.00034939	*
Steps: 1000	Loss: 23326.10742	PPL: 27.67259	bleu: 13.46693	LR: 0.00069877	*
Steps: 1500	Loss: 23168.03320	PPL: 27.05686	bleu: 17.51870	LR: 0.00057054	*
Steps: 2000	Loss: 24336.34375	PPL: 31.95243	bleu: 16.99290	LR: 0.00049411	
Steps: 2500	Loss: 24009.45508	PPL: 30.49967	bleu: 18.66731	LR: 0.00044194	*
Steps: 3000	Loss: 23779.20898	PPL: 29.51625	bleu: 17.61267	LR: 0.00040344	
Steps: 3500	Loss: 23638.41797	PPL: 28.93059	bleu: 18.91206	LR: 0.00037351	*
Steps: 4000	Loss: 23474.00195	PPL: 28.26134	bleu: 19.68848	LR: 0.00034939	*
Steps: 4500	Loss: 23306.66211	PPL: 27.59610	bleu: 19.81664	LR: 0.00032940	*
Steps: 5000	Loss: 23490.83203	PPL: 28.32913	bleu: 19.27047	LR: 0.00031250	
Steps: 5500	Loss: 23279.69922	PPL: 27.49038	bleu: 19.25319	LR: 0.00029796	
Steps: 6000	Loss: 23328.73438	PPL: 27.68294	bleu: 20.27485	LR: 0.00028527	*
Steps: 6500	Loss: 23496.45508	PPL: 28.35181	bleu: 19.71401	LR: 0.00027408	
Steps: 7000	Loss: 2

In [0]:
# Test our model
! cd joeynmt; python3 -m joeynmt test "$gdrive_path/models/${src}${tgt}_transformer/config.yaml"

2019-10-14 15:31:18,094 - dev bleu: 22.06 [Beam search decoding with beam size = 5 and alpha = 1.0]
2019-10-14 15:31:18,094 - dev bleu: 22.06 [Beam search decoding with beam size = 5 and alpha = 1.0]
2019-10-14 15:31:50,269 - test bleu: 14.84 [Beam search decoding with beam size = 5 and alpha = 1.0]
2019-10-14 15:31:50,269 - test bleu: 14.84 [Beam search decoding with beam size = 5 and alpha = 1.0]


## Record

After 200 epochs:

 Steps: 500	Loss: 28996.02539	PPL: 65.32051	bleu: 0.74017	LR: 0.00034939	*
 Steps: 1000	Loss: 22725.31836	PPL: 26.45606	bleu: 12.15630	LR: 0.00069877	*
 Steps: 1500	Loss: 22900.86719	PPL: 27.13401	bleu: 17.04406	LR: 0.00057054	*
 Steps: 2000	Loss: 24123.17773	PPL: 32.36132	bleu: 17.20765	LR: 0.00049411	*
 Steps: 2500	Loss: 23582.63867	PPL: 29.93578	bleu: 18.16604	LR: 0.00044194	*
 Steps: 3000	Loss: 23164.73633	PPL: 28.18586	bleu: 19.39783	LR: 0.00040344	*
 Steps: 3500	Loss: 23084.53516	PPL: 27.86192	bleu: 19.46346	LR: 0.00037351	*
 Steps: 4000	Loss: 23180.01953	PPL: 28.24801	bleu: 19.10164	LR: 0.00034939	
 Steps: 4500	Loss: 22994.55078	PPL: 27.50288	bleu: 20.05288	LR: 0.00032940	*
 Steps: 5000	Loss: 22928.59961	PPL: 27.24268	bleu: 19.66884	LR: 0.00031250	
 Steps: 5500	Loss: 22814.38477	PPL: 26.79788	bleu: 18.71092	LR: 0.00029796	
 Steps: 6000	Loss: 22747.05664	PPL: 26.53909	bleu: 19.54311	LR: 0.00028527	
 Steps: 6500	Loss: 22670.42383	PPL: 26.24757	bleu: 19.12990	LR: 0.00027408	
 Steps: 7000	Loss: 22537.89453	PPL: 25.75094	bleu: 19.76692	LR: 0.00026411	
 Steps: 7500	Loss: 22478.74023	PPL: 25.53232	bleu: 20.04524	LR: 0.00025516	